2025-08-15

STream3R Scalable Sequential 3D Reconstruction with Causal Transformer
Generalizable Federated Learning using Client Adaptive Focal Modulation
Video-BLADE Block-Sparse Attention Meets Step Distillation for Efficient Video Generation
Thinking Inside the Mask In-Place Prompting in Diffusion LLMs
Continuous Bangla Sign Language Translation Mitigating the Expense of Gloss Annotation with the Assistance of Graph
SemPT Semantic Prompt Tuning for Vision-Language Models
DAS Dual-Aligned Semantic IDs Empowered Industrial Recommender System
GCRPNet Graph-Enhanced Contextual and Regional Perception Network For Salient Object Detection in Optical Remote Sensing Images
X-Node Self-Explanation is All We Need
Efficient Methods for Accurate Sparse Trajectory Recovery and Map Matching
Computational Economics in Large Language Models Exploring Model Behavior and Incentive Design under Resource Constraints
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
XQuant Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
eMamba Efficient Acceleration Framework for Mamba Models in Edge Computing
Improving Generative Cross-lingual Aspect-Based Sentiment Analysis with Constrained Decoding
Advancing Cross-lingual Aspect-Based Sentiment Analysis with LLMs and Constrained Decoding for Sequence-to-Sequence Models
What to Ask Next? Probing the Imaginative Reasoning of LLMs with TurtleSoup Puzzles
DiffAxE Diffusion-driven Hardware Accelerator Generation and Design Space Exploration
Pruning and Malicious Injection A Retraining-Free Backdoor Attack on Transformer Models
Personalized Real-time Jargon Support for Online Meetings
Can Transformers Break Encryption Schemes via In-Context Learning?
Agentic AI Frameworks Architectures, Protocols, and Design Challenges
Nested-ReFT Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts
From Intent to Execution Multimodal Chain-of-Thought Reinforcement Learning for Precise CAD Code Generation
Constrained Decoding of Diffusion LLMs with Context-Free Grammars
Language of Persuasion and Misrepresentation in Business Communication A Textual Detection Approach
Memory Decoder A Pretrained, Plug-and-Play Memory for Large Language Models
OneVAE Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better
Speed Always Wins A Survey on Efficient Architectures for Large Language Models
MoIIE Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models
MEML-GRPO Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement
HierMoE Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap
NeuronTune Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs
EGGS-PTP An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models
Gen-AFFECT Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy
Shadow in the Cache Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference
Synaptic Pruning A Biological Inspiration for Deep Learning Regularization
SinLlama -- A Large Language Model for Sinhala
READER Retrieval-Assisted Drafter for Efficient LLM Inference
FetFIDS A Feature Embedding Attention based Federated Network Intrusion Detection Algorithm
A Survey on Training-free Alignment of Large Language Models
Retrospective Sparse Attention for Efficient Long-Context Generation
NEFMind Parameter-Efficient Fine-Tuning of Open-Source LLMs for Telecom APIs Automation
ColorGPT Leveraging Large Language Models for Multimodal Color Recommendation
ASPD Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs
Steering Towards Fairness Mitigating Political Bias in LLMs
DiffPose-Animal A Language-Conditioned Diffusion Framework for Animal Pose Estimation
Interpretable Reward Model via Sparse Autoencoder
A Survey on Parallel Text Generation From Parallel Decoding to Diffusion Language Models
Prompt-and-Check Using Large Language Models to Evaluate Communication Protocol Compliance in Simulation-Based Training
Classifier Language Models Unifying Sparse Finetuning and Adaptive Tokenization for Specialized Classification Tasks
AgriGPT a Large Language Model Ecosystem for Agriculture
QoE-Aware Service Provision for Mobile AR Rendering An Agent-Driven Approach
Agentic Graph Neural Networks for Wireless Communications and Networking Towards Edge General Intelligence A Survey
Joint decoding method for controllable contextual speech recognition based on Speech LLM
Securing Agentic AI Threat Modeling and Risk Analysis for Network Monitoring Agentic AI System
Profiling Large Language Model Inference on Apple Silicon A Quantization Perspective
Using LLMs to Capture Users' Temporal Context for Recommendation
When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise Real-World Lessons from LLM Co-Design in a Safety-Net Hospital
Vector-Centric Machine Learning Systems A Cross-Stack Approach
Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories
OverFill Two-Stage Models for Efficient Language Model Decoding
Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference
Follow-Your-Shape Shape-Aware Image Editing via Trajectory-Guided Region Control
BlindGuard Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks
TeamMedAgents Enhancing Medical Decision-Making of LLMs Through Structured Teamwork
ChatGPT on the Road Leveraging Large Language Model-Powered In-vehicle Conversational Agents for Safer and More Enjoyable Driving Experience
Bridging ASR and LLMs for Dysarthric Speech Recognition Benchmarking Self-Supervised and Generative Approaches
Interpreting Fedspeak with Confidence A LLM-Based Uncertainty-Aware Framework Guided by Monetary Policy Transmission Paths
DiTVR Zero-Shot Diffusion Transformer for Video Restoration
EvoCoT Overcoming the Exploration Bottleneck in Reinforcement Learning
Grove MoE Towards Efficient and Superior MoE LLMs with Adjugate Experts
SASST Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation
Symmetry-Aware Transformer Training for Automated Planning
Semantic Caching for Low-Cost LLM Serving From Offline Learning to Online Adaptation
GLiClass Generalist Lightweight Model for Sequence Classification Tasks
LaVieID Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation
HGMF A Hierarchical Gaussian Mixture Framework for Scalable Tool Invocation within the Model Context Protocol
Towards Theoretical Understanding of Transformer Test-Time Computing Investigation on In-Context Linear Regression
Grounding Natural Language for Multi-agent Decision-Making with Multi-agentic LLMs
Investigating 1-Bit Quantization in Transformer-Based Top Tagging
LET-US Long Event-Text Understanding of Scenes
Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative
BEVANet Bilateral Efficient Visual Attention Network for Real-Time Semantic Segmentation
Tasa Thermal-aware 3D-Stacked Architecture Design with Bandwidth Sharing for LLM Inference
LP-Spec Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture-Dataflow Co-Optimization
Bridging Semantic Logic Gaps A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization
DySK-Attn A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention
How Effectively Can Large Language Models Connect SNP Variants and ECG Phenotypes for Cardiovascular Risk Prediction?
From Nodes to Narratives Explaining Graph Neural Networks with LLMs and Graph Context
Large Language Model Evaluated Stand-alone Attention-Assisted Graph Neural Network with Spatial and Structural Information Interaction for Precise Endoscopic Image Segmentation
Vec2Summ Text Summarization via Probabilistic Sentence Embeddings
Narrative Memory in Machines Multi-Agent Arc Extraction in Serialized TV
SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency
Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models
Fed MobiLLM Efficient Federated LLM Fine-Tuning over Heterogeneous Mobile Devices via Server Assisted Side-Tuning
Pushing the Envelope of LLM Inference on AI-PC
CISO Species Distribution Modeling Conditioned on Incomplete Species Observations

STream3R Scalable Sequential 3D Reconstruction with Causal Transformer

Authors: Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, Xingang Pan

2025-08-14

http://arxiv.org/abs/2508.10893v1

We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a r-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with -style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: https://nirvanalan.github.io/projects/stream3r.

Generalizable Federated Learning using Client Adaptive Focal Modulation

Authors: Tajamul Ashraf, Iqra Altaf Gillani

2025-08-14

http://arxiv.org/abs/2508.10840v1

Federated learning (FL) has proven essential for privacy-pre, collaborative training across distributed clients. Our prior work, TransFed, introduced a robust -based FL framework that leverages a learn-to-adapt hypernetwork to generate personalized focal modulation layers per client, outperforming traditional methods in non-IID and cross-domain settings. In this extended version, we propose AdaptFED, where we deepen the investigation of focal modulation in generalizable FL by incorporating: (1) a refined adaptation strategy that integrates task-aware client embeddings to personalize modulation dynamics further, (2) enhanced theoretical bounds on adaptation performance, and (3) broader empirical validation across additional modalities, including time-series and multilingual data. We also introduce an efficient variant of TransFed that reduces server-client overhead via low-rank hypernetwork conditioning, enabling scalable deployment in resource-constrained environments. Extensive experiments on eight diverse datasets reaffirm the superiority of our method over state-of-the-art baselines, particularly in source-free and cross-task federated setups. Our findings not only extend the capabilities of focal modulation in FL but also pave the way for more adaptive, scalable, and generalizable -based federated systems. The code is available at http://github.com/Tajamul21/TransFed

Video-BLADE Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

Authors: Youping Gu, Xiaolong Li, Yuhao Hu, Bohan Zhuang

2025-08-14

http://arxiv.org/abs/2508.10774v1

Diffusion s currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and attention mechanisms have shown promise as independent strategies, effectively combining these approaches presents critical challenges -- training-free integration yields suboptimal results, while separately training attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE, an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware masks to focus computation on salient spatiotemporal features, and (2) a -aware step distillation paradigm built upon Trajectory Distribution Matching (TDM) that directly incorporates into the distillation process rather than treating it as a separate compression step, with fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10x end-to-end inference over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup. Crucially, the is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations. Our code and model weights are publicly available at: http://ziplab.co/BLADE-Homepage/.

Thinking Inside the Mask In-Place Prompting in Diffusion LLMs

Authors: Xiangqi Jin, Yuxuan Wang, Yifeng Gao, Zichen Wen, Biqing Qi, Dongrui Liu, Linfeng Zhang

2025-08-14

http://arxiv.org/abs/2508.10736v1

Despite large language models (s) have achieved remarkable success, their prefix-only prompting paradigm and sequential generation process offer limited flexibility for bidirectional information. Diffusion large language models (ds) present new opportunities through their bidirectional attention mechanisms and iterative refinement processes, enabling more flexible in-place prompting strategies. We introduce ICE (In-Place Chain-of-Thought Prompting with Early Exit), a novel framework that transforms prefix-only prompting into in-place prompting specifically designed for ds. ICE integrates in-place prompts directly within masked token positions during iterative refinement and employs a confidence-aware early exit mechanism to significantly reduce computational overhead. Extensive experiments demonstrate ICE's effectiveness, achieving up to 17.29% accuracy improvement with 4.12 $\times$ speedup on GSM8K, and up to 276.67 $\times$ on MMLU while maintaining competitive performance.

Continuous Bangla Sign Language Translation Mitigating the Expense of Gloss Annotation with the Assistance of Graph

Authors: Safaeid Hossain Arib, Rabeya Akter, Sejuti Rahman

2025-08-14

http://arxiv.org/abs/2508.10687v1

Millions of individuals worldwide are affected by deafness and hearing impairment. Sign language serves as a sophisticated means of for the deaf and hard of hearing. However, in societies that prioritize spoken languages, sign language often faces underestimation, leading to barriers and social exclusion. The Continuous Bangla Sign Language Translation project aims to address this gap by enhancing translation methods. While recent approaches leverage architecture for state-of-the-art results, our method integrates graph-based methods with the architecture. This fusion, combining and STGCN-LSTM architectures, proves more effective in gloss-free translation. Our contributions include architectural fusion, exploring various fusion strategies, and achieving a new state-of-the-art performance on diverse sign language datasets, namely RWTH-PHOENIX-2014T, CSL-Daily, How2Sign, and BornilDB v1.0. Our approach demonstrates superior performance compared to current translation outcomes across all datasets, showcasing notable improvements of BLEU-4 scores of 4.01, 2.07, and 0.5, surpassing those of GASLT, GASLT and slt_how2sign in RWTH-PHOENIX-2014T, CSL-Daily, and How2Sign, respectively. Also, we introduce benchmarking on the BornilDB v1.0 dataset for the first time. Our method sets a benchmark for future research, emphasizing the importance of gloss-free translation to improve accessibility for the deaf and hard of hearing.

SemPT Semantic Prompt Tuning for Vision-Language Models

Authors: Xiao Shi, Yangjun Ou, Zhenzhong Chen

2025-08-14

http://arxiv.org/abs/2508.10645v1

Visual transfer learning for unseen categories presents an active research topic yet a challenging task, due to the inherent conflict between pre category-specific representations and acquiring transferable knowledge. Vision-Language Models (VLMs) pre-trained on large amounts of image-text pairs offer a promising solution. However, existing prompt tuning methods rely on category labels or disparate -generated descriptions, which fragment knowledge representation and hinder transferability. To address this limitation, we introduce Semantic Prompt Tuning (SemPT), a novel framework that tackles the generalization challenge by leveraging shared attribute-level knowledge across categories. Specifically, SemPT adopts a two-step prompting strategy to guide in extracting shared visual attributes and generating attribute-level descriptions, capturing transferable semantic cues beyond labels while ensuring coherent structure. Then, visually guided weighting is applied to the embeddings of attribute-level descriptions to reduce noise from irrelevant attributes and enhance the text embeddings. Additionally, image embeddings are jointly aligned with both label and attribute-enhanced text embeddings, balancing discrimination for seen categories and transferability to unseen ones. Considering the availability of category exposure, our inference dynamically selects between standard label embeddings for seen categories and attribute-enhanced embeddings for unseen ones to ensure effective adaptation. Extensive experiments on 15 benchmark datasets demonstrate that SemPT achieves state-of-the-art performance across various settings, including base-to-novel generalization, cross-dataset transfer, cross-domain transfer, and few-shot learning.

DAS Dual-Aligned Semantic IDs Empowered Industrial Recommender System

Authors: Wencai Ye, Mingjie Sun, Shaoyun Shi, Peng Wang, Wenjin Wu, Peng Jiang

2025-08-14

http://arxiv.org/abs/2508.10584v1

Semantic IDs are discrete identifiers generated by quantizing the Multi-modal Large Language Models (Ms) embeddings, enabling efficient multi-modal content integration in recommendation systems. However, their lack of collaborative signals results in a misalignment with downstream discriminative and generative recommendation objectives. Recent studies have introduced various alignment mechanisms to address this problem, but their two-stage framework design still leads to two main limitations: (1) inevitable information loss during alignment, and (2) inflexibility in applying adaptive alignment strategies, consequently constraining the mutual information maximization during the alignment process. To address these limitations, we propose a novel and flexible one-stage Dual-Aligned Semantic IDs (DAS) method that simultaneously optimizes and alignment, pre semantic integrity and alignment quality while avoiding the information loss typically associated with two-stage methods. Meanwhile, DAS achieves more efficient alignment between the semantic IDs and collaborative signals, with the following two innovative and effective approaches: (1) Multi-view Constrative Alignment: To maximize mutual information between semantic IDs and collaborative signals, we first incorporate an ID-based CF debias module, and then design three effective contrastive alignment methods: dual user-to-item (u2i), dual item-to-item/user-to-user (i2i/u2u), and dual co-occurrence item-to-item/user-to-user (i2i/u2u). (2) Dual Learning: By aligning the dual s of users and ads, the constructed semantic IDs for users and ads achieve stronger alignment. Finally, we conduct extensive offline experiments and online A/B tests to evaluate DAS's effectiveness, which is now successfully deployed across various advertising scenarios at Kuaishou App, over 400 million users daily.

GCRPNet Graph-Enhanced Contextual and Regional Perception Network For Salient Object Detection in Optical Remote Sensing Images

Authors: Mengyu Ren, Yutong Li, Hua Li, Runmin Cong, Sam Kwong

2025-08-14

http://arxiv.org/abs/2508.10542v1

Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision s (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose a graph-enhanced contextual and regional perception network (GCRPNet), which builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space (VSS) encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a difference-similarity guided hierarchical graph attention module (DS-HGAM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model's structural perception,allowing it to distinguish between foreground and background more effectively. Then, we design the LEVSS block as the r of GCRPNet. This module integrates our proposed adaptive scanning strategy and multi-granularity collaborative attention enhancement module (MCAEM). It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information and enhancing Mamba's local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.

X-Node Self-Explanation is All We Need

Authors: Prajit Sengupta, Islem Rekik

2025-08-14

http://arxiv.org/abs/2508.10461v1

Graph neural networks (GNNs) have achieved state-of-the-art results in computer vision and medical image classification tasks by capturing structural dependencies across data instances. However, their decision-making remains largely opaque, limiting their trustworthiness in high-stakes clinical applications where interpretability is essential. Existing explainability techniques for GNNs are typically post-hoc and global, offering limited insight into individual node decisions or local reasoning. We introduce X-Node, a self-explaining GNN framework in which each node generates its own explanation as part of the prediction process. For every node, we construct a structured context vector encoding interpretable cues such as degree, centrality, clustering, feature saliency, and label agreement within its local topology. A lightweight Reasoner module maps this context into a compact explanation vector, which serves three purposes: (1) reconstructing the node's latent embedding via a r to enforce faithfulness, (2) generating a natural language explanation using a pre-trained (e.g., Grok or Gemini), and (3) guiding the GNN itself via a "text-injection" mechanism that feeds explanations back into the message-passing pipeline. We evaluate X-Node on two graph datasets derived from MedMNIST and MorphoMNIST, integrating it with GCN, GAT, and GIN backbones. Our results show that X-Node maintains competitive classification accuracy while producing faithful, per-node explanations. Repository: https://github.com/basiralab/X-Node.

Efficient Methods for Accurate Sparse Trajectory Recovery and Map Matching

Authors: Wei Tian, Jieming Shi, Man Lung Yiu

2025-08-14

http://arxiv.org/abs/2508.10460v1

Real-world trajectories are often with low-sampling rates (i.e., long intervals between consecutive GPS points) and misaligned with road networks, yet many applications demand high-quality data for optimal performance. To improve data quality with trajectories as input, we systematically study two related research problems: trajectory recovery on road network, which aims to infer missing points to recover high-sampling trajectories, and map matching, which aims to map GPS points to road segments to determine underlying routes. In this paper, we present efficient methods TRMMA and MMA for accurate trajectory recovery and map matching, respectively, where MMA serves as the first step of TRMMA. In MMA, we carefully formulate a classification task to map a GPS point from trajectories to a road segment over a small candidate segment set, rather than the entire road network. We develop techniques in MMA to generate effective embeddings that capture the patterns of GPS data, directional information, and road segments, to accurately align trajectories to routes. For trajectory recovery, TRMMA focuses on the segments in the route returned by MMA to infer missing points with position ratios on road segments, producing high-sampling trajectories efficiently by avoiding evaluation of all road segments. Specifically, in TRMMA, we design a dual- encoding process to cohesively capture latent patterns in trajectories and routes, and an effective technique to sequentially predict the position ratios and road segments of missing points. We conduct extensive experiments to compare TRMMA and MMA with numerous existing methods for trajectory recovery and map matching, respectively, on 4 large real-world datasets. TRMMA and MMA consistently achieve the best result quality, often by a significant margin.

Computational Economics in Large Language Models Exploring Model Behavior and Incentive Design under Resource Constraints

Authors: Sandeep Reddy, Kabir Khan, Rohit Patil, Ananya Chakraborty, Faizan A. Khan, Swati Kulkarni, Arjun Verma, Neha Singh

2025-08-14

http://arxiv.org/abs/2508.10426v1

Large language models (s) are limited by substantial computational cost. We introduce a "computational economics" framework that treats an as an internal economy of resource-constrained agents (attention heads and neuron blocks) that must allocate scarce computation to maximize task utility. First, we show empirically that when computation is scarce, standard s reallocate attention toward high-value tokens while pre accuracy. Building on this observation, we propose an incentive-driven training paradigm that augments the task loss with a differentiable computation cost term, encouraging and efficient activations. On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method yields a family of models that trace a Pareto frontier and consistently dominate post-hoc ; for a similar accuracy we obtain roughly a forty percent reduction in FLOPS and lower latency, together with more interpretable attention patterns. These results indicate that economic principles offer a principled route to designing efficient, adaptive, and more transparent s under strict resource constraints.

Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

Authors: Huizhen Shu, Xuying Li, Qirui Wang, Yuji Kosuga, Mengqiu Tian, Zhuo Li

2025-08-14

http://arxiv.org/abs/2508.10404v1

With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (s), generating adversarial examples to jailbreak s remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while amplifying safety signals, thereby increasing their potential to evade existing defenses. Our method enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Experimental results demonstrate that adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms, revealing persistent vulnerabilities in current NLP systems.However, the method's effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated.

XQuant Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

Authors: Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

2025-08-14

http://arxiv.org/abs/2508.10395v1

Although inference has emerged as a critical workload for many downstream applications, efficiently inferring s is challenging due to the substantial memory footprint and bandwidth requirements. In parallel, compute capabilities have steadily outpaced both memory capacity and bandwidth over the last few decades, a trend that remains evident in modern GPU hardware and exacerbates the challenge of inference. As such, new algorithms are emerging that trade increased computation for reduced memory operations. To that end, we present XQuant, which takes advantage of this trend, enabling an order-of-magnitude reduction in memory consumption through with substantial accuracy benefits relative to state-of-the-art methods. We accomplish this by quantizing and caching the layer input activations X, instead of using standard caching, and then rematerializing the Keys and Values on-the-fly during inference. This results in an immediate 2 $\times$ memory savings compared to caching. By applying XQuant, we achieve up to $\sim 7.7\times$ memory savings with $<0.1$ perplexity degradation compared to the FP16 baseline. Furthermore, our approach leverages the fact that X values are similar across layers. Building on this observation, we introduce XQuant-CL, which exploits the cross-layer similarity in the X embeddings for extreme compression. Across different models, XQuant-CL attains up to 10 $\times$ memory savings relative to the FP16 baseline with only 0.01 perplexity degradation, and 12.5 $\times$ memory savings with only $0.1$ perplexity degradation. XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck, while surpassing state-of-the-art methods and achieving near-FP16 accuracy across a wide range of models.

eMamba Efficient Acceleration Framework for Mamba Models in Edge Computing

Authors: Jiyong Kim, Jaeho Lee, Jiahao Lin, Alish Kanani, Miao Sun, Umit Y. Ogras, Jaehyun Park

2025-08-14

http://arxiv.org/abs/2508.10370v1

State Space Model (SSM)-based machine learning architectures have recently gained significant attention for processing sequential data. Mamba, a recent sequence-to-sequence SSM, offers competitive accuracy with superior computational efficiency compared to state-of-the-art models. While this advantage makes Mamba particularly promising for resource-constrained edge devices, no hardware frameworks are currently optimized for deploying it in such environments. This paper presents eMamba, a comprehensive end-to-end hardware framework explicitly designed for deploying Mamba models on edge platforms. eMamba maximizes computational efficiency by replacing complex normalization layers with lightweight hardware-aware alternatives and approximating expensive operations, such as SiLU activation and exponentiation, considering the target applications. Then, it performs an approximation-aware neural architecture search (NAS) to tune the learnable parameters used during approximation. Evaluations with Fashion-MNIST, CIFAR-10, and MARS, an open-source human pose estimation dataset, show eMamba achieves comparable accuracy to state-of-the-art techniques using 1.63-19.9 $\times$ fewer parameters. In addition, it generalizes well to large-scale natural language tasks, demonstrating stable perplexity across varying sequence lengths on the WikiText2 dataset. We also and implement the entire eMamba pipeline on an AMD ZCU102 FPGA and ASIC using GlobalFoundries (GF) 22 nm technology. Experimental results show 4.95-5.62 $\times$ lower latency and 2.22-9.95 $\times$ higher throughput, with 4.77 $\times$ smaller area, 9.84 $\times$ lower power, and 48.6 $\times$ lower energy consumption than baseline solutions while maintaining competitive accuracy.

Improving Generative Cross-lingual Aspect-Based Sentiment Analysis with Constrained Decoding

Authors: Jakub Šmíd, Pavel Přibáň, Pavel Král

2025-08-14

http://arxiv.org/abs/2508.10369v1

While aspect-based sentiment analysis (ABSA) has made substantial progress, challenges remain for low-resource languages, which are often overlooked in favour of English. Current cross-lingual ABSA approaches focus on limited, less complex tasks and often rely on external translation tools. This paper introduces a novel approach using constrained with sequence-to-sequence models, eliminating the need for unreliable translation tools and improving cross-lingual performance by 5\% on average for the most complex task. The proposed method also supports multi-tasking, which enables solving multiple ABSA tasks with a single model, with constrained boosting results by more than 10\%. We evaluate our approach across seven languages and six ABSA tasks, surpassing state-of-the-art methods and setting new benchmarks for previously unexplored tasks. Additionally, we assess large language models (s) in zero-shot, few-shot, and fine-tuning scenarios. While s perform poorly in zero-shot and few-shot settings, fine-tuning achieves competitive results compared to smaller multilingual models, albeit at the cost of longer training and inference times. We provide practical recommendations for real-world applications, enhancing the understanding of cross-lingual ABSA methodologies. This study offers valuable insights into the strengths and limitations of cross-lingual ABSA approaches, advancing the state-of-the-art in this challenging research domain.

Advancing Cross-lingual Aspect-Based Sentiment Analysis with LLMs and Constrained Decoding for Sequence-to-Sequence Models

Authors: Jakub Šmíd, Pavel Přibáň, Pavel Král

2025-08-14

http://arxiv.org/abs/2508.10366v1

Aspect-based sentiment analysis (ABSA) has made significant strides, yet challenges remain for low-resource languages due to the predominant focus on English. Current cross-lingual ABSA studies often centre on simpler tasks and rely heavily on external translation tools. In this paper, we present a novel sequence-to-sequence method for compound ABSA tasks that eliminates the need for such tools. Our approach, which uses constrained , improves cross-lingual ABSA performance by up to 10\%. This method broadens the scope of cross-lingual ABSA, enabling it to handle more complex tasks and providing a practical, efficient alternative to translation-dependent techniques. Furthermore, we compare our approach with large language models (s) and show that while fine-tuned multilingual s can achieve comparable results, English-centric s struggle with these tasks.

What to Ask Next? Probing the Imaginative Reasoning of LLMs with TurtleSoup Puzzles

Authors: Mengtao Zhou, Sifan Wu, Huan Zhang, Qi Sima, Bang Liu

2025-08-14

http://arxiv.org/abs/2508.10358v1

We investigate the capacity of Large Language Models (s) for imaginative reasoning--the proactive construction, testing, and revision of hypotheses in information- environments. Existing benchmarks, often static or focused on social deduction, fail to capture the dynamic, exploratory nature of this reasoning process. To address this gap, we introduce a comprehensive research framework based on the classic "Turtle Soup" game, integrating a benchmark, an agent, and an evaluation protocol. We present TurtleSoup-Bench, the first large-scale, bilingual, interactive benchmark for imaginative reasoning, comprising 800 turtle soup puzzles sourced from both the Internet and expert authors. We also propose Mosaic-Agent, a novel agent designed to assess s' performance in this setting. To evaluate reasoning quality, we develop a multi-dimensional protocol measuring logical consistency, detail completion, and conclusion alignment. Experiments with leading s reveal clear capability limits, common failure patterns, and a significant performance gap compared to humans. Our work offers new insights into s' imaginative reasoning and establishes a foundation for future research on exploratory agent behavior.

DiffAxE Diffusion-driven Hardware Accelerator Generation and Design Space Exploration

Authors: Arkapravo Ghosh, Abhishek Moitra, Abhiroop Bhattacharjee, Ruokai Yin, Priyadarshini Panda

2025-08-14

http://arxiv.org/abs/2508.10303v1

Design space exploration (DSE) is critical for developing optimized hardware architectures, especially for AI workloads such as deep neural networks (DNNs) and large language models (s), which require specialized . As model complexity grows, accelerator design spaces have expanded to O(10^17), becoming highly irregular, non-convex, and exhibiting many-to-one mappings from design configurations to performance metrics. This complexity renders direct inverse derivation infeasible and necessitates heuristic or sampling-based optimization. Conventional methods - including Bayesian optimization, gradient descent, reinforcement learning, and genetic algorithms - depend on iterative sampling, resulting in long runtimes and sensitivity to initialization. Deep learning-based approaches have reframed DSE as classification using recommendation models, but remain limited to small-scale (O(10^3)), less complex design spaces. To overcome these constraints, we propose a generative approach that models hardware design as 1-D image synthesis conditioned on target performance, enabling efficient learning of non-differentiable, non-bijective hardware-performance mappings. Our framework achieves 0.86% lower generation error than Bayesian optimization with a 17000x speedup, and outperforms GANDSE with 30% lower error at only 1.83x slower search. We further extend the method to a structured DSE setting, attaining 9.8% lower energy-delay product (EDP) and 6% higher performance, with up to 145.6x and 1312x faster search compared to existing optimization methods on O(10^17) design spaces. For inference, our method achieves 3.37x and 7.75x lower EDP on a 32nm ASIC and Xilinx Ultrascale+ VPU13 FPGA, respectively, compared to the state-of-the-art DOSA framework.

Pruning and Malicious Injection A Retraining-Free Backdoor Attack on Transformer Models

Authors: Taibiao Zhao, Mingxuan Sun, Hao Wang, Xiaobing Chen, Xiangwei Zhou

2025-08-14

http://arxiv.org/abs/2508.10243v1

Transformer models have demonstrated exceptional performance and have become indispensable in computer vision (CV) and natural language processing (NLP) tasks. However, recent studies reveal that s are susceptible to backdoor attacks. Prior backdoor attack methods typically rely on retraining with clean data or altering the model architecture, both of which can be resource-intensive and intrusive. In this paper, we propose Head-wise Pruning and Malicious Injection (HPMI), a novel retraining-free backdoor attack on s that does not alter the model's architecture. Our approach requires only a small subset of the original data and basic knowledge of the model architecture, eliminating the need for retraining the target . Technically, HPMI works by the least important head and injecting a pre-trained malicious head to establish the backdoor. We provide a rigorous theoretical justification demonstrating that the implanted backdoor resists detection and removal by state-of-the-art defense techniques, under reasonable assumptions. Experimental evaluations across multiple datasets further validate the effectiveness of HPMI, showing that it 1) incurs negligible clean accuracy loss, 2) achieves at least 99.55% attack success rate, and 3) bypasses four advanced defense mechanisms. Additionally, relative to state-of-the-art retraining-dependent attacks, HPMI achieves greater concealment and robustness against diverse defense strategies, while maintaining minimal impact on clean accuracy.

Personalized Real-time Jargon Support for Online Meetings

Authors: Yifan Song, Wing Yee Au, Hon Yung Wong, Brian P. Bailey, Tal August

2025-08-13

http://arxiv.org/abs/2508.10239v1

Effective interdisciplinary is frequently hindered by domain-specific jargon. To explore the jargon barriers in-depth, we conducted a formative diary study with 16 professionals, revealing critical limitations in current jargon-management strategies during workplace meetings. Based on these insights, we designed ParseJargon, an interactive -powered system providing real-time personalized jargon identification and explanations tailored to users' individual backgrounds. A controlled experiment comparing ParseJargon against baseline (no support) and general-purpose (non-personalized) conditions demonstrated that personalized jargon support significantly enhanced participants' comprehension, engagement, and appreciation of colleagues' work, whereas general-purpose support negatively affected engagement. A follow-up field study validated ParseJargon's usability and practical value in real-time meetings, highlighting both opportunities and limitations for real-world deployment. Our findings contribute insights into designing personalized jargon support tools, with implications for broader interdisciplinary and educational applications.

Can Transformers Break Encryption Schemes via In-Context Learning?

Authors: Jathin Korrapati, Patrick Mendoza, Aditya Tomar, Abein Abraham

2025-08-13

http://arxiv.org/abs/2508.10235v1

In-context learning (ICL) has emerged as a powerful capability of -based language models, enabling them to perform tasks by conditioning on a small number of examples presented at inference time, without any parameter updates. Prior work has shown that s can generalize over simple function classes like linear functions, decision trees, even neural networks, purely from context, focusing on numerical or symbolic reasoning over underlying well-structured functions. Instead, we propose a novel application of ICL into the domain of cryptographic function learning, specifically focusing on ciphers such as mono-alphabetic substitution and Vigen`ere ciphers, two classes of private-key encryption schemes. These ciphers involve a fixed but hidden bijective mapping between plain text and cipher text characters. Given a small set of (cipher text, plain text) pairs, the goal is for the model to infer the underlying substitution and a new cipher text word. This setting poses a structured inference challenge, which is well-suited for evaluating the inductive biases and generalization capabilities of s under the ICL paradigm. Code is available at https://github.com/adistomar/CS182-project.

Agentic AI Frameworks Architectures, Protocols, and Design Challenges

Authors: Hana Derouiche, Zaki Brahmi, Haithem Mazeni

2025-08-13

http://arxiv.org/abs/2508.10146v1

The emergence of Large Language Models (s) has ushered in a transformative paradigm in artificial intelligence, Agentic AI, where intelligent agents exhibit goal-directed autonomy, contextual reasoning, and dynamic multi-agent coordination. This paper provides a systematic review and comparative analysis of leading Agentic AI frameworks, including CrewAI, LangGraph, AutoGen, Semantic Kernel, Agno, Google ADK, and MetaGPT, evaluating their architectural principles, mechanisms, memory management, safety guardrails, and alignment with service-oriented computing paradigms. Furthermore, we identify key limitations, emerging trends, and open challenges in the field. To address the issue of agent , we conduct an in-depth analysis of protocols such as the Contract Net Protocol (CNP), Agent-to-Agent (A2A), Agent Network Protocol (ANP), and Agora. Our findings not only establish a foundational taxonomy for Agentic AI systems but also propose future research directions to enhance scalability, robustness, and interoperability. This work serves as a comprehensive reference for researchers and practitioners working to advance the next generation of autonomous AI systems.

Nested-ReFT Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts

Authors: Maxime Heuillet, Yufei Cui, Boxing Chen, Audrey Durand, Prasanna Parthasarathi

2025-08-13

http://arxiv.org/abs/2508.10123v1

Advanced reasoning in s on challenging domains like mathematical reasoning can be tackled using verifiable rewards based reinforced fine-tuning (ReFT). In standard ReFT frameworks, a behavior model generates multiple completions with answers per problem, for the answer to be then scored by a reward function. While such RL post-training methods demonstrate significant performance improvements across challenging reasoning domains, the computational cost of generating completions during training with multiple inference steps makes the training cost non-trivial. To address this, we draw inspiration from off-policy RL, and speculative to introduce a novel ReFT framework, dubbed Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training. The behavior model configured with dynamic layer skipping per batch during training decreases the inference cost compared to the standard ReFT frameworks. Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance. Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes. Additionally, we explore three variants of bias mitigation to minimize the off-policyness in the gradient updates that allows for maintaining performance that matches the baseline ReFT performance.

From Intent to Execution Multimodal Chain-of-Thought Reinforcement Learning for Precise CAD Code Generation

Authors: Ke Niu, Haiyang Yu, Zhuofan Chen, Mengyang Zhao, Teng Fu, Bin Li, Xiangyang Xue

2025-08-13

http://arxiv.org/abs/2508.10118v1

Computer-Aided Design (CAD) plays a vital role in engineering and manufacturing, yet current CAD workflows require extensive domain expertise and manual modeling effort. Recent advances in large language models (s) have made it possible to generate code from natural language, opening new opportunities for automating parametric 3D modeling. However, directly translating human design intent into executable CAD code remains highly challenging, due to the need for logical reasoning, syntactic correctness, and numerical precision. In this work, we propose CAD-RL, a multimodal Chain-of-Thought (CoT) guided reinforcement learning post training framework for CAD modeling code generation. Our method combines CoT-based Cold Start with goal-driven reinforcement learning post training using three task-specific rewards: executability reward, geometric accuracy reward, and external evaluation reward. To ensure stable policy learning under and high-variance reward conditions, we introduce three targeted optimization strategies: Trust Region Stretch for improved exploration, Precision Token Loss for enhanced dimensions parameter accuracy, and Overlong Filtering to reduce noisy supervision. To support training and benchmarking, we release ExeCAD, a noval dataset comprising 16,540 real-world CAD examples with paired natural language and structured design language descriptions, executable CADQuery scripts, and rendered 3D models. Experiments demonstrate that CAD-RL achieves significant improvements in reasoning quality, output precision, and code executability over existing VLMs.

Constrained Decoding of Diffusion LLMs with Context-Free Grammars

Authors: Niels Mündler, Jasper Dekoninck, Martin Vechev

2025-08-13

http://arxiv.org/abs/2508.10111v1

Large language models (s) have shown promising performance across diverse domains. Many practical applications of s, such as code completion and structured data extraction, require adherence to syntactic constraints specified by a formal language. Yet, due to their probabilistic nature, output is not guaranteed to adhere to such formal languages. Prior work has proposed constrained as a means to restrict generation to particular formal languages. However, existing works are not applicable to the emerging paradigm of diffusion s, when used in practical scenarios such as the generation of formally correct C++ or JSON output. In this paper we address this challenge and present the first constrained method for diffusion models, one that can handle formal languages captured by context-free grammars. We begin by reducing constrained to the more general additive infilling problem, which asks whether a partial output can be completed to a valid word in the target language. This problem also naturally subsumes the previously unaddressed multi-region infilling constrained . We then reduce this problem to the task of deciding whether the intersection of the target language and a regular language is empty and present an efficient algorithm to solve it for context-free languages. Empirical results on various applications, such as C++ code infilling and structured data extraction in JSON, demonstrate that our method achieves near-perfect syntactic correctness while consistently pre or improving functional correctness. Importantly, our efficiency optimizations ensure that the computational overhead remains practical.

Language of Persuasion and Misrepresentation in Business Communication A Textual Detection Approach

Authors: Sayem Hossen, Monalisa Moon Joti, Md. Golam Rashed

2025-08-13

http://arxiv.org/abs/2508.09935v1

Business digitisation has reorganised the process of persuasive discourse, which allows not only greater transparency but also advanced deception. This inquiry synthesises classical rhetoric and psychology with linguistic theory and empirical studies in the financial reporting, sustainability discourse, and digital marketing to explain how deceptive language can be systematically detected using persuasive lexicon. In controlled settings, detection accuracies of greater than 99% were achieved by using computational textual analysis as well as personalised models. However, reproducing this performance in multilingual settings is also problematic and, to a large extent, this is because it is not easy to find sufficient data, and because few multilingual text-processing infrastructures are in place. This evidence shows that there has been an increasing gap between the theoretical representations of and those empirically approximated, and therefore, there is a need to have strong automatic text-identification systems where AI-based discourse is becoming more realistic in communicating with humans.

Memory Decoder A Pretrained, Plug-and-Play Memory for Large Language Models

Authors: Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin

2025-08-13

http://arxiv.org/abs/2508.09874v1

Large Language Models (s) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model's parameters. Memory Decoder employs a small r that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.

OneVAE Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better

Authors: Yupeng Zhou, Zhen Li, Ziheng Ouyang, Yuming Chen, Ruoyi Du, Daquan Zhou, Bin Fu, Yihao Liu, Peng Gao, Ming-Ming Cheng, Qibin Hou

2025-08-13

http://arxiv.org/abs/2508.09857v1

Encoding videos into discrete tokens could align with text tokens to facilitate concise and unified multi-modal s, yet introducing significant spatiotemporal compression compared to continuous video representation. Previous discrete video VAEs experienced unstable training, long training time, and degraded reconstruction quality. Given the easier training and superior performance of continuous VAEs, an intuitive idea is to enhance discrete video VAEs by leveraging continuous VAEs. After rethinking the intrinsic link between discrete and continuous representations, we found that FSQ could effectively preserve pre-trained continuous VAE priors compared to other methods. By leveraging continuous VAE priors, it converges several times faster than training from scratch and achieves superior performance at convergence. Meanwhile, two structural improvements are proposed. First, inspired by how continuous VAEs enhance reconstruction via enlarged latent dimensions, we introduce a multi-token mechanism, which achieves nearly a 1 dB improvement in PSNR without compromising the token compression ratio. Second, to tackle reconstruction challenges in high-compression video VAEs, we strengthen first-frame reconstruction, enabling the causal VAE to leverage this information in subsequent frames and markedly improving the performance of 4 x 16 x 16 discrete VAEs. Furthermore, we propose a joint discrete-continuous optimization scheme that unifies the two paradigms and, for the first time, achieves competitive performance on both continuous and discrete representations within a single network. We name our method OneVAE to reflect this connection.

Speed Always Wins A Survey on Efficient Architectures for Large Language Models

Authors: Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng

2025-08-13

http://arxiv.org/abs/2508.09834v1

Large Language Models (s) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern s, offer a strong baseline with excellent scaling properties. However, the traditional architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative architectures that address the inherent limitations of s and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sequence modeling methods, efficient full attention variants, mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion s. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.

MoIIE Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

Authors: Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yitong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, Zhongyu Wei

2025-08-13

http://arxiv.org/abs/2508.09779v1

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across multi-modal tasks by scaling model size and training data. However, these dense LVLMs incur significant computational costs and motivate the exploration of Mixture of Experts (MoE) architectures. While MoE improve parameter efficiency, effectively applying MoE to simultaneously model modality-specific features and cross-modal associations in LVLMs remains challenging. In this work, we propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts, enabling the model to jointly learn rich intra-modal features and cross-modal interactions. We further introduce an effective and straightforward two-stage training strategy, which facilitates the direct activation of both MoE and multi-modal capabilities. Extensive experiments across different data scales and backbone demonstrate the effectiveness, efficiency and generality of our approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-s based multi-modal models that involve more activated parameters. The code is available at https://github.com/AlenjandroWang/MoIIE.

MEML-GRPO Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

Authors: Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, Hao Feng, Irene Li, Kun Yang, Han Wang, Jingqun Tang, Teng Fu, Changhong Jin, Chao Feng, Xiaohui Lv, Can Huang

2025-08-13

http://arxiv.org/abs/2508.09670v1

Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (s). However, standard RLVR faces challenges with reward , where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model's performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods.

HierMoE Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap

Authors: Wenxiang Lin, Xinglin Pan, Lin Zhang, Shaohuai Shi, Xuan Wang, Xiaowen Chu

2025-08-13

http://arxiv.org/abs/2508.09591v1

The ly activated mixture-of-experts (MoE) has become a common architecture for large language models (s) due to its , which requires fewer computational demands while easily scaling the model size. In MoE models, each MoE layer requires to dynamically choose tokens to activate particular experts for computation while the activated experts may not be located in the same device or GPU as the token. However, this leads to substantial and load imbalances across all GPUs, which obstructs the scalability of distributed systems within a GPU cluster. To this end, we introduce HierMoE to accelerate the training of MoE models by two topology-aware techniques: 1) token deduplication to reduce the traffic, and 2) expert swap to balance the workloads among all GPUs. To enable the above two proposed approaches to be more general, we build theoretical models aimed at achieving the best token duplication and expert swap strategy under different model configurations and hardware environments. We implement our prototype HierMoE system atop Megatron-LM and conduct experiments on a 32-GPU cluster with DeepSeek-V3 and Qwen3-30B-A3B models. Experimental results show that our HierMoE achieves $1.55\times$ to $3.32\times$ faster and delivers $1.18\times$ to $1.27\times$ faster end-to-end training compared to state-of-the-art MoE training systems, Tutel-2DH, SmartMoE, and Megatron-LM.

NeuronTune Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Authors: Birong Pan, Mayi Xu, Qiankun Pi, Jianhao Chen, Yuanyuan Zhu, Ming Zhong, Tieyun Qian

2025-08-13

http://arxiv.org/abs/2508.09473v1

Ensuring robust safety alignment while pre utility is critical for the reliable deployment of Large Language Models (s). However, current techniques fundamentally suffer from intertwined deficiencies: insufficient robustness against malicious attacks, frequent refusal of benign queries, degradation in generated text quality and general task performance--the former two reflecting deficits in robust safety and the latter constituting utility impairment. We trace these limitations to the coarse-grained layer-wise interventions in existing methods. To resolve this, we propose NeuronTune, a fine-grained framework that dynamically modulates neurons to achieve simultaneous safety-utility optimization. Our approach first identifies safety-critical and utility-pre neurons across all layers via attribution, then employs meta-learning to adaptively amplify safety-neuron activations and suppress utility-neuron activations. Crucially, NeuronTune enables tunable adjustment of intervention scope via neuron-count thresholds, supporting flexible adaptation to security-critical or utility-priority scenarios. Extensive experimental results demonstrate that our method significantly outperforms existing state-of-the-art technologies, achieving superior model safety while maintaining excellent utility.

EGGS-PTP An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models

Authors: Omar Bazarbachi, Zijun Sun, Yanning Shen

2025-08-13

http://arxiv.org/abs/2508.09471v1

As Large Language Models (s) become more widely adopted and scale up in size, the computational and memory challenges involved in deploying these massive foundation models have grown increasingly severe. This underscores the urgent need to develop more efficient model variants. Faced with this challenge, the present work introduces EGGS-PTP: an Expander-Graph Guided Structured Post-training Pruning method. The proposed approach leverages graph theory to guide the design of N:M structured , effectively reducing model size and computational demands. By incorporating concepts from expander graphs, EGGS-PTP ensures information flow within the pruned network, pre essential model functionality. Extensive numerical experiments demonstrate that EGGS-PTP not only achieves significant and memory savings due to structured but also outperforms existing structured techniques in terms of accuracy across various s.

Gen-AFFECT Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy

Authors: Hao Yu, Rupayan Mallick, Margrit Betke, Sarah Adel Bargal

2025-08-13

http://arxiv.org/abs/2508.09461v1

Different forms of customized 2D avatars are widely used in gaming applications, virtual , education, and content creation. However, existing approaches often fail to capture fine-grained facial expressions and struggle to preserve identity across different expressions. We propose GEN-AFFECT, a novel framework for personalized avatar generation that generates expressive and identity-consistent avatars with a diverse set of facial expressions. Our framework proposes conditioning a multimodal diffusion on an extracted identity-expression representation. This enables identity preservation and representation of a wide range of facial expressions. GEN-AFFECT additionally employs consistent attention at inference for information sharing across the set of generated expressions, enabling the generation process to maintain identity consistency over the array of generated fine-grained expressions. GEN-AFFECT demonstrates superior performance compared to previous state-of-the-art methods on the basis of the accuracy of the generated expressions, the preservation of the identity and the consistency of the target identity across an array of fine-grained facial expressions.

Shadow in the Cache Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference

Authors: Zhifan Luo, Shuo Shao, Su Zhang, Lijing Zhou, Yuke Hu, Chenxu Zhao, Zhihao Liu, Zhan Qin

2025-08-13

http://arxiv.org/abs/2508.09442v1

The Key-Value () , which stores intermediate attention computations (Key and Value pairs) to avoid redundant calculations, is a fundamental mechanism for accelerating Large Language Model () inference. However, this efficiency optimization introduces significant yet underexplored privacy risks. This paper provides the first comprehensive analysis of these vulnerabilities, demonstrating that an attacker can reconstruct sensitive user inputs directly from the -. We design and implement three distinct attack vectors: a direct Inversion Attack, a more broadly applicable and potent Collision Attack, and a semantic-based Injection Attack. These methods demonstrate the practicality and severity of - privacy leakage issues. To mitigate this, we propose -Cloak, a novel, lightweight, and efficient defense mechanism. -Cloak uses a reversible matrix-based obfuscation scheme, combined with operator fusion, to secure the -. Our extensive experiments show that -Cloak effectively thwarts all proposed attacks, reducing reconstruction quality to random noise. Crucially, it achieves this robust security with virtually no degradation in model accuracy and minimal performance overhead, offering a practical solution for trustworthy deployment.

Synaptic Pruning A Biological Inspiration for Deep Learning Regularization

Authors: Gideon Vos, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi

2025-08-12

http://arxiv.org/abs/2508.09330v1

Synaptic in biological brains removes weak connections to improve efficiency. In contrast, dropout regularization in artificial neural networks randomly deactivates neurons without considering activity-dependent . We propose a magnitude-based synaptic method that better reflects biology by progressively removing low-importance connections during training. Integrated directly into the training loop as a dropout replacement, our approach computes weight importance from absolute magnitudes across layers and applies a cubic schedule to gradually increase global . At fixed intervals, masks permanently remove low-importance weights while maintaining gradient flow for active ones, eliminating the need for separate and fine-tuning phases. Experiments on multiple time series forecasting models including RNN, LSTM, and Patch Time Series Transformer across four datasets show consistent gains. Our method ranked best overall, with statistically significant improvements confirmed by Friedman tests (p < 0.01). In financial forecasting, it reduced Mean Absolute Error by up to 20% over models with no or standard dropout, and up to 52% in select models. This dynamic mechanism advances regularization by coupling weight elimination with progressive sparsification, offering easy integration into diverse architectures. Its strong performance, especially in financial time series forecasting, highlights its potential as a practical alternative to conventional dropout techniques.

SinLlama -- A Large Language Model for Sinhala

Authors: H. W. K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, Rishemjit Kaur

2025-08-12

http://arxiv.org/abs/2508.09115v1

Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (s). In this research, we extend an existing multilingual (Llama-3-8B) to better serve Sinhala. We enhance the tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first r-based open-source with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.

READER Retrieval-Assisted Drafter for Efficient LLM Inference

Authors: Maxim Divilkovskiy, Vitaly Malygin, Sergey Zlobin, Sultan Isali, Vasily Kalugin, Stanislav Ilyushin, Nuriza Aitassova, Yi Fei, Zeng Weidi

2025-08-12

http://arxiv.org/abs/2508.09072v1

Large Language Models (s) generate tokens autoregressively, with each token depending on the preceding context. This sequential nature makes the inference process inherently difficult to accelerate, posing a significant challenge for efficient deployment. In recent years, various methods have been proposed to address this issue, with the most effective approaches often involving the training of additional draft models. In this paper, we introduce READER (Retrieval-Assisted Drafter for Efficient Inference), a novel lossless speculative method that enhances model-based approaches by leveraging self-repetitions in the text. Our algorithm expands the speculative tree using tokens obtained through statistical search. This work focuses on large batch sizes (>= 8), an underexplored yet important area for industrial applications. We also analyze the key-value () size during speculative and propose an optimization to improve performance for large batches. As a result, READER outperforms existing speculative methods. Notably, READER requires no additional training and can reuse pre-trained speculator models, increasing the speedup by over 40\%. Our method demonstrates particularly strong performance on search-based tasks, such as retrieval-augmented generation, where we achieve more than 10x speedup.

FetFIDS A Feature Embedding Attention based Federated Network Intrusion Detection Algorithm

Authors: Shreya Ghosh, Abu Shafin Mohammad Mahdee Jameel, Aly El Gamal

2025-08-12

http://arxiv.org/abs/2508.09056v1

Intrusion Detection Systems (IDS) have an increasingly important role in preventing exploitation of network vulnerabilities by malicious actors. Recent deep learning based developments have resulted in significant improvements in the performance of IDS systems. In this paper, we present FetFIDS, where we explore the employment of feature embedding instead of positional embedding to improve intrusion detection performance of a based deep learning system. Our model is developed with the aim of deployments in edge learning scenarios, where federated learning over multiple rounds can ensure both privacy and localized performance improvements. FetFIDS outperforms multiple state-of-the-art intrusion detection systems in a federated environment and demonstrates a high degree of suitability to federated learning. The code for this work can be found at https://github.com/ghosh64/fetfids.

A Survey on Training-free Alignment of Large Language Models

Authors: Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian

2025-08-12

http://arxiv.org/abs/2508.09016v1

The alignment of large language models (s) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques--leveraging in-context learning, -time adjustments, and post-generation corrections--offer a promising alternative by enabling alignment without heavily retraining s, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of pre-, in-, and post-. For each stage, we provide a detailed examination from the viewpoint of s and multimodal s (Ms), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable s.

Retrospective Sparse Attention for Efficient Long-Context Generation

Authors: Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim

2025-08-12

http://arxiv.org/abs/2508.09001v1

Large Language Models (s) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value () , whose memory footprint grows linearly with sequence length and dominates latency at each step. While recent compression methods identify and load important tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long . In this paper, we introduce RetroAttention, a novel update technique that retrospectively revises past attention outputs using newly arrived entries from subsequent steps. By maintaining a lightweight output , RetroAttention enables past queries to efficiently access more relevant context, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) compression methods, increasing effective exposure by up to 1.6 $\times$ and accuracy by up to 21.9\%.

NEFMind Parameter-Efficient Fine-Tuning of Open-Source LLMs for Telecom APIs Automation

Authors: Zainab Khan, Ahmed Hussain, Mukesh Thakur, Arto Hellas, Panos Papadimitratos

2025-08-12

http://arxiv.org/abs/2508.09240v1

The use of Service-Based Architecture in modern teles has exponentially increased Network Functions (NFs) and Application Programming Interfaces (APIs), creating substantial operational complexities in service discovery and management. We introduce \textit{NEFMind}, a framework leveraging parameter-efficient fine-tuning of open-source Large Language Models (s) to address these challenges. It integrates three core components: synthetic dataset generation from Network Exposure Function (NEF) API specifications, model optimization through Quantized-Low-Rank Adaptation, and performance evaluation via GPT-4 Ref Score and BertScore metrics. Targeting 5G Service-Based Architecture APIs, our approach achieves 85% reduction in overhead compared to manual discovery methods. Experimental validation using the open-source Phi-2 model demonstrates exceptional API call identification performance at 98-100% accuracy. The fine-tuned Phi-2 model delivers performance comparable to significantly larger models like GPT-4 while maintaining computational efficiency for teles infrastructure deployment. These findings validate domain-specific, parameter-efficient strategies for managing complex API ecosystems in next-generation teles networks.

ColorGPT Leveraging Large Language Models for Multimodal Color Recommendation

Authors: Ding Xia, Naoto Inoue, Qianru Qiu, Kotaro Kikuchi

2025-08-12

http://arxiv.org/abs/2508.08987v1

Colors play a crucial role in the design of vector graphic documents by enhancing visual appeal, facilitating , improving usability, and ensuring accessibility. In this context, color recommendation involves suggesting appropriate colors to complete or refine a design when one or more colors are missing or require alteration. Traditional methods often struggled with these challenges due to the complex nature of color design and the limited data availability. In this study, we explored the use of pretrained Large Language Models (s) and their commonsense reasoning capabilities for color recommendation, raising the question: Can pretrained s serve as superior designers for color recommendation tasks? To investigate this, we developed a robust, rigorously validated pipeline, ColorGPT, that was built by systematically testing multiple color representations and applying effective prompt engineering techniques. Our approach primarily targeted color palette completion by recommending colors based on a set of given colors and accompanying context. Moreover, our method can be extended to full palette generation, producing an entire color palette corresponding to a provided textual description. Experimental results demonstrated that our -based pipeline outperformed existing methods in terms of color suggestion accuracy and the distribution of colors in the color palette completion task. For the full palette generation task, our approach also yielded improvements in color diversity and similarity compared to current techniques.

ASPD Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

Authors: Keyu Chen, Zhifeng Shen, Daohai Yu, Haoqian Wu, Wei Wen, Jianfeng He, Ruizhi Qiao, Xing Sun

2025-08-12

http://arxiv.org/abs/2508.08895v2

The increasing scale and complexity of large language models (s) pose significant inference latency challenges, primarily due to their autoregressive paradigm characterized by the sequential nature of next-token prediction. By re-examining the outputs of autoregressive models, we observed that some segments exhibit parallelizable structures, which we term intrinsic parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel ) can significantly improve the overall inference speed of s. In this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which addresses two core challenges: automated construction of parallelizable data and efficient parallel mechanism. More specifically, we introduce a non-invasive pipeline that automatically extracts and validates parallelizable structures from the responses of autoregressive models. To empower efficient adaptive serial-parallel , we implement a Hybrid Decoding Engine which enables seamless transitions between serial and parallel modes while maintaining a reusable , maximizing computational efficiency. Extensive evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical Reasoning, demonstrate that ASPD achieves unprecedented performance in both effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up to 3.19x speedup (1.85x on average) while maintaining response quality within 1% difference compared to autoregressive models, realizing significant without compromising generation quality. Our framework sets a groundbreaking benchmark for efficient parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.

Steering Towards Fairness Mitigating Political Bias in LLMs

Authors: Afrozah Nadeem, Mark Dras, Usman Naseem

2025-08-12

http://arxiv.org/abs/2508.08846v1

Recent advancements in large language models (s) have enabled their widespread use across diverse real-world applications. However, concerns remain about their tendency to encode and reproduce ideological biases, particularly along political and economic dimensions. In this paper, we propose a framework for probing and mitigating such biases in r-based s through analysis of internal model representations. Grounded in the Political Compass Test (PCT), our method uses contrastive pairs to extract and compare hidden layer activations from models like Mistral and DeepSeek. We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes, revealing meaningful disparities linked to political framing. Our results show that r s systematically encode representational bias across layers, which can be leveraged for effective steering vector-based mitigation. This work provides new insights into how political bias is encoded in s and offers a principled approach to debiasing beyond surface-level output interventions.

DiffPose-Animal A Language-Conditioned Diffusion Framework for Animal Pose Estimation

Authors: Tianyu Xiong, Dayi Tan, Wei Tian

2025-08-12

http://arxiv.org/abs/2508.08783v1

Animal pose estimation is a fundamental task in computer vision, with growing importance in ecological monitoring, behavioral analysis, and intelligent livestock management. Compared to human pose estimation, animal pose estimation is more challenging due to high interspecies morphological diversity, complex body structures, and limited annotated data. In this work, we introduce DiffPose-Animal, a novel diffusion-based framework for top-down animal pose estimation. Unlike traditional heatmap regression methods, DiffPose-Animal reformulates pose estimation as a denoising process under the generative framework of diffusion models. To enhance semantic guidance during keypoint generation, we leverage large language models (s) to extract both global anatomical priors and local keypoint-wise semantics based on species-specific prompts. These textual priors are encoded and fused with image features via cross-attention modules to provide biologically meaningful constraints throughout the denoising process. Additionally, a diffusion-based keypoint r is designed to progressively refine pose predictions, improving robustness to occlusion and annotation . Extensive experiments on public animal pose datasets demonstrate the effectiveness and generalization capability of our method, especially under challenging scenarios with diverse species, cluttered backgrounds, and incomplete keypoints.

Interpretable Reward Model via Sparse Autoencoder

Authors: Shuyi Zhang, Wei Shi, Sihang Li, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang

2025-08-12

http://arxiv.org/abs/2508.08746v2

Large language models (s) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model (SARM), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of -based RM into an interpretable, , and monosemantic feature space, from which a scalar head aggregates feature activations to produce transparent and conceptually meaningful reward scores. Empirical evaluations demonstrate that SARM facilitates direct feature-level attribution of reward assignments, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models. Our code is available at https://github.com/schrieffer-z/sarm.

A Survey on Parallel Text Generation From Parallel Decoding to Diffusion Language Models

Authors: Lingzhe Zhang, Liancheng Fang, Chiming Duan, Minghua He, Leyi Pan, Pei Xiao, Shiyu Huang, Yunpeng Zhai, Xuming Hu, Philip S. Yu, Aiwei Liu

2025-08-12

http://arxiv.org/abs/2508.08712v2

As text generation has become a core capability of modern Large Language Models (s), it underpins a wide range of downstream applications. However, most existing s rely on autoregressive (AR) generation, producing one token at a time based on previously generated context-resulting in limited generation speed due to the inherently sequential nature of the process. To address this challenge, an increasing number of researchers have begun exploring parallel text generation-a broad class of techniques aimed at breaking the token-by-token generation bottleneck and improving inference efficiency. Despite growing interest, there remains a lack of comprehensive analysis on what specific techniques constitute parallel text generation and how they improve inference performance. To bridge this gap, we present a systematic survey of parallel text generation methods. We categorize existing approaches into AR-based and Non-AR-based paradigms, and provide a detailed examination of the core techniques within each category. Following this taxonomy, we assess their theoretical trade-offs in terms of speed, quality, and efficiency, and examine their potential for combination and comparison with alternative strategies. Finally, based on our findings, we highlight recent advancements, identify open challenges, and outline promising directions for future research in parallel text generation. We have also created a GitHub repository for indexing relevant papers and open resources available at https://github.com/zhanglingzhe0820/Awesome-Parallel-Text-Generation.

Prompt-and-Check Using Large Language Models to Evaluate Communication Protocol Compliance in Simulation-Based Training

Authors: Vishakha Lall, Yisi Liu

2025-08-12

http://arxiv.org/abs/2508.08652v1

Accurate evaluation of procedural compliance is essential in simulation-based training, particularly in safety-critical domains where adherence to compliance checklists reflects operational competence. This paper explores a lightweight, deployable approach using prompt-based inference with open-source large language models (s) that can run efficiently on consumer-grade GPUs. We present Prompt-and-Check, a method that uses context-rich prompts to evaluate whether each checklist item in a protocol has been fulfilled, solely based on transcribed verbal exchanges. We perform a case study in the maritime domain with participants performing an identical simulation task, and experiment with models such as LLama 2 7B, LLaMA 3 8B and Mistral 7B, running locally on an RTX 4070 GPU. For each checklist item, a prompt incorporating relevant transcript excerpts is fed into the model, which outputs a compliance judgment. We assess model outputs against expert-annotated ground truth using classification accuracy and agreement scores. Our findings demonstrate that prompting enables effective context-aware reasoning without task-specific training. This study highlights the practical utility of s in augmenting debriefing, performance feedback, and automated assessment in training environments.

Classifier Language Models Unifying Sparse Finetuning and Adaptive Tokenization for Specialized Classification Tasks

Authors: Adit Krishnan, Chu Wang, Chris Kong

2025-08-12

http://arxiv.org/abs/2508.08635v1

Semantic text classification requires the understanding of the contextual significance of specific tokens rather than surface-level patterns or keywords (as in rule-based or statistical text classification), making large language models (s) well-suited for this task. However, semantic classification applications in industry, like customer intent detection or semantic role labeling, tend to be highly specialized. They require annotation by domain experts in contrast to general-purpose corpora for pretraining. Further, they typically require high inference throughputs which limits the model size from latency and cost perspectives. Thus, for a range of specialized classification tasks, the preferred solution is to develop customized classifiers by finetuning smaller language models (e.g., mini-encoders, small language models). In this work, we develop a token-driven finetuning strategy to adapt small language models to specialized classification tasks. We identify and finetune a small sensitive subset of model parameters by leveraging task-specific token constructs in the finetuning dataset, while leaving most of the pretrained weights unchanged. Unlike adapter approaches such as low rank adaptation (LoRA), we do not introduce additional parameters to the model. Our approach identifies highly relevant semantic tokens (case study in the Appendix) and outperforms end-to-end finetuning, LoRA, layer selection, and prefix tuning on five diverse semantic classification tasks. We achieve greater stability and half the training costs vs. end-to-end finetuning.

AgriGPT a Large Language Model Ecosystem for Agriculture

Authors: Bo Yang, Yu Zhang, Lanfei Feng, Yunkui Chen, Jianyu Zhang, Xiao Xu, Nueraili Aierken, Yurui Li, Yuxuan Chen, Guijun Yang, Yong He, Runhe Huang, Shijian Li

2025-08-12

http://arxiv.org/abs/2508.08632v1

Despite the rapid progress of Large Language Models (s), their application in agriculture remains limited due to the lack of domain-specific models, curated datasets, and robust evaluation frameworks. To address these challenges, we propose AgriGPT, a domain-specialized ecosystem for agricultural usage. At its core, we design a multi-agent scalable data engine that systematically compiles credible data sources into Agri-342K, a high-quality, standardized question-answer (QA) dataset. Trained on this dataset, AgriGPT supports a broad range of agricultural stakeholders, from practitioners to policy-makers. To enhance factual grounding, we employ Tri-RAG, a three-channel Retrieval-Augmented Generation framework combining dense retrieval, retrieval, and multi-hop knowledge graph reasoning, thereby improving the 's reasoning reliability. For comprehensive evaluation, we introduce AgriBench-13K, a benchmark suite comprising 13 tasks with varying types and complexities. Experiments demonstrate that AgriGPT significantly outperforms general-purpose s on both domain adaptation and reasoning. Beyond the model itself, AgriGPT represents a modular and extensible ecosystem for agriculture, comprising structured data construction, retrieval-enhanced generation, and domain-specific evaluation. This work provides a generalizable framework for developing scientific and industry-specialized s. All models, datasets, and code will be released to empower agricultural communities, especially in underserved regions, and to promote open, impactful research.

QoE-Aware Service Provision for Mobile AR Rendering An Agent-Driven Approach

Authors: Conghao Zhou, Lulu Sun, Xiucheng Wang, Peng Yang, Feng Lyu, Sihan Lu, Xuemin Shen

2025-08-12

http://arxiv.org/abs/2508.08627v1

Mobile augmented reality (MAR) is envisioned as a key immersive application in 6G, enabling virtual content rendering aligned with the physical environment through device pose estimation. In this paper, we propose a novel agent-driven service provisioning approach for edge-assisted MAR, aiming to reduce overhead between MAR devices and the edge server while ensuring the quality of experience (QoE). First, to address the inaccessibility of MAR application-specific information to the network controller, we establish a digital agent powered by large language models (s) on behalf of the MAR service provider, bridging the data and function gap between the MAR service and network domains. Second, to cope with the user-dependent and dynamic nature of data traffic patterns for individual devices, we develop a user-level QoE modeling method that captures the relationship between resource demands and perceived user QoE, enabling personalized, agent-driven resource management. Trace-driven simulation results demonstrate that the proposed approach outperforms conventional -based QoE-aware service provisioning methods in both user-level QoE modeling accuracy and resource efficiency.

Agentic Graph Neural Networks for Wireless Communications and Networking Towards Edge General Intelligence A Survey

Authors: Yang Lu, Shengli Zhang, Chang Liu, Ruichen Zhang, Bo Ai, Dusit Niyato, Wei Ni, Xianbin Wang, Abbas Jamalipour

2025-08-12

http://arxiv.org/abs/2508.08620v1

The rapid advancement of technologies has driven the evolution of networks towards both high-dimensional resource utilization and multifunctional integration. This evolving complexity poses significant challenges in designing networks to satisfy the growing quality-of-service and time sensitivity of mobile applications in dynamic environments. Graph neural networks (GNNs) have emerged as fundamental deep learning (DL) models for complex networks. GNNs not only augment the extraction of features over network topologies but also enhance scalability and facilitate distributed computation. However, most existing GNNs follow a traditional passive learning framework, which may fail to meet the needs of increasingly diverse wireless systems. This survey proposes the employment of agentic artificial intelligence (AI) to organize and integrate GNNs, enabling scenario- and task-aware implementation towards edge general intelligence. To comprehend the full capability of GNNs, we holistically review recent applications of GNNs in wireless s and networking. Specifically, we focus on the alignment between graph representations and network topologies, and between neural architectures and wireless tasks. We first provide an overview of GNNs based on prominent neural architectures, followed by the concept of agentic GNNs. Then, we summarize and compare GNN applications for conventional systems and emerging technologies, including physical, MAC, and network layer designs, integrated sensing and (ISAC), reconfigurable intelligent surface (RIS) and cell-free network architecture. We further propose a large language model () framework as an intelligent question-answering agent, leveraging this survey as a local knowledge base to enable GNN-related responses tailored to wireless research.

Joint decoding method for controllable contextual speech recognition based on Speech LLM

Authors: Yangui Fang, Jing Peng, Yu Xi, Xu Li, Haoyu Li, Chengwei Zhang, Guohui Zhong, Kai Yu

2025-08-12

http://arxiv.org/abs/2508.08585v1

Contextual speech recognition refers to the ability to identify preferences for specific content based on contextual information. Recently, leveraging the contextual understanding capabilities of Speech to achieve contextual biasing by injecting contextual information through prompts have emerged as a research hotspot.However, the direct information injection method via prompts relies on the internal attention mechanism of the model, making it impossible to explicitly control the extent of information injection. To address this limitation, we propose a joint method to control the contextual information. This approach enables explicit control over the injected contextual information and achieving superior recognition performance. Additionally, Our method can also be used for sensitive word suppression recognition.Furthermore, experimental results show that even Speech not pre-trained on long contextual data can acquire long contextual capabilities through our method.

Securing Agentic AI Threat Modeling and Risk Analysis for Network Monitoring Agentic AI System

Authors: Pallavi Zambare, Venkata Nikhil Thanikella, Ying Liu

2025-08-12

http://arxiv.org/abs/2508.10043v1

When combining Large Language Models (s) with autonomous agents, used in network monitoring and decision-making systems, this will create serious security issues. In this research, the MAESTRO framework consisting of the seven layers threat modeling architecture in the system was used to expose, evaluate, and eliminate vulnerabilities of agentic AI. The prototype agent system was constructed and implemented, using Python, LangChain, and telemetry in WebSockets, and deployed with inference, memory, parameter tuning, and anomaly detection modules. Two practical threat cases were confirmed as follows: (i) resource denial of service by traffic replay denial-of-service, and (ii) memory poisoning by tampering with the historical log file maintained by the agent. These situations resulted in measurable levels of performance degradation, i.e. telemetry updates were delayed, and computational loads were increased, as a result of poor system adaptations. It was suggested to use a multilayered defense-in-depth approach with memory isolation, validation of planners and anomaly response systems in real-time. These findings verify that MAESTRO is viable in operational threat mapping, prospective risk scoring, and the basis of the resilient system design. The authors bring attention to the importance of the enforcement of memory integrity, paying attention to the adaptation logic monitoring, and cross-layer protection that guarantee the agentic AI reliability in adversarial settings.

Profiling Large Language Model Inference on Apple Silicon A Quantization Perspective

Authors: Afsara Benazir, Felix Xiaozhu Lin

2025-08-12

http://arxiv.org/abs/2508.08531v1

A systematic understanding of Apple Silicon is lacking in the current landscape of hardware efficiency; research focus is largely centered on accelerating GPUs for large-scale training or inference on CUDA devices. This paper investigates Apple Silicon's unique memory architecture that offers a unified memory integrating CPU and GPU memory and its implications for on-device inference. We decipher myths about whether Apple Silicon is efficient for on-device inference compared to competitors such as NVIDIA GPUs by directly conducting latency and throughput comparison benchmarks. We explain the performance gap between them through profiling low level hardware metrics - ALU utilization, memory bandwidth, buffer usage, residency etc. at runtime. We draw several insights regarding performance bottlenecks such as de overhead, compute throughput and memory bandwidth. We debunk existing false claims regarding large language model inference such as compressing models to lower bit precision is a defacto promise for faster inference across all hardware platforms. We find that the large unified memory enables Apple Silicon to be both cost effective and efficient against NVIDIA GPUs for ultra large language models. Our large scale evaluation on 5 hardware testbeds incorporating three Apple M-series devices: M2 Ultra, M2 Max and M4 Pro and two NVIDIA GPUs: NVIDIA RTX A6000, a multi GPU setup with 2xNVIDIA RTX A6000, 5 model scales ranging from 8B to 405B parameters and 14 schemes gives an understanding of how Apple Silicon fits within the paradigm of on-device inference. Our analysis reveals multiple resource interdependencies and unexpected findings, while also quantifying established insights. To the best of our knowledge, this study makes the first attempt to present a thorough characterization and analysis of Apple Silicon for on-device inference.

Using LLMs to Capture Users' Temporal Context for Recommendation

Authors: Milad Sabouri, Masoud Mansoury, Kun Lin, Bamshad Mobasher

2025-08-11

http://arxiv.org/abs/2508.08512v1

Effective recommender systems demand dynamic user understanding, especially in complex, evolving environments. Traditional user profiling often fails to capture the nuanced, temporal contextual factors of user preferences, such as transient short-term interests and enduring long-term tastes. This paper presents an assessment of Large Language Models (s) for generating semantically rich, time-aware user profiles. We do not propose a novel end-to-end recommendation architecture; instead, the core contribution is a systematic investigation into the degree of effectiveness in capturing the dynamics of user context by disentangling short-term and long-term preferences. This approach, framing temporal preferences as dynamic user contexts for recommendations, adaptively fuses these distinct contextual components into comprehensive user embeddings. The evaluation across Movies&TV and Video Games domains suggests that while -generated profiles offer semantic depth and temporal structure, their effectiveness for context-aware recommendations is notably contingent on the richness of user interaction histories. Significant gains are observed in dense domains (e.g., Movies&TV), whereas improvements are less pronounced in environments (e.g., Video Games). This work highlights s' nuanced potential in enhancing user profiling for adaptive, context-aware recommendations, emphasizing the critical role of dataset characteristics for practical applicability.

When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise Real-World Lessons from LLM Co-Design in a Safety-Net Hospital

Authors: Avni Kothari, Patrick Vossler, Jean Digitale, Mohammad Forouzannia, Elise Rosenberg, Michele Lee, Jennee Bryant, Melanie Molina, James Marks, Lucas Zier, Jean Feng

2025-08-11

http://arxiv.org/abs/2508.08504v1

Large language models (s) have the potential to address social and behavioral determinants of health by transforming labor intensive workflows in resource-constrained settings. Creating -based applications that serve the needs of underserved communities requires a deep understanding of their local context, but it is often the case that neither s nor their developers possess this local expertise, and the experts in these communities often face severe time/resource constraints. This creates a disconnect: how can one engage in meaningful co-design of an -based application for an under-resourced community when the channel between the developer and domain expert is constrained? We explored this question through a real-world case study, in which our data science team sought to partner with social workers at a safety net hospital to build an application that summarizes patients' social needs. Whereas prior works focus on the challenge of prompt tuning, we found that the most critical challenge in this setting is the careful and precise specification of \what information to surface to providers so that the application is accurate, comprehensive, and verifiable. Here we present a novel co-design framework for settings with limited access to domain experts, in which the summary generation task is first decomposed into individually-optimizable attributes and then each attribute is efficiently refined and validated through a multi-tier cascading approach.

Vector-Centric Machine Learning Systems A Cross-Stack Approach

Authors: Wenqi Jiang

2025-08-11

http://arxiv.org/abs/2508.08469v1

Today, two major trends are shaping the evolution of ML systems. First, modern AI systems are becoming increasingly complex, often integrating components beyond the model itself. A notable example is Retrieval-Augmented Generation (RAG), which incorporates not only multiple models but also vector databases, leading to heterogeneity in both system components and underlying hardware. Second, with the end of Moore's Law, achieving high system efficiency is no longer feasible without accounting for the rapid evolution of the hardware landscape. Building on the observations above, this thesis adopts a cross-stack approach to improving ML system efficiency, presenting solutions that span algorithms, systems, and hardware. First, it introduces several pioneering works about RAG efficiency across the computing stack. PipeRAG focuses on algorithm-level improvements, RAGO introduces system-level optimizations, and Chameleon explores heterogeneous accelerator systems for RAG. Second, this thesis investigates algorithm-hardware co-design for vector search. Specifically, FANNS and Falcon optimize -based and graph-based vector search, the two most popular paradigms of retrieval algorithms. Third, this thesis addresses the efficiency of recommender systems, another example of vector-centric ML systems, where the memory-intensive lookup operations on embedding vector tables often represent a major performance bottleneck. MicroRec and FleetRec propose solutions at the hardware and system levels, respectively, optimizing both data movement and computation to enhance the efficiency of large-scale recommender models.

Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories

Authors: Ming-Yen Lee, Faaiq Waqar, Hanchen Yang, Muhammed Ahosan Ul Karim, Harsono Simka, Shimeng Yu

2025-08-11

http://arxiv.org/abs/2508.08457v1

Long-context Large Language Model () inference faces increasing compute bottlenecks as attention calculations scale with context length, primarily due to the growing - transfer overhead that saturates High Bandwidth Memory (HBM). While prefetching techniques mitigate misses by fetching data in advance, their spatial and temporal benefits present new opportunities to exploit. This work proposes a packing-prefetch scheduling architecture with monolithic 3D (M3D) back-end-of-line (BEOL) compatible embedded memories with ultra-large on-chip capacity to accelerate long-context inference. Our optimizations demonstrate 8.06x speedup and 1.83x overall latency reduction on Llama3.1-8B using TPUv6e-like hardware with additional 512MB BEOL memories over the serial execution. Evaluations of multi-request workloads on TPU-like architectures show 1.7x-2.4x throughput improvement and 1.5x-2.4x HBM bandwidth reduction compared to packing-only methods on Llama3.1-8B and Llama3.1-70B models. With the co-design of packing, prefetching, and BEOL memories, our approach alleviates HBM constraints and enables efficient long-context inference.

OverFill Two-Stage Models for Efficient Language Model Decoding

Authors: Woojeong Kim, Junxiong Wang, Jing Nathan Yan, Mohamed Abdelfattah, Alexander M. Rush

2025-08-11

http://arxiv.org/abs/2508.08446v1

Large language models (s) excel across diverse tasks but face significant deployment challenges due to high inference costs. inference comprises (compute-bound) and (memory-bound) stages, with dominating latency particularly for long sequences. Current r-only models handle both stages uniformly, despite their distinct computational profiles. We propose OverFill, which decouples these stages to optimize accuracy-efficiency tradeoffs. OverFill begins with a full model for , processing system and user inputs in parallel. It then switches to a dense pruned model, while generating tokens sequentially. Leveraging more compute during , OverFill improves generation quality with minimal latency overhead. Our 3B-to-1B OverFill configuration outperforms 1B pruned models by 83.2%, while the 8B-to-3B configuration improves over 3B pruned models by 79.2% on average across standard benchmarks. OverFill matches the performance of same-sized models trained from scratch, while using significantly less training data. Our code is available at https://github.com/friendshipkim/overfill.

Authors: Kexin Chu, Zecheng Lin, Dawei Xiang, Zixu Shen, Jianchang Su, Cheng Chu, Yiwei Yang, Wenhui Zhang, Wenfei Wu, Wei Zhang

2025-08-11

http://arxiv.org/abs/2508.08438v1

Global - sharing has emerged as a key optimization for accelerating large language model () inference. However, it exposes a new class of timing side-channel attacks, enabling adversaries to infer sensitive user inputs via shared entries. Existing defenses, such as per-user isolation, eliminate leakage but degrade performance by up to 38.9% in time-to-first-token (TTFT), making them impractical for high-throughput deployment. To address this gap, we introduce Safe (Secure and Flexible Cache Sharing), a privacy-aware - management framework that selectively shares non-sensitive entries while confining sensitive content to private s. Safe comprises three components: (i) a hybrid, multi-tier detection pipeline that integrates rule-based pattern matching, a general-purpose privacy detector, and context-aware validation; (ii) a unified radix-tree index that manages public and private entries across heterogeneous memory tiers (HBM, DRAM, SSD); and (iii) entropy-based access monitoring to detect and mitigate residual information leakage. Our evaluation shows that Safe mitigates 94% - 97% of timing-based side-channel attacks. Compared to per-user isolation method, Safe improves TTFT by up to 40.58% and throughput by up to 2.66X across diverse s and workloads. Safe reduces -induced TTFT overhead from 50.41% to 11.74% on Qwen3-235B. By combining fine-grained privacy control with high reuse efficiency, Safe reclaims the performance advantages of global sharing while providing robust runtime privacy guarantees for inference.

Follow-Your-Shape Shape-Aware Image Editing via Trajectory-Guided Region Control

Authors: Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Linfeng Zhang, Qifeng Chen, Yue Ma

2025-08-11

http://arxiv.org/abs/2508.08134v2

While recent flow-based image editing models demonstrate general-purpose capabilities across diverse tasks, they often struggle to specialize in challenging scenarios -- particularly those involving large-scale shape transformations. When performing such structural edits, these methods either fail to achieve the intended shape change or inadvertently alter non-target regions, resulting in degraded background quality. We propose Follow-Your-Shape, a training-free and mask-free framework that supports precise and controllable editing of object shapes while strictly pre non-target content. Motivated by the divergence between inversion and editing trajectories, we compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths. The TDM enables precise localization of editable regions and guides a Scheduled Injection mechanism that ensures stable and faithful editing. To facilitate a rigorous evaluation, we introduce ReShapeBench, a new benchmark comprising 120 new images and enriched prompt pairs specifically curated for shape-aware editing. Experiments demonstrate that our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.

BlindGuard Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks

Authors: Rui Miao, Yixin Liu, Yili Wang, Xu Shen, Yue Tan, Yiwei Dai, Shirui Pan, Xin Wang

2025-08-11

http://arxiv.org/abs/2508.08127v1

The security of -based multi-agent systems (MAS) is critically threatened by propagation vulnerability, where malicious agents can distort collective decision-making through inter-agent message interactions. While existing supervised defense methods demonstrate promising performance, they may be impractical in real-world scenarios due to their heavy reliance on labeled malicious agents to train a supervised malicious detection model. To enable practical and generalizable MAS defenses, in this paper, we propose BlindGuard, an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors. To this end, we establish a hierarchical agent encoder to capture individual, neighborhood, and global interaction patterns of each agent, providing a comprehensive understanding for malicious agent detection. Meanwhile, we design a corruption-guided detector that consists of directional noise injection and contrastive learning, allowing effective detection model training solely on normal agent behaviors. Extensive experiments show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across MAS with various patterns while maintaining superior generalizability compared to supervised baselines. The code is available at: https://github.com/MR9812/BlindGuard.

TeamMedAgents Enhancing Medical Decision-Making of LLMs Through Structured Teamwork

Authors: Pranav Pushkar Mishra, Mohammad Arvan, Mohan Zalake

2025-08-11

http://arxiv.org/abs/2508.08115v1

We present TeamMedAgents, a novel multi-agent approach that systematically integrates evidence-based teamwork components from human-human collaboration into medical decision-making with large language models (s). Our approach validates an organizational psychology teamwork model from human collaboration to computational multi-agent medical systems by operationalizing six core teamwork components derived from Salas et al.'s "Big Five" model: team leadership, mutual performance monitoring, team orientation, shared mental models, closed-loop , and mutual trust. We implement and evaluate these components as modular, configurable mechanisms within an adaptive collaboration architecture while assessing the effect of the number of agents involved based on the task's requirements and domain. Systematic evaluation of computational implementations of teamwork behaviors across eight medical benchmarks (MedQA, MedMCQA, MMLU-Pro Medical, PubMedQA, DDXPlus, MedBullets, Path-VQA, and PMC-VQA) demonstrates consistent improvements across 7 out of 8 evaluated datasets. Controlled ablation studies conducted on 50 questions per configuration across 3 independent runs provide mechanistic insights into individual component contributions, revealing optimal teamwork configurations that vary by reasoning task complexity and domain-specific requirements. Our ablation analyses reveal dataset-specific optimal teamwork configurations, indicating that different medical reasoning modalities benefit from distinct collaborative patterns. TeamMedAgents represents an advancement in collaborative AI by providing a systematic translation of established teamwork theories from human collaboration into agentic collaboration, establishing a foundation for evidence-based multi-agent system design in critical decision-making domains.

ChatGPT on the Road Leveraging Large Language Model-Powered In-vehicle Conversational Agents for Safer and More Enjoyable Driving Experience

Authors: Yeana Lee Bond, Mungyeong Choe, Baker Kasim Hasan, Arsh Siddiqui, Myounghoon Jeon

2025-08-11

http://arxiv.org/abs/2508.08101v1

Studies on in-vehicle conversational agents have traditionally relied on pre-scripted prompts or limited voice commands, constraining natural driver-agent interaction. To resolve this issue, the present study explored the potential of a ChatGPT-based in-vehicle agent capable of carrying continuous, multi-turn dialogues. Forty drivers participated in our experiment using a motion-based driving simulator, comparing three conditions (No agent, Pre-scripted agent, and ChatGPT-based agent) as a within-subjects variable. Results showed that the ChatGPT-based agent condition led to more stable driving performance across multiple metrics. Participants demonstrated lower variability in longitudinal , lateral , and lane deviation compared to the other two conditions. In subjective evaluations, the ChatGPT-based agent also received significantly higher ratings in competence, animacy, affective trust, and preference compared to the Pre-scripted agent. Our thematic analysis of driver-agent conversations revealed diverse interaction patterns in topics, including driving assistance/questions, entertainment requests, and anthropomorphic interactions. Our results highlight the potential of -powered in-vehicle conversational agents to enhance driving safety and user experience through natural, context-rich interactions.

Bridging ASR and LLMs for Dysarthric Speech Recognition Benchmarking Self-Supervised and Generative Approaches

Authors: Ahmed Aboeitta, Ahmed Sharshar, Youssef Nafea, Shady Shehata

2025-08-11

http://arxiv.org/abs/2508.08027v1

Speech Recognition (ASR) due to phoneme distortions and high variability. While self-supervised ASR models like Wav2Vec, HuBERT, and Whisper have shown promise, their effectiveness in dysarthric speech remains unclear. This study systematically benchmarks these models with different strategies, including CTC, seq2seq, and -enhanced (BART,GPT-2, Vicuna). Our contributions include (1) benchmarking ASR architectures for dysarthric speech, (2) introducing -based to improve intelligibility, (3) analyzing generalization across datasets, and (4) providing insights into recognition errors across severity levels. Findings highlight that -enhanced improves dysarthric ASR by leveraging linguistic constraints for phoneme restoration and grammatical correction.

Interpreting Fedspeak with Confidence A LLM-Based Uncertainty-Aware Framework Guided by Monetary Policy Transmission Paths

Authors: Rui Yao, Qi Chai, Jinhai Yao, Siyuan Li, Junhao Chen, Qi Zhang, Hao Wang

2025-08-11

http://arxiv.org/abs/2508.08001v2

"Fedspeak", the stylized and often nuanced language used by the U.S. Federal Reserve, encodes implicit policy signals and strategic stances. The Federal Open Market Committee strategically employs Fedspeak as a tool to shape market expectations and influence both domestic and global economic conditions. As such, automatically parsing and interpreting Fedspeak presents a high-impact challenge, with significant implications for financial forecasting, algorithmic trading, and data-driven policy analysis. In this paper, we propose an -based, uncertainty-aware framework for deciphering Fedspeak and classifying its underlying monetary policy stance. Technically, to enrich the semantic and contextual representation of Fedspeak texts, we incorporate domain-specific reasoning grounded in the monetary policy transmission mechanism. We further introduce a dynamic uncertainty module to assess the confidence of model predictions, thereby enhancing both classification accuracy and model reliability. Experimental results demonstrate that our framework achieves state-of-the-art performance on the policy stance analysis task. Moreover, statistical analysis reveals a significant positive correlation between perceptual uncertainty and model error rates, validating the effectiveness of perceptual uncertainty as a diagnostic signal.

DiTVR Zero-Shot Diffusion Transformer for Video Restoration

Authors: Sicheng Gao, Nancy Mehta, Zongwei Wu, Radu Timofte

2025-08-11

http://arxiv.org/abs/2508.07811v1

Video restoration aims to reconstruct high quality video sequences from low quality inputs, addressing tasks such as super resolution, denoising, and deblurring. Traditional regression based methods often produce unrealistic details and require extensive paired datasets, while recent generative diffusion models face challenges in ensuring temporal consistency. We introduce DiTVR, a zero shot video restoration framework that couples a diffusion with trajectory aware attention and a wavelet guided, flow consistent sampler. Unlike prior 3D convolutional or frame wise diffusion approaches, our attention mechanism aligns tokens along optical flow trajectories, with particular emphasis on vital layers that exhibit the highest sensitivity to temporal dynamics. A spatiotemporal neighbour dynamically selects relevant tokens based on motion correspondences across frames. The flow guided sampler injects data consistency only into low-frequency bands, pre high frequency priors while accelerating convergence. DiTVR establishes a new zero shot state of the art on video restoration benchmarks, demonstrating superior temporal consistency and detail preservation while remaining robust to flow noise and occlusions.

EvoCoT Overcoming the Exploration Bottleneck in Reinforcement Learning

Authors: Huanyu Liu, Jia Li, Chang Yu, Taozhi Chen, Yihong Dong, Lecheng Wang, Hu XiaoLong, Ge Li

2025-08-11

http://arxiv.org/abs/2508.07809v1

Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (s) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the reward becomes , limiting learning efficiency and causing exploration bottlenecks. Existing approaches either rely on stronger s for distillation or filter out difficult problems, which limits scalability or restricts reasoning improvement through exploration. We propose EvoCoT, a self-evolving curriculum learning framework based on two-stage chain-of-thought (CoT) reasoning optimization. EvoCoT constrains the exploration space by self-generating and verifying CoT trajectories, then gradually shortens them to expand the space in a controlled way. This enables s to stably learn from initially unsolved hard problems under rewards. We apply EvoCoT to multiple families, including Qwen, DeepSeek, and Llama. Experiments show that EvoCoT enables s to solve previously unsolved problems, improves reasoning capability without external CoT supervision, and is compatible with various RL fine-tuning methods. We release the source code to support future research.

Grove MoE Towards Efficient and Superior MoE LLMs with Adjugate Experts

Authors: Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, Zhenzhong Lan, Bei Yu, Jianguo Li

2025-08-11

http://arxiv.org/abs/2508.07785v1

The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (s). MoE models facilitate scalability by enabling parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter s developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.

SASST Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

Authors: Zeyu Yang, Lai Wei, Roman Koshkin, Xi Chen, Satoshi Nakamura

2025-08-11

http://arxiv.org/abs/2508.07781v1

This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation. Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Speech Translation), an end-to-end framework integrating frozen Whisper encoder and r-only . The unified architecture dynamically outputs translation tokens or symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence. Experiments on CoVoST2 multilingual corpus En-{De, Zh, Ja} demonstrate significant translation quality improvements across languages and validate the effectiveness of syntactic structures in -driven SimulST systems.

Symmetry-Aware Transformer Training for Automated Planning

Authors: Markus Fritzsche, Elliot Gestrin, Jendrik Seipp

2025-08-11

http://arxiv.org/abs/2508.07743v1

While s excel in many settings, their application in the field of automated planning is limited. Prior work like PlanGPT, a state-of-the-art r-only , struggles with extrapolation from easy to hard planning problems. This in turn stems from problem symmetries: planning tasks can be represented with arbitrary variable names that carry no meaning beyond being identifiers. This causes a combinatorial explosion of equivalent representations that pure s cannot efficiently learn from. We propose a novel contrastive learning objective to make s symmetry-aware and thereby compensate for their lack of inductive bias. Combining this with architectural improvements, we show that s can be efficiently trained for either plan-generation or heuristic-prediction. Our results across multiple planning domains demonstrate that our symmetry-aware training effectively and efficiently addresses the limitations of PlanGPT.

Semantic Caching for Low-Cost LLM Serving From Offline Learning to Online Adaptation

Authors: Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C. S. Lui, Wei Chen, Carlee Joe-Wong

2025-08-11

http://arxiv.org/abs/2508.07675v2

Large Language Models (s) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the , has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different eviction problem: one must account for mismatch costs between incoming queries and d responses. Moreover, key system parameters, such as query arrival probabilities and costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical foundations and unable to adapt to real-world uncertainty. In this paper, we present a principled, learning-based framework for semantic eviction under unknown query and cost distributions. We formulate both offline optimization and online learning variants of the problem, and develop provably efficient algorithms with state-of-the-art guarantees. We also evaluate our framework on a synthetic dataset, showing that our proposed algorithms perform matching or superior performance compared with baselines.

GLiClass Generalist Lightweight Model for Sequence Classification Tasks

Authors: Ihor Stepanov, Mykhailo Shtopko, Dmytro Vodianytskyi, Oleksandr Lukashov, Alexander Yavorskyi, Mykyta Yaroshenko

2025-08-11

http://arxiv.org/abs/2508.07662v1

Classification is one of the most widespread tasks in AI applications, often as the first step in filtering, sorting, and categorizing data. Since modern AI systems must handle large volumes of input data and early pipeline stages can propagate errors downstream, achieving high efficiency and accuracy is critical. Moreover, classification requirements can change dynamically based on user needs, necessitating models with strong zero-shot capabilities. While generative s have become mainstream for zero-shot classification due to their versatility, they suffer from inconsistent instruction following and computational inefficiency. Cross-encoders, commonly used as rerankers in RAG pipelines, face a different bottleneck: they must process text-label pairs sequentially, significantly reducing efficiency with large label sets. Embedding-based approaches offer good efficiency but struggle with complex scenarios involving logical and semantic constraints. We propose GLiClass, a novel method that adapts the GLiNER architecture for sequence classification tasks. Our approach achieves strong accuracy and efficiency comparable to embedding-based methods, while maintaining the flexibility needed for zero-shot and few-shot learning scenarios. Additionally, we adapted proximal policy optimization (PPO) for multi-label text classification, enabling training classifiers in data- conditions or from human feedback.

LaVieID Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

Authors: Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang

2025-08-11

http://arxiv.org/abs/2508.07603v1

In this paper, we present LaVieID, a novel \underline{l}ocal \underline{a}utoregressive \underline{vi}d\underline{e}o diffusion framework designed to tackle the challenging \underline{id}entity-pre text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion s (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video . This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at https://github.com/ssugarwh/LaVieID.

HGMF A Hierarchical Gaussian Mixture Framework for Scalable Tool Invocation within the Model Context Protocol

Authors: Wenpeng Xing, Zhipeng Chen, Changting Lin, Meng Han

2025-08-11

http://arxiv.org/abs/2508.07602v1

Invoking external tools enables Large Language Models (s) to perform complex, real-world tasks, yet selecting the correct tool from large, hierarchically-structured libraries remains a significant challenge. The limited context windows of s and noise from irrelevant options often lead to low selection accuracy and high computational costs. To address this, we propose the Hierarchical Gaussian Mixture Framework (HGMF), a probabilistic method for scalable tool invocation. HGMF first maps the user query and all tool descriptions into a unified semantic space. The framework then operates in two stages: it clusters servers using a Gaussian Mixture Model (GMM) and filters them based on the query's likelihood. Subsequently, it applies the same GMM-based clustering and filtering to the tools associated with the selected servers. This hierarchical process produces a compact, high-relevance candidate set, simplifying the final selection task for the . Experiments on a public dataset show that HGMF significantly improves tool selection accuracy while reducing inference latency, confirming the framework's scalability and effectiveness for large-scale tool libraries.

Towards Theoretical Understanding of Transformer Test-Time Computing Investigation on In-Context Linear Regression

Authors: Xingwu Chen, Miao Lu, Beining Wu, Difan Zou

2025-08-11

http://arxiv.org/abs/2508.07571v1

Using more test-time computation during language model inference, such as generating more intermediate thoughts or sampling multiple candidate answers, has proven effective in significantly improving model performance. This paper takes an initial step toward bridging the gap between practical language model inference and theoretical analysis by incorporating randomness and sampling. We focus on in-context linear regression with continuous/binary coefficients, where our framework simulates language model through noise injection and binary coefficient sampling. Through this framework, we provide detailed analyses of widely adopted inference techniques. Supported by empirical results, our theoretical framework and analysis demonstrate the potential for offering new insights into understanding inference behaviors in real-world language models.

Grounding Natural Language for Multi-agent Decision-Making with Multi-agentic LLMs

Authors: Dom Huh, Prasant Mohapatra

2025-08-10

http://arxiv.org/abs/2508.07466v1

Language is a ubiquitous tool that is foundational to reasoning and collaboration, ranging from everyday interactions to sophisticated problem-solving tasks. The establishment of a common language can serve as a powerful asset in ensuring clear and understanding amongst agents, facilitating desired coordination and strategies. In this work, we extend the capabilities of large language models (s) by integrating them with advancements in multi-agent decision-making algorithms. We propose a systematic framework for the design of multi-agentic large language models (s), focusing on key integration practices. These include advanced prompt engineering techniques, the development of effective memory architectures, multi-modal information processing, and alignment strategies through fine-tuning algorithms. We evaluate these design choices through extensive ablation studies on classic game settings with significant underlying social dilemmas and game-theoretic considerations.

Investigating 1-Bit Quantization in Transformer-Based Top Tagging

Authors: Saurabh Rai, Prisha, Jitendra Kumar

2025-08-10

http://arxiv.org/abs/2508.07431v1

The increasing scale of deep learning models in high-energy physics (HEP) has posed challenges to their deployment on low-power, latency-sensitive platforms, such as FPGAs and ASICs used in trigger systems, as well as in offline data reconstruction and processing pipelines. In this work, we introduce BitParT, a 1-bit Transformer-based architecture designed specifically for the top-quark tagging method. Building upon recent advances in ultra- large language models (s), we extended these ideas to the HEP domain by developing a binary-weight variant (BitParT) of the Particle Transformer (ParT) model. Our findings indicate a potential for substantial reduction in model size and computational complexity, while maintaining high tagging performance. We benchmark BitParT on the public Top Quark Tagging Reference Dataset and show that it achieves competitive performance relative to its full-precision counterpart. This work demonstrates the design of extreme d models for physics applications, paving the way for real-time inference in collider experiments with minimal and optimized resource usage.

LET-US Long Event-Text Understanding of Scenes

Authors: Rui Chen, Xingyu Chen, Shaoan Wang, Shihan Kong, Junzhi Yu

2025-08-10

http://arxiv.org/abs/2508.07401v1

Event cameras output event streams as , asynchronous data with microsecond-level temporal resolution, enabling visual perception with low latency and a high dynamic range. While existing Multimodal Large Language Models (Ms) have achieved significant success in understanding and analyzing RGB video content, they either fail to interpret event streams effectively or remain constrained to very short sequences. In this paper, we introduce LET-US, a framework for long event-stream--text comprehension that employs an adaptive compression mechanism to reduce the volume of input events while pre critical visual details. LET-US thus establishes a new frontier in cross-modal inferential understanding over extended event sequences. To bridge the substantial modality gap between event streams and textual representations, we adopt a two-stage optimization paradigm that progressively equips our model with the capacity to interpret event-based scenes. To handle the voluminous temporal information inherent in long event streams, we leverage text-guided cross-modal queries for feature reduction, augmented by hierarchical clustering and similarity computation to distill the most representative event features. Moreover, we curate and construct a large-scale event-text aligned dataset to train our model, achieving tighter alignment of event features within the embedding space. We also develop a comprehensive benchmark covering a diverse set of tasks -- reasoning, captioning, classification, temporal localization and moment retrieval. Experimental results demonstrate that LET-US outperforms prior state-of-the-art Ms in both descriptive accuracy and semantic comprehension on long-duration event streams. All datasets, codes, and models will be publicly available.

Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative

Authors: Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang

2025-08-10

http://arxiv.org/abs/2508.07329v1

With the breakthrough progress of large language models (s) in natural language processing and multimodal tasks, efficiently deploying them on resource-constrained edge devices has become a critical challenge. The Mixture of Experts (MoE) architecture enhances model capacity through activation, but faces two major difficulties in practical deployment: (1) The presence of numerous outliers in activation distributions leads to severe degradation in accuracy for both activations and weights, significantly impairing inference performance; (2) Under limited memory, efficient offloading and collaborative inference of expert modules struggle to balance latency and throughput. To address these issues, this paper proposes an efficient MoE edge deployment scheme based on Hessian-Aware Quantization (HAQ) and CPU-GPU collaborative inference. First, by introducing smoothed Hessian matrix , we achieve joint 8-bit of activations and weights, which significantly alleviates the accuracy loss caused by outliers while ensuring efficient implementation on mainstream hardware. Second, we design an expert-level collaborative offloading and inference mechanism, which, combined with expert activation path statistics, enables efficient deployment and scheduling of expert modules between CPU and GPU, greatly reducing memory footprint and inference latency. Extensive experiments validate the effectiveness of our method on mainstream large models such as the OPT series and Mixtral 8*7B: on datasets like Wikitext2 and C4, the inference accuracy of the d model approaches that of the full-precision model, while GPU memory usage is reduced by about 60%, and inference latency is significantly improved.

BEVANet Bilateral Efficient Visual Attention Network for Real-Time Semantic Segmentation

Authors: Ping-Mao Huang, I-Tien Chao, Ping-Chia Huang, Jia-Wei Liao, Yung-Yu Chuang

2025-08-10

http://arxiv.org/abs/2508.07300v1

Real-time semantic segmentation presents the dual challenge of designing efficient architectures that capture large receptive fields for semantic understanding while also refining detailed contours. Vision s model long-range dependencies effectively but incur high computational cost. To address these challenges, we introduce the Large Kernel Attention (LKA) mechanism. Our proposed Bilateral Efficient Visual Attention Network (BEVANet) expands the receptive field to capture contextual information and extracts visual and structural features using Sparse Decomposed Large Separable Kernel Attentions (SDLSKA). The Comprehensive Kernel Selection (CKS) mechanism dynamically adapts the receptive field to further enhance performance. Furthermore, the Deep Large Kernel Pyramid Pooling Module (DLKPPM) enriches contextual features by synergistically combining dilated convolutions and large kernel attention. The bilateral architecture facilitates frequent branch , and the Boundary Guided Adaptive Fusion (BGAF) module enhances boundary delineation by integrating spatial and semantic features under boundary guidance. BEVANet achieves real-time segmentation at 33 FPS, yielding 79.3% mIoU without pretraining and 81.0% mIoU on Cityscapes after ImageNet pretraining, demonstrating state-of-the-art performance. The code and model is available at https://github.com/maomao0819/BEVANet.

Authors: Siyuan He, Peiran Yan, Yandong He, Youwei Zhuo, Tianyu Jia

2025-08-10

http://arxiv.org/abs/2508.07252v1

The autoregressive in s is the major inference bottleneck due to the memory-intensive operations and limited hardware bandwidth. 3D-stacked architecture is a promising solution with significantly improved memory bandwidth, which vertically stacked multi DRAM dies on top of logic die. However, our experiments also show the 3D-stacked architecture faces severer thermal issues compared to 2D architecture, in terms of thermal temperature, gradient and scalability. To better exploit the potential of 3D-stacked architecture, we present Tasa, a heterogeneous architecture with cross-stack thermal optimizations to balance the temperature distribution and maximize the performance under the thermal constraints. High-performance core is designed for compute-intensive operations, while high-efficiency core is used for memory-intensive operators, e.g. attention layers. Furthermore, we propose a bandwidth sharing scheduling to improve the bandwidth utilization in such heterogeneous architecture. Extensive thermal experiments show that our Tasa architecture demonstrates greater scalability compared with the homogeneous 3D-stacked architecture, i.e. up to 5.55 $\tccentigrade$ , 9.37 $\tccentigrade$ , and 7.91 $\tccentigrade$ peak temperature reduction for 48, 60, and 72 core configurations. Our experimental for Llama-65B and GPT-3 66B inferences also demonstrate 2.85x and 2.21x speedup are obtained over the GPU baselines and state-of-the-art heterogeneous PIM-based accelerator

LP-Spec Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture-Dataflow Co-Optimization

Authors: Siyuan He, Zhantong Zhu, Yandong He, Tianyu Jia

2025-08-10

http://arxiv.org/abs/2508.07227v1

inference on mobile devices faces extraneous challenges due to limited memory bandwidth and computational resources. To address these issues, speculative inference and processing-in-memory (PIM) techniques have been explored at the algorithmic and hardware levels. However, speculative inference results in more compute-intensive GEMM operations, creating new design trade-offs for existing GEMV-accelerated PIM architectures. Furthermore, there exists a significant amount of redundant draft tokens in tree-based speculative inference, necessitating efficient token management schemes to minimize energy consumption. In this work, we present LP-Spec, an architecture-dataflow co-design leveraging hybrid LPDDR5 performance-enhanced PIM architecture with draft token and dynamic workload scheduling to accelerate speculative inference. A near-data memory controller is proposed to enable data reallocation between DRAM and PIM banks. Furthermore, a data allocation unit based on the hardware-aware draft token pruner is developed to minimize energy consumption and fully exploit parallel execution opportunities. Compared to end-to-end inference on other mobile solutions such as mobile NPUs or GEMV-accelerated PIMs, our LP-Spec achieves 13.21x, 7.56x, and 99.87x improvements in performance, energy efficiency, and energy-delay-product (EDP). Compared with prior AttAcc PIM and RTX 3090 GPU, LP-Spec can obtain 12.83x and 415.31x EDP reduction benefits.

Bridging Semantic Logic Gaps A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization

Authors: Songlin Li, Zhiqing Guo, Yuanman Li, Zeyu Li, Yunfeng Diao, Gaobo Yang, Liejun Wang

2025-08-10

http://arxiv.org/abs/2508.07216v1

The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition-inspired multimodal boundary-pre network (CMB-Net). Specifically, CMB-Net utilizes large language models (s) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from s will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge r (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models.

DySK-Attn A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention

Authors: Kabir Khan, Priya Sharma, Arjun Mehta, Neha Gupta, Ravi Narayanan

2025-08-10

http://arxiv.org/abs/2508.07185v1

Large Language Models (s) suffer from a critical limitation: their knowledge is static and quickly becomes outdated. Retraining these massive models is computationally prohibitive, while existing knowledge editing techniques can be slow and may introduce unforeseen side effects. To address this, we propose DySK-Attn, a novel framework that enables s to efficiently integrate real-time knowledge from a dynamic external source. Our approach synergizes an with a dynamic Knowledge Graph (KG) that can be updated instantaneously. The core of our framework is a knowledge attention mechanism, which allows the to perform a coarse-to-fine grained search, efficiently identifying and focusing on a small, highly relevant subset of facts from the vast KG. This mechanism avoids the high computational cost of dense attention over the entire knowledge base and mitigates noise from irrelevant information. We demonstrate through extensive experiments on time-sensitive question-answering tasks that DySK-Attn significantly outperforms strong baselines, including standard Retrieval-Augmented Generation (RAG) and model editing techniques, in both factual accuracy for updated knowledge and computational efficiency. Our framework offers a scalable and effective solution for building s that can stay current with the ever-changing world.

How Effectively Can Large Language Models Connect SNP Variants and ECG Phenotypes for Cardiovascular Risk Prediction?

Authors: Niranjana Arun Menon, Iqra Farooq, Yulong Li, Sara Ahmed, Yutong Xie, Muhammad Awais, Imran Razzak

2025-08-10

http://arxiv.org/abs/2508.07127v1

Cardiovascular disease (CVD) prediction remains a tremendous challenge due to its multifactorial etiology and global burden of morbidity and mortality. Despite the growing availability of genomic and electrophysiological data, extracting biologically meaningful insights from such high-dimensional, noisy, and ly annotated datasets remains a non-trivial task. Recently, s has been applied effectively to predict structural variations in biological sequences. In this work, we explore the potential of fine-tuned s to predict cardiac diseases and SNPs potentially leading to CVD risk using genetic markers derived from high-throughput genomic profiling. We investigate the effect of genetic patterns associated with cardiac conditions and evaluate how s can learn latent biological relationships from structured and semi-structured genomic data obtained by mapping genetic aspects that are inherited from the family tree. By framing the problem as a Chain of Thought (CoT) reasoning task, the models are prompted to generate disease labels and articulate informed clinical deductions across diverse patient profiles and phenotypes. The findings highlight the promise of s in contributing to early detection, risk assessment, and ultimately, the advancement of personalized medicine in cardiac care.

From Nodes to Narratives Explaining Graph Neural Networks with LLMs and Graph Context

Authors: Peyman Baghershahi, Gregoire Fournier, Pranav Nyati, Sourav Medya

2025-08-09

http://arxiv.org/abs/2508.07117v1

Graph Neural Networks (GNNs) have emerged as powerful tools for learning over structured data, including text-attributed graphs, which are common in domains such as citation networks, social platforms, and knowledge graphs. GNNs are not inherently interpretable and thus, many explanation methods have been proposed. However, existing explanation methods often struggle to generate interpretable, fine-grained rationales, especially when node attributes include rich natural language. In this work, we introduce LOGIC, a lightweight, post-hoc framework that uses large language models (s) to generate faithful and interpretable explanations for GNN predictions. LOGIC projects GNN node embeddings into the embedding space and constructs hybrid prompts that interleave soft prompts with textual inputs from the graph structure. This enables the to reason about GNN internal representations and produce natural language explanations along with concise explanation subgraphs. Our experiments across four real-world TAG datasets demonstrate that LOGIC achieves a favorable trade-off between fidelity and , while significantly improving human-centric metrics such as insightfulness. LOGIC sets a new direction for -based explainability in graph learning by aligning GNN internals with human reasoning.

Large Language Model Evaluated Stand-alone Attention-Assisted Graph Neural Network with Spatial and Structural Information Interaction for Precise Endoscopic Image Segmentation

Authors: Juntong Fan, Shuyi Fan, Debesh Jha, Changsheng Fang, Tieyong Zeng, Hengyong Yu, Dayang Wang

2025-08-09

http://arxiv.org/abs/2508.07028v1

Accurate endoscopic image segmentation on the polyps is critical for early colorectal cancer detection. However, this task remains challenging due to low contrast with surrounding mucosa, specular highlights, and indistinct boundaries. To address these challenges, we propose FOCUS-Med, which stands for Fusion of spatial and structural graph with attentional context-aware polyp segmentation in endoscopic medical imaging. FOCUS-Med integrates a Dual Graph Convolutional Network (Dual-GCN) module to capture contextual spatial and topological structural dependencies. This graph-based representation enables the model to better distinguish polyps from background tissues by leveraging topological cues and spatial connectivity, which are often obscured in raw image intensities. It enhances the model's ability to preserve boundaries and delineate complex shapes typical of polyps. In addition, a location-fused stand-alone self-attention is employed to strengthen global context integration. To bridge the semantic gap between encoder-r layers, we incorporate a trainable weighted fast normalized fusion strategy for efficient multi-scale aggregation. Notably, we are the first to introduce the use of a Large Language Model () to provide detailed qualitative evaluations of segmentation quality. Extensive experiments on public benchmarks demonstrate that FOCUS-Med achieves state-of-the-art performance across five key metrics, underscoring its effectiveness and clinical potential for AI-assisted colonoscopy.

Vec2Summ Text Summarization via Probabilistic Sentence Embeddings

Authors: Mao Li, Fred Conrad, Johann Gagnon-Bartsch

2025-08-09

http://arxiv.org/abs/2508.07017v1

We propose Vec2Summ, a novel method for abstractive summarization that frames the task as semantic compression. Vec2Summ represents a document collection using a single mean vector in the semantic embedding space, capturing the central meaning of the corpus. To reconstruct fluent summaries, we perform embedding inversion -- this mean vector into natural language using a generative language model. To improve reconstruction quality and capture some degree of topical variability, we introduce stochasticity by sampling from a Gaussian distribution centered on the mean. This approach is loosely analogous to bagging in ensemble learning, where controlled randomness encourages more robust and varied outputs. Vec2Summ addresses key limitations of -based summarization methods. It avoids context-length constraints, enables interpretable and controllable generation via semantic parameters, and scales efficiently with corpus size -- requiring only $O(d + d^2)$ parameters. Empirical results show that Vec2Summ produces coherent summaries for topically focused, order-invariant corpora, with performance comparable to direct summarization in terms of thematic coverage and efficiency, albeit with less fine-grained detail. These results underscore Vec2Summ's potential in settings where scalability, semantic control, and corpus-level abstraction are prioritized.

Narrative Memory in Machines Multi-Agent Arc Extraction in Serialized TV

Authors: Roberto Balestri, Guglielmo Pescatore

2025-08-09

http://arxiv.org/abs/2508.07010v1

Serialized television narratives present significant analytical challenges due to their complex, temporally distributed storylines that necessitate sophisticated information management. This paper introduces a multi-agent system (MAS) designed to extract and analyze narrative arcs by implementing principles of computational memory architectures. The system conceptualizes narrative understanding through analogues of human memory: Large Language Models (s) provide a form of semantic memory for general narrative patterns, while a vector database stores specific arc progressions as episodic memories. A multi-agent workflow simulates working memory processes to integrate these information types. Tested on the first season of Grey's Anatomy (ABC 2005-), the MAS identifies three arc types: Anthology (self-contained), Soap (relationship-focused), and Genre-Specific. These arcs and their episodic developments are stored in a vector database, facilitating structured analysis and semantic comparison. To bridge automation with critical interpretation, a graphical interface enables human oversight and refinement of the system's narrative memory. While demonstrating strong performance in identifying Anthology Arcs and character entities, the system's reliance on textual paratexts (episode summaries) revealed limitations in discerning ping arcs and opaque dynamics, underscoring the challenges in computational memory consolidation versus human holistic understanding. This memory-centric approach highlights the potential of combining AI-driven memory processing with human expertise. Beyond television, it offers promise for serialized written formats where narrative is entirely text-based. Future work will focus on integrating multimodal inputs to enrich episodic memory, refining memory integration mechanisms within the MAS, and expanding testing across diverse genres.

SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency

Authors: Kwanhee Kyung, Sungmin Yun, Jung Ho Ahn

2025-08-09

http://arxiv.org/abs/2508.06978v1

Large Language Models (s) applying Mixture-of-Experts (MoE) scale to trillions of parameters but require vast memory, motivating a line of research to offload expert weights from fast-but-small DRAM (HBM) to denser Flash SSDs. While SSDs provide cost-effective capacity, their read energy per bit is substantially higher than that of DRAM. This paper quantitatively analyzes the energy implications of offloading MoE expert weights to SSDs during the critical stage of inference. Our analysis, comparing SSD, CPU memory (DDR), and HBM storage scenarios for models like DeepSeek-R1, reveals that offloading MoE weights to current SSDs drastically increases per-token-generation energy consumption (e.g., by up to ~12x compared to the HBM baseline), dominating the total inference energy budget. Although techniques like prefetching effectively hide access latency, they cannot mitigate this fundamental energy penalty. We further explore future technological scaling, finding that the inherent of MoE models could potentially make SSDs energy-viable if Flash read energy improves significantly, roughly by an order of magnitude.

Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

Authors: Zhijun Tu, Hanting Chen, Siqi Liu, Chuanjian Liu, Jian Li, Jie Hu, Yunhe Wang

2025-08-09

http://arxiv.org/abs/2508.06974v1

1-bit offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit s from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes direct adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the floating-point weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on s of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit s can be achieved using pre-trained models, eliminating the need for expensive training from scratch.

Fed MobiLLM Efficient Federated LLM Fine-Tuning over Heterogeneous Mobile Devices via Server Assisted Side-Tuning

Authors: Xingke Yang, Liang Li, Sicong Li, Liwei Guan, Hao Wang, Xiaoqi Qi, Jiang Liu, Xin Fu, Miao Pan

2025-08-09

http://arxiv.org/abs/2508.06765v1

Collaboratively fine-tuning (FT) large language models (s) over heterogeneous mobile devices fosters immense potential applications of personalized intelligence. However, such a vision faces critical system challenges. Conventional federated FT approaches place prohibitive computational and memory burdens on mobile hardware, and their synchronous model aggregation protocols stall for slower devices. In this paper, we propose Fed Mobi, a novel design to facilitate efficient federated FT across mobile devices with diverse computing/ speeds and local model architectures. In particular, Fed Mobi implements a pioneering server-assisted federated side-tuning paradigm. Briefly, mobile devices perform lightweight forward propagation computations on local data using their frozen pre-scaled backbone s, and then upload selected intermediate activations. The server trains a shared side-network independently, eliminating client-side backpropagation and enabling asynchronous updates. To bridge model heterogeneity across different devices, we introduce an adaptive layer-wise feature alignment method, which ensures consistent representations for collaboratively tuning a shared side network. Extensive experimental results demonstrate that Fed Mobi can maintain robust fine-tuning performance while achieving extremely low on-device memory, with at least 95.2% reduction in computation overhead, 93.2% reduction in costs and 5.1x faster convergence compared to existing methods, validating its efficacy for practical adaptation over heterogeneous mobile devices.

Pushing the Envelope of LLM Inference on AI-PC

Authors: Evangelos Georganas, Dhiraj Kalamkar, Alexander Heinecke

2025-08-08

http://arxiv.org/abs/2508.06753v1

The advent of ultra- models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of inference for resource-constrained environments such as edge devices and AI PCs. While these advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art inference framework, namely PyTorch-TPP, and present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet.cpp by up to 2.2x, and deliver up to 7x speedup compared to the 16-bit model inference. Our optimized runtime advances the state of inference on AI PCs and edge devices, paving the way for efficient deployment of ultra- models.

CISO Species Distribution Modeling Conditioned on Incomplete Species Observations

Authors: Hager Radi Abdelwahed, Mélisande Teng, Robin Zbinden, Laura Pollock, Hugo Larochelle, Devis Tuia, David Rolnick

2025-08-08

http://arxiv.org/abs/2508.06704v1

Species distribution models (SDMs) are widely used to predict species' geographic distributions, as critical tools for ecological research and conservation planning. Typically, SDMs relate species occurrences to environmental variables representing abiotic factors, such as temperature, precipitation, and soil properties. However, species distributions are also strongly influenced by biotic interactions with other species, which are often overlooked. While some methods partially address this limitation by incorporating biotic interactions, they often assume symmetrical pairwise relationships between species and require consistent co-occurrence data. In practice, species observations are , and the availability of information about the presence or absence of other species varies significantly across locations. To address these challenges, we propose CISO, a deep learning-based method for species distribution modeling Conditioned on Incomplete Species Observations. CISO enables predictions to be conditioned on a flexible number of species observations alongside environmental variables, accommodating the variability and incompleteness of available biotic data. We demonstrate our approach using three datasets representing different species groups: sPlotOpen for plants, SatBird for birds, and a new dataset, SatButterfly, for butterflies. Our results show that including partial biotic information improves predictive performance on spatially separate test sets. When conditioned on a subset of species within the same dataset, CISO outperforms alternative methods in predicting the distribution of the remaining species. Furthermore, we show that combining observations from multiple datasets can improve performance. CISO is a promising ecological tool, capable of incorporating incomplete biotic information and identifying potential interactions between species from disparate taxa.