2025-08-08

MisVisFix An Interactive Dashboard for Detecting, Explaining, and Correcting Misleading Visualizations using Large Language Models
Sculptor Empowering LLMs with Cognitive Agency via Active Context Management
Share Your Attention Transformer Weight Sharing via Matrix-based Dictionary Learning
TRAIL Joint Inference and Refinement of Knowledge Graphs with Large Language Models
CARD Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference
Automatic LLM Red Teaming
Evaluating, Synthesizing, and Enhancing for Customer Support Conversation
FlexQ Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design
KVSink Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs
ViLLA-MMBench A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation
Reasoning Beyond Labels Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts
Benefit from Rich Tackling Search Interaction Sparsity in Search Enhanced Recommendation
Excavate the potential of Single-Scale Features A Decomposition Network for Water-Related Optical Image Enhancement
TCSAFormer Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation
PAIRS Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG
S $^2$ Q-VDiT Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation
Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency
MultiRAG A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation
Data Dependency Inference for Industrial Code Generation Based on UML Sequence Diagrams
Compressing Chain-of-Thought in LLMs via Step Entropy
Do language models accommodate their users? A study of linguistic convergence
Attack the Messages, Not the Agents A Multi-round Adaptive Stealthy Tampering Framework for LLM-MAS
AgentSME for Simulating Diverse Communication Modes in Smart Education
Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives
LOST Low-rank and Sparse Pre-training for Large Language Models
Automated SNOMED CT Concept Annotation in Clinical Text Using Bi-GRU Neural Networks
xDeepServe Model-as-a-Service on Huawei CloudMatrix384
Decomposed Reasoning with Reinforcement Learning for Relevance Assessment in UGC Platforms
CompressKV Semantic Retrieval Heads Know What Tokens are Not Important Before Generation
Beyond Manually Designed Pruning Policies with Second-Level Performance Prediction A Pruning Framework for LLMs
Traffic-R1 Reinforced LLMs Bring Human-Like Reasoning to Traffic Signal Control Systems
CAMERA Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis
VeOmni Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
Isolating Culture Neurons in Multilingual Large Language Models
Forecasting When to Forecast Accelerating Diffusion Models with Confidence-Gated Taylor
LeanK Learnable K Cache Channel Pruning for Efficient Decoding
Whispering Agents An event-driven covert communication protocol for the Internet of Agents
Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation
Amber Pruner Leveraging NM Activation Sparsity for Efficient Prefill in Large Language Models
A Survey on AgentOps Categorization, Challenges, and Future Directions
AlignGuard-LoRA Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization
Everyone Contributes! Incentivizing Strategic Cooperation in Multi-LLM Systems via Sequential Public Goods Games
CVD-SfM A Cross-View Deep Front-end Structure-from-Motion System for Sparse Localization in Multi-Altitude Scenes
IAUNet Instance-Aware U-Net
Quantum-RAG and PunGPT2 Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language
AGFT An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization
SmallKV Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
EAC-MoE Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models
RepoForge Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale
BlockA2A Towards Secure and Verifiable Agent-to-Agent Interoperability
Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models
Asking the Right Questions Benchmarking Large Language Models in the Development of Clinical Consultation Templates
Towards Bridging Review Sparsity in Recommendation with Textual Edge Graph Representation
REACT A Real-Time Edge-AI Based V2X Framework for Accident Avoidance in Autonomous Driving System
Session-Based Recommendation with Validated and Enriched LLM Intents
ReaGAN Node-as-Agent-Reasoning Graph Agentic Network
EdgeInfinite-Instruct Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices
Systematic Evaluation of Optimization Techniques for Long-Context Language Models
Large AI Model-Enabled Secure Communications in Low-Altitude Wireless Networks Concepts, Perspectives and Case Study

MisVisFix An Interactive Dashboard for Detecting, Explaining, and Correcting Misleading Visualizations using Large Language Models

Authors: Amit Kumar Das, Klaus Mueller

2025-08-06

http://arxiv.org/abs/2508.04679v1

Misleading visualizations pose a significant challenge to accurate data interpretation. While recent research has explored the use of Large Language Models (s) for detecting such misinformation, practical tools that also support explanation and correction remain limited. We present MisVisFix, an interactive dashboard that leverages both Claude and GPT models to support the full workflow of detecting, explaining, and correcting misleading visualizations. MisVisFix correctly identifies 96% of visualization issues and addresses all 74 known visualization misinformation types, classifying them as major, minor, or potential concerns. It provides detailed explanations, actionable suggestions, and automatically generates corrected charts. An interactive chat interface allows users to ask about specific chart elements or request modifications. The dashboard adapts to newly emerging misinformation strategies through targeted user interactions. User studies with visualization experts and developers of fact-checking tools show that MisVisFix accurately identifies issues and offers useful suggestions for improvement. By transforming -based detection into an accessible, interactive platform, MisVisFix advances visualization literacy and supports more trustworthy data .

Sculptor Empowering LLMs with Cognitive Agency via Active Context Management

Authors: Mo Li, L. H. Xu, Qitai Tan, Ting Cao, Yunxin Liu

2025-08-06

http://arxiv.org/abs/2508.04664v1

Large Language Models (s) suffer from significant performance degradation when processing long contexts due to proactive interference, where irrelevant information in earlier parts of the context disrupts reasoning and memory recall. While most research focuses on external memory systems to augment s' capabilities, we propose a complementary approach: empowering s with Active Context Management (ACM) tools to actively sculpt their internal working memory. We introduce Sculptor, a framework that equips s with three categories of tools: (1) context fragmentation, (2) summary, hide, and restore, and (3) intelligent search. Our approach enables s to proactively manage their attention and working memory, analogous to how humans selectively focus on relevant information while filtering out distractions. Experimental evaluation on information- benchmarks-PI- (proactive interference) and NeedleBench Multi-Needle Reasoning-demonstrates that Sculptor significantly improves performance even without specific training, leveraging s' inherent tool calling generalization capabilities. By enabling Active Context Management, Sculptor not only mitigates proactive interference but also provides a cognitive foundation for more reliable reasoning across diverse long-context tasks-highlighting that explicit context-control strategies, rather than merely larger token windows, are key to robustness at scale.

Authors: Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis

2025-08-06

http://arxiv.org/abs/2508.04581v1

Large language models (s) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g. low-rank approximation, attention head ), while the repetitive layered structure of s implies significant inter-block redundancy - a dimension largely unexplored beyond key-value () caching. Inspired by dictionary learning in CNNs, we propose a framework for structured weight sharing across layers. Our approach decomposes attention projection matrices into shared dictionary atoms, reducing the attention module's parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement - trained with standard optimizers - and represents each layer's weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on pretrained s to reduce their number of parameters without experiencing any significant drop in their performance.

Authors: Xinkui Zhao, Haode Li, Yifan Zhang, Guanjie Cheng, Yueshen Xu

2025-08-06

http://arxiv.org/abs/2508.04474v1

Recent advances in large language models (s) have unlocked powerful reasoning and decision-making capabilities. However, their inherent dependence on static parametric memory fundamentally limits their adaptability, factual accuracy, and interpretability in knowledge-intensive scenarios. Knowledge graphs (KGs), as structured repositories of explicit relational knowledge, offer a promising approach for augmenting s with external, interpretable memory. Nevertheless, most existing methods that combine s with KGs treat reasoning and knowledge updating as separate processes, resulting in suboptimal utilization of new information and hindering real-time updates. In this work, we propose TRAIL: a novel, unified framework for Thinking, Reasoning, And Incremental Learning that couples joint inference and dynamic KG refinement with large language models. TRAIL enables agents to iteratively explore, update, and refine knowledge graphs during the reasoning process, employing a confidence-driven mechanism for the generation, validation, and of new facts. This plug-and-play architecture facilitates seamless integration with various s, supporting continual adaptation without the need for retraining. Extensive experiments on multiple benchmarks demonstrate that TRAIL outperforms existing KG-augmented and retrieval-augmented baselines by 3% to 13%. More importantly, these results represent a significant step toward developing adaptive, memory-augmented language models capable of continual learning and reliable, transparent reasoning.

CARD Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference

Authors: Enyu Zhou, Kai Sheng, Hao Chen, Xin He

2025-08-06

http://arxiv.org/abs/2508.04462v1

Speculative decoding (SD), where an extra draft model first provides multiple draft tokens and the original target model then verifies these tokens in parallel, has shown great power for inference . However, existing SD methods must adhere to the 'draft-then-verify' paradigm, which forces drafting and verification processes to execute sequentially during SD, resulting in inefficient inference performance and limiting the size of the draft model. Furthermore, once a single token in the candidate sequence is rejected during the drafting process, all subsequent candidate tokens must be discarded, leading to inefficient drafting. To address these challenges, we propose a cache-based parallel speculative decoding framework employing a 'query-and-correct' paradigm. Specifically, CARD decouples drafting and verification: the draft model generates candidate tokens to populate a shared cache, while the target model concurrently rectifies the draft model's generation direction. This effectively enables the target model to perform inference at speed approaching that of the draft model. Our approach achieves up to 4.83 speedup over vanilla decoding without requiring fine-tuning of either the draft or target models. Our code is available at https://github.com/hunzhizi/CARD.

Automatic LLM Red Teaming

Authors: Roman Belaire, Arunesh Sinha, Pradeep Varakantham

2025-08-06

http://arxiv.org/abs/2508.04451v1

Red teaming is critical for identifying vulnerabilities and building trust in current s. However, current automated methods for Large Language Models (s) rely on brittle prompt templates or single-turn attacks, failing to capture the complex, interactive nature of real-world adversarial dialogues. We propose a novel paradigm: training an AI to strategically `break' another AI. By formalizing red teaming as a Markov Decision Process (MDP) and employing a hierarchical Reinforcement Learning (RL) framework, we effectively address the inherent reward and long-horizon challenges. Our generative agent learns coherent, multi-turn attack strategies through a fine-grained, token-level harm reward, enabling it to uncover subtle vulnerabilities missed by existing baselines. This approach sets a new state-of-the-art, fundamentally reframing red teaming as a dynamic, trajectory-based process (rather than a one-step test) essential for robust AI deployment.

Evaluating, Synthesizing, and Enhancing for Customer Support Conversation

Authors: Jie Zhu, Huaixia Dou, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong

2025-08-06

http://arxiv.org/abs/2508.04423v1

Effective customer support requires not only accurate problem solving but also structured and empathetic aligned with professional standards. However, existing dialogue datasets often lack strategic guidance, and real-world service data is difficult to access and annotate. To address this, we introduce the task of Customer Support Conversation (CSC), aimed at training customer service agents to respond using well-defined support strategies. We propose a structured CSC framework grounded in COPC guidelines, defining five conversational stages and twelve strategies to guide high-quality interactions. Based on this, we construct CSConv, an evaluation dataset of 1,855 real-world customer-agent conversations rewritten using s to reflect deliberate strategy use, and annotated accordingly. Additionally, we develop a role-playing approach that simulates strategy-rich conversations using -powered roles aligned with the CSC framework, resulting in the training dataset RoleCS. Experiments show that fine-tuning strong s on RoleCS significantly improves their ability to generate high-quality, strategy-aligned responses on CSConv. Human evaluations further confirm gains in problem resolution. All code and data will be made publicly available at https://github.com/aliyun/qwen-dianjin.

FlexQ Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

Authors: Hao Zhang, Aining Jia, Weifeng Bu, Yushu Cai, Kai Sheng, Hao Chen, Xin He

2025-08-06

http://arxiv.org/abs/2508.04405v1

Large Language Models (s) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit . In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU kernel supporting matrix multiplication for W6A6 and W6A8 representations via Binary Tensor Core (BTC) equivalents, effectively bypassing the lack of native INT6 tensor cores. Evaluations on LLaMA models show FlexQ maintains near-FP16 accuracy, with perplexity increases of no more than 0.05. The proposed kernel achieves an average 1.39 $\times$ speedup over ABQ- on LLaMA-2-70B linear layers. End-to-end, FlexQ delivers 1.33 $\times$ inference and 1.21 $\times$ memory savings over SmoothQuant. Code is released at https://github.com/FlyFoxPlayer/FlexQ.

KVSink Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs

Authors: Zunhai Su, Kehong Yuan

2025-08-06

http://arxiv.org/abs/2508.04257v1

Key-Value () cache quantization has become a widely adopted optimization technique for efficient large language models (s) inference by reducing cache memory usage and mitigating memory-bound constraints. Recent studies have emphasized the importance of preserving the original precision of s for the first few tokens to ensure the protection of attention sinks. While this approach has proven effective in mitigating performance degradation, its underlying principles remain insufficiently understood. Moreover, it fails to address the recent discovery that attention sinks can emerge beyond the initial token positions. In this work, we elucidate the underlying mechanisms of attention sinks during inference by examining their role in the cross-layer evolution of extreme activation outliers. Additionally, we provide a comprehensive analysis of the interplay between attention sinks and cache quantization. Based on our enhanced understanding, we introduce \textit{\textbf{Sink}}, a plug-and-play method that effectively predicts sink tokens with negligible overhead, enabling more thorough preservation. Extensive experiments demonstrate that Sink outperforms the existing Preserve-First-N (PFN) strategy, offering more effective preservation of attention sinks during cache quantization. Moreover, when applied to the well-established Quant method, Sink further improves perplexity (PPL) and reduces reliance on 16-bit numerical outliers.

ViLLA-MMBench A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation

Authors: Fatemeh Nazary, Ali Tourani, Yashar Deldjoo, Tommaso Di Noia

2025-08-06

http://arxiv.org/abs/2508.04206v1

Recommending long-form video content demands joint modeling of visual, audio, and textual modalities, yet most benchmarks address only raw features or narrow fusion. We present ViLLA-MMBench, a reproducible, extensible benchmark for -augmented multimodal movie recommendation. Built on MovieLens and MMTF-14K, it aligns dense item embeddings from three modalities: audio (block-level, i-vector), visual (CNN, AVF), and text. Missing or metadata is automatically enriched using state-of-the-art s (e.g., OpenAI Ada), generating high-quality synopses for thousands of movies. All text (raw or augmented) is embedded with configurable encoders (Ada, LLaMA-2, Sentence-T5), producing multiple ready-to-use sets. The pipeline supports interchangeable early-, mid-, and late-fusion (concatenation, PCA, CCA, rank-aggregation) and multiple backbones (MF, VAECF, VBPR, AMR, VMF) for ablation. Experiments are fully declarative via a single YAML file. Evaluation spans accuracy (Recall, nDCG) and beyond-accuracy metrics: cold-start rate, coverage, novelty, diversity, fairness. Results show -based augmentation and strong text embeddings boost cold-start and coverage, especially when fused with audio-visual features. Systematic benchmarking reveals universal versus backbone- or metric-specific combinations. Open-source code, embeddings, and configs enable reproducible, fair multimodal RS research and advance principled generative AI integration in large-scale recommendation. Code: https://recsys-lab.github.io/ViLLA-MMBench

Reasoning Beyond Labels Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts

Authors: Millicent Ochieng, Anja Thieme, Ignatius Ezeani, Risa Ueno, Samuel Maina, Keshet Ronen, Javier Gonzalez, Jacki O'Neill

2025-08-06

http://arxiv.org/abs/2508.04199v1

Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (s) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using a combination of human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social-science measurement lens, we operationalize and interrogate s outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier s demonstrating interpretive stability, while open models often falter under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world .

Benefit from Rich Tackling Search Interaction Sparsity in Search Enhanced Recommendation

Authors: Teng Shi, Weijie Yu, Xiao Zhang, Ming He, Jianping Fan, Jun Xu

2025-08-06

http://arxiv.org/abs/2508.04145v1

In modern online platforms, search and recommendation (S&R) often coexist, offering opportunities for performance improvement through search-enhanced approaches. Existing studies show that incorporating search signals boosts recommendation performance. However, the effectiveness of these methods relies heavily on rich search interactions. They primarily benefit a small subset of users with abundant search behavior, while offering limited improvements for the majority of users who exhibit only search activity. To address the problem of search data in search-enhanced recommendation, we face two key challenges: (1) how to learn useful search features for users with search interactions, and (2) how to design effective training objectives under conditions. Our idea is to leverage the features of users with rich search interactions to enhance those of users with search interactions. Based on this idea, we propose GSERec, a method that utilizes message passing on the User-Code Graphs to alleviate data in Search-Enhanced Recommendation. Specifically, we utilize Large Language Models (s) with vector quantization to generate discrete codes, which connect similar users and thereby construct the graph. Through message passing on this graph, embeddings of users with rich search data are propagated to enhance the embeddings of users with interactions. To further ensure that the message passing captures meaningful information from truly similar users, we introduce a contrastive loss to better model user similarities. The enhanced user representations are then integrated into downstream search-enhanced recommendation models. Experiments on three real-world datasets show that GSERec consistently outperforms baselines, especially for users with search behaviors.

Authors: Zheng Cheng, Wenri Wang, Guangyong Chen, Yakun Ju, Yihua Cheng, Zhisong Liu, Yanda Meng, Jintao Song

2025-08-06

http://arxiv.org/abs/2508.04123v1

Underwater image enhancement (UIE) techniques aim to improve visual quality of images captured in aquatic environments by addressing degradation issues caused by light absorption and scattering effects, including color distortion, blurring, and low contrast. Current mainstream solutions predominantly employ multi-scale feature extraction (MSFE) mechanisms to enhance reconstruction quality through multi-resolution feature fusion. However, our extensive experiments demonstrate that high-quality image reconstruction does not necessarily rely on multi-scale feature fusion. Contrary to popular belief, our experiments show that single-scale feature extraction alone can match or surpass the performance of multi-scale methods, significantly reducing complexity. To comprehensively explore single-scale feature potential in underwater enhancement, we propose an innovative Single-Scale Decomposition Network (SSD-Net). This architecture introduces an asymmetrical decomposition mechanism that disentangles input image into clean layer along with degradation layer. The former contains scene-intrinsic information and the latter encodes medium-induced interference. It uniquely combines CNN's local feature extraction capabilities with Transformer's global modeling strengths through two core modules: 1) Parallel Feature Decomposition Block (PFDB), implementing dual-branch feature space decoupling via efficient attention operations and adaptive ; 2) Bidirectional Feature Communication Block (BFCB), enabling cross-layer residual interactions for complementary feature mining and fusion. This synergistic design preserves feature decomposition independence while establishing dynamic cross-layer information pathways, effectively enhancing degradation decoupling capacity.

TCSAFormer Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation

Authors: Zunhui Xia, Hongxing Li, Libin Lan

2025-08-06

http://arxiv.org/abs/2508.04058v1

In recent years, -based methods have achieved remarkable progress in medical image segmentation due to their superior ability to capture long-range dependencies. However, these methods typically suffer from two major limitations. First, their computational complexity scales quadratically with the input sequences. Second, the feed-forward network (FFN) modules in vanilla Transformers typically rely on fully connected layers, which limits models' ability to capture local contextual information and multiscale features critical for precise semantic segmentation. To address these issues, we propose an efficient medical image segmentation network, named TCSAFormer. The proposed TCSAFormer adopts two key ideas. First, it incorporates a Compressed Attention (CA) module, which combines token compression and pixel-level attention to dynamically focus on the most relevant key-value pairs for each query. This is achieved by globally irrelevant tokens and merging redundant ones, significantly reducing computational complexity while enhancing the model's ability to capture relationships between tokens. Second, it introduces a Dual-Branch Feed-Forward Network (DBFFN) module as a replacement for the standard FFN to capture local contextual features and multiscale information, thereby strengthening the model's feature representation capability. We conduct extensive experiments on three publicly available medical image segmentation datasets: ISIC-2018, CVC-ClinicDB, and Synapse, to evaluate the segmentation performance of TCSAFormer. Experimental results demonstrate that TCSAFormer achieves superior performance compared to existing state-of-the-art (SOTA) methods, while maintaining lower computational overhead, thus achieving an optimal trade-off between efficiency and accuracy.

PAIRS Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG

Authors: Wang Chen, Guanqiang Qi, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang

2025-08-06

http://arxiv.org/abs/2508.04057v1

Retrieval-Augmented Generation (RAG) has become a cornerstone technique for enhancing large language models (s) with external knowledge. However, current RAG systems face two critical limitations: (1) they inefficiently retrieve information for every query, including simple questions that could be resolved using the 's parametric knowledge alone, and (2) they risk retrieving irrelevant documents when queries contain information signals. To address these gaps, we introduce Parametric-verified Adaptive Information Retrieval and Selection (PAIRS), a training-free framework that integrates parametric and retrieved knowledge to adaptively determine whether to retrieve and how to select external information. Specifically, PAIRS employs a dual-path generation mechanism: First, the produces both a direct answer and a context-augmented answer using self-generated pseudo-context. When these outputs converge, PAIRS bypasses external retrieval entirely, dramatically improving the RAG system's efficiency. For divergent cases, PAIRS activates a dual-path retrieval (DPR) process guided by both the original query and self-generated contextual signals, followed by an Adaptive Information Selection (AIS) module that filters documents through weighted similarity to both sources. This simple yet effective approach can not only enhance efficiency by eliminating unnecessary retrievals but also improve accuracy through contextually guided retrieval and adaptive information selection. Experimental results on six question-answering (QA) benchmarks show that PAIRS reduces retrieval costs by around 25% (triggering for only 75% of queries) while still improving accuracy-achieving +1.1% EM and +1.0% F1 over prior baselines on average.

S $^2$ Q-VDiT Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

Authors: Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, Yongjun Xu

2025-08-06

http://arxiv.org/abs/2508.04016v2

Diffusion s have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose S $^2$ Q-VDiT, a post-training quantization framework for V-DMs that leverages Salient data and Sparse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model's output. Under W4A6 quantization, S $^2$ Q-VDiT achieves lossless performance while delivering $3.9\times$ model compression and $1.3\times$ inference . Code will be available at https://github.com/wlfeng0509/s2q-vdit.

Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency

Authors: Md Arafat Sultan, Ramón Fernandez Astudillo

2025-08-06

http://arxiv.org/abs/2508.03979v1

Despite its simplicity and efficacy, the high token expenditure of self-consistency can limit its practical utility. Here we investigate if self-consistency can be made more token-efficient for long chain-of-thought reasoning tasks, while preserving its parallelism, through early hypothesis . Concretely, we generate all solutions in parallel, but periodically prune intermediate hypotheses that are deemed unnecessary based on two lightweight indicators: (a) the model's own confidence in individual hypotheses, and (b) lexical coverage of all current hypotheses by candidate subsets that are under consideration for continued retention. We design a fast weighted set cover algorithm that utilizes the two indicators; our evaluation of five s on three math benchmarks shows that this method can improve token efficiency for all models, by 10-35% in many cases.

MultiRAG A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation

Authors: Wenlong Wu, Haofen Wang, Bohan Li, Peixuan Huang, Xinzhe Zhao, Lei Liang

2025-08-05

http://arxiv.org/abs/2508.03553v1

Retrieval Augmented Generation (RAG) has emerged as a promising solution to address hallucination issues in Large Language Models (s). However, the integration of multiple retrieval sources, while potentially more informative, introduces new challenges that can paradoxically exacerbate hallucination problems. These challenges manifest primarily in two aspects: the distribution of multi-source data that hinders the capture of logical relationships and the inherent inconsistencies among different sources that lead to information conflicts. To address these challenges, we propose MultiRAG, a novel framework designed to mitigate hallucination in multi-source retrieval-augmented generation through knowledge-guided approaches. Our framework introduces two key innovations: (1) a knowledge construction module that employs multi-source line graphs to efficiently aggregate logical relationships across different knowledge sources, effectively addressing the data distribution issue; and (2) a sophisticated retrieval module that implements a multi-level confidence calculation mechanism, performing both graph-level and node-level assessments to identify and eliminate unreliable information nodes, thereby reducing hallucinations caused by inter-source inconsistencies. Extensive experiments on four multi-domain query datasets and two multi-hop QA datasets demonstrate that MultiRAG significantly enhances the reliability and efficiency of knowledge retrieval in complex multi-source scenarios. \textcolor{blue}{Our code is available in https://github.com/wuwenlong123/MultiRAG.

Data Dependency Inference for Industrial Code Generation Based on UML Sequence Diagrams

Authors: Wenxin Mao, Zhitao Wang, Long Wang, Sirong Chen, Cuiyun Gao, Luyang Cao, Ziming Liu, Qiming Zhang, Jun Zhou, Zhi Jin

2025-08-05

http://arxiv.org/abs/2508.03379v2

Large language models (s) excel at generating code from natural language (NL) descriptions. However, the plain textual descriptions are inherently ambiguous and often fail to capture complex requirements like intricate system behaviors, conditional logic, and architectural constraints; implicit data dependencies in service-oriented architectures are difficult to infer and handle correctly. To bridge this gap, we propose a novel step-by-step code generation framework named UML2Dep by leveraging unambiguous formal specifications of complex requirements. First, we introduce an enhanced Unified Modeling Language (UML) sequence diagram tailored for service-oriented architectures. This diagram extends traditional visual syntax by integrating decision tables and API specifications, explicitly formalizing structural relationships and business logic flows in service interactions to rigorously eliminate linguistic ambiguity. Second, recognizing the critical role of data flow, we introduce a dedicated data dependency inference (DDI) task. DDI systematically constructs an explicit data dependency graph prior to actual code synthesis. To ensure reliability, we formalize DDI as a constrained mathematical reasoning task through novel prompting strategies, aligning with s' excellent mathematical strengths. Additional static parsing and dependency further reduce context complexity and cognitive load associated with intricate specifications, thereby enhancing reasoning accuracy and efficiency.

Compressing Chain-of-Thought in LLMs via Step Entropy

Authors: Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, Qiang Xu

2025-08-05

http://arxiv.org/abs/2508.03346v1

Large Language Models (s) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies the informational contribution of individual reasoning steps to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80\% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy , which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables s to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly enhances inference efficiency while rigorously preserving accuracy, offering profound implications for practical deployment and a deeper understanding of reasoning structures.

Do language models accommodate their users? A study of linguistic convergence

Authors: Terra Blevins, Susanne Schmalwieser, Benjamin Roth

2025-08-05

http://arxiv.org/abs/2508.03276v1

While large language models (s) are generally considered proficient in generating language, how similar their language usage is to that of humans remains understudied. In this paper, we test whether models exhibit linguistic convergence, a core pragmatic element of human language , asking: do models adapt, or converge, to the linguistic patterns of their user? To answer this, we systematically compare model completions of exisiting dialogues to the original human responses across sixteen language models, three dialogue corpora, and a variety of stylometric features. We find that models strongly converge to the conversation's style, often significantly overfitting relative to the human baseline. While convergence patterns are often feature-specific, we observe consistent shifts in convergence across modeling settings, with instruction-tuned and larger models converging less than their pretrained counterparts. Given the differences between human and model convergence patterns, we hypothesize that the underlying mechanisms for these behaviors are very different.

Attack the Messages, Not the Agents A Multi-round Adaptive Stealthy Tampering Framework for LLM-MAS

Authors: Bingyu Yan, Ziyi Zhou, Xiaoming Zhang, Chaozhuo Li, Ruilin Zeng, Yirui Qi, Tianbo Wang, Litian Zhang

2025-08-05

http://arxiv.org/abs/2508.03125v1

Large language model-based multi-agent systems (-MAS) effectively accomplish complex and dynamic tasks through inter-agent , but this reliance introduces substantial safety vulnerabilities. Existing attack methods targeting -MAS either compromise agent internals or rely on direct and overt persuasion, which limit their effectiveness, adaptability, and stealthiness. In this paper, we propose MAST, a Multi-round Adaptive Stealthy Tampering framework designed to exploit vulnerabilities within the system. MAST integrates Monte Carlo Tree Search with Direct Preference Optimization to train an attack policy model that adaptively generates effective multi-round tampering strategies. Furthermore, to preserve stealthiness, we impose dual semantic and embedding similarity constraints during the tampering process. Comprehensive experiments across diverse tasks, architectures, and s demonstrate that MAST consistently achieves high attack success rates while significantly enhancing stealthiness compared to baselines. These findings highlight the effectiveness, stealthiness, and adaptability of MAST, underscoring the need for robust safeguards in -MAS.

AgentSME for Simulating Diverse Communication Modes in Smart Education

Authors: Wen-Xi Yang, Tian-Fang Zhao

2025-08-05

http://arxiv.org/abs/2508.03109v1

Generative agent models specifically tailored for smart education are critical, yet remain relatively underdeveloped. A key challenge stems from the inherent complexity of educational contexts: learners are human beings with various cognitive behaviors, and pedagogy is fundamentally centered on personalized human-to-human . To address this issue, this paper proposes AgentSME, a unified generative agent framework powered by . Three directional modes are considered in the models, namely Solo, Mono, and Echo, reflecting different types of agency autonomy and communicative reciprocity. Accuracy is adopted as the primary evaluation metric, complemented by three diversity indices designed to assess the diversity of reasoning contents. Six widely used s are tested to validate the robustness of modes across different model tiers, which are equally divided into base-capacity and high-capacity configurations. The results show that generative agents that employ the Echo mode achieve the highest accuracy scores, while DeepSeek exhibits the greatest diversity. This study provides valuable information to improve agent learning capabilities and inspire smart education models.

Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives

Authors: Yinuo Xu, Veronica Derricks, Allison Earl, David Jurgens

2025-08-04

http://arxiv.org/abs/2508.02853v1

We present an approach to modeling annotator disagreement in subjective NLP tasks through both architectural and data-centric innovations. Our model, DEM-MoE (Demographic-Aware Mixture of Experts), routes inputs to expert subnetworks based on annotator demographics, enabling it to better represent structured, group-level variation compared to prior models. DEM-MoE consistently performs competitively across demographic groups, and shows especially strong results on datasets with high annotator disagreement. To address demographic coverage, we test whether -generated synthetic annotations via zero-shot persona prompting can be used for data imputation. We show these synthetic judgments align moderately well with human annotations on our data and offer a scalable way to potentially enrich training data. We then propose and evaluate approaches for blending real and synthetic data using strategies tailored to dataset structure. We find that the optimal strategies depend on dataset structure. Together, these contributions improve the representation of diverse perspectives.

LOST Low-rank and Sparse Pre-training for Large Language Models

Authors: Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Liwu Xu, Tianjin Huang, Wenwu Wang, Shiwei Liu, Xilu Wang

2025-08-04

http://arxiv.org/abs/2508.02668v1

While large language models (s) have achieved remarkable performance across a wide range of tasks, their massive scale incurs prohibitive computational and memory costs for pre-training from scratch. Recent studies have investigated the use of low-rank parameterization as a means of reducing model size and training cost. In this context, is often employed as a complementary technique to recover important information lost in low-rank compression by capturing salient features in the residual space. However, existing approaches typically combine low-rank and components in a simplistic or ad hoc manner, often resulting in undesirable performance degradation compared to full-rank training. In this paper, we propose \textbf{LO}w-rank and \textbf{S}parse pre-\textbf{T}raining (\textbf{LOST}) for s, a novel method that ingeniously integrates low-rank and structures to enable effective training of s from scratch under strict efficiency constraints. LOST applies singular value decomposition to weight matrices, preserving the dominant low-rank components, while allocating the remaining singular values to construct channel-wise components to complement the expressiveness of low-rank training. We evaluate LOST on pretraining ranging from 60M to 7B parameters. Our experiments show that LOST achieves competitive or superior performance compared to full-rank models, while significantly reducing both memory and compute overhead. Moreover, Code is available at \href{https://github.com/JiaxiLi1/LOST-Low-rank-and-Sparse-Training-for-Large-Language-Models}{LOST Repo}

Automated SNOMED CT Concept Annotation in Clinical Text Using Bi-GRU Neural Networks

Authors: Ali Noori, Pratik Devkota, Somya Mohanty, Prashanti Manda

2025-08-04

http://arxiv.org/abs/2508.02556v1

Automated annotation of clinical text with standardized medical concepts is critical for enabling structured data extraction and decision support. SNOMED CT provides a rich ontology for labeling clinical entities, but manual annotation is labor-intensive and impractical at scale. This study introduces a neural sequence labeling approach for SNOMED CT concept recognition using a Bidirectional GRU model. Leveraging a subset of MIMIC-IV, we preprocess text with domain-adapted SpaCy and SciBERT-based tokenization, segmenting sentences into ping 19-token chunks enriched with contextual, syntactic, and morphological features. The Bi-GRU model assigns IOB tags to identify concept spans and achieves strong performance with a 90 percent F1-score on the validation set. These results surpass traditional rule-based systems and match or exceed existing neural models. Qualitative analysis shows effective handling of ambiguous terms and misspellings. Our findings highlight that lightweight RNN-based architectures can deliver high-quality clinical concept annotation with significantly lower computational cost than -based models, making them well-suited for real-world deployment.

xDeepServe Model-as-a-Service on Huawei CloudMatrix384

Authors: Ao Xiao, Bangzheng He, Baoquan Zhang, Baoxing Huai, Bingji Wang, Bo Wang, Bo Xu, Boyi Hou, Chan Yang, Changhong Liu, Cheng Cui, Chenyu Zhu, Cong Feng, Daohui Wang, Dayun Lin, Duo Zhao, Fengshao Zou, Fu Wang, Gangqiang Zhang, Gengyuan Dan, Guanjie Chen, Guodong Guan, Guodong Yang, Haifeng Li, Haipei Zhu, Hao Feng, Hao Huang, Hao Xu, Hengrui Ma, Hengtao Fan, Hui Liu, Jia Li, Jiang Liu, Jiang Xu, Jie Meng, Jinhan Xin, Junhao Hu, Juwei Chen, Lan Yu, Lanxin Miao, Liang Liu, Linan Jing, Lu Zhou, Meina Han, Mingkun Deng, Mingyu Deng, Naitian Deng, Nizhong Lin, Peihan Zhao, Peng Pan, Pengfei Shen, Ping Li, Qi Zhang, Qin Zhang, Qingrong Xia, Qingyi Zhang, Qunchao Fu, Ren Guo, Ruimin Gao, Shaochun Li, Sheng Long, Shentian Li, Shining Wan, Shuai Shen, Shuangfu Zeng, Shuming Jing, Siqi Yang, Song Zhang, Tao Xu, Tianlin Du, Ting Chen, Wanxu Wu, Wei Jiang, Weinan Tong, Weiwei Chen, Wen Peng, Wenli Zhou, Wenquan Yang, Wenxin Liang, Xiang Liu, Xiaoli Zhou, Xin Jin, Xinyu Duan, Xu Li, Xu Zhang, Xusheng Chen, Yalong Shan, Yang Gan, Yao Lu, Yi Deng, Yi Zheng, Yingfei Zheng, Yiyun Zheng, Yizhou Shan, Yong Gao, Yongqiang Yang, Yuanjin Gong, Yue Yu, Yuetao Chen, Yukun Zhu, Yulong He, Yusu Zhao, Yuyan Wu, Zenan Zhang, Zhaojin Zhuo, Zhaoyang Ji, Zhefeng Wang, Zheng Wang, Zhenhua Yang, Zhenli Sheng, Zhibin Yu, Zhigang Ji, Zhihao Ren, Zhipeng Bian, Zhixia Liu, Zhiyu Dong, Zhonghua Li, Zhou Yu, Zhuoming Shen, Zhuwei Peng, Zi Ye, Zihao Xiang, Zimin Fu, Zixuan Zhang

2025-08-04

http://arxiv.org/abs/2508.02520v4

The rise of scaled-out s and scaled-up SuperPods signals a new era in large-scale AI infrastructure. s continue to scale out via MoE, as seen in recent models like DeepSeek, Kimi, and Qwen. In parallel, AI hardware is scaling up, with Huawei's CloudMatrix384 SuperPod offering hundreds of GB/s high-speed interconnects. Running large MoE models on SuperPod-scale hardware brings new challenges. It requires new execution models, scalable scheduling, efficient expert load balancing, and elimination of single points of failure. This paper presents xDeepServe, Huawei Cloud's serving system designed for SuperPod-scale infrastructure. At its core is Transformerless, a disaggregated architecture that decomposes models into modular units--attention, feedforward, and MoE--executed independently on NPUs connected via high-speed fabric. We implement this design in two forms: disaggregated prefill-decode and disaggregated MoE-attention. This fully disaggregated setup enables independent scaling of compute and memory without sacrificing performance. To support this architecture, we propose XCCL, a library that leverages CloudMatrix384's global shared memory to implement efficient point-to-point and all-to-all primitives. We also extend our serving engine FlowServe with system-level techniques, enabling scalable inference across hundreds of NPUs.

Decomposed Reasoning with Reinforcement Learning for Relevance Assessment in UGC Platforms

Authors: Xiaowei Yuan, Lei Jin, Haoxin Zhang, Yan Gao, Yi Wu, Yao Hu, Ziyang Huang, Jun Zhao, Kang Liu

2025-08-04

http://arxiv.org/abs/2508.02506v1

Retrieval-augmented generation (RAG) plays a critical role in user-generated content (UGC) platforms, but its effectiveness depends heavily on accurate relevance assessment of query-document pairs. Despite recent advances in applying large language models (s) to relevance modeling, UGC platforms present unique challenges: 1) ambiguous user intent due to user feedback in RAG scenarios, and 2) substantial noise introduced by informal and unstructured language. To address these issues, we propose the Reinforced Reasoning Model for Relevance Assessment (R3A), which introduces a decomposed reasoning framework over queries and candidate documents before scoring. R3A first leverages auxiliary high-ranked documents within the platform to infer latent query intent. It then performs verbatim fragment extraction to justify relevance decisions, thereby reducing errors caused by noisy UGC. Based on a reinforcement learning framework, R3A is optimized to mitigate distortions arising from ambiguous queries and unstructured content. Experimental results show that R3A significantly outperforms existing baseline methods in terms of relevance accuracy, across both offline benchmarks and online experiments.

CompressKV Semantic Retrieval Heads Know What Tokens are Not Important Before Generation

Authors: Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li, Grace Li Zhang

2025-08-04

http://arxiv.org/abs/2508.02401v1

Recent advances in large language models (s) have significantly boosted long-context processing. However, the increasing key-value () cache size poses critical challenges to memory and execution efficiency. Most cache compression methods rely on heuristic token eviction using all attention heads in Grouped Query Attention (GQA)-based s. This method ignores the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrades the performance of s. To address the issue above, instead of using all the attention heads in GQA-based s to determine important tokens as in the previous work, we first identify the attention heads in each layer that are not only capable of retrieving the initial and final tokens of a prompt, but also capable of retrieving important tokens within the text and attending to their surrounding semantic context. Afterwards, we exploit such heads to determine the important tokens and retain their corresponding cache pairs. Furthermore, we analyze the cache eviction error of each layer individually and introduce a layer-adaptive cache allocation strategy. Experimental results demonstrate the proposed Compress consistently outperforms state-of-the-art approaches under various memory budgets on LongBench and Needle-in-a-Haystack benchmarks. Our code is publicly available at: https://github.com/TUDa-HWAI/Compress.git.

Beyond Manually Designed Pruning Policies with Second-Level Performance Prediction A Pruning Framework for LLMs

Authors: Zuxin Ma, Yunhe Cui, Yongbin Qin

2025-08-04

http://arxiv.org/abs/2508.02381v2

Non-uniform structured network methods can effectively reduce Large Language Model () size by eliminating redundant channels or layers, offering lower performance degradation than uniform strategies. However, existing non-uniform methods rely heavily on manually designed policies (e.g., layer importance and scaling factors), and therefore cannot efficiently adapt to scenarios with dynamic ratio requirements. Additionly, a critical bottleneck -- the time-consuming evaluation of policies -- further limits the feasibility of iteratively and dynamically finding optimal policies. To address these limitations, we propose PPF (Predictive Pruning Framework), a novel framework for s that eliminates manual design dependencies via second-level performance prediction. PPF not only supports real-time decisions under dynamic ratios but is also applicable to static scenarios. It employs an agent for producing adaptive and real-time actions, while a lightweight performance predictor that can evaluate a policy in seconds, significantly speeding up the iterative optimization process. Experiments on Llama2-7B and Llama3-8B show that PPF can generate dynamic/static policies and it reduces perplexity by up to 33.4% (dynamic ) and 84.78% (static ) over existing methods, outperforming manually designed policies. The performance predictor achieves second-level performance prediction with high accuracy (prediction error < 0.0011). It reduces the mean evaluation latency from minute-level (1 minute and 38.02 seconds of test-set evaluation methods) to second-level (1.52 seconds), achieving over 64 times speedup. Our code will be available at https://github.com/Ma-zx/PPF .

Traffic-R1 Reinforced LLMs Bring Human-Like Reasoning to Traffic Signal Control Systems

Authors: Xingchen Zou, Yuhao Yang, Zheng Chen, Xixuan Hao, Yiqi Chen, Chao Huang, Yuxuan Liang

2025-08-04

http://arxiv.org/abs/2508.02344v1

Traffic signal control (TSC) is vital for mitigating congestion and sustaining urban mobility. In this paper, we introduce Traffic-R1, a foundation model with human-like reasoning for TSC systems. Our model is developed through self-exploration and iteration of reinforced large language models (s) with expert guidance in a simulated traffic environment. Compared to traditional reinforcement learning (RL) and recent -based methods, Traffic-R1 offers three significant advantages. First, Traffic-R1 delivers zero-shot generalisation, transferring unchanged to new road networks and out-of-distribution incidents by utilizing its internal traffic control policies and human-like reasoning. Second, its 3B-parameter architecture is lightweight enough for real-time inference on mobile-class chips, enabling large-scale edge deployment. Third, Traffic-R1 provides an explainable TSC process and facilitates multi-intersection through its self-iteration and a new synchronous network. Extensive benchmarks demonstrate that Traffic-R1 sets a new state of the art, outperforming strong baselines and training-intensive RL controllers. In practice, the model now manages signals for more than 55,000 drivers daily, shortening average queues by over 5% and halving operator workload. Our checkpoint is available at https://huggingface.co/Season998/Traffic-R1.

CAMERA Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

Authors: Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che

2025-08-04

http://arxiv.org/abs/2508.02322v1

Large Language Models (s) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level , merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.

VeOmni Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Authors: Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, Xin Liu

2025-08-04

http://arxiv.org/abs/2508.02317v3

Recent advances in large language models (s) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal s remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. We present VeOmni, a modular and efficient training framework to accelerate the development of omni-modal s. VeOmni introduces model-centric distributed recipes that decouples from computation, enabling efficient 3D parallelism on omni-modal s. VeOmni also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. Using VeOmni, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal s.

Isolating Culture Neurons in Multilingual Large Language Models

Authors: Danial Namazifard, Lukas Galke

2025-08-04

http://arxiv.org/abs/2508.02241v1

Language and culture are deeply intertwined, yet it is so far unclear how and where multilingual large language models encode culture. Here, we extend upon an established methodology for identifying language-specific neurons and extend it to localize and isolate culture-specific neurons, carefully disentangling their and interaction with language-specific neurons. To facilitate our experiments, we introduce MUREL, a curated dataset of 85.2 million tokens spanning six different cultures. Our localization and intervention experiments show that s encode different cultures in distinct neuron populations, predominantly in upper layers, and that these culture neurons can be modulated independently from language-specific neurons or those specific to other cultures. These findings suggest that cultural knowledge and propensities in multilingual language models can be selectively isolated and edited - promoting fairness, inclusivity, and alignment. Code and data is available at https://github.com/namazifard/Culture_Neurons .

Forecasting When to Forecast Accelerating Diffusion Models with Confidence-Gated Taylor

Authors: Xiaoliu Guan, Lielin Jiang, Hanqi Chen, Xu Zhang, Jiaxing Yan, Guanzhong Wang, Yi Liu, Zetao Zhang, Yu Wu

2025-08-04

http://arxiv.org/abs/2508.02240v2

Diffusion Transformers (DiTs) have demonstrated remarkable performance in visual generation tasks. However, their low inference speed limits their deployment in low-resource applications. Recent training-free approaches exploit the redundancy of features across timesteps by caching and reusing past representations to accelerate inference. Building on this idea, TaylorSeer instead uses cached features to predict future ones via Taylor expansion. However, its module-level prediction across all blocks (e.g., attention or feedforward modules) requires storing fine-grained intermediate features, leading to notable memory and computation overhead. Moreover, it adopts a fixed caching schedule without considering the varying accuracy of predictions across timesteps, which can lead to degraded outputs when prediction fails. To address these limitations, we propose a novel approach to better leverage Taylor-based . First, we shift the Taylor prediction target from the module level to the last block level, significantly reducing the number of cached features. Furthermore, observing strong sequential dependencies among Transformer blocks, we propose to use the error between the Taylor-estimated and actual outputs of the first block as an indicator of prediction reliability. If the error is small, we trust the Taylor prediction for the last block; otherwise, we fall back to full computation, thereby enabling a dynamic caching mechanism. Empirical results show that our method achieves a better balance between speed and quality, achieving a 3.17x on FLUX, 2.36x on DiT, and 4.14x on Wan Video with negligible quality drop. The Project Page is \href{https://cg-taylor-acce.github.io/CG-Taylor/}{here.}

LeanK Learnable K Cache Channel Pruning for Efficient Decoding

Authors: Yike Zhang, Zhiyuan He, Huiqiang Jiang, Chengruidong Zhang, Yuqing Yang, Jianyong Wang, Lili Qiu

2025-08-04

http://arxiv.org/abs/2508.02215v1

Large language models (s) enable long-context tasks but face efficiency challenges due to the growing key-value () cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel . With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.

Whispering Agents An event-driven covert communication protocol for the Internet of Agents

Authors: Kaibo Huang, Yukun Wei, WanSheng Wu, Tianhua Zhang, Zhongliang Yang, Linna Zhou

2025-08-04

http://arxiv.org/abs/2508.02188v1

The emergence of the Internet of Agents (IoA) introduces critical challenges for privacy in sensitive, high-stakes domains. While standard Agent-to-Agent (A2A) protocols secure message content, they are not designed to protect the act of itself, leaving agents vulnerable to surveillance and traffic analysis. We find that the rich, event-driven nature of agent dialogues provides a powerful, yet untapped, medium for covert . To harness this potential, we introduce and formalize the Covert Event Channel, the first unified model for agent covert driven by three interconnected dimensions, which consist of the Storage, Timing,and Behavioral channels. Based on this model, we design and engineer {\Pi}CCAP, a novel protocol that operationalizes this event-driven paradigm. Our comprehensive evaluation demonstrates that {\Pi}CCAP achieves high capacity and robustness while remaining imperceptible to powerful -based wardens, establishing its practical viability. By systematically engineering this channel, our work provides the foundational understanding essential for developing the next generation of monitoring systems and defensive protocols for a secure and trustworthy IoA.

Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation

Authors: Kuiyuan DIng, Caili Guo, Yang Yang, Zhongtian Du, Walid Saad

2025-08-04

http://arxiv.org/abs/2508.02148v1

Large-scale models (LSMs) can be an effective framework for semantic representation and understanding, thereby providing a suitable tool for designing semantic (SC) systems. However, their direct deployment is often hindered by high computational complexity and resource requirements. In this paper, a novel robust knowledge distillation based semantic (RKD-SC) framework is proposed to enable efficient and \textcolor{black}{channel-noise-robust} LSM-powered SC. The framework addresses two key challenges: determining optimal compact model architectures and effectively transferring knowledge while maintaining robustness against channel noise. First, a knowledge distillation-based lightweight differentiable architecture search (KDL-DARTS) algorithm is proposed. This algorithm integrates knowledge distillation loss and a complexity penalty into the neural architecture search process to identify high-performance, lightweight semantic encoder architectures. Second, a novel two-stage robust knowledge distillation (RKD) algorithm is developed to transfer semantic capabilities from an LSM (teacher) to a compact encoder (student) and subsequently enhance system robustness. To further improve resilience to channel impairments, a channel-aware (CAT) block is introduced as the channel codec, trained under diverse channel conditions with variable-length outputs. Extensive simulations on image classification tasks demonstrate that the RKD-SC framework significantly reduces model parameters while preserving a high degree of the teacher model's performance and exhibiting superior robustness compared to existing methods.

Amber Pruner Leveraging NM Activation Sparsity for Efficient Prefill in Large Language Models

Authors: Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, Gongyi Wang

2025-08-04

http://arxiv.org/abs/2508.02128v1

In the era of large language models (s), N:M has emerged as a structured compression technique critical for accelerating inference. While prior work has primarily focused on weight , it often suffers from significant accuracy degradation. Activation , though promising, is typically training-dependent and faces challenges in generalization. To address these limitations, we introduce Amber Pruner, a training-free N:M activation method designed specifically for the prefill stage, targeting the of linear projection layers in s. Extensive experiments across multiple models and ratios (2:4, 4:8, and 8:16) demonstrate that Amber Pruner can effectively sparsify and accelerate more than 55% of linear computations without requiring model retraining. To further enhance generality and efficiency, we propose Outstanding-, a unified framework that integrates Amber Pruner with post-training W8A8 quantization. Our approach preserves strong performance across a range of downstream tasks, with notable advantages in generative tasks. This work pioneers a new frontier in activation , providing foundational insights that are poised to guide the co-evolution of algorithms and architectures in the design of next-generation AI systems.

A Survey on AgentOps Categorization, Challenges, and Future Directions

Authors: Zexin Wang, Jingjing Li, Quan Zhou, Haotian Si, Yuanhao Liu, Jianhui Li, Gaogang Xie, Fei Sun, Dan Pei, Changhua Pei

2025-08-04

http://arxiv.org/abs/2508.02121v1

As the reasoning capabilities of Large Language Models (s) continue to advance, -based agent systems offer advantages in flexibility and interpretability over traditional systems, garnering increasing attention. However, despite the widespread research interest and industrial application of agent systems, these systems, like their traditional counterparts, frequently encounter anomalies. These anomalies lead to instability and insecurity, hindering their further development. Therefore, a comprehensive and systematic approach to the operation and maintenance of agent systems is urgently needed. Unfortunately, current research on the operations of agent systems is . To address this gap, we have undertaken a survey on agent system operations with the aim of establishing a clear framework for the field, defining the challenges, and facilitating further development. Specifically, this paper begins by systematically defining anomalies within agent systems, categorizing them into intra-agent anomalies and inter-agent anomalies. Next, we introduce a novel and comprehensive operational framework for agent systems, dubbed Agent System Operations (AgentOps). We provide detailed definitions and explanations of its four key stages: monitoring, anomaly detection, root cause analysis, and resolution.

AlignGuard-LoRA Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization

Authors: Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha

2025-08-04

http://arxiv.org/abs/2508.02079v1

Low-rank adaptation (LoRA) has become a standard tool for efficiently fine-tuning large language models (s). Yet, even minor LoRA updates can induce alignment drift, weakening safety and behavioral constraints through entangled parameter changes. To address this, we propose AlignGuard-LoRA (AGL), a principled framework for preserving alignment during finetuning. AGL introduces several key components: a primary task loss for supervision, Fisher Information Matrix-based regularization to restrict updates in alignment-sensitive subspaces, and task-specific regularization to stabilize the integration of new knowledge. We further introduce collision-aware regularization, blending Riemannian -- which penalizes coordinate-wise interference -- and geodesic separation -- which encourages disjoint update geometry. We curate DriftCaps, a targeted diagnostic benchmark of safe and unsafe prompts designed to quantify alignment drift and safety degradation. Empirical evaluations show that AGL mitigates alignment drift by up to 50% on safety-critical benchmarks without degrading downstream task performance. Comprehensive ablation confirms that each component contributes distinctly to preserving latent safety behaviors. Finally, we derive and validate a scaling law for catastrophic forgetting, revealing that AGL flattens post-finetuning loss escalation while preserving adaptation dynamics. AGL is a structurally grounded refinement of LoRA, ensuring alignment preservation with minimal trade-offs. To encourage further exploration and development, we open-source our implementation.

Everyone Contributes! Incentivizing Strategic Cooperation in Multi-LLM Systems via Sequential Public Goods Games

Authors: Yunhao Liang, Yuan Qu, Jingyuan Yang, Shaochong Lin, Zuo-Jun Max Shen

2025-08-04

http://arxiv.org/abs/2508.02076v1

Coordinating multiple large language models (s) to solve complex tasks collaboratively poses a fundamental trade-off between the computation costs and collective performance compared with individual model. We introduce a novel, game-theoretically grounded reinforcement learning (RL) framework, the Multi-Agent Cooperation Sequential Public Goods Game (MAC-SPGG), to systematically incentivize cooperation in multi- ensembles. In MAC-SPGG, agents move in sequence, observing predecessors' outputs and updating beliefs to condition their own contributions. By redesigning the public-goods reward, effortful contributions become the unique Subgame Perfect Nash Equilibrium (SPNE), which eliminates free-riding under traditional SPGG or PGG. Its sequential protocol replaces costly round-based information exchanges with a streamlined decision flow, cutting overhead while retaining strategic depth. We prove the existence and uniqueness of the SPNE under realistic parameters, and empirically show that MAC-SPGG-trained ensembles outperform single-agent baselines, chain-of-thought prompting, and other cooperative methods, even achieving comparable performance to large-scale models across reasoning, math, code generation, and NLP tasks. Our results highlight the power of structured, incentive-aligned MAC-SPGG cooperation for scalable and robust multi-agent language generation.

CVD-SfM A Cross-View Deep Front-end Structure-from-Motion System for Sparse Localization in Multi-Altitude Scenes

Authors: Yaxuan Li, Yewei Huang, Bijay Gaudel, Hamidreza Jafarnejadsani, Brendan Englot

2025-08-03

http://arxiv.org/abs/2508.01936v1

We present a novel multi-altitude camera pose estimation system, addressing the challenges of robust and accurate localization across varied altitudes when only considering image input. The system effectively handles diverse environmental conditions and viewpoint variations by integrating the cross-view , deep features, and structure-from-motion into a unified framework. To benchmark our method and foster further research, we introduce two newly collected datasets specifically tailored for multi-altitude camera pose estimation; datasets of this nature remain rare in the current literature. The proposed framework has been validated through extensive comparative analyses on these datasets, demonstrating that our system achieves superior performance in both accuracy and robustness for multi-altitude pose estimation tasks compared to existing solutions, making it well suited for real-world robotic applications such as aerial navigation, search and rescue, and automated inspection.

IAUNet Instance-Aware U-Net

Authors: Yaroslav Prytula, Illia Tsiporenko, Ali Zeynalli, Dmytro Fishman

2025-08-03

http://arxiv.org/abs/2508.01928v1

Instance segmentation is critical in biomedical imaging to accurately distinguish individual objects like cells, which often and vary in size. Recent query-based methods, where object queries guide segmentation, have shown strong performance. While U-Net has been a go-to architecture in medical image segmentation, its potential in query-based approaches remains largely unexplored. In this work, we present IAUNet, a novel query-based U-Net architecture. The core design features a full U-Net architecture, enhanced by a novel lightweight convolutional Pixel decoder, making the model more efficient and reducing the number of parameters. Additionally, we propose a Transformer decoder that refines object-specific features across multiple scales. Finally, we introduce the 2025 Revvity Full Cell Segmentation Dataset, a unique resource with detailed annotations of ping cell cytoplasm in brightfield images, setting a new benchmark for biomedical instance segmentation. Experiments on multiple public datasets and our own show that IAUNet outperforms most state-of-the-art fully convolutional, -based, and query-based models and cell segmentation-specific models, setting a strong baseline for cell instance segmentation tasks. Code is available at https://github.com/SlavkoPrytula/IAUNet

Quantum-RAG and PunGPT2 Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

Authors: Jaskaranjeet Singh, Rakesh Thakur

2025-08-03

http://arxiv.org/abs/2508.01918v1

Despite the rapid advancement of large language models (s), low-resource languages remain largely excluded from the NLP landscape. We present PunGPT2, the first fully open-source suite of Punjabi large language models, trained from scratch on a 35GB domain-diverse corpus encompassing literature, religious texts, news, and social discourse. Unlike prior multilingual approaches, PunGPT2 captures rich syntactic and morphological features unique to Punjabi through a tokenizer optimised with byte pair encoding and linguistically aligned pretraining objectives. To improve factual grounding and domain recall, we introduce Pun-RAG, a retrieval-augmented generation framework combining PunGPT2 with a dense FAISS retriever over a curated Punjabi knowledge base. We further develop Pun-Instruct, a parameter-efficient, instruction-tuned variant using QLoRA, enabling robust zero-shot and instruction-following performance with significantly reduced compute needs. As a key innovation, we propose Quantum-RAG, a novel hybrid retrieval system that fuses (BM25) and dense methods with quantum-inspired semantic matching. By encoding queries using amplitude-based embeddings and retrieving via quantum kernel similarity, Quantum-RAG achieves improved contextual relevance with minimal memory overhead marking the first practical integration of quantum representations in low-resource language generation. Our models significantly outperform strong multilingual baselines (mBERT, mT5, MuRIL) in perplexity, factuality, and fluency. This work provides a scalable, reproducible blueprint for extending capabilities to underrepresented languages and pioneers quantum-aware retrieval in low-resource NLP

AGFT An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization

Authors: Zicong Ye, Kunming Zhang, Guoming Tang

2025-08-03

http://arxiv.org/abs/2508.01744v1

The explosive growth of interactive Large Language Models (s) has placed unprecedented demands for low latency on cloud GPUs, forcing them into high-power modes and causing escalating energy costs. Real-time inference workloads exhibit significant dynamic volatility, presenting substantial energy-saving opportunities. However, traditional static or rule-based power management strategies struggle to exploit these opportunities without compromising peak performance. To address this challenge, we propose AGFT (An Adaptive GPU Frequency Tuner), a framework that employs online reinforcement learning to autonomously learn an optimal frequency tuning policy. By monitoring real-time features like request load and latency, AGFT utilizes fine-grained frequency control for precise adjustments and intelligent action space for stable, efficient decision-making. This creates a robust, automated energy management solution. We comprehensively evaluated AGFT in an environment simulating realistic, fluctuating inference requests. The experimental results demonstrate that AGFT successfully saves 44.3% of GPU energy consumption while introducing a minimal performance latency overhead of under 10%. This achievement translates into a comprehensive Energy-Delay Product (EDP) optimization of up to 40.3%, clearly showing that our framework can significantly enhance the energy efficiency and economic benefits of existing inference clusters without compromising service quality.

SmallKV Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference

Authors: Yi Zhao, Yajuan Peng, Cam-Tu Nguyen, Zuchao Li, Xiaoliang Wang, Hai Zhao, Xiaoming Fu

2025-08-03

http://arxiv.org/abs/2508.02751v1

cache eviction has emerged as an effective solution to alleviate resource constraints faced by s in long-context scenarios. However, existing token-level eviction methods often overlook two critical aspects: (1) their irreversible eviction strategy fails to adapt to dynamic attention patterns during decoding (the saliency shift problem), and (2) they treat both marginally important tokens and truly unimportant tokens equally, despite the collective significance of marginal tokens to model performance (the marginal information over-compression problem). To address these issues, we design two compensation mechanisms based on the high similarity of attention matrices between s of different scales. We propose Small, a small model assisted compensation method for cache compression. Small can maintain attention matching between different-scale s to: 1) assist the larger model in perceiving globally important information of attention; and 2) use the smaller model's attention scores to approximate those of marginal tokens in the larger model. Extensive experiments on benchmarks including GSM8K, BBH, MT-Bench, and LongBench demonstrate the effectiveness of Small. Moreover, efficiency evaluations show that Small achieves 1.75 - 2.56 times higher throughput than baseline methods, highlighting its potential for efficient and performant inference in resource constrained environments.

EAC-MoE Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models

Authors: Yuanteng Chen, Yuantian Shao, Peisong Wang, Jian Cheng

2025-08-03

http://arxiv.org/abs/2508.01625v1

Mixture-of-Experts (MoE) has demonstrated promising potential in scaling s. However, it is hindered by two critical challenges: (1) substantial GPU memory consumption to load all experts; (2) low activated parameters cannot be equivalently translated into inference effects. In this work, we propose EAC-MoE, an Expert-Selection Aware Compressor for MoE-s, which deeply aligns with the characteristics of MoE from the perspectives of quantization and , and introduces two modules to address these two challenges respectively: (1) The expert selection bias caused by low-bit quantization is a major factor contributing to the performance degradation in MoE-s. Based on this, we propose Quantization with Expert-Selection Calibration (QESC), which mitigates the expert selection bias by calibrating the routers within the MoE; (2) There are always certain experts that are not crucial for the corresponding tasks, yet causing inference latency. Therefore, we propose Pruning based on Expert-Selection Frequency (PESF), which significantly improves inference speed by less frequently used experts for current task. Extensive experiments demonstrate that our approach significantly reduces memory usage and improves inference speed with minimal performance degradation.

RepoForge Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale

Authors: Zhilong Chen, Chengzong Zhao, Boyuan Chen, Dayi Lin, Yihao Chen, Arthur Leung, Gopi Krishnan Rajbahadur, Gustavo A. Oliva, Ahmed E. Hassan

2025-08-03

http://arxiv.org/abs/2508.01550v1

Training software engineering (SWE) s is bottlenecked by expensive infrastructure, inefficient evaluation pipelines, scarce training data, and costly quality control. We present RepoForge, an autonomous, end-to-end pipeline that generates, evaluates, and trains SWE agents at scale. Our key contributions include: (1) RepoForge-8B-Agent, achieving 17.4\% on SWE-Bench-Verified~\citep{swebench_verified2024}, establishing new state-of-the-art for $\leq$ 8B non-thinking s; (2) 7,304 executable environments auto-generated from real GitHub commits with zero manual intervention; (3) 14 $\times$ storage reduction (1.4GB $\rightarrow$ 102MB per instance) via intelligent dependency management and image ; (4) $>$ 70\% faster evaluation using a Ray-powered~\citep{ray2018} distributed RepoForge harness; (5) 19,000 $\times$ cheaper labeling through our automated SPICE~\citep{spice2024} difficulty assessment technique. By unifying storage-efficient sandboxing, Ray-powered evaluation harness, automated data generation, SPICE-based labeling, and bubble-free RL scaffold, we demonstrate that even $\leq$ 8B models can reach new state-of-the-art performance on demanding benchmarks like SWE-Bench-Verified. Our approach addresses critical bottlenecks in SWE agent training: high storage costs of container-based evaluation, inefficient sequential reward pipelines, limited availability of high-quality training data, expensive manual labeling, and multi-turn RL pipeline bottlenecks.

BlockA2A Towards Secure and Verifiable Agent-to-Agent Interoperability

Authors: Zhenhua Zou, Zhuotao Liu, Lepeng Zhao, Qiuyang Zhan

2025-08-02

http://arxiv.org/abs/2508.01332v2

The rapid adoption of agentic AI, powered by large language models (s), is transforming enterprise ecosystems with autonomous agents that execute complex workflows. Yet we observe several key security vulnerabilities in -driven multi-agent systems (MASes): fragmented identity frameworks, insecure channels, and inadequate defenses against Byzantine agents or adversarial prompts. In this paper, we present the first systematic analysis of these emerging multi-agent risks and explain why the legacy security strategies cannot effectively address these risks. Afterwards, we propose BlockA2A, the first unified multi-agent trust framework that enables secure and verifiable and agent-to-agent interoperability. At a high level, BlockA2A adopts decentralized identifiers (DIDs) to enable fine-grained cross-domain agent authentication, blockchain-anchored ledgers to enable immutable auditability, and smart contracts to dynamically enforce context-aware access control policies. BlockA2A eliminates centralized trust bottlenecks, ensures message authenticity and execution integrity, and guarantees accountability across agent interactions. Furthermore, we propose a Defense Orchestration Engine (DOE) that actively neutralizes attacks through real-time mechanisms, including Byzantine agent flagging, reactive execution halting, and instant permission revocation. Empirical evaluations demonstrate BlockA2A's effectiveness in neutralizing prompt-based, -based, behavioral and systemic MAS attacks. We formalize its integration into existing MAS and showcase a practical implementation for Google's A2A protocol. Experiments confirm that BlockA2A and DOE operate with sub-second overhead, enabling scalable deployment in production -based MAS environments.

Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models

Authors: Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat

2025-08-02

http://arxiv.org/abs/2508.01261v1

We present MoE-MLA-RoPE, a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient language modeling. Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations: (1) fine-grained expert routing with 64 micro-experts and top- $k$ selection, enabling flexible specialization through 3.6 * 10^7 possible expert combinations; (2) shared expert isolation that dedicates 2 always active experts for common patterns while routing to 6 of 62 specialized experts; and (3) gradient-conflict-free load balancing that maintains expert utilization without interfering with primary loss optimization. Extensive experiments on models ranging from 17M to 202M parameters demonstrate that MoE-MLA-RoPE with compression ratio r=d/2 achieves 68% cache memory reduction and 3.2x inference speedup while maintaining competitive perplexity (0.8% degradation). Compared to the parameters with 53.9M parameters, MoE-MLA-RoPE improves the validation loss by 6.9% over the vanilla s while using 42% fewer active parameters per forward pass. FLOP-matched experiments reveal even larger gains: 11.1% improvement with 3.2x inference . Automated evaluation using GPT-4 as a judge confirms quality improvements in generation, with higher scores on coherence (8.1/10), creativity (7.9/10) and grammatical correctness (8.2/10). Our results establish that architectural novelty, not parameter scaling, defines the efficiency frontier for resource-constrained language model deployment.

Asking the Right Questions Benchmarking Large Language Models in the Development of Clinical Consultation Templates

Authors: Liam G. McCoy, Fateme Nateghi Haredasht, Kanav Chopra, David Wu, David JH Wu, Abass Conteh, Sarita Khemani, Saloni Kumar Maharaj, Vishnu Ravi, Arth Pahwa, Yingjie Weng, Leah Rosengaus, Lena Giang, Kelvin Zhenghao Li, Olivia Jee, Daniel Shirvani, Ethan Goh, Jonathan H. Chen

2025-08-02

http://arxiv.org/abs/2508.01159v1

This study evaluates the capacity of large language models (s) to generate structured clinical consultation templates for electronic consultation. Using 145 expert-crafted templates developed and routinely used by Stanford's eConsult team, we assess frontier models -- including o3, GPT-4o, Kimi K2, Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro -- for their ability to produce clinically coherent, concise, and prioritized clinical question schemas. Through a multi-agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis, we show that while models like o3 achieve high comprehensiveness (up to 92.2\%), they consistently generate excessively long templates and fail to correctly prioritize the most clinically important questions under length constraints. Performance varies across specialties, with significant degradation in narrative-driven fields such as psychiatry and pain medicine. Our findings demonstrate that s can enhance structured clinical information exchange between physicians, while highlighting the need for more robust evaluation methods that capture a model's ability to prioritize clinically salient information within the time constraints of real-world physician .

Towards Bridging Review Sparsity in Recommendation with Textual Edge Graph Representation

Authors: Leyao Wang, Xutao Mao, Xuhui Zhan, Yuying Zhao, Bo Ni, Ryan A. Rossi, Nesreen K. Ahmed, Tyler Derr

2025-08-02

http://arxiv.org/abs/2508.01128v1

Textual reviews enrich recommender systems with fine-grained preference signals and enhanced explainability. However, in real-world scenarios, users rarely leave reviews, resulting in severe that undermines the effectiveness of existing models. A natural solution is to impute or generate missing reviews to enrich the data. However, conventional imputation techniques -- such as matrix completion and -based augmentation -- either lose contextualized semantics by embedding texts into vectors, or overlook structural dependencies among user-item interactions. To address these shortcomings, we propose TWISTER (ToWards Imputation on Sparsity with Textual Edge Graph Representation), a unified framework that imputes missing reviews by jointly modeling semantic and structural signals. Specifically, we represent user-item interactions as a Textual-Edge Graph (TEG), treating reviews as edge attributes. To capture relational context, we construct line-graph views and employ a large language model as a graph-aware aggregator. For each interaction lacking a textual review, our model aggregates the neighborhood's natural-language representations to generate a coherent and personalized review. Experiments on the Amazon and Goodreads datasets show that TWISTER consistently outperforms traditional numeric, graph-based, and baselines, delivering higher-quality imputed reviews and, more importantly, enhanced recommendation performance. In summary, TWISTER generates reviews that are more helpful, authentic, and specific, while smoothing structural signals for improved recommendations.

REACT A Real-Time Edge-AI Based V2X Framework for Accident Avoidance in Autonomous Driving System

Authors: Fengze Yang, Bo Yu, Yang Zhou, Xuewen Luo, Zhengzhong Tu, Chenxi Liu

2025-08-01

http://arxiv.org/abs/2508.01057v1

Collisions caused by human error are the most common type of multi-vehicle crash, highlighting the critical need for autonomous driving (AD) systems to leverage cooperative perception through Vehicle-to-Everything (V2X) . This capability extends situational awareness beyond the limitations of onboard sensors. However, current -based V2X frameworks suffer from limited generalization, shallow contextual reasoning, and reliance on mono-modal inputs. Vision-Language Models (VLMs) offer enhanced reasoning and multimodal integration but typically fall short of real-time performance requirements in safety-critical applications. This paper presents REACT, a real-time, V2X-integrated trajectory optimization framework built upon a fine-tuned lightweight VLM. REACT integrates a set of specialized modules that process multimodal inputs into optimized, risk-aware trajectories. To ensure real-time performance on edge devices, REACT incorporates edge adaptation strategies that reduce model complexity and accelerate inference. Evaluated on the DeepAccident benchmark, REACT achieves state-of-the-art performance, a 77% collision rate reduction, a 48.2% Video Panoptic Quality (VPQ), and a 0.57-second inference latency on the Jetson AGX Orin. Ablation studies validate the contribution of each input, module, and edge adaptation strategy. These results demonstrate the feasibility of lightweight VLMs for real-time edge-based cooperative planning and showcase the potential of language-guided contextual reasoning to improve safety and responsiveness in autonomous driving.

Session-Based Recommendation with Validated and Enriched LLM Intents

Authors: Gyuseok Lee, Yaokun Liu, Yifan Liu, Susik Yoon, Dong Wang, SeongKu Kang

2025-08-01

http://arxiv.org/abs/2508.00570v1

Session-based recommendation (SBR) aims to predict the next item for an anonymous user in a timely manner. However, SBR suffers from data due to the short and anonymous nature of sessions. Recently, an emerging line of work has explored inferring the underlying user intents of a session using large language models (s), with the generated intents serving as auxiliary training signals to enhance SBR models. Despite its promise, this approach faces three key challenges: validating intent quality, incorporating session-level multi-intents, and complementing inevitable failure cases. In this paper, we propose VELI4SBR, a two-stage framework that leverages Validated and Enriched -generated Intents for SBR. In the first stage, we generate high-quality intents using a predict-and-correct loop that validates the informativeness of -generated intents with a global intent pool to constrain the 's output space and reduce hallucination. In the second stage, we enhance the SBR model using the generated intents through a lightweight multi-intent prediction and fusion mechanism. Furthermore, we introduce a training strategy that compensates for failures by inferring intents from inter-session behavioral similarities. Extensive experiments show that VELI4SBR outperforms state-of-the-art baselines while improving explainability.

ReaGAN Node-as-Agent-Reasoning Graph Agentic Network

Authors: Minghao Guo, Xi Zhu, Jingyuan Huang, Kai Mei, Yongfeng Zhang

2025-08-01

http://arxiv.org/abs/2508.00429v2

Graph Neural Networks (GNNs) have achieved remarkable success in graph-based learning by propagating information among neighbor nodes via predefined aggregation mechanisms. However, such fixed schemes often suffer from two key limitations. First, they cannot handle the imbalance in node informativeness -- some nodes are rich in information, while others remain . Second, predefined message passing primarily leverages local structural similarity while ignoring global semantic relationships across the graph, limiting the model's ability to capture distant but relevant information. We propose Retrieval-augmented Graph Agentic Network (ReaGAN), an agent-based framework that empowers each node with autonomous, node-level decision-making. Each node acts as an agent that independently plans its next action based on its internal memory, enabling node-level planning and adaptive message propagation. Additionally, retrieval-augmented generation (RAG) allows nodes to access semantically relevant content and build global relationships in the graph. ReaGAN achieves competitive performance under few-shot in-context settings using a frozen backbone without fine-tuning, showcasing the potential of agentic planning and local-global retrieval in graph learning.

EdgeInfinite-Instruct Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices

Authors: Jiyu Chen, Poh Seng Lim, Shuang Peng, Daxiong Luo, JungHau Foo, Yap Deep, Timothy Lee Jun Jie, Kelvin Teh Kae Wen, Fan Yang, Danyu Feng, Hao-Yun Chen, Peng-Wen Chen, Fangyuan Li, Xiaoxin Chen, Wong Wai Mun

2025-08-01

http://arxiv.org/abs/2508.00370v2

Deploying Transformer-based large language models (s) on resource-constrained edge devices for long-sequence tasks remains challenging due to the quadratic time complexity of self-attention and growing Key-Value () cache demands. While existing cache optimizations improve memory efficiency, they often fail to reduce time to first token (TTFT) and may degrade performance through token . Alternative sequence modeling architectures address some of these limitations, but typically require full retraining and lack infrastructure support. EdgeInfinite offers an efficient solution by fine-tuning only a small subset of parameters, maintaining quality while reducing both computational and memory costs, including improved TTFT. However, its instruction-following ability is limited, and it lacks mobile-specific optimizations. To address these issues, we propose EdgeInfinite-Instruct, which introduces a Segmented Supervised Fine-Tuning (S-SFT) strategy tailored to long-sequence tasks such as summarization and question answering. We further optimized EdgeInfinite-Instruct for efficient deployment on edge NPUs by employing fine-grained post-training quantization (PTQ) to reduce computational demands while maintaining accuracy, and by implementing a fixed-shape computation graph that balances memory usage and on-device efficiency through scenario-specific customization of input token and cache sizes. Experiments on long-context benchmarks and real-world mobile tasks show that our approach improves domain-specific performance while maintaining efficiency on NPU-accelerated edge devices.

Systematic Evaluation of Optimization Techniques for Long-Context Language Models

Authors: Ammar Ahmed, Sheng Di, Franck Cappello, Zirui Liu, Jingoo Han, Ali Anwar

2025-08-01

http://arxiv.org/abs/2508.00305v1

Large language models (s) excel across diverse natural language processing tasks but face resource demands and limited context windows. Although techniques like , quantization, and token dropping can mitigate these issues, their efficacy in long-context scenarios and system evaluation remains underexplored. This paper systematically benchmarks these optimizations, characterizing memory usage, latency, and throughput, and studies how these methods impact the quality of text generation. We first analyze individual optimization methods for two architectures supporting long context and then systematically evaluate combinations of these techniques to assess how this deeper analysis impacts performance metrics. We subsequently study the scalability of individual optimization methods on a larger variant with 70 billion-parameter model. Our novel insights reveal that naive combination inference optimization algorithms can adversely affect larger models due to compounded approximation errors, as compared to their smaller counterparts. Experiments show that relying solely on F1 obscures these effects by hiding precision-recall trade-offs in question answering tasks. By integrating system-level profiling with task-specific insights, this study helps practitioners and researchers explore and balance efficiency, accuracy, and scalability across tasks and hardware configurations.

Large AI Model-Enabled Secure Communications in Low-Altitude Wireless Networks Concepts, Perspectives and Case Study

Authors: Chuang Zhang, Geng Sun, Jiacheng Wang, Yijing Lin, Weijie Yuan, Sinem Coleri, Dusit Niyato, Tony Q. S. Quek

2025-08-01

http://arxiv.org/abs/2508.00256v1

Low-altitude wireless networks (LAWNs) have the potential to revolutionize s by supporting a range of applications, including urban parcel delivery, aerial inspections and air taxis. However, compared with traditional wireless networks, LAWNs face unique security challenges due to low-altitude operations, frequent mobility and reliance on unlicensed spectrum, making it more vulnerable to some malicious attacks. In this paper, we investigate some large artificial intelligence model (LAM)-enabled solutions for secure s in LAWNs. Specifically, we first explore the amplified security risks and important limitations of traditional AI methods in LAWNs. Then, we introduce the basic concepts of LAMs and delve into the role of LAMs in addressing these challenges. To demonstrate the practical benefits of LAMs for secure s in LAWNs, we propose a novel LAM-based optimization framework that leverages large language models (s) to generate enhanced state features on top of handcrafted representations, and to design intrinsic rewards accordingly, thereby improving reinforcement learning performance for secure tasks. Through a typical case study, simulation results validate the effectiveness of the proposed framework. Finally, we outline future directions for integrating LAMs into secure LAWN applications.