2025-08-08
Table of Contents
- MisVisFix An Interactive Dashboard for Detecting, Explaining, and Correcting Misleading Visualizations using Large Language Models
- Sculptor Empowering LLMs with Cognitive Agency via Active Context Management
- Share Your Attention Transformer Weight Sharing via Matrix-based Dictionary Learning
- TRAIL Joint Inference and Refinement of Knowledge Graphs with Large Language Models
- CARD Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference
- Automatic LLM Red Teaming
- Evaluating, Synthesizing, and Enhancing for Customer Support Conversation
- FlexQ Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design
- KVSink Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs
- ViLLA-MMBench A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation
- Reasoning Beyond Labels Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts
- Benefit from Rich Tackling Search Interaction Sparsity in Search Enhanced Recommendation
- Excavate the potential of Single-Scale Features A Decomposition Network for Water-Related Optical Image Enhancement
- TCSAFormer Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation
- PAIRS Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG
- SQ-VDiT Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation
- Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency
- MultiRAG A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation
- Data Dependency Inference for Industrial Code Generation Based on UML Sequence Diagrams
- Compressing Chain-of-Thought in LLMs via Step Entropy
- Do language models accommodate their users? A study of linguistic convergence
- Attack the Messages, Not the Agents A Multi-round Adaptive Stealthy Tampering Framework for LLM-MAS
- AgentSME for Simulating Diverse Communication Modes in Smart Education
- Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives
- LOST Low-rank and Sparse Pre-training for Large Language Models
- Automated SNOMED CT Concept Annotation in Clinical Text Using Bi-GRU Neural Networks
- xDeepServe Model-as-a-Service on Huawei CloudMatrix384
- Decomposed Reasoning with Reinforcement Learning for Relevance Assessment in UGC Platforms
- CompressKV Semantic Retrieval Heads Know What Tokens are Not Important Before Generation
- Beyond Manually Designed Pruning Policies with Second-Level Performance Prediction A Pruning Framework for LLMs
- Traffic-R1 Reinforced LLMs Bring Human-Like Reasoning to Traffic Signal Control Systems
- CAMERA Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis
- VeOmni Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
- Isolating Culture Neurons in Multilingual Large Language Models
- Forecasting When to Forecast Accelerating Diffusion Models with Confidence-Gated Taylor
- LeanK Learnable K Cache Channel Pruning for Efficient Decoding
- Whispering Agents An event-driven covert communication protocol for the Internet of Agents
- Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation
- Amber Pruner Leveraging NM Activation Sparsity for Efficient Prefill in Large Language Models
- A Survey on AgentOps Categorization, Challenges, and Future Directions
- AlignGuard-LoRA Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization
- Everyone Contributes! Incentivizing Strategic Cooperation in Multi-LLM Systems via Sequential Public Goods Games
- CVD-SfM A Cross-View Deep Front-end Structure-from-Motion System for Sparse Localization in Multi-Altitude Scenes
- IAUNet Instance-Aware U-Net
- Quantum-RAG and PunGPT2 Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language
- AGFT An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization
- SmallKV Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
- EAC-MoE Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models
- RepoForge Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale
- BlockA2A Towards Secure and Verifiable Agent-to-Agent Interoperability
- Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models
- Asking the Right Questions Benchmarking Large Language Models in the Development of Clinical Consultation Templates
- Towards Bridging Review Sparsity in Recommendation with Textual Edge Graph Representation
- REACT A Real-Time Edge-AI Based V2X Framework for Accident Avoidance in Autonomous Driving System
- Session-Based Recommendation with Validated and Enriched LLM Intents
- ReaGAN Node-as-Agent-Reasoning Graph Agentic Network
- EdgeInfinite-Instruct Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices
- Systematic Evaluation of Optimization Techniques for Long-Context Language Models
- Large AI Model-Enabled Secure Communications in Low-Altitude Wireless Networks Concepts, Perspectives and Case Study
MisVisFix An Interactive Dashboard for Detecting, Explaining, and Correcting Misleading Visualizations using Large Language Models
Authors: Amit Kumar Das, Klaus Mueller
2025-08-06
http://arxiv.org/abs/2508.04679v1
Misleading visualizations pose a significant challenge to accurate data
interpretation. While recent research has explored the use of Large Language
Models (s) for detecting such misinformation, practical tools that also
support explanation and correction remain limited. We present MisVisFix, an
interactive dashboard that leverages both Claude and GPT models to support the
full workflow of detecting, explaining, and correcting misleading
visualizations. MisVisFix correctly identifies 96% of visualization issues and
addresses all 74 known visualization misinformation types, classifying them as
major, minor, or potential concerns. It provides detailed explanations,
actionable suggestions, and automatically generates corrected charts. An
interactive chat interface allows users to ask about specific chart elements or
request modifications. The dashboard adapts to newly emerging misinformation
strategies through targeted user interactions. User studies with visualization
experts and developers of fact-checking tools show that MisVisFix accurately
identifies issues and offers useful suggestions for improvement. By
transforming
-based detection into an accessible, interactive platform,
MisVisFix advances visualization literacy and supports more trustworthy data
.
Sculptor Empowering LLMs with Cognitive Agency via Active Context Management
Authors: Mo Li, L. H. Xu, Qitai Tan, Ting Cao, Yunxin Liu
2025-08-06
http://arxiv.org/abs/2508.04664v1
Large Language Models (s) suffer from significant performance degradation
when processing long contexts due to proactive interference, where irrelevant
information in earlier parts of the context disrupts reasoning and memory
recall. While most research focuses on external memory systems to augment
s'
capabilities, we propose a complementary approach: empowering
s with Active
Context Management (ACM) tools to actively sculpt their internal working
memory. We introduce Sculptor, a framework that equips
s with three
categories of tools: (1) context fragmentation, (2) summary, hide, and restore,
and (3) intelligent search. Our approach enables
s to proactively manage
their attention and working memory, analogous to how humans selectively focus
on relevant information while filtering out distractions. Experimental
evaluation on information-
benchmarks-PI-
(proactive interference) and
NeedleBench Multi-Needle Reasoning-demonstrates that Sculptor significantly
improves performance even without specific training, leveraging
s' inherent
tool calling generalization capabilities. By enabling Active Context
Management, Sculptor not only mitigates proactive interference but also
provides a cognitive foundation for more reliable reasoning across diverse
long-context tasks-highlighting that explicit context-control strategies,
rather than merely larger token windows, are key to robustness at scale.
Share Your Attention Transformer Weight Sharing via Matrix-based Dictionary Learning
Authors: Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis
2025-08-06
http://arxiv.org/abs/2508.04581v1
Large language models (s) have revolutionized AI applications, yet their
high computational and memory demands hinder their widespread deployment.
Existing compression techniques focus on intra-block optimizations (e.g.
low-rank approximation, attention head
), while the repetitive layered
structure of
s implies significant inter-block redundancy - a
dimension largely unexplored beyond key-value (
) caching. Inspired by
dictionary learning in CNNs, we propose a framework for structured weight
sharing across
layers. Our approach decomposes attention projection
matrices into shared dictionary atoms, reducing the attention module's
parameters by 66.7% while achieving on-par performance. Unlike complex methods
requiring distillation or architectural changes, MASA (Matrix Atom Sharing in
Attention) operates as a drop-in replacement - trained with standard optimizers
- and represents each layer's weights as linear combinations of shared matrix
atoms. Experiments across scales (100M-700M parameters) show that MASA achieves
better benchmark accuracy and perplexity than grouped-query attention (GQA),
low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at
comparable parameter budgets. Ablation studies confirm robustness to the
dictionary size and the efficacy of shared representations in capturing
cross-layer statistical regularities. Extending to Vision Transformers (ViT),
MASA matches performance metrics on image classification and detection tasks
with 66.7% fewer attention parameters. By combining dictionary learning
strategies with
efficiency, MASA offers a scalable blueprint for
parameter-efficient models without sacrificing performance. Finally, we
investigate the possibility of employing MASA on pretrained
s to reduce
their number of parameters without experiencing any significant drop in their
performance.
TRAIL Joint Inference and Refinement of Knowledge Graphs with Large Language Models
Authors: Xinkui Zhao, Haode Li, Yifan Zhang, Guanjie Cheng, Yueshen Xu
2025-08-06
http://arxiv.org/abs/2508.04474v1
Recent advances in large language models (s) have unlocked powerful
reasoning and decision-making capabilities. However, their inherent dependence
on static parametric memory fundamentally limits their adaptability, factual
accuracy, and interpretability in knowledge-intensive scenarios. Knowledge
graphs (KGs), as structured repositories of explicit relational knowledge,
offer a promising approach for augmenting
s with external, interpretable
memory. Nevertheless, most existing methods that combine
s with KGs treat
reasoning and knowledge updating as separate processes, resulting in suboptimal
utilization of new information and hindering real-time updates. In this work,
we propose TRAIL: a novel, unified framework for Thinking, Reasoning, And
Incremental Learning that couples joint inference and dynamic KG refinement
with large language models. TRAIL enables
agents to iteratively explore,
update, and refine knowledge graphs during the reasoning process, employing a
confidence-driven mechanism for the generation, validation, and
of new
facts. This plug-and-play architecture facilitates seamless integration with
various
s, supporting continual adaptation without the need for retraining.
Extensive experiments on multiple benchmarks demonstrate that TRAIL outperforms
existing KG-augmented and retrieval-augmented
baselines by 3% to 13%. More
importantly, these results represent a significant step toward developing
adaptive, memory-augmented language models capable of continual learning and
reliable, transparent reasoning.
CARD Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference
Authors: Enyu Zhou, Kai Sheng, Hao Chen, Xin He
2025-08-06
http://arxiv.org/abs/2508.04462v1
Speculative decoding (SD), where an extra draft model first provides multiple
draft tokens and the original target model then verifies these tokens in
parallel, has shown great power for inference
. However,
existing SD methods must adhere to the 'draft-then-verify' paradigm, which
forces drafting and verification processes to execute sequentially during SD,
resulting in inefficient inference performance and limiting the size of the
draft model. Furthermore, once a single token in the candidate sequence is
rejected during the drafting process, all subsequent candidate tokens must be
discarded, leading to inefficient drafting. To address these challenges, we
propose a cache-based parallel speculative decoding framework employing a
'query-and-correct' paradigm. Specifically, CARD decouples drafting and
verification: the draft model generates candidate tokens to populate a shared
cache, while the target model concurrently rectifies the draft model's
generation direction. This effectively enables the target model to perform
inference at speed approaching that of the draft model. Our approach achieves
up to 4.83 speedup over vanilla decoding without requiring fine-tuning of
either the draft or target models. Our code is available at
https://github.com/hunzhizi/CARD.
Automatic LLM Red Teaming
Authors: Roman Belaire, Arunesh Sinha, Pradeep Varakantham
2025-08-06
http://arxiv.org/abs/2508.04451v1
Red teaming is critical for identifying vulnerabilities and building trust in
current s. However, current automated methods for Large Language Models
(
s) rely on brittle prompt templates or single-turn attacks, failing to
capture the complex, interactive nature of real-world adversarial dialogues. We
propose a novel paradigm: training an AI to strategically `break' another AI.
By formalizing red teaming as a Markov Decision Process (MDP) and employing a
hierarchical Reinforcement Learning (RL) framework, we effectively address the
inherent
reward and long-horizon challenges. Our generative agent learns
coherent, multi-turn attack strategies through a fine-grained, token-level harm
reward, enabling it to uncover subtle vulnerabilities missed by existing
baselines. This approach sets a new state-of-the-art, fundamentally reframing
red teaming as a dynamic, trajectory-based process (rather than a one-step
test) essential for robust AI deployment.
Evaluating, Synthesizing, and Enhancing for Customer Support Conversation
Authors: Jie Zhu, Huaixia Dou, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong
2025-08-06
http://arxiv.org/abs/2508.04423v1
Effective customer support requires not only accurate problem solving but
also structured and empathetic aligned with professional
standards. However, existing dialogue datasets often lack strategic guidance,
and real-world service data is difficult to access and annotate. To address
this, we introduce the task of Customer Support Conversation (CSC), aimed at
training customer service agents to respond using well-defined support
strategies. We propose a structured CSC framework grounded in COPC guidelines,
defining five conversational stages and twelve strategies to guide high-quality
interactions. Based on this, we construct CSConv, an evaluation dataset of
1,855 real-world customer-agent conversations rewritten using
s to reflect
deliberate strategy use, and annotated accordingly. Additionally, we develop a
role-playing approach that simulates strategy-rich conversations using
-powered roles aligned with the CSC framework, resulting in the training
dataset RoleCS. Experiments show that fine-tuning strong
s on RoleCS
significantly improves their ability to generate high-quality, strategy-aligned
responses on CSConv. Human evaluations further confirm gains in problem
resolution. All code and data will be made publicly available at
https://github.com/aliyun/qwen-dianjin.
FlexQ Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design
Authors: Hao Zhang, Aining Jia, Weifeng Bu, Yushu Cai, Kai Sheng, Hao Chen, Xin He
2025-08-06
http://arxiv.org/abs/2508.04405v1
Large Language Models (s) demonstrate exceptional performance but entail
significant memory and computational costs, restricting their practical
deployment. While existing INT4/INT8 quantization reduces these costs, they
often degrade accuracy or lack optimal efficiency. INT6 quantization offers a
superior trade-off between model accuracy and inference efficiency, but lacks
hardware support in modern GPUs, forcing emulation via higher-precision
arithmetic units that limit
.
In this paper, we propose FlexQ, a novel post-training INT6 quantization
framework combining algorithmic innovation with system-level optimizations.
FlexQ employs uniform 6-bit weight quantization across all layers, with
adaptive retention of 8-bit activations in layers identified through layer-wise
sensitivity analysis. To maximize hardware efficiency, we develop a specialized
high-performance GPU kernel supporting matrix multiplication for W6A6 and W6A8
representations via Binary Tensor Core (BTC) equivalents, effectively bypassing
the lack of native INT6 tensor cores. Evaluations on LLaMA models show FlexQ
maintains near-FP16 accuracy, with perplexity increases of no more than 0.05.
The proposed kernel achieves an average 1.39 speedup over ABQ-
on
LLaMA-2-70B linear layers. End-to-end, FlexQ delivers 1.33 inference
and 1.21 memory savings over SmoothQuant. Code is released
at https://github.com/FlyFoxPlayer/FlexQ.
KVSink Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs
Authors: Zunhai Su, Kehong Yuan
2025-08-06
http://arxiv.org/abs/2508.04257v1
Key-Value () cache quantization has become a widely adopted optimization
technique for efficient large language models (
s) inference by reducing
cache memory usage and mitigating memory-bound constraints. Recent studies have
emphasized the importance of preserving the original precision of
s for the
first few tokens to ensure the protection of attention sinks. While this
approach has proven effective in mitigating performance degradation, its
underlying principles remain insufficiently understood. Moreover, it fails to
address the recent discovery that attention sinks can emerge beyond the initial
token positions. In this work, we elucidate the underlying mechanisms of
attention sinks during inference by examining their role in the cross-layer
evolution of extreme activation outliers. Additionally, we provide a
comprehensive analysis of the interplay between attention sinks and
cache
quantization. Based on our enhanced understanding, we introduce
\textit{\textbf{
Sink}}, a plug-and-play method that effectively predicts sink
tokens with negligible overhead, enabling more thorough preservation. Extensive
experiments demonstrate that
Sink outperforms the existing Preserve-First-N
(PFN) strategy, offering more effective preservation of attention sinks during
cache quantization. Moreover, when applied to the well-established
Quant
method,
Sink further improves perplexity (PPL) and reduces reliance on 16-bit
numerical outliers.
ViLLA-MMBench A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation
Authors: Fatemeh Nazary, Ali Tourani, Yashar Deldjoo, Tommaso Di Noia
2025-08-06
http://arxiv.org/abs/2508.04206v1
Recommending long-form video content demands joint modeling of visual, audio,
and textual modalities, yet most benchmarks address only raw features or narrow
fusion. We present ViLLA-MMBench, a reproducible, extensible benchmark for
-augmented multimodal movie recommendation. Built on MovieLens and MMTF-14K,
it aligns dense item embeddings from three modalities: audio (block-level,
i-vector), visual (CNN, AVF), and text. Missing or
metadata is
automatically enriched using state-of-the-art
s (e.g., OpenAI Ada),
generating high-quality synopses for thousands of movies. All text (raw or
augmented) is embedded with configurable encoders (Ada, LLaMA-2, Sentence-T5),
producing multiple ready-to-use sets. The pipeline supports interchangeable
early-, mid-, and late-fusion (concatenation, PCA, CCA, rank-aggregation) and
multiple backbones (MF, VAECF, VBPR, AMR, VMF) for ablation. Experiments are
fully declarative via a single YAML file. Evaluation spans accuracy (Recall,
nDCG) and beyond-accuracy metrics: cold-start rate, coverage, novelty,
diversity, fairness. Results show
-based augmentation and strong text
embeddings boost cold-start and coverage, especially when fused with
audio-visual features. Systematic benchmarking reveals universal versus
backbone- or metric-specific combinations. Open-source code, embeddings, and
configs enable reproducible, fair multimodal RS research and advance principled
generative AI integration in large-scale recommendation. Code:
https://recsys-lab.github.io/ViLLA-MMBench
Reasoning Beyond Labels Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts
Authors: Millicent Ochieng, Anja Thieme, Ignatius Ezeani, Risa Ueno, Samuel Maina, Keshet Ronen, Javier Gonzalez, Jacki O'Neill
2025-08-06
http://arxiv.org/abs/2508.04199v1
Sentiment analysis in low-resource, culturally nuanced contexts challenges
conventional NLP approaches that assume fixed labels and universal affective
expressions. We present a diagnostic framework that treats sentiment as a
context-dependent, culturally embedded construct, and evaluate how large
language models (s) reason about sentiment in informal, code-mixed WhatsApp
messages from Nairobi youth health groups. Using a combination of
human-annotated data, sentiment-flipped counterfactuals, and rubric-based
explanation evaluation, we probe
interpretability, robustness, and
alignment with human reasoning. Framing our evaluation through a social-science
measurement lens, we operationalize and interrogate
s outputs as an
instrument for measuring the abstract concept of sentiment. Our findings reveal
significant variation in model reasoning quality, with top-tier
s
demonstrating interpretive stability, while open models often falter under
ambiguity or sentiment shifts. This work highlights the need for culturally
sensitive, reasoning-aware AI evaluation in complex, real-world
.
Benefit from Rich Tackling Search Interaction Sparsity in Search Enhanced Recommendation
Authors: Teng Shi, Weijie Yu, Xiao Zhang, Ming He, Jianping Fan, Jun Xu
2025-08-06
http://arxiv.org/abs/2508.04145v1
In modern online platforms, search and recommendation (S&R) often coexist,
offering opportunities for performance improvement through search-enhanced
approaches. Existing studies show that incorporating search signals boosts
recommendation performance. However, the effectiveness of these methods relies
heavily on rich search interactions. They primarily benefit a small subset of
users with abundant search behavior, while offering limited improvements for
the majority of users who exhibit only search activity. To address the
problem of
search data in search-enhanced recommendation, we face two
key challenges: (1) how to learn useful search features for users with
search interactions, and (2) how to design effective training objectives under
conditions. Our idea is to leverage the features of users with rich
search interactions to enhance those of users with
search interactions.
Based on this idea, we propose GSERec, a method that utilizes message passing
on the User-Code Graphs to alleviate data
in Search-Enhanced
Recommendation. Specifically, we utilize Large Language Models (
s) with
vector quantization to generate discrete codes, which connect similar users and
thereby construct the graph. Through message passing on this graph, embeddings
of users with rich search data are propagated to enhance the embeddings of
users with
interactions. To further ensure that the message passing
captures meaningful information from truly similar users, we introduce a
contrastive loss to better model user similarities. The enhanced user
representations are then integrated into downstream search-enhanced
recommendation models. Experiments on three real-world datasets show that
GSERec consistently outperforms baselines, especially for users with
search behaviors.
Excavate the potential of Single-Scale Features A Decomposition Network for Water-Related Optical Image Enhancement
Authors: Zheng Cheng, Wenri Wang, Guangyong Chen, Yakun Ju, Yihua Cheng, Zhisong Liu, Yanda Meng, Jintao Song
2025-08-06
http://arxiv.org/abs/2508.04123v1
Underwater image enhancement (UIE) techniques aim to improve visual quality
of images captured in aquatic environments by addressing degradation issues
caused by light absorption and scattering effects, including color distortion,
blurring, and low contrast. Current mainstream solutions predominantly employ
multi-scale feature extraction (MSFE) mechanisms to enhance reconstruction
quality through multi-resolution feature fusion. However, our extensive
experiments demonstrate that high-quality image reconstruction does not
necessarily rely on multi-scale feature fusion. Contrary to popular belief, our
experiments show that single-scale feature extraction alone can match or
surpass the performance of multi-scale methods, significantly reducing
complexity. To comprehensively explore single-scale feature potential in
underwater enhancement, we propose an innovative Single-Scale Decomposition
Network (SSD-Net). This architecture introduces an asymmetrical decomposition
mechanism that disentangles input image into clean layer along with degradation
layer. The former contains scene-intrinsic information and the latter encodes
medium-induced interference. It uniquely combines CNN's local feature
extraction capabilities with Transformer's global modeling strengths through
two core modules: 1) Parallel Feature Decomposition Block (PFDB), implementing
dual-branch feature space decoupling via efficient attention operations and
adaptive
; 2) Bidirectional Feature Communication Block
(BFCB), enabling cross-layer residual interactions for complementary feature
mining and fusion. This synergistic design preserves feature decomposition
independence while establishing dynamic cross-layer information pathways,
effectively enhancing degradation decoupling capacity.
TCSAFormer Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation
Authors: Zunhui Xia, Hongxing Li, Libin Lan
2025-08-06
http://arxiv.org/abs/2508.04058v1
In recent years, -based methods have achieved remarkable progress
in medical image segmentation due to their superior ability to capture
long-range dependencies. However, these methods typically suffer from two major
limitations. First, their computational complexity scales quadratically with
the input sequences. Second, the feed-forward network (FFN) modules in vanilla
Transformers typically rely on fully connected layers, which limits models'
ability to capture local contextual information and multiscale features
critical for precise semantic segmentation. To address these issues, we propose
an efficient medical image segmentation network, named TCSAFormer. The proposed
TCSAFormer adopts two key ideas. First, it incorporates a Compressed Attention
(CA) module, which combines token compression and pixel-level
attention
to dynamically focus on the most relevant key-value pairs for each query. This
is achieved by
globally irrelevant tokens and merging redundant ones,
significantly reducing computational complexity while enhancing the model's
ability to capture relationships between tokens. Second, it introduces a
Dual-Branch Feed-Forward Network (DBFFN) module as a replacement for the
standard FFN to capture local contextual features and multiscale information,
thereby strengthening the model's feature representation capability. We conduct
extensive experiments on three publicly available medical image segmentation
datasets: ISIC-2018, CVC-ClinicDB, and Synapse, to evaluate the segmentation
performance of TCSAFormer. Experimental results demonstrate that TCSAFormer
achieves superior performance compared to existing state-of-the-art (SOTA)
methods, while maintaining lower computational overhead, thus achieving an
optimal trade-off between efficiency and accuracy.
PAIRS Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG
Authors: Wang Chen, Guanqiang Qi, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang
2025-08-06
http://arxiv.org/abs/2508.04057v1
Retrieval-Augmented Generation (RAG) has become a cornerstone technique for
enhancing large language models (s) with external knowledge. However,
current RAG systems face two critical limitations: (1) they inefficiently
retrieve information for every query, including simple questions that could be
resolved using the
's parametric knowledge alone, and (2) they risk
retrieving irrelevant documents when queries contain
information
signals. To address these gaps, we introduce Parametric-verified Adaptive
Information Retrieval and Selection (PAIRS), a training-free framework that
integrates parametric and retrieved knowledge to adaptively determine whether
to retrieve and how to select external information. Specifically, PAIRS employs
a dual-path generation mechanism: First, the
produces both a direct answer
and a context-augmented answer using self-generated pseudo-context. When these
outputs converge, PAIRS bypasses external retrieval entirely, dramatically
improving the RAG system's efficiency. For divergent cases, PAIRS activates a
dual-path retrieval (DPR) process guided by both the original query and
self-generated contextual signals, followed by an Adaptive Information
Selection (AIS) module that filters documents through weighted similarity to
both sources. This simple yet effective approach can not only enhance
efficiency by eliminating unnecessary retrievals but also improve accuracy
through contextually guided retrieval and adaptive information selection.
Experimental results on six question-answering (QA) benchmarks show that PAIRS
reduces retrieval costs by around 25% (triggering for only 75% of queries)
while still improving accuracy-achieving +1.1% EM and +1.0% F1 over prior
baselines on average.
SQ-VDiT Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation
Authors: Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, Yongjun Xu
2025-08-06
http://arxiv.org/abs/2508.04016v2
Diffusion s have emerged as the mainstream paradigm for video
generation models. However, the use of up to billions of parameters incurs
significant computational costs. Quantization offers a promising solution by
reducing memory usage and accelerating inference. Nonetheless, we observe that
the joint modeling of spatial and temporal information in video diffusion
models (V-DMs) leads to extremely long token sequences, which introduces high
calibration variance and learning challenges. To address these issues, we
propose SQ-VDiT, a post-training quantization framework for V-DMs that
leverages Salient data and Sparse token distillation. During the calibration
phase, we identify that quantization performance is highly sensitive to the
choice of calibration data. To mitigate this, we introduce
\textit{Hessian-aware Salient Data Selection}, which constructs high-quality
calibration datasets by considering both diffusion and quantization
characteristics unique to V-DMs. To tackle the learning challenges, we further
analyze the
attention patterns inherent in V-DMs. Based on this
observation, we propose \textit{Attention-guided Sparse Token Distillation},
which exploits token-wise attention distributions to emphasize tokens that are
more influential to the model's output. Under W4A6 quantization, SQ-VDiT
achieves lossless performance while delivering model compression
and inference
. Code will be available at
https://github.com/wlfeng0509/s2q-vdit.
Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency
Authors: Md Arafat Sultan, Ramón Fernandez Astudillo
2025-08-06
http://arxiv.org/abs/2508.03979v1
Despite its simplicity and efficacy, the high token expenditure of
self-consistency can limit its practical utility. Here we investigate if
self-consistency can be made more token-efficient for long chain-of-thought
reasoning tasks, while preserving its parallelism, through early hypothesis
. Concretely, we generate all solutions in parallel, but periodically
prune intermediate hypotheses that are deemed unnecessary based on two
lightweight indicators: (a) the model's own confidence in individual
hypotheses, and (b) lexical coverage of all current hypotheses by candidate
subsets that are under consideration for continued retention. We design a fast
weighted set cover algorithm that utilizes the two indicators; our evaluation
of five
s on three math benchmarks shows that this method can improve token
efficiency for all models, by 10-35% in many cases.
MultiRAG A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation
Authors: Wenlong Wu, Haofen Wang, Bohan Li, Peixuan Huang, Xinzhe Zhao, Lei Liang
2025-08-05
http://arxiv.org/abs/2508.03553v1
Retrieval Augmented Generation (RAG) has emerged as a promising solution to
address hallucination issues in Large Language Models (s). However, the
integration of multiple retrieval sources, while potentially more informative,
introduces new challenges that can paradoxically exacerbate hallucination
problems. These challenges manifest primarily in two aspects: the
distribution of multi-source data that hinders the capture of logical
relationships and the inherent inconsistencies among different sources that
lead to information conflicts. To address these challenges, we propose
MultiRAG, a novel framework designed to mitigate hallucination in multi-source
retrieval-augmented generation through knowledge-guided approaches. Our
framework introduces two key innovations: (1) a knowledge construction module
that employs multi-source line graphs to efficiently aggregate logical
relationships across different knowledge sources, effectively addressing the
data distribution issue; and (2) a sophisticated retrieval module that
implements a multi-level confidence calculation mechanism, performing both
graph-level and node-level assessments to identify and eliminate unreliable
information nodes, thereby reducing hallucinations caused by inter-source
inconsistencies. Extensive experiments on four multi-domain query datasets and
two multi-hop QA datasets demonstrate that MultiRAG significantly enhances the
reliability and efficiency of knowledge retrieval in complex multi-source
scenarios. \textcolor{blue}{Our code is available in
https://github.com/wuwenlong123/MultiRAG.
Data Dependency Inference for Industrial Code Generation Based on UML Sequence Diagrams
Authors: Wenxin Mao, Zhitao Wang, Long Wang, Sirong Chen, Cuiyun Gao, Luyang Cao, Ziming Liu, Qiming Zhang, Jun Zhou, Zhi Jin
2025-08-05
http://arxiv.org/abs/2508.03379v2
Large language models (s) excel at generating code from natural language
(NL) descriptions. However, the plain textual descriptions are inherently
ambiguous and often fail to capture complex requirements like intricate system
behaviors, conditional logic, and architectural constraints; implicit data
dependencies in service-oriented architectures are difficult to infer and
handle correctly. To bridge this gap, we propose a novel step-by-step code
generation framework named UML2Dep by leveraging unambiguous formal
specifications of complex requirements. First, we introduce an enhanced Unified
Modeling Language (UML) sequence diagram tailored for service-oriented
architectures. This diagram extends traditional visual syntax by integrating
decision tables and API specifications, explicitly formalizing structural
relationships and business logic flows in service interactions to rigorously
eliminate linguistic ambiguity. Second, recognizing the critical role of data
flow, we introduce a dedicated data dependency inference (DDI) task. DDI
systematically constructs an explicit data dependency graph prior to actual
code synthesis. To ensure reliability, we formalize DDI as a constrained
mathematical reasoning task through novel prompting strategies, aligning with
s' excellent mathematical strengths. Additional static parsing and
dependency
further reduce context complexity and cognitive load
associated with intricate specifications, thereby enhancing reasoning accuracy
and efficiency.
Compressing Chain-of-Thought in LLMs via Step Entropy
Authors: Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, Qiang Xu
2025-08-05
http://arxiv.org/abs/2508.03346v1
Large Language Models (s) using Chain-of-Thought (CoT) prompting excel at
complex reasoning but generate verbose thought processes with considerable
redundancy, leading to increased inference costs and reduced efficiency. We
introduce a novel CoT compression framework based on step entropy, a metric
that quantifies the informational contribution of individual reasoning steps to
identify redundancy. Through theoretical analysis and extensive empirical
validation on mathematical reasoning benchmarks, we demonstrate that steps with
low entropy are indeed highly redundant. Our experiments reveal that an
astonishing 80\% of low-entropy intermediate steps can be pruned with minor
degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and
Qwen3-8B. This finding sharply contrasts with random or high-entropy
,
which severely impairs reasoning performance. Building on this, we propose a
novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and
Group Relative Policy Optimization (GRPO) reinforcement learning. This approach
enables
s to autonomously learn to generate compressed COTs during inference
by strategically incorporating [SKIP] tokens. Our method significantly enhances
inference efficiency while rigorously preserving accuracy, offering
profound implications for practical
deployment and a deeper understanding
of reasoning structures.
Do language models accommodate their users? A study of linguistic convergence
Authors: Terra Blevins, Susanne Schmalwieser, Benjamin Roth
2025-08-05
http://arxiv.org/abs/2508.03276v1
While large language models (s) are generally considered proficient in
generating language, how similar their language usage is to that of humans
remains understudied. In this paper, we test whether models exhibit linguistic
convergence, a core pragmatic element of human language
, asking:
do models adapt, or converge, to the linguistic patterns of their user? To
answer this, we systematically compare model completions of exisiting dialogues
to the original human responses across sixteen language models, three dialogue
corpora, and a variety of stylometric features. We find that models strongly
converge to the conversation's style, often significantly overfitting relative
to the human baseline. While convergence patterns are often feature-specific,
we observe consistent shifts in convergence across modeling settings, with
instruction-tuned and larger models converging less than their pretrained
counterparts. Given the differences between human and model convergence
patterns, we hypothesize that the underlying mechanisms for these behaviors are
very different.
Attack the Messages, Not the Agents A Multi-round Adaptive Stealthy Tampering Framework for LLM-MAS
Authors: Bingyu Yan, Ziyi Zhou, Xiaoming Zhang, Chaozhuo Li, Ruilin Zeng, Yirui Qi, Tianbo Wang, Litian Zhang
2025-08-05
http://arxiv.org/abs/2508.03125v1
Large language model-based multi-agent systems (-MAS) effectively
accomplish complex and dynamic tasks through inter-agent
, but
this reliance introduces substantial safety vulnerabilities. Existing attack
methods targeting
-MAS either compromise agent internals or rely on direct
and overt persuasion, which limit their effectiveness, adaptability, and
stealthiness. In this paper, we propose MAST, a Multi-round Adaptive Stealthy
Tampering framework designed to exploit
vulnerabilities within
the system. MAST integrates Monte Carlo Tree Search with Direct Preference
Optimization to train an attack policy model that adaptively generates
effective multi-round tampering strategies. Furthermore, to preserve
stealthiness, we impose dual semantic and embedding similarity constraints
during the tampering process. Comprehensive experiments across diverse tasks,
architectures, and
s demonstrate that MAST consistently
achieves high attack success rates while significantly enhancing stealthiness
compared to baselines. These findings highlight the effectiveness,
stealthiness, and adaptability of MAST, underscoring the need for robust
safeguards in
-MAS.
AgentSME for Simulating Diverse Communication Modes in Smart Education
Authors: Wen-Xi Yang, Tian-Fang Zhao
2025-08-05
http://arxiv.org/abs/2508.03109v1
Generative agent models specifically tailored for smart education are
critical, yet remain relatively underdeveloped. A key challenge stems from the
inherent complexity of educational contexts: learners are human beings with
various cognitive behaviors, and pedagogy is fundamentally centered on
personalized human-to-human . To address this issue, this paper
proposes AgentSME, a unified generative agent framework powered by
. Three
directional
modes are considered in the models, namely Solo,
Mono, and Echo, reflecting different types of agency autonomy and communicative
reciprocity. Accuracy is adopted as the primary evaluation metric, complemented
by three diversity indices designed to assess the diversity of reasoning
contents. Six widely used
s are tested to validate the robustness of
modes across different model tiers, which are equally divided
into base-capacity and high-capacity configurations. The results show that
generative agents that employ the Echo
mode achieve the highest
accuracy scores, while DeepSeek exhibits the greatest diversity. This study
provides valuable information to improve agent learning capabilities and
inspire smart education models.
Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives
Authors: Yinuo Xu, Veronica Derricks, Allison Earl, David Jurgens
2025-08-04
http://arxiv.org/abs/2508.02853v1
We present an approach to modeling annotator disagreement in subjective NLP
tasks through both architectural and data-centric innovations. Our model,
DEM-MoE (Demographic-Aware Mixture of Experts), routes inputs to expert
subnetworks based on annotator demographics, enabling it to better represent
structured, group-level variation compared to prior models. DEM-MoE
consistently performs competitively across demographic groups, and shows
especially strong results on datasets with high annotator disagreement. To
address demographic coverage, we test whether
-generated synthetic
annotations via zero-shot persona prompting can be used for data imputation. We
show these synthetic judgments align moderately well with human annotations on
our data and offer a scalable way to potentially enrich training data. We then
propose and evaluate approaches for blending real and synthetic data using
strategies tailored to dataset structure. We find that the optimal strategies
depend on dataset structure. Together, these contributions improve the
representation of diverse perspectives.
LOST Low-rank and Sparse Pre-training for Large Language Models
Authors: Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Liwu Xu, Tianjin Huang, Wenwu Wang, Shiwei Liu, Xilu Wang
2025-08-04
http://arxiv.org/abs/2508.02668v1
While large language models (s) have achieved remarkable performance
across a wide range of tasks, their massive scale incurs prohibitive
computational and memory costs for pre-training from scratch. Recent studies
have investigated the use of low-rank parameterization as a means of reducing
model size and training cost. In this context,
is often employed as a
complementary technique to recover important information lost in low-rank
compression by capturing salient features in the residual space. However,
existing approaches typically combine low-rank and
components in a
simplistic or ad hoc manner, often resulting in undesirable performance
degradation compared to full-rank training. In this paper, we propose
\textbf{LO}w-rank and \textbf{S}parse pre-\textbf{T}raining (\textbf{LOST}) for
s, a novel method that ingeniously integrates low-rank and
structures
to enable effective training of
s from scratch under strict efficiency
constraints. LOST applies singular value decomposition to weight matrices,
preserving the dominant low-rank components, while allocating the remaining
singular values to construct channel-wise
components to complement the
expressiveness of low-rank training. We evaluate LOST on
pretraining
ranging from 60M to 7B parameters. Our experiments show that LOST achieves
competitive or superior performance compared to full-rank models, while
significantly reducing both memory and compute overhead. Moreover, Code is
available at
\href{https://github.com/JiaxiLi1/LOST-Low-rank-and-Sparse-Training-for-Large-Language-Models}{LOST
Repo}
Automated SNOMED CT Concept Annotation in Clinical Text Using Bi-GRU Neural Networks
Authors: Ali Noori, Pratik Devkota, Somya Mohanty, Prashanti Manda
2025-08-04
http://arxiv.org/abs/2508.02556v1
Automated annotation of clinical text with standardized medical concepts is
critical for enabling structured data extraction and decision support. SNOMED
CT provides a rich ontology for labeling clinical entities, but manual
annotation is labor-intensive and impractical at scale. This study introduces a
neural sequence labeling approach for SNOMED CT concept recognition using a
Bidirectional GRU model. Leveraging a subset of MIMIC-IV, we preprocess text
with domain-adapted SpaCy and SciBERT-based tokenization, segmenting sentences
into ping 19-token chunks enriched with contextual, syntactic, and
morphological features. The Bi-GRU model assigns IOB tags to identify concept
spans and achieves strong performance with a 90 percent F1-score on the
validation set. These results surpass traditional rule-based systems and match
or exceed existing neural models. Qualitative analysis shows effective handling
of ambiguous terms and misspellings. Our findings highlight that lightweight
RNN-based architectures can deliver high-quality clinical concept annotation
with significantly lower computational cost than
-based models,
making them well-suited for real-world deployment.
xDeepServe Model-as-a-Service on Huawei CloudMatrix384
Authors: Ao Xiao, Bangzheng He, Baoquan Zhang, Baoxing Huai, Bingji Wang, Bo Wang, Bo Xu, Boyi Hou, Chan Yang, Changhong Liu, Cheng Cui, Chenyu Zhu, Cong Feng, Daohui Wang, Dayun Lin, Duo Zhao, Fengshao Zou, Fu Wang, Gangqiang Zhang, Gengyuan Dan, Guanjie Chen, Guodong Guan, Guodong Yang, Haifeng Li, Haipei Zhu, Hao Feng, Hao Huang, Hao Xu, Hengrui Ma, Hengtao Fan, Hui Liu, Jia Li, Jiang Liu, Jiang Xu, Jie Meng, Jinhan Xin, Junhao Hu, Juwei Chen, Lan Yu, Lanxin Miao, Liang Liu, Linan Jing, Lu Zhou, Meina Han, Mingkun Deng, Mingyu Deng, Naitian Deng, Nizhong Lin, Peihan Zhao, Peng Pan, Pengfei Shen, Ping Li, Qi Zhang, Qin Zhang, Qingrong Xia, Qingyi Zhang, Qunchao Fu, Ren Guo, Ruimin Gao, Shaochun Li, Sheng Long, Shentian Li, Shining Wan, Shuai Shen, Shuangfu Zeng, Shuming Jing, Siqi Yang, Song Zhang, Tao Xu, Tianlin Du, Ting Chen, Wanxu Wu, Wei Jiang, Weinan Tong, Weiwei Chen, Wen Peng, Wenli Zhou, Wenquan Yang, Wenxin Liang, Xiang Liu, Xiaoli Zhou, Xin Jin, Xinyu Duan, Xu Li, Xu Zhang, Xusheng Chen, Yalong Shan, Yang Gan, Yao Lu, Yi Deng, Yi Zheng, Yingfei Zheng, Yiyun Zheng, Yizhou Shan, Yong Gao, Yongqiang Yang, Yuanjin Gong, Yue Yu, Yuetao Chen, Yukun Zhu, Yulong He, Yusu Zhao, Yuyan Wu, Zenan Zhang, Zhaojin Zhuo, Zhaoyang Ji, Zhefeng Wang, Zheng Wang, Zhenhua Yang, Zhenli Sheng, Zhibin Yu, Zhigang Ji, Zhihao Ren, Zhipeng Bian, Zhixia Liu, Zhiyu Dong, Zhonghua Li, Zhou Yu, Zhuoming Shen, Zhuwei Peng, Zi Ye, Zihao Xiang, Zimin Fu, Zixuan Zhang
2025-08-04
http://arxiv.org/abs/2508.02520v4
The rise of scaled-out s and scaled-up SuperPods signals a new era in
large-scale AI infrastructure.
s continue to scale out via MoE, as seen in
recent models like DeepSeek, Kimi, and Qwen. In parallel, AI hardware is
scaling up, with Huawei's CloudMatrix384 SuperPod offering hundreds of GB/s
high-speed interconnects. Running large MoE models on SuperPod-scale hardware
brings new challenges. It requires new execution models, scalable scheduling,
efficient expert load balancing, and elimination of single points of failure.
This paper presents xDeepServe, Huawei Cloud's
serving system designed for
SuperPod-scale infrastructure. At its core is Transformerless, a disaggregated
architecture that decomposes
models into modular units--attention,
feedforward, and MoE--executed independently on NPUs connected via high-speed
fabric. We implement this design in two forms: disaggregated prefill-decode and
disaggregated MoE-attention. This fully disaggregated setup enables independent
scaling of compute and memory without sacrificing performance. To support this
architecture, we propose XCCL, a
library that leverages
CloudMatrix384's global shared memory to implement efficient point-to-point and
all-to-all primitives. We also extend our serving engine FlowServe with
system-level techniques, enabling scalable inference across hundreds of NPUs.
Decomposed Reasoning with Reinforcement Learning for Relevance Assessment in UGC Platforms
Authors: Xiaowei Yuan, Lei Jin, Haoxin Zhang, Yan Gao, Yi Wu, Yao Hu, Ziyang Huang, Jun Zhao, Kang Liu
2025-08-04
http://arxiv.org/abs/2508.02506v1
Retrieval-augmented generation (RAG) plays a critical role in user-generated
content (UGC) platforms, but its effectiveness depends heavily on accurate
relevance assessment of query-document pairs. Despite recent advances in
applying large language models (s) to relevance modeling, UGC platforms
present unique challenges: 1) ambiguous user intent due to
user feedback
in RAG scenarios, and 2) substantial noise introduced by informal and
unstructured language. To address these issues, we propose the Reinforced
Reasoning Model for Relevance Assessment (R3A), which introduces a decomposed
reasoning framework over queries and candidate documents before scoring. R3A
first leverages auxiliary high-ranked documents within the platform to infer
latent query intent. It then performs verbatim fragment extraction to justify
relevance decisions, thereby reducing errors caused by noisy UGC. Based on a
reinforcement learning framework, R3A is optimized to mitigate distortions
arising from ambiguous queries and unstructured content. Experimental results
show that R3A significantly outperforms existing baseline methods in terms of
relevance accuracy, across both offline benchmarks and online experiments.
CompressKV Semantic Retrieval Heads Know What Tokens are Not Important Before Generation
Authors: Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li, Grace Li Zhang
2025-08-04
http://arxiv.org/abs/2508.02401v1
Recent advances in large language models (s) have significantly boosted
long-context processing. However, the increasing key-value (
) cache size
poses critical challenges to memory and execution efficiency. Most
cache
compression methods rely on heuristic token eviction using all attention heads
in Grouped Query Attention (GQA)-based
s. This method ignores the different
functionalities of attention heads, leading to the eviction of critical tokens
and thus degrades the performance of
s.
To address the issue above, instead of using all the attention heads in
GQA-based
s to determine important tokens as in the previous work, we first
identify the attention heads in each layer that are not only capable of
retrieving the initial and final tokens of a prompt, but also capable of
retrieving important tokens within the text and attending to their surrounding
semantic context. Afterwards, we exploit such heads to determine the important
tokens and retain their corresponding
cache pairs. Furthermore, we analyze
the cache eviction error of each layer individually and introduce a
layer-adaptive
cache allocation strategy. Experimental results demonstrate
the proposed Compress
consistently outperforms state-of-the-art approaches
under various memory budgets on LongBench and Needle-in-a-Haystack benchmarks.
Our code is publicly available at: https://github.com/TUDa-HWAI/Compress
.git.
Beyond Manually Designed Pruning Policies with Second-Level Performance Prediction A Pruning Framework for LLMs
Authors: Zuxin Ma, Yunhe Cui, Yongbin Qin
2025-08-04
http://arxiv.org/abs/2508.02381v2
Non-uniform structured network methods can effectively reduce Large
Language Model (
) size by eliminating redundant channels or layers, offering
lower performance degradation than uniform strategies. However, existing
non-uniform methods rely heavily on manually designed
policies (e.g.,
layer importance and scaling factors), and therefore cannot efficiently adapt
to scenarios with dynamic
ratio requirements. Additionly, a critical
bottleneck -- the time-consuming evaluation of
policies -- further
limits the feasibility of iteratively and dynamically finding optimal
policies. To address these limitations, we propose PPF (Predictive Pruning
Framework), a novel
framework for
s that eliminates manual design
dependencies via second-level performance prediction. PPF not only supports
real-time
decisions under dynamic
ratios but is also applicable
to static
scenarios. It employs an agent for producing adaptive and
real-time
actions, while a lightweight performance predictor that can
evaluate a
policy in seconds, significantly speeding up the iterative
optimization process. Experiments on Llama2-7B and Llama3-8B show that PPF can
generate dynamic/static
policies and it reduces perplexity by up to
33.4% (dynamic
) and 84.78% (static
) over existing methods,
outperforming manually designed
policies. The performance predictor
achieves second-level performance prediction with high accuracy (prediction
error < 0.0011). It reduces the mean evaluation latency from minute-level (1
minute and 38.02 seconds of test-set evaluation methods) to second-level (1.52
seconds), achieving over 64 times speedup. Our code will be available at
https://github.com/Ma-zx/PPF .
Traffic-R1 Reinforced LLMs Bring Human-Like Reasoning to Traffic Signal Control Systems
Authors: Xingchen Zou, Yuhao Yang, Zheng Chen, Xixuan Hao, Yiqi Chen, Chao Huang, Yuxuan Liang
2025-08-04
http://arxiv.org/abs/2508.02344v1
Traffic signal control (TSC) is vital for mitigating congestion and
sustaining urban mobility. In this paper, we introduce Traffic-R1, a foundation
model with human-like reasoning for TSC systems. Our model is developed through
self-exploration and iteration of reinforced large language models (s) with
expert guidance in a simulated traffic environment. Compared to traditional
reinforcement learning (RL) and recent
-based methods, Traffic-R1 offers
three significant advantages. First, Traffic-R1 delivers zero-shot
generalisation, transferring unchanged to new road networks and
out-of-distribution incidents by utilizing its internal traffic control
policies and human-like reasoning. Second, its 3B-parameter architecture is
lightweight enough for real-time inference on mobile-class chips, enabling
large-scale edge deployment. Third, Traffic-R1 provides an explainable TSC
process and facilitates multi-intersection
through its
self-iteration and a new synchronous
network. Extensive
benchmarks demonstrate that Traffic-R1 sets a new state of the art,
outperforming strong baselines and training-intensive RL controllers. In
practice, the model now manages signals for more than 55,000 drivers daily,
shortening average queues by over 5% and halving operator workload. Our
checkpoint is available at https://huggingface.co/Season998/Traffic-R1.
CAMERA Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis
Authors: Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che
2025-08-04
http://arxiv.org/abs/2508.02322v1
Large Language Models (s) with Mixture-of-Experts (MoE) architectures are
distinguished by their strong performance scaling with increasing parameters
across a wide range of tasks, yet they also suffer from substantial
computational and storage overheads. Notably, the performance gains of MoE
models do not scale proportionally with the growth in expert parameters. While
prior works attempt to reduce parameters via expert-level
, merging, or
decomposition, they still suffer from challenges in both performance and
computational efficiency. In this paper, we address these challenges by
introducing micro-expert as a finer-grained compression unit that spans across
matrices. We first establish a more fundamental perspective, viewing MoE layers
as mixtures of micro-experts, and present CAMERA, a lightweight and
training-free framework for identifying micro-expert redundancy. Our analysis
uncovers significant variance in micro-expert contributions during decoding.
Based on this insight, we further propose CAMERA-P, a structured micro-expert
framework, and CAMERA-Q, a mixed-precision quantization idea designed
for micro-experts. Extensive experiments on nine downstream tasks show that
CAMERA-P consistently outperforms strong baselines under
ratios ranging
from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under
aggressive 2-bit quantization, surpassing existing matrix- and channel-level
ideas. Notably, our method enables complete micro-expert analysis of
Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.
VeOmni Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
Authors: Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, Xin Liu
2025-08-04
http://arxiv.org/abs/2508.02317v3
Recent advances in large language models (s) have driven impressive
progress in omni-modal understanding and generation. However, training
omni-modal
s remains a significant challenge due to the heterogeneous model
architectures required to process diverse modalities, necessitating
sophisticated system design for efficient large-scale training. Existing
frameworks typically entangle model definition with parallel logic, incurring
limited scalability and substantial engineering overhead for end-to-end
omni-modal training. We present VeOmni, a modular and efficient training
framework to accelerate the development of omni-modal
s. VeOmni introduces
model-centric distributed recipes that decouples
from
computation, enabling efficient 3D parallelism on omni-modal
s. VeOmni also
features a flexible configuration interface supporting seamless integration of
new modalities with minimal code change. Using VeOmni, a omni-modal
mixture-of-experts (MoE) model with 30B parameters can be trained with over
2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D
parallelism on 128 GPUs, showcasing its superior efficiency and scalability for
training large omni-modal
s.
Isolating Culture Neurons in Multilingual Large Language Models
Authors: Danial Namazifard, Lukas Galke
2025-08-04
http://arxiv.org/abs/2508.02241v1
Language and culture are deeply intertwined, yet it is so far unclear how and
where multilingual large language models encode culture. Here, we extend upon
an established methodology for identifying language-specific neurons and extend
it to localize and isolate culture-specific neurons, carefully disentangling
their and interaction with language-specific neurons. To facilitate our
experiments, we introduce MUREL, a curated dataset of 85.2 million tokens
spanning six different cultures. Our localization and intervention experiments
show that
s encode different cultures in distinct neuron populations,
predominantly in upper layers, and that these culture neurons can be modulated
independently from language-specific neurons or those specific to other
cultures. These findings suggest that cultural knowledge and propensities in
multilingual language models can be selectively isolated and edited - promoting
fairness, inclusivity, and alignment. Code and data is available at
https://github.com/namazifard/Culture_Neurons .
Forecasting When to Forecast Accelerating Diffusion Models with Confidence-Gated Taylor
Authors: Xiaoliu Guan, Lielin Jiang, Hanqi Chen, Xu Zhang, Jiaxing Yan, Guanzhong Wang, Yi Liu, Zetao Zhang, Yu Wu
2025-08-04
http://arxiv.org/abs/2508.02240v2
Diffusion Transformers (DiTs) have demonstrated remarkable performance in
visual generation tasks. However, their low inference speed limits their
deployment in low-resource applications. Recent training-free approaches
exploit the redundancy of features across timesteps by caching and reusing past
representations to accelerate inference. Building on this idea, TaylorSeer
instead uses cached features to predict future ones via Taylor expansion.
However, its module-level prediction across all blocks (e.g.,
attention or feedforward modules) requires storing fine-grained intermediate
features, leading to notable memory and computation overhead. Moreover, it
adopts a fixed caching schedule without considering the varying accuracy of
predictions across timesteps, which can lead to degraded outputs when
prediction fails. To address these limitations, we propose a novel approach to
better leverage Taylor-based
. First, we shift the Taylor
prediction target from the module level to the last block level, significantly
reducing the number of cached features. Furthermore, observing strong
sequential dependencies among Transformer blocks, we propose to use the error
between the Taylor-estimated and actual outputs of the first block as an
indicator of prediction reliability. If the error is small, we trust the Taylor
prediction for the last block; otherwise, we fall back to full computation,
thereby enabling a dynamic caching mechanism. Empirical results show that our
method achieves a better balance between speed and quality, achieving a 3.17x
on FLUX, 2.36x on DiT, and 4.14x on Wan Video with negligible
quality drop. The Project Page is
\href{https://cg-taylor-acce.github.io/CG-Taylor/}{here.}
LeanK Learnable K Cache Channel Pruning for Efficient Decoding
Authors: Yike Zhang, Zhiyuan He, Huiqiang Jiang, Chengruidong Zhang, Yuqing Yang, Jianyong Wang, Lili Qiu
2025-08-04
http://arxiv.org/abs/2508.02215v1
Large language models (s) enable long-context tasks but face efficiency
challenges due to the growing key-value (
) cache. We propose LeanK, a
learning-based method that prunes unimportant key (K) cache channels by
leveraging static channel
. With a novel two-stage training process,
LeanK learns channel-wise static mask that could satisfy specific
ratio and hardware alignment requirement. LeanK reduces GPU memory and
accelerates decoding without sacrificing accuracy. Experiments demonstrate up
to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel
enables 1.3x speedup for attention computation. We also provide insights into
model channels and attention heads during long-context inference by analyzing
the learned importance distribution. Our code is available at
https://aka.ms/LeanK.
Whispering Agents An event-driven covert communication protocol for the Internet of Agents
Authors: Kaibo Huang, Yukun Wei, WanSheng Wu, Tianhua Zhang, Zhongliang Yang, Linna Zhou
2025-08-04
http://arxiv.org/abs/2508.02188v1
The emergence of the Internet of Agents (IoA) introduces critical challenges
for privacy in sensitive, high-stakes domains. While standard
Agent-to-Agent (A2A) protocols secure message content, they are not designed to
protect the act of
itself, leaving agents vulnerable to
surveillance and traffic analysis. We find that the rich, event-driven nature
of agent dialogues provides a powerful, yet untapped, medium for covert
. To harness this potential, we introduce and formalize the Covert
Event Channel, the first unified model for agent covert
driven by
three interconnected dimensions, which consist of the Storage, Timing,and
Behavioral channels. Based on this model, we design and engineer {\Pi}CCAP, a
novel protocol that operationalizes this event-driven paradigm. Our
comprehensive evaluation demonstrates that {\Pi}CCAP achieves high capacity and
robustness while remaining imperceptible to powerful
-based wardens,
establishing its practical viability. By systematically engineering this
channel, our work provides the foundational understanding essential for
developing the next generation of monitoring systems and defensive protocols
for a secure and trustworthy IoA.
Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation
Authors: Kuiyuan DIng, Caili Guo, Yang Yang, Zhongtian Du, Walid Saad
2025-08-04
http://arxiv.org/abs/2508.02148v1
Large-scale models (LSMs) can be an effective framework for semantic
representation and understanding, thereby providing a suitable tool for
designing semantic (SC) systems. However, their direct deployment
is often hindered by high computational complexity and resource requirements.
In this paper, a novel robust knowledge distillation based semantic
(RKD-SC) framework is proposed to enable efficient and
\textcolor{black}{channel-noise-robust} LSM-powered SC. The framework addresses
two key challenges: determining optimal compact model architectures and
effectively transferring knowledge while maintaining robustness against channel
noise. First, a knowledge distillation-based lightweight differentiable
architecture search (KDL-DARTS) algorithm is proposed. This algorithm
integrates knowledge distillation loss and a complexity penalty into the neural
architecture search process to identify high-performance, lightweight semantic
encoder architectures. Second, a novel two-stage robust knowledge distillation
(RKD) algorithm is developed to transfer semantic capabilities from an LSM
(teacher) to a compact encoder (student) and subsequently enhance system
robustness. To further improve resilience to channel impairments, a
channel-aware
(CAT) block is introduced as the channel codec,
trained under diverse channel conditions with variable-length outputs.
Extensive simulations on image classification tasks demonstrate that the RKD-SC
framework significantly reduces model parameters while preserving a high degree
of the teacher model's performance and exhibiting superior robustness compared
to existing methods.
Amber Pruner Leveraging NM Activation Sparsity for Efficient Prefill in Large Language Models
Authors: Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, Gongyi Wang
2025-08-04
http://arxiv.org/abs/2508.02128v1
In the era of large language models (s), N:M
has emerged as a
structured compression technique critical for accelerating inference. While
prior work has primarily focused on weight
, it often suffers from
significant accuracy degradation. Activation
, though promising, is
typically training-dependent and faces challenges in generalization. To address
these limitations, we introduce Amber Pruner, a training-free N:M activation
method designed specifically for the prefill stage, targeting the
of linear projection layers in
s. Extensive experiments across
multiple models and
ratios (2:4, 4:8, and 8:16) demonstrate that Amber
Pruner can effectively sparsify and accelerate more than 55% of linear
computations without requiring model retraining. To further enhance generality
and efficiency, we propose Outstanding-
, a unified framework that
integrates Amber Pruner with post-training W8A8 quantization. Our approach
preserves strong performance across a range of downstream tasks, with notable
advantages in generative tasks. This work pioneers a new frontier in activation
, providing foundational insights that are poised to guide the
co-evolution of algorithms and architectures in the design of next-generation
AI systems.
A Survey on AgentOps Categorization, Challenges, and Future Directions
Authors: Zexin Wang, Jingjing Li, Quan Zhou, Haotian Si, Yuanhao Liu, Jianhui Li, Gaogang Xie, Fei Sun, Dan Pei, Changhua Pei
2025-08-04
http://arxiv.org/abs/2508.02121v1
As the reasoning capabilities of Large Language Models (s) continue to
advance,
-based agent systems offer advantages in flexibility and
interpretability over traditional systems, garnering increasing attention.
However, despite the widespread research interest and industrial application of
agent systems, these systems, like their traditional counterparts, frequently
encounter anomalies. These anomalies lead to instability and insecurity,
hindering their further development. Therefore, a comprehensive and systematic
approach to the operation and maintenance of agent systems is urgently needed.
Unfortunately, current research on the operations of agent systems is
.
To address this gap, we have undertaken a survey on agent system operations
with the aim of establishing a clear framework for the field, defining the
challenges, and facilitating further development. Specifically, this paper
begins by systematically defining anomalies within agent systems, categorizing
them into intra-agent anomalies and inter-agent anomalies. Next, we introduce a
novel and comprehensive operational framework for agent systems, dubbed Agent
System Operations (AgentOps). We provide detailed definitions and explanations
of its four key stages: monitoring, anomaly detection, root cause analysis, and
resolution.
AlignGuard-LoRA Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization
Authors: Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha
2025-08-04
http://arxiv.org/abs/2508.02079v1
Low-rank adaptation (LoRA) has become a standard tool for efficiently
fine-tuning large language models (s). Yet, even minor LoRA updates can
induce alignment drift, weakening safety and behavioral constraints through
entangled parameter changes. To address this, we propose AlignGuard-LoRA (AGL),
a principled framework for preserving alignment during finetuning. AGL
introduces several key components: a primary task loss for supervision, Fisher
Information Matrix-based regularization to restrict updates in
alignment-sensitive subspaces, and task-specific regularization to stabilize
the integration of new knowledge. We further introduce collision-aware
regularization, blending Riemannian
-- which penalizes coordinate-wise
interference -- and geodesic separation -- which encourages disjoint update
geometry. We curate DriftCaps, a targeted diagnostic benchmark of safe and
unsafe prompts designed to quantify alignment drift and safety degradation.
Empirical evaluations show that AGL mitigates alignment drift by up to 50% on
safety-critical benchmarks without degrading downstream task performance.
Comprehensive ablation confirms that each component contributes distinctly to
preserving latent safety behaviors. Finally, we derive and validate a scaling
law for catastrophic forgetting, revealing that AGL flattens post-finetuning
loss escalation while preserving adaptation dynamics. AGL is a structurally
grounded refinement of LoRA, ensuring alignment preservation with minimal
trade-offs. To encourage further exploration and development, we open-source
our implementation.
Everyone Contributes! Incentivizing Strategic Cooperation in Multi-LLM Systems via Sequential Public Goods Games
Authors: Yunhao Liang, Yuan Qu, Jingyuan Yang, Shaochong Lin, Zuo-Jun Max Shen
2025-08-04
http://arxiv.org/abs/2508.02076v1
Coordinating multiple large language models (s) to solve complex tasks
collaboratively poses a fundamental trade-off between the computation costs and
collective performance compared with individual model. We introduce a novel,
game-theoretically grounded reinforcement learning (RL) framework, the
Multi-Agent Cooperation Sequential Public Goods Game (MAC-SPGG), to
systematically incentivize cooperation in multi-
ensembles. In MAC-SPGG,
agents move in sequence, observing predecessors' outputs and updating beliefs
to condition their own contributions. By redesigning the public-goods reward,
effortful contributions become the unique Subgame Perfect Nash Equilibrium
(SPNE), which eliminates free-riding under traditional SPGG or PGG. Its
sequential protocol replaces costly round-based information exchanges with a
streamlined decision flow, cutting
overhead while retaining
strategic depth. We prove the existence and uniqueness of the SPNE under
realistic parameters, and empirically show that MAC-SPGG-trained ensembles
outperform single-agent baselines, chain-of-thought prompting, and other
cooperative methods, even achieving comparable performance to large-scale
models across reasoning, math, code generation, and NLP tasks. Our results
highlight the power of structured, incentive-aligned MAC-SPGG cooperation for
scalable and robust multi-agent language generation.
CVD-SfM A Cross-View Deep Front-end Structure-from-Motion System for Sparse Localization in Multi-Altitude Scenes
Authors: Yaxuan Li, Yewei Huang, Bijay Gaudel, Hamidreza Jafarnejadsani, Brendan Englot
2025-08-03
http://arxiv.org/abs/2508.01936v1
We present a novel multi-altitude camera pose estimation system, addressing
the challenges of robust and accurate localization across varied altitudes when
only considering image input. The system effectively handles diverse
environmental conditions and viewpoint variations by integrating the cross-view
, deep features, and structure-from-motion into a unified framework.
To benchmark our method and foster further research, we introduce two newly
collected datasets specifically tailored for multi-altitude camera pose
estimation; datasets of this nature remain rare in the current literature. The
proposed framework has been validated through extensive comparative analyses on
these datasets, demonstrating that our system achieves superior performance in
both accuracy and robustness for multi-altitude
pose estimation tasks
compared to existing solutions, making it well suited for real-world robotic
applications such as aerial navigation, search and rescue, and automated
inspection.
IAUNet Instance-Aware U-Net
Authors: Yaroslav Prytula, Illia Tsiporenko, Ali Zeynalli, Dmytro Fishman
2025-08-03
http://arxiv.org/abs/2508.01928v1
Instance segmentation is critical in biomedical imaging to accurately
distinguish individual objects like cells, which often and vary in
size. Recent query-based methods, where object queries guide segmentation, have
shown strong performance. While U-Net has been a go-to architecture in medical
image segmentation, its potential in query-based approaches remains largely
unexplored. In this work, we present IAUNet, a novel query-based U-Net
architecture. The core design features a full U-Net architecture, enhanced by a
novel lightweight convolutional Pixel decoder, making the model more efficient
and reducing the number of parameters. Additionally, we propose a Transformer
decoder that refines object-specific features across multiple scales. Finally,
we introduce the 2025 Revvity Full Cell Segmentation Dataset, a unique resource
with detailed annotations of
ping cell cytoplasm in brightfield images,
setting a new benchmark for biomedical instance segmentation. Experiments on
multiple public datasets and our own show that IAUNet outperforms most
state-of-the-art fully convolutional,
-based, and query-based models
and cell segmentation-specific models, setting a strong baseline for cell
instance segmentation tasks. Code is available at
https://github.com/SlavkoPrytula/IAUNet
Quantum-RAG and PunGPT2 Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language
Authors: Jaskaranjeet Singh, Rakesh Thakur
2025-08-03
http://arxiv.org/abs/2508.01918v1
Despite the rapid advancement of large language models (s), low-resource
languages remain largely excluded from the NLP landscape. We present PunGPT2,
the first fully open-source suite of Punjabi large language models, trained
from scratch on a 35GB domain-diverse corpus encompassing literature, religious
texts, news, and social discourse. Unlike prior multilingual approaches,
PunGPT2 captures rich syntactic and morphological features unique to Punjabi
through a tokenizer optimised with byte pair encoding and linguistically
aligned pretraining objectives. To improve factual grounding and domain recall,
we introduce Pun-RAG, a retrieval-augmented generation framework combining
PunGPT2 with a dense FAISS retriever over a curated Punjabi knowledge base. We
further develop Pun-Instruct, a parameter-efficient, instruction-tuned variant
using QLoRA, enabling robust zero-shot and instruction-following performance
with significantly reduced compute needs.
As a key innovation, we propose Quantum-RAG, a novel hybrid retrieval system
that fuses
(BM25) and dense methods with quantum-inspired semantic
matching. By encoding queries using amplitude-based embeddings and retrieving
via quantum kernel similarity, Quantum-RAG achieves improved contextual
relevance with minimal memory overhead marking the first practical integration
of quantum representations in low-resource language generation. Our models
significantly outperform strong multilingual baselines (mBERT, mT5, MuRIL) in
perplexity, factuality, and fluency. This work provides a scalable,
reproducible blueprint for extending
capabilities to underrepresented
languages and pioneers quantum-aware retrieval in low-resource NLP
AGFT An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization
Authors: Zicong Ye, Kunming Zhang, Guoming Tang
2025-08-03
http://arxiv.org/abs/2508.01744v1
The explosive growth of interactive Large Language Models (s) has placed
unprecedented demands for low latency on cloud GPUs, forcing them into
high-power modes and causing escalating energy costs. Real-time inference
workloads exhibit significant dynamic volatility, presenting substantial
energy-saving opportunities. However, traditional static or rule-based power
management strategies struggle to exploit these opportunities without
compromising peak performance. To address this challenge, we propose AGFT (An
Adaptive GPU Frequency Tuner), a framework that employs online reinforcement
learning to autonomously learn an optimal frequency tuning policy. By
monitoring real-time features like request load and latency, AGFT utilizes
fine-grained frequency control for precise adjustments and intelligent action
space
for stable, efficient decision-making. This creates a robust,
automated energy management solution. We comprehensively evaluated AGFT in an
environment simulating realistic, fluctuating inference requests. The
experimental results demonstrate that AGFT successfully saves 44.3% of GPU
energy consumption while introducing a minimal performance latency overhead of
under 10%. This achievement translates into a comprehensive Energy-Delay
Product (EDP) optimization of up to 40.3%, clearly showing that our framework
can significantly enhance the energy efficiency and economic benefits of
existing
inference clusters without compromising service quality.
SmallKV Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
Authors: Yi Zhao, Yajuan Peng, Cam-Tu Nguyen, Zuchao Li, Xiaoliang Wang, Hai Zhao, Xiaoming Fu
2025-08-03
http://arxiv.org/abs/2508.02751v1
cache eviction has emerged as an effective solution to alleviate resource
constraints faced by
s in long-context scenarios. However, existing
token-level eviction methods often overlook two critical aspects: (1) their
irreversible eviction strategy fails to adapt to dynamic attention patterns
during decoding (the saliency shift problem), and (2) they treat both
marginally important tokens and truly unimportant tokens equally, despite the
collective significance of marginal tokens to model performance (the marginal
information over-compression problem). To address these issues, we design two
compensation mechanisms based on the high similarity of attention matrices
between
s of different scales. We propose Small
, a small model assisted
compensation method for
cache compression. Small
can maintain attention
matching between different-scale
s to: 1) assist the larger model in
perceiving globally important information of attention; and 2) use the smaller
model's attention scores to approximate those of marginal tokens in the larger
model. Extensive experiments on benchmarks including GSM8K, BBH, MT-Bench, and
LongBench demonstrate the effectiveness of Small
. Moreover, efficiency
evaluations show that Small
achieves 1.75 - 2.56 times higher throughput than
baseline methods, highlighting its potential for efficient and performant
inference in resource constrained environments.
EAC-MoE Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models
Authors: Yuanteng Chen, Yuantian Shao, Peisong Wang, Jian Cheng
2025-08-03
http://arxiv.org/abs/2508.01625v1
Mixture-of-Experts (MoE) has demonstrated promising potential in scaling
s. However, it is hindered by two critical challenges: (1) substantial GPU
memory consumption to load all experts; (2) low activated parameters cannot be
equivalently translated into inference
effects. In this work, we
propose EAC-MoE, an Expert-Selection Aware Compressor for MoE-
s, which
deeply aligns with the characteristics of MoE from the perspectives of
quantization and
, and introduces two modules to address these two
challenges respectively: (1) The expert selection bias caused by low-bit
quantization is a major factor contributing to the performance degradation in
MoE-
s. Based on this, we propose Quantization with Expert-Selection
Calibration (QESC), which mitigates the expert selection bias by calibrating
the routers within the MoE; (2) There are always certain experts that are not
crucial for the corresponding tasks, yet causing inference latency. Therefore,
we propose Pruning based on Expert-Selection Frequency (PESF), which
significantly improves inference speed by
less frequently used experts
for current task. Extensive experiments demonstrate that our approach
significantly reduces memory usage and improves inference speed with minimal
performance degradation.
RepoForge Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale
Authors: Zhilong Chen, Chengzong Zhao, Boyuan Chen, Dayi Lin, Yihao Chen, Arthur Leung, Gopi Krishnan Rajbahadur, Gustavo A. Oliva, Ahmed E. Hassan
2025-08-03
http://arxiv.org/abs/2508.01550v1
Training software engineering (SWE) s is bottlenecked by expensive
infrastructure, inefficient evaluation pipelines, scarce training data, and
costly quality control. We present RepoForge, an autonomous, end-to-end
pipeline that generates, evaluates, and trains SWE agents at scale. Our key
contributions include: (1) RepoForge-8B-Agent, achieving 17.4\% on
SWE-Bench-Verified~\citep{swebench_verified2024}, establishing new
state-of-the-art for 8B non-thinking
s; (2) 7,304 executable
environments auto-generated from real GitHub commits with zero manual
intervention; (3) 14 storage reduction (1.4GB 102MB per
instance) via intelligent dependency management and image
; (4) 70\%
faster evaluation using a Ray-powered~\citep{ray2018} distributed RepoForge
harness; (5) 19,000 cheaper labeling through our automated
SPICE~\citep{spice2024} difficulty assessment technique. By unifying
storage-efficient sandboxing, Ray-powered evaluation harness, automated data
generation, SPICE-based labeling, and bubble-free RL scaffold, we demonstrate
that even 8B models can reach new state-of-the-art performance on
demanding benchmarks like SWE-Bench-Verified. Our approach addresses critical
bottlenecks in SWE agent training: high storage costs of container-based
evaluation, inefficient sequential reward pipelines, limited availability of
high-quality training data, expensive manual labeling, and multi-turn RL
pipeline bottlenecks.
BlockA2A Towards Secure and Verifiable Agent-to-Agent Interoperability
Authors: Zhenhua Zou, Zhuotao Liu, Lepeng Zhao, Qiuyang Zhan
2025-08-02
http://arxiv.org/abs/2508.01332v2
The rapid adoption of agentic AI, powered by large language models (s), is
transforming enterprise ecosystems with autonomous agents that execute complex
workflows. Yet we observe several key security vulnerabilities in
-driven
multi-agent systems (MASes): fragmented identity frameworks, insecure
channels, and inadequate defenses against Byzantine agents or
adversarial prompts. In this paper, we present the first systematic analysis of
these emerging multi-agent risks and explain why the legacy security strategies
cannot effectively address these risks. Afterwards, we propose BlockA2A, the
first unified multi-agent trust framework that enables secure and verifiable
and agent-to-agent interoperability. At a high level, BlockA2A adopts
decentralized identifiers (DIDs) to enable fine-grained cross-domain agent
authentication, blockchain-anchored ledgers to enable immutable auditability,
and smart contracts to dynamically enforce context-aware access control
policies. BlockA2A eliminates centralized trust bottlenecks, ensures message
authenticity and execution integrity, and guarantees accountability across
agent interactions. Furthermore, we propose a Defense Orchestration Engine
(DOE) that actively neutralizes attacks through real-time mechanisms, including
Byzantine agent flagging, reactive execution halting, and instant permission
revocation. Empirical evaluations demonstrate BlockA2A's effectiveness in
neutralizing prompt-based,
-based, behavioral and systemic MAS
attacks. We formalize its integration into existing MAS and showcase a
practical implementation for Google's A2A protocol. Experiments confirm that
BlockA2A and DOE operate with sub-second overhead, enabling scalable deployment
in production
-based MAS environments.
Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models
Authors: Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat
2025-08-02
http://arxiv.org/abs/2508.01261v1
We present MoE-MLA-RoPE, a novel architecture combination that combines
Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary
Position Embeddings (RoPE) for efficient language modeling. Our approach
addresses the fundamental trade-off between model capacity and computational
efficiency through three key innovations: (1) fine-grained expert routing with
64 micro-experts and top- selection, enabling flexible specialization
through 3.6 * 10^7 possible expert combinations; (2) shared expert isolation
that dedicates 2 always active experts for common patterns while routing to 6
of 62 specialized experts; and (3) gradient-conflict-free load balancing that
maintains expert utilization without interfering with primary loss
optimization.
Extensive experiments on models ranging from 17M to 202M parameters
demonstrate that MoE-MLA-RoPE with compression ratio r=d/2 achieves 68%
cache memory reduction and 3.2x inference speedup while maintaining competitive
perplexity (0.8% degradation). Compared to the parameters with 53.9M
parameters, MoE-MLA-RoPE improves the validation loss by 6.9% over the vanilla
s while using 42% fewer active parameters per forward pass.
FLOP-matched experiments reveal even larger gains: 11.1% improvement with 3.2x
inference
. Automated evaluation using GPT-4 as a judge confirms
quality improvements in generation, with higher scores on coherence (8.1/10),
creativity (7.9/10) and grammatical correctness (8.2/10). Our results establish
that architectural novelty, not parameter scaling, defines the efficiency
frontier for resource-constrained language model deployment.
Asking the Right Questions Benchmarking Large Language Models in the Development of Clinical Consultation Templates
Authors: Liam G. McCoy, Fateme Nateghi Haredasht, Kanav Chopra, David Wu, David JH Wu, Abass Conteh, Sarita Khemani, Saloni Kumar Maharaj, Vishnu Ravi, Arth Pahwa, Yingjie Weng, Leah Rosengaus, Lena Giang, Kelvin Zhenghao Li, Olivia Jee, Daniel Shirvani, Ethan Goh, Jonathan H. Chen
2025-08-02
http://arxiv.org/abs/2508.01159v1
This study evaluates the capacity of large language models (s) to generate
structured clinical consultation templates for electronic consultation. Using
145 expert-crafted templates developed and routinely used by Stanford's
eConsult team, we assess frontier models -- including o3, GPT-4o, Kimi K2,
Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro -- for their ability to
produce clinically coherent, concise, and prioritized clinical question
schemas. Through a multi-agent pipeline combining prompt optimization, semantic
autograding, and prioritization analysis, we show that while models like o3
achieve high comprehensiveness (up to 92.2\%), they consistently generate
excessively long templates and fail to correctly prioritize the most clinically
important questions under length constraints. Performance varies across
specialties, with significant degradation in narrative-driven fields such as
psychiatry and pain medicine. Our findings demonstrate that
s can enhance
structured clinical information exchange between physicians, while highlighting
the need for more robust evaluation methods that capture a model's ability to
prioritize clinically salient information within the time constraints of
real-world physician
.
Towards Bridging Review Sparsity in Recommendation with Textual Edge Graph Representation
Authors: Leyao Wang, Xutao Mao, Xuhui Zhan, Yuying Zhao, Bo Ni, Ryan A. Rossi, Nesreen K. Ahmed, Tyler Derr
2025-08-02
http://arxiv.org/abs/2508.01128v1
Textual reviews enrich recommender systems with fine-grained preference
signals and enhanced explainability. However, in real-world scenarios, users
rarely leave reviews, resulting in severe that undermines the
effectiveness of existing models. A natural solution is to impute or generate
missing reviews to enrich the data. However, conventional imputation techniques
-- such as matrix completion and
-based augmentation -- either lose
contextualized semantics by embedding texts into vectors, or overlook
structural dependencies among user-item interactions. To address these
shortcomings, we propose TWISTER (ToWards Imputation on Sparsity with Textual
Edge Graph Representation), a unified framework that imputes missing reviews by
jointly modeling semantic and structural signals. Specifically, we represent
user-item interactions as a Textual-Edge Graph (TEG), treating reviews as edge
attributes. To capture relational context, we construct line-graph views and
employ a large language model as a graph-aware aggregator. For each interaction
lacking a textual review, our model aggregates the neighborhood's
natural-language representations to generate a coherent and personalized
review. Experiments on the Amazon and Goodreads datasets show that TWISTER
consistently outperforms traditional numeric, graph-based, and
baselines,
delivering higher-quality imputed reviews and, more importantly, enhanced
recommendation performance. In summary, TWISTER generates reviews that are more
helpful, authentic, and specific, while smoothing structural signals for
improved recommendations.
REACT A Real-Time Edge-AI Based V2X Framework for Accident Avoidance in Autonomous Driving System
Authors: Fengze Yang, Bo Yu, Yang Zhou, Xuewen Luo, Zhengzhong Tu, Chenxi Liu
2025-08-01
http://arxiv.org/abs/2508.01057v1
Collisions caused by human error are the most common type of multi-vehicle
crash, highlighting the critical need for autonomous driving (AD) systems to
leverage cooperative perception through Vehicle-to-Everything (V2X)
. This capability extends situational awareness beyond the
limitations of onboard sensors. However, current
-based V2X
frameworks suffer from limited generalization, shallow contextual reasoning,
and reliance on mono-modal inputs. Vision-Language Models (VLMs) offer enhanced
reasoning and multimodal integration but typically fall short of real-time
performance requirements in safety-critical applications. This paper presents
REACT, a real-time, V2X-integrated trajectory optimization framework built upon
a fine-tuned lightweight VLM. REACT integrates a set of specialized modules
that process multimodal inputs into optimized, risk-aware trajectories. To
ensure real-time performance on edge devices, REACT incorporates edge
adaptation strategies that reduce model complexity and accelerate inference.
Evaluated on the DeepAccident benchmark, REACT achieves state-of-the-art
performance, a 77% collision rate reduction, a 48.2% Video Panoptic Quality
(VPQ), and a 0.57-second inference latency on the Jetson AGX Orin. Ablation
studies validate the contribution of each input, module, and edge adaptation
strategy. These results demonstrate the feasibility of lightweight VLMs for
real-time edge-based cooperative planning and showcase the potential of
language-guided contextual reasoning to improve safety and responsiveness in
autonomous driving.
Session-Based Recommendation with Validated and Enriched LLM Intents
Authors: Gyuseok Lee, Yaokun Liu, Yifan Liu, Susik Yoon, Dong Wang, SeongKu Kang
2025-08-01
http://arxiv.org/abs/2508.00570v1
Session-based recommendation (SBR) aims to predict the next item for an
anonymous user in a timely manner. However, SBR suffers from data due
to the short and anonymous nature of sessions. Recently, an emerging line of
work has explored inferring the underlying user intents of a session using
large language models (
s), with the generated intents serving as auxiliary
training signals to enhance SBR models. Despite its promise, this approach
faces three key challenges: validating intent quality, incorporating
session-level multi-intents, and complementing inevitable
failure cases. In
this paper, we propose VELI4SBR, a two-stage framework that leverages Validated
and Enriched
-generated Intents for SBR. In the first stage, we generate
high-quality intents using a predict-and-correct loop that validates the
informativeness of
-generated intents with a global intent pool to constrain
the
's output space and reduce hallucination. In the second stage, we
enhance the SBR model using the generated intents through a lightweight
multi-intent prediction and fusion mechanism. Furthermore, we introduce a
training strategy that compensates for
failures by inferring intents from
inter-session behavioral similarities. Extensive experiments show that VELI4SBR
outperforms state-of-the-art baselines while improving explainability.
ReaGAN Node-as-Agent-Reasoning Graph Agentic Network
Authors: Minghao Guo, Xi Zhu, Jingyuan Huang, Kai Mei, Yongfeng Zhang
2025-08-01
http://arxiv.org/abs/2508.00429v2
Graph Neural Networks (GNNs) have achieved remarkable success in graph-based
learning by propagating information among neighbor nodes via predefined
aggregation mechanisms. However, such fixed schemes often suffer from two key
limitations. First, they cannot handle the imbalance in node informativeness --
some nodes are rich in information, while others remain . Second,
predefined message passing primarily leverages local structural similarity
while ignoring global semantic relationships across the graph, limiting the
model's ability to capture distant but relevant information. We propose
Retrieval-augmented Graph Agentic Network (ReaGAN), an agent-based framework
that empowers each node with autonomous, node-level decision-making. Each node
acts as an agent that independently plans its next action based on its internal
memory, enabling node-level planning and adaptive message propagation.
Additionally, retrieval-augmented generation (RAG) allows nodes to access
semantically relevant content and build global relationships in the graph.
ReaGAN achieves competitive performance under few-shot in-context settings
using a frozen
backbone without fine-tuning, showcasing the potential of
agentic planning and local-global retrieval in graph learning.
EdgeInfinite-Instruct Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices
Authors: Jiyu Chen, Poh Seng Lim, Shuang Peng, Daxiong Luo, JungHau Foo, Yap Deep, Timothy Lee Jun Jie, Kelvin Teh Kae Wen, Fan Yang, Danyu Feng, Hao-Yun Chen, Peng-Wen Chen, Fangyuan Li, Xiaoxin Chen, Wong Wai Mun
2025-08-01
http://arxiv.org/abs/2508.00370v2
Deploying Transformer-based large language models (s) on
resource-constrained edge devices for long-sequence tasks remains challenging
due to the quadratic time complexity of self-attention and growing Key-Value
(
) cache demands. While existing
cache optimizations improve memory
efficiency, they often fail to reduce time to first token (TTFT) and may
degrade performance through token
. Alternative sequence modeling
architectures address some of these limitations, but typically require full
retraining and lack infrastructure support. EdgeInfinite offers an efficient
solution by fine-tuning only a small subset of parameters, maintaining quality
while reducing both computational and memory costs, including improved TTFT.
However, its instruction-following ability is limited, and it lacks
mobile-specific optimizations. To address these issues, we propose
EdgeInfinite-Instruct, which introduces a Segmented Supervised Fine-Tuning
(S-SFT) strategy tailored to long-sequence tasks such as summarization and
question answering. We further optimized EdgeInfinite-Instruct for efficient
deployment on edge NPUs by employing fine-grained post-training quantization
(PTQ) to reduce computational demands while maintaining accuracy, and by
implementing a fixed-shape computation graph that balances memory usage and
on-device efficiency through scenario-specific customization of input token and
cache sizes. Experiments on long-context benchmarks and real-world mobile tasks
show that our approach improves domain-specific performance while maintaining
efficiency on NPU-accelerated edge devices.
Systematic Evaluation of Optimization Techniques for Long-Context Language Models
Authors: Ammar Ahmed, Sheng Di, Franck Cappello, Zirui Liu, Jingoo Han, Ali Anwar
2025-08-01
http://arxiv.org/abs/2508.00305v1
Large language models (s) excel across diverse natural language processing
tasks but face resource demands and limited context windows. Although
techniques like
, quantization, and token dropping can mitigate these
issues, their efficacy in long-context scenarios and system evaluation remains
underexplored. This paper systematically benchmarks these optimizations,
characterizing memory usage, latency, and throughput, and studies how these
methods impact the quality of text generation. We first analyze individual
optimization methods for two
architectures supporting long context and then
systematically evaluate combinations of these techniques to assess how this
deeper analysis impacts performance metrics. We subsequently study the
scalability of individual optimization methods on a larger variant with 70
billion-parameter model. Our novel insights reveal that naive combination
inference optimization algorithms can adversely affect larger models due to
compounded approximation errors, as compared to their smaller counterparts.
Experiments show that relying solely on F1 obscures these effects by hiding
precision-recall trade-offs in question answering tasks. By integrating
system-level profiling with task-specific insights, this study helps
practitioners and researchers explore and balance efficiency, accuracy, and
scalability across tasks and hardware configurations.
Large AI Model-Enabled Secure Communications in Low-Altitude Wireless Networks Concepts, Perspectives and Case Study
Authors: Chuang Zhang, Geng Sun, Jiacheng Wang, Yijing Lin, Weijie Yuan, Sinem Coleri, Dusit Niyato, Tony Q. S. Quek
2025-08-01
http://arxiv.org/abs/2508.00256v1
Low-altitude wireless networks (LAWNs) have the potential to revolutionize
s by supporting a range of applications, including urban parcel
delivery, aerial inspections and air taxis. However, compared with traditional
wireless networks, LAWNs face unique security challenges due to low-altitude
operations, frequent mobility and reliance on unlicensed spectrum, making it
more vulnerable to some malicious attacks. In this paper, we investigate some
large artificial intelligence model (LAM)-enabled solutions for secure
s in LAWNs. Specifically, we first explore the amplified security
risks and important limitations of traditional AI methods in LAWNs. Then, we
introduce the basic concepts of LAMs and delve into the role of LAMs in
addressing these challenges. To demonstrate the practical benefits of LAMs for
secure
s in LAWNs, we propose a novel LAM-based optimization
framework that leverages large language models (
s) to generate enhanced
state features on top of handcrafted representations, and to design intrinsic
rewards accordingly, thereby improving reinforcement learning performance for
secure
tasks. Through a typical case study, simulation results
validate the effectiveness of the proposed framework. Finally, we outline
future directions for integrating LAMs into secure LAWN applications.