2025-10-09

Multi-Segment Photonic Power Converters for Energy Harvesting and High-Speed Optical Wireless Communication
VecInfer Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
lm-Meter Unveiling Runtime Inference Latency for On-Device Language Models
Downsized and Compromised? Assessing the Faithfulness of Model Compression
Influence Functions for Efficient Data Selection in Reasoning
Sample Smart, Not Hard Correctness-First Decoding for Better Reasoning in LLMs
Diffusion-Based Image Editing for Breaking Robust Watermarks
Training-Free Time Series Classification via In-Context Reasoning with LLM Agents
$\bf{D^3}$ QE Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection
BioAutoML-NAS An End-to-End AutoML Framework for Multimodal Insect Classification via Neural Architecture Search on Large-Scale Biodiversity Data
Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input
Flow4Agent Long-form Video Understanding via Motion Prior from Optical Flow
Rasterized Steered Mixture of Experts for Efficient 2D Image Regression
OneVision An End-to-End Generative Framework for Multi-view E-commerce Vision Search
Communication Enables Cooperation in LLM Agents A Comparison with Curriculum-Based Approaches
Federated Split Learning for Resource-Constrained Robots in Industrial IoT Framework Comparison, Optimization Strategies, and Future Directions
Uncovering Representation Bias for Investment Decisions in Open-Source Large Language Models
DecEx-RAG Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision
Teaching Machines to Speak Using Articulatory Control
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
Deciphering Invariant Feature Decoupling in Source-free Time Series Forecasting with Proxy Denoising
H1B-KV Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference
ARMOR High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization
CAM A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension
LANTERN Scalable Distillation of Large Language Models for Job-Person Fit and Explanation
AMAQ Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning
Model-based Deep Learning for Joint RIS Phase Shift Compression and WMMSE Beamforming
Draft, Verify, and Improve Toward Training-Aware Speculative Decoding
Scalable In-context Ranking with Generative Models
KVLinC KV Cache Quantization with Hadamard Rotation and Linear Correction
WeatherArchive-Bench Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives
DP-Adam-AC Privacy-preserving Fine-Tuning of Localizable Language Models Using Adam Optimization with Adaptive Clipping
Stratum System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving
Boomerang Distillation Enables Zero-Shot Model Size Interpolation
SSDD Single-Step Diffusion Decoder for Efficient Image Tokenization
Bidirectional Mammogram View Translation with Column-Aware and Implicit 3D Conditional Diffusion
ParallelBench Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs
Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models
Multilingual Routing in Mixture-of-Experts
The R(1)W(1) Communication Model for Self-Stabilizing Distributed Algorithms
A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification
Compressed Concatenation of Small Embedding Models
FedSRD Sparsify-Reconstruct-Decompose for Communication-Efficient Federated Large Language Models Fine-Tuning
Language Model Based Text-to-Audio Generation Anti-Causally Aligned Collaborative Residual Transformers
LaDiR Latent Diffusion Enhances LLMs for Text Reasoning
COSMIR Chain Orchestrated Structured Memory for Iterative Reasoning over Long Context
Multi-Agent Collaborative Intelligence Dual-Dial Control for Reliable LLM Reasoning
Compressed Convolutional Attention Efficient Attention in a Compressed Latent Space
REAR Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization
Speculative Actions A Lossless Framework for Faster Agentic Systems
SliceMoE Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling
Doctor-R1 Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning
Don't Pass $\mathtt{@}k$ A Bayesian Framework for Large Language Model Evaluation
Scaling Sequence-to-Sequence Generative Neural Rendering
Let Features Decide Their Own Solvers Hybrid Feature Caching for Diffusion Transformers
PatternKV Flattening KV Representation Expands Quantization Headroom
Emergent Coordination in Multi-Agent Language Models
Beyond Next-Token Prediction A Performance Characterization of Diffusion versus Autoregressive Language Models
MoME Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
Can Linear Probes Measure LLM Uncertainty?
Enhancing Fake News Video Detection via LLM-Driven Creative Process Simulation
Fit Pixels, Get Labels Meta-learned Implicit Networks for Image Segmentation
Simulating and Understanding Deceptive Behaviors in Long-Horizon Interactions
Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs
SPEAR Soft Prompt Enhanced Anomaly Recognition for Time Series Data
Sliding Window Attention for Learned Video Compression
Multi-Agent Code-Orchestrated Generation for Reliable Infrastructure-as-Code
NoTVLA Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation
DHQA-4D Perceptual Quality Assessment of Dynamic 4D Digital Human
Algorithm Generation via Creative Ideation
Small Language Models for Agentic Systems A Survey of Architectures, Capabilities, and Deployment Trade offs
TROLL Trust Regions improve Reinforcement Learning for Large Language Models
MambaCAFU Hybrid Multi-Scale and Multi-Attention Model with Mamba-Based Fusion for Medical Image Segmentation
You Have Been LaTeXpOsEd A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models
EvoEngineer Mastering Automated CUDA Kernel Code Evolution with Large Language Models
Token Hidden Reward Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders
Decoupling Task-Solving and Output Formatting in LLM Generation
FieldFormer Physics-Informed Transformers for Spatio-Temporal Field Reconstruction from Sparse Sensors
Reactive Transformer (RxT) -- Stateful Real-Time Processing for Event-Driven Reactive Language Models
From Scope to Script An Automated Report Generation Model for Gastrointestinal Endoscopy
Harnessing the XMM-Newton data X-ray spectral modelling of 4XMM-DR11 detections and 4XMM-DR11s sources
Cache-to-Cache Direct Semantic Communication Between Large Language Models
Coevolutionary Continuous Discrete Diffusion Make Your Diffusion Language Model a Latent Reasoner
FocusAgent Simple Yet Effective Ways of Trimming the Large Context of Web Agents
OpenZL A Graph-Based Model for Compression
Improving Cooperation in Collaborative Embodied AI
CHORD Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration
Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders
TridentServe A Stage-level Serving System for Diffusion Pipelines
FlexiQ Adaptive Mixed-Precision Quantization for Latency/Accuracy Trade-Offs in Deep Neural Networks
Distributed Low-Communication Training with Decoupled Momentum Optimization
Prototyping Digital Social Spaces through Metaphor-Driven Design Translating Spatial Concepts into an Interactive Social Simulation
TokenFlow Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling
From Tokens to Nodes Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting
MALF A Multi-Agent LLM Framework for Intelligent Fuzzing of Industrial Control Protocols
To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration
HALO Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference
Mind the Gap Linguistic Divergence and Adaptation Strategies in Human-LLM Assistant vs. Human-Human Interactions
HyperAdaLoRA Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance
ElasticMoE An Efficient Auto Scaling Method for Mixture-of-Experts Models
SAGE Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection
KaVa Latent Reasoning via Compressed KV-Cache Distillation
VideoNSA Native Sparse Attention Scales Video Understanding
Self-Forcing++ Towards Minute-Scale High-Quality Video Generation
From Frames to Clips Efficient Key Clip Selection for Long-Form Video Understanding
Contrastive Retrieval Heads Improve Attention-Based Re-Ranking
MMDEW Multipurpose Multiclass Density Estimation in the Wild
UpSafe $^\circ$ C Upcycling for Controllable Safety in Large Language Models
KVComm Enabling Efficient LLM Communication through Selective KV Sharing
SoundReactor Frame-level Online Video-to-Audio Generation
Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning
LLM-Based Multi-Task Bangla Hate Speech Detection Type, Severity, and Target
Patch-as-Decodable-Token Towards Unified Multi-Modal Vision Tasks in MLLMs
MelCap A Unified Single-Codebook Neural Codec for High-Fidelity Audio Compression
HRTFformer A Spatially-Aware Transformer for Personalized HRTF Upsampling in Immersive Audio Rendering
Accelerating Attention with Basis Decomposition
TalkPlay-Tools Conversational Music Recommendation with LLM Tool Calling
ENLighten Lighten the Transformer, Enable Efficient Optical Acceleration
Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value
Asymmetric Proximal Policy Optimization mini-critics boost LLM reasoning
The Unseen Frontier Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM
Support Basis Fast Attention Beyond Bounded Entries

Multi-Segment Photonic Power Converters for Energy Harvesting and High-Speed Optical Wireless Communication

Authors: Othman Younus, Behnaz Majlesein, Richard Nacke, Isaac N. O. Osahon, Carmine Pellegrino, Sina Babadi, Iman Tavakkolnia, Henning Helmers, Harald Haas

2025-10-07

http://arxiv.org/abs/2510.06205v1

The demand for energy-efficient high-speed wireless , coupled with the rapid rise of IoT devices, requires systems that integrate power harvesting with optical data reception to eliminate the need for charging or battery replacements. Recent advances have explored the use of solar cells as optical receivers for high-speed data detection alongside power harvesting. \acs{GaAs}-based \acp{PPC} provide six times greater electron mobility than silicon- or cadmium telluride-based cells, enabling faster data detection and improved power efficiency. However, their bandwidth is constrained by junction capacitance, which increases with active area, creating a trade-off between power output and data rate. To address this, we propose and test multi-segment \acs{GaAs}-based \Acp{PPC} that serve as both energy harvesters and data detectors. By segmenting the active area into 2, 4, or 6 subcells, forming circular areas with diameters of 1, 1.5, or 2.08~mm, we reduce capacitance and boost bandwidth while pre light collection. Fabricated on a semi-insulating \ac{GaAs} substrate with etched trenches for electrical isolation, the series-connected subcells optimize absorption and minimize parasitic effects. The \Acp{PPC} were used for an eye-safe 1.5~m optical wireless link, employing \ac{OFDM} with adaptive bit and power loading. The system achieved a world record data rate of 3.8~Gbps, which is four times higher than prior works. The system converts 39.7\% of optical power from a beam of 2.3~mW, although the segmentation increases the sensitivity of the alignment. These findings provide new solutions for off-grid backhaul for future networks, such as 6th generation (6G) cellular.

VecInfer Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

Authors: Dingyu Yao, Chenxu Yang, Zhengyang Tong, Zheng Lin, Wei Liu, Jian Luan, Weiping Wang

2025-10-07

http://arxiv.org/abs/2510.06175v1

The Key-Value () introduces substantial memory overhead during large language model () inference. Although existing vector (VQ) methods reduce usage and provide flexible representational capacity across bit-widths, they suffer severe performance degradation at ultra-low bit-widths due to key outliers that hinder effective codebook utilization. To address this challenge, we propose VecInfer, a novel VQ method for aggressive while enabling efficient inference. By applying smooth and Hadamard transformations, VecInfer suppresses outliers in the key , enabling the codebook to comprehensively cover the original data distribution and thereby reducing difficulty. To facilitate efficient deployment, we design an optimized CUDA kernel that fuses computation with de to minimize memory access overhead. Extensive evaluations demonstrate that VecInfer consistently outperforms existing baselines across both long-context understanding and mathematical reasoning tasks. With only 2-bit , VecInfer achieves performance comparable to full precision, while delivering up to $\mathbf{2.7\times}$ speedup in large-batch self-attention computation and $\mathbf{8.3\times}$ reduction in single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.

lm-Meter Unveiling Runtime Inference Latency for On-Device Language Models

Authors: Haoxin Wang, Xiaolong Tu, Hongyu Ke, Huirong Chai, Dawei Chen, Kyungtae Han

2025-10-07

http://arxiv.org/abs/2510.06126v1

Large Language Models (s) are increasingly integrated into everyday applications, but their prevalent cloud-based deployment raises growing concerns around data privacy and long-term sustainability. Running s locally on mobile and edge devices (on-device s) offers the promise of enhanced privacy, reliability, and reduced costs. However, realizing this vision remains challenging due to substantial memory and compute demands, as well as limited visibility into performance-efficiency trade-offs on resource-constrained hardware. We propose lm-Meter, the first lightweight, online latency profiler tailored for on-device inference. lm-Meter captures fine-grained, real-time latency at both phase (e.g., embedding, , , softmax, sampling) and kernel levels without auxiliary devices. We implement lm-Meter on commercial mobile platforms and demonstrate its high profiling accuracy with minimal system overhead, e.g., only 2.58% throughput reduction in and 0.99% in under the most constrained Powersave governor. Leveraging lm-Meter, we conduct comprehensive empirical studies revealing phase- and kernel-level bottlenecks in on-device inference, quantifying accuracy-efficiency trade-offs, and identifying systematic optimization opportunities. lm-Meter provides unprecedented visibility into the runtime behavior of s on constrained platforms, laying the foundation for informed optimization and accelerating the democratization of on-device systems. Code and tutorials are available at https://github.com/amai-gsu/LM-Meter.

Downsized and Compromised? Assessing the Faithfulness of Model Compression

Authors: Moumita Kamal, Douglas A. Talbert

2025-10-07

http://arxiv.org/abs/2510.06125v1

In real-world applications, computational constraints often require transforming large models into smaller, more efficient versions through model . While these techniques aim to reduce size and computational cost without sacrificing performance, their evaluations have traditionally focused on the trade-off between size and accuracy, overlooking the aspect of model faithfulness. This limited view is insufficient for high-stakes domains like healthcare, finance, and criminal justice, where compressed models must remain faithful to the behavior of their original counterparts. This paper presents a novel approach to evaluating faithfulness in compressed models, moving beyond standard metrics. We introduce and demonstrate a set of faithfulness metrics that capture how model behavior changes post-. Our contributions include introducing techniques to assess predictive consistency between the original and compressed models using model agreement, and applying chi-squared tests to detect statistically significant changes in predictive patterns across both the overall dataset and demographic subgroups, thereby exposing shifts that aggregate fairness metrics may obscure. We demonstrate our approaches by applying and to artificial neural networks (ANNs) trained on three diverse and socially meaningful datasets. Our findings show that high accuracy does not guarantee faithfulness, and our statistical tests detect subtle yet significant shifts that are missed by standard metrics, such as Accuracy and Equalized Odds. The proposed metrics provide a practical and more direct method for ensuring that efficiency gains through do not compromise the fairness or faithfulness essential for trustworthy AI.

Influence Functions for Efficient Data Selection in Reasoning

Authors: Prateek Humane, Paolo Cudrano, Daniel Z. Kaplan, Matteo Matteucci, Supriyo Chakraborty, Irina Rish

2025-10-07

http://arxiv.org/abs/2510.06108v1

Fine-tuning large language models (s) on chain-of-thought (CoT) data shows that a small amount of high-quality data can outperform massive datasets. Yet, what constitutes "quality" remains ill-defined. Existing reasoning methods rely on indirect heuristics such as problem difficulty or trace length, while instruction-tuning has explored a broader range of automated selection strategies, but rarely in the context of reasoning. We propose to define reasoning data quality using influence functions, which measure the causal effect of individual CoT examples on downstream accuracy, and introduce influence-based , which consistently outperforms perplexity and embedding-based baselines on math reasoning within a model family.

Sample Smart, Not Hard Correctness-First Decoding for Better Reasoning in LLMs

Authors: Xueyan Li, Guinan Su, Mrinmaya Sachan, Jonas Geiping

2025-10-07

http://arxiv.org/abs/2510.05987v1

Large Language Models (s) are increasingly applied to complex tasks that require extended reasoning. In such settings, models often benefit from diverse chains-of-thought to arrive at multiple candidate solutions. This requires two competing objectives: to inject enough stochasticity to explore multiple reasoning chains, and to ensure sufficient accuracy and quality in each path. Existing works pursue the first objective by increasing exploration at highly uncertain steps with higher temperature or larger candidate token sets, while others improve reliability by rejecting samples with low confidence post-generation, implying that low confidence correlates with low answer quality. These two lines of thought are in conflict, as they conflate different sources of uncertainty. To resolve this, we argue that the rule should be calibrated by correctness, not confidence alone. We should sample from tokens with higher estimated correctness, and reduce sampling where expected correctness is low. We propose simple strategies that achieve this goal: Greedy-Threshold makes sampling greedy at very low confidence steps. Calibrated-TopK and Calibrated-epsilon set truncation threshold based on estimated rank-wise correctness. Together, our findings challenge prevailing heuristics about under uncertainty and show gains across math and general reasoning benchmarks.

Diffusion-Based Image Editing for Breaking Robust Watermarks

Authors: Yunyi Ni, Finn Carter, Ze Niu, Emily Davis, Bo Zhang

2025-10-07

http://arxiv.org/abs/2510.05978v1

Robust invisible watermarking aims to embed hidden information into images such that the watermark can survive various image manipulations. However, the rise of powerful diffusion-based image generation and editing techniques poses a new threat to these watermarking schemes. In this paper, we present a theoretical study and method demonstrating that diffusion models can effectively break robust image watermarks that were designed to resist conventional perturbations. We show that a diffusion-driven ``image regeneration'' process can erase embedded watermarks while pre perceptual image content. We further introduce a novel guided diffusion attack that explicitly targets the watermark signal during generation, significantly degrading watermark detectability. Theoretically, we prove that as an image undergoes sufficient diffusion-based transformation, the mutual information between the watermarked image and the embedded watermark payload vanishes, resulting in failure. Experimentally, we evaluate our approach on multiple state-of-the-art watermarking schemes (including the deep learning-based methods StegaStamp, TrustMark, and VINE) and demonstrate near-zero watermark recovery rates after attack, while maintaining high visual fidelity of the regenerated images. Our findings highlight a fundamental vulnerability in current robust watermarking techniques against generative model-based attacks, underscoring the need for new watermarking strategies in the era of generative AI.

Training-Free Time Series Classification via In-Context Reasoning with LLM Agents

Authors: Songyuan Sui, Zihang Xu, Yu-Neng Chuang, Kwei-Herng Lai, Xia Hu

2025-10-07

http://arxiv.org/abs/2510.05950v1

Time series classification (TSC) spans diverse application scenarios, yet labeled data are often scarce, making task-specific training costly and inflexible. Recent reasoning-oriented large language models (s) show promise in understanding temporal patterns, but purely zero-shot usage remains suboptimal. We propose FETA, a multi-agent framework for training-free TSC via exemplar-based in-context reasoning. FETA decomposes a multivariate series into channel-wise subproblems, retrieves a few structurally similar labeled examples for each channel, and leverages a reasoning to compare the query against these exemplars, producing channel-level labels with self-assessed confidences; a confidence-weighted aggregator then fuses all channel decisions. This design eliminates the need for pretraining or fine-tuning, improves efficiency by irrelevant channels and controlling input length, and enhances interpretability through exemplar grounding and confidence estimation. On nine challenging UEA datasets, FETA achieves strong accuracy under a fully training-free setting, surpassing multiple trained baselines. These results demonstrate that a multi-agent in-context reasoning framework can transform s into competitive, plug-and-play TSC solvers without any parameter training. The code is available at https://github.com/SongyuanSui/FETATSC.

$\bf{D^3}$ QE Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection

Authors: Yanran Zhang, Bingyao Yu, Yu Zheng, Wenzhao Zheng, Yueqi Duan, Lei Chen, Jie Zhou, Jiwen Lu

2025-10-07

http://arxiv.org/abs/2510.05891v1

The emergence of visual autoregressive (AR) models has revolutionized image generation while presenting new challenges for synthetic image detection. Unlike previous GAN or diffusion-based methods, AR models generate images through discrete token prediction, exhibiting both marked improvements in image synthesis quality and unique characteristics in their vector-d representations. In this paper, we propose to leverage Discrete Distribution Discrepancy-aware Quantization Error (D $^3$ QE) for autoregressive-generated image detection that exploits the distinctive patterns and the frequency distribution bias of the codebook existing in real and fake images. We introduce a discrete distribution discrepancy-aware that integrates dynamic codebook frequency statistics into its attention mechanism, fusing semantic features and error latent. To evaluate our method, we construct a comprehensive dataset termed ARForensics covering 7 mainstream visual AR models. Experiments demonstrate superior detection accuracy and strong generalization of D $^3$ QE across different AR models, with robustness to real-world perturbations. Code is available at \href{https://github.com/Zhangyr2022/D3QE}{https://github.com/Zhangyr2022/D3QE}.

BioAutoML-NAS An End-to-End AutoML Framework for Multimodal Insect Classification via Neural Architecture Search on Large-Scale Biodiversity Data

Authors: Arefin Ittesafun Abian, Debopom Sutradhar, Md Rafi Ur Rashid, Reem E. Mohamed, Md Rafiqul Islam, Asif Karim, Kheng Cher Yeo, Sami Azam

2025-10-07

http://arxiv.org/abs/2510.05888v1

Insect classification is important for agricultural management and ecological research, as it directly affects crop health and production. However, this task remains challenging due to the complex characteristics of insects, class imbalance, and large-scale datasets. To address these issues, we propose BioAutoML-NAS, the first BioAutoML model using multimodal data, including images, and metadata, which applies neural architecture search (NAS) for images to automatically learn the best operations for each connection within each cell. Multiple cells are stacked to form the full network, each extracting detailed image feature representations. A multimodal fusion module combines image embeddings with metadata, allowing the model to use both visual and categorical biological information to classify insects. An alternating bi-level optimization training strategy jointly updates network weights and architecture parameters, while zero operations remove less important connections, producing , efficient, and high-performing architectures. Extensive evaluation on the BIOSCAN-5M dataset demonstrates that BioAutoML-NAS achieves 96.81% accuracy, 97.46% precision, 96.81% recall, and a 97.05% F1 score, outperforming state-of-the-art transfer learning, , AutoML, and NAS methods by approximately 16%, 10%, and 8% respectively. Further validation on the Insects-1M dataset obtains 93.25% accuracy, 93.71% precision, 92.74% recall, and a 93.22% F1 score. These results demonstrate that BioAutoML-NAS provides accurate, confident insect classification that supports modern sustainable farming.

Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input

Authors: Faeze Ghorbanpour, Alexander Fraser

2025-10-07

http://arxiv.org/abs/2510.05864v1

Large language models (s) increasingly support applications that rely on extended context, from document processing to retrieval-augmented generation. While their long-context capabilities are well studied for reasoning and retrieval, little is known about their behavior in safety-critical scenarios. We evaluate s' sensitivity to harmful content under extended context, varying type (explicit vs. implicit), position (beginning, middle, end), prevalence (0.01-0.50 of the prompt), and context length (600-6000 tokens). Across harmful content categories such as toxic, offensive, and hate speech, with LLaMA-3, Qwen-2.5, and Mistral, we observe similar patterns: performance peaks at moderate harmful prevalence (0.25) but declines when content is very or dominant; recall decreases with increasing context length; harmful sentences at the beginning are generally detected more reliably; and explicit content is more consistently recognized than implicit. These findings provide the first systematic view of how s prioritize and calibrate harmful content in long contexts, highlighting both their emerging strengths and the challenges that remain for safety-critical use.

Flow4Agent Long-form Video Understanding via Motion Prior from Optical Flow

Authors: Ruyang Liu, Shangkun Sun, Haoran Tang, Ge Li, Wei Gao

2025-10-07

http://arxiv.org/abs/2510.05836v1

Long-form video understanding has always been a challenging problem due to the significant redundancy in both temporal and spatial contents. This challenge is further exacerbated by the limited context length of Multimodal Large Language Models (Ms). To address this issue, many previous works have attempted to extract key video information, where the "key" is typically semantic-aware and heavily dependent on the CLIP model as prior. In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate -based long video understanding. Flow4Agent mitigates the redundancy in long videos at both temporal and spatial levels through two core modules: Temporal Granularity Optimization (TGO) adaptively refines framelevel hierarchies, which first leverages coarse flow priors to group similar visual contents and then applies semantic priors to filter out highly irrelevant scene information. Motion Token Pruning (MTP) further refines the intra-frame visual representations, high-redundancy video tokens using fine-grained optical flow information. Extensive experiments demonstrate that our Flow4Agent outperforms existing methods across a wide range of video M benchmarks, especially for hour-level video understanding tasks, achieving 64.7% on Video-MME, 71.4% on MLVU and 60.4% on LongVideoBench.

Rasterized Steered Mixture of Experts for Efficient 2D Image Regression

Authors: Yi-Hsin Li, Thomas Sikora, Sebastian Knorr, Mårten Sjöström

2025-10-07

http://arxiv.org/abs/2510.05814v1

The Steered Mixture of Experts regression framework has demonstrated strong performance in image reconstruction, , denoising, and super-resolution. However, its high computational cost limits practical applications. This work introduces a rasterization-based optimization strategy that combines the efficiency of rasterized Gaussian kernel rendering with the edge-aware gating mechanism of the Steered Mixture of Experts. The proposed method is designed to accelerate two-dimensional image regression while maintaining the model's inherent and reconstruction quality. By replacing global iterative optimization with a rasterized formulation, the method achieves significantly faster parameter updates and more memory-efficient model representations. In addition, the proposed framework supports applications such as native super-resolution and image denoising, which are not directly achievable with standard rasterized Gaussian kernel approaches. The combination of fast rasterized optimization with the edge-aware structure of the Steered Mixture of Experts provides a new balance between computational efficiency and reconstruction fidelity for two-dimensional image processing tasks.

OneVision An End-to-End Generative Framework for Multi-view E-commerce Vision Search

Authors: Zexin Zheng, Huangyu Dai, Lingtao Mao, Xinyu Sun, Zihan Liang, Ben Chen, Yuqing Ding, Chenyi Lei, Wenwu Ou, Han Li, Kun Gai

2025-10-07

http://arxiv.org/abs/2510.05759v2

Traditional vision search, similar to search and recommendation systems, follows the multi-stage cascading architecture (MCA) paradigm to balance efficiency and conversion. Specifically, the query image undergoes feature extraction, recall, pre-ranking, and ranking stages, ultimately presenting the user with semantically similar products that meet their preferences. This multi-view representation discrepancy of the same object in the query and the optimization objective collide across these stages, making it difficult to achieve Pareto optimality in both user experience and conversion. In this paper, an end-to-end generative framework, OneVision, is proposed to address these problems. OneVision builds on VRQ, a vision-aligned residual encoding, which can align the vastly different representations of an object across multiple viewpoints while pre the distinctive features of each product as much as possible. Then a multi-stage semantic alignment scheme is adopted to maintain strong visual similarity priors while effectively incorporating user-specific information for personalized preference generation. In offline evaluations, OneVision performs on par with online MCA, while improving inference efficiency by 21% through dynamic . In A/B tests, it achieves significant online improvements: +2.15% item CTR, +2.27% CVR, and +3.12% order volume. These results demonstrate that a semantic ID centric, generative architecture can unify retrieval and personalization while simplifying the pathway.

Communication Enables Cooperation in LLM Agents A Comparison with Curriculum-Based Approaches

Authors: Hachem Madmoun, Salem Lahlou

2025-10-07

http://arxiv.org/abs/2510.05748v1

Eliciting cooperation in multi-agent systems is critical for AI alignment. We investigate two approaches: direct and curriculum learning. In a 4-player Stag Hunt, a one-word "cheap talk" channel increases cooperation from 0% to 48.3%, demonstrating as a robust coordination mechanism. In contrast, we find that curriculum learning is highly sensitive to design choices: our pedagogical curriculum through progressively complex games reduced agent payoffs by 27.4% in an Iterated Public Goods Game with Punishment. Qualitative analysis reveals that curricula emphasizing defection-equilibrium games can induce "learned pessimism" in agents. These findings suggest that for coordination problems, simple protocols may be more reliable than experience-based training, and that curriculum design for social dilemmas requires careful attention to the strategic lessons embedded in game sequences.

Federated Split Learning for Resource-Constrained Robots in Industrial IoT Framework Comparison, Optimization Strategies, and Future Directions

Authors: Wanli Ni, Hui Tian, Shuai Wang, Chengyang Li, Lei Sun, Zhaohui Yang

2025-10-07

http://arxiv.org/abs/2510.05713v1

Federated split learning (FedSL) has emerged as a promising paradigm for enabling collaborative intelligence in industrial Internet of Things (IoT) systems, particularly in smart factories where data privacy, efficiency, and device heterogeneity are critical concerns. In this article, we present a comprehensive study of FedSL frameworks tailored for resource-constrained robots in industrial scenarios. We compare synchronous, asynchronous, hierarchical, and heterogeneous FedSL frameworks in terms of workflow, scalability, adaptability, and limitations under dynamic industrial conditions. Furthermore, we systematically categorize token fusion strategies into three paradigms: input-level (pre-fusion), intermediate-level (intra-fusion), and output-level (post-fusion), and summarize their respective strengths in industrial applications. We also provide adaptive optimization techniques to enhance the efficiency and feasibility of FedSL implementation, including model , split layer selection, computing frequency allocation, and wireless resource management. Simulation results validate the performance of these frameworks under industrial detection scenarios. Finally, we outline open issues and research directions of FedSL in future smart manufacturing systems.

Uncovering Representation Bias for Investment Decisions in Open-Source Large Language Models

Authors: Fabrizio Dimino, Krati Saxena, Bhaskarjit Sarmah, Stefano Pasquali

2025-10-07

http://arxiv.org/abs/2510.05702v1

Large Language Models are increasingly adopted in financial applications to support investment workflows. However, prior studies have seldom examined how these models reflect biases related to firm size, sector, or financial characteristics, which can significantly impact decision-making. This paper addresses this gap by focusing on representation bias in open-source Qwen models. We propose a balanced round-robin prompting method over approximately 150 U.S. equities, applying constrained and token-logit aggregation to derive firm-level confidence scores across financial contexts. Using statistical tests and variance analysis, we find that firm size and valuation consistently increase model confidence, while risk factors tend to decrease it. Confidence varies significantly across sectors, with the Technology sector showing the greatest variability. When models are prompted for specific financial categories, their confidence rankings best align with fundamental data, moderately with technical signals, and least with growth indicators. These results highlight representation bias in Qwen models and motivate sector-aware calibration and category-conditioned evaluation protocols for safe and fair financial deployment.

DecEx-RAG Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision

Authors: Yongqi Leng, Yikun Lei, Xikai Liu, Meizhi Zhong, Bojian Xiong, Yurong Zhang, Yan Gao, Yi Wu, Yao Hu, Deyi Xiong

2025-10-07

http://arxiv.org/abs/2510.05691v1

Agentic Retrieval-Augmented Generation (Agentic RAG) enhances the processing capability for complex tasks through dynamic retrieval and adaptive workflows. Recent advances (e.g., Search-R1) have shown that outcome-supervised reinforcement learning demonstrate strong performance. However, this approach still suffers from inefficient exploration, reward signals, and ambiguous global reward feedback. To address these challenges, we propose DecEx-RAG, which models RAG as a Markov Decision Process (MDP) incorporating decision-making and execution, while introducing an efficient strategy to optimize data expansion. Through comprehensive process-level policy optimization, DecEx-RAG significantly enhances the autonomous task decomposition, dynamic retrieval, and high-quality answer generation capabilities of large language models (s). Experiments show that DecEx-RAG achieves an average absolute performance improvement of $6.2\%$ across six datasets, significantly outperforming existing baselines. Moreover, the strategy improves data construction efficiency by nearly $6 \times$ , providing an efficient solution for process-supervised RAG training. The code is available at https://github.com/sdsxdxl/DecEx-RAG.

Teaching Machines to Speak Using Articulatory Control

Authors: Akshay Anand, Chenxu Guo, Cheol Jun Cho, Jiachen Lian, Gopala Anumanchipalli

2025-10-07

http://arxiv.org/abs/2510.05619v1

Current speech production systems predominantly rely on large models that operate as black boxes, providing little interpretability or grounding in the physical mechanisms of human speech. We address this limitation by proposing a new framework: speech generation through explicit articulatory control. This reframes speech as a motor control task similar to robotic manipulation. Our approach uses reinforcement learning to train a policy that directly controls the movements of vocal tract articulators, such as the tongue, lips, and jaw, to produce syllable-level speech. Specifically, we employ the Proximal Policy Optimization algorithm to learn optimal articulatory movements based on acoustic feedback provided by our audio perceiver, Sylber. The resulting articulatory trajectories are d into audio using SPARC, a pre-trained articulatory-to-speech r. We train this framework on six target syllables, and it demonstrates successful convergence, with similarity scores between the policy-generated audio and the target syllables exceeding 0.85. Accurate human transcription of the audio for syllables such as "please", "loot", and "cat" demonstrates the intelligibility of this framework.

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Authors: Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu

2025-10-07

http://arxiv.org/abs/2510.05592v1

Outcome-driven reinforcement learning has advanced reasoning in large language models (s), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, -reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.

Deciphering Invariant Feature Decoupling in Source-free Time Series Forecasting with Proxy Denoising

Authors: Kangjia Yan, Chenxi Liu, Hao Miao, Xinle Wu, Yan Zhao, Chenjuan Guo, Bin Yang

2025-10-07

http://arxiv.org/abs/2510.05589v1

The proliferation of mobile devices generates a massive volume of time series across various domains, where effective time series forecasting enables a variety of real-world applications. This study focuses on a new problem of source-free domain adaptation for time series forecasting. It aims to adapt a pretrained model from sufficient source time series to the target time series domain without access to the source data, embracing data protection regulations. To achieve this, we propose TimePD, the first source-free time series forecasting framework with proxy denoising, where large language models (s) are employed to benefit from their generalization capabilities. Specifically, TimePD consists of three key components: (1) dual-branch invariant disentangled feature learning that enforces representation- and gradient-wise invariance by means of season-trend decomposition; (2) lightweight, parameter-free proxy denoising that dynamically calibrates systematic biases of s; and (3) knowledge distillation that bidirectionally aligns the denoised prediction and the original target prediction. Extensive experiments on real-world datasets offer insight into the effectiveness of the proposed TimePD, outperforming SOTA baselines by 9.3% on average.

H1B-KV Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference

Authors: Harshil Vejendla

2025-10-07

http://arxiv.org/abs/2510.05529v1

Autoregressive in large language models (s) requires caching a growing list of past key-value () pairs, making long-context inference a memory-bound problem. While recent methods have explored quantizing the , evicting tokens, or using binary sketches for keys (e.g., Loki), these approaches often provide an incomplete solution by leaving one component (like values) uncompressed or by discarding context information. This paper introduces the Hybrid One-Bit Cache (H1B-), a comprehensive scheme that radically reduces memory usage without sacrificing context. H1B- represents each key vector using a 1-bit binary sketch, enabling hardware-friendly bitwise attention, and further compresses value vectors using 4-bit . This holistic, hybrid approach allows a 7-billion parameter to handle an 8k-token context with under 60 MB of memory - a 70x reduction. We demonstrate that after a lightweight finetuning, H1B- matches full-precision performance not only on perplexity benchmarks but also on complex downstream tasks like mathematical reasoning (GSM8K), multi-task understanding (MMLU), and code generation (HumanEval). Our results show H1B- significantly outperforms leading (KIVI), token eviction (Sparse), and key-only sketching (Loki) methods in quality-per-byte, establishing it as a robust solution for deploying s in memory-constrained environments.

ARMOR High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

Authors: Lawrence Liu, Alexander Liu, Mengdi Wang, Tuo Zhao, Lin F. Yang

2025-10-07

http://arxiv.org/abs/2510.05528v1

Large language models (s) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured , particularly 2:4 , offers a path to practical hardware , existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training algorithm. Instead of directly weights, ARMOR factorizes each weight matrix into a 2:4 core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 techniques. The core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 , establishing a more effective trade-off between model and task accuracy

CAM A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension

Authors: Rui Li, Zeyu Zhang, Xiaohe Bo, Zihang Tian, Xu Chen, Quanyu Dai, Zhenhua Dong, Ruiming Tang

2025-10-07

http://arxiv.org/abs/2510.05520v1

Current Large Language Models (s) are confronted with overwhelming information volume when comprehending long-form documents. This challenge raises the imperative of a cohesive memory module, which can elevate vanilla s into autonomous reading agents. Despite the emergence of some heuristic approaches, a systematic design principle remains absent. To fill this void, we draw inspiration from Jean Piaget's Constructivist Theory, illuminating three traits of the agentic memory -- structured schemata, flexible assimilation, and dynamic accommodation. This blueprint forges a clear path toward a more robust and efficient memory system for -based reading comprehension. To this end, we develop CAM, a prototype implementation of Constructivist Agentic Memory that simultaneously embodies the structurality, flexibility, and dynamicity. At its core, CAM is endowed with an incremental ping clustering algorithm for structured memory development, supporting both coherent hierarchical summarization and online batch integration. During inference, CAM adaptively explores the memory structure to activate query-relevant information for contextual response, akin to the human associative process. Compared to existing approaches, our design demonstrates dual advantages in both performance and efficiency across diverse long-text reading comprehension tasks, including question answering, query-based summarization, and claim verification.

LANTERN Scalable Distillation of Large Language Models for Job-Person Fit and Explanation

Authors: Zhoutong Fu, Yihan Cao, Yi-Lin Chen, Aman Lunia, Liming Dong, Neha Saraf, Ruijie Jiang, Yun Dai, Qingquan Song, Tan Wang, Guoyao Li, Derek Koh, Haichao Wei, Zhipeng Wang, Aman Gupta, Chengming Jiang, Jianqiang Shen, Liangjie Hong, Wenjing Zhang

2025-10-07

http://arxiv.org/abs/2510.05490v1

Large language models (s) have achieved strong performance across a wide range of natural language processing tasks. However, deploying s at scale for domain specific applications, such as job-person fit and explanation in job seeking platforms, introduces distinct challenges. At LinkedIn, the job person fit task requires analyzing a candidate's public profile against job requirements to produce both a fit assessment and a detailed explanation. Directly applying open source or finetuned s to this task often fails to yield high quality, actionable feedback due to the complexity of the domain and the need for structured outputs. Moreover, the large size of these models leads to high inference latency and limits scalability, making them unsuitable for online use. To address these challenges, we introduce LANTERN, a novel knowledge distillation framework tailored specifically for job person fit tasks. LANTERN involves modeling over multiple objectives, an encoder model for classification purpose, and a r model for explanation purpose. To better distill the knowledge from a strong black box teacher model to multiple downstream models, LANTERN incorporates multi level knowledge distillation that integrates both data and logit level insights. In addition to introducing the knowledge distillation framework, we share our insights on post training techniques and prompt engineering, both of which are crucial for successfully adapting s to domain specific downstream tasks. Extensive experimental results demonstrate that LANTERN significantly improves task specific metrics for both job person fit and explanation. Online evaluations further confirm its effectiveness, showing measurable gains in job seeker engagement, including a 0.24\% increase in apply rate and a 0.28\% increase in qualified applications.

AMAQ Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning

Authors: Yurun Song, Zhuoyi Yang, Ian G. Harris, Sangeetha Abdu Jyothi

2025-10-07

http://arxiv.org/abs/2510.05468v1

Large Language Models (s) are scaling rapidly, creating significant challenges for collaborative server client distributed training, particularly in terms of efficiency and computational overheads. To address these challenges, we implement Parameter-efficient Split Learning, which effectively balances efficiency and performance for collaborative training on low-resource devices. To reduce overhead in collaborative training, we introduce Adaptive Mixed bit Activation Quantization (AMAQ), a strategy that progressively compresses activations and gradients from high precision (6 to 8 bits) to low precision (3 to 4 bits). AMAQ achieves this by effectively allocating bit budgets across channels based on feature wise and layer wise importance using bit regularization. Under the same bit budgets, AMAQ outperforms fixed-precision approaches, delivering about 2.5% higher generation accuracy and about 1.3% better classification accuracy for models like LLaMA3 8B and Qwen2.5 7B. In addition, it significantly enhances training stability and reducing ultra-low bit representation collapse during the training. Experiments demonstrate that AMAQ integrates effectively into practical multi-machine collaborative training setups, offering superior inference accuracy with only a modest overhead for bits adaptation during training. This trade off makes AMAQ a practical and effective solution for collaborative training with minimal cost.

Model-based Deep Learning for Joint RIS Phase Shift Compression and WMMSE Beamforming

Authors: Alexander James Fernandes, Ioannis Psaromiligkos

2025-10-06

http://arxiv.org/abs/2510.05438v2

A model-based deep learning (DL) architecture is proposed for reconfigurable intelligent surface (RIS)-assisted multi-user s to reduce the overhead of transmitting phase shift information from the access point (AP) to the RIS controller. The phase shifts are computed at the AP, which has access to the channel state information, and then encoded into a compressed binary control message that is sent to the RIS controller for element configuration. To help reduce beamformer mismatches due to phase shift errors, the beamformer is updated using weighted minimum mean square error (WMMSE) based on the effective channel resulting from the actual (decompressed) RIS reflection coefficients. By unrolling the iterative WMMSE algorithm as part of the wireless informed DL architecture, joint phase shift and WMMSE beamforming can be trained end-to-end. Simulations show that accounting for phase shift errors during beamforming significantly improves the sum-rate performance, even when the number of control bits is lower than the number of RIS elements.

Draft, Verify, and Improve Toward Training-Aware Speculative Decoding

Authors: Shrenik Bhansali, Larry Heck

2025-10-06

http://arxiv.org/abs/2510.05421v1

Autoregressive (AR) is a major latency bottleneck for large language models. Speculative (SD) accelerates AR by letting a drafter propose multi-token blocks that a verifier accepts or rejects. However, many SD systems require heavy offline training or extra components. These choices raise data/compute cost and can yield brittle drafters under distribution drift. We introduce \emph{Draft, Verify, \& Improve (DVI)}, a training-aware self-speculative framework that combines inference with continual online learning. We partition an into a drafter and a verifier, and during generation, verifier accept/reject decisions are converted into supervision signals and used to update the drafter head. A simple \emph{KL $\rightarrow$ RL} schedule bootstraps calibration via online distillation and then adds reward-masked cross-entropy with a on-policy policy-gradient term, pre lossless, single model deployment. On Spec-Bench, DVI achieves a $2.16\times$ wall-time speedup, on par with SoTA approaches like EAGLE-2, while orders of magnitude less data for training, and ablations show that DVI outperforms KL-only online distillation. DVI demonstrates that \emph{training-aware} self-speculation can deliver state-of-the-art, lossless speedups with minimal training overhead.

Scalable In-context Ranking with Generative Models

Authors: Nilesh Gupta, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Inderjit Dhillon, Felix Yu

2025-10-06

http://arxiv.org/abs/2510.05396v2

In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of s by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of s finetuned for ICR: (1) inter-document block : attention is dense within each document block but across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an by (a) architecturally enforcing the observed inter-document block , reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that BlockRank Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.

KVLinC KV Cache Quantization with Hadamard Rotation and Linear Correction

Authors: Utkarsh Saxena, Kaushik Roy

2025-10-06

http://arxiv.org/abs/2510.05373v1

Quantizing the key-value () is a promising strategy for improving the inference efficiency of large language models (s). However, aggressive to very low precision (e.g., 2 bits) introduces significant errors in the stored key and value tensors, which propagate through the dot-product attention mechanism and ultimately degrade generation quality. To address this, we propose LinC, a framework to mitigate attention errors introduced by in the extreme low-precision regime. LinC combines a Hadamard rotation, which reduces error in values, with lightweight linear correction adapters that explicitly compensate for errors introduced by d keys. Across extensive evaluations on the LLaMA, Qwen2.5, and Qwen3 model families, LinC consistently matches or surpasses strong baselines while achieving higher - . Furthermore, we implement a custom attention kernel that results in upto 2.55x faster inference compared to Flash Attention baseline, enabling efficient long-context inference.

WeatherArchive-Bench Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

Authors: Yongan Yu, Xianda Du, Qingchen Hu, Jiahao Liang, Jingwei Ni, Dan Qiang, Kaiyu Huang, Grant McKenzie, Renee Sieber, Fengran Mo

2025-10-06

http://arxiv.org/abs/2510.05336v1

Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events. These qualitative accounts provide insights into societal vulnerability and resilience that are largely absent from meteorological records, making them valuable for climate scientists to understand societal responses. However, their vast scale, noisy digitized quality, and archaic language make it difficult to transform them into structured knowledge for climate research. To address this challenge, we introduce WeatherArchive-Bench, the first benchmark for evaluating retrieval-augmented generation (RAG) systems on historical weather archives. WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system's ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (s) can classify societal vulnerability and resilience indicators from extreme weather narratives. Extensive experiments across , dense, and re-ranking retrievers, as well as a diverse set of s, reveal that dense retrievers often fail on historical terminology, while s frequently misinterpret vulnerability and resilience concepts. These findings highlight key limitations in reasoning about complex societal indicators and provide insights for designing more robust climate-focused RAG systems from archival contexts. The constructed dataset and evaluation framework are publicly available at https://anonymous.4open.science/r/WeatherArchive-Bench/.

DP-Adam-AC Privacy-preserving Fine-Tuning of Localizable Language Models Using Adam Optimization with Adaptive Clipping

Authors: Ruoxing Yang

2025-10-06

http://arxiv.org/abs/2510.05288v1

Large language models (s) such as ChatGPT have evolved into powerful and ubiquitous tools. Fine-tuning on small datasets allows s to acquire specialized skills for specific tasks efficiently. Although s provide great utility in both general and task-specific use cases, they are limited by two security-related concerns. First, traditional hardware requirements make them infeasible to run locally on consumer-grade devices. A remote network connection with the provider's server is usually required, making the system vulnerable to network attacks. Second, fine-tuning an for a sensitive task may involve sensitive data. Non-private fine-tuning algorithms produce models vulnerable to training data reproduction attacks. Our work addresses these security concerns by enhancing differentially private optimization algorithms and applying them to fine-tune localizable language models. We introduce adaptable gradient clipping along with other engineering enhancements to the standard DP-Adam optimizer to create DP-Adam-AC. We use our optimizer to fine-tune examples of two localizable designs, small language model (Qwen2.5-0.5B) and 1.58 bit (Bitnet-b1.58-2B). We demonstrate promising improvements in loss through experimentation with two synthetic datasets.

Stratum System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving

Authors: Yue Pan, Zihan Xia, Po-Kai Hsu, Lanxiang Hu, Hyungyo Kim, Janak Sharda, Minxuan Zhou, Nam Sung Kim, Shimeng Yu, Tajana Rosing, Mingu Kang

2025-10-06

http://arxiv.org/abs/2510.05245v1

As Large Language Models (s) continue to evolve, Mixture of Experts (MoE) architecture has emerged as a prevailing design for achieving state-of-the-art performance across a wide range of tasks. MoE models use gating to activate only a handful of expert sub-networks per input, achieving billion-parameter capacity with inference costs akin to much smaller models. However, such models often pose challenges for hardware deployment due to the massive data volume introduced by the MoE layers. To address the challenges of MoE models, we propose Stratum, a system-hardware co-design approach that combines the novel memory technology Monolithic 3D-Stackable DRAM (Mono3D DRAM), near-memory processing (NMP), and GPU . The logic and Mono3D DRAM dies are connected through hybrid bonding, whereas the Mono3D DRAM stack and GPU are interconnected via silicon interposer. Mono3D DRAM offers higher internal bandwidth than HBM thanks to the dense vertical interconnect pitch enabled by its monolithic structure, which supports implementations of higher-performance near-memory processing. Furthermore, we tackle the latency differences introduced by aggressive vertical scaling of Mono3D DRAM along the z-dimension by constructing internal memory tiers and assigning data across layers based on access likelihood, guided by topic-based expert usage prediction to boost NMP throughput. The Stratum system achieves up to 8.29x improvement in throughput and 7.66x better energy efficiency across various benchmarks compared to GPU baselines.

Boomerang Distillation Enables Zero-Shot Model Size Interpolation

Authors: Sara Kangaslahti, Nihal V. Nayak, Jonathan Geuter, Marco Fumero, Francesco Locatello, David Alvarez-Melis

2025-10-06

http://arxiv.org/abs/2510.05064v1

Large language models (s) are typically deployed under diverse memory and compute constraints. Existing approaches build model families by training each size independently, which is prohibitively expensive and provides only coarse-grained size options. In this work, we identify a novel phenomenon that we call boomerang distillation: starting from a large base model (the teacher), one first distills down to a small student and then progressively reconstructs intermediate-sized models by re-incorporating blocks of teacher layers into the student without any additional training. This process produces zero-shot interpolated models of many intermediate sizes whose performance scales smoothly between the student and teacher, often matching or surpassing pretrained or distilled models of the same size. We further analyze when this type of interpolation succeeds, showing that alignment between teacher and student through and distillation is essential. Boomerang distillation thus provides a simple and efficient way to generate fine-grained model families, dramatically reducing training cost while enabling flexible adaptation across deployment environments. The code and models are available at https://github.com/dcml-lab/boomerang-distillation.

SSDD Single-Step Diffusion Decoder for Efficient Image Tokenization

Authors: Théophane Vallaeys, Jakob Verbeek, Matthieu Cord

2025-10-06

http://arxiv.org/abs/2510.04961v1

Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variational autoencoders (KL-VAE), trained with reconstruction, perceptual and adversarial losses. Diffusion rs have been proposed as a more principled alternative to model the distribution over images conditioned on the latent. However, matching the performance of KL-VAE still requires adversarial losses, as well as a higher time due to iterative sampling. To address these limitations, we introduce a new pixel diffusion r architecture for improved scaling and training stability, benefiting from components and GAN-free training. We use distillation to replicate the performance of the diffusion r in an efficient single-step r. This makes SSDD the first diffusion r optimized for single-step reconstruction trained without adversarial losses, reaching higher reconstruction quality and faster sampling than KL-VAE. In particular, SSDD improves reconstruction FID from $0.87$ to $0.50$ with $1.4\times$ higher throughput and preserve generation quality of DiTs with $3.8\times$ faster sampling. As such, SSDD can be used as a drop-in replacement for KL-VAE, and for building higher-quality and faster generative models.

Bidirectional Mammogram View Translation with Column-Aware and Implicit 3D Conditional Diffusion

Authors: Xin Li, Kaixiang Yang, Qiang Li, Zhiwei Wang

2025-10-06

http://arxiv.org/abs/2510.04947v1

Dual-view mammography, including craniocaudal (CC) and mediolateral oblique (MLO) projections, offers complementary anatomical views crucial for breast cancer diagnosis. However, in real-world clinical workflows, one view may be missing, corrupted, or degraded due to acquisition errors or artifacts, limiting the effectiveness of downstream analysis. View-to-view translation can help recover missing views and improve lesion alignment. Unlike natural images, this task in mammography is highly challenging due to large non-rigid deformations and severe tissue in X-ray projections, which obscure pixel-level correspondences. In this paper, we propose Column-Aware and Implicit 3D Diffusion (CA3D-Diff), a novel bidirectional mammogram view translation framework based on conditional diffusion model. To address cross-view structural misalignment, we first design a column-aware cross-attention mechanism that leverages the geometric property that anatomically corresponding regions tend to lie in similar column positions across views. A Gaussian-decayed bias is applied to emphasize local column-wise correlations while suppressing distant mismatches. Furthermore, we introduce an implicit 3D structure reconstruction module that back-projects noisy 2D latents into a coarse 3D feature volume based on breast-view projection geometry. The reconstructed 3D structure is refined and injected into the denoising UNet to guide cross-view generation with enhanced anatomical awareness. Extensive experiments demonstrate that CA3D-Diff achieves superior performance in bidirectional tasks, outperforming state-of-the-art methods in visual fidelity and structural consistency. Furthermore, the synthesized views effectively improve single-view malignancy classification in screening settings, demonstrating the practical value of our method in real-world diagnostics.

ParallelBench Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs

Authors: Wonjun Kang, Kevin Galim, Seunghyuk Oh, Minjae Lee, Yuchen Zeng, Shuibai Zhang, Coleman Hooper, Yuezhou Hu, Hyung Il Koo, Nam Ik Cho, Kangwook Lee

2025-10-06

http://arxiv.org/abs/2510.04767v1

While most autoregressive s are constrained to one-by-one , diffusion s (ds) have attracted growing interest for their potential to dramatically accelerate inference through parallel . Despite this promise, the conditional independence assumption in ds causes parallel to ignore token dependencies, inevitably degrading generation quality when these dependencies are strong. However, existing works largely overlook these inherent challenges, and evaluations on standard benchmarks (e.g., math and coding) are not sufficient to capture the quality degradation caused by parallel . To address this gap, we first provide an information-theoretic analysis of parallel . We then conduct case studies on analytically tractable synthetic list operations from both data distribution and strategy perspectives, offering quantitative insights that highlight the fundamental limitations of parallel . Building on these insights, we propose ParallelBench, the first benchmark specifically designed for ds, featuring realistic tasks that are trivial for humans and autoregressive s yet exceptionally challenging for ds under parallel . Using ParallelBench, we systematically analyze both ds and autoregressive s, revealing that: (i) ds under parallel can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speedup without compromising quality. Our findings underscore the pressing need for innovative methods that can overcome the current speed-quality trade-off. We release our benchmark to help accelerate the development of truly efficient ds.

Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models

Authors: Raha Askari, Sina Zarrieß, Özge Alacam, Judith Sieker

2025-10-06

http://arxiv.org/abs/2510.04764v2

Implicit meanings are integral to human , making it essential for language models to be capable of identifying and interpreting them. Grice (1975) proposed a set of conversational maxims that guide cooperative dialogue, noting that speakers may deliberately violate these principles to express meanings beyond literal words, and that listeners, in turn, recognize such violations to draw pragmatic inferences. Building on Surian et al. (1996)'s study of children's sensitivity to violations of Gricean maxims, we introduce a novel benchmark to test whether language models pretrained on less than 10M and less than 100M tokens can distinguish maxim-adhering from maxim-violating utterances. We compare these BabyLMs across five maxims and situate their performance relative to children and a Large Language Model () pretrained on 3T tokens. We find that overall, models trained on less than 100M tokens outperform those trained on less than 10M, yet fall short of child-level and competence. Our results suggest that modest data increases improve some aspects of pragmatic behavior, leading to finer-grained differentiation between pragmatic dimensions.

Multilingual Routing in Mixture-of-Experts

Authors: Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng

2025-10-06

http://arxiv.org/abs/2510.04694v1

Mixture-of-Experts (MoE) architectures have become the key to scaling modern s, yet little is understood about how their routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late r layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense s. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art s. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.

The R(1)W(1) Communication Model for Self-Stabilizing Distributed Algorithms

Authors: Hirotsugu Kakugawa, Sayaka Kamei, Masahiro Shibata, Fukuhito Ooshita

2025-10-06

http://arxiv.org/abs/2510.04644v1

Self-stabilization is a versatile methodology in the design of fault-tolerant distributed algorithms for transient faults. A self-stabilizing system automatically recovers from any kind and any finite number of transient faults. This property is specifically useful in modern distributed systems with a large number of components. In this paper, we propose a new and execution model named the R(1)W(1) model in which each process can read and write its own and neighbors' local variables in a single step. We propose self-stabilizing distributed algorithms in the R(1)W(1) model for the problems of maximal matching, minimal k-dominating set and maximal k-dependent set. Finally, we propose an example , based on randomized distance-two local mutual exclusion, to simulate algorithms designed for the R(1)W(1) model in the synchronous message passing model with synchronized clocks.

A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification

Authors: Hao Liu, Yunhao Gao, Wei Li, Mingyang Zhang, Maoguo Gong, Lorenzo Bruzzone

2025-10-06

http://arxiv.org/abs/2510.04628v1

Deep learning-based methods have achieved significant success in remote sensing Earth observation data analysis. Numerous feature fusion techniques address multimodal remote sensing image classification by integrating global and local features. However, these techniques often struggle to extract structural and detail features from heterogeneous and redundant multimodal images. With the goal of introducing frequency domain learning to model key and detail features, this paper introduces the spatial-spectral-frequency interaction network (S $^2$ Fin), which integrates pairwise fusion modules across the spatial, spectral, and frequency domains. Specifically, we propose a high-frequency enhancement that employs spatial-spectral attention to optimize the parameters of the high-frequency filter. Subsequently, a two-level spatial-frequency fusion strategy is introduced, comprising an adaptive frequency channel module that fuses low-frequency structures with enhanced high-frequency details, and a high-frequency resonance mask that emphasizes sharp edges via phase similarity. In addition, a spatial-spectral attention fusion module further enhances feature extraction at intermediate layers of the network. Experiments on four benchmark multimodal datasets with limited labeled data demonstrate that S $^2$ Fin performs superior classification, outperforming state-of-the-art methods. The code is available at https://github.com/HaoLiu-XDU/SSFin.

Compressed Concatenation of Small Embedding Models

Authors: Mohamed Ayoub Ben Ayad, Michael Dinzinger, Kanishka Ghosh Dastidar, Jelena Mitrovic, Michael Granitzer

2025-10-06

http://arxiv.org/abs/2510.04626v1

Embedding models are central to dense retrieval, semantic search, and recommendation systems, but their size often makes them impractical to deploy in resource-constrained environments such as browsers or edge devices. While smaller embedding models offer practical advantages, they typically underperform compared to their larger counterparts. To bridge this gap, we demonstrate that concatenating the raw embedding vectors of multiple small models can outperform a single larger baseline on standard retrieval benchmarks. To overcome the resulting high dimensionality of naive concatenation, we introduce a lightweight unified r trained with a Matryoshka Representation Learning (MRL) loss. This r maps the high-dimensional joint representation to a low-dimensional space, pre most of the original performance without fine-tuning the base models. We also show that while concatenating more base models yields diminishing gains, the robustness of the r's representation under and improves. Our experiments show that, on a subset of MTEB retrieval tasks, our concat-encode- pipeline recovers 89\% of the original performance with a 48x factor when the pipeline is applied to a concatenation of four small embedding models.

FedSRD Sparsify-Reconstruct-Decompose for Communication-Efficient Federated Large Language Models Fine-Tuning

Authors: Guochen Yan, Luyuan Xie, Qingni Shen, Yuejian Fang, Zhonghai Wu

2025-10-06

http://arxiv.org/abs/2510.04601v2

The current paradigm of training large language models (s) on publicly available Web data is becoming unsustainable, with high-quality data sources in specialized domains nearing exhaustion. Federated Learning (FL) emerges as a practical solution for the next generation of AI on a decentralized Web, enabling privacy-pre collaborative fine-tuning by leveraging private data distributed across a global client base. While Low-Rank Adaptation (LoRA) is the standard for efficient fine-tuning, its application in federated settings presents a critical challenge: overhead remains a significant bottleneck across the Web's heterogeneous network conditions. The structural redundancy within LoRA parameters not only incurs a heavy burden but also introduces conflicts when aggregating client updates. To address this, we propose FedSRD, a Sparsify-Reconstruct-Decompose framework designed for -efficient federated s fine-tuning. We first introduce an importance-aware sparsification method that preserves the structural integrity of LoRA updates to reduce the uploaded parameter count. The server then reconstructs and aggregates these updates in a full-rank space to mitigate conflicts. Finally, it decomposes the global update into a low-rank format for broadcast, ensuring a symmetrically efficient cycle. We also propose an efficient variant, FedSRD-e, to reduce computational overhead. Experimental results on 10 benchmarks demonstrate that our framework significantly reduces costs by up to 90\% while even improving model performance on heterogeneous client data.

Language Model Based Text-to-Audio Generation Anti-Causally Aligned Collaborative Residual Transformers

Authors: Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, Shujun Wang

2025-10-06

http://arxiv.org/abs/2510.04577v1

While language models (LMs) paired with residual vector (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive . Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated s with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren facilitates a promising pathway toward unified multi-modal generation frameworks.

LaDiR Latent Diffusion Enhances LLMs for Text Reasoning

Authors: Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin

2025-10-06

http://arxiv.org/abs/2510.04573v2

Large Language Models (s) demonstrate their reasoning ability through chain-of-thought (CoT) generation. However, 's autoregressive may limit the ability to revisit and refine earlier tokens in a holistic manner, which can also lead to inefficient exploration for diverse solutions. In this paper, we propose LaDiR (Latent Diffusion Reasoner), a novel reasoning framework that unifies the expressiveness of continuous latent representation with the iterative refinement capabilities of latent diffusion models for an existing . We first construct a structured latent reasoning space using a Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of thought tokens, pre semantic information and interpretability while offering compact but expressive representations. Subsequently, we utilize a latent diffusion model that learns to denoise a block of latent thought tokens with a blockwise bidirectional attention mask, enabling longer horizon and iterative refinement with adaptive test-time compute. This design allows efficient parallel generation of diverse reasoning trajectories, allowing the model to plan and revise the reasoning process holistically. We conduct evaluations on a suite of mathematical reasoning and planning benchmarks. Empirical results show that LaDiR consistently improves accuracy, diversity, and interpretability over existing autoregressive, diffusion-based, and latent reasoning methods, revealing a new paradigm for text reasoning with latent diffusion.

COSMIR Chain Orchestrated Structured Memory for Iterative Reasoning over Long Context

Authors: Naman Gupta, Shreeyash Gowaikar, Arun Iyer, Kirankumar Shiragur, Ramakrishna B Bairi, Rishikesh Maurya, Ritabrata Maiti, Sankarshan Damle, Shachee Mishra Gupta

2025-10-06

http://arxiv.org/abs/2510.04568v1

Reasoning over very long inputs remains difficult for large language models (s). Common workarounds either shrink the input via retrieval (risking missed evidence), enlarge the context window (straining selectivity), or stage multiple agents to read in pieces. In staged pipelines (e.g., Chain of Agents, CoA), free-form summaries passed between agents can discard crucial details and amplify early mistakes. We introduce COSMIR (Chain Orchestrated Structured Memory for Iterative Reasoning), a chain-style framework that replaces ad hoc messages with a structured memory. A Planner agent first turns a user query into concrete, checkable sub-questions. worker agents process chunks via a fixed micro-cycle: Extract, Infer, Refine, writing all updates to the shared memory. A Manager agent then Synthesizes the final answer directly from the memory. This preserves step-wise read-then-reason benefits while changing both the medium (structured memory) and the worker procedure (fixed micro-cycle), yielding higher faithfulness, better long-range aggregation, and auditability. On long-context QA from the HELMET suite, COSMIR reduces propagation-stage information loss and improves accuracy over a CoA baseline.

Multi-Agent Collaborative Intelligence Dual-Dial Control for Reliable LLM Reasoning

Authors: Edward Y. Chang, Ethan Y. Chang

2025-10-06

http://arxiv.org/abs/2510.04488v1

Multi-agent debate often wastes compute by using a fixed adversarial stance, aggregating without deliberation, or stopping on heuristics. We introduce MACI, an active controller with two independent dials that decouple information from behavior: an information dial that gates evidence by quality, and a behavior dial that schedules contentiousness from exploration to consolidation. A moderator tracks disagreement, , evidence quality, and argument quality, and halts when gains plateau. We provide theory-lite guarantees for nonincreasing dispersion and provable termination, with a budget-feasible scheduler. Across clinical diagnosis and news-bias tasks, MACI improves accuracy and calibration while reducing tokens, and converts residual uncertainty into precision RAG plans that specify what to retrieve next. We use a cross-family judge (CRIT) as a conservative soft weight and stop signal, validated for order invariance and judge-swap stability; stability depends on using high-capability judges. MACI turns debate into a budget-aware, measurable, and provably terminating controller.

Compressed Convolutional Attention Efficient Attention in a Compressed Latent Space

Authors: Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, Beren Millidge

2025-10-06

http://arxiv.org/abs/2510.04476v1

Multi-headed Attention's (MHA) quadratic compute and linearly growing - make long-context s expensive to train and serve. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) shrink the , speeding , but leave compute, which determines and training speed, largely unchanged. We introduce Compressed Convolutional Attention (CCA), a novel attention method which down-projects queries, keys, and values and performs the entire attention operation inside the shared latent space. This simple design dramatically cuts parameters, -, and FLOPs all at once by the desired factor. Because CCA is orthogonal to head-sharing, we combine the two to form Compressed Convolutional Grouped Query Attention (CCGQA), which further tightens the compute-bandwidth Pareto frontier so that users can tune toward either FLOP or memory limits without sacrificing quality. Experiments show that CCGQA consistently outperforms both GQA and MLA at equal - on dense and MoE models. Additionally, we show that CCGQA outperforms all other attention methods on MoE models with half the - of GQA and MLA, achieving an 8x - with no drop in performance compared to standard MHA. CCA and CCGQA also dramatically reduce the FLOP cost of attention which leads to substantially faster training and than existing methods. On H100 GPUs, our fused CCA/CCGQA kernel reduces latency by about 1.7x at a sequence length of 16k relative to MHA, and accelerates backward by about 1.3x.

REAR Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

Authors: Qiyuan He, Yicong Li, Haotian Ye, Jinghao Wang, Xinyao Liao, Pheng-Ann Heng, Stefano Ermon, James Zou, Angela Yao

2025-10-06

http://arxiv.org/abs/2510.04450v1

Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-d by the tokenizer. To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models. Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).

Speculative Actions A Lossless Framework for Faster Agentic Systems

Authors: Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, Tianyi Peng

2025-10-05

http://arxiv.org/abs/2510.04371v1

Despite growing interest in AI agents across industry and academia, their execution in an environment is often slow, hampering training, evaluation, and deployment. For example, a game of chess between two state-of-the-art agents may take hours. A critical bottleneck is that agent behavior unfolds sequentially: each action requires an API call, and these calls can be time-consuming. Inspired by speculative execution in microprocessors and speculative in inference, we propose speculative actions, a lossless framework for general agentic systems that predicts likely actions using faster models, enabling multiple steps to be executed in parallel. We evaluate this framework across three agentic environments: gaming, e-commerce, web search, and a "lossy" extension for an operating systems environment. In all cases, speculative actions achieve substantial accuracy in next-action prediction (up to 55%), translating into significant reductions in end-to-end latency. Moreover, performance can be further improved through stronger guessing models, top-K action prediction, multi-step speculation, and uncertainty-aware optimization, opening a promising path toward deploying low-latency agentic systems in the real world.

SliceMoE Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling

Authors: Harshil Vejendla

2025-10-05

http://arxiv.org/abs/2510.04286v1

Mixture-of-Experts (MoE) layers scale s by routing tokens to a subset of feed-forward experts. Token-level routing, however, assigns an entire semantic spectrum to each expert, creating capacity bottlenecks, load-balancing pathologies, and limited specialization. We introduce SliceMoE, an architecture that routes contiguous slices of a token's hidden vector. A d-dimensional embedding is partitioned into S slices, and for each slice, a lightweight shared router predicts the top-k experts. Experts operate on their assigned slices independently, and outputs are reassembled, maintaining per-token FLOP efficiency. Because slices from different tokens interleave within an expert, utilization is naturally smoother. We propose a slice-level capacity loss, cross-slice dropout, and efficient fused batched GEMM kernels. Experiments on WikiText-103 language modeling, WMT En-De translation, and three text-classification datasets show SliceMoE attains up to 1.7x faster inference than dense baselines, 12 to 18 percent lower perplexity than parameter-matched token-MoE, and improved expert balance, with interpretable expertise over syntactic versus semantic subspaces.

Doctor-R1 Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning

Authors: Yunghwei Lai, Kaiming Liu, Ziyue Wang, Weizhi Ma, Yang Liu

2025-10-05

http://arxiv.org/abs/2510.04284v1

The professionalism of a human doctor in outpatient service depends on two core abilities: the ability to make accurate medical decisions and the medical consultation skill to conduct strategic, empathetic patient inquiry. Existing Large Language Models (s) have achieved remarkable accuracy on medical decision-making benchmarks. However, they often lack the ability to conduct the strategic and empathetic consultation, which is essential for real-world clinical scenarios. To address this gap, we propose Doctor-R1, an AI doctor agent trained to master both of the capabilities by ask high-yield questions and conduct strategic multi-turn inquiry to guide decision-making. Our framework introduces three key components: a multi-agent interactive environment, a two-tiered reward architecture that separately optimizes clinical decision-making and communicative inquiry skills, and an experience repository to ground policy learning in high-quality prior trajectories. We evaluate Doctor-R1 on OpenAI's HealthBench and MAQuE, assessed across multi-facet metrics, such as quality, user experience, and task accuracy. Remarkably, Doctor-R1 surpasses state-of-the-art open-source specialized s by a substantial margin with higher parameter efficiency and outperforms powerful proprietary models. Furthermore, the human evaluations show a strong preference for Doctor-R1 to generate human-preferred clinical dialogue, demonstrating the effectiveness of the framework.

Don't Pass $\mathtt{@}k$ A Bayesian Framework for Large Language Model Evaluation

Authors: Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary

2025-10-05

http://arxiv.org/abs/2510.04265v1

Pass $@k$ is widely used to report performance for reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass $@k$ and average accuracy over $N$ trials (avg $@N$ ) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass $@1$ ), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass $@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-ping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass $@k$ for evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Code is available at https://mohsenhariri.github.io/bayes-kit

Scaling Sequence-to-Sequence Generative Neural Rendering

Authors: Shikun Liu, Kam Woh Ng, Wonbong Jang, Jiadong Guo, Junlin Han, Haozhe Liu, Yiannis Douratsos, Juan C. Pérez, Zijian Zhou, Chi Phung, Tao Xiang, Juan-Manuel Pérez-Rúa

2025-10-05

http://arxiv.org/abs/2510.04236v1

We present Kaleido, a family of generative models designed for photorealistic, unified object- and scene-level neural rendering. Kaleido operates on the principle that 3D can be regarded as a specialised sub-domain of video, expressed purely as a sequence-to-sequence image synthesis task. Through a systemic study of scaling sequence-to-sequence generative neural rendering, we introduce key architectural innovations that enable our model to: i) perform generative view synthesis without explicit 3D representations; ii) generate any number of 6-DoF target views conditioned on any number of reference views via a masked autoregressive framework; and iii) seamlessly unify 3D and video modelling within a single r-only rectified flow . Within this unified framework, Kaleido leverages large-scale video data for pre-training, which significantly improves spatial consistency and reduces reliance on scarce, camera-labelled 3D datasets -- all without any architectural modifications. Kaleido sets a new state-of-the-art on a range of view synthesis benchmarks. Its zero-shot performance substantially outperforms other generative methods in few-view settings, and, for the first time, matches the quality of per-scene optimisation methods in many-view settings.

Let Features Decide Their Own Solvers Hybrid Feature Caching for Diffusion Transformers

Authors: Shikang Zheng, Guantao Chen, Qinming Zhou, Yuqi Lin, Lixuan He, Chang Zou, Peiliang Cai, Jiacheng Liu, Linfeng Zhang

2025-10-05

http://arxiv.org/abs/2510.04188v1

Diffusion Transformers offer state-of-the-art fidelity in image and video synthesis, but their iterative sampling process remains a major bottleneck due to the high cost of forward passes at each timestep. To mitigate this, feature caching has emerged as a training-free technique that reuses or forecasts hidden representations. However, existing methods often apply a uniform caching strategy across all feature dimensions, ignoring their heterogeneous dynamic behaviors. Therefore, we adopt a new perspective by modeling hidden feature evolution as a mixture of ODEs across dimensions, and introduce HyCa, a Hybrid ODE solver inspired caching framework that applies dimension-wise caching strategies. HyCa achieves near-lossless across diverse domains and models, including 5.55 times speedup on FLUX, 5.56 times speedup on HunyuanVideo, 6.24 times speedup on Qwen-Image and Qwen-Image-Edit without retraining.

PatternKV Flattening KV Representation Expands Quantization Headroom

Authors: Ji Zhang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li

2025-10-05

http://arxiv.org/abs/2510.05176v1

in autoregressive s eliminates redundant recomputation but has emerged as the dominant memory and bandwidth bottleneck during inference, notably with long contexts and test-time scaling. is a key lever for reducing cost, but accuracy drops sharply as the native distribution lacks flatness and thus maintains a wide range. Prior work focuses on isolating outliers, which caps their error but fails to flatten the overall distribution, leaving performance fragile under settings. In this work, we show that the K maintains a stable structure that evolves gradually with context, while the V carries latent semantic regularities. Building on these insights, we propose Pattern, a pattern-aligned residual scheme. It mines representative pattern vectors online, aligns each vector to its nearest pattern, and s only the residual. This reshaping of the distribution flattens the target and narrows its range, thereby improving the fidelity of . Across long-context and test-time scaling settings on multiple backbones, Pattern delivers consistent 2-bit gains, with a 0.08% average 4-bit drop relative to FP16, improves test-time scaling accuracy by 10% on average, and raises throughput by 1.4x while supporting 1.25x larger batches.

Emergent Coordination in Multi-Agent Language Models

Authors: Christoph Riedl

2025-10-05

http://arxiv.org/abs/2510.05174v1

When are multi-agent systems merely a collection of individual agents versus an integrated collective with higher-order structure? We introduce an information-theoretic framework to test -- in a purely data-driven way -- whether multi-agent systems show signs of higher-order structure. This information decomposition lets us measure whether dynamical emergence is present in multi-agent systems, localize it, and distinguish spurious temporal coupling from performance-relevant cross-agent synergy. We implement both a practical criterion and an emergence capacity criterion operationalized as partial information decomposition of time-delayed mutual information (TDMI). We apply our framework to experiments using a simple guessing game without direct agent and only minimal group-level feedback with three randomized interventions. Groups in the control condition exhibit strong temporal synergy but only little coordinated alignment across agents. Assigning a persona to each agent introduces stable identity-linked differentiation. Combining personas with an instruction to ``think about what other agents might do'' shows identity-linked differentiation and goal-directed complementarity across agents. Taken together, our framework establishes that multi-agent systems can be steered with prompt design from mere aggregates to higher-order collectives. Our results are robust across emergence measures and entropy estimators, and not explained by coordination-free baselines or temporal dynamics alone. Without attributing human-like cognition to the agents, the patterns of interaction we observe mirror well-established principles of collective intelligence in human groups: effective performance requires both alignment on shared objectives and complementary contributions across members.

Beyond Next-Token Prediction A Performance Characterization of Diffusion versus Autoregressive Language Models

Authors: Minseo Kim, Coleman Hooper, Aditya Tomar, Chenfeng Xu, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

2025-10-05

http://arxiv.org/abs/2510.04146v1

Large Language Models (s) have achieved state-of-the-art performance on a broad range of Natural Language Processing (NLP) tasks, including document processing and coding. Autoregressive Language Models (ARMs), which generate tokens sequentially conditioned on all previous tokens, have been the predominant paradigm for s. However, while these networks have achieved high accuracy across a range of downstream tasks, they exhibit low arithmetic intensity due to the inherent sequential dependency with next-token prediction. Recently, Diffusion Language Models (DLMs) have emerged as a promising alternative architecture. DLMs generate output text in parallel, breaking the limitations of sequential dependency. However, the performance implications of DLMs relative to commonly deployed ARMs are not fully understood. In this work, we present a comprehensive performance study analyzing the performance characteristics of ARMs and DLMs, using both theoretical analysis and profiling data to characterize the trade-offs between these approaches. We illustrate that although DLMs exhibit higher arithmetic intensity compared to ARMs because of their capability to utilize parallelism across sequence lengths, they fail to scale effectively to longer contexts. We then explore DLMs with block-wise , outlining how this approach allows for increased arithmetic intensity, while still scaling well to long contexts (similar to ARMs). We also show interesting trade-offs for batched inference, where we find that ARMs exhibit superior throughput, as they benefit more from parallelism across sequences in the batch. Finally, we highlight opportunities for accelerating DLM inference, and, in particular, highlight the importance of reducing the number of sampling steps for allowing open-source DLMs to provide improved latency relative to ARMs.

MoME Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

Authors: Umberto Cappellazzo, Minsu Kim, Pingchuan Ma, Honglie Chen, Xubo Liu, Stavros Petridis, Maja Pantic

2025-10-05

http://arxiv.org/abs/2510.04136v1

Large language models (s) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token methods can reduce inference cost, but they require fixing a rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high , and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates Mixture-of-Experts (MoE) into MRL-based s for AVSR. MoME augments a frozen with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower . Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.

Can Linear Probes Measure LLM Uncertainty?

Authors: Ramzi Dakhmouche, Adrien Letellier, Hossein Gorji

2025-10-05

http://arxiv.org/abs/2510.04108v1

Effective Uncertainty Quantification (UQ) represents a key aspect for reliable deployment of Large Language Models (s) in automated decision-making and beyond. Yet, for generation with multiple choice structure, the state-of-the-art in UQ is still dominated by the naive baseline given by the maximum softmax score. To address this shortcoming, we demonstrate that taking a principled approach via Bayesian statistics leads to improved performance despite leveraging the simplest possible model, namely linear regression. More precisely, we propose to train multiple Bayesian linear models, each predicting the output of a layer given the output of the previous one. Based on the obtained layer-level posterior distributions, we infer the global uncertainty level of the by identifying a combination of distributional features, leading to an efficient UQ scheme. Numerical experiments on various s show consistent improvement over state-of-the-art baselines.

Enhancing Fake News Video Detection via LLM-Driven Creative Process Simulation

Authors: Yuyan Bu, Qiang Sheng, Juan Cao, Shaofei Wang, Peng Qi, Yuhui Shi, Beizhe Hu

2025-10-05

http://arxiv.org/abs/2510.04024v1

The emergence of fake news on short video platforms has become a new significant societal concern, necessitating automatic video-news-specific detection. Current detectors primarily rely on pattern-based features to separate fake news videos from real ones. However, limited and less diversified training data lead to biased patterns and hinder their performance. This weakness stems from the complex many-to-many relationships between video material segments and fabricated news events in real-world scenarios: a single video clip can be utilized in multiple ways to create different fake narratives, while a single fabricated event often combines multiple distinct video segments. However, existing datasets do not adequately reflect such relationships due to the difficulty of collecting and annotating large-scale real-world data, resulting in coverage and non-comprehensive learning of the characteristics of potential fake news video creation. To address this issue, we propose a data augmentation framework, AgentAug, that generates diverse fake news videos by simulating typical creative processes. AgentAug implements multiple -driven pipelines of four fabrication categories for news video creation, combined with an active learning strategy based on uncertainty sampling to select the potentially useful augmented samples during training. Experimental results on two benchmark datasets demonstrate that AgentAug consistently improves the performance of short video fake news detectors.

Fit Pixels, Get Labels Meta-learned Implicit Networks for Image Segmentation

Authors: Kushal Vyas, Ashok Veeraraghavan, Guha Balakrishnan

2025-10-05

http://arxiv.org/abs/2510.04021v1

Implicit neural representations (INRs) have achieved remarkable successes in learning expressive yet compact signal representations. However, they are not naturally amenable to predictive tasks such as segmentation, where they must learn semantic structures over a distribution of signals. In this study, we introduce MetaSeg, a meta-learning framework to train INRs for medical image segmentation. MetaSeg uses an underlying INR that simultaneously predicts per pixel intensity values and class labels. It then uses a meta-learning procedure to find optimal initial parameters for this INR over a training dataset of images and segmentation maps, such that the INR can simply be fine-tuned to fit pixels of an unseen test image, and automatically its class labels. We evaluated MetaSeg on 2D and 3D brain MRI segmentation tasks and report Dice scores comparable to commonly used U-Net models, but with $90\%$ fewer parameters. MetaSeg offers a fresh, scalable alternative to traditional resource-heavy architectures such as U-Nets and vision s for medical image segmentation. Our project is available at https://kushalvyas.github.io/metaseg.html .

Simulating and Understanding Deceptive Behaviors in Long-Horizon Interactions

Authors: Yang Xu, Xuanming Zhang, Min-Hsuan Yeh, Jwala Dhamala, Ousmane Dia, Rahul Gupta, Yixuan Li

2025-10-05

http://arxiv.org/abs/2510.03999v1

Deception is a pervasive feature of human and an emerging concern in large language models (s). While recent studies document instances of deception under pressure, most evaluations remain confined to single-turn prompts and fail to capture the long-horizon interactions in which deceptive strategies typically unfold. We introduce the first simulation framework for probing and evaluating deception in s under extended sequences of interdependent tasks and dynamic contextual pressures. Our framework instantiates a multi-agent system: a performer agent tasked with completing tasks and a supervisor agent that evaluates progress, provides feedback, and maintains evolving states of trust. An independent deception auditor then reviews full trajectories to identify when and how deception occurs. We conduct extensive experiments across 11 frontier models, spanning both closed- and open-source systems, and find that deception is model-dependent, increases with event pressure, and consistently erodes supervisor trust. Qualitative analyses further reveal distinct strategies of concealment, equivocation, and falsification. Our findings establish deception as an emergent risk in long-horizon interactions and provide a foundation for evaluating future s in real-world, trust-sensitive contexts.

Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs

Authors: Junjie Luo, Rui Han, Arshana Welivita, Zeleikun Di, Jingfu Wu, Xuzhe Zhi, Ritu Agarwal, Gordon Gao

2025-10-05

http://arxiv.org/abs/2510.03997v1

Understanding how patients perceive their physicians is essential to improving trust, , and satisfaction. We present a large language model ()-based pipeline that infers Big Five personality traits and five patient-oriented subjective judgments. The analysis encompasses 4.1 million patient reviews of 226,999 U.S. physicians from an initial pool of one million. We validate the method through multi-model comparison and human expert benchmarking, achieving strong agreement between human and assessments (correlation coefficients 0.72-0.89) and external validity through correlations with patient satisfaction (r = 0.41-0.81, all p<0.001). National-scale analysis reveals systematic patterns: male physicians receive higher ratings across all traits, with largest disparities in clinical competence perceptions; empathy-related traits predominate in pediatrics and psychiatry; and all traits positively predict overall satisfaction. Cluster analysis identifies four distinct physician archetypes, from "Well-Rounded Excellent" (33.8%, uniformly high traits) to "Underperforming" (22.6%, consistently low). These findings demonstrate that automated trait extraction from patient narratives can provide interpretable, validated metrics for understanding physician-patient relationships at scale, with implications for quality measurement, bias detection, and workforce development in healthcare.

SPEAR Soft Prompt Enhanced Anomaly Recognition for Time Series Data

Authors: Hanzhe Wei, Jiajun Wu, Jialin Yang, Henry Leung, Steve Drew

2025-10-04

http://arxiv.org/abs/2510.03962v1

Time series anomaly detection plays a crucial role in a wide range of fields, such as healthcare and internet traffic monitoring. The emergence of large language models (s) offers new opportunities for detecting anomalies in the ubiquitous time series data. Traditional approaches struggle with variable-length time series sequences and context-based anomalies. We propose Soft Prompt Enhanced Anomaly Recognition (SPEAR), a novel approach to leverage s for anomaly detection with soft prompts and . Our methodology involves quantizing and transforming the time series data into input embeddings and combining them with learnable soft prompt embeddings. These combined embeddings are then fed into a frozen . The soft prompts are updated iteratively based on a cross-entropy loss, allowing the model to adapt to time series anomaly detection. The use of soft prompts helps adapt s effectively to time series tasks, while ensures optimal handling of sequences, as s are designed to handle discrete sequences. Our experimental results demonstrate that soft prompts effectively increase s' performance in downstream tasks regarding time series anomaly detection.

Sliding Window Attention for Learned Video Compression

Authors: Alexander Kopte, André Kaup

2025-10-04

http://arxiv.org/abs/2510.03926v1

To manage the complexity of s in video , local attention mechanisms are a practical necessity. The common approach of partitioning frames into patches, however, creates architectural flaws like irregular receptive fields. When adapted for temporal autoregressive models, this paradigm, exemplified by the Video Compression Transformer (VCT), also necessitates computationally redundant ping windows. This work introduces 3D Sliding Window Attention (SWA), a patchless form of local attention. By enabling a r-only architecture that unifies spatial and temporal context processing, and by providing a uniform receptive field, our method significantly improves rate-distortion performance, achieving Bj{\o}rntegaard Delta-rate savings of up to 18.6 % against the VCT baseline. Simultaneously, by eliminating the need for ping windows, our method reduces overall r complexity by a factor of 2.8, while its entropy model is nearly 3.5 times more efficient. We further analyze our model's behavior and show that while it benefits from long-range temporal context, excessive context can degrade performance.

Multi-Agent Code-Orchestrated Generation for Reliable Infrastructure-as-Code

Authors: Rana Nameer Hussain Khan, Dawood Wasif, Jin-Hee Cho, Ali Butt

2025-10-04

http://arxiv.org/abs/2510.03902v1

The increasing complexity of cloud-native infrastructure has made Infrastructure-as-Code (IaC) essential for reproducible and scalable deployments. While large language models (s) have shown promise in generating IaC snippets from natural language prompts, their monolithic, single-pass generation approach often results in syntactic errors, policy violations, and unscalable designs. In this paper, we propose MACOG (Multi-Agent Code-Orchestrated Generation), a novel multi-agent -based architecture for IaC generation that decomposes the task into modular subtasks handled by specialized agents: Architect, Provider Harmonizer, Engineer, Reviewer, Security Prover, Cost and Capacity Planner, DevOps, and Memory Curator. The agents interact via a shared-blackboard, finite-state orchestrator layer, and collectively produce Terraform configurations that are not only syntactically valid but also policy-compliant and semantically coherent. To ensure infrastructure correctness and governance, we incorporate Terraform Plan for execution validation and Open Policy Agent (OPA) for customizable policy enforcement. We evaluate MACOG using the IaC-Eval benchmark, where MACOG is the top enhancement across models, e.g., GPT-5 improves from 54.90 (RAG) to 74.02 and Gemini-2.5 Pro from 43.56 to 60.13, with concurrent gains on BLEU, CodeBERTScore, and an -judge metric. Ablations show constrained and deploy feedback are critical: removing them drops IaC-Eval to 64.89 and 56.93, respectively.

NoTVLA Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation

Authors: Zheng Huang, Mingyu Liu, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Xiaoman Li, Yiduo Jia, Hao Zhong, Hao Chen, Chunhua Shen

2025-10-04

http://arxiv.org/abs/2510.03895v1

Vision-Language-Action (VLA) models represent a pivotal advance in embodied intelligence, yet they confront critical barriers to real-world deployment, most notably catastrophic forgetting. This issue stems from their overreliance on continuous action sequences or action chunks, which inadvertently create isolated data silos that disrupt knowledge retention across tasks. To tackle these challenges, we propose the Narrowing of Trajectory VLA (NoTVLA) framework: a novel approach that narrows its focus to trajectories, thereby avoiding the catastrophic forgetting associated with dense trajectory fine-tuning. A key innovation of NoTVLA lies in its trajectory planning strategy: instead of centering on the target object's trajectory, it leverages temporal and spatial reasoning specifically for the robot end effector's trajectory. Furthermore, training is conducted using these trajectories rather than dense action trajectories, an optimization that delivers remarkable practical advantages with better performance in zero-shot. In multi-task evaluation scenarios, NoTVLA achieves superior performance and generalization compared to pi0 while operating under two critical constraints: it uses over an order of magnitude less computing power than pi0 and requires no wrist-mounted camera. This design ensures that NoTVLA's operational accuracy closely approximates that of single-task expert models. Crucially, it also preserves the model's inherent language capabilities, enabling zero-shot generalization in specific scenarios, supporting unified model deployment across multiple robot platforms, and fostering a degree of generalization even when perceiving tasks from novel perspectives.

DHQA-4D Perceptual Quality Assessment of Dynamic 4D Digital Human

Authors: Yunhao Li, Sijing Wu, Yucheng Zhu, Huiyu Duan, Zicheng Zhang, Guangtao Zhai

2025-10-04

http://arxiv.org/abs/2510.03874v1

With the rapid development of 3D scanning and reconstruction technologies, dynamic digital human avatars based on 4D meshes have become increasingly popular. A high-precision dynamic digital human avatar can be applied to various fields such as game production, animation generation, and remote immersive . However, these 4D human avatar meshes are prone to being degraded by various types of noise during the processes of collection, , and transmission, thereby affecting the viewing experience of users. In light of this fact, quality assessment of dynamic 4D digital humans becomes increasingly important. In this paper, we first propose a large-scale dynamic digital human quality assessment dataset, DHQA-4D, which contains 32 high-quality real-scanned 4D human mesh sequences, 1920 distorted textured 4D human meshes degraded by 11 textured distortions, as well as their corresponding textured and non-textured mean opinion scores (MOSs). Equipped with DHQA-4D dataset, we analyze the influence of different types of distortion on human perception for textured dynamic 4D meshes and non-textured dynamic 4D meshes. Additionally, we propose DynaMesh-Rater, a novel large multimodal model (LMM) based approach that is able to assess both textured 4D meshes and non-textured 4D meshes. Concretely, DynaMesh-Rater elaborately extracts multi-dimensional features, including visual features from a projected 2D video, motion features from cropped video clips, and geometry features from the 4D human mesh to provide comprehensive quality-related information. Then we utilize a LMM model to integrate the multi-dimensional features and conduct a LoRA-based instruction tuning technique to teach the LMM model to predict the quality scores. Extensive experimental results on the DHQA-4D dataset demonstrate the superiority of our DynaMesh-Rater method over previous quality assessment methods.

Algorithm Generation via Creative Ideation

Authors: Ruiying Ma, Chieh-Jan Mike Liang, Yanjie Gao, Francis Y. Yan

2025-10-04

http://arxiv.org/abs/2510.03851v1

Designing system algorithms remains challenging, where the discontinuous nature of the solution space often forces system engineers to rely on generic heuristics at the expense of performance. We study whether s can practically drive algorithm generation, and find that they are biased towards well-known generic designs, rather than making the creative leaps needed to navigate the discontinuous solution space. To address this limitation, we introduce MetaMuse, a framework for creative ideation built on three self-reflection principles: (1) quantifying solution diversity and usefulness in measurable performance space, rather than abstract idea space, (2) steering ideation through external stimuli, rather than internal randomness, and (3) constructing executable solutions using waypoint reasoning, rather than free-form chain-of-thought. Extensive evaluation shows that MetaMuse can generate high-performing solutions for two critical problems at a global cloud provider: replacement (reducing misses by up to 35.76%) and online bin packing (reducing bin usage by up to 30.93%).

Small Language Models for Agentic Systems A Survey of Architectures, Capabilities, and Deployment Trade offs

Authors: Raghav Sharma, Manan Mehta

2025-10-04

http://arxiv.org/abs/2510.03847v1

Small language models (SLMs; 1-12B params, sometimes up to 20B) are sufficient and often superior for agentic workloads where the objective is schema- and API-constrained accuracy rather than open-ended generation. We synthesize recent evidence across open and proprietary SLMs (Phi-4-Mini, Qwen-2.5-7B, Gemma-2-9B, Llama-3.2-1B/3B, Ministral-3B/8B, Apple on-device 3B, DeepSeek-R1-Distill) and connect it to modern evaluations (BFCL v3/v4, StableToolBench) and stacks (v, SGLang, TensorRT-) paired with guided libraries (XGrammar, Outlines). We formalize SLM-default, -fallback systems with uncertainty-aware routing and verifier cascades, and propose engineering metrics that reflect real production goals: cost per successful task (CPS), schema validity rate, executable call rate, p50/p95 latency, and energy per request. Guided , strict JSON Schema outputs, and validator-first tool execution close much of the capability gap with larger models and often let SLMs match or surpass s on tool use, function calling, and RAG at 10x-100x lower token cost with materially better latency and energy. We provide design patterns for agent stacks that prioritize SLMs: schema-first prompting, type-safe function registries, confidence scoring with verifier rollups, and lightweight adaptation via LoRA/QLoRA. We also delineate limits where fallback remains valuable (open-domain reasoning and some long-horizon planning). The result is a practical blueprint for building fast, inexpensive, and reliable agents that default to SLMs while pre headroom with targeted assistance. Keywords: small language models, agents, function calling, structured outputs, JSON Schema, guided , LoRA/QLoRA, routing, energy efficiency, edge inference

TROLL Trust Regions improve Reinforcement Learning for Large Language Models

Authors: Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, Gerhard Neumann

2025-10-04

http://arxiv.org/abs/2510.03817v1

On-policy Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (s). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a subset of the model's most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language Models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model's inference behavior. Across datasets, model families, and advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.

MambaCAFU Hybrid Multi-Scale and Multi-Attention Model with Mamba-Based Fusion for Medical Image Segmentation

Authors: T-Mai Bui, Fares Bougourzi, Fadi Dornaika, Vinh Truong Hoang

2025-10-04

http://arxiv.org/abs/2510.03786v1

In recent years, deep learning has shown near-expert performance in segmenting complex medical tissues and tumors. However, existing models are often task-specific, with performance varying across modalities and anatomical regions. Balancing model complexity and performance remains challenging, particularly in clinical settings where both accuracy and efficiency are critical. To address these issues, we propose a hybrid segmentation architecture featuring a three-branch encoder that integrates CNNs, Transformers, and a Mamba-based Attention Fusion (MAF) mechanism to capture local, global, and long-range dependencies. A multi-scale attention-based CNN r reconstructs fine-grained segmentation maps while pre contextual consistency. Additionally, a co-attention gate enhances feature selection by emphasizing relevant spatial and semantic information across scales during both encoding and , improving feature interaction and cross-scale . Extensive experiments on multiple benchmark datasets show that our approach outperforms state-of-the-art methods in accuracy and generalization, while maintaining comparable computational complexity. By effectively balancing efficiency and effectiveness, our architecture offers a practical and scalable solution for diverse medical imaging tasks. Source code and trained models will be publicly released upon acceptance to support reproducibility and further research.

You Have Been LaTeXpOsEd A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models

Authors: Richard A. Dubniczky, Bertalan Borsos, Tihanyi Norbert

2025-10-04

http://arxiv.org/abs/2510.03761v1

The widespread use of preprint repositories such as arXiv has accelerated the of scientific results but also introduced overlooked security risks. Beyond PDFs, these platforms provide unrestricted access to original source materials, including LaTeX sources, auxiliary code, figures, and embedded comments. In the absence of sanitization, submissions may disclose sensitive information that adversaries can harvest using open-source intelligence. In this work, we present the first large-scale security audit of preprint archives, analyzing more than 1.2 TB of source data from 100,000 arXiv submissions. We introduce LaTeXpOsEd, a four-stage framework that integrates pattern matching, logical filtering, traditional harvesting techniques, and large language models (s) to uncover hidden disclosures within non-referenced files and LaTeX comments. To evaluate s' secret-detection capabilities, we introduce Sec-DB, a benchmark on which we tested 25 state-of-the-art models. Our analysis uncovered thousands of PII leaks, GPS-tagged EXIF files, publicly available Google Drive and Dropbox folders, editable private SharePoint links, exposed GitHub and Google credentials, and cloud API keys. We also uncovered confidential author s, internal disagreements, and conference submission credentials, exposing information that poses serious reputational risks to both researchers and institutions. We urge the research community and repository operators to take immediate action to close these hidden security gaps. To support open science, we release all scripts and methods from this study but withhold sensitive findings that could be misused, in line with ethical principles. The source code and related material are available at the project website https://github.com/LaTeXpOsEd

EvoEngineer Mastering Automated CUDA Kernel Code Evolution with Large Language Models

Authors: Ping Guo, Chenyu Zhu, Siyuan Chen, Fei Liu, Xi Lin, Zhichao Lu, Qingfu Zhang

2025-10-04

http://arxiv.org/abs/2510.03760v1

CUDA kernel optimization has become a critical bottleneck for AI performance, as deep learning training and inference efficiency directly depends on highly optimized GPU kernels. Despite the promise of Large Language Models (s) for automating kernel optimization, this field suffers from a fragmented ecosystem of isolated and incomparable approaches with unclear problem formulations. Furthermore, general-purpose code evolution methods cannot meet strict correctness requirements of CUDA kernel optimization. We address these fundamental challenges by first formalizing CUDA kernel optimization as a code optimization task with a clear objective, constraints, and evaluation metrics. We then establish the first systematic -based code evolution framework, EvoEngineer, that provides guidance for designing and adapting optimization strategies to achieve a balance between performance and correctness. Finally, we implement a kernel optimization system based on this framework and conduct extensive experiments on 91 real-world CUDA kernels. Our results demonstrate that EvoEngineer achieves a principled balance between performance and correctness, with the highest averaged median speedup of \textbf{2.72} $\times$ over baseline CUDA kernels and a code validity rate of \textbf{69.8}\%, outperforming existing methods on both dimensions. Our method achieves a maximum speedup of \textbf{36.75} $\times$ among all operations over PyTorch kernels and delivers the highest speedup on \textbf{28} (\textbf{56.0\%}) of 50 operations that achieve over \textbf{2 $\times$ } .

Token Hidden Reward Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning

Authors: Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis

2025-10-04

http://arxiv.org/abs/2510.03669v1

Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token's influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO's learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy- accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned s, providing new tools for targeted fine-tuning in reasoning-intensive applications.

Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders

Authors: Xu Wang, Yan Hu, Benyou Wang, Difan Zou

2025-10-04

http://arxiv.org/abs/2510.03659v1

Sparse Autoencoders (SAEs) are widely used to steer large language models (s), based on the assumption that their interpretable features naturally enable effective model behavior steering. Yet, a fundamental question remains unanswered: does higher interpretability indeed imply better steering utility? To answer this question, we train 90 SAEs across three s (Gemma-2-2B, Qwen-2.5-3B, Gemma-2-9B), spanning five architectures and six levels, and evaluate their interpretability and steering utility based on SAEBench (arXiv:2501.12345) and AxBench (arXiv:2502.23456) respectively, and perform a rank-agreement analysis via Kendall's rank coefficients (tau b). Our analysis reveals only a relatively weak positive association (tau b approx 0.298), indicating that interpretability is an insufficient proxy for steering performance. We conjecture the interpretability utility gap may stem from the selection of SAE features, as not all of them are equally effective for steering. To further find features that truly steer the behavior of s, we propose a novel selection criterion called Delta Token Confidence, which measures how much amplifying a feature changes the next token distribution. We show that our method improves the steering performance of three s by 52.52 percent compared to the current best output score based criterion (arXiv:2503.34567). Strikingly, after selecting features with high Delta Token Confidence, the correlation between interpretability and utility vanishes (tau b approx 0), and can even become negative. This further highlights the divergence between interpretability and utility for the most effective steering features.

Decoupling Task-Solving and Output Formatting in LLM Generation

Authors: Haikang Deng, Po-Nien Kung, Nanyun Peng

2025-10-04

http://arxiv.org/abs/2510.03595v1

Large language models (s) are increasingly adept at following instructions containing task descriptions to solve complex problems, such as mathematical reasoning and automatic evaluation (-as-a-Judge). However, as prompts grow more complex, models often struggle to adhere to all instructions. This difficulty is especially common when instructive prompts intertwine reasoning directives -- specifying what the model should solve -- with rigid formatting requirements that dictate how the solution must be presented. The entanglement creates competing goals for the model, suggesting that more explicit separation of these two aspects could lead to improved performance. To this front, we introduce Deco-G, a framework that explicitly decouples format adherence from task solving. Deco-G handles format compliance with a separate tractable probabilistic model (TPM), while prompts s with only task instructions. At each step, Deco-G combines next token probabilities from the with the TPM calculated format compliance likelihood to form the output probability. To make this approach both practical and scalable for modern instruction-tuned s, we introduce three key innovations: instruction-aware distillation, a flexible trie-building algorithm, and HMM state for computational efficiency. We demonstrate the effectiveness of Deco-G across a wide range of tasks with diverse format requirements, including mathematical reasoning, -as-a-judge, and event argument extraction. Overall, our approach yields 1.0% to 6.0% relative gain over regular prompting practice with guaranteed format compliance.

FieldFormer Physics-Informed Transformers for Spatio-Temporal Field Reconstruction from Sparse Sensors

Authors: Ankit Bhardwaj, Ananth Balashankar, Lakshminarayanan Subramanian

2025-10-04

http://arxiv.org/abs/2510.03589v1

Spatio-temporal sensor data is often , noisy, and irregular, and existing interpolation or learning methods struggle here because they either ignore governing PDEs or do not scale. We introduce FieldFormer, a -based framework for mesh-free spatio-temporal field reconstruction that combines data-driven flexibility with physics-based structure. For each query, FieldFormer gathers a local neighborhood using a learnable velocity-scaled distance metric, enabling anisotropic adaptation to different propagation regimes. Neighborhoods are built efficiently via per-batch offset recomputation, and refined in an expectation-maximization style as the velocity scales evolve. Predictions are made by a local encoder, and physics consistency is enforced through autograd-based PDE residuals and boundary-specific penalties. Across three benchmarks--a scalar anisotropic heat equation, a vector-valued shallow-water system, and a realistic advection-diffusion pollution simulation--FieldFormer consistently outperforms strong baselines by more than 40%. Our results demonstrate that FieldFormer enables accurate (RMSE $<10^{-2}$ ), efficient, and physically consistent field reconstruction from (0.4%-2%) and noisy(10%) data.

Reactive Transformer (RxT) -- Stateful Real-Time Processing for Event-Driven Reactive Language Models

Authors: Adam Filipek

2025-10-03

http://arxiv.org/abs/2510.03561v1

The Transformer architecture has become the de facto standard for Large Language Models (s), demonstrating remarkable capabilities in language understanding and generation. However, its application in conversational AI is fundamentally constrained by its stateless nature and the quadratic computational complexity ( $O(L^2)$ ) with respect to sequence length $L$ . Current models emulate memory by reprocessing an ever-expanding conversation history with each turn, leading to prohibitive costs and latency in long dialogues. This paper introduces the Reactive Transformer (RxT), a novel architecture designed to overcome these limitations by shifting from a data-driven to an event-driven paradigm. RxT processes each conversational turn as a discrete event in real-time, maintaining context in an integrated, fixed-size Short-Term Memory (STM) system. The architecture features a distinct operational cycle where a generator-r produces a response based on the current query and the previous memory state, after which a memory-encoder and a dedicated Memory Attention network asynchronously update the STM with a representation of the complete interaction. This design fundamentally alters the scaling dynamics, reducing the total user-facing cost of a conversation from quadratic ( $O(N^2 \cdot T)$ ) to linear ( $O(N \cdot T)$ ) with respect to the number of interactions $N$ . By decoupling response generation from memory updates, RxT achieves low latency, enabling truly real-time, stateful, and economically viable long-form conversations. We validated our architecture with a series of proof-of-concept experiments on synthetic data, demonstrating superior performance and constant-time inference latency compared to a baseline stateless model of comparable size.

From Scope to Script An Automated Report Generation Model for Gastrointestinal Endoscopy

Authors: Evandros Kaklamanos, Kristjana Kristinsdottir, Jonathan Huang, Dustin Carlson, Rajesh Keswani, John Pandolfino, Mozziyar Etemadi

2025-10-03

http://arxiv.org/abs/2510.03543v1

Endoscopic procedures such as esophagogastroduodenoscopy (EGD) and colonoscopy play a critical role in diagnosing and managing gastrointestinal (GI) disorders. However, the documentation burden associated with these procedures place significant strain on gastroenterologists, contributing to inefficiencies in clinical workflows and physician burnout. To address this challenge, we propose a novel automated report generation model that leverages a -based vision encoder and text r within a two-stage training framework. In the first stage, both components are pre-trained on image/text caption pairs to capture generalized vision-language features, followed by fine-tuning on images/report pairs to generate clinically meaningful findings. Our approach not only streamlines the documentation process but also holds promise for reducing physician workload and improving patient care.

Harnessing the XMM-Newton data X-ray spectral modelling of 4XMM-DR11 detections and 4XMM-DR11s sources

Authors: A. Viitanen, G. Mountrichas, H. Stiele, F. J. Carrera, A. Ruiz, J. Ballet, A. Akylas, A. Corral, M. Freyberg, A. Georgakakis, I. Georgantopoulos, S. Mateos, C. Motch, A. Nebot, H. Tranin, N. Webb

2025-10-03

http://arxiv.org/abs/2510.03409v1

The XMM-Newton X-ray observatory has played a prominent role in astrophysics, conducting precise and thorough observations of the X-ray sky for the past two decades. The most recent iteration of the 4XMM catalogue and one of its latest data releases DR11 mark significant improvements over previous XMM-Newton catalogues, as a cornerstone for comprehending the diverse inhabitants of the X-ray sky. We employ detections and spectra extracted from the 4XMM-DR11 catalogue, subjecting them to fitting procedures using simple models. Our study operates within the framework of the XMM2ATHENA project, which focuses on developing state-of-the-art methods that exploit existing XMM-Newton data. We introduce and publicly release four catalogues containing measurements derived from X-ray spectral modelling of sources. The first catalogue encompasses outcomes obtained by fitting an absorbed power law model to all the extracted spectra for individual detections within the 4XMM-DR11 dataset. The second catalogue presents results obtained by fitting both an absorbed power law and an absorbed blackbody model to all unique physical sources listed in the 4XMM-DR11s catalogue, which documents source detection results from ping XMM-Newton observations. For the third catalogue we use the five band count rates derived from the pipe line detection of X-ray sources to mimic low resolution spectra to get a rough estimate of the spectral shape (absorbed power-law) of all 4XMM-DR11 detections. In the fourth catalogue, we conduct spectral analyses for the subset of identified sources with extracted spectra, employing various models based on their classification into categories such as AGN, stars, X-ray binaries, and cataclysmic variables. The scientific potential of these catalogues is highlighted by discussing the capabilities of optical and mid-infrared colours for selecting absorbed AGN. (abridged)

Cache-to-Cache Direct Semantic Communication Between Large Language Models

Authors: Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang

2025-10-03

http://arxiv.org/abs/2510.03215v1

Multi- systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, s communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can s communicate beyond text? Oracle experiments show that enriching the -Cache semantics can improve response quality without increasing size, supporting -Cache as an effective medium for inter-model . Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic between s. C2C uses a neural network to project and fuse the source model's - with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from . Compared with text , C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

Coevolutionary Continuous Discrete Diffusion Make Your Diffusion Language Model a Latent Reasoner

Authors: Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, Dinghuai Zhang

2025-10-03

http://arxiv.org/abs/2510.03206v1

Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped s or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped s. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped s lack, they introduce additional difficulty tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.

FocusAgent Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Authors: Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Léo Boisvert, Massimo Caccia, Jérémy Espinas, Alexandre Aussem, Véronique Eglin, Alexandre Lacoste

2025-10-03

http://arxiv.org/abs/2510.03204v1

Web agents powered by large language models (s) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted -based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

OpenZL A Graph-Based Model for Compression

Authors: Yann Collet, Nick Terrell, W. Felix Handte, Danielle Rozenblit, Victor Zhang, Kevin Zhang, Yaelle Goldschlag, Jennifer Lee, Daniel Riegel, Stan Angelov, Nadav Rotem

2025-10-03

http://arxiv.org/abs/2510.03203v1

Research in general-purpose lossless over the last decade has largely found improvements in ratio that come at great cost to resource utilization and processing throughput. However, most production workloads require high throughput and low resource utilization, so most research systems have seen little adoption. Instead, real world improvements in are increasingly often realized by building application-specific compressors which can exploit knowledge about the structure and semantics of the data being compressed. These systems easily outperform even the best generic compressors, but application-specific schemes are not without drawbacks. They are inherently limited in applicability and are difficult to maintain and deploy. We show that these challenges can be overcome with a new way of thinking about . We propose the ``graph model'' of , a new theoretical framework for representing as a directed acyclic graph of modular codecs. This motivates OpenZL, an implementation of this model that compresses data into a self-describing wire format, any configuration of which can be decompressed by a universal r. OpenZL's design enables rapid development of tailored compressors with minimal code, its universal r eliminates deployment lag, and its investment in a well-vetted standard component library minimizes security risks. Experimental results demonstrate that OpenZL achieves superior ratios and speeds compared to state-of-the-art general-purpose compressors on a variety of real-world datasets. Internal deployments at Meta have also shown consistent improvements in size and/or speed, with development timelines reduced from months to days. OpenZL thus represents an advance in practical, scalable, and maintainable data for modern data-intensive applications.

Improving Cooperation in Collaborative Embodied AI

Authors: Hima Jacob Leven Suprabha, Laxmi Nag Laxminarayan Nagesh, Ajith Nair, Alvin Reuben Amal Selvaster, Ayan Khan, Raghuram Damarla, Sanju Hannah Samuel, Sreenithi Saravana Perumal, Titouan Puech, Venkataramireddy Marella, Vishal Sonar, Alessandro Suglia, Oliver Lemon

2025-10-03

http://arxiv.org/abs/2510.03153v1

The integration of Large Language Models (s) into multiagent systems has opened new possibilities for collaborative reasoning and cooperation with AI agents. This paper explores different prompting methods and evaluates their effectiveness in enhancing agent collaborative behaviour and decision-making. We enhance CoELA, a framework designed for building Collaborative Embodied Agents that leverage s for multi-agent , reasoning, and task coordination in shared virtual spaces. Through systematic experimentation, we examine different s and prompt engineering strategies to identify optimised combinations that maximise collaboration performance. Furthermore, we extend our research by integrating speech capabilities, enabling seamless collaborative voice-based interactions. Our findings highlight the effectiveness of prompt optimisation in enhancing collaborative agent performance; for example, our best combination improved the efficiency of the system running with Gemma3 by 22% compared to the original CoELA system. In addition, the speech integration provides a more engaging user interface for iterative system development and demonstrations.

CHORD Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration

Authors: Tianqi Liu, Kairui Fu, Shengyu Zhang, Wenyan Fan, Zhaocheng Du, Jieming Zhu, Fan Wu, Fei Wu

2025-10-03

http://arxiv.org/abs/2510.03038v1

With the advancement of mobile device capabilities, deploying reranking models directly on devices has become feasible, enabling real-time contextual recommendations. When migrating models from cloud to devices, resource heterogeneity inevitably necessitates model . Recent methods show promise for efficient deployment, yet they overlook device-specific user interests, resulting in compromised recommendation accuracy. While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for \underline{\textbf{C}}ustomizing \underline{\textbf{H}}ybrid-precision \underline{\textbf{O}}n-device model for sequential \underline{\textbf{R}}ecommendation with \underline{\textbf{D}}evice-cloud collaboration (\textbf{CHORD}), leveraging channel-wise mixed-precision to simultaneously achieve personalization and resource-adaptive deployment. CHORD distributes randomly initialized models across heterogeneous devices and identifies user-specific critical parameters through auxiliary hypernetwork modules on the cloud. Our parameter sensitivity analysis operates across multiple granularities (layer, filter, and element levels), enabling precise mapping from user profiles to strategy. Through on-device mixed-precision , CHORD delivers dynamic model adaptation and accelerated inference without backpropagation, eliminating costly retraining cycles. We minimize overhead by encoding strategies using only 2 bits per channel instead of 32-bit weights. Experiments on three real-world datasets with two popular backbones (SASRec and Caser) demonstrate the accuracy, efficiency, and adaptivity of CHORD.

Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders

Authors: Kriz Tahimic, Charibeth Cheng

2025-10-03

http://arxiv.org/abs/2510.02917v1

As Large Language Models become integral to software development, with substantial portions of AI-suggested code entering production, understanding their internal correctness mechanisms becomes critical for safe deployment. We apply autoencoders to decompose representations, identifying directions that correspond to code correctness. We select predictor directions using t-statistics and steering directions through separation scores from base model representations, then analyze their mechanistic properties through steering, attention analysis, and weight orthogonalization. We find that code correctness directions in s reliably predict incorrect code, while correction capabilities, though statistically significant, involve tradeoffs between fixing errors and pre correct code. Mechanistically, successful code generation depends on attending to test cases rather than problem descriptions. Moreover, directions identified in base models retain their effectiveness after instruction-tuning, suggesting code correctness mechanisms learned during pre-training are repurposed during fine-tuning. Our mechanistic insights suggest three practical applications: prompting strategies should prioritize test examples over elaborate problem descriptions, predictor directions can serve as error alarms for developer review, and these same predictors can guide selective steering, intervening only when errors are anticipated to prevent the code corruption from constant steering.

TridentServe A Stage-level Serving System for Diffusion Pipelines

Authors: Yifei Xia, Fangcheng Fu, Hao Yuan, Hanke Zhang, Xupeng Miao, Yijun Liu, Suhan Ling, Jie Jiang, Bin Cui

2025-10-03

http://arxiv.org/abs/2510.02838v1

Diffusion pipelines, renowned for their powerful visual generation capabilities, have seen widespread adoption in generative vision tasks (e.g., text-to-image/video). These pipelines typically follow an encode--diffuse-- three-stage architecture. Current systems deploy diffusion pipelines within a static, manual, and pipeline-level paradigm, allocating the same resources to every request and stage. However, through an in-depth analysis, we find that such a paradigm is inefficient due to the discrepancy in resource needs across the three stages of each request, as well as across different requests. Following the analysis, we propose the dynamic stage-level paradigm and develop TridentServe, a brand new diffusion system. TridentServe automatically, dynamically derives the placement plan (i.e., how each stage resides) for pipeline deployment and the dispatch plan (i.e., how the requests are routed) for request processing, co-optimizing the resource allocation for both model and requests. Extensive experiments show that TridentServe consistently improves SLO attainment and reduces average/P95 latencies by up to 2.5x and 3.6x/4.1x over existing works across a variety of workloads.

FlexiQ Adaptive Mixed-Precision Quantization for Latency/Accuracy Trade-Offs in Deep Neural Networks

Authors: Jaemin Kim, Hongjun Um, Sungkyun Kim, Yongjun Park, Jiwon Seo

2025-10-03

http://arxiv.org/abs/2510.02822v1

Neural networks commonly execute on hardware accelerators such as NPUs and GPUs for their size and computation overhead. These accelerators are costly and it is hard to scale their resources to handle real-time workload fluctuations. We present FlexiQ, an adaptive mixed-precision scheme for computer vision models. FlexiQ selectively applies width computation to feature channels with small value ranges and employs an efficient bit-lowering method to minimize errors while maintaining inference accuracy. Furthermore, FlexiQ adjusts its width channel ratio in real time, enabling d models to effectively manage fluctuating inference workload. We implemented FlexiQ prototype, including the mixed-precision inference runtime on our custom NPU and GPUs. Evaluated on eleven convolution- and -based vision models, FlexiQ achieves on average 6.6% higher accuracy for 4-bit models with finetuning and outperforms four state-of-the-art techniques. Moreover, our mixed-precision models achieved an efficient accuracy-latency trade-off, with the 50% 4-bit model incurring only 0.6% accuracy loss while achieving 40% of the speedup of the 100% 4-bit model over 8-bit model. Latency evaluations on our NPU and GPUs confirmed that FlexiQ introduces minimal runtime overhead, demonstrating its hardware efficiency and overall performance benefits.

Distributed Low-Communication Training with Decoupled Momentum Optimization

Authors: Sasho Nedelkoski, Alexander Acker, Odej Kao, Soeren Becker, Dominik Scheinert

2025-10-03

http://arxiv.org/abs/2510.03371v1

The training of large models demands substantial computational resources, typically available only in data centers with high-bandwidth interconnects. However, reducing the reliance on high-bandwidth interconnects between nodes enables the use of distributed compute resources as an alternative to centralized data center training. Building on recent advances in distributed model training, we propose an approach that further reduces by combining infrequent synchronizations across distributed model replicas with gradient momentum . In particular, we treat the optimizer momentum as a signal and decompose the Nesterov momentum into high- and low-frequency components via the discrete cosine transform (DCT). Only the high-frequency components are synchronized across model replicas every $H$ steps. Empirically, our method achieves up to a $16\times$ reduction in compared to the baseline DiLoCo, and it generalizes across architectures, including -based language models and convolutional neural networks for images. Overall, this work advances the feasibility of training large models on distributed nodes with low-bandwidth interconnects.

Authors: Yoojin Hong, Martina Di Paola, Braahmi Padmakumar, Hwi Joon Lee, Mahnoor Shafiq, Joseph Seering

2025-10-03

http://arxiv.org/abs/2510.02759v1

Social media platforms are central to , yet their designs remain narrowly focused on engagement and scale. While researchers have proposed alternative visions for online spaces, these ideas are difficult to prototype within platform constraints. In this paper, we introduce a metaphor-driven system to help users imagine and explore new social media environments. The system translates users' metaphors into structured sets of platform features and generates interactive simulations populated with -driven agents. To evaluate this approach, we conducted a study where participants created and interacted with simulated social media spaces. Our findings show that metaphors allow users to express distinct social expectations, and that perceived authenticity of the simulation depended on how well it captured dynamics like intimacy, participation, and temporal engagement. We conclude by discussing how metaphor-driven simulation can be a powerful design tool for prototyping alternative social architectures and expanding the design space for future social platforms.

TokenFlow Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

Authors: Junyi Chen, Chuheng Du, Renyuan Liu, Shuochao Yao, Dingtian Yan, Jiang Liao, Shengzhong Liu, Fan Wu, Guihai Chen

2025-10-03

http://arxiv.org/abs/2510.02758v1

Real-time interactions demand streamed token generations, where text tokens are progressively generated and delivered to users while balancing two objectives: responsiveness (i.e., low time-to-first-token) and steady generation (i.e.,required time-between-tokens). Standard systems suffer from the inflexibility caused by non-preemptive request scheduling and reactive memory management, leading to poor resource utilization and low request processing parallelism under request bursts. Therefore, we present TokenFlow, a novel system with enhanced text streaming performance via preemptive request scheduling and proactive key-value () management. TokenFlow dynamically prioritizes requests based on real-time token buffer occupancy and token consumption rate, while actively transferring between GPU and CPU memory in the background and ping I/O with computation to minimize request preemption overhead. Extensive experiments on Llama3-8B and Qwen2.5-32B across multiple GPUs (RTX 4090, A6000, H200) demonstrate that TokenFlow achieves up to 82.5% higher effective throughput (accounting for actual user consumption) while reducing P99 TTFT by up to 80.2%, without degrading overall token throughput.

From Tokens to Nodes Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting

Authors: Jianing Chen, Zehao Li, Yujun Cai, Hao Jiang, Shuqin Gao, Honglong Zhao, Tianlu Mao, Yucheng Zhang

2025-10-03

http://arxiv.org/abs/2510.02732v1

Dynamic 3D reconstruction from monocular videos remains difficult due to the ambiguity inferring 3D motion from limited views and computational demands of modeling temporally varying scenes. While recent control methods alleviate computation by reducing millions of Gaussians to thousands of control points, they suffer from a critical limitation: they allocate points purely by geometry, leading to static redundancy and dynamic insufficiency. We propose a motion-adaptive framework that aligns control density with motion complexity. Leveraging semantic and motion priors from vision foundation models, we establish patch-token-node correspondences and apply motion-adaptive to concentrate control points in dynamic regions while suppressing redundancy in static backgrounds. Our approach achieves flexible representational density adaptation through iterative voxelization and motion tendency scoring, directly addressing the fundamental mismatch between control point allocation and motion complexity. To capture temporal evolution, we introduce spline-based trajectory parameterization initialized by 2D tracklets, replacing MLP-based deformation fields to achieve smoother motion representation and more stable optimization. Extensive experiments demonstrate significant improvements in reconstruction quality and efficiency over existing state-of-the-art methods.

MALF A Multi-Agent LLM Framework for Intelligent Fuzzing of Industrial Control Protocols

Authors: Bowei Ning, Xuejun Zong, Kan He

2025-10-03

http://arxiv.org/abs/2510.02694v1

Industrial control systems (ICS) are vital to modern infrastructure but increasingly vulnerable to cybersecurity threats, particularly through weaknesses in their protocols. This paper presents MALF (Multi-Agent Fuzzing Framework), an advanced fuzzing solution that integrates large language models (s) with multi-agent coordination to identify vulnerabilities in industrial control protocols (ICPs). By leveraging Retrieval-Augmented Generation (RAG) for domain-specific knowledge and QLoRA fine-tuning for protocol-aware input generation, MALF enhances fuzz testing precision and adaptability. The multi-agent framework optimizes seed generation, mutation strategies, and feedback-driven refinement, leading to improved vulnerability discovery. Experiments on protocols like Modbus/TCP, S7Comm, and Ethernet/IP demonstrate that MALF surpasses traditional methods, achieving a test case pass rate (TCPR) of 88-92% and generating more exception triggers (ETN). MALF also maintains over 90% seed coverage and Shannon entropy values between 4.2 and 4.6 bits, ensuring diverse, protocol-compliant mutations. Deployed in a real-world Industrial Attack-Defense Range for power plants, MALF identified critical vulnerabilities, including three zero-day flaws, one confirmed and registered by CNVD. These results validate MALF's effectiveness in real-world fuzzing applications. This research highlights the transformative potential of multi-agent s in ICS cybersecurity, offering a scalable, automated framework that sets a new standard for vulnerability discovery and strengthens critical infrastructure security against emerging threats.

To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

Authors: Zeyu Yang, Tianyi Zhang, Jianwen Xie, Chuan Li, Zhaozhuo Xu, Anshumali Shrivastava

2025-10-03

http://arxiv.org/abs/2510.02676v1

The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision floating-point formats, which inherently provide numerical stability, memory savings, and hardware efficiency without de overhead. In this paper, we present a theoretical and empirical study of an exponent concentration phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from $\alpha$ -stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless framework with entropy-aware encoding and GPU-optimized . Experiments on s and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput , with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era.

HALO Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference

Authors: Shubham Negi, Kaushik Roy

2025-10-03

http://arxiv.org/abs/2510.02675v1

The rapid adoption of Large Language Models (s) has driven a growing demand for efficient inference, particularly in latency-sensitive applications such as chatbots and personalized assistants. Unlike traditional deep neural networks, inference proceeds in two distinct phases: the phase, which processes the full input sequence in parallel, and the phase, which generates tokens sequentially. These phases exhibit highly diverse compute and memory requirements, which makes accelerator design particularly challenging. Prior works have primarily been optimized for high-batch inference or evaluated only short input context lengths, leaving the low-batch and long context regime, which is critical for interactive applications, largely underexplored. We propose HALO, a heterogeneous memory centric accelerator designed for these unique challenges of and phases in low-batch inference. HALO integrates HBM based Compute-in-DRAM (CiD) with an on-chip analog Compute-in-Memory (CiM), co-packaged using 2.5D integration. To further improve the hardware utilization, we introduce a phase-aware mapping strategy that adapts to the distinct demands of the and phases. Compute bound operations in the phase are mapped to CiM to exploit its high throughput matrix multiplication capability, while memory-bound operations in the phase are executed on CiD to benefit from reduced data movement within DRAM. Additionally, we present an analysis of the performance tradeoffs of s under two architectural extremes: a fully CiD and a fully on-chip analog CiM design to highlight the need for a heterogeneous design. We evaluate HALO on LLaMA-2 7B and Qwen3 8B models. Our experimental results show that s mapped to HALO achieve up to 18x geometric mean speedup over AttAcc, an attention-optimized mapping and 2.5x over CENT, a fully CiD based mapping.

Mind the Gap Linguistic Divergence and Adaptation Strategies in Human-LLM Assistant vs. Human-Human Interactions

Authors: Fulei Zhang, Zhou Yu

2025-10-03

http://arxiv.org/abs/2510.02645v1

As Large Language Models (s) are increasingly deployed in customer-facing applications, a critical yet underexplored question is how users communicate differently with chatbots compared to human agent. In this study, we present empirical evidence that users adopt distinct styles when users interact with chatbots versus human agents. Our analysis reveals significant differences in grammatical fluency, politeness, and lexical diversity in user language between the two settings. These findings suggest that models trained exclusively on human-human interaction data may not adequately accommodate the style shift that occurs once an chatbot is deployed. To enhance robustness to post-launch style changes, we experimented with two strategies: (1) data augmentation during the post-training phase and (2) inference-time user message reformulation. Our results indicate that models trained on stylistically diverse datasets significantly outperform those trained exclusively on original or stylistically uniform datasets, while inference-time reformulation proved less effective. These insights help us to better adapt our models for improved -user interaction experiences.

HyperAdaLoRA Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance

Authors: Hao Zhang, Zhenjia Li, Runfeng Bao, Yifan Gao, Xi Xiao, Bo Huang, Yuhang Wu, Tianyang Wang, Hao Xu

2025-10-03

http://arxiv.org/abs/2510.02630v1

Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), has emerged as a promising approach to fine-tuning large language models(s) while reducing computational and memory overhead. However, LoRA assumes a uniform rank \textit{r} for each incremental matrix, not accounting for the varying significance of weight matrices across different modules and layers. AdaLoRA leverages Singular Value Decomposition (SVD) to parameterize updates and employs of singular values to introduce dynamic rank allocation, thereby enhancing adaptability. However, during the training process, it often encounters issues of slow convergence speed and high computational overhead. To address this issue, we propose HyperAdaLoRA, a novel framework that accelerates the convergence of AdaLoRA by leveraging a hypernetwork. Instead of directly optimizing the components of Singular Value Decomposition $(P, \Lambda, Q)$ , HyperAdaLoRA employs a hypernetwork based on attention mechanisms to dynamically generate these parameters. By the outputs of the hypernetwork that generates the singular values, dynamic rank allocation is achieved. Comprehensive experiments on various datasets and models demonstrate that our method achieves faster convergence without sacrificing performance. Additionally, further extension experiments on other LoRA-based approaches validate the broad applicability of our method.

ElasticMoE An Efficient Auto Scaling Method for Mixture-of-Experts Models

Authors: Gursimran Singh, Timothy Yu, Haley Li, Cheng Chen, Hanieh Sadri, Qintao Zhang, Yu Zhang, Ying Xiong, Yong Zhang, Zhenan Fan

2025-10-02

http://arxiv.org/abs/2510.02613v1

Mixture-of-Experts (MoE) models promise efficient scaling of large language models (s) by activating only a small subset of experts per token, but their parallelized inference pipelines make elastic challenging. Existing strategies fall short: horizontal scaling provisions entire replicas of the current configuration, often tens to hundreds of accelerators, leading to coarse granularity, long provisioning delays, and costly overprovisioning. Vertical scaling offers finer adjustments but typically requires instance restarts, incurring downtime. These limitations make current approaches ill-suited for the bursty, short-lived traffic patterns common in cloud deployments. We present ElasticMoE, an elastic scaling framework for MoE s that achieves fine-grained, low-latency, and zero-downtime scaling. ElasticMoE decouples inference execution from memory operations, enabling scaling steps to proceed concurrently with . An HBM Management Module (HMM) reuses weights and s via zero-copy remapping, while high-bandwidth peer-to-peer transfers bring newly added accelerators online without interrupting service. A virtual memory based expert redistribution mechanism migrates MoE experts without costly buffer reallocations, reducing peak memory usage during expert parallelism reconfiguration. Our evaluation on Ascend NPUs with three popular MoE s shows that ElasticMoE achieves up to 9x lower scale-up latency, up to 2x better throughput during scaling, and significantly improves SLO attainment compared to baselines. By enabling fine-grained, concurrent scaling with minimal disruption, ElasticMoE advances the practicality of deploying massive MoE s in dynamic cloud environments.

SAGE Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection

Authors: Ashish Jha, Salman Ahmadi-Asl

2025-10-02

http://arxiv.org/abs/2510.02470v1

Training modern neural networks on large datasets is computationally and energy intensive. We present SAGE, a streaming data-subset selection method that maintains a compact Frequent Directions (FD) sketch of gradient geometry in $O(\ell D)$ memory and prioritizes examples whose sketched gradients align with a consensus direction. The approach eliminates $N \times N$ pairwise similarities and explicit $N \times \ell$ gradient stores, yielding a simple two-pass, GPU-friendly pipeline. Leveraging FD's deterministic approximation guarantees, we analyze how agreement scoring preserves gradient energy within the principal sketched subspace. Across multiple benchmarks, SAGE trains with small kept-rate budgets while retaining competitive accuracy relative to full-data training and recent subset-selection baselines, and reduces end-to-end compute and peak memory. Overall, SAGE offers a practical, constant-memory alternative that complements and model for efficient training.

KaVa Latent Reasoning via Compressed KV-Cache Distillation

Authors: Anna Kuzina, Maciej Pioro, Paul N. Whatmough, Babak Ehteshami Bejnordi

2025-10-02

http://arxiv.org/abs/2510.02312v1

Large Language Models (s) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural-language reasoning traces. In this work, we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed - of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise trajectories. We show that the abstract, unstructured knowledge within compressed -, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student. Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation-only to natural-language traces, and scales to larger backbones while pre efficiency. These results establish compressed - distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT-trained teachers with the efficiency and deployability of latent inference.

VideoNSA Native Sparse Attention Scales Video Understanding

Authors: Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu

2025-10-02

http://arxiv.org/abs/2510.02295v1

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, pre dense attention for text, while employing NSA for video. Compared to token- and training-free baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined attention help induce dynamic attention sinks.

Self-Forcing++ Towards Minute-Scale High-Quality Video Generation

Authors: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh

2025-10-02

http://arxiv.org/abs/2510.02283v1

Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing ping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/

From Frames to Clips Efficient Key Clip Selection for Long-Form Video Understanding

Authors: Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, Garin Kessler

2025-10-02

http://arxiv.org/abs/2510.02262v1

Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks, yet their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window. Existing solutions alleviate this issue by selecting a set of frames, thereby reducing token count, but such frame-wise selection discards essential temporal dynamics, leading to suboptimal reasoning about motion and event continuity. In this work we systematically explore the impact of temporal information and demonstrate that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we propose an adaptive resolution strategy that dynamically balances spatial resolution and clip length, ensuring a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These results highlight the importance of pre temporal coherence in frame selection and provide a practical pathway for scaling Video s to real world video understanding applications. Project webpage is available at https://guangyusun.com/f2c .

Contrastive Retrieval Heads Improve Attention-Based Re-Ranking

Authors: Linh Tran, Yulong Li, Radu Florian, Wei Sun

2025-10-02

http://arxiv.org/abs/2510.02219v1

The strong zero-shot and long-context capabilities of recent Large Language Models (s) have paved the way for highly effective re-ranking systems. Attention-based re-rankers leverage attention weights from heads to produce relevance scores, but not all heads are created equally: many contribute noise and redundancy, thus limiting performance. To address this, we introduce CoRe heads, a small set of retrieval heads identified via a contrastive scoring metric that explicitly rewards high attention heads that correlate with relevant documents, while downplaying nodes with higher attention that correlate with irrelevant documents. This relative ranking criterion isolates the most discriminative heads for re-ranking and yields a state-of-the-art list-wise re-ranker. Extensive experiments with three s show that aggregated signals from CoRe heads, constituting less than 1% of all heads, substantially improve re-ranking accuracy over strong baselines. We further find that CoRe heads are concentrated in middle layers, and the computation of final 50% of model layers preserves accuracy while significantly reducing inference time and memory usage.

MMDEW Multipurpose Multiclass Density Estimation in the Wild

Authors: Villanelle O'Reilly, Jonathan Cox, Georgios Leontidis, Marc Hanheide, Petra Bosilj, James Brown

2025-10-02

http://arxiv.org/abs/2510.02213v1

Density map estimation can be used to estimate object counts in dense and occluded scenes where discrete counting-by-detection methods fail. We propose a multicategory counting framework that leverages a Twins pyramid vision- backbone and a specialised multi-class counting head built on a state-of-the-art multiscale approach. A two-task design adds a segmentation-based Category Focus Module, suppressing inter-category cross-talk at training time. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates superior performance versus prior multicategory crowd-counting approaches (33%, 43% and 64% reduction to MAE), and the comparison with YOLOv11 underscores the necessity of crowd counting methods in dense scenes. The method's regional loss opens up multi-class crowd counting to new domains, demonstrated through the application to a biodiversity monitoring dataset, highlighting its capacity to inform conservation efforts and enable scalable ecological insights.

UpSafe $^\circ$ C Upcycling for Controllable Safety in Large Language Models

Authors: Yuhao Sun, Zhuoer Xu, Shiwen Cui, Kun Yang, Lingyun Yu, Yongdong Zhang, Hongtao Xie

2025-10-02

http://arxiv.org/abs/2510.02194v1

Large Language Models (s) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques -- including external guardrails, inference-time guidance, and post-training alignment -- each face limitations in balancing safety, utility, and controllability. In this work, we propose UpSafe $^\circ$ C, a unified framework for enhancing safety through safety-aware upcycling. Our approach first identifies safety-critical layers and upcycles them into a Mixture-of-Experts (MoE) structure, where the router acts as a soft guardrail that selectively activates original MLPs and added safety experts. We further introduce a two-stage SFT strategy to strengthen safety discrimination while pre general capabilities. To enable flexible control at inference time, we introduce a safety temperature mechanism, allowing dynamic adjustment of the trade-off between safety and utility. Experiments across multiple benchmarks, base model, and model scales demonstrate that UpSafe $^\circ$ C achieves robust safety improvements against harmful and jailbreak inputs, while maintaining competitive performance on general tasks. Moreover, analysis shows that safety temperature provides fine-grained inference-time control that achieves the Pareto-optimal frontier between utility and safety. Our results highlight a new direction for safety: moving from static alignment toward dynamic, modular, and inference-aware control.

Authors: Xiangyu Shi, Marco Chiesa, Gerald Q. Maguire Jr., Dejan Kostic

2025-10-02

http://arxiv.org/abs/2510.03346v1

Large Language Models (s) are increasingly deployed in multi-agent systems, where effective inter-model is crucial. Existing protocols either rely on natural language, incurring high inference costs and information loss, or on hidden states, which suffer from information concentration bias and inefficiency. To address these limitations, we propose Comm, a novel framework that enables efficient between s through selective sharing of pairs. Comm leverages the rich information encoded in the pairs while avoiding the pitfalls of hidden states. We introduce a layer-wise selection strategy based on attention importance scores with a Gaussian prior to identify the most informative pairs for . Extensive experiments across diverse tasks and model pairs demonstrate that Comm achieves comparable performance to the upper-bound method, which directly merges inputs to one model without any , while transmitting as few as 30\% of layers' pairs. Our study highlights the potential of pairs as an effective medium for inter- , paving the way for scalable and efficient multi-agent systems.

SoundReactor Frame-level Online Video-to-Audio Generation

Authors: Koichi Saito, Julian Tanke, Christian Simon, Masato Ishii, Kazuki Shimada, Zachary Novack, Zhi Zhong, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

2025-10-02

http://arxiv.org/abs/2510.02110v1

Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model's backbone is a r-only causal over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head . On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi-saito-sony.github.io/soundreactor/.

Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning

Authors: Xinyuan Song, Keyu Wang, PengXiang Li, Lu Yin, Shiwei Liu

2025-10-02

http://arxiv.org/abs/2510.02091v1

Recent studies suggest that the deeper layers of Large Language Models (s) contribute little to representation learning and can often be removed without significant performance loss. However, such claims are typically drawn from narrow evaluations and may overlook important aspects of model behavior. In this work, we present a systematic study of depth utilization across diverse dimensions, including evaluation protocols, task categories, and model architectures. Our analysis confirms that very deep layers are generally less effective than earlier ones, but their contributions vary substantially with the evaluation setting. Under likelihood-based metrics without generation, most layers preserves performance, with only the initial few being critical. By contrast, generation-based evaluation uncovers indispensable roles for middle and deeper layers in enabling reasoning and maintaining long-range coherence. We further find that knowledge and retrieval are concentrated in shallow components, whereas reasoning accuracy relies heavily on deeper layers -- yet can be reshaped through distillation. These results highlight that depth usage in s is highly heterogeneous and context-dependent, underscoring the need for task-, metric-, and model-aware perspectives in both interpreting and compressing large models.

LLM-Based Multi-Task Bangla Hate Speech Detection Type, Severity, and Target

Authors: Md Arid Hasan, Firoj Alam, Md Fahad Hossain, Usman Naseem, Syed Ishtiaque Ahmed

2025-10-02

http://arxiv.org/abs/2510.01995v1

Online social media platforms are central to everyday and information seeking. While these platforms serve positive purposes, they also provide fertile ground for the spread of hate speech, offensive language, and bullying content targeting individuals, organizations, and communities. Such content undermines safety, participation, and equity online. Reliable detection systems are therefore needed, especially for low-resource languages where moderation tools are limited. In Bangla, prior work has contributed resources and models, but most are single-task (e.g., binary hate/offense) with limited coverage of multi-facet signals (type, severity, target). We address these gaps by introducing the first multi-task Bangla hate-speech dataset, BanglaMultiHate, one of the largest manually annotated corpus to date. Building on this resource, we conduct a comprehensive, controlled comparison spanning classical baselines, monolingual pretrained models, and s under zero-shot prompting and LoRA fine-tuning. Our experiments assess adaptability in a low-resource setting and reveal a consistent trend: although LoRA-tuned s are competitive with BanglaBERT, culturally and linguistically grounded pretraining remains critical for robust performance. Together, our dataset and findings establish a stronger benchmark for developing culturally aligned moderation tools in low-resource contexts. For reproducibility, we will release the dataset and all related scripts.

Authors: Yongyi Su, Haojie Zhang, Shijie Li, Nanqing Liu, Jingyi Liao, Junyi Pan, Yuan Liu, Xiaofen Xing, Chong Sun, Chen Li, Nancy F. Chen, Shuicheng Yan, Xulei Yang, Xun Xu

2025-10-02

http://arxiv.org/abs/2510.01954v1

Multimodal large language models (Ms) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables Ms to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with 's output textual tokens. A lightweight r then transforms 's outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger M models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.

MelCap A Unified Single-Codebook Neural Codec for High-Fidelity Audio Compression

Authors: Jingyi Li, Zhiyuan Zhao, Yunfei Liu, Lijian Lin, Ye Zhu, Jiahao Wu, Qiuqiang Kong, Yu Li

2025-10-02

http://arxiv.org/abs/2510.01903v1

Neural audio codecs have recently emerged as powerful tools for high-quality and rate audio , leveraging deep generative models to learn latent representations of audio signals. However, existing approaches either rely on a single r that only processes speech domain, or on multiple rs that are not well suited for downstream tasks. To address this issue, we propose MelCap, a unified "one-codebook-for-all" neural codec that effectively handles speech, music, and general sound. By decomposing audio reconstruction into two stages, our method preserves more acoustic details than previous single-codebook approaches, while achieving performance comparable to mainstream multi-codebook methods. In the first stage, audio is transformed into mel-spectrograms, which are compressed and d into compact single tokens using a 2D tokenizer. A perceptual loss is further applied to mitigate the over-smoothing artifacts observed in spectrogram reconstruction. In the second stage, a Vocoder recovers waveforms from the mel discrete tokens in a single forward pass, enabling real-time . Both objective and subjective evaluations demonstrate that MelCap achieves quality on comparable to state-of-the-art multi-codebook codecs, while retaining the computational simplicity of a single-codebook design, thereby providing an effective representation for downstream tasks.

HRTFformer A Spatially-Aware Transformer for Personalized HRTF Upsampling in Immersive Audio Rendering

Authors: Xuyi Hu, Jian Li, Shaojie Zhang, Stefan Goetz, Lorenzo Picinali, Ozgur B. Akan, Aidan O. T. Hogg

2025-10-02

http://arxiv.org/abs/2510.01891v1

Personalized Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating personalized HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range spatial consistency and generalization at high upsampling factors. In this paper, we propose a novel -based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbor dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model surpasses leading methods by a substantial margin in generating realistic, high-fidelity HRTFs.

Accelerating Attention with Basis Decomposition

Authors: Jialin Zhao

2025-10-02

http://arxiv.org/abs/2510.01718v1

Attention is a core operation in large language models (s) and vision-language models (VLMs). We present BD Attention (BDA), the first lossless algorithmic reformulation of attention. BDA is enabled by a simple matrix identity from Basis Decomposition (BD), which restructures multi-head projections into a compact form while pre exact outputs. Unlike I/O-aware system optimizations such as FlashAttention, BDA provides a mathematically guaranteed that is architecture-agnostic. On DeepSeek-V2-Lite (16B, FP16), BDA requires only 4s of offline preparation with no retraining required and, on modern GPUs, achieves 32% faster key/value projections and 25% smaller weights, while increasing end-to-end perplexity (PPL) by just 0.02% (FP16) or 0.0004% (FP32), a negligible effect on model performance. These results position BDA as the first theoretically exact method for lossless attention that is complementary to existing engineering-level optimizations. Our code is available at https://github.com/abcbdf/basis-decomposition-official.

TalkPlay-Tools Conversational Music Recommendation with LLM Tool Calling

Authors: Seungheon Doh, Keunwoo Choi, Juhan Nam

2025-10-02

http://arxiv.org/abs/2510.01698v3

While the recent developments in large language models (s) have successfully enabled generative recommenders with natural language interactions, their recommendation behavior is limited, leaving other simpler yet crucial components such as metadata or attribute filtering underutilized in the system. We propose an -based music recommendation system with tool calling to serve as a unified retrieval-reranking pipeline. Our system positions an as an end-to-end recommendation system that interprets user intent, plans tool invocations, and orchestrates specialized components: boolean filters (SQL), retrieval (BM25), dense retrieval (embedding similarity), and generative retrieval (semantic IDs). Through tool planning, the system predicts which types of tools to use, their execution order, and the arguments needed to find music matching user preferences, supporting diverse modalities while seamlessly integrating multiple database filtering methods. We demonstrate that this unified tool-calling framework achieves competitive performance across diverse recommendation scenarios by selectively employing appropriate retrieval methods based on user queries, envisioning a new paradigm for conversational music recommendation systems.

ENLighten Lighten the Transformer, Enable Efficient Optical Acceleration

Authors: Hanqing Zhu, Zhican Zhou, Shupeng Ning, Xuhao Wu, Ray Chen, Yating Wan, David Pan

2025-10-02

http://arxiv.org/abs/2510.01673v1

Photonic computing has emerged as a promising substrate for accelerating the dense linear-algebra operations at the heart of AI, yet adoption for large Transformer models remains in its infancy. We identify two bottlenecks: (1) costly electro--optic conversions and data-movement overheads that erode energy efficiency as model sizes scale; (2) a mismatch between limited on-chip photonic resources and Transformer scale, which forces frequent reuse of photonic tensor cores and dilutes throughput gains. To address these challenges, we introduce a hardware--software co-design framework. First, we propose \texttt{Lighten}, a PTC-aware flow that post-hoc decomposes each Transformer weight matrix into a low-rank component plus a structured- component aligned to photonic tensor-core granularity, without lengthy retraining. Second, we present \texttt{ENLighten}, a reconfigurable photonic accelerator with dynamically adaptive tensor cores, driven by broadband light redistribution, enabling fine-grained support and full power gating of inactive parts. On ImageNet, \texttt{Lighten} prunes a Base-scale Vision Transformer by 50\% with $\approx$ 1\% accuracy drop after only 3 epochs (about 1 hour) of fine-tuning. Deployed on \texttt{ENLighten}, it achieves a $2.5\times$ improvement in energy--delay product over the state-of-the-art photonic Transformer accelerator.

Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value

Authors: Wangxuan Fan, Ching Wang, Siqi Li, Nan Liu

2025-10-02

http://arxiv.org/abs/2510.01663v1

For many real-world applications, understanding feature-outcome relationships is as crucial as achieving high predictive accuracy. While traditional neural networks excel at prediction, their black-box nature obscures underlying functional relationships. Kolmogorov--Arnold Networks (KANs) address this by employing learnable spline-based activation functions on edges, enabling recovery of symbolic representations while maintaining competitive performance. However, KAN's architecture presents unique challenges for network . Conventional magnitude-based methods become unreliable due to sensitivity to input coordinate shifts. We propose \textbf{ShapKAN}, a framework using Shapley value attribution to assess node importance in a shift-invariant manner. Unlike magnitude-based approaches, ShapKAN quantifies each node's actual contribution, ensuring consistent importance rankings regardless of input parameterization. Extensive experiments on synthetic and real-world datasets demonstrate that ShapKAN preserves true node importance while enabling effective network . Our approach improves KAN's interpretability advantages, facilitating deployment in resource-constrained environments.

Asymmetric Proximal Policy Optimization mini-critics boost LLM reasoning

Authors: Jiashun Liu, Johan Obando-Ceron, Han Lu, Yancheng He, Weixun Wang, Wenbo Su, Bo Zheng, Pablo Samuel Castro, Aaron Courville, Ling Pan

2025-10-02

http://arxiv.org/abs/2510.01656v1

Most recent RL for s (RL4) methods avoid explicit critics, replacing them with average advantage baselines. This shift is largely pragmatic: conventional value functions are computationally expensive to train at scale and often fail under rewards and long reasoning horizons. We revisit this bottleneck from an architectural perspective and introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings. AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards. This design encourages diversity while pre calibration, reducing value-estimation bias. Beyond robust estimation, AsyPPO leverages inter-critic uncertainty to refine the policy update: (i) masking advantages in states where critics agree and gradients add little learning signal, and (ii) filtering high-divergence states from entropy regularization, suppressing spurious exploration. After training on open-source data with only 5,000 samples, AsyPPO consistently improves learning stability and performance across multiple benchmarks over strong baselines, such as GRPO, achieving performance gains of more than six percent on Qwen3-4b-Base and about three percent on Qwen3-8b-Base and Qwen3-14b-Base over classic PPO, without additional tricks. These results highlight the importance of architectural innovations for scalable, efficient algorithms.

The Unseen Frontier Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM

Authors: Kwanhee Lee, Hyeondo Jang, Dongyeop Lee, Dan Alistarh, Namhoon Lee

2025-10-02

http://arxiv.org/abs/2510.01650v1

Neural network is a promising technique to mitigate the excessive computational and memory requirements of large language models (s). Despite its promise, however, progress in this area has diminished, as conventional methods are seemingly unable to surpass moderate levels (50-60%) without severely degrading model accuracy. This work breaks through the current impasse, presenting a principled and effective method called $\texttt{Elsa}$ , which achieves extreme levels of up to 90% while retaining high model fidelity. This is done by identifying several limitations in current practice, all of which can be traced back to their reliance on a surrogate objective formulation. $\texttt{Elsa}$ tackles this issue directly and effectively via standard and well-established constrained optimization techniques based on ADMM. Our extensive experiments across a wide range of models and scales show that $\texttt{Elsa}$ achieves substantial improvements over existing methods; e.g., it achieves 7.8 $\times$ less perplexity than the best existing method on LLaMA-2-7B at 90% . Furthermore, we present $\texttt{Elsa}_{\text{-L}}$ , a d variant that scales to extremely large models (27B), and establish its theoretical convergence guarantees. These results highlight meaningful progress in advancing the frontier of , while promising that significant opportunities for further advancement may remain in directions that have so far attracted limited exploration.

Support Basis Fast Attention Beyond Bounded Entries

Authors: Maryam Aliakbarpour, Vladimir Braverman, Junze Yin, Haochen Zhang

2025-10-02

http://arxiv.org/abs/2510.01643v1

The quadratic complexity of softmax attention remains a central bottleneck in scaling large language models (s). [Alman and Song, NeurIPS 2023] proposed a sub-quadratic attention approximation algorithm, but it works only under the restrictive bounded-entry assumption. Since this assumption rarely holds in practice, its applicability to modern s is limited. In this paper, we introduce support-basis decomposition, a new framework for efficient attention approximation beyond bounded entries. We empirically demonstrate that the entries of the query and key matrices exhibit sub-Gaussian behavior. Our approach uses this property to split large and small entries, enabling exact computation on components and polynomial approximation on dense components. We establish rigorous theoretical guarantees, proving a sub-quadratic runtime, and extend the method to a multi-threshold setting that eliminates all distributional assumptions. Furthermore, we provide the first theoretical justification for the empirical success of polynomial attention [Kacham, Mirrokni, and Zhong, ICML 2024], showing that softmax attention can be closely approximated by a combination of multiple polynomial attentions with sketching.