2025-10-09
Table of Contents
- Multi-Segment Photonic Power Converters for Energy Harvesting and High-Speed Optical Wireless Communication
- VecInfer Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
- lm-Meter Unveiling Runtime Inference Latency for On-Device Language Models
- Downsized and Compromised? Assessing the Faithfulness of Model Compression
- Influence Functions for Efficient Data Selection in Reasoning
- Sample Smart, Not Hard Correctness-First Decoding for Better Reasoning in LLMs
- Diffusion-Based Image Editing for Breaking Robust Watermarks
- Training-Free Time Series Classification via In-Context Reasoning with LLM Agents
- QE Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection
- BioAutoML-NAS An End-to-End AutoML Framework for Multimodal Insect Classification via Neural Architecture Search on Large-Scale Biodiversity Data
- Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input
- Flow4Agent Long-form Video Understanding via Motion Prior from Optical Flow
- Rasterized Steered Mixture of Experts for Efficient 2D Image Regression
- OneVision An End-to-End Generative Framework for Multi-view E-commerce Vision Search
- Communication Enables Cooperation in LLM Agents A Comparison with Curriculum-Based Approaches
- Federated Split Learning for Resource-Constrained Robots in Industrial IoT Framework Comparison, Optimization Strategies, and Future Directions
- Uncovering Representation Bias for Investment Decisions in Open-Source Large Language Models
- DecEx-RAG Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision
- Teaching Machines to Speak Using Articulatory Control
- In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
- Deciphering Invariant Feature Decoupling in Source-free Time Series Forecasting with Proxy Denoising
- H1B-KV Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference
- ARMOR High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization
- CAM A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension
- LANTERN Scalable Distillation of Large Language Models for Job-Person Fit and Explanation
- AMAQ Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning
- Model-based Deep Learning for Joint RIS Phase Shift Compression and WMMSE Beamforming
- Draft, Verify, and Improve Toward Training-Aware Speculative Decoding
- Scalable In-context Ranking with Generative Models
- KVLinC KV Cache Quantization with Hadamard Rotation and Linear Correction
- WeatherArchive-Bench Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives
- DP-Adam-AC Privacy-preserving Fine-Tuning of Localizable Language Models Using Adam Optimization with Adaptive Clipping
- Stratum System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving
- Boomerang Distillation Enables Zero-Shot Model Size Interpolation
- SSDD Single-Step Diffusion Decoder for Efficient Image Tokenization
- Bidirectional Mammogram View Translation with Column-Aware and Implicit 3D Conditional Diffusion
- ParallelBench Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs
- Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models
- Multilingual Routing in Mixture-of-Experts
- The R(1)W(1) Communication Model for Self-Stabilizing Distributed Algorithms
- A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification
- Compressed Concatenation of Small Embedding Models
- FedSRD Sparsify-Reconstruct-Decompose for Communication-Efficient Federated Large Language Models Fine-Tuning
- Language Model Based Text-to-Audio Generation Anti-Causally Aligned Collaborative Residual Transformers
- LaDiR Latent Diffusion Enhances LLMs for Text Reasoning
- COSMIR Chain Orchestrated Structured Memory for Iterative Reasoning over Long Context
- Multi-Agent Collaborative Intelligence Dual-Dial Control for Reliable LLM Reasoning
- Compressed Convolutional Attention Efficient Attention in a Compressed Latent Space
- REAR Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization
- Speculative Actions A Lossless Framework for Faster Agentic Systems
- SliceMoE Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling
- Doctor-R1 Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning
- Don't Pass A Bayesian Framework for Large Language Model Evaluation
- Scaling Sequence-to-Sequence Generative Neural Rendering
- Let Features Decide Their Own Solvers Hybrid Feature Caching for Diffusion Transformers
- PatternKV Flattening KV Representation Expands Quantization Headroom
- Emergent Coordination in Multi-Agent Language Models
- Beyond Next-Token Prediction A Performance Characterization of Diffusion versus Autoregressive Language Models
- MoME Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
- Can Linear Probes Measure LLM Uncertainty?
- Enhancing Fake News Video Detection via LLM-Driven Creative Process Simulation
- Fit Pixels, Get Labels Meta-learned Implicit Networks for Image Segmentation
- Simulating and Understanding Deceptive Behaviors in Long-Horizon Interactions
- Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs
- SPEAR Soft Prompt Enhanced Anomaly Recognition for Time Series Data
- Sliding Window Attention for Learned Video Compression
- Multi-Agent Code-Orchestrated Generation for Reliable Infrastructure-as-Code
- NoTVLA Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation
- DHQA-4D Perceptual Quality Assessment of Dynamic 4D Digital Human
- Algorithm Generation via Creative Ideation
- Small Language Models for Agentic Systems A Survey of Architectures, Capabilities, and Deployment Trade offs
- TROLL Trust Regions improve Reinforcement Learning for Large Language Models
- MambaCAFU Hybrid Multi-Scale and Multi-Attention Model with Mamba-Based Fusion for Medical Image Segmentation
- You Have Been LaTeXpOsEd A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models
- EvoEngineer Mastering Automated CUDA Kernel Code Evolution with Large Language Models
- Token Hidden Reward Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
- Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders
- Decoupling Task-Solving and Output Formatting in LLM Generation
- FieldFormer Physics-Informed Transformers for Spatio-Temporal Field Reconstruction from Sparse Sensors
- Reactive Transformer (RxT) -- Stateful Real-Time Processing for Event-Driven Reactive Language Models
- From Scope to Script An Automated Report Generation Model for Gastrointestinal Endoscopy
- Harnessing the XMM-Newton data X-ray spectral modelling of 4XMM-DR11 detections and 4XMM-DR11s sources
- Cache-to-Cache Direct Semantic Communication Between Large Language Models
- Coevolutionary Continuous Discrete Diffusion Make Your Diffusion Language Model a Latent Reasoner
- FocusAgent Simple Yet Effective Ways of Trimming the Large Context of Web Agents
- OpenZL A Graph-Based Model for Compression
- Improving Cooperation in Collaborative Embodied AI
- CHORD Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration
- Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders
- TridentServe A Stage-level Serving System for Diffusion Pipelines
- FlexiQ Adaptive Mixed-Precision Quantization for Latency/Accuracy Trade-Offs in Deep Neural Networks
- Distributed Low-Communication Training with Decoupled Momentum Optimization
- Prototyping Digital Social Spaces through Metaphor-Driven Design Translating Spatial Concepts into an Interactive Social Simulation
- TokenFlow Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling
- From Tokens to Nodes Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting
- MALF A Multi-Agent LLM Framework for Intelligent Fuzzing of Industrial Control Protocols
- To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration
- HALO Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference
- Mind the Gap Linguistic Divergence and Adaptation Strategies in Human-LLM Assistant vs. Human-Human Interactions
- HyperAdaLoRA Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance
- ElasticMoE An Efficient Auto Scaling Method for Mixture-of-Experts Models
- SAGE Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection
- KaVa Latent Reasoning via Compressed KV-Cache Distillation
- VideoNSA Native Sparse Attention Scales Video Understanding
- Self-Forcing++ Towards Minute-Scale High-Quality Video Generation
- From Frames to Clips Efficient Key Clip Selection for Long-Form Video Understanding
- Contrastive Retrieval Heads Improve Attention-Based Re-Ranking
- MMDEW Multipurpose Multiclass Density Estimation in the Wild
- UpSafeC Upcycling for Controllable Safety in Large Language Models
- KVComm Enabling Efficient LLM Communication through Selective KV Sharing
- SoundReactor Frame-level Online Video-to-Audio Generation
- Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning
- LLM-Based Multi-Task Bangla Hate Speech Detection Type, Severity, and Target
- Patch-as-Decodable-Token Towards Unified Multi-Modal Vision Tasks in MLLMs
- MelCap A Unified Single-Codebook Neural Codec for High-Fidelity Audio Compression
- HRTFformer A Spatially-Aware Transformer for Personalized HRTF Upsampling in Immersive Audio Rendering
- Accelerating Attention with Basis Decomposition
- TalkPlay-Tools Conversational Music Recommendation with LLM Tool Calling
- ENLighten Lighten the Transformer, Enable Efficient Optical Acceleration
- Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value
- Asymmetric Proximal Policy Optimization mini-critics boost LLM reasoning
- The Unseen Frontier Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM
- Support Basis Fast Attention Beyond Bounded Entries
Multi-Segment Photonic Power Converters for Energy Harvesting and High-Speed Optical Wireless Communication
Authors: Othman Younus, Behnaz Majlesein, Richard Nacke, Isaac N. O. Osahon, Carmine Pellegrino, Sina Babadi, Iman Tavakkolnia, Henning Helmers, Harald Haas
2025-10-07
The demand for energy-efficient high-speed wireless , coupled
with the rapid rise of IoT devices, requires systems that integrate power
harvesting with optical data reception to eliminate the need for charging or
battery replacements. Recent advances have explored the use of solar cells as
optical receivers for high-speed data detection alongside power harvesting.
\acs{GaAs}-based \acp{PPC} provide six times greater electron mobility than
silicon- or cadmium telluride-based cells, enabling faster data detection and
improved power efficiency. However, their bandwidth is constrained by junction
capacitance, which increases with active area, creating a trade-off between
power output and data rate. To address this, we propose and test multi-segment
\acs{GaAs}-based \Acp{PPC} that serve as both energy harvesters and data
detectors. By segmenting the active area into 2, 4, or 6 subcells, forming
circular areas with diameters of 1, 1.5, or 2.08~mm, we reduce capacitance and
boost bandwidth while pre
light collection. Fabricated on a
semi-insulating \ac{GaAs} substrate with etched trenches for electrical
isolation, the series-connected subcells optimize absorption and minimize
parasitic effects. The \Acp{PPC} were used for an eye-safe 1.5~m optical
wireless link, employing \ac{OFDM} with adaptive bit and power loading. The
system achieved a world record data rate of 3.8~Gbps, which is four times
higher than prior works. The system converts 39.7\% of optical power from a
beam of 2.3~mW, although the segmentation increases the sensitivity of the
alignment. These findings provide new solutions for off-grid backhaul for
future
networks, such as 6th generation (6G) cellular.
VecInfer Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
Authors: Dingyu Yao, Chenxu Yang, Zhengyang Tong, Zheng Lin, Wei Liu, Jian Luan, Weiping Wang
2025-10-07
The Key-Value ()
introduces substantial memory overhead during large
language model (
) inference. Although existing vector
(VQ)
methods reduce
usage and provide flexible representational capacity
across bit-widths, they suffer severe performance degradation at ultra-low
bit-widths due to key
outliers that hinder effective codebook
utilization. To address this challenge, we propose VecInfer, a novel VQ method
for aggressive
while enabling efficient inference. By
applying smooth and Hadamard transformations, VecInfer suppresses outliers in
the key
, enabling the codebook to comprehensively cover the original data
distribution and thereby reducing
difficulty. To facilitate
efficient deployment, we design an optimized CUDA kernel that fuses computation
with de
to minimize memory access overhead. Extensive evaluations
demonstrate that VecInfer consistently outperforms existing
baselines across both long-context understanding and mathematical reasoning
tasks. With only 2-bit
, VecInfer achieves performance comparable
to full precision, while delivering up to speedup in
large-batch self-attention computation and reduction in
single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.
lm-Meter Unveiling Runtime Inference Latency for On-Device Language Models
Authors: Haoxin Wang, Xiaolong Tu, Hongyu Ke, Huirong Chai, Dawei Chen, Kyungtae Han
2025-10-07
Large Language Models (s) are increasingly integrated into everyday
applications, but their prevalent cloud-based deployment raises growing
concerns around data privacy and long-term sustainability. Running
s locally
on mobile and edge devices (on-device
s) offers the promise of enhanced
privacy, reliability, and reduced
costs. However, realizing this
vision remains challenging due to substantial memory and compute demands, as
well as limited visibility into performance-efficiency trade-offs on
resource-constrained hardware. We propose lm-Meter, the first lightweight,
online latency profiler tailored for on-device
inference. lm-Meter captures
fine-grained, real-time latency at both phase (e.g., embedding,
,
, softmax, sampling) and kernel levels without auxiliary devices. We
implement lm-Meter on commercial mobile platforms and demonstrate its high
profiling accuracy with minimal system overhead, e.g., only 2.58% throughput
reduction in
and 0.99% in
under the most constrained Powersave
governor. Leveraging lm-Meter, we conduct comprehensive empirical studies
revealing phase- and kernel-level bottlenecks in on-device
inference,
quantifying accuracy-efficiency trade-offs, and identifying systematic
optimization opportunities. lm-Meter provides unprecedented visibility into the
runtime behavior of
s on constrained platforms, laying the foundation for
informed optimization and accelerating the democratization of on-device
systems. Code and tutorials are available at
https://github.com/amai-gsu/LM-Meter.
Downsized and Compromised? Assessing the Faithfulness of Model Compression
Authors: Moumita Kamal, Douglas A. Talbert
2025-10-07
In real-world applications, computational constraints often require
transforming large models into smaller, more efficient versions through model
. While these techniques aim to reduce size and computational cost
without sacrificing performance, their evaluations have traditionally focused
on the trade-off between size and accuracy, overlooking the aspect of model
faithfulness. This limited view is insufficient for high-stakes domains like
healthcare, finance, and criminal justice, where compressed models must remain
faithful to the behavior of their original counterparts. This paper presents a
novel approach to evaluating faithfulness in compressed models, moving beyond
standard metrics. We introduce and demonstrate a set of faithfulness metrics
that capture how model behavior changes post-
. Our contributions
include introducing techniques to assess predictive consistency between the
original and compressed models using model agreement, and applying chi-squared
tests to detect statistically significant changes in predictive patterns across
both the overall dataset and demographic subgroups, thereby exposing shifts
that aggregate fairness metrics may obscure. We demonstrate our approaches by
applying
and
to artificial neural networks (ANNs) trained
on three diverse and socially meaningful datasets. Our findings show that high
accuracy does not guarantee faithfulness, and our statistical tests detect
subtle yet significant shifts that are missed by standard metrics, such as
Accuracy and Equalized Odds. The proposed metrics provide a practical and more
direct method for ensuring that efficiency gains through
do not
compromise the fairness or faithfulness essential for trustworthy AI.
Influence Functions for Efficient Data Selection in Reasoning
Authors: Prateek Humane, Paolo Cudrano, Daniel Z. Kaplan, Matteo Matteucci, Supriyo Chakraborty, Irina Rish
2025-10-07
Fine-tuning large language models (s) on chain-of-thought (CoT) data shows
that a small amount of high-quality data can outperform massive datasets. Yet,
what constitutes "quality" remains ill-defined. Existing reasoning methods rely
on indirect heuristics such as problem difficulty or trace length, while
instruction-tuning has explored a broader range of automated selection
strategies, but rarely in the context of reasoning. We propose to define
reasoning data quality using influence functions, which measure the causal
effect of individual CoT examples on downstream accuracy, and introduce
influence-based
, which consistently outperforms perplexity and
embedding-based baselines on math reasoning within a model family.
Sample Smart, Not Hard Correctness-First Decoding for Better Reasoning in LLMs
Authors: Xueyan Li, Guinan Su, Mrinmaya Sachan, Jonas Geiping
2025-10-07
Large Language Models (s) are increasingly applied to complex tasks that
require extended reasoning. In such settings, models often benefit from diverse
chains-of-thought to arrive at multiple candidate solutions. This requires two
competing objectives: to inject enough stochasticity to explore multiple
reasoning chains, and to ensure sufficient accuracy and quality in each path.
Existing works pursue the first objective by increasing exploration at highly
uncertain steps with higher temperature or larger candidate token sets, while
others improve reliability by rejecting samples with low confidence
post-generation, implying that low confidence correlates with low answer
quality. These two lines of thought are in conflict, as they conflate different
sources of uncertainty. To resolve this, we argue that the
rule should
be calibrated by correctness, not confidence alone. We should sample from
tokens with higher estimated correctness, and reduce sampling where expected
correctness is low. We propose simple strategies that achieve this goal:
Greedy-Threshold makes sampling greedy at very low confidence steps.
Calibrated-TopK and Calibrated-epsilon set truncation threshold based on
estimated rank-wise correctness. Together, our findings challenge prevailing
heuristics about
under uncertainty and show gains across math and
general reasoning benchmarks.
Diffusion-Based Image Editing for Breaking Robust Watermarks
Authors: Yunyi Ni, Finn Carter, Ze Niu, Emily Davis, Bo Zhang
2025-10-07
Robust invisible watermarking aims to embed hidden information into images
such that the watermark can survive various image manipulations. However, the
rise of powerful diffusion-based image generation and editing techniques poses
a new threat to these watermarking schemes. In this paper, we present a
theoretical study and method demonstrating that diffusion models can
effectively break robust image watermarks that were designed to resist
conventional perturbations. We show that a diffusion-driven ``image
regeneration'' process can erase embedded watermarks while pre
perceptual image content. We further introduce a novel guided diffusion attack
that explicitly targets the watermark signal during generation, significantly
degrading watermark detectability. Theoretically, we prove that as an image
undergoes sufficient diffusion-based transformation, the mutual information
between the watermarked image and the embedded watermark payload vanishes,
resulting in
failure. Experimentally, we evaluate our approach on
multiple state-of-the-art watermarking schemes (including the deep
learning-based methods StegaStamp, TrustMark, and VINE) and demonstrate
near-zero watermark recovery rates after attack, while maintaining high visual
fidelity of the regenerated images. Our findings highlight a fundamental
vulnerability in current robust watermarking techniques against generative
model-based attacks, underscoring the need for new watermarking strategies in
the era of generative AI.
Training-Free Time Series Classification via In-Context Reasoning with LLM Agents
Authors: Songyuan Sui, Zihang Xu, Yu-Neng Chuang, Kwei-Herng Lai, Xia Hu
2025-10-07
Time series classification (TSC) spans diverse application scenarios, yet
labeled data are often scarce, making task-specific training costly and
inflexible. Recent reasoning-oriented large language models (s) show promise
in understanding temporal patterns, but purely zero-shot usage remains
suboptimal. We propose FETA, a multi-agent framework for training-free TSC via
exemplar-based in-context reasoning. FETA decomposes a multivariate series into
channel-wise subproblems, retrieves a few structurally similar labeled examples
for each channel, and leverages a reasoning
to compare the query against
these exemplars, producing channel-level labels with self-assessed confidences;
a confidence-weighted aggregator then fuses all channel decisions. This design
eliminates the need for pretraining or fine-tuning, improves efficiency by
irrelevant channels and controlling input length, and enhances
interpretability through exemplar grounding and confidence estimation. On nine
challenging UEA datasets, FETA achieves strong accuracy under a fully
training-free setting, surpassing multiple trained baselines. These results
demonstrate that a multi-agent in-context reasoning framework can transform
s into competitive, plug-and-play TSC solvers without any parameter
training. The code is available at https://github.com/SongyuanSui/FETATSC.
QE Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection
Authors: Yanran Zhang, Bingyao Yu, Yu Zheng, Wenzhao Zheng, Yueqi Duan, Lei Chen, Jie Zhou, Jiwen Lu
2025-10-07
The emergence of visual autoregressive (AR) models has revolutionized image
generation while presenting new challenges for synthetic image detection.
Unlike previous GAN or diffusion-based methods, AR models generate images
through discrete token prediction, exhibiting both marked improvements in image
synthesis quality and unique characteristics in their vector-d
representations. In this paper, we propose to leverage Discrete Distribution
Discrepancy-aware Quantization Error (DQE) for autoregressive-generated
image detection that exploits the distinctive patterns and the frequency
distribution bias of the codebook existing in real and fake images. We
introduce a discrete distribution discrepancy-aware
that integrates
dynamic codebook frequency statistics into its attention mechanism, fusing
semantic features and
error latent. To evaluate our method, we
construct a comprehensive dataset termed ARForensics covering 7 mainstream
visual AR models. Experiments demonstrate superior detection accuracy and
strong generalization of DQE across different AR models, with robustness to
real-world perturbations. Code is available at
\href{https://github.com/Zhangyr2022/D3QE}{https://github.com/Zhangyr2022/D3QE}.
BioAutoML-NAS An End-to-End AutoML Framework for Multimodal Insect Classification via Neural Architecture Search on Large-Scale Biodiversity Data
Authors: Arefin Ittesafun Abian, Debopom Sutradhar, Md Rafi Ur Rashid, Reem E. Mohamed, Md Rafiqul Islam, Asif Karim, Kheng Cher Yeo, Sami Azam
2025-10-07
Insect classification is important for agricultural management and ecological
research, as it directly affects crop health and production. However, this task
remains challenging due to the complex characteristics of insects, class
imbalance, and large-scale datasets. To address these issues, we propose
BioAutoML-NAS, the first BioAutoML model using multimodal data, including
images, and metadata, which applies neural architecture search (NAS) for images
to automatically learn the best operations for each connection within each
cell. Multiple cells are stacked to form the full network, each extracting
detailed image feature representations. A multimodal fusion module combines
image embeddings with metadata, allowing the model to use both visual and
categorical biological information to classify insects. An alternating bi-level
optimization training strategy jointly updates network weights and architecture
parameters, while zero operations remove less important connections, producing
, efficient, and high-performing architectures. Extensive evaluation on
the BIOSCAN-5M dataset demonstrates that BioAutoML-NAS achieves 96.81%
accuracy, 97.46% precision, 96.81% recall, and a 97.05% F1 score, outperforming
state-of-the-art transfer learning,
, AutoML, and NAS methods by
approximately 16%, 10%, and 8% respectively. Further validation on the
Insects-1M dataset obtains 93.25% accuracy, 93.71% precision, 92.74% recall,
and a 93.22% F1 score. These results demonstrate that BioAutoML-NAS provides
accurate, confident insect classification that supports modern sustainable
farming.
Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input
Authors: Faeze Ghorbanpour, Alexander Fraser
2025-10-07
Large language models (s) increasingly support applications that rely on
extended context, from document processing to retrieval-augmented generation.
While their long-context capabilities are well studied for reasoning and
retrieval, little is known about their behavior in safety-critical scenarios.
We evaluate
s' sensitivity to harmful content under extended context,
varying type (explicit vs. implicit), position (beginning, middle, end),
prevalence (0.01-0.50 of the prompt), and context length (600-6000 tokens).
Across harmful content categories such as toxic, offensive, and hate speech,
with LLaMA-3, Qwen-2.5, and Mistral, we observe similar patterns: performance
peaks at moderate harmful prevalence (0.25) but declines when content is very
or dominant; recall decreases with increasing context length; harmful
sentences at the beginning are generally detected more reliably; and explicit
content is more consistently recognized than implicit. These findings provide
the first systematic view of how
s prioritize and calibrate harmful content
in long contexts, highlighting both their emerging strengths and the challenges
that remain for safety-critical use.
Flow4Agent Long-form Video Understanding via Motion Prior from Optical Flow
Authors: Ruyang Liu, Shangkun Sun, Haoran Tang, Ge Li, Wei Gao
2025-10-07
Long-form video understanding has always been a challenging problem due to
the significant redundancy in both temporal and spatial contents. This
challenge is further exacerbated by the limited context length of Multimodal
Large Language Models (Ms). To address this issue, many previous works have
attempted to extract key video information, where the "key" is typically
semantic-aware and heavily dependent on the CLIP model as prior. In this paper,
we propose Flow4Agent, a novel framework that pioneeringly incorporates motion
priors from optical flow to facilitate
-based long video understanding.
Flow4Agent mitigates the redundancy in long videos at both temporal and spatial
levels through two core modules: Temporal Granularity Optimization (TGO)
adaptively refines framelevel hierarchies, which first leverages coarse flow
priors to group similar visual contents and then applies semantic priors to
filter out highly irrelevant scene information. Motion Token Pruning (MTP)
further refines the intra-frame visual representations,
high-redundancy
video tokens using fine-grained optical flow information. Extensive experiments
demonstrate that our Flow4Agent outperforms existing methods across a wide
range of video M
benchmarks, especially for hour-level video understanding
tasks, achieving 64.7% on Video-MME, 71.4% on MLVU and 60.4% on LongVideoBench.
Rasterized Steered Mixture of Experts for Efficient 2D Image Regression
Authors: Yi-Hsin Li, Thomas Sikora, Sebastian Knorr, Mårten Sjöström
2025-10-07
The Steered Mixture of Experts regression framework has demonstrated strong
performance in image reconstruction, , denoising, and
super-resolution. However, its high computational cost limits practical
applications. This work introduces a rasterization-based optimization strategy
that combines the efficiency of rasterized Gaussian kernel rendering with the
edge-aware gating mechanism of the Steered Mixture of Experts. The proposed
method is designed to accelerate two-dimensional image regression while
maintaining the model's inherent
and reconstruction quality. By
replacing global iterative optimization with a rasterized formulation, the
method achieves significantly faster parameter updates and more
memory-efficient model representations. In addition, the proposed framework
supports applications such as native super-resolution and image denoising,
which are not directly achievable with standard rasterized Gaussian kernel
approaches. The combination of fast rasterized optimization with the edge-aware
structure of the Steered Mixture of Experts provides a new balance between
computational efficiency and reconstruction fidelity for two-dimensional image
processing tasks.
OneVision An End-to-End Generative Framework for Multi-view E-commerce Vision Search
Authors: Zexin Zheng, Huangyu Dai, Lingtao Mao, Xinyu Sun, Zihan Liang, Ben Chen, Yuqing Ding, Chenyi Lei, Wenwu Ou, Han Li, Kun Gai
2025-10-07
Traditional vision search, similar to search and recommendation systems,
follows the multi-stage cascading architecture (MCA) paradigm to balance
efficiency and conversion. Specifically, the query image undergoes feature
extraction, recall, pre-ranking, and ranking stages, ultimately presenting the
user with semantically similar products that meet their preferences. This
multi-view representation discrepancy of the same object in the query and the
optimization objective collide across these stages, making it difficult to
achieve Pareto optimality in both user experience and conversion. In this
paper, an end-to-end generative framework, OneVision, is proposed to address
these problems. OneVision builds on VRQ, a vision-aligned residual
encoding, which can align the vastly different representations of an object
across multiple viewpoints while pre
the distinctive features of each
product as much as possible. Then a multi-stage semantic alignment scheme is
adopted to maintain strong visual similarity priors while effectively
incorporating user-specific information for personalized preference generation.
In offline evaluations, OneVision performs on par with online MCA, while
improving inference efficiency by 21% through dynamic
. In A/B tests, it
achieves significant online improvements: +2.15% item CTR, +2.27% CVR, and
+3.12% order volume. These results demonstrate that a semantic ID centric,
generative architecture can unify retrieval and personalization while
simplifying the
pathway.
Communication Enables Cooperation in LLM Agents A Comparison with Curriculum-Based Approaches
Authors: Hachem Madmoun, Salem Lahlou
2025-10-07
Eliciting cooperation in multi-agent systems is critical for AI
alignment. We investigate two approaches: direct
and curriculum
learning. In a 4-player Stag Hunt, a one-word "cheap talk" channel increases
cooperation from 0% to 48.3%, demonstrating
as a robust
coordination mechanism. In contrast, we find that curriculum learning is highly
sensitive to design choices: our pedagogical curriculum through progressively
complex games reduced agent payoffs by 27.4% in an Iterated Public Goods Game
with Punishment. Qualitative analysis reveals that curricula emphasizing
defection-equilibrium games can induce "learned pessimism" in agents. These
findings suggest that for coordination problems, simple
protocols
may be more reliable than experience-based training, and that curriculum design
for social dilemmas requires careful attention to the strategic lessons
embedded in game sequences.
Federated Split Learning for Resource-Constrained Robots in Industrial IoT Framework Comparison, Optimization Strategies, and Future Directions
Authors: Wanli Ni, Hui Tian, Shuai Wang, Chengyang Li, Lei Sun, Zhaohui Yang
2025-10-07
Federated split learning (FedSL) has emerged as a promising paradigm for
enabling collaborative intelligence in industrial Internet of Things (IoT)
systems, particularly in smart factories where data privacy,
efficiency, and device heterogeneity are critical concerns. In this article, we
present a comprehensive study of FedSL frameworks tailored for
resource-constrained robots in industrial scenarios. We compare synchronous,
asynchronous, hierarchical, and heterogeneous FedSL frameworks in terms of
workflow, scalability, adaptability, and limitations under dynamic industrial
conditions. Furthermore, we systematically categorize token fusion strategies
into three paradigms: input-level (pre-fusion), intermediate-level
(intra-fusion), and output-level (post-fusion), and summarize their respective
strengths in industrial applications. We also provide adaptive optimization
techniques to enhance the efficiency and feasibility of FedSL implementation,
including model
, split layer selection, computing frequency
allocation, and wireless resource management. Simulation results validate the
performance of these frameworks under industrial detection scenarios. Finally,
we outline open issues and research directions of FedSL in future smart
manufacturing systems.
Uncovering Representation Bias for Investment Decisions in Open-Source Large Language Models
Authors: Fabrizio Dimino, Krati Saxena, Bhaskarjit Sarmah, Stefano Pasquali
2025-10-07
Large Language Models are increasingly adopted in financial applications to
support investment workflows. However, prior studies have seldom examined how
these models reflect biases related to firm size, sector, or financial
characteristics, which can significantly impact decision-making. This paper
addresses this gap by focusing on representation bias in open-source Qwen
models. We propose a balanced round-robin prompting method over approximately
150 U.S. equities, applying constrained and token-logit aggregation to
derive firm-level confidence scores across financial contexts. Using
statistical tests and variance analysis, we find that firm size and valuation
consistently increase model confidence, while risk factors tend to decrease it.
Confidence varies significantly across sectors, with the Technology sector
showing the greatest variability. When models are prompted for specific
financial categories, their confidence rankings best align with fundamental
data, moderately with technical signals, and least with growth indicators.
These results highlight representation bias in Qwen models and motivate
sector-aware calibration and category-conditioned evaluation protocols for safe
and fair financial
deployment.
DecEx-RAG Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision
Authors: Yongqi Leng, Yikun Lei, Xikai Liu, Meizhi Zhong, Bojian Xiong, Yurong Zhang, Yan Gao, Yi Wu, Yao Hu, Deyi Xiong
2025-10-07
Agentic Retrieval-Augmented Generation (Agentic RAG) enhances the processing
capability for complex tasks through dynamic retrieval and adaptive workflows.
Recent advances (e.g., Search-R1) have shown that outcome-supervised
reinforcement learning demonstrate strong performance. However, this approach
still suffers from inefficient exploration, reward signals, and
ambiguous global reward feedback. To address these challenges, we propose
DecEx-RAG, which models RAG as a Markov Decision Process (MDP) incorporating
decision-making and execution, while introducing an efficient
strategy
to optimize data expansion. Through comprehensive process-level policy
optimization, DecEx-RAG significantly enhances the autonomous task
decomposition, dynamic retrieval, and high-quality answer generation
capabilities of large language models (
s). Experiments show that DecEx-RAG
achieves an average absolute performance improvement of across six
datasets, significantly outperforming existing baselines. Moreover, the
strategy improves data construction efficiency by nearly , providing
an efficient solution for process-supervised RAG training. The code is
available at https://github.com/sdsxdxl/DecEx-RAG.
Teaching Machines to Speak Using Articulatory Control
Authors: Akshay Anand, Chenxu Guo, Cheol Jun Cho, Jiachen Lian, Gopala Anumanchipalli
2025-10-07
Current speech production systems predominantly rely on large
models that operate as black boxes, providing little interpretability or
grounding in the physical mechanisms of human speech. We address this
limitation by proposing a new framework: speech generation through explicit
articulatory control. This reframes speech as a motor control task similar to
robotic manipulation. Our approach uses reinforcement learning to train a
policy that directly controls the movements of vocal tract articulators, such
as the tongue, lips, and jaw, to produce syllable-level speech. Specifically,
we employ the Proximal Policy Optimization algorithm to learn optimal
articulatory movements based on acoustic feedback provided by our audio
perceiver, Sylber. The resulting articulatory trajectories are
d into
audio using SPARC, a pre-trained articulatory-to-speech
r. We train this
framework on six target syllables, and it demonstrates successful convergence,
with similarity scores between the policy-generated audio and the target
syllables exceeding 0.85. Accurate human transcription of the audio for
syllables such as "please", "loot", and "cat" demonstrates the intelligibility
of this framework.
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
Authors: Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu
2025-10-07
Outcome-driven reinforcement learning has advanced reasoning in large
language models (s), but prevailing tool-augmented approaches train a
single, monolithic policy that interleaves thoughts and tool calls under full
context; this scales poorly with long horizons and diverse tools and
generalizes weakly to new scenarios. Agentic systems offer a promising
alternative by decomposing work across specialized modules, yet most remain
training-free or rely on offline training decoupled from the live dynamics of
multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow
agentic framework that coordinates four modules (planner, executor, verifier,
generator) through an evolving memory and directly optimizes its planner inside
the multi-turn loop. To train on-policy in live environments, we propose
Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles
long-horizon,
-reward credit assignment by converting multi-turn
optimization into a sequence of tractable single-turn policy updates. It
broadcasts a single, verifiable trajectory-level outcome to every turn to align
local planner decisions with global success and stabilizes learning with
group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale
backbone outperforms top-performing baselines with average accuracy gains of
14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on
scientific tasks, even surpassing larger proprietary models like GPT-4o.
Further analyses confirm the benefits of in-the-flow optimization, showing
improved planning, enhanced tool-calling reliability, and positive scaling with
model size and reasoning turns.
Deciphering Invariant Feature Decoupling in Source-free Time Series Forecasting with Proxy Denoising
Authors: Kangjia Yan, Chenxi Liu, Hao Miao, Xinle Wu, Yan Zhao, Chenjuan Guo, Bin Yang
2025-10-07
The proliferation of mobile devices generates a massive volume of time series
across various domains, where effective time series forecasting enables a
variety of real-world applications. This study focuses on a new problem of
source-free domain adaptation for time series forecasting. It aims to adapt a
pretrained model from sufficient source time series to the target time
series domain without access to the source data, embracing data protection
regulations. To achieve this, we propose TimePD, the first source-free time
series forecasting framework with proxy denoising, where large language models
(
s) are employed to benefit from their generalization capabilities.
Specifically, TimePD consists of three key components: (1) dual-branch
invariant disentangled feature learning that enforces representation- and
gradient-wise invariance by means of season-trend decomposition; (2)
lightweight, parameter-free proxy denoising that dynamically calibrates
systematic biases of
s; and (3) knowledge distillation that bidirectionally
aligns the denoised prediction and the original target prediction. Extensive
experiments on real-world datasets offer insight into the effectiveness of the
proposed TimePD, outperforming SOTA baselines by 9.3% on average.
H1B-KV Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference
Authors: Harshil Vejendla
2025-10-07
Autoregressive in large language models (
s) requires caching a
growing list of past key-value (
) pairs, making long-context inference a
memory-bound problem. While recent methods have explored quantizing the
,
evicting tokens, or using binary sketches for keys (e.g., Loki), these
approaches often provide an incomplete solution by leaving one component (like
values) uncompressed or by discarding context information. This paper
introduces the Hybrid One-Bit
Cache (H1B-
), a comprehensive
scheme that radically reduces memory usage without sacrificing context. H1B-
represents each key vector using a 1-bit binary sketch, enabling
hardware-friendly bitwise attention, and further compresses value vectors using
4-bit
. This holistic, hybrid approach allows a 7-billion parameter
to handle an 8k-token context with under 60 MB of
memory - a 70x
reduction. We demonstrate that after a lightweight finetuning, H1B-
matches
full-precision performance not only on perplexity benchmarks but also on
complex downstream tasks like mathematical reasoning (GSM8K), multi-task
understanding (MMLU), and code generation (HumanEval). Our results show H1B-
significantly outperforms leading
(KIVI), token eviction
(Sparse
), and key-only sketching (Loki) methods in quality-per-byte,
establishing it as a robust solution for deploying
s in memory-constrained
environments.
ARMOR High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization
Authors: Lawrence Liu, Alexander Liu, Mengdi Wang, Tuo Zhao, Lin F. Yang
2025-10-07
Large language models (s) present significant deployment challenges due to
their immense computational and memory requirements. While semi-structured
, particularly 2:4
, offers a path to practical hardware
, existing methods often incur substantial performance degradation.
To bridge this gap, we introduce ARMOR: (Adaptive Representation with
Matrix-factORization), a novel one-shot post-training
algorithm.
Instead of directly
weights, ARMOR factorizes each weight matrix into a
2:4
core wrapped by two low-overhead, block diagonal matrices. These
wrappers act as efficient pre and post-transformation error correctors,
offering greater flexibility to preserve model quality compared to conventional
2:4
techniques. The
core and block diagonal wrappers are chosen
through a block coordinate descent algorithm that minimizes a layer-wise proxy
loss. We theoretically prove this optimization is guaranteed to converge to a
solution with a proxy loss less than or equal to state-of-the-art
algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and
Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and
significantly outperforms state-of-the-art 2:4
methods across a wide
range of downstream tasks and perplexity evaluations. ARMOR achieves this
superior performance while retaining the inference speedups and substantial
memory usage reductions of 2:4
, establishing a more effective trade-off
between model
and task accuracy
CAM A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension
Authors: Rui Li, Zeyu Zhang, Xiaohe Bo, Zihang Tian, Xu Chen, Quanyu Dai, Zhenhua Dong, Ruiming Tang
2025-10-07
Current Large Language Models (s) are confronted with overwhelming
information volume when comprehending long-form documents. This challenge
raises the imperative of a cohesive memory module, which can elevate vanilla
s into autonomous reading agents. Despite the emergence of some heuristic
approaches, a systematic design principle remains absent. To fill this void, we
draw inspiration from Jean Piaget's Constructivist Theory, illuminating three
traits of the agentic memory -- structured schemata, flexible assimilation, and
dynamic accommodation. This blueprint forges a clear path toward a more robust
and efficient memory system for
-based reading comprehension. To this end,
we develop CAM, a prototype implementation of Constructivist Agentic Memory
that simultaneously embodies the structurality, flexibility, and dynamicity. At
its core, CAM is endowed with an incremental
ping clustering algorithm
for structured memory development, supporting both coherent hierarchical
summarization and online batch integration. During inference, CAM adaptively
explores the memory structure to activate query-relevant information for
contextual response, akin to the human associative process. Compared to
existing approaches, our design demonstrates dual advantages in both
performance and efficiency across diverse long-text reading comprehension
tasks, including question answering, query-based summarization, and claim
verification.
LANTERN Scalable Distillation of Large Language Models for Job-Person Fit and Explanation
Authors: Zhoutong Fu, Yihan Cao, Yi-Lin Chen, Aman Lunia, Liming Dong, Neha Saraf, Ruijie Jiang, Yun Dai, Qingquan Song, Tan Wang, Guoyao Li, Derek Koh, Haichao Wei, Zhipeng Wang, Aman Gupta, Chengming Jiang, Jianqiang Shen, Liangjie Hong, Wenjing Zhang
2025-10-07
Large language models (s) have achieved strong performance across a wide
range of natural language processing tasks. However, deploying
s at scale
for domain specific applications, such as job-person fit and explanation in job
seeking platforms, introduces distinct challenges. At LinkedIn, the job person
fit task requires analyzing a candidate's public profile against job
requirements to produce both a fit assessment and a detailed explanation.
Directly applying open source or finetuned
s to this task often fails to
yield high quality, actionable feedback due to the complexity of the domain and
the need for structured outputs. Moreover, the large size of these models leads
to high inference latency and limits scalability, making them unsuitable for
online use. To address these challenges, we introduce LANTERN, a novel
knowledge distillation framework tailored specifically for job person fit
tasks. LANTERN involves modeling over multiple objectives, an encoder model for
classification purpose, and a
r model for explanation purpose. To better
distill the knowledge from a strong black box teacher model to multiple
downstream models, LANTERN incorporates multi level knowledge distillation that
integrates both data and logit level insights. In addition to introducing the
knowledge distillation framework, we share our insights on post training
techniques and prompt engineering, both of which are crucial for successfully
adapting
s to domain specific downstream tasks. Extensive experimental
results demonstrate that LANTERN significantly improves task specific metrics
for both job person fit and explanation. Online evaluations further confirm its
effectiveness, showing measurable gains in job seeker engagement, including a
0.24\% increase in apply rate and a 0.28\% increase in qualified applications.
AMAQ Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning
Authors: Yurun Song, Zhuoyi Yang, Ian G. Harris, Sangeetha Abdu Jyothi
2025-10-07
Large Language Models (s) are scaling rapidly, creating significant
challenges for collaborative server client distributed training, particularly
in terms of
efficiency and computational overheads. To address
these challenges, we implement Parameter-efficient Split Learning, which
effectively balances efficiency and performance for collaborative training on
low-resource devices.
To reduce
overhead in collaborative training, we introduce
Adaptive Mixed bit Activation Quantization (AMAQ), a strategy that
progressively compresses activations and gradients from high precision (6 to 8
bits) to low precision (3 to 4 bits). AMAQ achieves this by effectively
allocating bit budgets across channels based on feature wise and layer wise
importance using bit regularization.
Under the same bit budgets, AMAQ outperforms fixed-precision approaches,
delivering about 2.5% higher generation accuracy and about 1.3% better
classification accuracy for models like LLaMA3 8B and Qwen2.5 7B. In addition,
it significantly enhances training stability and reducing ultra-low bit
representation collapse during the training.
Experiments demonstrate that AMAQ integrates effectively into practical
multi-machine collaborative training setups, offering superior inference
accuracy with only a modest
overhead for bits adaptation during
training. This trade off makes AMAQ a practical and effective solution for
collaborative training with minimal
cost.
Model-based Deep Learning for Joint RIS Phase Shift Compression and WMMSE Beamforming
Authors: Alexander James Fernandes, Ioannis Psaromiligkos
2025-10-06
A model-based deep learning (DL) architecture is proposed for reconfigurable
intelligent surface (RIS)-assisted multi-user s to reduce the
overhead of transmitting phase shift information from the access point (AP) to
the RIS controller. The phase shifts are computed at the AP, which has access
to the channel state information, and then encoded into a compressed binary
control message that is sent to the RIS controller for element configuration.
To help reduce beamformer mismatches due to phase shift
errors, the
beamformer is updated using weighted minimum mean square error (WMMSE) based on
the effective channel resulting from the actual (decompressed) RIS reflection
coefficients. By unrolling the iterative WMMSE algorithm as part of the
wireless
informed DL architecture, joint phase shift
and WMMSE beamforming can be trained end-to-end. Simulations show that
accounting for phase shift
errors during beamforming significantly
improves the sum-rate performance, even when the number of control bits is
lower than the number of RIS elements.
Draft, Verify, and Improve Toward Training-Aware Speculative Decoding
Authors: Shrenik Bhansali, Larry Heck
2025-10-06
Autoregressive (AR) is a major latency bottleneck for large language
models. Speculative
(SD) accelerates AR by letting a drafter propose
multi-token blocks that a verifier accepts or rejects. However, many SD systems
require heavy offline training or extra components. These choices raise
data/compute cost and can yield brittle drafters under distribution drift. We
introduce \emph{Draft, Verify, \& Improve (DVI)}, a training-aware
self-speculative framework that combines inference with continual online
learning. We partition an
into a drafter and a verifier, and during
generation, verifier accept/reject decisions are converted into supervision
signals and used to update the drafter head. A simple \emph{KLRL}
schedule bootstraps calibration via online distillation and then adds
reward-masked cross-entropy with a on-policy policy-gradient term, pre
lossless, single model deployment. On Spec-Bench, DVI achieves a
wall-time speedup, on par with SoTA approaches like EAGLE-2, while orders of
magnitude less data for training, and ablations show that DVI outperforms
KL-only online distillation. DVI demonstrates that \emph{training-aware}
self-speculation can deliver state-of-the-art, lossless speedups with minimal
training overhead.
Scalable In-context Ranking with Generative Models
Authors: Nilesh Gupta, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Inderjit Dhillon, Felix Yu
2025-10-06
In-context Ranking (ICR) is an emerging paradigm for Information Retrieval
(IR), which leverages contextual understanding of s by directly
incorporating the task description, candidate documents, and the query into the
model's input prompt and tasking the
to identify relevant document(s).
While it is effective, efficiency is a significant challenge in this paradigm,
especially as the candidate list grows due to quadratic/super-linear scaling of
attention operation with context length. To this end, this paper first
identifies inherent and exploitable structures in the attention of
s
finetuned for ICR: (1) inter-document block
: attention is dense within
each document block but
across different documents in the context; and
(2) query-document block relevance: the attention scores from certain query
tokens to a document block in middle layers strongly correlate with that
document's actual relevance. Motivated by these observations, we introduce
BlockRank (Blockwise In-context Ranking), a novel method that adapts the
attention operation in an
by (a) architecturally enforcing the observed
inter-document block
, reducing attention complexity from quadratic to
linear without loss in performance, and (b) optimizing query-document block
relevance for true relevant documents during fine-tuning using an auxiliary
contrastive training objective, improving retrieval in attention. Experiments
on BEIR, MSMarco and NQ with Mistral-7B demonstrate that BlockRank Mistral
matches or outperforms existing SOTA listwise rankers and controlled fine-tuned
baseline while being significantly more efficient at inference (4.7x for 100
MSMarco documents in context) and scaling gracefully to long-context
shortlists, around 500 documents in-context (approximately 100K context length)
within a second, presenting a scalable and effective solution for ICR.
KVLinC KV Cache Quantization with Hadamard Rotation and Linear Correction
Authors: Utkarsh Saxena, Kaushik Roy
2025-10-06
Quantizing the key-value ()
is a promising strategy for improving the
inference efficiency of large language models (
s). However, aggressive
to very low precision (e.g., 2 bits) introduces significant errors
in the stored key and value tensors, which propagate through the dot-product
attention mechanism and ultimately degrade generation quality. To address this,
we propose
LinC, a framework to mitigate attention errors introduced by
in the extreme low-precision regime.
LinC combines a
Hadamard rotation, which reduces
error in values, with lightweight
linear correction adapters that explicitly compensate for errors introduced by
d keys. Across extensive evaluations on the LLaMA, Qwen2.5, and Qwen3
model families,
LinC consistently matches or surpasses strong baselines while
achieving higher
-
. Furthermore, we implement a custom
attention kernel that results in upto 2.55x faster inference compared to Flash
Attention baseline, enabling efficient long-context
inference.
WeatherArchive-Bench Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives
Authors: Yongan Yu, Xianda Du, Qingchen Hu, Jiahao Liang, Jingwei Ni, Dan Qiang, Kaiyu Huang, Grant McKenzie, Renee Sieber, Fengran Mo
2025-10-06
Historical archives on weather events are collections of enduring primary
source records that offer rich, untapped narratives of how societies have
experienced and responded to extreme weather events. These qualitative accounts
provide insights into societal vulnerability and resilience that are largely
absent from meteorological records, making them valuable for climate scientists
to understand societal responses. However, their vast scale, noisy digitized
quality, and archaic language make it difficult to transform them into
structured knowledge for climate research. To address this challenge, we
introduce WeatherArchive-Bench, the first benchmark for evaluating
retrieval-augmented generation (RAG) systems on historical weather archives.
WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which
measures a system's ability to locate historically relevant passages from over
one million archival news segments, and WeatherArchive-Assessment, which
evaluates whether Large Language Models (s) can classify societal
vulnerability and resilience indicators from extreme weather narratives.
Extensive experiments across
, dense, and re-ranking retrievers, as well
as a diverse set of
s, reveal that dense retrievers often fail on historical
terminology, while
s frequently misinterpret vulnerability and resilience
concepts. These findings highlight key limitations in reasoning about complex
societal indicators and provide insights for designing more robust
climate-focused RAG systems from archival contexts. The constructed dataset and
evaluation framework are publicly available at
https://anonymous.4open.science/r/WeatherArchive-Bench/.
DP-Adam-AC Privacy-preserving Fine-Tuning of Localizable Language Models Using Adam Optimization with Adaptive Clipping
Authors: Ruoxing Yang
2025-10-06
Large language models (s) such as ChatGPT have evolved into powerful and
ubiquitous tools. Fine-tuning on small datasets allows
s to acquire
specialized skills for specific tasks efficiently. Although
s provide great
utility in both general and task-specific use cases, they are limited by two
security-related concerns. First, traditional
hardware requirements make
them infeasible to run locally on consumer-grade devices. A remote network
connection with the
provider's server is usually required, making the
system vulnerable to network attacks. Second, fine-tuning an
for a
sensitive task may involve sensitive data. Non-private fine-tuning algorithms
produce models vulnerable to training data reproduction attacks. Our work
addresses these security concerns by enhancing differentially private
optimization algorithms and applying them to fine-tune localizable language
models. We introduce adaptable gradient clipping along with other engineering
enhancements to the standard DP-Adam optimizer to create DP-Adam-AC. We use our
optimizer to fine-tune examples of two localizable
designs, small language
model (Qwen2.5-0.5B) and 1.58 bit
(Bitnet-b1.58-2B). We
demonstrate promising improvements in loss through experimentation with two
synthetic datasets.
Stratum System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving
Authors: Yue Pan, Zihan Xia, Po-Kai Hsu, Lanxiang Hu, Hyungyo Kim, Janak Sharda, Minxuan Zhou, Nam Sung Kim, Shimeng Yu, Tajana Rosing, Mingu Kang
2025-10-06
As Large Language Models (s) continue to evolve, Mixture of Experts (MoE)
architecture has emerged as a prevailing design for achieving state-of-the-art
performance across a wide range of tasks. MoE models use
gating to
activate only a handful of expert sub-networks per input, achieving
billion-parameter capacity with inference costs akin to much smaller models.
However, such models often pose challenges for hardware deployment due to the
massive data volume introduced by the MoE layers. To address the challenges of
MoE models, we propose Stratum, a system-hardware co-design approach
that combines the novel memory technology Monolithic 3D-Stackable DRAM (Mono3D
DRAM), near-memory processing (NMP), and GPU
. The logic and Mono3D
DRAM dies are connected through hybrid bonding, whereas the Mono3D DRAM stack
and GPU are interconnected via silicon interposer. Mono3D DRAM offers higher
internal bandwidth than HBM thanks to the dense vertical interconnect pitch
enabled by its monolithic structure, which supports implementations of
higher-performance near-memory processing. Furthermore, we tackle the latency
differences introduced by aggressive vertical scaling of Mono3D DRAM along the
z-dimension by constructing internal memory tiers and assigning data across
layers based on access likelihood, guided by topic-based expert usage
prediction to boost NMP throughput. The Stratum system achieves up to 8.29x
improvement in
throughput and 7.66x better energy efficiency across
various benchmarks compared to GPU baselines.
Boomerang Distillation Enables Zero-Shot Model Size Interpolation
Authors: Sara Kangaslahti, Nihal V. Nayak, Jonathan Geuter, Marco Fumero, Francesco Locatello, David Alvarez-Melis
2025-10-06
Large language models (s) are typically deployed under diverse memory and
compute constraints. Existing approaches build model families by training each
size independently, which is prohibitively expensive and provides only
coarse-grained size options. In this work, we identify a novel phenomenon that
we call boomerang distillation: starting from a large base model (the teacher),
one first distills down to a small student and then progressively reconstructs
intermediate-sized models by re-incorporating blocks of teacher layers into the
student without any additional training. This process produces zero-shot
interpolated models of many intermediate sizes whose performance scales
smoothly between the student and teacher, often matching or surpassing
pretrained or distilled models of the same size. We further analyze when this
type of interpolation succeeds, showing that alignment between teacher and
student through
and distillation is essential. Boomerang distillation
thus provides a simple and efficient way to generate fine-grained model
families, dramatically reducing training cost while enabling flexible
adaptation across deployment environments. The code and models are available at
https://github.com/dcml-lab/boomerang-distillation.
SSDD Single-Step Diffusion Decoder for Efficient Image Tokenization
Authors: Théophane Vallaeys, Jakob Verbeek, Matthieu Cord
2025-10-06
Tokenizers are a key component of state-of-the-art generative image models,
extracting the most important features from the signal while reducing data
dimension and redundancy. Most current tokenizers are based on KL-regularized
variational autoencoders (KL-VAE), trained with reconstruction, perceptual and
adversarial losses. Diffusion rs have been proposed as a more principled
alternative to model the distribution over images conditioned on the latent.
However, matching the performance of KL-VAE still requires adversarial losses,
as well as a higher
time due to iterative sampling. To address these
limitations, we introduce a new pixel diffusion
r architecture for
improved scaling and training stability, benefiting from
components
and GAN-free training. We use distillation to replicate the performance of the
diffusion
r in an efficient single-step
r. This makes SSDD the
first diffusion
r optimized for single-step reconstruction trained
without adversarial losses, reaching higher reconstruction quality and faster
sampling than KL-VAE. In particular, SSDD improves reconstruction FID from
to with higher throughput and preserve generation
quality of DiTs with faster sampling. As such, SSDD can be used as
a drop-in replacement for KL-VAE, and for building higher-quality and faster
generative models.
Bidirectional Mammogram View Translation with Column-Aware and Implicit 3D Conditional Diffusion
Authors: Xin Li, Kaixiang Yang, Qiang Li, Zhiwei Wang
2025-10-06
Dual-view mammography, including craniocaudal (CC) and mediolateral oblique
(MLO) projections, offers complementary anatomical views crucial for breast
cancer diagnosis. However, in real-world clinical workflows, one view may be
missing, corrupted, or degraded due to acquisition errors or
artifacts, limiting the effectiveness of downstream analysis. View-to-view
translation can help recover missing views and improve lesion alignment. Unlike
natural images, this task in mammography is highly challenging due to large
non-rigid deformations and severe tissue
in X-ray projections, which
obscure pixel-level correspondences. In this paper, we propose Column-Aware and
Implicit 3D Diffusion (CA3D-Diff), a novel bidirectional mammogram view
translation framework based on conditional diffusion model. To address
cross-view structural misalignment, we first design a column-aware
cross-attention mechanism that leverages the geometric property that
anatomically corresponding regions tend to lie in similar column positions
across views. A Gaussian-decayed bias is applied to emphasize local column-wise
correlations while suppressing distant mismatches. Furthermore, we introduce an
implicit 3D structure reconstruction module that back-projects noisy 2D latents
into a coarse 3D feature volume based on breast-view projection geometry. The
reconstructed 3D structure is refined and injected into the denoising UNet to
guide cross-view generation with enhanced anatomical awareness. Extensive
experiments demonstrate that CA3D-Diff achieves superior performance in
bidirectional tasks, outperforming state-of-the-art methods in visual fidelity
and structural consistency. Furthermore, the synthesized views effectively
improve single-view malignancy classification in screening settings,
demonstrating the practical value of our method in real-world diagnostics.
ParallelBench Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs
Authors: Wonjun Kang, Kevin Galim, Seunghyuk Oh, Minjae Lee, Yuchen Zeng, Shuibai Zhang, Coleman Hooper, Yuezhou Hu, Hyung Il Koo, Nam Ik Cho, Kangwook Lee
2025-10-06
While most autoregressive s are constrained to one-by-one
,
diffusion
s (d
s) have attracted growing interest for their potential to
dramatically accelerate inference through parallel
. Despite this
promise, the conditional independence assumption in d
s causes parallel
to ignore token dependencies, inevitably degrading generation quality
when these dependencies are strong. However, existing works largely overlook
these inherent challenges, and evaluations on standard benchmarks (e.g., math
and coding) are not sufficient to capture the quality degradation caused by
parallel
. To address this gap, we first provide an
information-theoretic analysis of parallel
. We then conduct case
studies on analytically tractable synthetic list operations from both data
distribution and
strategy perspectives, offering quantitative insights
that highlight the fundamental limitations of parallel
. Building on
these insights, we propose ParallelBench, the first benchmark specifically
designed for d
s, featuring realistic tasks that are trivial for humans and
autoregressive
s yet exceptionally challenging for d
s under parallel
. Using ParallelBench, we systematically analyze both d
s and
autoregressive
s, revealing that: (i) d
s under parallel
can
suffer dramatic quality degradation in real-world scenarios, and (ii) current
parallel
strategies struggle to adapt their degree of parallelism
based on task difficulty, thus failing to achieve meaningful speedup without
compromising quality. Our findings underscore the pressing need for innovative
methods that can overcome the current speed-quality trade-off. We
release our benchmark to help accelerate the development of truly efficient
d
s.
Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models
Authors: Raha Askari, Sina Zarrieß, Özge Alacam, Judith Sieker
2025-10-06
Implicit meanings are integral to human , making it essential
for language models to be capable of identifying and interpreting them. Grice
(1975) proposed a set of conversational maxims that guide cooperative dialogue,
noting that speakers may deliberately violate these principles to express
meanings beyond literal words, and that listeners, in turn, recognize such
violations to draw pragmatic inferences.
Building on Surian et al. (1996)'s study of children's sensitivity to
violations of Gricean maxims, we introduce a novel benchmark to test whether
language models pretrained on less than 10M and less than 100M tokens can
distinguish maxim-adhering from maxim-violating utterances. We compare these
BabyLMs across five maxims and situate their performance relative to children
and a Large Language Model (
) pretrained on 3T tokens.
We find that overall, models trained on less than 100M tokens outperform
those trained on less than 10M, yet fall short of child-level and
competence. Our results suggest that modest data increases improve some aspects
of pragmatic behavior, leading to finer-grained differentiation between
pragmatic dimensions.
Multilingual Routing in Mixture-of-Experts
Authors: Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng
2025-10-06
Mixture-of-Experts (MoE) architectures have become the key to scaling modern
s, yet little is understood about how their
routing dynamics respond
to multilingual data. In this work, we analyze expert routing patterns using
parallel multilingual datasets and present highly interpretable layer-wise
phenomena. We find that MoE models route tokens in language-specific ways in
the early and late
r layers but exhibit significant cross-lingual routing
alignment in middle layers, mirroring parameter-sharing trends observed in
dense
s. In particular, we reveal a clear, strong correlation between a
model's performance in a given language and how similarly its tokens are routed
to English in these layers. Extending beyond correlation, we explore
inference-time interventions that induce higher cross-lingual routing
alignment. We introduce a method that steers the router by promoting
middle-layer task experts frequently activated in English, and it successfully
increases multilingual performance. These 1-2% gains are remarkably consistent
across two evaluation tasks, three models, and 15+ languages, especially given
that these simple interventions override routers of extensively trained,
state-of-the-art
s. In comparison, interventions outside of the middle
layers or targeting multilingual-specialized experts only yield performance
degradation. Altogether, we present numerous findings that explain how MoEs
process non-English text and demonstrate that generalization is limited by the
model's ability to leverage language-universal experts in all languages.
The R(1)W(1) Communication Model for Self-Stabilizing Distributed Algorithms
Authors: Hirotsugu Kakugawa, Sayaka Kamei, Masahiro Shibata, Fukuhito Ooshita
2025-10-06
Self-stabilization is a versatile methodology in the design of fault-tolerant
distributed algorithms for transient faults. A self-stabilizing system
automatically recovers from any kind and any finite number of transient faults.
This property is specifically useful in modern distributed systems with a large
number of components. In this paper, we propose a new and
execution model named the R(1)W(1) model in which each process can read and
write its own and neighbors' local variables in a single step. We propose
self-stabilizing distributed algorithms in the R(1)W(1) model for the problems
of maximal matching, minimal k-dominating set and maximal k-dependent set.
Finally, we propose an example
, based on randomized distance-two
local mutual exclusion, to simulate algorithms designed for the R(1)W(1) model
in the synchronous message passing model with synchronized clocks.
A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification
Authors: Hao Liu, Yunhao Gao, Wei Li, Mingyang Zhang, Maoguo Gong, Lorenzo Bruzzone
2025-10-06
Deep learning-based methods have achieved significant success in remote
sensing Earth observation data analysis. Numerous feature fusion techniques
address multimodal remote sensing image classification by integrating global
and local features. However, these techniques often struggle to extract
structural and detail features from heterogeneous and redundant multimodal
images. With the goal of introducing frequency domain learning to model key and
detail features, this paper introduces the spatial-spectral-frequency
interaction network (SFin), which integrates pairwise fusion modules across
the spatial, spectral, and frequency domains. Specifically, we propose a
high-frequency
enhancement
that employs
spatial-spectral attention to optimize the parameters of the high-frequency
filter. Subsequently, a two-level spatial-frequency fusion strategy is
introduced, comprising an adaptive frequency channel module that fuses
low-frequency structures with enhanced high-frequency details, and a
high-frequency resonance mask that emphasizes sharp edges via phase similarity.
In addition, a spatial-spectral attention fusion module further enhances
feature extraction at intermediate layers of the network. Experiments on four
benchmark multimodal datasets with limited labeled data demonstrate that
SFin performs superior classification, outperforming state-of-the-art
methods. The code is available at https://github.com/HaoLiu-XDU/SSFin.
Compressed Concatenation of Small Embedding Models
Authors: Mohamed Ayoub Ben Ayad, Michael Dinzinger, Kanishka Ghosh Dastidar, Jelena Mitrovic, Michael Granitzer
2025-10-06
Embedding models are central to dense retrieval, semantic search, and
recommendation systems, but their size often makes them impractical to deploy
in resource-constrained environments such as browsers or edge devices. While
smaller embedding models offer practical advantages, they typically
underperform compared to their larger counterparts. To bridge this gap, we
demonstrate that concatenating the raw embedding vectors of multiple small
models can outperform a single larger baseline on standard retrieval
benchmarks. To overcome the resulting high dimensionality of naive
concatenation, we introduce a lightweight unified r trained with a
Matryoshka Representation Learning (MRL) loss. This
r maps the
high-dimensional joint representation to a low-dimensional space, pre
most of the original performance without fine-tuning the base models. We also
show that while concatenating more base models yields diminishing gains, the
robustness of the
r's representation under
and
improves. Our experiments show that, on a subset of MTEB retrieval tasks, our
concat-encode-
pipeline recovers 89\% of the original performance with
a 48x
factor when the pipeline is applied to a concatenation of
four small embedding models.
FedSRD Sparsify-Reconstruct-Decompose for Communication-Efficient Federated Large Language Models Fine-Tuning
Authors: Guochen Yan, Luyuan Xie, Qingni Shen, Yuejian Fang, Zhonghai Wu
2025-10-06
The current paradigm of training large language models (s) on publicly
available Web data is becoming unsustainable, with high-quality data sources in
specialized domains nearing exhaustion. Federated Learning (FL) emerges as a
practical solution for the next generation of AI on a decentralized Web,
enabling privacy-pre
collaborative fine-tuning by leveraging private
data distributed across a global client base. While Low-Rank Adaptation (LoRA)
is the standard for efficient fine-tuning, its application in federated
settings presents a critical challenge:
overhead remains a
significant bottleneck across the Web's heterogeneous network conditions. The
structural redundancy within LoRA parameters not only incurs a heavy
burden but also introduces conflicts when aggregating client
updates. To address this, we propose FedSRD, a Sparsify-Reconstruct-Decompose
framework designed for
-efficient federated
s fine-tuning. We
first introduce an importance-aware sparsification method that preserves the
structural integrity of LoRA updates to reduce the uploaded parameter count.
The server then reconstructs and aggregates these updates in a full-rank space
to mitigate conflicts. Finally, it decomposes the global update into a
low-rank format for broadcast, ensuring a symmetrically efficient cycle. We
also propose an efficient variant, FedSRD-e, to reduce computational overhead.
Experimental results on 10 benchmarks demonstrate that our framework
significantly reduces
costs by up to 90\% while even improving
model performance on heterogeneous client data.
Language Model Based Text-to-Audio Generation Anti-Causally Aligned Collaborative Residual Transformers
Authors: Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, Shujun Wang
2025-10-06
While language models (LMs) paired with residual vector (RVQ)
tokenizers have shown promise in text-to-audio (T2A) generation, they still lag
behind diffusion-based models by a non-trivial margin. We identify a critical
dilemma underpinning this gap: incorporating more RVQ layers improves audio
reconstruction fidelity but exceeds the generation capacity of conventional
LMs. To address this, we first analyze RVQ dynamics and uncover two key
limitations: 1) orthogonality of features across RVQ layers hinders effective
LMs training, and 2) descending semantic richness in tokens from deeper RVQ
layers exacerbates exposure bias during autoregressive
. Based on these
insights, we propose Siren, a novel LM-based framework that employs multiple
isolated
s with causal conditioning and anti-causal alignment via
reinforcement learning. Extensive experiments demonstrate that Siren
outperforms both existing LM-based and diffusion-based T2A systems, achieving
state-of-the-art results. By bridging the representational strengths of LMs
with the fidelity demands of audio synthesis, our approach repositions LMs as
competitive contenders against diffusion models in T2A tasks. Moreover, by
aligning audio representations with linguistic structures, Siren facilitates a
promising pathway toward unified multi-modal generation frameworks.
LaDiR Latent Diffusion Enhances LLMs for Text Reasoning
Authors: Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin
2025-10-06
Large Language Models (s) demonstrate their reasoning ability through
chain-of-thought (CoT) generation. However,
's autoregressive
may
limit the ability to revisit and refine earlier tokens in a holistic manner,
which can also lead to inefficient exploration for diverse solutions. In this
paper, we propose LaDiR (Latent Diffusion Reasoner), a novel reasoning
framework that unifies the expressiveness of continuous latent representation
with the iterative refinement capabilities of latent diffusion models for an
existing
. We first construct a structured latent reasoning space using a
Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of
thought tokens, pre
semantic information and interpretability while
offering compact but expressive representations. Subsequently, we utilize a
latent diffusion model that learns to denoise a block of latent thought tokens
with a blockwise bidirectional attention mask, enabling longer horizon and
iterative refinement with adaptive test-time compute. This design allows
efficient parallel generation of diverse reasoning trajectories, allowing the
model to plan and revise the reasoning process holistically. We conduct
evaluations on a suite of mathematical reasoning and planning benchmarks.
Empirical results show that LaDiR consistently improves accuracy, diversity,
and interpretability over existing autoregressive, diffusion-based, and latent
reasoning methods, revealing a new paradigm for text reasoning with latent
diffusion.
COSMIR Chain Orchestrated Structured Memory for Iterative Reasoning over Long Context
Authors: Naman Gupta, Shreeyash Gowaikar, Arun Iyer, Kirankumar Shiragur, Ramakrishna B Bairi, Rishikesh Maurya, Ritabrata Maiti, Sankarshan Damle, Shachee Mishra Gupta
2025-10-06
Reasoning over very long inputs remains difficult for large language models
(s). Common workarounds either shrink the input via retrieval (risking
missed evidence), enlarge the context window (straining selectivity), or stage
multiple agents to read in pieces. In staged pipelines (e.g., Chain of Agents,
CoA), free-form summaries passed between agents can discard crucial details and
amplify early mistakes. We introduce COSMIR (Chain Orchestrated Structured
Memory for Iterative Reasoning), a chain-style framework that replaces ad hoc
messages with a structured memory. A Planner agent first turns a user query
into concrete, checkable sub-questions. worker agents process chunks via a
fixed micro-cycle: Extract, Infer, Refine, writing all updates to the shared
memory. A Manager agent then Synthesizes the final answer directly from the
memory. This preserves step-wise read-then-reason benefits while changing both
the
medium (structured memory) and the worker procedure (fixed
micro-cycle), yielding higher faithfulness, better long-range aggregation, and
auditability. On long-context QA from the HELMET suite, COSMIR reduces
propagation-stage information loss and improves accuracy over a CoA baseline.
Multi-Agent Collaborative Intelligence Dual-Dial Control for Reliable LLM Reasoning
Authors: Edward Y. Chang, Ethan Y. Chang
2025-10-06
Multi-agent debate often wastes compute by using a fixed adversarial stance,
aggregating without deliberation, or stopping on heuristics. We introduce MACI,
an active controller with two independent dials that decouple information from
behavior: an information dial that gates evidence by quality, and a behavior
dial that schedules contentiousness from exploration to consolidation. A
moderator tracks disagreement, , evidence quality, and argument quality,
and halts when gains plateau. We provide theory-lite guarantees for
nonincreasing dispersion and provable termination, with a budget-feasible
scheduler. Across clinical diagnosis and news-bias tasks, MACI improves
accuracy and calibration while reducing tokens, and converts residual
uncertainty into precision RAG plans that specify what to retrieve next. We use
a cross-family
judge (CRIT) as a conservative soft weight and stop signal,
validated for order invariance and judge-swap stability; stability depends on
using high-capability judges. MACI turns debate into a budget-aware,
measurable, and provably terminating controller.
Compressed Convolutional Attention Efficient Attention in a Compressed Latent Space
Authors: Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, Beren Millidge
2025-10-06
Multi-headed Attention's (MHA) quadratic compute and linearly growing
-
make long-context
s expensive to train and serve. Prior
works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA)
shrink the
, speeding
, but leave compute, which determines
and training speed, largely unchanged. We introduce Compressed Convolutional
Attention (CCA), a novel attention method which down-projects queries, keys,
and values and performs the entire attention operation inside the shared latent
space. This simple design dramatically cuts parameters,
-
, and FLOPs all
at once by the desired
factor. Because CCA is orthogonal to
head-sharing, we combine the two to form Compressed Convolutional Grouped Query
Attention (CCGQA), which further tightens the compute-bandwidth Pareto frontier
so that users can tune
toward either FLOP or memory limits without
sacrificing quality. Experiments show that CCGQA consistently outperforms both
GQA and MLA at equal
-
on dense and MoE models.
Additionally, we show that CCGQA outperforms all other attention methods on MoE
models with half the
-
of GQA and MLA, achieving an 8x
-
with no drop in performance compared to standard MHA. CCA and CCGQA
also dramatically reduce the FLOP cost of attention which leads to
substantially faster training and
than existing methods. On H100 GPUs,
our fused CCA/CCGQA kernel reduces
latency by about 1.7x at a sequence
length of 16k relative to MHA, and accelerates backward by about 1.3x.
REAR Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization
Authors: Qiyuan He, Yicong Li, Haotian Ye, Jinghao Wang, Xinyao Liao, Pheng-Ann Heng, Stefano Ermon, James Zou, Angela Yao
2025-10-06
Visual autoregressive (AR) generation offers a promising path toward unifying
vision and language models, yet its performance remains suboptimal against
diffusion models. Prior work often attributes this gap to tokenizer limitations
and rasterization ordering. In this work, we identify a core bottleneck from
the perspective of generator-tokenizer inconsistency, i.e., the AR-generated
tokens may not be well-d by the tokenizer. To address this, we propose
reAR, a simple training strategy introducing a token-wise regularization
objective: when predicting the next token, the causal
is also
trained to recover the visual embedding of the current token and predict the
embedding of the target token under a noisy context. It requires no changes to
the tokenizer, generation order, inference pipeline, or external models.
Despite its simplicity, reAR substantially improves performance. On ImageNet,
it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard
rasterization-based tokenizer. When applied to advanced tokenizers, it achieves
a gFID of 1.42 with only 177M parameters, matching the performance with larger
state-of-the-art diffusion models (675M).
Speculative Actions A Lossless Framework for Faster Agentic Systems
Authors: Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, Tianyi Peng
2025-10-05
Despite growing interest in AI agents across industry and academia, their
execution in an environment is often slow, hampering training, evaluation, and
deployment. For example, a game of chess between two state-of-the-art agents
may take hours. A critical bottleneck is that agent behavior unfolds
sequentially: each action requires an API call, and these calls can be
time-consuming. Inspired by speculative execution in microprocessors and
speculative in
inference, we propose speculative actions, a
lossless framework for general agentic systems that predicts likely actions
using faster models, enabling multiple steps to be executed in parallel. We
evaluate this framework across three agentic environments: gaming, e-commerce,
web search, and a "lossy" extension for an operating systems environment. In
all cases, speculative actions achieve substantial accuracy in next-action
prediction (up to 55%), translating into significant reductions in end-to-end
latency. Moreover, performance can be further improved through stronger
guessing models, top-K action prediction, multi-step speculation, and
uncertainty-aware optimization, opening a promising path toward deploying
low-latency agentic systems in the real world.
SliceMoE Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling
Authors: Harshil Vejendla
2025-10-05
Mixture-of-Experts (MoE) layers scale s by routing tokens to a
subset of feed-forward experts. Token-level routing, however, assigns an
entire semantic spectrum to each expert, creating capacity bottlenecks,
load-balancing pathologies, and limited specialization. We introduce SliceMoE,
an architecture that routes contiguous slices of a token's hidden vector. A
d-dimensional embedding is partitioned into S slices, and for each slice, a
lightweight shared router predicts the top-k experts. Experts operate on their
assigned slices independently, and outputs are reassembled, maintaining
per-token FLOP efficiency. Because slices from different tokens interleave
within an expert, utilization is naturally smoother. We propose a slice-level
capacity loss, cross-slice dropout, and efficient fused batched GEMM kernels.
Experiments on WikiText-103 language modeling, WMT En-De translation, and three
text-classification datasets show SliceMoE attains up to 1.7x faster inference
than dense baselines, 12 to 18 percent lower perplexity than parameter-matched
token-MoE, and improved expert balance, with interpretable expertise over
syntactic versus semantic subspaces.
Doctor-R1 Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning
Authors: Yunghwei Lai, Kaiming Liu, Ziyue Wang, Weizhi Ma, Yang Liu
2025-10-05
The professionalism of a human doctor in outpatient service depends on two
core abilities: the ability to make accurate medical decisions and the medical
consultation skill to conduct strategic, empathetic patient inquiry. Existing
Large Language Models (s) have achieved remarkable accuracy on medical
decision-making benchmarks. However, they often lack the ability to conduct the
strategic and empathetic consultation, which is essential for real-world
clinical scenarios. To address this gap, we propose Doctor-R1, an AI doctor
agent trained to master both of the capabilities by ask high-yield questions
and conduct strategic multi-turn inquiry to guide decision-making. Our
framework introduces three key components: a multi-agent interactive
environment, a two-tiered reward architecture that separately optimizes
clinical decision-making and communicative inquiry skills, and an experience
repository to ground policy learning in high-quality prior trajectories. We
evaluate Doctor-R1 on OpenAI's HealthBench and MAQuE, assessed across
multi-facet metrics, such as
quality, user experience, and task
accuracy. Remarkably, Doctor-R1 surpasses state-of-the-art open-source
specialized
s by a substantial margin with higher parameter efficiency and
outperforms powerful proprietary models. Furthermore, the human evaluations
show a strong preference for Doctor-R1 to generate human-preferred clinical
dialogue, demonstrating the effectiveness of the framework.
Don't Pass A Bayesian Framework for Large Language Model Evaluation
Authors: Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary
2025-10-05
Pass is widely used to report performance for reasoning, but it often
yields unstable, misleading rankings, especially when the number of trials
(samples) is limited and compute is constrained. We present a principled
Bayesian evaluation framework that replaces Pass and average accuracy over
trials (avg) with posterior estimates of a model's underlying success
probability and credible intervals, yielding stable rankings and a transparent
decision rule for differences. Evaluation outcomes are modeled as categorical
(not just 0/1) with a Dirichlet prior, giving closed-form expressions for the
posterior mean and uncertainty of any weighted rubric and enabling the use of
prior evidence when appropriate. Theoretically, under a uniform prior, the
Bayesian posterior mean is order-equivalent to average accuracy (Pass),
explaining its empirical robustness while adding principled uncertainty.
Empirically, in simulations with known ground-truth success rates and on
AIME'24/'25, HMMT'25, and BrUMO'25, the Bayesian/avg procedure achieves faster
convergence and greater rank stability than Pass and recent variants,
enabling reliable comparisons at far smaller sample counts. The framework
clarifies when observed gaps are statistically meaningful (non-
ping
credible intervals) versus noise, and it naturally extends to graded,
rubric-based evaluations. Together, these results recommend replacing Pass
for
evaluation and ranking with a posterior-based, compute-efficient
protocol that unifies binary and non-binary evaluation while making uncertainty
explicit. Code is available at https://mohsenhariri.github.io/bayes-kit
Scaling Sequence-to-Sequence Generative Neural Rendering
Authors: Shikun Liu, Kam Woh Ng, Wonbong Jang, Jiadong Guo, Junlin Han, Haozhe Liu, Yiannis Douratsos, Juan C. Pérez, Zijian Zhou, Chi Phung, Tao Xiang, Juan-Manuel Pérez-Rúa
2025-10-05
We present Kaleido, a family of generative models designed for
photorealistic, unified object- and scene-level neural rendering. Kaleido
operates on the principle that 3D can be regarded as a specialised sub-domain
of video, expressed purely as a sequence-to-sequence image synthesis task.
Through a systemic study of scaling sequence-to-sequence generative neural
rendering, we introduce key architectural innovations that enable our model to:
i) perform generative view synthesis without explicit 3D representations; ii)
generate any number of 6-DoF target views conditioned on any number of
reference views via a masked autoregressive framework; and iii) seamlessly
unify 3D and video modelling within a single r-only rectified flow
. Within this unified framework, Kaleido leverages large-scale video
data for pre-training, which significantly improves spatial consistency and
reduces reliance on scarce, camera-labelled 3D datasets -- all without any
architectural modifications. Kaleido sets a new state-of-the-art on a range of
view synthesis benchmarks. Its zero-shot performance substantially outperforms
other generative methods in few-view settings, and, for the first time, matches
the quality of per-scene optimisation methods in many-view settings.
Let Features Decide Their Own Solvers Hybrid Feature Caching for Diffusion Transformers
Authors: Shikang Zheng, Guantao Chen, Qinming Zhou, Yuqi Lin, Lixuan He, Chang Zou, Peiliang Cai, Jiacheng Liu, Linfeng Zhang
2025-10-05
Diffusion Transformers offer state-of-the-art fidelity in image and video
synthesis, but their iterative sampling process remains a major bottleneck due
to the high cost of forward passes at each timestep. To mitigate
this, feature caching has emerged as a training-free
technique
that reuses or forecasts hidden representations. However, existing methods
often apply a uniform caching strategy across all feature dimensions, ignoring
their heterogeneous dynamic behaviors. Therefore, we adopt a new perspective by
modeling hidden feature evolution as a mixture of ODEs across dimensions, and
introduce HyCa, a Hybrid ODE solver inspired caching framework that applies
dimension-wise caching strategies. HyCa achieves near-lossless
across diverse domains and models, including 5.55 times speedup on FLUX, 5.56
times speedup on HunyuanVideo, 6.24 times speedup on Qwen-Image and
Qwen-Image-Edit without retraining.
PatternKV Flattening KV Representation Expands Quantization Headroom
Authors: Ji Zhang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
2025-10-05
in autoregressive
s eliminates redundant recomputation but has
emerged as the dominant memory and bandwidth bottleneck during inference,
notably with long contexts and test-time scaling.
is a key
lever for reducing
cost, but accuracy drops sharply as the native
distribution lacks flatness and thus maintains a wide
range. Prior
work focuses on isolating outliers, which caps their error but fails to flatten
the overall distribution, leaving performance fragile under
settings.
In this work, we show that the K
maintains a stable structure that
evolves gradually with context, while the V
carries latent semantic
regularities. Building on these insights, we propose Pattern
, a
pattern-aligned residual
scheme. It mines representative pattern
vectors online, aligns each
vector to its nearest pattern, and
s
only the residual. This reshaping of the
distribution flattens the
target and narrows its range, thereby improving the fidelity of
. Across long-context and test-time scaling settings on
multiple backbones, Pattern
delivers consistent 2-bit gains, with a 0.08%
average 4-bit drop relative to FP16, improves test-time scaling accuracy by 10%
on average, and raises throughput by 1.4x while supporting 1.25x larger
batches.
Emergent Coordination in Multi-Agent Language Models
Authors: Christoph Riedl
2025-10-05
When are multi-agent systems merely a collection of individual agents
versus an integrated collective with higher-order structure? We introduce an
information-theoretic framework to test -- in a purely data-driven way --
whether multi-agent systems show signs of higher-order structure. This
information decomposition lets us measure whether dynamical emergence is
present in multi-agent
systems, localize it, and distinguish spurious
temporal coupling from performance-relevant cross-agent synergy. We implement
both a practical criterion and an emergence capacity criterion operationalized
as partial information decomposition of time-delayed mutual information (TDMI).
We apply our framework to experiments using a simple guessing game without
direct agent
and only minimal group-level feedback with three
randomized interventions. Groups in the control condition exhibit strong
temporal synergy but only little coordinated alignment across agents. Assigning
a persona to each agent introduces stable identity-linked differentiation.
Combining personas with an instruction to ``think about what other agents might
do'' shows identity-linked differentiation and goal-directed complementarity
across agents. Taken together, our framework establishes that multi-agent
systems can be steered with prompt design from mere aggregates to higher-order
collectives. Our results are robust across emergence measures and entropy
estimators, and not explained by coordination-free baselines or temporal
dynamics alone. Without attributing human-like cognition to the agents, the
patterns of interaction we observe mirror well-established principles of
collective intelligence in human groups: effective performance requires both
alignment on shared objectives and complementary contributions across members.
Beyond Next-Token Prediction A Performance Characterization of Diffusion versus Autoregressive Language Models
Authors: Minseo Kim, Coleman Hooper, Aditya Tomar, Chenfeng Xu, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
2025-10-05
Large Language Models (s) have achieved state-of-the-art performance on a
broad range of Natural Language Processing (NLP) tasks, including document
processing and coding. Autoregressive Language Models (ARMs), which generate
tokens sequentially conditioned on all previous tokens, have been the
predominant paradigm for
s. However, while these networks have achieved high
accuracy across a range of downstream tasks, they exhibit low arithmetic
intensity due to the inherent sequential dependency with next-token prediction.
Recently, Diffusion Language Models (DLMs) have emerged as a promising
alternative architecture. DLMs generate output text in parallel, breaking the
limitations of sequential dependency. However, the performance implications of
DLMs relative to commonly deployed ARMs are not fully understood. In this work,
we present a comprehensive performance study analyzing the performance
characteristics of ARMs and DLMs, using both theoretical analysis and profiling
data to characterize the trade-offs between these approaches. We illustrate
that although DLMs exhibit higher arithmetic intensity compared to ARMs because
of their capability to utilize parallelism across sequence lengths, they fail
to scale effectively to longer contexts. We then explore DLMs with block-wise
, outlining how this approach allows for increased arithmetic
intensity, while still scaling well to long contexts (similar to ARMs). We also
show interesting trade-offs for batched inference, where we find that ARMs
exhibit superior throughput, as they benefit more from parallelism across
sequences in the batch. Finally, we highlight opportunities for accelerating
DLM inference, and, in particular, highlight the importance of reducing the
number of sampling steps for allowing open-source DLMs to provide improved
latency relative to ARMs.
MoME Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
Authors: Umberto Cappellazzo, Minsu Kim, Pingchuan Ma, Honglie Chen, Xubo Liu, Stavros Petridis, Maja Pantic
2025-10-05
Large language models (s) have recently shown strong potential in
audio-visual speech recognition (AVSR), but their high computational demands
and sensitivity to token granularity limit their practicality in
resource-constrained settings. Token
methods can reduce inference
cost, but they require fixing a
rate in advance and produce a
single fixed-length output, offering no flexibility to balance information
density and efficiency at inference time. Matryoshka representation learning
(MRL) addresses this by enabling a single model to operate across multiple
token granularities, allowing
rates to be adjusted dynamically.
However, current MRL-based methods treat each scale independently during
training, limiting cross-scale generalization, robustness at high
,
and interpretability. To overcome these limitations, we propose MoME (Mixture
of Matryoshka Experts), a novel framework that integrates
Mixture-of-Experts (MoE) into MRL-based
s for AVSR. MoME augments a frozen
with top-k routed and shared experts, allowing dynamic capacity allocation
across scales and modalities. A shared router promotes consistent expert
activation across granularities, enabling compressed sequences to benefit from
representations learned at lower
. Experiments on LRS2 and LRS3
demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR,
and VSR tasks, while requiring significantly fewer parameters and maintaining
robustness under noise. MoME unifies the adaptability of MRL with the
efficiency of MoE, offering a scalable and interpretable solution for
resource-aware speech recognition.
Can Linear Probes Measure LLM Uncertainty?
Authors: Ramzi Dakhmouche, Adrien Letellier, Hossein Gorji
2025-10-05
Effective Uncertainty Quantification (UQ) represents a key aspect for
reliable deployment of Large Language Models (s) in automated
decision-making and beyond. Yet, for
generation with multiple choice
structure, the state-of-the-art in UQ is still dominated by the naive baseline
given by the maximum softmax score. To address this shortcoming, we demonstrate
that taking a principled approach via Bayesian statistics leads to improved
performance despite leveraging the simplest possible model, namely linear
regression. More precisely, we propose to train multiple Bayesian linear
models, each predicting the output of a layer given the output of the previous
one. Based on the obtained layer-level posterior distributions, we infer the
global uncertainty level of the
by identifying a
combination of
distributional features, leading to an efficient UQ scheme. Numerical
experiments on various
s show consistent improvement over state-of-the-art
baselines.
Enhancing Fake News Video Detection via LLM-Driven Creative Process Simulation
Authors: Yuyan Bu, Qiang Sheng, Juan Cao, Shaofei Wang, Peng Qi, Yuhui Shi, Beizhe Hu
2025-10-05
The emergence of fake news on short video platforms has become a new
significant societal concern, necessitating automatic video-news-specific
detection. Current detectors primarily rely on pattern-based features to
separate fake news videos from real ones. However, limited and less diversified
training data lead to biased patterns and hinder their performance. This
weakness stems from the complex many-to-many relationships between video
material segments and fabricated news events in real-world scenarios: a single
video clip can be utilized in multiple ways to create different fake
narratives, while a single fabricated event often combines multiple distinct
video segments. However, existing datasets do not adequately reflect such
relationships due to the difficulty of collecting and annotating large-scale
real-world data, resulting in coverage and non-comprehensive learning of
the characteristics of potential fake news video creation. To address this
issue, we propose a data augmentation framework, AgentAug, that generates
diverse fake news videos by simulating typical creative processes. AgentAug
implements multiple
-driven pipelines of four fabrication categories for
news video creation, combined with an active learning strategy based on
uncertainty sampling to select the potentially useful augmented samples during
training. Experimental results on two benchmark datasets demonstrate that
AgentAug consistently improves the performance of short video fake news
detectors.
Fit Pixels, Get Labels Meta-learned Implicit Networks for Image Segmentation
Authors: Kushal Vyas, Ashok Veeraraghavan, Guha Balakrishnan
2025-10-05
Implicit neural representations (INRs) have achieved remarkable successes in
learning expressive yet compact signal representations. However, they are not
naturally amenable to predictive tasks such as segmentation, where they must
learn semantic structures over a distribution of signals. In this study, we
introduce MetaSeg, a meta-learning framework to train INRs for medical image
segmentation. MetaSeg uses an underlying INR that simultaneously predicts per
pixel intensity values and class labels. It then uses a meta-learning procedure
to find optimal initial parameters for this INR over a training dataset of
images and segmentation maps, such that the INR can simply be fine-tuned to fit
pixels of an unseen test image, and automatically its class labels. We
evaluated MetaSeg on 2D and 3D brain MRI segmentation tasks and report Dice
scores comparable to commonly used U-Net models, but with fewer
parameters. MetaSeg offers a fresh, scalable alternative to traditional
resource-heavy architectures such as U-Nets and vision
s for medical
image segmentation. Our project is available at
https://kushalvyas.github.io/metaseg.html .
Simulating and Understanding Deceptive Behaviors in Long-Horizon Interactions
Authors: Yang Xu, Xuanming Zhang, Min-Hsuan Yeh, Jwala Dhamala, Ousmane Dia, Rahul Gupta, Yixuan Li
2025-10-05
Deception is a pervasive feature of human and an emerging
concern in large language models (
s). While recent studies document
instances of
deception under pressure, most evaluations remain confined to
single-turn prompts and fail to capture the long-horizon interactions in which
deceptive strategies typically unfold. We introduce the first simulation
framework for probing and evaluating deception in
s under extended sequences
of interdependent tasks and dynamic contextual pressures. Our framework
instantiates a multi-agent system: a performer agent tasked with completing
tasks and a supervisor agent that evaluates progress, provides feedback, and
maintains evolving states of trust. An independent deception auditor then
reviews full trajectories to identify when and how deception occurs. We conduct
extensive experiments across 11 frontier models, spanning both closed- and
open-source systems, and find that deception is model-dependent, increases with
event pressure, and consistently erodes supervisor trust. Qualitative analyses
further reveal distinct strategies of concealment, equivocation, and
falsification. Our findings establish deception as an emergent risk in
long-horizon interactions and provide a foundation for evaluating future
s
in real-world, trust-sensitive contexts.
Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs
Authors: Junjie Luo, Rui Han, Arshana Welivita, Zeleikun Di, Jingfu Wu, Xuzhe Zhi, Ritu Agarwal, Gordon Gao
2025-10-05
Understanding how patients perceive their physicians is essential to
improving trust, , and satisfaction. We present a large language
model (
)-based pipeline that infers Big Five personality traits and five
patient-oriented subjective judgments. The analysis encompasses 4.1 million
patient reviews of 226,999 U.S. physicians from an initial pool of one million.
We validate the method through multi-model comparison and human expert
benchmarking, achieving strong agreement between human and
assessments
(correlation coefficients 0.72-0.89) and external validity through correlations
with patient satisfaction (r = 0.41-0.81, all p<0.001). National-scale analysis
reveals systematic patterns: male physicians receive higher ratings across all
traits, with largest disparities in clinical competence perceptions;
empathy-related traits predominate in pediatrics and psychiatry; and all traits
positively predict overall satisfaction. Cluster analysis identifies four
distinct physician archetypes, from "Well-Rounded Excellent" (33.8%, uniformly
high traits) to "Underperforming" (22.6%, consistently low). These findings
demonstrate that automated trait extraction from patient narratives can provide
interpretable, validated metrics for understanding physician-patient
relationships at scale, with implications for quality measurement, bias
detection, and workforce development in healthcare.
SPEAR Soft Prompt Enhanced Anomaly Recognition for Time Series Data
Authors: Hanzhe Wei, Jiajun Wu, Jialin Yang, Henry Leung, Steve Drew
2025-10-04
Time series anomaly detection plays a crucial role in a wide range of fields,
such as healthcare and internet traffic monitoring. The emergence of large
language models (s) offers new opportunities for detecting anomalies in the
ubiquitous time series data. Traditional approaches struggle with
variable-length time series sequences and context-based anomalies. We propose
Soft Prompt Enhanced Anomaly Recognition (SPEAR), a novel approach to leverage
s for anomaly detection with soft prompts and
. Our methodology
involves quantizing and transforming the time series data into input embeddings
and combining them with learnable soft prompt embeddings. These combined
embeddings are then fed into a frozen
. The soft prompts are updated
iteratively based on a cross-entropy loss, allowing the model to adapt to time
series anomaly detection. The use of soft prompts helps adapt
s effectively
to time series tasks, while
ensures optimal handling of sequences,
as
s are designed to handle discrete sequences. Our experimental results
demonstrate that soft prompts effectively increase
s' performance in
downstream tasks regarding time series anomaly detection.
Sliding Window Attention for Learned Video Compression
Authors: Alexander Kopte, André Kaup
2025-10-04
To manage the complexity of s in video
, local
attention mechanisms are a practical necessity. The common approach of
partitioning frames into patches, however, creates architectural flaws like
irregular receptive fields. When adapted for temporal autoregressive models,
this paradigm, exemplified by the Video Compression Transformer (VCT), also
necessitates computationally redundant
ping windows. This work
introduces 3D Sliding Window Attention (SWA), a patchless form of local
attention. By enabling a
r-only architecture that unifies spatial and
temporal context processing, and by providing a uniform receptive field, our
method significantly improves rate-distortion performance, achieving
Bj{\o}rntegaard Delta-rate savings of up to 18.6 % against the VCT baseline.
Simultaneously, by eliminating the need for
ping windows, our method
reduces overall
r complexity by a factor of 2.8, while its entropy model
is nearly 3.5 times more efficient. We further analyze our model's behavior and
show that while it benefits from long-range temporal context, excessive context
can degrade performance.
Multi-Agent Code-Orchestrated Generation for Reliable Infrastructure-as-Code
Authors: Rana Nameer Hussain Khan, Dawood Wasif, Jin-Hee Cho, Ali Butt
2025-10-04
The increasing complexity of cloud-native infrastructure has made
Infrastructure-as-Code (IaC) essential for reproducible and scalable
deployments. While large language models (s) have shown promise in
generating IaC snippets from natural language prompts, their monolithic,
single-pass generation approach often results in syntactic errors, policy
violations, and unscalable designs. In this paper, we propose MACOG
(Multi-Agent Code-Orchestrated Generation), a novel multi-agent
-based
architecture for IaC generation that decomposes the task into modular subtasks
handled by specialized agents: Architect, Provider Harmonizer, Engineer,
Reviewer, Security Prover, Cost and Capacity Planner, DevOps, and Memory
Curator. The agents interact via a shared-blackboard, finite-state orchestrator
layer, and collectively produce Terraform configurations that are not only
syntactically valid but also policy-compliant and semantically coherent. To
ensure infrastructure correctness and governance, we incorporate Terraform Plan
for execution validation and Open Policy Agent (OPA) for customizable policy
enforcement. We evaluate MACOG using the IaC-Eval benchmark, where MACOG is the
top enhancement across models, e.g., GPT-5 improves from 54.90 (RAG) to 74.02
and Gemini-2.5 Pro from 43.56 to 60.13, with concurrent gains on BLEU,
CodeBERTScore, and an
-judge metric. Ablations show constrained
and
deploy feedback are critical: removing them drops IaC-Eval to 64.89 and 56.93,
respectively.
NoTVLA Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation
Authors: Zheng Huang, Mingyu Liu, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Xiaoman Li, Yiduo Jia, Hao Zhong, Hao Chen, Chunhua Shen
2025-10-04
Vision-Language-Action (VLA) models represent a pivotal advance in embodied
intelligence, yet they confront critical barriers to real-world deployment,
most notably catastrophic forgetting. This issue stems from their overreliance
on continuous action sequences or action chunks, which inadvertently create
isolated data silos that disrupt knowledge retention across tasks. To tackle
these challenges, we propose the Narrowing of Trajectory VLA (NoTVLA)
framework: a novel approach that narrows its focus to trajectories,
thereby avoiding the catastrophic forgetting associated with dense trajectory
fine-tuning. A key innovation of NoTVLA lies in its trajectory planning
strategy: instead of centering on the target object's trajectory, it leverages
temporal
and spatial reasoning
specifically for the robot
end effector's trajectory. Furthermore, training is conducted using these
trajectories rather than dense action trajectories, an optimization that
delivers remarkable practical advantages with better performance in zero-shot.
In multi-task evaluation scenarios, NoTVLA achieves superior performance and
generalization compared to pi0 while operating under two critical constraints:
it uses over an order of magnitude less computing power than pi0 and requires
no wrist-mounted camera. This design ensures that NoTVLA's operational accuracy
closely approximates that of single-task expert models. Crucially, it also
preserves the model's inherent language capabilities, enabling zero-shot
generalization in specific scenarios, supporting unified model deployment
across multiple robot platforms, and fostering a degree of generalization even
when perceiving tasks from novel perspectives.
DHQA-4D Perceptual Quality Assessment of Dynamic 4D Digital Human
Authors: Yunhao Li, Sijing Wu, Yucheng Zhu, Huiyu Duan, Zicheng Zhang, Guangtao Zhai
2025-10-04
With the rapid development of 3D scanning and reconstruction technologies,
dynamic digital human avatars based on 4D meshes have become increasingly
popular. A high-precision dynamic digital human avatar can be applied to
various fields such as game production, animation generation, and remote
immersive . However, these 4D human avatar meshes are prone to
being degraded by various types of noise during the processes of collection,
, and transmission, thereby affecting the viewing experience of
users. In light of this fact, quality assessment of dynamic 4D digital humans
becomes increasingly important. In this paper, we first propose a large-scale
dynamic digital human quality assessment dataset, DHQA-4D, which contains 32
high-quality real-scanned 4D human mesh sequences, 1920 distorted textured 4D
human meshes degraded by 11 textured distortions, as well as their
corresponding textured and non-textured mean opinion scores (MOSs). Equipped
with DHQA-4D dataset, we analyze the influence of different types of distortion
on human perception for textured dynamic 4D meshes and non-textured dynamic 4D
meshes. Additionally, we propose DynaMesh-Rater, a novel large multimodal model
(LMM) based approach that is able to assess both textured 4D meshes and
non-textured 4D meshes. Concretely, DynaMesh-Rater elaborately extracts
multi-dimensional features, including visual features from a projected 2D
video, motion features from cropped video clips, and geometry features from the
4D human mesh to provide comprehensive quality-related information. Then we
utilize a LMM model to integrate the multi-dimensional features and conduct a
LoRA-based instruction tuning technique to teach the LMM model to predict the
quality scores. Extensive experimental results on the DHQA-4D dataset
demonstrate the superiority of our DynaMesh-Rater method over previous quality
assessment methods.
Algorithm Generation via Creative Ideation
Authors: Ruiying Ma, Chieh-Jan Mike Liang, Yanjie Gao, Francis Y. Yan
2025-10-04
Designing system algorithms remains challenging, where the discontinuous
nature of the solution space often forces system engineers to rely on generic
heuristics at the expense of performance. We study whether s can practically
drive algorithm generation, and find that they are biased towards well-known
generic designs, rather than making the creative leaps needed to navigate the
discontinuous solution space. To address this limitation, we introduce
MetaMuse, a framework for creative ideation built on three self-reflection
principles: (1) quantifying solution diversity and usefulness in measurable
performance space, rather than abstract idea space, (2) steering ideation
through external stimuli, rather than internal randomness, and (3) constructing
executable solutions using waypoint reasoning, rather than free-form
chain-of-thought. Extensive evaluation shows that MetaMuse can generate
high-performing solutions for two critical problems at a global cloud provider:
replacement (reducing
misses by up to 35.76%) and online bin
packing (reducing bin usage by up to 30.93%).
Small Language Models for Agentic Systems A Survey of Architectures, Capabilities, and Deployment Trade offs
Authors: Raghav Sharma, Manan Mehta
2025-10-04
Small language models (SLMs; 1-12B params, sometimes up to 20B) are
sufficient and often superior for agentic workloads where the objective is
schema- and API-constrained accuracy rather than open-ended generation. We
synthesize recent evidence across open and proprietary SLMs (Phi-4-Mini,
Qwen-2.5-7B, Gemma-2-9B, Llama-3.2-1B/3B, Ministral-3B/8B, Apple on-device 3B,
DeepSeek-R1-Distill) and connect it to modern evaluations (BFCL v3/v4,
StableToolBench) and stacks (v
, SGLang, TensorRT-
) paired with
guided
libraries (XGrammar, Outlines). We formalize SLM-default,
-fallback systems with uncertainty-aware routing and verifier cascades, and
propose engineering metrics that reflect real production goals: cost per
successful task (CPS), schema validity rate, executable call rate, p50/p95
latency, and energy per request. Guided
, strict JSON Schema outputs,
and validator-first tool execution close much of the capability gap with larger
models and often let SLMs match or surpass
s on tool use, function calling,
and RAG at 10x-100x lower token cost with materially better latency and energy.
We provide design patterns for agent stacks that prioritize SLMs: schema-first
prompting, type-safe function registries, confidence scoring with verifier
rollups, and lightweight adaptation via LoRA/QLoRA. We also delineate limits
where fallback remains valuable (open-domain reasoning and some long-horizon
planning). The result is a practical blueprint for building fast, inexpensive,
and reliable agents that default to SLMs while pre
headroom with
targeted
assistance.
Keywords: small language models, agents, function calling, structured
outputs, JSON Schema, guided
, LoRA/QLoRA, routing, energy efficiency,
edge inference
TROLL Trust Regions improve Reinforcement Learning for Large Language Models
Authors: Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, Gerhard Neumann
2025-10-04
On-policy Reinforcement Learning (RL) with PPO-like clip objectives has
become the standard choice for reward-based fine-tuning of large language
models (s). Although recent work has explored improved estimators of
advantages and normalization, the clipping mechanism itself has remained
untouched. Originally introduced as a proxy for principled KL-based trust
regions, clipping is a crude approximation that often causes unstable updates
and suboptimal performance. We replace the clip objective with a novel discrete
differentiable trust region projection, which provides principled token-level
KL constraints. The projection operates on a
subset of the model's most
important token logits to balance computational cost and projection
effectiveness. Our approach, Trust Region Optimization for Large Language
Models (TROLL), serves as a direct replacement for PPO-like clipping during
training and does not alter the model's inference behavior. Across datasets,
model families, and advantage-estimation methods, TROLL consistently
outperforms PPO-like clipping in terms of training speed, stability, and final
success rates.
MambaCAFU Hybrid Multi-Scale and Multi-Attention Model with Mamba-Based Fusion for Medical Image Segmentation
Authors: T-Mai Bui, Fares Bougourzi, Fadi Dornaika, Vinh Truong Hoang
2025-10-04
In recent years, deep learning has shown near-expert performance in
segmenting complex medical tissues and tumors. However, existing models are
often task-specific, with performance varying across modalities and anatomical
regions. Balancing model complexity and performance remains challenging,
particularly in clinical settings where both accuracy and efficiency are
critical. To address these issues, we propose a hybrid segmentation
architecture featuring a three-branch encoder that integrates CNNs,
Transformers, and a Mamba-based Attention Fusion (MAF) mechanism to capture
local, global, and long-range dependencies. A multi-scale attention-based CNN
r reconstructs fine-grained segmentation maps while pre
contextual
consistency. Additionally, a co-attention gate enhances feature selection by
emphasizing relevant spatial and semantic information across scales during both
encoding and
, improving feature interaction and cross-scale
. Extensive experiments on multiple benchmark datasets show that
our approach outperforms state-of-the-art methods in accuracy and
generalization, while maintaining comparable computational complexity. By
effectively balancing efficiency and effectiveness, our architecture offers a
practical and scalable solution for diverse medical imaging tasks. Source code
and trained models will be publicly released upon acceptance to support
reproducibility and further research.
You Have Been LaTeXpOsEd A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models
Authors: Richard A. Dubniczky, Bertalan Borsos, Tihanyi Norbert
2025-10-04
The widespread use of preprint repositories such as arXiv has accelerated the
of scientific results but also introduced overlooked security
risks. Beyond PDFs, these platforms provide unrestricted access to original
source materials, including LaTeX sources, auxiliary code, figures, and
embedded comments. In the absence of sanitization, submissions may disclose
sensitive information that adversaries can harvest using open-source
intelligence. In this work, we present the first large-scale security audit of
preprint archives, analyzing more than 1.2 TB of source data from 100,000 arXiv
submissions. We introduce LaTeXpOsEd, a four-stage framework that integrates
pattern matching, logical filtering, traditional harvesting techniques, and
large language models (
s) to uncover hidden disclosures within
non-referenced files and LaTeX comments. To evaluate
s' secret-detection
capabilities, we introduce
Sec-DB, a benchmark on which we tested 25
state-of-the-art models. Our analysis uncovered thousands of PII leaks,
GPS-tagged EXIF files, publicly available Google Drive and Dropbox folders,
editable private SharePoint links, exposed GitHub and Google credentials, and
cloud API keys. We also uncovered confidential author
s, internal
disagreements, and conference submission credentials, exposing information that
poses serious reputational risks to both researchers and institutions. We urge
the research community and repository operators to take immediate action to
close these hidden security gaps. To support open science, we release all
scripts and methods from this study but withhold sensitive findings that could
be misused, in line with ethical principles. The source code and related
material are available at the project website https://github.com/LaTeXpOsEd
EvoEngineer Mastering Automated CUDA Kernel Code Evolution with Large Language Models
Authors: Ping Guo, Chenyu Zhu, Siyuan Chen, Fei Liu, Xi Lin, Zhichao Lu, Qingfu Zhang
2025-10-04
CUDA kernel optimization has become a critical bottleneck for AI performance,
as deep learning training and inference efficiency directly depends on highly
optimized GPU kernels.
Despite the promise of Large Language Models (s) for automating kernel
optimization, this field suffers from a fragmented ecosystem of isolated and
incomparable approaches with unclear problem formulations.
Furthermore, general-purpose
code evolution methods cannot meet strict
correctness requirements of CUDA kernel optimization.
We address these fundamental challenges by first formalizing CUDA kernel
optimization as a code optimization task with a clear objective, constraints,
and evaluation metrics.
We then establish the first systematic
-based code evolution framework,
EvoEngineer, that provides guidance for designing and adapting optimization
strategies to achieve a balance between performance and correctness.
Finally, we implement a kernel optimization system based on this framework
and conduct extensive experiments on 91 real-world CUDA kernels.
Our results demonstrate that EvoEngineer achieves a principled balance
between performance and correctness, with the highest averaged median speedup
of \textbf{2.72} over baseline CUDA kernels and a code validity rate of
\textbf{69.8}\%, outperforming existing methods on both dimensions.
Our method achieves a maximum speedup of \textbf{36.75} among all
operations over PyTorch kernels and delivers the highest speedup on \textbf{28}
(\textbf{56.0\%}) of 50 operations that achieve over \textbf{2}
.
Token Hidden Reward Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
Authors: Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis
2025-10-04
Reinforcement learning with verifiable rewards has significantly advanced the
reasoning capabilities of large language models, yet how to explicitly steer
training toward exploration or exploitation remains an open problem. We
introduce Token Hidden Reward (THR), a token-level metric that quantifies each
token's influence on the likelihood of correct responses under Group Relative
Policy Optimization (GRPO). We find that training dynamics are dominated by a
small subset of tokens with high absolute THR values. Most interestingly,
tokens with positive THR strengthen confidence in correct outputs, thus
favoring exploitation, while tokens with negative THR preserve probability mass
for alternative outputs, enabling exploration. This insight suggests a natural
intervention: a THR-guided reweighting algorithm that modulates GRPO's learning
signals to explicitly bias training toward exploitation or exploration. We
validate the efficacy of this algorithm on diverse math reasoning benchmarks.
By amplifying tokens with positive THR value and weakening negative ones, our
algorithm improves greedy- accuracy, favoring exploitation. The reverse
strategy yields consistent gains in Pass@K accuracy, favoring exploration. We
further demonstrate that our algorithm integrates seamlessly with other RL
objectives such as GSPO and generalizes across architectures including Llama.
These findings establish THR as a principled and fine-grained mechanism for
dynamically controlling exploration and exploitation in RL-tuned
s,
providing new tools for targeted fine-tuning in reasoning-intensive
applications.
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders
Authors: Xu Wang, Yan Hu, Benyou Wang, Difan Zou
2025-10-04
Sparse Autoencoders (SAEs) are widely used to steer large language models
(s), based on the assumption that their interpretable features naturally
enable effective model behavior steering. Yet, a fundamental question remains
unanswered: does higher interpretability indeed imply better steering utility?
To answer this question, we train 90 SAEs across three
s (Gemma-2-2B,
Qwen-2.5-3B, Gemma-2-9B), spanning five architectures and six
levels,
and evaluate their interpretability and steering utility based on SAEBench
(arXiv:2501.12345) and AxBench (arXiv:2502.23456) respectively, and perform a
rank-agreement analysis via Kendall's rank coefficients (tau b). Our analysis
reveals only a relatively weak positive association (tau b approx 0.298),
indicating that interpretability is an insufficient proxy for steering
performance. We conjecture the interpretability utility gap may stem from the
selection of SAE features, as not all of them are equally effective for
steering. To further find features that truly steer the behavior of
s, we
propose a novel selection criterion called Delta Token Confidence, which
measures how much amplifying a feature changes the next token distribution. We
show that our method improves the steering performance of three
s by 52.52
percent compared to the current best output score based criterion
(arXiv:2503.34567). Strikingly, after selecting features with high Delta Token
Confidence, the correlation between interpretability and utility vanishes (tau
b approx 0), and can even become negative. This further highlights the
divergence between interpretability and utility for the most effective steering
features.
Decoupling Task-Solving and Output Formatting in LLM Generation
Authors: Haikang Deng, Po-Nien Kung, Nanyun Peng
2025-10-04
Large language models (s) are increasingly adept at following instructions
containing task descriptions to solve complex problems, such as mathematical
reasoning and automatic evaluation (
-as-a-Judge). However, as prompts grow
more complex, models often struggle to adhere to all instructions. This
difficulty is especially common when instructive prompts intertwine reasoning
directives -- specifying what the model should solve -- with rigid formatting
requirements that dictate how the solution must be presented. The entanglement
creates competing goals for the model, suggesting that more explicit separation
of these two aspects could lead to improved performance. To this front, we
introduce Deco-G, a
framework that explicitly decouples format
adherence from task solving. Deco-G handles format compliance with a separate
tractable probabilistic model (TPM), while prompts
s with only task
instructions. At each
step, Deco-G combines next token probabilities
from the
with the TPM calculated format compliance likelihood to form the
output probability. To make this approach both practical and scalable for
modern instruction-tuned
s, we introduce three key innovations:
instruction-aware distillation, a flexible trie-building algorithm, and HMM
state
for computational efficiency. We demonstrate the effectiveness of
Deco-G across a wide range of tasks with diverse format requirements, including
mathematical reasoning,
-as-a-judge, and event argument extraction. Overall,
our approach yields 1.0% to 6.0% relative gain over regular prompting practice
with guaranteed format compliance.
FieldFormer Physics-Informed Transformers for Spatio-Temporal Field Reconstruction from Sparse Sensors
Authors: Ankit Bhardwaj, Ananth Balashankar, Lakshminarayanan Subramanian
2025-10-04
Spatio-temporal sensor data is often , noisy, and irregular, and
existing interpolation or learning methods struggle here because they either
ignore governing PDEs or do not scale. We introduce FieldFormer, a
-based framework for mesh-free spatio-temporal field reconstruction
that combines data-driven flexibility with physics-based structure. For each
query, FieldFormer gathers a local neighborhood using a learnable
velocity-scaled distance metric, enabling anisotropic adaptation to different
propagation regimes. Neighborhoods are built efficiently via per-batch offset
recomputation, and refined in an expectation-maximization style as the velocity
scales evolve. Predictions are made by a local
encoder, and physics
consistency is enforced through autograd-based PDE residuals and
boundary-specific penalties. Across three benchmarks--a scalar anisotropic heat
equation, a vector-valued shallow-water system, and a realistic
advection-diffusion pollution simulation--FieldFormer consistently outperforms
strong baselines by more than 40%. Our results demonstrate that FieldFormer
enables accurate (RMSE), efficient, and physically consistent field
reconstruction from
(0.4%-2%) and noisy(10%) data.
Reactive Transformer (RxT) -- Stateful Real-Time Processing for Event-Driven Reactive Language Models
Authors: Adam Filipek
2025-10-03
The Transformer architecture has become the de facto standard for Large
Language Models (s), demonstrating remarkable capabilities in language
understanding and generation. However, its application in conversational AI is
fundamentally constrained by its stateless nature and the quadratic
computational complexity () with respect to sequence length .
Current models emulate memory by reprocessing an ever-expanding conversation
history with each turn, leading to prohibitive costs and latency in long
dialogues. This paper introduces the Reactive Transformer (RxT), a novel
architecture designed to overcome these limitations by shifting from a
data-driven to an event-driven paradigm. RxT processes each conversational turn
as a discrete event in real-time, maintaining context in an integrated,
fixed-size Short-Term Memory (STM) system. The architecture features a distinct
operational cycle where a generator-
r produces a response based on the
current query and the previous memory state, after which a memory-encoder and a
dedicated Memory Attention network asynchronously update the STM with a
representation of the complete interaction. This design fundamentally alters
the scaling dynamics, reducing the total user-facing cost of a conversation
from quadratic () to linear () with respect to
the number of interactions . By decoupling response generation from memory
updates, RxT achieves low latency, enabling truly real-time, stateful, and
economically viable long-form conversations. We validated our architecture with
a series of proof-of-concept experiments on synthetic data, demonstrating
superior performance and constant-time inference latency compared to a baseline
stateless model of comparable size.
From Scope to Script An Automated Report Generation Model for Gastrointestinal Endoscopy
Authors: Evandros Kaklamanos, Kristjana Kristinsdottir, Jonathan Huang, Dustin Carlson, Rajesh Keswani, John Pandolfino, Mozziyar Etemadi
2025-10-03
Endoscopic procedures such as esophagogastroduodenoscopy (EGD) and
colonoscopy play a critical role in diagnosing and managing gastrointestinal
(GI) disorders. However, the documentation burden associated with these
procedures place significant strain on gastroenterologists, contributing to
inefficiencies in clinical workflows and physician burnout. To address this
challenge, we propose a novel automated report generation model that leverages
a -based vision encoder and text
r within a two-stage training
framework. In the first stage, both components are pre-trained on image/text
caption pairs to capture generalized vision-language features, followed by
fine-tuning on images/report pairs to generate clinically meaningful findings.
Our approach not only streamlines the documentation process but also holds
promise for reducing physician workload and improving patient care.
Harnessing the XMM-Newton data X-ray spectral modelling of 4XMM-DR11 detections and 4XMM-DR11s sources
Authors: A. Viitanen, G. Mountrichas, H. Stiele, F. J. Carrera, A. Ruiz, J. Ballet, A. Akylas, A. Corral, M. Freyberg, A. Georgakakis, I. Georgantopoulos, S. Mateos, C. Motch, A. Nebot, H. Tranin, N. Webb
2025-10-03
The XMM-Newton X-ray observatory has played a prominent role in astrophysics,
conducting precise and thorough observations of the X-ray sky for the past two
decades. The most recent iteration of the 4XMM catalogue and one of its latest
data releases DR11 mark significant improvements over previous XMM-Newton
catalogues, as a cornerstone for comprehending the diverse inhabitants
of the X-ray sky. We employ detections and spectra extracted from the 4XMM-DR11
catalogue, subjecting them to fitting procedures using simple models. Our study
operates within the framework of the XMM2ATHENA project, which focuses on
developing state-of-the-art methods that exploit existing XMM-Newton data. We
introduce and publicly release four catalogues containing measurements derived
from X-ray spectral modelling of sources. The first catalogue encompasses
outcomes obtained by fitting an absorbed power law model to all the extracted
spectra for individual detections within the 4XMM-DR11 dataset. The second
catalogue presents results obtained by fitting both an absorbed power law and
an absorbed blackbody model to all unique physical sources listed in the
4XMM-DR11s catalogue, which documents source detection results from
ping
XMM-Newton observations. For the third catalogue we use the five band count
rates derived from the pipe line detection of X-ray sources to mimic low
resolution spectra to get a rough estimate of the spectral shape (absorbed
power-law) of all 4XMM-DR11 detections. In the fourth catalogue, we conduct
spectral analyses for the subset of identified sources with extracted spectra,
employing various models based on their classification into categories such as
AGN, stars, X-ray binaries, and cataclysmic variables. The scientific potential
of these catalogues is highlighted by discussing the capabilities of optical
and mid-infrared colours for selecting absorbed AGN. (abridged)
Cache-to-Cache Direct Semantic Communication Between Large Language Models
Authors: Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang
2025-10-03
Multi- systems harness the complementary strengths of diverse Large
Language Models, achieving performance and efficiency gains unattainable by a
single model. In existing designs,
s communicate through text, forcing
internal representations to be transformed into output token sequences. This
process both loses rich semantic information and incurs token-by-token
generation latency. Motivated by these limitations, we ask: Can
s
communicate beyond text? Oracle experiments show that enriching the
-Cache
semantics can improve response quality without increasing
size,
supporting
-Cache as an effective medium for inter-model
. Thus,
we propose Cache-to-Cache (C2C), a new paradigm for direct semantic
between
s. C2C uses a neural network to project and fuse the
source model's
-
with that of the target model to enable direct semantic
transfer. A learnable gating mechanism selects the target layers that benefit
from
. Compared with text
, C2C utilizes the
deep, specialized semantics from both models, while avoiding explicit
intermediate text generation. Experiments show that C2C achieves 8.5-10.5%
higher average accuracy than individual models. It further outperforms the text
paradigm by approximately 3.0-5.0%, while delivering an average
2.0x speedup in latency. Our code is available at
https://github.com/thu-nics/C2C.
Coevolutionary Continuous Discrete Diffusion Make Your Diffusion Language Model a Latent Reasoner
Authors: Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, Dinghuai Zhang
2025-10-03
Diffusion language models, especially masked discrete diffusion models, have
achieved great success recently. While there are some theoretical and primary
empirical results showing the advantages of latent reasoning with looped
s or continuous chain-of-thoughts, continuous diffusion models
typically underperform their discrete counterparts. In this paper, we argue
that diffusion language models do not necessarily need to be in the discrete
space. In particular, we prove that continuous diffusion models have stronger
expressivity than discrete diffusions and looped
s. We attribute the
contradiction between the theoretical expressiveness and empirical performance
to their practical trainability: while continuous diffusion provides
intermediate supervision that looped
s lack, they introduce
additional difficulty
tokens into the discrete token space from the
continuous representation space. We therefore propose Coevolutionary Continuous
Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process
on the union of a continuous representation space and a discrete token space,
leveraging a single model to simultaneously denoise in the joint space. By
combining two modalities, CCDD is expressive with rich semantics in the latent
space, as well as good trainability and sample quality with the help of
explicit discrete tokens. We also propose effective architectures and advanced
training/sampling techniques for CCDD, which reveals strong empirical
performance in extensive language modeling experiments on real-world tasks.
FocusAgent Simple Yet Effective Ways of Trimming the Large Context of Web Agents
Authors: Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Léo Boisvert, Massimo Caccia, Jérémy Espinas, Alexandre Aussem, Véronique Eglin, Alexandre Lacoste
2025-10-03
Web agents powered by large language models (s) must process lengthy web
page observations to complete user goals; these pages often exceed tens of
thousands of tokens. This saturates context limits and increases computational
cost processing; moreover, processing full pages exposes agents to security
risks such as prompt injection. Existing
strategies either discard
relevant content or retain irrelevant context, leading to suboptimal action
prediction. We introduce FocusAgent, a simple yet effective approach that
leverages a lightweight
retriever to extract the most relevant lines from
accessibility tree (AxTree) observations, guided by task goals. By
noisy and irrelevant content, FocusAgent enables efficient reasoning while
reducing vulnerability to injection attacks. Experiments on WorkArena and
WebArena benchmarks show that FocusAgent matches the performance of strong
baselines, while reducing observation size by over 50%. Furthermore, a variant
of FocusAgent significantly reduces the success rate of prompt-injection
attacks, including banner and pop-up attacks, while maintaining task success
performance in attack-free settings. Our results highlight that targeted
-based retrieval is a practical and robust strategy for building web agents
that are efficient, effective, and secure.
OpenZL A Graph-Based Model for Compression
Authors: Yann Collet, Nick Terrell, W. Felix Handte, Danielle Rozenblit, Victor Zhang, Kevin Zhang, Yaelle Goldschlag, Jennifer Lee, Daniel Riegel, Stan Angelov, Nadav Rotem
2025-10-03
Research in general-purpose lossless over the last decade has
largely found improvements in
ratio that come at great cost to
resource utilization and processing throughput. However, most production
workloads require high throughput and low resource utilization, so most
research systems have seen little adoption. Instead, real world improvements in
are increasingly often realized by building application-specific
compressors which can exploit knowledge about the structure and semantics of
the data being compressed. These systems easily outperform even the best
generic compressors, but application-specific
schemes are not
without drawbacks. They are inherently limited in applicability and are
difficult to maintain and deploy.
We show that these challenges can be overcome with a new way of thinking
about
. We propose the ``graph model'' of
, a new
theoretical framework for representing
as a directed acyclic graph
of modular codecs. This motivates OpenZL, an implementation of this model that
compresses data into a self-describing wire format, any configuration of which
can be decompressed by a universal
r. OpenZL's design enables rapid
development of tailored compressors with minimal code, its universal
r
eliminates deployment lag, and its investment in a well-vetted standard
component library minimizes security risks. Experimental results demonstrate
that OpenZL achieves superior
ratios and speeds compared to
state-of-the-art general-purpose compressors on a variety of real-world
datasets. Internal deployments at Meta have also shown consistent improvements
in size and/or speed, with development timelines reduced from months to days.
OpenZL thus represents an advance in practical, scalable, and maintainable data
for modern data-intensive applications.
Improving Cooperation in Collaborative Embodied AI
Authors: Hima Jacob Leven Suprabha, Laxmi Nag Laxminarayan Nagesh, Ajith Nair, Alvin Reuben Amal Selvaster, Ayan Khan, Raghuram Damarla, Sanju Hannah Samuel, Sreenithi Saravana Perumal, Titouan Puech, Venkataramireddy Marella, Vishal Sonar, Alessandro Suglia, Oliver Lemon
2025-10-03
The integration of Large Language Models (s) into multiagent systems has
opened new possibilities for collaborative reasoning and cooperation with AI
agents. This paper explores different prompting methods and evaluates their
effectiveness in enhancing agent collaborative behaviour and decision-making.
We enhance CoELA, a framework designed for building Collaborative Embodied
Agents that leverage
s for multi-agent
, reasoning, and task
coordination in shared virtual spaces. Through systematic experimentation, we
examine different
s and prompt engineering strategies to identify optimised
combinations that maximise collaboration performance. Furthermore, we extend
our research by integrating speech capabilities, enabling seamless
collaborative voice-based interactions. Our findings highlight the
effectiveness of prompt optimisation in enhancing collaborative agent
performance; for example, our best combination improved the efficiency of the
system running with Gemma3 by 22% compared to the original CoELA system. In
addition, the speech integration provides a more engaging user interface for
iterative system development and demonstrations.
CHORD Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration
Authors: Tianqi Liu, Kairui Fu, Shengyu Zhang, Wenyan Fan, Zhaocheng Du, Jieming Zhu, Fan Wu, Fei Wu
2025-10-03
With the advancement of mobile device capabilities, deploying reranking
models directly on devices has become feasible, enabling real-time contextual
recommendations. When migrating models from cloud to devices, resource
heterogeneity inevitably necessitates model . Recent
methods show promise for efficient deployment, yet they overlook
device-specific user interests, resulting in compromised recommendation
accuracy. While on-device finetuning captures personalized user preference, it
imposes additional computational burden through local retraining. To address
these challenges, we propose a framework for \underline{\textbf{C}}ustomizing
\underline{\textbf{H}}ybrid-precision \underline{\textbf{O}}n-device model for
sequential \underline{\textbf{R}}ecommendation with
\underline{\textbf{D}}evice-cloud collaboration (\textbf{CHORD}), leveraging
channel-wise mixed-precision
to simultaneously achieve
personalization and resource-adaptive deployment. CHORD distributes randomly
initialized models across heterogeneous devices and identifies user-specific
critical parameters through auxiliary hypernetwork modules on the cloud. Our
parameter sensitivity analysis operates across multiple granularities (layer,
filter, and element levels), enabling precise mapping from user profiles to
strategy. Through on-device mixed-precision
, CHORD
delivers dynamic model adaptation and accelerated inference without
backpropagation, eliminating costly retraining cycles. We minimize
overhead by encoding
strategies using only 2 bits
per channel instead of 32-bit weights. Experiments on three real-world datasets
with two popular backbones (SASRec and Caser) demonstrate the accuracy,
efficiency, and adaptivity of CHORD.
Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders
Authors: Kriz Tahimic, Charibeth Cheng
2025-10-03
As Large Language Models become integral to software development, with
substantial portions of AI-suggested code entering production, understanding
their internal correctness mechanisms becomes critical for safe deployment. We
apply autoencoders to decompose
representations, identifying
directions that correspond to code correctness. We select predictor directions
using t-statistics and steering directions through separation scores from base
model representations, then analyze their mechanistic properties through
steering, attention analysis, and weight orthogonalization. We find that code
correctness directions in
s reliably predict incorrect code, while
correction capabilities, though statistically significant, involve tradeoffs
between fixing errors and pre
correct code. Mechanistically, successful
code generation depends on attending to test cases rather than problem
descriptions. Moreover, directions identified in base models retain their
effectiveness after instruction-tuning, suggesting code correctness mechanisms
learned during pre-training are repurposed during fine-tuning. Our mechanistic
insights suggest three practical applications: prompting strategies should
prioritize test examples over elaborate problem descriptions, predictor
directions can serve as error alarms for developer review, and these same
predictors can guide selective steering, intervening only when errors are
anticipated to prevent the code corruption from constant steering.
TridentServe A Stage-level Serving System for Diffusion Pipelines
Authors: Yifei Xia, Fangcheng Fu, Hao Yuan, Hanke Zhang, Xupeng Miao, Yijun Liu, Suhan Ling, Jie Jiang, Bin Cui
2025-10-03
Diffusion pipelines, renowned for their powerful visual generation
capabilities, have seen widespread adoption in generative vision tasks (e.g.,
text-to-image/video). These pipelines typically follow an
encode--diffuse-- three-stage architecture. Current
systems
deploy diffusion pipelines within a static, manual, and pipeline-level
paradigm, allocating the same resources to every request and stage. However,
through an in-depth analysis, we find that such a paradigm is inefficient due
to the discrepancy in resource needs across the three stages of each request,
as well as across different requests. Following the analysis, we propose the
dynamic stage-level
paradigm and develop TridentServe, a brand new
diffusion
system. TridentServe automatically, dynamically derives the
placement plan (i.e., how each stage resides) for pipeline deployment and the
dispatch plan (i.e., how the requests are routed) for request processing,
co-optimizing the resource allocation for both model and requests. Extensive
experiments show that TridentServe consistently improves SLO attainment and
reduces average/P95 latencies by up to 2.5x and 3.6x/4.1x over existing works
across a variety of workloads.
FlexiQ Adaptive Mixed-Precision Quantization for Latency/Accuracy Trade-Offs in Deep Neural Networks
Authors: Jaemin Kim, Hongjun Um, Sungkyun Kim, Yongjun Park, Jiwon Seo
2025-10-03
Neural networks commonly execute on hardware accelerators such as NPUs and
GPUs for their size and computation overhead. These accelerators are costly and
it is hard to scale their resources to handle real-time workload fluctuations.
We present FlexiQ, an adaptive mixed-precision scheme for
computer vision models. FlexiQ selectively applies
width computation to
feature channels with small value ranges and employs an efficient bit-lowering
method to minimize
errors while maintaining inference accuracy.
Furthermore, FlexiQ adjusts its
width channel ratio in real time,
enabling
d models to effectively manage fluctuating inference workload.
We implemented FlexiQ prototype, including the mixed-precision inference
runtime on our custom NPU and GPUs. Evaluated on eleven convolution- and
-based vision models, FlexiQ achieves on average 6.6% higher
accuracy for 4-bit models with finetuning and outperforms four state-of-the-art
techniques. Moreover, our mixed-precision models achieved an
efficient accuracy-latency trade-off, with the 50% 4-bit model incurring only
0.6% accuracy loss while achieving 40% of the speedup of the 100% 4-bit model
over 8-bit model. Latency evaluations on our NPU and GPUs confirmed that FlexiQ
introduces minimal runtime overhead, demonstrating its hardware efficiency and
overall performance benefits.
Distributed Low-Communication Training with Decoupled Momentum Optimization
Authors: Sasho Nedelkoski, Alexander Acker, Odej Kao, Soeren Becker, Dominik Scheinert
2025-10-03
The training of large models demands substantial computational resources,
typically available only in data centers with high-bandwidth interconnects.
However, reducing the reliance on high-bandwidth interconnects between nodes
enables the use of distributed compute resources as an alternative to
centralized data center training. Building on recent advances in distributed
model training, we propose an approach that further reduces by
combining infrequent synchronizations across distributed model replicas with
gradient momentum
. In particular, we treat the optimizer momentum
as a signal and decompose the Nesterov momentum into high- and low-frequency
components via the discrete cosine transform (DCT). Only the high-frequency
components are synchronized across model replicas every steps. Empirically,
our method achieves up to a reduction in
compared to
the baseline DiLoCo, and it generalizes across architectures, including
-based language models and convolutional neural networks for images.
Overall, this work advances the feasibility of training large models on
distributed nodes with low-bandwidth interconnects.
Prototyping Digital Social Spaces through Metaphor-Driven Design Translating Spatial Concepts into an Interactive Social Simulation
Authors: Yoojin Hong, Martina Di Paola, Braahmi Padmakumar, Hwi Joon Lee, Mahnoor Shafiq, Joseph Seering
2025-10-03
Social media platforms are central to , yet their designs remain
narrowly focused on engagement and scale. While researchers have proposed
alternative visions for online spaces, these ideas are difficult to prototype
within platform constraints. In this paper, we introduce a metaphor-driven
system to help users imagine and explore new social media environments. The
system translates users' metaphors into structured sets of platform features
and generates interactive simulations populated with
-driven agents. To
evaluate this approach, we conducted a study where participants created and
interacted with simulated social media spaces. Our findings show that metaphors
allow users to express distinct social expectations, and that perceived
authenticity of the simulation depended on how well it captured dynamics like
intimacy, participation, and temporal engagement. We conclude by discussing how
metaphor-driven simulation can be a powerful design tool for prototyping
alternative social architectures and expanding the design space for future
social platforms.
TokenFlow Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling
Authors: Junyi Chen, Chuheng Du, Renyuan Liu, Shuochao Yao, Dingtian Yan, Jiang Liao, Shengzhong Liu, Fan Wu, Guihai Chen
2025-10-03
Real-time interactions demand streamed token generations, where text
tokens are progressively generated and delivered to users while balancing two
objectives: responsiveness (i.e., low time-to-first-token) and steady
generation (i.e.,required time-between-tokens). Standard
systems
suffer from the inflexibility caused by non-preemptive request scheduling and
reactive memory management, leading to poor resource utilization and low
request processing parallelism under request bursts. Therefore, we present
TokenFlow, a novel
system with enhanced text streaming performance
via preemptive request scheduling and proactive key-value (
)
management. TokenFlow dynamically prioritizes requests based on real-time token
buffer occupancy and token consumption rate, while actively transferring
between GPU and CPU memory in the background and
ping I/O with
computation to minimize request preemption overhead. Extensive experiments on
Llama3-8B and Qwen2.5-32B across multiple GPUs (RTX 4090, A6000, H200)
demonstrate that TokenFlow achieves up to 82.5% higher effective throughput
(accounting for actual user consumption) while reducing P99 TTFT by up to
80.2%, without degrading overall token throughput.
From Tokens to Nodes Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting
Authors: Jianing Chen, Zehao Li, Yujun Cai, Hao Jiang, Shuqin Gao, Honglong Zhao, Tianlu Mao, Yucheng Zhang
2025-10-03
Dynamic 3D reconstruction from monocular videos remains difficult due to the
ambiguity inferring 3D motion from limited views and computational demands of
modeling temporally varying scenes. While recent control methods
alleviate computation by reducing millions of Gaussians to thousands of control
points, they suffer from a critical limitation: they allocate points purely by
geometry, leading to static redundancy and dynamic insufficiency. We propose a
motion-adaptive framework that aligns control density with motion complexity.
Leveraging semantic and motion priors from vision foundation models, we
establish patch-token-node correspondences and apply motion-adaptive
to concentrate control points in dynamic regions while suppressing
redundancy in static backgrounds. Our approach achieves flexible
representational density adaptation through iterative voxelization and motion
tendency scoring, directly addressing the fundamental mismatch between control
point allocation and motion complexity. To capture temporal evolution, we
introduce spline-based trajectory parameterization initialized by 2D tracklets,
replacing MLP-based deformation fields to achieve smoother motion
representation and more stable optimization. Extensive experiments demonstrate
significant improvements in reconstruction quality and efficiency over existing
state-of-the-art methods.
MALF A Multi-Agent LLM Framework for Intelligent Fuzzing of Industrial Control Protocols
Authors: Bowei Ning, Xuejun Zong, Kan He
2025-10-03
Industrial control systems (ICS) are vital to modern infrastructure but
increasingly vulnerable to cybersecurity threats, particularly through
weaknesses in their protocols. This paper presents MALF
(Multi-Agent
Fuzzing Framework), an advanced fuzzing solution that
integrates large language models (
s) with multi-agent coordination to
identify vulnerabilities in industrial control protocols (ICPs). By leveraging
Retrieval-Augmented Generation (RAG) for domain-specific knowledge and QLoRA
fine-tuning for protocol-aware input generation, MALF enhances fuzz testing
precision and adaptability. The multi-agent framework optimizes seed
generation, mutation strategies, and feedback-driven refinement, leading to
improved vulnerability discovery. Experiments on protocols like Modbus/TCP,
S7Comm, and Ethernet/IP demonstrate that MALF surpasses traditional methods,
achieving a test case pass rate (TCPR) of 88-92% and generating more exception
triggers (ETN). MALF also maintains over 90% seed coverage and Shannon entropy
values between 4.2 and 4.6 bits, ensuring diverse, protocol-compliant
mutations. Deployed in a real-world Industrial Attack-Defense Range for power
plants, MALF identified critical vulnerabilities, including three zero-day
flaws, one confirmed and registered by CNVD. These results validate MALF's
effectiveness in real-world fuzzing applications. This research highlights the
transformative potential of multi-agent
s in ICS cybersecurity, offering a
scalable, automated framework that sets a new standard for vulnerability
discovery and strengthens critical infrastructure security against emerging
threats.
To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration
Authors: Zeyu Yang, Tianyi Zhang, Jianwen Xie, Chuan Li, Zhaozhuo Xu, Anshumali Shrivastava
2025-10-03
The scaling of Generative AI (GenAI) models into the hundreds of billions of
parameters makes low-precision computation indispensable for efficient
deployment. We argue that the fundamental solution lies in developing
low-precision floating-point formats, which inherently provide numerical
stability, memory savings, and hardware efficiency without de
overhead. In this paper, we present a theoretical and empirical study of an
exponent concentration phenomenon in GenAI weights: exponents consistently
exhibit low entropy across architectures and modalities. We show that this
arises naturally from -stable distributions induced by stochastic
gradient descent, and we prove tight bounds on the entropy of exponents. Our
analysis establishes a theoretical
limit near FP4.67, which
motivates the design of a practical FP8 format. Building on these insights, we
propose Exponent-Concentrated FP8 (ECF8), a lossless
framework with
entropy-aware encoding and GPU-optimized
. Experiments on
s and DiTs
up to 671B parameters demonstrate up to 26.9% memory savings and 177.1%
throughput
, with perfectly lossless computations, i.e., no
deviation in model outputs. Our results establish exponent concentration as a
statistical law of trained models and open a principled path for lossless
low-precision floating-point design in the FP8 era.
HALO Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference
Authors: Shubham Negi, Kaushik Roy
2025-10-03
The rapid adoption of Large Language Models (s) has driven a growing
demand for efficient inference, particularly in latency-sensitive applications
such as chatbots and personalized assistants. Unlike traditional deep neural
networks,
inference proceeds in two distinct phases: the
phase,
which processes the full input sequence in parallel, and the
phase,
which generates tokens sequentially. These phases exhibit highly diverse
compute and memory requirements, which makes accelerator design particularly
challenging. Prior works have primarily been optimized for high-batch inference
or evaluated only short input context lengths, leaving the low-batch and long
context regime, which is critical for interactive applications, largely
underexplored.
We propose HALO, a heterogeneous memory centric accelerator designed for
these unique challenges of
and
phases in low-batch
inference. HALO integrates HBM based Compute-in-DRAM (CiD) with an on-chip
analog Compute-in-Memory (CiM), co-packaged using 2.5D integration. To further
improve the hardware utilization, we introduce a phase-aware mapping strategy
that adapts to the distinct demands of the
and
phases. Compute
bound operations in the
phase are mapped to CiM to exploit its high
throughput matrix multiplication capability, while memory-bound operations in
the
phase are executed on CiD to benefit from reduced data movement
within DRAM. Additionally, we present an analysis of the performance tradeoffs
of
s under two architectural extremes: a fully CiD and a fully on-chip
analog CiM design to highlight the need for a heterogeneous design. We evaluate
HALO on LLaMA-2 7B and Qwen3 8B models. Our experimental results show that
s
mapped to HALO achieve up to 18x geometric mean speedup over AttAcc, an
attention-optimized mapping and 2.5x over CENT, a fully CiD based mapping.
Mind the Gap Linguistic Divergence and Adaptation Strategies in Human-LLM Assistant vs. Human-Human Interactions
Authors: Fulei Zhang, Zhou Yu
2025-10-03
As Large Language Models (s) are increasingly deployed in customer-facing
applications, a critical yet underexplored question is how users communicate
differently with
chatbots compared to human agent. In this study, we
present empirical evidence that users adopt distinct
styles when
users interact with chatbots versus human agents. Our analysis reveals
significant differences in grammatical fluency, politeness, and lexical
diversity in user language between the two settings. These findings suggest
that models trained exclusively on human-human interaction data may not
adequately accommodate the
style shift that occurs once an
chatbot is deployed. To enhance
robustness to post-launch
style changes, we experimented with two strategies: (1) data augmentation
during the post-training phase and (2) inference-time user message
reformulation. Our results indicate that models trained on stylistically
diverse datasets significantly outperform those trained exclusively on original
or stylistically uniform datasets, while inference-time reformulation proved
less effective. These insights help us to better adapt our models for improved
-user interaction experiences.
HyperAdaLoRA Accelerating LoRA Rank Allocation During Training via Hypernetworks without Sacrificing Performance
Authors: Hao Zhang, Zhenjia Li, Runfeng Bao, Yifan Gao, Xi Xiao, Bo Huang, Yuhang Wu, Tianyang Wang, Hao Xu
2025-10-03
Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation
(LoRA), has emerged as a promising approach to fine-tuning large language
models(s) while reducing computational and memory overhead. However, LoRA
assumes a uniform rank \textit{r} for each incremental matrix, not accounting
for the varying significance of weight matrices across different modules and
layers. AdaLoRA leverages Singular Value Decomposition (SVD) to parameterize
updates and employs
of singular values to introduce dynamic rank
allocation, thereby enhancing adaptability. However, during the training
process, it often encounters issues of slow convergence speed and high
computational overhead. To address this issue, we propose HyperAdaLoRA, a novel
framework that accelerates the convergence of AdaLoRA by leveraging a
hypernetwork. Instead of directly optimizing the components of Singular Value
Decomposition , HyperAdaLoRA employs a hypernetwork based on
attention mechanisms to dynamically generate these parameters. By
the
outputs of the hypernetwork that generates the singular values, dynamic rank
allocation is achieved. Comprehensive experiments on various datasets and
models demonstrate that our method achieves faster convergence without
sacrificing performance. Additionally, further extension experiments on other
LoRA-based approaches validate the broad applicability of our method.
ElasticMoE An Efficient Auto Scaling Method for Mixture-of-Experts Models
Authors: Gursimran Singh, Timothy Yu, Haley Li, Cheng Chen, Hanieh Sadri, Qintao Zhang, Yu Zhang, Ying Xiong, Yong Zhang, Zhenan Fan
2025-10-02
Mixture-of-Experts (MoE) models promise efficient scaling of large language
models (s) by activating only a small subset of experts per token, but their
parallelized inference pipelines make elastic
challenging. Existing
strategies fall short: horizontal scaling provisions entire replicas of the
current configuration, often tens to hundreds of accelerators, leading to
coarse granularity, long provisioning delays, and costly overprovisioning.
Vertical scaling offers finer adjustments but typically requires instance
restarts, incurring downtime. These limitations make current approaches
ill-suited for the bursty, short-lived traffic patterns common in cloud
deployments.
We present ElasticMoE, an elastic scaling framework for MoE
s that
achieves fine-grained, low-latency, and zero-downtime scaling. ElasticMoE
decouples inference execution from memory operations, enabling scaling steps to
proceed concurrently with
. An HBM Management Module (HMM) reuses
weights and
s via zero-copy remapping, while high-bandwidth
peer-to-peer transfers bring newly added accelerators online without
interrupting service. A virtual memory based expert redistribution mechanism
migrates MoE experts without costly buffer reallocations, reducing peak memory
usage during expert parallelism reconfiguration.
Our evaluation on Ascend NPUs with three popular MoE
s shows that
ElasticMoE achieves up to 9x lower scale-up latency, up to 2x better throughput
during scaling, and significantly improves SLO attainment compared to
baselines. By enabling fine-grained, concurrent scaling with minimal
disruption, ElasticMoE advances the practicality of deploying massive MoE
s
in dynamic cloud environments.
SAGE Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection
Authors: Ashish Jha, Salman Ahmadi-Asl
2025-10-02
Training modern neural networks on large datasets is computationally and
energy intensive. We present SAGE, a streaming data-subset selection method
that maintains a compact Frequent Directions (FD) sketch of gradient geometry
in memory and prioritizes examples whose sketched gradients align
with a consensus direction. The approach eliminates pairwise
similarities and explicit gradient stores, yielding a simple
two-pass, GPU-friendly pipeline. Leveraging FD's deterministic approximation
guarantees, we analyze how agreement scoring preserves gradient energy within
the principal sketched subspace. Across multiple benchmarks, SAGE trains with
small kept-rate budgets while retaining competitive accuracy relative to
full-data training and recent subset-selection baselines, and reduces
end-to-end compute and peak memory. Overall, SAGE offers a practical,
constant-memory alternative that complements and model
for
efficient training.
KaVa Latent Reasoning via Compressed KV-Cache Distillation
Authors: Anna Kuzina, Maciej Pioro, Paul N. Whatmough, Babak Ehteshami Bejnordi
2025-10-02
Large Language Models (s) excel at multi-step reasoning problems with
explicit chain-of-thought (CoT), but verbose traces incur significant
computational costs and memory overhead, and often carry redundant, stylistic
artifacts. Latent reasoning has emerged as an efficient alternative that
internalizes the thought process, but it suffers from a critical lack of
supervision, limiting its effectiveness on complex, natural-language reasoning
traces. In this work, we propose KaVa, the first framework that bridges this
gap by distilling knowledge directly from a compressed
-
of the teacher
into a latent-reasoning student via self-distillation, leveraging the
representational flexibility of continuous latent tokens to align stepwise
trajectories. We show that the abstract, unstructured knowledge within
compressed
-
, which lacks direct token correspondence, can serve as a
rich supervisory signal for a latent reasoning student. Empirically, the
approach consistently outperforms strong latent baselines, exhibits markedly
smaller degradation from equation-only to natural-language traces, and scales
to larger backbones while pre
efficiency. These results establish
compressed
-
distillation as a scalable supervision signal for latent
reasoning, combining the accuracy of CoT-trained teachers with the efficiency
and deployability of latent inference.
VideoNSA Native Sparse Attention Scales Video Understanding
Authors: Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu
2025-10-02
Video understanding in multimodal language models remains limited by context
length: models often miss key transition frames and struggle to maintain
coherence across long time scales. To address this, we adapt Native Sparse
Attention (NSA) to video-language models. Our method, VideoNSA, adapts
Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We
employ a hardware-aware hybrid approach to attention, pre dense
attention for text, while employing NSA for video. Compared to
token-
and training-free
baselines, VideoNSA achieves
improved performance on long-video understanding, temporal reasoning, and
spatial benchmarks. Further ablation analysis reveals four key findings: (1)
reliable scaling to 128K tokens; (2) an optimal global-local attention
allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4)
the learnable combined
attention help induce dynamic attention sinks.
Self-Forcing++ Towards Minute-Scale High-Quality Video Generation
Authors: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh
2025-10-02
Diffusion models have revolutionized image and video generation, achieving
unprecedented visual quality. However, their reliance on
architectures incurs prohibitively high computational costs, particularly when
extending generation to long videos. Recent work has explored autoregressive
formulations for long video generation, typically by distilling from
short-horizon bidirectional teachers. Nevertheless, given that teacher models
cannot synthesize long videos, the extrapolation of student models beyond their
training horizon often leads to pronounced quality degradation, arising from
the compounding of errors within the continuous latent space. In this paper, we
propose a simple yet effective approach to mitigate quality degradation in
long-horizon video generation without requiring supervision from long-video
teachers or retraining on long video datasets. Our approach centers on
exploiting the rich knowledge of teacher models to provide guidance for the
student model through sampled segments drawn from self-generated long videos.
Our method maintains temporal consistency while scaling video length by up to
20x beyond teacher's capability, avoiding common issues such as over-exposure
and error-accumulation without recomputing
ping frames like previous
methods. When scaling up the computation, our method shows the capability of
generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the
maximum span supported by our base model's position embedding and more than 50x
longer than that of our baseline model. Experiments on standard benchmarks and
our proposed improved benchmark demonstrate that our approach substantially
outperforms baseline methods in both fidelity and consistency. Our long-horizon
videos demo can be found at https://self-forcing-plus-plus.github.io/
From Frames to Clips Efficient Key Clip Selection for Long-Form Video Understanding
Authors: Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, Garin Kessler
2025-10-02
Video Large Language Models (VLMs) have achieved remarkable results on a
variety of vision language tasks, yet their practical use is limited by the
"needle in a haystack" problem: the massive number of visual tokens produced
from raw video frames exhausts the model's context window. Existing solutions
alleviate this issue by selecting a set of frames, thereby reducing
token count, but such frame-wise selection discards essential temporal
dynamics, leading to suboptimal reasoning about motion and event continuity. In
this work we systematically explore the impact of temporal information and
demonstrate that extending selection from isolated key frames to key clips,
which are short, temporally coherent segments, improves video understanding. To
maintain a fixed computational budget while accommodating the larger token
footprint of clips, we propose an adaptive resolution strategy that dynamically
balances spatial resolution and clip length, ensuring a constant token count
per video. Experiments on three long-form video benchmarks demonstrate that our
training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and
10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These
results highlight the importance of pre
temporal coherence in frame
selection and provide a practical pathway for scaling Video
s to real world
video understanding applications. Project webpage is available at
https://guangyusun.com/f2c .
Contrastive Retrieval Heads Improve Attention-Based Re-Ranking
Authors: Linh Tran, Yulong Li, Radu Florian, Wei Sun
2025-10-02
The strong zero-shot and long-context capabilities of recent Large Language
Models (s) have paved the way for highly effective re-ranking systems.
Attention-based re-rankers leverage attention weights from
heads to
produce relevance scores, but not all heads are created equally: many
contribute noise and redundancy, thus limiting performance. To address this, we
introduce CoRe heads, a small set of retrieval heads identified via a
contrastive scoring metric that explicitly rewards high attention heads that
correlate with relevant documents, while downplaying nodes with higher
attention that correlate with irrelevant documents. This relative ranking
criterion isolates the most discriminative heads for re-ranking and yields a
state-of-the-art list-wise re-ranker. Extensive experiments with three
s
show that aggregated signals from CoRe heads, constituting less than 1% of all
heads, substantially improve re-ranking accuracy over strong baselines. We
further find that CoRe heads are concentrated in middle layers, and
the
computation of final 50% of model layers preserves accuracy while significantly
reducing inference time and memory usage.
MMDEW Multipurpose Multiclass Density Estimation in the Wild
Authors: Villanelle O'Reilly, Jonathan Cox, Georgios Leontidis, Marc Hanheide, Petra Bosilj, James Brown
2025-10-02
Density map estimation can be used to estimate object counts in dense and
occluded scenes where discrete counting-by-detection methods fail. We propose a
multicategory counting framework that leverages a Twins pyramid
vision- backbone and a specialised multi-class counting head built
on a state-of-the-art multiscale
approach. A two-task design adds a
segmentation-based Category Focus Module, suppressing inter-category cross-talk
at training time. Training and evaluation on the VisDrone and iSAID benchmarks
demonstrates superior performance versus prior multicategory crowd-counting
approaches (33%, 43% and 64% reduction to MAE), and the comparison with YOLOv11
underscores the necessity of crowd counting methods in dense scenes. The
method's regional loss opens up multi-class crowd counting to new domains,
demonstrated through the application to a biodiversity monitoring dataset,
highlighting its capacity to inform conservation efforts and enable scalable
ecological insights.
UpSafeC Upcycling for Controllable Safety in Large Language Models
Authors: Yuhao Sun, Zhuoer Xu, Shiwen Cui, Kun Yang, Lingyun Yu, Yongdong Zhang, Hongtao Xie
2025-10-02
Large Language Models (s) have achieved remarkable progress across a wide
range of tasks, but remain vulnerable to safety risks such as harmful content
generation and jailbreak attacks. Existing safety techniques -- including
external guardrails, inference-time guidance, and post-training alignment --
each face limitations in balancing safety, utility, and controllability. In
this work, we propose UpSafeC, a unified framework for enhancing
safety through safety-aware upcycling. Our approach first identifies
safety-critical layers and upcycles them into a
Mixture-of-Experts (MoE)
structure, where the router acts as a soft guardrail that selectively activates
original MLPs and added safety experts. We further introduce a two-stage SFT
strategy to strengthen safety discrimination while pre
general
capabilities. To enable flexible control at inference time, we introduce a
safety temperature mechanism, allowing dynamic adjustment of the trade-off
between safety and utility. Experiments across multiple benchmarks, base model,
and model scales demonstrate that UpSafeC achieves robust safety
improvements against harmful and jailbreak inputs, while maintaining
competitive performance on general tasks. Moreover, analysis shows that safety
temperature provides fine-grained inference-time control that achieves the
Pareto-optimal frontier between utility and safety. Our results highlight a new
direction for
safety: moving from static alignment toward dynamic, modular,
and inference-aware control.
KVComm Enabling Efficient LLM Communication through Selective KV Sharing
Authors: Xiangyu Shi, Marco Chiesa, Gerald Q. Maguire Jr., Dejan Kostic
2025-10-02
Large Language Models (s) are increasingly deployed in multi-agent
systems, where effective inter-model
is crucial. Existing
protocols either rely on natural language, incurring high
inference costs and information loss, or on hidden states, which suffer from
information concentration bias and inefficiency. To address these limitations,
we propose
Comm, a novel
framework that enables efficient
between
s through selective sharing of
pairs.
Comm
leverages the rich information encoded in the
pairs while avoiding the
pitfalls of hidden states. We introduce a
layer-wise selection strategy
based on attention importance scores with a Gaussian prior to identify the most
informative
pairs for
. Extensive experiments across diverse
tasks and model pairs demonstrate that
Comm achieves comparable performance
to the upper-bound method, which directly merges inputs to one model without
any
, while transmitting as few as 30\% of layers'
pairs. Our
study highlights the potential of
pairs as an effective medium for inter-
, paving the way for scalable and efficient multi-agent systems.
SoundReactor Frame-level Online Video-to-Audio Generation
Authors: Koichi Saito, Julian Tanke, Christian Simon, Masato Ishii, Kazuki Shimada, Zachary Novack, Zhi Zhong, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji
2025-10-02
Prevailing Video-to-Audio (V2A) generation models operate offline, assuming
an entire video sequence or chunks of frames are available beforehand. This
critically limits their use in interactive applications such as live content
creation and emerging generative world models. To address this gap, we
introduce the novel task of frame-level online V2A generation, where a model
autoregressively generates audio from video without access to future video
frames. Furthermore, we propose SoundReactor, which, to the best of our
knowledge, is the first simple yet effective framework explicitly tailored for
this task. Our design enforces end-to-end causality and targets low per-frame
latency with audio-visual synchronization. Our model's backbone is a
r-only causal
over continuous audio latents. For vision
conditioning, it leverages grid (patch) features extracted from the smallest
variant of the DINOv2 vision encoder, which are aggregated into a single token
per frame to maintain end-to-end causality and efficiency. The model is trained
through a diffusion pre-training followed by consistency fine-tuning to
accelerate the diffusion head
. On a benchmark of diverse gameplay
videos from AAA titles, our model successfully generates semantically and
temporally aligned, high-quality full-band stereo audio, validated by both
objective and human evaluations. Furthermore, our model achieves low per-frame
waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on
30FPS, 480p videos using a single H100. Demo samples are available at
https://koichi-saito-sony.github.io/soundreactor/.
Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning
Authors: Xinyuan Song, Keyu Wang, PengXiang Li, Lu Yin, Shiwei Liu
2025-10-02
Recent studies suggest that the deeper layers of Large Language Models (s)
contribute little to representation learning and can often be removed without
significant performance loss. However, such claims are typically drawn from
narrow evaluations and may overlook important aspects of model behavior. In
this work, we present a systematic study of depth utilization across diverse
dimensions, including evaluation protocols, task categories, and model
architectures. Our analysis confirms that very deep layers are generally less
effective than earlier ones, but their contributions vary substantially with
the evaluation setting. Under likelihood-based metrics without generation,
most layers preserves performance, with only the initial few being
critical. By contrast, generation-based evaluation uncovers indispensable roles
for middle and deeper layers in enabling reasoning and maintaining long-range
coherence. We further find that knowledge and retrieval are concentrated in
shallow components, whereas reasoning accuracy relies heavily on deeper layers
-- yet can be reshaped through distillation. These results highlight that depth
usage in
s is highly heterogeneous and context-dependent, underscoring the
need for task-, metric-, and model-aware perspectives in both interpreting and
compressing large models.
LLM-Based Multi-Task Bangla Hate Speech Detection Type, Severity, and Target
Authors: Md Arid Hasan, Firoj Alam, Md Fahad Hossain, Usman Naseem, Syed Ishtiaque Ahmed
2025-10-02
Online social media platforms are central to everyday and
information seeking. While these platforms serve positive purposes, they also
provide fertile ground for the spread of hate speech, offensive language, and
bullying content targeting individuals, organizations, and communities. Such
content undermines safety, participation, and equity online. Reliable detection
systems are therefore needed, especially for low-resource languages where
moderation tools are limited. In Bangla, prior work has contributed resources
and models, but most are single-task (e.g., binary hate/offense) with limited
coverage of multi-facet signals (type, severity, target). We address these gaps
by introducing the first multi-task Bangla hate-speech dataset,
BanglaMultiHate, one of the largest manually annotated corpus to date. Building
on this resource, we conduct a comprehensive, controlled comparison spanning
classical baselines, monolingual pretrained models, and
s under zero-shot
prompting and LoRA fine-tuning. Our experiments assess
adaptability in a
low-resource setting and reveal a consistent trend: although LoRA-tuned
s
are competitive with BanglaBERT, culturally and linguistically grounded
pretraining remains critical for robust performance. Together, our dataset and
findings establish a stronger benchmark for developing culturally aligned
moderation tools in low-resource contexts. For reproducibility, we will release
the dataset and all related scripts.
Patch-as-Decodable-Token Towards Unified Multi-Modal Vision Tasks in MLLMs
Authors: Yongyi Su, Haojie Zhang, Shijie Li, Nanqing Liu, Jingyi Liao, Junyi Pan, Yuan Liu, Xiaofen Xing, Chong Sun, Chen Li, Nancy F. Chen, Shuicheng Yan, Xulei Yang, Xun Xu
2025-10-02
Multimodal large language models (Ms) have advanced rapidly in recent
years. However, existing approaches for vision tasks often rely on indirect
representations, such as generating coordinates as text for detection, which
limits performance and prevents dense prediction tasks like segmentation. To
overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a
unified paradigm that enables M
s to directly generate both textual and
diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs),
derived from visual patch embeddings of query images and interleaved seamlessly
with
's output textual tokens. A lightweight
r then transforms
's
outputs into detection, segmentation, and grounding predictions. Unlike prior
methods, PaDT processes VRTs independently at each forward pass and dynamically
expands the embedding table, thus improving localization and differentiation
among similar objects. We further tailor a training strategy for PaDT by
randomly selecting VRTs for supervised fine-tuning and introducing a robust
per-token cross-entropy loss. Our empirical studies across four visual
perception and understanding tasks suggest PaDT consistently achieving
state-of-the-art performance, even compared with significantly larger M
models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.
MelCap A Unified Single-Codebook Neural Codec for High-Fidelity Audio Compression
Authors: Jingyi Li, Zhiyuan Zhao, Yunfei Liu, Lijian Lin, Ye Zhu, Jiahao Wu, Qiuqiang Kong, Yu Li
2025-10-02
Neural audio codecs have recently emerged as powerful tools for high-quality
and rate audio
, leveraging deep generative models to learn
latent representations of audio signals. However, existing approaches either
rely on a single
r that only processes speech domain, or on multiple
rs that are not well suited for downstream tasks. To address this
issue, we propose MelCap, a unified "one-codebook-for-all" neural codec that
effectively handles speech, music, and general sound. By decomposing audio
reconstruction into two stages, our method preserves more acoustic details than
previous single-codebook approaches, while achieving performance comparable to
mainstream multi-codebook methods. In the first stage, audio is transformed
into mel-spectrograms, which are compressed and
d into compact single
tokens using a 2D tokenizer. A perceptual loss is further applied to mitigate
the over-smoothing artifacts observed in spectrogram reconstruction. In the
second stage, a Vocoder recovers waveforms from the mel discrete tokens in a
single forward pass, enabling real-time
. Both objective and subjective
evaluations demonstrate that MelCap achieves quality on comparable to
state-of-the-art multi-codebook codecs, while retaining the computational
simplicity of a single-codebook design, thereby providing an effective
representation for downstream tasks.
HRTFformer A Spatially-Aware Transformer for Personalized HRTF Upsampling in Immersive Audio Rendering
Authors: Xuyi Hu, Jian Li, Shaojie Zhang, Stefan Goetz, Lorenzo Picinali, Ozgur B. Akan, Aidan O. T. Hogg
2025-10-02
Personalized Head-Related Transfer Functions (HRTFs) are starting to be
introduced in many commercial immersive audio applications and are crucial for
realistic spatial audio rendering. However, one of the main hesitations
regarding their introduction is that creating personalized HRTFs is impractical
at scale due to the complexities of the HRTF measurement process. To mitigate
this drawback, HRTF spatial upsampling has been proposed with the aim of
reducing measurements required. While prior work has seen success with
different machine learning (ML) approaches, these models often struggle with
long-range spatial consistency and generalization at high upsampling factors.
In this paper, we propose a novel -based architecture for HRTF
upsampling, leveraging the attention mechanism to better capture spatial
correlations across the HRTF sphere. Working in the spherical harmonic (SH)
domain, our model learns to reconstruct high-resolution HRTFs from
input
measurements with significantly improved accuracy. To enhance spatial
coherence, we introduce a neighbor dissimilarity loss that promotes magnitude
smoothness, yielding more realistic upsampling. We evaluate our method using
both perceptual localization models and objective spectral distortion metrics.
Experiments show that our model surpasses leading methods by a substantial
margin in generating realistic, high-fidelity HRTFs.
Accelerating Attention with Basis Decomposition
Authors: Jialin Zhao
2025-10-02
Attention is a core operation in large language models (s) and
vision-language models (VLMs). We present BD Attention (BDA), the first
lossless algorithmic reformulation of attention. BDA is enabled by a simple
matrix identity from Basis Decomposition (BD), which restructures multi-head
projections into a compact form while pre
exact outputs. Unlike
I/O-aware system optimizations such as FlashAttention, BDA provides a
mathematically guaranteed
that is architecture-agnostic. On
DeepSeek-V2-Lite (16B, FP16), BDA requires only 4s of offline preparation with
no retraining required and, on modern GPUs, achieves 32% faster key/value
projections and 25% smaller weights, while increasing end-to-end perplexity
(PPL) by just 0.02% (FP16) or 0.0004% (FP32), a negligible effect on model
performance. These results position BDA as the first theoretically exact method
for lossless attention
that is complementary to existing
engineering-level optimizations. Our code is available at
https://github.com/abcbdf/basis-decomposition-official.
TalkPlay-Tools Conversational Music Recommendation with LLM Tool Calling
Authors: Seungheon Doh, Keunwoo Choi, Juhan Nam
2025-10-02
While the recent developments in large language models (s) have
successfully enabled generative recommenders with natural language
interactions, their recommendation behavior is limited, leaving other simpler
yet crucial components such as metadata or attribute filtering underutilized in
the system. We propose an
-based music recommendation system with tool
calling to serve as a unified retrieval-reranking pipeline. Our system
positions an
as an end-to-end recommendation system that interprets user
intent, plans tool invocations, and orchestrates specialized components:
boolean filters (SQL),
retrieval (BM25), dense retrieval (embedding
similarity), and generative retrieval (semantic IDs). Through tool planning,
the system predicts which types of tools to use, their execution order, and the
arguments needed to find music matching user preferences, supporting diverse
modalities while seamlessly integrating multiple database filtering methods. We
demonstrate that this unified tool-calling framework achieves competitive
performance across diverse recommendation scenarios by selectively employing
appropriate retrieval methods based on user queries, envisioning a new paradigm
for conversational music recommendation systems.
ENLighten Lighten the Transformer, Enable Efficient Optical Acceleration
Authors: Hanqing Zhu, Zhican Zhou, Shupeng Ning, Xuhao Wu, Ray Chen, Yating Wan, David Pan
2025-10-02
Photonic computing has emerged as a promising substrate for accelerating the
dense linear-algebra operations at the heart of AI, yet adoption for large
Transformer models remains in its infancy. We identify two bottlenecks: (1)
costly electro--optic conversions and data-movement overheads that erode energy
efficiency as model sizes scale; (2) a mismatch between limited on-chip
photonic resources and Transformer scale, which forces frequent reuse of
photonic tensor cores and dilutes throughput gains. To address these
challenges, we introduce a hardware--software co-design framework. First, we
propose \texttt{Lighten}, a PTC-aware flow that post-hoc decomposes
each Transformer weight matrix into a low-rank component plus a
structured-
component aligned to photonic tensor-core granularity,
without lengthy retraining. Second, we present \texttt{ENLighten}, a
reconfigurable photonic accelerator with dynamically adaptive tensor cores,
driven by broadband light redistribution, enabling fine-grained
support and full power gating of inactive parts. On ImageNet, \texttt{Lighten}
prunes a Base-scale Vision Transformer by 50\% with 1\% accuracy drop
after only 3 epochs (about 1 hour) of fine-tuning. Deployed on
\texttt{ENLighten}, it achieves a improvement in energy--delay
product over the state-of-the-art photonic Transformer accelerator.
Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value
Authors: Wangxuan Fan, Ching Wang, Siqi Li, Nan Liu
2025-10-02
For many real-world applications, understanding feature-outcome relationships
is as crucial as achieving high predictive accuracy. While traditional neural
networks excel at prediction, their black-box nature obscures underlying
functional relationships. Kolmogorov--Arnold Networks (KANs) address this by
employing learnable spline-based activation functions on edges, enabling
recovery of symbolic representations while maintaining competitive performance.
However, KAN's architecture presents unique challenges for network .
Conventional magnitude-based methods become unreliable due to sensitivity to
input coordinate shifts. We propose \textbf{ShapKAN}, a
framework using
Shapley value attribution to assess node importance in a shift-invariant
manner. Unlike magnitude-based approaches, ShapKAN quantifies each node's
actual contribution, ensuring consistent importance rankings regardless of
input parameterization. Extensive experiments on synthetic and real-world
datasets demonstrate that ShapKAN preserves true node importance while enabling
effective network
. Our approach improves KAN's interpretability
advantages, facilitating deployment in resource-constrained environments.
Asymmetric Proximal Policy Optimization mini-critics boost LLM reasoning
Authors: Jiashun Liu, Johan Obando-Ceron, Han Lu, Yancheng He, Weixun Wang, Wenbo Su, Bo Zheng, Pablo Samuel Castro, Aaron Courville, Ling Pan
2025-10-02
Most recent RL for s (RL4
) methods avoid explicit critics, replacing
them with average advantage baselines. This shift is largely pragmatic:
conventional value functions are computationally expensive to train at
scale and often fail under
rewards and long reasoning horizons. We
revisit this bottleneck from an architectural perspective and introduce
Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable
framework that restores the critics role while remaining efficient in
large-model settings. AsyPPO employs a set of lightweight mini-critics, each
trained on disjoint prompt shards. This design encourages diversity while
pre
calibration, reducing value-estimation bias. Beyond robust
estimation, AsyPPO leverages inter-critic uncertainty to refine the policy
update: (i) masking advantages in states where critics agree and gradients add
little learning signal, and (ii) filtering high-divergence states from entropy
regularization, suppressing spurious exploration. After training on open-source
data with only 5,000 samples, AsyPPO consistently improves learning stability
and performance across multiple benchmarks over strong baselines, such as GRPO,
achieving performance gains of more than six percent on Qwen3-4b-Base and about
three percent on Qwen3-8b-Base and Qwen3-14b-Base over classic PPO, without
additional tricks. These results highlight the importance of architectural
innovations for scalable, efficient algorithms.
The Unseen Frontier Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM
Authors: Kwanhee Lee, Hyeondo Jang, Dongyeop Lee, Dan Alistarh, Namhoon Lee
2025-10-02
Neural network is a promising technique to mitigate the excessive
computational and memory requirements of large language models (
s). Despite
its promise, however, progress in this area has diminished, as conventional
methods are seemingly unable to surpass moderate
levels (50-60%)
without severely degrading model accuracy. This work breaks through the current
impasse, presenting a principled and effective method called ,
which achieves extreme
levels of up to 90% while retaining high model
fidelity. This is done by identifying several limitations in current practice,
all of which can be traced back to their reliance on a surrogate objective
formulation. tackles this issue directly and effectively via
standard and well-established constrained optimization techniques based on
ADMM. Our extensive experiments across a wide range of models and scales show
that achieves substantial improvements over existing methods;
e.g., it achieves 7.8 less perplexity than the best existing method on
LLaMA-2-7B at 90%
. Furthermore, we present
, a
d variant that scales to extremely large
models (27B), and establish its theoretical convergence guarantees. These
results highlight meaningful progress in advancing the frontier of
, while promising that significant opportunities for further
advancement may remain in directions that have so far attracted limited
exploration.
Support Basis Fast Attention Beyond Bounded Entries
Authors: Maryam Aliakbarpour, Vladimir Braverman, Junze Yin, Haochen Zhang
2025-10-02
The quadratic complexity of softmax attention remains a central bottleneck in
scaling large language models (s). [Alman and Song, NeurIPS 2023] proposed a
sub-quadratic attention approximation algorithm, but it works only under the
restrictive bounded-entry assumption. Since this assumption rarely holds in
practice, its applicability to modern
s is limited.
In this paper, we introduce support-basis decomposition, a new framework for
efficient attention approximation beyond bounded entries. We empirically
demonstrate that the entries of the query and key matrices exhibit sub-Gaussian
behavior. Our approach uses this property to split large and small entries,
enabling exact computation on
components and polynomial approximation on
dense components. We establish rigorous theoretical guarantees, proving a
sub-quadratic runtime, and extend the method to a multi-threshold setting that
eliminates all distributional assumptions. Furthermore, we provide the first
theoretical justification for the empirical success of polynomial attention
[Kacham, Mirrokni, and Zhong, ICML 2024], showing that softmax attention can be
closely approximated by a combination of multiple polynomial attentions with
sketching.