2026-02-06

DLM-Scope Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
"It Talks Like a Patient, But Feels Different" Co-Designing AI Standardized Patients with Medical Learners
Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation
Focus-Scan-Refine From Human Visual Perception to Efficient Visual Token Pruning
Price of universality in vector quantization is at most 0.11 bit
Efficient implementation of arbitrary Hermitian-preserving and trace-preserving maps
Variational Speculative Decoding Rethinking Draft Training from Token Likelihood to Sequence Acceptance
LongR Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards
Towards Green AI Decoding the Energy of LLM Inference in Software Development
Shiva-DiT Residual-Based Differentiable Top- $k$ Selection for Efficient Diffusion Transformers
SDFP Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration
DisCa Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
Forward Index Compression for Learned Sparse Retrieval
THOR Inductive Link Prediction over Hyper-Relational Knowledge Graphs
Speech-XL Towards Long-Form Speech Understanding in Large Speech Language Models
RaBiT Residual-Aware Binarization Training for Accurate and Efficient LLMs
Pool-based Active Learning as Noisy Lossy Compression Characterizing Label Complexity via Finite Blocklength Analysis
High-Performance Moment-Encoded Lattice Boltzmann Method with Stability-Guided Quantization
Hybrid Gated Flow (HGF) Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction
FedMosaic Federated Retrieval-Augmented Generation via Parametric Adapters
Extreme Weather Nowcasting via Local Precipitation Pattern Prediction
Diffusion-aided Extreme Video Compression with Lightweight Semantics Guidance
Double-P Hierarchical Top-P Sparse Attention for Long-Context LLMs
Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky
TIDE Temporal Incremental Draft Engine for Self-Improving LLM Inference
ARGaze Autoregressive Transformers for Online Egocentric Gaze Estimation
SocialVeil Probing Social Intelligence of Language Agents under Communication Barriers
Physics-Informed Diffusion Models for Vehicle Speed Trajectory Generation
Protein Autoregressive Modeling via Multiscale Structure Generation
Adaptive estimation of Sobolev-type energy functionals on the sphere
OmniSIFT Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
From independent patches to coordinated attention Controlling information flow in vision transformers
Less Finetuning, Better Retrieval Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging
Addressing Corpus Knowledge Poisoning Attacks on RAG Using Sparse Attention
Towards Understanding and Avoiding Limitations of Convolutions on Graphs
PIO-FVLM Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective
LEAD Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation
Disentangling meaning from language in LLM-based machine translation
Domain decomposition methods and preconditioning strategies using generalized locally Toepltiz tools proposals, analysis, and numerical validation
Harmonia Algorithm-Hardware Co-Design for Memory- and Compute-Efficient BFP-based LLM Inference
SalFormer360 a transformer-based saliency estimation model for 360-degree videos
Nix and Fix Targeting 1000x Compression of 3D Gaussian Splatting with Diffusion Models
LycheeDecode Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
Incongruity-sensitive access to highly compressed strings
$C$ - $ΔΘ$ Circuit-Restricted Weight Arithmetic for Selective Refusal
Model-Dowser Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models
Greedy-Gnorm A Gradient Matrix Norm-Based Alternative to Attention Entropy for Head Pruning
DOS Dual-Flow Orthogonal Semantic IDs for Recommendation in Meituan
Seg-ReSearch Segmentation with Interleaved Reasoning and External Search
RASA Routing-Aware Safety Alignment for Mixture-of-Experts Models
The Stretto Execution Engine for LLM-Augmented Data Systems
Depth-Wise Emergence of Prediction-Centric Geometry in Large Language Models
HoRD Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation
LCUDiff Latent Capacity Upgrade Diffusion for Faithful Human Body Restoration
Enabling Real-Time Colonoscopic Polyp Segmentation on Commodity CPUs via Ultra-Lightweight Architecture
TurboBoA Faster and Exact Attention-aware Quantization without Backpropagation
SparVAR Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
Pruning Minimal Reasoning Graphs for Efficient Retrieval-Augmented Generation
Internalizing LLM Reasoning via Discovery and Replay of Latent Actions
From Assumptions to Actions Turning LLM Reasoning into Uncertainty-Aware Planning for Embodied Agents
Disentangling Causal Importance from Emergent Structure in Multi-Expert Orchestration
MiniRec Data-Efficient Reinforcement Learning for LLM-based Recommendation
KVSmooth Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing
Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog
AppleVLM End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models
CoLT Reasoning with Chain of Latent Tool Calls
Post-Quantum Identity-Based TLS for 5G Service-Based Architecture and Cloud-Native Infrastructure
Following the TRAIL Predicting and Explaining Tomorrow's Hits with a Fine-Tuned LLM
Adaptive 1D Video Diffusion Autoencoder
OAT Ordered Action Tokenization
Semantic Consensus Decoding Backdoor Defense for Verilog Code Generation
Universal Quantized Berry-Dipole Flat Bands

DLM-Scope Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

Authors: Xu Wang, Bingqing Jiang, Yu Wan, Baosong Yang, Lingpeng Kong, Difan Zou

2026-02-05

http://arxiv.org/abs/2602.05859v1

Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (s), enabling researchers to extract , human-interpretable features and intervene on model behavior. Recently, as diffusion language models (DLMs) have become an increasingly promising alternative to the autoregressive s, it is essential to develop tailored mechanistic interpretability tools for this emerging class of models. In this work, we present DLM-Scope, the first SAE-based interpretability framework for DLMs, and demonstrate that trained Top-K SAEs can faithfully extract interpretable features. Notably, we find that inserting SAEs affects DLMs differently than autoregressive s: while SAE insertion in s typically incurs a loss penalty, in DLMs it can reduce cross-entropy loss when applied to early layers, a phenomenon absent or markedly weaker in s. Additionally, SAE features in DLMs enable more effective diffusion-time interventions, often outperforming steering. Moreover, we pioneer certain new SAE-based research directions for DLMs: we show that SAEs can provide useful signals for DLM order; and the SAE features are stable during the post-training phase of DLMs. Our work establishes a foundation for mechanistic interpretability in DLMs and shows a great potential of applying SAEs to DLM-related tasks and algorithms.

"It Talks Like a Patient, But Feels Different" Co-Designing AI Standardized Patients with Medical Learners

Authors: Zhiqi Gao, Guo Zhu, Huarui Luo, Dongyijie Primo Pan, Haoming Tang, Bingquan Zhang, Jiahuan Pei, Jie Li, Benyou Wang

2026-02-05

http://arxiv.org/abs/2602.05856v1

Standardized patients (SPs) play a central role in clinical training but are costly, difficult to scale, and inconsistent. Large language model () based AI standardized patients (AI-SPs) promise flexible, on-demand practice, yet learners often report that they talk like a patient but feel different. We interviewed 12 clinical-year medical students and conducted three co-design workshops to examine how learners experience constraints of SP encounters and what they expect from AI-SPs. We identified six learner-centered needs, translated them into AI-SP design requirements, and synthesized a conceptual workflow. Our findings position AI-SPs as tools for deliberate practice and show that instructional usability, rather than conversational realism alone, drives learner trust, engagement, and educational value.

Authors: Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, Hongyang Li

2026-02-05

http://arxiv.org/abs/2602.05827v1

Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model ()-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.

Focus-Scan-Refine From Human Visual Perception to Efficient Visual Token Pruning

Authors: Enwei Tong, Yuanchao Bai, Yao Zhu, Junjun Jiang, Xianming Liu

2026-02-05

http://arxiv.org/abs/2602.05809v1

Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive . We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art methods. The source codes can be found at https://github.com/ILOT-code/FSR

Price of universality in vector quantization is at most 0.11 bit

Authors: Alina Harbuzova, Or Ordentlich, Yury Polyanskiy

2026-02-05

http://arxiv.org/abs/2602.05790v1

Fast computation of a matrix product $W^\top X$ is a workhorse of modern s. To make their deployment more efficient, a popular approach is that of using a low-precision approximation $\widehat W$ in place of true $W$ ("weight-only ''). Information theory demonstrates that an optimal algorithm for reducing precision of $W$ depends on the (second order) statistics of $X$ and requires a careful alignment of vector codebook with PCA directions of $X$ (a process known as "waterfilling allocation''). Dependence of the codebook on statistics of $X$ , however, is highly impractical. This paper proves that there exist a universal codebook that is simultaneously near-optimal for all possible statistics of $X$ , in the sense of being at least as good as an $X$ -adapted waterfilling codebook with rate reduced by 0.11 bit per dimension. Such universal codebook would be an ideal candidate for the low-precision storage format, a topic of active modern research, but alas the existence proof is non-constructive. Equivalently, our result shows existence of a net in $\mathbb{R}^n$ that is a nearly-optimal covering of a sphere simultaneously with respect to all Hilbert norms.

Efficient implementation of arbitrary Hermitian-preserving and trace-preserving maps

Authors: Weizhou Cai, Zi-Jie Chen, Xuanqiang Zhao, Xin Wang, Guang-Can Guo, Luyan Sun, Chang-Ling Zou

2026-02-05

http://arxiv.org/abs/2602.05777v1

Quantum control has been a cornerstone of quantum information science, driving major advances in quantum computing, quantum , and quantum sensing. Over the years, it has enabled the implementation of arbitrary completely positive and trace-pre (CPTP) maps; an important next step is to extend control to Hermitian-pre and trace-pre (HPTP) maps, which underpin applications such as entanglement detection, quantum error mitigation, quantum simulation, and quantum machine learning. Here we present an efficient and fully constructive method for implementing arbitrary HPTP maps. Unlike existing methods that decompose an HPTP map into multiple CPTP maps or approximate it using bipartite Hamiltonians with large Hilbert spaces, our approach compiles a target HPTP map into a single executable CPTP map whose Kraus rank is guaranteed to be no larger than the intrinsic rank of the target HPTP map plus one, followed by simple classical post-processing. Numerical results for inverse noise channels used in quantum error mitigation, including bosonic photon loss, confirm substantial reductions in resources and highlight scalability in higher-dimensional settings. Together with our numerical benchmarks, these results validate the efficiency and versatility of the proposed framework, opening a route to broader quantum-information applications enabled by HPTP processing.

Variational Speculative Decoding Rethinking Draft Training from Token Likelihood to Sequence Acceptance

Authors: Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou

2026-02-05

http://arxiv.org/abs/2602.05774v1

Speculative accelerates inference for (M)s, yet a training- discrepancy persists: while existing methods optimize single greedy trajectories, involves verifying and ranking multiple sampled draft paths. We propose Variational Speculative Decoding (VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws MCMC samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across s and Ms show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly improving efficiency.

LongR Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards

Authors: Bowen Ping, Zijun Chen, Yiyao Yu, Tingfeng Hui, Junchi Yan, Baobao Chang

2026-02-05

http://arxiv.org/abs/2602.05758v1

Reinforcement Learning has emerged as a key driver for reasoning. This capability is equally pivotal in long-context scenarios--such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on , outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model's robustness against distractors.

Towards Green AI Decoding the Energy of LLM Inference in Software Development

Authors: Lola Solovyeva, Fernando Castor

2026-02-05

http://arxiv.org/abs/2602.05712v1

Context: AI-assisted tools are increasingly integrated into software development workflows, but their reliance on large language models (s) introduces substantial computational and energy costs. Understanding and reducing the energy footprint of inference is therefore essential for sustainable software development. Objective: In this study, we conduct a phase-level analysis of inference energy consumption, distinguishing between the (1) , where the model processes the input and builds internal representations, and (2) , where output tokens are generated using the stored state. Method: We investigate six 6B-7B and four 3B-4B -based models, evaluating them on code-centric benchmarks HumanEval for code generation and LongBench for code understanding. Results: Our findings show that, within both parameter groups, models exhibit distinct energy patterns across phases. Furthermore, we observed that increases in cost amplify the energy cost per token during , with amplifications ranging from 1.3% to 51.8% depending on the model. Lastly, three out of ten models demonstrate babbling behavior, adding excessive content to the output that unnecessarily inflates energy consumption. We implemented babbling suppression for code generation, achieving energy savings ranging from 44% to 89% without affecting generation accuracy. Conclusion: These findings show that costs influence , which dominates energy consumption, and that babbling suppression can yield up to 89% energy savings. Reducing inference energy therefore requires both mitigating babbling behavior and limiting impact of on .

Shiva-DiT Residual-Based Differentiable Top- $k$ Selection for Efficient Diffusion Transformers

Authors: Jiaji Zhang, Hailiang Zhao, Guoxuan Zhu, Ruichao Sun, Jiaju Wu, Xinkui Zhao, Hanlin Tang, Weiyi Lu, Kan Liu, Tao Lan, Lin Qu, Shuiguang Deng

2026-02-05

http://arxiv.org/abs/2602.05605v1

Diffusion Transformers (DiTs) incur prohibitive computational costs due to the quadratic scaling of self-attention. Existing methods fail to simultaneously satisfy differentiability, efficiency, and the strict static budgets required for hardware overhead. To address this, we propose Shiva-DiT, which effectively reconciles these conflicting requirements via Residual-Based Differentiable Top- $k$ Selection. By leveraging a residual-aware straight-through estimator, our method enforces deterministic token counts for static compilation while pre end-to-end learnability through residual gradient estimation. Furthermore, we introduce a Context-Aware Router and Adaptive Ratio Policy to autonomously learn an adaptive schedule. Experiments on mainstream models, including SD3.5, demonstrate that Shiva-DiT establishes a new Pareto frontier, achieving a 1.54 $\times$ wall-clock speedup with superior fidelity compared to existing baselines, effectively eliminating ragged tensor overheads.

SDFP Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration

Authors: Hanyu Wei, Zunhai Su, Peng Lu, Chao Li, Spandan Tiwari, Ashish Sirasao, Yuhan Dong

2026-02-05

http://arxiv.org/abs/2602.05499v1

Large language models (s) underpin interactive multimedia applications such as captioning, retrieval, recommendation, and creative content generation, yet their autoregressive incurs substantial latency. Speculative reduces latency using a lightweight draft model, but deployment is often limited by the cost and complexity of acquiring, tuning, and maintaining an effective draft model. Recent approaches usually require auxiliary training or specialization, and even training-free methods incur costly search or optimization. We propose SDFP, a fully training-free and plug-and-play framework that builds the draft model via Fisher Information Trace (FIT)-based layer of a given . Using layer sensitivity as a proxy for output perturbation, SDFP removes low-impact layers to obtain a compact draft while pre compatibility with the original model for standard speculative verification. SDFP needs no additional training, hyperparameter tuning, or separately maintained drafts, enabling rapid, deployment-friendly draft construction. Across benchmarks, SDFP delivers 1.32x-1.5x speedup without altering the target model's output distribution, supporting low-latency multimedia applications.

DisCa Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching

Authors: Chang Zou, Changlin Li, Yang Li, Patrol Li, Jianbing Wu, Xiao He, Songtao Liu, Zhao Zhong, Kailin Huang, Linfeng Zhang

2026-02-05

http://arxiv.org/abs/2602.05449v1

While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further . Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the r sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the boundaries to $11.8\times$ while pre generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.

Forward Index Compression for Learned Sparse Retrieval

Authors: Sebastian Bruch, Martino Fontana, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini

2026-02-05

http://arxiv.org/abs/2602.05445v1

Text retrieval using learned representations of queries and documents has, over the years, evolved into a highly effective approach to search. It is thanks to recent advances in approximate nearest neighbor search-with the emergence of highly efficient algorithms such as the inverted index-based Seismic and the graph-based Hnsw-that retrieval with representations became viable in practice. In this work, we scrutinize the efficiency of retrieval algorithms and focus particularly on the size of a data structure that is common to all algorithmic flavors and that constitutes a substantial fraction of the overall index size: the forward index. In particular, we seek techniques to reduce the storage footprint of the forward index without compromising search quality or inner product computation latency. In our examination with various integer techniques, we report that StreamVByte achieves the best trade-off between memory footprint, retrieval accuracy, and latency. We then improve StreamVByte by introducing DotVByte, a new algorithm tailored to inner product computation. Experiments on MsMarco show that our improvements lead to significant space savings while maintaining retrieval efficiency.

THOR Inductive Link Prediction over Hyper-Relational Knowledge Graphs

Authors: Weijian Yu, Yuhuan Lu, Dingqi Yang

2026-02-05

http://arxiv.org/abs/2602.05424v1

Knowledge graphs (KGs) have become a key ingredient supporting a variety of applications. Beyond the traditional triplet representation of facts where a relation connects two entities, modern KGs observe an increasing number of hyper-relational facts, where an arbitrary number of qualifiers associated with a triplet provide auxiliary information to further describe the rich semantics of the triplet, which can effectively boost the reasoning performance in link prediction tasks. However, existing link prediction techniques over such hyper-relational KGs (HKGs) mostly focus on a transductive setting, where KG embedding models are learned from the specific vocabulary of a given KG and subsequently can only make predictions within the same vocabulary, limiting their generalizability to previously unseen vocabularies. Against this background, we propose THOR, an inducTive link prediction technique for Hyper-relational knOwledge gRaphs. Specifically, we first introduce both relation and entity foundation graphs, modeling their fundamental inter- and intra-fact interactions in HKGs, which are agnostic to any specific relations and entities. Afterward, THOR is designed to learn from the two foundation graphs with two parallel graph encoders followed by a r, which supports efficient masked training and fully-inductive inference. We conduct a thorough evaluation of THOR in hyper-relational link prediction tasks on 12 datasets with different settings. Results show that THOR outperforms a sizable collection of baselines, yielding 66.1%, 55.9%, and 20.4% improvement over the best-performing rule-based, semi-inductive, and fully-inductive techniques, respectively. A series of ablation studies also reveals our key design factors capturing the structural invariance transferable across HKGs for inductive tasks.

Speech-XL Towards Long-Form Speech Understanding in Large Speech Language Models

Authors: Haoqin Sun, Chenyang Lyu, Shiwan Zhao, Xuanfan Ni, Xiangyu Kong, Longyue Wang, Weihua Luo, Yong Qin

2026-02-05

http://arxiv.org/abs/2602.05373v1

Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value () sparsification capacity of Large Language Models (s) to achieve high-ratio speech input . Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) . Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.

RaBiT Residual-Aware Binarization Training for Accurate and Efficient LLMs

Authors: Youngcheon You, Banseok Lee, Minseop Choi, Seonyoung Kim, Hyochan Chong, Changdong Kim, Youngmin Kim, Dongkyu Kim

2026-02-05

http://arxiv.org/abs/2602.05367v1

Efficient deployment of large language models (s) requires extreme , forcing a critical trade-off between efficiency and performance. Residual binarization enables hardware-friendly, matmul-free inference by stacking binary ( $\pm$ 1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during -aware training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and limiting the expressive capacity of the model. While prior work relies on heuristic workarounds (e.g., path freezing) that constrain the solution space, we propose RaBiT, a novel framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared full-precision weight, which ensures that every path corrects the error of the preceding one. This process is stabilized by a robust initialization that prioritizes functional preservation over mere weight approximation. RaBiT redefines the 2-bit accuracy-efficiency frontier: it achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a $4.49\times$ inference speed-up over full-precision models on an RTX 4090.

Pool-based Active Learning as Noisy Lossy Compression Characterizing Label Complexity via Finite Blocklength Analysis

Authors: Kosuke Sugiyama, Masato Uchida

2026-02-05

http://arxiv.org/abs/2602.05333v1

This paper proposes an information-theoretic framework for analyzing the theoretical limits of pool-based active learning (AL), in which a subset of instances is selectively labeled. The proposed framework reformulates pool-based AL as a noisy lossy problem by mapping pool observations to noisy symbol observations, data selection to , and learning to . This correspondence enables a unified information-theoretic analysis of data selection and learning in pool-based AL. Applying finite blocklength analysis of noisy lossy , we derive information-theoretic lower bounds on label complexity and generalization error that serve as theoretical limits for a given learning algorithm under its associated optimal data selection strategy. Specifically, our bounds include terms that reflect overfitting induced by the learning algorithm and the discrepancy between its inductive bias and the target task, and are closely related to established information-theoretic bounds and stability theory, which have not been previously applied to the analysis of pool-based AL. These properties yield a new theoretical perspective on pool-based AL.

High-Performance Moment-Encoded Lattice Boltzmann Method with Stability-Guided Quantization

Authors: Yixin Chen, Wei Li, David I. W. Levin, Kui Wu

2026-02-05

http://arxiv.org/abs/2602.05295v1

In this work, we present a memory-efficient, high-performance GPU framework for moment-based lattice Boltzmann methods (LBM) with fluid-solid coupling. We introduce a split-kernel scheme that decouples fluid updates from solid boundary handling, substantially reducing warp divergence and improving utilization on GPUs. We further perform the first von Neumann stability analysis of the high-order moment-encoded LBM (HOME-LBM) formulation, characterizing its spectral behavior and deriving stability bounds for individual moment components. These theoretical insights directly guide a practical 16-bit moment without compromising numerical stability. Our framework achieves up to 6x speedup and reduces GPU memory footprint by up to 50% in fluid-only scenarios and 25% in scenes with complex solid boundaries compared to the state-of-the-art LBM solver, while pre physical fidelity across a range of large-scale benchmarks and real-time demonstrations. The proposed approach enables scalable, stable, and high-resolution LBM simulation on a single GPU, bridging theoretical stability analysis with practical GPU algorithm design.

Hybrid Gated Flow (HGF) Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction

Authors: David Alejandro Trejo Pizzo

2026-02-05

http://arxiv.org/abs/2602.05269v1

The deployment of Large Language Models (s) on edge devices is fundamentally constrained by the "Memory Wall" -- a hardware limitation where memory bandwidth, not compute, becomes the bottleneck. Recent 1.58-bit techniques (e.g., BitNet b1.58) dramatically reduce memory footprint but typically incur a perplexity degradation of 20-25% compared to FP16 baselines. In this work, we introduce Hybrid Gated Flow (HGF), a dual-stream architecture that couples a 1.58-bit ternary backbone with a learnable, low-rank FP16 correction path controlled by adaptive gates. Through extensive experiments on the TinyStories dataset across two training regimes (2500 and 3500 steps), we demonstrate that HGF 5.4 achieves a validation loss of 0.9306 compared to BitNet's 1.0294, recovering approximately 55% of the quality gap between pure ternary and the FP16 baseline (0.8490). This recovery is achieved with only ~12-15% memory overhead beyond the ternary backbone. Furthermore, we provide empirical evidence for an emergent phenomenon: as structural regularization. While a full-precision differential attention baseline (Diff_Only) exhibited training instability with validation loss exceeding 1.68, the ternary-anchored HGF maintained robust convergence throughout training. Finally, we report preliminary results extending this architecture to 1.2B and 3B parameter models trained on SlimPajama and FineWeb-Edu. These larger-scale experiments confirm that the architectural stability and quality recovery observed in small-scale proxies scale linearly to production-grade language modeling regimes.

FedMosaic Federated Retrieval-Augmented Generation via Parametric Adapters

Authors: Zhilin Liang, Yuxiang Wang, Zimu Zhou, Hainan Zhang, Boyi Liu, Yongxin Tong

2026-02-05

http://arxiv.org/abs/2602.05235v1

Retrieval-Augmented Generation (RAG) enhances Large Language Models (s) by grounding generation in external knowledge to improve factuality and reduce hallucinations. Yet most deployments assume a centralized corpus, which is infeasible in privacy aware domains where knowledge remains siloed. This motivates federated RAG (FedRAG), where a central server collaborates with distributed silos without sharing raw documents. In context RAG violates this requirement by transmitting verbatim documents, whereas parametric RAG encodes documents into lightweight adapters that merge with a frozen at inference, avoiding raw-text exchange. We adopt the parametric approach but face two unique challenges induced by FedRAG: high storage and from per-document adapters, and destructive aggregation caused by indiscriminately merging multiple adapters. We present FedMosaic, the first federated RAG framework built on parametric adapters. FedMosaic clusters semantically related documents into multi-document adapters with document-specific masks to reduce overhead while pre specificity, and performs selective adapter aggregation to combine only relevance-aligned, nonconflicting adapters. Experiments show that FedMosaic achieves an average 10.9% higher accuracy than state-of-the-art methods in four categories, while lowering storage costs by 78.8% to 86.3% and costs by 91.4%, and never sharing raw documents.

Extreme Weather Nowcasting via Local Precipitation Pattern Prediction

Authors: Changhoon Song, Teng Yuan Chang, Youngjoon Hong

2026-02-05

http://arxiv.org/abs/2602.05204v1

Accurate forecasting of extreme weather events such as heavy rainfall or storms is critical for risk management and disaster mitigation. Although high-resolution radar observations have spurred extensive research on nowcasting models, precipitation nowcasting remains particularly challenging due to pronounced spatial locality, intricate fine-scale rainfall structures, and variability in forecasting horizons. While recent diffusion-based generative ensembles show promising results, they are computationally expensive and unsuitable for real-time applications. In contrast, deterministic models are computationally efficient but remain biased toward normal rainfall. Furthermore, the benchmark datasets commonly used in prior studies are themselves skewed--either dominated by ordinary rainfall events or restricted to extreme rainfall episodes--thereby hindering general applicability in real-world settings. In this paper, we propose exPreCast, an efficient deterministic framework for generating finely detailed radar forecasts, and introduce a newly constructed balanced radar dataset from the Korea Meteorological Administration (KMA), which encompasses both ordinary precipitation and extreme events. Our model integrates local spatiotemporal attention, a texture-pre cubic dual upsampling r, and a temporal extractor to flexibly adjust forecasting horizons. Experiments on established benchmarks (SEVIR and MeteoNet) as well as on the balanced KMA dataset demonstrate that our approach achieves state-of-the-art performance, delivering accurate and reliable nowcasts across both normal and extreme rainfall regimes.

Diffusion-aided Extreme Video Compression with Lightweight Semantics Guidance

Authors: Maojun Zhang, Haotian Wu, Richeng Jin, Deniz Gunduz, Krystian Mikolajczyk

2026-02-05

http://arxiv.org/abs/2602.05201v1

Modern video codecs and learning-based approaches struggle for semantic reconstruction at extremely low bit-rates due to reliance on low-level spatiotemporal redundancies. Generative models, especially diffusion models, offer a new paradigm for video by leveraging high-level semantic understanding and powerful visual synthesis. This paper propose a video framework that integrates generative priors to drastically reduce bit-rate while maintaining reconstruction fidelity. Specifically, our method compresses high-level semantic representations of the video, then uses a conditional diffusion model to reconstruct frames from these semantics. To further improve , we characterize motion information with global camera trajectories and foreground segmentation: background motion is compactly represented by camera pose parameters while foreground dynamics by segmentation masks. This allows for significantly boosts efficiency, enabling descent video reconstruction at extremely low bit-rates.

Double-P Hierarchical Top-P Sparse Attention for Long-Context LLMs

Authors: Wentao Ni, Kangqi Zhang, Zhongming Yu, Oren Nelson, Mingu Lee, Hong Cai, Fatih Porikli, Jongryool Kim, Zhijian Liu, Jishen Zhao

2026-02-05

http://arxiv.org/abs/2602.05191v1

As long-context inference becomes central to large language models (s), attention over growing key-value s emerges as a dominant bottleneck, motivating attention for scalable inference. Fixed-budget top-k attention cannot adapt to heterogeneous attention distributions across heads and layers, whereas top-p attention directly preserves attention mass and provides stronger accuracy guarantees. Existing top-p methods, however, fail to jointly optimize top-p accuracy, selection overhead, and attention cost, which limits their overall efficiency. We present Double-P, a hierarchical attention framework that optimizes all three stages. Double-P first performs coarse-grained top-p estimation at the cluster level using size-weighted centroids, then adaptively refines computation through a second top-p stage that allocates token-level attention only when needed. Across long-context benchmarks, Double-P consistently achieves near-zero accuracy drop, reducing attention computation overhead by up to 1.8x and delivers up to 1.3x end-to-end speedup over state-of-the-art fixed-budget attention methods.

Authors: Hsuan-Yu Chou, Wajiha Naveed, Shuyan Zhou, Xiaowei Yang

2026-02-05

http://arxiv.org/abs/2602.05189v1

As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (s) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary s have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight s remains an open question. Motivated by recent developments of reasoning s, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight s and those (72%--98%, and 93%--99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the s, highlighting considerations for deploying s in both platform-scale and personalized moderation contexts. These findings show open-weight s can support privacy-pre moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.

TIDE Temporal Incremental Draft Engine for Self-Improving LLM Inference

Authors: Jiyoung Park, Hankyu Jang, Changseok Song, Wookeun Jung

2026-02-05

http://arxiv.org/abs/2602.05145v1

Speculative can substantially accelerate inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal Incremental Draft Engine), a -engine-native framework that integrates online draft adaptation directly into high-performance inference systems. TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model, and employs adaptive runtime control to activate speculation and training only when beneficial. TIDE exploits heterogeneous clusters by mapping decoupled inference and training to appropriate GPU classes. Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative while reducing draft training time by 1.67x compared to approaches that recompute training signals.

ARGaze Autoregressive Transformers for Online Egocentric Gaze Estimation

Authors: Jia Li, Wenjie Zhao, Shijian Deng, Bolin Lai, Yuheng Wu, RUijia Chen, Jon E. Froehlich, Yuhang Zhao, Yapeng Tian

2026-02-04

http://arxiv.org/abs/2602.05132v1

Online egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames, a task essential for augmented reality and assistive technologies. Unlike third-person gaze estimation, this setting lacks explicit head or eye signals, requiring models to infer current visual attention from , indirect cues such as hand-object interactions and salient scene content. We observe that gaze exhibits strong temporal continuity during goal-directed activities: knowing where a person looked recently provides a powerful prior for predicting where they look next. Inspired by vision-conditioned autoregressive in vision-language models, we propose ARGaze, which reformulates gaze estimation as sequential prediction: at each timestep, a r predicts current gaze by conditioning on (i) current visual features and (ii) a fixed-length Gaze Context Window of recent gaze target estimates. This design enforces causality and enables bounded-resource streaming inference. We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation, with extensive ablations validating that autoregressive modeling with bounded gaze history is critical for robust prediction. We will release our source code and pre-trained models.

Authors: Keyang Xuan, Pengda Wang, Chongrui Ye, Haofei Yu, Tal August, Jiaxuan You

2026-02-04

http://arxiv.org/abs/2602.05115v1

Large language models (s) are increasingly evaluated in interactive environments to test their social intelligence. However, existing benchmarks often assume idealized between agents, limiting our ability to diagnose whether s can maintain and repair interactions in more realistic, imperfect settings. To close this gap, we present \textsc{SocialVeil}, a social learning environment that can simulate social interaction under cognitive-difference-induced barriers. Grounded in a systematic literature review of challenges in human interaction, \textsc{SocialVeil} introduces three representative types of such disruption, \emph{semantic vagueness}, \emph{sociocultural mismatch}, and \emph{emotional interference}. We also introduce two barrier-aware evaluation metrics, \emph{unresolved confusion} and \emph{mutual understanding}, to evaluate interaction quality under impaired . Experiments across 720 scenarios and four frontier s show that barriers consistently impair performance, with mutual understanding reduced by over 45\% on average, and confusion elevated by nearly 50\%. Human evaluations validate the fidelity of these simulated barriers (ICC $\approx$ 0.78, Pearson r $\approx$ 0.80). We further demonstrate that adaptation strategies (Repair Instruction and Interactive learning) only have a modest effect far from barrier-free performance. This work takes a step toward bringing social interaction environments closer to real-world , opening opportunities for exploring the social intelligence of agents.

Physics-Informed Diffusion Models for Vehicle Speed Trajectory Generation

Authors: Vadim Sokolov, Farnaz Behnia, Dominik Karbowski

2026-02-04

http://arxiv.org/abs/2602.05028v1

Synthetic vehicle speed trajectory generation is essential for evaluating vehicle control algorithms and connected vehicle technologies. Traditional Markov chain approaches suffer from discretization artifacts and limited expressiveness. This paper proposes a physics-informed diffusion framework for conditional micro-trip synthesis, combining a dual-channel speed- representation with soft physics constraints that resolve optimization conflicts inherent to hard-constraint formulations. We compare a 1D U-Net architecture against a -based Conditional Score-based Diffusion Imputation (CSDI) model using 6,367 GPS-derived micro-trips. CSDI achieves superior distribution matching (Wasserstein distance 0.30 for speed, 0.026 for ), strong indistinguishability from real data (discriminative score 0.49), and validated utility for downstream energy assessment tasks. The methodology enables scalable generation of realistic driving profiles for intelligent transportation systems (ITS) applications without costly field data collection.

Protein Autoregressive Modeling via Multiscale Structure Generation

Authors: Yanru Qu, Cheng-Yen Hsieh, Zaixiang Zheng, Ge Liu, Quanquan Gu

2026-02-04

http://arxiv.org/abs/2602.04883v1

We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone r that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.

Adaptive estimation of Sobolev-type energy functionals on the sphere

Authors: Claudio Durastanti

2026-02-04

http://arxiv.org/abs/2602.04823v1

We study the estimation of quadratic Sobolev-type integral functionals of an unknown density on the unit sphere. The functional is defined through fractional powers of the Laplace--Beltrami operator and provides a global measure of smoothness and spectral energy. Our approach relies on spherical needlet frames, which yield a localized multiscale decomposition while pre tight frame properties in the natural square-integrable function space on the sphere. We construct unbiased estimators of suitably truncated versions of the functional and derive sharp oracle risk bounds through an explicit bias--variance analysis. When the smoothness of the density is unknown, we propose a Lepski-type data-driven selection of the resolution level. The resulting adaptive estimator achieves minimax-optimal rates over Sobolev classes, without resorting to nonlinear or -based methods.

Authors: Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, Yuanxing Zhang, Jiaheng Liu, Qiang Liu, Pengfei Wan, Liang Wang

2026-02-04

http://arxiv.org/abs/2602.04804v1

Omni-modal Large Language Models (Omni-s) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token methods designed for Omni-s remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token ), a modality-asymmetric token framework tailored for Omni-s. Specifically, OmniSIFT adopts a two-stage strategy: (i) a spatio-temporal video module that removes video redundancy arising from both intra-frame structure and inter-frame , and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all baselines and even surpasses the performance of the full-token model on several tasks.

From independent patches to coordinated attention Controlling information flow in vision transformers

Authors: Kieran A. Murphy

2026-02-04

http://arxiv.org/abs/2602.04784v1

We make the information transmitted by attention an explicit, measurable quantity in vision s. By inserting variational information bottlenecks on all attention-mediated writes to the residual stream -- without other architectural changes -- we train models with an explicit information cost and obtain a controllable spectrum from independent patch processing to fully expressive global attention. On ImageNet-100, we characterize how classification behavior and information routing evolve across this spectrum, and provide initial insights into how global visual representations emerge from local patch processing by analyzing the first attention heads that transmit information. By biasing learning toward solutions with constrained internal , our approach yields models that are more tractable for mechanistic analysis and more amenable to control.

Less Finetuning, Better Retrieval Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging

Authors: Sameh Khattab, Jean-Philippe Corbeil, Osman Alperen Koraş, Amin Dada, Julian Friedrich, François Beaulieu, Paul Vozila, Jens Kleesiek

2026-02-04

http://arxiv.org/abs/2602.04731v1

Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (s), improving knowledge updates and reducing hallucinations. Recently, -based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose s into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modular framework that enhances r-only s with synthetic hard negatives, retrieval prompt optimization, and model merging. Experiments on a subset of 12 medical and general tasks from the MTEB benchmark show STM boosts task-specific experts by up to 23.5\% (average 7.5\%) and produces merged models that outperform both single experts and strong baselines without extensive pretraining. Our results demonstrate a scalable, efficient path for turning general s into high-performing, domain-specialized retrievers, pre general-domain capabilities while excelling on specialized tasks.

Addressing Corpus Knowledge Poisoning Attacks on RAG Using Sparse Attention

Authors: Sagie Dekel, Moshe Tennenholtz, Oren Kurland

2026-02-04

http://arxiv.org/abs/2602.04711v2

Retrieval Augmented Generation (RAG) is a highly effective paradigm for keeping -based responses up-to-date and reducing the likelihood of hallucinations. Yet, RAG was recently shown to be quite vulnerable to corpus knowledge poisoning: an attacker injects misleading documents to the corpus to steer an 's output to an undesired response. We argue that the standard causal attention mechanism in s enables harmful cross-document interactions, specifically in cases of attacks. Accordingly, we introduce a novel defense approach for RAG: Sparse Document Attention RAG (SDAG). This is a block- attention mechanism that disallows cross-attention between retrieved documents. SDAG requires a minimal inference-time change to the attention mask; furthermore, no fine-tuning or additional architectural changes are needed. We present an empirical evaluation of -based question answering (QA) with a variety of attack strategies on RAG. We show that our SDAG method substantially outperforms the standard causal attention mechanism in terms of attack success rate. We further demonstrate the clear merits of integrating SDAG with state-of-the-art RAG defense methods. Specifically, the integration results in performance that is statistically significantly better than the state-of-the-art.

Towards Understanding and Avoiding Limitations of Convolutions on Graphs

Authors: Andreas Roth

2026-02-04

http://arxiv.org/abs/2602.04709v1

While message-passing neural networks (MPNNs) have shown promising results, their real-world impact remains limited. Although various limitations have been identified, their theoretical foundations remain poorly understood, leading to fragmented research efforts. In this thesis, we provide an in-depth theoretical analysis and identify several key properties limiting the performance of MPNNs. Building on these findings, we propose several frameworks that address these shortcomings. We identify two properties exhibited by many MPNNs: shared component amplification (SCA), where each message-passing iteration amplifies the same components across all feature channels, and component dominance (CD), where a single component gets increasingly amplified as more message-passing steps are applied. These properties lead to the observable phenomenon of rank collapse of node representations, which generalizes the established over-smoothing phenomenon. By generalizing and decomposing over-smoothing, we enable a deeper understanding of MPNNs, more targeted solutions, and more precise within the field. To avoid SCA, we show that utilizing multiple computational graphs or edge relations is necessary. Our multi-relational split (MRS) framework transforms any existing MPNN into one that leverages multiple edge relations. Additionally, we introduce the spectral graph convolution for multiple feature channels (MIMO-GC), which naturally uses multiple computational graphs. A localized variant, LMGC, approximates the MIMO-GC while inheriting its beneficial properties. To address CD, we demonstrate a close connection between MPNNs and the PageRank algorithm. Based on personalized PageRank, we propose a variant of MPNNs that allows for infinitely many message-passing iterations, while pre initial node features. Collectively, these results deepen the theoretical understanding of MPNNs.

PIO-FVLM Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective

Authors: Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Ying Li, Rong Xiao, Chunhua Shen

2026-02-04

http://arxiv.org/abs/2602.04657v2

Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in performance and practical deployment. In contrast, we propose PIO-FVLM from the perspective of inference objectives, which transforms visual token into pre output result invariance and selects tokens primarily by their importance to this goal. Specially, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle. The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance, with a 2.67 $\times$ speedup, 2.11 $\times$ inference speedup, 6.22 $\times$ lower FLOPs, and 6.05 $\times$ reduced Cache overhead. Our code is available at https://github.com/ocy1/PIO-FVLM.

LEAD Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation

Authors: Ruixiao Yang, Yuanhe Tian, Xu Yang, Huiqi Li, Yan Song

2026-02-04

http://arxiv.org/abs/2602.04617v1

Radiology Report Generation (RRG) aims to produce accurate and coherent diagnostics from medical images. Although large vision language models (LVLM) improve report fluency and accuracy, they exhibit hallucinations, generating plausible yet image-ungrounded pathological details. Existing methods primarily rely on external knowledge guidance to facilitate the alignment between generated text and visual information. However, these approaches often ignore the inherent priors and vision-language alignment biases in pretrained models and lack robustness due to reliance on constructed guidance. In this paper, we propose Layer-wise Expert-aligned Decoding (LEAD), a novel method to inherently modify the LVLM trajectory. A multiple experts module is designed for extracting distinct pathological features which are integrated into each r layer via a gating mechanism. This layer-wise architecture enables the to consult expert features at every inference step via a learned gating function, thereby dynamically rectifying biases and steering the generation toward factual consistency. Experiments conducted on multiple public datasets demonstrate that the LEAD method yields effective improvements in clinical accuracy metrics and mitigates hallucinations while pre high generation quality.

Disentangling meaning from language in LLM-based machine translation

Authors: Théo Lasnier, Armel Zebaze, Djamé Seddah, Rachel Bawden, Benoît Sagot

2026-02-04

http://arxiv.org/abs/2602.04613v1

Mechanistic Interpretability (MI) seeks to explain how neural networks implement their capabilities, but the scale of Large Language Models (s) has limited prior MI work in Machine Translation (MT) to word-level analyses. We study sentence-level MT from a mechanistic perspective by analyzing attention heads to understand how s internally encode and distribute translation functions. We decompose MT into two subtasks: producing text in the target language (i.e. target language identification) and pre the input sentence's meaning (i.e. sentence equivalence). Across three families of open-source models and 20 translation directions, we find that distinct, sets of attention heads specialize in each subtask. Based on this insight, we construct subtask-specific steering vectors and show that modifying just 1% of the relevant heads enables instruction-free MT performance comparable to instruction-based prompting, while ablating these heads selectively disrupts their corresponding translation functions.

Domain decomposition methods and preconditioning strategies using generalized locally Toepltiz tools proposals, analysis, and numerical validation

Authors: Abdessadek Rifqui, Ahmed Ratnani, Stefano Serra-Capizzano

2026-02-04

http://arxiv.org/abs/2602.04603v1

In the current work we present a spectral analysis of the additive and multiplicative Schwarz methods within the framework of domain decomposition techniques, by investigating the spectral properties of these classical Schwarz preconditioning matrix-sequences, with emphasis on their convergence behavior and on the effect of transmission operators. In particular, after a general presentation of various options, we focus on restricted variants of the Schwarz methods aimed at improving parallel efficiency, while pre their convergence features. In order to rigorously describe and analyze the convergence behavior, we employ the theory of generalized locally Toeplitz (GLT) sequences, which provides a robust framework for studying the asymptotic spectral distribution of the discretized operators arising from Schwarz iterations. By associating each operator sequence with the appropriate GLT symbol, we derive explicit expressions for the GLT symbols of the convergence factors, for both additive and multiplicative Schwarz methods. The GLT-based spectral approach offers a unified and systematic understanding of how the spectrum evolves with mesh refinement and size (in the algebraic case). Our analysis not only deepens the theoretical understanding of classical Schwarz methods, but also establishes a foundation for examining future restricted or hybrid Schwarz variants using symbolic spectral tools. These results enable the prediction of the remarkable efficiency of block Jacobi/Gauss--Seidel and block additive/multiplicative Schwarz preconditioners for GLT sequences, as further illustrated through a wide choice of numerical experiments.

Harmonia Algorithm-Hardware Co-Design for Memory- and Compute-Efficient BFP-based LLM Inference

Authors: Xinyu Wang, Jieyu Li, Yanan Sun, Weifeng He

2026-02-04

http://arxiv.org/abs/2602.04595v1

Large Language Models (s) are powerful but incur high memory and computation costs. Quantization is an effective solution, with INT weights and FP activations being widely adopted to preserve accuracy. Prior works further reduce FP overhead by using block floating point (BFP) activations in linear layers, but fail to extend BFP to attention layers due to severe accuracy degradation, limiting overall efficiency. To address this challenge, we propose Harmonia, an algorithm-hardware co-design framework that enables all-layer BFP activations with a configurable hardware architecture. First, we systematically explore BFP configurations to achieve a better trade-off between accuracy and activation across all layers. Second, to reduce - storage and computation in attention layers, we introduce an asymmetric bit-allocation strategy and computations in attention layers,we introduce an asymmetric bit-allocation strategy combined with a hybrid offline-online outlier smoothing technique. This allow aggressive - from FP16 to 4-bit-mantissa BFP with only 0.3% average accuracy loss. Third, to fully exploit all-layer BFP activations, we design dedicated hardware components, including a reconfigurable PE supporting mixed data formats (BFP-INT and BPF-BFP), a real-time FP16-to-BFP converter, and a tiling-aware dataflow to reduce memory traffic. We evaluate Harmonia on GEMM operations in both linear and attention layers across eight widely used s. Compared with prior works, Harmonia achieves 3.84x (up to 5.05x) higher area efficiency, 2.03x (up to 3.90x) better energy efficiency, and 3.08x (up to 4.62x) speedup on average.

SalFormer360 a transformer-based saliency estimation model for 360-degree videos

Authors: Mahmoud Z. A. Wahba, Francesco Barbato, Sara Baldoni, Federica Battisti

2026-02-04

http://arxiv.org/abs/2602.04584v1

Saliency estimation has received growing attention in recent years due to its importance in a wide range of applications. In the context of 360-degree video, it has been particularly valuable for tasks such as viewport prediction and immersive content optimization. In this paper, we propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a -based architecture. Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom r. The SegFormer model was originally developed for 2D segmentation tasks, and it has been fine-tuned to adapt it to 360-degree content. To further enhance prediction accuracy in our model, we incorporated Viewing Center Bias to reflect user attention in 360-degree environments. Extensive experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods. In terms of Pearson Correlation Coefficient, our model achieves 8.4% higher performance on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking compared to previous state-of-the-art.

Nix and Fix Targeting 1000x Compression of 3D Gaussian Splatting with Diffusion Models

Authors: Cem Eteke, Enzo Tartaglione

2026-02-04

http://arxiv.org/abs/2602.04549v1

3D Gaussian Splatting (3DGS) revolutionized novel view rendering. Instead of inferring from dense spatial points, as implicit representations do, 3DGS uses Gaussians. This enables real-time performance but increases space requirements, hindering applications such as immersive . 3DGS emerged as a field aimed at alleviating this issue. While impressive progress has been made, at low rates, introduces artifacts that degrade visual quality significantly. We introduce NiFi, a method for extreme 3DGS through restoration via artifact-aware, diffusion-based one-step distillation. We show that our method achieves state-of-the-art perceptual quality at extremely low rates, down to 0.1 MB, and towards 1000x rate improvement over 3DGS at comparable perceptual performance. The code will be open-sourced upon acceptance.

LycheeDecode Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

Authors: Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, Min Zhang

2026-02-04

http://arxiv.org/abs/2602.04541v1

The proliferation of long-context large language models (s) exposes a key bottleneck: the rapidly expanding key-value during , which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By pre the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context inference.

Incongruity-sensitive access to highly compressed strings

Authors: Ferdinando Cicalese, Zsuzsanna Lipták, Travis Gagie, Gonzalo Navarro, Nicola Prezza, Cristian Urbina

2026-02-04

http://arxiv.org/abs/2602.04523v1

Random access to highly compressed strings -- represented by straight-line programs or Lempel-Ziv parses, for example -- is a well-studied topic. Random access to such strings in strongly sublogarithmic time is impossible in the worst case, but previous authors have shown how to support faster access to specific characters and their neighbourhoods. In this paper we explore whether, since better can impede access, we can support faster access to relatively incompressible substrings of highly compressed strings. We first show how, given a run-length compressed straight-line program (RLSLP) of size $g_{rl}$ or a block tree of size $L$ , we can build an $O (g_{rl})$ -space or an $O (L)$ -space data structure, respectively, that supports access to any character in time logarithmic in the length of the longest repeated substring containing that character. That is, the more incongruous a character is with respect to the characters around it in a certain sense, the faster we can support access to it. We then prove a similar but more powerful and sophisticated result for parsings in which phrases' sources do not much larger phrases, with the query time depending also on the number of phrases we must copy from their sources to obtain the queried character.

$C$ - $ΔΘ$ Circuit-Restricted Weight Arithmetic for Selective Refusal

Authors: Aditya Kasliwal, Pratinav Seth, Vinay Kumar Sankarapu

2026-02-04

http://arxiv.org/abs/2602.04521v1

Modern deployments require s to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.

Model-Dowser Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models

Authors: Hyeontaek Hwang, Nguyen Dinh Son, Daeyoung Kim

2026-02-04

http://arxiv.org/abs/2602.04509v1

Fine-tuning Multimodal Large Language Models (Ms) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language r or scale poorly with increasing model size. To address these limitations, we propose Model-Dowser, a novel fine-tuning approach for Ms. Model-Dowser measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, Model-Dowser selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative Ms, LLaVA and NVILA, demonstrate that Model-Dowser effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.

Greedy-Gnorm A Gradient Matrix Norm-Based Alternative to Attention Entropy for Head Pruning

Authors: Yuxi Guo, Paul Sheridan

2026-02-04

http://arxiv.org/abs/2602.04491v1

Attention head has emerged as an effective technique for model , an increasingly important goal in the era of Green AI. However, existing methods often rely on static importance scores, which fail to capture the evolving role of attention heads during iterative removal. We propose Greedy-Gradient norm (Greedy-Gnorm), a novel head algorithm that dynamically recalculates head importance after each step. Specifically, each head is scored by the elementwise product of the l2-norms of its Q/K/V gradient blocks, as estimated from a hold-out validation set and updated at every greedy iteration. This dynamic approach to scoring mitigates against stale rankings and better reflects gradient-informed importance as progresses. Extensive experiments on BERT, ALBERT, RoBERTa, and XLM-RoBERTa demonstrate that Greedy-Gnorm consistently preserves accuracy under substantial head removal, outperforming attention entropy. By effectively reducing model size while maintaining task performance, Greedy-Gnorm offers a promising step toward more energy-efficient model deployment.

DOS Dual-Flow Orthogonal Semantic IDs for Recommendation in Meituan

Authors: Junwei Yin, Senjie Kou, Changhao Li, Shuli Wang, Xue Wei, Yinqiu Huang, Yinhua Zhu, Haitao Wang, Xingxing Wang

2026-02-04

http://arxiv.org/abs/2602.04460v1

Semantic IDs serve as a key component in generative recommendation systems. They not only incorporate open-world knowledge from large language models (s) but also compress the semantic space to reduce generation difficulty. However, existing methods suffer from two major limitations: (1) the lack of contextual awareness in generation tasks leads to a gap between the Semantic ID codebook space and the generation space, resulting in suboptimal recommendations; and (2) suboptimal methods exacerbate semantic loss in s. To address these issues, we propose Dual-Flow Orthogonal Semantic IDs (DOS) method. Specifically, DOS employs a user-item dual flow-framework that leverages collaborative signals to align the Semantic ID codebook space with the generation space. Furthermore, we introduce an orthogonal residual scheme that rotates the semantic space to an appropriate orientation, thereby maximizing semantic preservation. Extensive offline experiments and online A/B testing demonstrate the effectiveness of DOS. The proposed method has been successfully deployed in Meituan's mobile application, hundreds of millions of users.

Seg-ReSearch Segmentation with Interleaved Reasoning and External Search

Authors: Tianming Liang, Qirui Du, Jian-Fang Hu, Haichao Jiang, Zicheng Lin, Wei-Shi Zheng

2026-02-04

http://arxiv.org/abs/2602.04454v1

Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (Ms) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of Ms, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose \textbf{Seg-ReSearch}, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of Ms. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released at https://github.com/iSEE-Laboratory/Seg-ReSearch.

RASA Routing-Aware Safety Alignment for Mixture-of-Experts Models

Authors: Jiacheng Liang, Yuhui Wang, Tanqiu Jiang, Ting Wang

2026-02-04

http://arxiv.org/abs/2602.04448v1

Mixture-of-Experts (MoE) language models introduce unique challenges for safety alignment due to their routing mechanisms, which can enable degenerate optimization behaviors under standard full-parameter fine-tuning. In our preliminary experiments, we observe that naively applying full-parameter safety fine-tuning to MoE models can reduce attack success rates through routing or expert dominance effects, rather than by directly repairing Safety-Critical Experts. To address this challenge, we propose RASA, a routing-aware expert-level alignment framework that explicitly repairs Safety-Critical Experts while preventing routing-based bypasses. RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts. Across two representative MoE architectures and a diverse set of jailbreak attacks, RASA achieves near-perfect robustness, strong cross-attack generalization, and substantially reduced over-refusal, while pre general capabilities on benchmarks such as MMLU, GSM8K, and TruthfulQA. Our results suggest that robust MoE safety alignment benefits from targeted expert repair rather than global parameter updates, offering a practical and architecture-pre alternative to prior approaches.

The Stretto Execution Engine for LLM-Augmented Data Systems

Authors: Gabriele Sanmartino, Matthias Urban, Paolo Papotti, Carsten Binnig

2026-02-04

http://arxiv.org/abs/2602.04430v1

-augmented data systems enable semantic querying over structured and unstructured data, but executing queries with -powered operators introduces a fundamental runtime--accuracy trade-off. In this paper, we present Stretto, a new execution engine that provides end-to-end query guarantees while efficiently navigating this trade-off in a holistic manner. For this, Stretto formulates query planning as a constrained optimization problem and uses a gradient-based optimizer to jointly select operator implementations and allocate error budgets across pipelines. Moreover, to enable fine-grained execution choices, Stretto introduces a novel idea on how -caching can be used to realize a spectrum of different physical operators that transform a design space into a dense continuum of runtime--accuracy trade-offs. Experiments show that Stretto outperforms state-of-the-art systems while consistently meeting quality guarantees.

Depth-Wise Emergence of Prediction-Centric Geometry in Large Language Models

Authors: Shahar Haim, Daniel C McNamee

2026-02-04

http://arxiv.org/abs/2602.04931v1

We show that r-only large language models exhibit a depth-wise transition from context-processing to prediction-forming phases of computation accompanied by a reorganization of representational geometry. Using a unified framework combining geometric analysis with mechanistic intervention, we demonstrate that late-layer representations implement a structured geometric code that enables selective causal control over token prediction. Specifically, angular organization of the representation geometry parametrizes prediction distributional similarity, while representation norms encode context-specific information that does not determine prediction. Together, these results provide a mechanistic-geometric account of the dynamics of transforming context into predictions in s.

HoRD Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation

Authors: Puyue Wang, Jiawei Hu, Yan Gao, Junyan Wang, Yu Zhang, Gillian Dobbie, Tao Gu, Wafa Johal, Ting Dang, Hong Jia

2026-02-04

http://arxiv.org/abs/2602.04412v2

Humanoid robots can suffer significant performance drops under small changes in dynamics, task specifications, or environment setup. We propose HoRD, a two-stage learning framework for robust humanoid control under domain shift. First, we train a high-performance teacher policy via history-conditioned reinforcement learning, where the policy infers latent dynamics context from recent state--action trajectories to adapt online to diverse randomized dynamics. Second, we perform online distillation to transfer the teacher's robust control capabilities into a -based student policy that operates on root-relative 3D joint keypoint trajectories. By combining history-conditioned adaptation with online distillation, HoRD enables a single policy to adapt zero-shot to unseen domains without per-domain retraining. Extensive experiments show HoRD outperforms strong baselines in robustness and transfer, especially under unseen domains and external perturbations. Code and project page are available at https://tonywang-0517.github.io/hord/.

LCUDiff Latent Capacity Upgrade Diffusion for Faithful Human Body Restoration

Authors: Jue Gong, Zihan Zhou, Jingkai Wang, Shu Li, Libo Liu, Jianliang Lan, Yulun Zhang

2026-02-04

http://arxiv.org/abs/2602.04406v1

Existing methods for restoring degraded human-centric images often struggle with insufficient fidelity, particularly in human body restoration (HBR). Recent diffusion-based restoration methods commonly adapt pre-trained text-to-image diffusion models, where the variational autoencoder (VAE) can significantly bottleneck restoration fidelity. We propose LCUDiff, a stable one-step framework that upgrades a pre-trained latent diffusion model from the 4-channel latent space to the 16-channel latent space. For VAE fine-tuning, channel splitting distillation (CSD) is used to keep the first four channels aligned with pre-trained priors while allocating the additional channels to effectively encode high-frequency details. We further design prior-pre adaptation (PPA) to smoothly bridge the mismatch between 4-channel diffusion backbones and the higher-dimensional 16-channel latent. In addition, we propose a r router (DeR) for per-sample r routing using restoration-quality score annotations, which improves visual quality across diverse conditions. Experiments on synthetic and real-world datasets show competitive results with higher fidelity and fewer artifacts under mild degradations, while pre one-step efficiency. The code and model will be at https://github.com/gobunu/LCUDiff.

Enabling Real-Time Colonoscopic Polyp Segmentation on Commodity CPUs via Ultra-Lightweight Architecture

Authors: Weihao Gao, Zhuo Deng, Zheng Gong, Lan Ma

2026-02-04

http://arxiv.org/abs/2602.04381v1

Early detection of colorectal cancer hinges on real-time, accurate polyp identification and resection. Yet current high-precision segmentation models rely on GPUs, making them impractical to deploy in primary hospitals, mobile endoscopy units, or capsule robots. To bridge this gap, we present the UltraSeg family, operating in an extreme- regime (<0.3 M parameters). UltraSeg-108K (0.108 M parameters) is optimized for single-center data, while UltraSeg-130K (0.13 M parameters) generalizes to multi-center, multi-modal images. By jointly optimizing encoder-r widths, incorporating constrained dilated convolutions to enlarge receptive fields, and integrating a cross-layer lightweight fusion module, the models achieve 90 FPS on a single CPU core without sacrificing accuracy. Evaluated on seven public datasets, UltraSeg retains >94% of the Dice score of a 31 M-parameter U-Net while utilizing only 0.4% of its parameters, establishing a strong, clinically viable baseline for the extreme- domain and offering an immediately deployable solution for resource-constrained settings. This work provides not only a CPU-native solution for colonoscopy but also a reproducible blueprint for broader minimally invasive surgical vision applications. Source code is publicly available to ensure reproducibility and facilitate future benchmarking.

TurboBoA Faster and Exact Attention-aware Quantization without Backpropagation

Authors: Junhan Kim, Yeo Jeong Park, Seungwoo Son, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon

2026-02-04

http://arxiv.org/abs/2602.04929v1

The rapid growth of large language models (s) has heightened the importance of post-training (PTQ) for reducing memory and computation costs. Among PTQ methods, GPTQ has gained significant attention for its efficiency, enabling billion-scale s to be d within a few GPU hours. However, GPTQ's assumption of layer-wise independence leads to severe accuracy drops in regimes. Recently, BoA improved upon GPTQ by incorporating inter-layer dependencies within attention modules, but its reliance on sequential across all out-channels makes it substantially less efficient. In this paper, we propose TurboBoA, a new backpropagation-free PTQ algorithm that preserves the accuracy benefits of BoA while significantly accelerating the process. The proposed TurboBoA introduces three key innovations: (i) joint of multiple out-channels with a closed-form error compensation rule, which reduces sequential bottlenecks and yields more than a three-fold speedup; (ii) a correction mechanism for errors propagated from preceding d layers; and (iii) adaptive grid computation with coordinate descent refinement to maintain alignment during iterative updates. Extensive experiments demonstrate that TurboBoA delivers substantial over BoA while consistently improving accuracy. When combined with outlier suppression techniques, it achieves state-of-the-art results in both weight-only and weight-activation . The code will be available at https://github.com/SamsungLabs/TurboBoA.

SparVAR Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration

Authors: Zekun Li, Ning Wang, Tongxin Bai, Changwang Mei, Peisong Wang, Shuang Qiu, Jian Cheng

2026-02-04

http://arxiv.org/abs/2602.04361v1

Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior s often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present SparVAR, a training-free framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross-scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the attention pattern of later high-resolution scales from a decision scale, and construct scale self-similar attention via an efficient index-mapping mechanism, enabling high-efficiency attention computation at large scales. Furthermore, we propose cross-scale local attention and implement an efficient block-wise kernel, which achieves $\mathbf{> 5\times}$ faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing $1024\times1024$ high-resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a $\mathbf{1.57\times}$ speed-up while pre almost all high-frequency details. When combined with existing scale-skipping strategies, SparseVAR attains up to a $\mathbf{2.28\times}$ , while maintaining competitive visual generation quality. Code is available at https://github.com/CAS-CLab/SparVAR.

Pruning Minimal Reasoning Graphs for Efficient Retrieval-Augmented Generation

Authors: Ning Wang, Kuanyan Zhu, Daniel Yuehwoon Yee, Yitang Gao, Shiying Huang, Zirun Xu, Sainyam Galhotra

2026-02-04

http://arxiv.org/abs/2602.04926v1

Retrieval-augmented generation (RAG) is now standard for knowledge-intensive tasks, but most systems still treat every query as fresh, repeatedly re-retrieving long passages and re-reasoning from scratch, inflating tokens, latency, and cost. We present AutoPrunedRetriever, a graph-style RAG system that persists the minimal reasoning subgraph built for earlier questions and incrementally extends it for later ones. AutoPrunedRetriever stores entities and relations in a compact, ID-indexed codebook and represents questions, facts, and answers as edge sequences, enabling retrieval and prompting over symbolic structure instead of raw text. To keep the graph compact, we apply a two-layer consolidation policy (fast ANN/KNN alias detection plus selective $k$ -means once a memory threshold is reached) and prune low-value structure, while prompts retain only representatives and genuinely new evidence. We instantiate two front ends: AutoPrunedRetriever-REBEL, which uses REBEL as a triplet parser, and AutoPrunedRetriever-llm, which swaps in an extractor. On GraphRAG-Benchmark (Medical and Novel), both variants achieve state-of-the-art complex reasoning accuracy, improving over HippoRAG2 by roughly 9--11 points, and remain competitive on contextual summarize and generation. On our harder STEM and TV benchmarks, AutoPrunedRetriever again ranks first, while using up to two orders of magnitude fewer tokens than graph-heavy baselines, making it a practical substrate for long-running sessions, evolving corpora, and multi-agent pipelines.

Internalizing LLM Reasoning via Discovery and Replay of Latent Actions

Authors: Zhenning Shi, Yijia Zhu, Junhan Shi, Xun Zhang, Lei Wang, Congcong Miao

2026-02-04

http://arxiv.org/abs/2602.04925v1

The internalization of chain-of-thought processes into hidden states has emerged as a highly efficient paradigm for scaling test-time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non-stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self-Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three-stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) control basis construction curates a compact, geometrically diverse tool library; and (3) value-modulated trajectory intervention dynamically injects context-specific impulses via anchor-based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla . These findings demonstrate that the benefits of explicit chain-of-thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at https://github.com/sznnzs/-Latent-Action.

From Assumptions to Actions Turning LLM Reasoning into Uncertainty-Aware Planning for Embodied Agents

Authors: SeungWon Seo, SooBin Lim, SeongRae Noh, Haneul Kim, HyeongYeop Kang

2026-02-04

http://arxiv.org/abs/2602.04326v1

Embodied agents operating in multi-agent, partially observable, and decentralized environments must plan and act despite pervasive uncertainty about hidden objects and collaborators' intentions. Recent advances in applying Large Language Models (s) to embodied agents have addressed many long-standing challenges, such as high-level goal decomposition and online adaptation. Yet, uncertainty is still primarily mitigated through frequent inter-agent . This incurs substantial token and time costs, and can disrupt established workflows, when human partners are involved. We introduce PCE, a Planner-Composer-Evaluator framework that converts the fragmented assumptions latent in reasoning traces into a structured decision tree. Internal nodes encode environment assumptions and leaves map to actions; each path is then scored by scenario likelihood, goal-directed gain, and execution cost to guide rational action selection without heavy . Across two challenging multi-agent benchmarks (C-WAH and TDW-MAT) and three diverse backbones, PCE consistently outperforms -centric baselines in success rate and task efficiency while showing comparable token usage. Ablation results indicate that the performance gains obtained by scaling model capacity or reasoning depth persist even when PCE is applied, while PCE consistently raises the baseline across both capacity and reasoning-depth scales, confirming that structured uncertainty handling complements both forms of scaling. A user study further demonstrates that PCE produces patterns that human partners perceive as more efficient and trustworthy. Together, these results establish a principled route for turning latent assumptions into reliable strategies for uncertainty-aware planning.

Disentangling Causal Importance from Emergent Structure in Multi-Expert Orchestration

Authors: Sudipto Ghosh, Sujoy Nath, Sunny Manchanda, Tanmoy Chakraborty

2026-02-04

http://arxiv.org/abs/2602.04291v1

Multi-expert systems, where multiple Large Language Models (s) collaborate to solve complex tasks, are increasingly adopted for high-performance reasoning and generation. However, the orchestration policies governing expert interaction and sequencing remain largely opaque. We introduce INFORM, an interpretability analysis that treats orchestration as an explicit, analyzable computation, enabling the decoupling of expert interaction structure, execution order, and causal attribution. We use INFORM to evaluate an orchestrator on GSM8K, HumanEval, and MMLU using a homogeneous consortium of ten instruction-tuned experts drawn from LLaMA-3.1 8B, Qwen-3 8B, and DeepSeek-R1 8B, with controlled -temperature variation, and a secondary heterogeneous consortium spanning 1B-7B parameter models. Across tasks, routing dominance is a poor proxy for functional necessity. We reveal a divergence between relational importance, captured by routing mass and interaction topology, and intrinsic importance, measured via gradient-based causal attribution: frequently selected experts often act as interaction hubs with limited causal influence, while ly routed experts can be structurally critical. Orchestration behaviors emerge asynchronously, with expert centralization preceding stable routing confidence and expert ordering remaining non-deterministic. Targeted ablations show that masking intrinsically important experts induces disproportionate collapse in interaction structure compared to masking frequent peers, confirming that INFORM exposes causal and structural dependencies beyond accuracy metrics alone.

MiniRec Data-Efficient Reinforcement Learning for LLM-based Recommendation

Authors: Lin Wang, Yang Zhang, Jingfan Chen, Xiaoyan Zhao, Fengbin Zhu, Qing Li, Tat-Seng Chua

2026-02-04

http://arxiv.org/abs/2602.04278v1

The integration of reinforcement learning (RL) into large language models (s) has opened new opportunities for recommender systems by eliciting reasoning and improving user preference modeling. However, RL-based recommendation faces significant efficiency challenges, making full-data training costly. Existing data selection methods define sample value based on learnability or representativeness, yet their loss- or gradient-driven or dataset coverage-driven criteria often misalign with RL learning dynamics, resulting in suboptimal performance. To address this, we propose MiniRec, a data selection framework tailored for RL-based recommendation. MiniRec evaluates sample learnability using key RL signals -- rewards -- samples that are too easy (too high reward) or too difficult (consistently low reward). It assesses representativeness by aligning sample gradients with the approximated "ideal" global RL optimization trajectory, selecting samples that mainly drive model updates, and it also enforces diversity to reduce redundancy. Combined with a curriculum learning strategy from easy to hard samples, MiniRec significantly reduces training cost while largely pre performance. Extensive experiments demonstrate MiniRec's effectiveness, highlighting the importance of reward-aligned, trajectory-informed data selection in RL-based recommendation.

Authors: Siyu Jiang, Feiyang Chen, Xiaojin Zhang, Kun He

2026-02-04

http://arxiv.org/abs/2602.04268v1

Despite the significant progress of Multimodal Large Language Models (Ms) across diverse tasks, hallucination -- corresponding to the generation of visually inconsistent objects, attributes, or relations -- remains a major obstacle to their reliable deployment. Unlike pure language models, Ms must ground their generation process in visual inputs. However, existing models often suffer from semantic drift during , causing outputs to diverge from visual facts as the sequence length increases. To address this issue, we propose Smooth, a training-free and plug-and-play method that mitigates hallucination by performing attention-entropy-guided adaptive smoothing on hidden states. Specifically, Smooth applies an exponential moving average (EMA) to both keys and values in the -Cache, while dynamically quantifying the sink degree of each token through the entropy of its attention distribution to adaptively adjust the smoothing strength. Unlike computationally expensive retraining or contrastive methods, Smooth operates efficiently during inference without additional training or model modification. Extensive experiments demonstrate that Smooth significantly reduces hallucination ( $\mathit{CHAIR}_{S}$ from $41.8 \rightarrow 18.2$ ) while improving overall performance ( $F_1$ score from $77.5 \rightarrow 79.2$ ), achieving higher precision and recall simultaneously. In contrast, prior methods often improve one at the expense of the other, validating the effectiveness and generality of our approach.

Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog

Authors: Yiran Zhao, Shengyang Zhou, Zijian Wu, Tongyan Hu, Yuhui Xu, Rengan Dou, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Michael Qizhe Shieh

2026-02-04

http://arxiv.org/abs/2602.04919v1

Large Language Models (s) have demonstrated impressive reasoning capabilities, but their substantial size often demands significant computational resources. To reduce resource consumption and accelerate inference, it is essential to eliminate redundant parameters without compromising performance. However, conventional methods that directly remove such parameters often lead to a dramatic drop in model performance in reasoning tasks, and require extensive post-training to recover the lost capabilities. In this work, we propose a gradual compacting method that divides the process into multiple fine-grained iterations, applying a Prune-Tune Loop (PTL) at each stage to incrementally reduce model size while restoring performance with finetuning. This iterative approach-reminiscent of the "boiling frog" effect-enables the model to be progressively compressed without abrupt performance loss. Experimental results show that PTL can compress s to nearly half their original size with only lightweight post-training, while maintaining performance comparable to the original model on reasoning tasks. Moreover, PTL is flexible and can be applied to various strategies, such as neuron and layer , as well as different post-training methods, including continual pre-training and reinforcement learning. Additionally, experimental results confirm the effectiveness of PTL on a variety of tasks beyond mathematical reasoning, such as code generation, demonstrating its broad applicability.

AppleVLM End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

Authors: Yuxuan Han, Kunyuan Wu, Qianyi Shao, Renxiang Xiao, Zilu Wang, Cansen Jiang, Yi Xiao, Liang Hu, Yunjiang Lou

2026-02-04

http://arxiv.org/abs/2602.04256v1

End-to-end autonomous driving has emerged as a promising paradigm integrating perception, decision-making, and control within a unified learning framework. Recently, Vision-Language Models (VLMs) have gained significant attention for their potential to enhance the robustness and generalization of end-to-end driving models in diverse and unseen scenarios. However, existing VLM-based approaches still face challenges, including suboptimal lane perception, language understanding biases, and difficulties in handling corner cases. To address these issues, we propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving. AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making. Firstly, the vision encoder fuses spatial-temporal information from multi-view images across multiple timesteps using a deformable mechanism, enhancing robustness to camera variations and facilitating scalable deployment across different vehicle platforms. Secondly, unlike traditional VLM-based approaches, AppleVLM introduces a dedicated planning modality that encodes explicit Bird's-Eye-View spatial information, mitigating language biases in navigation instructions. Finally, a VLM r fine-tuned by a hierarchical Chain-of-Thought integrates vision, language, and planning features to output robust driving waypoints. We evaluate AppleVLM in closed-loop experiments on two CARLA benchmarks, achieving state-of-the-art driving performance. Furthermore, we deploy AppleVLM on an AGV platform and successfully showcase real-world end-to-end autonomous driving in complex outdoor environments.

CoLT Reasoning with Chain of Latent Tool Calls

Authors: Fangwei Zhu, Zhifang Sui

2026-02-04

http://arxiv.org/abs/2602.04246v1

Chain-of-Thought (CoT) is a critical technique in enhancing the reasoning ability of Large Language Models (s), and latent reasoning methods have been proposed to accelerate the inefficient token-level reasoning chain. We notice that existing latent reasoning methods generally require model structure augmentation and exhaustive training, limiting their broader applicability. In this paper, we propose CoLT, a novel framework that implements latent reasoning as ``tool calls''. Instead of reasoning entirely in the latent space, CoLT generates seed tokens that contain information of a reasoning step. When a latent tool call is triggered, a smaller external model will take the hidden states of seed tokens as its input, and unpack the seed tokens back to a full reasoning step. In this way, we can ensure that the main model reasons in the explicit token space, pre its ability while improving efficiency. Experimental results on four mathematical datasets demonstrate that CoLT achieves higher accuracy and shorter reasoning length than baseline latent models, and is compatible with reinforcement learning algorithms and different r structures.

Post-Quantum Identity-Based TLS for 5G Service-Based Architecture and Cloud-Native Infrastructure

Authors: Vipin Kumar Rathi, Lakshya Chopra, Nikhil Kumar Rajput

2026-02-04

http://arxiv.org/abs/2602.04238v1

Cloud-native application platforms and latency-sensitive systems such as 5G Core networks rely heavily on certificate-based Public Key Infrastructure (PKI) and mutual TLS to secure service-to-service . While effective, this model introduces significant operational and performance overhead, which is further amplified in the post-quantum setting due to large certificates and expensive signature verification. In this paper, we present a certificate-free authentication framework for private distributed systems based on post-quantum Identity-Based Encryption(IBE). Our design replaces certificate and signature based authentication with identity-derived keys and identity-based key encapsulation, enabling mutually authenticated TLS connections without certificate transmission or validation. We describe an IBE-based replacement for private PKI, including identity lifecycle management, and show how it can be instantiated using a threshold Private Key Generator (T-PKG). We apply this framework to cloud-native application deployments and latency-sensitive 5G Core networks. In particular, we demonstrate how identity-based TLS integrates with the 5G Service-Based Architecture while pre security semantics and 3GPP requirements, and we show how the same architecture can replace private PKI in Kubernetes, including its control plane, without disrupting existing trust domains or deployment models.

Following the TRAIL Predicting and Explaining Tomorrow's Hits with a Fine-Tuned LLM

Authors: Yinan Zhang, Zhixi Chen, Jiazheng Jing, Zhiqi Shen

2026-02-04

http://arxiv.org/abs/2602.04225v1

Large Language Models (s) have been widely applied across multiple domains for their broad knowledge and strong reasoning capabilities. However, applying them to recommendation systems is challenging since it is hard for s to extract user preferences from large, user-item logs, and real-time per-user ranking over the full catalog is too time-consuming to be practical. Moreover, many existing recommender systems focus solely on ranking items while overlooking explanations, which could help improve predictive accuracy and make recommendations more convincing to users. Inspired by recent works that achieve strong recommendation performance by forecasting near-term item popularity, we propose TRAIL (TRend and explAnation Integrated Learner). TRAIL is a fine-tuned that jointly predicts short-term item popularity and generates faithful natural-language explanations. It employs contrastive learning with positive and negative pairs to align its scores and explanations with structured trend signals, yielding accurate and explainable popularity predictions. Extensive experiments show that TRAIL outperforms strong baselines and produces coherent, well-grounded explanations.

Adaptive 1D Video Diffusion Autoencoder

Authors: Yao Teng, Minxuan Lin, Xian Liu, Shuai Wang, Xiao Yang, Xihui Liu

2026-02-04

http://arxiv.org/abs/2602.04220v1

Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic rs that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a -based framework for adaptive 1D encoding and diffusion-based . The encoder employs query-based vision s to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The r is a pixel-space diffusion that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical ratios. More importantly, it supports adaptive and thus can achieve higher ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its r to mitigate artifacts caused by the generation process.

OAT Ordered Action Tokenization

Authors: Chaoqi Liu, Xiaoshen Han, Jiawei Gao, Yue Zhao, Haonan Chen, Yilun Du

2026-02-04

http://arxiv.org/abs/2602.04215v1

Autoregressive policies offer a compelling foundation for scalable robot learning by enabling discrete abstraction, token-level reasoning, and flexible inference. However, applying autoregressive modeling to continuous robot actions requires an effective action tokenization scheme. Existing approaches either rely on analytical discretization methods that produce prohibitively long token sequences, or learned latent tokenizers that lack structure, limiting their compatibility with next-token prediction. In this work, we identify three desiderata for action tokenization - high , total decodability, and a left-to-right causally ordered token space - and introduce Ordered Action Tokenization (OAT), a learned action tokenizer that satisfies all three. OAT discretizes action chunks into an ordered sequence of tokens using with registers, finite scalar , and ordering-inducing training mechanisms. The resulting token space aligns naturally with autoregressive generation and enables prefix-based detokenization, yielding an anytime trade-off between inference cost and action fidelity. Across more than 20 tasks spanning four simulation benchmarks and real-world settings, autoregressive policies equipped with OAT consistently outperform prior tokenization schemes and diffusion-based baselines, while offering significantly greater flexibility at inference time.

Semantic Consensus Decoding Backdoor Defense for Verilog Code Generation

Authors: Guang Yang, Xing Hu, Xiang Chen, Xin Xia

2026-02-04

http://arxiv.org/abs/2602.04195v1

Large language models (s) for Verilog code generation are increasingly adopted in hardware design, yet remain vulnerable to backdoor attacks where adversaries inject malicious triggers during training to induce vulnerable hardware designs. Unlike patchable software vulnerabilities, hardware trojans become irreversible once fabricated, making remediation extremely costly or impossible. Existing active defenses require access to training data, impractical for third-party users, while passive defenses struggle against semantically stealthy triggers that naturally blend into design specifications. In this paper, we hypothesize that under the requirements of both effectiveness and stealthiness, attackers are strongly biased toward embedding triggers in non-functional requirements (e.g., style modifiers, quality descriptors) rather than functional specifications that determine hardware behavior. Exploiting this insight, we propose Semantic Consensus Decoding (SCD), an inference-time passive defense with two key components: (1) functional requirement extraction that identifies essential requirements from user specifications, and (2) consensus that adaptively fuses output distributions based on full user specifications and extracted functional requirements. When these distributions diverge significantly, SCD automatically suppresses suspicious components. Extensive experiments with three representative backdoor attacks demonstrate that SCD reduces average attack success rate from 89% to under 3% with negligible impact on generation quality.

Universal Quantized Berry-Dipole Flat Bands

Authors: Qingyang Mo, Shuang Zhang

2026-02-04

http://arxiv.org/abs/2602.04194v1

Perfectly flat bands with nontrivial quantum geometry have emerged as a frontier for exotic topological phenomena and superconductors. Here, we unveil a universal family of d Berry-dipole flat bands in chiral-symmetric (2n+1)-band systems, where the central perfectly flat band carries a Berry-dipole moment d=n, with n an arbitrary integer, while pre zero Chern number. We construct explicit lattice models to showcase three topological phenomena characterized by the Berry-dipole moment: a flat-band returning pump featuring bidirectional, soliton-like displacement of Wannier centers by exactly n unit cells per half cycle, a dipolar Haldane phase diagram arising from the competition between time-reversal and parity symmetries, and n pairs of bulk helical zero modes whose existence depends on the orientation of pseudomagnetic field. Our findings establish a universal framework for the topology beyond Chern class in perfectly flat bands and provide a tunable platform for exploring quantum geometry and interaction-driven phases.