2026-02-06
Table of Contents
- DLM-Scope Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
- "It Talks Like a Patient, But Feels Different" Co-Designing AI Standardized Patients with Medical Learners
- Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation
- Focus-Scan-Refine From Human Visual Perception to Efficient Visual Token Pruning
- Price of universality in vector quantization is at most 0.11 bit
- Efficient implementation of arbitrary Hermitian-preserving and trace-preserving maps
- Variational Speculative Decoding Rethinking Draft Training from Token Likelihood to Sequence Acceptance
- LongR Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards
- Towards Green AI Decoding the Energy of LLM Inference in Software Development
- Shiva-DiT Residual-Based Differentiable Top- Selection for Efficient Diffusion Transformers
- SDFP Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration
- DisCa Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
- Forward Index Compression for Learned Sparse Retrieval
- THOR Inductive Link Prediction over Hyper-Relational Knowledge Graphs
- Speech-XL Towards Long-Form Speech Understanding in Large Speech Language Models
- RaBiT Residual-Aware Binarization Training for Accurate and Efficient LLMs
- Pool-based Active Learning as Noisy Lossy Compression Characterizing Label Complexity via Finite Blocklength Analysis
- High-Performance Moment-Encoded Lattice Boltzmann Method with Stability-Guided Quantization
- Hybrid Gated Flow (HGF) Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction
- FedMosaic Federated Retrieval-Augmented Generation via Parametric Adapters
- Extreme Weather Nowcasting via Local Precipitation Pattern Prediction
- Diffusion-aided Extreme Video Compression with Lightweight Semantics Guidance
- Double-P Hierarchical Top-P Sparse Attention for Long-Context LLMs
- Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky
- TIDE Temporal Incremental Draft Engine for Self-Improving LLM Inference
- ARGaze Autoregressive Transformers for Online Egocentric Gaze Estimation
- SocialVeil Probing Social Intelligence of Language Agents under Communication Barriers
- Physics-Informed Diffusion Models for Vehicle Speed Trajectory Generation
- Protein Autoregressive Modeling via Multiscale Structure Generation
- Adaptive estimation of Sobolev-type energy functionals on the sphere
- OmniSIFT Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
- From independent patches to coordinated attention Controlling information flow in vision transformers
- Less Finetuning, Better Retrieval Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging
- Addressing Corpus Knowledge Poisoning Attacks on RAG Using Sparse Attention
- Towards Understanding and Avoiding Limitations of Convolutions on Graphs
- PIO-FVLM Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective
- LEAD Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation
- Disentangling meaning from language in LLM-based machine translation
- Domain decomposition methods and preconditioning strategies using generalized locally Toepltiz tools proposals, analysis, and numerical validation
- Harmonia Algorithm-Hardware Co-Design for Memory- and Compute-Efficient BFP-based LLM Inference
- SalFormer360 a transformer-based saliency estimation model for 360-degree videos
- Nix and Fix Targeting 1000x Compression of 3D Gaussian Splatting with Diffusion Models
- LycheeDecode Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
- Incongruity-sensitive access to highly compressed strings
- - Circuit-Restricted Weight Arithmetic for Selective Refusal
- Model-Dowser Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models
- Greedy-Gnorm A Gradient Matrix Norm-Based Alternative to Attention Entropy for Head Pruning
- DOS Dual-Flow Orthogonal Semantic IDs for Recommendation in Meituan
- Seg-ReSearch Segmentation with Interleaved Reasoning and External Search
- RASA Routing-Aware Safety Alignment for Mixture-of-Experts Models
- The Stretto Execution Engine for LLM-Augmented Data Systems
- Depth-Wise Emergence of Prediction-Centric Geometry in Large Language Models
- HoRD Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation
- LCUDiff Latent Capacity Upgrade Diffusion for Faithful Human Body Restoration
- Enabling Real-Time Colonoscopic Polyp Segmentation on Commodity CPUs via Ultra-Lightweight Architecture
- TurboBoA Faster and Exact Attention-aware Quantization without Backpropagation
- SparVAR Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
- Pruning Minimal Reasoning Graphs for Efficient Retrieval-Augmented Generation
- Internalizing LLM Reasoning via Discovery and Replay of Latent Actions
- From Assumptions to Actions Turning LLM Reasoning into Uncertainty-Aware Planning for Embodied Agents
- Disentangling Causal Importance from Emergent Structure in Multi-Expert Orchestration
- MiniRec Data-Efficient Reinforcement Learning for LLM-based Recommendation
- KVSmooth Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing
- Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog
- AppleVLM End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models
- CoLT Reasoning with Chain of Latent Tool Calls
- Post-Quantum Identity-Based TLS for 5G Service-Based Architecture and Cloud-Native Infrastructure
- Following the TRAIL Predicting and Explaining Tomorrow's Hits with a Fine-Tuned LLM
- Adaptive 1D Video Diffusion Autoencoder
- OAT Ordered Action Tokenization
- Semantic Consensus Decoding Backdoor Defense for Verilog Code Generation
- Universal Quantized Berry-Dipole Flat Bands
DLM-Scope Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
Authors: Xu Wang, Bingqing Jiang, Yu Wan, Baosong Yang, Lingpeng Kong, Difan Zou
2026-02-05
Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (s), enabling researchers to extract
, human-interpretable features and intervene on model behavior. Recently, as diffusion language models (DLMs) have become an increasingly promising alternative to the autoregressive
s, it is essential to develop tailored mechanistic interpretability tools for this emerging class of models. In this work, we present DLM-Scope, the first SAE-based interpretability framework for DLMs, and demonstrate that trained Top-K SAEs can faithfully extract interpretable features. Notably, we find that inserting SAEs affects DLMs differently than autoregressive
s: while SAE insertion in
s typically incurs a loss penalty, in DLMs it can reduce cross-entropy loss when applied to early layers, a phenomenon absent or markedly weaker in
s. Additionally, SAE features in DLMs enable more effective diffusion-time interventions, often outperforming
steering. Moreover, we pioneer certain new SAE-based research directions for DLMs: we show that SAEs can provide useful signals for DLM
order; and the SAE features are stable during the post-training phase of DLMs. Our work establishes a foundation for mechanistic interpretability in DLMs and shows a great potential of applying SAEs to DLM-related tasks and algorithms.
"It Talks Like a Patient, But Feels Different" Co-Designing AI Standardized Patients with Medical Learners
Authors: Zhiqi Gao, Guo Zhu, Huarui Luo, Dongyijie Primo Pan, Haoming Tang, Bingquan Zhang, Jiahuan Pei, Jie Li, Benyou Wang
2026-02-05
Standardized patients (SPs) play a central role in clinical training but are costly, difficult to scale, and inconsistent. Large language model (
) based AI standardized patients (AI-SPs) promise flexible, on-demand practice, yet learners often report that they talk like a patient but feel different. We interviewed 12 clinical-year medical students and conducted three co-design workshops to examine how learners experience constraints of SP encounters and what they expect from AI-SPs. We identified six learner-centered needs, translated them into AI-SP design requirements, and synthesized a conceptual workflow. Our findings position AI-SPs as tools for deliberate practice and show that instructional usability, rather than conversational realism alone, drives learner trust, engagement, and educational value.
Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation
Authors: Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, Hongyang Li
2026-02-05
Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model ()-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes
training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated
future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art
baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.
Focus-Scan-Refine From Human Visual Perception to Efficient Visual Token Pruning
Authors: Enwei Tong, Yuanchao Bai, Yao Zhu, Junjun Jiang, Xianming Liu
2026-02-05
Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive
. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play
framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art
methods. The source codes can be found at https://github.com/ILOT-code/FSR
Price of universality in vector quantization is at most 0.11 bit
Authors: Alina Harbuzova, Or Ordentlich, Yury Polyanskiy
2026-02-05
Fast computation of a matrix product is a workhorse of modern s. To make their deployment more efficient, a popular approach is that of using a low-precision approximation in place of true ("weight-only
''). Information theory demonstrates that an optimal algorithm for reducing precision of depends on the (second order) statistics of and requires a careful alignment of vector
codebook with PCA directions of (a process known as "waterfilling allocation''). Dependence of the codebook on statistics of , however, is highly impractical. This paper proves that there exist a universal codebook that is simultaneously near-optimal for all possible statistics of , in the sense of being at least as good as an -adapted waterfilling codebook with rate reduced by 0.11 bit per dimension. Such universal codebook would be an ideal candidate for the low-precision storage format, a topic of active modern research, but alas the existence proof is non-constructive.
Equivalently, our result shows existence of a net in that is a nearly-optimal covering of a sphere simultaneously with respect to all Hilbert norms.
Efficient implementation of arbitrary Hermitian-preserving and trace-preserving maps
Authors: Weizhou Cai, Zi-Jie Chen, Xuanqiang Zhao, Xin Wang, Guang-Can Guo, Luyan Sun, Chang-Ling Zou
2026-02-05
Quantum control has been a cornerstone of quantum information science, driving major advances in quantum computing, quantum , and quantum sensing. Over the years, it has enabled the implementation of arbitrary completely positive and trace-pre
(CPTP) maps; an important next step is to extend control to Hermitian-pre
and trace-pre
(HPTP) maps, which underpin applications such as entanglement detection, quantum error mitigation, quantum simulation, and quantum machine learning. Here we present an efficient and fully constructive method for implementing arbitrary HPTP maps. Unlike existing methods that decompose an HPTP map into multiple CPTP maps or approximate it using bipartite Hamiltonians with large Hilbert spaces, our approach compiles a target HPTP map into a single executable CPTP map whose Kraus rank is guaranteed to be no larger than the intrinsic rank of the target HPTP map plus one, followed by simple classical post-processing. Numerical results for inverse noise channels used in quantum error mitigation, including bosonic photon loss, confirm substantial reductions in resources and highlight scalability in higher-dimensional settings. Together with our numerical benchmarks, these results validate the efficiency and versatility of the proposed framework, opening a route to broader quantum-information applications enabled by HPTP processing.
Variational Speculative Decoding Rethinking Draft Training from Token Likelihood to Sequence Acceptance
Authors: Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou
2026-02-05
Speculative accelerates inference for (M)
s, yet a training-
discrepancy persists: while existing methods optimize single greedy trajectories,
involves verifying and ranking multiple sampled draft paths. We propose Variational Speculative Decoding (VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws MCMC samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across
s and M
s show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly improving
efficiency.
LongR Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards
Authors: Bowen Ping, Zijun Chen, Yiyao Yu, Tingfeng Hui, Junchi Yan, Baobao Chang
2026-02-05
Reinforcement Learning has emerged as a key driver for reasoning. This capability is equally pivotal in long-context scenarios--such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on
, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model's robustness against distractors.
Towards Green AI Decoding the Energy of LLM Inference in Software Development
Authors: Lola Solovyeva, Fernando Castor
2026-02-05
Context: AI-assisted tools are increasingly integrated into software development workflows, but their reliance on large language models (s) introduces substantial computational and energy costs. Understanding and reducing the energy footprint of
inference is therefore essential for sustainable software development. Objective: In this study, we conduct a phase-level analysis of
inference energy consumption, distinguishing between the (1)
, where the model processes the input and builds internal representations, and (2)
, where output tokens are generated using the stored state. Method: We investigate six 6B-7B and four 3B-4B
-based models, evaluating them on code-centric benchmarks HumanEval for code generation and LongBench for code understanding. Results: Our findings show that, within both parameter groups, models exhibit distinct energy patterns across phases. Furthermore, we observed that increases in
cost amplify the energy cost per token during
, with amplifications ranging from 1.3% to 51.8% depending on the model. Lastly, three out of ten models demonstrate babbling behavior, adding excessive content to the output that unnecessarily inflates energy consumption. We implemented babbling suppression for code generation, achieving energy savings ranging from 44% to 89% without affecting generation accuracy. Conclusion: These findings show that
costs influence
, which dominates energy consumption, and that babbling suppression can yield up to 89% energy savings. Reducing inference energy therefore requires both mitigating babbling behavior and limiting impact of
on
.
Shiva-DiT Residual-Based Differentiable Top- Selection for Efficient Diffusion Transformers
Authors: Jiaji Zhang, Hailiang Zhao, Guoxuan Zhu, Ruichao Sun, Jiaju Wu, Xinkui Zhao, Hanlin Tang, Weiyi Lu, Kan Liu, Tao Lan, Lin Qu, Shuiguang Deng
2026-02-05
Diffusion Transformers (DiTs) incur prohibitive computational costs due to the quadratic scaling of self-attention. Existing methods fail to simultaneously satisfy differentiability, efficiency, and the strict static budgets required for hardware overhead. To address this, we propose Shiva-DiT, which effectively reconciles these conflicting requirements via Residual-Based Differentiable Top- Selection. By leveraging a residual-aware straight-through estimator, our method enforces deterministic token counts for static compilation while pre
end-to-end learnability through residual gradient estimation. Furthermore, we introduce a Context-Aware Router and Adaptive Ratio Policy to autonomously learn an adaptive
schedule. Experiments on mainstream models, including SD3.5, demonstrate that Shiva-DiT establishes a new Pareto frontier, achieving a 1.54 wall-clock speedup with superior fidelity compared to existing baselines, effectively eliminating ragged tensor overheads.
SDFP Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration
Authors: Hanyu Wei, Zunhai Su, Peng Lu, Chao Li, Spandan Tiwari, Ashish Sirasao, Yuhan Dong
2026-02-05
Large language models (s) underpin interactive multimedia applications such as captioning, retrieval, recommendation, and creative content generation, yet their autoregressive
incurs substantial latency. Speculative
reduces latency using a lightweight draft model, but deployment is often limited by the cost and complexity of acquiring, tuning, and maintaining an effective draft model. Recent approaches usually require auxiliary training or specialization, and even training-free methods incur costly search or optimization. We propose SDFP, a fully training-free and plug-and-play framework that builds the draft model via Fisher Information Trace (FIT)-based layer
of a given
. Using layer sensitivity as a proxy for output perturbation, SDFP removes low-impact layers to obtain a compact draft while pre
compatibility with the original model for standard speculative verification. SDFP needs no additional training, hyperparameter tuning, or separately maintained drafts, enabling rapid, deployment-friendly draft construction. Across benchmarks, SDFP delivers 1.32x-1.5x
speedup without altering the target model's output distribution, supporting low-latency multimedia applications.
DisCa Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
Authors: Chang Zou, Changlin Li, Yang Li, Patrol Li, Jianbing Wu, Xiao He, Songtao Liu, Zhao Zhong, Kailin Huang, Linfeng Zhang
2026-02-05
While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further
. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the
r sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the
boundaries to while pre
generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.
Forward Index Compression for Learned Sparse Retrieval
Authors: Sebastian Bruch, Martino Fontana, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini
2026-02-05
Text retrieval using learned representations of queries and documents has, over the years, evolved into a highly effective approach to search. It is thanks to recent advances in approximate nearest neighbor search-with the emergence of highly efficient algorithms such as the inverted index-based Seismic and the graph-based Hnsw-that retrieval with
representations became viable in practice. In this work, we scrutinize the efficiency of
retrieval algorithms and focus particularly on the size of a data structure that is common to all algorithmic flavors and that constitutes a substantial fraction of the overall index size: the forward index. In particular, we seek
techniques to reduce the storage footprint of the forward index without compromising search quality or inner product computation latency. In our examination with various integer
techniques, we report that StreamVByte achieves the best trade-off between memory footprint, retrieval accuracy, and latency. We then improve StreamVByte by introducing DotVByte, a new algorithm tailored to inner product computation. Experiments on MsMarco show that our improvements lead to significant space savings while maintaining retrieval efficiency.
THOR Inductive Link Prediction over Hyper-Relational Knowledge Graphs
Authors: Weijian Yu, Yuhuan Lu, Dingqi Yang
2026-02-05
Knowledge graphs (KGs) have become a key ingredient supporting a variety of applications. Beyond the traditional triplet representation of facts where a relation connects two entities, modern KGs observe an increasing number of hyper-relational facts, where an arbitrary number of qualifiers associated with a triplet provide auxiliary information to further describe the rich semantics of the triplet, which can effectively boost the reasoning performance in link prediction tasks. However, existing link prediction techniques over such hyper-relational KGs (HKGs) mostly focus on a transductive setting, where KG embedding models are learned from the specific vocabulary of a given KG and subsequently can only make predictions within the same vocabulary, limiting their generalizability to previously unseen vocabularies. Against this background, we propose THOR, an inducTive link prediction technique for Hyper-relational knOwledge gRaphs. Specifically, we first introduce both relation and entity foundation graphs, modeling their fundamental inter- and intra-fact interactions in HKGs, which are agnostic to any specific relations and entities. Afterward, THOR is designed to learn from the two foundation graphs with two parallel graph encoders followed by a
r, which supports efficient masked training and fully-inductive inference. We conduct a thorough evaluation of THOR in hyper-relational link prediction tasks on 12 datasets with different settings. Results show that THOR outperforms a sizable collection of baselines, yielding 66.1%, 55.9%, and 20.4% improvement over the best-performing rule-based, semi-inductive, and fully-inductive techniques, respectively. A series of ablation studies also reveals our key design factors capturing the structural invariance transferable across HKGs for inductive tasks.
Speech-XL Towards Long-Form Speech Understanding in Large Speech Language Models
Authors: Haoqin Sun, Chenyang Lyu, Shiwan Zhao, Xuanfan Ni, Xiangyu Kong, Longyue Wang, Weihua Luo, Yong Qin
2026-02-05
Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value () sparsification capacity of Large Language Models (
s) to achieve high-ratio speech input
. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated
pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging)
. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.
RaBiT Residual-Aware Binarization Training for Accurate and Efficient LLMs
Authors: Youngcheon You, Banseok Lee, Minseop Choi, Seonyoung Kim, Hyochan Chong, Changdong Kim, Youngmin Kim, Dongkyu Kim
2026-02-05
Efficient deployment of large language models (s) requires extreme
, forcing a critical trade-off between
efficiency and performance. Residual binarization enables hardware-friendly, matmul-free inference by stacking binary (1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during
-aware training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and limiting the expressive capacity of the model. While prior work relies on heuristic workarounds (e.g., path freezing) that constrain the solution space, we propose RaBiT, a novel
framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared full-precision weight, which ensures that every path corrects the error of the preceding one. This process is stabilized by a robust initialization that prioritizes functional preservation over mere weight approximation. RaBiT redefines the 2-bit accuracy-efficiency frontier: it achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a inference speed-up over full-precision models on an RTX 4090.
Pool-based Active Learning as Noisy Lossy Compression Characterizing Label Complexity via Finite Blocklength Analysis
Authors: Kosuke Sugiyama, Masato Uchida
2026-02-05
This paper proposes an information-theoretic framework for analyzing the theoretical limits of pool-based active learning (AL), in which a subset of instances is selectively labeled. The proposed framework reformulates pool-based AL as a noisy lossy problem by mapping pool observations to noisy symbol observations, data selection to
, and learning to
. This correspondence enables a unified information-theoretic analysis of data selection and learning in pool-based AL. Applying finite blocklength analysis of noisy lossy
, we derive information-theoretic lower bounds on label complexity and generalization error that serve as theoretical limits for a given learning algorithm under its associated optimal data selection strategy. Specifically, our bounds include terms that reflect overfitting induced by the learning algorithm and the discrepancy between its inductive bias and the target task, and are closely related to established information-theoretic bounds and stability theory, which have not been previously applied to the analysis of pool-based AL. These properties yield a new theoretical perspective on pool-based AL.
High-Performance Moment-Encoded Lattice Boltzmann Method with Stability-Guided Quantization
Authors: Yixin Chen, Wei Li, David I. W. Levin, Kui Wu
2026-02-05
In this work, we present a memory-efficient, high-performance GPU framework for moment-based lattice Boltzmann methods (LBM) with fluid-solid coupling. We introduce a split-kernel scheme that decouples fluid updates from solid boundary handling, substantially reducing warp divergence and improving utilization on GPUs. We further perform the first von Neumann stability analysis of the high-order moment-encoded LBM (HOME-LBM) formulation, characterizing its spectral behavior and deriving stability bounds for individual moment components. These theoretical insights directly guide a practical 16-bit moment without compromising numerical stability. Our framework achieves up to 6x speedup and reduces GPU memory footprint by up to 50% in fluid-only scenarios and 25% in scenes with complex solid boundaries compared to the state-of-the-art LBM solver, while pre
physical fidelity across a range of large-scale benchmarks and real-time demonstrations. The proposed approach enables scalable, stable, and high-resolution LBM simulation on a single GPU, bridging theoretical stability analysis with practical GPU algorithm design.
Hybrid Gated Flow (HGF) Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction
Authors: David Alejandro Trejo Pizzo
2026-02-05
The deployment of Large Language Models (s) on edge devices is fundamentally constrained by the "Memory Wall" -- a hardware limitation where memory bandwidth, not compute, becomes the bottleneck. Recent 1.58-bit
techniques (e.g., BitNet b1.58) dramatically reduce memory footprint but typically incur a perplexity degradation of 20-25% compared to FP16 baselines. In this work, we introduce Hybrid Gated Flow (HGF), a dual-stream architecture that couples a 1.58-bit ternary backbone with a learnable, low-rank FP16 correction path controlled by adaptive gates.
Through extensive experiments on the TinyStories dataset across two training regimes (2500 and 3500 steps), we demonstrate that HGF 5.4 achieves a validation loss of 0.9306 compared to BitNet's 1.0294, recovering approximately 55% of the quality gap between pure ternary
and the FP16 baseline (0.8490). This recovery is achieved with only ~12-15% memory overhead beyond the ternary backbone.
Furthermore, we provide empirical evidence for an emergent phenomenon:
as structural regularization. While a full-precision differential attention baseline (Diff_Only) exhibited training instability with validation loss exceeding 1.68, the ternary-anchored HGF maintained robust convergence throughout training. Finally, we report preliminary results extending this architecture to 1.2B and 3B parameter models trained on SlimPajama and FineWeb-Edu. These larger-scale experiments confirm that the architectural stability and quality recovery observed in small-scale proxies scale linearly to production-grade language modeling regimes.
FedMosaic Federated Retrieval-Augmented Generation via Parametric Adapters
Authors: Zhilin Liang, Yuxiang Wang, Zimu Zhou, Hainan Zhang, Boyi Liu, Yongxin Tong
2026-02-05
Retrieval-Augmented Generation (RAG) enhances Large Language Models (s) by grounding generation in external knowledge to improve factuality and reduce hallucinations. Yet most deployments assume a centralized corpus, which is infeasible in privacy aware domains where knowledge remains siloed. This motivates federated RAG (FedRAG), where a central
server collaborates with distributed silos without sharing raw documents. In context RAG violates this requirement by transmitting verbatim documents, whereas parametric RAG encodes documents into lightweight adapters that merge with a frozen
at inference, avoiding raw-text exchange. We adopt the parametric approach but face two unique challenges induced by FedRAG: high storage and
from per-document adapters, and destructive aggregation caused by indiscriminately merging multiple adapters. We present FedMosaic, the first federated RAG framework built on parametric adapters. FedMosaic clusters semantically related documents into multi-document adapters with document-specific masks to reduce overhead while pre
specificity, and performs selective adapter aggregation to combine only relevance-aligned, nonconflicting adapters. Experiments show that FedMosaic achieves an average 10.9% higher accuracy than state-of-the-art methods in four categories, while lowering storage costs by 78.8% to 86.3% and
costs by 91.4%, and never sharing raw documents.
Extreme Weather Nowcasting via Local Precipitation Pattern Prediction
Authors: Changhoon Song, Teng Yuan Chang, Youngjoon Hong
2026-02-05
Accurate forecasting of extreme weather events such as heavy rainfall or storms is critical for risk management and disaster mitigation. Although high-resolution radar observations have spurred extensive research on nowcasting models, precipitation nowcasting remains particularly challenging due to pronounced spatial locality, intricate fine-scale rainfall structures, and variability in forecasting horizons. While recent diffusion-based generative ensembles show promising results, they are computationally expensive and unsuitable for real-time applications. In contrast, deterministic models are computationally efficient but remain biased toward normal rainfall. Furthermore, the benchmark datasets commonly used in prior studies are themselves skewed--either dominated by ordinary rainfall events or restricted to extreme rainfall episodes--thereby hindering general applicability in real-world settings. In this paper, we propose exPreCast, an efficient deterministic framework for generating finely detailed radar forecasts, and introduce a newly constructed balanced radar dataset from the Korea Meteorological Administration (KMA), which encompasses both ordinary precipitation and extreme events. Our model integrates local spatiotemporal attention, a texture-pre cubic dual upsampling
r, and a temporal extractor to flexibly adjust forecasting horizons. Experiments on established benchmarks (SEVIR and MeteoNet) as well as on the balanced KMA dataset demonstrate that our approach achieves state-of-the-art performance, delivering accurate and reliable nowcasts across both normal and extreme rainfall regimes.
Diffusion-aided Extreme Video Compression with Lightweight Semantics Guidance
Authors: Maojun Zhang, Haotian Wu, Richeng Jin, Deniz Gunduz, Krystian Mikolajczyk
2026-02-05
Modern video codecs and learning-based approaches struggle for semantic reconstruction at extremely low bit-rates due to reliance on low-level spatiotemporal redundancies. Generative models, especially diffusion models, offer a new paradigm for video by leveraging high-level semantic understanding and powerful visual synthesis. This paper propose a video
framework that integrates generative priors to drastically reduce bit-rate while maintaining reconstruction fidelity. Specifically, our method compresses high-level semantic representations of the video, then uses a conditional diffusion model to reconstruct frames from these semantics. To further improve
, we characterize motion information with global camera trajectories and foreground segmentation: background motion is compactly represented by camera pose parameters while foreground dynamics by
segmentation masks. This allows for significantly boosts
efficiency, enabling descent video reconstruction at extremely low bit-rates.
Double-P Hierarchical Top-P Sparse Attention for Long-Context LLMs
Authors: Wentao Ni, Kangqi Zhang, Zhongming Yu, Oren Nelson, Mingu Lee, Hong Cai, Fatih Porikli, Jongryool Kim, Zhijian Liu, Jishen Zhao
2026-02-05
As long-context inference becomes central to large language models (s), attention over growing key-value
s emerges as a dominant
bottleneck, motivating
attention for scalable inference. Fixed-budget top-k
attention cannot adapt to heterogeneous attention distributions across heads and layers, whereas top-p
attention directly preserves attention mass and provides stronger accuracy guarantees. Existing top-p methods, however, fail to jointly optimize top-p accuracy, selection overhead, and
attention cost, which limits their overall efficiency. We present Double-P, a hierarchical
attention framework that optimizes all three stages. Double-P first performs coarse-grained top-p estimation at the cluster level using size-weighted centroids, then adaptively refines computation through a second top-p stage that allocates token-level attention only when needed. Across long-context benchmarks, Double-P consistently achieves near-zero accuracy drop, reducing attention computation overhead by up to 1.8x and delivers up to 1.3x end-to-end
speedup over state-of-the-art fixed-budget
attention methods.
Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky
Authors: Hsuan-Yu Chou, Wajiha Naveed, Shuyan Zhou, Xiaowei Yang
2026-02-05
As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (s) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary
s have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight
s remains an open question.
Motivated by recent developments of reasoning
s, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of
between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight
s and those (72%--98%, and 93%--99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the
s, highlighting considerations for deploying
s in both platform-scale and personalized moderation contexts. These findings show open-weight
s can support privacy-pre
moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.
TIDE Temporal Incremental Draft Engine for Self-Improving LLM Inference
Authors: Jiyoung Park, Hankyu Jang, Changseok Song, Wookeun Jung
2026-02-05
Speculative can substantially accelerate
inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal Incremental Draft Engine), a
-engine-native framework that integrates online draft adaptation directly into high-performance
inference systems. TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model, and employs adaptive runtime control to activate speculation and training only when beneficial. TIDE exploits heterogeneous clusters by mapping decoupled inference and training to appropriate GPU classes. Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative
while reducing draft training time by 1.67x compared to approaches that recompute training signals.
ARGaze Autoregressive Transformers for Online Egocentric Gaze Estimation
Authors: Jia Li, Wenjie Zhao, Shijian Deng, Bolin Lai, Yuheng Wu, RUijia Chen, Jon E. Froehlich, Yuhang Zhao, Yapeng Tian
2026-02-04
Online egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames, a task essential for augmented reality and assistive technologies. Unlike third-person gaze estimation, this setting lacks explicit head or eye signals, requiring models to infer current visual attention from , indirect cues such as hand-object interactions and salient scene content. We observe that gaze exhibits strong temporal continuity during goal-directed activities: knowing where a person looked recently provides a powerful prior for predicting where they look next. Inspired by vision-conditioned autoregressive
in vision-language models, we propose ARGaze, which reformulates gaze estimation as sequential prediction: at each timestep, a
r predicts current gaze by conditioning on (i) current visual features and (ii) a fixed-length Gaze Context Window of recent gaze target estimates. This design enforces causality and enables bounded-resource streaming inference. We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation, with extensive ablations validating that autoregressive modeling with bounded gaze history is critical for robust prediction. We will release our source code and pre-trained models.
SocialVeil Probing Social Intelligence of Language Agents under Communication Barriers
Authors: Keyang Xuan, Pengda Wang, Chongrui Ye, Haofei Yu, Tal August, Jiaxuan You
2026-02-04
Large language models (s) are increasingly evaluated in interactive environments to test their social intelligence. However, existing benchmarks often assume idealized
between agents, limiting our ability to diagnose whether
s can maintain and repair interactions in more realistic, imperfect settings. To close this gap, we present \textsc{SocialVeil}, a social learning environment that can simulate social interaction under cognitive-difference-induced
barriers. Grounded in a systematic literature review of
challenges in human interaction, \textsc{SocialVeil} introduces three representative types of such disruption, \emph{semantic vagueness}, \emph{sociocultural mismatch}, and \emph{emotional interference}. We also introduce two barrier-aware evaluation metrics, \emph{unresolved confusion} and \emph{mutual understanding}, to evaluate interaction quality under impaired
. Experiments across 720 scenarios and four frontier
s show that barriers consistently impair performance, with mutual understanding reduced by over 45\% on average, and confusion elevated by nearly 50\%. Human evaluations validate the fidelity of these simulated barriers (ICC0.78, Pearson r0.80). We further demonstrate that adaptation strategies (Repair Instruction and Interactive learning) only have a modest effect far from barrier-free performance. This work takes a step toward bringing social interaction environments closer to real-world
, opening opportunities for exploring the social intelligence of
agents.
Physics-Informed Diffusion Models for Vehicle Speed Trajectory Generation
Authors: Vadim Sokolov, Farnaz Behnia, Dominik Karbowski
2026-02-04
Synthetic vehicle speed trajectory generation is essential for evaluating vehicle control algorithms and connected vehicle technologies. Traditional Markov chain approaches suffer from discretization artifacts and limited expressiveness. This paper proposes a physics-informed diffusion framework for conditional micro-trip synthesis, combining a dual-channel speed- representation with soft physics constraints that resolve optimization conflicts inherent to hard-constraint formulations. We compare a 1D U-Net architecture against a
-based Conditional Score-based Diffusion Imputation (CSDI) model using 6,367 GPS-derived micro-trips. CSDI achieves superior distribution matching (Wasserstein distance 0.30 for speed, 0.026 for
), strong indistinguishability from real data (discriminative score 0.49), and validated utility for downstream energy assessment tasks. The methodology enables scalable generation of realistic driving profiles for intelligent transportation systems (ITS) applications without costly field data collection.
Protein Autoregressive Modeling via Multiscale Structure Generation
Authors: Yanru Qu, Cheng-Yen Hsieh, Zaixiang Zheng, Ge Liu, Quanquan Gu
2026-02-04
We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone
r that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.
Adaptive estimation of Sobolev-type energy functionals on the sphere
Authors: Claudio Durastanti
2026-02-04
We study the estimation of quadratic Sobolev-type integral functionals of an unknown density on the unit sphere. The functional is defined through fractional powers of the Laplace--Beltrami operator and provides a global measure of smoothness and spectral energy. Our approach relies on spherical needlet frames, which yield a localized multiscale decomposition while pre tight frame properties in the natural square-integrable function space on the sphere.
We construct unbiased estimators of suitably truncated versions of the functional and derive sharp oracle risk bounds through an explicit bias--variance analysis. When the smoothness of the density is unknown, we propose a Lepski-type data-driven selection of the resolution level. The resulting adaptive estimator achieves minimax-optimal rates over Sobolev classes, without resorting to nonlinear or
-based methods.
OmniSIFT Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
Authors: Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, Yuanxing Zhang, Jiaheng Liu, Qiang Liu, Pengfei Wan, Liang Wang
2026-02-04
Omni-modal Large Language Models (Omni-s) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token
methods designed for Omni-
s remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token
), a modality-asymmetric token
framework tailored for Omni-
s. Specifically, OmniSIFT adopts a two-stage
strategy: (i) a spatio-temporal video
module that removes video redundancy arising from both intra-frame structure and inter-frame
, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all
baselines and even surpasses the performance of the full-token model on several tasks.
From independent patches to coordinated attention Controlling information flow in vision transformers
Authors: Kieran A. Murphy
2026-02-04
We make the information transmitted by attention an explicit, measurable quantity in vision s. By inserting variational information bottlenecks on all attention-mediated writes to the residual stream -- without other architectural changes -- we train models with an explicit information cost and obtain a controllable spectrum from independent patch processing to fully expressive global attention. On ImageNet-100, we characterize how classification behavior and information routing evolve across this spectrum, and provide initial insights into how global visual representations emerge from local patch processing by analyzing the first attention heads that transmit information. By biasing learning toward solutions with constrained internal
, our approach yields models that are more tractable for mechanistic analysis and more amenable to control.
Less Finetuning, Better Retrieval Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging
Authors: Sameh Khattab, Jean-Philippe Corbeil, Osman Alperen Koraş, Amin Dada, Julian Friedrich, François Beaulieu, Paul Vozila, Jens Kleesiek
2026-02-04
Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (s), improving knowledge updates and reducing hallucinations. Recently,
-based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose
s into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modular framework that enhances
r-only
s with synthetic hard negatives, retrieval prompt optimization, and model merging. Experiments on a subset of 12 medical and general tasks from the MTEB benchmark show STM boosts task-specific experts by up to 23.5\% (average 7.5\%) and produces merged models that outperform both single experts and strong baselines without extensive pretraining. Our results demonstrate a scalable, efficient path for turning general
s into high-performing, domain-specialized retrievers, pre
general-domain capabilities while excelling on specialized tasks.
Addressing Corpus Knowledge Poisoning Attacks on RAG Using Sparse Attention
Authors: Sagie Dekel, Moshe Tennenholtz, Oren Kurland
2026-02-04
Retrieval Augmented Generation (RAG) is a highly effective paradigm for keeping -based responses up-to-date and reducing the likelihood of hallucinations. Yet, RAG was recently shown to be quite vulnerable to corpus knowledge poisoning: an attacker injects misleading documents to the corpus to steer an
's output to an undesired response. We argue that the standard causal attention mechanism in
s enables harmful cross-document interactions, specifically in cases of attacks. Accordingly, we introduce a novel defense approach for RAG: Sparse Document Attention RAG (SDAG). This is a block-
attention mechanism that disallows cross-attention between retrieved documents. SDAG requires a minimal inference-time change to the attention mask; furthermore, no fine-tuning or additional architectural changes are needed. We present an empirical evaluation of
-based question answering (QA) with a variety of attack strategies on RAG. We show that our SDAG method substantially outperforms the standard causal attention mechanism in terms of attack success rate. We further demonstrate the clear merits of integrating SDAG with state-of-the-art RAG defense methods. Specifically, the integration results in performance that is statistically significantly better than the state-of-the-art.
Towards Understanding and Avoiding Limitations of Convolutions on Graphs
Authors: Andreas Roth
2026-02-04
While message-passing neural networks (MPNNs) have shown promising results, their real-world impact remains limited. Although various limitations have been identified, their theoretical foundations remain poorly understood, leading to fragmented research efforts. In this thesis, we provide an in-depth theoretical analysis and identify several key properties limiting the performance of MPNNs. Building on these findings, we propose several frameworks that address these shortcomings. We identify two properties exhibited by many MPNNs: shared component amplification (SCA), where each message-passing iteration amplifies the same components across all feature channels, and component dominance (CD), where a single component gets increasingly amplified as more message-passing steps are applied. These properties lead to the observable phenomenon of rank collapse of node representations, which generalizes the established over-smoothing phenomenon. By generalizing and decomposing over-smoothing, we enable a deeper understanding of MPNNs, more targeted solutions, and more precise within the field. To avoid SCA, we show that utilizing multiple computational graphs or edge relations is necessary. Our multi-relational split (MRS) framework transforms any existing MPNN into one that leverages multiple edge relations. Additionally, we introduce the spectral graph convolution for multiple feature channels (MIMO-GC), which naturally uses multiple computational graphs. A localized variant, LMGC, approximates the MIMO-GC while inheriting its beneficial properties. To address CD, we demonstrate a close connection between MPNNs and the PageRank algorithm. Based on personalized PageRank, we propose a variant of MPNNs that allows for infinitely many message-passing iterations, while pre
initial node features. Collectively, these results deepen the theoretical understanding of MPNNs.
PIO-FVLM Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective
Authors: Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Ying Li, Rong Xiao, Chunhua Shen
2026-02-04
Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in performance and practical deployment. In contrast, we propose PIO-FVLM from the perspective of inference objectives, which transforms visual token
into pre
output result invariance and selects tokens primarily by their importance to this goal. Specially, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle. The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder
approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance, with a 2.67
speedup, 2.11 inference speedup, 6.22 lower FLOPs, and 6.05 reduced
Cache overhead. Our code is available at https://github.com/ocy1/PIO-FVLM.
LEAD Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation
Authors: Ruixiao Yang, Yuanhe Tian, Xu Yang, Huiqi Li, Yan Song
2026-02-04
Radiology Report Generation (RRG) aims to produce accurate and coherent diagnostics from medical images. Although large vision language models (LVLM) improve report fluency and accuracy, they exhibit hallucinations, generating plausible yet image-ungrounded pathological details. Existing methods primarily rely on external knowledge guidance to facilitate the alignment between generated text and visual information. However, these approaches often ignore the inherent priors and vision-language alignment biases in pretrained models and lack robustness due to reliance on constructed guidance. In this paper, we propose Layer-wise Expert-aligned Decoding (LEAD), a novel method to inherently modify the LVLM
trajectory. A multiple experts module is designed for extracting distinct pathological features which are integrated into each
r layer via a gating mechanism. This layer-wise architecture enables the
to consult expert features at every inference step via a learned gating function, thereby dynamically rectifying
biases and steering the generation toward factual consistency. Experiments conducted on multiple public datasets demonstrate that the LEAD method yields effective improvements in clinical accuracy metrics and mitigates hallucinations while pre
high generation quality.
Disentangling meaning from language in LLM-based machine translation
Authors: Théo Lasnier, Armel Zebaze, Djamé Seddah, Rachel Bawden, Benoît Sagot
2026-02-04
Mechanistic Interpretability (MI) seeks to explain how neural networks implement their capabilities, but the scale of Large Language Models (s) has limited prior MI work in Machine Translation (MT) to word-level analyses. We study sentence-level MT from a mechanistic perspective by analyzing attention heads to understand how
s internally encode and distribute translation functions. We decompose MT into two subtasks: producing text in the target language (i.e. target language identification) and pre
the input sentence's meaning (i.e. sentence equivalence). Across three families of open-source models and 20 translation directions, we find that distinct,
sets of attention heads specialize in each subtask. Based on this insight, we construct subtask-specific steering vectors and show that modifying just 1% of the relevant heads enables instruction-free MT performance comparable to instruction-based prompting, while ablating these heads selectively disrupts their corresponding translation functions.
Domain decomposition methods and preconditioning strategies using generalized locally Toepltiz tools proposals, analysis, and numerical validation
Authors: Abdessadek Rifqui, Ahmed Ratnani, Stefano Serra-Capizzano
2026-02-04
In the current work we present a spectral analysis of the additive and multiplicative Schwarz methods within the framework of domain decomposition techniques, by investigating the spectral properties of these classical Schwarz preconditioning matrix-sequences, with emphasis on their convergence behavior and on the effect of transmission operators. In particular, after a general presentation of various options, we focus on restricted variants of the Schwarz methods aimed at improving parallel efficiency, while pre their convergence features. In order to rigorously describe and analyze the convergence behavior, we employ the theory of generalized locally Toeplitz (GLT) sequences, which provides a robust framework for studying the asymptotic spectral distribution of the discretized operators arising from Schwarz iterations. By associating each operator sequence with the appropriate GLT symbol, we derive explicit expressions for the GLT symbols of the convergence factors, for both additive and multiplicative Schwarz methods. The GLT-based spectral approach offers a unified and systematic understanding of how the spectrum evolves with mesh refinement and
size (in the algebraic case). Our analysis not only deepens the theoretical understanding of classical Schwarz methods, but also establishes a foundation for examining future restricted or hybrid Schwarz variants using symbolic spectral tools. These results enable the prediction of the remarkable efficiency of block Jacobi/Gauss--Seidel and block additive/multiplicative Schwarz preconditioners for GLT sequences, as further illustrated through a wide choice of numerical experiments.
Harmonia Algorithm-Hardware Co-Design for Memory- and Compute-Efficient BFP-based LLM Inference
Authors: Xinyu Wang, Jieyu Li, Yanan Sun, Weifeng He
2026-02-04
Large Language Models (s) are powerful but incur high memory and computation costs. Quantization is an effective solution, with INT weights and FP activations being widely adopted to preserve accuracy. Prior works further reduce FP overhead by using block floating point (BFP) activations in linear layers, but fail to extend BFP to attention layers due to severe accuracy degradation, limiting overall efficiency. To address this challenge, we propose Harmonia, an algorithm-hardware co-design framework that enables all-layer BFP activations with a configurable hardware architecture. First, we systematically explore BFP configurations to achieve a better trade-off between accuracy and activation
across all layers. Second, to reduce
-
storage and computation in attention layers, we introduce an asymmetric bit-allocation strategy and computations in attention layers,we introduce an asymmetric bit-allocation strategy combined with a hybrid offline-online outlier smoothing technique. This allow aggressive
-
from FP16 to 4-bit-mantissa BFP with only 0.3% average accuracy loss. Third, to fully exploit all-layer BFP activations, we design dedicated hardware components, including a reconfigurable PE supporting mixed data formats (BFP-INT and BPF-BFP), a real-time FP16-to-BFP converter, and a tiling-aware dataflow to reduce memory traffic. We evaluate Harmonia on GEMM operations in both linear and attention layers across eight widely used
s. Compared with prior works, Harmonia achieves 3.84x (up to 5.05x) higher area efficiency, 2.03x (up to 3.90x) better energy efficiency, and 3.08x (up to 4.62x) speedup on average.
SalFormer360 a transformer-based saliency estimation model for 360-degree videos
Authors: Mahmoud Z. A. Wahba, Francesco Barbato, Sara Baldoni, Federica Battisti
2026-02-04
Saliency estimation has received growing attention in recent years due to its importance in a wide range of applications. In the context of 360-degree video, it has been particularly valuable for tasks such as viewport prediction and immersive content optimization. In this paper, we propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a -based architecture. Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom
r. The SegFormer model was originally developed for 2D segmentation tasks, and it has been fine-tuned to adapt it to 360-degree content. To further enhance prediction accuracy in our model, we incorporated Viewing Center Bias to reflect user attention in 360-degree environments. Extensive experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods. In terms of Pearson Correlation Coefficient, our model achieves 8.4% higher performance on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking compared to previous state-of-the-art.
Nix and Fix Targeting 1000x Compression of 3D Gaussian Splatting with Diffusion Models
Authors: Cem Eteke, Enzo Tartaglione
2026-02-04
3D Gaussian Splatting (3DGS) revolutionized novel view rendering. Instead of inferring from dense spatial points, as implicit representations do, 3DGS uses Gaussians. This enables real-time performance but increases space requirements, hindering applications such as immersive
. 3DGS
emerged as a field aimed at alleviating this issue. While impressive progress has been made, at low rates,
introduces artifacts that degrade visual quality significantly. We introduce NiFi, a method for extreme 3DGS
through restoration via artifact-aware, diffusion-based one-step distillation. We show that our method achieves state-of-the-art perceptual quality at extremely low rates, down to 0.1 MB, and towards 1000x rate improvement over 3DGS at comparable perceptual performance. The code will be open-sourced upon acceptance.
LycheeDecode Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
Authors: Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, Min Zhang
2026-02-04
The proliferation of long-context large language models (s) exposes a key bottleneck: the rapidly expanding key-value
during
, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient
method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of
heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By pre
the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context
inference.
Incongruity-sensitive access to highly compressed strings
Authors: Ferdinando Cicalese, Zsuzsanna Lipták, Travis Gagie, Gonzalo Navarro, Nicola Prezza, Cristian Urbina
2026-02-04
Random access to highly compressed strings -- represented by straight-line programs or Lempel-Ziv parses, for example -- is a well-studied topic. Random access to such strings in strongly sublogarithmic time is impossible in the worst case, but previous authors have shown how to support faster access to specific characters and their neighbourhoods. In this paper we explore whether, since better can impede access, we can support faster access to relatively incompressible substrings of highly compressed strings. We first show how, given a run-length compressed straight-line program (RLSLP) of size or a block tree of size , we can build an -space or an -space data structure, respectively, that supports access to any character in time logarithmic in the length of the longest repeated substring containing that character. That is, the more incongruous a character is with respect to the characters around it in a certain sense, the faster we can support access to it. We then prove a similar but more powerful and sophisticated result for parsings in which phrases' sources do not
much larger phrases, with the query time depending also on the number of phrases we must copy from their sources to obtain the queried character.
- Circuit-Restricted Weight Arithmetic for Selective Refusal
Authors: Aditya Kasliwal, Pratinav Seth, Vinay Kumar Sankarapu
2026-02-04
Modern deployments require s to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and
complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a
circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.
Model-Dowser Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models
Authors: Hyeontaek Hwang, Nguyen Dinh Son, Daeyoung Kim
2026-02-04
Fine-tuning Multimodal Large Language Models (Ms) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language
r or scale poorly with increasing model size. To address these limitations, we propose Model-Dowser, a novel
fine-tuning approach for M
s. Model-Dowser measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, Model-Dowser selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative M
s, LLaVA and NVILA, demonstrate that Model-Dowser effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.
Greedy-Gnorm A Gradient Matrix Norm-Based Alternative to Attention Entropy for Head Pruning
Authors: Yuxi Guo, Paul Sheridan
2026-02-04
Attention head has emerged as an effective technique for
model
, an increasingly important goal in the era of Green AI. However, existing
methods often rely on static importance scores, which fail to capture the evolving role of attention heads during iterative removal. We propose Greedy-Gradient norm (Greedy-Gnorm), a novel head
algorithm that dynamically recalculates head importance after each
step. Specifically, each head is scored by the elementwise product of the l2-norms of its Q/K/V gradient blocks, as estimated from a hold-out validation set and updated at every greedy iteration. This dynamic approach to scoring mitigates against stale rankings and better reflects gradient-informed importance as
progresses. Extensive experiments on BERT, ALBERT, RoBERTa, and XLM-RoBERTa demonstrate that Greedy-Gnorm consistently preserves accuracy under substantial head removal, outperforming attention entropy. By effectively reducing model size while maintaining task performance, Greedy-Gnorm offers a promising step toward more energy-efficient
model deployment.
DOS Dual-Flow Orthogonal Semantic IDs for Recommendation in Meituan
Authors: Junwei Yin, Senjie Kou, Changhao Li, Shuli Wang, Xue Wei, Yinqiu Huang, Yinhua Zhu, Haitao Wang, Xingxing Wang
2026-02-04
Semantic IDs serve as a key component in generative recommendation systems. They not only incorporate open-world knowledge from large language models (s) but also compress the semantic space to reduce generation difficulty. However, existing methods suffer from two major limitations: (1) the lack of contextual awareness in generation tasks leads to a gap between the Semantic ID codebook space and the generation space, resulting in suboptimal recommendations; and (2) suboptimal
methods exacerbate semantic loss in
s. To address these issues, we propose Dual-Flow Orthogonal Semantic IDs (DOS) method. Specifically, DOS employs a user-item dual flow-framework that leverages collaborative signals to align the Semantic ID codebook space with the generation space. Furthermore, we introduce an orthogonal residual
scheme that rotates the semantic space to an appropriate orientation, thereby maximizing semantic preservation. Extensive offline experiments and online A/B testing demonstrate the effectiveness of DOS. The proposed method has been successfully deployed in Meituan's mobile application,
hundreds of millions of users.
Seg-ReSearch Segmentation with Interleaved Reasoning and External Search
Authors: Tianming Liang, Qirui Du, Jian-Fang Hu, Haichao Jiang, Zicheng Lin, Wei-Shi Zheng
2026-02-04
Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (Ms) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of M
s, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose \textbf{Seg-ReSearch}, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of M
s. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between
outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released at https://github.com/iSEE-Laboratory/Seg-ReSearch.
RASA Routing-Aware Safety Alignment for Mixture-of-Experts Models
Authors: Jiacheng Liang, Yuhui Wang, Tanqiu Jiang, Ting Wang
2026-02-04
Mixture-of-Experts (MoE) language models introduce unique challenges for safety alignment due to their routing mechanisms, which can enable degenerate optimization behaviors under standard full-parameter fine-tuning. In our preliminary experiments, we observe that naively applying full-parameter safety fine-tuning to MoE models can reduce attack success rates through routing or expert dominance effects, rather than by directly repairing Safety-Critical Experts. To address this challenge, we propose RASA, a routing-aware expert-level alignment framework that explicitly repairs Safety-Critical Experts while preventing routing-based bypasses. RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts. Across two representative MoE architectures and a diverse set of jailbreak attacks, RASA achieves near-perfect robustness, strong cross-attack generalization, and substantially reduced over-refusal, while pre
general capabilities on benchmarks such as MMLU, GSM8K, and TruthfulQA. Our results suggest that robust MoE safety alignment benefits from targeted expert repair rather than global parameter updates, offering a practical and architecture-pre
alternative to prior approaches.
The Stretto Execution Engine for LLM-Augmented Data Systems
Authors: Gabriele Sanmartino, Matthias Urban, Paolo Papotti, Carsten Binnig
2026-02-04
-augmented data systems enable semantic querying over structured and unstructured data, but executing queries with
-powered operators introduces a fundamental runtime--accuracy trade-off. In this paper, we present Stretto, a new execution engine that provides end-to-end query guarantees while efficiently navigating this trade-off in a holistic manner. For this, Stretto formulates query planning as a constrained optimization problem and uses a gradient-based optimizer to jointly select operator implementations and allocate error budgets across pipelines. Moreover, to enable fine-grained execution choices, Stretto introduces a novel idea on how
-caching can be used to realize a spectrum of different physical operators that transform a
design space into a dense continuum of runtime--accuracy trade-offs. Experiments show that Stretto outperforms state-of-the-art systems while consistently meeting quality guarantees.
Depth-Wise Emergence of Prediction-Centric Geometry in Large Language Models
Authors: Shahar Haim, Daniel C McNamee
2026-02-04
We show that r-only large language models exhibit a depth-wise transition from context-processing to prediction-forming phases of computation accompanied by a reorganization of representational geometry. Using a unified framework combining geometric analysis with mechanistic intervention, we demonstrate that late-layer representations implement a structured geometric code that enables selective causal control over token prediction. Specifically, angular organization of the representation geometry parametrizes prediction distributional similarity, while representation norms encode context-specific information that does not determine prediction. Together, these results provide a mechanistic-geometric account of the dynamics of transforming context into predictions in
s.
HoRD Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation
Authors: Puyue Wang, Jiawei Hu, Yan Gao, Junyan Wang, Yu Zhang, Gillian Dobbie, Tao Gu, Wafa Johal, Ting Dang, Hong Jia
2026-02-04
Humanoid robots can suffer significant performance drops under small changes in dynamics, task specifications, or environment setup. We propose HoRD, a two-stage learning framework for robust humanoid control under domain shift. First, we train a high-performance teacher policy via history-conditioned reinforcement learning, where the policy infers latent dynamics context from recent state--action trajectories to adapt online to diverse randomized dynamics. Second, we perform online distillation to transfer the teacher's robust control capabilities into a -based student policy that operates on
root-relative 3D joint keypoint trajectories. By combining history-conditioned adaptation with online distillation, HoRD enables a single policy to adapt zero-shot to unseen domains without per-domain retraining. Extensive experiments show HoRD outperforms strong baselines in robustness and transfer, especially under unseen domains and external perturbations. Code and project page are available at https://tonywang-0517.github.io/hord/.
LCUDiff Latent Capacity Upgrade Diffusion for Faithful Human Body Restoration
Authors: Jue Gong, Zihan Zhou, Jingkai Wang, Shu Li, Libo Liu, Jianliang Lan, Yulun Zhang
2026-02-04
Existing methods for restoring degraded human-centric images often struggle with insufficient fidelity, particularly in human body restoration (HBR). Recent diffusion-based restoration methods commonly adapt pre-trained text-to-image diffusion models, where the variational autoencoder (VAE) can significantly bottleneck restoration fidelity. We propose LCUDiff, a stable one-step framework that upgrades a pre-trained latent diffusion model from the 4-channel latent space to the 16-channel latent space. For VAE fine-tuning, channel splitting distillation (CSD) is used to keep the first four channels aligned with pre-trained priors while allocating the additional channels to effectively encode high-frequency details. We further design prior-pre adaptation (PPA) to smoothly bridge the mismatch between 4-channel diffusion backbones and the higher-dimensional 16-channel latent. In addition, we propose a
r router (DeR) for per-sample
r routing using restoration-quality score annotations, which improves visual quality across diverse conditions. Experiments on synthetic and real-world datasets show competitive results with higher fidelity and fewer artifacts under mild degradations, while pre
one-step efficiency. The code and model will be at https://github.com/gobunu/LCUDiff.
Enabling Real-Time Colonoscopic Polyp Segmentation on Commodity CPUs via Ultra-Lightweight Architecture
Authors: Weihao Gao, Zhuo Deng, Zheng Gong, Lan Ma
2026-02-04
Early detection of colorectal cancer hinges on real-time, accurate polyp identification and resection. Yet current high-precision segmentation models rely on GPUs, making them impractical to deploy in primary hospitals, mobile endoscopy units, or capsule robots. To bridge this gap, we present the UltraSeg family, operating in an extreme- regime (<0.3 M parameters). UltraSeg-108K (0.108 M parameters) is optimized for single-center data, while UltraSeg-130K (0.13 M parameters) generalizes to multi-center, multi-modal images. By jointly optimizing encoder-
r widths, incorporating constrained dilated convolutions to enlarge receptive fields, and integrating a cross-layer lightweight fusion module, the models achieve 90 FPS on a single CPU core without sacrificing accuracy. Evaluated on seven public datasets, UltraSeg retains >94% of the Dice score of a 31 M-parameter U-Net while utilizing only 0.4% of its parameters, establishing a strong, clinically viable baseline for the extreme-
domain and offering an immediately deployable solution for resource-constrained settings. This work provides not only a CPU-native solution for colonoscopy but also a reproducible blueprint for broader minimally invasive surgical vision applications. Source code is publicly available to ensure reproducibility and facilitate future benchmarking.
TurboBoA Faster and Exact Attention-aware Quantization without Backpropagation
Authors: Junhan Kim, Yeo Jeong Park, Seungwoo Son, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon
2026-02-04
The rapid growth of large language models (s) has heightened the importance of post-training
(PTQ) for reducing memory and computation costs. Among PTQ methods, GPTQ has gained significant attention for its efficiency, enabling billion-scale
s to be
d within a few GPU hours. However, GPTQ's assumption of layer-wise independence leads to severe accuracy drops in
regimes. Recently, BoA improved upon GPTQ by incorporating inter-layer dependencies within attention modules, but its reliance on sequential
across all out-channels makes it substantially less efficient. In this paper, we propose TurboBoA, a new backpropagation-free PTQ algorithm that preserves the accuracy benefits of BoA while significantly accelerating the process. The proposed TurboBoA introduces three key innovations: (i) joint
of multiple out-channels with a closed-form error compensation rule, which reduces sequential bottlenecks and yields more than a three-fold speedup; (ii) a correction mechanism for errors propagated from preceding
d layers; and (iii) adaptive grid computation with coordinate descent refinement to maintain alignment during iterative updates. Extensive experiments demonstrate that TurboBoA delivers substantial
over BoA while consistently improving accuracy. When combined with outlier suppression techniques, it achieves state-of-the-art results in both weight-only and weight-activation
. The code will be available at https://github.com/SamsungLabs/TurboBoA.
SparVAR Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
Authors: Zekun Li, Ning Wang, Tongxin Bai, Changwang Mei, Peisong Wang, Shuang Qiu, Jian Cheng
2026-02-04
Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior s often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present SparVAR, a training-free
framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross-scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the
attention pattern of later high-resolution scales from a
decision scale, and construct scale self-similar
attention via an efficient index-mapping mechanism, enabling high-efficiency
attention computation at large scales. Furthermore, we propose cross-scale local
attention and implement an efficient block-wise
kernel, which achieves faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing high-resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a speed-up while pre
almost all high-frequency details. When combined with existing scale-skipping strategies, SparseVAR attains up to a
, while maintaining competitive visual generation quality. Code is available at https://github.com/CAS-CLab/SparVAR.
Pruning Minimal Reasoning Graphs for Efficient Retrieval-Augmented Generation
Authors: Ning Wang, Kuanyan Zhu, Daniel Yuehwoon Yee, Yitang Gao, Shiying Huang, Zirun Xu, Sainyam Galhotra
2026-02-04
Retrieval-augmented generation (RAG) is now standard for knowledge-intensive tasks, but most systems still treat every query as fresh, repeatedly re-retrieving long passages and re-reasoning from scratch, inflating tokens, latency, and cost. We present AutoPrunedRetriever, a graph-style RAG system that persists the minimal reasoning subgraph built for earlier questions and incrementally extends it for later ones. AutoPrunedRetriever stores entities and relations in a compact, ID-indexed codebook and represents questions, facts, and answers as edge sequences, enabling retrieval and prompting over symbolic structure instead of raw text. To keep the graph compact, we apply a two-layer consolidation policy (fast ANN/KNN alias detection plus selective -means once a memory threshold is reached) and prune low-value structure, while prompts retain only
representatives and genuinely new evidence. We instantiate two front ends: AutoPrunedRetriever-REBEL, which uses REBEL as a triplet parser, and AutoPrunedRetriever-llm, which swaps in an
extractor. On GraphRAG-Benchmark (Medical and Novel), both variants achieve state-of-the-art complex reasoning accuracy, improving over HippoRAG2 by roughly 9--11 points, and remain competitive on contextual summarize and generation. On our harder STEM and TV benchmarks, AutoPrunedRetriever again ranks first, while using up to two orders of magnitude fewer tokens than graph-heavy baselines, making it a practical substrate for long-running sessions, evolving corpora, and multi-agent pipelines.
Internalizing LLM Reasoning via Discovery and Replay of Latent Actions
Authors: Zhenning Shi, Yijia Zhu, Junhan Shi, Xun Zhang, Lei Wang, Congcong Miao
2026-02-04
The internalization of chain-of-thought processes into hidden states has emerged as a highly efficient paradigm for scaling test-time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non-stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self-Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three-stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) control basis construction curates a compact, geometrically diverse tool library; and (3) value-modulated trajectory intervention dynamically injects context-specific impulses via anchor-based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla
. These findings demonstrate that the benefits of explicit chain-of-thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at https://github.com/sznnzs/
-Latent-Action.
From Assumptions to Actions Turning LLM Reasoning into Uncertainty-Aware Planning for Embodied Agents
Authors: SeungWon Seo, SooBin Lim, SeongRae Noh, Haneul Kim, HyeongYeop Kang
2026-02-04
Embodied agents operating in multi-agent, partially observable, and decentralized environments must plan and act despite pervasive uncertainty about hidden objects and collaborators' intentions. Recent advances in applying Large Language Models (s) to embodied agents have addressed many long-standing challenges, such as high-level goal decomposition and online adaptation. Yet, uncertainty is still primarily mitigated through frequent inter-agent
. This incurs substantial token and time costs, and can disrupt established workflows, when human partners are involved. We introduce PCE, a Planner-Composer-Evaluator framework that converts the fragmented assumptions latent in
reasoning traces into a structured decision tree. Internal nodes encode environment assumptions and leaves map to actions; each path is then scored by scenario likelihood, goal-directed gain, and execution cost to guide rational action selection without heavy
. Across two challenging multi-agent benchmarks (C-WAH and TDW-MAT) and three diverse
backbones, PCE consistently outperforms
-centric baselines in success rate and task efficiency while showing comparable token usage. Ablation results indicate that the performance gains obtained by scaling model capacity or reasoning depth persist even when PCE is applied, while PCE consistently raises the baseline across both capacity and reasoning-depth scales, confirming that structured uncertainty handling complements both forms of scaling. A user study further demonstrates that PCE produces
patterns that human partners perceive as more efficient and trustworthy. Together, these results establish a principled route for turning latent
assumptions into reliable strategies for uncertainty-aware planning.
Disentangling Causal Importance from Emergent Structure in Multi-Expert Orchestration
Authors: Sudipto Ghosh, Sujoy Nath, Sunny Manchanda, Tanmoy Chakraborty
2026-02-04
Multi-expert systems, where multiple Large Language Models (s) collaborate to solve complex tasks, are increasingly adopted for high-performance reasoning and generation. However, the orchestration policies governing expert interaction and sequencing remain largely opaque. We introduce INFORM, an interpretability analysis that treats orchestration as an explicit, analyzable computation, enabling the decoupling of expert interaction structure, execution order, and causal attribution. We use INFORM to evaluate an orchestrator on GSM8K, HumanEval, and MMLU using a homogeneous consortium of ten instruction-tuned experts drawn from LLaMA-3.1 8B, Qwen-3 8B, and DeepSeek-R1 8B, with controlled
-temperature variation, and a secondary heterogeneous consortium spanning 1B-7B parameter models. Across tasks, routing dominance is a poor proxy for functional necessity. We reveal a divergence between relational importance, captured by routing mass and interaction topology, and intrinsic importance, measured via gradient-based causal attribution: frequently selected experts often act as interaction hubs with limited causal influence, while
ly routed experts can be structurally critical. Orchestration behaviors emerge asynchronously, with expert centralization preceding stable routing confidence and expert ordering remaining non-deterministic. Targeted ablations show that masking intrinsically important experts induces disproportionate collapse in interaction structure compared to masking frequent peers, confirming that INFORM exposes causal and structural dependencies beyond accuracy metrics alone.
MiniRec Data-Efficient Reinforcement Learning for LLM-based Recommendation
Authors: Lin Wang, Yang Zhang, Jingfan Chen, Xiaoyan Zhao, Fengbin Zhu, Qing Li, Tat-Seng Chua
2026-02-04
The integration of reinforcement learning (RL) into large language models (s) has opened new opportunities for recommender systems by eliciting reasoning and improving user preference modeling. However, RL-based
recommendation faces significant efficiency challenges, making full-data training costly. Existing data selection methods define sample value based on learnability or representativeness, yet their loss- or gradient-driven or dataset coverage-driven criteria often misalign with RL learning dynamics, resulting in suboptimal performance. To address this, we propose MiniRec, a data selection framework tailored for RL-based
recommendation. MiniRec evaluates sample learnability using key RL signals -- rewards --
samples that are too easy (too high reward) or too difficult (consistently low reward). It assesses representativeness by aligning sample gradients with the approximated "ideal" global RL optimization trajectory, selecting samples that mainly drive model updates, and it also enforces diversity to reduce redundancy. Combined with a curriculum learning strategy from easy to hard samples, MiniRec significantly reduces training cost while largely pre
performance. Extensive experiments demonstrate MiniRec's effectiveness, highlighting the importance of reward-aligned, trajectory-informed data selection in RL-based
recommendation.
KVSmooth Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing
Authors: Siyu Jiang, Feiyang Chen, Xiaojin Zhang, Kun He
2026-02-04
Despite the significant progress of Multimodal Large Language Models (Ms) across diverse tasks, hallucination -- corresponding to the generation of visually inconsistent objects, attributes, or relations -- remains a major obstacle to their reliable deployment. Unlike pure language models, M
s must ground their generation process in visual inputs. However, existing models often suffer from semantic drift during
, causing outputs to diverge from visual facts as the sequence length increases.
To address this issue, we propose
Smooth, a training-free and plug-and-play method that mitigates hallucination by performing attention-entropy-guided adaptive smoothing on hidden states. Specifically,
Smooth applies an exponential moving average (EMA) to both keys and values in the
-Cache, while dynamically quantifying the sink degree of each token through the entropy of its attention distribution to adaptively adjust the smoothing strength.
Unlike computationally expensive retraining or contrastive
methods,
Smooth operates efficiently during inference without additional training or model modification. Extensive experiments demonstrate that
Smooth significantly reduces hallucination ( from ) while improving overall performance ( score from ), achieving higher precision and recall simultaneously. In contrast, prior methods often improve one at the expense of the other, validating the effectiveness and generality of our approach.
Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog
Authors: Yiran Zhao, Shengyang Zhou, Zijian Wu, Tongyan Hu, Yuhui Xu, Rengan Dou, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Michael Qizhe Shieh
2026-02-04
Large Language Models (s) have demonstrated impressive reasoning capabilities, but their substantial size often demands significant computational resources. To reduce resource consumption and accelerate inference, it is essential to eliminate redundant parameters without compromising performance. However, conventional
methods that directly remove such parameters often lead to a dramatic drop in model performance in reasoning tasks, and require extensive post-training to recover the lost capabilities. In this work, we propose a gradual compacting method that divides the
process into multiple fine-grained iterations, applying a Prune-Tune Loop (PTL) at each stage to incrementally reduce model size while restoring performance with finetuning. This iterative approach-reminiscent of the "boiling frog" effect-enables the model to be progressively compressed without abrupt performance loss. Experimental results show that PTL can compress
s to nearly half their original size with only lightweight post-training, while maintaining performance comparable to the original model on reasoning tasks. Moreover, PTL is flexible and can be applied to various
strategies, such as neuron
and layer
, as well as different post-training methods, including continual pre-training and reinforcement learning. Additionally, experimental results confirm the effectiveness of PTL on a variety of tasks beyond mathematical reasoning, such as code generation, demonstrating its broad applicability.
AppleVLM End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models
Authors: Yuxuan Han, Kunyuan Wu, Qianyi Shao, Renxiang Xiao, Zilu Wang, Cansen Jiang, Yi Xiao, Liang Hu, Yunjiang Lou
2026-02-04
End-to-end autonomous driving has emerged as a promising paradigm integrating perception, decision-making, and control within a unified learning framework. Recently, Vision-Language Models (VLMs) have gained significant attention for their potential to enhance the robustness and generalization of end-to-end driving models in diverse and unseen scenarios. However, existing VLM-based approaches still face challenges, including suboptimal lane perception, language understanding biases, and difficulties in handling corner cases. To address these issues, we propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving. AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making. Firstly, the vision encoder fuses spatial-temporal information from multi-view images across multiple timesteps using a deformable mechanism, enhancing robustness to camera variations and facilitating scalable deployment across different vehicle platforms. Secondly, unlike traditional VLM-based approaches, AppleVLM introduces a dedicated planning modality that encodes explicit Bird's-Eye-View spatial information, mitigating language biases in navigation instructions. Finally, a VLM
r fine-tuned by a hierarchical Chain-of-Thought integrates vision, language, and planning features to output robust driving waypoints. We evaluate AppleVLM in closed-loop experiments on two CARLA benchmarks, achieving state-of-the-art driving performance. Furthermore, we deploy AppleVLM on an AGV platform and successfully showcase real-world end-to-end autonomous driving in complex outdoor environments.
CoLT Reasoning with Chain of Latent Tool Calls
Authors: Fangwei Zhu, Zhifang Sui
2026-02-04
Chain-of-Thought (CoT) is a critical technique in enhancing the reasoning ability of Large Language Models (s), and latent reasoning methods have been proposed to accelerate the inefficient token-level reasoning chain. We notice that existing latent reasoning methods generally require model structure augmentation and exhaustive training, limiting their broader applicability. In this paper, we propose CoLT, a novel framework that implements latent reasoning as ``tool calls''. Instead of reasoning entirely in the latent space, CoLT generates seed tokens that contain information of a reasoning step. When a latent tool call is triggered, a smaller external model will take the hidden states of seed tokens as its input, and unpack the seed tokens back to a full reasoning step. In this way, we can ensure that the main model reasons in the explicit token space, pre
its ability while improving efficiency. Experimental results on four mathematical datasets demonstrate that CoLT achieves higher accuracy and shorter reasoning length than baseline latent models, and is compatible with reinforcement learning algorithms and different
r structures.
Post-Quantum Identity-Based TLS for 5G Service-Based Architecture and Cloud-Native Infrastructure
Authors: Vipin Kumar Rathi, Lakshya Chopra, Nikhil Kumar Rajput
2026-02-04
Cloud-native application platforms and latency-sensitive systems such as 5G Core networks rely heavily on certificate-based Public Key Infrastructure (PKI) and mutual TLS to secure service-to-service . While effective, this model introduces significant operational and performance overhead, which is further amplified in the post-quantum setting due to large certificates and expensive signature verification. In this paper, we present a certificate-free authentication framework for private distributed systems based on post-quantum Identity-Based Encryption(IBE). Our design replaces certificate and signature based authentication with identity-derived keys and identity-based key encapsulation, enabling mutually authenticated TLS connections without certificate transmission or validation. We describe an IBE-based replacement for private PKI, including identity lifecycle management, and show how it can be instantiated using a threshold Private Key Generator (T-PKG). We apply this framework to cloud-native application deployments and latency-sensitive 5G Core networks. In particular, we demonstrate how identity-based TLS integrates with the 5G Service-Based Architecture while pre
security semantics and 3GPP requirements, and we show how the same architecture can replace private PKI in Kubernetes, including its control plane, without disrupting existing trust domains or deployment models.
Following the TRAIL Predicting and Explaining Tomorrow's Hits with a Fine-Tuned LLM
Authors: Yinan Zhang, Zhixi Chen, Jiazheng Jing, Zhiqi Shen
2026-02-04
Large Language Models (s) have been widely applied across multiple domains for their broad knowledge and strong reasoning capabilities. However, applying them to recommendation systems is challenging since it is hard for
s to extract user preferences from large,
user-item logs, and real-time per-user ranking over the full catalog is too time-consuming to be practical. Moreover, many existing recommender systems focus solely on ranking items while overlooking explanations, which could help improve predictive accuracy and make recommendations more convincing to users. Inspired by recent works that achieve strong recommendation performance by forecasting near-term item popularity, we propose TRAIL (TRend and explAnation Integrated Learner). TRAIL is a fine-tuned
that jointly predicts short-term item popularity and generates faithful natural-language explanations. It employs contrastive learning with positive and negative pairs to align its scores and explanations with structured trend signals, yielding accurate and explainable popularity predictions. Extensive experiments show that TRAIL outperforms strong baselines and produces coherent, well-grounded explanations.
Adaptive 1D Video Diffusion Autoencoder
Authors: Yao Teng, Minxuan Lin, Xian Liu, Shuai Wang, Xiao Yang, Xihui Liu
2026-02-04
Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic
rs that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a
-based framework for adaptive 1D encoding and diffusion-based
. The encoder employs query-based vision
s to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The
r is a pixel-space diffusion
that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical
ratios. More importantly, it supports adaptive
and thus can achieve higher
ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its
r to mitigate artifacts caused by the generation process.
OAT Ordered Action Tokenization
Authors: Chaoqi Liu, Xiaoshen Han, Jiawei Gao, Yue Zhao, Haonan Chen, Yilun Du
2026-02-04
Autoregressive policies offer a compelling foundation for scalable robot learning by enabling discrete abstraction, token-level reasoning, and flexible inference. However, applying autoregressive modeling to continuous robot actions requires an effective action tokenization scheme. Existing approaches either rely on analytical discretization methods that produce prohibitively long token sequences, or learned latent tokenizers that lack structure, limiting their compatibility with next-token prediction. In this work, we identify three desiderata for action tokenization - high , total decodability, and a left-to-right causally ordered token space - and introduce Ordered Action Tokenization (OAT), a learned action tokenizer that satisfies all three. OAT discretizes action chunks into an ordered sequence of tokens using
with registers, finite scalar
, and ordering-inducing training mechanisms. The resulting token space aligns naturally with autoregressive generation and enables prefix-based detokenization, yielding an anytime trade-off between inference cost and action fidelity. Across more than 20 tasks spanning four simulation benchmarks and real-world settings, autoregressive policies equipped with OAT consistently outperform prior tokenization schemes and diffusion-based baselines, while offering significantly greater flexibility at inference time.
Semantic Consensus Decoding Backdoor Defense for Verilog Code Generation
Authors: Guang Yang, Xing Hu, Xiang Chen, Xin Xia
2026-02-04
Large language models (s) for Verilog code generation are increasingly adopted in hardware design, yet remain vulnerable to backdoor attacks where adversaries inject malicious triggers during training to induce vulnerable hardware designs. Unlike patchable software vulnerabilities, hardware trojans become irreversible once fabricated, making remediation extremely costly or impossible. Existing active defenses require access to training data, impractical for third-party
users, while passive defenses struggle against semantically stealthy triggers that naturally blend into design specifications. In this paper, we hypothesize that under the requirements of both effectiveness and stealthiness, attackers are strongly biased toward embedding triggers in non-functional requirements (e.g., style modifiers, quality descriptors) rather than functional specifications that determine hardware behavior. Exploiting this insight, we propose Semantic Consensus Decoding (SCD), an inference-time passive defense with two key components: (1) functional requirement extraction that identifies essential requirements from user specifications, and (2) consensus
that adaptively fuses output distributions based on full user specifications and extracted functional requirements. When these distributions diverge significantly, SCD automatically suppresses suspicious components. Extensive experiments with three representative backdoor attacks demonstrate that SCD reduces average attack success rate from 89% to under 3% with negligible impact on generation quality.
Universal Quantized Berry-Dipole Flat Bands
Authors: Qingyang Mo, Shuang Zhang
2026-02-04
Perfectly flat bands with nontrivial quantum geometry have emerged as a frontier for exotic topological phenomena and superconductors. Here, we unveil a universal family of d Berry-dipole flat bands in chiral-symmetric (2n+1)-band systems, where the central perfectly flat band carries a Berry-dipole moment d=n, with n an arbitrary integer, while pre
zero Chern number. We construct explicit lattice models to showcase three topological phenomena characterized by the Berry-dipole moment: a flat-band returning pump featuring bidirectional, soliton-like displacement of Wannier centers by exactly n unit cells per half cycle, a dipolar Haldane phase diagram arising from the competition between time-reversal and parity symmetries, and n pairs of bulk helical zero modes whose existence depends on the orientation of pseudomagnetic field. Our findings establish a universal framework for the topology beyond Chern class in perfectly flat bands and provide a tunable platform for exploring quantum geometry and interaction-driven phases.