2026-01-30
Table of Contents
- Temporal Guidance for Large Language Models
- CE-GOCD Central Entity-Guided Graph Optimization for Community Detection to Augment LLM Scientific Question Answering
- DropoutTS Sample-Adaptive Dropout for Robust Time Series Forecasting
- Why Attention Patterns Exist A Unifying Temporal Perspective Analysis
- FBS Modeling Native Parallel Reading inside a Transformer
- Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold
- FIT Defying Catastrophic Forgetting in Continual LLM Unlearning
- Sustainable Open-Source AI Requires Tracking the Cumulative Footprint of Derivatives
- HeRo-Q A General Framework for Stable Low Bit Quantization via Hessian Conditioning
- StarSD One-for-Many Speculative Decoding
- Breaking the Overscaling Curse Thinking Parallelism Before Parallel Thinking
- RecNet Self-Evolving Preference Propagation for Agentic Recommender Systems
- Beyond Imitation Reinforcement Learning for Active Latent Planning
- Frequency as Aperture Enabling Embeddable Near-Field Sensing for 6G Wireless Radios
- Opinion Consensus Formation Among Networked Large Language Models
- On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression
- Don't double it Efficient Agent Prediction in Occlusions
- Task-Awareness Improves LLM Generations and Uncertainty
- ETS Energy-Guided Test-Time Scaling for Training-Free RL Alignment
- Compressed Sensing-Driven Near-Field Localization Exploiting Array of Subarrays
- ScaleSim Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management
- L Large Lookup Layers
- Spava Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention
- NEMO Execution-Aware Optimization Modeling via Autonomous Coding Agents
- Small models, big threats Characterizing safety challenges from low-compute AI models
- Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving
- Robust Floquet Topological Phases and Anomalous -Modes in Quasiperiodic Quantum Walks
- Token Entropy Regularization for Multi-modal Antenna Affiliation Identification
- NEXUS Bit-Exact ANN-to-SNN Equivalence via Neuromorphic Gate Circuits with Surrogate-Free Training
- Lightweight High-Fidelity Low-Bitrate Talking Face Compression for 3D Video Conference
- User-Centric Phishing Detection A RAG and LLM-Based Approach
- 3D imaging of the biphoton spatiotemporal wave packet
- Less Noise, More Voice Reinforcement Learning for Reasoning via Instruction Purification
- Delegation Without Living Governance
- Soft Quantization Model Compression Via Weight Coupling
- ZipMoE Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling
- Do Reasoning Models Enhance Embedding Models?
- MAD Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
- Output-Space Search Targeting LLM Generations in a Frozen Encoder-Defined Output Space
- BrainStack Neuro-MoE with Functionally Guided Expert Routing for EEG-Based Language Decoding
- Planner-Auditor Twin Agentic Discharge Planning with FHIR-Based LLM Planning, Guideline Recall, Optional Caching and Self-Improvement
- ChunkWise LoRA Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference
- CompSRT Quantization and Pruning for Image Super Resolution Transformers
- Probabilistic Interpolation of Sagittarius A*'s Multi-Wavelength Light Curves Using Diffusion Models
- Evolutionary Strategies lead to Catastrophic Forgetting in LLMs
- MemCtrl Using MLLMs as Active Memory Controllers on Embodied Agents
- Robust Federated Learning for Malicious Clients using Loss Trend Deviation Detection
- Context-Augmented Code Generation Using Programming Knowledge Graphs
- Leveraging Second-Order Curvature for Efficient Learned Image Compression Theory and Empirical Evidence
- HESTIA A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs
- SA-PEF Step-Ahead Partial Error Feedback for Efficient Federated Learning
- P2S Probabilistic Process Supervision for General-Domain Reasoning Question Answering
- TGSBM Transformer-Guided Stochastic Block Model for Link Prediction
- Investigating the Development of Task-Oriented Communication in Vision-Language Models
- MeCo Enhancing LLM-Empowered Multi-Robot Collaboration via Similar Task Memoization
- DiffVC-RT Towards Practical Real-Time Diffusion-based Perceptual Neural Video Compression
- Context Tokens are Anchors Understanding the Repetition Curse in dMLLMs from an Information Flow Perspective
- Efficient Autoregressive Video Diffusion with Dummy Head
- Towards Sensitivity-Aware Language Models
- CtrlCoT Dual-Granularity Chain-of-Thought Compression for Controllable Reasoning
- Youtu-Parsing Perception, Structuring and Recognition via High-Parallelism Decoding
- Concept Component Analysis A Principled Approach for Concept Extraction in LLMs
- LLM-AutoDP Automatic Data Processing via LLM Agents for Model Fine-tuning
- Switchcodec Adaptive residual-expert sparse quantization for high-fidelity neural audio coding
- TABED Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs
- Everything in Its Place Benchmarking Spatial Intelligence of Text-to-Image Models
- Window-Diffusion Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching
- Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning
- VersaQ-3D A Reconfigurable Accelerator Enabling Feed-Forward and Generalizable 3D Reconstruction via Versatile Quantization
- SuperInfer SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips
- Hallucination Begins Where Saliency Drops
- StreamFusion Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs
- Shallow-π Knowledge Distillation for Flow-based VLAs
- Effect of initial Rayleigh mode on drop deformation and breakup under impulsive acceleration
Temporal Guidance for Large Language Models
Authors: Hong-Kai Zheng, Piji Li
2026-01-29
Contrastive Decoding (CD) enhances the generation quality of large language models (s) but incurs significant additional computational overhead due to the need for an auxiliary model. Existing internal self-contrastive
methods, such as Decoding by Contrasting Layers (DoLa), focus on discrepancies across different layers, which are notably unstable on small-scale models. In this work, based on the observation that
s exhibit local preferences, we propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Guidance (TeGu). Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast. To standardize the implementation of this mechanism, we further introduce a lightweight Conditional MTP Projector (cMTPP), which avoids maintaining multiple independent networks as required by other MTP modules. Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.
CE-GOCD Central Entity-Guided Graph Optimization for Community Detection to Augment LLM Scientific Question Answering
Authors: Jiayin Lan, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin, Guoping Hu
2026-01-29
Large Language Models (s) are increasingly used for question answering over scientific research papers. Existing retrieval augmentation methods often rely on isolated text chunks or concepts, but overlook deeper semantic connections between papers. This impairs the
's comprehension of scientific literature, hindering the comprehensiveness and specificity of its responses. To address this, we propose Central Entity-Guided Graph Optimization for Community Detection (CE-GOCD), a method that augments
s' scientific question answering by explicitly modeling and leveraging semantic substructures within academic knowledge graphs. Our approach operates by: (1) leveraging paper titles as central entities for targeted subgraph retrieval, (2) enhancing implicit semantic discovery via subgraph
and completion, and (3) applying community detection to distill coherent paper groups with shared themes. We evaluated the proposed method on three NLP literature-based question-answering datasets, and the results demonstrate its superiority over other retrieval-augmented baseline approaches, confirming the effectiveness of our framework.
DropoutTS Sample-Adaptive Dropout for Robust Time Series Forecasting
Authors: Siru Zhong, Yiqiu Liu, Zhiqing Cui, Zezhi Shao, Fei Wang, Qingsong Wen, Yuxuan Liang
2026-01-29
Deep time series models are vulnerable to noisy data ubiquitous in real-world applications. Existing robustness strategies either prune data or rely on costly prior quantification, failing to balance effectiveness and efficiency. In this paper, we introduce DropoutTS, a model-agnostic plugin that shifts the paradigm from "what" to learn to "how much" to learn. DropoutTS employs a Sample-Adaptive Dropout mechanism: leveraging spectral to efficiently quantify instance-level noise via reconstruction residuals, it dynamically calibrates model learning capacity by mapping noise to adaptive dropout rates - selectively suppressing spurious fluctuations while pre
fine-grained fidelity. Extensive experiments across diverse noise regimes and open benchmarks show DropoutTS consistently boosts superior backbones' performance, delivering advanced robustness with negligible parameter overhead and no architectural modifications. Our code is available at https://github.com/CityMind-Lab/DropoutTS.
Why Attention Patterns Exist A Unifying Temporal Perspective Analysis
Authors: Qingyue Yang, Jie Wang, Xing Li, Yinqi Bai, Xialiang Tong, Huiling Zhen, Jianye Hao, Mingxuan Yuan, Bin Li
2026-01-29
Attention patterns play a crucial role in both training and inference of large language models (s). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce \textbf{Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations} from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference
approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to
and
tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at https://github.com/MIRALab-USTC/
-TAPPA.
FBS Modeling Native Parallel Reading inside a Transformer
Authors: Tongxi Wang
2026-01-29
Large language models (s) excel across many tasks, yet inference is still dominated by strictly token-by-token autoregression. Existing
methods largely patch this pipeline and miss core human-reading ingredients: content-adaptive foresight, chunk-structure-aware compute allocation, and train--test consistency for preview/skimming. We propose the \textbf{Fovea-Block-Skip Transformer} (FBS), which injects a causal, trainable loop into Transformers via Parafovea-Attention Window (PAW), Chunk-Head (CH), and Skip-Gate (SG). Across diverse benchmarks, FBS improves the quality-efficiency trade-off without increasing parameters, and ablations show the three modules are complementary.
Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold
Authors: Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Yüzügüler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello
2026-01-29
Key--value () caching enables fast autoregressive
but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress
d keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent
r-layer transformations.
For these reasons, we introduce StiefAttention, a post-training
-
method that learns \emph{orthonormal} projection bases by directly minimizing \emph{
r-layer output reconstruction error}. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank allocation under a user-specified error budget. Noteworthy, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by points on C4 perplexity and on 0-shot MMLU accuracy at iso-
, yielding lower relative error and higher cosine similarity with respect to the original
r-layer outputs.
FIT Defying Catastrophic Forgetting in Continual LLM Unlearning
Authors: Xiaoyu Xu, Minxin Du, Kun Fang, Zi Liang, Yaxin Xiao, Zhicong Huang, Cheng Hong, Qingqing Ye, Haibo Hu
2026-01-29
Large language models (s) demonstrate impressive capabilities across diverse tasks but raise concerns about privacy, copyright, and harmful materials. Existing
unlearning methods rarely consider the continual and high-volume nature of real-world deletion requests, which can cause utility degradation and catastrophic forgetting as requests accumulate. To address this challenge, we introduce \fit, a framework for continual unlearning that handles large numbers of deletion requests while maintaining robustness against both catastrophic forgetting and post-unlearning recovery. \fit mitigates degradation through rigorous data \underline{F}iltering, \underline{I}mportance-aware updates, and \underline{T}argeted layer attribution, enabling stable performance across long sequences of unlearning operations and achieving a favorable balance between forgetting effectiveness and utility retention. To support realistic evaluation, we present \textbf{PCH}, a benchmark covering \textbf{P}ersonal information, \textbf{C}opyright, and \textbf{H}armful content in sequential deletion scenarios, along with two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), which jointly assess forgetting quality and utility preservation. Extensive experiments on four open-source
s with hundreds of deletion requests show that \fit achieves the strongest trade-off between F.D. and R.U., surpasses existing methods on MMLU, CommonsenseQA, and GSM8K, and remains resistant against both relearning and
recovery attacks.
Sustainable Open-Source AI Requires Tracking the Cumulative Footprint of Derivatives
Authors: Shaina Raza, Iuliia Eyriay, Ahmed Y. Radwan, Nate Lesperance, Deval Pandya, Sedef Akinli Kocak, Graham W. Taylor
2026-01-29
Open-source AI is scaling rapidly, and model hubs now host millions of artifacts. Each foundation model can spawn large numbers of fine-tunes, adapters, s, merges, and forks. We take the position that compute efficiency alone is insufficient for sustainability in open-source AI: lower per-run costs can accelerate experimentation and deployment, increasing aggregate environmental footprint unless impacts are measurable and comparable across derivative lineages. However, the energy use, water consumption, and emissions of these derivative lineages are rarely measured or disclosed in a consistent, comparable manner, leaving ecosystem-level impact largely invisible. We argue that sustainable open-source AI requires coordination infrastructure that tracks impacts across model lineages, not only base models. We propose Data and Impact Accounting (DIA), a lightweight, non-restrictive transparency layer that (i) standardizes carbon and water reporting metadata, (ii) integrates low-friction measurement into common training and inference pipelines, and (iii) aggregates reports through public dashboards to summarize cumulative impacts across releases and derivatives. DIA makes derivative costs visible and supports ecosystem-level accountability while pre
openness. https://vectorinstitute.github.io/ai-impact-accounting/
HeRo-Q A General Framework for Stable Low Bit Quantization via Hessian Conditioning
Authors: Jinhao Zhang Yunquan Zhang, Zicheng yan, Boyang Zhang, Jun Sun, Daning Cheng
2026-01-29
Post Training Quantization (PTQ), a mainstream model technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing
error. The root cause lies in the Hessian matrix of the
loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-
matrix to the weight space prior to
. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to
noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15\% and effectively avoids the logical collapse commonly seen in aggressive
.
StarSD One-for-Many Speculative Decoding
Authors: Junhao He, Feiran You, Hongyang Du
2026-01-29
Speculative accelerates autoregressive generation by separating token proposal from verification, but most existing approaches are designed for single-node execution and do not scale well to multi-accelerator clusters used for
modern Large Language Models (
s). We present StarSD, a one-for-many speculative
framework that uses a single draft model to serve multiple target models across distributed nodes via a star topology. StarSD decouples drafting and verification, enabling effective sharing of draft computation, and preventing distributed accelerators from remaining idle under bursty workloads. We provide a system-level analysis that characterizes when and why a single draft model can remain fully utilized by multiple verifiers, yielding predictable latency and utilization gains. Extensive experiments in real-world distributed inference settings demonstrate that StarSD simplifies deployment and supports flexible resource allocation across heterogeneous accelerators, while maintaining output quality. These results indicate that StarSD is a practical and scalable framework for bringing speculative
to modern cloud and edge inference infrastructures.
Breaking the Overscaling Curse Thinking Parallelism Before Parallel Thinking
Authors: Yiming Wang, Zhuosheng Zhang, Rui Wang
2026-01-29
Parallel thinking enhances reasoning by multi-path sampling and aggregation. In system-level evaluations, a global parallelism level N is allocated to all samples, typically set large to maximize overall dataset accuracy. However, due to sample heterogeneity, some samples can achieve comparable performance with a smaller N'< N, causing budget redundancy. This incompatibility between system-level efficacy and sample-level efficiency constitutes the overscaling curse. In this paper, we formalize and quantify the overscaling curse, showing its universality and severity in practice, and analyze its trigger mechanism. We then propose a lightweight method, T2, to break the overscaling curse, which utilizes latent representations to estimate the optimal parallelism level for each sample before
. Experiments show that T2 significantly reduces cost while maintaining comparable performance, enabling more efficient parallel thinking.
RecNet Self-Evolving Preference Propagation for Agentic Recommender Systems
Authors: Bingqian Li, Xiaolei Wang, Junyi Li, Weitao Li, Long Zhang, Sheng Chen, Wayne Xin Zhao, Ji-Rong Wen
2026-01-29
Agentic recommender systems leverage Large Language Models (s) to model complex user behaviors and support personalized decision-making. However, existing methods primarily model preference changes based on explicit user-item interactions, which are
, noisy, and unable to reflect the real-time, mutual influences among users and items. To address these limitations, we propose RecNet, a self-evolving preference propagation framework that proactively propagates real-time preference updates across related users and items. RecNet consists of two complementary phases. In the forward phase, the centralized preference routing mechanism leverages router agents to integrate preference updates and dynamically propagate them to the most relevant agents. To ensure accurate and personalized integration of propagated preferences, we further introduce a personalized preference reception mechanism, which combines a message buffer for temporary caching and an optimizable, rule-based filter memory to guide selective preference assimilation based on past experience and interests. In the backward phase, the feedback-driven propagation optimization mechanism simulates a multi-agent reinforcement learning framework, using
s for credit assignment, gradient analysis, and module-level optimization, enabling continuous self-evolution of propagation strategies. Extensive experiments on various scenarios demonstrate the effectiveness of RecNet in modeling preference propagation for recommender systems.
Beyond Imitation Reinforcement Learning for Active Latent Planning
Authors: Zhi Zheng, Wee Sun Lee
2026-01-29
Aiming at efficient and dense chain-of-thought (CoT) reasoning, latent reasoning methods fine-tune Large Language Models (s) to substitute discrete language tokens with continuous latent tokens. These methods consume fewer tokens compared to the conventional language CoT reasoning and have the potential to plan in a dense latent space. However, current latent tokens are generally supervised based on imitating language labels. Considering that there can be multiple equivalent but diverse CoT labels for a question, passively imitating an arbitrary one may lead to inferior latent token representations and latent reasoning policies, undermining the potential planning ability and resulting in clear gaps between training and testing. In this work, we emphasize the importance of active planning over the representation space of latent tokens in achieving the optimal latent reasoning policy. So, we propose the \underline{A}c\underline{t}ive Latent \underline{P}lanning method (ATP-Latent), which models the supervision process of latent tokens as a conditional variational auto-encoder (VAE) to obtain a smoother latent space. Moreover, to facilitate the most reasonable latent reasoning policy, ATP-Latent conducts reinforcement learning (RL) with an auxiliary coherence reward, which is calculated based on the consistency between VAE-
d contents of latent tokens, enabling a guided RL process. In experiments on LLaMA-1B, ATP-Latent demonstrates +4.1\% accuracy and -3.3\% tokens on four benchmarks compared to advanced baselines. Codes are available on https://github.com/zz1358m/ATP-Latent-master.
Frequency as Aperture Enabling Embeddable Near-Field Sensing for 6G Wireless Radios
Authors: Pin-Han Ho, Limei Peng, Yiming Miao, Xu Fan, Kairan Liang, Haoran Mei, Wei Duan
2026-01-29
Integrated sensing and (ISAC) is expected to be natively supported by future 6G wireless radios, yet most mmWave sensing solutions still rely on dedicated radar hardware incompatible with cost and power constrained wireless nodes. This article introduces Frequency-as-Aperture (FaA), a wireless-first sensing paradigm that repurposes inherent frequency agility into a virtual sensing aperture, enabling near-field perception with minimal RF front end complexity. Using a single RF chain and a frequency-scanning leaky-wave antenna, FaA achieves two dimensional spatial sensing by reusing the local oscillator (LO) frequency sweep already employed for wideband
. From a wireless-system perspective, this shifts spatial sampling from the antenna domain to the frequency domain, embedding radar-grade spatial fingerprints directly into the
RF chain. A case study shows that FaA provides fine angular and range discrimination with low power consumption and unit cost, demonstrating significantly higher architectural efficiency than conventional multi-channel MIMO based sensing under identical physical and spectral constraints. These results indicate that near-field sensing can be seamlessly integrated into frequency-agile wireless radios, enabling hardware-efficient, embeddable, and privacy-pre
ISAC nodes for smart homes, wearables, and industrial edge deployments.
Opinion Consensus Formation Among Networked Large Language Models
Authors: Iris Yazici, Mert Kayaalp, Stefan Taga, Ali H. Sayed
2026-01-29
Can classical consensus models predict the group behavior of large language models (s)? We examine multi-round interactions among
agents through the DeGroot framework, where agents exchange text-based messages over diverse
graphs. To track opinion evolution, we map each message to an opinion score via sentiment analysis. We find that agents typically reach consensus and the disagreement between the agents decays exponentially. However, the limiting opinion departs from DeGroot's network-centrality-weighted forecast. The consensus between
agents turns out to be largely insensitive to initial conditions and instead depends strongly on the discussion subject and inherent biases. Nevertheless, transient dynamics align with classical graph theory and the convergence rate of opinions is closely related to the second-largest eigenvalue of the graph's combination matrix. Together, these findings can be useful for
-driven social-network simulations and the design of resource-efficient multi-agent
applications.
On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression
Authors: Xinwei Zhang, Hangcheng Liu, Li Bai, Hao Wang, Qingqing Ye, Tianwei Zhang, Haibo Hu
2026-01-29
Visual token is widely used to accelerate large vision-language models (LVLMs) by
or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks can substantially overestimate the robustness of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-
bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with
inference without assuming access to the deployed
mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play
mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring
can be overly optimistic, calling for
-aware security evaluation and defenses for efficient LVLMs.
Don't double it Efficient Agent Prediction in Occlusions
Authors: Anna Rothenhäusler, Markus Mazzola, Andreas Look, Raghu Rajan, Joschka Bödecker
2026-01-29
Occluded traffic agents pose a significant challenge for autonomous vehicles, as hidden pedestrians or vehicles can appear unexpectedly, yet this problem remains understudied. Existing learning-based methods, while capable of inferring the presence of hidden agents, often produce redundant occupancy predictions where a single agent is identified multiple times. This issue complicates downstream planning and increases computational load. To address this, we introduce MatchInformer, a novel -based approach that builds on the state-of-the-art SceneInformer architecture. Our method improves upon prior work by integrating Hungarian Matching, a state-of-the-art object matching algorithm from object detection, into the training process to enforce a one-to-one correspondence between predictions and ground truth, thereby reducing redundancy. We further refine trajectory forecasts by decoupling an agent's heading from its motion, a strategy that improves the accuracy and interpretability of predicted paths. To better handle class imbalances, we propose using the Matthews Correlation Coefficient (MCC) to evaluate occupancy predictions. By considering all entries in the confusion matrix, MCC provides a robust measure even in
or imbalanced scenarios. Experiments on the Waymo Open Motion Dataset demonstrate that our approach improves reasoning about occluded regions and produces more accurate trajectory forecasts than prior methods.
Task-Awareness Improves LLM Generations and Uncertainty
Authors: Tim Tomov, Dominik Fuchsgruber, Stephan Günnemann
2026-01-29
In many applications of s, natural language responses often have an underlying structure such as representing discrete labels, numerical values, or graphs. Yet, existing
and uncertainty estimation methods operate only in language space and largely disregard structural information. We address this by modeling
outputs directly in a task-dependent latent structure. By equipping this structure with a dissimilarity measure, we can compute Bayes-optimal responses. These are not selected from sampled generations but are newly synthesized by combining individual responses in the latent space. Across different tasks, Bayes-optimal responses consistently outperform standard
methods like beam search. Moreover, quantifying uncertainty via the induced Bayesian risk captures variations in terms of the latent structure and improves alignment with output quality and correctness. Our decision-theoretic framework is applicable to any problem that admits a latent response structure and enables reliable task-aware
predictions.
ETS Energy-Guided Test-Time Scaling for Training-Free RL Alignment
Authors: Xiuyu Li, Jinkai Zhang, Mingyang Yi, Yu Li, Longqiang Wang, Yue Wang, Ju Fan
2026-01-29
Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably pre
sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.
Compressed Sensing-Driven Near-Field Localization Exploiting Array of Subarrays
Authors: Sai Pavan Deram, Jacopo Pegoraro, Javier Lorca Hernando, Jesus O. Lacruz, Joerg Widmer
2026-01-29
Near-field localization for ISAC requires large-aperture arrays, making fully-digital implementations prohibitively complex and costly. While subarray architectures can reduce cost, they introduce severe estimation ambiguity from grating lobes. To address both issues, we propose SHARE (Sparse Hierarchical Angle-Range Estimation), a novel two-stage
recovery algorithm. SHARE operates in two stages. It first performs coarse, unambiguous angle estimation using individual subarrays to resolve the grating lobe ambiguity. It then leverages the full
aperture to perform a localized joint angle-range search. This hierarchical approach avoids an exhaustive and computationally intensive two-dimensional grid search while pre
the high resolution of the large aperture. Simulation results show that SHARE significantly outperforms conventional one-shot
recovery methods, such as Orthogonal Matching Pursuit (OMP), in both localization accuracy and robustness. Furthermore, we show that SHARE's overall localization accuracy is comparable to or even surpasses that of the fully-digital 2D-MUSIC algorithm, despite MUSIC having access to the complete, uncompressed data from every antenna element. SHARE therefore provides a practical path for high-resolution near-field ISAC systems.
ScaleSim Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management
Authors: Zaifeng Pan, Yipeng Shen, Zhengding Hu, Zhuang Wang, Aninda Manocha, Zheng Wang, Zhongkai Yu, Yue Guan, Yufei Ding
2026-01-29
-based multi-agent simulations are increasingly adopted across application domains, but remain difficult to scale due to GPU memory pressure. Each agent maintains private GPU-resident states, including models, prefix
s, and adapters, which quickly exhaust device memory as the agent count grows. We identify two key properties of these workloads:
agent activation and an estimable agent invocation order. Based on an analysis of representative workload classes, we introduce invocation distance, a unified abstraction that estimates the relative order in which agents will issue future
requests. Leveraging this abstraction, we present ScaleSim, a memory-efficient
system for large-scale multi-agent simulations. ScaleSim enables proactive prefetching and priority-based eviction, supports diverse agent-specific memory through a modular interface, and achieves up to 1.74x speedup over SGLang on simulation benchmarks.
L Large Lookup Layers
Authors: Albert Tseng, Christopher De Sa
2026-01-29
Modern language models typically achieve
through Mixture-of-Experts (MoE) layers, which dynamically route tokens to dense MLP "experts." However, dynamic hard routing has a number of drawbacks, such as potentially poor hardware efficiency and needing auxiliary losses for stable training. In contrast, the tokenizer embedding table, which is natively
, largely avoids these issues by selecting a single embedding per token at the cost of not having contextual information. In this work, we introduce the Large Lookup Layer (L), which unlocks a new axis of
by generalizing embedding tables to model
r layers. L layers use static token-based routing to aggregate a set of learned embeddings per token in a context-dependent way, allowing the model to efficiently balance memory and compute by caching information in embeddings. L has two main components: (1) a systems-friendly architecture that allows for fast training and CPU-offloaded inference with no overhead, and (2) an information-theoretic embedding allocation algorithm that effectively balances speed and quality. We empirically test L by training
s with up to 2.6B active parameters and find that L strongly outperforms both dense models and iso-
MoEs in both language modeling and downstream tasks.
Spava Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention
Authors: Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Ao Sun, Ziqi Yuan, Hao Zhou, Fandong Meng, Zhiyuan Liu
2026-01-29
The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply
attention on a single GPU, yielding limited
or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose Spava, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, Spava reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without
and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of Spava, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB
NEMO Execution-Aware Optimization Modeling via Autonomous Coding Agents
Authors: Yang Song, Anoushka Vyas, Zirui Wei, Sina Khoshfetrat Pakazad, Henrik Ohlsson, Graham Neubig
2026-01-29
In this paper, we present NEMO, a system that translates Natural-language descriptions of decision problems into formal Executable Mathematical Optimization implementations, operating collaboratively with users or autonomously. Existing approaches typically rely on specialized large language models (s) or bespoke, task-specific agents. Such methods are often brittle, complex and frequently generating syntactically invalid or non-executable code.
NEMO instead centers on remote interaction with autonomous coding agents (ACAs), treated as a first-class abstraction analogous to API-based interaction with
s. This design enables the construction of higher-level systems around ACAs that structure, consolidate, and iteratively refine task specifications. Because ACAs execute within sandboxed environments, code produced by NEMO is executable by construction, allowing automated validation and repair.
Building on this, we introduce novel coordination patterns with and across ACAs, including asymmetric validation loops between independently generated optimizer and simulator implementations (
as a high-level validation mechanism), external memory for experience reuse, and robustness enhancements via minimum Bayes risk (MBR)
and self-consistency. We evaluate NEMO on nine established optimization benchmarks. As depicted in Figure 1, it achieves state-of-the-art performance on the majority of tasks, with substantial margins on several datasets, demonstrating the power of execution-aware agentic architectures for automated optimization modeling.
Small models, big threats Characterizing safety challenges from low-compute AI models
Authors: Prateek Puri
2026-01-29
Artificial intelligence (AI) systems are revolutionizing fields such as medicine, drug discovery, and materials science; however, many technologists and policymakers are also concerned about the technology's risks. To date, most concrete policies around AI governance have focused on managing AI risk by considering the amount of compute required to operate or build a given AI system. However, low-compute AI systems are becoming increasingly more performant - and more dangerous. Driven by agentic workflows, parameter , and other model
techniques, capabilities once only achievable on frontier-level systems have diffused into low-resource models deployable on consumer devices. In this report, we profile this trend by downloading historical benchmark performance data for over 5,000 large language models (
s) hosted on HuggingFace, noting the model size needed to achieve competitive
benchmarks has decreased by more than 10X over the past year. We then simulate the computational resources needed for an actor to launch a series of digital societal harm campaigns - such as disinformation botnets, sexual extortion schemes, voice-cloning fraud, and others - using low-compute open-source models and find nearly all studied campaigns can easily be executed on consumer-grade hardware. This position paper argues that protection measures for high-compute models leave serious security holes for their low-compute counterparts, meaning it is urgent both policymakers and technologists make greater efforts to understand and address this emerging class of threats.
Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving
Authors: Chendong Song, Meixuan Wang, Hang Zhou, Hong Liang, Yuan Lyu, Zixi Chen, Yuwei Fan, Zijie Zhou
2026-01-29
Attention-FFN (AFD) is an emerging architecture for
that separates state-heavy,
-
-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step
. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop a tractable analytical framework for sizing AFD bundles in an A-F topology, where the key difficulty is that Attention-side work is nonstationary-token context grows and requests are continuously replenished with random lengths-while FFN work is stable given the aggregated batch. Using a probabilistic workload model, we derive closed-form rules for the optimal A/F ratio that maximize average throughput per instance across the system. A trace-calibrated AFD simulator validates the theory: across workloads, the theoretical optimal A/F ratio matches the simulation-optimal within 10%, and consistently reduces idle time.
Robust Floquet Topological Phases and Anomalous -Modes in Quasiperiodic Quantum Walks
Authors: F. Iwase
2026-01-29
We uncover the global topological phase diagram of one-dimensional discrete-time quantum walks driven by Fibonacci-modulated coin parameters. Utilizing the mean chiral displacement (MCD) as dynamical probe, we identify robust topological phases defined by a strictly d winding number and exponentially localized edge states. Crucially, we discover that these topological edge modes emerges not only at zero energy but also at the quasienergy zone boundary , exhibiting identical localization robustness despite the fractal nature of the bulk spectrum. These results demonstrate that Floquet topological protection remains intact amidst quasiperiodic disorder, offering a concrete route to ob
exotic non-equilibrium phases in photonic experiments.
Token Entropy Regularization for Multi-modal Antenna Affiliation Identification
Authors: Dong Chen, Ruoyu Li, Xinyan Zhang, Jialei Xu, Ruoseng Zhao, Zhikang Zhang, Lingyun Li, Zizhuang Wei
2026-01-29
Accurate antenna affiliation identification is crucial for optimizing and maintaining networks. Current practice, however, relies on the cumbersome and error-prone process of manual tower inspections. We propose a novel paradigm shift that fuses video footage of base stations, antenna geometric features, and Physical Cell Identity (PCI) signals, transforming antenna affiliation identification into multi-modal classification and matching tasks. Publicly available pretrained
s struggle with this unique task due to a lack of analogous data in the
s domain, which hampers cross-modal alignment. To address this, we introduce a dedicated training framework that aligns antenna images with corresponding PCI signals. To tackle the representation alignment challenge, we propose a novel Token Entropy Regularization module in the pretraining stage. Our experiments demonstrate that TER accelerates convergence and yields significant performance gains. Further analysis reveals that the entropy of the first token is modality-dependent. Code will be made available upon publication.
NEXUS Bit-Exact ANN-to-SNN Equivalence via Neuromorphic Gate Circuits with Surrogate-Free Training
Authors: Zhengzheng Tang
2026-01-29
Spiking Neural Networks (SNNs) promise energy-efficient computing through event-driven , yet all existing approaches sacrifice accuracy by approximating continuous values with discrete spikes. We propose NEXUS, a framework that achieves bit-exact ANN-to-SNN equivalence -- not approximate, but mathematically identical outputs. Our key insight is constructing all arithmetic operations, both linear and nonlinear, from pure IF neuron logic gates that implement IEEE-754 compliant floating-point arithmetic. Through spatial bit encoding (zero encoding error by construction), hierarchical neuromorphic gate circuits (from basic logic gates to complete
layers), and surrogate-free STE training (exact identity mapping rather than heuristic approximation), NEXUS produces outputs identical to standard ANNs up to machine precision. Experiments on models up to LLaMA-2 70B demonstrate identical task accuracy (0.00\% degradation) with mean ULP error of only 6.19, while achieving 27-168,000 energy reduction on neuromorphic hardware. Crucially, spatial bit encoding's single-timestep design renders the framework inherently immune to membrane potential leakage (100\% accuracy across all decay factors ), while tolerating synaptic noise up to with >98\% gate-level accuracy.
Lightweight High-Fidelity Low-Bitrate Talking Face Compression for 3D Video Conference
Authors: Jianglong Li, Jun Xu, Bingcong Lu, Zhengxue Cheng, Hongwei Hu, Ronghua Wu, Li Song
2026-01-29
The demand for immersive and interactive has driven advancements in 3D video conferencing, yet achieving high-fidelity 3D talking face representation at low bitrates remains a challenge. Traditional 2D video
techniques fail to preserve fine-grained geometric and appearance details, while implicit neural rendering methods like NeRF suffer from prohibitive computational costs. To address these challenges, we propose a lightweight, high-fidelity,
rate 3D talking face
framework that integrates FLAME-based parametric modeling with 3DGS neural rendering. Our approach transmits only essential facial metadata in real time, enabling efficient reconstruction with a Gaussian-based head model. Additionally, we introduce a compact representation and
scheme, including Gaussian attribute
and MLP optimization, to enhance transmission efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance, delivering high-quality facial rendering at extremely low bitrates, making it well-suited for real-time 3D video conferencing applications.
User-Centric Phishing Detection A RAG and LLM-Based Approach
Authors: Abrar Hamed Al Barwani, Abdelaziz Amara Korba, Raja Waseem Anwar
2026-01-29
The escalating sophistication of phishing emails necessitates a shift beyond traditional rule-based and conventional machine-learning-based detectors. Although large language models (s) offer strong natural language understanding, using them as standalone classifiers often yields elevated falsepositive (FP) rates, which mislabel legitimate emails as phishing and create significant operational burden. This paper presents a personalized phishing detection framework that integrates
s with retrieval-augmented generation (RAG). For each message, the system constructs user-specific context by retrieving a compact set of the user's historical legitimate emails and enriching it with real-time domain and URL reputation from a cyber-threat intelligence platform, then conditions the
's decision on this evidence. We evaluate four open-source
s (Llama4-Scout, DeepSeek-R1, Mistral-Saba, and Gemma2) on an email dataset collected from public and institutional sources. Results show high performance; for example, Llama4-Scout attains an F1-score of 0.9703 and achieves a 66.7% reduction in FPs with RAG. These findings validate that a RAG-based, user-profiling approach is both feasible and effective for building high-precision, low-friction email security systems that adapt to individual
patterns.
3D imaging of the biphoton spatiotemporal wave packet
Authors: Yang Xue, Ze-Shan He, Hao-Shu Tian, Qin-Qin Wang, Bin-Tong Yin, Jun Zhong, Xiao-Ye Xu, Chuan-Feng Li, Guang-Can Guo
2026-01-29
Photons are among the most important carriers of quantum information owing to their rich degrees of freedom (DoFs), including various spatiotemporal structures. The ability to characterize these DoFs, as well as the hidden correlations among them, directly determines whether they can be exploited for quantum tasks. While various methods have been developed for measuring the spatiotemporal structure of classical light fields, owing to the technical challenges posed by weak photon flux, there have so far been no reports of ob such structures in their quantum counterparts, except for a few studies limited to correlations within individual DoFs. Here, we propose and experimentally demonstrate a self-referenced, high-efficiency, and all-optical method, termed 3D imaging of photonic wave packets, for comprehensive characterization of the spatiotemporal structure of a quantum light field, i.e., the biphoton spatiotemporal wave packet. Benefiting from this developed method, we successfully observe the spatial-spatial, spectral-spectral, and spatiotemporal correlations of biphotons generated via spontaneous parametric down-conversion, revealing rich local and nonlocal spatiotemporal structure in quantum light fields. This method will further advance the understanding of the dynamics in nonlinear quantum optics and expand the potential of photons for applications in quantum
and quantum computing.
Less Noise, More Voice Reinforcement Learning for Reasoning via Instruction Purification
Authors: Yiju Guo, Tianyi Hu, Zexu Sun, Yankai Lin
2026-01-29
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6 speedup. Our work highlights the critical role of
interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.
Delegation Without Living Governance
Authors: Wolfgang Rohde
2026-01-29
Most governance frameworks assume that rules can be defined in advance, systems can be engineered to comply, and accountability can be applied after outcomes occur. This model worked when machines replaced physical labor or accelerated calculation. It no longer holds when judgment itself is delegated to agentic AI systems operating at machine speed. The central issue here is not safety, efficiency, or employment. It is whether humans remain relevant participants in systems that increasingly shape social, economic, and political outcomes. This paper argues that static, compliance-based governance fails once decision-making moves to runtime and becomes opaque. It further argues that the core challenge is not whether AI is conscious, but whether humans can maintain meaningful , influence, and co-evolution with increasingly alien forms of intelligence. We position runtime governance, specifically, a newly proposed concept called the Governance Twin [1]; as a strong candidate for pre
human relevance, while acknowledging that accountability, agency, and even punishment must be rethought in this transition.
Soft Quantization Model Compression Via Weight Coupling
Authors: Daniel T. Bernstein, Luca Di Carlo, David Schwab
2026-01-29
We show that introducing short-range attractive couplings between the weights of a neural network during training provides a novel avenue for model . These couplings rapidly induce the discretization of a model's weight distribution, and they do so in a mixed-precision manner despite only relying on two additional hyperparameters. We demonstrate that, within an appropriate range of hyperparameters, our "soft
'' scheme outperforms histogram-equalized post-training
on ResNet-20/CIFAR-10. Soft
provides both a new pipeline for the flexible
of machine learning models and a new tool for investigating the trade-off between
and generalization in high-dimensional loss landscapes.
ZipMoE Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling
Authors: Yuchen Yang, Yaru Zhao, Pu Yang, Shaowei Wang, Zhi-Hua Zhou
2026-01-29
While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy . In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE
system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to inference latency reduction and up to higher throughput than the state-of-the-art systems.
Do Reasoning Models Enhance Embedding Models?
Authors: Wun Yu Chan, Shaojin Chen, Huihao Jing, Kwun Hang Lau, Elton Chun-Chai Li, Zihao Wang, Haoran Li, Yangqiu Song
2026-01-29
State-of-the-art embedding models are increasingly derived from r-only Large Language Model (
) backbones adapted via contrastive learning. Given the emergence of reasoning models trained via Reinforcement Learning with Verifiable Rewards (RLVR), a natural question arises: do enhanced reasoning translate to superior semantic representations when these models serve as embedding initializations? Contrary to expectation, our evaluation on MTEB and BRIGHT reveals a null effect: embedding models initialized from RLVR-tuned backbones yield no consistent performance advantage over their base counterparts when subjected to identical training recipes. To unpack this paradox, we introduce Hierarchical Representation Similarity Analysis (HRSA), a framework that decomposes similarity across representation, geometry, and function levels. HRSA reveals that while RLVR induces irreversible latent manifold's local geometry reorganization and reversible coordinate basis drift, it preserves the global manifold geometry and linear readout. Consequently, subsequent contrastive learning drives strong alignment between base- and reasoning-initialized models, a phenomenon we term Manifold Realignment. Empirically, our findings suggest that unlike Supervised Fine-Tuning (SFT), RLVR optimizes trajectories within an existing semantic landscape rather than fundamentally restructuring the landscape itself.
MAD Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
Authors: Sangyun Chung, Se Yeon Kim, Youngchae Chee, Yong Man Ro
2026-01-29
Multimodal Large Language Models (Ms) suffer from cross-modal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality-interaction control. To address this, we propose Modality-Adaptive Decoding (MAD), a training-free method that adaptively weights modality-specific
branches based on task requirements. MAD leverages the model's inherent ability to self-assess modality relevance by querying which modalities are needed for each task. The extracted modality probabilities are then used to adaptively weight contrastive
branches, enabling the model to focus on relevant information while suppressing cross-modal interference. Extensive experiments on CMM and AVHBench demonstrate that MAD significantly reduces cross-modal hallucinations across multiple audio-visual language models (7.8\% and 2.0\% improvements for VideoLLaMA2-AV, 8.7\% and 4.7\% improvements for Qwen2.5-Omni). Our approach demonstrates that explicit modality awareness through self-assessment is crucial for robust multimodal reasoning, offering a principled extension to existing contrastive
methods. Our code is available at \href{https://github.com/top-yun/MAD}{https://github.com/top-yun/MAD}
Output-Space Search Targeting LLM Generations in a Frozen Encoder-Defined Output Space
Authors: Tobias Materzok
2026-01-29
We introduce Output-Space Search (OS-Search), which turns generation into endpoint search. An outer loop selects a target z in a frozen encoder-defined 3D output space Z, and a retrieval-grounded policy trained with sequence-level RL generates outputs whose coordinates land near z under standard autoregressive
. This enables parallel sweeps and black-box optimization in Z without path-dependent token/program search. On stories, sweeping Z (text) yields 3.1x higher
-scored diversity than prompt-chaining. On code, Bayesian optimization over Z (code) improves an objective withheld from the controller under matched inference budgets while pre
validity.
BrainStack Neuro-MoE with Functionally Guided Expert Routing for EEG-Based Language Decoding
Authors: Ziyi Zhao, Jinzhao Zhou, Xiaowei Jiang, Beining Cao, Wenhao Ma, Yang Shen, Ren Li, Yu-Kai Wang, Chin-teng Lin
2026-01-29
Decoding linguistic information from electroencephalography (EEG) remains challenging due to the brain's distributed and nonlinear organization. We present BrainStack, a functionally guided neuro-mixture-of-experts (Neuro-MoE) framework that models the brain's modular functional architecture through anatomically partitioned expert networks. Each functional region is represented by a specialized expert that learns localized neural dynamics, while a -based global expert captures cross-regional dependencies. A learnable routing gate adaptively aggregates these heterogeneous experts, enabling context-dependent expert coordination and selective fusion. To promote coherent representation across the hierarchy, we introduce cross-regional distillation, where the global expert provides top-down regularization to the regional experts. We further release SilentSpeech-EEG (SS-EEG), a large-scale benchmark comprising over 120 hours of EEG recordings from 12 subjects performing 24 silent words, the largest dataset of its kind. Experiments demonstrate that BrainStack consistently outperforms state-of-the-art models, achieving superior accuracy and generalization across subjects. Our results establish BrainStack as a functionally modular, neuro-inspired MoE paradigm that unifies neuroscientific priors with adaptive expert routing, paving the way for scalable and interpretable brain-language
.
Planner-Auditor Twin Agentic Discharge Planning with FHIR-Based LLM Planning, Guideline Recall, Optional Caching and Self-Improvement
Authors: Kaiyuan Wu, Aditya Nagori, Rishikesan Kamaleswaran
2026-01-28
Objective: Large language models (s) show promise for clinical discharge planning, but their use is constrained by hallucination, omissions, and miscalibrated confidence. We introduce a self-improving,
-optional Planner-Auditor framework that improves safety and reliability by decoupling generation from deterministic validation and targeted replay.
Materials and Methods: We implemented an agentic, retrospective, FHIR-native evaluation pipeline using MIMIC-IV-on-FHIR. For each patient, the Planner (
) generates a structured discharge action plan with an explicit confidence estimate. The Auditor is a deterministic module that evaluates multi-task coverage, tracks calibration (Brier score, ECE proxies), and monitors action-distribution drift. The framework supports two-tier self-improvement: (i) within-episode regeneration when enabled, and (ii) cross-episode discrepancy buffering with replay for high-confidence, low-coverage cases.
Results: While context caching improved performance over baseline, the self-improvement loop was the primary driver of gains, increasing task coverage from 32% to 86%. Calibration improved substantially, with reduced Brier/ECE and fewer high-confidence misses. Discrepancy buffering further corrected persistent high-confidence omissions during replay.
Discussion: Feedback-driven regeneration and targeted replay act as effective control mechanisms to reduce omissions and improve confidence reliability in structured clinical planning. Separating an
Planner from a rule-based, observational Auditor enables systematic reliability measurement and safer iteration without model retraining.
Conclusion: The Planner-Auditor framework offers a practical pathway toward safer automated discharge planning using interoperable FHIR data access and deterministic auditing, supported by reproducible ablations and reliability-focused evaluation.
ChunkWise LoRA Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference
Authors: Ketan Thakkar, Maitreyi Chatterjee, Ramasubramanian Balasubramanian, Achyuthan Jootoo, Rajendra Ugrani
2026-01-28
Recent advances in low-rank adaptation (LoRA) have enabled efficient fine-tuning of large language models (s) with minimal additional parameters. However, existing LoRA methods apply static rank configurations uniformly across all input tokens, ignoring variation in token complexity and computational requirements. In this work, we propose ChunkWise LoRA, a dynamic and adaptive approach that partitions sequences into variable-length chunks based on token complexity and assigns each chunk a tailored low-rank configuration. Our system introduces a runtime scheduler that estimates token difficulty, performs adaptive chunking, and selects per-chunk LoRA rank and scaling using a rank-ladder mechanism. To preserve output consistency, we further introduce a boundary-safe composition module and integrate policy-driven
-
strategies. Experiments on benchmark datasets such as Wikitext-103 and SQuAD demonstrate that ChunkWise LoRA achieves up to 34\% lower latency and 38% memory reduction compared to baseline LoRA, while maintaining or improving task performance metrics like BLEU, EM, and perplexity. The proposed framework remains fully compatible with existing
architectures and inference frameworks, providing a practical solution for real-world deployment of parameter-efficient
s.
CompSRT Quantization and Pruning for Image Super Resolution Transformers
Authors: Dorsa Zeinali, Hailing Wang, Yitian Zhang, Raymond Fu
2026-01-28
Model has become an important tool for making image super resolution models more efficient. However, the gap between the best compressed models and the full precision model still remains large and a need for deeper understanding of
theory on more performant models remains. Prior research on
of
s has shown that Hadamard transformations lead to weights and activations with reduced outliers, which leads to improved performance. We argue that while the Hadamard transform does reduce the effect of outliers, an empirical analysis on how the transform functions remains needed. By studying the distributions of weights and activations of SwinIR-light, we show with statistical analysis that lower errors is caused by the Hadamard transforms ability to reduce the ranges, and increase the proportion of values around . Based on these findings, we introduce CompSRT, a more performant way to compress the image super resolution
network SwinIR-light. We perform Hadamard-based
, and we also perform scalar decomposition to introduce two additional trainable parameters. Our
performance statistically significantly surpasses the SOTA in metrics with gains as large as 1.53 dB, and visibly improves visual quality by reducing blurriness at all bitwidths. At - bits, to show our method is compatible with
for increased
, we also prune of weights and show that we can achieve - reduction in bits per parameter with comparable performance to SOTA.
Probabilistic Interpolation of Sagittarius A*'s Multi-Wavelength Light Curves Using Diffusion Models
Authors: Gabriel Sasseville, Julie Hlavacek-Larrondo, Daryl Haggard, Alexandre Adam, Hadrien Paugnat, Gunther Witzel
2026-01-28
Understanding the variability of Sagittarius A (Sgr A) requires coordinated, multi-wavelength observations that span the electromagnetic spectrum. In this work, we focus on data from four key observatories: Chandra in the X-ray (2-8 keV), GRAVITY on the Very Large Telescope in the near-infrared (2.2 microns), Spitzer in the infrared (4.5 microns), and ALMA in the submillimeter (340 GHz). These multi-band observations are essential for probing the physics of accretion and emission near the black hole's event horizon, yet they suffer from irregular sampling, band-dependent noise, and substantial data gaps. These limitations complicate efforts to robustly identify flares and measure cross-band time lags, key diagnostics of the physical processes driving variability. To address this challenge, we introduce a diffusion-based generative model, for interpolating , multivariate astrophysical time series. This represents the first application of score-based diffusion models to astronomical time series. We also present the first
-based model for light curve reconstruction that includes calibrated uncertainty estimates. The models are trained on simulated light curves constructed to match the statistical and observational characteristics of real Sgr A data. These simulations capture correlated multi-band variability, realistic observation cadences, and wavelength-specific noise. We compare our models against a multi-output Gaussian Process. The diffusion model achieves superior accuracy and competitive calibration across both simulated and real datasets, demonstrating the promise of diffusion models for high-fidelity, uncertainty-aware reconstruction of multi-wavelength variability in Sgr A.
Evolutionary Strategies lead to Catastrophic Forgetting in LLMs
Authors: Immanuel Abdi, Akshat Gupta, Micah Mok, Alexander Lu, Nicholas Lee, Gopala Anumanchipalli
2026-01-28
One of the biggest missing capabilities in current AI systems is the ability to learn continuously after deployment. Implementing such continually learning systems have several challenges, one of which is the large memory requirement of gradient-based algorithms that are used to train state-of-the-art s. Evolutionary Strategies (ES) have recently re-emerged as a gradient-free alternative to traditional learning algorithms and have shown encouraging performance on specific tasks in
s. In this paper, we perform a comprehensive analysis of ES and specifically evaluate its forgetting curves when training for an increasing number of update steps. We first find that ES is able to reach performance numbers close to GRPO for math and reasoning tasks with a comparable compute budget. However, and most importantly for continual learning, the performance gains in ES is accompanied by significant forgetting of prior abilities, limiting its applicability for training models online. We also explore the reason behind this behavior and show that the updates made using ES are much less
and have orders of magnitude larger norm compared to corresponding GRPO updates, explaining the contrasting forgetting curves between the two algorithms. With this study, we aim to highlight the issue of forgetting in gradient-free algorithms like ES and hope to inspire future work to mitigate these issues.
MemCtrl Using MLLMs as Active Memory Controllers on Embodied Agents
Authors: Vishnu Sashank Dorbala, Dinesh Manocha
2026-01-28
Foundation models rely on in-context learning for personalized decision making. The limited size of this context window necessitates memory and retrieval systems like RAG. These systems however often treat memory as large offline storage spaces, which is unfavorable for embodied agents that are expected to operate under strict memory and compute constraints, online. In this work, we propose MemCtrl, a novel framework that uses Multimodal Large Language Models (M
s) for
memory online. MemCtrl augments M
s with a trainable memory head μthat acts as a gate to determine which observations or reflections to retain, update, or discard during exploration. We evaluate with training two types of μ, 1) via an offline expert, and 2) via online RL, and observe significant improvement in overall embodied task completion ability on μ-augmented M
s. In particular, on augmenting two low performing M
s with MemCtrl on multiple subsets of the EmbodiedBench benchmark, we observe that μ-augmented M
s show an improvement of around 16% on average, with over 20% on specific instruction subsets. Finally, we present a qualitative analysis on the memory fragments collected by μ, noting the superior performance of μaugmented M
s on long and complex instruction types.
Robust Federated Learning for Malicious Clients using Loss Trend Deviation Detection
Authors: Deepthy K Bhaskar, Minimol B, Binu V P
2026-01-28
Federated Learning (FL) facilitates collaborative model training among distributed clients while ensuring that raw data remains on local devices.Despite this advantage, FL systems are still exposed to risks from malicious or unreliable participants. Such clients can interfere with the training process by sending misleading updates, which can negatively affect the performance and reliability of the global model. Many existing defense mechanisms rely on gradient inspection, complex similarity computations, or cryptographic operations, which introduce additional overhead and may become unstable under non-IID data distributions. In this paper, we propose the Federated Learning with Loss Trend Detection (FL-LTD), a lightweight and privacy-pre defense framework that detects and mitigates malicious behavior by monitoring temporal loss dynamics rather than model gradients. The proposed approach identifies anomalous clients by detecting abnormal loss stagnation or abrupt loss fluctuations across
rounds. To counter adaptive attackers, a short-term memory mechanism is incorporated to sustain mitigation for clients previously flagged as anomalous, while enabling trust recovery for stable participants. We evaluate FL-LTD on a non-IID federated MNIST setup under loss manipulation attacks. Experimental results demonstrate that the proposed method significantly enhances robustness, achieving a final test accuracy of 0.84, compared to 0.41 for standard FedAvg under attack. FL-LTD incurs negligible computational and
overhead, maintains stable convergence, and avoids client exclusion or access to sensitive data, highlighting the effectiveness of loss-based monitoring for secure federated learning.
Context-Augmented Code Generation Using Programming Knowledge Graphs
Authors: Shahd Seddik, Fahd Seddik, Iman Saberi, Fatemeh Fard, Minh Hieu Huynh, Patanamon Thongtanunam
2026-01-28
Large Language Models (s) excel at code generation but struggle with complex problems. Retrieval-Augmented Generation (RAG) mitigates this issue by integrating external knowledge, yet retrieval models often miss relevant context, and generation models hallucinate with irrelevant data. We propose Programming Knowledge Graph (PKG) for semantic representation and fine-grained retrieval of code and text. Our approach enhances retrieval precision through tree
and mitigates hallucinations via a re-ranking mechanism that integrates non-RAG solutions. Structuring external data into finer-grained nodes improves retrieval granularity. Evaluations on HumanEval and MBPP show up to 20% pass@1 accuracy gains and a 34% improvement over baselines on MBPP. Our findings demonstrate that our proposed PKG approach along with re-ranker effectively address complex problems while maintaining minimal negative impact on solutions that are already correct without RAG. The replication package is published at https://github.com/iamshahd/ProgrammingKnowledgeGraph
Leveraging Second-Order Curvature for Efficient Learned Image Compression Theory and Empirical Evidence
Authors: Yichi Zhang, Fengqing Zhu
2026-01-28
Training learned image (LIC) models entails navigating a challenging optimization landscape defined by the fundamental trade-off between rate and distortion. Standard first-order optimizers, such as SGD and Adam, struggle with \emph{gradient conflicts} arising from competing objectives, leading to slow convergence and suboptimal rate-distortion performance. In this work, we demonstrate that a simple utilization of a second-order quasi-Newton optimizer, \textbf{SOAP}, dramatically improves both training efficiency and final performance across diverse LICs. Our theoretical and empirical analyses reveal that Newton preconditioning inherently resolves the intra-step and inter-step update conflicts intrinsic to the R-D objective, facilitating faster, more stable convergence. Beyond
, we uncover a critical deployability benefit: second-order trained models exhibit significantly fewer activation and latent outliers. This substantially enhances robustness to post-training
. Together, these results establish second-order optimization, achievable as a seamless drop-in replacement of the imported optimizer, as a powerful, practical tool for advancing the efficiency and real-world readiness of LICs.
HESTIA A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs
Authors: Guoan Wang, Feiyu Wang, Zongwei Lv, Yikun Zong, Tong Yang
2026-01-28
As large language models (s) continue to scale, deployment is increasingly bottlenecked by the memory wall, motivating a shift toward extremely
. However, most
-aware training (QAT) methods apply hard rounding and the straight-through estimator (STE) from the beginning of the training, which prematurely discretizes the optimization landscape and induces persistent gradient mismatch between latent weights and
d weights, hindering effective optimization of
d models. To address this, we propose Hestia, a Hessian-guided differentiable QAT framework for extremely
s, which replaces the rigid step function with a temperature-controlled softmax relaxation to maintain gradient flow early in training while progressively hardening
. Furthermore, Hestia leverages a tensor-wise Hessian trace metric as a lightweight curvature signal to drive fine-grained temperature annealing, enabling sensitivity-aware discretization across the model. Evaluations on Llama-3.2 show that Hestia consistently outperforms existing ternary QAT baselines, yielding average zero-shot improvements of 5.39% and 4.34% for the 1B and 3B models. These results indicate that Hessian-guided relaxation effectively recovers representational capacity, establishing a more robust training path for 1.58-bit
s. The code is available at https://github.com/hestia2026/Hestia.
SA-PEF Step-Ahead Partial Error Feedback for Efficient Federated Learning
Authors: Dawit Kiros Redie, Reza Arablouei, Stefan Werner
2026-01-28
Biased gradient with error feedback (EF) reduces
in federated learning (FL), but under non-IID data, the residual error can decay slowly, causing gradient mismatch and stalled progress in the early rounds. We propose step-ahead partial error feedback (SA-PEF), which integrates step-ahead (SA) correction with partial error feedback (PEF). SA-PEF recovers EF when the step-ahead coefficient and step-ahead EF (SAEF) when . For non-convex objectives and -contractive compressors, we establish a second-moment bound and a residual recursion that guarantee convergence to stationarity under heterogeneous data and partial client participation. The resulting rates match standard non-convex Fed-SGD guarantees up to constant factors, achieving convergence to a variance/heterogeneity floor with a fixed inner step size. Our analysis reveals a step-ahead-controlled residual contraction that explains the observed
in the early training phase. To balance SAEF's rapid warm-up with EF's long-term stability, we select near its theory-predicted optimum. Experiments across diverse architectures and datasets show that SA-PEF consistently reaches target accuracy faster than EF.
P2S Probabilistic Process Supervision for General-Domain Reasoning Question Answering
Authors: Wenlin Zhong, Chengyuan Liu, Yiquan Wu, Bovin Tan, Changlong Sun, Yi Wang, Xiaozhong Liu, Kun Kuang
2026-01-28
While reinforcement learning with verifiable rewards (RLVR) has advanced reasoning in structured domains like mathematics and programming, its application to general-domain reasoning tasks remains challenging due to the absence of verifiable reward signals. To this end, methods like Reinforcement Learning with Reference Probability Reward (RLPR) have emerged, leveraging the probability of generating the final answer as a reward signal. However, these outcome-focused approaches neglect crucial step-by-step supervision of the reasoning process itself. To address this gap, we introduce Probabilistic Process Supervision (P2S), a novel self-supervision framework that provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps. During reinforcement learning, P2S synthesizes and filters a high-quality reference reasoning chain (gold-CoT). The core of our method is to calculate a Path Faithfulness Reward (PFR) for each reasoning step, which is derived from the conditional probability of generating the gold-CoT's suffix, given the model's current reasoning prefix. Crucially, this PFR can be flexibly integrated with any outcome-based reward, directly tackling the reward
problem by providing dense guidance. Extensive experiments on reading comprehension and medical Question Answering benchmarks show that P2S significantly outperforms strong baselines.
TGSBM Transformer-Guided Stochastic Block Model for Link Prediction
Authors: Zhejian Yang, Songwei Zhao, Zilin Zhao, Hechang Chen
2026-01-28
Link prediction is a cornerstone of the Web ecosystem, powering applications from recommendation and search to knowledge graph completion and collaboration forecasting. However, large-scale networks present unique challenges: they contain hundreds of thousands of nodes and edges with heterogeneous and ping community structures that evolve over time. Existing approaches face notable limitations: traditional graph neural networks struggle to capture global structural dependencies, while recent graph
s achieve strong performance but incur quadratic complexity and lack interpretable latent structure. We propose \textbf{TGSBM} (Transformer-Guided Stochastic Block Model), a framework that integrates the principled generative structure of Overlapping Stochastic Block Models with the representational power of
Graph Transformers. TGSBM comprises three main components: (i) \emph{expander-augmented
attention} that enables near-linear complexity and efficient global mixing, (ii) a \emph{neural variational encoder} that infers structured posteriors over community memberships and strengths, and (iii) a \emph{neural edge
r} that reconstructs links via OSBM's generative process, pre
interpretability. Experiments across diverse benchmarks demonstrate competitive performance (mean rank 1.6 under HeaRT protocol), superior scalability (up to faster training), and interpretable community structures. These results position TGSBM as a practical approach that strikes a balance between accuracy, efficiency, and transparency for large-scale link prediction.
Investigating the Development of Task-Oriented Communication in Vision-Language Models
Authors: Boaz Carmeli, Orr Paradise, Shafi Goldwasser, Yonatan Belinkov, Ron Meir
2026-01-28
We investigate whether \emph{-based agents} can develop task-oriented
protocols that differ from standard natural language in collaborative reasoning tasks. Our focus is on two core properties such task-oriented protocols may exhibit: Efficiency -- conveying task-relevant information more concisely than natural language, and Covertness -- becoming difficult for external observers to interpret, raising concerns about transparency and control. To investigate these aspects, we use a referential-game framework in which vision-language model (VLM) agents communicate, providing a controlled, measurable setting for evaluating language variants. Experiments show that VLMs can develop effective, task-adapted
patterns. At the same time, they can develop covert protocols that are difficult for humans and external agents to interpret. We also observe spontaneous coordination between similar models without explicitly shared protocols. These findings highlight both the potential and the risks of task-oriented
, and position referential games as a valuable testbed for future work in this area.
MeCo Enhancing LLM-Empowered Multi-Robot Collaboration via Similar Task Memoization
Authors: Baiqing Wang, Helei Cui, Bo Zhang, Xiaolong Zheng, Bin Guo, Zhiwen Yu
2026-01-28
Multi-robot systems have been widely deployed in real-world applications, providing significant improvements in efficiency and reductions in labor costs. However, most existing multi-robot collaboration methods rely on extensive task-specific training, which limits their adaptability to new or diverse scenarios. Recent research leverages the language understanding and reasoning capabilities of large language models (s) to enable more flexible collaboration without specialized training. Yet, current
-empowered approaches remain inefficient: when confronted with identical or similar tasks, they must replan from scratch because they omit task-level similarities. To address this limitation, we propose MeCo, a similarity-aware multi-robot collaboration framework that applies the principle of ``
and reuse'' (a.k.a., memoization) to reduce redundant computation. Unlike simple task repetition, identifying and reusing solutions for similar but not identical tasks is far more challenging, particularly in multi-robot settings. To this end, MeCo introduces a new similarity testing method that retrieves previously solved tasks with high relevance, enabling effective plan reuse without re-invoking
s. Furthermore, we present MeCoBench, the first benchmark designed to evaluate performance on similar-task collaboration scenarios. Experimental results show that MeCo substantially reduces planning costs and improves success rates compared with state-of-the-art approaches.
DiffVC-RT Towards Practical Real-Time Diffusion-based Perceptual Neural Video Compression
Authors: Wenzhuo Ma, Zhenzhong Chen
2026-01-28
The practical deployment of diffusion-based Neural Video Compression (NVC) faces critical challenges, including severe information loss, prohibitive inference latency, and poor temporal consistency. To bridge this gap, we propose DiffVC-RT, the first framework designed to achieve real-time diffusion-based perceptual NVC. First, we introduce an Efficient and Informative Model Architecture. Through strategic module replacements and , this architecture significantly reduces computational complexity while mitigating structural information loss. Second, to address generative flickering artifacts, we propose Explicit and Implicit Consistency Modeling. We enhance temporal consistency by explicitly incorporating a zero-cost Online Temporal Shift Module within the U-Net, complemented by hybrid implicit consistency constraints. Finally, we present an Asynchronous and Parallel Decoding Pipeline incorporating Mixed Half Precision, which enables asynchronous latent
and parallel frame reconstruction via a Batch-dimension Temporal Shift design. Experiments show that DiffVC-RT achieves 80.1% bitrate savings in terms of LPIPS over VTM-17.0 on HEVC dataset with real-time encoding and
speeds of 206 / 30 fps for 720p videos on an NVIDIA H800 GPU, marking a significant milestone in diffusion-based video
.
Context Tokens are Anchors Understanding the Repetition Curse in dMLLMs from an Information Flow Perspective
Authors: Qiyan Zhao, Xiaofeng Zhang, Shuochen Chang, Qianyu Chen, Xiaosong Yuan, Xuhang Chen, Luoqi Liu, Jiajun Zhang, Xu-Yao Zhang, Da-Han Wang
2026-01-28
Recent diffusion-based Multimodal Large Language Models (dMs) suffer from high inference latency and therefore rely on caching techniques to accelerate
. However, the application of
mechanisms often introduces undesirable repetitive text generation, a phenomenon we term the \textbf{Repeat Curse}. To better investigate underlying mechanism behind this issue, we analyze repetition generation through the lens of information flow. Our work reveals three key findings: (1) context tokens aggregate semantic information as anchors and guide the final predictions; (2) as information propagates across layers, the entropy of context tokens converges in deeper layers, reflecting the model's growing prediction certainty; (3) Repetition is typically linked to disruptions in the information flow of context tokens and to the inability of their entropy to converge in deeper layers. Based on these insights, we present \textbf{CoTA}, a plug-and-play method for mitigating repetition. CoTA enhances the attention of context tokens to preserve intrinsic information flow patterns, while introducing a penalty term to the confidence score during
to avoid outputs driven by uncertain context tokens. With extensive experiments, CoTA demonstrates significant effectiveness in alleviating repetition and achieves consistent performance improvements on general tasks. Code is available at https://github.com/ErikZ719/CoTA
Efficient Autoregressive Video Diffusion with Dummy Head
Authors: Hang Guo, Zhaoyang Jia, Jiahao Li, Bin Li, Yuanhao Cai, Jiangshan Wang, Yawei Li, Yan Lu
2026-01-28
The autoregressive video diffusion model has recently gained considerable research interest due to its causal modeling and iterative denoising. In this work, we identify that the multi-head self-attention in these models under-utilizes historical frames: approximately 25% heads attend almost exclusively to the current frame, and discarding their
s incurs only minor performance degradation. Building upon this, we propose Dummy Forcing, a simple yet effective method to control context accessibility across different heads. Specifically, the proposed heterogeneous memory allocation reduces head-wise context redundancy, accompanied by dynamic head programming to adaptively classify head types. Moreover, we develop a context packing technique to achieve more aggressive
. Without additional training, our Dummy Forcing delivers up to 2.0x speedup over the baseline, supporting video generation at 24.3 FPS with less than 0.5% quality drop. Project page is available at https://csguoh.github.io/project/DummyForcing/.
Towards Sensitivity-Aware Language Models
Authors: Dren Fazlija, Iyiola E. Olatunji, Daniel Kudenko, Sandipan Sikdar
2026-01-28
With s increasingly deployed in corporate data management, it is crucial to ensure that these models do not leak sensitive information. In the context of corporate data management, the concept of sensitivity awareness has been introduced, enabling
s to adhere to predefined access rights rules. However, it remains unclear how sensitivity awareness relates to established notions of privacy, such as differential privacy (DP), thereby making it difficult to deploy meaningfully in real-world applications. In this work, we formalize the notion of sensitivity awareness and theoretically establish its connection to DP. Additionally, we develop a supervised fine-tuning recipe to make existing, four-bit
d
s more sensitivity-aware. With a performance boost of up to 21.7%, the finetuned
s not only substantially improve over their baseline but also outperform other full-precision open-source and commercial models of similar size in achieving sensitivity awareness, demonstrating the effectiveness of our proposed approach. At the same time, our method also largely preserves the models' performance on other tasks, such as general instruction-following, mathematical, and common-sense reasoning.
CtrlCoT Dual-Granularity Chain-of-Thought Compression for Controllable Reasoning
Authors: Zhenxuan Fan, Jie Cao, Yang Dai, Zheqi Lv, Wenqiao Zhang, Zhongle Xie, Peng LU, Beng Chin Ooi
2026-01-28
Chain-of-thought (CoT) prompting improves reasoning but incurs high latency and memory cost due to verbose traces, motivating CoT
with preserved correctness. Existing methods either shorten CoTs at the semantic level, which is often conservative, or prune tokens aggressively, which can miss task-critical cues and degrade accuracy. Moreover, combining the two is non-trivial due to sequential dependency, task-agnostic
, and distribution mismatch. We propose \textbf{CtrlCoT}, a dual-granularity CoT
framework that harmonizes semantic abstraction and token-level
through three components: Hierarchical Reasoning Abstraction produces CoTs at multiple semantic granularities; Logic-Pre
Distillation trains a logic-aware pruner to retain indispensable reasoning cues (e.g., numbers and operators) across
ratios; and Distribution-Alignment Generation aligns compressed traces with fluent inference-time reasoning styles to avoid fragmentation. On MATH-500 with Qwen2.5-7B-Instruct, CtrlCoT uses 30.7\% fewer tokens while achieving 7.6 percentage points higher than the strongest baseline, demonstrating more efficient and reliable reasoning. Our code will be publicly available at https://github.com/fanzhenxuan/Ctrl-CoT.
Youtu-Parsing Perception, Structuring and Recognition via High-Parallelism Decoding
Authors: Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li, Huang Chen, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Qianyu Li, Antai Guo, Yanzhen Liao, Yanqiu Qu, Haodong Lin, Chengxu He, Shuangyin Liu
2026-01-28
This paper presents Youtu-Parsing, an efficient and versatile document parsing model designed for high-performance content extraction. The architecture employs a native Vision Transformer (ViT) featuring a dynamic-resolution visual encoder to extract shared document features, coupled with a prompt-guided Youtu--2B language model for layout analysis and region-prompted
. Leveraging this decoupled and feature-reusable framework, we introduce a high-parallelism
strategy comprising two core components: token parallelism and query parallelism. The token parallelism strategy concurrently generates up to 64 candidate tokens per inference step, which are subsequently validated through a verification mechanism. This approach yields a 5--11x speedup over traditional autoregressive
and is particularly well-suited for highly structured scenarios, such as table recognition. To further exploit the advantages of region-prompted
, the query parallelism strategy enables simultaneous content prediction for multiple bounding boxes (up to five), providing an additional 2x
while maintaining output quality equivalent to standard
. Youtu-Parsing encompasses a diverse range of document elements, including text, formulas, tables, charts, seals, and hierarchical structures. Furthermore, the model exhibits strong robustness when handling rare characters, multilingual text, and handwritten content. Extensive evaluations demonstrate that Youtu-Parsing achieves state-of-the-art (SOTA) performance on both the OmniDocBench and olmOCR-bench benchmarks. Overall, Youtu-Parsing demonstrates significant experimental value and practical utility for large-scale document intelligence applications.
Concept Component Analysis A Principled Approach for Concept Extraction in LLMs
Authors: Yuhang Liu, Erdun Gao, Dong Gong, Anton van den Hengel, Javen Qinfeng Shi
2026-01-28
Developing human understandable interpretation of large language models (s) becomes increasingly critical for their deployment in essential domains. Mechanistic interpretability seeks to mitigate the issues through extracts human-interpretable process and concepts from
s' activations. Sparse autoencoders (SAEs) have emerged as a popular approach for extracting interpretable and monosemantic concepts by decomposing the
internal representations into a dictionary. Despite their empirical progress, SAEs suffer from a fundamental theoretical ambiguity: the well-defined correspondence between
representations and human-interpretable concepts remains unclear. This lack of theoretical grounding gives rise to several methodological challenges, including difficulties in principled method design and evaluation criteria. In this work, we show that, under mild assumptions,
representations can be approximated as a {linear mixture} of the log-posteriors over concepts given the input context, through the lens of a latent variable model where concepts are treated as latent variables. This motivates a principled framework for concept extraction, namely Concept Component Analysis (ConCA), which aims to recover the log-posterior of each concept from
representations through a {unsupervised} linear unmixing process. We explore a specific variant, termed
ConCA, which leverages a
prior to address the inherent ill-posedness of the unmixing problem. We implement 12
ConCA variants and demonstrate their ability to extract meaningful concepts across multiple
s, offering theory-backed advantages over SAEs.
LLM-AutoDP Automatic Data Processing via LLM Agents for Model Fine-tuning
Authors: Wei Huang, Anda Cheng, Yinggui Wang, Lei Wang, Tao Wei
2026-01-28
Large Language Models (s) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (DP). In practice, DP strategies are typically developed through iterative manual analysis and trial-and-error adjustment. These processes inevitably incur high labor costs and may lead to privacy issues in high-privacy domains like healthcare due to direct human access to sensitive data. Thus, achieving automated data processing without exposing the raw data has become a critical challenge. To address this challenge, we propose
-AutoDP, a novel framework that leverages
s as agents to automatically generate and optimize data processing strategies. Our method generates multiple candidate strategies and iteratively refines them using feedback signals and comparative evaluations. This iterative in-context learning mechanism enables the agent to converge toward high-quality processing pipelines without requiring direct human intervention or access to the underlying data. To further accelerate strategy search, we introduce three key techniques: Distribution Pre
Sampling, which reduces data volume while maintaining distributional integrity; Processing Target Selection, which uses a binary classifier to identify low-quality samples for focused processing; Cache-and-Reuse Mechanism}, which minimizes redundant computations by reusing prior processing results. Results show that models trained on data processed by our framework achieve over 80% win rates against models trained on unprocessed data. Compared to AutoML baselines based on
agents,
-AutoDP achieves approximately a 65% win rate. Moreover, our
techniques reduce the total searching time by up to 10 times, demonstrating both effectiveness and efficiency.
Switchcodec Adaptive residual-expert sparse quantization for high-fidelity neural audio coding
Authors: Xiangbo Wang, Wenbin Jiang, Jin Wang, Yubo You, Sheng Fang, Fei Wen
2026-01-28
Recent neural audio models often rely on residual vector
for high-fidelity coding, but using a fixed number of per-frame codebooks is suboptimal for the wide variability of audio content-especially for signals that are either very simple or highly complex. To address this limitation, we propose SwitchCodec, a neural audio codec based on Residual Experts Vector Quantization (REVQ). REVQ combines a shared
r with dynamically routed expert
rs that are activated according to the input audio, decoupling bitrate from codebook capacity and improving
efficiency. This design ensures full training and utilization of each
r. In addition, a variable-bitrate mechanism adjusts the number of active expert
rs at inference, enabling multi-bitrate operation without retraining. Experiments demonstrate that SwitchCodec surpasses existing baselines on both objective metrics and subjective listening tests.
TABED Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs
Authors: Minjae Lee, Wonjun Kang, Byeongkeun Ahn, Christian Classen, Kevin Galim, Seunghyuk Oh, Minghao Yan, Hyung Il Koo, Kangwook Lee
2026-01-28
Speculative (SD) has proven effective for accelerating
inference by quickly generating draft tokens and verifying them in parallel. However, SD remains largely unexplored for Large Vision-Language Models (LVLMs), which extend
s to process both image and text prompts. To address this gap, we benchmark existing inference methods with small draft models on 11 datasets across diverse input scenarios and observe scenario-specific performance fluctuations. Motivated by these findings, we propose Test-time Adaptive Batched Ensemble Drafting (TABED), which dynamically ensembles multiple drafts obtained via batch inference by leveraging deviations from past ground truths available in the SD setting. The dynamic ensemble method achieves an average robust walltime speedup of 1.74x over autoregressive
and a 5% improvement over single drafting methods, while remaining training-free and keeping ensembling costs negligible through parameter sharing. With its plug-and-play compatibility, we further enhance TABED by integrating advanced verification and alternative drafting methods. Code and custom-trained models are available at https://github.com/furiosa-ai/TABED.
Everything in Its Place Benchmarking Spatial Intelligence of Text-to-Image Models
Authors: Zengbin Wang, Xuecai Hu, Yong Wang, Feng Xiong, Man Zhang, Xiangxiang Chu
2026-01-28
Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short or information- prompt design. In this paper, we introduce SpatialGenEval, a new benchmark designed to systematically evaluate the spatial intelligence of T2I models, covering two key aspects: (1) SpatialGenEval involves 1,230 long, information-dense prompts across 25 real-world scenes. Each prompt integrates 10 spatial sub-domains and corresponding 10 multi-choice question-answer pairs, ranging from object position and layout to occlusion and causality. Our extensive evaluation of 21 state-of-the-art models reveals that higher-order spatial reasoning remains a primary bottleneck. (2) To demonstrate that the utility of our information-dense design goes beyond simple evaluation, we also construct the SpatialT2I dataset. It contains 15,400 text-image pairs with rewritten prompts to ensure image consistency while pre
information density. Fine-tuned results on current foundation models (i.e., Stable Diffusion-XL, Uniworld-V1, OmniGen2) yield consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic effects in spatial relations, highlighting a data-centric paradigm to achieve spatial intelligence in T2I models.
Window-Diffusion Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching
Authors: Fengrui Zuo, Zhiwei Ke, Yiming Liu, Wenqi Lou, Chao Wang, Xvehai Zhou
2026-01-28
Diffusion language models (DLMs) generate text through iterative denoising, but inference requires full-sequence attention at every iteration, resulting in substantial redundant computation on masked tokens. Block-wise diffusion can reduce this cost, yet it typically relies on retraining and constrained update orders, limiting its direct applicability to pretrained DLMs. Our token-level analysis reveals pronounced structural locality in DLM inference. Decoding is driven by a small set of prefix-localized active tokens; the influence of distant und context diminishes rapidly, and
d tokens exhibit stage-wise temporal stability, enabling reuse of intermediate representations except for a brief post-
transient. Motivated by these observations, we propose \textbf{\placeholder}\footnote{The source code is available at https://github.com/vhicrgit/Window-Diffusion.}, a window-based token
and caching method for inference. We maintain a local computation window that slides rightward as denoising progresses, and partition un
d tokens into: (i) \textit{active tokens} that are computed online, (ii) \textit{buffer tokens} whose
states are
d and periodically refreshed, and (iii) \textit{far-field tokens} that are pruned outside the window. Computation is restricted to active and buffer tokens within the window, while far-field tokens are omitted at each stage. Experiments on LLaDA and Dream show that, under matched compute budgets, our method achieves up to inference speedup while largely pre
generation performance.
Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning
Authors: Zeyu Xing, Xing Li, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan
2026-01-28
s, typically used only to speed up autoregressive
, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the
as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings,
-derived representations are shown to be sufficient for two key applications: \textbf{(i) Chain-of-Embedding}, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and \textbf{(ii) Fast/Slow Thinking Switching}, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to with minimal accuracy loss. Our findings establish
s as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in
inference. Code: https://github.com/cmd2001/ICLR2026_
-Embedding.
VersaQ-3D A Reconfigurable Accelerator Enabling Feed-Forward and Generalizable 3D Reconstruction via Versatile Quantization
Authors: Yipu Zhang, Jintao Cheng, Xingyu Liu, Zeyu Li, Carol Jingyi Li, Jin Wu, Lin Jiang, Yuan Xie, Jiang Xu, Wei Zhang
2026-01-28
The Visual Geometry Grounded Transformer (VGGT) enables strong feed-forward 3D reconstruction without per-scene optimization. However, its billion-parameter scale creates high memory and compute demands, hindering on-device deployment. Existing
methods fail on VGGT due to saturated activation channels and diverse 3D semantics, which cause unreliable calibration. Furthermore, VGGT presents hardware challenges regarding precision-sensitive nonlinear operators and memory-intensive global attention. To address this, we propose VersaQ-3D, an algorithm-architecture co-design framework. Algorithmically, we introduce the first calibration-free, scene-agnostic
for VGGT down to 4-bit, leveraging orthogonal transforms to decorrelate features and suppress outliers. Architecturally, we design a reconfigurable accelerator supporting BF16, INT8, and INT4. A unified systolic datapath handles both linear and nonlinear operators, reducing latency by 60%, while two-stage recomputation-based tiling alleviates memory pressure for long-sequence attention. Evaluations show VersaQ-3D preserves 98-99% accuracy at W4A8. At W4A4, it outperforms prior methods by 1.61x-2.39x across diverse scenes. The accelerator delivers 5.2x-10.8x speedup over edge GPUs with low power, enabling efficient instant 3D reconstruction.
SuperInfer SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips
Authors: Jiahuan Yu, Mingtao Hu, Zichao Lin, Minjia Zhang
2026-01-28
Large Language Model ()
faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the
budget, existing
inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance
inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and Duplex
, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive
.
Hallucination Begins Where Saliency Drops
Authors: Xiaofeng Zhang, Yuanchao Zhu, Chaochen Gu, Xiaosong Yuan, Qiyan Zhao, Jiawei Cao, Feilong Tang, Sinan Fan, Yaomin Shen, Chen Shen, Hao Tang
2026-01-28
Recent studies have examined attention dynamics in large vision-language models (LVLMs) to detect hallucinations. However, existing approaches remain limited in reliably distinguishing hallucinated from factually grounded outputs, as they rely solely on forward-pass attention patterns and neglect gradient-based signals that reveal how token influence propagates through the network. To bridge this gap, we introduce LVLMs-Saliency, a gradient-aware diagnostic framework that quantifies the visual grounding strength of each output token by fusing attention weights with their input gradients. Our analysis uncovers a decisive pattern: hallucinations frequently arise when preceding output tokens exhibit low saliency toward the prediction of the next token, signaling a breakdown in contextual memory retention. Leveraging this insight, we propose a dual-mechanism inference-time framework to mitigate hallucinations: (1) Saliency-Guided Rejection Sampling (SGRS), which dynamically filters candidate tokens during autoregressive by rejecting those whose saliency falls below a context-adaptive threshold, thereby preventing coherence-breaking tokens from entering the output sequence; and (2) Local Coherence Reinforcement (LocoRE), a lightweight, plug-and-play module that strengthens attention from the current token to its most recent predecessors, actively counteracting the contextual forgetting behavior identified by LVLMs-Saliency. Extensive experiments across multiple LVLMs demonstrate that our method significantly reduces hallucination rates while pre
fluency and task performance, offering a robust and interpretable solution for enhancing model reliability. Code is available at: https://github.com/zhangbaijin/LVLMs-Saliency
StreamFusion Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs
Authors: Jiacheng Yang, Jun Wu, Yaoyao Ding, Zhiying Xu, Yida Wang, Gennady Pekhimenko
2026-01-28
Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine
, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided
libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT
engine. StreamFusion incorporates three key innovations: (1) a topology-aware sequence parallelism technique that accounts for inter- and intra-machine bandwidth differences, (2) Torus Attention, a novel SP technique enabling
ping of inter-machine all-to-all operations with computation, and (3) a one-sided
implementation that minimizes GPU sender-receiver synchronization and computation overheads. Our experiments demonstrate that StreamFusion outperforms the state-of-the-art approach by an average of (up to ).
Shallow-π Knowledge Distillation for Flow-based VLAs
Authors: Boseong Jeon, Yunho Choi, Taehan Kim
2026-01-28
The growing demand for real-time robotic deployment necessitates fast and on-device inference for vision-language-action (VLA) models. Within the VLA literature, efficiency has been extensively studied at the token level, such as visual token . In contrast, systematic
layer reduction has received limited attention and, to the best of our knowledge, has not been explored for flow-based VLA models under knowledge distillation. In this work, we propose Shallow-pi, a principled knowledge distillation framework that aggressively reduces the
depth of both the VLM backbone and the flow-based action head, compressing the model from 18 to 6 layers. Shallow-pi achieves over two times faster inference with less than one percent absolute drop in success rate on standard manipulation benchmarks, establishing state-of-the-art performance among reduced VLA models. Crucially, we validate our approach through industrial-scale real-world experiments on Jetson Orin and Jetson Thor across multiple robot platforms, including humanoid systems, in complex and dynamic manipulation scenarios.
Effect of initial Rayleigh mode on drop deformation and breakup under impulsive acceleration
Authors: Aditya Parik, Sandip Dighe, Tadd Truscott, Som Dutta
2026-01-28
One of the fundamental ways of representing a droplet shape is through its Rayleigh-mode decomposition, in which each mode corresponds to a distinct surface-energy content. The influence of these modes on free oscillation dynamics has been studied extensively; however, their role in droplet deformation, breakup, and fragmentation under impulsive remains largely unexplored. Here we systematically quantify how prescribed initial axisymmetric Rayleigh modes affect the deformation and breakup of an impulsively accelerated drop. Using experimentally validated, VOF-based multiphase direct numerical simulations, we isolate the coupled effects of finite-amplitude surface oscillation modes and the associated initial surface-energy state by initializing drops with well-defined modes (and phases) while con
volume at finite amplitudes. We show that breakup is governed not simply by the initial drag of the imposed shape, but by the dynamic coupling between the free modal oscillations and the forced aerodynamic (or shear-driven) deformation: constructive superposition can strongly amplify deformation, whereas destructive superposition can stabilize the drop even under otherwise disruptive forcing. Across all systems studied, the outcome is controlled by how efficiently the external work is partitioned into recoverable oscillatory energy versus centre-of-mass translation and viscous dissipation, with viscosity and density ratio acting as key mediators that respectively damp modal interactions and restrict the time window for energy uptake.