2026-02-24
Table of Contents
- PositionOCR Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration
- Next Reply Prediction X Dataset Linguistic Discrepancies in Naively Generated Content
- Flash-VAED Plug-and-Play VAE Decoders for Efficient Video Generation
- Incremental Learning of Sparse Attention Patterns in Transformers
- How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders
- Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
- Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
- Whisper Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
- NeuroWise A Multi-Agent LLM "Glass-Box" System for Practicing Double-Empathy Communication with Autistic Partners
- WANSpec Leveraging Global Compute Capacity for LLM Inference
- Why Agent Caching Fails and How to Fix It Structured Intent Canonicalization with Few-Shot Learning
- DeepInterestGR Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation
- Could Large Language Models work as Post-hoc Explainability Tools in Credit Risk Models?
- SceneTok A Compressed, Diffusable Token Space for 3D Scenes
- ABD Default Exception Abduction in Finite First Order Worlds
- BiScale Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS
- HillInfer Efficient Long-Context LLM Inference on the Edge with Hierarchical KV Eviction using SmartSSD
- Compact Hadamard Latent Codes for Efficient Spectral Rendering
- HIME Mitigating Object Hallucinations in LVLMs via Hallucination Insensitivity Model Editing
- Spilled Energy in Large Language Models
- Luna-2 Scalable Single-Token Evaluation with Small Language Models
- RPU -- A Reasoning Processing Unit
- Going Down Memory Lane Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory
- SPQ An Ensemble Technique for Large Language Model Compression
- FedZMG Efficient Client-Side Optimization in Federated Learning
- PRISM Parallel Reward Integration with Symmetry for MORL
- Thinking by Subtraction Confidence-Driven Contrastive Decoding for LLM Reasoning
- RAT+ Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
- SeedFlood A Step Toward Scalable Decentralized Training of LLMs
- The Statistical Signature of LLMs
- Cut Less, Fold More Model Compression through the Lens of Projection Geometry
- Predict to Skip Linear Multistep Feature Forecasting for Efficient Diffusion Transformers
- Joint Training on AMD and NVIDIA GPUs
- Asynchronous Heavy-Tailed Optimization
- Turbo Connection Reasoning as Information Flow from Higher to Lower Layers
- JAEGER Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments
- Graph-Neural Multi-Agent Coordination for Distributed Access-Point Selection in Cell-Free Massive MIMO
- Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning
- MIRA Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance
- The Geometry of Multi-Task Grokking Transverse Instability, Superposition, and Weight Decay Phase Structure
- Dual Length Codes for Lossless Compression of BFloat16
- CLUTCH Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild
- Hardware-Aware Design of a GNN-Based Hit Filtering Algorithm for the Belle II Level-1 Trigger
- Sink-Aware Pruning for Diffusion Language Models
- KLong Training LLM Agent for Extremely Long-horizon Tasks
- Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models
- Small LLMs for Medical NLP a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian
- The CTI Echo Chamber Fragmentation, Overlap, and Vendor Specificity in Twenty Years of Cyber Threat Reporting
- Preserving Historical Truth Detecting Historical Revisionism in Large Language Models
- Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs A Comparative Study
PositionOCR Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration
Authors: Chen Duan, Zhentao Guo, Pei Fu, Zining Wang, Kai Zhou, Pengfei Yan
2026-02-22
In recent years, Multi-modal Large Language Models (Ms) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these M
s rely on a Large Language Model (
) as the
r, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of M
s necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual power of
s to create a positionally-accurate M
? To overcome these challenges, we introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an
's contextual reasoning. Comprising 131M trainable parameters, this framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting, consistently surpassing traditional M
s.
Next Reply Prediction X Dataset Linguistic Discrepancies in Naively Generated Content
Authors: Simon Münker, Nils Schwager, Kai Kugler, Michael Heseltine, Achim Rettinger
2026-02-22
The increasing use of Large Language Models (s) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While
s offer scalability and cost-efficiency, their "naive" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of
s against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the need for more sophisticated prompting techniques and specialized datasets to ensure that
-generated content accurately reflects the complex linguistic patterns of human
, thereby improving the validity of computational social science studies.
Flash-VAED Plug-and-Play VAE Decoders for Efficient Video Generation
Authors: Lunjie Zhu, Yushi Huang, Xingtong Ge, Yufei Xue, Zhening Liu, Yumeng Zhang, Zehong Lin, Jun Zhang
2026-02-22
Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion s become increasingly efficient, the latency bottleneck inevitably shifts to VAE
rs. To reduce their latency while maintaining quality, we propose a universal
framework for VAE
rs that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel
method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE
rs. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE
r to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE
rs demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6 speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.
Incremental Learning of Sparse Attention Patterns in Transformers
Authors: Oğuz Kaan Yüksel, Rodrigo Alvarez Lucendo, Nicolas Flammarion
2026-02-22
This paper introduces a high-order Markov chain task to investigate how s learn to integrate information from multiple past positions with varying statistical significance. We demonstrate that
s learn this task incrementally: each stage is defined by the acquisition of specific information through
attention patterns. Notably, we identify a shift in learning dynamics from competitive, where heads converge on the most statistically dominant pattern, to cooperative, where heads specialize in distinct patterns. We model these dynamics using simplified differential equations that characterize the trajectory and prove stage-wise convergence results. Our analysis reveals that
s ascend a complexity ladder by passing through simpler, misspecified hypothesis classes before reaching the full model class. We further show that early stopping acts as an implicit regularizer, biasing the model toward these simpler classes. These results provide a theoretical foundation for the emergence of staged learning and complex behaviors in
s, offering insights into generalization for natural language processing and algorithmic reasoning.
How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders
Authors: Michael McCoubrey, Angelo Salatino, Francesco Osborne, Enrico Motta
2026-02-22
In recent years, there has been a growing use of generative AI, and large language models (s) in particular, to support both the assessment and generation of scientific work. Although some studies have shown that
s can, to a certain extent, evaluate research according to perceived quality, our understanding of the internal mechanisms that enable this capability remains limited. This paper presents the first study that investigates how
s encode the concept of scientific quality through relevant monosemantic features extracted using
autoencoders. We derive such features under different experimental settings and assess their ability to serve as predictors across three tasks related to research quality: predicting citation count, journal SJR, and journal h-index. The results indicate that
s encode features associated with multiple dimensions of scientific quality. In particular, we identify four recurring types of features that capture key aspects of how research quality is represented: 1) features reflecting research methodologies; 2) features related to publication type, with literature reviews typically exhibiting higher impact; 3) features associated with high-impact research fields and technologies; and 4) features corresponding to specific scientific jargons. These findings represent an important step toward understanding how
s encapsulate concepts related to research quality.
Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
Authors: Qinfeng Xiao, Guofeng Mei, Bo Yang, Liying Zhang, Jian Zhang, Kit-lun Yick
2026-02-22
Establishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on near-isometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories. The key insight is to lift "coarse" semantic cues into "fine" correspondence, which is achieved through two stages. In the "coarse" stage, we perform class-agnostic 3D segmentation to obtain non-ping semantic parts and prompt multimodal large language models (M
s) to identify part names. Then, we employ pretrained vision language models (VLMs) to extract text embeddings, enabling the construction of matched semantic parts. In the "fine" stage, we leverage these coarse correspondences to guide the learning of dense correspondences through a dedicated rank-based contrastive scheme. Thanks to class-agnostic segmentation, language guiding, and rank-based contrastive learning, our method is versatile for universal object categories and requires no predefined part proposals, enabling universal matching for inter-class and non-isometric shapes. Extensive experiments demonstrate UniMatch consistently outperforms competing methods in various challenging scenarios.
Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Authors: Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng, Zhenkai Liang, Xiang Wang, Tat-Seng Chua
2026-02-22
Large vision-language models (LVLMs) have rapidly advanced across various domains, yet they still lag behind strong text-only large language models (s) on tasks that require multi-step inference and compositional decision-making. Motivated by their shared
architectures, we investigate whether the two model families rely on common internal computation for such inference. At the neuron level, we uncover a surprisingly large
: more than half of the top-activated units during multi-step inference are shared between representative
s and LVLMs, revealing a modality-invariant inference subspace.
Through causal probing via activation amplification, we further show that these shared neurons encode consistent and interpretable concept-level effects, demonstrating their functional contribution to inference. Building on this insight, we propose Shared Neuron Low-Rank Fusion (SNRF), a parameter-efficient framework that transfers mature inference circuitry from
s to LVLMs. SNRF profiles cross-model activations to identify shared neurons, computes a low-rank approximation of inter-model weight differences, and injects these updates selectively within the shared-neuron subspace. This mechanism strengthens multimodal inference performance with minimal parameter changes and requires no large-scale multimodal fine-tuning.
Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while pre
perceptual capabilities. Our results demonstrate that shared neurons form an interpretable bridge between
s and LVLMs, enabling low-cost transfer of inference ability into multimodal models. Our code is available at https://github.com/chenhangcuisg-code/Do-
s-VLMs-Share-Neurons.
Whisper Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Authors: Yonathan Ron, Shiri Gilboa, Tammuz Dubnov
2026-02-21
Domain-specific speech remains a persistent challenge for automatic speech recognition (ASR), even for state-of-the-art systems like OpenAI's Whisper. We introduce Whisper: Courtside Edition, a novel multi-agent large language model () pipeline that enhances Whisper transcriptions without retraining. The pipeline intercepts Whisper's initial transcript, applies specialized
agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's
r. Evaluated on 421 NBA basketball commentary segments (a domain characterized by dense proper nouns and technical terminology) our best pipeline achieves a statistically significant 17.0% relative reduction in word error rate (WER; from 0.217 to 0.180, p<0.001). Improvements are observed in 40.1% of segments with degradation in only 7.1%, substantially outperforming direct transcript post-editing. These results demonstrate that prompt-based augmentation can deliver scalable domain adaptation for ASR, offering a practical alternative to costly model fine-tuning.
NeuroWise A Multi-Agent LLM "Glass-Box" System for Practicing Double-Empathy Communication with Autistic Partners
Authors: Albert Tang, Yifan Mo, Jie Li, Yue Su, Mengyuan Zhang, Sander L. Koole, Koen Hindriks, Jiahuan Pei
2026-02-21
The double empathy problem frames difficulties between neurodivergent and neurotypical individuals as arising from mutual misunderstanding, yet most interventions focus on autistic individuals. We present NeuroWise, a multi-agent
-based coaching system that supports neurotypical users through stress visualization, interpretation of internal experiences, and contextual guidance. In a between-subjects study (N=30), NeuroWise was rated as helpful by all participants and showed a significant condition-time effect on deficit-based attributions (p=0.02): NeuroWise users reduced deficit framing, while baseline users shifted toward blaming autistic "deficits" after difficult interactions. NeuroWise users also completed conversations more efficiently (37% fewer turns, p=0.03). These findings suggest that AI-based interpretation can support attributional change by helping users recognize
challenges as mutual.
WANSpec Leveraging Global Compute Capacity for LLM Inference
Authors: Noah Martin, Fahad Dogar
2026-02-21
Data centers capable of running large language models (s) are spread across the globe. Some have high end GPUs for running the most advanced models (100B+ parameters), and others are only suitable for smaller models (1B parameters). The most capable GPUs are under high demand thanks to the rapidly expanding applications of
s. Choosing the right location to run an
inference workload can have consequences on the latency of requests due to these high demands. In this work, we explore options to shift some aspects of inference to the under-utilized data centers. We first observe the varying delays affecting inference in AWS services from different regions, demonstrating that load is not spread evenly. We then introduce WANSpec, which offloads part of
generation to the under-utilized data centers. In doing so, WANSpec can mitigate capacity issues as well as effectively use on-site compute (ie at universities) to augment cloud providers. This is done with speculative
, a widely used technique to speed up auto-regressive
, by moving the draft model to the under-utilized compute resources. Our experiments in simulation and cloud deployments show that WANSpec can judiciously employ redundancy to avoid increases in latency while still reducing the forward passes of speculative
's draft model in high demand data centers by over 50%.
Why Agent Caching Fails and How to Fix It Structured Intent Canonicalization with Few-Shot Learning
Authors: Abhinaba Basu
2026-02-21
Personal AI agents incur substantial cost via repeated calls. We show existing caching methods fail: GPTCache achieves 37.9% accuracy on real benchmarks; APC achieves 0-12%. The root cause is optimizing for the wrong property --
effectiveness requires key consistency and precision,
not classification accuracy. We observe
-key evaluation reduces to clustering evaluation and apply V-measure decomposition to separate these on n=8,682 points across MASSIVE, BANKING77, CLINC150, and NyayaBench v2, our new 8,514-entry multilingual agentic dataset (528 intents, 20 W5H2 classes, 63 languages). We introduce W5H2, a structured intent decomposition framework. Using SetFit with 8 examples per class, W5H2 achieves 91.1%+/-1.7% on MASSIVE in ~2ms -- vs 37.9% for
GPTCache and 68.8% for a 20B-parameter
at 3,447ms. On NyayaBench v2 (20 classes), SetFit achieves 55.3%, with cross-lingual transfer across 30 languages. Our five-tier cascade handles 85% of interactions locally, projecting 97.5% cost reduction. We provide risk-controlled selective prediction guarantees via RCPS with nine bound families.
DeepInterestGR Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation
Authors: Yangchen Zeng
2026-02-21
Recent generative recommendation frameworks have demonstrated remarkable scaling potential by reformulating item prediction as autoregressive Semantic ID (SID) generation. However, existing methods primarily rely on shallow behavioral signals, encoding items solely through surface-level textual features such as titles and descriptions. This reliance results in a critical Shallow Interest problem: the model fails to capture the latent, semantically rich interests underlying user interactions, limiting both personalization depth and recommendation interpretability. DeepInterestGR introduces three key innovations: (1) Multi- Interest Mining (MLIM): We leverage multiple frontier
s along with their multi-modal variants to extract deep textual and visual interest representations through Chain-of-Thought prompting. (2) Reward-Labeled Deep Interest (RLDI): We employ a lightweight binary classifier to assign reward labels to mined interests, enabling effective supervision signals for reinforcement learning. (3) Interest-Enhanced Item Discretization (IEID): The curated deep interests are encoded into semantic embeddings and
d into SID tokens via RQ-VAE. We adopt a two-stage training pipeline: supervised fine-tuning aligns the generative model with deep interest signals and collaborative filtering patterns, followed by reinforcement learning with GRPO optimized by our Interest-Aware Reward. Experiments on three Amazon Review benchmarks demonstrate that DeepInterestGR consistently outperforms state-of-the-art baselines across HR@K and NDCG@K metrics.
Could Large Language Models work as Post-hoc Explainability Tools in Credit Risk Models?
Authors: Wenxi Geng, Dingyuan Liu, Liya Li, Yiqing Wang
2026-02-21
Post-hoc explainability is central to credit risk model governance, yet widely used tools such as coefficient-based attributions and SHapley Additive exPlanations (SHAP) often produce numerical outputs that are difficult to communicate to non-technical stakeholders. This paper investigates whether large language models (s) can serve as post-hoc explainability tools for credit risk predictions through in-context learning, focusing on two roles: translators and autonomous explainers. Using a personal lending dataset from LendingClub, we evaluate three commercial
s, including GPT-4-turbo, Claude Sonnet 4, and Gemini-2.0-Flash. Results provide strong evidence for the translator role. In contrast, autonomous explanations show low alignment with model-based attributions. Few-shot prompting improves feature
for logistic regression but does not consistently benefit XGBoost, suggesting that
s have limited capacity to recover non-linear, interaction-driven reasoning from prompt cues alone. Our findings position
s as effective narrative interfaces grounded in auditable model attributions, rather than as substitutes for post-hoc explainers in credit risk model governance. Practitioners should leverage
s to bridge the
gap between complex model outputs and regulatory or business stakeholders, while pre
the rigor and traceability required by credit risk governance frameworks.
SceneTok A Compressed, Diffusable Token Space for 3D Scenes
Authors: Mohammad Asim, Christopher Wewer, Jan Eric Lenssen
2026-02-21
We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation-invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow r. We show that the
is 1-3 orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including ones deviating from the input trajectory, and we show that the
r gracefully handles uncertainty. Finally, the highly-compressed set of unstructured latent scene tokens enables simple and efficient scene generation in 5 seconds, achieving a much better quality-speed trade-off than previous paradigms.
ABD Default Exception Abduction in Finite First Order Worlds
Authors: Serafim Batzoglou
2026-02-21
We introduce ABD, a benchmark for default-exception abduction over finite first-order worlds. Given a background theory with an abnormality predicate and a set of relational structures, a model must output a first-order formula that defines exceptions, restoring satisfiability while keeping exceptions . We formalize three observation regimes (closed-world, existential completion, universal completion) with exact SMT verification. Evaluating ten frontier
s on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.
BiScale Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS
Authors: Omar Basit, Yunzhao Liu, Z. Jonny Kong, Y. Charlie Hu
2026-02-21
Prefill/
is increasingly adopted in
to improve the latency-throughput tradeoff and meet strict TTFT and TPOT SLOs. However,
inference remains energy-hungry: autoscaling alone is too coarse-grained to track fast workload fluctuations, and applying fine-grained DVFS under
is complicated by phase-asymmetric dynamics and coupling between provisioning and frequency control.
We present BiScale, a two-tier energy optimization framework for
d
. BiScale jointly optimizes placement and DVFS across
and
using predictive latency and power models. At coarse timescales, BiScale computes phase-aware placement and baseline frequencies that minimize energy while satisfying SLO constraints. At fine timescales, BiScale dynamically adapts GPU frequency per iteration using stage-specific control: model predictive control (MPC) for
to account for queue evolution and future TTFT impact, and lightweight slack-aware adaptation for
to exploit its smoother, memory-bound dynamics. This hierarchical design enables coordinated control across timescales while pre
strict
SLOs.
Evaluation on a 16x H100 cluster
Llama 3.3 70B with production-style traces shows that BiScale meets TTFT/TPOT SLOs while reducing energy by up to 39% in
and 48% in
relative to DistServe.
HillInfer Efficient Long-Context LLM Inference on the Edge with Hierarchical KV Eviction using SmartSSD
Authors: He Sun, Li Li, Mingjun Xiao
2026-02-21
Deploying Large Language Models (s) on edge devices such as PCs enables low-latency inference with strong privacy guarantees, but long-context inference is fundamentally constrained by limited memory and compute resources. Beyond model parameters, the
becomes the dominant bottleneck due to its linear growth with context length. Although prior work exploits contextual
to evict unimportant
data, these approaches are largely designed for memory-rich platforms and incur prohibitive data transfer overhead when applied to resource-constrained edge devices with external storage. In this paper, we propose HillInfer, an importance-aware long-context
inference framework on the edge that leverages SmartSSD-assisted hierarchical
management. HillInfer jointly manages
pools across the CPU and SmartSSD, and performs in-storage importance evaluation to reduce unnecessary data movement. Furthermore, we design an adaptive, prefetch-based pipeline that
s computation and
data transfer across GPU, CPU, and SmartSSD, minimizing end-to-end inference latency without sacrificing accuracy. We implement HillInfer on a PC with a commodity GPU, and experiments across multiple models and benchmarks demonstrate up to 8.56 speedup over baselines while pre
model accuracy.
Compact Hadamard Latent Codes for Efficient Spectral Rendering
Authors: Jiaqi Yu, Dar'ya Guarnera, Giuseppe Claudio Guarnera
2026-02-21
Spectral rendering accurately reproduces wavelength-dependent appearance but is computationally expensive, as shading must be evaluated at many wavelength samples and scales roughly linearly with the number of samples. It also requires spectral textures and lights throughout the rendering pipeline. We propose Hadamard spectral codes, a compact latent representation that enables spectral rendering using standard RGB rendering operations. Spectral images are approximated with a small number of RGB rendering passes, followed by a step. Our key requirement is latent linearity: scaling and addition in spectral space correspond to scaling and addition of codes, and the element-wise product of spectra (for example reflectance times illumination) is approximated by the element-wise product of their latent codes. We show that an exact low-dimensional algebra-pre
representation cannot exist for arbitrary spectra when the latent dimension k is smaller than the number of spectral samples n. We therefore introduce a learned non-negative linear encoder and
r architecture that preserves scaling and addition exactly while encouraging approximate multiplicativity under the Hadamard product. With k = 6, we render k/3 = 2 RGB images per frame using an unmodified RGB renderer, reconstruct the latent image, and
to high-resolution spectra or XYZ or RGB. Experiments on 3D scenes demonstrate that k = 6 significantly reduces color error compared to RGB baselines while being substantially faster than naive n-sample spectral rendering. Using k = 9 provides higher-quality reference results. We further introduce a lightweight neural upsampling network that maps RGB assets directly to latent codes, enabling integration of legacy RGB content into the spectral pipeline while maintaining perceptually accurate colors in rendered images.
HIME Mitigating Object Hallucinations in LVLMs via Hallucination Insensitivity Model Editing
Authors: Ahmed Akl, Abdelwahed Khamis, Ali Cheraghian, Zhe Wang, Sara Khalifa, Kewen Wang
2026-02-21
Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal understanding capabilities, yet they remain prone to object hallucination, where models describe non-existent objects or attribute incorrect factual information, raising serious concerns for reliable real-world deployment. While fine-tuning is a commonly adopted mitigation strategy, its high computational cost and practical difficulty motivate the need for training-free alternatives, among which model editing has recently emerged as a promising direction. However, indiscriminate editing risks disrupting the rich implicit knowledge encoded in pre-trained LVLMs, leading to a fundamental question: how much intervention is necessary at each layer to suppress hallucinations while pre pre-trained knowledge? To address this question, we present a systematic analysis of LVLM
rs built on three widely used large language model backbones-Qwen, LLaMA, and Vicuna-revealing clear layer-wise differences in susceptibility to object hallucination. Building on these insights, we introduce the Hallucination Insensitivity Score (HIS), a principled metric that quantifies each layer's sensitivity to hallucination and provides guidance for targeted intervention. Leveraging HIS, we propose Hallucination Insensitivity Model Editing (HIME), a simple yet effective layer-adaptive weight editing approach that selectively modifies latent features to suppress hallucinations while pre
pre-trained knowledge. Extensive experiments demonstrate that HIME reduces hallucinations by an average of 61.8% across open-ended generation benchmarks, including CHAIR, MME, and GPT-4V-aided evaluation, without introducing additional parameters, inference-time latency, or computational overhead.
Spilled Energy in Large Language Models
Authors: Adrian Robert Minut, Hazem Dewidar, Iacopo Masi
2026-02-21
We reinterpret the final Large Language Model () softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track "energy spills" during
, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we achieve this without requiring trained probe classifiers or activation ablations. Instead, we introduce two completely training-free metrics derived directly from output logits: spilled energy, which captures the discrepancy between energy values across consecutive generation steps that should theoretically match, and marginalized energy, which is measurable at a single step. Evaluated on nine benchmarks across state-of-the-art
s (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalization. Notably, these results hold for both pretrained and instruction-tuned variants without introducing any training overhead.
Luna-2 Scalable Single-Token Evaluation with Small Language Models
Authors: Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel, Shuai Shao, Yash Sheth
2026-02-20
Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, -as-a-judge (
AJ), is slow, expensive, and operationally non-deterministic due to multi-token generation. We present Luna-2, a novel architecture that leverages
r-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific
AJ metrics (e.g. toxicity, hallucination, tool selection quality, etc.) at an accuracy at par or higher than
AJ using frontier
s while drastically reducing the cost and latency of computation. Each metric is implemented as a lightweight LoRA/PEFT head on top of a shared SLM backbone, enabling hundreds of specialized metrics to run concurrently on a single GPU, deployable locally next to AI systems in a privacy-pre
and latency optimizing manner. Across content safety and hallucination benchmarks, Luna-2 matches the accuracy of state-of-the-art
-based evaluators while reducing inference cost by over 80x and latency by over 20x.
In this paper, we outline the model architecture, training methodology and report real-world empirical results on accuracy, latency, and throughput results. In production, Luna-2 is protecting 100M+ AI sessions and processing over 100B tokens per month for our customers with eval cost savings of over $30M annually.
RPU -- A Reasoning Processing Unit
Authors: Matthew Adiletta, Gu-Yeon Wei, David Brooks
2026-02-20
Large language model () inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This challenge is amplified by emerging reasoning
applications, where long output sequences, low arithmetic intensity, and tight latency constraints demand significantly higher memory bandwidth. As a result, system utilization drops and energy per inference rises, highlighting the need for an optimized system architecture for scalable memory bandwidth.
To address these challenges we present the Reasoning Processing Unit (RPU), a chiplet-based architecture designed to address the challenges of the modern memory wall. RPU introduces: (1) A Capacity-Optimized High-Bandwidth Memory (HBM-CO) that trades capacity for lower energy and cost; (2) a scalable chiplet architecture featuring a bandwidth-first power and area provisioning design; and (3) a decoupled microarchitecture that separates memory, compute, and
pipelines to sustain high bandwidth utilization. Simulation results show that RPU performs up to 45.3x lower latency and 18.6x higher throughput over an H100 system at ISO-TDP on Llama3-405B.
Going Down Memory Lane Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory
Authors: Vatsal Agarwal, Saksham Suri, Matthew Gwilliam, Pulkit Kumar, Abhinav Shrivastava
2026-02-20
Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while pre local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over Re
with Qwen2.5-VL-7B.
SPQ An Ensemble Technique for Large Language Model Compression
Authors: Jiamin Yao, Eren Gultepe
2026-02-20
This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model ()
that combines variance-retained singular value decomposition (SVD), activation-based
, and post-training linear
. Each component targets a different source of inefficiency: i)
removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit
uniformly compresses all linear layers. At matched
ratios, SPQ outperforms individual methods (SVD-only,
-only, or
-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and pre
accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K. Compared to strong baselines like GPTQ and SparseGPT, SPQ offers competitive perplexity and accuracy while using less memory (6.86 GB vs. 7.16 GB for GPTQ). Moreover, SPQ improves inference throughput over GPTQ, achieving up to a 1.9x speedup, which further enhances its practicality for real-world deployment. The effectiveness of SPQ's robust
through layer-aware and complementary
techniques may provide practical deployment of
s in memory-constrained environments. Code is available at: https://github.com/JiaminYao/SPQ_
_Compression/
FedZMG Efficient Client-Side Optimization in Federated Learning
Authors: Fotios Zantalis, Evangelos Zervas, Grigorios Koulouras
2026-02-20
Federated Learning (FL) enables distributed model training on edge devices while pre data privacy. However, clients tend to have non-Independent and Identically Distributed (non-IID) data, which often leads to client-drift, and therefore diminishing convergence speed and model performance. While adaptive optimizers have been proposed to mitigate these effects, they frequently introduce computational complexity or
overhead unsuitable for resource-constrained IoT environments. This paper introduces Federated Zero Mean Gradients (FedZMG), a novel, parameter-free, client-side optimization algorithm designed to tackle client-drift by structurally regularizing the optimization space. Advancing the idea of Gradient Centralization, FedZMG projects local gradients onto a zero-mean hyperplane, effectively neutralizing the "intensity" or "bias" shifts inherent in heterogeneous data distributions without requiring additional
or hyperparameter tuning. A theoretical analysis is provided, proving that FedZMG reduces the effective gradient variance and guarantees tighter convergence bounds compared to standard FedAvg. Extensive empirical evaluations on EMNIST, CIFAR100, and Shakespeare datasets demonstrate that FedZMG achieves better convergence speed and final validation accuracy compared to the baseline FedAvg and the adaptive optimizer FedAdam, particularly in highly non-IID settings.
PRISM Parallel Reward Integration with Symmetry for MORL
Authors: Finn van der Knaap, Kejiang Qian, Zheng Xu, Fengxiang He
2026-02-20
This work studies heterogeneous Multi-Objective Reinforcement Learning (MORL), where objectives can differ sharply in temporal frequency. Such heterogeneity allows dense objectives to dominate learning, while long-horizon rewards receive weak credit assignment, leading to poor sample efficiency. We propose a Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels. PRISM introduces ReSymNet, a theory-motivated model that reconciles temporal-frequency mismatches across objectives, using residual blocks to learn a scaled opportunity value that accelerates exploration while pre
the optimal policy. We also propose SymReg, a reflectional equivariance regulariser that enforces agent mirroring and constrains policy search to a reflection-equivariant subspace. This restriction provably reduces hypothesis complexity and improves generalisation. Across MuJoCo benchmarks, PRISM consistently outperforms both a
-reward baseline and an oracle trained with full dense rewards, improving Pareto coverage and distributional balance: it achieves hypervolume gains exceeding 100\% over the baseline and up to 32\% over the oracle. The code is at \href{https://github.com/EVIEHub/PRISM}{https://github.com/EVIEHub/PRISM}.
Thinking by Subtraction Confidence-Driven Contrastive Decoding for LLM Reasoning
Authors: Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin, Bang Yang, Yuexian Zou
2026-02-20
Recent work on test-time scaling for large language model () reasoning typically assumes that allocating more inference-time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion. Motivated by this observation, we propose Thinking by Subtraction, a confidence-driven contrastive
approach that improves reasoning reliability through targeted token-level intervention. Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during
and intervenes selectively at these positions. It constructs a contrastive reference by replacing high-confidence tokens with minimal placeholders, and refines predictions by subtracting this reference distribution at low-confidence locations. Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal
-
overhead. As a training-free method, CCD enhances reasoning reliability through targeted low-confidence intervention without computational redundancy. Our code will be made available at: https://github.com/bolo-web/CCD.
RAT+ Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
Authors: Xiuying Wei, Caglar Gulcehre
2026-02-20
Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of the attention and the
size by a factor of the dilation size D, while pre
long-range connectivity. However, we find a persistent failure mode of them -- sparsifying a pretrained attention model to a dilated pattern leads to severe accuracy degradation. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once, then flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate
models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at 16 and drops by about 2-3 points at 64 on commonsense reasoning and LongBench tasks, respectively. Moreover, RAT+ outperforms attention when sparsifying to the top-k block attention. We further scale to 2.6B parameters and 200B tokens and observe the same trend.
SeedFlood A Step Toward Scalable Decentralized Training of LLMs
Authors: Jihun Kim, Namhoon Lee
2026-02-20
This work presents a new approach to decentralized training-SeedFlood-designed to scale for large models across complex network topologies and achieve global consensus with minimal overhead. Traditional gossip-based methods suffer from message
costs that grow with model size, while information decay over network hops renders global consensus inefficient. SeedFlood departs from these practices by exploiting the seed-reconstructible structure of zeroth-order updates and effectively making the messages near-zero in size, allowing them to be flooded to every client in the network. This mechanism makes
overhead negligible and independent of model size, removing the primary scalability bottleneck in decentralized training. Consequently, SeedFlood enables training in regimes previously considered impractical, such as billion-parameter models distributed across hundreds of clients. Our experiments on decentralized
fine-tuning demonstrate thatSeedFlood consistently outperforms gossip-based baselines in both generalization performance and
efficiency, and even achieves results comparable to first-order methods in large scale settings.
The Statistical Signature of LLMs
Authors: Ortal Hadad, Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Matteo Cinelli, Walter Quattrociocchi
2026-02-20
Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze
behavior across three progressively more complex information ecosystems: controlled human-
continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings,
reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts,
-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of
.
Cut Less, Fold More Model Compression through the Lens of Projection Geometry
Authors: Olga Saukh, Dong Wang, Haris Šikić, Yun Cheng, Lothar Thiele
2026-02-20
Compressing neural networks without retraining is vital for deployment at scale. We study calibration-free through the lens of projection geometry: structured
is an axis-aligned projection, whereas model folding performs a low-rank projection via weight clustering. We formalize both as orthogonal operators and show that, within a rank distance of one, folding provably yields smaller parameter reconstruction error, and under mild smoothness assumptions, smaller functional perturbations than
. At scale, we evaluate >1000 checkpoints spanning ResNet18, PreActResNet18, ViT-B/32, and CLIP ViT-B/32 on CIFAR-10 and ImageNet-1K, covering diverse training hyperparameters (optimizers, learning rates, augmentations, regularization, sharpness-aware training), as well as multiple LLaMA-family 60M and 130M parameter models trained on C4. We show that folding typically achieves higher post-
accuracy, with the largest gains at moderate-high
. The gap narrows and occasionally reverses at specific training setups. Our results position folding as a geometry-aware, calibration-free alternative to
that is often superior in practice and principled in theory.
Predict to Skip Linear Multistep Feature Forecasting for Efficient Diffusion Transformers
Authors: Hanshuai Cui, Zhiqing Tang, Qianli Ma, Zhi Yao, Weijia Jia
2026-02-20
Diffusion Transformers (DiT) have emerged as a widely adopted backbone for high-fidelity image and video generation, yet their iterative denoising process incurs high computational costs. Existing training-free methods rely on feature caching and reuse under the assumption of temporal stability. However, reusing features for multiple steps may lead to latent drift and visual degradation. We observe that model outputs evolve smoothly along much of the diffusion trajectory, enabling principled predictions rather than naive reuse. Based on this insight, we propose \textbf{PrediT}, a training-free
framework that formulates feature prediction as a linear multistep problem. We employ classical linear multistep methods to forecast future model outputs from historical information, combined with a corrector that activates in high-dynamics regions to prevent error accumulation. A dynamic step modulation mechanism adaptively adjusts the prediction horizon by monitoring the feature change rate. Together, these components enable substantial
while pre
generation fidelity. Extensive experiments validate that our method achieves up to latency reduction across various DiT-based image and video generation models, while incurring negligible quality degradation.
Joint Training on AMD and NVIDIA GPUs
Authors: Jon Hu, Thomas Jia, Jing Zhu, Zhendong Yu
2026-02-20
As large language models continue to scale, training demands on compute and system capacity grow rapidly, making single-vendor homogeneous clusters insufficient. This paper presents a technical solution for heterogeneous mixed training in AMD-NVIDIA environments. We first adopt a compatibility-oriented approach based on CPU-Forwarding Communication, with differentiated back-end selection across parallel groups and multi-NIC parallel data transfer. To achieve higher performance, we further propose another Device-Direct Communication approach, integrating a CPU-offloading P2P mechanism to enable direct cross-vendor GPU data transfer without host-memory staging. Experiments on LLaMA-8B and Qwen2-7B demonstrate that the proposed Device-Direct Communication approach achieves up to 98% of the throughput of an NVIDIA homogeneous system, while pre
training stability and correctness.
Asynchronous Heavy-Tailed Optimization
Authors: Junfei Sun, Dixi Yao, Xuchen Gong, Tahseen Rabbani, Manzil Zaheer, Tian Li
2026-02-20
Heavy-tailed stochastic gradient noise, commonly observed in models, can destabilize the optimization process. Recent works mainly focus on developing and understanding approaches to address heavy-tailed noise in the centralized or distributed, synchronous setting, leaving the interactions between such noise and asynchronous optimization underexplored. In this work, we investigate two
schemes that handle stragglers with asynchronous updates in the presence of heavy-tailed gradient noise. We propose and theoretically analyze algorithmic modifications based on delay-aware learning rate scheduling and delay compensation to enhance the performance of asynchronous algorithms. Our convergence guarantees under heavy-tailed noise match the rate of the synchronous counterparts and improve delay tolerance compared with existing asynchronous approaches. Empirically, our approaches outperform prior synchronous and asynchronous methods in terms of accuracy/runtime trade-offs and are more robust to hyperparameters in both image and language tasks.
Turbo Connection Reasoning as Information Flow from Higher to Lower Layers
Authors: Mohan Tang, Sidi Lu
2026-02-20
Complex problems, whether in math, logic, or planning, are solved by humans through a sequence of steps where the result of one step informs the next. In this work, we adopt the perspective that the reasoning power of Transformers is fundamentally limited by a fixed maximum number of steps along any latent path of computation. To address this, we introduce Turbo Connection (TurboConn), a novel architecture that overcomes the fixed-depth constraint by routing multiple residual connections from the higher-layer hidden states of each token to the lower layers of token . Fine-tuning pre-trained s with our method not only yields accuracy gains of 0.9% to over 10% on benchmarks like GSM8K, Parity, and multi-step arithmetic, but also demonstrates that the density of these backward connections is critical; our dense interaction significantly outperforms "
" alternatives that only pass a single hidden state or vector. Notably, TurboConn can be integrated into pre-trained
s to overcome task-specific plateaus: while a fine-tuned Qwen-3-1.7B achieves only 53.78% on Parity, adding our architectural modification enables the model to reach 100% accuracy, all without the necessity to retrain the full model from scratch or sophisticated curriculum learning. Our results provide strong empirical evidence that the depth of the computational path is a key factor in reasoning ability, also offering a new mechanism to enhance
s without significantly affecting generation latency.
JAEGER Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments
Authors: Zhan Liu, Changli Tang, Yuxin Wang, Zhiyuan Zhu, Youjun Chen, Yiwen Shao, Tianzi Wang, Lei Ke, Zengrui Jin, Chao Zhang
2026-02-20
Current audio-visual large language models (AV-s) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-
s to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with
ping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints and datasets will be released upon acceptance.
Graph-Neural Multi-Agent Coordination for Distributed Access-Point Selection in Cell-Free Massive MIMO
Authors: Mohammad Zangooei, Lou Salaün, Chung Shue Chen, Raouf Boutaba
2026-02-20
Cell-free massive MIMO (CFmMIMO) systems require scalable and reliable distributed coordination mechanisms to operate under stringent and latency constraints. A central challenge is the Access Point Selection (APS) problem, which seeks to determine the subset of
Access Points (APs) for each User Equipment (UE) that can satisfy UEs' Spectral Efficiency (SE) requirements while minimizing network power consumption. We introduce APS-GNN, a scalable distributed multi-agent learning framework that decomposes APS into agents operating at the granularity of individual AP-UE connections. Agents coordinate via local observation exchange over a novel Graph Neural Network (GNN) architecture and share parameters to reuse their knowledge and experience. APS-GNN adopts a constrained reinforcement learning approach to provide agents with explicit observability of APS' conflicting objectives, treating SE satisfaction as a cost and power reduction as a reward. Both signals are defined locally, facilitating effective credit assignment and scalable coordination in large networks. To further improve training stability and exploration efficiency, the policy is initialized via supervised imitation learning from a heuristic APS baseline. We develop a realistic CFmMIMO simulator and demonstrate that APS-GNN delivers the target SE while activating 50-70% fewer APs than heuristic and centralized Multi-agent Reinforcement Learning (MARL) baselines in different evaluation scenarios. Moreover, APS-GNN achieves one to two orders of magnitude lower inference latency than centralized MARL approaches due to its fully parallel and distributed execution. These results establish APS-GNN as a practical and scalable solution for APS in large-scale CFmMIMO networks.
Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning
Authors: Narjes Nourzad, Carlee Joe-Wong
2026-02-20
In environments with or delayed rewards, reinforcement learning (RL) incurs high sample complexity due to the large number of interactions needed for learning. This limitation has motivated the use of large language models (
s) for subgoal discovery and trajectory guidance. While
s can support exploration, frequent reliance on
calls raises concerns about scalability and reliability. We address these challenges by constructing a memory graph that encodes subgoals and trajectories from both
guidance and the agent's own successful rollouts. From this graph, we derive a utility function that evaluates how closely the agent's trajectories align with prior successful strategies. This utility shapes the advantage function, providing the critic with additional guidance without altering the reward. Our method relies primarily on offline input and only occasional online queries, avoiding dependence on continuous
supervision. Preliminary experiments in benchmark environments show improved sample efficiency and faster early learning compared to baseline RL methods, with final returns comparable to methods that require frequent
interaction.
MIRA Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance
Authors: Narjes Nourzad, Carlee Joe-Wong
2026-02-20
Reinforcement learning (RL) agents often suffer from high sample complexity in or delayed reward settings due to limited prior structure. Large language models (
s) can provide subgoal decompositions, plausible trajectories, and abstract priors that facilitate early learning. However, heavy reliance on
supervision introduces scalability constraints and dependence on potentially unreliable signals. We propose MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training. The graph stores decision-relevant information, including trajectory segments and subgoal structures, and is constructed from both the agent's high-return experiences and
outputs. This design amortizes
queries into a persistent memory rather than requiring continuous real-time supervision. From this memory graph, we derive a utility signal that softly adjusts advantage estimation to influence policy updates without modifying the underlying reward function. As training progresses, the agent's policy gradually surpasses the initial
-derived priors, and the utility term decays, pre
standard convergence guarantees. We provide theoretical analysis showing that utility-based shaping improves early-stage learning in
-reward environments. Empirically, MIRA outperforms RL baselines and achieves returns comparable to approaches that rely on frequent
supervision, while requiring substantially fewer online
queries. Project webpage: https://narjesno.github.io/MIRA/
The Geometry of Multi-Task Grokking Transverse Instability, Superposition, and Weight Decay Phase Structure
Authors: Yongzhong Xu
2026-02-19
Grokking -- the abrupt transition from memorization to generalization long after near-zero training loss -- has been studied mainly in single-task settings. We extend geometric analysis to multi-task modular arithmetic, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across a systematic weight decay sweep. Five consistent phenomena emerge. (1) Staggered grokking order: multiplication generalizes first, followed by squaring, then addition, with consistent delays across seeds. (2) Universal integrability: optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold; commutator defects orthogonal to this manifold reliably precede generalization. (3) Weight decay phase structure: grokking timescale, curvature depth, reconstruction threshold, and defect lead covary systematically with weight decay, revealing distinct dynamical regimes and a sharp no-decay failure mode. (4) Holographic incompressibility: final solutions occupy only 4--8 principal trajectory directions yet are distributed across full-rank weights and destroyed by minimal perturbations; SVD truncation, magnitude , and uniform scaling all fail to preserve performance. (5) Transverse fragility and redundancy: removing less than 10% of orthogonal gradient components eliminates grokking, yet dual-task models exhibit partial recovery under extreme deletion, suggesting redundant center manifolds enabled by overparameterization. Together, these results support a dynamical picture in which multi-task grokking constructs a compact superposition subspace in parameter space, with weight decay acting as
pressure and excess parameters supplying geometric redundancy in optimization pathways.
Dual Length Codes for Lossless Compression of BFloat16
Authors: Aditya Agrawal, Albert Magyar, Hiteshwar Eswaraiah, Patrick Sheridan, Pradeep Janedula, Ravi Krishnan Venkatesan, Krishna Nair, Ravi Iyer
2026-02-19
Training and Large Language Models (
s) relies heavily on parallelization and collective operations, which are frequently bottlenecked by network bandwidth. Lossless
using e.g., Huffman codes can alleviate the issue, however, Huffman codes suffer from slow, bit-sequential
and high hardware complexity due to deep tree traversals. Universal codes e.g., Exponential-Golomb codes are faster to
but do not exploit the symbol frequency distributions. To address these limitations, this paper introduces Dual Length Codes, a hybrid approach designed to balance
efficiency with
speed. Analyzing BFloat16 tensors from the Gemma model, we observed that the top 8 most frequent symbols account for approximately 50% of the cumulative probability. These 8 symbols are assigned a short 4 bit code. The remaining 248 symbols are assigned a longer 9 bit code. The coding scheme uses a single prefix bit to distinguish between the two code lengths. The scheme uses a small Look Up Table with only 8 entries for encoding and
. The scheme achieves a compressibility of 18.6% in comparison to 21.3% achieved by Huffman codes, but it significantly speeds up the
and simplifies the hardware complexity.
CLUTCH Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild
Authors: Balamurugan Thambiraja, Omid Taheri, Radek Danecek, Giorgio Becherini, Gerard Pons-Moll, Justus Thies
2026-02-19
Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to "in-the-wild" settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce '3D Hands in the Wild' (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an -based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the
. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part-modality decomposed VQ-VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to
d hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.
Hardware-Aware Design of a GNN-Based Hit Filtering Algorithm for the Belle II Level-1 Trigger
Authors: Greta Heine, Fabio Mayer, Marc Neu, Jürgen Becker, Torben Ferber
2026-02-19
The Belle~II experiment operates at high luminosity, where an increasing beam-induced background imposes stringent demands on the hardware Level-1 trigger system, which must operate under tight latency and bandwidth constraints. To achieve online data reduction within the Level-1 trigger system, we have developed a hit-filtering algorithm based on the lightweight Interaction Network architecture. In this work, we present a hardware-aware model- workflow for this hit-filtering algorithm targeting deployment on FPGA devices within the Belle~II trigger system. The network is adapted to the detector and trigger conditions through model-size and graph-size reduction, low-precision (4 bit) fixed-point arithmetic, and unstructured
. We assess the resulting design using the total number of bit operations as a hardware-aware computational complexity metric. Using this metric, we identify a configuration that decreases this cost by more than two orders of magnitude relative to the full-precision reference implementation. This reduction is achieved while pre
performance close to the reference model in terms of hit efficiency and background rejection, as indicated by only a modest decrease in the AUC score from 97.4 to 96.8, evaluated on Belle~II collision data.
Sink-Aware Pruning for Diffusion Language Models
Authors: Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen
2026-02-19
Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient . Existing
heuristics largely inherited from autoregressive (AR)
s, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose , which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR
s). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior
baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.
KLong Training LLM Agent for Extremely Long-horizon Tasks
Authors: Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi
2026-02-19
This paper introduces KLong, an open-source agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains
between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.
Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models
Authors: Wen-Tse Chen, Jiayu Chen, Fahim Tajwar, Hao Zhu, Xintong Duan, Ruslan Salakhutdinov, Jeff Schneider
2026-02-19
Learning from self-sampled data and environmental feedback remains a fundamental challenge in training self-evolving agents. Temporal credit assignment mitigates this issue by transforming
feedback into dense supervision signals. However, previous approaches typically depend on learning task-specific value functions for credit assignment, which suffer from poor sample efficiency and limited generalization. In this work, we propose to leverage pretrained knowledge from large language models (
s) to transform
rewards into dense training signals (i.e., the advantage function) through retrospective in-context learning (RICL). We further propose an online learning framework, RICOL, which iteratively refines the policy based on the credit assignment results from RICL. We empirically demonstrate that RICL can accurately estimate the advantage function with limited samples and effectively identify critical states in the environment for temporal credit assignment. Extended evaluation on four BabyAI scenarios show that RICOL achieves comparable convergent performance with traditional online RL algorithms with significantly higher sample efficiency. Our findings highlight the potential of leveraging
s for temporal credit assignment, paving the way for more sample-efficient and generalizable RL paradigms.
Small LLMs for Medical NLP a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian
Authors: Pietro Ferrazzi, Mattia Franzin, Alberto Lavelli, Bernardo Magnini
2026-02-19
Large Language Models (s) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small"
s (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint
) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint
offers strong lower-resource alternatives. Our results show that small
s can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.
The CTI Echo Chamber Fragmentation, Overlap, and Vendor Specificity in Twenty Years of Cyber Threat Reporting
Authors: Manuel Suarez-Roman, Francesco Marciori, Mauro Conti, Juan Tapiador
2026-02-19
Despite the high volume of open-source Cyber Threat Intelligence (CTI), our understanding of long-term threat actor-victim dynamics remains fragmented due to the lack of structured datasets and inconsistent reporting standards. In this paper, we present a large-scale automated analysis of open-source CTI reports spanning two decades. We develop a high-precision, -based pipeline to ingest and structure 13,308 reports, extracting key entities such as attributed threat actors, motivations, victims, reporting vendors, and technical indicators (IoCs and TTPs). Our analysis quantifies the evolution of CTI information density and specialization, characterizing patterns that relate specific threat actors to motivations and victim profiles. Furthermore, we perform a meta-analysis of the CTI industry itself. We identify a fragmented ecosystem of distinct silos where vendors demonstrate significant geographic and sectoral reporting biases. Our marginal coverage analysis reveals that intelligence
between vendors is typically low: while a few core providers may offer broad situational awareness, additional sources yield diminishing returns. Overall, our findings characterize the structural biases inherent in the CTI ecosystem, enabling practitioners and researchers to better evaluate the completeness of their intelligence sources.
Preserving Historical Truth Detecting Historical Revisionism in Large Language Models
Authors: Francesco Ortu, Joeun Yook, Punya Syon Pandey, Keenan Samway, Bernhard Schölkopf, Alberto Cazzaniga, Rada Mihalcea, Zhijing Jin
2026-02-19
Large language models (s) are increasingly used as sources of historical information, motivating the need for scalable audits on contested events and politically charged narratives in settings that mirror real user interactions. We introduce \texttt{HistoricalMisinfo, a curated dataset of contested events from countries, each paired with a factual reference narrative and a documented revisionist reference narrative. To approximate real-world usage, we instantiate each event in prompt scenarios that reflect common
settings (e.g., questions, textbooks, social posts, policy briefs). Using an
-as-a-judge protocol that compares model outputs to the two references, we evaluate
s varying across model architectures in two conditions: (i) neutral user prompts that ask for factually accurate information, and (ii) robustness prompts in which the user explicitly requests the revisionist version of the event. Under neutral prompts, models are generally closer to factual references, though the resulting scores should be interpreted as reference-alignment signals rather than definitive evidence of human-interpretable revisionism. Robustness prompting yields a strong and consistent effect: when the user requests the revisionist narrative, all evaluated models show sharply higher revisionism scores, indicating limited resistance or self-correction. HistoricalMisinfo provides a practical foundation for benchmarking robustness to revisionist framing and for guiding future work on more precise automatic evaluation of contested historical claims to ensure a sustainable integration of AI systems within society. Our code is available at https://github.com/francescortu/Pre
HistoricalTruth
Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs A Comparative Study
Authors: Dylan Bouchard, Mohit Singh Chauhan, Viren Bajaj, David Skarbrevik
2026-02-19
Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for s, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form
outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple
s and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware
is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.