2026-02-24

Table of Contents

PositionOCR Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration

Authors: Chen Duan, Zhentao Guo, Pei Fu, Zining Wang, Kai Zhou, Pengfei Yan

2026-02-22

http://arxiv.org/abs/2602.19188v1

In recent years, Multi-modal Large Language Models (Mkeys) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these Mkeys rely on a Large Language Model (key) as the keyr, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of Mkeys necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual power of keys to create a positionally-accurate Mkey? To overcome these challenges, we introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an key's contextual reasoning. Comprising 131M trainable parameters, this framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting, consistently surpassing traditional Mkeys.

Next Reply Prediction X Dataset Linguistic Discrepancies in Naively Generated Content

Authors: Simon Münker, Nils Schwager, Kai Kugler, Michael Heseltine, Achim Rettinger

2026-02-22

http://arxiv.org/abs/2602.19177v1

The increasing use of Large Language Models (keys) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While keys offer scalability and cost-efficiency, their "naive" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of keys against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the need for more sophisticated prompting techniques and specialized datasets to ensure that key-generated content accurately reflects the complex linguistic patterns of human key, thereby improving the validity of computational social science studies.

Flash-VAED Plug-and-Play VAE Decoders for Efficient Video Generation

Authors: Lunjie Zhu, Yushi Huang, Xingtong Ge, Yufei Xue, Zhening Liu, Yumeng Zhang, Zehong Lin, Jun Zhang

2026-02-22

http://arxiv.org/abs/2602.19161v1

Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion keys become increasingly efficient, the latency bottleneck inevitably shifts to VAE keyrs. To reduce their latency while maintaining quality, we propose a universal key framework for VAE keyrs that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel key method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE keyrs. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE keyr to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE keyrs demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6 speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.

Incremental Learning of Sparse Attention Patterns in Transformers

Authors: Oğuz Kaan Yüksel, Rodrigo Alvarez Lucendo, Nicolas Flammarion

2026-02-22

http://arxiv.org/abs/2602.19143v1

This paper introduces a high-order Markov chain task to investigate how keys learn to integrate information from multiple past positions with varying statistical significance. We demonstrate that keys learn this task incrementally: each stage is defined by the acquisition of specific information through key attention patterns. Notably, we identify a shift in learning dynamics from competitive, where heads converge on the most statistically dominant pattern, to cooperative, where heads specialize in distinct patterns. We model these dynamics using simplified differential equations that characterize the trajectory and prove stage-wise convergence results. Our analysis reveals that keys ascend a complexity ladder by passing through simpler, misspecified hypothesis classes before reaching the full model class. We further show that early stopping acts as an implicit regularizer, biasing the model toward these simpler classes. These results provide a theoretical foundation for the emergence of staged learning and complex behaviors in keys, offering insights into generalization for natural language processing and algorithmic reasoning.

How Do LLMs Encode Scientific Quality? An Empirical Study Using Monosemantic Features from Sparse Autoencoders

Authors: Michael McCoubrey, Angelo Salatino, Francesco Osborne, Enrico Motta

2026-02-22

http://arxiv.org/abs/2602.19115v1

In recent years, there has been a growing use of generative AI, and large language models (keys) in particular, to support both the assessment and generation of scientific work. Although some studies have shown that keys can, to a certain extent, evaluate research according to perceived quality, our understanding of the internal mechanisms that enable this capability remains limited. This paper presents the first study that investigates how keys encode the concept of scientific quality through relevant monosemantic features extracted using key autoencoders. We derive such features under different experimental settings and assess their ability to serve as predictors across three tasks related to research quality: predicting citation count, journal SJR, and journal h-index. The results indicate that keys encode features associated with multiple dimensions of scientific quality. In particular, we identify four recurring types of features that capture key aspects of how research quality is represented: 1) features reflecting research methodologies; 2) features related to publication type, with literature reviews typically exhibiting higher impact; 3) features associated with high-impact research fields and technologies; and 4) features corresponding to specific scientific jargons. These findings represent an important step toward understanding how keys encapsulate concepts related to research quality.

Universal 3D Shape Matching via Coarse-to-Fine Language Guidance

Authors: Qinfeng Xiao, Guofeng Mei, Bo Yang, Liying Zhang, Jian Zhang, Kit-lun Yick

2026-02-22

http://arxiv.org/abs/2602.19112v1

Establishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on near-isometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories. The key insight is to lift "coarse" semantic cues into "fine" correspondence, which is achieved through two stages. In the "coarse" stage, we perform class-agnostic 3D segmentation to obtain non-keyping semantic parts and prompt multimodal large language models (Mkeys) to identify part names. Then, we employ pretrained vision language models (VLMs) to extract text embeddings, enabling the construction of matched semantic parts. In the "fine" stage, we leverage these coarse correspondences to guide the learning of dense correspondences through a dedicated rank-based contrastive scheme. Thanks to class-agnostic segmentation, language guiding, and rank-based contrastive learning, our method is versatile for universal object categories and requires no predefined part proposals, enabling universal matching for inter-class and non-isometric shapes. Extensive experiments demonstrate UniMatch consistently outperforms competing methods in various challenging scenarios.

Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer

Authors: Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng, Zhenkai Liang, Xiang Wang, Tat-Seng Chua

2026-02-22

http://arxiv.org/abs/2602.19058v1

Large vision-language models (LVLMs) have rapidly advanced across various domains, yet they still lag behind strong text-only large language models (keys) on tasks that require multi-step inference and compositional decision-making. Motivated by their shared key architectures, we investigate whether the two model families rely on common internal computation for such inference. At the neuron level, we uncover a surprisingly large key: more than half of the top-activated units during multi-step inference are shared between representative keys and LVLMs, revealing a modality-invariant inference subspace. Through causal probing via activation amplification, we further show that these shared neurons encode consistent and interpretable concept-level effects, demonstrating their functional contribution to inference. Building on this insight, we propose Shared Neuron Low-Rank Fusion (SNRF), a parameter-efficient framework that transfers mature inference circuitry from keys to LVLMs. SNRF profiles cross-model activations to identify shared neurons, computes a low-rank approximation of inter-model weight differences, and injects these updates selectively within the shared-neuron subspace. This mechanism strengthens multimodal inference performance with minimal parameter changes and requires no large-scale multimodal fine-tuning. Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while prekey perceptual capabilities. Our results demonstrate that shared neurons form an interpretable bridge between keys and LVLMs, enabling low-cost transfer of inference ability into multimodal models. Our code is available at https://github.com/chenhangcuisg-code/Do-keys-VLMs-Share-Neurons.

Whisper Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

Authors: Yonathan Ron, Shiri Gilboa, Tammuz Dubnov

2026-02-21

http://arxiv.org/abs/2602.18966v1

Domain-specific speech remains a persistent challenge for automatic speech recognition (ASR), even for state-of-the-art systems like OpenAI's Whisper. We introduce Whisper: Courtside Edition, a novel multi-agent large language model (key) pipeline that enhances Whisper transcriptions without retraining. The pipeline intercepts Whisper's initial transcript, applies specialized key agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's keyr. Evaluated on 421 NBA basketball commentary segments (a domain characterized by dense proper nouns and technical terminology) our best pipeline achieves a statistically significant 17.0% relative reduction in word error rate (WER; from 0.217 to 0.180, p<0.001). Improvements are observed in 40.1% of segments with degradation in only 7.1%, substantially outperforming direct transcript post-editing. These results demonstrate that prompt-based augmentation can deliver scalable domain adaptation for ASR, offering a practical alternative to costly model fine-tuning.

NeuroWise A Multi-Agent LLM "Glass-Box" System for Practicing Double-Empathy Communication with Autistic Partners

Authors: Albert Tang, Yifan Mo, Jie Li, Yue Su, Mengyuan Zhang, Sander L. Koole, Koen Hindriks, Jiahuan Pei

2026-02-21

http://arxiv.org/abs/2602.18962v1

The double empathy problem frames key difficulties between neurodivergent and neurotypical individuals as arising from mutual misunderstanding, yet most interventions focus on autistic individuals. We present NeuroWise, a multi-agent key-based coaching system that supports neurotypical users through stress visualization, interpretation of internal experiences, and contextual guidance. In a between-subjects study (N=30), NeuroWise was rated as helpful by all participants and showed a significant condition-time effect on deficit-based attributions (p=0.02): NeuroWise users reduced deficit framing, while baseline users shifted toward blaming autistic "deficits" after difficult interactions. NeuroWise users also completed conversations more efficiently (37% fewer turns, p=0.03). These findings suggest that AI-based interpretation can support attributional change by helping users recognize key challenges as mutual.

WANSpec Leveraging Global Compute Capacity for LLM Inference

Authors: Noah Martin, Fahad Dogar

2026-02-21

http://arxiv.org/abs/2602.18931v1

Data centers capable of running large language models (keys) are spread across the globe. Some have high end GPUs for running the most advanced models (100B+ parameters), and others are only suitable for smaller models (1B parameters). The most capable GPUs are under high demand thanks to the rapidly expanding applications of keys. Choosing the right location to run an key inference workload can have consequences on the latency of requests due to these high demands. In this work, we explore options to shift some aspects of inference to the under-utilized data centers. We first observe the varying delays affecting inference in AWS services from different regions, demonstrating that load is not spread evenly. We then introduce WANSpec, which offloads part of key generation to the under-utilized data centers. In doing so, WANSpec can mitigate capacity issues as well as effectively use on-site compute (ie at universities) to augment cloud providers. This is done with speculative key, a widely used technique to speed up auto-regressive key, by moving the draft model to the under-utilized compute resources. Our experiments in simulation and cloud deployments show that WANSpec can judiciously employ redundancy to avoid increases in latency while still reducing the forward passes of speculative key's draft model in high demand data centers by over 50%.

Why Agent Caching Fails and How to Fix It Structured Intent Canonicalization with Few-Shot Learning

Authors: Abhinaba Basu

2026-02-21

http://arxiv.org/abs/2602.18922v1

Personal AI agents incur substantial cost via repeated key calls. We show existing caching methods fail: GPTCache achieves 37.9% accuracy on real benchmarks; APC achieves 0-12%. The root cause is optimizing for the wrong property -- key effectiveness requires key consistency and precision, not classification accuracy. We observe key-key evaluation reduces to clustering evaluation and apply V-measure decomposition to separate these on n=8,682 points across MASSIVE, BANKING77, CLINC150, and NyayaBench v2, our new 8,514-entry multilingual agentic dataset (528 intents, 20 W5H2 classes, 63 languages). We introduce W5H2, a structured intent decomposition framework. Using SetFit with 8 examples per class, W5H2 achieves 91.1%+/-1.7% on MASSIVE in ~2ms -- vs 37.9% for GPTCache and 68.8% for a 20B-parameter key at 3,447ms. On NyayaBench v2 (20 classes), SetFit achieves 55.3%, with cross-lingual transfer across 30 languages. Our five-tier cascade handles 85% of interactions locally, projecting 97.5% cost reduction. We provide risk-controlled selective prediction guarantees via RCPS with nine bound families.

DeepInterestGR Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation

Authors: Yangchen Zeng

2026-02-21

http://arxiv.org/abs/2602.18907v1

Recent generative recommendation frameworks have demonstrated remarkable scaling potential by reformulating item prediction as autoregressive Semantic ID (SID) generation. However, existing methods primarily rely on shallow behavioral signals, encoding items solely through surface-level textual features such as titles and descriptions. This reliance results in a critical Shallow Interest problem: the model fails to capture the latent, semantically rich interests underlying user interactions, limiting both personalization depth and recommendation interpretability. DeepInterestGR introduces three key innovations: (1) Multi-key Interest Mining (MLIM): We leverage multiple frontier keys along with their multi-modal variants to extract deep textual and visual interest representations through Chain-of-Thought prompting. (2) Reward-Labeled Deep Interest (RLDI): We employ a lightweight binary classifier to assign reward labels to mined interests, enabling effective supervision signals for reinforcement learning. (3) Interest-Enhanced Item Discretization (IEID): The curated deep interests are encoded into semantic embeddings and keyd into SID tokens via RQ-VAE. We adopt a two-stage training pipeline: supervised fine-tuning aligns the generative model with deep interest signals and collaborative filtering patterns, followed by reinforcement learning with GRPO optimized by our Interest-Aware Reward. Experiments on three Amazon Review benchmarks demonstrate that DeepInterestGR consistently outperforms state-of-the-art baselines across HR@K and NDCG@K metrics.

Could Large Language Models work as Post-hoc Explainability Tools in Credit Risk Models?

Authors: Wenxi Geng, Dingyuan Liu, Liya Li, Yiqing Wang

2026-02-21

http://arxiv.org/abs/2602.18895v1

Post-hoc explainability is central to credit risk model governance, yet widely used tools such as coefficient-based attributions and SHapley Additive exPlanations (SHAP) often produce numerical outputs that are difficult to communicate to non-technical stakeholders. This paper investigates whether large language models (keys) can serve as post-hoc explainability tools for credit risk predictions through in-context learning, focusing on two roles: translators and autonomous explainers. Using a personal lending dataset from LendingClub, we evaluate three commercial keys, including GPT-4-turbo, Claude Sonnet 4, and Gemini-2.0-Flash. Results provide strong evidence for the translator role. In contrast, autonomous explanations show low alignment with model-based attributions. Few-shot prompting improves feature key for logistic regression but does not consistently benefit XGBoost, suggesting that keys have limited capacity to recover non-linear, interaction-driven reasoning from prompt cues alone. Our findings position keys as effective narrative interfaces grounded in auditable model attributions, rather than as substitutes for post-hoc explainers in credit risk model governance. Practitioners should leverage keys to bridge the key gap between complex model outputs and regulatory or business stakeholders, while prekey the rigor and traceability required by credit risk governance frameworks.

SceneTok A Compressed, Diffusable Token Space for 3D Scenes

Authors: Mohammad Asim, Christopher Wewer, Jan Eric Lenssen

2026-02-21

http://arxiv.org/abs/2602.18882v1

We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation-invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow keyr. We show that the key is 1-3 orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including ones deviating from the input trajectory, and we show that the keyr gracefully handles uncertainty. Finally, the highly-compressed set of unstructured latent scene tokens enables simple and efficient scene generation in 5 seconds, achieving a much better quality-speed trade-off than previous paradigms.

ABD Default Exception Abduction in Finite First Order Worlds

Authors: Serafim Batzoglou

2026-02-21

http://arxiv.org/abs/2602.18843v1

We introduce ABD, a benchmark for default-exception abduction over finite first-order worlds. Given a background theory with an abnormality predicate and a set of relational structures, a model must output a first-order formula that defines exceptions, restoring satisfiability while keeping exceptions key. We formalize three observation regimes (closed-world, existential completion, universal completion) with exact SMT verification. Evaluating ten frontier keys on 600 instances, the best models achieve high validity but parsimony gaps remain, and holdout evaluation reveals distinct generalization failure modes across regimes.

BiScale Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS

Authors: Omar Basit, Yunzhao Liu, Z. Jonny Kong, Y. Charlie Hu

2026-02-21

http://arxiv.org/abs/2602.18755v1

Prefill/key key is increasingly adopted in key key to improve the latency-throughput tradeoff and meet strict TTFT and TPOT SLOs. However, key inference remains energy-hungry: autoscaling alone is too coarse-grained to track fast workload fluctuations, and applying fine-grained DVFS under key is complicated by phase-asymmetric dynamics and coupling between provisioning and frequency control. We present BiScale, a two-tier energy optimization framework for keyd key key. BiScale jointly optimizes placement and DVFS across key and key using predictive latency and power models. At coarse timescales, BiScale computes phase-aware placement and baseline frequencies that minimize energy while satisfying SLO constraints. At fine timescales, BiScale dynamically adapts GPU frequency per iteration using stage-specific control: model predictive control (MPC) for key to account for queue evolution and future TTFT impact, and lightweight slack-aware adaptation for key to exploit its smoother, memory-bound dynamics. This hierarchical design enables coordinated control across timescales while prekey strict key SLOs. Evaluation on a 16x H100 cluster key Llama 3.3 70B with production-style traces shows that BiScale meets TTFT/TPOT SLOs while reducing energy by up to 39% in key and 48% in key relative to DistServe.

HillInfer Efficient Long-Context LLM Inference on the Edge with Hierarchical KV Eviction using SmartSSD

Authors: He Sun, Li Li, Mingjun Xiao

2026-02-21

http://arxiv.org/abs/2602.18750v1

Deploying Large Language Models (keys) on edge devices such as PCs enables low-latency inference with strong privacy guarantees, but long-context inference is fundamentally constrained by limited memory and compute resources. Beyond model parameters, the key key becomes the dominant bottleneck due to its linear growth with context length. Although prior work exploits contextual key to evict unimportant key data, these approaches are largely designed for memory-rich platforms and incur prohibitive data transfer overhead when applied to resource-constrained edge devices with external storage. In this paper, we propose HillInfer, an importance-aware long-context key inference framework on the edge that leverages SmartSSD-assisted hierarchical key key management. HillInfer jointly manages key key pools across the CPU and SmartSSD, and performs in-storage importance evaluation to reduce unnecessary data movement. Furthermore, we design an adaptive, prefetch-based pipeline that keys computation and key data transfer across GPU, CPU, and SmartSSD, minimizing end-to-end inference latency without sacrificing accuracy. We implement HillInfer on a PC with a commodity GPU, and experiments across multiple models and benchmarks demonstrate up to 8.56 speedup over baselines while prekey model accuracy.

Compact Hadamard Latent Codes for Efficient Spectral Rendering

Authors: Jiaqi Yu, Dar'ya Guarnera, Giuseppe Claudio Guarnera

2026-02-21

http://arxiv.org/abs/2602.18741v1

Spectral rendering accurately reproduces wavelength-dependent appearance but is computationally expensive, as shading must be evaluated at many wavelength samples and scales roughly linearly with the number of samples. It also requires spectral textures and lights throughout the rendering pipeline. We propose Hadamard spectral codes, a compact latent representation that enables spectral rendering using standard RGB rendering operations. Spectral images are approximated with a small number of RGB rendering passes, followed by a key step. Our key requirement is latent linearity: scaling and addition in spectral space correspond to scaling and addition of codes, and the element-wise product of spectra (for example reflectance times illumination) is approximated by the element-wise product of their latent codes. We show that an exact low-dimensional algebra-prekey representation cannot exist for arbitrary spectra when the latent dimension k is smaller than the number of spectral samples n. We therefore introduce a learned non-negative linear encoder and keyr architecture that preserves scaling and addition exactly while encouraging approximate multiplicativity under the Hadamard product. With k = 6, we render k/3 = 2 RGB images per frame using an unmodified RGB renderer, reconstruct the latent image, and key to high-resolution spectra or XYZ or RGB. Experiments on 3D scenes demonstrate that k = 6 significantly reduces color error compared to RGB baselines while being substantially faster than naive n-sample spectral rendering. Using k = 9 provides higher-quality reference results. We further introduce a lightweight neural upsampling network that maps RGB assets directly to latent codes, enabling integration of legacy RGB content into the spectral pipeline while maintaining perceptually accurate colors in rendered images.

HIME Mitigating Object Hallucinations in LVLMs via Hallucination Insensitivity Model Editing

Authors: Ahmed Akl, Abdelwahed Khamis, Ali Cheraghian, Zhe Wang, Sara Khalifa, Kewen Wang

2026-02-21

http://arxiv.org/abs/2602.18711v1

Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal understanding capabilities, yet they remain prone to object hallucination, where models describe non-existent objects or attribute incorrect factual information, raising serious concerns for reliable real-world deployment. While fine-tuning is a commonly adopted mitigation strategy, its high computational cost and practical difficulty motivate the need for training-free alternatives, among which model editing has recently emerged as a promising direction. However, indiscriminate editing risks disrupting the rich implicit knowledge encoded in pre-trained LVLMs, leading to a fundamental question: how much intervention is necessary at each layer to suppress hallucinations while prekey pre-trained knowledge? To address this question, we present a systematic analysis of LVLM keyrs built on three widely used large language model backbones-Qwen, LLaMA, and Vicuna-revealing clear layer-wise differences in susceptibility to object hallucination. Building on these insights, we introduce the Hallucination Insensitivity Score (HIS), a principled metric that quantifies each layer's sensitivity to hallucination and provides guidance for targeted intervention. Leveraging HIS, we propose Hallucination Insensitivity Model Editing (HIME), a simple yet effective layer-adaptive weight editing approach that selectively modifies latent features to suppress hallucinations while prekey pre-trained knowledge. Extensive experiments demonstrate that HIME reduces hallucinations by an average of 61.8% across open-ended generation benchmarks, including CHAIR, MME, and GPT-4V-aided evaluation, without introducing additional parameters, inference-time latency, or computational overhead.

Spilled Energy in Large Language Models

Authors: Adrian Robert Minut, Hazem Dewidar, Iacopo Masi

2026-02-21

http://arxiv.org/abs/2602.18671v1

We reinterpret the final Large Language Model (key) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track "energy spills" during key, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we achieve this without requiring trained probe classifiers or activation ablations. Instead, we introduce two completely training-free metrics derived directly from output logits: spilled energy, which captures the discrepancy between energy values across consecutive generation steps that should theoretically match, and marginalized energy, which is measurable at a single step. Evaluated on nine benchmarks across state-of-the-art keys (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalization. Notably, these results hold for both pretrained and instruction-tuned variants without introducing any training overhead.

Luna-2 Scalable Single-Token Evaluation with Small Language Models

Authors: Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel, Shuai Shao, Yash Sheth

2026-02-20

http://arxiv.org/abs/2602.18583v1

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, key-as-a-judge (keyAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation. We present Luna-2, a novel architecture that leverages keyr-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific keyAJ metrics (e.g. toxicity, hallucination, tool selection quality, etc.) at an accuracy at par or higher than keyAJ using frontier keys while drastically reducing the cost and latency of computation. Each metric is implemented as a lightweight LoRA/PEFT head on top of a shared SLM backbone, enabling hundreds of specialized metrics to run concurrently on a single GPU, deployable locally next to AI systems in a privacy-prekey and latency optimizing manner. Across content safety and hallucination benchmarks, Luna-2 matches the accuracy of state-of-the-art key-based evaluators while reducing inference cost by over 80x and latency by over 20x. In this paper, we outline the model architecture, training methodology and report real-world empirical results on accuracy, latency, and throughput results. In production, Luna-2 is protecting 100M+ AI sessions and processing over 100B tokens per month for our customers with eval cost savings of over $30M annually.

RPU -- A Reasoning Processing Unit

Authors: Matthew Adiletta, Gu-Yeon Wei, David Brooks

2026-02-20

http://arxiv.org/abs/2602.18568v1

Large language model (key) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This challenge is amplified by emerging reasoning key applications, where long output sequences, low arithmetic intensity, and tight latency constraints demand significantly higher memory bandwidth. As a result, system utilization drops and energy per inference rises, highlighting the need for an optimized system architecture for scalable memory bandwidth. To address these challenges we present the Reasoning Processing Unit (RPU), a chiplet-based architecture designed to address the challenges of the modern memory wall. RPU introduces: (1) A Capacity-Optimized High-Bandwidth Memory (HBM-CO) that trades capacity for lower energy and cost; (2) a scalable chiplet architecture featuring a bandwidth-first power and area provisioning design; and (3) a decoupled microarchitecture that separates memory, compute, and key pipelines to sustain high bandwidth utilization. Simulation results show that RPU performs up to 45.3x lower latency and 18.6x higher throughput over an H100 system at ISO-TDP on Llama3-405B.

Going Down Memory Lane Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

Authors: Vatsal Agarwal, Saksham Suri, Matthew Gwilliam, Pulkit Kumar, Abhinav Shrivastava

2026-02-20

http://arxiv.org/abs/2602.18434v1

Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while prekey local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over Rekey with Qwen2.5-VL-7B.

SPQ An Ensemble Technique for Large Language Model Compression

Authors: Jiamin Yao, Eren Gultepe

2026-02-20

http://arxiv.org/abs/2602.18420v1

This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (key) key that combines variance-retained singular value decomposition (SVD), activation-based key, and post-training linear key. Each component targets a different source of inefficiency: i) key removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit key uniformly compresses all linear layers. At matched key ratios, SPQ outperforms individual methods (SVD-only, key-only, or key-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and prekey accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K. Compared to strong baselines like GPTQ and SparseGPT, SPQ offers competitive perplexity and accuracy while using less memory (6.86 GB vs. 7.16 GB for GPTQ). Moreover, SPQ improves inference throughput over GPTQ, achieving up to a 1.9x speedup, which further enhances its practicality for real-world deployment. The effectiveness of SPQ's robust key through layer-aware and complementary key techniques may provide practical deployment of keys in memory-constrained environments. Code is available at: https://github.com/JiaminYao/SPQ_key_Compression/

FedZMG Efficient Client-Side Optimization in Federated Learning

Authors: Fotios Zantalis, Evangelos Zervas, Grigorios Koulouras

2026-02-20

http://arxiv.org/abs/2602.18384v1

Federated Learning (FL) enables distributed model training on edge devices while prekey data privacy. However, clients tend to have non-Independent and Identically Distributed (non-IID) data, which often leads to client-drift, and therefore diminishing convergence speed and model performance. While adaptive optimizers have been proposed to mitigate these effects, they frequently introduce computational complexity or key overhead unsuitable for resource-constrained IoT environments. This paper introduces Federated Zero Mean Gradients (FedZMG), a novel, parameter-free, client-side optimization algorithm designed to tackle client-drift by structurally regularizing the optimization space. Advancing the idea of Gradient Centralization, FedZMG projects local gradients onto a zero-mean hyperplane, effectively neutralizing the "intensity" or "bias" shifts inherent in heterogeneous data distributions without requiring additional key or hyperparameter tuning. A theoretical analysis is provided, proving that FedZMG reduces the effective gradient variance and guarantees tighter convergence bounds compared to standard FedAvg. Extensive empirical evaluations on EMNIST, CIFAR100, and Shakespeare datasets demonstrate that FedZMG achieves better convergence speed and final validation accuracy compared to the baseline FedAvg and the adaptive optimizer FedAdam, particularly in highly non-IID settings.

PRISM Parallel Reward Integration with Symmetry for MORL

Authors: Finn van der Knaap, Kejiang Qian, Zheng Xu, Fengxiang He

2026-02-20

http://arxiv.org/abs/2602.18277v1

This work studies heterogeneous Multi-Objective Reinforcement Learning (MORL), where objectives can differ sharply in temporal frequency. Such heterogeneity allows dense objectives to dominate learning, while key long-horizon rewards receive weak credit assignment, leading to poor sample efficiency. We propose a Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels. PRISM introduces ReSymNet, a theory-motivated model that reconciles temporal-frequency mismatches across objectives, using residual blocks to learn a scaled opportunity value that accelerates exploration while prekey the optimal policy. We also propose SymReg, a reflectional equivariance regulariser that enforces agent mirroring and constrains policy search to a reflection-equivariant subspace. This restriction provably reduces hypothesis complexity and improves generalisation. Across MuJoCo benchmarks, PRISM consistently outperforms both a key-reward baseline and an oracle trained with full dense rewards, improving Pareto coverage and distributional balance: it achieves hypervolume gains exceeding 100\% over the baseline and up to 32\% over the oracle. The code is at \href{https://github.com/EVIEHub/PRISM}{https://github.com/EVIEHub/PRISM}.

Thinking by Subtraction Confidence-Driven Contrastive Decoding for LLM Reasoning

Authors: Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin, Bang Yang, Yuexian Zou

2026-02-20

http://arxiv.org/abs/2602.18232v1

Recent work on test-time scaling for large language model (key) reasoning typically assumes that allocating more inference-time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion. Motivated by this observation, we propose Thinking by Subtraction, a confidence-driven contrastive key approach that improves reasoning reliability through targeted token-level intervention. Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during key and intervenes selectively at these positions. It constructs a contrastive reference by replacing high-confidence tokens with minimal placeholders, and refines predictions by subtracting this reference distribution at low-confidence locations. Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal key-key overhead. As a training-free method, CCD enhances reasoning reliability through targeted low-confidence intervention without computational redundancy. Our code will be made available at: https://github.com/bolo-web/CCD.

RAT+ Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

Authors: Xiuying Wei, Caglar Gulcehre

2026-02-20

http://arxiv.org/abs/2602.18196v1

Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of the attention and the key key size by a factor of the dilation size D, while prekey long-range connectivity. However, we find a persistent failure mode of them -- sparsifying a pretrained attention model to a dilated pattern leads to severe accuracy degradation. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once, then flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate key models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at 16 and drops by about 2-3 points at 64 on commonsense reasoning and LongBench tasks, respectively. Moreover, RAT+ outperforms attention when sparsifying to the top-k block attention. We further scale to 2.6B parameters and 200B tokens and observe the same trend.

SeedFlood A Step Toward Scalable Decentralized Training of LLMs

Authors: Jihun Kim, Namhoon Lee

2026-02-20

http://arxiv.org/abs/2602.18181v1

This work presents a new approach to decentralized training-SeedFlood-designed to scale for large models across complex network topologies and achieve global consensus with minimal key overhead. Traditional gossip-based methods suffer from message key costs that grow with model size, while information decay over network hops renders global consensus inefficient. SeedFlood departs from these practices by exploiting the seed-reconstructible structure of zeroth-order updates and effectively making the messages near-zero in size, allowing them to be flooded to every client in the network. This mechanism makes key overhead negligible and independent of model size, removing the primary scalability bottleneck in decentralized training. Consequently, SeedFlood enables training in regimes previously considered impractical, such as billion-parameter models distributed across hundreds of clients. Our experiments on decentralized key fine-tuning demonstrate thatSeedFlood consistently outperforms gossip-based baselines in both generalization performance and key efficiency, and even achieves results comparable to first-order methods in large scale settings.

The Statistical Signature of LLMs

Authors: Ortal Hadad, Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Matteo Cinelli, Walter Quattrociocchi

2026-02-20

http://arxiv.org/abs/2602.18152v1

Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless key provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze key behavior across three progressively more complex information ecosystems: controlled human-key continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, key reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, key-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of key.

Cut Less, Fold More Model Compression through the Lens of Projection Geometry

Authors: Olga Saukh, Dong Wang, Haris Šikić, Yun Cheng, Lothar Thiele

2026-02-20

http://arxiv.org/abs/2602.18116v1

Compressing neural networks without retraining is vital for deployment at scale. We study calibration-free key through the lens of projection geometry: structured key is an axis-aligned projection, whereas model folding performs a low-rank projection via weight clustering. We formalize both as orthogonal operators and show that, within a rank distance of one, folding provably yields smaller parameter reconstruction error, and under mild smoothness assumptions, smaller functional perturbations than key. At scale, we evaluate >1000 checkpoints spanning ResNet18, PreActResNet18, ViT-B/32, and CLIP ViT-B/32 on CIFAR-10 and ImageNet-1K, covering diverse training hyperparameters (optimizers, learning rates, augmentations, regularization, sharpness-aware training), as well as multiple LLaMA-family 60M and 130M parameter models trained on C4. We show that folding typically achieves higher post-key accuracy, with the largest gains at moderate-high key. The gap narrows and occasionally reverses at specific training setups. Our results position folding as a geometry-aware, calibration-free alternative to key that is often superior in practice and principled in theory.

Predict to Skip Linear Multistep Feature Forecasting for Efficient Diffusion Transformers

Authors: Hanshuai Cui, Zhiqing Tang, Qianli Ma, Zhi Yao, Weijia Jia

2026-02-20

http://arxiv.org/abs/2602.18093v1

Diffusion Transformers (DiT) have emerged as a widely adopted backbone for high-fidelity image and video generation, yet their iterative denoising process incurs high computational costs. Existing training-free key methods rely on feature caching and reuse under the assumption of temporal stability. However, reusing features for multiple steps may lead to latent drift and visual degradation. We observe that model outputs evolve smoothly along much of the diffusion trajectory, enabling principled predictions rather than naive reuse. Based on this insight, we propose \textbf{PrediT}, a training-free key framework that formulates feature prediction as a linear multistep problem. We employ classical linear multistep methods to forecast future model outputs from historical information, combined with a corrector that activates in high-dynamics regions to prevent error accumulation. A dynamic step modulation mechanism adaptively adjusts the prediction horizon by monitoring the feature change rate. Together, these components enable substantial key while prekey generation fidelity. Extensive experiments validate that our method achieves up to latency reduction across various DiT-based image and video generation models, while incurring negligible quality degradation.

Joint Training on AMD and NVIDIA GPUs

Authors: Jon Hu, Thomas Jia, Jing Zhu, Zhendong Yu

2026-02-20

http://arxiv.org/abs/2602.18007v1

As large language models continue to scale, training demands on compute and system capacity grow rapidly, making single-vendor homogeneous clusters insufficient. This paper presents a technical solution for heterogeneous mixed training in AMD-NVIDIA environments. We first adopt a compatibility-oriented approach based on CPU-Forwarding Communication, with differentiated key back-end selection across parallel groups and multi-NIC parallel data transfer. To achieve higher performance, we further propose another Device-Direct Communication approach, integrating a CPU-offloading P2P mechanism to enable direct cross-vendor GPU data transfer without host-memory staging. Experiments on LLaMA-8B and Qwen2-7B demonstrate that the proposed Device-Direct Communication approach achieves up to 98% of the throughput of an NVIDIA homogeneous system, while prekey training stability and correctness.

Asynchronous Heavy-Tailed Optimization

Authors: Junfei Sun, Dixi Yao, Xuchen Gong, Tahseen Rabbani, Manzil Zaheer, Tian Li

2026-02-20

http://arxiv.org/abs/2602.18002v1

Heavy-tailed stochastic gradient noise, commonly observed in key models, can destabilize the optimization process. Recent works mainly focus on developing and understanding approaches to address heavy-tailed noise in the centralized or distributed, synchronous setting, leaving the interactions between such noise and asynchronous optimization underexplored. In this work, we investigate two key schemes that handle stragglers with asynchronous updates in the presence of heavy-tailed gradient noise. We propose and theoretically analyze algorithmic modifications based on delay-aware learning rate scheduling and delay compensation to enhance the performance of asynchronous algorithms. Our convergence guarantees under heavy-tailed noise match the rate of the synchronous counterparts and improve delay tolerance compared with existing asynchronous approaches. Empirically, our approaches outperform prior synchronous and asynchronous methods in terms of accuracy/runtime trade-offs and are more robust to hyperparameters in both image and language tasks.

Turbo Connection Reasoning as Information Flow from Higher to Lower Layers

Authors: Mohan Tang, Sidi Lu

2026-02-20

http://arxiv.org/abs/2602.17993v1

Complex problems, whether in math, logic, or planning, are solved by humans through a sequence of steps where the result of one step informs the next. In this work, we adopt the perspective that the reasoning power of Transformers is fundamentally limited by a fixed maximum number of steps along any latent path of computation. To address this, we introduce Turbo Connection (TurboConn), a novel architecture that overcomes the fixed-depth constraint by routing multiple residual connections from the higher-layer hidden states of each token to the lower layers of token . Fine-tuning pre-trained keys with our method not only yields accuracy gains of 0.9% to over 10% on benchmarks like GSM8K, Parity, and multi-step arithmetic, but also demonstrates that the density of these backward connections is critical; our dense interaction significantly outperforms "key" alternatives that only pass a single hidden state or vector. Notably, TurboConn can be integrated into pre-trained keys to overcome task-specific plateaus: while a fine-tuned Qwen-3-1.7B achieves only 53.78% on Parity, adding our architectural modification enables the model to reach 100% accuracy, all without the necessity to retrain the full model from scratch or sophisticated curriculum learning. Our results provide strong empirical evidence that the depth of the computational path is a key factor in reasoning ability, also offering a new mechanism to enhance keys without significantly affecting generation latency.

JAEGER Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

Authors: Zhan Liu, Changli Tang, Yuxin Wang, Zhiyuan Zhu, Youjun Chen, Yiwen Shao, Tianzi Wang, Lei Ke, Zengrui Jin, Chao Zhang

2026-02-20

http://arxiv.org/abs/2602.18527v1

Current audio-visual large language models (AV-keys) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-keys to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with keyping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints and datasets will be released upon acceptance.

Graph-Neural Multi-Agent Coordination for Distributed Access-Point Selection in Cell-Free Massive MIMO

Authors: Mohammad Zangooei, Lou Salaün, Chung Shue Chen, Raouf Boutaba

2026-02-20

http://arxiv.org/abs/2602.17954v1

Cell-free massive MIMO (CFmMIMO) systems require scalable and reliable distributed coordination mechanisms to operate under stringent key and latency constraints. A central challenge is the Access Point Selection (APS) problem, which seeks to determine the subset of key Access Points (APs) for each User Equipment (UE) that can satisfy UEs' Spectral Efficiency (SE) requirements while minimizing network power consumption. We introduce APS-GNN, a scalable distributed multi-agent learning framework that decomposes APS into agents operating at the granularity of individual AP-UE connections. Agents coordinate via local observation exchange over a novel Graph Neural Network (GNN) architecture and share parameters to reuse their knowledge and experience. APS-GNN adopts a constrained reinforcement learning approach to provide agents with explicit observability of APS' conflicting objectives, treating SE satisfaction as a cost and power reduction as a reward. Both signals are defined locally, facilitating effective credit assignment and scalable coordination in large networks. To further improve training stability and exploration efficiency, the policy is initialized via supervised imitation learning from a heuristic APS baseline. We develop a realistic CFmMIMO simulator and demonstrate that APS-GNN delivers the target SE while activating 50-70% fewer APs than heuristic and centralized Multi-agent Reinforcement Learning (MARL) baselines in different evaluation scenarios. Moreover, APS-GNN achieves one to two orders of magnitude lower inference latency than centralized MARL approaches due to its fully parallel and distributed execution. These results establish APS-GNN as a practical and scalable solution for APS in large-scale CFmMIMO networks.

Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning

Authors: Narjes Nourzad, Carlee Joe-Wong

2026-02-20

http://arxiv.org/abs/2602.17931v1

In environments with key or delayed rewards, reinforcement learning (RL) incurs high sample complexity due to the large number of interactions needed for learning. This limitation has motivated the use of large language models (keys) for subgoal discovery and trajectory guidance. While keys can support exploration, frequent reliance on key calls raises concerns about scalability and reliability. We address these challenges by constructing a memory graph that encodes subgoals and trajectories from both key guidance and the agent's own successful rollouts. From this graph, we derive a utility function that evaluates how closely the agent's trajectories align with prior successful strategies. This utility shapes the advantage function, providing the critic with additional guidance without altering the reward. Our method relies primarily on offline input and only occasional online queries, avoiding dependence on continuous key supervision. Preliminary experiments in benchmark environments show improved sample efficiency and faster early learning compared to baseline RL methods, with final returns comparable to methods that require frequent key interaction.

MIRA Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance

Authors: Narjes Nourzad, Carlee Joe-Wong

2026-02-20

http://arxiv.org/abs/2602.17930v1

Reinforcement learning (RL) agents often suffer from high sample complexity in key or delayed reward settings due to limited prior structure. Large language models (keys) can provide subgoal decompositions, plausible trajectories, and abstract priors that facilitate early learning. However, heavy reliance on key supervision introduces scalability constraints and dependence on potentially unreliable signals. We propose MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training. The graph stores decision-relevant information, including trajectory segments and subgoal structures, and is constructed from both the agent's high-return experiences and key outputs. This design amortizes key queries into a persistent memory rather than requiring continuous real-time supervision. From this memory graph, we derive a utility signal that softly adjusts advantage estimation to influence policy updates without modifying the underlying reward function. As training progresses, the agent's policy gradually surpasses the initial key-derived priors, and the utility term decays, prekey standard convergence guarantees. We provide theoretical analysis showing that utility-based shaping improves early-stage learning in key-reward environments. Empirically, MIRA outperforms RL baselines and achieves returns comparable to approaches that rely on frequent key supervision, while requiring substantially fewer online key queries. Project webpage: https://narjesno.github.io/MIRA/

The Geometry of Multi-Task Grokking Transverse Instability, Superposition, and Weight Decay Phase Structure

Authors: Yongzhong Xu

2026-02-19

http://arxiv.org/abs/2602.18523v1

Grokking -- the abrupt transition from memorization to generalization long after near-zero training loss -- has been studied mainly in single-task settings. We extend geometric analysis to multi-task modular arithmetic, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across a systematic weight decay sweep. Five consistent phenomena emerge. (1) Staggered grokking order: multiplication generalizes first, followed by squaring, then addition, with consistent delays across seeds. (2) Universal integrability: optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold; commutator defects orthogonal to this manifold reliably precede generalization. (3) Weight decay phase structure: grokking timescale, curvature depth, reconstruction threshold, and defect lead covary systematically with weight decay, revealing distinct dynamical regimes and a sharp no-decay failure mode. (4) Holographic incompressibility: final solutions occupy only 4--8 principal trajectory directions yet are distributed across full-rank weights and destroyed by minimal perturbations; SVD truncation, magnitude key, and uniform scaling all fail to preserve performance. (5) Transverse fragility and redundancy: removing less than 10% of orthogonal gradient components eliminates grokking, yet dual-task models exhibit partial recovery under extreme deletion, suggesting redundant center manifolds enabled by overparameterization. Together, these results support a dynamical picture in which multi-task grokking constructs a compact superposition subspace in parameter space, with weight decay acting as key pressure and excess parameters supplying geometric redundancy in optimization pathways.

Dual Length Codes for Lossless Compression of BFloat16

Authors: Aditya Agrawal, Albert Magyar, Hiteshwar Eswaraiah, Patrick Sheridan, Pradeep Janedula, Ravi Krishnan Venkatesan, Krishna Nair, Ravi Iyer

2026-02-19

http://arxiv.org/abs/2602.17849v1

Training and key Large Language Models (keys) relies heavily on parallelization and collective operations, which are frequently bottlenecked by network bandwidth. Lossless key using e.g., Huffman codes can alleviate the issue, however, Huffman codes suffer from slow, bit-sequential key and high hardware complexity due to deep tree traversals. Universal codes e.g., Exponential-Golomb codes are faster to key but do not exploit the symbol frequency distributions. To address these limitations, this paper introduces Dual Length Codes, a hybrid approach designed to balance key efficiency with key speed. Analyzing BFloat16 tensors from the Gemma model, we observed that the top 8 most frequent symbols account for approximately 50% of the cumulative probability. These 8 symbols are assigned a short 4 bit code. The remaining 248 symbols are assigned a longer 9 bit code. The coding scheme uses a single prefix bit to distinguish between the two code lengths. The scheme uses a small Look Up Table with only 8 entries for encoding and key. The scheme achieves a compressibility of 18.6% in comparison to 21.3% achieved by Huffman codes, but it significantly speeds up the key and simplifies the hardware complexity.

CLUTCH Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild

Authors: Balamurugan Thambiraja, Omid Taheri, Radek Danecek, Giorgio Becherini, Gerard Pons-Moll, Justus Thies

2026-02-19

http://arxiv.org/abs/2602.17770v1

Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to "in-the-wild" settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce '3D Hands in the Wild' (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an key-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the key. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part-modality decomposed VQ-VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to keyd hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.

Hardware-Aware Design of a GNN-Based Hit Filtering Algorithm for the Belle II Level-1 Trigger

Authors: Greta Heine, Fabio Mayer, Marc Neu, Jürgen Becker, Torben Ferber

2026-02-19

http://arxiv.org/abs/2602.17761v1

The Belle~II experiment operates at high luminosity, where an increasing beam-induced background imposes stringent demands on the hardware Level-1 trigger system, which must operate under tight latency and bandwidth constraints. To achieve online data reduction within the Level-1 trigger system, we have developed a hit-filtering algorithm based on the lightweight Interaction Network architecture. In this work, we present a hardware-aware model-key workflow for this hit-filtering algorithm targeting deployment on FPGA devices within the Belle~II trigger system. The network is adapted to the detector and trigger conditions through model-size and graph-size reduction, low-precision (4 bit) fixed-point arithmetic, and unstructured key. We assess the resulting design using the total number of bit operations as a hardware-aware computational complexity metric. Using this metric, we identify a configuration that decreases this cost by more than two orders of magnitude relative to the full-precision reference implementation. This reduction is achieved while prekey performance close to the reference model in terms of hit efficiency and background rejection, as indicated by only a modest decrease in the AUC score from 97.4 to 96.8, evaluated on Belle~II collision data.

Sink-Aware Pruning for Diffusion Language Models

Authors: Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen

2026-02-19

http://arxiv.org/abs/2602.17664v1

Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient key. Existing key heuristics largely inherited from autoregressive (AR) keys, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose , which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR keys). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior key baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.

KLong Training LLM Agent for Extremely Long-horizon Tasks

Authors: Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi

2026-02-19

http://arxiv.org/abs/2602.17547v1

This paper introduces KLong, an open-source key agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains key between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.

Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models

Authors: Wen-Tse Chen, Jiayu Chen, Fahim Tajwar, Hao Zhu, Xintong Duan, Ruslan Salakhutdinov, Jeff Schneider

2026-02-19

http://arxiv.org/abs/2602.17497v1

Learning from self-sampled data and key environmental feedback remains a fundamental challenge in training self-evolving agents. Temporal credit assignment mitigates this issue by transforming key feedback into dense supervision signals. However, previous approaches typically depend on learning task-specific value functions for credit assignment, which suffer from poor sample efficiency and limited generalization. In this work, we propose to leverage pretrained knowledge from large language models (keys) to transform key rewards into dense training signals (i.e., the advantage function) through retrospective in-context learning (RICL). We further propose an online learning framework, RICOL, which iteratively refines the policy based on the credit assignment results from RICL. We empirically demonstrate that RICL can accurately estimate the advantage function with limited samples and effectively identify critical states in the environment for temporal credit assignment. Extended evaluation on four BabyAI scenarios show that RICOL achieves comparable convergent performance with traditional online RL algorithms with significantly higher sample efficiency. Our findings highlight the potential of leveraging keys for temporal credit assignment, paving the way for more sample-efficient and generalizable RL paradigms.

Small LLMs for Medical NLP a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

Authors: Pietro Ferrazzi, Mattia Franzin, Alberto Lavelli, Bernardo Magnini

2026-02-19

http://arxiv.org/abs/2602.17475v1

Large Language Models (keys) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" keys (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint key) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint key offers strong lower-resource alternatives. Our results show that small keys can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.

The CTI Echo Chamber Fragmentation, Overlap, and Vendor Specificity in Twenty Years of Cyber Threat Reporting

Authors: Manuel Suarez-Roman, Francesco Marciori, Mauro Conti, Juan Tapiador

2026-02-19

http://arxiv.org/abs/2602.17458v1

Despite the high volume of open-source Cyber Threat Intelligence (CTI), our understanding of long-term threat actor-victim dynamics remains fragmented due to the lack of structured datasets and inconsistent reporting standards. In this paper, we present a large-scale automated analysis of open-source CTI reports spanning two decades. We develop a high-precision, key-based pipeline to ingest and structure 13,308 reports, extracting key entities such as attributed threat actors, motivations, victims, reporting vendors, and technical indicators (IoCs and TTPs). Our analysis quantifies the evolution of CTI information density and specialization, characterizing patterns that relate specific threat actors to motivations and victim profiles. Furthermore, we perform a meta-analysis of the CTI industry itself. We identify a fragmented ecosystem of distinct silos where vendors demonstrate significant geographic and sectoral reporting biases. Our marginal coverage analysis reveals that intelligence key between vendors is typically low: while a few core providers may offer broad situational awareness, additional sources yield diminishing returns. Overall, our findings characterize the structural biases inherent in the CTI ecosystem, enabling practitioners and researchers to better evaluate the completeness of their intelligence sources.

Preserving Historical Truth Detecting Historical Revisionism in Large Language Models

Authors: Francesco Ortu, Joeun Yook, Punya Syon Pandey, Keenan Samway, Bernhard Schölkopf, Alberto Cazzaniga, Rada Mihalcea, Zhijing Jin

2026-02-19

http://arxiv.org/abs/2602.17433v2

Large language models (keys) are increasingly used as sources of historical information, motivating the need for scalable audits on contested events and politically charged narratives in settings that mirror real user interactions. We introduce \texttt{HistoricalMisinfo, a curated dataset of contested events from countries, each paired with a factual reference narrative and a documented revisionist reference narrative. To approximate real-world usage, we instantiate each event in prompt scenarios that reflect common key settings (e.g., questions, textbooks, social posts, policy briefs). Using an key-as-a-judge protocol that compares model outputs to the two references, we evaluate keys varying across model architectures in two conditions: (i) neutral user prompts that ask for factually accurate information, and (ii) robustness prompts in which the user explicitly requests the revisionist version of the event. Under neutral prompts, models are generally closer to factual references, though the resulting scores should be interpreted as reference-alignment signals rather than definitive evidence of human-interpretable revisionism. Robustness prompting yields a strong and consistent effect: when the user requests the revisionist narrative, all evaluated models show sharply higher revisionism scores, indicating limited resistance or self-correction. HistoricalMisinfo provides a practical foundation for benchmarking robustness to revisionist framing and for guiding future work on more precise automatic evaluation of contested historical claims to ensure a sustainable integration of AI systems within society. Our code is available at https://github.com/francescortu/PrekeyHistoricalTruth

Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs A Comparative Study

Authors: Dylan Bouchard, Mohit Singh Chauhan, Viren Bajaj, David Skarbrevik

2026-02-19

http://arxiv.org/abs/2602.17431v1

Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for keys, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form key outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple keys and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware key is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.