2026-02-27

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
InnerQ Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Efficient Real-Time Adaptation of ROMs for Unsteady Flows Using Data Assimilation
MetaOthello A Controlled Study of Multiple World Models in Transformers
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Modality Collapse as Mismatched Decoding Information-Theoretic Limits of Multimodal LLMs
MaRI Accelerating Ranking Model Inference via Structural Re-parameterization in Large Scale Recommendation System
Locally Adaptive Decay Surfaces for High-Speed Face and Landmark Detection with Event Cameras
Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design
MoDora Tree-Based Semi-Structured Document Analysis System
LLMServingSim 2.0 A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure
FactGuard Agentic Video Misinformation Detection via Reinforcement Learning
ToProVAR Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization
Rejection Mixing Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference
TARAZ Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models
Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching
Optimizing SSD-Resident Graph Indexing for High-Throughput Vector Search
Natural Language Declarative Prompting (NLD-P) A Modular Governance Method for Prompt Design Under Model Drift
Dynamic Hierarchical Birkhoff-von Neumann Decomposition for All-to-All GPU Communication
ProjFlow Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control
Generative Recommendation for Large-Scale Advertising
HulluEdit Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models
Reinforcing Real-world Service Agents Balancing Utility and Cost in Task-oriented Dialogue
U-Net-Based Generative Joint Source-Channel Coding for Wireless Image Transmission
Accelerating LLM Pre-Training through Flat-Direction Dynamics Enhancement
Denoising as Path Planning Training-Free Acceleration of Diffusion Models with DPCache
Vectorizing the Trie Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
SideQuest Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning
FLYING SERVING On-the-Fly Parallelism Switching for Large Language Model Serving
pQuant Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training
Search-P1 Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
GIFSplat Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views
Autoregressive Visual Decoding from EEG Signals
Multilingual Safety Alignment Via Sparse Weight Editing
SignVLA A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation
Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns
Beyond Dominant Patches Spatial Credit Redistribution For Grounded Vision-Language Models
CCCL Node-Spanning GPU Collectives with CXL Memory Pooling
Three-Dimensional Modified Klein--Gordon Oscillator in Standard and Generalized Doubly Special Relativity
How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Decoder-based Sense Knowledge Distillation
When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
FlowCorrect Efficient Interactive Correction of Generative Flow Policies for Robotic Manipulation
DHP Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism
XStreamVGGT Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression
Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
Support Tokens, Stability Margins, and a New Foundation for Robust LLMs
Send Less, Perceive More Masked Quantized Point Cloud Communication for Loss-Tolerant Collaborative Perception
Multi-Layer Scheduling for MoE-Based LLM Reasoning
AQR-HNSW Accelerating Approximate Nearest Neighbor Search via Density-aware Quantization and Multi-stage Re-ranking
CADC Content Adaptive Diffusion-Based Generative Image Compression
SEF-MAP Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction
Duel-Evolve Reward-Free Test-Time Scaling via LLM Self-Preferences
DualPath Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference
RAC Relation-Aware Cache Replacement for Large Language Models

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Authors: Siyuan Liu, Jiahui Xu, Feng Jiang, Kuang Wang, Zefeng Zhao, Chu-Ren Huang, Jinghang Gu, Changqing Yin, Haizhou Li

2026-02-26

http://arxiv.org/abs/2602.23266v1

Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR--TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically s ASR, inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based discourse continuity enhancement, which maintains coherence and logical consistency between early responses and subsequent reasoning outputs. Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51% while pre discourse quality. Further analysis shows that DDTSR functions as a plug-and-play module compatible with diverse backbones, and remains robust across varying utterance lengths, indicating strong practicality and scalability for real-time spoken interaction.

Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

Authors: Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao

2026-02-26

http://arxiv.org/abs/2602.23235v1

Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high . Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.

InnerQ Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Authors: Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross

2026-02-26

http://arxiv.org/abs/2602.23200v1

Reducing the hardware footprint of large language models (s) during is critical for efficient long-sequence generation. A key bottleneck is the key-value () , whose size scales with sequence length and easily dominates the memory footprint of the model. Previous work proposed methods that are focused on compressing the while maintaining its information. We introduce InnerQ, a hardware-aware - scheme that lowers latency without sacrificing accuracy. InnerQ applies group-wise while grouping the matrices over their inner dimension. Unlike previous work that group over the outer dimension, InnerQ aligns de with the vector-matrix multiplication and enables scale factor reuse across GPU compute units. This reduces memory accesses and accelerates de, yielding up to $22\%$ speedup over previous work and up to $88\%$ over half-precision vector-matrix multiplication. To preserve fidelity under aggressive , InnerQ incorporates (i) hybrid , selecting symmetric or asymmetric per group based on local statistics; (ii) high-precision windows for both the most recent tokens and the attention sink tokens to mitigate outlier leakage; and (iii) per-channel normalization of the key , computed once during and folded into the query to avoid runtime overhead. Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-d s and surpasses prior methods.

Efficient Real-Time Adaptation of ROMs for Unsteady Flows Using Data Assimilation

Authors: Ismaël Zighed, Andrea Nóvoa, Luca Magri, Taraneh Sayadi

2026-02-26

http://arxiv.org/abs/2602.23188v1

We propose an efficient retraining strategy for a parameterized Reduced Order Model (ROM) that attains accuracy comparable to full retraining while requiring only a fraction of the computational time and relying solely on observations of the full system. The architecture employs an encode-process- structure: a Variational Autoencoder (VAE) to perform dimensionality reduction, and a network to evolve the latent states and model the dynamics. The ROM is parameterized by an external control variable, the Reynolds number in the Navier-Stokes setting, with the exploiting attention mechanisms to capture both temporal dependencies and parameter effects. The probabilistic VAE enables stochastic sampling of trajectory ensembles, providing predictive means and uncertainty quantification through the first two moments. After initial training on a limited set of dynamical regimes, the model is adapted to out-of-sample parameter regions using only data. Its probabilistic formulation naturally supports ensemble generation, which we employ within an ensemble Kalman filtering framework to assimilate data and reconstruct full-state trajectories from minimal observations. We further show that, for the dynamical system considered, the dominant source of error in out-of-sample forecasts stems from distortions of the latent manifold rather than changes in the latent dynamics. Consequently, retraining can be limited to the autoencoder, allowing for a lightweight, computationally efficient, real-time adaptation procedure with very fine-tuning data.

MetaOthello A Controlled Study of Multiple World Models in Transformers

Authors: Aviral Chawla, Galen Hall, Juniper Lovato

2026-02-26

http://arxiv.org/abs/2602.23164v1

Foundation models must handle multiple generative processes, yet mechanistic interpretability largely studies capabilities in isolation; it remains unclear how a single organizes multiple, potentially conflicting "world models". Previous experiments on Othello playing neural-networks test world-model learning but focus on a single game with a single set of rules. We introduce MetaOthello, a controlled suite of Othello variants with shared syntax but different rules or tokenizations, and train small GPTs on mixed-variant data to study how multiple world models are organized in a shared representation space. We find that s trained on mixed-game data do not partition their capacity into isolated sub-models; instead, they converge on a mostly shared board-state representation that transfers causally across variants. Linear probes trained on one variant can intervene on another's internal state with effectiveness approaching that of matched probes. For isomorphic games with token remapping, representations are equivalent up to a single orthogonal rotation that generalizes across layers. When rules partially , early layers maintain game-agnostic representations while a middle layer identifies game identity, and later layers specialize. MetaOthello offers a path toward understanding not just whether s learn world models, but how they organize many at once.

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Authors: Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall, Max Tegmark, Christian Schroeder de Witt, Mihaela van der Schaar, David Krueger

2026-02-26

http://arxiv.org/abs/2602.23163v1

Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in s, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$ -information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in s.

Modality Collapse as Mismatched Decoding Information-Theoretic Limits of Multimodal LLMs

Authors: Jayadev Billa

2026-02-26

http://arxiv.org/abs/2602.23136v1

Multimodal s can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every layer (3--55 $\times$ above chance in linear probes), yet removing 64--71% of modality-specific variance improves r loss. The r has no learned use for these directions; their presence is noise. We formalize this as a mismatched r problem: a r trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and r sensitivity. The bound is a property of the r's scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the r's scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility ( $+$ 7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible.

MaRI Accelerating Ranking Model Inference via Structural Re-parameterization in Large Scale Recommendation System

Authors: Yusheng Huang, Pengbo Xu, Shen Wang, Changxin Lao, Jiangxia Cao, Shuang Wen, Shuang Yang, Zhaojie Liu, Han Li, Kun Gai

2026-02-26

http://arxiv.org/abs/2602.23105v1

Ranking models, i.e., coarse-ranking and fine-ranking models, serve as core components in large-scale recommendation systems, responsible for scoring massive item candidates based on user preferences. To meet the stringent latency requirements of online , structural lightweighting or knowledge distillation techniques are commonly employed for ranking model . However, these approaches typically lead to a non-negligible drop in accuracy. Notably, the angle of lossless by optimizing feature fusion matrix multiplication, particularly through structural reparameterization, remains underexplored. In this paper, we propose MaRI, a novel Matrix Re-parameterized Inference framework, which serves as a complementary approach to existing techniques while accelerating ranking model inference without any accuracy loss. MaRI is motivated by the observation that user-side computation is redundant in feature fusion matrix multiplication, and we therefore adopt the philosophy of structural reparameterization to alleviate such redundancy.

Locally Adaptive Decay Surfaces for High-Speed Face and Landmark Detection with Event Cameras

Authors: Paul Kielty, Timothy Hanley, Peter Corcoran

2026-02-26

http://arxiv.org/abs/2602.23101v1

Event cameras record luminance changes with microsecond resolution, but converting their , asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between pre spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by pre spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

Authors: Zhuoliang Xie, Fei Liu, Zhenkun Wang, Qingfu Zhang

2026-02-26

http://arxiv.org/abs/2602.23092v1

The Capacitated Vehicle Routing Problem (CVRP), a fundamental combinatorial optimization challenge, focuses on optimizing fleet operations under vehicle capacity constraints. While extensively studied in operational research, the NP-hard nature of CVRP continues to pose significant computational challenges, particularly for large-scale instances. This study presents AILS-AHD (Adaptive Iterated Local Search with Automatic Heuristic Design), a novel approach that leverages Large Language Models (s) to revolutionize CVRP solving. Our methodology integrates an evolutionary search framework with s to dynamically generate and optimize ruin heuristics within the AILS method. Additionally, we introduce an -based mechanism to enhance computational efficiency. Comprehensive experimental evaluations against state-of-the-art solvers, including AILS-II and HGS, demonstrate the superior performance of AILS-AHD across both moderate and large-scale instances. Notably, our approach establishes new best-known solutions for 8 out of 10 instances in the CVRPLib large-scale benchmark, underscoring the potential of -driven heuristic design in advancing the field of vehicle routing optimization.

MoDora Tree-Based Semi-Structured Document Analysis System

Authors: Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He, Shihan Yu, Qianqian Xu, Bin Wang, Guoliang Li, Conghui He, Fan Wu

2026-02-26

http://arxiv.org/abs/2602.23061v1

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real-world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of their original semantic context, making them inadequate for analysis. (2) Existing approaches lack effective representations to capture hierarchical structures within documents (e.g., associating tables with nested chapter titles) and to preserve layout-specific distinctions (e.g., differentiating sidebars from main content). (3) Answering questions often requires retrieving and aligning relevant information scattered across multiple regions or pages, such as linking a descriptive paragraph to table cells located elsewhere in the document. To address these issues, we propose MoDora, an -powered system for semi-structured document analysis. First, we adopt a local-alignment aggregation strategy to convert OCR-parsed elements into layout-aware components, and conduct type-specific information extraction for components with hierarchical titles or non-text elements. Second, we design the Component-Correlation Tree (CCTree) to hierarchically organize components, explicitly modeling inter-component relations and layout distinctions through a bottom-up cascade summarization process. Finally, we propose a question-type-aware retrieval strategy that supports (1) layout-based grid partitioning for location-based retrieval and (2) -guided for semantic-based retrieval. Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy. The code is at https://github.com/weAIDB/MoDora.

LLMServingSim 2.0 A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure

Authors: Jaehong Cho, Hyunmin Choi, Guseul Heo, Jongse Park

2026-02-26

http://arxiv.org/abs/2602.23036v1

Large language model () infrastructures are undergoing a shift toward heterogeneity and . Modern deployments increasingly integrate diverse accelerators and near-memory processing technologies, introducing significant hardware heterogeneity, while system software increasingly separates computation, memory, and model components across distributed resources to improve scalability and efficiency. As a result, performance is no longer determined by hardware or software choices in isolation, but by their runtime interaction through scheduling, data movement, and interconnect behavior. However, understanding these interactions remains challenging, as existing simulators lack the ability to jointly model heterogeneous hardware and d techniques within a unified, runtime-driven framework. This paper presents ServingSim 2.0, a unified system-level simulator designed to make runtime-driven hardware-software interactions in heterogeneous and d infrastructures explicit and analyzable. ServingSim 2.0 embeds decisions and hardware behavior into a single runtime loop, enabling interaction-aware modeling of batching, routing, offloading, memory, and power. The simulator supports extensible integration of emerging accelerators and memory systems through profile-based modeling, while capturing dynamic behavior and system-level effects. We validate ServingSim 2.0 against real deployments, showing that it reproduces key performance, memory, and power metrics with an average error of 0.97%, while maintaining simulation times of around 10 minutes even for complex configurations. These results demonstrate that ServingSim 2.0 provides a practical bridge between hardware innovation and -system design, enabling systematic exploration and co-design for next-generation infrastructures.

FactGuard Agentic Video Misinformation Detection via Reinforcement Learning

Authors: Zehao Li, Hongwei Yu, Hao Jiang, Qiang Sheng, Yilong Xu, Baolong Bi, Yang Li, Zhenlong Yuan, Yujun Cai, Zhaoqi Wang

2026-02-26

http://arxiv.org/abs/2602.22963v1

Multimodal large language models (Ms) have substantially advanced video misinformation detection through unified multimodal reasoning, but they often rely on fixed-depth inference and place excessive trust in internally generated assumptions, particularly in scenarios where critical evidence is , fragmented, or requires external verification. To address these limitations, we propose FactGuard, an agentic framework for video misinformation detection that formulates verification as an iterative reasoning process built upon Ms. FactGuard explicitly assesses task ambiguity and selectively invokes external tools to acquire critical evidence, enabling progressive refinement of reasoning trajectories. To further strengthen this capability, we introduce a two-stage training strategy that combines domain-specific agentic supervised fine-tuning with decision-aware reinforcement learning to optimize tool usage and calibrate risk-sensitive decision making. Extensive experiments on FakeSV, FakeTT, and FakeVV demonstrate FactGuard's state-of-the-art performance and validate its excellent robustness and generalization capacity.

ToProVAR Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization

Authors: Jiayu Chen, Ruoyu Lin, Zihao Zheng, Jingxin Li, Maoliang Li, Guojie Luo, Xiang chen

2026-02-26

http://arxiv.org/abs/2602.22948v1

Visual Autoregressive(VAR) models enhance generation quality but face a critical efficiency bottleneck in later stages. In this paper, we present a novel optimization framework for VAR models that fundamentally differs from prior approaches such as FastVAR and SkipVAR. Instead of relying on heuristic skipping strategies, our method leverages attention entropy to characterize the semantic projections across different dimensions of the model architecture. This enables precise identification of parameter dynamics under varying token granularity levels, semantic scopes, and generation scales. Building on this analysis, we further uncover patterns along three critical dimensions-token, layer, and scale-and propose a set of fine-grained optimization strategies tailored to these patterns. Extensive evaluation demonstrates that our approach achieves aggressive of the generation process while significantly pre semantic fidelity and fine details, outperforming traditional methods in both efficiency and quality. Experiments on Infinity-2B and Infinity-8B models demonstrate that ToProVAR achieves up to 3.4x with minimal quality loss, effectively mitigating the issues found in prior work. Our code will be made publicly available.

Rejection Mixing Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference

Authors: Yushi Ye, Feng Hong, Huangjie Zheng, Xu Chen, Zhiyong Chen, Yanfeng Wang, Jiangchao Yao

2026-02-26

http://arxiv.org/abs/2602.22868v1

Diffusion Large Language Models (Ds) promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel . This stems from the ''combinatorial contradiction'' phenomenon, where parallel tokens form semantically inconsistent combinations. We address this by integrating continuous representations into the discrete process, as they preserve rich inter-position dependency. We propose ReMix (Rejection Mixing), a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final d token state. This intermediate state allows a token's representation to be iteratively refined in a continuous space, resolving mutual conflicts with other tokens before collapsing into a final discrete sample. Furthermore, a rejection rule reverts uncertain representations from the continuous state back to the masked state for reprocessing, ensuring stability and preventing error propagation. ReMix thus mitigates combinatorial contradictions by enabling continuous-space refinement during discrete diffusion . Extensive experiments demonstrate that ReMix, as a training-free method, achieves a $2-8 \times$ inference speedup without any quality degradation.

TARAZ Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

Authors: Reihaneh Iranmanesh, Saeedeh Davoudi, Pasha Abrishamchian, Ophir Frieder, Nazli Goharian

2026-02-26

http://arxiv.org/abs/2602.22827v1

This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (s) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string . Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understanding in Persian and establishing a reproducible foundation for cross-cultural evaluation research.

Accelerating Local LLMs on Resource-Constrained Edge Devices via Distributed Prompt Caching

Authors: Hiroki Matsutani, Naoki Matsuda, Naoto Sugiura

2026-02-26

http://arxiv.org/abs/2602.22812v1

Since local inference on resource-constrained edge devices imposes a severe performance bottleneck, this paper proposes distributed prompt caching to enhance inference performance by cooperatively sharing intermediate processing states across multiple low-end edge devices. To fully utilize prompt similarity, our distributed caching mechanism also supports partial matching. As this approach introduces overhead associated with state sharing over a wireless network, we introduce a Bloom-filter-based data structure, referred to as a catalog, to determine whether a remote server possesses the desired internal states, thereby suppressing unnecessary . Experiments using the Gemma-3 270M model and the MMLU dataset on the Raspberry Pi Zero 2W platform demonstrate that the proposed approach reduces TTFT (Time to First Token) and TTLT (Time to Last Token) by 93.12% and 50.07% on average, respectively.

Optimizing SSD-Resident Graph Indexing for High-Throughput Vector Search

Authors: Weichen Zhao, Yuncheng Lu, Yao Tian, Hao Zhang, Jiehui Li, Minghao Zhao, Yakun Li, Weining Qian

2026-02-26

http://arxiv.org/abs/2602.22805v1

Graph-based approximate nearest neighbor search (ANNS) methods (e.g., HNSW) have become the de facto state of the art for their high precision and low latency. To scale beyond main memory, recent out-of-memory ANNS systems leverage SSDs to store large vector indexes. However, they still suffer from severe CPU underutilization and read amplification (i.e., storage stalls) caused by limited access locality during graph traversal. We present VeloANN, which mitigates storage stalls through a locality-aware data layout and a coroutine-based asynchronous runtime. VeloANN utilizes hierarchical and affinity-based data placement scheme to co-locate related vectors within the same page, effectively reducing fragmentation and over-fetching. We further design a record-level buffer pool, where each record groups the neighbors of a vector; by persistently retaining hot records in memory, it eliminates excessive page swapping under constrained memory budgets. To minimize CPU scheduling overheads during disk I/O interruptions, VeloANN employs a coroutine-based asynchronous runtime for lightweight task scheduling. On top of this, it incorporates asynchronous prefetching and a beam-aware search strategy to prioritize d data, ultimately improving overall search efficiency. Extensive experiments show that VeloANN outperforms state-of-the-art disk-based ANN systems by up to 5.8x in throughput and 3.25x in latency reduction, while achieving 0.92x the throughput of in-memory systems using only 10% of their memory footprint.

Natural Language Declarative Prompting (NLD-P) A Modular Governance Method for Prompt Design Under Model Drift

Authors: Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim

2026-02-26

http://arxiv.org/abs/2602.22790v1

The rapid evolution of large language models (s) has transformed prompt engineering from a localized craft into a systems-level governance challenge. As models scale and update across generations, prompt behavior becomes sensitive to shifts in instruction-following policies, alignment regimes, and strategies, a phenomenon we characterize as GPT-scale model drift. Under such conditions, surface-level formatting conventions and ad hoc refinement are insufficient to ensure stable, interpretable control. This paper reconceptualizes Natural Language Declarative Prompting (NLD-P) as a declarative governance method rather than a rigid field template. NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code. We define minimal compliance criteria, analyze model-dependent schema receptivity, and position NLD-P as an accessible governance framework for non-developer practitioners operating within evolving ecosystems. Portions of drafting and editorial refinement employed a schema-bound assistant configured under NLD-P. All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol. The paper concludes by outlining implications for declarative control under ongoing model evolution and identifying directions for future empirical validation.

Dynamic Hierarchical Birkhoff-von Neumann Decomposition for All-to-All GPU Communication

Authors: Yen-Chieh Wu, Cheng-Shang Chang, Duan-Shin Lee, H. Jonathan Chao

2026-02-26

http://arxiv.org/abs/2602.22756v1

All-to-all GPU is a critical bottleneck in large-scale training clusters, where completion time is constrained by per-port bandwidth and can be severely impacted by traffic skew across GPUs and network interface cards (NICs). This issue is amplified by the two-tier structure of modern GPU systems, which combine fast intra-server links with much slower inter-server networks. Motivated by recent system observations that highlight the importance of traffic reshaping and hierarchy awareness, we study all-to-all scheduling from an online switching and queueing-theoretic perspective. We propose a dynamic hierarchical Birkhoff--von Neumann (BvN) decomposition framework tailored to two-tier GPU fabrics. At each frame boundary, traffic is first balanced within each server using simple local operations to mitigate micro-level GPU/NIC skew while pre aggregate server-to-server demand. A hierarchical BvN decomposition is then applied at the server level and refined into GPU-level matchings, significantly reducing decomposition complexity relative to a flat GPU-level approach. By integrating this construction with the dynamic frame sizing (DFS) principle, we obtain an online scheduler with provable stability under admissible Poisson arrivals. Simulations demonstrate substantial reductions in mean frame length, particularly under server-localized hotspot traffic.

ProjFlow Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control

Authors: Akihisa Watanabe, Qing Yu, Edgar Simo-Serra, Kent Fujiwara

2026-02-26

http://arxiv.org/abs/2602.22742v1

Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while pre motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.

Generative Recommendation for Large-Scale Advertising

Authors: Ben Xue, Dan Liu, Lixiang Wang, Mingjie Sun, Peng Wang, Pengfei Zhang, Shaoyun Shi, Tianyu Xu, Yunhao Sha, Zhiqiang Liu, Bo Kong, Bo Wang, Hang Yang, Jieting Xue, Junhao Wang, Shengyu Wang, Shuping Hui, Wencai Ye, Xiao Lin, Yongzhi Li, Yuhang Chen, Zhihui Yin, Quan Chen, Shiyang Wen, Wenjin Wu, Han Li, Guorui Zhou, Changcheng Li, Peng Jiang

2026-02-26

http://arxiv.org/abs/2602.22732v1

Generative recommendation has recently attracted widespread attention in industry due to its potential for scaling and stronger model capacity. However, deploying real-time generative recommendation in large-scale advertising requires designs beyond large-language-model ()-style training and recipes. We present a production-oriented generative recommender co-designed across architecture, learning, and , named GR4AD (Generative Recommendation for ADdvertising). As for tokenization, GR4AD proposes UA-SID (Unified Advertisement Semantic ID) to capture complicated business information. Furthermore, GR4AD introduces LazyAR, a lazy autoregressive r that relaxes layer-wise dependencies for short, multi-candidate generation, pre effectiveness while reducing inference cost, which facilitates scaling under fixed budgets. To align optimization with business value, GR4AD employs VSL (Value-Aware Supervised Learning) and proposes RSPO (Ranking-Guided Softmax Preference Optimization), a ranking-aware, list-wise reinforcement learning algorithm that optimizes value-based rewards under list-level metrics for continual online updates. For online inference, we further propose dynamic beam , which adapts beam width across generation levels and online load to control compute. Large-scale online A/B tests show up to 4.2% ad revenue improvement over an existing DLRM-based stack, with consistent gains from both model scaling and inference-time scaling. GR4AD has been fully deployed in Kuaishou advertising system with over 400 million users and achieves high-throughput real-time .

HulluEdit Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models

Authors: Yangguang Lin, Quan Fang, Yufei Li, Jiachen Sun, Junyu Gao, Jitao Sang

2026-02-26

http://arxiv.org/abs/2602.22727v1

Object hallucination in Large Vision-Language Models (LVLMs) significantly hinders their reliable deployment. Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. Our core innovation is orthogonal subspace editing: we decompose the hidden states of the model into orthogonal subspaces - visual evidence, conflicting priors, and residual uncertainty - enabling selective suppression of hallucinatory patterns without interfering with visual grounding. This approach mathematically guarantees that edits applied to the prior subspace leave the visual component entirely unaffected. Extensive experiments show that HulluEdit achieves state-of-the-art hallucination reduction on benchmarks including POPE and CHAIR across diverse architectures, while pre general capabilities on MME and maintaining efficient inference. Our method consistently outperforms contrastive and static subspace editing baselines, offering a new pathway toward more trustworthy LVLMs.

Reinforcing Real-world Service Agents Balancing Utility and Cost in Task-oriented Dialogue

Authors: Ning Gao, Wei Zhang, Yuqin Dai, Ling Shi, Ziyin Wang, Yujie Wang, Wei He, Jinpeng Wang, Chaozheng Wang

2026-02-26

http://arxiv.org/abs/2602.22697v1

The rapid evolution of Large Language Models (s) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized real business scenarios demonstrate that InteractCS-RL significantly outperform other baselines across three evaluation dimensions. Further evaluation on tool-agent-user interaction benchmarks verify InteractCS-RL robustness across diverse domains.

U-Net-Based Generative Joint Source-Channel Coding for Wireless Image Transmission

Authors: Ming Ye, Kui Cai, Cunhua Pan, Zhen Mei, Wanting Yang, Chunguo Li

2026-02-26

http://arxiv.org/abs/2602.22691v1

Deep learning (DL)-based joint source-channel coding (JSCC) methods have achieved remarkable success in wireless image transmission. However, these methods either focus on conventional distortion metrics that do not necessarily yield high perceptual quality or incur high computational complexity. In this paper, we propose two DL-based JSCC (DeepJSCC) methods that leverage deep generative architectures for wireless image transmission. Specifically, we propose G-UNet-JSCC, a scheme comprising an encoder and a U-Net-based generator as the r. Its skip connections enable multi-scale feature fusion to improve both pixel-level fidelity and perceptual quality of reconstructed images by integrating low- and high-level features. To further enhance pixel-level fidelity, the encoder and the U-Net-based r are jointly optimized using a weighted sum of structural similarity and mean-squared error (MSE) losses. Building upon G-UNet-JSCC, we further develop a DeepJSCC method called cGAN-JSCC, where the r is enhanced through adversarial training. In this scheme, we retain the encoder of G-UNet-JSCC and adversarially train the r's generator against a patch-based discriminator. cGAN-JSCC employs a two-stage training procedure. The outer stage trains the encoder and the r end-to-end using an MSE loss, while the inner stage adversarially trains the r's generator and the discriminator by minimizing a joint loss combining adversarial and distortion losses. Simulation results demonstrate that the proposed methods achieve superior pixel-level fidelity and perceptual quality on both high- and low-resolution images. For low-resolution images, cGAN-JSCC achieves better reconstruction performance and greater robustness to channel variations than G-UNet-JSCC.

Accelerating LLM Pre-Training through Flat-Direction Dynamics Enhancement

Authors: Shuchen Zhu, Rizhen Hu, Mingze Wang, Mou Sun, Xue Wang, Kun Yuan, Zaiwen Wen

2026-02-26

http://arxiv.org/abs/2602.22681v1

Pre-training Large Language Models requires immense computational resources, making optimizer efficiency essential. The optimization landscape is highly anisotropic, with loss reduction driven predominantly by progress along flat directions. While matrix-based optimizers such as Muon and SOAP leverage fine-grained curvature information to outperform AdamW, their updates tend toward isotropy -- relatively conservative along flat directions yet potentially aggressive along sharp ones. To address this limitation, we first establish a unified Riemannian Ordinary Differential Equation (ODE) framework that elucidates how common adaptive algorithms operate synergistically: the preconditioner induces a Riemannian geometry that mitigates ill-conditioning, while momentum serves as a Riemannian damping term that promotes convergence. Guided by these insights, we propose LITE, a generalized strategy that enhances training dynamics by applying larger Hessian damping coefficients and learning rates along flat trajectories. Extensive experiments demonstrate that LITE significantly accelerates both Muon and SOAP across diverse architectures (Dense, MoE), parameter scales (130M--1.3B), datasets (C4, Pile), and learning-rate schedules (cosine, warmup-stable-decay). Theoretical analysis confirms that LITE facilitates faster convergence along flat directions in anisotropic landscapes, providing a principled approach to efficient pre-training. The code is available at https://github.com/SHUCHENZHU/LITE.

Denoising as Path Planning Training-Free Acceleration of Diffusion Models with DPCache

Authors: Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, Pipei Huang

2026-02-26

http://arxiv.org/abs/2602.22654v1

Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free framework that formulates diffusion sampling as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while pre trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using d features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong with minimal quality loss, outperforming prior methods by $+$ 0.031 ImageReward at 4.87 $\times$ speedup and even surpassing the full-step baseline by $+$ 0.028 ImageReward at 3.54 $\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at https://github.com/argsss/DPCache.

Vectorizing the Trie Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators

Authors: Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt, Raghunandan Keshavan, Shao-Chuan Wang, Xinyang Yi, Mingyan Gao, Onkar Dalal, Lichan Hong, Ed Chi, Ningren Han

2026-02-26

http://arxiv.org/abs/2602.22647v1

Generative retrieval has emerged as a powerful paradigm for -based recommendation. However, industrial recommender systems often benefit from restricting the output space to a constrained subset of items based on business logic (e.g. enforcing content freshness or product category), which standard autoregressive cannot natively support. Moreover, existing constrained methods that make use of prefix trees (Tries) incur severe latency penalties on hardware accelerators (TPUs/GPUs). In this work, we introduce STATIC (Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding), an efficient and scalable constrained technique designed specifically for high-throughput -based generative retrieval on TPUs/GPUs. By flattening the prefix tree into a static Compressed Sparse Row (CSR) matrix, we transform irregular tree traversals into fully vectorized matrix operations, unlocking massive efficiency gains on hardware accelerators. We deploy STATIC on a large-scale industrial video recommendation platform billions of users. STATIC produces significant product metric impact with minimal latency overhead (0.033 ms per step and 0.25% of inference time), achieving a 948x speedup over a CPU trie implementation and a 47-1033x speedup over a hardware-accelerated binary-search baseline. Furthermore, the runtime overhead of STATIC remains extremely low across a wide range of practical configurations. To the best of our knowledge, STATIC enables the first production-scale deployment of strictly constrained generative retrieval. In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval. Our code is available at https://github.com/youtube/static-constraint-.

SideQuest Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

Authors: Sanjay Kariyappa, G. Edward Suh

2026-02-26

http://arxiv.org/abs/2602.22603v1

Long-running agentic tasks, such as deep research, require multi-hop reasoning over information distributed across multiple webpages and documents. In such tasks, the context is dominated by tokens from external retrieval, causing memory usage to grow rapidly and limiting performance. While several techniques exist for long-context inputs, we find that existing heuristics fail to support multi-step reasoning models effectively. We address this challenge with SideQuest -- a novel approach that leverages the Large Reasoning Model (LRM) itself to perform by reasoning about the usefulness of tokens in its context. To prevent the tokens associated with this management process from polluting the model's memory, we frame as an auxiliary task executed in parallel to the main reasoning task. Our evaluations, using a model trained with just 215 samples, show that SideQuest reduces peak token usage by up to 65% on agentic tasks with minimal degradation in accuracy, outperforming heuristic-based techniques.

FLYING SERVING On-the-Fly Parallelism Switching for Large Language Model Serving

Authors: Shouwei Gao, Junqi Yin, Feiyi Wang, Wenqian Dong

2026-02-26

http://arxiv.org/abs/2602.22593v1

Production must simultaneously deliver high throughput, low latency, and sufficient context capacity under non-stationary traffic and mixed request requirements. Data parallelism (DP) maximizes throughput by running independent replicas, while tensor parallelism (TP) reduces per-request latency and pools memory for long-context inference. However, existing stacks typically commit to a static parallelism configuration at deployment; adapting to bursts, priorities, or long-context requests is often disruptive and slow. We present Flying Serving, a v-based system that enables online DP-TP switching without restarting engine workers. Flying Serving makes reconfiguration practical by virtualizing the state that would otherwise force data movement: (i) a zero-copy Model Weights Manager that exposes TP shard views on demand, (ii) a Cache Adaptor that preserves request state across DP/TP layouts, (iii) an eagerly initialized Communicator Pool to amortize collective setup, and (iv) a deadlock-free scheduler that coordinates safe transitions under execution skew. Across three popular s and realistic scenarios, Flying Serving improves performance by up to $4.79\times$ under high load and $3.47\times$ under low load while supporting latency- and memory-driven requests.

pQuant Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training

Authors: Wenzheng Zhang, Bingzheng Liu, Yang Hu, Xiaoying Bai, Wentao Zhang, Bin Cui

2026-02-26

http://arxiv.org/abs/2602.22592v1

Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (s) with extremely weights (sub 2-bit), which can offer substantial advantages for edge deployment. However, existing methods still fail to achieve satisfactory accuracy and scalability. In this work, we identify a parameter democratization effect as a key bottleneck: the sensitivity of all parameters becomes homogenized, severely limiting expressivity. To address this, we propose pQuant, a method that decouples parameters by splitting linear layers into two specialized branches: a dominant 1-bit branch for efficient computation and a compact high-precision branch dedicated to pre the most sensitive parameters. Through tailored feature scaling, we explicitly guide the model to allocate sensitive parameters to the high-precision branch. Furthermore, we extend this branch into multiple, ly-activated experts, enabling efficient capacity scaling. Extensive experiments indicate our pQuant achieves state-of-the-art performance in extremely .

Search-P1 Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Authors: Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li, Linfang Shang, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang

2026-02-26

http://arxiv.org/abs/2602.22576v1

Retrieval-Augmented Generation (RAG) enhances large language models (s) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling s to dynamically decide when and what to retrieve, but current RL-based training methods suffer from outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.

GIFSplat Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

Authors: Tianyu Chen, Wei Xiang, Kang Han, Yu Lu, Di Wu, Gaowen Liu, Ramana Rao Kompella

2026-02-26

http://arxiv.org/abs/2602.22571v1

Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while pre feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.

Autoregressive Visual Decoding from EEG Signals

Authors: Sicheng Dai, Hongwang Xiao, Shan Yu, Qiwei Ye

2026-02-26

http://arxiv.org/abs/2602.22555v1

Electroencephalogram (EEG) signals have become a popular medium for visual information due to their cost-effectiveness and high temporal resolution. However, current approaches face significant challenges in bridging the modality gap between EEG and image data. These methods typically rely on complex adaptation processes involving multiple stages, making it hard to maintain consistency and manage compounding errors. Furthermore, the computational overhead imposed by large-scale diffusion models limit their practicality in real-world brain-computer interface (BCI) applications. In this work, we present AVDE, a lightweight and efficient framework for visual from EEG signals. First, we leverage LaBraM, a pre-trained EEG model, and fine-tune it via contrastive learning to align EEG and image representations. Second, we adopt an autoregressive generative framework based on a "next-scale prediction" strategy: images are encoded into multi-scale token maps using a pre-trained VQ-VAE, and a is trained to autoregressively predict finer-scale tokens starting from EEG embeddings as the coarsest representation. This design enables coherent generation while pre a direct connection between the input EEG signals and the reconstructed images. Experiments on two datasets show that AVDE outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters. In addition, visualization of intermediate outputs shows that the generative process of AVDE reflects the hierarchical nature of human visual perception. These results highlight the potential of autoregressive models as efficient and interpretable tools for practical BCI applications.

Multilingual Safety Alignment Via Sparse Weight Editing

Authors: Jiaming Liang, Zhaoxin Wang, Handing Wang

2026-02-26

http://arxiv.org/abs/2602.22554v1

Large Language Models (s) exhibit significant safety disparities across languages, with low-resource languages (LRLs) often bypassing safety guardrails established for high-resource languages (HRLs) like English. Existing solutions, such as multilingual supervised fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), are computationally expensive and dependent on scarce multilingual safety data. In this work, we propose a novel, training-free alignment framework based on Sparse Weight Editing. Identifying that safety capabilities are localized within a set of safety neurons, we formulate the cross-lingual alignment problem as a constrained linear transformation. We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while pre general utility via a null-space projection constraint. Extensive experiments across 8 languages and multiple model families (Llama-3, Qwen-2.5) demonstrate that our method substantially reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities, all achieved with a single, data-efficient calculation.

SignVLA A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation

Authors: Xinyu Tan, Ningwei Bai, Harry Gardener, Zhengyang Zhong, Luoyu Zhang, Liuhaichen Yang, Zhekai Duan, Monkgogi Galeitsiwe, Zezhi Tang

2026-02-26

http://arxiv.org/abs/2602.22514v1

We present, to our knowledge, the first sign language-driven Vision-Language-Action (VLA) framework for intuitive and inclusive human-robot interaction. Unlike conventional approaches that rely on gloss annotations as intermediate supervision, the proposed system adopts a gloss-free paradigm and directly maps visual sign gestures to semantic instructions. This design reduces annotation cost and avoids the information loss introduced by gloss representations, enabling more natural and scalable multimodal interaction. In this work, we focus on a real-time alphabet-level finger-spelling interface that provides a robust and low-latency channel for robotic control. Compared with large-scale continuous sign language recognition, alphabet-level interaction offers improved reliability, interpretability, and deployment feasibility in safety-critical embodied environments. The proposed pipeline transforms continuous gesture streams into coherent language commands through geometric normalization, temporal smoothing, and lexical refinement, ensuring stable and consistent interaction. Furthermore, the framework is designed to support future integration of -based gloss-free sign language models, enabling scalable word-level and sentence-level semantic understanding. Experimental results demonstrate the effectiveness of the proposed system in grounding sign-derived instructions into precise robotic actions under diverse interaction scenarios. These results highlight the potential of the framework to advance accessible, scalable, and multimodal embodied intelligence.

Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns

Authors: Afshin Khadangi

2026-02-25

http://arxiv.org/abs/2602.22479v1

Continual learning is a core requirement for deployed language models, yet standard training and fine-tuning pipelines remain brittle under non-stationary data. Online updates often induce catastrophic forgetting, while methods that improve stability frequently increase latency, memory footprint, or dense computation in ways that do not scale well to long contexts. We introduce TRC $^{2}$ (Thalamically Routed Cortical Columns), a r-only backbone that addresses continual learning at the architectural level. TRC $^{2}$ combines thalamic routing over cortical columns with mechanisms for modulation, prediction, memory, and feedback, together with a fast corrective pathway that supports rapid adaptation without destabilizing slower parameters. The resulting block is and chunk-parallel, enabling efficient training and inference while pre clean ablations of each subsystem. We instantiate a reproducible training and evaluation stack and a continual-learning harness that measures proxy forgetting under streaming domain shifts. Across language modeling and continual learning benchmarks, TRC $^{2}$ improves the stability-plasticity tradeoff at comparable compute, enabling rapid on-stream adaptation while pre previously acquired behavior.

Beyond Dominant Patches Spatial Credit Redistribution For Grounded Vision-Language Models

Authors: Niamul Hassan Samin, Md Arifur Rahman, Abdullah Ibne Hanif, Juena Ahmed Noshin, Md Ashikur Rahman

2026-02-25

http://arxiv.org/abs/2602.22469v1

Vision-language models (VLMs) frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on visual patches in early layers, which suppresses contextual evidence and increases reliance on language priors. We introduce Spatial Credit Redistribution (SCR), a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs. We evaluate six model families (Chameleon, LLaVA, and Qwen, including both Qwen-VL and Qwen2-VL) at scales of 7B, 13B, and 30B, on POPE and CHAIR benchmarks. SCR reduces hallucination by ~4.7-6.0 percentage points on POPE-Adversarial, cuts CHAIR-s by 3.7-5.2 percentage points (42-51 percent relative), and CHAIR-i by 2.7-4.4 percentage points (44-58 percent relative), and preserves CIDEr within 0.8 percentage points. Gains are largest for low-entropy inputs, consistent with the theoretical framework. SCR incurs only 43-56 ms overhead (small models: +43-46 ms; large models: +54-56 ms), roughly 3-6 times lower than OPERA and VCD and 1.3-1.7 times lower than OVCD (+72 ms), while Pareto-dominating all three on both hallucination rate and CIDEr, making it practical for real-time settings. A controlled ablation confirms that attention-guided source selection is essential: replacing it with uniform random selection reduces hallucination rate gains from ~4.7-6.0 percentage points to only ~2.6-3.4 percentage points, pointing to credit-collapse as the key driver.

CCCL Node-Spanning GPU Collectives with CXL Memory Pooling

Authors: Dong Xu, Han Meng, Xinyu Chen, Dengcheng Zhu, Wei Tang, Fei Liu, Liguang Xie, Wu Xiang, Rui Shi, Yue Li, Henry Hu, Hui Zhang, Jianping Jiang, Dong Li

2026-02-25

http://arxiv.org/abs/2602.22457v1

Large language models (s) training or inference across multiple nodes introduces significant pressure on GPU memory and interconnect bandwidth. The Compute Express Link (CXL) shared memory pool offers a scalable solution by enabling memory sharing across nodes, reducing over-provisioning and improving resource utilization. We propose \name, a collective library, leveraging the CXL shared memory pool to support cross-node GPU operations without relying on traditional RDMA-based networking. Our design addresses the challenges on synchronization, data interleaving, and parallelization faced by using the CXL shared memory pool for collective s. Evaluating on multiple nodes with a TITAN-II CXL switch and six Micron CZ120 memory cards, we show that \name achieves highly efficient collective operations across hosts, demonstrating CXL's potential for scalable, memory-centric GPU . Our evaluation demonstrates that \name achieves average performance improvements of 1.34 $\times$ for AllGather, 1.84 $\times$ for Broadcast, 1.94 $\times$ for Gather, and 1.04 $\times$ for Scatter, compared to the original RDMA-based implementation over 200 Gbps InfiniBand. \textcolor{dong}{In addition, the evaluation with a case of training shows 1.11 $\times$ speedup compared with the InfiniBand while saving production cost by $2.75\times$ in hardware.}

Three-Dimensional Modified Klein--Gordon Oscillator in Standard and Generalized Doubly Special Relativity

Authors: Abdelmalek Boumali, Nosratollah Jafari

2026-02-25

http://arxiv.org/abs/2602.22444v1

Doubly Special Relativity (DSR) augments special relativity by introducing, alongside the invariant speed of light $c$ , a second observer-independent scale typically associated with the Planck regime. At the level of effective wave equations this principle manifests itself through deformed dispersion relations and energy-dependent spatial operators. Here we quantify such effects in a prototypical exactly solvable bound-state problem: the three-dimensional Klein--Gordon oscillator generated by a non-minimal momentum coupling that yields isotropic harmonic confinement while pre rotational symmetry. We analyze two standard DSR realizations (Amelino--Camelia and Magueijo--Smolin, parametrized by an invariant energy scale $k$ ) as well as a generalized DSR framework based on a first-order expansion in the Planck length $l_p$ . After stationary reduction and separation in spherical coordinates, the eigenfunctions retain the generalized-Laguerre and spherical-harmonic structure of the undeformed oscillator, whereas DSR deforms the algebraic condition that relates the principal oscillator number $N=2n+\ell\in\mathbb{N}_0$ to the relativistic energy. Closed-form spectra are obtained for the standard DSR cases, and perturbative Planck-suppressed shifts are derived for the generalized model. In all realizations the deformation induces branch-dependent shifts of both positive- and negative-energy solutions, which increase with excitation and vanish smoothly in the limits $k\to\infty$ or $l_p\to0$ . The main goal of this paper is to extract analytic spectra and Planck-suppressed shifts that enable a direct comparison between different DSR prescriptions in a fully three-dimensional setting.

How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Authors: Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu, Rui Sun, Zhiji Liu, Yue Xing, Jiliang Tang, Benoit Dumoulin

2026-02-25

http://arxiv.org/abs/2602.22441v1

Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space. This paradigm enables reasoning beyond discrete language tokens by performing multi-step computation in continuous latent spaces. Although there have been numerous studies focusing on improving the performance of latent reasoning, its internal mechanisms remain not fully investigated. In this work, we conduct a comprehensive analysis of latent reasoning methods to better understand the role and behavior of latent representation in the process. We identify two key issues across latent reasoning methods with different levels of supervision. First, we observe pervasive shortcut behavior, where they achieve high accuracy without relying on latent reasoning. Second, we examine the hypothesis that latent reasoning supports BFS-like exploration in latent space, and find that while latent representations can encode multiple possibilities, the reasoning process does not faithfully implement structured search, but instead exhibits implicit and . Finally, our findings reveal a trade-off associated with supervision strength: stronger supervision mitigates shortcut behavior but restricts the ability of latent representations to maintain diverse hypotheses, whereas weaker supervision allows richer latent representations at the cost of increased shortcut behavior.

Decoder-based Sense Knowledge Distillation

Authors: Qitong Wang, Mohammed J. Zaki, Georgios Kollias, Vasileios Kalantzis

2026-02-25

http://arxiv.org/abs/2602.22351v1

Large language models (s) learn contextual embeddings that capture rich semantic information, yet they often overlook structured lexical knowledge such as word senses and relationships. Prior work has shown that incorporating sense dictionaries can improve knowledge distillation for encoder models, but their application to r as generative models remains challenging. In this paper, we introduce Decoder-based Sense Knowledge Distillation (DSKD), a framework that integrates lexical resources into the training of r-style s without requiring dictionary lookup at inference time. Extensive experiments on diverse benchmarks demonstrate that DSKD significantly enhances knowledge distillation performance for rs, enabling generative models to inherit structured semantics while maintaining efficient training.

When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

Authors: Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang

2026-02-25

http://arxiv.org/abs/2602.22145v1

Large Language Models (s) are increasingly used to ``professionalize'' workplace , often at the cost of linguistic identity. We introduce "Cultural Ghosting", the systematic erasure of linguistic markers unique to non-native English varieties during text processing. Through analysis of 22,350 outputs generated from 1,490 culturally marked texts (Indian, Singaporean,& Nigerian English) processed by five models under three prompt conditions, we quantify this phenomenon using two novel metrics: Identity Erasure Rate (IER) & Semantic Preservation Score (SPS). Across all prompts, we find an overall IER of 10.26%, with model-level variation from 3.5% to 20.5% (5.9x range). Crucially, we identify a Semantic Preservation Paradox: models maintain high semantic similarity (mean SPS = 0.748) while systematically erasing cultural markers. Pragmatic markers (politeness conventions) are 1.9x more vulnerable than lexical markers (71.5% vs. 37.1% erasure). Our experiments demonstrate that explicit cultural-preservation prompts reduce erasure by 29% without sacrificing semantic quality.

FlowCorrect Efficient Interactive Correction of Generative Flow Policies for Robotic Manipulation

Authors: Edgar Welte, Yitian Shi, Rosa Wolf, Maximillian Gilles, Rania Rayyes

2026-02-25

http://arxiv.org/abs/2602.22056v1

Generative manipulation policies can fail catastrophically under deployment-time distribution shift, yet many failures are near-misses: the robot reaches almost-correct poses and would succeed with a small corrective motion. We present FlowCorrect, a deployment-time correction framework that converts near-miss failures into successes using human nudges, without full policy retraining. During execution, a human provides brief corrective pose nudges via a lightweight VR interface. FlowCorrect uses these corrections to locally adapt the policy, improving actions without retraining the backbone while pre the model performance on previously learned scenarios. We evaluate on a real-world robot across three tabletop tasks: pick-and-place, pouring, and cup uprighting. With a low correction budget, FlowCorrect improves success on hard cases by 85\% while pre performance on previously solved scenarios. The results demonstrate clearly that FlowCorrect learns only with very few demonstrations and enables fast and sample-efficient incremental, human-in-the-loop corrections of generative visuomotor policies at deployment time in real-world robotics.

DHP Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

Authors: Yifan Niu, Han Xiao, Dongyi Liu, Wei Zhou, Jia Li

2026-02-25

http://arxiv.org/abs/2602.21788v1

Scaling long-context capabilities is crucial for Multimodal Large Language Models (Ms). However, real-world multimodal datasets are extremely heterogeneous. Existing training frameworks predominantly rely on static parallelism strategies, which suffer from severe load imbalance, redundant , and suboptimal hardware utilization under data heterogeneity. In this work, we propose Dynamic Hybrid Parallelism (DHP), an efficient parallelism strategy that adaptively reconfigures groups and parallelism degrees during M training. We generalize the non-power-of-two parallelism degrees and develop a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch. DHP is able to maintain high hardware efficiency even under extreme data variability. Experimental results demonstrate that DHP significantly outperforms Megatron-LM and DeepSpeed, achieving up to 1.36 $\times$ speedup in training throughput while maintaining near-linear scaling efficiency across large-scale NPU clusters.

XStreamVGGT Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression

Authors: Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, Ngai Wong

2026-02-25

http://arxiv.org/abs/2602.21780v1

Learning-based 3D visual geometry models have significantly advanced with the advent of large-scale s. Among these, StreamVGGT leverages frame-wise causal attention to deliver robust and efficient streaming 3D reconstruction. However, it suffers from unbounded growth in the Key-Value () due to the massive influx of vision tokens from multi-image and long-video inputs, leading to increased memory consumption and inference latency as input frames accumulate. This ultimately limits its scalability for long-horizon applications. To address this gap, we propose XStreamVGGT, a tuning-free approach that seamlessly integrates and to systematically compress the , enabling extremely memory-efficient streaming inference. Specifically, redundant s generated from multi-frame inputs are initially pruned to conform to a fixed memory budget using an efficient token-importance identification mechanism that maintains full compatibility with high-performance attention kernels (e.g., FlashAttention). Additionally, leveraging the inherent distribution patterns of tensors, we apply dimension-adaptive within the pipeline to further minimize memory overhead while pre numerical accuracy. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42 $\times$ and accelerating inference by 5.48 $\times$ , enabling practical and scalable streaming 3D applications. The code is available at https://github.com/ywh187/XStreamVGGT/.

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

Authors: Euisoo Jung, Byunghyun Kim, Hyunjin Kim, Seonghye Cho, Jae-Gil Lee

2026-02-25

http://arxiv.org/abs/2602.21760v1

Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. The key ideas are to (i) leverage the conditional and unconditional denoising paths as a new data-partitioning perspective and (ii) adaptively enable optimal pipeline parallelism according to the denoising discrepancy between these two paths. Our framework achieves $2.31\times$ and $2.07\times$ latency reductions on SDXL and SD3, respectively, using two NVIDIA RTX~3090 GPUs, while pre image quality. This result confirms the generality of our approach across U-Net-based diffusion models and DiT-based flow-matching architectures. Our approach also outperforms existing methods in under high-resolution synthesis settings. Code is available at https://github.com/kaist-dmlab/Hybridiff.

Support Tokens, Stability Margins, and a New Foundation for Robust LLMs

Authors: Deepak Agarwal, Dhyey Dharmendrakumar Mavani, Suyash Gupta, Karthik Sethuraman, Tejas Dharamsi

2026-02-25

http://arxiv.org/abs/2602.22271v1

Self-attention is usually described as a flexible, content-adaptive way to mix a token with information from its past. We re-interpret causal self-attention s, the backbone of modern foundation models, within a probabilistic framework, much like how classical PCA is extended to probabilistic PCA. However, this re-formulation reveals a surprising and deeper structural insight: due to a change-of-variables phenomenon, a barrier constraint emerges on the self-attention parameters. This induces a highly structured geometry on the token space, providing theoretical insights into the dynamics of . This reveals a boundary where attention becomes ill-conditioned, leading to a margin interpretation similar to classical support vector machines. Just like support vectors, this naturally gives rise to the concept of support tokens. Furthermore, we show that s can be interpreted as a stochastic process over the power set of the token space, providing a rigorous probabilistic framework for sequence modeling. We propose a Bayesian framework and derive a MAP estimation objective that requires only a minimal modification to standard training: the addition of a smooth log-barrier penalty to the usual cross-entropy loss. We demonstrate that this provides more robust models without sacrificing out-of-sample accuracy and that it is straightforward to incorporate in practice.

Send Less, Perceive More Masked Quantized Point Cloud Communication for Loss-Tolerant Collaborative Perception

Authors: Sheng Xu, Enshu Wang, Hongfei Xue, Jian Teng, Bingyi Liu, Yi Zhu, Pu Wang, Libing Wu, Chunming Qiao

2026-02-25

http://arxiv.org/abs/2602.21667v1

Collaborative perception allows connected vehicles to overcome occlusions and limited viewpoints by sharing sensory information. However, existing approaches struggle to achieve high accuracy under strict bandwidth constraints and remain highly vulnerable to random transmission packet loss. We introduce QPoint2Comm, a d point-cloud framework that dramatically reduces bandwidth while pre high-fidelity 3D information. Instead of transmitting intermediate features, QPoint2Comm directly communicates d point-cloud indices using a shared codebook, enabling efficient reconstruction with lower bandwidth than feature-based methods. To ensure robustness to possible packet loss, we employ a masked training strategy that simulates random packet loss, allowing the model to maintain strong performance even under severe transmission failures. In addition, a cascade attention fusion module is proposed to enhance multi-vehicle information integration. Extensive experiments on both simulated and real-world datasets demonstrate that QPoint2Comm sets a new state of the art in accuracy, efficiency, and resilience to packet loss.

Multi-Layer Scheduling for MoE-Based LLM Reasoning

Authors: Yifan Sun, Gholamreza Haffar, Minxian Xu, Rajkumar Buyya, Adel N. Toosi

2026-02-25

http://arxiv.org/abs/2602.21626v1

Large Language Models (s) have achieved remarkable success across a wide range of tasks, but them efficiently at scale remains a critical challenge due to their substantial computational and latency demands. While most existing inference frameworks rely on simple scheduling strategies such as First-Come-First-Serve (FCFS) at the engine level and Round-Robin (RR) at the scheduler or coordinator level, they often fail to fully utilize system resources and may suffer from issues such as head-of-line blocking and load imbalance. Recent advances in Mixture-of-Experts (MoE) models have also introduced new challenges in scheduling arising from expert parallelism and routing complexity. This research proposes a multi-layer scheduling framework tailored for MoE-based . It targets scheduling at three levels: request-level, enginelevel, and expert-level. At the request level, we explore algorithms such as Shortest-Job-First (SJF) and priority-aware aging to improve throughput and reduce latency. At the engine level, we design load-aware dispatching strategies that account for the current prefix token load, utilization, and user stickiness to achieve better resource matching. At the expert level, we focus on alleviating expert hotspots and strategically placing inter-layer expert dependencies to balance load and improve routing efficiency. Extensive experimental results from more than 100 experiments conducted under diverse workload distributions show that our approach consistently outperforms the state-of-theart inference framework v, achieving up to 17.8% reduction in Time To First Token (TTFT) latency and 13.3% reduction in Time-Per-Output-Token (TPOT) latency.

AQR-HNSW Accelerating Approximate Nearest Neighbor Search via Density-aware Quantization and Multi-stage Re-ranking

Authors: Ganap Ashit Tewary, Nrusinga Charan Gantayat, Jeff Zhang

2026-02-25

http://arxiv.org/abs/2602.21600v1

Approximate Nearest Neighbor (ANN) search has become fundamental to modern AI infrastructure, powering recommendation systems, search engines, and large language models across industry leaders from Google to OpenAI. Hierarchical Navigable Small World (HNSW) graphs have emerged as the dominant ANN algorithm, widely adopted in production systems due to their superior recall versus latency balance. However, as vector databases scale to billions of embeddings, HNSW faces critical bottlenecks: memory consumption expands, distance computation overhead dominates query latency, and it suffers suboptimal performance on heterogeneous data distributions. This paper presents Adaptive Quantization and Rerank HNSW (AQR-HNSW), a novel framework that synergistically integrates three strategies to enhance HNSW scalability. AQR-HNSW introduces (1) density-aware adaptive , achieving 4x while pre distance relationships; (2) multi-state re-ranking that reduces unnecessary computations by 35%; and (3) -optimized SIMD implementations delivering 16-64 operations per cycle across architectures. Evaluation on standard benchmarks demonstrates 2.5-3.3x higher queries per second (QPS) than state-of-the-art HNSW implementations while maintaining over 98% recall, with 75% memory reduction for the index graph and 5x faster index construction.

CADC Content Adaptive Diffusion-Based Generative Image Compression

Authors: Xihua Sheng, Lingyu Zhu, Tianyu Zhang, Dong Liu, Shiqi Wang, Jing Wang

2026-02-25

http://arxiv.org/abs/2602.21591v1

Diffusion-based generative image has demonstrated remarkable potential for achieving realistic reconstruction at ultra-low bitrates. The key to unlocking this potential lies in making the entire process content-adaptive, ensuring that the encoder's representation and the r's generative prior are dynamically aligned with the semantic and structural characteristics of the input image. However, existing methods suffer from three critical limitations that prevent effective content adaptation. First, isotropic applies a uniform step, failing to adapt to the spatially varying complexity of image content and creating a misalignment with the diffusion model's noise-dependent prior. Second, the information concentration bottleneck -- arising from the dimensional mismatch between the high-dimensional noisy latent and the diffusion r's fixed input -- prevents the model from adaptively pre essential semantic information in the primary channels. Third, existing textual conditioning strategies either need significant textual bitrate overhead or rely on generic, content-agnostic textual prompts, thereby failing to provide adaptive semantic guidance efficiently. To overcome these limitations, we propose a content-adaptive diffusion-based image codec with three technical innovations: 1) an Uncertainty-Guided Adaptive Quantization method that learns spatial uncertainty maps to adaptively align distortion with content characteristics; 2) an Auxiliary Decoder-Guided Information Concentration method that uses a lightweight auxiliary r to enforce content-aware information preservation in the primary latent channels; and 3) a Bitrate-Free Adaptive Textual Conditioning method that derives content-aware textual descriptions from the auxiliary reconstructed image, enabling semantic guidance without bitrate cost.

SEF-MAP Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction

Authors: Haoxiang Fu, Lingfeng Zhang, Hao Li, Ruibing Hu, Zhengrong Li, Guanjing Liu, Zimu Tan, Long Chen, Hangjun Ye, Xiaoshuai Hao

2026-02-25

http://arxiv.org/abs/2602.21589v1

High-definition (HD) maps are essential for autonomous driving, yet multi-modal fusion often suffers from inconsistency between camera and LiDAR modalities, leading to performance degradation under low-light conditions, occlusions, or point clouds. To address this, we propose SEFMAP, a Subspace-Expert Fusion framework for robust multimodal HD map prediction. The key idea is to explicitly disentangle BEV features into four semantic subspaces: LiDAR-private, Image-private, Shared, and Interaction. Each subspace is assigned a dedicated expert, thereby pre modality-specific cues while capturing cross-modal consensus. To adaptively combine expert outputs, we introduce an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse. To enhance robustness in degraded conditions and promote role specialization, we further propose distribution-aware masking: during training, modality-drop scenarios are simulated using EMA-statistical surrogate features, and a specialization loss enforces distinct behaviors of private, shared, and interaction experts across complete and masked inputs. Experiments on nuScenes and Argoverse2 benchmarks demonstrate that SEFMAP achieves state-of-the-art performance, surpassing prior methods by +4.2% and +4.8% in mAP, respectively. SEF-MAPprovides a robust and effective solution for multi-modal HD map prediction under diverse and degraded conditions.

Duel-Evolve Reward-Free Test-Time Scaling via LLM Self-Preferences

Authors: Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, David Blei

2026-02-25

http://arxiv.org/abs/2602.21585v1

Many applications seek to optimize outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too , or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. We evaluate Duel-Evolve on MathBench, where it achieves 20 percentage points higher accuracy over existing methods and baselines, and on LiveCodeBench, where it improves over comparable iterative methods by over 12 percentage points. Notably, the method requires no reward model, no ground-truth labels during search, and no hand-crafted scoring function. Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.

DualPath Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Authors: Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, Panpan Huang

2026-02-25

http://arxiv.org/abs/2602.21548v2

The performance of multi-turn, agentic inference is increasingly dominated by -Cache storage I/O rather than computation. In prevalent d architectures, loading the massive -Cache from external storage creates a fundamental imbalance: storage NICs on engines become bandwidth-saturated, while those on engines remain idle. This asymmetry severely constrains overall system throughput. We present DualPath, an inference system that breaks this bottleneck by introducing dual-path -Cache loading. Beyond the traditional storage-to- path, DualPath enables a novel storage-to- path, in which the -Cache is loaded into engines and then efficiently transferred to engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution s -- with a global scheduler that dynamically balances load across and engines. Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87 $\times$ on our in-house inference system. It can also improve online throughput by an average factor of 1.96 $\times$ without violating SLO.

RAC Relation-Aware Cache Replacement for Large Language Models

Authors: Yuchong Wu, Zihuan Xu, Wangze Ni, Peng Cheng, Lei Chen, Xuemin Lin, Heng Tao Shen, Kui Ren

2026-02-25

http://arxiv.org/abs/2602.21547v1

The scaling of Large Language Model () services faces significant cost and latency challenges, making effective caching under tight capacity crucial. Existing replacement policies, from heuristics to learning-based methods, predominantly rely on limited-window statistics such as recency and frequency. We show these signals are not robust for real-world workloads, which exhibit long reuse distances and local recurrence. To address these limitations, we propose Relation-Aware Cache (RAC), an online eviction strategy that leverages semantic relations among requests to guide eviction decisions. RAC synthesizes two relation-aware signals: (1) Topical Prevalence, which aggregates access evidence at the topic level to capture long-horizon reuse; and (2) Structural Importance, which leverages local intra-topic dependency structure to discriminate entries by their future reuse value. Extensive evaluations show that RAC maintains high effectiveness across diverse workloads, consistently surpassing state-of-the-art baselines by 20%--30% in hit ratio.