2025-10-17

Breadcrumbs Reasoning Memory-Efficient Reasoning with Compression Beacons
Invited Paper BitMedViT Ternary-Quantized Vision Transformer for Medical AI Assistants on the Edge
Don't Be Greedy, Just Relax! Pruning LLMs via Frank-Wolfe
How Sampling Affects the Detectability of Machine-written texts A Comprehensive Study
Adaptive Rescheduling in Prefill-Decode Disaggregated LLM Inference
Time Series Foundation Models Benchmarking Challenges and Requirements
NOSA Native and Offloadable Sparse Attention
DOLFIN Balancing Stability and Plasticity in Federated Continual Learning
Steer-MoE Efficient Audio-Language Alignment with a Mixture-of-Experts Steering Module
MedREK Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts
Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers
F-BFQ Flexible Block Floating-Point Quantization Accelerator for LLMs
Make an Offer They Can't Refuse Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment
Document Intelligence in the Era of Large Language Models A Survey
Taming the Fragility of KV Cache Eviction in LLM Inference
ChatR1 Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering
BanaServe Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure
DSCD Large Language Model Detoxification with Self-Constrained Decoding
A Dimension-Keeping Semi-Tensor Product Framework for Compressed Sensing
Mirror Speculative Decoding Breaking the Serial Barrier in LLM Inference
Retrieval-in-the-Chain Bootstrapping Large Language Models for Generative Retrieval
NeuroRVQ Multi-Scale EEG Tokenization for Generative Large Brainwave Models
Neural Approximate Inverse Preconditioners
Computationally Efficient Neural Receivers via Axial Self-Attention
Pruning Cannot Hurt Robustness Certified Trade-offs in Reinforcement Learning
Gaussian Process Implicit Surfaces as Control Barrier Functions for Safe Robot Navigation
KVCOMM Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
What If Understanding Motion Through Sparse Interactions
CARVQ Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression
Enhanced Angle-Range Cluster Parameter Estimation in Full-Duplex ISAC Systems
Low Latency, High Bandwidth Streaming of Experimental Data with EJFAT
Teaching Language Models to Faithfully Express their Uncertainty
SMEC Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression
Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems
Probing Latent Knowledge Conflict for Faithful Retrieval-Augmented Generation
VideoLucy Deep Memory Backtracking for Long Video Understanding
PricingLogic Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks
Efficient Adaptive Transformer An Empirical Study and Reproducible Framework
An Empirical Study of Reducing AV1 Decoder Complexity and Energy Consumption via Encoder Parameter Tuning
CurriFlow Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion
Traveling Salesman-Based Token Ordering Improves Stability in Homomorphically Encrypted Language Models
CoLF Logic Programming as Infinitary Proof Exploration
Reinforced Preference Optimization for Recommendation
A Survey on Parallel Reasoning
FedLoDrop Federated LoRA with Dropout for Generalized LLM Fine-tuning
Compressibility Measures Complexity Minimum Description Length Meets Singular Learning Theory
GeoPipe a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA-enabled Datacenter Optical Transport Network
APCE Adaptive Progressive Context Expansion for Long Context Processing
Direct Multi-Token Decoding
FlexPipe Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
Topological Vibration Analysis of Elastic Lattices via Bloch Sphere Mapping
Indoor Localization using Compact, Telemetry-Agnostic, Transfer-Learning Enabled Decoder-Only Transformer
Variational Mixture of Graph Neural Experts for Alzheimer's Disease Biomarker Recognition in EEG Brain Networks
QeRL Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
Scaling Language-Centric Omnimodal Representation Learning
Diffusion Transformers with Representation Autoencoders
Hierarchical Qubit-Merging Transformer for Quantum Error Correction
Culturally-Aware Conversations A Framework & Benchmark for LLMs
Situat3DChange Situated 3D Change Understanding Dataset for Multimodal Large Language Model
ReLook Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
AndesVL Technical Report An Efficient Mobile-side Multimodal Large Language Model
From to Multidimensional Supervision of Reasoning Process for LLM Optimization
Multi-View Graph Feature Propagation for Privacy Preservation and Feature Sparsity
Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding
XQuant Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers
Discursive Circuits How Do Language Models Understand Discourse Relations?
Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs
Flow Matching-Based Autonomous Driving Planning with Advanced Interactive Behavior Modeling
Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding
Not All Bits Are Equal Scale-Dependent Memory Optimization Strategies for Reasoning Models
MC# Mixture Compressor for Mixture-of-Experts Large Models
KOTOX A Korean Toxic Dataset for Deobfuscation and Detoxification
The Social Cost of Intelligence Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems
Redundancy as a Structural Information Principle for Learning and Generalization
AwareCompiler Agentic Context-Aware Compiler Optimization via a Synergistic Knowledge-Data Driven Framework
FastHMR Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding
Agentic RAG for Software Testing with Hybrid Vector-Graph and Multi-Agent Orchestration
A compressed code for memory discrimination
Review of Inference-Time Scaling Strategies Reasoning, Search and RAG
ADiP Adaptive Precision Systolic Array for Matrix Multiplication Acceleration
Preserving LLM Capabilities through Calibration Data Curation From Analysis to Optimization
Large Language Model-Empowered Channel Prediction and Predictive Beamforming for LEO Satellite Communications
BitMar Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
Self-Supervised Representation Learning with ID-Content Modality Alignment for Sequential Recommendation
The Hidden DNA of LLM-Generated JavaScript Structural Patterns Enable High-Accuracy Authorship Attribution
SASER Stego attacks on open-source LLMs
AnyBCQ Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs
When Images Speak Louder Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance
NIM Neuro-symbolic Ideographic Metalanguage for Inclusive Communication
RobotFleet An Open-Source Framework for Centralized Multi-Robot Task Planning
SP-MoE Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference
Grounded AI for Code Review Resource-Efficient Large-Model Serving in Enterprise Pipelines
The Achilles' Heel of LLMs How Altering a Handful of Neurons Can Cripple Language Abilities
ISAAC Intelligent, Scalable, Agile, and Accelerated CPU Verification via LLM-aided FPGA Parallelism
BILLY Steering Large Language Models via Merging Persona Vectors for Creative Generation
A Unified Frequency Domain Decomposition Framework for Interpretable and Robust Time Series Forecasting
PermLLM Learnable Channel Permutation for NM Sparse Large Language Models
CacheClip Accelerating RAG with Effective KV Cache Reuse
Lighter-X An Efficient and Plug-and-play Strategy for Graph-based Recommendation through Decoupled Propagation
P-4DGS Predictive 4D Gaussian Splatting with 90 $\times$ Compression
Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization
Deliberative Dynamics and Value Alignment in LLM Debates
Universal Discrete-Domain Speech Enhancement
Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding
The Ethics Engine A Modular Pipeline for Accessible Psychometric Assessment of Large Language Models

Breadcrumbs Reasoning Memory-Efficient Reasoning with Compression Beacons

Authors: Giovanni Monea, Yair Feldman, Shankar Padmanabhan, Kianté Brantley, Yoav Artzi

2025-10-15

http://arxiv.org/abs/2510.13797v1

The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value , which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for . In this work, we propose to periodically compress the generation with a learned, special-purpose token and evict compressed entries. We train the model to perform this via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without and training-free techniques.

Invited Paper BitMedViT Ternary-Quantized Vision Transformer for Medical AI Assistants on the Edge

Authors: Mikolaj Walczak, Uttej Kallakuri, Edward Humes, Xiaomin Lin, Tinoosh Mohsenin

2025-10-15

http://arxiv.org/abs/2510.13760v1

Vision Transformers (ViTs) have demonstrated strong capabilities in interpreting complex medical imaging data. However, their significant computational and memory demands pose challenges for deployment in real-time, resource-constrained mobile and wearable devices used in clinical environments. We introduce, BiTMedViT, a new class of Edge ViTs as medical AI assistants that perform structured analysis of medical images directly on the edge. BiTMedViT utilizes ternary- d linear layers tailored for medical imaging and com- bines a training procedure with multi-query attention, pre stability under ternary weights with low-precision activations. Furthermore, BiTMedViT employs task-aware distillation from a high-capacity teacher to recover accuracy lost due to extreme . Lastly, we also present a pipeline that maps the ternarized ViTs to a custom CUDA kernel for efficient memory bandwidth utilization and latency reduction on the Jetson Orin Nano. Finally, BiTMedViT achieves 86% diagnostic accuracy (89% SOTA) on MedMNIST across 12 datasets, while reducing model size by 43x, memory traffic by 39x, and enabling 16.8 ms inference at an energy efficiency up to 41x that of SOTA models at 183.62 GOPs/J on the Orin Nano. Our results demonstrate a practical and scientifically grounded route for extreme-precision medical imaging ViTs deployable on the edge, narrowing the gap between algorithmic advances and deployable clinical tools.

Don't Be Greedy, Just Relax! Pruning LLMs via Frank-Wolfe

Authors: Christophe Roux, Max Zimmer, Alexandre d'Aspremont, Sebastian Pokutta

2025-10-15

http://arxiv.org/abs/2510.13713v1

Pruning is a common technique to reduce the compute and storage requirements of Neural Networks. While conventional approaches typically retrain the model to recover -induced performance degradation, state-of-the-art Large Language Model () methods operate layer-wise, minimizing the per-layer error on a small calibration dataset to avoid full retraining, which is considered computationally prohibitive for s. However, finding the optimal mask is a hard combinatorial problem and solving it to optimality is intractable. Existing methods hence rely on greedy heuristics that ignore the weight interactions in the objective. In this work, we instead consider the convex relaxation of these combinatorial constraints and solve the resulting problem using the Frank-Wolfe (FW) algorithm. Our method drastically reduces the per-layer error, outperforms strong baselines on state-of-the-art GPT architectures, and remains memory-efficient. We provide theoretical justification by showing that, combined with the convergence guarantees of the FW algorithm, we obtain an approximate solution to the original combinatorial problem upon rounding the relaxed solution to integrality.

How Sampling Affects the Detectability of Machine-written texts A Comprehensive Study

Authors: Matthieu Dubois, François Yvon, Pablo Piantanida

2025-10-15

http://arxiv.org/abs/2510.13681v1

As texts generated by Large Language Models (s) are ever more common and often indistinguishable from human-written content, research on automatic text detection has attracted growing attention. Many recent detectors report near-perfect accuracy, often boasting AUROC scores above 99\%. However, these claims typically assume fixed generation settings, leaving open the question of how robust such systems are to changes in strategies. In this work, we systematically examine how sampling-based impacts detectability, with a focus on how subtle variations in a model's (sub)word-level distribution affect detection performance. We find that even minor adjustments to parameters - such as temperature, top-p, or nucleus sampling - can severely impair detector accuracy, with AUROC dropping from near-perfect levels to 1\% in some settings. Our findings expose critical blind spots in current detection methods and emphasize the need for more comprehensive evaluation protocols. To facilitate future research, we release a large-scale dataset encompassing 37 configurations, along with our code and evaluation framework https://github.com/BaggerOfWords/Sampling-and-Detection

Adaptive Rescheduling in Prefill-Decode Disaggregated LLM Inference

Authors: Zhibin Wang, Zetao Hong, Xue Li, Zibo Wang, Shipeng Li, Qingkai Meng, Qing Wang, Chengying Huan, Rong Gu, Sheng Zhong, Chen Tian

2025-10-15

http://arxiv.org/abs/2510.13668v1

Large Language Model () inference has emerged as a fundamental paradigm. In real-world scenarios, variations in output length cause severe workload imbalance in the phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static -to- scheduling, which often results in SLO violations and OOM failures under evolving workloads. In this paper, we propose ARES, an adaptive rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous -native prediction method that leverages hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in phase with : A dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 74.77% and achieving up to 2.24 times higher goodput.

Time Series Foundation Models Benchmarking Challenges and Requirements

Authors: Marcel Meyer, Sascha Kaltenpoth, Kevin Zalipski, Oliver Müller

2025-10-15

http://arxiv.org/abs/2510.13654v1

Time Series Foundation Models (TSFMs) represent a new paradigm for time series forecasting, offering zero-shot forecasting capabilities without the need for domain-specific pre-training or fine-tuning. However, as with Large Language Models (s), evaluating TSFMs is tricky, as with ever more extensive training sets, it becomes more and more challenging to ensure the integrity of benchmarking data. Our investigation of existing TSFM evaluation highlights multiple challenges, ranging from the representativeness of the benchmark datasets, over the lack of spatiotemporal evaluation, to risks of information leakage due to ping and obscure datasets, and the memorization of global patterns caused by external shocks like economic crises or pandemics. Our findings reveal widespread confusion regarding data partitions, risking inflated performance estimates and incorrect transfer of global knowledge to local time series. We argue for the development of robust evaluation methodologies to prevent pitfalls already observed in and classical time series benchmarking, and call upon the research community to design new, principled approaches, such as evaluations on truly out-of-sample future data, to safeguard the integrity of TSFM assessment.

NOSA Native and Offloadable Sparse Attention

Authors: Yuxiang Huang, Chaojun Xiao, Xu Han, Zhiyuan Liu

2025-10-15

http://arxiv.org/abs/2510.13602v1

Trainable attention has emerged as a promising solution to address the efficiency bottleneck of s in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing attention methods leave a crucial limitation unresolved: the size of the key-value () remains unreduced, which constrains on-GPU batch sizes and throttles throughput, especially in large-scale batched inference. In this paper, we show that trainable attention naturally exhibits strong locality in token selection across adjacent steps, thereby enabling offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected pairs between the CPU and GPU continues to dominate the overall cost. Building on this insight, we present NOSA, a trainable attention framework designed to natively support offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing transfers while pre the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in throughput compared with the vanilla trainable attention baseline (Inf-V2).

DOLFIN Balancing Stability and Plasticity in Federated Continual Learning

Authors: Omayma Moussadek, Riccardo Salami, Simone Calderara

2025-10-15

http://arxiv.org/abs/2510.13567v1

Federated continual learning (FCL) enables models to learn new tasks across multiple distributed clients, protecting privacy and without forgetting previously acquired knowledge. However, current methods face challenges balancing performance, privacy preservation, and efficiency. We introduce a Distributed Online LoRA for Federated INcremental learning method DOLFIN, a novel approach combining Vision Transformers with low-rank adapters designed to efficiently and stably learn new tasks in federated environments. Our method leverages LoRA for minimal overhead and incorporates DualGradient Projection Memory (DualGPM) to prevent forgetting. Evaluated on CIFAR-100, ImageNet-R, ImageNet-A, and CUB-200 under two Dirichlet heterogeneity settings, DOLFIN consistently surpasses six strong baselines in final average accuracy while matching their memory footprint. Orthogonal low-rank adapters offer an effective and scalable solution for privacy-pre continual learning in federated settings.

Steer-MoE Efficient Audio-Language Alignment with a Mixture-of-Experts Steering Module

Authors: Ruitao Feng, Bixi Zhang, Sheng Liang, Zheng Yuan

2025-10-15

http://arxiv.org/abs/2510.13558v1

Aligning pretrained audio encoders and Large Language Models (s) offers a promising, parameter-efficient path to building powerful multimodal agents. However, existing methods often require costly full-model finetuning or rely on static adapters that may lack expressive power. Drawing inspiration from the Platonic Representation Hypothesis, we introduce SteerMoE, a novel and modular framework for audio-language alignment. SteerMoE freezes both the audio encoder and the r, training only a lightweight steering module integrated within the encoder's layers. This module uses a Mixture-of-Experts (MoE) router to dynamically select and apply learned steering vectors, progressively transforming continuous audio representations into a space comprehensible to the . By operating entirely in the continuous embedding space, our approach requires no modifications to the 's vocabulary and preserves its advanced reasoning and agentic capabilities. We demonstrate through experiments on ASR, audio understanding, and a qualitative function-calling task that SteerMoE achieves strong performance while remaining highly modular and computationally efficient, offering a robust new paradigm for developing sophisticated audio-language systems.

MedREK Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts

Authors: Shujun Xia, Haokun Lin, Yichen Wu, Yinan Zhou, Zixuan Li, Zhongwei Wan, Xingrun Xing, Yefeng Zheng, Xiang Li, Caifeng Shan, Zhenan Sun, Quanzheng Li

2025-10-15

http://arxiv.org/abs/2510.13500v1

s hold great promise for healthcare applications, but the rapid evolution of medical knowledge and errors in training data often cause them to generate outdated or inaccurate information, limiting their applicability in high-stakes clinical practice. Model editing has emerged as a potential remedy without full retraining. While parameter-based editing often compromises locality and is thus ill-suited for the medical domain, retrieval-based editing offers a more viable alternative. However, it still faces two critical challenges: (1) representation within the medical knowledge space often causes inaccurate retrieval and reduces editing accuracy; (2) existing methods are restricted to single-sample edits, while batch-editing remains largely unexplored despite its importance for real-world medical applications. To address these challenges, we first construct MedVersa, \hk{an enhanced benchmark with broader coverage of medical subjects, designed to evaluate both single and batch edits under strict locality constraints}. We then propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance. Experimental results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics and provides the first validated solution for batch-editing in medical s. Our code and dataset are available at https://github.com/mylittleriver/MedREK.

Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers

Authors: Xin Zhao, Xiaojun Chen, Bingshan Liu, Haoyu Gao, Zhendong Zhao, Yilong Chen

2025-10-15

http://arxiv.org/abs/2510.13462v1

Large language models (s) with Mixture-of-Experts (MoE) architectures achieve impressive performance and efficiency by dynamically routing inputs to specialized subnetworks, known as experts. However, this routing mechanism inherently exhibits task preferences due to expert specialization, introducing a new and underexplored vulnerability to backdoor attacks. In this work, we investigate the feasibility and effectiveness of injecting backdoors into MoE-based s by exploiting their inherent expert routing preferences. We thus propose BadSwitch, a novel backdoor framework that integrates task-coupled dynamic trigger optimization with a sensitivity-guided Top-S expert tracing mechanism. Our approach jointly optimizes trigger embeddings during pretraining while identifying S most sensitive experts, subsequently constraining the Top-K gating mechanism to these targeted experts. Unlike traditional backdoor attacks that rely on superficial data poisoning or model editing, BadSwitch primarily embeds malicious triggers into expert routing paths with strong task affinity, enabling precise and stealthy model manipulation. Through comprehensive evaluations across three prominent MoE architectures (Switch Transformer, QwenMoE, and DeepSeekMoE), we demonstrate that BadSwitch can efficiently hijack pre-trained models with up to 100% success rate (ASR) while maintaining the highest clean accuracy (ACC) among all baselines. Furthermore, BadSwitch exhibits strong resilience against both text-level and model-level defense mechanisms, achieving 94.07% ASR and 87.18% ACC on the AGNews dataset. Our analysis of expert activation patterns reveals fundamental insights into MoE vulnerabilities. We anticipate this work will expose security risks in MoE systems and contribute to advancing AI safety.

F-BFQ Flexible Block Floating-Point Quantization Accelerator for LLMs

Authors: Jude Haris, José Cano

2025-10-15

http://arxiv.org/abs/2510.13401v1

Large Language Models (s) have become increasingly prominent for daily tasks, from improving sound-totext translation to generating additional frames for the latest video games. With the help of inference frameworks, such as llama.cpp, which support optimizations such as -caching and , it is now easier than ever to deploy s on edge devices. Quantization is fundamental to enable s on resource-constrained edge devices, and llama.cpp utilizes block floating point (BFP) to drastically reduce the bit width of weights and input tensors, the memory footprint, and the computational power required to run s. s are typically d with mixed BFP across the model layers to reduce the loss of model accuracy due to . Therefore, to efficiently accelerate across the layers of BFP-d s, specialized accelerators need to support different BFP variants without reconfiguration. To address this issue, we propose a Flexible Block FloatingPoint Quantization (F-BFQ) accelerator, which can dynamically switch between two BFP variants and perform matrix multiplication (MatMul) operations. Our initial F-BFQ accelerator design, deployed on the AMD Kria board, reduces inference time by 1.4x on average over the Arm NEON-based CPU execution across three BFP d s while achieving 5.2 tokens per second (~3.9 words per second).

Make an Offer They Can't Refuse Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment

Authors: Buwei He, Yang Liu, Zhaowei Zhang, Zixia Jia, Huijia Wu, Zhaofeng He, Zilong Zheng, Yipeng Kang

2025-10-15

http://arxiv.org/abs/2510.13387v2

Persuasion, a fundamental social capability for humans, remains a challenge for AI systems such as large language models (s). Current studies often overlook the strategic use of information asymmetry in message design or rely on strong assumptions regarding pre-commitment. In this work, we explore the application of Bayesian Persuasion (BP) in natural language within single-turn dialogue settings, to enhance the strategic persuasion capabilities of s. Our framework incorporates a commitment- mechanism, where the persuader explicitly outlines an information schema by narrating their potential types (e.g., honest or dishonest), thereby guiding the persuadee in performing the intended Bayesian belief update. We evaluate two variants of our approach: Semi-Formal-Natural-Language (SFNL) BP and Fully-Natural-Language (FNL) BP, benchmarking them against both naive and strong non-BP (NBP) baselines within a comprehensive evaluation framework. This framework covers a diverse set of persuadees -- including instances with varying prompts and fine-tuning and human participants -- across tasks ranging from specially designed persuasion scenarios to general everyday situations. Experimental results on -based agents reveal three main findings: (1) s guided by BP strategies consistently achieve higher persuasion success rates than NBP baselines; (2) SFNL exhibits greater credibility and logical coherence, while FNL shows stronger emotional resonance and robustness in naturalistic conversations; (3) with supervised fine-tuning, smaller models can attain BP performance comparable to that of larger models.

Document Intelligence in the Era of Large Language Models A Survey

Authors: Weishi Wang, Hengchang Hu, Zhijie Zhang, Zhaochen Li, Hongxin Shao, Daniel Dahlmeier

2025-10-15

http://arxiv.org/abs/2510.13366v1

Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (s). While earlier approaches relied on encoder-r architectures, r-only s have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI's evolution, highlighting current research attempts and future prospects of s in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.

Taming the Fragility of KV Cache Eviction in LLM Inference

Authors: Yuan Feng, Haoyu Guo, JunLin Lv, S. Kevin Zhou, Xike Xie

2025-10-15

http://arxiv.org/abs/2510.13334v1

Large language models have revolutionized natural language processing, yet their deployment remains hampered by the substantial memory and runtime overhead of the 's Key-Value . To mitigate this, recent methods employ a scoring-aggregation framework to evict unimportant entries, based on the stability assumption-that a fixed subset of entries remains consistently important during generation. However, prior work has largely focused on refining importance indicators for scoring, while defaulting to mean aggregation due to a faithful trust in the stability assumption. In this work, we argue that this underlying assumption is inherently fragile, making mean aggregation highly vulnerable in extreme cases. To counter this, we propose a simple yet elegant defensive aggregation strategy: a two-step, linear-time approach that controls worst-case risk, thereby defending against extreme cases with negligible computational overhead. Embodying this strategy, we propose a novel eviction method, Defensive and its extension, Layer-Defensive, which incorporates layer-wise budget allocation. Across seven task domains (18 datasets), our methods reduce generation quality loss by 2.3x and 4.3x respectively, versus the strongest baseline under a 20% size. These results set new performance benchmarks and pioneer a promising direction for optimizing eviction against underlying fragility through worst-case risk management. Our code is available at https://github.com/FFY0/Defensive.

ChatR1 Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering

Authors: Simon Lupart, Mohammad Aliannejadi, Evangelos Kanoulas

2025-10-15

http://arxiv.org/abs/2510.13312v1

We present ChatR1, a reasoning framework based on reinforcement learning (RL) for conversational question answering (CQA). Reasoning plays an important role in CQA, where user intent evolves across dialogue turns, and utterances are often underspecified, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Unlike static `rewrite, retrieve, and generate' pipelines, ChatR1 interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through RL. To address the challenge of and delayed rewards in RL, we propose an intent-aware reward that provides turn-level feedback by aligning retrieval and reasoning with evolving user goals. Our proposed ChatR1 demonstrates strong performance on both 3B and 7B model backbones, outperforming competitive models on five CQA datasets, measured by different metrics (F1, BERTScore, and -as-judge). We include a diverse set of CQA datasets to cover topic shifts, evolving intents, mixed-initiative dialogues, and multi-document grounding, testing ChatR1's performance from various aspects. Ablation studies confirm the effectiveness of the intent-aware reward. Our analyses further reveal diverse reasoning trajectories and effective use of the search tool. ChatR1 also generalizes robustly across domains, demonstrating that RL-based reasoning enables more flexible and context-sensitive behavior than static CQA pipelines.

BanaServe Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure

Authors: Yiyuan He, Minxian Xu, Jingfeng Wu, Jianmin Hu, Chong Ma, Min Shen, Le Chen, Chengzhong Xu, Lin Qu, Kejiang Ye

2025-10-15

http://arxiv.org/abs/2510.13223v1

Large language models (s) are increasingly deployed in AI infrastructure, driving the need for high throughput, resource efficient systems. Disaggregated , which separates prompt from auto-regressive , has emerged as a promising architecture by isolating their heterogeneous compute and memory demands. However, current d systems face three key limitations: (i) static resource allocation cannot adapt to highly dynamic workloads, causing over-provisioning that wastes resources or under-provisioning that violates service level objectives (SLOs); (ii) inherent load imbalance between and stages, where is compute-bound and is memory-bound, causes under-utilization in one tier while the other becomes a bottleneck; and (iii) prefix aware routing skews load distribution, as high hit rate nodes attract disproportionately more requests, further degrading balance and efficiency. To address these issues, we present BanaServe, a dynamic orchestration framework that continuously rebalances computational and memory resources across and instances while eliminating hotspots induced by . BanaServe introduces layer level weight migration, attention level Key Value Cache ( Cache) migration, and Global Cache Store sharing with layer wise ped transmission, enabling both coarse grained (layer level) and fine grained (attention level) load redistribution with minimal latency overhead. These mechanisms allow routers to perform purely load aware scheduling, unconstrained by placement. Compared to v, BanaServe achieves 1.2x-3.9x higher throughput with 3.9%-78.4% lower total processing time, and outperforms DistServe by 1.1x-2.8x in throughput with 1.4%-70.1% latency reduction.

DSCD Large Language Model Detoxification with Self-Constrained Decoding

Authors: Ming Dong, Jinkui Zhang, Bolong Zheng, Xinhui Tu, Po Hu, Tingting He

2025-10-15

http://arxiv.org/abs/2510.13183v1

Detoxification in large language models (s) remains a significant research challenge. Existing detoxification methods are all based on external constraints, which require additional resource overhead and lose generation fluency. This work proposes Detoxification with Self-Constrained Decoding (DSCD), a novel method for detoxification without parameter fine-tuning. DSCD strengthens the inner next-token distribution of the safety layer while weakening that of hallucination and toxic layers during output generation. This effectively diminishes toxicity and enhances output safety. DSCD offers lightweight, high compatibility, and plug-and-play capabilities, readily integrating with existing detoxification methods for further performance improvement. Extensive experiments on representative open-source s and public datasets validate DSCD's effectiveness, demonstrating state-of-the-art (SOTA) performance in both detoxification and generation fluency, with superior efficiency compared to existing methods. These results highlight DSCD's potential as a practical and scalable solution for safer deployments.

A Dimension-Keeping Semi-Tensor Product Framework for Compressed Sensing

Authors: Qi Qi, Abdelhamid Tayebi, Daizhan Cheng, Jun-e Feng

2025-10-15

http://arxiv.org/abs/2510.13180v1

In compressed sensing (CS), signals can be reconstructed from significantly fewer samples than required by the Nyquist-Shannon sampling theorem. While non- signals can be ly represented in appropriate transformation domains, conventional CS frameworks rely on the incoherence of the measurement matrix columns to guarantee reconstruction performance. This paper proposes a novel method termed Dimension-Keeping Semi-Tensor Product Compressed Sensing (DK-STP-CS), which leverages intra-group correlations while maintaining inter-group incoherence to enhance the measurement matrix design. Specifically, the DK-STP algorithm is integrated into the design of the sensing matrix, enabling dimensionality reduction while pre signal recovery capability. For image and reconstruction tasks, the proposed method achieves notable noise suppression and improves visual fidelity. Experimental results demonstrate that DK-STP-CS significantly outperforms traditional CS and STP-CS approaches, as evidenced by higher Peak Signal-to-Noise Ratio (PSNR) values between the reconstructed and original images. The robustness of DK-STP-CS is further validated under noisy conditions and varying sampling rates, highlighting its potential for practical applications in resource-constrained environments.

Mirror Speculative Decoding Breaking the Serial Barrier in LLM Inference

Authors: Nikhil Bhendawade, Kumari Nishu, Arnav Kundu, Chris Bartels, Minsik Cho, Irina Belousova

2025-10-15

http://arxiv.org/abs/2510.13161v1

Speculative accelerates inference by using a draft model to look ahead, but gains are capped by the cost of autoregressive draft generation: increasing draft size elevates acceptance rates but introduces additional latency overhead exacerbating the speed-accuracy tradeoff. Prior methods (Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling. We present Mirror Speculative Decoding (Mirror-SD), an inference algorithm that breaks the latency-acceptance tradeoff. Mirror-SD launches branch-complete rollouts from early-exit signals in parallel with the target model's suffix and explicitly maps computation across heterogeneous accelerators (GPU and NPU) to exploit cross-device parallelism. The draft speculates forward continuations for the target to verify, while the target simultaneously speculates correction paths for the draft, converting speculation into two complementary execution pipelines. To further cut draft latency without weakening acceptance semantics, we add speculative streaming so the draft emits multiple tokens per step. This dual strategy of parallel heterogeneous execution plus multi-token speculative streaming pushes speculative toward its ideal regime of high acceptance with low overhead. On SpecBench with server-scale models from 14B to 66B parameters, Mirror-SD delivers consistent end-to-end gains, achieving 2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative improvement over the strongest baseline, EAGLE3.

Retrieval-in-the-Chain Bootstrapping Large Language Models for Generative Retrieval

Authors: Yingchen zhang, Ruqing zhang, Jiafeng Guo, Wenjun Peng, Sen Li, Fuyu Lv

2025-10-15

http://arxiv.org/abs/2510.13095v1

Generative retrieval (GR) is an emerging paradigm that leverages large language models (s) to autoregressively generate document identifiers (docids) relevant to a given query. Prior works have focused on leveraging the generative capabilities of s to improve GR, while overlooking that their reasoning capabilities could likewise help. This raises a key question: Can explicit reasoning benefit GR? To investigate, we first conduct a preliminary study where an is prompted to generate free-form chain-of-thought (CoT) reasoning before performing constrained docid . Although this method outperforms standard GR, the generated reasoning tends to be verbose and poorly aligned with the docid space. These limitations motivate the development of a reasoning mechanism better tailored to GR. Therefore, we propose Reason-for-Retrieval (R4R), a reasoning-augmented framework for GR that converts free-form CoT reasoning into a compact, structured format, and iteratively refines the reasoning during the retrieval process. R4R augments an existing GR method by leveraging a reasoning-capable that has been instruction-tuned for GR. At inference time, R4R first uses the to generate an initial structured reasoning; then the same alternates between (i) constrained with the chosen GR method to produce candidate docids and (ii) updating the reasoning based on retrieval results to improve the next round. R4R does not require additional models or training, and instead a single serves as both the reasoning generator and the retriever. Extensive experiments on Natural Questions, MS MARCO, and a real-world item-search benchmark validate the effectiveness of R4R.

NeuroRVQ Multi-Scale EEG Tokenization for Generative Large Brainwave Models

Authors: Konstantinos Barmpas, Na Lee, Alexandros Koliousis, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Stefanos Zafeiriou

2025-10-15

http://arxiv.org/abs/2510.13068v1

Electroencephalography (EEG) captures neural activity across multiple temporal and spectral scales, yielding signals that are rich but complex for representation learning. Recently, EEG foundation models trained to predict masked signal-tokens have shown promise for learning generalizable representations. However, their performance is hindered by their signal tokenization modules. Existing neural tokenizers fail to preserve high-frequency dynamics, limiting their ability to reconstruct EEG signals with high fidelity. We introduce NeuroRVQ, a scalable Large Brainwave Model (LBM) centered on a codebook-based tokenizer. Our tokenizer integrates: (i) multi-scale feature extraction modules that capture the full frequency neural spectrum; (ii) hierarchical residual vector (RVQ) codebooks for high-resolution encoding; and, (iii) an EEG signal phase- and amplitude-aware loss function for efficient training. This design enables efficient EEG while supporting accurate reconstruction across all frequency bands, leading to robust generative masked modeling. Our empirical results demonstrate that NeuroRVQ achieves lower reconstruction error and outperforms existing LBMs on a variety of downstream tasks. More broadly, NeuroRVQ tokenizer establishes a strong prior for codebook-based general-purpose brainwave models, enabling advances in neural , generative modeling and multimodal biosignal integration.

Neural Approximate Inverse Preconditioners

Authors: Tianshi Xu, Rui Peng Li, Yuanzhe Xi

2025-10-14

http://arxiv.org/abs/2510.13034v1

In this paper, we propose a data-driven framework for constructing efficient approximate inverse preconditioners for elliptic partial differential equations (PDEs) by learning the Green's function of the underlying operator with neural networks (NNs). The training process integrates four key components: an adaptive multiscale neural architecture ( $\alpha$ MSNN) that captures hierarchical features across near-, middle-, and far-field regimes; the use of coarse-grid anchor data to ensure physical identifiability; a multi- $\varepsilon$ staged training protocol that progressively refines the Green's function representation across spatial scales; and an ping domain decomposition that enables local adaptation while maintaining global consistency. Once trained, the NN-approximated Green's function is directly compressed into either a hierarchical ( $\mathcal{H}$ -) matrix or a matrix-using only the mesh geometry and the network output. This geometric construction achieves nearly linear complexity in both setup and application while pre the spectral properties essential for effective preconditioning. Numerical experiments on challenging elliptic PDEs demonstrate that the resulting preconditioners consistently yield fast convergence and small iteration counts.

Computationally Efficient Neural Receivers via Axial Self-Attention

Authors: SaiKrishna Saketh Yellapragada, Atchutaram K. Kocharlakota, Mário Costa, Esa Ollila, Sergiy A. Vorobyov

2025-10-14

http://arxiv.org/abs/2510.12941v1

Deep learning-based neural receivers are redefining physical-layer signal processing for next-generation wireless systems. We propose an axial self-attention neural receiver designed for applicability to 6G and beyond wireless systems, validated through 5G-compliant experimental configurations, that achieves state-of-the-art block error rate (BLER) performance with significantly improved computational efficiency. By factorizing attention operations along temporal and spectral axes, the proposed architecture reduces the quadratic complexity of conventional multi-head self-attention from $O((TF)^2)$ to $O(T^2F+TF^2)$ , yielding substantially fewer total floating-point operations and attention matrix multiplications per block compared to global self-attention. Relative to convolutional neural receiver baselines, the axial neural receiver achieves significantly lower computational cost with a fraction of the parameters. Experimental validation under 3GPP Clustered Delay Line (CDL) channels demonstrates consistent performance gains across varying mobility scenarios. Under non-line-of-sight CDL-C conditions, the axial neural receiver consistently outperforms all evaluated receiver architectures, including global self-attention, convolutional neural receivers, and traditional LS-LMMSE at 10\% BLER with reduced computational complexity per inference. At stringent reliability targets of 1\% BLER, the axial receiver maintains robust symbol detection at high user speeds, whereas the traditional LS-LMMSE receiver fails to converge, underscoring its suitability for ultra-reliable low-latency (URLLC) in dynamic 6G environments and beyond. These results establish the axial neural receiver as a structured, scalable, and efficient framework for AI-Native 6G RAN systems, enabling deployment in resource-constrained edge environments.

Pruning Cannot Hurt Robustness Certified Trade-offs in Reinforcement Learning

Authors: James Pedley, Benjamin Etheridge, Stephen J. Roberts, Francesco Quinzan

2025-10-14

http://arxiv.org/abs/2510.12939v1

Reinforcement learning (RL) policies deployed in real-world environments must remain reliable under adversarial perturbations. At the same time, modern deep RL agents are heavily over-parameterized, raising costs and fragility concerns. While has been shown to improve robustness in supervised learning, its role in adversarial RL remains poorly understood. We develop the first theoretical framework for certified robustness under in state-adversarial Markov decision processes (SA-MDPs). For Gaussian and categorical policies with Lipschitz networks, we prove that element-wise can only tighten certified robustness bounds; never makes the policy less robust. Building on this, we derive a novel three-term regret decomposition that disentangles clean-task performance, -induced performance loss, and robustness gains, exposing a fundamental performance--robustness frontier. Empirically, we evaluate magnitude and micro- schedules on continuous-control benchmarks with strong policy-aware adversaries. Across tasks, consistently uncovers reproducible ``sweet spots'' at moderate levels, where robustness improves substantially without harming - and sometimes even enhancing - clean performance. These results position not merely as a tool but as a structural intervention for robust RL.

Authors: Mouhyemen Khan, Tatsuya Ibuki, Abhijit Chatterjee

2025-10-14

http://arxiv.org/abs/2510.12919v1

Level set methods underpin modern safety techniques such as control barrier functions (CBFs), while also as implicit surface representations for geometric shapes via distance fields. Inspired by these two paradigms, we propose a unified framework where the implicit surface itself acts as a CBF. We leverage Gaussian process (GP) implicit surface (GPIS) to represent the safety boundaries, using safety samples which are derived from sensor measurements to condition the GP. The GP posterior mean defines the implicit safety surface (safety belief), while the posterior variance provides a robust safety margin. Although GPs have favorable properties such as uncertainty estimation and analytical tractability, they scale cubically with data. To alleviate this issue, we develop a solution called Gaussian CBFs. To the best of our knowledge, GPIS have not been explicitly used to synthesize CBFs. We validate the approach on collision avoidance tasks in two settings: a simulated 7-DOF manipulator operating around the Stanford bunny, and a quadrotor navigating in 3D around a physical chair. In both cases, Gaussian CBFs (with and without ) enable safe interaction and collision-free execution of trajectories that would otherwise intersect the objects.

KVCOMM Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

Authors: Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, Yiran Chen

2025-10-14

http://arxiv.org/abs/2510.12872v1

Multi-agent large language model () systems are increasingly adopted for complex language processing tasks that require and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of ping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context-including prior turns-must be reprocessed from scratch, leading to inefficient processing. While key-value () caching is an effective solution for avoiding redundant computation in single-agent settings where prefixes remain unchanged, it cannot be directly reused in multi-agent scenarios due to diverging prefixes introduced by agent-specific context extensions. We identify that the core challenge lies in the offset variance of -s across agents. To address this, we propose COMM, a training-free framework that enables efficient ing in multi-agent inference by reusing -s and aligning offsets of ping contexts under diverse prefix contexts. COMM estimates and adjusts -s for shared content by referencing a pool of d examples-termed anchors-that store observed deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. COMM achieves over 70% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, COMM achieves up to 7.8x speedup compared to the standard pipeline, reducing TTFT from ~430 ms to ~55 ms.

What If Understanding Motion Through Sparse Interactions

Authors: Stefan Andreas Baumann, Nick Stracke, Timy Phan, Björn Ommer

2025-10-14

http://arxiv.org/abs/2510.12777v1

Understanding the dynamics of a physical scene involves reasoning about the diverse ways it can potentially change, especially as a result of local interactions. We present the Flow Poke Transformer (FPT), a novel framework for directly predicting the distribution of local motion, conditioned on interactions termed "pokes". Unlike traditional methods that typically only enable dense sampling of a single realization of scene dynamics, FPT provides an interpretable directly accessible representation of multi-modal scene motion, its dependency on physical interactions and the inherent uncertainties of scene dynamics. We also evaluate our model on several downstream tasks to enable comparisons with prior methods and highlight the flexibility of our approach. On dense face motion generation, our generic pre-trained model surpasses specialized baselines. FPT can be fine-tuned in strongly out-of-distribution tasks such as synthetic datasets to enable significant improvements over in-domain methods in articulated object motion estimation. Additionally, predicting explicit motion distributions directly enables our method to achieve competitive performance on tasks like moving part segmentation from pokes which further demonstrates the versatility of our FPT. Code and models are publicly available at https://compvis.github.io/flow-poke-.

CARVQ Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression

Authors: Dayin Gou, Sanghyun Byun, Nilesh Malpeddi, Gabrielle De Micheli, Prathamesh Vaste, Jacob Song, Woo Seong Chung

2025-10-14

http://arxiv.org/abs/2510.12721v1

Large Language Models (s) typically rely on a large number of parameters for token embedding, leading to substantial storage requirements and memory footprints. In particular, s deployed on edge devices are memory-bound, and reducing the memory footprint by compressing the embedding layer not only frees up the memory bandwidth but also speeds up inference. To address this, we introduce CARVQ, a post-training novel Corrective Adaptor combined with group Residual Vector Quantization. CARVQ relies on the composition of both linear and non-linear maps and mimics the original model embedding to compress to approximately 1.6 bits without requiring specialized hardware to support lower-bit storage. We test our method on pre-trained s such as LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3.2-3B-Instruct, LLaMA-3.1-8B, Qwen2.5-7B, Qwen2.5-Math-7B and Phi-4, evaluating on common generative, discriminative, math and reasoning tasks. We show that in most cases, CARVQ can achieve lower average bitwidth-per-parameter while maintaining reasonable perplexity and accuracy compared to scalar . Our contributions include a novel technique that is compatible with state-of-the-art methods and can be seamlessly integrated into any hardware supporting 4-bit memory to reduce the model's memory footprint in memory-constrained devices. This work demonstrates a crucial step toward the efficient deployment of s on edge devices.

Enhanced Angle-Range Cluster Parameter Estimation in Full-Duplex ISAC Systems

Authors: Muhammad Talha, Besma Smida, David González G

2025-10-14

http://arxiv.org/abs/2510.12711v1

This work studies an integrated sensing and (ISAC) framework for targets that are spread both in the angle and range domains. We model each target using a cluster of rays parameterized by a specific density function, and propose a truncated Multiple Signal Classification (MUSIC) spread (TMS) algorithm to accurately estimate the parameters of the density function. Unlike the conventional MUSIC spread (CMS), TMS restricts the signal subspace rank based on the eigen decomposition of the received-signal autocorrelation. We also propose a discrete Fourier transform (DFT) based algorithm for estimating the distance and range spread of each target. Leveraging these estimates, we then develop a dynamic transmit beamforming algorithm that successfully illuminates multiple targets while also multiple downlink (DL) users. Simulation results demonstrate the superiority of our proposed algorithms over baseline schemes in both low and high signal-to-noise ratio (SNR) regimes as well as under a wide angular spread regime.

Low Latency, High Bandwidth Streaming of Experimental Data with EJFAT

Authors: Ilya Baldin, Michael Goodrich, Vardan Gyurjyan, Graham Heyes, Derek Howard, Yatish Kumar, David Lawrence, Brad Sawatzky, Stacey Sheldon, Carl Timmer

2025-10-14

http://arxiv.org/abs/2510.12597v1

Thomas Jefferson National Accelerator Facility (JLab) has partnered with Energy Sciences Network (ESnet) to define and implement an edge to compute cluster computational load balancing architecture. The ESnet-JLab FPGA Accelerated Transport (EJFAT) architecture focuses on FPGA to address , fragmentation, UDP packet destination redirection (Network Address Translation (NAT)) and de and reassembly. EJFAT seamlessly integrates edge and cluster computing to support direct processing of streamed experimental data. This will directly benefit the JLab science program as well as data centers of the future that require high throughput and low latency for both time-critical data acquisition systems and data center workflows. The EJFAT project will be presented along with how it is synergistic with other DOE activities such as an Integrated Research Infrastructure (IRI), and recent results using data sources at JLab, an EJFAT LB at ESnet, and computational cluster resources at Lawrence Berkeley National Laboratory (LBNL).

Teaching Language Models to Faithfully Express their Uncertainty

Authors: Bryan Eikema, Evgenia Ilia, José G. C. de Souza, Chrysoula Zerva, Wilker Aziz

2025-10-14

http://arxiv.org/abs/2510.12587v1

Large language models (s) often miscommunicate their uncertainty: repeated queries can produce divergent answers, yet generated responses are typically unhedged or hedged in ways that do not reflect this variability. This conveys unfaithful information about the uncertain state of the s' knowledge, creating a faithfulness gap that affects even strong s. We introduce Faithful Uncertainty Tuning (FUT): a fine-tuning approach that teaches instruction-tuned s to express uncertainty faithfully without altering their underlying answer distribution. We construct training data by augmenting model samples with uncertainty hedges (i.e. verbal cues such as 'possibly' or 'likely') aligned with sample consistency, requiring no supervision beyond the model and a set of prompts. We evaluate FUT on open-domain question answering (QA) across multiple models and datasets. Our results show that FUT substantially reduces the faithfulness gap, while pre QA accuracy and introducing minimal semantic distribution shift. Further analyses demonstrate robustness across strategies, choice of hedgers, and other forms of uncertainty expression (i.e. numerical). These findings establish FUT as a simple and effective way to teach s to communicate uncertainty faithfully.

SMEC Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression

Authors: Biao Zhang, Lixin Chen, Tong Liu, Bo Zheng

2025-10-14

http://arxiv.org/abs/2510.12474v1

Large language models (s) generate high-dimensional embeddings that capture rich semantic and syntactic information. However, high-dimensional embeddings exacerbate computational complexity and storage requirements, thereby hindering practical deployment. To address these challenges, we propose a novel training framework named Sequential Matryoshka Embedding Compression (SMEC). This framework introduces the Sequential Matryoshka Representation Learning(SMRL) method to mitigate gradient variance during training, the Adaptive Dimension Selection (ADS) module to reduce information degradation during dimension , and the Selectable Cross-batch Memory (S-XBM) module to enhance unsupervised learning between high- and low-dimensional embeddings. Experiments on image, text, and multimodal datasets demonstrate that SMEC achieves significant dimensionality reduction while maintaining performance. For instance, on the BEIR dataset, our approach improves the performance of compressed 2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.

Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems

Authors: Jiaxin Gao, Chen Chen, Yanwen Jia, Xueluan Gong, Kwok-Yan Lam, Qian Wang

2025-10-14

http://arxiv.org/abs/2510.12462v1

Large Language Models (s) are increasingly being used to autonomously evaluate the quality of content in systems, e.g., to assess responses in telecom customer support chatbots. However, the impartiality of these AI "judges" is not guaranteed, and any biases in their evaluation criteria could skew outcomes and undermine user trust. In this paper, we systematically investigate judgment biases in two -as-a-judge models (i.e., GPT-Judge and JudgeLM) under the point-wise scoring setting, encompassing 11 types of biases that cover both implicit and explicit forms. We observed that state-of-the-art judges demonstrate robustness to biased inputs, generally assigning them lower scores than the corresponding clean samples. Providing a detailed scoring rubric further enhances this robustness. We further found that fine-tuning an on high-scoring yet biased responses can significantly degrade its performance, highlighting the risk of training on biased data. We also discovered that the judged scores correlate with task difficulty: a challenging dataset like GPQA yields lower average scores, whereas an open-ended reasoning dataset (e.g., JudgeLM-val) sees higher average scores. Finally, we proposed four potential mitigation strategies to ensure fair and reliable AI judging in practical scenarios.

Probing Latent Knowledge Conflict for Faithful Retrieval-Augmented Generation

Authors: Linfeng Gao, Baolong Bi, Zheng Yuan, Le Wang, Zerui Chen, Zhimin Wei, Shenghua Liu, Qinggang Zhang, Jinsong Su

2025-10-14

http://arxiv.org/abs/2510.12460v1

Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm to enhance the factuality of Large Language Models (s). However, existing RAG systems often suffer from an unfaithfulness issue, where the model's response contradicts evidence from the retrieved context. Existing approaches to improving contextual faithfulness largely rely on external interventions, such as prompt engineering, constraints, or reward-based fine-tuning. These works treat the as a black box and overlook a crucial question: how does the internally integrate retrieved evidence with its parametric memory, particularly under knowledge conflicts? To address this gap, we conduct a probing-based analysis of hidden-state representations in s and observe three findings: knowledge integration occurs hierarchically, conflicts manifest as latent signals at the sentence level, and irrelevant context is often amplified when aligned with parametric knowledge. Building on these findings, we propose CLEAR (Conflict-Localized and Enhanced Attention for RAG), a framework that (i) decomposes context into fine-grained sentence-level knowledge, (ii) employs hidden-state probing to localize conflicting knowledge, and (iii) introduces conflict-aware fine-tuning to guide the model to accurately integrate retrieved evidence. Extensive experiments across three benchmarks demonstrate that CLEAR substantially improves both accuracy and contextual faithfulness, consistently outperforming strong baselines under diverse conflict conditions. The related resources are available at https://github.com/LinfengGao/CLEAR.

VideoLucy Deep Memory Backtracking for Long Video Understanding

Authors: Jialong Zuo, Yongtai Deng, Lingdong Kong, Jingkang Yang, Rui Jin, Yiwei Zhang, Nong Sang, Liang Pan, Ziwei Liu, Changxin Gao

2025-10-14

http://arxiv.org/abs/2510.12422v1

Recent studies have shown that agent-based systems leveraging large language models (s) for key information retrieval and integration have emerged as a promising approach for long video understanding. However, these systems face two major challenges. First, they typically perform modeling and reasoning on individual frames, struggling to capture the temporal context of consecutive frames. Second, to reduce the cost of dense frame-level captioning, they adopt frame sampling, which risks discarding crucial information. To overcome these limitations, we propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while pre critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model's ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o. Our code and dataset will be made publicly at https://videolucy.github.io

PricingLogic Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks

Authors: Yunuo Liu, Dawei Zhu, Zena Al-Khalili, Dai Cheng, Yanjun Chen, Dietrich Klakow, Wei Zhang, Xiaoyu Shen

2025-10-14

http://arxiv.org/abs/2510.12409v1

We present PricingLogic, the first benchmark that probes whether Large Language Models(s) can reliably automate tourism-related prices when multiple, ping fare rules apply. Travel agencies are eager to offload this error-prone task onto AI systems; however, deploying s without verified reliability could result in significant financial losses and erode customer trust. PricingLogic comprises 300 natural-language questions based on booking requests derived from 42 real-world pricing policies, spanning two levels of difficulty: (i) basic customer-type pricing and (ii)bundled-tour calculations involving interacting discounts. Evaluations of a line of s reveal a steep performance drop on the harder tier,exposing systematic failures in rule interpretation and arithmetic reasoning.These results highlight that, despite their general capabilities, today's s remain unreliable in revenue-critical applications without further safeguards or domain adaptation. Our code and dataset are available at https://github.com/EIT-NLP/PricingLogic.

Efficient Adaptive Transformer An Empirical Study and Reproducible Framework

Authors: Jan Miller

2025-10-14

http://arxiv.org/abs/2510.12856v1

The Efficient Adaptive Transformer (EAT) framework unifies three adaptive efficiency techniques - progressive token , attention, and dynamic early exiting - into a single, reproducible architecture for input-adaptive inference. EAT provides an open-source benchmarking pipeline that automates data processing, timing, and ablation across GLUE tasks (SST-2, QQP, MNLI). Although this empirical study finds that combining these mechanisms can increase latency in shallow six-layer models, it demonstrates that EAT achieves slightly higher accuracy than the optimized DistilBERT baseline on SST-2, illustrating the potential of dynamic computation for latency-sensitive NLP. The main contribution is the open, end-to-end reproducible framework - complete with scripts, CSV logging, and analysis utilities - intended to serve as a community tool for further research on adaptive s.

An Empirical Study of Reducing AV1 Decoder Complexity and Energy Consumption via Encoder Parameter Tuning

Authors: Vibhoothi Vibhoothi, Julien Zouein, Shanker Shreejith, Jean-Baptiste Kempf, Anil Kokaram

2025-10-14

http://arxiv.org/abs/2510.12380v1

The widespread adoption of advanced video codecs such as AV1 is often hindered by their high complexity, posing a challenge for battery-constrained devices. While encoders can be configured to produce bitstreams that are r-friendly, estimating the complexity and energy overhead for a given video is non-trivial. In this study, we systematically analyse the impact of disabling various coding tools and adjusting coding parameters in two AV1 encoders, libaom-av1 and SVT-AV1. Using system-level energy measurement tools like RAPL (Running Average Power Limit), Intel SoC Watch (integrated with VTune profiler), we quantify the resulting trade-offs between complexity, energy consumption, and efficiency for a bitstream. Our results demonstrate that specific encoder configurations can substantially reduce complexity with minimal perceptual quality degradation. For libaom-av1, disabling CDEF, an in-loop filter gives us a mean reduction in cycles by 10%. For SVT-AV1, using the in-built, fast-=2 preset achieves a more substantial 24% reduction in cycles. These findings provide strategies for content providers to lower the energy footprint of AV1 video streaming.

CurriFlow Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion

Authors: Jinzhou Lin, Jie Zhou, Wenhao Xu, Rongtao Xu, Changwei Wang, Shunpeng Chen, Kexue Fu, Yihua Shao, Li Guo, Shibiao Xu

2025-10-14

http://arxiv.org/abs/2510.12362v1

Semantic Scene Completion (SSC) aims to infer complete 3D geometry and semantics from monocular images, as a crucial capability for camera-based perception in autonomous driving. However, existing SSC methods relying on temporal stacking or depth projection often lack explicit motion reasoning and struggle with occlusions and noisy depth supervision. We propose CurriFlow, a novel semantic occupancy prediction framework that integrates optical flow-based temporal alignment with curriculum-guided depth fusion. CurriFlow employs a multi-level fusion strategy to align segmentation, visual, and depth features across frames using pre-trained optical flow, thereby improving temporal consistency and dynamic object understanding. To enhance geometric robustness, a curriculum learning mechanism progressively transitions from yet accurate LiDAR depth to dense but noisy stereo depth during training, ensuring stable optimization and seamless adaptation to real-world deployment. Furthermore, semantic priors from the Segment Anything Model (SAM) provide category-agnostic supervision, strengthening voxel-level semantic learning and spatial consistency. Experiments on the SemanticKITTI benchmark demonstrate that CurriFlow achieves state-of-the-art performance with a mean IoU of 16.9, validating the effectiveness of our motion-guided and curriculum-aware design for camera-based 3D semantic scene completion.

Traveling Salesman-Based Token Ordering Improves Stability in Homomorphically Encrypted Language Models

Authors: Donghwan Rho, Sieun Seo, Hyewon Sung, Chohong Min, Ernest K. Ryu

2025-10-14

http://arxiv.org/abs/2510.12343v1

As users increasingly interact with large language models (s) using private information, secure and encrypted becomes essential. Homomorphic encryption (HE) provides a principled solution by enabling computation directly on encrypted data. Although prior work has explored aspects of running s under HE, the challenge of text generation, particularly next-token prediction, has received limited attention and remains a key obstacle to practical encrypted interaction. In this work, we propose a TSP-based token reordering strategy to address the difficulties of encrypted text generation, together with a post-processing step that further reduces approximation error. Theoretical analysis and experimental results demonstrate that our method prevents collapse, improves coherence in generated text, and preserves data privacy throughout. Overall, our contributions advance the feasibility of practical and privacy-pre inference.

CoLF Logic Programming as Infinitary Proof Exploration

Authors: Zhibo Chen, Frank Pfenning

2025-10-14

http://arxiv.org/abs/2510.12302v1

Logical Frameworks such as Automath [de Bruijn, 1968] or LF [Harper et al., 1993] were originally conceived as metalanguages for the specification of foundationally uncommitted deductive systems, yielding generic proof checkers. Their high level of abstraction was soon exploited to also express algorithms over deductive systems such as theorem provers, type-checkers, evaluators, compilers, proof s, etc. in the paradigm of computation-as-proof-construction. This has been realized in languages such as $\lambda$ -Prolog [Miller et al., 1991] or Elf [Pfenning, 1991] based on backward chaining, and LolliMon [Lopez et al., 2005] or Celf [Schack-Nielsen and Schuermann, 2008], which integrated forward chaining. None of these early frameworks supported the direct expression of infinitary objects or proofs, which are available in the recently developed CoLF $^\omega$ [Chen, 2023]. In this work-in-progress report, we sketch an approach to computation-as-proof-construction over the first-order fragment of CoLF $^\omega$ (called CoLF $^\omega_1$ ) that already includes infinitary objects and proofs. A key idea is the interpretation of logic variables as channels and computation as concurrent message-passing. This is realized in a concrete compiler from CoLF $^\omega_1$ to Sax, a proof-theoretically inspired parallel programming language based on the proof-reduction in the semi-axiomatic sequent calculus [DeYoung et al., 2020].

Reinforced Preference Optimization for Recommendation

Authors: Junfei Tan, Yuxin Chen, An Zhang, Junguang Jiang, Bin Liu, Ziru Xu, Han Zhu, Jian Xu, Bo Zheng, Xiang Wang

2025-10-14

http://arxiv.org/abs/2510.12211v1

Recent breakthroughs in large language models (s) have fundamentally shifted recommender systems from discriminative to generative paradigms, where user behavior modeling is achieved by generating target items conditioned on historical interactions. Yet current generative recommenders still suffer from two core limitations: the lack of high-quality negative modeling and the reliance on implicit rewards. Reinforcement learning with verifiable rewards (RLVR) offers a natural solution by enabling on-policy sampling of harder negatives and grounding optimization in explicit reward signals. However, applying RLVR to generative recommenders remains non-trivial. Its unique generation space often leads to invalid or repetitive items that undermine sampling efficiency, and ranking supervision is since most items receive identical zero rewards. To address these challenges, we propose Reinforced Preference Optimization for Recommendation (ReRe), a reinforcement-based paradigm tailored to -based recommenders, an important direction in generative recommendation. ReRe incorporates constrained beam search to improve sampling efficiency and diversify hard negatives, while augmenting rule-based accuracy rewards with auxiliary ranking rewards for finer-grained supervision. Extensive experiments on three real-world datasets demonstrate that ReRe consistently outperforms both traditional and -based recommenders in ranking performance. Further analysis shows that ReRe not only enhances performance across both base and SFT-initialized models but also generalizes robustly across different backbone families and scales. Beyond empirical gains, we systematically investigate the design space of RLVR in recommendation across generation, sampling strategy, reward modeling, and optimization algorithm, offering insights for future research.

A Survey on Parallel Reasoning

Authors: Ziqi Wang, Boye Niu, Zipeng Gao, Zhi Zheng, Tong Xu, Linghui Meng, Zhongli Li, Jing Liu, Yilong Chen, Chen Zhu, Hua Wu, Haifeng Wang, Enhong Chen

2025-10-14

http://arxiv.org/abs/2510.12164v1

With the increasing capabilities of Large Language Models (s), parallel reasoning has emerged as a new inference paradigm that enhances reasoning robustness by concurrently exploring multiple lines of thought before converging on a final answer. It has become a significant trend to explore parallel reasoning to overcome the fragility of standard sequential methods and improve practical performance. In this paper, we aim to survey and summarize the progress and challenges of parallel reasoning. We first present a formal definition of parallel reasoning and clarify its distinction from related concepts like Chain-of-Thought. Then, we organize and discuss advanced techniques based on a novel taxonomy, including non-interactive reasoning, interactive reasoning, and efficiency-focused strategies. Additionally, we explore various application scenarios, such as solving complex problems and enhancing the reliability of outputs.Finally, we highlight the core challenges of parallel reasoning and suggest potential directions for future research. We hope that our work can provide a useful roadmap for beginners and encourage more research on improving parallel reasoning methods. Related source can be avaliable in https://github.com/PPPP-kaqiu/Awesome-Parallel-Reasoning.

FedLoDrop Federated LoRA with Dropout for Generalized LLM Fine-tuning

Authors: Sijing Xie, Dingzhu Wen, Changsheng You, Qimei Chen, Mehdi Bennis, Kaibin Huang

2025-10-14

http://arxiv.org/abs/2510.12078v1

Fine-tuning (FT) large language models (s) is crucial for adapting general-purpose models to specific tasks, enhancing accuracy and relevance with minimal resources. To further enhance generalization ability while reducing training costs, this paper proposes Federated LoRA with Dropout (FedLoDrop), a new framework that applies dropout to the rows and columns of the trainable matrix in Federated LoRA. A generalization error bound and convergence analysis under regularization are obtained, which elucidate the fundamental trade-off between underfitting and overfitting. The error bound reveals that a higher dropout rate increases model , thereby lowering the upper bound of pointwise hypothesis stability (PHS). While this reduces the gap between empirical and generalization errors, it also incurs a higher empirical error, which, together with the gap, determines the overall generalization error. On the other hand, though dropout reduces costs, deploying FedLoDrop at the network edge still faces challenges due to limited network resources. To address this issue, an optimization problem is formulated to minimize the upper bound of the generalization error, by jointly optimizing the dropout rate and resource allocation subject to the latency and per-device energy consumption constraints. To solve this problem, a branch-and-bound (B\&B)-based method is proposed to obtain its globally optimal solution. Moreover, to reduce the high computational complexity of the B\&B-based method, a penalized successive convex approximation (P-SCA)-based algorithm is proposed to efficiently obtain its high-quality suboptimal solution. Finally, numerical results demonstrate the effectiveness of the proposed approach in mitigating overfitting and improving the generalization capability.

Compressibility Measures Complexity Minimum Description Length Meets Singular Learning Theory

Authors: Einar Urdshals, Edmund Lau, Jesse Hoogland, Stan van Wingerden, Daniel Murfet

2025-10-14

http://arxiv.org/abs/2510.12077v1

We study neural network compressibility by using singular learning theory to extend the minimum description length (MDL) principle to singular models like neural networks. Through extensive experiments on the Pythia suite with , factorization, and other techniques, we find that complexity estimates based on the local learning coefficient (LLC) are closely, and in some cases, linearly correlated with compressibility. Our results provide a path toward rigorously evaluating the limits of model .

GeoPipe a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA-enabled Datacenter Optical Transport Network

Authors: Jun Dai, Xiaorun Wang, Kexiong Fang, Zheng Yang, Yuefeng Ji, Jiawei Zhang

2025-10-14

http://arxiv.org/abs/2510.12064v1

The proliferation of Large Language Models (s) with exponentially growing parameters is making cross-data center (DC) training an inevitable trend. However, viable strategies for extending single-DC training frameworks to multi-DC environments remain underdeveloped. We experimentally demonstrate, for the first time, a high-performance geo-distributed s training framework across multiple DCs interconnected by a lossless, remote direct memory access (RDMA) enabled Datacenter Optical Transport Network (DC-OTN). An enhanced pipeline parallelism scheme is implemented within the Ascend full-stack environment of Huawei, which effectively eliminates the impact of cross-DC overhead on training efficiency. The ped computation and cross-DC is achieved with constraint cross-DC bandwidth and High Bandwidth Memory (HBM), reducing computation bubble ratio by up to 78.91%.

APCE Adaptive Progressive Context Expansion for Long Context Processing

Authors: Baisub Lee, Sanghyun Byun, Mohanad Odema, Jung Guack, Jacob Song, Woo Seong Chung

2025-10-14

http://arxiv.org/abs/2510.12051v1

Deploying useful Long-Context Transformer Models (LCTMs) requires addressing two key challenges: (1) A growing memory footprint due to quadratic self-attention and linear - scaling in memory as sequence length increases; (2) the ContextRot phenomena where empirical evidence suggests that architecture's performance degrades with increasing context length. Given the shared dependency on the input, a natural question arises: Can we surgically select the most important input chunks for processing to synergistically (a) reduce the memory footprint, and (b) mitigate the ContextRot effects? In this paper, we answer this question in the affirmative for long-context summarization tasks. We propose APCE as a context-aware solution to select the most important input chunks through low-dimensional semantic similarity matching with the current query. By directly operating on the input, APCE decouples from strict dependency on underlying hardware or CUDA environments, promising a compatible solution scalable to different deployment systems. Our empirical evaluations have demonstrated superior or on-par summarization performance for APCE compared to the full dense baseline using a fraction (50%-70%) of the input sequence resulting in - and self-attention memory efficiency improvements. We hope our findings inspire further research on context-aware efficiency solutions for LCTMs geared towards other relevant long-context tasks.

Direct Multi-Token Decoding

Authors: Xuan Luo, Weizhi Wang, Xifeng Yan

2025-10-13

http://arxiv.org/abs/2510.11958v1

Decoder-only s have become the standard architecture for large language models (s) due to their strong performance. Recent studies suggest that, in pre-trained s, early, middle, and late layers may serve distinct roles: Early layers focus on understanding the input context, middle layers handle task-specific processing, and late layers convert abstract representations into output tokens. We hypothesize that once representations have been processed by the early and middle layers, the resulting hidden states may encapsulate sufficient information to support the generation of multiple tokens using only the late layers, eliminating the need to repeatedly traverse the early and middle layers. We refer to this inference paradigm as Direct Multi-Token Decoding (DMTD). Unlike speculative , our method introduces no additional parameters, auxiliary routines, or post-generation verification. Despite being trained on a limited dataset, a fine-tuned DMTD Qwen3-4B model has already demonstrated promising results, achieving up to a 2x speedup with only minor performance loss. Moreover, as shown in our scaling analysis, its performance is expected to further improve with larger training datasets.

FlexPipe Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters

Authors: Yanying Lin, Shijie Peng, Chengzhi Lu, Chengzhong Xu, Kejiang Ye

2025-10-13

http://arxiv.org/abs/2510.11938v1

Serving Large Language Models (s) in production faces significant challenges from highly variable request patterns and severe resource fragmentation in serverless clusters. Current systems rely on static pipeline configurations that struggle to adapt to dynamic workload conditions, leading to substantial inefficiencies. We present FlexPipe, a novel system that dynamically reconfigures pipeline architectures during runtime to address these fundamental limitations. FlexPipe decomposes models into fine-grained stages and intelligently adjusts pipeline granularity based on real-time request pattern analysis, implementing three key innovations: fine-grained model partitioning with preserved computational graph constraints, inflight pipeline refactoring with consistent transitions, and topology-aware resource allocation that navigates GPU fragmentation. Comprehensive evaluation on an 82-GPU cluster demonstrates that FlexPipe achieves up to 8.5x better resource efficiency while maintaining 38.3% lower latency compared to state-of-the-art systems, reducing GPU reservation requirements from 75% to 30% of peak capacity.

Topological Vibration Analysis of Elastic Lattices via Bloch Sphere Mapping

Authors: Kazi Tahsin Mahmood, M. Arif Hasan

2025-10-13

http://arxiv.org/abs/2510.11930v1

Mechanical lattices support topological wave phenomena governed by geometric phases. We develop a compact Hilbert space description for one-dimensional elastic chains, expressing intra-cell motion as a normalized superposition of orthogonal eigenstates and tracking complex amplitudes as trajectories on a Bloch sphere. For diatomic lattices, this framework makes inversion symmetry protection explicit: the relative phase between in-phase and out-of-phase modes is piecewise locked, and the Zak phase is d with band-dependent jumps at symmetry points. Extending the analysis to triatomic lattices shows that restoring inversion retains , whereas breaking it des the geometric phase while leaving the spectral origin invariant. Viewing norm-pre transformations of the modal coefficient pair as Bloch sphere rotations, we demonstrate classical analogues of single-qubit logic gates. A pi-phase rotation about a transverse axis swaps the modal poles, and a longitudinal-axis phase flip maps balanced superpositions to their conjugates. These gate-like operations are realized by controlled evolution across wavenumber space and can be driven or reprogrammed through spatiotemporal stiffness modulation. Introducing space-time modulation hybridizes carrier and sideband harmonics, producing continuous phase winding and open-path geometric phases accumulated along the Floquet trajectory. Across static and modulated regimes, the framework unifies algebraic and geometric viewpoints, remains robust to gauge and basis choices, and operates directly on amplitude-phase data. The results clarify how symmetry, modulation, and topology jointly govern dispersion, modal mixing, and phase accumulation, providing tools to analyze and design vibration and acoustic functionalities in engineered structures.

Indoor Localization using Compact, Telemetry-Agnostic, Transfer-Learning Enabled Decoder-Only Transformer

Authors: Nayan Sanjay Bhatia, Pranay Kocheta, Russell Elliott, Harikrishna S. Kuttivelil, Katia Obraczka

2025-10-13

http://arxiv.org/abs/2510.11926v1

Indoor Wi-Fi positioning remains a challenging problem due to the high sensitivity of radio signals to environmental dynamics, channel propagation characteristics, and hardware heterogeneity. Conventional fingerprinting and model-based approaches typically require labor-intensive calibration and suffer rapid performance degradation when devices, channel or deployment conditions change. In this paper, we introduce Locaris, a r-only large language model () for indoor localization. Locaris treats each access point (AP) measurement as a token, enabling the ingestion of raw Wi-Fi telemetry without pre-processing. By fine-tuning its on different Wi-Fi datasets, Locaris learns a lightweight and generalizable mapping from raw signals directly to device location. Our experimental study comparing Locaris with state-of-the-art methods consistently shows that Locaris matches or surpasses existing techniques for various types of telemetry. Our results demonstrate that compact s can serve as calibration-free regression models for indoor localization, offering scalable and robust cross-environment performance in heterogeneous Wi-Fi deployments. Few-shot adaptation experiments, using only a handful of calibration points per device, further show that Locaris maintains high accuracy when applied to previously unseen devices and deployment scenarios. This yields sub-meter accuracy with just a few hundred samples, robust performance under missing APs and supports any and all available telemetry. Our findings highlight the practical viability of Locaris for indoor positioning in the real-world scenarios, particularly in large-scale deployments where extensive calibration is infeasible.

Variational Mixture of Graph Neural Experts for Alzheimer's Disease Biomarker Recognition in EEG Brain Networks

Authors: Jun-En Ding, Anna Zilverstand, Shihao Yang, Albert Chih-Chieh Yang, Feng Liu

2025-10-13

http://arxiv.org/abs/2510.11917v1

Dementia disorders such as Alzheimer's disease (AD) and frontotemporal dementia (FTD) exhibit ping electrophysiological signatures in EEG that challenge accurate diagnosis. Existing EEG-based methods are limited by full-band frequency analysis that hinders precise differentiation of dementia subtypes and severity stages. We propose a variational mixture of graph neural experts (VMoGE) that integrates frequency-specific biomarker identification with structured variational inference for enhanced dementia diagnosis and staging. VMoGE employs a multi-granularity to extract multi-scale temporal patterns across four frequency bands, followed by a variational graph convolutional encoder using Gaussian Markov Random Field priors. Through structured variational inference and adaptive gating, VMoGE links neural specialization to physiologically meaningful EEG frequency bands. Evaluated on two diverse datasets for both subtype classification and severity staging, VMoGE achieves superior performance with AUC improvements of +4% to +10% over state-of-the-art methods. Moreover, VMoGE provides interpretable insights through expert weights that correlate with clinical indicators and spatial patterns aligned with neuropathological signatures, facilitating EEG biomarker discovery for comprehensive dementia diagnosis and monitoring.

QeRL Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

Authors: Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen

2025-10-13

http://arxiv.org/abs/2510.11696v1

We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (s). While RL is essential for s' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in s.

Scaling Language-Centric Omnimodal Representation Learning

Authors: Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong

2025-10-13

http://arxiv.org/abs/2510.11693v1

Recent multimodal embedding approaches leveraging multimodal large language models (Ms) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of M-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language r learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within M representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the M's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the M's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.

Diffusion Transformers with Representation Autoencoders

Authors: Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie

2025-10-13

http://arxiv.org/abs/2510.11690v1

Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained rs, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable -based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion s to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion training.

Hierarchical Qubit-Merging Transformer for Quantum Error Correction

Authors: Seong-Joon Park, Hee-Youl Kwak, Yongjune Kim

2025-10-13

http://arxiv.org/abs/2510.11593v1

For reliable large-scale quantum computation, a quantum error correction (QEC) scheme must effectively resolve physical errors to protect logical information. Leveraging recent advances in deep learning, neural network-based rs have emerged as a promising approach to enhance the reliability of QEC. We propose the Hierarchical Qubit-Merging Transformer (HQMT), a novel and general framework that explicitly leverages the structural graph of stabilizer codes to learn error correlations across multiple scales. Our architecture first computes attention locally on structurally related groups of stabilizers and then systematically merges these qubit-centric representations to build a global view of the error syndrome. The proposed HQMT achieves substantially lower logical error rates for surface codes by integrating a dedicated qubit-merging layer within the architecture. Across various code distances, HQMT significantly outperforms previous neural network-based QEC rs as well as a powerful belief propagation with ordered statistics (BP+OSD) baseline. This hierarchical approach provides a scalable and effective framework for surface code , advancing the realization of reliable quantum computing.

Culturally-Aware Conversations A Framework & Benchmark for LLMs

Authors: Shreya Havaldar, Sunny Rai, Young-Min Cho, Lyle Ungar

2025-10-13

http://arxiv.org/abs/2510.11563v1

Existing benchmarks that measure cultural adaptation in s are misaligned with the actual challenges these models face when interacting with users from diverse cultural backgrounds. In this work, we introduce the first framework and benchmark designed to evaluate s in realistic, multicultural conversational settings. Grounded in sociocultural theory, our framework formalizes how linguistic style - a key element of cultural - is shaped by situational, relational, and cultural context. We construct a benchmark dataset based on this framework, annotated by culturally diverse raters, and propose a new set of desiderata for cross-cultural evaluation in NLP: conversational framing, stylistic sensitivity, and subjective correctness. We evaluate today's top s on our benchmark and show that these models struggle with cultural adaptation in a conversational setting.

Situat3DChange Situated 3D Change Understanding Dataset for Multimodal Large Language Model

Authors: Ruiping Liu, Junwei Zheng, Yufan Chen, Zirui Wang, Kunyu Peng, Kailun Yang, Jiaming Zhang, Marc Pollefeys, Rainer Stiefelhagen

2025-10-13

http://arxiv.org/abs/2510.11509v1

Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D M approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language r. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of Ms in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for Ms.

ReLook Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

Authors: Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, Bo Zhou

2025-10-13

http://arxiv.org/abs/2510.11498v1

While Large Language Models (s) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate--diagnose--refine loop by invoking a multimodal (M) as a tool. During training, the agent uses the M-in-the-loop both as a visual critic--scoring code with screenshots--and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.

AndesVL Technical Report An Efficient Mobile-side Multimodal Large Language Model

Authors: Zhiwei Jin, Xiaohui Song, Nan Wang, Yafei Liu, Chao Li, Xin Li, Ruichen Wang, Zhihao Li, Qi Qi, Long Cheng, Dongze Hao, Quanlong Zheng, Yanhao Zhang, Haobo Ji, Jian Ma, Zhitong Zheng, Zhenyi Lin, Haolin Deng, Xin Zou, Xiaojie Yin, Ruilin Wang, Liankai Cai, Haijing Liu, Yuqing Qiu, Ke Chen, Zixian Li, Chi Xie, Huafei Li, Chenxing Li, Chuangchuang Wang, Kai Tang, Zhiguang Zhu, Kai Tang, Wenmei Gao, Rui Wang, Jun Wu, Chao Liu, Qin Xie, Chen Chen, Haonan Lu

2025-10-13

http://arxiv.org/abs/2510.11496v2

In recent years, while cloud-based Ms such as QwenVL, InternVL, GPT-4o, Gemini, and Claude Sonnet have demonstrated outstanding performance with enormous model sizes reaching hundreds of billions of parameters, they significantly surpass the limitations in memory, power consumption, and computing capacity of edge devices such as mobile phones. This paper introduces AndesVL, a suite of mobile-side Ms with 0.6B to 4B parameters based on Qwen3's and various visual encoders. We comprehensively outline the model architectures, training pipeline, and training data of AndesVL, which achieves first-tier performance across a wide range of open-source benchmarks, including fields such as text-rich image understanding, reasoning and math, multi-image comprehension, general VQA, hallucination mitigation, multilingual understanding, and GUI-related tasks when compared with state-of-the-art models of a similar scale. Furthermore, we introduce a 1+N LoRA architecture alongside a Quantization-Aware LoRA Fine-Tuning (QALFT) framework to facilitate efficient task adaptation and model during mobile-side deployment of AndesVL. Moreover, utilizing our eviction algorithm -- O -- along with customized speculative and strategies, we achieve a 6.7x peak speedup ratio, up to 30.9% memory reduction, and 1.8 bits-per-weight when deploying AndesVL-4B on MediaTek Dimensity 9500 chips. We release all models on https://huggingface.co/OPPOer.

From to Multidimensional Supervision of Reasoning Process for LLM Optimization

Authors: Beining Wang, Weihang Su, Hongtao Tian, Tao Yang, Yujia Zhou, Ting Yao, Qingyao Ai, Yiqun Liu

2025-10-13

http://arxiv.org/abs/2510.11457v1

Improving the multi-step reasoning ability of Large Language Models (s) is a critical yet challenging task. The dominant paradigm, outcome-supervised reinforcement learning (RLVR), rewards only correct final answers, often propagating flawed reasoning and suffering from reward signals. While process-level reward models (PRMs) provide denser, step-by-step feedback, they lack generalizability and interpretability, requiring task-specific segmentation of the reasoning process. To this end, we propose the Dimension-level Reward Model (DRM), a new supervision framework that bridges the gap between these two approaches. DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions: Confidence for uncertainty calibration, Relevance for semantic alignment, and Coherence for logical consistency. Together, these dimensions capture aspects beyond final answer correctness and enable interpretable assessment without requiring ground truth answers. Experimental results show that DRM provides effective supervision signals, guides the optimization of s and enhances their reasoning ability. In particular, DRM-supervised training achieves consistent gains on both in-distribution and out-of-distribution open-domain tasks, including mathematics, question answering, code execution, and puzzles. Our findings demonstrate that multidimensional supervision of the reasoning process can improve the generalized reasoning ability of s beyond the training distribution.

Multi-View Graph Feature Propagation for Privacy Preservation and Feature Sparsity

Authors: Etzion Harari, Moshe Unger

2025-10-13

http://arxiv.org/abs/2510.11347v1

Graph Neural Networks (GNNs) have demonstrated remarkable success in node classification tasks over relational data, yet their effectiveness often depends on the availability of complete node features. In many real-world scenarios, however, feature matrices are highly or contain sensitive information, leading to degraded performance and increased privacy risks. Furthermore, direct exposure of information can result in unintended data leakage, enabling adversaries to infer sensitive information. To address these challenges, we propose a novel Multi-view Feature Propagation (MFP) framework that enhances node classification under feature while promoting privacy preservation. MFP extends traditional Feature Propagation (FP) by dividing the available features into multiple Gaussian-noised views, each propagating information independently through the graph topology. The aggregated representations yield expressive and robust node embeddings. This framework is novel in two respects: it introduces a mechanism that improves robustness under extreme , and it provides a principled way to balance utility with privacy. Extensive experiments conducted on graph datasets demonstrate that MFP outperforms state-of-the-art baselines in node classification while substantially reducing privacy leakage. Moreover, our analysis demonstrates that propagated outputs serve as alternative imputations rather than reconstructions of the original features, pre utility without compromising privacy. A comprehensive sensitivity analysis further confirms the stability and practical applicability of MFP across diverse scenarios. Overall, MFP provides an effective and privacy-aware framework for graph learning in domains characterized by missing or sensitive features.

Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding

Authors: Bingjie Zhu, Zhixiong Chen, Liqiang Zhao, Hyundong Shin, Arumugam Nallanathan

2025-10-13

http://arxiv.org/abs/2510.11331v1

Large language model () inference at the network edge is a promising paradigm that leverages distributed edge resources to run inference near users and enhance privacy. Existing edge-based inference systems typically adopt autoregressive (AD), which only generates one token per forward pass. This iterative process, compounded by the limited computational resources of edge nodes, results in high latency and constrains the system's ability to support multiple users under growing demands.To address these challenges, we propose a speculative (SD)-based framework that deploys small and large models across heterogeneous edge nodes to collaboratively deliver inference services. Specifically, the small model rapidly generates draft tokens that the large model verifies in parallel, enabling multi-token generation per forward pass and thus reducing latency. To improve resource utilization of edge nodes, we incorporate pipeline parallelism to drafting and verification across multiple inference tasks. Based on this framework, we analyze and derive a comprehensive latency model incorporating both and inference latency. Then, we formulate a joint optimization problem for speculation length, task batching, and wireless resource allocation to minimize total latency. To address this problem, we derive the closed-form solutions for wireless resource allocation, and develop a dynamic programming algorithm for joint batching and speculation control strategies. Experimental results demonstrate that the proposed framework achieves lower latency compared to AD-based systems. In addition,the proposed joint optimization method delivers up to 44.9% latency reduction compared to benchmark schemes.

XQuant Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

Authors: Haoqi Yang, Yao Yao, Zuchao Li, Baoyuan Qi, Guoming Liu, Hai Zhao

2025-10-13

http://arxiv.org/abs/2510.11236v1

Large Language Models (s) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while pre historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width . XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer , enabling to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and Asym-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy.

The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers

Authors: Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš

2025-10-13

http://arxiv.org/abs/2510.11218v1

Large language models (s) can correctly answer "When was Einstein born?" yet fail to provide the same date when writing about Einstein's life revealing a fundamental inconsistency in how models access factual knowledge across task complexities. While models display impressive accuracy on factual question-answering benchmarks, the reliability gap between simple and complex queries remains poorly understood, eroding their trustworthiness. In this work, we introduce Short-Long Form Alignment for Factual Question Answering (SLAQ), a controlled evaluation framework that compares s' answers to the same factual questions asked (a) in isolation (short) vs. (b) integrated into complex queries (long). Looking at 16 s across 600 queries, we find a systematic misalignment of answers to the corresponding short and long queries. We further uncover position-dependent accuracy loss and momentum effects where consecutive correct or incorrect answers create self-reinforcing patterns. Through mechanistic analysis, we find that aligned facts activate ping model internals, and that metrics based on mechanistic similarity can predict short-long answer alignment with up to 78% accuracy. Our work establishes factual consistency over query complexity as an important aspect of s' trustworthiness and challenges current evaluation practices, which implicitly assume that good performance for simple factual queries implies reliability in more complex knowledge-seeking tasks too.

Discursive Circuits How Do Language Models Understand Discourse Relations?

Authors: Yisong Miao, Min-Yen Kan

2025-10-13

http://arxiv.org/abs/2510.11210v1

Which components in language models are responsible for discourse understanding? We hypothesize that computational graphs, termed as discursive circuits, control how models process discourse relations. Unlike simpler tasks, discourse relations involve longer spans and complex reasoning. To make circuit discovery feasible, we introduce a task called Completion under Discourse Relation (CuDR), where a model completes a discourse given a specified relation. To support this task, we construct a corpus of minimal contrastive pairs tailored for activation patching in circuit discovery. Experiments show that circuits ( $\approx 0.2\%$ of a full GPT-2 model) recover discourse understanding in the English PDTB-based CuDR task. These circuits generalize well to unseen discourse frameworks such as RST and SDRT. Further analysis shows lower layers capture linguistic features such as lexical semantics and coreference, while upper layers encode discourse-level abstractions. Feature utility is consistent across frameworks (e.g., coreference supports Expansion-like relations).

Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs

Authors: João Paulo Cardoso de Lima, Marc Dietrich, Jeronimo Castrillon, Asif Ali Khan

2025-10-13

http://arxiv.org/abs/2510.11192v1

Structured enables deploying large language models (s) on resource-constrained systems. Approaches like dense-to- fine-tuning are particularly compelling, achieving remarkable structured by reducing the model size by over 6.7x, while still maintaining acceptable accuracy. Despite this reduction, inference, especially the stage being inherently memory-bound, is extremely expensive on conventional Von-Neumann architectures. Compute-in-memory (CIM) architectures mitigate this by performing computations directly in memory, and when paired with s, enable storing and computing the entire model in memory, eliminating the data movement on the off-chip bus and improving efficiency. Nonetheless, naively mapping matrices onto CIM arrays leads to poor array utilization and diminished computational efficiency. In this paper, we present an automated framework with novel mapping and scheduling strategies to accelerate inference on CIM accelerators. By exploiting block-diagonal , our approach improves CIM array utilization by over 50%, achieving more than 4x reduction in both memory footprint and the number of required floating-point operations.

Flow Matching-Based Autonomous Driving Planning with Advanced Interactive Behavior Modeling

Authors: Tianyi Tan, Yinan Zheng, Ruiming Liang, Zexu Wang, Kexin Zheng, Jinliang Zheng, Jianxiong Li, Xianyuan Zhan, Jingjing Liu

2025-10-13

http://arxiv.org/abs/2510.11083v1

Modeling interactive driving behaviors in complex scenarios remains a fundamental challenge for autonomous driving planning. Learning-based approaches attempt to address this challenge with advanced generative models, removing the dependency on over-engineered architectures for representation fusion. However, brute-force implementation by simply stacking blocks lacks a dedicated mechanism for modeling interactive behaviors that are common in real driving scenarios. The scarcity of interactive driving data further exacerbates this problem, leaving conventional imitation learning methods ill-equipped to capture high-value interactive behaviors. We propose Flow Planner, which tackles these problems through coordinated innovations in data modeling, model architecture, and learning scheme. Specifically, we first introduce fine-grained trajectory tokenization, which decomposes the trajectory into ping segments to decrease the complexity of whole trajectory modeling. With a sophisticatedly designed architecture, we achieve efficient temporal and spatial fusion of planning and scene information, to better capture interactive behaviors. In addition, the framework incorporates flow matching with classifier-free guidance for multi-modal behavior generation, which dynamically reweights agent interactions during inference to maintain coherent response strategies, providing a critical boost for interactive scenario understanding. Experimental results on the large-scale nuPlan dataset and challenging interactive interPlan dataset demonstrate that Flow Planner achieves state-of-the-art performance among learning-based approaches while effectively modeling interactive behaviors in complex driving scenarios.

Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding

Authors: Runyu Yang, Ivan V. Bajić

2025-10-13

http://arxiv.org/abs/2510.10970v2

Mainstream image and video coding standards -- including state-of-the-art codecs like H.266/VVC, AVS3, and AV1 -- adopt a block-based hybrid coding framework. While this framework facilitates straightforward optimization for Peak Signal-to-Noise Ratio (PSNR), it struggles to effectively optimize perceptually-aligned metrics such as Multi-Scale Structural Similarity (MS-SSIM). To address this challenge, this paper proposes a low-complexity method to enhance perceptual quality in VVC intra coding by transferring bit allocation knowledge from end-to-end image . We introduce a lightweight model trained with perceptual losses to generate a step map. This map implicitly captures block-level perceptual importance, enabling efficient derivation of a QP map for VVC. Experiments on Kodak and CLIC datasets demonstrate significant advantages, both in execution time and perceptual metric performance, with more than 11% BD-rate reduction in terms of MS-SSIM. Our scheme provides an efficient, practical pathway for perceptual enhancement of traditional codecs.

Not All Bits Are Equal Scale-Dependent Memory Optimization Strategies for Reasoning Models

Authors: Junhyuck Kim, Ethan Ewer, Taehong Moon, Jongho Park, Dimitris Papailiopoulos

2025-10-13

http://arxiv.org/abs/2510.10964v1

While 4-bit has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales, we show that this universal prescription fails for reasoning models, where the rather than model size can dominate memory. Through systematic experiments across 1,700 inference scenarios on AIME25 and GPQA-Diamond, we find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation, while larger models achieve better accuracy by allocating memory to longer generations. This scale threshold also determines when parallel scaling becomes memory-efficient and whether eviction outperforms . Our findings show that memory optimization for s cannot be scale-agnostic, while providing principled guidelines: for small reasoning models, prioritize model capacity over test-time compute, while for larger ones, maximize test-time compute. Our results suggest that optimizing reasoning models for deployment requires fundamentally different strategies from those established for non-reasoning models.

MC# Mixture Compressor for Mixture-of-Experts Large Models

Authors: Wei Huang, Yue Liao, Yukang Chen, Jianhui Liu, Haoru Tan, Si Liu, Shiming Zhang, Shuicheng Yan, Xiaojuan Qi

2025-10-13

http://arxiv.org/abs/2510.10962v1

Mixture-of-Experts (MoE) effectively scales large language models (s) and vision-language models (VLMs) by increasing capacity through activation. However, preloading all experts into memory and activating multiple experts per input introduces significant computational and memory overhead, making the expert module a major contributor to model size and inference cost. To address this, we propose MC# (Mixture-Compressor-sharp), a framework that combines static and dynamic expert by leveraging the significance of experts and tokens for aggressive of MoE-s/VLMs. To reduce storage and loading costs, we introduce Pre-Loading Mixed-Precision Quantization (PMQ), which optimizes bit allocation via linear programming, balancing expert importance and error for a Pareto-optimal trade-off between size and performance. To reduce runtime computation, Online Top-any Pruning (OTP) uses Gumbel-Softmax sampling to dynamically select a subset of experts per token, enabling fine-grained control over activation. By combining PMQ's static bit-width optimization with OTP's dynamic routing, MC# achieves extreme with minimal accuracy loss. On DeepSeek-VL2, MC# achieves a 6.2 times weight reduction at 2.57 average bits with only a 1.7% accuracy drop across five multimodal benchmarks. Additionally, OTP reduces expert activation over 20% with less than 1% performance degradation, demonstrating strong potential for efficient MoE-based model deployment.

KOTOX A Korean Toxic Dataset for Deobfuscation and Detoxification

Authors: Yejin Lee, Su-Hyeon Kim, Hyundong Jin, Dayoung Kim, Yeonsoo Kim, Yo-Sub Han

2025-10-13

http://arxiv.org/abs/2510.10961v1

Toxic content has become an increasingly critical social issue with the rapid expansion of online . While numerous studies explored methods for detecting and detoxifying such content, most have focused primarily on English, leaving low-resource language underrepresented. Consequently, Large Language Models~(s) often struggle to identify and neutralize toxic expressions in these languages. This challenge becomes even more pronounced when user employ obfuscation techniques to evade detection systems. Therefore, we propose a \textbf{KOTOX: Korean Toxic Dataset} for deobfuscation and detoxicification to address this issue. We categorize various obfuscation approaches based on linguistic characteristics of Korean and define a set of transformation rules grounded in real-word examples. Using these rules, we construct three dataset versions (easy, normal, and hard) representing different levels of obfuscation difficulty. This is the first dataset that simultaneously supports deobfuscation and detoxification for the Korean language. We expect it to facilitate better understanding and mitigating of obfuscated toxic content in for low-resource languages. Our code and data are available at https://github.com/leeyejin1231/KOTOX.

Authors: Thi-Nhung Nguyen, Linhao Luo, Thuy-Trang Vu, Dinh Phung

2025-10-13

http://arxiv.org/abs/2510.10943v1

Bias in large language models (s) remains a persistent challenge, manifesting in stereotyping and unfair treatment across social groups. While prior research has primarily focused on individual models, the rise of multi-agent systems (MAS), where multiple s collaborate and communicate, introduces new and largely unexplored dynamics in bias emergence and propagation. In this work, we present a comprehensive study of stereotypical bias in MAS, examining how internal specialization, underlying s and inter-agent protocols influence bias robustness, propagation, and amplification. We simulate social contexts where agents represent different social groups and evaluate system behavior under various interaction and adversarial scenarios. Experiments on three bias benchmarks reveal that MAS are generally less robust than single-agent systems, with bias often emerging early through in-group favoritism. However, cooperative and debate-based can mitigate bias amplification, while more robust underlying s improve overall system stability. Our findings highlight critical factors shaping fairness and resilience in multi-agent systems.

Redundancy as a Structural Information Principle for Learning and Generalization

Authors: Yuda Bi, Ying Zhu, Vince D Calhoun

2025-10-13

http://arxiv.org/abs/2510.10938v1

We present a theoretical framework that extends classical information theory to finite and structured systems by redefining redundancy as a fundamental property of information organization rather than inefficiency. In this framework, redundancy is expressed as a general family of informational divergences that unifies multiple classical measures, such as mutual information, chi-squared dependence, and spectral redundancy, under a single geometric principle. This reveals that these traditional quantities are not isolated heuristics but projections of a shared redundancy geometry. The theory further predicts that redundancy is bounded both above and below, giving rise to an optimal equilibrium that balances over- (loss of structure) and over-coupling (collapse). While classical theory favors minimal redundancy for transmission efficiency, finite and structured systems, such as those underlying real-world learning, achieve maximal stability and generalization near this equilibrium. Experiments with masked autoencoders are used to illustrate and verify this principle: the model exhibits a stable redundancy level where generalization peaks. Together, these results establish redundancy as a measurable and tunable quantity that bridges the asymptotic world of and the finite world of learning.

AwareCompiler Agentic Context-Aware Compiler Optimization via a Synergistic Knowledge-Data Driven Framework

Authors: Hongyu Lin, Haolin Pan, Haoran Luo, Yuchen Li, Kaichun Yao, Libo Zhang, Mingjie Xing, Yanjun Wu

2025-10-13

http://arxiv.org/abs/2510.11759v1

Compiler optimization is crucial for enhancing program performance by transforming the sequence of optimization passes while maintaining correctness. Despite the promising potential of large language models (s)-based agent for software optimization, automating compiler optimization remains challenging due to: (1) semantic misalignment between abstract program representations and concrete optimization passes, (2) inefficient interaction mechanisms between agents and compiler environments, and (3) reward from the extensive decision-making process within large optimization spaces. This paper introduces \textbf{AwareCompiler}, an agentic framework for compiler optimization that addresses these challenges through three key innovations: structured knowledge integration and dataset construction, knowledge-driven adaptive pass generation, and data-driven hybrid training pipeline. Experimental results on standard benchmarks demonstrate that AwareCompiler significantly outperforms existing baselines in both performance and efficiency, highlighting the effectiveness of our synergistic knowledge-data-driven approach. Our code is publicly available at https://github.com/LHY-24/AwareCompiler.

FastHMR Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding

Authors: Soroush Mehraban, Andrea Iaboni, Babak Taati

2025-10-13

http://arxiv.org/abs/2510.10868v1

Recent -based models for 3D Human Mesh Recovery (HMR) have achieved strong performance but often suffer from high computational cost and complexity due to deep architectures and redundant tokens. In this paper, we introduce two HMR-specific merging strategies: Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe). ECLM selectively merges layers that have minimal impact on the Mean Per Joint Position Error (MPJPE), while Mask-ToMe focuses on merging background tokens that contribute little to the final prediction. To further address the potential performance drop caused by merging, we propose a diffusion-based r that incorporates temporal context and leverages pose priors learned from large-scale motion capture datasets. Experiments across multiple benchmarks demonstrate that our method achieves up to 2.3x speed-up while slightly improving performance over the baseline.

Agentic RAG for Software Testing with Hybrid Vector-Graph and Multi-Agent Orchestration

Authors: Mohanakrishnan Hariharan, Satish Arvapalli, Seshu Barma, Evangeline Sheela

2025-10-12

http://arxiv.org/abs/2510.10824v1

We present an approach to software testing automation using Agentic Retrieval-Augmented Generation (RAG) systems for Quality Engineering (QE) artifact creation. We combine autonomous AI agents with hybrid vector-graph knowledge systems to automate test plan, case, and QE metric generation. Our approach addresses traditional software testing limitations by leveraging s such as Gemini and Mistral, multi-agent orchestration, and enhanced contextualization. The system achieves remarkable accuracy improvements from 65% to 94.8% while ensuring comprehensive document traceability throughout the quality engineering lifecycle. Experimental validation of enterprise Corporate Systems Engineering and SAP migration projects demonstrates an 85% reduction in testing timeline, an 85% improvement in test suite efficiency, and projected 35% cost savings, resulting in a 2-month of go-live.

A compressed code for memory discrimination

Authors: Dale Zhou, Sharon Mina Noh, Nora C Harhen, Nidhi V Banavar, C. Brock Kirwan, Michael A Yassa, Aaron M Bornstein

2025-10-12

http://arxiv.org/abs/2510.10791v1

The ability to discriminate similar visual stimuli is an important index of memory function. This ability is widely thought to be supported by expanding the dimensionality of relevant neural codes, such that neural representations for similar stimuli are maximally distinct, or ``separated.'' An alternative hypothesis is that discrimination is supported by lossy of visual inputs, efficiently coding sensory information by discarding seemingly irrelevant details. A benefit of , relative to expansion, is that it allows individuals to retain fewer essential dimensions underlying stimulus variation -- a process linked to higher-order visual processing -- without hindering discrimination. Under this hypothesis, pattern separation is facilitated when more information from similar stimuli can be discarded, rather than preserved. We test the versus expansion hypotheses by predicting performance on the canonical mnemonic similarity task. We train neural networks to compress perceptual and semantic factors of stimuli, measuring lossiness using the mathematical framework underlying . Consistent with the hypothesis, and not the expansion hypothesis, greater lossiness predicts the ease and performance of lure discrimination, especially in deeper convolutional network layers that predict higher-order visual brain activity. We then confirm these predictions across two image sets, four behavioral datasets, and alternative lossiness metrics. Finally, using task fMRI, we identify signatures of lossy -- neural dimensionality reduction and information loss -- in higher-order visual regions V4 and IT and hippocampal DG/CA3 and CA1 linked to lure discrimination. These results suggest lossy supports mnemonic discrimination by discarding redundant and ping information.

Review of Inference-Time Scaling Strategies Reasoning, Search and RAG

Authors: Zhichao Wang, Cheng Wan, Dong Nie

2025-10-12

http://arxiv.org/abs/2510.10787v1

The performance gains of s have historically been driven by scaling up model size and training data. However, the rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck, shifting the focus of research toward inference-time scaling. This paradigm uses additional computation at the time of deployment to substantially improve performance on downstream tasks without costly model re-training. This review systematically surveys the diverse techniques contributing to this new era of inference-time scaling, organizing the rapidly evolving field into two comprehensive perspectives: Output-focused and Input-focused methods. Output-focused techniques encompass complex, multi-step generation strategies, including reasoning (e.g., CoT, ToT, ReAct), various search and methods (e.g., MCTS, beam search), training for long CoT (e.g., RLVR, GRPO), and model ensemble methods. Input-focused techniques are primarily categorized by few-shot and RAG, with RAG as the central focus. The RAG section is further detailed through a structured examination of query expansion, data, retrieval and reranker, generation methods, and multi-modal RAG.

ADiP Adaptive Precision Systolic Array for Matrix Multiplication Acceleration

Authors: Ahmed J. Abdelmaksoud, Cristian Sestito, Shiwei Wang, Themis Prodromakis

2025-10-12

http://arxiv.org/abs/2510.10623v1

Transformers are at the core of modern AI nowadays. They rely heavily on matrix multiplication and require efficient due to their substantial memory and computational requirements. Quantization plays a vital role in reducing memory usage, and can be exploited for computations by designing reconfigurable architectures that enhance matrix multiplication by dynamically adjusting the precision. This paper proposes ADiP, a novel adaptive-precision systolic array architecture designed for efficient matrix multiplication .The proposed architecture consists of NxN adaptive-precision processing elements (PEs) and shared accumulators. ADiP supports multiple computation modes, including symmetric single-matrix multiplication as well as asymmetric multi-matrix multiplication with a shared input matrix, thereby improving data-reuse and PE utilization. In addition, ADiP maximizes the computational density by adapting to different precisions, such as 8bitx8bit, 8bitx4bit, and 8bitx2bit. Analytical models are developed for ADiP architecture, including latency and throughput for versatile architecture configurations. A comprehensive hardware design space exploration is demonstrated using 22nm commercial technology, achieving up to a 4x higher computational throughput. Furthermore, ADiP is evaluated on different workloads from GPT-2 Medium, BERT Large, and BitNet-1.58B models, delivering latency improvement up to 53.6%, and energy improvement up to 24.4% for BitNet-1.58B MHA workloads. At a 64x64 size with 4096 PEs, ADiP achieves a peak throughput of 8.192 TOPS, 16.384 TOPS, and 32.768 TOPS for 8bitx8bit, 8bitx4bit, and 8bitx2bit operations, respectively.

Preserving LLM Capabilities through Calibration Data Curation From Analysis to Optimization

Authors: Bowei He, Lihao Yin, Huiling Zhen, Shuqi Liu, Han Wu, Xiaokun Zhang, Mingxuan Yuan, Chen Ma

2025-10-12

http://arxiv.org/abs/2510.10618v1

Post-training has been a widely employed approach to scale down large language model () and facilitate efficient inference. In various proposed methods, including and , calibration data plays a vital role by informing the weight importance and activation dynamic ranges. However, how calibration data impacts the capability after is less explored. Few of the existing works, though recognizing the significance of this study, only investigate the language modeling or commonsense reasoning performance degradation from limited angles, like the data sources or sample amounts. More systematic research is still needed to examine the impacts on different capabilities in terms of compositional properties and domain correspondence of calibration data. In this work, we aim at bridging this gap and further analyze underlying influencing mechanisms from the activation pattern perspective. Especially, we explore the calibration data's impacts on high-level complex reasoning capabilities, like math problem solving and code generation. Delving into the underlying mechanism, we find that the representativeness and diversity in activation space more fundamentally determine the quality of calibration data. Finally, we propose a calibration data curation framework based on such observations and analysis, enhancing the performance of existing post-training methods on pre critical capabilities. Our code is provided in \href{https://github.com/BokwaiHo/COLA.git}{Link}.

Large Language Model-Empowered Channel Prediction and Predictive Beamforming for LEO Satellite Communications

Authors: Zhixiong Chen, Hyundong Shin, Arumugam Nallanathan, Jonathon Chambers

2025-10-12

http://arxiv.org/abs/2510.10561v1

Accurate channel prediction and effective beamforming are essential for low Earth orbit (LEO) satellite s to enhance system capacity and enable high-speed connectivity. Most existing channel prediction and predictive beamforming methods are limited by model generalization capabilities and struggle to adapt to time-varying wireless propagation environments. Inspired by the remarkable generalization and reasoning capabilities of large language models (s), this work proposes an -based channel prediction framework, namely CP, to forecast future channel state information (CSI) for LEO satellites based on historical CSI data. In the proposed CP, a dedicated CSI encoder is designed to map raw CSI data into the textual embedding space, effectively bridging the modality gap and enabling the to perform reliable reasoning over CSI data. Additionally, a CSI r is introduced to simultaneously predict CSI for multiple future time slots, substantially reducing the computational burden and inference latency associated with the inherent autoregressive process of s. Then, instead of training the from scratch, we adopt a parameter-efficient fine-tuning strategy, i.e., LoRA, for CP, where the pretrained remains frozen and trainable low-rank matrices are injected into each Transformer r layer to enable effective fine-tuning. Furthermore, we extend CP to directly generate beamforming strategies for future time slots based on historical CSI data, namely BF. This extended framework retains the same architecture as CP, while introducing a dedicated beamforming r to output beamforming strategies. Finally, extensive simulation results validate the effectiveness of the proposed approaches in channel prediction and predictive beamforming for LEO satellite s.

BitMar Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices

Authors: Euhid Aman, Esteban Carlin, Hsing-Kuo Pao, Giovanni Beltrame, Ghaluh Indah Permata Sari, Yie-Tarng Chen

2025-10-12

http://arxiv.org/abs/2510.10560v1

Cross-attention s and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented . We introduce BitMar, a d multimodal that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet r applies per-layer conditioning, which increases the contextual relevance of generated content. The r also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.

Self-Supervised Representation Learning with ID-Content Modality Alignment for Sequential Recommendation

Authors: Donglin Zhou, Weike Pan, Zhong Ming

2025-10-12

http://arxiv.org/abs/2510.10556v1

Sequential recommendation (SR) models often capture user preferences based on the historically interacted item IDs, which usually obtain sub-optimal performance when the interaction history is limited. Content-based sequential recommendation has recently emerged as a promising direction that exploits items' textual and visual features to enhance preference learning. However, there are still three key challenges: (i) how to reduce the semantic gap between different content modality representations; (ii) how to jointly model user behavior preferences and content preferences; and (iii) how to design an effective training strategy to align ID representations and content representations. To address these challenges, we propose a novel model, self-supervised representation learning with ID-Content modality alignment, named SICSRec. Firstly, we propose a -driven sample construction method and develop a supervised fine-tuning approach to align item-level modality representations. Secondly, we design a novel Transformer-based sequential model, where an ID-modality sequence encoder captures user behavior preferences, a content-modality sequence encoder learns user content preferences, and a mix-modality sequence r grasps the intrinsic relationship between these two types of preferences. Thirdly, we propose a two-step training strategy with a content-aware contrastive learning task to align modality representations and ID representations, which decouples the training process of content modality dependency and item collaborative dependency. Extensive experiments conducted on four public video streaming datasets demonstrate our SICSRec outperforms the state-of-the-art ID-modality sequential recommenders and content-modality sequential recommenders by 8.04% on NDCG@5 and 6.62% on NDCD@10 on average, respectively.

The Hidden DNA of LLM-Generated JavaScript Structural Patterns Enable High-Accuracy Authorship Attribution

Authors: Norbert Tihanyi, Bilel Cherif, Richard A. Dubniczky, Mohamed Amine Ferrag, Tamás Bisztray

2025-10-12

http://arxiv.org/abs/2510.10493v1

In this paper, we present the first large-scale study exploring whether JavaScript code generated by Large Language Models (s) can reveal which model produced it, enabling reliable authorship attribution and model fingerprinting. With the rapid rise of AI-generated code, attribution is playing a critical role in detecting vulnerabilities, flagging malicious content, and ensuring accountability. While AI-vs-human detection usually treats AI as a single category we show that individual s leave unique stylistic signatures, even among models belonging to the same family or parameter size. To this end, we introduce -NodeJS, a dataset of 50,000 Node.js back-end programs from 20 large language models. Each has four transformed variants, yielding 250,000 unique JavaScript samples and two additional representations (JSIR and AST) for diverse research applications. Using this dataset, we benchmark traditional machine learning classifiers against fine-tuned Transformer encoders and introduce CodeT5-JSA, a custom architecture derived from the 770M-parameter CodeT5 model with its r removed and a modified classification head. It achieves 95.8% accuracy on five-class attribution, 94.6% on ten-class, and 88.5% on twenty-class tasks, surpassing other tested models such as BERT, CodeBERT, and Longformer. We demonstrate that classifiers capture deeper stylistic regularities in program dataflow and structure, rather than relying on surface-level features. As a result, attribution remains effective even after mangling, comment removal, and heavy code transformations. To support open science and reproducibility, we release the -NodeJS dataset, Google Colab training scripts, and all related materials on GitHub: https://github.com/-NodeJS-dataset.

SASER Stego attacks on open-source LLMs

Authors: Ming Tan, Wei Li, Hu Tao, Hailong Ma, Aodi Liu, Qian Chen, Zilong Wang

2025-10-12

http://arxiv.org/abs/2510.10486v1

Open-source large language models (s) have demonstrated considerable dominance over proprietary s in resolving neural processing tasks, thanks to the collaborative and sharing nature. Although full access to source codes, model parameters, and training data lays the groundwork for transparency, we argue that such a full-access manner is vulnerable to stego attacks, and their ill-effects are not fully understood. In this paper, we conduct a systematic formalization for stego attacks on open-source s by enumerating all possible threat models associated with adversary objectives, knowledge, and capabilities. Therein, the threat posed by adversaries with internal knowledge, who inject payloads and triggers during the model sharing phase, is of practical interest. We go even further and propose the first stego attack on open-source s, dubbed SASER, which wields impacts through identifying targeted parameters, embedding payloads, injecting triggers, and executing payloads sequentially. Particularly, SASER enhances the attack robustness against -based local deployment by de-quantizing the embedded payloads. In addition, to achieve stealthiness, SASER devises the performance-aware importance metric to identify targeted parameters with the least degradation of model performance. Extensive experiments on LlaMA2-7B and ChatGLM3-6B, without , show that the stealth rate of SASER outperforms existing stego attacks (for general DNNs) by up to 98.1%, while achieving the same attack success rate (ASR) of 100%. More importantly, SASER improves ASR on d models from 0 to 100% in all settings. We appeal for investigations on countermeasures against SASER in view of the significant attack effectiveness.

AnyBCQ Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs

Authors: Gunho Park, Jeongin Bae, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee

2025-10-12

http://arxiv.org/abs/2510.10467v1

The deployment of large language models (s) is increasingly constrained by memory and latency bottlenecks, motivating the need for techniques that flexibly balance accuracy and efficiency. Recent work has introduced multi-precision models, which enable inference at multiple precisions within a single model depending on runtime constraints. To support such flexibility, d weights are often stored as bit-planes, where hardware efficiency improves when the compute operates directly at the bit-plane level and activates only the precision required by each request. In this work, we present AnyBCQ, a hardware-friendly multi-precision extension of Binary-Coded Quantization (BCQ) that supports direct bit-plane operations. By representing weights as binary bit-planes with corresponding scale factors, AnyBCQ enables bit-plane-level computation and maps naturally to accelerator-friendly, bit-parallel arithmetic. Our progressive precision expansion mechanism incrementally refines scaling factors while reusing previously assigned binary codes, yielding monotonic improvements in accuracy as additional bits are enabled. We further co-design a specialized kernel that exploits the BCQ structure to support dynamic per-request precision selection with negligible overhead. Experiments on recent s demonstrate that AnyBCQ significantly narrows the accuracy drop in the regime (e.g. 2-bit), remains competitive at higher precision, and achieves throughput gains of up to 3.0x over half precision and 1.2x over state-of-the-art multi-precision methods. By aligning algorithmic flexibility with hardware efficiency, AnyBCQ provides a practical foundation for multi-precision deployment across diverse service-level objectives.

Authors: Jinjin Cao, Zhiyang Chen, Zijun Wang, Liyuan Ma, Weijian Luo, Guojun Qi

2025-10-12

http://arxiv.org/abs/2510.10466v1

Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.

NIM Neuro-symbolic Ideographic Metalanguage for Inclusive Communication

Authors: Prawaal Sharma, Poonam Goyal, Navneet Goyal, Vidisha Sharma

2025-10-12

http://arxiv.org/abs/2510.10459v1

Digital has become the cornerstone of modern interaction, enabling rapid, accessible, and interactive exchanges. However, individuals with lower academic literacy often face significant barriers, exacerbating the "digital divide". In this work, we introduce a novel, universal ideographic metalanguage designed as an innovative framework that transcends academic, linguistic, and cultural boundaries. Our approach leverages principles of Neuro-symbolic AI, combining neural-based large language models (s) enriched with world knowledge and symbolic knowledge heuristics grounded in the linguistic theory of Natural Semantic Metalanguage (NSM). This enables the semantic decomposition of complex ideas into simpler, atomic concepts. Adopting a human-centric, collaborative methodology, we engaged over 200 semi-literate participants in defining the problem, selecting ideographs, and validating the system. With over 80\% semantic comprehensibility, an accessible learning curve, and universal adaptability, our system effectively serves underprivileged populations with limited formal education.

RobotFleet An Open-Source Framework for Centralized Multi-Robot Task Planning

Authors: Rohan Gupta, Trevor Asbery, Zain Merchant, Abrar Anwar, Jesse Thomason

2025-10-12

http://arxiv.org/abs/2510.10379v1

Coordinating heterogeneous robot fleets to achieve multiple goals is challenging in multi-robot systems. We introduce an open-source and extensible framework for centralized multi-robot task planning and scheduling that leverages s to enable fleets of heterogeneous robots to accomplish multiple tasks. RobotFleet provides abstractions for planning, scheduling, and execution across robots deployed as containerized services to simplify fleet scaling and management. The framework maintains a shared declarative world state and two-way for task execution and replanning. By modularizing each layer of the autonomy stack and using s for open-world reasoning, RobotFleet lowers the barrier to building scalable multi-robot systems. The code can be found here: https://github.com/therohangupta/robot-fleet.

SP-MoE Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference

Authors: Liangkun Chen, Zijian Wen, Tian Wu, Xiaoxi Zhang, Chuan Wu

2025-10-11

http://arxiv.org/abs/2510.10302v1

The Mixture-of-Experts (MoE) architecture has been widely adopted in large language models (s) to reduce computation cost through model . Employing speculative (SD) can further accelerate MoE inference by drafting multiple tokens per step and verifying them in parallel. However, combining MoE with SD inflates GPU memory and aggravates CPU-GPU bandwidth contention during multi-token verification. Existing MoE offloading systems are SD-agnostic and do not address this bottleneck. We present SP-MoE, the first SD-aware expert-offloading and compute- pipelining framework. SP-MoE introduces: (1) speculative expert prefetching that exploits structural correspondence between the draft and target models to prefetch likely experts ahead of verification; (2) a cutoff-layer policy that bounds per-layer prefetch depth based on empirical profiles and an analytical latency model, guaranteeing just-in-time availability without overfetch; and (3) a pipelined runtime with asynchronous prefetch threads and batched I/O to hide loading latency. Extensive experiments demonstrate that SP-MoE achieves a 1.07-3.5 times TPOT speedup over state-of-the-art methods across diverse datasets, environments, and MoE-based models.

Grounded AI for Code Review Resource-Efficient Large-Model Serving in Enterprise Pipelines

Authors: Sayan Mandal, Hua Jiang

2025-10-11

http://arxiv.org/abs/2510.10290v1

Automated code review adoption lags in compliance-heavy settings, where static analyzers produce high-volume, low-rationale outputs, and naive use risks hallucination and incurring cost overhead. We present a production system for grounded, PR-native review that pairs static-analysis findings with AST-guided context extraction and a single-GPU, on-demand stack (d open-weight model, multi-tier caching) to deliver concise explanations and remediation guidance. Evaluated on safety-oriented C/C++ standards, the approach achieves sub-minute median first-feedback (offline p50 build+ 59.8s) while maintaining competitive violation reduction and lower violation rates versus larger proprietary models. The architecture is decoupled: teams can adopt the grounding/prompting layer or the layer independently. A small internal survey (n=8) provides directional signals of reduced triage effort and moderate perceived grounding, with participants reporting fewer human review iterations. We outline operational lessons and limitations, emphasizing reproducibility, auditability, and pathways to broader standards and assisted patching.

The Achilles' Heel of LLMs How Altering a Handful of Neurons Can Cripple Language Abilities

Authors: Zixuan Qin, Kunlin Lyu, Qingchen Yu, Yifan Sun, Zhaoxin Fan

2025-10-11

http://arxiv.org/abs/2510.10238v1

Large Language Models (s) have become foundational tools in natural language processing, powering a wide range of applications and research. Many studies have shown that s share significant similarities with the human brain. Recent neuroscience research has found that a small subset of biological neurons in the human brain are crucial for core cognitive functions, which raises a fundamental question: do s also contain a small subset of critical neurons? In this paper, we investigate this question by proposing a Perturbation-based Causal Identification of Critical Neurons method to systematically locate such critical neurons in s. Our findings reveal three key insights: (1) s contain ultra- critical neuron sets. Disrupting these critical neurons can cause a 72B-parameter model with over 1.1 billion neurons to completely collapse, with perplexity increasing by up to 20 orders of magnitude; (2) These critical neurons are not uniformly distributed, but tend to concentrate in the outer layers, particularly within the MLP down_proj components; (3) Performance degradation exhibits sharp phase transitions, rather than a gradual decline, when these critical neurons are disrupted. Through comprehensive experiments across diverse model architectures and scales, we provide deeper analysis of these phenomena and their implications for robustness and interpretability. These findings can offer guidance for developing more robust model architectures and improving deployment security in safety-critical applications.

ISAAC Intelligent, Scalable, Agile, and Accelerated CPU Verification via LLM-aided FPGA Parallelism

Authors: Jialin Sun, Yuchen Hu, Dean You, Yushu Du, Hui Wang, Xinwei Fang, Weiwei Shan, Nan Guan, Zhe Jiang

2025-10-11

http://arxiv.org/abs/2510.10225v1

Functional verification is a critical bottleneck in integrated circuit development, with CPU verification being especially time-intensive and labour-consuming. Industrial practice relies on differential testing for CPU verification, yet faces bottlenecks at nearly each stage of the framework pipeline: front-end stimulus generation lacks micro-architectural awareness, yielding low-quality and redundant tests that impede coverage closure and miss corner cases. Meanwhile, back-end simulation infrastructure, even with FPGA , often stalls on long-running tests and offers limited visibility, delaying feedback and prolonging the debugging cycle. Here, we present ISAAC, a full-stack, Large Language Model ()-aided CPU verification framework with FPGA parallelism, from bug categorisation and stimulus generation to simulation infrastructure. To do so, we presented a multi-agent stimulus engine in ISAAC's front-end, infused with micro-architectural knowledge and historical bug patterns, generating highly targeted tests that rapidly achieve coverage goals and capture elusive corner cases. In ISAAC's back-end, we introduce a lightweight forward-snapshot mechanism and a decoupled co-simulation architecture between the Instruction Set Simulator (ISS) and the Design Under Test (DUT), enabling a single ISS to drive multiple DUTs in parallel. By eliminating long-tail test bottlenecks and exploiting FPGA parallelism, the simulation throughput is significantly improved. As a demonstration, we used ISAAC to verify a mature CPU that has undergone multiple successful tape-outs. Results show up to 17,536x speed-up over software RTL simulation, while detecting several previously unknown bugs, two of which are reported in this paper.

BILLY Steering Large Language Models via Merging Persona Vectors for Creative Generation

Authors: Tsung-Min Pai, Jui-I Wang, Li-Chun Lu, Shao-Hua Sun, Hung-Yi Lee, Kai-Wei Chang

2025-10-11

http://arxiv.org/abs/2510.10157v1

Multi- systems enhance the creativity of large language models by simulating human collective intelligence but suffer from significant drawbacks, such as high computational costs and inference latency. To address these limitations, we propose BILLY (BlendIng persona vectors for Large Language model creativitY), a training-free framework that captures the benefits of multi- collaboration, i.e. inducing diverse perspectives and specialized expertise, within a single model. BILLY operates by extracting and blending multiple distinct persona vectors directly in the model's activation space. We steer the model's generation process with this merged vector while inference, enabling multi-perspective output without explicit multi- . Our experiments across creativity-oriented benchmarks demonstrate that BILLY surpasses single model prompting and traditional multi- approaches, while substantially reducing inference time and computational costs. Our analyses further reveal that distinct persona vectors can be blended to achieve both effective control over complementary aspects of generation and greater interpretability.

A Unified Frequency Domain Decomposition Framework for Interpretable and Robust Time Series Forecasting

Authors: Cheng He, Xijie Liang, Zengrong Zheng, Patrick P. C. Lee, Xu Huang, Zhaoyi Li, Hong Xie, Defu Lian, Enhong Chen

2025-10-11

http://arxiv.org/abs/2510.10145v1

Current approaches for time series forecasting, whether in the time or frequency domain, predominantly use deep learning models based on linear layers or s. They often encode time series data in a black-box manner and rely on trial-and-error optimization solely based on forecasting performance, leading to limited interpretability and theoretical understanding. Furthermore, the dynamics in data distribution over time and frequency domains pose a critical challenge to accurate forecasting. We propose FIRE, a unified frequency domain decomposition framework that provides a mathematical abstraction for diverse types of time series, so as to achieve interpretable and robust time series forecasting. FIRE introduces several key innovations: (i) independent modeling of amplitude and phase components, (ii) adaptive learning of weights of frequency basis components, (iii) a targeted loss function, and (iv) a novel training paradigm for data. Extensive experiments demonstrate that FIRE consistently outperforms state-of-the-art models on long-term forecasting benchmarks, achieving superior predictive performance and significantly enhancing interpretability of time series

PermLLM Learnable Channel Permutation for NM Sparse Large Language Models

Authors: Lancheng Zou, Shuo Yin, Zehua Pei, Tsung-Yi Ho, Farzan Farnia, Bei Yu

2025-10-11

http://arxiv.org/abs/2510.10136v1

Channel permutation is a powerful technique for enhancing the accuracy of N:M models by reordering the channels of weight matrices to prioritize the retention of important weights. However, traditional channel permutation methods rely on handcrafted quality metrics, which often fail to accurately capture the true impact of on model performance. To address this limitation, we propose Perm, a novel post-training framework that introduces learnable channel permutation (LCP) for N:M . LCP leverages Sinkhorn normalization to transform discrete permutation matrices into differentiable soft permutation matrices, enabling end-to-end optimization. Additionally, Perm incorporates an efficient block-wise channel permutation strategy, which significantly reduces the number of learnable parameters and computational complexity. Perm seamlessly integrates with existing one-shot methods to adaptively optimize channel permutations, effectively mitigating -induced errors. Extensive experiments on the LLaMA series, Qwen, and OPT models demonstrate that Perm achieves superior performance in optimizing N:M models. The code is available at https://github.com/lanchengzou/Perm.

CacheClip Accelerating RAG with Effective KV Cache Reuse

Authors: Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu

2025-10-11

http://arxiv.org/abs/2510.10129v1

Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary s exhibit similar last-layer attention distributions to primary s (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates three techniques: (1) auxiliary-model-guided token selection for selective recomputation, where the auxiliary model is finetuned to improve selection accuracy, (2) shared prefixes to eliminate redundant attention sinks, and (3) grouping strategy to maintain local coherence during partial updates. Experiments show CacheClip retains up to 94.8% and 85.0% of full-attention performance on NIAH and LongBench, outperforming APE and CacheBlend by 25.2% and 35.1% on NIAH (with reomp% = 20%). Meanwhile, CacheClip accelerates inference by up to 1.92x in time, providing a practical solution to the efficiency-quality trade-off in RAG systems.

Lighter-X An Efficient and Plug-and-play Strategy for Graph-based Recommendation through Decoupled Propagation

Authors: Yanping Zheng, Zhewei Wei, Frank de Hoog, Xu Chen, Hongteng Xu, Yuhang Ye, Jiadeng Huang

2025-10-11

http://arxiv.org/abs/2510.10105v1

Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness in recommendation systems. However, conventional graph-based recommenders, such as LightGCN, require maintaining embeddings of size $d$ for each node, resulting in a parameter complexity of $\mathcal{O}(n \times d)$ , where $n$ represents the total number of users and items. This scaling pattern poses significant challenges for deployment on large-scale graphs encountered in real-world applications. To address this scalability limitation, we propose \textbf{Lighter-X}, an efficient and modular framework that can be seamlessly integrated with existing GNN-based recommender architectures. Our approach substantially reduces both parameter size and computational complexity while pre the theoretical guarantees and empirical performance of the base models, thereby enabling practical deployment at scale. Specifically, we analyze the original structure and inherent redundancy in their parameters, identifying opportunities for optimization. Based on this insight, we propose an efficient scheme for the adjacency structure and high-dimensional embedding matrices, achieving a parameter complexity of $\mathcal{O}(h \times d)$ , where $h \ll n$ . Furthermore, the model is optimized through a decoupled framework, reducing computational complexity during the training process and enhancing scalability. Extensive experiments demonstrate that Lighter-X achieves comparable performance to baseline models with significantly fewer parameters. In particular, on large-scale interaction graphs with millions of edges, we are able to attain even better results with only 1\% of the parameter over LightGCN.

P-4DGS Predictive 4D Gaussian Splatting with 90 $\times$ Compression

Authors: Henan Wang, Hanxin Zhu, Xinliang Gong, Tianyu He, Xin Li, Zhibo Chen

2025-10-11

http://arxiv.org/abs/2510.10030v1

3D Gaussian Splatting (3DGS) has garnered significant attention due to its superior scene representation fidelity and real-time rendering performance, especially for dynamic 3D scene reconstruction (\textit{i.e.}, 4D reconstruction). However, despite achieving promising results, most existing algorithms overlook the substantial temporal and spatial redundancies inherent in dynamic scenes, leading to prohibitive memory consumption. To address this, we propose P-4DGS, a novel dynamic 3DGS representation for compact 4D scene modeling. Inspired by intra- and inter-frame prediction techniques commonly used in video , we first design a 3D anchor point-based spatial-temporal prediction module to fully exploit the spatial-temporal correlations across different 3D Gaussian primitives. Subsequently, we employ an adaptive strategy combined with context-based entropy coding to further reduce the size of the 3D anchor points, thereby achieving enhanced efficiency. To evaluate the rate-distortion performance of our proposed P-4DGS in comparison with other dynamic 3DGS representations, we conduct extensive experiments on both synthetic and real-world datasets. Experimental results demonstrate that our approach achieves state-of-the-art reconstruction quality and the fastest rendering speed, with a remarkably low storage footprint (around \textbf{1MB} on average), achieving up to \textbf{40 $\times$ } and \textbf{90 $\times$ } on synthetic and real-world scenes, respectively.

Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization

Authors: Yang Li, Ruichen Zhang, Yinqiu Liu, Guangyuan Liu, Dusit Niyato, Abbas Jamalipour, Xianbin Wang, Dong In Kim

2025-10-11

http://arxiv.org/abs/2510.10028v1

The rapid advancement of Low-Altitude Economy Networks (LAENets) has enabled a variety of applications, including aerial surveillance, environmental sensing, and semantic data collection. To support these scenarios, unmanned aerial vehicles (UAVs) equipped with onboard vision-language models (VLMs) offer a promising solution for real-time multimodal inference. However, ensuring both inference accuracy and efficiency remains a significant challenge due to limited onboard resources and dynamic network conditions. In this paper, we first propose a UAV-enabled LAENet system model that jointly captures UAV mobility, user-UAV , and the onboard visual question answering (VQA) pipeline. Based on this model, we formulate a mixed-integer non-convex optimization problem to minimize task latency and power consumption under user-specific accuracy constraints. To solve the problem, we design a hierarchical optimization framework composed of two parts: (i) an Alternating Resolution and Power Optimization (ARPO) algorithm for resource allocation under accuracy constraints, and (ii) a Large Language Model-augmented Reinforcement Learning Approach (LLaRA) for adaptive UAV trajectory optimization. The large language model () serves as an expert in refining reward design of reinforcement learning in an offline fashion, introducing no additional latency in real-time decision-making. Numerical results demonstrate the efficacy of our proposed framework in improving inference performance and efficiency under dynamic LAENet conditions.

Deliberative Dynamics and Value Alignment in LLM Debates

Authors: Pratik S. Sachdeva, Tom van Nuenen

2025-10-11

http://arxiv.org/abs/2510.10002v1

As large language models (s) are increasingly deployed in sensitive everyday contexts - offering personal advice, mental health support, and moral guidance - understanding their elicited values in navigating complex moral reasoning is essential. Most evaluations study this sociotechnical alignment through single-turn prompts, but it is unclear if these findings extend to multi-turn settings where values emerge through dialogue, revision, and consensus. We address this gap using debate to examine deliberative dynamics and value alignment in multi-turn settings by prompting subsets of three models (GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash) to collectively assign blame in 1,000 everyday dilemmas from Reddit's "Am I the Asshole" community. We use both synchronous (parallel responses) and round-robin (sequential responses) formats to test order effects and verdict revision. Our findings show striking behavioral differences. In the synchronous setting, GPT showed strong inertia (0.6-3.1% revision rates) while Claude and Gemini were far more flexible (28-41%). Value patterns also diverged: GPT emphasized personal autonomy and direct , while Claude and Gemini prioritized empathetic dialogue. Certain values proved especially effective at driving verdict changes. We further find that deliberation format had a strong impact on model behavior: GPT and Gemini stood out as highly conforming relative to Claude, with their verdict behavior strongly shaped by order effects. These results show how deliberation format and model-specific behaviors shape moral reasoning in multi-turn interactions, underscoring that sociotechnical alignment depends on how systems structure dialogue as much as on their outputs.

Universal Discrete-Domain Speech Enhancement

Authors: Fei Liu, Yang Ai, Ye-Xin Lu, Rui-Chen Zheng, Hui-Peng Du, Zhen-Hua Ling

2025-10-11

http://arxiv.org/abs/2510.09974v1

In real-world scenarios, speech signals are inevitably corrupted by various types of interference, making speech enhancement (SE) a critical task for robust speech processing. However, most existing SE methods only handle a limited range of distortions, such as additive noise, reverberation, or band limitation, while the study of SE under multiple simultaneous distortions remains limited. This gap affects the generalization and practical usability of SE methods in real-world environments.To address this gap, this paper proposes a novel Universal Discrete-domain SE model called UDSE.Unlike regression-based SE models that directly predict clean speech waveform or continuous features, UDSE redefines SE as a discrete-domain classification task, instead predicting the clean discrete tokens d by the residual vector r (RVQ) of a pre-trained neural speech codec.Specifically, UDSE first extracts global features from the degraded speech. Guided by these global features, the clean token prediction for each VQ follows the rules of RVQ, where the prediction of each VQ relies on the results of the preceding ones. Finally, the predicted clean tokens from all VQs are d to reconstruct the clean speech waveform. During training, the UDSE model employs a teacher-forcing strategy, and is optimized with cross-entropy loss. Experimental results confirm that the proposed UDSE model can effectively enhance speech degraded by various conventional and unconventional distortions, e.g., additive noise, reverberation, band limitation, clipping, phase distortion, and distortion, as well as their combinations. These results demonstrate the superior universality and practicality of UDSE compared to advanced regression-based SE methods.

Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding

Authors: Payel Bhattacharjee, Fengwei Tian, Meiyu Zhong, Guangyi Zhang, Osvaldo Simeone, Ravi Tandon

2025-10-11

http://arxiv.org/abs/2510.09942v1

Edge-cloud speculative (SD) accelerates inference by having a cloud-based large language model () that verifies draft tokens generated by a resource-constrained small language model (SLM) at the edge. A central bottleneck is the limited bandwidth of the edge-cloud link, which necessitates efficient of draft token distributions. We first derive an information-theoretic bound that decomposes the token rejection rate into contributions from SLM- distribution mismatch and from distortion. Guided by this analysis, we propose the Sparse Quantize-and-Sample SD (SQS-SD) framework, which exploits distributional through structured sparsification and lattice-based . Within this framework, K-SQS applies fixed top-K truncation, while C-SQS adaptively adjusts the retained token set via online conformal prediction to ensure bounded deviation from the dense distribution. Empirical results confirm that both approaches improve end-to-end latency and rejection rates in complimentary operating regimes.

The Ethics Engine A Modular Pipeline for Accessible Psychometric Assessment of Large Language Models

Authors: Jake Van Clief, Constantine Kyritsopoulos

2025-10-11

http://arxiv.org/abs/2510.11742v1

As Large Language Models increasingly mediate human and decision-making, understanding their value expression becomes critical for research across disciplines. This work presents the Ethics Engine, a modular Python pipeline that transforms psychometric assessment of s from a technically complex endeavor into an accessible research tool. The pipeline demonstrates how thoughtful infrastructure design can expand participation in AI research, enabling investigators across cognitive science, political psychology, education, and other fields to study value expression in language models. Recent adoption by University of Edinburgh researchers studying authoritarianism validates its research utility, processing over 10,000 AI responses across multiple models and contexts. We argue that such tools fundamentally change the landscape of AI research by lowering technical barriers while maintaining scientific rigor. As s increasingly serve as cognitive infrastructure, their embedded values shape millions of daily interactions. Without systematic measurement of these value expressions, we deploy systems whose moral influence remains uncharted. The Ethics Engine enables the rigorous assessment necessary for informed governance of these influential technologies.