2025-10-17
Table of Contents
- Breadcrumbs Reasoning Memory-Efficient Reasoning with Compression Beacons
- Invited Paper BitMedViT Ternary-Quantized Vision Transformer for Medical AI Assistants on the Edge
- Don't Be Greedy, Just Relax! Pruning LLMs via Frank-Wolfe
- How Sampling Affects the Detectability of Machine-written texts A Comprehensive Study
- Adaptive Rescheduling in Prefill-Decode Disaggregated LLM Inference
- Time Series Foundation Models Benchmarking Challenges and Requirements
- NOSA Native and Offloadable Sparse Attention
- DOLFIN Balancing Stability and Plasticity in Federated Continual Learning
- Steer-MoE Efficient Audio-Language Alignment with a Mixture-of-Experts Steering Module
- MedREK Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts
- Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers
- F-BFQ Flexible Block Floating-Point Quantization Accelerator for LLMs
- Make an Offer They Can't Refuse Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment
- Document Intelligence in the Era of Large Language Models A Survey
- Taming the Fragility of KV Cache Eviction in LLM Inference
- ChatR1 Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering
- BanaServe Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure
- DSCD Large Language Model Detoxification with Self-Constrained Decoding
- A Dimension-Keeping Semi-Tensor Product Framework for Compressed Sensing
- Mirror Speculative Decoding Breaking the Serial Barrier in LLM Inference
- Retrieval-in-the-Chain Bootstrapping Large Language Models for Generative Retrieval
- NeuroRVQ Multi-Scale EEG Tokenization for Generative Large Brainwave Models
- Neural Approximate Inverse Preconditioners
- Computationally Efficient Neural Receivers via Axial Self-Attention
- Pruning Cannot Hurt Robustness Certified Trade-offs in Reinforcement Learning
- Gaussian Process Implicit Surfaces as Control Barrier Functions for Safe Robot Navigation
- KVCOMM Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
- What If Understanding Motion Through Sparse Interactions
- CARVQ Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression
- Enhanced Angle-Range Cluster Parameter Estimation in Full-Duplex ISAC Systems
- Low Latency, High Bandwidth Streaming of Experimental Data with EJFAT
- Teaching Language Models to Faithfully Express their Uncertainty
- SMEC Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression
- Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems
- Probing Latent Knowledge Conflict for Faithful Retrieval-Augmented Generation
- VideoLucy Deep Memory Backtracking for Long Video Understanding
- PricingLogic Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks
- Efficient Adaptive Transformer An Empirical Study and Reproducible Framework
- An Empirical Study of Reducing AV1 Decoder Complexity and Energy Consumption via Encoder Parameter Tuning
- CurriFlow Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion
- Traveling Salesman-Based Token Ordering Improves Stability in Homomorphically Encrypted Language Models
- CoLF Logic Programming as Infinitary Proof Exploration
- Reinforced Preference Optimization for Recommendation
- A Survey on Parallel Reasoning
- FedLoDrop Federated LoRA with Dropout for Generalized LLM Fine-tuning
- Compressibility Measures Complexity Minimum Description Length Meets Singular Learning Theory
- GeoPipe a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA-enabled Datacenter Optical Transport Network
- APCE Adaptive Progressive Context Expansion for Long Context Processing
- Direct Multi-Token Decoding
- FlexPipe Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
- Topological Vibration Analysis of Elastic Lattices via Bloch Sphere Mapping
- Indoor Localization using Compact, Telemetry-Agnostic, Transfer-Learning Enabled Decoder-Only Transformer
- Variational Mixture of Graph Neural Experts for Alzheimer's Disease Biomarker Recognition in EEG Brain Networks
- QeRL Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
- Scaling Language-Centric Omnimodal Representation Learning
- Diffusion Transformers with Representation Autoencoders
- Hierarchical Qubit-Merging Transformer for Quantum Error Correction
- Culturally-Aware Conversations A Framework & Benchmark for LLMs
- Situat3DChange Situated 3D Change Understanding Dataset for Multimodal Large Language Model
- ReLook Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
- AndesVL Technical Report An Efficient Mobile-side Multimodal Large Language Model
- From
to Multidimensional Supervision of Reasoning Process for LLM Optimization - Multi-View Graph Feature Propagation for Privacy Preservation and Feature Sparsity
- Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding
- XQuant Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
- The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers
- Discursive Circuits How Do Language Models Understand Discourse Relations?
- Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs
- Flow Matching-Based Autonomous Driving Planning with Advanced Interactive Behavior Modeling
- Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding
- Not All Bits Are Equal Scale-Dependent Memory Optimization Strategies for Reasoning Models
- MC# Mixture Compressor for Mixture-of-Experts Large Models
- KOTOX A Korean Toxic Dataset for Deobfuscation and Detoxification
- The Social Cost of Intelligence Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems
- Redundancy as a Structural Information Principle for Learning and Generalization
- AwareCompiler Agentic Context-Aware Compiler Optimization via a Synergistic Knowledge-Data Driven Framework
- FastHMR Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding
- Agentic RAG for Software Testing with Hybrid Vector-Graph and Multi-Agent Orchestration
- A compressed code for memory discrimination
- Review of Inference-Time Scaling Strategies Reasoning, Search and RAG
- ADiP Adaptive Precision Systolic Array for Matrix Multiplication Acceleration
- Preserving LLM Capabilities through Calibration Data Curation From Analysis to Optimization
- Large Language Model-Empowered Channel Prediction and Predictive Beamforming for LEO Satellite Communications
- BitMar Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
- Self-Supervised Representation Learning with ID-Content Modality Alignment for Sequential Recommendation
- The Hidden DNA of LLM-Generated JavaScript Structural Patterns Enable High-Accuracy Authorship Attribution
- SASER Stego attacks on open-source LLMs
- AnyBCQ Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs
- When Images Speak Louder Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance
- NIM Neuro-symbolic Ideographic Metalanguage for Inclusive Communication
- RobotFleet An Open-Source Framework for Centralized Multi-Robot Task Planning
- SP-MoE Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference
- Grounded AI for Code Review Resource-Efficient Large-Model Serving in Enterprise Pipelines
- The Achilles' Heel of LLMs How Altering a Handful of Neurons Can Cripple Language Abilities
- ISAAC Intelligent, Scalable, Agile, and Accelerated CPU Verification via LLM-aided FPGA Parallelism
- BILLY Steering Large Language Models via Merging Persona Vectors for Creative Generation
- A Unified Frequency Domain Decomposition Framework for Interpretable and Robust Time Series Forecasting
- PermLLM Learnable Channel Permutation for NM Sparse Large Language Models
- CacheClip Accelerating RAG with Effective KV Cache Reuse
- Lighter-X An Efficient and Plug-and-play Strategy for Graph-based Recommendation through Decoupled Propagation
- P-4DGS Predictive 4D Gaussian Splatting with 90 Compression
- Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization
- Deliberative Dynamics and Value Alignment in LLM Debates
- Universal Discrete-Domain Speech Enhancement
- Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding
- The Ethics Engine A Modular Pipeline for Accessible Psychometric Assessment of Large Language Models
Breadcrumbs Reasoning Memory-Efficient Reasoning with Compression Beacons
Authors: Giovanni Monea, Yair Feldman, Shankar Padmanabhan, Kianté Brantley, Yoav Artzi
2025-10-15
The scalability of large language models for long-context reasoning is
severely constrained by the linear growth of their Transformer key-value ,
which incurs significant memory and computational costs. We posit that as a
model generates reasoning tokens, the informational value of past generated
tokens diminishes, creating an opportunity for
. In this work, we
propose to periodically compress the generation
with a learned,
special-purpose token and evict compressed entries. We train the model to
perform this
via a modified joint distillation and reinforcement
learning (RL) framework. Our training method minimizes overhead over the
conventional RL process, as it leverages RL outputs for distillation.
Empirically, our method achieves a superior memory-accuracy Pareto frontier
compared to both the model without
and training-free
techniques.
Invited Paper BitMedViT Ternary-Quantized Vision Transformer for Medical AI Assistants on the Edge
Authors: Mikolaj Walczak, Uttej Kallakuri, Edward Humes, Xiaomin Lin, Tinoosh Mohsenin
2025-10-15
Vision Transformers (ViTs) have demonstrated strong capabilities in
interpreting complex medical imaging data. However, their significant
computational and memory demands pose challenges for deployment in real-time,
resource-constrained mobile and wearable devices used in clinical environments.
We introduce, BiTMedViT, a new class of Edge ViTs as medical AI
assistants that perform structured analysis of medical images directly on the
edge. BiTMedViT utilizes ternary-
d linear layers tailored for medical
imaging and com- bines a training procedure with multi-query attention,
pre
stability under ternary weights with low-precision activations.
Furthermore, BiTMedViT employs task-aware distillation from a high-capacity
teacher to recover accuracy lost due to extreme
. Lastly, we also
present a pipeline that maps the ternarized ViTs to a custom CUDA kernel for
efficient memory bandwidth utilization and latency reduction on the Jetson Orin
Nano. Finally, BiTMedViT achieves 86% diagnostic accuracy (89% SOTA) on
MedMNIST across 12 datasets, while reducing model size by 43x, memory traffic
by 39x, and enabling 16.8 ms inference at an energy efficiency up to 41x that
of SOTA models at 183.62 GOPs/J on the Orin Nano. Our results demonstrate a
practical and scientifically grounded route for extreme-precision medical
imaging ViTs deployable on the edge, narrowing the gap between algorithmic
advances and deployable clinical tools.
Don't Be Greedy, Just Relax! Pruning LLMs via Frank-Wolfe
Authors: Christophe Roux, Max Zimmer, Alexandre d'Aspremont, Sebastian Pokutta
2025-10-15
Pruning is a common technique to reduce the compute and storage requirements
of Neural Networks. While conventional approaches typically retrain the model
to recover -induced performance degradation, state-of-the-art Large
Language Model (
)
methods operate layer-wise, minimizing the
per-layer
error on a small calibration dataset to avoid full
retraining, which is considered computationally prohibitive for
s. However,
finding the optimal
mask is a hard combinatorial problem and solving it
to optimality is intractable. Existing methods hence rely on greedy heuristics
that ignore the weight interactions in the
objective. In this work, we
instead consider the convex relaxation of these combinatorial constraints and
solve the resulting problem using the Frank-Wolfe (FW) algorithm. Our method
drastically reduces the per-layer
error, outperforms strong baselines
on state-of-the-art GPT architectures, and remains memory-efficient. We provide
theoretical justification by showing that, combined with the convergence
guarantees of the FW algorithm, we obtain an approximate solution to the
original combinatorial problem upon rounding the relaxed solution to
integrality.
How Sampling Affects the Detectability of Machine-written texts A Comprehensive Study
Authors: Matthieu Dubois, François Yvon, Pablo Piantanida
2025-10-15
As texts generated by Large Language Models (s) are ever more common and
often indistinguishable from human-written content, research on automatic text
detection has attracted growing attention. Many recent detectors report
near-perfect accuracy, often boasting AUROC scores above 99\%. However, these
claims typically assume fixed generation settings, leaving open the question of
how robust such systems are to changes in
strategies. In this work, we
systematically examine how sampling-based
impacts detectability, with
a focus on how subtle variations in a model's (sub)word-level distribution
affect detection performance. We find that even minor adjustments to
parameters - such as temperature, top-p, or nucleus sampling - can severely
impair detector accuracy, with AUROC dropping from near-perfect levels to 1\%
in some settings. Our findings expose critical blind spots in current detection
methods and emphasize the need for more comprehensive evaluation protocols. To
facilitate future research, we release a large-scale dataset encompassing 37
configurations, along with our code and evaluation framework
https://github.com/BaggerOfWords/Sampling-and-Detection
Adaptive Rescheduling in Prefill-Decode Disaggregated LLM Inference
Authors: Zhibin Wang, Zetao Hong, Xue Li, Zibo Wang, Shipeng Li, Qingkai Meng, Qing Wang, Chengying Huan, Rong Gu, Sheng Zhong, Chen Tian
2025-10-15
Large Language Model () inference has emerged as a fundamental paradigm.
In real-world scenarios, variations in output length cause severe workload
imbalance in the
phase, particularly for long-output reasoning tasks.
Existing systems, such as PD disaggregation architectures, rely on static
-to-
scheduling, which often results in SLO violations and OOM
failures under evolving
workloads.
In this paper, we propose ARES, an adaptive
rescheduling system
powered by length prediction to anticipate future workloads. Our core
contributions include: (1) A lightweight and continuous
-native prediction
method that leverages
hidden state to model remaining generation length
with high precision (reducing MAE by 49.42%) and low overhead (cutting
predictor parameters by 93.28%); (2) A rescheduling solution in
phase
with : A dynamic balancing mechanism that integrates current and predicted
workloads, reducing P99 TPOT by 74.77% and achieving up to 2.24 times higher
goodput.
Time Series Foundation Models Benchmarking Challenges and Requirements
Authors: Marcel Meyer, Sascha Kaltenpoth, Kevin Zalipski, Oliver Müller
2025-10-15
Time Series Foundation Models (TSFMs) represent a new paradigm for time
series forecasting, offering zero-shot forecasting capabilities without the
need for domain-specific pre-training or fine-tuning. However, as with Large
Language Models (s), evaluating TSFMs is tricky, as with ever more extensive
training sets, it becomes more and more challenging to ensure the integrity of
benchmarking data. Our investigation of existing TSFM evaluation highlights
multiple challenges, ranging from the representativeness of the benchmark
datasets, over the lack of spatiotemporal evaluation, to risks of information
leakage due to
ping and obscure datasets, and the memorization of global
patterns caused by external shocks like economic crises or pandemics. Our
findings reveal widespread confusion regarding data partitions, risking
inflated performance estimates and incorrect transfer of global knowledge to
local time series. We argue for the development of robust evaluation
methodologies to prevent pitfalls already observed in
and classical time
series benchmarking, and call upon the research community to design new,
principled approaches, such as evaluations on truly out-of-sample future data,
to safeguard the integrity of TSFM assessment.
NOSA Native and Offloadable Sparse Attention
Authors: Yuxiang Huang, Chaojun Xiao, Xu Han, Zhiyuan Liu
2025-10-15
Trainable attention has emerged as a promising solution to address the
efficiency bottleneck of
s in long-context processing,
significantly saving memory accesses while minimally impacting task
performance. However, existing
attention methods leave a crucial
limitation unresolved: the size of the key-value (
)
remains unreduced,
which constrains on-GPU batch sizes and throttles
throughput,
especially in large-scale batched inference. In this paper, we show that
trainable
attention naturally exhibits strong locality in token
selection across adjacent
steps, thereby enabling
offloading
without altering the underlying attention computation. However, the inherent
locality remains insufficient to achieve efficient offloading, as the transfer
of selected
pairs between the CPU and GPU continues to dominate the overall
cost. Building on this insight, we present NOSA, a trainable
attention framework designed to natively support
offloading. NOSA
introduces explicit locality constraints by decomposing token selection into
query-aware and query-agnostic components, thereby reducing
transfers while
pre
the same attention computation as used during training. We pretrain
a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that
it preserves near-lossless performance while achieving up to a 2.3x improvement
in
throughput compared with the vanilla trainable
attention
baseline (Inf
-V2).
DOLFIN Balancing Stability and Plasticity in Federated Continual Learning
Authors: Omayma Moussadek, Riccardo Salami, Simone Calderara
2025-10-15
Federated continual learning (FCL) enables models to learn new tasks across
multiple distributed clients, protecting privacy and without forgetting
previously acquired knowledge. However, current methods face challenges
balancing performance, privacy preservation, and efficiency. We
introduce a Distributed Online LoRA for Federated INcremental learning method
DOLFIN, a novel approach combining Vision Transformers with low-rank adapters
designed to efficiently and stably learn new tasks in federated environments.
Our method leverages LoRA for minimal
overhead and incorporates
DualGradient Projection Memory (DualGPM) to prevent forgetting. Evaluated on
CIFAR-100, ImageNet-R, ImageNet-A, and CUB-200 under two Dirichlet
heterogeneity settings, DOLFIN consistently surpasses six strong baselines in
final average accuracy while matching their memory footprint. Orthogonal
low-rank adapters offer an effective and scalable solution for
privacy-pre
continual learning in federated settings.
Steer-MoE Efficient Audio-Language Alignment with a Mixture-of-Experts Steering Module
Authors: Ruitao Feng, Bixi Zhang, Sheng Liang, Zheng Yuan
2025-10-15
Aligning pretrained audio encoders and Large Language Models (s) offers a
promising, parameter-efficient path to building powerful multimodal agents.
However, existing methods often require costly full-model finetuning or rely on
static adapters that may lack expressive power. Drawing inspiration from the
Platonic Representation Hypothesis, we introduce SteerMoE, a novel and modular
framework for audio-language alignment. SteerMoE freezes both the audio encoder
and the
r, training only a lightweight steering module integrated
within the encoder's layers. This module uses a Mixture-of-Experts (MoE) router
to dynamically select and apply learned steering vectors, progressively
transforming continuous audio representations into a space comprehensible to
the
. By operating entirely in the continuous embedding space, our approach
requires no modifications to the
's vocabulary and preserves its advanced
reasoning and agentic capabilities. We demonstrate through experiments on ASR,
audio understanding, and a qualitative function-calling task that SteerMoE
achieves strong performance while remaining highly modular and computationally
efficient, offering a robust new paradigm for developing sophisticated
audio-language systems.
MedREK Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts
Authors: Shujun Xia, Haokun Lin, Yichen Wu, Yinan Zhou, Zixuan Li, Zhongwei Wan, Xingrun Xing, Yefeng Zheng, Xiang Li, Caifeng Shan, Zhenan Sun, Quanzheng Li
2025-10-15
s hold great promise for healthcare applications, but the rapid evolution
of medical knowledge and errors in training data often cause them to generate
outdated or inaccurate information, limiting their applicability in high-stakes
clinical practice. Model editing has emerged as a potential remedy without full
retraining. While parameter-based editing often compromises locality and is
thus ill-suited for the medical domain, retrieval-based editing offers a more
viable alternative. However, it still faces two critical challenges: (1)
representation
within the medical knowledge space often causes
inaccurate retrieval and reduces editing accuracy; (2) existing methods are
restricted to single-sample edits, while batch-editing remains largely
unexplored despite its importance for real-world medical applications. To
address these challenges, we first construct MedVersa, \hk{an enhanced
benchmark with broader coverage of medical subjects, designed to evaluate both
single and batch edits under strict locality constraints}. We then propose
MedREK, a retrieval-based editing framework that integrates a shared query-key
module for precise matching with an attention-based prompt encoder for
informative guidance. Experimental results on various medical benchmarks
demonstrate that our MedREK achieves superior performance across different core
metrics and provides the first validated solution for batch-editing in medical
s. Our code and dataset are available at
https://github.com/mylittleriver/MedREK.
Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers
Authors: Xin Zhao, Xiaojun Chen, Bingshan Liu, Haoyu Gao, Zhendong Zhao, Yilong Chen
2025-10-15
Large language models (s) with Mixture-of-Experts (MoE) architectures
achieve impressive performance and efficiency by dynamically routing inputs to
specialized subnetworks, known as experts. However, this
routing
mechanism inherently exhibits task preferences due to expert specialization,
introducing a new and underexplored vulnerability to backdoor attacks. In this
work, we investigate the feasibility and effectiveness of injecting backdoors
into MoE-based
s by exploiting their inherent expert routing preferences. We
thus propose BadSwitch, a novel backdoor framework that integrates task-coupled
dynamic trigger optimization with a sensitivity-guided Top-S expert tracing
mechanism. Our approach jointly optimizes trigger embeddings during pretraining
while identifying S most sensitive experts, subsequently constraining the Top-K
gating mechanism to these targeted experts. Unlike traditional backdoor attacks
that rely on superficial data poisoning or model editing, BadSwitch primarily
embeds malicious triggers into expert routing paths with strong task affinity,
enabling precise and stealthy model manipulation. Through comprehensive
evaluations across three prominent MoE architectures (Switch Transformer,
QwenMoE, and DeepSeekMoE), we demonstrate that BadSwitch can efficiently hijack
pre-trained models with up to 100% success rate (ASR) while maintaining the
highest clean accuracy (ACC) among all baselines. Furthermore, BadSwitch
exhibits strong resilience against both text-level and model-level defense
mechanisms, achieving 94.07% ASR and 87.18% ACC on the AGNews dataset. Our
analysis of expert activation patterns reveals fundamental insights into MoE
vulnerabilities. We anticipate this work will expose security risks in MoE
systems and contribute to advancing AI safety.
F-BFQ Flexible Block Floating-Point Quantization Accelerator for LLMs
Authors: Jude Haris, José Cano
2025-10-15
Large Language Models (s) have become increasingly prominent for daily
tasks, from improving sound-totext translation to generating additional frames
for the latest video games. With the help of
inference frameworks, such as
llama.cpp, which support optimizations such as
-caching and
, it
is now easier than ever to deploy
s on edge devices. Quantization is
fundamental to enable
s on resource-constrained edge devices, and llama.cpp
utilizes block floating point (BFP)
to drastically reduce the bit
width of weights and input tensors, the memory footprint, and the computational
power required to run
s.
s are typically
d with mixed BFP
across the model layers to reduce the loss of model accuracy due
to
. Therefore, to efficiently accelerate across the layers of
BFP-
d
s, specialized accelerators need to support different BFP
variants without reconfiguration. To address this issue, we propose a Flexible
Block FloatingPoint Quantization (F-BFQ) accelerator, which can dynamically
switch between two BFP
variants and perform matrix multiplication
(MatMul) operations. Our initial F-BFQ accelerator design, deployed on the AMD
Kria board, reduces inference time by 1.4x on average over the Arm NEON-based
CPU execution across three BFP
d
s while achieving 5.2 tokens per
second (~3.9 words per second).
Make an Offer They Can't Refuse Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment
Authors: Buwei He, Yang Liu, Zhaowei Zhang, Zixia Jia, Huijia Wu, Zhaofeng He, Zilong Zheng, Yipeng Kang
2025-10-15
Persuasion, a fundamental social capability for humans, remains a challenge
for AI systems such as large language models (s). Current studies often
overlook the strategic use of information asymmetry in message design or rely
on strong assumptions regarding pre-commitment. In this work, we explore the
application of Bayesian Persuasion (BP) in natural language within single-turn
dialogue settings, to enhance the strategic persuasion capabilities of
s.
Our framework incorporates a commitment-
mechanism, where the
persuader explicitly outlines an information schema by narrating their
potential types (e.g., honest or dishonest), thereby guiding the persuadee in
performing the intended Bayesian belief update. We evaluate two variants of our
approach: Semi-Formal-Natural-Language (SFNL) BP and Fully-Natural-Language
(FNL) BP, benchmarking them against both naive and strong non-BP (NBP)
baselines within a comprehensive evaluation framework. This framework covers a
diverse set of persuadees -- including
instances with varying prompts and
fine-tuning and human participants -- across tasks ranging from specially
designed persuasion scenarios to general everyday situations. Experimental
results on
-based agents reveal three main findings: (1)
s guided by BP
strategies consistently achieve higher persuasion success rates than NBP
baselines; (2) SFNL exhibits greater credibility and logical coherence, while
FNL shows stronger emotional resonance and robustness in naturalistic
conversations; (3) with supervised fine-tuning, smaller models can attain BP
performance comparable to that of larger models.
Document Intelligence in the Era of Large Language Models A Survey
Authors: Weishi Wang, Hengchang Hu, Zhijie Zhang, Zhaochen Li, Hongxin Shao, Daniel Dahlmeier
2025-10-15
Document AI (DAI) has emerged as a vital application area, and is
significantly transformed by the advent of large language models (s). While
earlier approaches relied on encoder-
r architectures,
r-only
s
have revolutionized DAI, bringing remarkable advancements in understanding and
generation. This survey provides a comprehensive overview of DAI's evolution,
highlighting current research attempts and future prospects of
s in this
field. We explore key advancements and challenges in multimodal, multilingual,
and retrieval-augmented DAI, while also suggesting future research directions,
including agent-based approaches and document-specific foundation models. This
paper aims to provide a structured analysis of the state-of-the-art in DAI and
its implications for both academic and practical applications.
Taming the Fragility of KV Cache Eviction in LLM Inference
Authors: Yuan Feng, Haoyu Guo, JunLin Lv, S. Kevin Zhou, Xike Xie
2025-10-15
Large language models have revolutionized natural language processing, yet
their deployment remains hampered by the substantial memory and runtime
overhead of the 's Key-Value
. To mitigate this, recent methods
employ a scoring-aggregation framework to evict unimportant
entries,
based on the stability assumption-that a fixed subset of entries remains
consistently important during generation. However, prior work has largely
focused on refining importance indicators for scoring, while defaulting to mean
aggregation due to a faithful trust in the stability assumption. In this work,
we argue that this underlying assumption is inherently fragile, making mean
aggregation highly vulnerable in extreme cases. To counter this, we propose a
simple yet elegant defensive aggregation strategy: a two-step, linear-time
approach that controls worst-case risk, thereby defending against extreme cases
with negligible computational overhead. Embodying this strategy, we propose a
novel
eviction method, Defensive
and its extension, Layer-Defensive
,
which incorporates layer-wise budget allocation. Across seven task domains (18
datasets), our methods reduce generation quality loss by 2.3x and 4.3x
respectively, versus the strongest baseline under a 20%
size. These
results set new performance benchmarks and pioneer a promising direction for
optimizing
eviction against underlying fragility through worst-case risk
management. Our code is available at https://github.com/FFY0/Defensive
.
ChatR1 Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering
Authors: Simon Lupart, Mohammad Aliannejadi, Evangelos Kanoulas
2025-10-15
We present ChatR1, a reasoning framework based on reinforcement learning (RL)
for conversational question answering (CQA). Reasoning plays an important role
in CQA, where user intent evolves across dialogue turns, and utterances are
often underspecified, requiring contextual interpretation, query reformulation,
and dynamic coordination between retrieval and generation. Unlike static
`rewrite, retrieve, and generate' pipelines, ChatR1 interleaves search and
reasoning across turns, enabling exploratory and adaptive behaviors learned
through RL. To address the challenge of and delayed rewards in RL, we
propose an intent-aware reward that provides turn-level feedback by aligning
retrieval and reasoning with evolving user goals. Our proposed ChatR1
demonstrates strong performance on both 3B and 7B model backbones,
outperforming competitive models on five CQA datasets, measured by different
metrics (F1, BERTScore, and
-as-judge). We include a diverse set of CQA
datasets to cover topic shifts, evolving intents, mixed-initiative dialogues,
and multi-document grounding, testing ChatR1's performance from various
aspects. Ablation studies confirm the effectiveness of the intent-aware reward.
Our analyses further reveal diverse reasoning trajectories and effective use of
the search tool. ChatR1 also generalizes robustly across domains, demonstrating
that RL-based reasoning enables more flexible and context-sensitive behavior
than static CQA pipelines.
BanaServe Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure
Authors: Yiyuan He, Minxian Xu, Jingfeng Wu, Jianmin Hu, Chong Ma, Min Shen, Le Chen, Chengzhong Xu, Lin Qu, Kejiang Ye
2025-10-15
Large language models (s) are increasingly deployed in AI infrastructure,
driving the need for high throughput, resource efficient
systems.
Disaggregated
, which separates prompt
from auto-regressive
, has emerged as a promising architecture by isolating their
heterogeneous compute and memory demands. However, current
d
systems face three key limitations: (i) static resource allocation cannot adapt
to highly dynamic workloads, causing over-provisioning that wastes resources or
under-provisioning that violates service level objectives (SLOs); (ii) inherent
load imbalance between
and
stages, where
is
compute-bound and
is memory-bound, causes under-utilization in one tier
while the other becomes a bottleneck; and (iii) prefix
aware routing
skews load distribution, as high
hit rate
nodes attract
disproportionately more requests, further degrading balance and efficiency. To
address these issues, we present BanaServe, a dynamic orchestration framework
that continuously rebalances computational and memory resources across
and
instances while eliminating hotspots induced by
. BanaServe
introduces layer level weight migration, attention level Key Value Cache (
Cache) migration, and Global
Cache Store sharing with layer wise
ped
transmission, enabling both coarse grained (layer level) and fine grained
(attention level) load redistribution with minimal latency overhead. These
mechanisms allow routers to perform purely load aware scheduling, unconstrained
by
placement. Compared to v
, BanaServe achieves 1.2x-3.9x higher
throughput with 3.9%-78.4% lower total processing time, and outperforms
DistServe by 1.1x-2.8x in throughput with 1.4%-70.1% latency reduction.
DSCD Large Language Model Detoxification with Self-Constrained Decoding
Authors: Ming Dong, Jinkui Zhang, Bolong Zheng, Xinhui Tu, Po Hu, Tingting He
2025-10-15
Detoxification in large language models (s) remains a significant research
challenge. Existing
detoxification methods are all based on external
constraints, which require additional resource overhead and lose generation
fluency. This work proposes Detoxification with Self-Constrained Decoding
(DSCD), a novel method for
detoxification without parameter fine-tuning.
DSCD strengthens the inner next-token distribution of the safety layer while
weakening that of hallucination and toxic layers during output generation. This
effectively diminishes toxicity and enhances output safety. DSCD offers
lightweight, high compatibility, and plug-and-play capabilities, readily
integrating with existing detoxification methods for further performance
improvement. Extensive experiments on representative open-source
s and
public datasets validate DSCD's effectiveness, demonstrating state-of-the-art
(SOTA) performance in both detoxification and generation fluency, with superior
efficiency compared to existing methods. These results highlight DSCD's
potential as a practical and scalable solution for safer
deployments.
A Dimension-Keeping Semi-Tensor Product Framework for Compressed Sensing
Authors: Qi Qi, Abdelhamid Tayebi, Daizhan Cheng, Jun-e Feng
2025-10-15
In compressed sensing (CS), signals can be reconstructed from
significantly fewer samples than required by the Nyquist-Shannon sampling
theorem. While non-
signals can be
ly represented in appropriate
transformation domains, conventional CS frameworks rely on the incoherence of
the measurement matrix columns to guarantee reconstruction performance. This
paper proposes a novel method termed Dimension-Keeping Semi-Tensor Product
Compressed Sensing (DK-STP-CS), which leverages intra-group correlations while
maintaining inter-group incoherence to enhance the measurement matrix design.
Specifically, the DK-STP algorithm is integrated into the design of the sensing
matrix, enabling dimensionality reduction while pre
signal recovery
capability. For image
and reconstruction tasks, the proposed method
achieves notable noise suppression and improves visual fidelity. Experimental
results demonstrate that DK-STP-CS significantly outperforms traditional CS and
STP-CS approaches, as evidenced by higher Peak Signal-to-Noise Ratio (PSNR)
values between the reconstructed and original images. The robustness of
DK-STP-CS is further validated under noisy conditions and varying sampling
rates, highlighting its potential for practical applications in
resource-constrained environments.
Mirror Speculative Decoding Breaking the Serial Barrier in LLM Inference
Authors: Nikhil Bhendawade, Kumari Nishu, Arnav Kundu, Chris Bartels, Minsik Cho, Irina Belousova
2025-10-15
Speculative accelerates
inference by using a draft model to look
ahead, but gains are capped by the cost of autoregressive draft generation:
increasing draft size elevates acceptance rates but introduces additional
latency overhead exacerbating the speed-accuracy tradeoff. Prior methods
(Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade
acceptance or introduce overheads that limit scaling. We present Mirror
Speculative Decoding (Mirror-SD), an inference algorithm that breaks the
latency-acceptance tradeoff. Mirror-SD launches branch-complete rollouts from
early-exit signals in parallel with the target model's suffix and explicitly
maps computation across heterogeneous accelerators (GPU and NPU) to exploit
cross-device parallelism. The draft speculates forward continuations for the
target to verify, while the target simultaneously speculates correction paths
for the draft, converting speculation into two complementary execution
pipelines. To further cut draft latency without weakening acceptance semantics,
we add speculative streaming so the draft emits multiple tokens per step. This
dual strategy of parallel heterogeneous execution plus multi-token speculative
streaming pushes speculative
toward its ideal regime of high
acceptance with low overhead. On SpecBench with server-scale models from 14B to
66B parameters, Mirror-SD delivers consistent end-to-end gains, achieving
2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative
improvement over the strongest baseline, EAGLE3.
Retrieval-in-the-Chain Bootstrapping Large Language Models for Generative Retrieval
Authors: Yingchen zhang, Ruqing zhang, Jiafeng Guo, Wenjun Peng, Sen Li, Fuyu Lv
2025-10-15
Generative retrieval (GR) is an emerging paradigm that leverages large
language models (s) to autoregressively generate document identifiers
(docids) relevant to a given query. Prior works have focused on leveraging the
generative capabilities of
s to improve GR, while overlooking that their
reasoning capabilities could likewise help. This raises a key question: Can
explicit reasoning benefit GR? To investigate, we first conduct a preliminary
study where an
is prompted to generate free-form chain-of-thought (CoT)
reasoning before performing constrained docid
. Although this method
outperforms standard GR, the generated reasoning tends to be verbose and poorly
aligned with the docid space. These limitations motivate the development of a
reasoning mechanism better tailored to GR.
Therefore, we propose Reason-for-Retrieval (R4R), a reasoning-augmented
framework for GR that converts free-form CoT reasoning into a compact,
structured format, and iteratively refines the reasoning during the retrieval
process. R4R augments an existing GR method by leveraging a reasoning-capable
that has been instruction-tuned for GR. At inference time, R4R first uses
the
to generate an initial structured reasoning; then the same
alternates between (i) constrained
with the chosen GR method to
produce candidate docids and (ii) updating the reasoning based on retrieval
results to improve the next round. R4R does not require additional models or
training, and instead a single
serves as both the reasoning generator and
the retriever. Extensive experiments on Natural Questions, MS MARCO, and a
real-world item-search benchmark validate the effectiveness of R4R.
NeuroRVQ Multi-Scale EEG Tokenization for Generative Large Brainwave Models
Authors: Konstantinos Barmpas, Na Lee, Alexandros Koliousis, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Stefanos Zafeiriou
2025-10-15
Electroencephalography (EEG) captures neural activity across multiple
temporal and spectral scales, yielding signals that are rich but complex for
representation learning. Recently, EEG foundation models trained to predict
masked signal-tokens have shown promise for learning generalizable
representations. However, their performance is hindered by their signal
tokenization modules. Existing neural tokenizers fail to preserve
high-frequency dynamics, limiting their ability to reconstruct EEG signals with
high fidelity. We introduce NeuroRVQ, a scalable Large Brainwave Model (LBM)
centered on a codebook-based tokenizer. Our tokenizer integrates: (i)
multi-scale feature extraction modules that capture the full frequency neural
spectrum; (ii) hierarchical residual vector (RVQ) codebooks for
high-resolution encoding; and, (iii) an EEG signal phase- and amplitude-aware
loss function for efficient training. This design enables efficient EEG
while supporting accurate reconstruction across all frequency
bands, leading to robust generative masked modeling. Our empirical results
demonstrate that NeuroRVQ achieves lower reconstruction error and outperforms
existing LBMs on a variety of downstream tasks. More broadly, NeuroRVQ
tokenizer establishes a strong prior for codebook-based general-purpose
brainwave models, enabling advances in neural
, generative modeling and
multimodal biosignal integration.
Neural Approximate Inverse Preconditioners
Authors: Tianshi Xu, Rui Peng Li, Yuanzhe Xi
2025-10-14
In this paper, we propose a data-driven framework for constructing efficient
approximate inverse preconditioners for elliptic partial differential equations
(PDEs) by learning the Green's function of the underlying operator with neural
networks (NNs). The training process integrates four key components: an
adaptive multiscale neural architecture (MSNN) that captures
hierarchical features across near-, middle-, and far-field regimes; the use of
coarse-grid anchor data to ensure physical identifiability; a
multi- staged training protocol that progressively refines the
Green's function representation across spatial scales; and an ping
domain decomposition that enables local adaptation while maintaining global
consistency. Once trained, the NN-approximated Green's function is directly
compressed into either a hierarchical (-) matrix or a
matrix-using only the mesh geometry and the network output. This geometric
construction achieves nearly linear complexity in both setup and application
while pre
the spectral properties essential for effective
preconditioning. Numerical experiments on challenging elliptic PDEs demonstrate
that the resulting preconditioners consistently yield fast convergence and
small iteration counts.
Computationally Efficient Neural Receivers via Axial Self-Attention
Authors: SaiKrishna Saketh Yellapragada, Atchutaram K. Kocharlakota, Mário Costa, Esa Ollila, Sergiy A. Vorobyov
2025-10-14
Deep learning-based neural receivers are redefining physical-layer signal
processing for next-generation wireless systems. We propose an axial
self-attention neural receiver designed for applicability to 6G and
beyond wireless systems, validated through 5G-compliant experimental
configurations, that achieves state-of-the-art block error rate (BLER)
performance with significantly improved computational efficiency. By
factorizing attention operations along temporal and spectral axes, the proposed
architecture reduces the quadratic complexity of conventional multi-head
self-attention from to , yielding substantially fewer
total floating-point operations and attention matrix multiplications per
block compared to global self-attention. Relative to convolutional
neural receiver baselines, the axial neural receiver achieves significantly
lower computational cost with a fraction of the parameters. Experimental
validation under 3GPP Clustered Delay Line (CDL) channels demonstrates
consistent performance gains across varying mobility scenarios. Under
non-line-of-sight CDL-C conditions, the axial neural receiver consistently
outperforms all evaluated receiver architectures, including global
self-attention, convolutional neural receivers, and traditional LS-LMMSE at
10\% BLER with reduced computational complexity per inference. At stringent
reliability targets of 1\% BLER, the axial receiver maintains robust symbol
detection at high user speeds, whereas the traditional LS-LMMSE receiver fails
to converge, underscoring its suitability for ultra-reliable low-latency
(URLLC)
in dynamic 6G environments and beyond. These results
establish the axial neural receiver as a structured, scalable, and efficient
framework for AI-Native 6G RAN systems, enabling deployment in
resource-constrained edge environments.
Pruning Cannot Hurt Robustness Certified Trade-offs in Reinforcement Learning
Authors: James Pedley, Benjamin Etheridge, Stephen J. Roberts, Francesco Quinzan
2025-10-14
Reinforcement learning (RL) policies deployed in real-world environments must
remain reliable under adversarial perturbations. At the same time, modern deep
RL agents are heavily over-parameterized, raising costs and fragility concerns.
While has been shown to improve robustness in supervised learning, its
role in adversarial RL remains poorly understood. We develop the first
theoretical framework for certified robustness under
in
state-adversarial Markov decision processes (SA-MDPs). For Gaussian and
categorical policies with Lipschitz networks, we prove that element-wise
can only tighten certified robustness bounds;
never makes the
policy less robust. Building on this, we derive a novel three-term regret
decomposition that disentangles clean-task performance,
-induced
performance loss, and robustness gains, exposing a fundamental
performance--robustness frontier. Empirically, we evaluate magnitude and
micro-
schedules on continuous-control benchmarks with strong
policy-aware adversaries. Across tasks,
consistently uncovers
reproducible ``sweet spots'' at moderate
levels, where robustness
improves substantially without harming - and sometimes even enhancing - clean
performance. These results position
not merely as a
tool
but as a structural intervention for robust RL.
Gaussian Process Implicit Surfaces as Control Barrier Functions for Safe Robot Navigation
Authors: Mouhyemen Khan, Tatsuya Ibuki, Abhijit Chatterjee
2025-10-14
Level set methods underpin modern safety techniques such as control barrier
functions (CBFs), while also as implicit surface representations for
geometric shapes via distance fields. Inspired by these two paradigms, we
propose a unified framework where the implicit surface itself acts as a CBF. We
leverage Gaussian process (GP) implicit surface (GPIS) to represent the safety
boundaries, using safety samples which are derived from sensor measurements to
condition the GP. The GP posterior mean defines the implicit safety surface
(safety belief), while the posterior variance provides a robust safety margin.
Although GPs have favorable properties such as uncertainty estimation and
analytical tractability, they scale cubically with data. To alleviate this
issue, we develop a
solution called
Gaussian CBFs. To the best of
our knowledge, GPIS have not been explicitly used to synthesize CBFs. We
validate the approach on collision avoidance tasks in two settings: a simulated
7-DOF manipulator operating around the Stanford bunny, and a quadrotor
navigating in 3D around a physical chair. In both cases, Gaussian CBFs (with
and without
) enable safe interaction and collision-free execution of
trajectories that would otherwise intersect the objects.
KVCOMM Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
Authors: Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, Yiran Chen
2025-10-14
Multi-agent large language model () systems are increasingly adopted for
complex language processing tasks that require
and coordination
among agents. However, these systems often suffer substantial overhead from
repeated reprocessing of
ping contexts across agents. In typical
pipelines, once an agent receives a message from its predecessor, the full
context-including prior turns-must be reprocessed from scratch, leading to
inefficient processing. While key-value (
) caching is an effective solution
for avoiding redundant computation in single-agent settings where prefixes
remain unchanged, it cannot be directly reused in multi-agent scenarios due to
diverging prefixes introduced by agent-specific context extensions. We identify
that the core challenge lies in the offset variance of
-
s across agents.
To address this, we propose
COMM, a training-free framework that enables
efficient
ing in multi-agent inference by reusing
-
s and aligning
offsets of
ping contexts under diverse prefix contexts.
COMM
estimates and adjusts
-
s for shared content by referencing a pool of
d examples-termed anchors-that store observed
deviations under
varying prefixes. The anchor pool is maintained and updated online, allowing
dynamic adaptation to distinct user requests and context structures.
COMM
achieves over 70% reuse rate across diverse multi-agent workloads, including
retrieval-augmented generation, math reasoning, and collaborative coding tasks,
all without quality degradation. Particularly, when each fully-connected agent
receives 1K input tokens with 512 prefix tokens and 512 output tokens under a
five-agent setting,
COMM achieves up to 7.8x speedup compared to the standard
pipeline, reducing TTFT from ~430 ms to ~55 ms.
What If Understanding Motion Through Sparse Interactions
Authors: Stefan Andreas Baumann, Nick Stracke, Timy Phan, Björn Ommer
2025-10-14
Understanding the dynamics of a physical scene involves reasoning about the
diverse ways it can potentially change, especially as a result of local
interactions. We present the Flow Poke Transformer (FPT), a novel framework for
directly predicting the distribution of local motion, conditioned on
interactions termed "pokes". Unlike traditional methods that typically only
enable dense sampling of a single realization of scene dynamics, FPT provides
an interpretable directly accessible representation of multi-modal scene
motion, its dependency on physical interactions and the inherent uncertainties
of scene dynamics. We also evaluate our model on several downstream tasks to
enable comparisons with prior methods and highlight the flexibility of our
approach. On dense face motion generation, our generic pre-trained model
surpasses specialized baselines. FPT can be fine-tuned in strongly
out-of-distribution tasks such as synthetic datasets to enable significant
improvements over in-domain methods in articulated object motion estimation.
Additionally, predicting explicit motion distributions directly enables our
method to achieve competitive performance on tasks like moving part
segmentation from pokes which further demonstrates the versatility of our FPT.
Code and models are publicly available at
https://compvis.github.io/flow-poke-
.
CARVQ Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression
Authors: Dayin Gou, Sanghyun Byun, Nilesh Malpeddi, Gabrielle De Micheli, Prathamesh Vaste, Jacob Song, Woo Seong Chung
2025-10-14
Large Language Models (s) typically rely on a large number of parameters
for token embedding, leading to substantial storage requirements and memory
footprints. In particular,
s deployed on edge devices are memory-bound, and
reducing the memory footprint by compressing the embedding layer not only frees
up the memory bandwidth but also speeds up inference. To address this, we
introduce CARVQ, a post-training novel Corrective Adaptor combined with group
Residual Vector Quantization. CARVQ relies on the composition of both linear
and non-linear maps and mimics the original model embedding to compress to
approximately 1.6 bits without requiring specialized hardware to support
lower-bit storage. We test our method on pre-trained
s such as LLaMA-3.2-1B,
LLaMA-3.2-3B, LLaMA-3.2-3B-Instruct, LLaMA-3.1-8B, Qwen2.5-7B, Qwen2.5-Math-7B
and Phi-4, evaluating on common generative, discriminative, math and reasoning
tasks. We show that in most cases, CARVQ can achieve lower average
bitwidth-per-parameter while maintaining reasonable perplexity and accuracy
compared to scalar
. Our contributions include a novel
technique that is compatible with state-of-the-art
methods and can be seamlessly integrated into any hardware supporting 4-bit
memory to reduce the model's memory footprint in memory-constrained devices.
This work demonstrates a crucial step toward the efficient deployment of
s
on edge devices.
Enhanced Angle-Range Cluster Parameter Estimation in Full-Duplex ISAC Systems
Authors: Muhammad Talha, Besma Smida, David González G
2025-10-14
This work studies an integrated sensing and (ISAC) framework
for targets that are spread both in the angle and range domains. We model each
target using a cluster of rays parameterized by a specific density function,
and propose a truncated Multiple Signal Classification (MUSIC) spread (TMS)
algorithm to accurately estimate the parameters of the density function. Unlike
the conventional MUSIC spread (CMS), TMS restricts the signal subspace rank
based on the eigen decomposition of the received-signal autocorrelation. We
also propose a discrete Fourier transform (DFT) based algorithm for estimating
the distance and range spread of each target. Leveraging these estimates, we
then develop a dynamic transmit beamforming algorithm that successfully
illuminates multiple targets while also
multiple downlink (DL) users.
Simulation results demonstrate the superiority of our proposed algorithms over
baseline schemes in both low and high signal-to-noise ratio (SNR) regimes as
well as under a wide angular spread regime.
Low Latency, High Bandwidth Streaming of Experimental Data with EJFAT
Authors: Ilya Baldin, Michael Goodrich, Vardan Gyurjyan, Graham Heyes, Derek Howard, Yatish Kumar, David Lawrence, Brad Sawatzky, Stacey Sheldon, Carl Timmer
2025-10-14
Thomas Jefferson National Accelerator Facility (JLab) has partnered with
Energy Sciences Network (ESnet) to define and implement an edge to compute
cluster computational load balancing architecture. The ESnet-JLab
FPGA Accelerated Transport (EJFAT) architecture focuses on FPGA
to
address
, fragmentation, UDP packet destination redirection (Network
Address Translation (NAT)) and de
and reassembly.
EJFAT seamlessly integrates edge and cluster computing to support direct
processing of streamed experimental data. This will directly benefit the JLab
science program as well as data centers of the future that require high
throughput and low latency for both time-critical data acquisition systems and
data center workflows.
The EJFAT project will be presented along with how it is synergistic with
other DOE activities such as an Integrated Research Infrastructure (IRI), and
recent results using data sources at JLab, an EJFAT LB at ESnet, and
computational cluster resources at Lawrence Berkeley National Laboratory
(LBNL).
Teaching Language Models to Faithfully Express their Uncertainty
Authors: Bryan Eikema, Evgenia Ilia, José G. C. de Souza, Chrysoula Zerva, Wilker Aziz
2025-10-14
Large language models (s) often miscommunicate their uncertainty: repeated
queries can produce divergent answers, yet generated responses are typically
unhedged or hedged in ways that do not reflect this variability. This conveys
unfaithful information about the uncertain state of the
s' knowledge,
creating a faithfulness gap that affects even strong
s. We introduce
Faithful Uncertainty Tuning (FUT): a fine-tuning approach that teaches
instruction-tuned
s to express uncertainty faithfully without altering their
underlying answer distribution. We construct training data by augmenting model
samples with uncertainty hedges (i.e. verbal cues such as 'possibly' or
'likely') aligned with sample consistency, requiring no supervision beyond the
model and a set of prompts. We evaluate FUT on open-domain question answering
(QA) across multiple models and datasets. Our results show that FUT
substantially reduces the faithfulness gap, while pre
QA accuracy and
introducing minimal semantic distribution shift. Further analyses demonstrate
robustness across
strategies, choice of hedgers, and other forms of
uncertainty expression (i.e. numerical). These findings establish FUT as a
simple and effective way to teach
s to communicate uncertainty faithfully.
SMEC Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression
Authors: Biao Zhang, Lixin Chen, Tong Liu, Bo Zheng
2025-10-14
Large language models (s) generate high-dimensional embeddings that
capture rich semantic and syntactic information. However, high-dimensional
embeddings exacerbate computational complexity and storage requirements,
thereby hindering practical deployment. To address these challenges, we propose
a novel training framework named Sequential Matryoshka Embedding Compression
(SMEC). This framework introduces the Sequential Matryoshka Representation
Learning(SMRL) method to mitigate gradient variance during training, the
Adaptive Dimension Selection (ADS) module to reduce information degradation
during dimension
, and the Selectable Cross-batch Memory (S-XBM) module
to enhance unsupervised learning between high- and low-dimensional embeddings.
Experiments on image, text, and multimodal datasets demonstrate that SMEC
achieves significant dimensionality reduction while maintaining performance.
For instance, on the BEIR dataset, our approach improves the performance of
compressed
2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points
compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.
Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems
Authors: Jiaxin Gao, Chen Chen, Yanwen Jia, Xueluan Gong, Kwok-Yan Lam, Qian Wang
2025-10-14
Large Language Models (s) are increasingly being used to autonomously
evaluate the quality of content in
systems, e.g., to assess
responses in telecom customer support chatbots. However, the impartiality of
these AI "judges" is not guaranteed, and any biases in their evaluation
criteria could skew outcomes and undermine user trust. In this paper, we
systematically investigate judgment biases in two
-as-a-judge models (i.e.,
GPT-Judge and JudgeLM) under the point-wise scoring setting, encompassing 11
types of biases that cover both implicit and explicit forms. We observed that
state-of-the-art
judges demonstrate robustness to biased inputs, generally
assigning them lower scores than the corresponding clean samples. Providing a
detailed scoring rubric further enhances this robustness. We further found that
fine-tuning an
on high-scoring yet biased responses can significantly
degrade its performance, highlighting the risk of training on biased data. We
also discovered that the judged scores correlate with task difficulty: a
challenging dataset like GPQA yields lower average scores, whereas an
open-ended reasoning dataset (e.g., JudgeLM-val) sees higher average scores.
Finally, we proposed four potential mitigation strategies to ensure fair and
reliable AI judging in practical
scenarios.
Probing Latent Knowledge Conflict for Faithful Retrieval-Augmented Generation
Authors: Linfeng Gao, Baolong Bi, Zheng Yuan, Le Wang, Zerui Chen, Zhimin Wei, Shenghua Liu, Qinggang Zhang, Jinsong Su
2025-10-14
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm to
enhance the factuality of Large Language Models (s). However, existing RAG
systems often suffer from an unfaithfulness issue, where the model's response
contradicts evidence from the retrieved context. Existing approaches to
improving contextual faithfulness largely rely on external interventions, such
as prompt engineering,
constraints, or reward-based fine-tuning. These
works treat the
as a black box and overlook a crucial question: how does
the
internally integrate retrieved evidence with its parametric memory,
particularly under knowledge conflicts? To address this gap, we conduct a
probing-based analysis of hidden-state representations in
s and observe
three findings: knowledge integration occurs hierarchically, conflicts manifest
as latent signals at the sentence level, and irrelevant context is often
amplified when aligned with parametric knowledge. Building on these findings,
we propose CLEAR (Conflict-Localized and Enhanced Attention for RAG), a
framework that (i) decomposes context into fine-grained sentence-level
knowledge, (ii) employs hidden-state probing to localize conflicting knowledge,
and (iii) introduces conflict-aware fine-tuning to guide the model to
accurately integrate retrieved evidence. Extensive experiments across three
benchmarks demonstrate that CLEAR substantially improves both accuracy and
contextual faithfulness, consistently outperforming strong baselines under
diverse conflict conditions. The related resources are available at
https://github.com/LinfengGao/CLEAR.
VideoLucy Deep Memory Backtracking for Long Video Understanding
Authors: Jialong Zuo, Yongtai Deng, Lingdong Kong, Jingkang Yang, Rui Jin, Yiwei Zhang, Nong Sang, Liang Pan, Ziwei Liu, Changxin Gao
2025-10-14
Recent studies have shown that agent-based systems leveraging large language
models (s) for key information retrieval and integration have emerged as a
promising approach for long video understanding. However, these systems face
two major challenges. First, they typically perform modeling and reasoning on
individual frames, struggling to capture the temporal context of consecutive
frames. Second, to reduce the cost of dense frame-level captioning, they adopt
frame sampling, which risks discarding crucial information. To overcome
these limitations, we propose VideoLucy, a deep memory backtracking framework
for long video understanding. Inspired by the human recollection process from
coarse to fine, VideoLucy employs a hierarchical memory structure with
progressive granularity. This structure explicitly defines the detail level and
temporal scope of memory at different hierarchical depths. Through an
agent-based iterative backtracking mechanism, VideoLucy systematically mines
video-wide, question-relevant deep memories until sufficient information is
gathered to provide a confident answer. This design enables effective temporal
understanding of consecutive frames while pre
critical details. In
addition, we introduce EgoMem, a new benchmark for long video understanding.
EgoMem is designed to comprehensively evaluate a model's ability to understand
complex events that unfold over time and capture fine-grained details in
extremely long videos. Extensive experiments demonstrate the superiority of
VideoLucy. Built on open-source models, VideoLucy significantly outperforms
state-of-the-art methods on multiple long video understanding benchmarks,
achieving performance even surpassing the latest proprietary models such as
GPT-4o. Our code and dataset will be made publicly at
https://videolucy.github.io
PricingLogic Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks
Authors: Yunuo Liu, Dawei Zhu, Zena Al-Khalili, Dai Cheng, Yanjun Chen, Dietrich Klakow, Wei Zhang, Xiaoyu Shen
2025-10-14
We present PricingLogic, the first benchmark that probes whether Large
Language Models(s) can reliably automate tourism-related prices when
multiple,
ping fare rules apply. Travel agencies are eager to offload
this error-prone task onto AI systems; however, deploying
s without verified
reliability could result in significant financial losses and erode customer
trust. PricingLogic comprises 300 natural-language questions based on booking
requests derived from 42 real-world pricing policies, spanning two levels of
difficulty: (i) basic customer-type pricing and (ii)bundled-tour calculations
involving interacting discounts. Evaluations of a line of
s reveal a steep
performance drop on the harder tier,exposing systematic failures in rule
interpretation and arithmetic reasoning.These results highlight that, despite
their general capabilities, today's
s remain unreliable in revenue-critical
applications without further safeguards or domain adaptation. Our code and
dataset are available at https://github.com/EIT-NLP/PricingLogic.
Efficient Adaptive Transformer An Empirical Study and Reproducible Framework
Authors: Jan Miller
2025-10-14
The Efficient Adaptive Transformer (EAT) framework unifies three adaptive
efficiency techniques - progressive token ,
attention, and
dynamic early exiting - into a single, reproducible architecture for
input-adaptive inference. EAT provides an open-source benchmarking pipeline
that automates data processing, timing, and ablation across GLUE tasks (SST-2,
QQP, MNLI). Although this empirical study finds that combining these mechanisms
can increase latency in shallow six-layer models, it demonstrates that EAT
achieves slightly higher accuracy than the optimized DistilBERT baseline on
SST-2, illustrating the potential of dynamic computation for latency-sensitive
NLP. The main contribution is the open, end-to-end reproducible framework -
complete with scripts, CSV logging, and analysis utilities - intended to serve
as a community tool for further research on adaptive
s.
An Empirical Study of Reducing AV1 Decoder Complexity and Energy Consumption via Encoder Parameter Tuning
Authors: Vibhoothi Vibhoothi, Julien Zouein, Shanker Shreejith, Jean-Baptiste Kempf, Anil Kokaram
2025-10-14
The widespread adoption of advanced video codecs such as AV1 is often
hindered by their high complexity, posing a challenge for
battery-constrained devices. While encoders can be configured to produce
bitstreams that are
r-friendly, estimating the
complexity and
energy overhead for a given video is non-trivial. In this study, we
systematically analyse the impact of disabling various coding tools and
adjusting coding parameters in two AV1 encoders, libaom-av1 and SVT-AV1. Using
system-level energy measurement tools like RAPL (Running Average Power Limit),
Intel SoC Watch (integrated with VTune profiler), we quantify the resulting
trade-offs between
complexity, energy consumption, and
efficiency for
a bitstream. Our results demonstrate that specific
encoder configurations can substantially reduce
complexity with
minimal perceptual quality degradation. For libaom-av1, disabling CDEF, an
in-loop filter gives us a mean reduction in
cycles by 10%. For
SVT-AV1, using the in-built, fast-
=2 preset achieves a more substantial
24% reduction in
cycles. These findings provide strategies for content
providers to lower the energy footprint of AV1 video streaming.
CurriFlow Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion
Authors: Jinzhou Lin, Jie Zhou, Wenhao Xu, Rongtao Xu, Changwei Wang, Shunpeng Chen, Kexue Fu, Yihua Shao, Li Guo, Shibiao Xu
2025-10-14
Semantic Scene Completion (SSC) aims to infer complete 3D geometry and
semantics from monocular images, as a crucial capability for
camera-based perception in autonomous driving. However, existing SSC methods
relying on temporal stacking or depth projection often lack explicit motion
reasoning and struggle with occlusions and noisy depth supervision. We propose
CurriFlow, a novel semantic occupancy prediction framework that integrates
optical flow-based temporal alignment with curriculum-guided depth fusion.
CurriFlow employs a multi-level fusion strategy to align segmentation, visual,
and depth features across frames using pre-trained optical flow, thereby
improving temporal consistency and dynamic object understanding. To enhance
geometric robustness, a curriculum learning mechanism progressively transitions
from
yet accurate LiDAR depth to dense but noisy stereo depth during
training, ensuring stable optimization and seamless adaptation to real-world
deployment. Furthermore, semantic priors from the Segment Anything Model (SAM)
provide category-agnostic supervision, strengthening voxel-level semantic
learning and spatial consistency. Experiments on the SemanticKITTI benchmark
demonstrate that CurriFlow achieves state-of-the-art performance with a mean
IoU of 16.9, validating the effectiveness of our motion-guided and
curriculum-aware design for camera-based 3D semantic scene completion.
Traveling Salesman-Based Token Ordering Improves Stability in Homomorphically Encrypted Language Models
Authors: Donghwan Rho, Sieun Seo, Hyewon Sung, Chohong Min, Ernest K. Ryu
2025-10-14
As users increasingly interact with large language models (s) using
private information, secure and encrypted
becomes essential.
Homomorphic encryption (HE) provides a principled solution by enabling
computation directly on encrypted data. Although prior work has explored
aspects of running
s under HE, the challenge of text generation,
particularly next-token prediction, has received limited attention and remains
a key obstacle to practical encrypted interaction. In this work, we propose a
TSP-based token reordering strategy to address the difficulties of encrypted
text generation, together with a post-processing step that further reduces
approximation error. Theoretical analysis and experimental results demonstrate
that our method prevents collapse, improves coherence in generated text, and
preserves data privacy throughout. Overall, our contributions advance the
feasibility of practical and privacy-pre
inference.
CoLF Logic Programming as Infinitary Proof Exploration
Authors: Zhibo Chen, Frank Pfenning
2025-10-14
Logical Frameworks such as Automath [de Bruijn, 1968] or LF [Harper et al.,
1993] were originally conceived as metalanguages for the specification of
foundationally uncommitted deductive systems, yielding generic proof checkers.
Their high level of abstraction was soon exploited to also express algorithms
over deductive systems such as theorem provers, type-checkers, evaluators,
compilers, proof s, etc. in the paradigm of
computation-as-proof-construction. This has been realized in languages such as
-Prolog [Miller et al., 1991] or Elf [Pfenning, 1991] based on
backward chaining, and LolliMon [Lopez et al., 2005] or Celf [Schack-Nielsen
and Schuermann, 2008], which integrated forward chaining. None of these early
frameworks supported the direct expression of infinitary objects or proofs,
which are available in the recently developed CoLF [Chen, 2023]. In
this work-in-progress report, we sketch an approach to
computation-as-proof-construction over the first-order fragment of
CoLF (called CoLF ) that already includes infinitary
objects and proofs. A key idea is the interpretation of logic variables as
channels and computation as concurrent message-passing. This is
realized in a concrete compiler from CoLF to Sax, a
proof-theoretically inspired parallel programming language based on the
proof-reduction in the semi-axiomatic sequent calculus [DeYoung et al., 2020].
Reinforced Preference Optimization for Recommendation
Authors: Junfei Tan, Yuxin Chen, An Zhang, Junguang Jiang, Bin Liu, Ziru Xu, Han Zhu, Jian Xu, Bo Zheng, Xiang Wang
2025-10-14
Recent breakthroughs in large language models (s) have fundamentally
shifted recommender systems from discriminative to generative paradigms, where
user behavior modeling is achieved by generating target items conditioned on
historical interactions. Yet current generative recommenders still suffer from
two core limitations: the lack of high-quality negative modeling and the
reliance on implicit rewards. Reinforcement learning with verifiable rewards
(RLVR) offers a natural solution by enabling on-policy sampling of harder
negatives and grounding optimization in explicit reward signals. However,
applying RLVR to generative recommenders remains non-trivial. Its unique
generation space often leads to invalid or repetitive items that undermine
sampling efficiency, and ranking supervision is
since most items receive
identical zero rewards. To address these challenges, we propose Reinforced
Preference Optimization for Recommendation (ReRe), a reinforcement-based
paradigm tailored to
-based recommenders, an important direction in
generative recommendation. ReRe incorporates constrained beam search to improve
sampling efficiency and diversify hard negatives, while augmenting rule-based
accuracy rewards with auxiliary ranking rewards for finer-grained supervision.
Extensive experiments on three real-world datasets demonstrate that ReRe
consistently outperforms both traditional and
-based recommenders in ranking
performance. Further analysis shows that ReRe not only enhances performance
across both base and SFT-initialized models but also generalizes robustly
across different backbone families and scales. Beyond empirical gains, we
systematically investigate the design space of RLVR in recommendation across
generation, sampling strategy, reward modeling, and optimization algorithm,
offering insights for future research.
A Survey on Parallel Reasoning
Authors: Ziqi Wang, Boye Niu, Zipeng Gao, Zhi Zheng, Tong Xu, Linghui Meng, Zhongli Li, Jing Liu, Yilong Chen, Chen Zhu, Hua Wu, Haifeng Wang, Enhong Chen
2025-10-14
With the increasing capabilities of Large Language Models (s), parallel
reasoning has emerged as a new inference paradigm that enhances reasoning
robustness by concurrently exploring multiple lines of thought before
converging on a final answer. It has become a significant trend to explore
parallel reasoning to overcome the fragility of standard sequential methods and
improve practical performance. In this paper, we aim to survey and summarize
the progress and challenges of parallel reasoning. We first present a formal
definition of parallel reasoning and clarify its distinction from related
concepts like Chain-of-Thought. Then, we organize and discuss advanced
techniques based on a novel taxonomy, including non-interactive reasoning,
interactive reasoning, and efficiency-focused
strategies.
Additionally, we explore various application scenarios, such as solving complex
problems and enhancing the reliability of
outputs.Finally, we highlight the
core challenges of parallel reasoning and suggest potential directions for
future research. We hope that our work can provide a useful roadmap for
beginners and encourage more research on improving parallel reasoning methods.
Related source can be avaliable in
https://github.com/PPPP-kaqiu/Awesome-Parallel-Reasoning.
FedLoDrop Federated LoRA with Dropout for Generalized LLM Fine-tuning
Authors: Sijing Xie, Dingzhu Wen, Changsheng You, Qimei Chen, Mehdi Bennis, Kaibin Huang
2025-10-14
Fine-tuning (FT) large language models (s) is crucial for adapting
general-purpose models to specific tasks, enhancing accuracy and relevance with
minimal resources. To further enhance generalization ability while reducing
training costs, this paper proposes Federated LoRA with Dropout (FedLoDrop), a
new framework that applies dropout to the rows and columns of the trainable
matrix in Federated LoRA. A generalization error bound and convergence analysis
under
regularization are obtained, which elucidate the fundamental
trade-off between underfitting and overfitting. The error bound reveals that a
higher dropout rate increases model
, thereby lowering the upper bound
of pointwise hypothesis stability (PHS). While this reduces the gap between
empirical and generalization errors, it also incurs a higher empirical error,
which, together with the gap, determines the overall generalization error. On
the other hand, though dropout reduces
costs, deploying FedLoDrop
at the network edge still faces challenges due to limited network resources. To
address this issue, an optimization problem is formulated to minimize the upper
bound of the generalization error, by jointly optimizing the dropout rate and
resource allocation subject to the latency and per-device energy consumption
constraints. To solve this problem, a branch-and-bound (B\&B)-based method is
proposed to obtain its globally optimal solution. Moreover, to reduce the high
computational complexity of the B\&B-based method, a penalized successive
convex approximation (P-SCA)-based algorithm is proposed to efficiently obtain
its high-quality suboptimal solution. Finally, numerical results demonstrate
the effectiveness of the proposed approach in mitigating overfitting and
improving the generalization capability.
Compressibility Measures Complexity Minimum Description Length Meets Singular Learning Theory
Authors: Einar Urdshals, Edmund Lau, Jesse Hoogland, Stan van Wingerden, Daniel Murfet
2025-10-14
We study neural network compressibility by using singular learning theory to
extend the minimum description length (MDL) principle to singular models like
neural networks. Through extensive experiments on the Pythia suite with
, factorization, and other
techniques, we find that
complexity estimates based on the local learning coefficient (LLC) are closely,
and in some cases, linearly correlated with compressibility. Our results
provide a path toward rigorously evaluating the limits of model
.
GeoPipe a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA-enabled Datacenter Optical Transport Network
Authors: Jun Dai, Xiaorun Wang, Kexiong Fang, Zheng Yang, Yuefeng Ji, Jiawei Zhang
2025-10-14
The proliferation of Large Language Models (s) with exponentially growing
parameters is making cross-data center (DC) training an inevitable trend.
However, viable strategies for extending single-DC training frameworks to
multi-DC environments remain underdeveloped. We experimentally demonstrate, for
the first time, a high-performance geo-distributed
s training framework
across multiple DCs interconnected by a lossless, remote direct memory access
(RDMA) enabled Datacenter Optical Transport Network (DC-OTN). An enhanced
pipeline parallelism scheme is implemented within the Ascend full-stack
environment of Huawei, which effectively eliminates the impact of cross-DC
overhead on training efficiency. The
ped computation and
cross-DC
is achieved with constraint cross-DC bandwidth and High
Bandwidth Memory (HBM), reducing computation bubble ratio by up to 78.91%.
APCE Adaptive Progressive Context Expansion for Long Context Processing
Authors: Baisub Lee, Sanghyun Byun, Mohanad Odema, Jung Guack, Jacob Song, Woo Seong Chung
2025-10-14
Deploying useful Long-Context Transformer Models (LCTMs) requires addressing
two key challenges: (1) A growing memory footprint due to quadratic
self-attention and linear -
scaling in memory as sequence length
increases; (2) the ContextRot phenomena where empirical evidence suggests that
architecture's performance degrades with increasing context length.
Given the shared dependency on the input, a natural question arises: Can we
surgically select the most important input chunks for processing to
synergistically (a) reduce the memory footprint, and (b) mitigate the
ContextRot effects? In this paper, we answer this question in the affirmative
for long-context summarization tasks. We propose APCE as a context-aware
solution to select the most important input chunks through low-dimensional
semantic similarity matching with the current query. By directly operating on
the input, APCE decouples from strict dependency on underlying hardware or CUDA
environments, promising a compatible solution scalable to different deployment
systems. Our empirical evaluations have demonstrated superior or on-par
summarization performance for APCE compared to the full dense baseline using a
fraction (50%-70%) of the input sequence resulting in
-
and
self-attention memory efficiency improvements. We hope our findings inspire
further research on context-aware efficiency solutions for LCTMs geared towards
other relevant long-context tasks.
Direct Multi-Token Decoding
Authors: Xuan Luo, Weizhi Wang, Xifeng Yan
2025-10-13
Decoder-only s have become the standard architecture for large
language models (
s) due to their strong performance. Recent studies suggest
that, in pre-trained
s, early, middle, and late layers may serve distinct
roles: Early layers focus on understanding the input context, middle layers
handle task-specific processing, and late layers convert abstract
representations into output tokens. We hypothesize that once representations
have been processed by the early and middle layers, the resulting hidden states
may encapsulate sufficient information to support the generation of multiple
tokens using only the late layers, eliminating the need to repeatedly traverse
the early and middle layers. We refer to this inference paradigm as Direct
Multi-Token Decoding (DMTD). Unlike speculative
, our method introduces
no additional parameters, auxiliary routines, or post-generation verification.
Despite being trained on a limited dataset, a fine-tuned DMTD Qwen3-4B model
has already demonstrated promising results, achieving up to a 2x speedup with
only minor performance loss. Moreover, as shown in our scaling analysis, its
performance is expected to further improve with larger training datasets.
FlexPipe Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
Authors: Yanying Lin, Shijie Peng, Chengzhi Lu, Chengzhong Xu, Kejiang Ye
2025-10-13
Serving Large Language Models (s) in production faces significant
challenges from highly variable request patterns and severe resource
fragmentation in serverless clusters. Current systems rely on static pipeline
configurations that struggle to adapt to dynamic workload conditions, leading
to substantial inefficiencies. We present FlexPipe, a novel system that
dynamically reconfigures pipeline architectures during runtime to address these
fundamental limitations. FlexPipe decomposes models into fine-grained stages
and intelligently adjusts pipeline granularity based on real-time request
pattern analysis, implementing three key innovations: fine-grained model
partitioning with preserved computational graph constraints, inflight pipeline
refactoring with consistent
transitions, and topology-aware resource
allocation that navigates GPU fragmentation. Comprehensive evaluation on an
82-GPU cluster demonstrates that FlexPipe achieves up to 8.5x better resource
efficiency while maintaining 38.3% lower latency compared to state-of-the-art
systems, reducing GPU reservation requirements from 75% to 30% of peak
capacity.
Topological Vibration Analysis of Elastic Lattices via Bloch Sphere Mapping
Authors: Kazi Tahsin Mahmood, M. Arif Hasan
2025-10-13
Mechanical lattices support topological wave phenomena governed by geometric
phases. We develop a compact Hilbert space description for one-dimensional
elastic chains, expressing intra-cell motion as a normalized superposition of
orthogonal eigenstates and tracking complex amplitudes as trajectories on a
Bloch sphere. For diatomic lattices, this framework makes inversion symmetry
protection explicit: the relative phase between in-phase and out-of-phase modes
is piecewise locked, and the Zak phase is d with band-dependent jumps
at symmetry points. Extending the analysis to triatomic lattices shows that
restoring inversion retains
, whereas breaking it de
s the
geometric phase while leaving the spectral origin invariant. Viewing
norm-pre
transformations of the modal coefficient pair as Bloch sphere
rotations, we demonstrate classical analogues of single-qubit logic gates. A
pi-phase rotation about a transverse axis swaps the modal poles, and a
longitudinal-axis phase flip maps balanced superpositions to their conjugates.
These gate-like operations are realized by controlled evolution across
wavenumber space and can be driven or reprogrammed through spatiotemporal
stiffness modulation. Introducing space-time modulation hybridizes carrier and
sideband harmonics, producing continuous phase winding and open-path geometric
phases accumulated along the Floquet trajectory. Across static and modulated
regimes, the framework unifies algebraic and geometric viewpoints, remains
robust to gauge and basis choices, and operates directly on amplitude-phase
data. The results clarify how symmetry, modulation, and topology jointly govern
dispersion, modal mixing, and phase accumulation, providing tools to analyze
and design vibration and acoustic functionalities in engineered structures.
Indoor Localization using Compact, Telemetry-Agnostic, Transfer-Learning Enabled Decoder-Only Transformer
Authors: Nayan Sanjay Bhatia, Pranay Kocheta, Russell Elliott, Harikrishna S. Kuttivelil, Katia Obraczka
2025-10-13
Indoor Wi-Fi positioning remains a challenging problem due to the high
sensitivity of radio signals to environmental dynamics, channel propagation
characteristics, and hardware heterogeneity. Conventional fingerprinting and
model-based approaches typically require labor-intensive calibration and suffer
rapid performance degradation when devices, channel or deployment conditions
change. In this paper, we introduce Locaris, a r-only large language
model (
) for indoor localization. Locaris treats each access point (AP)
measurement as a token, enabling the ingestion of raw Wi-Fi telemetry without
pre-processing. By fine-tuning its
on different Wi-Fi datasets, Locaris
learns a lightweight and generalizable mapping from raw signals directly to
device location. Our experimental study comparing Locaris with state-of-the-art
methods consistently shows that Locaris matches or surpasses existing
techniques for various types of telemetry. Our results demonstrate that compact
s can serve as calibration-free regression models for indoor localization,
offering scalable and robust cross-environment performance in heterogeneous
Wi-Fi deployments. Few-shot adaptation experiments, using only a handful of
calibration points per device, further show that Locaris maintains high
accuracy when applied to previously unseen devices and deployment scenarios.
This yields sub-meter accuracy with just a few hundred samples, robust
performance under missing APs and supports any and all available telemetry. Our
findings highlight the practical viability of Locaris for indoor positioning in
the real-world scenarios, particularly in large-scale deployments where
extensive calibration is infeasible.
Variational Mixture of Graph Neural Experts for Alzheimer's Disease Biomarker Recognition in EEG Brain Networks
Authors: Jun-En Ding, Anna Zilverstand, Shihao Yang, Albert Chih-Chieh Yang, Feng Liu
2025-10-13
Dementia disorders such as Alzheimer's disease (AD) and frontotemporal
dementia (FTD) exhibit ping electrophysiological signatures in EEG that
challenge accurate diagnosis. Existing EEG-based methods are limited by
full-band frequency analysis that hinders precise differentiation of dementia
subtypes and severity stages. We propose a variational mixture of graph neural
experts (VMoGE) that integrates frequency-specific biomarker identification
with structured variational inference for enhanced dementia diagnosis and
staging. VMoGE employs a multi-granularity
to extract multi-scale
temporal patterns across four frequency bands, followed by a variational graph
convolutional encoder using Gaussian Markov Random Field priors. Through
structured variational inference and adaptive gating, VMoGE links neural
specialization to physiologically meaningful EEG frequency bands. Evaluated on
two diverse datasets for both subtype classification and severity staging,
VMoGE achieves superior performance with AUC improvements of +4% to +10% over
state-of-the-art methods. Moreover, VMoGE provides interpretable insights
through expert weights that correlate with clinical indicators and spatial
patterns aligned with neuropathological signatures, facilitating EEG biomarker
discovery for comprehensive dementia diagnosis and monitoring.
QeRL Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
Authors: Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen
2025-10-13
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for
large language models (s). While RL is essential for
s' reasoning
capabilities, it is resource-intensive, requiring substantial GPU memory and
long rollout durations. QeRL addresses these issues by combining NVFP4
with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL
while reducing memory overhead. Beyond efficiency, our findings show that
noise increases policy entropy, enhancing exploration, and
enabling the discovery of better strategies during RL. To further optimize
exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism,
which dynamically adjusts noise during training. Experiments demonstrate that
QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is
the first framework to enable RL training of a 32B
on a single H100 80GB
GPU, while delivering overall speedups for RL training. It also achieves faster
reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while
matching the performance of full-parameter fine-tuning on mathematical
benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These
results establish QeRL as an efficient and effective framework for RL training
in
s.
Scaling Language-Centric Omnimodal Representation Learning
Authors: Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong
2025-10-13
Recent multimodal embedding approaches leveraging multimodal large language
models (Ms) fine-tuned with contrastive learning (CL) have shown promising
results, yet the underlying reasons behind their superiority remain
underexplored. This work argues that a crucial advantage of M
-based
approaches stems from implicit cross-modal alignment achieved during generative
pretraining, where the language
r learns to exploit multimodal signals
within a shared representation space for generating unimodal outputs. Through
analysis of anisotropy and kernel similarity structure, we empirically confirm
that latent alignment emerges within M
representations, allowing CL to serve
as a lightweight refinement stage. Leveraging this insight, we propose a
Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive
experiments across diverse backbones and benchmarks demonstrate its
effectiveness, achieving state-of-the-art performance across modalities.
Furthermore, we identify a Generation-Representation Scaling Law (GRSL),
showing that the representational capabilities gained through contrastive
refinement scales positively with the M
's generative capabilities. This
suggests that improving generative abilities evolves as an effective paradigm
for enhancing representation quality. We provide a theoretical explanation of
GRSL, which formally links the M
's generative quality to the upper bound on
its representation performance, and validate it on a challenging, low-resource
visual-document retrieval task, showing that continual generative pretraining
before CL can further enhance the potential of a model's embedding
capabilities. Codes, models, and resources are available at
https://github.com/LCO-Embedding/LCO-Embedding.
Diffusion Transformers with Representation Autoencoders
Authors: Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie
2025-10-13
Latent generative modeling, where a pretrained autoencoder maps pixels into a
latent space for the diffusion process, has become the standard strategy for
Diffusion Transformers (DiT); however, the autoencoder component has barely
evolved. Most DiTs continue to rely on the original VAE encoder, which
introduces several limitations: outdated backbones that compromise
architectural simplicity, low-dimensional latent spaces that restrict
information capacity, and weak representations that result from purely
reconstruction-based training and ultimately limit generative quality. In this
work, we explore replacing the VAE with pretrained representation encoders
(e.g., DINO, SigLIP, MAE) paired with trained rs, forming what we term
Representation Autoencoders (RAEs). These models provide both high-quality
reconstructions and semantically rich latent spaces, while allowing for a
scalable
-based architecture. Since these latent spaces are
typically high-dimensional, a key challenge is enabling diffusion
s
to operate effectively within them. We analyze the sources of this difficulty,
propose theoretically motivated solutions, and validate them empirically. Our
approach achieves faster convergence without auxiliary representation alignment
losses. Using a DiT variant equipped with a lightweight, wide DDT head, we
achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no
guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers
clear advantages and should be the new default for diffusion
training.
Hierarchical Qubit-Merging Transformer for Quantum Error Correction
Authors: Seong-Joon Park, Hee-Youl Kwak, Yongjune Kim
2025-10-13
For reliable large-scale quantum computation, a quantum error correction
(QEC) scheme must effectively resolve physical errors to protect logical
information. Leveraging recent advances in deep learning, neural network-based
rs have emerged as a promising approach to enhance the reliability of
QEC. We propose the Hierarchical Qubit-Merging Transformer (HQMT), a novel and
general
framework that explicitly leverages the structural graph of
stabilizer codes to learn error correlations across multiple scales. Our
architecture first computes attention locally on structurally related groups of
stabilizers and then systematically merges these qubit-centric representations
to build a global view of the error syndrome. The proposed HQMT achieves
substantially lower logical error rates for surface codes by integrating a
dedicated qubit-merging layer within the
architecture. Across
various code distances, HQMT significantly outperforms previous neural
network-based QEC
rs as well as a powerful belief propagation with
ordered statistics
(BP+OSD) baseline. This hierarchical approach
provides a scalable and effective framework for surface code
,
advancing the realization of reliable quantum computing.
Culturally-Aware Conversations A Framework & Benchmark for LLMs
Authors: Shreya Havaldar, Sunny Rai, Young-Min Cho, Lyle Ungar
2025-10-13
Existing benchmarks that measure cultural adaptation in s are misaligned
with the actual challenges these models face when interacting with users from
diverse cultural backgrounds. In this work, we introduce the first framework
and benchmark designed to evaluate
s in realistic, multicultural
conversational settings. Grounded in sociocultural theory, our framework
formalizes how linguistic style - a key element of cultural
- is
shaped by situational, relational, and cultural context. We construct a
benchmark dataset based on this framework, annotated by culturally diverse
raters, and propose a new set of desiderata for cross-cultural evaluation in
NLP: conversational framing, stylistic sensitivity, and subjective correctness.
We evaluate today's top
s on our benchmark and show that these models
struggle with cultural adaptation in a conversational setting.
Situat3DChange Situated 3D Change Understanding Dataset for Multimodal Large Language Model
Authors: Ruiping Liu, Junwei Zheng, Yufan Chen, Zirui Wang, Kunyu Peng, Kailun Yang, Jiaming Zhang, Marc Pollefeys, Rainer Stiefelhagen
2025-10-13
Physical environments and circumstances are fundamentally dynamic, yet
current 3D datasets and evaluation benchmarks tend to concentrate on either
dynamic scenarios or dynamic situations in isolation, resulting in incomplete
comprehension. To overcome these constraints, we introduce Situat3DChange, an
extensive dataset supporting three situation-aware change understanding tasks
following the perception-action model: 121K question-answer pairs, 36K change
descriptions for perception tasks, and 17K rearrangement instructions for the
action task. To construct this large-scale dataset, Situat3DChange leverages
11K human observations of environmental changes to establish shared mental
models and shared situational awareness for human-AI collaboration. These
observations, enriched with egocentric and allocentric perspectives as well as
categorical and coordinate spatial relations, are integrated using an to
support understanding of situated changes. To address the challenge of
comparing pairs of point clouds from the same scene with minor changes, we
propose SCReasoner, an efficient 3D M
approach that enables effective point
cloud comparison with minimal parameter overhead and no additional tokens
required for the language
r. Comprehensive evaluation on Situat3DChange
tasks highlights both the progress and limitations of M
s in dynamic scene
and situation understanding. Additional experiments on data scaling and
cross-domain transfer demonstrate the task-agnostic effectiveness of using
Situat3DChange as a training dataset for M
s.
ReLook Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
Authors: Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, Bo Zhou
2025-10-13
While Large Language Models (s) excel at algorithmic code generation, they
struggle with front-end development, where correctness is judged on rendered
pixels and interaction. We present ReLook, an agentic, vision-grounded
reinforcement learning framework that empowers an agent to close a robust
generate--diagnose--refine loop by invoking a multimodal
(M
) as a tool.
During training, the agent uses the M
-in-the-loop both as a visual
critic--scoring code with screenshots--and as a source of actionable,
vision-grounded feedback; a strict zero-reward rule for invalid renders anchors
renderability and prevents reward hacking. To prevent behavioral collapse, we
introduce Forced Optimization, a strict acceptance rule that admits only
improving revisions, yielding monotonically better trajectories. At inference,
we decouple the critic and run a lightweight, critic-free self-edit cycle,
keeping latency comparable to base
while retaining most of the gains.
Across three widely used benchmarks, ReLook consistently outperforms strong
baselines in vision-grounded front-end code generation, highlighting the
benefits of agentic perception, visual rewards, and training-inference
decoupling.
AndesVL Technical Report An Efficient Mobile-side Multimodal Large Language Model
Authors: Zhiwei Jin, Xiaohui Song, Nan Wang, Yafei Liu, Chao Li, Xin Li, Ruichen Wang, Zhihao Li, Qi Qi, Long Cheng, Dongze Hao, Quanlong Zheng, Yanhao Zhang, Haobo Ji, Jian Ma, Zhitong Zheng, Zhenyi Lin, Haolin Deng, Xin Zou, Xiaojie Yin, Ruilin Wang, Liankai Cai, Haijing Liu, Yuqing Qiu, Ke Chen, Zixian Li, Chi Xie, Huafei Li, Chenxing Li, Chuangchuang Wang, Kai Tang, Zhiguang Zhu, Kai Tang, Wenmei Gao, Rui Wang, Jun Wu, Chao Liu, Qin Xie, Chen Chen, Haonan Lu
2025-10-13
In recent years, while cloud-based Ms such as QwenVL, InternVL, GPT-4o,
Gemini, and Claude Sonnet have demonstrated outstanding performance with
enormous model sizes reaching hundreds of billions of parameters, they
significantly surpass the limitations in memory, power consumption, and
computing capacity of edge devices such as mobile phones. This paper introduces
AndesVL, a suite of mobile-side M
s with 0.6B to 4B parameters based on
Qwen3's
and various visual encoders. We comprehensively outline the model
architectures, training pipeline, and training data of AndesVL, which achieves
first-tier performance across a wide range of open-source benchmarks, including
fields such as text-rich image understanding, reasoning and math, multi-image
comprehension, general VQA, hallucination mitigation, multilingual
understanding, and GUI-related tasks when compared with state-of-the-art models
of a similar scale. Furthermore, we introduce a 1+N LoRA architecture alongside
a Quantization-Aware LoRA Fine-Tuning (QALFT) framework to facilitate efficient
task adaptation and model
during mobile-side deployment of AndesVL.
Moreover, utilizing our
eviction algorithm -- O
-- along with
customized speculative
and
strategies, we achieve a 6.7x
peak
speedup ratio, up to 30.9% memory reduction, and 1.8
bits-per-weight when deploying AndesVL-4B on MediaTek Dimensity 9500 chips. We
release all models on https://huggingface.co/OPPOer.
From to Multidimensional Supervision of Reasoning Process for LLM Optimization
Authors: Beining Wang, Weihang Su, Hongtao Tian, Tao Yang, Yujia Zhou, Ting Yao, Qingyao Ai, Yiqun Liu
2025-10-13
Improving the multi-step reasoning ability of Large Language Models (s) is
a critical yet challenging task. The dominant paradigm, outcome-supervised
reinforcement learning (RLVR), rewards only correct final answers, often
propagating flawed reasoning and suffering from
reward signals. While
process-level reward models (PRMs) provide denser, step-by-step feedback, they
lack generalizability and interpretability, requiring task-specific
segmentation of the reasoning process. To this end, we propose the
Dimension-level Reward Model (DRM), a new supervision framework that bridges
the gap between these two approaches. DRM evaluates the quality of a reasoning
process along three fundamental, complementary, and interpretable dimensions:
Confidence for uncertainty calibration, Relevance for semantic alignment, and
Coherence for logical consistency. Together, these dimensions capture aspects
beyond final answer correctness and enable interpretable assessment without
requiring ground truth answers. Experimental results show that DRM provides
effective supervision signals, guides the optimization of
s and enhances
their reasoning ability. In particular, DRM-supervised training achieves
consistent gains on both in-distribution and out-of-distribution open-domain
tasks, including mathematics, question answering, code execution, and puzzles.
Our findings demonstrate that multidimensional supervision of the reasoning
process can improve the generalized reasoning ability of
s beyond the
training distribution.
Multi-View Graph Feature Propagation for Privacy Preservation and Feature Sparsity
Authors: Etzion Harari, Moshe Unger
2025-10-13
Graph Neural Networks (GNNs) have demonstrated remarkable success in node
classification tasks over relational data, yet their effectiveness often
depends on the availability of complete node features. In many real-world
scenarios, however, feature matrices are highly or contain sensitive
information, leading to degraded performance and increased privacy risks.
Furthermore, direct exposure of information can result in unintended data
leakage, enabling adversaries to infer sensitive information. To address these
challenges, we propose a novel Multi-view Feature Propagation (MFP) framework
that enhances node classification under feature
while promoting
privacy preservation. MFP extends traditional Feature Propagation (FP) by
dividing the available features into multiple Gaussian-noised views, each
propagating information independently through the graph topology. The
aggregated representations yield expressive and robust node embeddings. This
framework is novel in two respects: it introduces a mechanism that improves
robustness under extreme
, and it provides a principled way to balance
utility with privacy. Extensive experiments conducted on graph datasets
demonstrate that MFP outperforms state-of-the-art baselines in node
classification while substantially reducing privacy leakage. Moreover, our
analysis demonstrates that propagated outputs serve as alternative imputations
rather than reconstructions of the original features, pre
utility
without compromising privacy. A comprehensive sensitivity analysis further
confirms the stability and practical applicability of MFP across diverse
scenarios. Overall, MFP provides an effective and privacy-aware framework for
graph learning in domains characterized by missing or sensitive features.
Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding
Authors: Bingjie Zhu, Zhixiong Chen, Liqiang Zhao, Hyundong Shin, Arumugam Nallanathan
2025-10-13
Large language model () inference at the network edge is a promising
paradigm that leverages distributed edge resources to run inference
near users and enhance privacy. Existing edge-based
inference systems
typically adopt autoregressive
(AD), which only generates one token
per forward pass. This iterative process, compounded by the limited
computational resources of edge nodes, results in high
latency and
constrains the system's ability to support multiple users under growing
demands.To address these challenges, we propose a speculative
(SD)-based
framework that deploys small and large models across
heterogeneous edge nodes to collaboratively deliver inference services.
Specifically, the small model rapidly generates draft tokens that the large
model verifies in parallel, enabling multi-token generation per forward pass
and thus reducing
latency. To improve resource utilization of edge
nodes, we incorporate pipeline parallelism to
drafting and verification
across multiple inference tasks. Based on this framework, we analyze and derive
a comprehensive latency model incorporating both
and inference
latency. Then, we formulate a joint optimization problem for speculation
length, task batching, and wireless
resource allocation to
minimize total
latency. To address this problem, we derive the
closed-form solutions for wireless
resource allocation, and
develop a dynamic programming algorithm for joint batching and speculation
control strategies. Experimental results demonstrate that the proposed
framework achieves lower
latency compared to AD-based
systems.
In addition,the proposed joint optimization method delivers up to 44.9% latency
reduction compared to benchmark schemes.
XQuant Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
Authors: Haoqi Yang, Yao Yao, Zuchao Li, Baoyuan Qi, Guoming Liu, Hai Zhao
2025-10-13
Large Language Models (s) have demonstrated remarkable capabilities across
diverse natural language processing tasks. However, their extensive memory
requirements, particularly due to
growth during long-text
understanding and generation, present significant challenges for deployment in
resource-constrained environments. Quantization has emerged as a promising
solution to reduce memory consumption while pre
historical information.
We propose XQuant, a training-free and plug-and-play framework that achieves
ultra-low equivalent bit-width
. XQuant introduces two key
innovations: a computationally negligible data-free calibration method and
cross-layer
, enabling
to sub-1.4 bits.
Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant
outperforms state-of-the-art methods (e.g., KIVI-2bit and Asym
-1.5bit) by
achieving lower bit-width while maintaining superior performance, establishing
a better trade-off between memory efficiency and model accuracy.
The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers
Authors: Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
2025-10-13
Large language models (s) can correctly answer "When was Einstein born?"
yet fail to provide the same date when writing about Einstein's life revealing
a fundamental inconsistency in how models access factual knowledge across task
complexities. While models display impressive accuracy on factual
question-answering benchmarks, the reliability gap between simple and complex
queries remains poorly understood, eroding their trustworthiness. In this work,
we introduce Short-Long Form Alignment for Factual Question Answering (SLAQ), a
controlled evaluation framework that compares
s' answers to the same factual
questions asked (a) in isolation (short) vs. (b) integrated into complex
queries (long). Looking at 16
s across 600 queries, we find a systematic
misalignment of answers to the corresponding short and long queries. We further
uncover position-dependent accuracy loss and momentum effects where consecutive
correct or incorrect answers create self-reinforcing patterns. Through
mechanistic analysis, we find that aligned facts activate
ping model
internals, and that metrics based on mechanistic similarity can predict
short-long answer alignment with up to 78% accuracy. Our work establishes
factual consistency over query complexity as an important aspect of
s'
trustworthiness and challenges current evaluation practices, which implicitly
assume that good performance for simple factual queries implies reliability in
more complex knowledge-seeking tasks too.
Discursive Circuits How Do Language Models Understand Discourse Relations?
Authors: Yisong Miao, Min-Yen Kan
2025-10-13
Which components in language models are responsible for discourse
understanding? We hypothesize that
computational graphs, termed as
discursive circuits, control how models process discourse relations. Unlike
simpler tasks, discourse relations involve longer spans and complex reasoning.
To make circuit discovery feasible, we introduce a task called Completion under
Discourse Relation (CuDR), where a model completes a discourse given a
specified relation. To support this task, we construct a corpus of minimal
contrastive pairs tailored for activation patching in circuit discovery.
Experiments show that
circuits ( of a full GPT-2 model)
recover discourse understanding in the English PDTB-based CuDR task. These
circuits generalize well to unseen discourse frameworks such as RST and SDRT.
Further analysis shows lower layers capture linguistic features such as lexical
semantics and coreference, while upper layers encode discourse-level
abstractions. Feature utility is consistent across frameworks (e.g.,
coreference supports Expansion-like relations).
Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs
Authors: João Paulo Cardoso de Lima, Marc Dietrich, Jeronimo Castrillon, Asif Ali Khan
2025-10-13
Structured enables deploying large language models (
s) on
resource-constrained systems. Approaches like dense-to-
fine-tuning are
particularly compelling, achieving remarkable structured
by reducing
the model size by over 6.7x, while still maintaining acceptable accuracy.
Despite this reduction,
inference, especially the
stage being
inherently memory-bound, is extremely expensive on conventional Von-Neumann
architectures. Compute-in-memory (CIM) architectures mitigate this by
performing computations directly in memory, and when paired with
s,
enable storing and computing the entire model in memory, eliminating the data
movement on the off-chip bus and improving efficiency. Nonetheless, naively
mapping
matrices onto CIM arrays leads to poor array utilization and
diminished computational efficiency. In this paper, we present an automated
framework with novel mapping and scheduling strategies to accelerate
inference on CIM accelerators. By exploiting block-diagonal
, our
approach improves CIM array utilization by over 50%, achieving more than 4x
reduction in both memory footprint and the number of required floating-point
operations.
Flow Matching-Based Autonomous Driving Planning with Advanced Interactive Behavior Modeling
Authors: Tianyi Tan, Yinan Zheng, Ruiming Liang, Zexu Wang, Kexin Zheng, Jinliang Zheng, Jianxiong Li, Xianyuan Zhan, Jingjing Liu
2025-10-13
Modeling interactive driving behaviors in complex scenarios remains a
fundamental challenge for autonomous driving planning. Learning-based
approaches attempt to address this challenge with advanced generative models,
removing the dependency on over-engineered architectures for representation
fusion. However, brute-force implementation by simply stacking
blocks lacks a dedicated mechanism for modeling interactive behaviors that are
common in real driving scenarios. The scarcity of interactive driving data
further exacerbates this problem, leaving conventional imitation learning
methods ill-equipped to capture high-value interactive behaviors. We propose
Flow Planner, which tackles these problems through coordinated innovations in
data modeling, model architecture, and learning scheme. Specifically, we first
introduce fine-grained trajectory tokenization, which decomposes the trajectory
into
ping segments to decrease the complexity of whole trajectory
modeling. With a sophisticatedly designed architecture, we achieve efficient
temporal and spatial fusion of planning and scene information, to better
capture interactive behaviors. In addition, the framework incorporates flow
matching with classifier-free guidance for multi-modal behavior generation,
which dynamically reweights agent interactions during inference to maintain
coherent response strategies, providing a critical boost for interactive
scenario understanding. Experimental results on the large-scale nuPlan dataset
and challenging interactive interPlan dataset demonstrate that Flow Planner
achieves state-of-the-art performance among learning-based approaches while
effectively modeling interactive behaviors in complex driving scenarios.
Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding
Authors: Runyu Yang, Ivan V. Bajić
2025-10-13
Mainstream image and video coding standards -- including state-of-the-art
codecs like H.266/VVC, AVS3, and AV1 -- adopt a block-based hybrid coding
framework. While this framework facilitates straightforward optimization for
Peak Signal-to-Noise Ratio (PSNR), it struggles to effectively optimize
perceptually-aligned metrics such as Multi-Scale Structural Similarity
(MS-SSIM). To address this challenge, this paper proposes a low-complexity
method to enhance perceptual quality in VVC intra coding by transferring bit
allocation knowledge from end-to-end image . We introduce a
lightweight model trained with perceptual losses to generate a
step map. This map implicitly captures block-level perceptual importance,
enabling efficient derivation of a QP map for VVC. Experiments on Kodak and
CLIC datasets demonstrate significant advantages, both in execution time and
perceptual metric performance, with more than 11% BD-rate reduction in terms of
MS-SSIM. Our scheme provides an efficient, practical pathway for perceptual
enhancement of traditional codecs.
Not All Bits Are Equal Scale-Dependent Memory Optimization Strategies for Reasoning Models
Authors: Junhyuck Kim, Ethan Ewer, Taehong Moon, Jongho Park, Dimitris Papailiopoulos
2025-10-13
While 4-bit has emerged as a memory-optimal choice for
non-reasoning models and zero-shot tasks across scales, we show that this
universal prescription fails for reasoning models, where the
rather
than model size can dominate memory. Through systematic experiments across
1,700 inference scenarios on AIME25 and GPQA-Diamond, we find a scale-dependent
trade-off: models with an effective size below 8-bit 4B parameters achieve
better accuracy by allocating memory to more weights rather than longer
generation, while larger models achieve better accuracy by allocating memory to
longer generations. This scale threshold also determines when parallel scaling
becomes memory-efficient and whether
eviction outperforms
. Our findings show that memory optimization for
s cannot be
scale-agnostic, while providing principled guidelines: for small reasoning
models, prioritize model capacity over test-time compute, while for larger
ones, maximize test-time compute. Our results suggest that optimizing reasoning
models for deployment requires fundamentally different strategies from those
established for non-reasoning models.
MC# Mixture Compressor for Mixture-of-Experts Large Models
Authors: Wei Huang, Yue Liao, Yukang Chen, Jianhui Liu, Haoru Tan, Si Liu, Shiming Zhang, Shuicheng Yan, Xiaojuan Qi
2025-10-13
Mixture-of-Experts (MoE) effectively scales large language models (s) and
vision-language models (VLMs) by increasing capacity through
activation.
However, preloading all experts into memory and activating multiple experts per
input introduces significant computational and memory overhead, making the
expert module a major contributor to model size and inference cost. To address
this, we propose MC# (Mixture-Compressor-sharp), a framework that combines
static
and dynamic expert
by leveraging the significance
of experts and tokens for aggressive
of MoE-
s/VLMs. To reduce
storage and loading costs, we introduce Pre-Loading Mixed-Precision
Quantization (PMQ), which optimizes bit allocation via linear programming,
balancing expert importance and
error for a Pareto-optimal
trade-off between size and performance. To reduce runtime computation, Online
Top-any Pruning (OTP) uses Gumbel-Softmax sampling to dynamically select a
subset of experts per token, enabling fine-grained control over activation. By
combining PMQ's static bit-width optimization with OTP's dynamic routing, MC#
achieves extreme
with minimal accuracy loss. On DeepSeek-VL2, MC#
achieves a 6.2 times weight reduction at 2.57 average bits with only a 1.7%
accuracy drop across five multimodal benchmarks. Additionally, OTP reduces
expert activation over 20% with less than 1% performance degradation,
demonstrating strong potential for efficient MoE-based model deployment.
KOTOX A Korean Toxic Dataset for Deobfuscation and Detoxification
Authors: Yejin Lee, Su-Hyeon Kim, Hyundong Jin, Dayoung Kim, Yeonsoo Kim, Yo-Sub Han
2025-10-13
Toxic content has become an increasingly critical social issue with the rapid
expansion of online . While numerous studies explored methods for
detecting and detoxifying such content, most have focused primarily on English,
leaving low-resource language underrepresented. Consequently, Large Language
Models~(
s) often struggle to identify and neutralize toxic expressions in
these languages. This challenge becomes even more pronounced when user employ
obfuscation techniques to evade detection systems. Therefore, we propose a
\textbf{KOTOX: Korean Toxic Dataset} for deobfuscation and detoxicification to
address this issue. We categorize various obfuscation approaches based on
linguistic characteristics of Korean and define a set of transformation rules
grounded in real-word examples. Using these rules, we construct three dataset
versions (easy, normal, and hard) representing different levels of obfuscation
difficulty. This is the first dataset that simultaneously supports
deobfuscation and detoxification for the Korean language. We expect it to
facilitate better understanding and mitigating of obfuscated toxic content in
for low-resource languages. Our code and data are available at
https://github.com/leeyejin1231/KOTOX.
The Social Cost of Intelligence Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems
Authors: Thi-Nhung Nguyen, Linhao Luo, Thuy-Trang Vu, Dinh Phung
2025-10-13
Bias in large language models (s) remains a persistent challenge,
manifesting in stereotyping and unfair treatment across social groups. While
prior research has primarily focused on individual models, the rise of
multi-agent systems (MAS), where multiple
s collaborate and communicate,
introduces new and largely unexplored dynamics in bias emergence and
propagation. In this work, we present a comprehensive study of stereotypical
bias in MAS, examining how internal specialization, underlying
s and
inter-agent
protocols influence bias robustness, propagation, and
amplification. We simulate social contexts where agents represent different
social groups and evaluate system behavior under various interaction and
adversarial scenarios. Experiments on three bias benchmarks reveal that MAS are
generally less robust than single-agent systems, with bias often emerging early
through in-group favoritism. However, cooperative and debate-based
can mitigate bias amplification, while more robust underlying
s improve overall system stability. Our findings highlight critical factors
shaping fairness and resilience in multi-agent
systems.
Redundancy as a Structural Information Principle for Learning and Generalization
Authors: Yuda Bi, Ying Zhu, Vince D Calhoun
2025-10-13
We present a theoretical framework that extends classical information theory
to finite and structured systems by redefining redundancy as a fundamental
property of information organization rather than inefficiency. In this
framework, redundancy is expressed as a general family of informational
divergences that unifies multiple classical measures, such as mutual
information, chi-squared dependence, and spectral redundancy, under a single
geometric principle. This reveals that these traditional quantities are not
isolated heuristics but projections of a shared redundancy geometry. The theory
further predicts that redundancy is bounded both above and below, giving rise
to an optimal equilibrium that balances over- (loss of structure)
and over-coupling (collapse). While classical
theory favors
minimal redundancy for transmission efficiency, finite and structured systems,
such as those underlying real-world learning, achieve maximal stability and
generalization near this equilibrium. Experiments with masked autoencoders are
used to illustrate and verify this principle: the model exhibits a stable
redundancy level where generalization peaks. Together, these results establish
redundancy as a measurable and tunable quantity that bridges the asymptotic
world of
and the finite world of learning.
AwareCompiler Agentic Context-Aware Compiler Optimization via a Synergistic Knowledge-Data Driven Framework
Authors: Hongyu Lin, Haolin Pan, Haoran Luo, Yuchen Li, Kaichun Yao, Libo Zhang, Mingjie Xing, Yanjun Wu
2025-10-13
Compiler optimization is crucial for enhancing program performance by
transforming the sequence of optimization passes while maintaining correctness.
Despite the promising potential of large language models (s)-based agent for
software optimization, automating compiler optimization remains challenging due
to: (1) semantic misalignment between abstract program representations and
concrete optimization passes, (2) inefficient interaction mechanisms between
agents and compiler environments, and (3) reward
from the extensive
decision-making process within large optimization spaces. This paper introduces
\textbf{AwareCompiler}, an agentic framework for compiler optimization that
addresses these challenges through three key innovations: structured knowledge
integration and dataset construction, knowledge-driven adaptive pass
generation, and data-driven hybrid training pipeline. Experimental results on
standard benchmarks demonstrate that AwareCompiler significantly outperforms
existing baselines in both performance and efficiency, highlighting the
effectiveness of our synergistic knowledge-data-driven approach. Our code is
publicly available at https://github.com/LHY-24/AwareCompiler.
FastHMR Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding
Authors: Soroush Mehraban, Andrea Iaboni, Babak Taati
2025-10-13
Recent -based models for 3D Human Mesh Recovery (HMR) have
achieved strong performance but often suffer from high computational cost and
complexity due to deep
architectures and redundant tokens. In this
paper, we introduce two HMR-specific merging strategies: Error-Constrained
Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe). ECLM
selectively merges
layers that have minimal impact on the Mean Per
Joint Position Error (MPJPE), while Mask-ToMe focuses on merging background
tokens that contribute little to the final prediction. To further address the
potential performance drop caused by merging, we propose a diffusion-based
r that incorporates temporal context and leverages pose priors learned
from large-scale motion capture datasets. Experiments across multiple
benchmarks demonstrate that our method achieves up to 2.3x speed-up while
slightly improving performance over the baseline.
Agentic RAG for Software Testing with Hybrid Vector-Graph and Multi-Agent Orchestration
Authors: Mohanakrishnan Hariharan, Satish Arvapalli, Seshu Barma, Evangeline Sheela
2025-10-12
We present an approach to software testing automation using Agentic
Retrieval-Augmented Generation (RAG) systems for Quality Engineering (QE)
artifact creation. We combine autonomous AI agents with hybrid vector-graph
knowledge systems to automate test plan, case, and QE metric generation. Our
approach addresses traditional software testing limitations by leveraging s
such as Gemini and Mistral, multi-agent orchestration, and enhanced
contextualization. The system achieves remarkable accuracy improvements from
65% to 94.8% while ensuring comprehensive document traceability throughout the
quality engineering lifecycle. Experimental validation of enterprise Corporate
Systems Engineering and SAP migration projects demonstrates an 85% reduction in
testing timeline, an 85% improvement in test suite efficiency, and projected
35% cost savings, resulting in a 2-month
of go-live.
A compressed code for memory discrimination
Authors: Dale Zhou, Sharon Mina Noh, Nora C Harhen, Nidhi V Banavar, C. Brock Kirwan, Michael A Yassa, Aaron M Bornstein
2025-10-12
The ability to discriminate similar visual stimuli is an important index of
memory function. This ability is widely thought to be supported by expanding
the dimensionality of relevant neural codes, such that neural representations
for similar stimuli are maximally distinct, or ``separated.'' An alternative
hypothesis is that discrimination is supported by lossy of visual
inputs, efficiently coding sensory information by discarding seemingly
irrelevant details. A benefit of
, relative to expansion, is that it
allows individuals to retain fewer essential dimensions underlying stimulus
variation -- a process linked to higher-order visual processing -- without
hindering discrimination. Under this hypothesis, pattern separation is
facilitated when more information from similar stimuli can be discarded, rather
than preserved. We test the
versus expansion hypotheses by
predicting performance on the canonical mnemonic similarity task. We train
neural networks to compress perceptual and semantic factors of stimuli,
measuring lossiness using the mathematical framework underlying
.
Consistent with the
hypothesis, and not the expansion hypothesis,
greater lossiness predicts the ease and performance of lure discrimination,
especially in deeper convolutional network layers that predict higher-order
visual brain activity. We then confirm these predictions across two image sets,
four behavioral datasets, and alternative lossiness metrics. Finally, using
task fMRI, we identify signatures of lossy
-- neural dimensionality
reduction and information loss -- in higher-order visual regions V4 and IT and
hippocampal DG/CA3 and CA1 linked to lure discrimination. These results suggest
lossy
supports mnemonic discrimination by discarding redundant and
ping information.
Review of Inference-Time Scaling Strategies Reasoning, Search and RAG
Authors: Zhichao Wang, Cheng Wan, Dong Nie
2025-10-12
The performance gains of s have historically been driven by scaling up
model size and training data. However, the rapidly diminishing availability of
high-quality training data is introducing a fundamental bottleneck, shifting
the focus of research toward inference-time scaling. This paradigm uses
additional computation at the time of deployment to substantially improve
performance on downstream tasks without costly model re-training. This review
systematically surveys the diverse techniques contributing to this new era of
inference-time scaling, organizing the rapidly evolving field into two
comprehensive perspectives: Output-focused and Input-focused methods.
Output-focused techniques encompass complex, multi-step generation strategies,
including reasoning (e.g., CoT, ToT, ReAct), various search and
methods (e.g., MCTS, beam search), training for long CoT (e.g., RLVR, GRPO),
and model ensemble methods. Input-focused techniques are primarily categorized
by few-shot and RAG, with RAG as the central focus. The RAG section is further
detailed through a structured examination of query expansion, data, retrieval
and reranker,
generation methods, and multi-modal RAG.
ADiP Adaptive Precision Systolic Array for Matrix Multiplication Acceleration
Authors: Ahmed J. Abdelmaksoud, Cristian Sestito, Shiwei Wang, Themis Prodromakis
2025-10-12
Transformers are at the core of modern AI nowadays. They rely heavily on
matrix multiplication and require efficient due to their
substantial memory and computational requirements. Quantization plays a vital
role in reducing memory usage, and can be exploited for computations by
designing reconfigurable architectures that enhance matrix multiplication by
dynamically adjusting the precision. This paper proposes ADiP, a novel
adaptive-precision systolic array architecture designed for efficient matrix
multiplication
.The proposed architecture consists of NxN
adaptive-precision processing elements (PEs) and shared accumulators. ADiP
supports multiple computation modes, including symmetric single-matrix
multiplication as well as asymmetric multi-matrix multiplication with a shared
input matrix, thereby improving data-reuse and PE utilization. In addition,
ADiP maximizes the computational density by adapting to different precisions,
such as 8bitx8bit, 8bitx4bit, and 8bitx2bit. Analytical models are developed
for ADiP architecture, including latency and throughput for versatile
architecture configurations. A comprehensive hardware design space exploration
is demonstrated using 22nm commercial technology, achieving up to a 4x higher
computational throughput. Furthermore, ADiP is evaluated on different
workloads from GPT-2 Medium, BERT Large, and BitNet-1.58B models,
delivering latency improvement up to 53.6%, and energy improvement up to 24.4%
for BitNet-1.58B MHA workloads. At a 64x64 size with 4096 PEs, ADiP achieves a
peak throughput of 8.192 TOPS, 16.384 TOPS, and 32.768 TOPS for 8bitx8bit,
8bitx4bit, and 8bitx2bit operations, respectively.
Preserving LLM Capabilities through Calibration Data Curation From Analysis to Optimization
Authors: Bowei He, Lihao Yin, Huiling Zhen, Shuqi Liu, Han Wu, Xiaokun Zhang, Mingxuan Yuan, Chen Ma
2025-10-12
Post-training has been a widely employed approach to scale down
large language model (
) and facilitate efficient inference. In various
proposed
methods, including
and
, calibration
data plays a vital role by informing the weight importance and activation
dynamic ranges. However, how calibration data impacts the
capability after
is less explored. Few of the existing works, though recognizing the
significance of this study, only investigate the language modeling or
commonsense reasoning performance degradation from limited angles, like the
data sources or sample amounts. More systematic research is still needed to
examine the impacts on different
capabilities in terms of compositional
properties and domain correspondence of calibration data. In this work, we aim
at bridging this gap and further analyze underlying influencing mechanisms from
the activation pattern perspective. Especially, we explore the calibration
data's impacts on high-level complex reasoning capabilities, like math problem
solving and code generation. Delving into the underlying mechanism, we find
that the representativeness and diversity in activation space more
fundamentally determine the quality of calibration data. Finally, we propose a
calibration data curation framework based on such observations and analysis,
enhancing the performance of existing post-training
methods on
pre
critical
capabilities. Our code is provided in
\href{https://github.com/BokwaiHo/COLA.git}{Link}.
Large Language Model-Empowered Channel Prediction and Predictive Beamforming for LEO Satellite Communications
Authors: Zhixiong Chen, Hyundong Shin, Arumugam Nallanathan, Jonathon Chambers
2025-10-12
Accurate channel prediction and effective beamforming are essential for low
Earth orbit (LEO) satellite s to enhance system capacity and
enable high-speed connectivity. Most existing channel prediction and predictive
beamforming methods are limited by model generalization capabilities and
struggle to adapt to time-varying wireless propagation environments. Inspired
by the remarkable generalization and reasoning capabilities of large language
models (
s), this work proposes an
-based channel prediction framework,
namely CP
, to forecast future channel state information (CSI) for LEO
satellites based on historical CSI data. In the proposed CP
, a dedicated CSI
encoder is designed to map raw CSI data into the textual embedding space,
effectively bridging the modality gap and enabling the
to perform reliable
reasoning over CSI data. Additionally, a CSI
r is introduced to
simultaneously predict CSI for multiple future time slots, substantially
reducing the computational burden and inference latency associated with the
inherent autoregressive
process of
s. Then, instead of training the
from scratch, we adopt a parameter-efficient fine-tuning strategy, i.e.,
LoRA, for CP
, where the pretrained
remains frozen and trainable low-rank
matrices are injected into each Transformer
r layer to enable effective
fine-tuning. Furthermore, we extend CP
to directly generate beamforming
strategies for future time slots based on historical CSI data, namely BF
.
This extended framework retains the same architecture as CP
, while
introducing a dedicated beamforming
r to output beamforming strategies.
Finally, extensive simulation results validate the effectiveness of the
proposed approaches in channel prediction and predictive beamforming for LEO
satellite
s.
BitMar Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
Authors: Euhid Aman, Esteban Carlin, Hsing-Kuo Pao, Giovanni Beltrame, Ghaluh Indah Permata Sari, Yie-Tarng Chen
2025-10-12
Cross-attention s and other multimodal vision-language models
excel at grounding and generation; however, their extensive, full-precision
backbones make it challenging to deploy them on edge devices. Memory-augmented
architectures enhance the utilization of past context; however, most works
rarely pair them with aggressive edge-oriented
. We introduce
BitMar, a
d multimodal
that proposes an external human-like
episodic memory for effective image-text generation on hardware with limited
resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and
one for vision (DiNOv2-based), to create compact embeddings that are combined
and used to query a fixed-size key-value episodic memory. During vector
retrieval, the BitNet
r applies per-layer conditioning, which increases
the contextual relevance of generated content. The
r also employs
attention sinks with a sliding-window mechanism to process long or streaming
inputs under tight memory budgets. The combination of per-layer conditioning
and sliding-window attention achieves a strong quality-speed trade-off,
delivering competitive captioning and multimodal understanding at low latency
with a small model footprint. These characteristics make BitMar well-suited for
edge deployment.
Self-Supervised Representation Learning with ID-Content Modality Alignment for Sequential Recommendation
Authors: Donglin Zhou, Weike Pan, Zhong Ming
2025-10-12
Sequential recommendation (SR) models often capture user preferences based on
the historically interacted item IDs, which usually obtain sub-optimal
performance when the interaction history is limited. Content-based sequential
recommendation has recently emerged as a promising direction that exploits
items' textual and visual features to enhance preference learning. However,
there are still three key challenges: (i) how to reduce the semantic gap
between different content modality representations; (ii) how to jointly model
user behavior preferences and content preferences; and (iii) how to design an
effective training strategy to align ID representations and content
representations. To address these challenges, we propose a novel model,
self-supervised representation learning with ID-Content modality alignment,
named SICSRec. Firstly, we propose a -driven sample construction method and
develop a supervised fine-tuning approach to align item-level modality
representations. Secondly, we design a novel Transformer-based sequential
model, where an ID-modality sequence encoder captures user behavior
preferences, a content-modality sequence encoder learns user content
preferences, and a mix-modality sequence
r grasps the intrinsic
relationship between these two types of preferences. Thirdly, we propose a
two-step training strategy with a content-aware contrastive learning task to
align modality representations and ID representations, which decouples the
training process of content modality dependency and item collaborative
dependency. Extensive experiments conducted on four public video streaming
datasets demonstrate our SICSRec outperforms the state-of-the-art ID-modality
sequential recommenders and content-modality sequential recommenders by 8.04%
on NDCG@5 and 6.62% on NDCD@10 on average, respectively.
The Hidden DNA of LLM-Generated JavaScript Structural Patterns Enable High-Accuracy Authorship Attribution
Authors: Norbert Tihanyi, Bilel Cherif, Richard A. Dubniczky, Mohamed Amine Ferrag, Tamás Bisztray
2025-10-12
In this paper, we present the first large-scale study exploring whether
JavaScript code generated by Large Language Models (s) can reveal which
model produced it, enabling reliable authorship attribution and model
fingerprinting. With the rapid rise of AI-generated code, attribution is
playing a critical role in detecting vulnerabilities, flagging malicious
content, and ensuring accountability. While AI-vs-human detection usually
treats AI as a single category we show that individual
s leave unique
stylistic signatures, even among models belonging to the same family or
parameter size. To this end, we introduce
-NodeJS, a dataset of 50,000
Node.js back-end programs from 20 large language models. Each has four
transformed variants, yielding 250,000 unique JavaScript samples and two
additional representations (JSIR and AST) for diverse research applications.
Using this dataset, we benchmark traditional machine learning classifiers
against fine-tuned Transformer encoders and introduce CodeT5-JSA, a custom
architecture derived from the 770M-parameter CodeT5 model with its
r
removed and a modified classification head. It achieves 95.8% accuracy on
five-class attribution, 94.6% on ten-class, and 88.5% on twenty-class tasks,
surpassing other tested models such as BERT, CodeBERT, and Longformer. We
demonstrate that classifiers capture deeper stylistic regularities in program
dataflow and structure, rather than relying on surface-level features. As a
result, attribution remains effective even after mangling, comment removal, and
heavy code transformations. To support open science and reproducibility, we
release the
-NodeJS dataset, Google Colab training scripts, and all related
materials on GitHub: https://github.com/
-NodeJS-dataset.
SASER Stego attacks on open-source LLMs
Authors: Ming Tan, Wei Li, Hu Tao, Hailong Ma, Aodi Liu, Qian Chen, Zilong Wang
2025-10-12
Open-source large language models (s) have demonstrated considerable
dominance over proprietary
s in resolving neural processing tasks, thanks to
the collaborative and sharing nature. Although full access to source codes,
model parameters, and training data lays the groundwork for transparency, we
argue that such a full-access manner is vulnerable to stego attacks, and their
ill-effects are not fully understood. In this paper, we conduct a systematic
formalization for stego attacks on open-source
s by enumerating all possible
threat models associated with adversary objectives, knowledge, and
capabilities. Therein, the threat posed by adversaries with internal knowledge,
who inject payloads and triggers during the model sharing phase, is of
practical interest. We go even further and propose the first stego attack on
open-source
s, dubbed SASER, which wields impacts through identifying
targeted parameters, embedding payloads, injecting triggers, and executing
payloads sequentially. Particularly, SASER enhances the attack robustness
against
-based local deployment by de-quantizing the embedded
payloads. In addition, to achieve stealthiness, SASER devises the
performance-aware importance metric to identify targeted parameters with the
least degradation of model performance. Extensive experiments on LlaMA2-7B and
ChatGLM3-6B, without
, show that the stealth rate of SASER
outperforms existing stego attacks (for general DNNs) by up to 98.1%, while
achieving the same attack success rate (ASR) of 100%. More importantly, SASER
improves ASR on
d models from 0 to 100% in all settings. We appeal for
investigations on countermeasures against SASER in view of the significant
attack effectiveness.
AnyBCQ Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs
Authors: Gunho Park, Jeongin Bae, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee
2025-10-12
The deployment of large language models (s) is increasingly constrained by
memory and latency bottlenecks, motivating the need for
techniques
that flexibly balance accuracy and efficiency. Recent work has introduced
multi-precision models, which enable inference at multiple precisions within a
single model depending on runtime constraints. To support such flexibility,
d weights are often stored as bit-planes, where hardware efficiency
improves when the compute operates directly at the bit-plane level and
activates only the precision required by each request. In this work, we present
AnyBCQ, a hardware-friendly multi-precision extension of Binary-Coded
Quantization (BCQ) that supports direct bit-plane operations. By representing
weights as binary bit-planes with corresponding scale factors, AnyBCQ enables
bit-plane-level computation and maps naturally to accelerator-friendly,
bit-parallel arithmetic. Our progressive precision expansion mechanism
incrementally refines scaling factors while reusing previously assigned binary
codes, yielding monotonic improvements in accuracy as additional bits are
enabled. We further co-design a specialized kernel that exploits the BCQ
structure to support dynamic per-request precision selection with negligible
overhead. Experiments on recent
s demonstrate that AnyBCQ significantly
narrows the accuracy drop in the
regime (e.g. 2-bit), remains
competitive at higher precision, and achieves throughput gains of up to 3.0x
over half precision and 1.2x over state-of-the-art multi-precision methods. By
aligning algorithmic flexibility with hardware efficiency, AnyBCQ provides a
practical foundation for multi-precision
deployment across diverse
service-level objectives.
When Images Speak Louder Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance
Authors: Jinjin Cao, Zhiyang Chen, Zijun Wang, Liyuan Ma, Weijian Luo, Guojun Qi
2025-10-12
Vision-Language Models (VLMs) have shown solid ability for multimodal
understanding of both visual and language contexts. However, existing VLMs
often face severe challenges of hallucinations, meaning that VLMs tend to
generate responses that are only fluent in the language but irrelevant to
images in previous contexts. To address this issue, we analyze how language
bias contributes to hallucinations and then introduce Cross-Modal
Guidance(CMG), a training-free method that addresses the
hallucinations by leveraging the difference between the output distributions of
the original model and the one with degraded visual-language attention. In
practice, we adaptively mask the attention weight of the most influential image
tokens in selected
layers to corrupt the visual-language perception
as a concrete type of degradation. Such a degradation-induced
emphasizes the perception of visual contexts and therefore significantly
reduces language bias without harming the ability of VLMs. In experiment
sections, we conduct comprehensive studies. All results demonstrate the
superior advantages of CMG with neither additional conditions nor training
costs. We also quantitatively show CMG can improve different VLM's performance
on hallucination-specific benchmarks and generalize effectively.
NIM Neuro-symbolic Ideographic Metalanguage for Inclusive Communication
Authors: Prawaal Sharma, Poonam Goyal, Navneet Goyal, Vidisha Sharma
2025-10-12
Digital has become the cornerstone of modern interaction,
enabling rapid, accessible, and interactive exchanges. However, individuals
with lower academic literacy often face significant barriers, exacerbating the
"digital divide". In this work, we introduce a novel, universal ideographic
metalanguage designed as an innovative
framework that transcends
academic, linguistic, and cultural boundaries. Our approach leverages
principles of Neuro-symbolic AI, combining neural-based large language models
(
s) enriched with world knowledge and symbolic knowledge heuristics grounded
in the linguistic theory of Natural Semantic Metalanguage (NSM). This enables
the semantic decomposition of complex ideas into simpler, atomic concepts.
Adopting a human-centric, collaborative methodology, we engaged over 200
semi-literate participants in defining the problem, selecting ideographs, and
validating the system. With over 80\% semantic comprehensibility, an accessible
learning curve, and universal adaptability, our system effectively serves
underprivileged populations with limited formal education.
RobotFleet An Open-Source Framework for Centralized Multi-Robot Task Planning
Authors: Rohan Gupta, Trevor Asbery, Zain Merchant, Abrar Anwar, Jesse Thomason
2025-10-12
Coordinating heterogeneous robot fleets to achieve multiple goals is
challenging in multi-robot systems. We introduce an open-source and extensible
framework for centralized multi-robot task planning and scheduling that
leverages s to enable fleets of heterogeneous robots to accomplish multiple
tasks. RobotFleet provides abstractions for planning, scheduling, and execution
across robots deployed as containerized services to simplify fleet scaling and
management. The framework maintains a shared declarative world state and
two-way
for task execution and replanning. By modularizing each
layer of the autonomy stack and using
s for open-world reasoning, RobotFleet
lowers the barrier to building scalable multi-robot systems. The code can be
found here: https://github.com/therohangupta/robot-fleet.
SP-MoE Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference
Authors: Liangkun Chen, Zijian Wen, Tian Wu, Xiaoxi Zhang, Chuan Wu
2025-10-11
The Mixture-of-Experts (MoE) architecture has been widely adopted in large
language models (s) to reduce computation cost through model
.
Employing speculative
(SD) can further accelerate MoE inference by
drafting multiple tokens per step and verifying them in parallel. However,
combining MoE with SD inflates GPU memory and aggravates CPU-GPU bandwidth
contention during multi-token verification. Existing MoE offloading systems are
SD-agnostic and do not address this bottleneck. We present SP-MoE, the first
SD-aware expert-offloading and compute-
pipelining framework.
SP-MoE introduces: (1) speculative expert prefetching that exploits structural
correspondence between the draft and target models to prefetch likely experts
ahead of verification; (2) a cutoff-layer policy that bounds per-layer prefetch
depth based on empirical profiles and an analytical latency model, guaranteeing
just-in-time availability without overfetch; and (3) a pipelined runtime with
asynchronous prefetch threads and batched I/O to hide loading latency.
Extensive experiments demonstrate that SP-MoE achieves a 1.07-3.5 times TPOT
speedup over state-of-the-art methods across diverse datasets, environments,
and MoE-based models.
Grounded AI for Code Review Resource-Efficient Large-Model Serving in Enterprise Pipelines
Authors: Sayan Mandal, Hua Jiang
2025-10-11
Automated code review adoption lags in compliance-heavy settings, where
static analyzers produce high-volume, low-rationale outputs, and naive use
risks hallucination and incurring cost overhead. We present a production system
for grounded, PR-native review that pairs static-analysis findings with
AST-guided context extraction and a single-GPU, on-demand
stack
(
d open-weight model, multi-tier caching) to deliver concise
explanations and remediation guidance. Evaluated on safety-oriented C/C++
standards, the approach achieves sub-minute median first-feedback (offline p50
build+
59.8s) while maintaining competitive violation reduction and lower
violation rates versus larger proprietary models. The architecture is
decoupled: teams can adopt the grounding/prompting layer or the
layer
independently. A small internal survey (n=8) provides directional signals of
reduced triage effort and moderate perceived grounding, with participants
reporting fewer human review iterations. We outline operational lessons and
limitations, emphasizing reproducibility, auditability, and pathways to broader
standards and assisted patching.
The Achilles' Heel of LLMs How Altering a Handful of Neurons Can Cripple Language Abilities
Authors: Zixuan Qin, Kunlin Lyu, Qingchen Yu, Yifan Sun, Zhaoxin Fan
2025-10-11
Large Language Models (s) have become foundational tools in natural
language processing, powering a wide range of applications and research. Many
studies have shown that
s share significant similarities with the human
brain. Recent neuroscience research has found that a small subset of biological
neurons in the human brain are crucial for core cognitive functions, which
raises a fundamental question: do
s also contain a small subset of critical
neurons? In this paper, we investigate this question by proposing a
Perturbation-based Causal Identification of Critical Neurons method to
systematically locate such critical neurons in
s. Our findings reveal three
key insights: (1)
s contain ultra-
critical neuron sets. Disrupting
these critical neurons can cause a 72B-parameter model with over 1.1 billion
neurons to completely collapse, with perplexity increasing by up to 20 orders
of magnitude; (2) These critical neurons are not uniformly distributed, but
tend to concentrate in the outer layers, particularly within the MLP down_proj
components; (3) Performance degradation exhibits sharp phase transitions,
rather than a gradual decline, when these critical neurons are disrupted.
Through comprehensive experiments across diverse model architectures and
scales, we provide deeper analysis of these phenomena and their implications
for
robustness and interpretability. These findings can offer guidance for
developing more robust model architectures and improving deployment security in
safety-critical applications.
ISAAC Intelligent, Scalable, Agile, and Accelerated CPU Verification via LLM-aided FPGA Parallelism
Authors: Jialin Sun, Yuchen Hu, Dean You, Yushu Du, Hui Wang, Xinwei Fang, Weiwei Shan, Nan Guan, Zhe Jiang
2025-10-11
Functional verification is a critical bottleneck in integrated circuit
development, with CPU verification being especially time-intensive and
labour-consuming. Industrial practice relies on differential testing for CPU
verification, yet faces bottlenecks at nearly each stage of the framework
pipeline: front-end stimulus generation lacks micro-architectural awareness,
yielding low-quality and redundant tests that impede coverage closure and miss
corner cases. Meanwhile, back-end simulation infrastructure, even with FPGA
, often stalls on long-running tests and offers limited visibility,
delaying feedback and prolonging the debugging cycle. Here, we present ISAAC, a
full-stack, Large Language Model (
)-aided CPU verification framework with
FPGA parallelism, from bug categorisation and stimulus generation to simulation
infrastructure. To do so, we presented a multi-agent stimulus engine in ISAAC's
front-end, infused with micro-architectural knowledge and historical bug
patterns, generating highly targeted tests that rapidly achieve coverage goals
and capture elusive corner cases. In ISAAC's back-end, we introduce a
lightweight forward-snapshot mechanism and a decoupled co-simulation
architecture between the Instruction Set Simulator (ISS) and the Design Under
Test (DUT), enabling a single ISS to drive multiple DUTs in parallel. By
eliminating long-tail test bottlenecks and exploiting FPGA parallelism, the
simulation throughput is significantly improved. As a demonstration, we used
ISAAC to verify a mature CPU that has undergone multiple successful tape-outs.
Results show up to 17,536x speed-up over software RTL simulation, while
detecting several previously unknown bugs, two of which are reported in this
paper.
BILLY Steering Large Language Models via Merging Persona Vectors for Creative Generation
Authors: Tsung-Min Pai, Jui-I Wang, Li-Chun Lu, Shao-Hua Sun, Hung-Yi Lee, Kai-Wei Chang
2025-10-11
Multi- systems enhance the creativity of large language models by
simulating human collective intelligence but suffer from significant drawbacks,
such as high computational costs and inference latency. To address these
limitations, we propose BILLY (BlendIng persona vectors for Large Language
model creativitY), a training-free framework that captures the benefits of
multi-
collaboration, i.e. inducing diverse perspectives and specialized
expertise, within a single model. BILLY operates by extracting and blending
multiple distinct persona vectors directly in the model's activation space. We
steer the model's generation process with this merged vector while inference,
enabling multi-perspective output without explicit multi-
. Our
experiments across creativity-oriented benchmarks demonstrate that BILLY
surpasses single model prompting and traditional multi-
approaches, while
substantially reducing inference time and computational costs. Our analyses
further reveal that distinct persona vectors can be blended to achieve both
effective control over complementary aspects of generation and greater
interpretability.
A Unified Frequency Domain Decomposition Framework for Interpretable and Robust Time Series Forecasting
Authors: Cheng He, Xijie Liang, Zengrong Zheng, Patrick P. C. Lee, Xu Huang, Zhaoyi Li, Hong Xie, Defu Lian, Enhong Chen
2025-10-11
Current approaches for time series forecasting, whether in the time or
frequency domain, predominantly use deep learning models based on linear layers
or s. They often encode time series data in a black-box manner and
rely on trial-and-error optimization solely based on forecasting performance,
leading to limited interpretability and theoretical understanding. Furthermore,
the dynamics in data distribution over time and frequency domains pose a
critical challenge to accurate forecasting. We propose FIRE, a unified
frequency domain decomposition framework that provides a mathematical
abstraction for diverse types of time series, so as to achieve interpretable
and robust time series forecasting. FIRE introduces several key innovations:
(i) independent modeling of amplitude and phase components, (ii) adaptive
learning of weights of frequency basis components, (iii) a targeted loss
function, and (iv) a novel training paradigm for
data. Extensive
experiments demonstrate that FIRE consistently outperforms state-of-the-art
models on long-term forecasting benchmarks, achieving superior predictive
performance and significantly enhancing interpretability of time series
PermLLM Learnable Channel Permutation for NM Sparse Large Language Models
Authors: Lancheng Zou, Shuo Yin, Zehua Pei, Tsung-Yi Ho, Farzan Farnia, Bei Yu
2025-10-11
Channel permutation is a powerful technique for enhancing the accuracy of N:M
models by reordering the channels of weight matrices to prioritize the
retention of important weights. However, traditional channel permutation
methods rely on handcrafted quality metrics, which often fail to accurately
capture the true impact of
on model performance. To address this
limitation, we propose Perm
, a novel post-training
framework that
introduces learnable channel permutation (LCP) for N:M
. LCP leverages
Sinkhorn normalization to transform discrete permutation matrices into
differentiable soft permutation matrices, enabling end-to-end optimization.
Additionally, Perm
incorporates an efficient block-wise channel permutation
strategy, which significantly reduces the number of learnable parameters and
computational complexity. Perm
seamlessly integrates with existing one-shot
methods to adaptively optimize channel permutations, effectively
mitigating
-induced errors. Extensive experiments on the LLaMA series,
Qwen, and OPT models demonstrate that Perm
achieves superior performance in
optimizing N:M
models. The code is available at
https://github.com/lanchengzou/Perm
.
CacheClip Accelerating RAG with Effective KV Cache Reuse
Authors: Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu
2025-10-11
Retrieval-Augmented Generation (RAG) systems suffer from severe
time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing
reuse methods face a fundamental trade-off: prefix caching requires
identical prefixes that rarely occur in RAG scenarios, while direct
precomputation sacrifices quality due to missing inter-chunk attention and
repeated attention sinks. Recent methods like APE and CacheBlend partially
address these issues but remain inadequate for robust RAG applications. This
paper presents CacheClip, a novel framework that achieves both fast TTFT and
high generation quality. Our key insight is that small auxiliary
s exhibit
similar last-layer attention distributions to primary
s (the target model
for generation), enabling efficient identification of tokens critical for
restoring inter-chunk attention, thereby significantly improving response
quality on cross-chunk reasoning tasks. CacheClip integrates three techniques:
(1) auxiliary-model-guided token selection for selective
recomputation, where the auxiliary model is finetuned to improve selection
accuracy, (2) shared prefixes to eliminate redundant attention sinks, and (3)
grouping strategy to maintain local coherence during partial
updates.
Experiments show CacheClip retains up to 94.8% and 85.0% of full-attention
performance on NIAH and LongBench, outperforming APE and CacheBlend by 25.2%
and 35.1% on NIAH (with reomp% = 20%). Meanwhile, CacheClip accelerates
inference by up to 1.92x in
time, providing a practical solution to the
efficiency-quality trade-off in RAG systems.
Lighter-X An Efficient and Plug-and-play Strategy for Graph-based Recommendation through Decoupled Propagation
Authors: Yanping Zheng, Zhewei Wei, Frank de Hoog, Xu Chen, Hongteng Xu, Yuhang Ye, Jiadeng Huang
2025-10-11
Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness in
recommendation systems. However, conventional graph-based recommenders, such as
LightGCN, require maintaining embeddings of size for each node, resulting
in a parameter complexity of , where represents
the total number of users and items. This scaling pattern poses significant
challenges for deployment on large-scale graphs encountered in real-world
applications. To address this scalability limitation, we propose
\textbf{Lighter-X}, an efficient and modular framework that can be seamlessly
integrated with existing GNN-based recommender architectures. Our approach
substantially reduces both parameter size and computational complexity while
pre the theoretical guarantees and empirical performance of the base
models, thereby enabling practical deployment at scale. Specifically, we
analyze the original structure and inherent redundancy in their parameters,
identifying opportunities for optimization. Based on this insight, we propose
an efficient
scheme for the
adjacency structure and
high-dimensional embedding matrices, achieving a parameter complexity of
, where . Furthermore, the model is optimized
through a decoupled framework, reducing computational complexity during the
training process and enhancing scalability. Extensive experiments demonstrate
that Lighter-X achieves comparable performance to baseline models with
significantly fewer parameters. In particular, on large-scale interaction
graphs with millions of edges, we are able to attain even better results with
only 1\% of the parameter over LightGCN.
P-4DGS Predictive 4D Gaussian Splatting with 90 Compression
Authors: Henan Wang, Hanxin Zhu, Xinliang Gong, Tianyu He, Xin Li, Zhibo Chen
2025-10-11
3D Gaussian Splatting (3DGS) has garnered significant attention due to its
superior scene representation fidelity and real-time rendering performance,
especially for dynamic 3D scene reconstruction (\textit{i.e.}, 4D
reconstruction). However, despite achieving promising results, most existing
algorithms overlook the substantial temporal and spatial redundancies inherent
in dynamic scenes, leading to prohibitive memory consumption. To address this,
we propose P-4DGS, a novel dynamic 3DGS representation for compact 4D scene
modeling. Inspired by intra- and inter-frame prediction techniques commonly
used in video , we first design a 3D anchor point-based
spatial-temporal prediction module to fully exploit the spatial-temporal
correlations across different 3D Gaussian primitives. Subsequently, we employ
an adaptive
strategy combined with context-based entropy coding to
further reduce the size of the 3D anchor points, thereby achieving enhanced
efficiency. To evaluate the rate-distortion performance of our
proposed P-4DGS in comparison with other dynamic 3DGS representations, we
conduct extensive experiments on both synthetic and real-world datasets.
Experimental results demonstrate that our approach achieves state-of-the-art
reconstruction quality and the fastest rendering speed, with a remarkably low
storage footprint (around \textbf{1MB} on average), achieving up to
\textbf{40} and \textbf{90}
on synthetic and
real-world scenes, respectively.
Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization
Authors: Yang Li, Ruichen Zhang, Yinqiu Liu, Guangyuan Liu, Dusit Niyato, Abbas Jamalipour, Xianbin Wang, Dong In Kim
2025-10-11
The rapid advancement of Low-Altitude Economy Networks (LAENets) has enabled
a variety of applications, including aerial surveillance, environmental
sensing, and semantic data collection. To support these scenarios, unmanned
aerial vehicles (UAVs) equipped with onboard vision-language models (VLMs)
offer a promising solution for real-time multimodal inference. However,
ensuring both inference accuracy and efficiency remains a
significant challenge due to limited onboard resources and dynamic network
conditions. In this paper, we first propose a UAV-enabled LAENet system model
that jointly captures UAV mobility, user-UAV
, and the onboard
visual question answering (VQA) pipeline. Based on this model, we formulate a
mixed-integer non-convex optimization problem to minimize task latency and
power consumption under user-specific accuracy constraints. To solve the
problem, we design a hierarchical optimization framework composed of two parts:
(i) an Alternating Resolution and Power Optimization (ARPO) algorithm for
resource allocation under accuracy constraints, and (ii) a Large Language
Model-augmented Reinforcement Learning Approach (LLaRA) for adaptive UAV
trajectory optimization. The large language model (
) serves as an expert in
refining reward design of reinforcement learning in an offline fashion,
introducing no additional latency in real-time decision-making. Numerical
results demonstrate the efficacy of our proposed framework in improving
inference performance and
efficiency under dynamic LAENet
conditions.
Deliberative Dynamics and Value Alignment in LLM Debates
Authors: Pratik S. Sachdeva, Tom van Nuenen
2025-10-11
As large language models (s) are increasingly deployed in sensitive
everyday contexts - offering personal advice, mental health support, and moral
guidance - understanding their elicited values in navigating complex moral
reasoning is essential. Most evaluations study this sociotechnical alignment
through single-turn prompts, but it is unclear if these findings extend to
multi-turn settings where values emerge through dialogue, revision, and
consensus. We address this gap using
debate to examine deliberative
dynamics and value alignment in multi-turn settings by prompting subsets of
three models (GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash) to collectively
assign blame in 1,000 everyday dilemmas from Reddit's "Am I the Asshole"
community. We use both synchronous (parallel responses) and round-robin
(sequential responses) formats to test order effects and verdict revision. Our
findings show striking behavioral differences. In the synchronous setting, GPT
showed strong inertia (0.6-3.1% revision rates) while Claude and Gemini were
far more flexible (28-41%). Value patterns also diverged: GPT emphasized
personal autonomy and direct
, while Claude and Gemini prioritized
empathetic dialogue. Certain values proved especially effective at driving
verdict changes. We further find that deliberation format had a strong impact
on model behavior: GPT and Gemini stood out as highly conforming relative to
Claude, with their verdict behavior strongly shaped by order effects. These
results show how deliberation format and model-specific behaviors shape moral
reasoning in multi-turn interactions, underscoring that sociotechnical
alignment depends on how systems structure dialogue as much as on their
outputs.
Universal Discrete-Domain Speech Enhancement
Authors: Fei Liu, Yang Ai, Ye-Xin Lu, Rui-Chen Zheng, Hui-Peng Du, Zhen-Hua Ling
2025-10-11
In real-world scenarios, speech signals are inevitably corrupted by various
types of interference, making speech enhancement (SE) a critical task for
robust speech processing. However, most existing SE methods only handle a
limited range of distortions, such as additive noise, reverberation, or band
limitation, while the study of SE under multiple simultaneous distortions
remains limited. This gap affects the generalization and practical usability of
SE methods in real-world environments.To address this gap, this paper proposes
a novel Universal Discrete-domain SE model called UDSE.Unlike regression-based
SE models that directly predict clean speech waveform or continuous features,
UDSE redefines SE as a discrete-domain classification task, instead predicting
the clean discrete tokens d by the residual vector
r (RVQ) of a
pre-trained neural speech codec.Specifically, UDSE first extracts global
features from the degraded speech. Guided by these global features, the clean
token prediction for each VQ follows the rules of RVQ, where the prediction of
each VQ relies on the results of the preceding ones. Finally, the predicted
clean tokens from all VQs are
d to reconstruct the clean speech waveform.
During training, the UDSE model employs a teacher-forcing strategy, and is
optimized with cross-entropy loss. Experimental results confirm that the
proposed UDSE model can effectively enhance speech degraded by various
conventional and unconventional distortions, e.g., additive noise,
reverberation, band limitation, clipping, phase distortion, and
distortion, as well as their combinations. These results demonstrate the
superior universality and practicality of UDSE compared to advanced
regression-based SE methods.
Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding
Authors: Payel Bhattacharjee, Fengwei Tian, Meiyu Zhong, Guangyi Zhang, Osvaldo Simeone, Ravi Tandon
2025-10-11
Edge-cloud speculative (SD) accelerates inference by having a
cloud-based large language model (
) that verifies draft tokens generated by
a resource-constrained small language model (SLM) at the edge. A central
bottleneck is the limited bandwidth of the edge-cloud link, which necessitates
efficient
of draft token distributions. We first derive an
information-theoretic bound that decomposes the token rejection rate into
contributions from SLM-
distribution mismatch and from
distortion. Guided by this analysis, we propose the Sparse Quantize-and-Sample
SD (SQS-SD) framework, which exploits distributional
through
structured sparsification and lattice-based
. Within this
framework, K-SQS applies fixed top-K truncation, while C-SQS adaptively adjusts
the retained token set via online conformal prediction to ensure bounded
deviation from the dense distribution. Empirical results confirm that both
approaches improve end-to-end latency and rejection rates in complimentary
operating regimes.
The Ethics Engine A Modular Pipeline for Accessible Psychometric Assessment of Large Language Models
Authors: Jake Van Clief, Constantine Kyritsopoulos
2025-10-11
As Large Language Models increasingly mediate human and
decision-making, understanding their value expression becomes critical for
research across disciplines. This work presents the Ethics Engine, a modular
Python pipeline that transforms psychometric assessment of
s from a
technically complex endeavor into an accessible research tool. The pipeline
demonstrates how thoughtful infrastructure design can expand participation in
AI research, enabling investigators across cognitive science, political
psychology, education, and other fields to study value expression in language
models. Recent adoption by University of Edinburgh researchers studying
authoritarianism validates its research utility, processing over 10,000 AI
responses across multiple models and contexts. We argue that such tools
fundamentally change the landscape of AI research by lowering technical
barriers while maintaining scientific rigor. As
s increasingly serve as
cognitive infrastructure, their embedded values shape millions of daily
interactions. Without systematic measurement of these value expressions, we
deploy systems whose moral influence remains uncharted. The Ethics Engine
enables the rigorous assessment necessary for informed governance of these
influential technologies.