2025-09-28
Table of Contents
- Quantized Visual Geometry Grounded Transformer
- Nova Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization
- Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training
- Semantic Edge-Cloud Communication for Real-Time Urban Traffic Surveillance with ViT and LLMs over Mobile Networks
- Explaining Fine Tuned LLMs via Counterfactuals A Knowledge Graph Driven Framework
- Tree Search for LLM Agent Reinforcement Learning
- A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
- Who's Laughing Now? An Overview of Computational Humour Generation and Explanation
- GRPO is Secretly a Process Reward Model
- CAD-Tokenizer Towards Text-based CAD Prototyping via Modality-Specific Tokenization
- UniSS Unified Expressive Speech-to-Speech Translation with Your Voice
- Acoustic-based Gender Differentiation in Speech-aware Language Models
- TyphoonMLA A Mixed Naive-Absorb MLA Kernel For Shared Prefix
- KeyWorld Key Frame Reasoning Enables Effective and Efficient World Models
- Binary Autoencoder for Mechanistic Interpretability of Large Language Models
- Toward Robust and Efficient ML-Based GPU Caching for Modern Inference
- MemLens Uncovering Memorization in LLMs with Activation Trajectories
- Punching Above Precision Small Quantized Model Distillation with Learnable Regularizer
- SPADE Structured Pruning and Adaptive Distillation for Efficient LLM-TTS
- Towards Atoms of Large Language Models
- Recidivism and Peer Influence with LLM Text Embeddings in Low Security Correctional Facilities
- Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning
- CHOIR A Chatbot-mediated Organizational Memory Leveraging Communication in University Research Labs
- MARS toward more efficient multi-agent collaboration for LLM reasoning
- Shared Neural Space Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision
- Seedream 4.0 Toward Next-generation Multimodal Image Generation
- Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing
- SIM-CoT Supervised Implicit Chain-of-Thought
- Multimodal Representation-disentangled Information Bottleneck for Multimodal Recommendation
- Q-Palette Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment
- From Text to Talk Audio-Language Model Needs Non-Autoregressive Joint Training
- Projective Kolmogorov Arnold Neural Networks (P-KANs) Entropy-Driven Functional Space Discovery for Interpretable Machine Learning
- Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks
- MeshMosaic Scaling Artist Mesh Generation via Local-to-Global Assembly
- RAD Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis
- FastEagle Cascaded Drafting for Accelerating Speculative Decoding
- Exploration with Foundation Models Capabilities, Limitations, and Hybrid Approaches
- Future Policy Aware Preference Learning for Mathematical Reasoning
- Structuring Collective Action with LLM-Guided Evolution From Ill-Structured Problems to Executable Heuristics
- CollaPipe Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks
- BurstEngine an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens
- MMedFD A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition
- Gyges Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference
- Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling
- Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation
- Transformer Modeling for Both Scalability and Performance in Multivariate Time Series
- CompLLM Compression for Long Context Q&A
- Online Process Reward Leanring for Agentic Reinforcement Learning
- Reading Images Like Texts Sequential Image Understanding in Vision-Language Models
- BiGraspFormer End-to-End Bimanual Grasp Transformer
- Clapping Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression
- HD-PPT Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS
- Benchmarking PDF Accessibility Evaluation A Dataset and Framework for Assessing Automated and LLM-Based Approaches for Accessibility Testing
- Confidential LLM Inference Performance and Cost Across CPU and GPU TEEs
- FlashGMM Fast Gaussian Mixture Entropy Model for Learned Image Compression
- Bi-VLM Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models
- HyperCool Reducing Encoding Cost in Overfitted Codecs with Hypernetworks
- PIE Perception and Interaction Enhanced End-to-End Motion Planning for Autonomous Driving
- FlexSED Towards Open-Vocabulary Sound Event Detection
- OmniFed A Modular Framework for Configurable Federated Learning from Edge to HPC
- LLMZ+ Contextual Prompt Whitelist Principles for Agentic LLMs
- Individualized non-uniform quantization for vector search
- LAWCAT Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling
- NormGenesis Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery
- Efficient Particle Acceleration in 2.5-Dimensional, Hybrid-Kinetic Simulations of Decaying, Supersonic, Plasma Turbulence
- Chiplet-Based RISC-V SoC with Modular AI Acceleration
- Speculate Deep and Accurate Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding
- Exploiting Tree Structure for Credit Assignment in RL Training of LLMs
- Evaluating Large Language Models for Detecting Antisemitism
- Spiffy Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
- GraDeT-HTR A Resource-Efficient Bengali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer
- TempSamp-R1 Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
- RadEval A framework for radiology text evaluation
- Through the Lens of Human-Human Collaboration A Configurable Research Platform for Exploring Human-Agent Collaboration
- Visual Detector Compression via Location-Aware Discriminant Analysis
- Benchmarking Humans and Machines on Complex Multilingual Speech Understanding Tasks
- Expert-as-a-Service Towards Efficient, Scalable, and Robust Large-scale MoE Serving
- Towards General Computer Control with Hierarchical Agents and Multi-Level Action Spaces
- ConfClip Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs
- When TableQA Meets Noise A Dual Denoising Framework for Complex Questions and Large-scale Tables
- Mechanistic Interpretability with SAEs Probing Religion, Violence, and Geography in Large Language Models
- Evict3R Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
- Bilateral Distribution Compression Reducing Both Data Size and Dimensionality
- Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs
- 4DGCPro Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming
- CorefInst Leveraging LLMs for Multilingual Coreference Resolution
- Privacy in Action Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents
- Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks
- QWHA Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
- DINVMark A Deep Invertible Network for Video Watermarking
- Interpreting vision transformers via residual replacement model
- EpiCache Episodic KV Cache Management for Long Conversational Question Answering
- Multimodal Health Risk Prediction System for Chronic Diseases via Vision-Language Fusion and Large Language Models
- Asteria Semantic-Aware Cross-Region Caching for Agentic LLM Tool Access
- Cronus Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill
- Compact representation of transonic airfoil buffet flows with observable-augmented machine learning
- Rational Multi-Modal Transformers for TCR-pMHC Prediction
- Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection
- DeepASA An Object-Oriented One-for-All Network for Auditory Scene Analysis
- MoEs Are Stronger than You Think Hyper-Parallel Inference Scaling with RoE
- SignalLLM A General-Purpose LLM Agent Framework for Automated Signal Processing
- MAST Multi-Agent Spatial Transformer for Learning to Collaborate
- Attention Consistency for LLMs Explanation
- Analyzing Memory Effects in Large Language Models through the lens of Cognitive Psychology
Quantized Visual Geometry Grounded Transformer
Authors: Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
2025-09-25
Learning-based 3D reconstruction models, represented by Visual Geometry
Grounded Transformers (VGGTs), have made remarkable progress with the use of
large-scale s. Their prohibitive computational and memory costs
severely hinder real-world deployment. Post-Training Quantization (PTQ) has
become a common practice for compressing and accelerating models. However, we
empirically observe that PTQ faces unique obstacles when compressing
billion-scale VGGTs: the data-independent special tokens induce heavy-tailed
activation distributions, while the multi-view nature of 3D data makes
calibration sample selection highly unstable. This paper proposes the first
Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two
technical contributions: First, we introduce Dual-Smoothed Fine-Grained
Quantization, which integrates pre-global Hadamard rotation and post-local
channel smoothing to mitigate heavy-tailed distributions and inter-channel
variance robustly. Second, we design Noise-Filtered Diverse Sampling, which
filters outliers via deep-layer statistics and constructs frame-aware diverse
calibration clusters to ensure stable
ranges. Comprehensive
experiments demonstrate that QuantVGGT achieves the state-of-the-art results
across different benchmarks and bit-width, surpassing the previous
state-of-the-art generic
method with a great margin. We highlight
that our 4-bit QuantVGGT can deliver a 3.7 memory reduction and
2.5
in real-hardware inference, while maintaining
reconstruction accuracy above 98\% of its full-precision counterpart. This
demonstrates the vast advantages and practicality of QuantVGGT in
resource-constrained scenarios. Our code is released in
https://github.com/wlfeng0509/QuantVGGT.
Nova Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization
Authors: Yuhang Xu, Shengzhong Liu, Dong Zhang, Bingheng Yan, Fan Wu, Guihai Chen
2025-09-25
This paper presents Nova, a real-time scheduling framework for
agentic vision-language models (VLMs) on a single GPU with balanced per-request
latency and overall request process throughput. Our design begins by enabling
effective pipelining across vision encode,
, and
stages
of VLMs, by exploiting their heterogeneous resource demands during execution
and incorporating elastic GPU spatial partitioning among stages to maximally
utilize the compute and memory resources. Building on this, we introduce a
real-time scheduling algorithm that adaptively calibrates resource allocation
among stages based on a Pareto-optimal analysis of the latency-throughput
trade-off, allowing the system to sustain responsiveness and resource
efficiency under dynamic request loads. To further alleviate GPU memory
pressure, we design a lightweight weight offloading strategy for vision
encoders that preserves inference efficiency with minimized memory overhead.
Extensive evaluations on both synthetic and real-world agent workloads
demonstrate that Nova consistently outperforms the state-of-the-art baselines,
improving the maximum latency by up to 23.3%, while keeping competitive
throughput.
Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training
Authors: Shiju Wang, Yujie Wang, Ao Sun, Fangcheng Fu, Zijian Zhu, Bin Cui, Xu Han, Kaisheng Ma
2025-09-25
Long context training is crucial for 's context extension. Existing
schemes, such as sequence parallelism, incur substantial
overhead. Pipeline parallelism (PP) reduces this cost, but its effectiveness
hinges on partitioning granularity. Batch-level PP dividing input samples
exhibits high memory consumption in long-context scenario, whereas token-level
PP splitting sequences into slices alleviates memory overhead but may incur
hardware under-utilization. This trade-off motivates adaptively selecting PP
granularity to match resource and workload characteristics. Moreover, sequence
length distribution of the real-world dataset exhibits skewness, posing a
challenge on PP's workload balance and efficient scheduling. Current static PP
scheduling methods overlook the variance of sequence length, leading to
suboptimal performance. In this paper, we propose Elastic Pipeline Parallelism
(EPP) that orchestrates token-level PP and batch-level PP to adapt to resource
and workload heterogeneity. We build InfiniPipe, a distributed training system
that unleashes the potential of EPP via (1) a resource-aware and
workload-balanced sequence processor that splits long sequences and packs short
ones; and (2) a co-optimization methodology that jointly optimizes pipeline
schedule and gradient checkpointing via a mechanism named stage-aware
chunk-level adaptive checkpointing. Comprehensive experiments demonstrate that
InfiniPipe achieves a 1.69x speedup over state-of-the-art systems.
Semantic Edge-Cloud Communication for Real-Time Urban Traffic Surveillance with ViT and LLMs over Mobile Networks
Authors: Murat Arda Onsu, Poonam Lohan, Burak Kantarci, Aisha Syed, Matthew Andrews, Sean Kennedy
2025-09-25
Real-time urban traffic surveillance is vital for Intelligent Transportation
Systems (ITS) to ensure road safety, optimize traffic flow, track vehicle
trajectories, and prevent collisions in smart cities. Deploying edge cameras
across urban environments is a standard practice for monitoring road
conditions. However, integrating these with intelligent models requires a
robust understanding of dynamic traffic scenarios and a responsive interface
for user interaction. Although multimodal Large Language Models (s) can
interpret traffic images and generate informative responses, their deployment
on edge devices is infeasible due to high computational demands. Therefore,
inference must occur on the cloud, necessitating visual data transmission from
edge to cloud, a process hindered by limited bandwidth, leading to potential
delays that compromise real-time performance. To address this challenge, we
propose a semantic
framework that significantly reduces
transmission overhead. Our method involves detecting Regions of Interest (RoIs)
using YOLOv11, cropping relevant image segments, and converting them into
compact embedding vectors using a Vision Transformer (ViT). These embeddings
are then transmitted to the cloud, where an image
r reconstructs the
cropped images. The reconstructed images are processed by a multimodal
to
generate traffic condition descriptions. This approach achieves a 99.9%
reduction in data transmission size while maintaining an
response accuracy
of 89% for reconstructed cropped images, compared to 93% accuracy with original
cropped images. Our results demonstrate the efficiency and practicality of ViT
and
-assisted edge-cloud semantic
for real-time traffic
surveillance.
Explaining Fine Tuned LLMs via Counterfactuals A Knowledge Graph Driven Framework
Authors: Yucheng Wang, Ziyang Chen, Md Faisal Kabir
2025-09-25
The widespread adoption of Low-Rank Adaptation (LoRA) has enabled large
language models (s) to acquire domain-specific knowledge with remarkable
efficiency. However, understanding how such a fine-tuning mechanism alters a
model's structural reasoning and semantic behavior remains an open challenge.
This work introduces a novel framework that explains fine-tuned
s via
counterfactuals grounded in knowledge graphs. Specifically, we construct
BioToolKG, a domain-specific heterogeneous knowledge graph in bioinformatics
tools and design a counterfactual-based fine-tuned
s explainer
(CFFT
Explainer) that learns soft masks over graph nodes and edges to
generate minimal structural perturbations that induce maximum semantic
divergence. Our method jointly optimizes structural
and semantic
divergence while enforcing interpretability pre
constraints such as
entropy regularization and edge smoothness. We apply this framework to a
fine-tuned LLaMA-based
and reveal that counterfactual masking exposes the
model's structural dependencies and aligns with LoRA-induced parameter shifts.
This work provides new insights into the internal mechanisms of fine-tuned
s
and highlights counterfactual graphs as a potential tool for interpretable AI.
Tree Search for LLM Agent Reinforcement Learning
Authors: Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu
2025-09-25
Recent advances in reinforcement learning (RL) have significantly enhanced
the agentic capabilities of large language models (s). In long-term and
multi-turn agent tasks, existing approaches driven solely by outcome rewards
often suffer from the problem of
supervision. To address the challenge,
we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped
agent RL method based on tree search, where each tree node represents the
complete agent interaction step. By sharing common prefixes, the tree search
sampling increases the number of rollouts achievable within a fixed budget of
tokens or tool calls. Moreover, we find that the tree-structured trajectory
naturally allows the construction of step-wise process supervised signals even
using only the outcome reward. Based on this, Tree-GRPO estimates the grouped
relative advantages both on intra-tree and inter-tree levels. Through
theoretical analysis, we demonstrate that the objective of intra-tree level
group relative policy optimization is equivalent to that of step-level direct
preference learning. Experiments across 11 datasets and 3 types of QA tasks
demonstrate the superiority of the proposed tree-based RL over the chain-based
RL method.
A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
Authors: Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, Xiuying Chen
2025-09-25
Multi-Hop Question Answering (MHQA) requires integrating dispersed,
interdependent evidence through sequential reasoning under noise. This task is
challenging for s as they have a finite per-pass output capacity, beyond
which the integration of task-relevant evidence proves unreliable.
Consequently, the single-pass reasoning paradigm is inherently vulnerable to
this capacity overflow. To formalize this bottleneck, our analysis establishes
a Fano-style accuracy upper bound, defining a theoretical performance ceiling
for single-pass
s. This bound reveals that accuracy inevitably collapses
once task complexity exceeds model capacity, providing general principles for
capacity-aware representation and structuring of MHQA in
s. Building on
these principles, we introduce a proof-of-concept multi-call framework for
MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware
task decomposition with active
of prior reasoning traces, keeping the
information load within the single-pass limit. It further achieves robustness
by a dependency-explicit workflow that enables precise control over the
reasoning path. We construct a stringent and noise-rich benchmark to validate
our theory and framework. Experimental results show that model behavior aligns
with our predicted capacity curves while InfoQA achieves consistent performance
improvements. We hope our work inspires more
multi-step reasoning methods:
\faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.
Who's Laughing Now? An Overview of Computational Humour Generation and Explanation
Authors: Tyler Loakman, William Thorne, Chenghua Lin
2025-09-25
The creation and perception of humour is a fundamental human trait,
positioning its computational understanding as one of the most challenging
tasks in natural language processing (NLP). As an abstract, creative, and
frequently context-dependent construct, humour requires extensive reasoning to
understand and create, making it a pertinent task for assessing the
common-sense knowledge and reasoning abilities of modern large language models
(s). In this work, we survey the landscape of computational humour as it
pertains to the generative tasks of creation and explanation. We observe that,
despite the task of understanding humour bearing all the hallmarks of a
foundational NLP task, work on generating and explaining humour beyond puns
remains
, while state-of-the-art models continue to fall short of human
capabilities. We bookend our literature survey by motivating the importance of
computational humour processing as a subdiscipline of NLP and presenting an
extensive discussion of future directions for research in the area that takes
into account the subjective and ethically ambiguous nature of humour.
GRPO is Secretly a Process Reward Model
Authors: Michael Sullivan
2025-09-25
We prove theoretically that the GRPO RL algorithm induces a non-trivial
process reward model (PRM), under certain assumptions regarding within-group
of token sequences across completions. We then show empirically that
these assumptions are met under real-world conditions: GRPO does in fact induce
a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a
flaw in the GRPO objective: non-uniformly distributed process steps hinder both
exploration and exploitation (under different conditions). We propose a simple
modification to the algorithm to mitigate this defect (-GRPO), and
show that
s trained with -GRPO achieve higher validation accuracy
and performance on downstream reasoning tasksand reach peak performance more
rapidlythan
s trained with standard GRPO. Our results call into question
the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is
possible to instead leverage the hidden, built-in PRM structure within the
vanilla GRPO algorithm to boost model performance with a negligible impact on
training time and cost.
CAD-Tokenizer Towards Text-based CAD Prototyping via Modality-Specific Tokenization
Authors: Ruiyu Wang, Shizhao Sun, Weijian Ma, Jiang Bian
2025-09-25
Computer-Aided Design (CAD) is a foundational component of industrial
prototyping, where models are defined not by raw coordinates but by
construction sequences such as sketches and extrusions. This sequential
structure enables both efficient prototype initialization and subsequent
editing. Text-guided CAD prototyping, which unifies Text-to-CAD generation and
CAD editing, has the potential to streamline the entire design pipeline.
However, prior work has not explored this setting, largely because standard
large language model () tokenizers decompose CAD sequences into
natural-language word pieces, failing to capture primitive-level CAD semantics
and hindering attention modules from modeling geometric structure. We
conjecture that a multimodal tokenization strategy, aligned with CAD's
primitive and structural nature, can provide more effective representations. To
this end, we propose CAD-Tokenizer, a framework that represents CAD data with
modality-specific tokens using a sequence-based VQ-VAE with primitive-level
pooling and constrained
. This design produces compact, primitive-aware
representations that align with CAD's structural nature. Applied to unified
text-guided CAD prototyping, CAD-Tokenizer significantly improves instruction
following and generation quality, achieving better quantitative and qualitative
performance over both general-purpose
s and task-specific baselines.
UniSS Unified Expressive Speech-to-Speech Translation with Your Voice
Authors: Sitong Cheng, Weizhen Bian, Xinsheng Wang, Ruibin Yuan, Jianyi Chen, Shunshun Yin, Yike Guo, Wei Xue
2025-09-25
The ultimate goal of expressive speech-to-speech translation (S2ST) is to
accurately translate spoken content while pre the speaker identity and
emotional style. However, progress in this field is largely hindered by three
key challenges: the scarcity of paired speech data that retains expressive
styles, the complexity of multi-stage processing pipelines, and the limited
transfer of translation capabilities from large language models (
s). In this
work, we address these challenges by introducing UniSS, a novel single-stage
framework for expressive S2ST. Our approach features carefully designed speech
semantic and style modeling, enabling seamless integration with existing
text-based
frameworks to develop a unified text-speech language model. To
transfer translation capabilities from text to speech, we propose a cross-modal
chain-of-thought prompting process that progressively aligns audio semantics
with text and ensures style preservation in the
d results. Furthermore,
we construct and release a large-scale, high-quality expressive S2ST dataset,
UniST, comprising 44.8k hours of data. Experimental results show that UniSS
significantly outperforms previous methods in translation fidelity and speech
quality while pre
voice, emotion, and duration consistency. Our work
establishes a simpler and more effective paradigm for building the next
generation of expressive S2ST systems. Audio samples are available at
https://cmots.github.io/uniss-demo.
Acoustic-based Gender Differentiation in Speech-aware Language Models
Authors: Junhyuk Choi, Jihwan Seol, Nayeon Kim, Chanhee Cho, EunBin Cho, Bugeun Kim
2025-09-25
Speech-aware Language Models (SpeechLMs) have fundamentally transformed
human-AI interaction by enabling voice-based , yet they may
exhibit acoustic-based gender differentiation where identical questions lead to
different responses based on the speaker's gender. This paper propose a new
dataset that enables systematic analysis of this phenomenon, containing 9,208
speech samples across three categories: Gender-Independent,
Gender-Stereotypical, and Gender-Dependent. We further evaluated LLaMA-Omni
series and discovered a paradoxical pattern; while overall responses seems
identical regardless of gender, the pattern is far from unbiased responses.
Specifically, in Gender-Stereotypical questions, all models consistently
exhibited male-oriented responses; meanwhile, in Gender-Dependent questions
where gender differentiation would be contextually appropriate, models
exhibited responses independent to gender instead. We also confirm that this
pattern does not result from neutral options nor perceived gender of a voice.
When we allow neutral response, models tends to respond neutrally also in
Gender-Dependent questions. The paradoxical pattern yet retains when we applied
gender neutralization methods on speech. Through comparison between SpeechLMs
with corresponding backbone
s, we confirmed that these paradoxical patterns
primarily stem from Whisper speech encoders, which generates male-oriented
acoustic tokens. These findings reveal that current SpeechLMs may not
successfully remove gender biases though they prioritized general fairness
principles over contextual appropriateness, highlighting the need for more
sophisticated techniques to utilize gender information properly in speech
technology.
TyphoonMLA A Mixed Naive-Absorb MLA Kernel For Shared Prefix
Authors: Ahmet Caner Yüzügüler, Ahmet Çelik, Jiawei Zhuang, Lukas Cavigelli
2025-09-25
Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in
state-of-the-art s such as DeepSeek-v3 and Kimi K2. Thanks to its novel
formulation, MLA allows two functionally equivalent but computationally
distinct kernel implementations: naive and absorb. While the naive kernels
(e.g., FlashAttention) are typically preferred in training and
for
their computational efficiency, existing
kernels (e.g., FlashMLA) rely
on the absorb method to minimize HBM bandwidth usage. However, the
compute-bound nature of the absorb implementations prohibits performance
benefits from data reuse opportunities in attention calculations, such as
shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that
combines naive and absorb formulations to harness the strengths of both.
TyphoonMLA effectively leverages the shared prefix by applying the naive
formulation to the compute-bound parts of attention calculations, while
reducing the bandwidth requirements for non-shared parts by using the absorb
formulation. As a result, TyphoonMLA improves the throughput of attention
calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with
only a 3% overhead in HBM size.
KeyWorld Key Frame Reasoning Enables Effective and Efficient World Models
Authors: Sibo Li, Qianyue Hao, Yu Shang, Yong Li
2025-09-25
Robotic world models are a promising paradigm for forecasting future
environment states, yet their inference speed and the physical plausibility of
generated trajectories remain critical bottlenecks, limiting their real-world
applications. This stems from the redundancy of the prevailing frame-to-frame
generation approach, where the model conducts costly computation on similar
frames, as well as neglecting the semantic importance of key transitions. To
address this inefficiency, we propose KeyWorld, a framework that improves
text-conditioned robotic world models by concentrating s computation
on a few semantic key frames while employing a lightweight convolutional model
to fill the intermediate frames. Specifically, KeyWorld first identifies
significant transitions by iteratively simplifying the robot's motion
trajectories, obtaining the ground truth key frames. Then, a DiT model is
trained to reason and generate these physically meaningful key frames from
textual task descriptions. Finally, a lightweight interpolator efficiently
reconstructs the full video by inpainting all intermediate frames. Evaluations
on the LIBERO benchmark demonstrate that KeyWorld achieves a 5.68
compared to the frame-to-frame generation baseline, and focusing
on the motion-aware key frames further contributes to the physical validity of
the generated videos, especially on complex tasks. Our approach highlights a
practical path toward deploying world models in real-time robotic control and
other domains requiring both efficient and effective world models. Code is
released at https://anonymous.4open.science/r/Keyworld-E43D.
Binary Autoencoder for Mechanistic Interpretability of Large Language Models
Authors: Hakaze Cho, Haolin Yang, Brian M. Kurkoski, Naoya Inoue
2025-09-25
Existing works are dedicated to untangling atomized numerical components
(features) from the hidden states of Large Language Models (s) for
interpreting their mechanism. However, they typically rely on autoencoders
constrained by some implicit training-time regularization on single training
instances (i.e., normalization, top-k function, etc.), without an
explicit guarantee of global
among instances, causing a large amount
of dense (simultaneously inactive) features, harming the feature
and
atomization. In this paper, we propose a novel autoencoder variant that
enforces minimal entropy on minibatches of hidden activations, thereby
promoting feature independence and
across instances. For efficient
entropy calculation, we discretize the hidden activations to 1-bit via a step
function and apply gradient estimation to enable backpropagation, so that we
term it as Binary Autoencoder (BAE) and empirically demonstrate two major
applications: (1) Feature set entropy calculation. Entropy can be reliably
estimated on binary hidden activations, which we empirically evaluate and
leverage to characterize the inference dynamics of
s and In-context
Learning. (2) Feature untangling. Similar to typical methods, BAE can extract
atomized features from
's hidden states. To robustly evaluate such feature
extraction capability, we refine traditional feature-interpretation methods to
avoid unreliable handling of numerical tokens, and show that BAE avoids dense
features while producing the largest number of interpretable ones among
baselines, which confirms the effectiveness of BAE
as a feature
extractor.
Toward Robust and Efficient ML-Based GPU Caching for Modern Inference
Authors: Peng Chen, Jiaji Zhang, Hailiang Zhao, Yirong Zhang, Jiahong Yu, Xueyan Tang, Yixuan Wang, Hao Li, Jianping Zou, Gang Xiong, Kingsum Chow, Shuibing He, Shuiguang Deng
2025-09-25
In modern GPU inference, efficiency remains a major bottleneck. In
recommendation models, embedding hit rates largely determine throughput, while
in large language models,
-
misses substantially increase
time-to-first-token (TTFT). Heuristic policies such as \textsc{LRU} often
struggle under structured access patterns. Learning-based approaches are
promising, but in practice face two major limitations: they degrade sharply
when predictions are inaccurate, or they gain little even with accurate
predictions due to conservative designs. Some also incur high overhead, further
limiting practicality.
We present \textsc{LCR}, a practical framework for learning-based GPU caching
that delivers performance gains while ensuring robustness and efficiency. Its
core algorithm, \textsc{LARU}, enhances \textsc{LRU} with machine-learned
predictions and dynamically adapts to prediction accuracy through online error
estimation. When predictions are accurate, \textsc{LARU} achieves near-optimal
performance. With inaccurate predictions, it degrades gracefully to
near-\textsc{LRU} performance. With \textsc{LCR}, we bridge the gap between
empirical progress and theoretical advances in learning-based caching.
Experiments show that \textsc{LCR} delivers consistent gains under realistic
conditions. In DLRM and
scenarios, it improves throughput by up to 24.2\%
and reduces P99 TTFT by up to 28.3\%, outperforming widely used inference
systems. Even under poor predictions, its performance remains stable,
demonstrating practical robustness.
MemLens Uncovering Memorization in LLMs with Activation Trajectories
Authors: Zirui He, Haiyan Zhao, Ali Payani, Mengnan du
2025-09-25
Large language models (s) are commonly evaluated on challenging benchmarks
such as AIME and Math500, which are susceptible to contamination and risk of
being memorized. Existing detection methods, which primarily rely on
surface-level lexical
and perplexity, demonstrate low generalization
and degrade significantly when encountering implicitly contaminated data. In
this paper, we propose MemLens (An Activation Lens for Memorization Detection)
to detect memorization by analyzing the probability trajectories of numeric
tokens during generation. Our method reveals that contaminated samples exhibit
``shortcut'' behaviors, locking onto an answer with high confidence in the
model's early layers, whereas clean samples show more gradual evidence
accumulation across the model's full depth. We observe that contaminated and
clean samples exhibit distinct and well-separated reasoning trajectories. To
further validate this, we inject carefully designed samples into the model
through LoRA fine-tuning and observe the same trajectory patterns as in
naturally contaminated data. These results provide strong evidence that MemLens
captures genuine signals of memorization rather than spurious correlations.
Punching Above Precision Small Quantized Model Distillation with Learnable Regularizer
Authors: Abdur Rehman, S M A Sharif, Md Abdur Rahaman, Mohamed Jismy Aashik Rasool, Seongwan Kim, Jaeho Lee
2025-09-25
Quantization-aware training (QAT) combined with knowledge distillation (KD)
is a promising strategy for compressing Artificial Intelligence (AI) models for
deployment on resource-constrained hardware. However, existing QAT-KD methods
often struggle to balance task-specific (TS) and distillation losses due to
heterogeneous gradient magnitudes, especially under
. We
propose Game of Regularizer (GoR), a novel learnable regularization method that
adaptively balances TS and KD objectives using only two trainable parameters
for dynamic loss weighting. GoR reduces conflict between supervision signals,
improves convergence, and boosts the performance of small
d models
(SQMs). Experiments on image classification, object detection (OD), and large
language model (
)
show that GoR consistently outperforms
state-of-the-art QAT-KD methods. On low-power edge devices, it delivers faster
inference while maintaining full-precision accuracy. We also introduce
QAT-EKD-GoR, an ensemble distillation framework that uses multiple
heterogeneous teacher models. Under optimal conditions, the proposed EKD-GoR
can outperform full-precision models, providing a robust solution for
real-world deployment.
SPADE Structured Pruning and Adaptive Distillation for Efficient LLM-TTS
Authors: Tan Dat Nguyen, Jaehun Kim, Ji-Hoon Kim, Shukjae Choi, Youshin Lim, Joon Son Chung
2025-09-25
The goal of this paper is to introduce SPADE, a framework for Structured
Pruning and Adaptive Distillation for Efficient Large Language Model-based
text-to-speech (-TTS). Recent
-TTS systems achieve strong controllability
and zero-shot generalization, but their large parameter counts and high latency
limit real-world deployment. SPADE addresses this by combining (i) a
step guided by a word-error-rate-based layer importance index to remove
non-essential Transformer layers, with (ii) multi-level knowledge distillation
to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves
near-parity perceptual quality while halving Transformer depth, reducing VRAM
usage by up to 20%, and achieving up to 1.7x faster real-time factor with less
than 5% of the original training data. These results show that compact
-TTS
models can maintain naturalness and speaker similarity while enabling practical
real-time speech generation. Audio samples are available at
https://mm.kaist.ac.kr/projects/SPADE/.
Towards Atoms of Large Language Models
Authors: Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
2025-09-25
The fundamental units of internal representations in large language models
(s) remain undefined, limiting further understanding of their mechanisms.
Neurons or features are often regarded as such units, yet neurons suffer from
polysemy, while features face concerns of unreliable reconstruction and
instability. To address this issue, we propose the Atoms Theory, which defines
such units as atoms. We introduce the atomic inner product (AIP) to correct
representation shifting, formally define atoms, and prove the conditions that
atoms satisfy the Restricted Isometry Property (RIP), ensuring stable
representations over atom set and linking to compressed sensing. Under stronger
conditions, we further establish the uniqueness and exact
recoverability of the
representations, and provide guarantees that
single-layer
autoencoders (SAEs) with threshold activations can reliably
identify the atoms. To validate the Atoms Theory, we train threshold-activated
SAEs on Gemma2-2B, Gemma2-9B, and Llama3.1-8B, achieving 99.9%
reconstruction across layers on average, and more than 99.8% of atoms satisfy
the uniqueness condition, compared to 0.5% for neurons and 68.2% for features,
showing that atoms more faithfully capture intrinsic representations of
s.
Scaling experiments further reveal the link between SAEs size and recovery
capacity. Overall, this work systematically introduces and validates Atoms
Theory of
s, providing a theoretical framework for understanding internal
representations and a foundation for mechanistic interpretability. Code
available at https://github.com/ChenhuiHu/towards_atoms.
Recidivism and Peer Influence with LLM Text Embeddings in Low Security Correctional Facilities
Authors: Shanjukta Nath, Jiwon Hong, Jae Ho Chang, Keith Warren, Subhadeep Paul
2025-09-25
We find AI embeddings obtained using a pre-trained -based Large
Language Model (
) of 80,000-120,000 written affirmations and correction
exchanges among residents in low-security correctional facilities to be highly
predictive of recidivism. The prediction accuracy is 30\% higher with embedding
vectors than with only pre-entry covariates. However, since the text embedding
vectors are high-dimensional, we perform Zero-Shot classification of these
texts to a low-dimensional vector of user-defined classes to aid interpretation
while retaining the predictive power. To shed light on the social dynamics
inside the correctional facilities, we estimate peer effects in these
-generated numerical representations of language with a multivariate peer
effect model, adjusting for network endogeneity. We develop new methodology and
theory for peer effect estimation that accommodate
networks,
multivariate latent variables, and correlated multivariate outcomes. With these
new methods, we find significant peer effects in language usage for interaction
and feedback.
Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning
Authors: Hanjiang Hu, Changliu Liu, Na Li, Yebin Wang
2025-09-24
Large Language Models (s) have demonstrated remarkable capabilities in
knowledge acquisition, reasoning, and tool use, making them promising
candidates for autonomous agent applications. However, training
agents for
complex multi-turn task planning faces significant challenges, including
episode-wise rewards, credit assignment across long horizons, and the
computational overhead of reinforcement learning in multi-turn interaction
settings. To this end, this paper introduces a novel approach that transforms
multi-turn task planning into single-turn task reasoning problems, enabling
efficient policy optimization through Group Relative Policy Optimization (GRPO)
with dense and verifiable reward from expert trajectories. Our theoretical
analysis shows that GRPO improvement on single-turn task reasoning results in
higher multi-turn success probability under the minimal turns, as well as the
generalization to subtasks with shorter horizons. Experimental evaluation on
the complex task planning benchmark demonstrates that our 1.5B parameter model
trained with single-turn GRPO achieves superior performance compared to larger
baseline models up to 14B parameters, with success rates of 70% for
long-horizon planning tasks with over 30 steps. We also theoretically and
empirically validate the strong cross-task generalizability that the models
trained on complex tasks can lead to the successful completion of all simpler
subtasks.
CHOIR A Chatbot-mediated Organizational Memory Leveraging Communication in University Research Labs
Authors: Sangwook Lee, Adnan Abbas, Yan Chen, Young-Ho Kim, Sang Won Lee
2025-09-24
University research labs often rely on chat-based platforms for
and project management, where valuable knowledge surfaces but is easily lost in
message streams. Documentation can preserve knowledge, but it requires ongoing
maintenance and is challenging to navigate. Drawing on formative interviews
that revealed organizational memory challenges in labs, we designed CHOIR, an
-based chatbot that supports organizational memory through four key
functions: document-grounded Q&A, Q&A sharing for follow-up discussion,
knowledge extraction from conversations, and AI-assisted document updates. We
deployed CHOIR in four research labs for one month (n=21), where the lab
members asked 107 questions and lab directors updated documents 38 times in the
organizational memory. Our findings reveal a privacy-awareness tension:
questions were asked privately, limiting directors' visibility into
documentation gaps. Students often avoided contribution due to challenges in
generalizing personal experiences into universal documentation. We contribute
design implications for privacy-pre
awareness and supporting
context-specific knowledge documentation.
MARS toward more efficient multi-agent collaboration for LLM reasoning
Authors: Xiao Wang, Jia Wang, Yijie Wang, Pengtao Dang, Sha Cao, Chi Zhang
2025-09-24
Large language models (s) have achieved impressive results in natural
language understanding, yet their reasoning capabilities remain limited when
operating as single agents. Multi-Agent Debate (MAD) has been proposed to
address this limitation by enabling collaborative reasoning among multiple
models in a round-table debate manner. While effective, MAD introduces
substantial computational overhead due to the number of agents involved and the
frequent
required. In this paper, we propose MARS (Multi-Agent
Review System), a role-based collaboration framework inspired by the review
process. In MARS, an author agent generates an initial solution, reviewer
agents provide decisions and comments independently, and a meta-reviewer
integrates the feedback to make the final decision and guide further revision.
This design enhances reasoning quality while avoiding costly
reviewer-to-reviewer interactions, thereby controlling token consumption and
inference time. We compared MARS with both MAD and other state-of-the-art
reasoning strategies across multiple benchmarks. Extensive experiments with
different
s show that MARS matches the accuracy of MAD while reducing both
token usage and inference time by approximately 50\%. Code is available at
https://github.com/xwang97/MARS.
Shared Neural Space Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision
Authors: Jing Li, Oskar Bartosz, Chengyu Wang, Michal Wnuczynski, Dilshan Godaliyadda, Michael Polley
2025-09-24
The majority of AI models in imaging and vision are customized to perform on
specific high-precision task. However, this strategy is inefficient for
applications with a series of modular tasks, since each requires a mapping into
a disparate latent domain. To address this inefficiency, we proposed a
universal Neural Space (NS), where an encoder-r framework pre-computes
features across vision and imaging tasks. Our encoder learns transformation
aware, generalizable representations, which enable multiple downstream AI
modules to share the same feature space. This architecture reduces redundancy,
improves generalization across domain shift, and establishes a foundation for
effecient multi-task vision pipelines. Furthermore, as opposed to larger
backbones, our backbone is lightweight and CNN-based, allowing for
wider across hardware. We furthur demonstrate that imaging and vision modules,
such as demosaicing, denoising, depth estimation and semantic segmentation can
be performed efficiently in the NS.
Seedream 4.0 Toward Next-generation Multimodal Image Generation
Authors: Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu
2025-09-24
We introduce Seedream 4.0, an efficient and high-performance multimodal image
generation system that unifies text-to-image (T2I) synthesis, image editing,
and multi-image composition within a single framework. We develop a highly
efficient diffusion with a powerful VAE which also can reduce the
number of image tokens considerably. This allows for efficient training of our
model, and enables it to fast generate native high-resolution images (e.g.,
1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning
diverse taxonomies and knowledge-centric concepts. Comprehensive data
collection across hundreds of vertical scenarios, coupled with optimized
strategies, ensures stable and large-scale training, with strong
generalization. By incorporating a carefully fine-tuned VLM model, we perform
multi-modal post-training for training both T2I and image editing tasks
jointly. For inference
, we integrate adversarial distillation,
distribution matching, and
, as well as speculative
. It
achieves an inference time of up to 1.8 seconds for generating a 2K image
(without a
/VLM as PE model). Comprehensive evaluations reveal that Seedream
4.0 can achieve state-of-the-art results on both T2I and multimodal image
editing. In particular, it demonstrates exceptional multimodal capabilities in
complex tasks, including precise image editing and in-context reasoning, and
also allows for multi-image reference, and can generate multiple output images.
This extends traditional T2I systems into an more interactive and
multidimensional creative tool, pushing the boundary of generative AI for both
creativity and professional applications. Seedream 4.0 is now accessible on
https://www.volcengine.com/experience/ark?launch=seedream.
Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing
Authors: Xinnan Dai, Chung-Hsiang Lo, Kai Guo, Shenglai Zeng, Dongsheng Luo, Jiliang Tang
2025-09-24
Transformer-based s demonstrate strong performance on graph reasoning
tasks, yet their internal mechanisms remain underexplored. To uncover these
reasoning process mechanisms in a fundamental and unified view, we set the
basic
r-only
s and explain them using the circuit-tracer
framework. Through this lens, we visualize reasoning traces and identify two
core mechanisms in graph reasoning: token merging and structural memorization,
which underlie both path reasoning and substructure extraction tasks. We
further quantify these behaviors and analyze how they are influenced by graph
density and model size. Our study provides a unified interpretability framework
for understanding structural reasoning in
r-only Transformers.
SIM-CoT Supervised Implicit Chain-of-Thought
Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin
2025-09-24
Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative
to explicit CoT reasoning in Large Language Models (s), but a persistent
performance gap has limited their adoption. We identify a core latent
instability issue when scaling the computational budget of implicit CoT: as the
number of reasoning tokens increases, training often becomes unstable and
collapses. Our analysis shows that this instability arises from latent
representations becoming homogeneous and losing semantic diversity, caused by
insufficient step-level supervision in current implicit CoT methods. To address
this, we propose SIM-CoT, a plug-and-play training module that introduces
step-level supervision to stabilize and enrich the latent reasoning space.
SIM-CoT employs an auxiliary
r during training to align each implicit
token with its corresponding explicit reasoning step, ensuring latent states
capture distinct and meaningful information. The auxiliary
r is removed
at inference, pre
the efficiency of implicit CoT with no added overhead.
It also provides interpretability by projecting each latent token onto an
explicit reasoning vocabulary, enabling per-step visualization and diagnosis.
SIM-CoT significantly improves both in-domain accuracy and out-of-domain
stability of implicit CoT methods, boosting Coconut by +8.2\% on GPT-2 and CODI
by +3.0\% on LLaMA-3.1 8B. It further surpasses the explicit CoT baseline on
GPT-2 by 2.1\% with 2.3 greater token efficiency, while closing the
performance gap on larger models like LLaMA-3.1 8B. Code:
https://github.com/InternLM/SIM-CoT
Multimodal Representation-disentangled Information Bottleneck for Multimodal Recommendation
Authors: Hui Wang, Jinghui Qin, Wushao Wen, Qingling Li, Shanshan Zhong, Zhongzhan Huang
2025-09-24
Multimodal data has significantly advanced recommendation systems by
integrating diverse information sources to model user preferences and item
characteristics. However, these systems often struggle with redundant and
irrelevant information, which can degrade performance. Most existing methods
either fuse multimodal information directly or use rigid architectural
separation for disentanglement, failing to adequately filter noise and model
the complex interplay between modalities. To address these challenges, we
propose a novel framework, the Multimodal Representation-disentangled
Information Bottleneck (MRdIB). Concretely, we first employ a Multimodal
Information Bottleneck to compress the input representations, effectively
filtering out task-irrelevant noise while pre rich semantic information.
Then, we decompose the information based on its relationship with the
recommendation target into unique, redundant, and synergistic components. We
achieve this decomposition with a series of constraints: a unique information
learning objective to preserve modality-unique signals, a redundant information
learning objective to minimize
, and a synergistic information learning
objective to capture emergent information. By optimizing these objectives,
MRdIB guides a model to learn more powerful and disentangled representations.
Extensive experiments on several competitive models and three benchmark
datasets demonstrate the effectiveness and versatility of our MRdIB in
enhancing multimodal recommendation.
Q-Palette Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment
Authors: Deokjae Lee, Hyun Oh Song
2025-09-24
We study weight-only post-training (PTQ), which
s the
weights of a large language model (
) without retraining, using little or no
calibration data. Weight-only PTQ is crucial for reducing the memory footprint
and latency of
inference, especially in memory-bound, small-batch inference
scenarios, such as personalized inference on edge devices. Despite its
importance, irregular weight distributions with heavy-tailed outliers in
s
complicate
, recently motivating rotation-based methods that
transform weights into near-Gaussian distributions, which are more regular with
fewer outliers, thereby reducing
error. In this work, we first
derive the information-theoretically optimal bit allocation for Gaussianized
weights under given bit budgets, revealing that fine-grained fractional-bit
rs approaching the Gaussian distortion-rate bound are essential to
achieve near-optimal
performance. To bridge this theoretical
insight and practical implementation, we introduce Q-Palette, a versatile
collection of fractional-bit
rs that range from trellis-coded
rs offering near-optimal distortion to simpler vector and scalar
rs optimized for faster inference, all efficiently implemented with
optimized CUDA kernels across various bitwidths. Furthermore, leveraging
Q-Palette as a foundational component, we propose a novel mixed-scheme
framework, jointly optimizing
r choices and layer fusion
decisions given resource constraints. The code is available at
https://github.com/snu-mllab/Q-Palette.
From Text to Talk Audio-Language Model Needs Non-Autoregressive Joint Training
Authors: Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo, Zitao Liu
2025-09-24
Recent advances in large language models (s) have attracted significant
interest in extending their capabilities to multimodal scenarios, particularly
for speech-to-speech conversational systems. However, existing multimodal
models handling interleaved audio and text rely on autoregressive methods,
overlooking that text depends on target-target relations whereas audio depends
mainly on source-target relations. In this work, we propose Text-to-Talk (TtT),
a unified audio-text framework that integrates autoregressive (AR) text
generation with non-autoregressive (NAR) audio diffusion in a single
Transformer. By leveraging the any-order autoregressive property of absorbing
discrete diffusion, our approach provides a unified training objective for text
and audio. To support this hybrid generation paradigm, we design a
modality-aware attention mechanism that enforces causal
for text while
allowing bidirectional modeling within audio spans, and further introduce three
training strategies that reduce train-test discrepancies. During inference, TtT
employs block-wise diffusion to synthesize audio in parallel while flexibly
handling variable-length outputs. Extensive experiments across Audio-QA and ASR
tasks demonstrate the effectiveness of our approach, with detailed ablation
studies validating each proposed component. We will open-source our models,
data and code to facilitate future research in this direction.
Projective Kolmogorov Arnold Neural Networks (P-KANs) Entropy-Driven Functional Space Discovery for Interpretable Machine Learning
Authors: Alastair Poole, Stig McArthur, Saravan Kumar
2025-09-24
Kolmogorov-Arnold Networks (KANs) relocate learnable nonlinearities from
nodes to edges, demonstrating remarkable capabilities in scientific machine
learning and interpretable modeling. However, current KAN implementations
suffer from fundamental inefficiencies due to redundancy in high-dimensional
spline parameter spaces, where numerous distinct parameterisations yield
functionally equivalent behaviors. This redundancy manifests as a "nuisance
space" in the model's Jacobian, leading to susceptibility to overfitting and
poor generalization. We introduce Projective Kolmogorov-Arnold Networks
(P-KANs), a novel training framework that guides edge function discovery
towards interpretable functional representations through entropy-minimisation
techniques from signal analysis and dictionary learning. Rather than
constraining functions to predetermined spaces, our approach maintains spline
space flexibility while introducing "gravitational" terms that encourage
convergence towards optimal functional representations. Our key insight
recognizes that optimal representations can be identified through entropy
analysis of projection coefficients, compressing edge functions to
lower-parameter projective spaces (Fourier, Chebyshev, Bessel). P-KANs
demonstrate superior performance across multiple domains, achieving up to 80%
parameter reduction while maintaining representational capacity, significantly
improved robustness to noise compared to standard KANs, and successful
application to industrial automated fiber placement prediction. Our approach
enables automatic discovery of mixed functional representations where different
edges converge to different optimal spaces, providing both
benefits
and enhanced interpretability for scientific machine learning applications.
Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks
Authors: Vani Kanjirangat, Tanja Samardžić, Ljiljana Dolamic, Fabio Rinaldi
2025-09-24
Dialectal data are characterized by linguistic variation that appears small
to humans but has a significant impact on the performance of models. This
dialect gap has been related to various factors (e.g., data size, economic and
social factors) whose impact, however, turns out to be inconsistent. In this
work, we investigate factors impacting the model performance more directly: we
correlate Tokenization Parity (TP) and Information Parity (IP), as measures of
representational biases in pre-trained multilingual models, with the downstream
performance. We compare state-of-the-art r-only
s with encoder-based
models across three tasks: dialect classification, topic classification, and
extractive question answering, controlling for varying scripts (Latin vs.
non-Latin) and resource availability (high vs. low). Our analysis reveals that
TP is a better predictor of the performance on tasks reliant on syntactic and
morphological cues (e.g., extractive QA), while IP better predicts performance
in semantic tasks (e.g., topic classification). Complementary analyses,
including tokenizer behavior, vocabulary coverage, and qualitative insights,
reveal that the language support claims of
s often might mask deeper
mismatches at the script or token level.
MeshMosaic Scaling Artist Mesh Generation via Local-to-Global Assembly
Authors: Rui Xu, Tianyang Xue, Qiujie Dong, Le Wan, Zhe Zhu, Peng Li, Zhiyang Dou, Cheng Lin, Shiqing Xin, Yuan Liu, Wenping Wang, Taku Komura
2025-09-24
Scaling artist-designed meshes to high triangle numbers remains challenging
for autoregressive generative models. Existing -based methods suffer
from long-sequence bottlenecks and limited
resolution, primarily
due to the large number of tokens required and constrained
granularity. These issues prevent faithful reproduction of fine geometric
details and structured density patterns. We introduce MeshMosaic, a novel
local-to-global framework for artist mesh generation that scales to over 100K
triangles--substantially surpassing prior methods, which typically handle only
around 8K faces. MeshMosaic first segments shapes into patches, generating each
patch autoregressively and leveraging shared boundary conditions to promote
coherence, symmetry, and seamless connectivity between neighboring regions.
This strategy enhances scalability to high-resolution meshes by quantizing
patches individually, resulting in more symmetrical and organized mesh density
and structure. Extensive experiments across multiple public datasets
demonstrate that MeshMosaic significantly outperforms state-of-the-art methods
in both geometric fidelity and user preference, supporting superior detail
representation and practical mesh generation for real-world applications.
RAD Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis
Authors: Haolin Li, Tianjie Dai, Zhe Chen, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang
2025-09-24
Clinical diagnosis is a highly specialized discipline requiring both domain
expertise and strict adherence to rigorous guidelines. While current AI-driven
medical research predominantly focuses on knowledge graphs or natural text
pretraining paradigms to incorporate medical knowledge, these approaches
primarily rely on implicitly encoded knowledge within model parameters,
neglecting task-specific knowledge required by diverse downstream tasks. To
address this limitation, we propose Retrieval-Augmented Diagnosis (RAD), a
novel framework that explicitly injects external knowledge into multimodal
models directly on downstream tasks. Specifically, RAD operates through three
key mechanisms: retrieval and refinement of disease-centered knowledge from
multiple medical sources, a guideline-enhanced contrastive loss that constrains
the latent distance between multi-modal features and guideline knowledge, and
the dual
r that employs guidelines as queries to steer
cross-modal fusion, aligning the models with clinical diagnostic workflows from
guideline acquisition to feature extraction and decision-making. Moreover,
recognizing the lack of quantitative evaluation of interpretability for
multimodal diagnostic models, we introduce a set of criteria to assess the
interpretability from both image and text perspectives. Extensive evaluations
across four datasets with different anatomies demonstrate RAD's
generalizability, achieving state-of-the-art performance. Furthermore, RAD
enables the model to concentrate more precisely on abnormal regions and
critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code
is available at https://github.com/tdlhl/RAD.
FastEagle Cascaded Drafting for Accelerating Speculative Decoding
Authors: Haiduo Huang, Jiangcheng Song, Wenzhe Zhao, Pengju Ren
2025-09-24
Speculative accelerates generation by drafting candidates and
verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still
require N sequential passes to propose N tokens. We present FastEagle, a
non-autoregressive cascaded drafter that emits an entire draft in a single
forward pass. FastEagle replaces temporal steps with a lightweight layer
cascade and trains with layer-wise supervision to mitigate error accumulation.
Coupled with a constrained draft tree that preserves lossless verification
cost, FastEagle delivers substantial wall-clock speedups over strong
autoregressive drafters while maintaining competitive acceptance behavior.
Across multiple
s (Vicuna-13B, LLaMA-Instruct 3.x, and
DeepSeek-R1-Distill-LLaMA) and tasks (MT-Bench, HumanEval, GSM8K, CNN/DM,
Alpaca), FastEagle consistently outperforms EAGLE-3 in speedup under both
greedy and stochastic
, with comparable average acceptance lengths.
These results indicate that removing sequential dependencies in drafting is a
practical path toward lossless
inference
.
Exploration with Foundation Models Capabilities, Limitations, and Hybrid Approaches
Authors: Remo Sasso, Michelangelo Conserva, Dominik Jeurissen, Paulo Rauber
2025-09-24
Exploration in reinforcement learning (RL) remains challenging, particularly
in -reward settings. While foundation models possess strong semantic
priors, their capabilities as zero-shot exploration agents in classic RL
benchmarks are not well understood. We benchmark
s and VLMs on multi-armed
bandits, Gridworlds, and
-reward Atari to test zero-shot exploration. Our
investigation reveals a key limitation: while VLMs can infer high-level
objectives from visual input, they consistently fail at precise low-level
control: the "knowing-doing gap". To analyze a potential bridge for this gap,
we investigate a simple on-policy hybrid framework in a controlled, best-case
scenario. Our results in this idealized setting show that VLM guidance can
significantly improve early-stage sample efficiency, providing a clear analysis
of the potential and constraints of using foundation models to guide
exploration rather than for end-to-end control.
Future Policy Aware Preference Learning for Mathematical Reasoning
Authors: Minjae Oh, Yunho Choi, Dongmin Choi, Yohan Jo
2025-09-24
Preference learning methods such as Direct Preference Optimization (DPO) have
become standard for Large Language Model () post-training, yet they are
often ineffective for mathematical reasoning. A key challenge is the large
token
between preferred and dispreferred trajectories; lowering the
probability of dispreferred trajectories also reduces the probability of shared
useful tokens, leading to over-penalization and overall performance collapse.
As a mitigation, existing algorithms include the probability of a trajectory
under the current policy as a regularization term, which decreases the effect
of the gradient when the probability is low. However, by the time this effect
takes hold, useful tokens may have already been over-penalized as the model has
begun to degrade. To address this, we propose Future Policy Aware (FPA)
preference learning, which replaces the current policy with a future policy in
the regularization term. This future policy is estimated via lightweight,
logit-space extrapolation from a reference model toward the current model. FPA
enables safer training by preemptively regularizing potentially problematic
gradients. We apply FPA to DPO, RPO, and SimPER and evaluate them on the MATH
and GSM8K benchmarks. FPA yields consistent performance gains, with the largest
improvements observed with SimPER, achieving gains of up to 5.75%. We
demonstrate that FPA provides proactive regularization while pre
the
probability of shared, useful mathematical tokens, and enables longer,
degradation-free training with negligible computational overhead. We will
release our code publicly upon publication.
Structuring Collective Action with LLM-Guided Evolution From Ill-Structured Problems to Executable Heuristics
Authors: Kevin Bradley Dsouza, Graham Alexander Watt, Yuri Leonenko, Juan Moreno-Cruz
2025-09-24
Collective action problems, which require aligning individual incentives with
collective goals, are classic examples of Ill-Structured Problems (ISPs). For
an individual agent, the causal links between local actions and global outcomes
are unclear, stakeholder objectives often conflict, and no single, clear
algorithm can bridge micro-level choices with macro-level welfare. We present
ECHO-MIMIC, a computational framework that converts this global complexity into
a tractable, Well-Structured Problem (WSP) for each agent by discovering
compact, executable heuristics and persuasive rationales. The framework
operates in two stages: ECHO (Evolutionary Crafting of Heuristics from
Outcomes) evolves snippets of Python code that encode candidate behavioral
policies, while MIMIC (Mechanism Inference & Messaging for
Individual-to-Collective Alignment) evolves companion natural language messages
that motivate agents to adopt those policies. Both phases employ a
large-language-model-driven evolutionary search: the proposes diverse and
context-aware code or text variants, while population-level selection retains
those that maximize collective performance in a simulated environment. We
demonstrate this framework on a canonical ISP in agricultural landscape
management, where local farming decisions impact global ecological
connectivity. Results show that ECHO-MIMIC discovers high-performing heuristics
compared to baselines and crafts tailored messages that successfully align
simulated farmer behavior with landscape-level ecological goals. By coupling
algorithmic rule discovery with tailored
, ECHO-MIMIC transforms
the cognitive burden of collective action into a simple set of agent-level
instructions, making previously ill-structured problems solvable in practice
and opening a new path toward scalable, adaptive policy design.
CollaPipe Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks
Authors: Jiewei Chen, Xiumei Deng, Zehui Xiong, Shaoyong Guo, Xuesong Qiu, Ping Wang, Dusit Niyato
2025-09-24
The increasing demand for intelligent mobile applications has made
multi-agent collaboration with Transformer-based large language models (s)
essential in mobile edge computing (MEC) networks. However, training
s in
such environments remains challenging due to heavy computation, high end-to-end
latency, and limited model generalization. We introduce CollaPipe, a hybrid
distributed learning framework that integrates collaborative pipeline
parallelism with federated aggregation to support self-evolving intelligent
networks. In CollaPipe, the encoder part is adaptively partitioned into
variable-sized segments and deployed across mobile devices for
pipeline-parallel training, while the
r is deployed on edge servers to
handle generative tasks. Then we perform global model update via federated
aggregation. To enhance training efficiency, we formulate a joint optimization
problem that adaptively allocates model segments, micro-batches, bandwidth, and
transmission power. We derive and use a closed-form convergence bound to design
an Dynamic Segment Scheduling and Resource Allocation (DSSDA) algorithm based
on Lyapunov optimization, ensuring system stability under long-term
constraints. Extensive experiments on downstream tasks with Transformer and
BERT models show that CollaPipe improves computation efficiency by up to
15.09%, reduces end-to-end latency by at least 48.98%, and cuts single device
memory usage by more than half, enabling online learning in heterogeneous and
dynamic
environments.
BurstEngine an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens
Authors: Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong sun
2025-09-24
Existing methods for training s on long-sequence data, such as Tensor
Parallelism and Context Parallelism, exhibit low Model FLOPs Utilization as
sequence lengths and number of GPUs increase, especially when sequence lengths
exceed 1M tokens. To address these challenges, we propose BurstEngine, an
efficient framework designed to train
s on long-sequence data. BurstEngine
introduces BurstAttention, an optimized distributed attention with lower
cost than RingAttention. BurstAttention leverages topology-aware
ring
to fully utilize network bandwidth and incorporates
fine-grained
-computation
. Furthermore, BurstEngine
introduces sequence-level selective checkpointing and fuses the language
modeling head with the loss function to reduce memory cost. Additionally,
BurstEngine introduces workload balance optimization for various types of
attention masking. By integrating these optimizations, BurstEngine achieves a
speedup with much lower memory overhead than the state-of-the-art
baselines when training
s on extremely long sequences of over 1M tokens. We
have made our code publicly available on GitHub:
https://github.com/thunlp/BurstEngine.
MMedFD A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition
Authors: Hongzhao Chen, XiaoYang Wang, Jing Lan, Hexiao Ding, Yufeng Jiang MingHui Yang, DanHui Xu, Jun Luo, Nga-Chun Ng, Gerald W. Y. Cheng, Yunlin Mao, Jung Sun Yoo
2025-09-24
Automatic speech recognition (ASR) in clinical dialogue demands robustness to
full-duplex interaction, speaker , and low-latency constraints, yet open
benchmarks remain scarce. We present MMedFD, the first real-world Chinese
healthcare ASR corpus designed for multi-turn, full-duplex settings. Captured
from a deployed AI assistant, the dataset comprises 5,805 annotated sessions
with synchronized user and mixed-channel views, RTTM/CTM timing, and role
labels. We introduce a model-agnostic pipeline for streaming segmentation,
speaker attribution, and dialogue memory, and fine-tune Whisper-small on
role-concatenated audio for long-context recognition. ASR evaluation includes
WER, CER, and HC-WER, which measures concept-level accuracy across healthcare
settings.
-generated responses are assessed using rubric-based and pairwise
protocols. MMedFD establishes a reproducible framework for benchmarking
streaming ASR and end-to-end duplex agents in healthcare deployment. The
dataset and related resources are publicly available at
https://github.com/Kinetics-JOJO/MMedFD
Gyges Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference
Authors: Haoyu Chen, Xue Li, Kun Qian, Yu Guan, Jin Zhao, Xin Wang
2025-09-24
Efficiently processing the dynamics of requests, especially the context
length variance, is important in Large Language Model ()
scenarios.
However, there is an intrinsic trade-off: while leveraging parallelism
strategies, such as Tensor Parallelism (TP), can coordinate multiple GPUs to
accommodate larger context lengths, it inevitably results in degraded overall
throughput. In this paper, we propose Cross-Instance Parallelism Transformation
(Gyges), which adaptively adjusts the parallelism strategies of running
instances to align with the dynamics of incoming requests. We design (1) a
page-friendly, header-centric layout to accelerate
transformations;
(2) dedicated weight padding to accelerate model weight transformations; and
(3) a transformation-aware scheduler to cooperatively schedule requests and
parallelism transformations, optimizing the overall performance. Evaluations
using real-world traces show that Gyges improves throughput by 1.75x-6.57x
compared to state-of-the-art solutions.
Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling
Authors: Youpeng Zhao, Jinpeng LV, Di Wu, Jun Wang, Christopher Gooley
2025-09-23
Test-time scaling (TTS) has recently emerged as a promising direction to
exploit the hidden reasoning capabilities of pre-trained large language models
(s). However, existing scaling methods narrowly focus on the compute-optimal
Pareto-frontier, ignoring the simple fact that compute-optimal is not always
system-optimal. In this work, we propose a system-driven perspective on TTS,
analyzing how reasoning models scale against practical metrics, such as latency
and cost-per-token. By evaluating the impact of popular optimizations such as
tensor parallelism and speculative
, our preliminary analysis reveals
the limitations of current methods and calls for a paradigm shift toward
holistic, system-aware evaluations that capture the true essence of scaling
laws at inference time.
Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation
Authors: Roy Fejgin, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Ryan Langman Jaehyeon Kim, Subhankar Ghosh, Shehzeen Hussain, Jason Li
2025-09-23
Speech generation models based on large language models (s) typically
operate on discrete acoustic codes, which differ fundamentally from text tokens
due to their multicodebook structure. At each timestep, models must predict N
codebook entries jointly, introducing dependencies that challenge simple
parallel prediction approaches. Parallel prediction assumes independence among
codebooks, yielding efficient
but often at the cost of reduced
fidelity. To address this, hierarchical strategies employ a local
(LT) to refine predictions and capture intra-timestep dependencies. In this
work, we systematically investigate two LT architectures: an autoregressive
that generates codebooks sequentially, and a MaskGIT-based
that performs iterative masked prediction. Both designs further
enable frame stacking, where the primary
predicts multiple frames
jointly, and the LT
s their codebooks, offering improvements in speed
without compromising perceptual quality. Through extensive analysis, we
characterize the tradeoffs between parallel and iterative sampling strategies
across different throughput and quality regimes. Finally, we propose practical
guidelines for selecting
strategies based on deployment priorities
such as computational efficiency and synthesis fidelity.
Transformer Modeling for Both Scalability and Performance in Multivariate Time Series
Authors: Hunjae Lee, Corey Clark
2025-09-23
Variable count is among the main scalability bottlenecks for
modeling in multivariate time series (MTS) data. On top of this, a growing
consensus in the field points to indiscriminate inter-variable mixing as a
potential source of noise-accumulation and performance degradation. This is
likely exacerbated by
of informative signals characteristic of many
MTS systems coupled with representational misalignment stemming from
indiscriminate information mixing between (heterogeneous) variables. While
scalability and performance are often seen as competing interests in
design, we show that both can be improved simultaneously in MTS by
strategically constraining the representational capacity of inter-variable
mixing. Our proposed method,
with Delegate Token Attention
(DELTAformer), constrains inter-variable modeling through what we call delegate
tokens which are then used to perform full, unconstrained, inter-temporal
modeling. Delegate tokens act as an implicit regularizer that forces the model
to be highly selective about what inter-variable information is allowed to
propagate through the network. Our results show that DELTAformer scales
linearly with variable-count while actually outperforming standard
s, achieving state-of-the-art performance across benchmarks and
baselines. In addition, DELTAformer can focus on relevant signals better than
standard
s in noisy MTS environments and overall exhibit superior
noise-resilience. Overall, results across various experiments confirm that by
aligning our model design to leverage domain-specific challenges in MTS to our
advantage, DELTAformer can simultaneously achieve linear scaling while actually
improving its performance against standard, quadratic
s.
CompLLM Compression for Long Context Q&A
Authors: Gabriele Berton, Jayakrishnan Unnikrishnan, Son Tran, Mubarak Shah
2025-09-23
Large Language Models (s) face significant computational challenges when
processing long contexts due to the quadratic complexity of self-attention.
While soft context
methods, which map input text to smaller latent
representations, have shown promise, their real-world adoption is limited.
Existing techniques typically compress the context as a single unit, which
leads to quadratic
complexity and an inability to reuse
computations across queries with
ping contexts. In this work, we
introduce Comp
, a soft
technique designed for practical
deployment. Instead of processing the context holistically, Comp
divides it
into segments and compresses each one independently. This simple design choice
yields three critical properties: efficiency, as the
step scales
linearly with the context length; scalability, enabling models trained on short
sequences (e.g., 1k tokens) to generalize to contexts of 100k tokens; and
reusability, allowing compressed segments to be
d and reused across
different queries. Our experiments show that with a 2x
rate, at
high context lengths Comp
speeds up Time To First Token (TTFT) by up to 4x
and reduces the
size by 50%. Furthermore, Comp
achieves performance
comparable to that obtained with the uncompressed context, and even surpasses
it on very long sequences, demonstrating its effectiveness and practical
utility.
Online Process Reward Leanring for Agentic Reinforcement Learning
Authors: Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, Jianbin Jiao
2025-09-23
Large language models (s) are increasingly trained with reinforcement
learning (RL) as autonomous agents that reason and act over long horizons in
interactive environments. However,
and sometimes unverifiable rewards
make temporal credit assignment extremely challenging. Recent work attempts to
integrate process supervision into agent learning but suffers from biased
annotation, reward hacking, high-variance from overly fine-grained signals or
failtures when state
is rare. We therefore introduce Online Process
Reward Learning (OPRL), a general credit-assignment strategy for agentic RL
that integrates seamlessly with standard on-policy algorithms without relying
on additional rollouts or explicit step labels. In OPRL, we optimize an
implicit process reward model (PRM) alternately with the agent's policy to
transform trajectory preferences into implicit step rewards through a
trajectory-based DPO objective. These step rewards are then used to compute
step-level advantages, which are combined with episode-level advantages from
outcome rewards for policy update, creating a self-reinforcing loop.
Theoretical findings guarantee that the learned step rewards are consistent
with trajectory preferences and act as potential-based shaping rewards,
providing bounded gradients to stabilize training. Empirically, we evaluate
OPRL on three distinct agent benmarks, including WebShop and VisualSokoban, as
well as open-ended social interactions with unverfiable rewards in SOTOPIA.
Crucially, OPRL shows superior performance over frontier
s and strong RL
baselines across domains, achieving state-of-the-art results with higher
sample-efficiency and lower variance during training. Further analysis also
demonstrates the efficient exploration by OPRL using fewer actions,
underscoring its potential for agentic learning in real-world scenarios.
Reading Images Like Texts Sequential Image Understanding in Vision-Language Models
Authors: Yueyan Li, Chenggong Zhao, Zeyuan Zang, Caixia Yuan, Xiaojie Wang
2025-09-23
Vision-Language Models (VLMs) have demonstrated remarkable performance across
a variety of real-world tasks. However, existing VLMs typically process visual
information by serializing images, a method that diverges significantly from
the parallel nature of human vision. Moreover, their opaque internal mechanisms
hinder both deeper understanding and architectural innovation. Inspired by the
dual-stream hypothesis of human vision, which distinguishes the "what" and
"where" pathways, we deconstruct the visual processing in VLMs into object
recognition and spatial perception for separate study. For object recognition,
we convert images into text token maps and find that the model's perception of
image content unfolds as a two-stage process from shallow to deep layers,
beginning with attribute recognition and culminating in semantic
disambiguation. For spatial perception, we theoretically derive and empirically
verify the geometric structure underlying the positional representation in
VLMs. Based on these findings, we introduce an instruction-agnostic token
algorithm based on a plug-and-play visual
r to improve
efficiency, and a RoPE scaling technique to enhance spatial reasoning.
Through rigorous experiments, our work validates these analyses, offering a
deeper understanding of VLM internals and providing clear principles for
designing more capable future architectures.
BiGraspFormer End-to-End Bimanual Grasp Transformer
Authors: Kangmin Kim, Seunghyeok Back, Geonhyup Lee, Sangbeom Lee, Sangjun Noh, Kyoobin Lee
2025-09-23
Bimanual grasping is essential for robots to handle large and complex
objects. However, existing methods either focus solely on single-arm grasping
or employ separate grasp generation and bimanual evaluation stages, leading to
coordination problems including collision risks and unbalanced force
distribution. To address these limitations, we propose BiGraspFormer, a unified
end-to-end framework that directly generates coordinated bimanual
grasps from object point clouds. Our key idea is the Single-Guided Bimanual
(SGB) strategy, which first generates diverse single grasp candidates using a
r, then leverages their learned features through specialized
attention mechanisms to jointly predict bimanual poses and quality scores. This
conditioning strategy reduces the complexity of the 12-DoF search space while
ensuring coordinated bimanual manipulation. Comprehensive simulation
experiments and real-world validation demonstrate that BiGraspFormer
consistently outperforms existing methods while maintaining efficient inference
speed (<0.05s), confirming the effectiveness of our framework. Code and
supplementary materials are available at https://sites.google.com/bigraspformer
Clapping Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression
Authors: Boao Kong, Xu Huang, Yuqi Xu, Yixuan Liang, Bin Wang, Kun Yuan
2025-09-23
Pipeline-parallel distributed optimization is essential for large-scale
machine learning but is challenged by significant overhead from
transmitting high-dimensional activations and gradients between workers.
Existing approaches often depend on impractical unbiased gradient assumptions
or incur sample-size memory overhead. This paper introduces Clapping, a
Communication
algorithm with LAzy samPling for Pipeline-parallel
learnING. Clapping adopts a lazy sampling strategy that reuses data samples
across steps, breaking sample-wise memory barrier and supporting convergence in
few-epoch or online training regimes. Clapping comprises two variants including
Clapping-FC and Clapping-FU, both of which achieve convergence without unbiased
gradient assumption, effectively addressing
error propagation in
multi-worker settings. Numerical experiments validate the performance of
Clapping across different learning tasks.
HD-PPT Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS
Authors: Sihang Nie, Xiaofen Xing, Jingyuan Xing, Baiji Liu, Xiangmin Xu
2025-09-23
Large Language Model ()-based Text-to-Speech (TTS) models have already
reached a high degree of naturalness. However, the precision control of TTS
inference is still challenging. Although instruction-based Text-to-Speech
(Instruct-TTS) models are proposed, these models still lack fine-grained
control due to the modality gap between single-level text instructions and
multilevel speech tokens. To address this limitation, we propose HD-PPT, a
framework that transforms speech synthesis into a structured, hierarchical
task. To enable fine-grained control, we introduce a novel speech codec to
extract distinct prompt-preference and content-preference tokens from the
complex speech tokens, supervised by automatic speech recognition (ASR) and
cross-lingual audio-text pre-training (CLAP) objectives. To bridge the modality
gap of these tokens, we propose a hierarchical
strategy, where the
generates tokens in a structured order: first semantic, then fine-grained
style, and finally complete acoustic representation. Extensive experiments
demonstrate that this hierarchical paradigm significantly improves instruction
adherence and achieves state-of-the-art naturalness, validating our approach
for precise and controllable speech synthesis. Audio samples are available at
https://xxh333.github.io/.
Benchmarking PDF Accessibility Evaluation A Dataset and Framework for Assessing Automated and LLM-Based Approaches for Accessibility Testing
Authors: Anukriti Kumar, Tanushree Padath, Lucy Lu Wang
2025-09-23
PDFs remain the dominant format for scholarly , despite
significant accessibility challenges for blind and low-vision users. While
various tools attempt to evaluate PDF accessibility, there is no standardized
methodology to evaluate how different accessibility assessment approaches
perform. Our work addresses this critical gap by introducing a novel benchmark
dataset of scholarly PDFs with expert-validated accessibility annotations
across seven criteria (alternative text quality, logical reading order,
semantic tagging, table structure, functional hyperlinks, color contrast, and
font readability), and a four-category evaluation framework with standardized
labels (Passed, Failed, Not Present, Cannot Tell) to systematically assess
accessibility evaluation approaches. Using our evaluation framework, we explore
whether large language models (
s) are capable of supporting automated
accessibility evaluation. We benchmark five
s, which demonstrate varying
capabilities in correctly assessing different accessibility criteria, with
GPT-4-Turbo achieving the highest overall accuracy (0.85). However, all models
struggled in correctly categorizing documents with Not Present and Cannot Tell
accessibility labels, particularly for alt text quality assessment. Our
qualitative comparison with standard automated checkers reveals complementary
strengths: rule-based tools excel at technical verification, while
s better
evaluate semantic appropriateness and contextual relevance. Based on our
findings, we propose a hybrid approach that would combine automated checkers,
evaluation, and human assessment as a future strategy for PDF accessibility
evaluation.
Confidential LLM Inference Performance and Cost Across CPU and GPU TEEs
Authors: Marcin Chrapek, Marcin Copik, Etienne Mettaz, Torsten Hoefler
2025-09-23
Large Language Models (s) are increasingly deployed on converged Cloud and
High-Performance Computing (HPC) infrastructure. However, as
s handle
confidential inputs and are fine-tuned on costly, proprietary datasets, their
heightened security requirements slow adoption in privacy-sensitive sectors
such as healthcare and finance. We investigate methods to address this gap and
propose Trusted Execution Environments (TEEs) as a solution for securing
end-to-end
inference. We validate their practicality by evaluating these
compute-intensive workloads entirely within CPU and GPU TEEs. On the CPU side,
we conduct an in-depth study running full Llama2 inference pipelines (7B, 13B,
70B) inside Intel's TDX and SGX, accelerated by Advanced Matrix Extensions
(A
). We derive 12 insights, including that across various data types, batch
sizes, and input lengths, CPU TEEs impose under 10% throughput and 20% latency
overheads, further reduced by A
. We run
inference on NVIDIA H100
Confidential Compute GPUs, contextualizing our CPU findings and ob
throughput penalties of 4-8% that diminish as batch and input sizes grow. By
comparing performance, cost, and security trade-offs, we show how CPU TEEs can
be more cost-effective or secure than their GPU counterparts. To our knowledge,
our work is the first to comprehensively demonstrate the performance and
practicality of modern TEEs across both CPUs and GPUs for enabling confidential
s (c
s).
FlashGMM Fast Gaussian Mixture Entropy Model for Learned Image Compression
Authors: Shimon Murai, Fangzheng Lin, Jiro Katto
2025-09-23
High-performance learned image codecs require flexible
probability models to fit latent representations. Gaussian Mixture Models
(GMMs) were proposed to satisfy this demand, but suffer from a significant
runtime performance bottleneck due to the large Cumulative Distribution
Function (CDF) tables that must be built for rANS coding. This paper introduces
a fast coding algorithm that entirely eliminates this bottleneck. By leveraging
the CDF's monotonic property, our
r performs a dynamic binary search to
find the correct symbol, eliminating the need for costly table construction and
lookup. Aided by SIMD optimizations and numerical approximations, our approach
accelerates the GMM entropy coding process by up to approximately 90x without
compromising rate-distortion performance, significantly improving the
practicality of GMM-based codecs. The implementation will be made publicly
available at https://github.com/tokkiwa/FlashGMM.
Bi-VLM Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models
Authors: Xijun Wang, Junyun Huang, Rayyan Abdalla, Chengyuan Zhang, Ruiqi Xian, Dinesh Manocha
2025-09-23
We address the critical gap between the computational demands of
vision-language models and the possible ultra- weight precision
(bitwidth bits) we can use for higher efficiency. Our work is motivated
by the substantial computational cost and memory requirements of VLMs, which
restrict their applicability in hardware-constrained environments. We propose
Bi-VLM, which separates model weights non-uniformly based on the Gaussian
quantiles. Our formulation groups the model weights into outlier (salient) and
multiple inlier (unsalient) subsets, ensuring that each subset contains a
proportion of weights corresponding to its quantile in the distribution. We
propose a saliency-aware hybrid
algorithm and use it to
weights by imposing different constraints on the scaler and binary matrices
based on the saliency metric and
objective. We have evaluated our
approach on different VLMs. For the language model part of the VLM, our Bi-VLM
outperforms the SOTA by 3%-47% on the visual question answering task in terms
of four different benchmarks and three different models. For the overall VLM,
our Bi-VLM outperforms the SOTA by 4%-45%. We also perform token
on the
d models and observe that there is redundancy of image tokens 90% - 99%
in the
d models. This helps us to further prune the visual tokens to
improve efficiency.
HyperCool Reducing Encoding Cost in Overfitted Codecs with Hypernetworks
Authors: Pep Borrell-Tatché, Till Aczel, Théo Ladune, Roger Wattenhofer
2025-09-23
Overfitted image codecs like Cool-chic achieve strong by
tailoring lightweight models to individual images, but their encoding is slow
and computationally expensive. To accelerate encoding, Non-Overfitted (N-O)
Cool-chic replaces the per-image optimization with a learned inference model,
trading
performance for encoding speed. We introduce HyperCool, a
hypernetwork architecture that mitigates this trade-off. Building upon the N-O
Cool-chic framework, HyperCool generates content-adaptive parameters for a
Cool-chic
r in a single forward pass, tailoring the
r to the input
image without requiring per-image fine-tuning. Our method achieves a 4.9% rate
reduction over N-O Cool-chic with minimal computational overhead. Furthermore,
the output of our hypernetwork provides a strong initialization for further
optimization, reducing the number of steps needed to approach fully overfitted
model performance. With fine-tuning, HEVC-level
is achieved with
60.4% of the encoding cost of the fully overfitted Cool-chic. This work
proposes a practical method to accelerate encoding in overfitted image codecs,
improving their viability in scenarios with tight compute budgets.
PIE Perception and Interaction Enhanced End-to-End Motion Planning for Autonomous Driving
Authors: Chengran Yuan, Zijian Lu, Zhanqi Zhang, Yimin Zhao, Zefan Huang, Shuo Sun, Jiawei Sun, Jiahui Li, Christina Dao Wen Lee, Dongen Li, Marcelo H. Ang Jr
2025-09-23
End-to-end motion planning is promising for simplifying complex autonomous
driving pipelines. However, challenges such as scene understanding and
effective prediction for decision-making continue to present substantial
obstacles to its large-scale deployment. In this paper, we present PIE, a
pioneering framework that integrates advanced perception, reasoning, and
intention modeling to dynamically capture interactions between the ego vehicle
and surrounding agents. It incorporates a bidirectional Mamba fusion that
addresses data losses in multimodal fusion of camera and LiDAR
inputs, alongside a novel reasoning-enhanced
r integrating Mamba and
Mixture-of-Experts to facilitate scene-compliant anchor selection and optimize
adaptive trajectory inference. PIE adopts an action-motion interaction module
to effectively utilize state predictions of surrounding agents to refine ego
planning. The proposed framework is thoroughly validated on the NAVSIM
benchmark. PIE, without using any ensemble and data augmentation techniques,
achieves an 88.9 PDM score and 85.6 EPDM score, surpassing the performance of
prior state-of-the-art methods. Comprehensive quantitative and qualitative
analyses demonstrate that PIE is capable of reliably generating feasible and
high-quality ego trajectories.
FlexSED Towards Open-Vocabulary Sound Event Detection
Authors: Jiarui Hai, Helin Wang, Weizhe Guo, Mounya Elhilali
2025-09-23
Despite recent progress in large-scale sound event detection (SED) systems
capable of handling hundreds of sound classes, existing multi-class
classification frameworks remain fundamentally limited. They cannot process
free-text sound queries, which enable more flexible and user-friendly
interaction, and they lack zero-shot capabilities and offer poor few-shot
adaptability. Although text-query-based separation methods have been explored,
they primarily focus on source separation and are ill-suited for SED tasks that
require precise temporal localization and efficient detection across large and
diverse sound vocabularies. In this paper, we propose FlexSED, an
open-vocabulary sound event detection system. FlexSED builds on a pretrained
audio SSL model and the CLAP text encoder, introducing an encoder-r
composition and an adaptive fusion strategy to enable effective continuous
training from pretrained weights. To ensure robust supervision, it also employs
large language models (
s) to assist in event query selection during
training, addressing challenges related to missing labels. As a result, FlexSED
achieves superior performance compared to vanilla SED models on
AudioSet-Strong, while demonstrating strong zero-shot and few-shot
capabilities. We release the code and pretrained models to support future
research and applications based on FlexSED.
OmniFed A Modular Framework for Configurable Federated Learning from Edge to HPC
Authors: Sahil Tyagi, Andrei Cozma, Olivera Kotevska, Feiyi Wang
2025-09-23
Federated Learning (FL) is critical for edge and High Performance Computing
(HPC) where data is not centralized and privacy is crucial. We present OmniFed,
a modular framework designed around decoupling and clear separation of concerns
for configuration, orchestration, , and training logic. Its
architecture supports configuration-driven prototyping and code-level
override-what-you-need customization. We also support different topologies,
mixed
protocols within a single deployment, and popular training
algorithms. It also offers optional privacy mechanisms including Differential
Privacy (DP), Homomorphic Encryption (HE), and Secure Aggregation (SA), as well
as
strategies. These capabilities are exposed through well-defined
extension points, allowing users to customize topology and orchestration,
learning logic, and privacy/
plugins, all while pre
the
integrity of the core system. We evaluate multiple models and algorithms to
measure various performance metrics. By unifying topology configuration,
mixed-protocol
, and pluggable modules in one stack, OmniFed
streamlines FL deployment across heterogeneous environments. Github repository
is available at https://github.com/at-aaims/OmniFed.
LLMZ+ Contextual Prompt Whitelist Principles for Agentic LLMs
Authors: Tom Pawelek, Raj Patel, Charlotte Crowell, Noorbakhsh Amiri, Sudip Mittal, Shahram Rahimi, Andy Perkins
2025-09-23
Compared to traditional models, agentic AI represents a highly valuable
target for potential attackers as they possess privileged access to data
sources and API tools, which are traditionally not incorporated into classical
agents. Unlike a typical software application residing in a Demilitarized Zone
(DMZ), agentic s consciously rely on nondeterministic behavior of the AI
(only defining a final goal, leaving the path selection to
). This
characteristic introduces substantial security risk to both operational
security and information security. Most common existing defense mechanism rely
on detection of malicious intent and preventing it from reaching the
agent,
thus protecting against jailbreak attacks such as prompt injection. In this
paper, we present an alternative approach,
Z+, which moves beyond
traditional detection-based approaches by implementing prompt whitelisting.
Through this method, only contextually appropriate and safe messages are
permitted to interact with the agentic
. By leveraging the specificity of
context,
Z+ guarantees that all exchanges between external users and the
conform to predefined use cases and operational boundaries. Our approach
streamlines the security framework, enhances its long-term resilience, and
reduces the resources required for sustaining
information security. Our
empirical evaluation demonstrates that
Z+ provides strong resilience against
the most common jailbreak prompts. At the same time, legitimate business
s are not disrupted, and authorized traffic flows seamlessly
between users and the agentic
. We measure the effectiveness of approach
using false positive and false negative rates, both of which can be reduced to
0 in our experimental setting.
Individualized non-uniform quantization for vector search
Authors: Mariano Tepper, Ted Willke
2025-09-22
Embedding vectors are widely used for representing unstructured data and
searching through it for semantically similar items. However, the large size of
these vectors, due to their high-dimensionality, creates problems for modern
vector search techniques: retrieving large vectors from memory/storage is
expensive and their footprint is costly. In this work, we present NVQ
(non-uniform vector ), a new vector
technique that is
computationally and spatially efficient in the high-fidelity regime. The core
in NVQ is to use novel parsimonious and computationally efficient
nonlinearities for building non-uniform vector
rs. Critically, these
rs are \emph{individually} learned for each indexed vector. Our
experimental results show that NVQ exhibits improved accuracy compared to the
state of the art with a minimal computational cost.
LAWCAT Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling
Authors: Zeyu Liu, Souvik Kundu, Lianghao Jiang, Anni Li, Srikanth Ronanki, Sravan Bodapati, Gourav Datta, Peter A. Beerel
2025-09-22
Although architectures have achieved state-of-the-art performance
across diverse domains, their quadratic computational complexity with respect
to sequence length remains a significant bottleneck, particularly for
latency-sensitive long-context applications. While recent linear-complexity
alternatives are increasingly powerful, effectively training them from scratch
is still resource-intensive. To overcome these limitations, we propose LAWCAT
(Linear Attention with Convolution Across Time), a novel linearization
framework designed to efficiently transfer the capabilities of pre-trained
s into a performant linear attention architecture. LAWCAT integrates
causal Conv1D layers to enhance local dependency modeling and employs
normalized gated linear attention to improve generalization across varying
context lengths. Our comprehensive evaluations demonstrate that, distilling
Mistral-7B with only 1K-length sequences yields over 90\% passkey retrieval
accuracy up to 22K tokens, significantly extending its effective context
window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance
on S-NIAH 1\&2\&3 tasks (1K-8K context length) and BABILong benchmark
(QA2\&QA3, 0K-16K context length), requiring less than 0.1\% pre-training
tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster
speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT
thus provides an efficient pathway to high-performance, long-context linear
models suitable for edge deployment, reducing reliance on extensive
long-sequence training data and computational resources.
NormGenesis Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery
Authors: Minki Hong, Jangho Choi, Jihie Kim
2025-09-22
Social norms govern culturally appropriate behavior in ,
enabling dialogue systems to produce responses that are not only coherent but
also socially acceptable. We present NormGenesis, a multicultural framework for
generating and annotating socially grounded dialogues across English, Chinese,
and Korean. To model the dynamics of social interaction beyond static norm
classification, we propose a novel dialogue type, Violation-to-Resolution
(V2R), which models the progression of conversations following norm violations
through recognition and socially appropriate repair. To improve pragmatic
consistency in underrepresented languages, we implement an exemplar-based
iterative refinement early in the dialogue synthesis process. This design
introduces alignment with linguistic, emotional, and sociocultural expectations
before full dialogue generation begins. Using this framework, we construct a
dataset of 10,800 multi-turn dialogues annotated at the turn level for norm
adherence, speaker intent, and emotional response. Human and
-based
evaluations demonstrate that NormGenesis significantly outperforms existing
datasets in refinement quality, dialogue naturalness, and generalization
performance. We show that models trained on our V2R-augmented data exhibit
improved pragmatic competence in ethically sensitive contexts. Our work
establishes a new benchmark for culturally adaptive dialogue modeling and
provides a scalable methodology for norm-aware generation across linguistically
and culturally diverse languages.
Efficient Particle Acceleration in 2.5-Dimensional, Hybrid-Kinetic Simulations of Decaying, Supersonic, Plasma Turbulence
Authors: Keyan Gootkin, Colby Haggerty, Damiano Caprioli, Zachary Davis
2025-09-22
Collisionless, turbulent plasmas surround the Earth, from the magnetosphere
to the intergalactic medium, and the fluctuations within them affect nearly
every field in the space sciences, from space weather forecasts to theories of
galaxy formation. Where turbulent motions become supersonic, their interactions
can lead to the formation of shocks, which are known to efficiently energize
ions to cosmic-ray energies. We present 2.5-dimensional, hybrid-kinetic
simulations of decaying, supersonic, non-relativistic turbulence in a
collisionless plasma using the code dHybridR. Turbulence within these
simulations is highly compressible; after accounting for this by
taking the omni-directional power-spectrum of the density weighted velocity
field, we find turbulent spectra with power-law slopes of for low Mach numbers, in the inertial range, and for high Mach numbers. Ions embedded in the highly supersonic simulations
are accelerated to non-thermal energies at efficiencies similar to those seen
in shocks, despite being in a non-relativistic regime and lacking the large
scale structure of a shock. We observe that particles are accelerated into a
power-law spectrum, with a slope of in (non-relativistic)
energy. We compare these results to those obtained from the theory and
simulations of diffusive shock
, and discuss the astrophysical
implications of this theoretical work.
Chiplet-Based RISC-V SoC with Modular AI Acceleration
Authors: P. Ramkumar, S. S. Bharadwaj
2025-09-22
Achieving high performance, energy efficiency, and cost-effectiveness while
maintaining architectural flexibility is a critical challenge in the
development and deployment of edge AI devices. Monolithic SoC designs struggle
with this complex balance mainly due to low manufacturing yields (below 16%) at
advanced 360 mm^2 process nodes. This paper presents a novel chiplet-based
RISC-V SoC architecture that addresses these limitations through modular AI
and intelligent system level optimization. Our proposed design
integrates 4 different key innovations in a 30mm x 30mm silicon interposer:
adaptive cross-chiplet Dynamic Voltage and Frequency Scaling (DVFS); AI-aware
Universal Chiplet Interconnect Express (UCIe) protocol extensions featuring
streaming flow control units and
-aware transfers; distributed
cryptographic security across heterogeneous chiplets; and intelligent
sensor-driven load migration. The proposed architecture integrates a 7nm RISC-V
CPU chiplet with dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3 memory
stacks, and dedicated power management controllers. Experimental results across
industry standard benchmarks like MobileNetV2, ResNet-50 and real-time video
processing demonstrate significant performance improvements. The AI-optimized
configuration achieves ~14.7% latency reduction, 17.3% throughput improvement,
and 16.2% power reduction compared to previous basic chiplet implementations.
These improvements collectively translate to a 40.1% efficiency gain
corresponding to ~3.5 mJ per MobileNetV2 inference (860 mW/244 images/s), while
maintaining sub-5ms real-time capability across all experimented workloads.
These performance upgrades demonstrate that modular chiplet designs can achieve
near-monolithic computational density while enabling cost efficiency,
scalability and upgradeability, crucial for next-generation edge AI device
applications.
Speculate Deep and Accurate Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding
Authors: Pei-Shuo Wang, Jian-Jia Chen, Chun-Che Yang, Chi-Chih Chang, Ning-Chi Huang, Mohamed S. Abdelfattah, Kai-Chiang Wu
2025-09-22
The immense model sizes of large language models (s) challenge deployment
on memory-limited consumer GPUs. Although model
and parameter
offloading are common strategies to address memory limitations,
can
degrade quality, and offloading maintains quality but suffers from slow
inference. Speculative
presents a promising avenue to accelerate
parameter offloading, utilizing a fast draft model to propose multiple draft
tokens, which are then verified by the target
in parallel with a single
forward pass. This method reduces the time-consuming data transfers in forward
passes that involve offloaded weight transfers. Existing methods often rely on
pretrained weights of the same family, but require additional training to align
with custom-trained models. Moreover, approaches that involve draft model
training usually yield only modest speedups. This limitation arises from
insufficient alignment with the target model, preventing higher token
acceptance lengths. To address these challenges and achieve greater speedups,
we propose SubSpec, a plug-and-play method to accelerate parameter offloading
that is lossless and training-free. SubSpec constructs a highly aligned draft
model by generating
d substitute layers from offloaded target
portions. Additionally, our method shares the remaining GPU-resident layers
and the
-Cache, further reducing memory overhead and enhance alignment.
SubSpec achieves a high average acceptance length, delivering 9.1x speedup for
Qwen2.5 7B on MT-Bench (8GB VRAM limit) and an average of 12.5x speedup for
Qwen2.5 32B on popular generation benchmarks (24GB VRAM limit).
Exploiting Tree Structure for Credit Assignment in RL Training of LLMs
Authors: Hieu Tran, Zonghai Yao, Hong Yu
2025-09-22
Reinforcement learning improves reasoning, yet
delayed reward over
long sequences makes token-level credit assignment the key bottleneck. We study
the verifiable-reward setting, where the final answer is checkable and multiple
responses can be drawn per prompt. Reasoning tasks in math and medical QA align
with this setup, where only a few decision tokens significantly impact the
outcome. PPO offers token-level advantages with a learned value model, but it
is complex to train both the actor and critic models simultaneously, and it is
not easily generalizable, as the token-level values from the critic model can
make training prone to overfitting. GRPO is critic-free and supports verifiable
rewards, but spreads a single sequence-level return across tokens and ignores
branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that
converts a group of responses into a prefix tree and computes
\emph{nonparametric} prefix values by aggregating descendant outcomes.
Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated
\textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a
critic-free algorithm that augments the group-relative outcome signal of GRPO
with \emph{branch-gated} temporal-difference corrections derived from the tree.
At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO
reduces to GRPO; at branching tokens, it supplies precise token-level credit
without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B,
TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and
out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and
reaches higher validation accuracy with roughly the same wall-clock time.
Evaluating Large Language Models for Detecting Antisemitism
Authors: Jay Patel, Hrudayangam Mehta, Jeremy Blackburn
2025-09-22
Detecting hateful content is a challenging and important problem. Automated
tools, like machine-learning models, can help, but they require continuous
training to adapt to the ever-changing landscape of social media. In this work,
we evaluate eight open-source s' capability to detect antisemitic content,
specifically leveraging in-context definition as a policy guideline. We explore
various prompting techniques and design a new CoT-like prompt, Guided-CoT.
Guided-CoT handles the in-context policy well, increasing performance across
all evaluated models, regardless of
configuration, model sizes, or
reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5.
Additionally, we examine
errors and introduce metrics to quantify semantic
divergence in model-generated rationales, revealing notable differences and
paradoxical behaviors among
s. Our experiments highlight the differences
observed across
s' utility, explainability, and reliability.
Spiffy Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
Authors: Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli
2025-09-22
Diffusion s (d
s) have recently emerged as a powerful alternative to
autoregressive
s (AR-
s) with the potential to operate at significantly
higher token generation rates. However, currently available open-source d
s
often generate at much lower rates, typically
only a single token at
every denoising timestep in order to maximize output quality. We present
Spiffy, a speculative
algorithm that accelerates d
inference by
while provably pre
the model's output
distribution. This work addresses the unique challenges involved in applying
ideas from speculative
of AR-
s to the d
setting. Spiffy proposes
draft states by leveraging the d
's distribution itself in an
auto-speculative manner. This approach is efficient and effective, and
eliminates the overheads of training and running an independent draft model. To
structure the candidate draft states, we propose a novel directed draft graph
which is uniquely designed to take advantage of the bidirectional, block-wise
nature of d
generation and can be verified in parallel by the d
. To
further optimize the structure of these draft graphs, we introduce an
efficient, offline calibration algorithm that procedurally determines
high-quality graph configurations. These optimized draft graphs, enabling
increased acceptance rates, lead to a significant boost in the overall speedup
achieved by the system. Crucially, Spiffy is also complementary to other recent
innovations in improving d
generation speeds such as
-caching and
multi-token unmasking. We demonstrate that when combined with such parallel
algorithms, Spiffy is able to effectively multiply the benefits of
these methods leading to total speedups of up to .
GraDeT-HTR A Resource-Efficient Bengali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer
Authors: Md. Mahmudul Hasan, Ahmed Nesar Tahsin Choudhury, Mahmudul Hasan, Md. Mosaddek Khan
2025-09-22
Despite Bengali being the sixth most spoken language in the world,
handwritten text recognition (HTR) systems for Bengali remain severely
underdeveloped. The complexity of Bengali script--featuring conjuncts,
diacritics, and highly variable handwriting styles--combined with a scarcity of
annotated datasets makes this task particularly challenging. We present
GraDeT-HTR, a resource-efficient Bengali handwritten text recognition system
based on a Grapheme-aware Decoder-only Transformer architecture. To address the
unique challenges of Bengali script, we augment the performance of a
r-only
by integrating a grapheme-based tokenizer and
demonstrate that it significantly improves recognition accuracy compared to
conventional subword tokenizers. Our model is pretrained on large-scale
synthetic data and fine-tuned on real human-annotated samples, achieving
state-of-the-art performance on multiple benchmark datasets.
TempSamp-R1 Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
Authors: Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng
2025-09-22
This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework
designed to improve the effectiveness of adapting multimodal large language
models (Ms) to video temporal grounding tasks. We reveal that existing
reinforcement learning methods, such as Group Relative Policy Optimization
(GRPO), rely on on-policy sampling for policy updates. However, in tasks with
large temporal search spaces, this strategy becomes both inefficient and
limited in performance, as it often fails to identify temporally accurate
solutions. To address this limitation, TempSamp-R1 leverages ground-truth
annotations as off-policy supervision to provide temporally precise guidance,
effectively compensating for the
and misalignment in on-policy
solutions. To further stabilize training and reduce variance in reward-based
updates, TempSamp-R1 provides a non-linear soft advantage computation method
that dynamically reshapes the reward feedback via an asymmetric transformation.
By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1
optimizes a single unified model to support both CoT and non-CoT inference
modes, enabling efficient handling of queries with varying reasoning
complexity. Experimental results demonstrate that TempSamp-R1 outperforms
GRPO-based baselines, establishing new state-of-the-art performance on
benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions
(R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover,
TempSamp-R1 shows robust few-shot generalization capabilities under limited
data. Code: https://github.com/HVision-NKU/TempSamp-R1
RadEval A framework for radiology text evaluation
Authors: Justin Xu, Xi Zhang, Javid Abderezaei, Julie Bauml, Roger Boodoo, Fatemeh Haghighi, Ali Ganjizadeh, Eric Brattain, Dave Van Veen, Zaiqiao Meng, David Eyre, Jean-Benoit Delbrouck
2025-09-22
We introduce RadEval, a unified, open-source framework for evaluating
radiology texts. RadEval consolidates a diverse range of metrics, from classic
n-gram (BLEU, ROUGE) and contextual measures (BERTScore) to clinical
concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT,
TemporalEntityF1) and advanced
-based evaluators (GREEN). We refine and
standardize implementations, extend GREEN to support multiple imaging
modalities with a more lightweight model, and pretrain a domain-specific
radiology encoder, demonstrating strong zero-shot retrieval performance. We
also release a richly annotated expert dataset with over 450 clinically
significant error labels and show how different metrics correlate with
radiologist judgment. Finally, RadEval provides statistical testing tools and
baseline model evaluations across multiple publicly available datasets,
facilitating reproducibility and robust benchmarking in radiology report
generation.
Through the Lens of Human-Human Collaboration A Configurable Research Platform for Exploring Human-Agent Collaboration
Authors: Bingsheng Yao, Jiaju Chen, Chaoran Chen, April Wang, Toby Jia-jun Li, Dakuo Wang
2025-09-22
Intelligent systems have traditionally been designed as tools rather than
collaborators, often lacking critical characteristics that collaboration
partnerships require. Recent advances in large language model () agents open
new opportunities for human-
-agent collaboration by enabling natural
and various social and cognitive behaviors. Yet it remains
unclear whether principles of computer-mediated collaboration established in
HCI and CSCW persist, change, or fail when humans collaborate with
agents.
To support systematic investigations of these questions, we introduce an open
and configurable research platform for HCI researchers. The platform's modular
design allows seamless adaptation of classic CSCW experiments and manipulation
of theory-grounded interaction controls. We demonstrate the platform's
effectiveness and usability through two case studies: (1) re-implementing the
classic human-human-collaboration task Shape Factory as a between-subject
human-agent-collaboration experiment with 16 participants, and (2) a
participatory cognitive walkthrough with five HCI researchers to refine
workflows and interfaces for experiment setup and analysis.
Visual Detector Compression via Location-Aware Discriminant Analysis
Authors: Qizhen Lan, Jung Im Choi, Qing Tian
2025-09-22
Deep neural networks are powerful, yet their high complexity greatly limits
their potential to be deployed on billions of resource-constrained edge
devices. Pruning is a crucial network technique, yet most existing
methods focus on classification models, with limited attention to detection.
Even among those addressing detection, there is a lack of utilization of
essential localization information. Also, many
methods passively rely
on pre-trained models, in which useful and useless components are intertwined,
making it difficult to remove the latter without harming the former at the
neuron/filter level. To address the above issues, in this paper, we propose a
proactive detection-discriminants-based network
approach for deep
visual detectors, which alternates between two steps: (1) maximizing and
compressing detection-related discriminants and aligning them with a subset of
neurons/filters immediately before the detection head, and (2) tracing the
detection-related discriminating power across the layers and discarding
features of lower importance. Object location information is exploited in both
steps. Extensive experiments, employing four advanced detection models and four
state-of-the-art competing methods on the KITTI and COCO datasets, highlight
the superiority of our approach. Remarkably, our compressed models can even
beat the original base models with a substantial reduction in complexity.
Benchmarking Humans and Machines on Complex Multilingual Speech Understanding Tasks
Authors: Sai Samrat Kankanala, Ram Chandra, Sriram Ganapathy
2025-09-22
Auditory attention and selective phase-locking are central to human speech
understanding in complex acoustic scenes and cocktail party settings, yet these
capabilities in multilingual subjects remain poorly understood. While machine
understanding of natural speech has advanced in recent years, questions persist
about comprehension of ped and mixed-channel speech. We propose a
systematic paradigm for studying humans and machines in speech
question-answering tasks in multilingual settings with clean and mixed-channel
speech. For human listeners, selective attention to a target speaker was
significantly better in their native language (L1) than in their second
language (L2). For machine listening, speech-based large language models (
s)
match or exceed human performance in clean, single-speaker conditions but often
struggle to selectively attend in two-speaker settings. These results reveal a
key divergence: humans rely on attentional cues that are more streamlined in
their native language, whereas
s default to parallel information extraction
which exceed human skills.
Expert-as-a-Service Towards Efficient, Scalable, and Robust Large-scale MoE Serving
Authors: Ziming Liu, Boyu Tian, Guoteng Wang, Zhen Jiang, Peng Sun, Zhenhua Han, Tian Tang, Xiaohe Hu, Yanmin Jia, Yan Zhang, He Liu, Mingjun Zhang, Yiqi Zhang, Qiaoling Chen, Shenggan Cheng, Mingyu Gao, Yang You, Siyuan Feng
2025-09-22
Mixture-of-Experts (MoE) models challenge infrastructures with
dynamic,
expert utilization, causing instability on conventional systems
designed for dense architectures. We propose EaaS, a novel
system to
enable efficient, scalable, and robust MoE deployment. Our system
s
MoE modules into independent, stateless services. This design enables
fine-grained resource scaling and provides inherent fault tolerance by
decoupling compute units. The architecture is powered by a high-performance,
CPU-free peer-to-peer
library that ensures minimal overhead and
high throughput. Experiments confirm EaaS's scalability and efficiency,
achieving performance comparable to monolithic systems while providing robust
fault tolerance and strong scalability. EaaS incurs less than a 2% throughput
reduction under simulated hardware failures that would otherwise halt
monolithic architectures. It further saves up to 37.5% of computing resources
through dynamic fine-grained adaptation to
traffic, demonstrating
strong resilience for large-scale MoE deployment in production.
Towards General Computer Control with Hierarchical Agents and Multi-Level Action Spaces
Authors: Zihan Dong, Xinyu Fan, Zixiang Tang, Yunqing Li
2025-09-22
Controlling desktop applications via software remains a fundamental yet
under-served problem. Existing multi-modal large language models (Ms) ingest
screenshots and task instructions to generate keystrokes and mouse events, but
they suffer from prohibitive inference latency, poor sample efficiency on
long-horizon
-reward tasks, and infeasible on-device deployment. We
introduce a lightweight hierarchical reinforcement learning framework,
ComputerAgent, that formulates OS control as a two-level option process
(manager and subpolicy), employs a triple-modal state encoder (screenshot, task
ID, numeric state) to handle visual and contextual diversity, integrates
meta-actions with an early-stop mechanism to reduce wasted interactions, and
uses a compact vision backbone plus small policy networks for on-device
inference (15M parameters). On a suite of 135 real-world desktop tasks,
ComputerAgent attains 92.1% success on simple tasks (<8 steps) and 58.8% on
hard tasks (>=8 steps), matching or exceeding 200B-parameter M
baselines on
simple scenarios while reducing model size by over four orders of magnitude and
halving inference time. These results demonstrate that hierarchical RL offers a
practical, scalable alternative to monolithic M
-based automation for
computer control.
ConfClip Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs
Authors: Bonan Zhang, Zhongqi Chen, Bowen Song, Qinya Li, Fan Wu, Guihai Chen
2025-09-22
Reinforcement learning (RL) has become a standard paradigm for refining large
language models (s) beyond pre-training and instruction tuning. A prominent
line of work is RL with verifiable rewards (RLVR), which leverages
automatically verifiable outcomes (e.g., correctness or executability) to
generate reward signals. While efficient, this framework faces two key
limitations: First, its binary feedback is too
to capture the quality of
the reasoning process. Second, its coarse-grained rewards potentially lead to
vanishing gradients. Inspired by observations from human learning, we introduce
a RL technique that integrates verifiable outcomes with the model's own
confidence estimates. This joint design enriches the reward signal, providing
finer-grained feedback and implicitly supervising the reasoning process.
Experimental results demonstrate that our proposed method enhances RL
performance across multiple datasets and reduces token consumption during
inference, while incurring negligible additional training cost. Moreover, it
can be used as a plug-in module to enhance other state-of-the-art RL methods.
When TableQA Meets Noise A Dual Denoising Framework for Complex Questions and Large-scale Tables
Authors: Shenghao Ye, Yu Guo, Dong Jin, Yikai Shen, Yunpeng Hou, Shuangwu Chen, Jian Yang, Xiaofeng Jiang
2025-09-22
Table question answering (TableQA) is a fundamental task in natural language
processing (NLP). The strong reasoning capabilities of large language models
(s) have brought significant advances in this field. However, as real-world
applications involve increasingly complex questions and larger tables,
substantial noisy data is introduced, which severely degrades reasoning
performance. To address this challenge, we focus on improving two core
capabilities: Relevance Filtering, which identifies and retains information
truly relevant to reasoning, and Table Pruning, which reduces table size while
pre
essential content. Based on these principles, we propose EnoTab, a
dual denoising framework for complex questions and large-scale tables.
Specifically, we first perform Evidence-based Question Denoising by decomposing
the question into minimal semantic units and filtering out those irrelevant to
answer reasoning based on consistency and usability criteria. Then, we propose
Evidence Tree-guided Table Denoising, which constructs an explicit and
transparent table
path to remove irrelevant data step by step. At each
step, we observe the intermediate state of the table and apply a
post-order node rollback mechanism to handle abnormal table states, ultimately
producing a highly reliable sub-table for final answer reasoning. Finally,
extensive experiments show that EnoTab achieves outstanding performance on
TableQA tasks with complex questions and large-scale tables, confirming its
effectiveness.
Mechanistic Interpretability with SAEs Probing Religion, Violence, and Geography in Large Language Models
Authors: Katharina Simbeck, Mariam Mahran
2025-09-22
Despite growing research on bias in large language models (s), most work
has focused on gender and race, with little attention to religious identity.
This paper explores how religion is internally represented in
s and how it
intersects with concepts of violence and geography. Using mechanistic
interpretability and Sparse Autoencoders (SAEs) via the Neuronpedia API, we
analyze latent feature activations across five models. We measure
between religion- and violence-related prompts and probe semantic patterns in
activation contexts. While all five religions show comparable internal
cohesion, Islam is more frequently linked to features associated with violent
language. In contrast, geographic associations largely reflect real-world
religious demographics, revealing how models embed both factual distributions
and cultural stereotypes. These findings highlight the value of structural
analysis in auditing not just outputs but also internal representations that
shape model behavior.
Evict3R Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
Authors: Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
2025-09-22
Streaming visual s like StreamVGGT achieve strong 3D perception
but suffer from unbounded growth of key value (
) memory, which limits
scalability. We propose a training-free, inference-time token eviction policy
that bounds memory by discarding redundant tokens while keeping the most
informative ones. Our method uses significantly less memory with little to no
drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from
18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under
strict memory budgets, eviction enables denser frame sampling, which improves
reconstruction accuracy compared to the baseline. Experiments across video
depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and
camera pose estimation (Sintel, TUM-dynamics) show that our approach closely
matches StreamVGGT at a fraction of the memory and makes long-horizon streaming
inference more practical.
Bilateral Distribution Compression Reducing Both Data Size and Dimensionality
Authors: Dominic Broadbent, Nick Whiteley, Robert Allison, Tom Lovett
2025-09-22
Existing distribution methods reduce dataset size by minimising
the Maximum Mean Discrepancy (MMD) between original and compressed sets, but
modern datasets are often large in both sample size and dimensionality. We
propose Bilateral Distribution Compression (BDC), a two-stage framework that
compresses along both axes while pre
the underlying distribution, with
overall linear time and memory complexity in dataset size and dimension.
Central to BDC is the Decoded MMD (DMMD), which quantifies the discrepancy
between the original data and a compressed set
d from a low-dimensional
latent space. BDC proceeds by (i) learning a low-dimensional projection using
the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with
the Encoded MMD (EMMD). We show that this procedure minimises the DMMD,
guaranteeing that the compressed set faithfully represents the original
distribution. Experiments show that across a variety of scenarios BDC can
achieve comparable or superior performance to ambient-space
at
substantially lower cost.
Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs
Authors: Xing Chen, Rong Shi, Lu Zhao, Lingbin Wang, Xiao Jin, Yueqiang Chen, Hongfeng Sun
2025-09-22
-based applications have been widely used in various industries, but with
the increasing of models size, an efficient large language model (
)
inference system is an urgent problem to be solved for service providers. Since
the inference system is divided into two stage with different characteristics:
Prefill and Decode, the two stage will interfere with each other during the
inference process. Toward this end, a P-D
d inference framework is
proposed by some researchers. Current research is done on homogeneous GPUs, and
lacks deployment solutions based on business scenarios. Compared with
homogeneous GPUs, using heterogeneous GPUs to construct inference systems can
better improve resource utilization and reduce costs. Even if GPUs from
different vendors are used to build inference systems, on the basis of reducing
costs, the resource utilization rate can be improved and the dependence on a
single vendor can be reduced. Therefore, a P-D disaggreagetd inference system
based on heterogeneous GPUs is designed, and the heterogeneous compatible
transmission module in the system is designed to address heterogeneous GPU data
compatibility issues. Then, a joint optimization algorithm of parallel strategy
and instance number allocation is proposed to obtain the deployment solutions.
Finally, the experimental results show that the P-D
d inference
system can well solve the hybrid inference problem of heterogeneous GPUs from
different vendors, and the joint optimization algorithm can obtain the optimal
deployment solution.
4DGCPro Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming
Authors: Zihan Zheng, Zhenlong Wu, Houqiang Zhong, Yuan Tian, Ning Cao, Lan Xu, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Wenjun Zhang
2025-09-22
Achieving seamless viewing of high-fidelity volumetric video, comparable to
2D video experiences, remains an open challenge. Existing volumetric video
methods either lack the flexibility to adjust quality and bitrate
within a single model for efficient streaming across diverse networks and
devices, or struggle with real-time
and rendering on lightweight
mobile platforms. To address these challenges, we introduce 4DGCPro, a novel
hierarchical 4D Gaussian
framework that facilitates real-time
mobile
and high-quality rendering via progressive volumetric video
streaming in a single bitstream. Specifically, we propose a
perceptually-weighted and
-friendly hierarchical 4D Gaussian
representation with motion-aware adaptive grouping to reduce temporal
redundancy, preserve coherence, and enable scalable multi-level detail
streaming. Furthermore, we present an end-to-end entropy-optimized training
scheme, which incorporates layer-wise rate-distortion (RD) supervision and
attribute-specific entropy modeling for efficient bitstream generation.
Extensive experiments show that 4DGCPro enables flexible quality and multiple
bitrate within a single model, achieving real-time
and rendering on
mobile devices while outperforming existing methods in RD performance across
multiple datasets. Project Page: https://mediax-sjtu.github.io/4DGCPro
CorefInst Leveraging LLMs for Multilingual Coreference Resolution
Authors: Tuğba Pamay Arslan, Emircan Erol, Gülşen Eryiğit
2025-09-22
Coreference Resolution (CR) is a crucial yet challenging task in natural
language understanding, often constrained by task-specific architectures and
encoder-based language models that demand extensive training and lack
adaptability. This study introduces the first multilingual CR methodology which
leverages r-only
s to handle both overt and zero mentions. The article
explores how to model the CR task for
s via five different instruction sets
using a controlled inference method. The approach is evaluated across three
s; Llama 3.1, Gemma 2, and Mistral 0.3. The results indicate that
s, when
instruction-tuned with a suitable instruction set, can surpass state-of-the-art
task-specific architectures. Specifically, our best model, a fully fine-tuned
Llama 3.1 for multilingual CR, outperforms the leading multilingual CR model
(i.e., Corpipe 24 single stage variant) by 2 pp on average across all languages
in the CorefUD v1.2 dataset collection.
Privacy in Action Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents
Authors: Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan
2025-09-22
The increasing autonomy of agents in handling sensitive
s,
accelerated by Model Context Protocol (MCP) and Agent-to-Agent (A2A)
frameworks, creates urgent privacy challenges. While recent work reveals
significant gaps between
s' privacy Q&A performance and their agent
behavior, existing benchmarks remain limited to static, simplified scenarios.
We present PrivacyChecker, a model-agnostic, contextual integrity based
mitigation approach that effectively reduces privacy leakage from 36.08% to
7.30% on DeepSeek-R1 and from 33.06% to 8.32% on GPT-4o, all while pre
task helpfulness. We also introduce PrivacyLens-Live, transforming static
benchmarks into dynamic MCP and A2A environments that reveal substantially
higher privacy risks in practical. Our modular mitigation approach integrates
seamlessly into agent protocols through three deployment strategies, providing
practical privacy protection for the emerging agentic ecosystem. Our data and
code will be made available at https://aka.ms/privacy_in_action.
Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks
Authors: Chaodong Tong, Qi Zhang, Lei Jiang, Yanbing Liu, Nannan Sun, Wei Li
2025-09-22
Reliable question answering with large language models (s) is challenged
by hallucinations, fluent but factually incorrect outputs arising from
epistemic uncertainty. Existing entropy-based semantic-level uncertainty
estimation methods are limited by sampling noise and unstable clustering of
variable-length answers. We propose Semantic Reformulation Entropy (SRE), which
improves uncertainty estimation in two ways. First, input-side semantic
reformulations produce faithful paraphrases, expand the estimation space, and
reduce biases from superficial
r tendencies. Second, progressive,
energy-based hybrid clustering stabilizes semantic grouping. Experiments on
SQuAD and TriviaQA show that SRE outperforms strong baselines, providing more
robust and generalizable hallucination detection. These results demonstrate
that combining input diversification with multi-signal clustering substantially
enhances semantic-level uncertainty estimation.
QWHA Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
Authors: Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim
2025-09-22
The demand for efficient deployment of large language models (s) has
driven interest in
, which reduces inference cost, and
parameter-efficient fine-tuning (PEFT), which lowers training overhead. This
motivated the development of
-aware PEFT to produce accurate yet
efficient
d models. In this setting, reducing
error prior
to fine-tuning is crucial for achieving high model accuracy. However, existing
methods that rely on low-rank adaptation suffer from limited representational
capacity. Recent Fourier-related transform (FT)-based adapters offer greater
representational power than low-rank adapters, but their direct integration
into
d models often results in ineffective error reduction and
increased computational overhead. To overcome these limitations, we propose
QWHA, a method that integrates FT-based adapters into
d models by
employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together
with a novel adapter initialization scheme incorporating adaptive parameter
selection and value refinement. We demonstrate that QWHA effectively mitigates
errors while facilitating fine-tuning, and that its design
substantially reduces computational cost. Experimental results show that QWHA
consistently outperforms baselines in
accuracy and
achieves significant training speedups over existing FT-based adapters. The
code is available at https://github.com/vantaa89/qwha.
DINVMark A Deep Invertible Network for Video Watermarking
Authors: Jianbin Ji, Dawen Xu, Li Dong, Lin Yang, Songhan He
2025-09-22
With the wide spread of video, video watermarking has become increasingly
crucial for copyright protection and content authentication. However, video
watermarking still faces numerous challenges. For example, existing methods
typically have shortcomings in terms of watermarking capacity and robustness,
and there is a lack of specialized noise layer for High Efficiency Video
Coding(HEVC) . To address these issues, this paper introduces a Deep
Invertible Network for Video watermarking (DINVMark) and designs a noise layer
to simulate HEVC
. This approach not only in creases watermarking
capacity but also enhances robustness. DINVMark employs an Invertible Neural
Network (INN), where the encoder and
r share the same network structure
for both watermark embedding and extraction. This shared architecture ensures
close coupling between the encoder and
r, thereby improving the accuracy
of the watermark extraction process. Experimental results demonstrate that the
proposed scheme significantly enhances watermark robustness, preserves video
quality, and substantially increases watermark embedding capacity.
Interpreting vision transformers via residual replacement model
Authors: Jinyeong Kim, Junhyeok Kim, Yumin Shim, Joohyeok Kim, Sunyoung Jung, Seong Jae Hwang
2025-09-22
How do vision s (ViTs) represent and process the world? This paper
addresses this long-standing question through the first systematic analysis of
6.6K features across all layers, extracted via
autoencoders, and by
introducing the residual replacement model, which replaces ViT computations
with interpretable features in the residual stream. Our analysis reveals not
only a feature evolution from low-level patterns to high-level semantics, but
also how ViTs encode curves and spatial positions through specialized feature
types. The residual replacement model scalably produces a faithful yet
parsimonious circuit for human-scale interpretability by significantly
simplifying the original computations. As a result, this framework enables
intuitive understanding of ViT mechanisms. Finally, we demonstrate the utility
of our framework in debiasing spurious correlations.
EpiCache Episodic KV Cache Management for Long Conversational Question Answering
Authors: Minsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho
2025-09-22
Modern large language models (s) extend context lengths to up to millions
of tokens, enabling AI assistants to generate coherent and personalized
responses grounded in long conversational histories. This ability, however,
hinges on Key-Value (
) caching, whose memory grows linearly with dialogue
length and quickly becomes the bottleneck in resource-constrained environments.
An active line of research for reducing memory bottleneck is
, which seeks to limit
size while pre
accuracy. Yet
existing methods face two major limitations: (i) evicting the
after
full-context
causes unbounded peak memory, and (ii) query-dependent
eviction narrows the
to a single query, leading to failure cases in
multi-turn conversations. We introduce EpiCache, a training-free
management framework for long conversational question answering (LongConvQA)
under fixed memory budgets. EpiCache bounds
growth through block-wise
and preserves topic-relevant context via episodic
, which
clusters conversation history into coherent episodes and applies
episode-specific
eviction. We further design an adaptive layer-wise
budget allocation strategy that measures each layer's sensitivity to eviction
and distributes the memory budget across layers accordingly. Across three
LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over recent
baselines, sustains near-full
accuracy under 4-6x
, and reduces
latency and memory by up to 2.4x and 3.5x, thereby enabling efficient
multi-turn interaction under strict resource constraints.
Multimodal Health Risk Prediction System for Chronic Diseases via Vision-Language Fusion and Large Language Models
Authors: Dingxin Lu, Shurui Wu, Xinyi Huang
2025-09-22
With the rising global burden of chronic diseases and the multimodal and
heterogeneous clinical data (medical imaging, free-text recordings, wearable
sensor streams, etc.), there is an urgent need for a unified multimodal AI
framework that can proactively predict individual health risks. We propose
VL-RiskFormer, a hierarchical stacked visual-language multimodal Transformer
with a large language model () inference head embedded in its top layer. The
system builds on the dual-stream architecture of existing visual-linguistic
models (e.g., PaLM-E, LLaVA) with four key innovations: (i) pre-training with
cross-modal comparison and fine-grained alignment of radiological images,
fundus maps, and wearable device photos with corresponding clinical narratives
using momentum update encoders and debiased InfoNCE losses; (ii) a time fusion
block that integrates irregular visit sequences into the causal Transformer
r through adaptive time interval position coding; (iii) a disease
ontology map adapter that injects ICD-10 codes into visual and textual channels
in layers and infers comorbid patterns with the help of a graph attention
mechanism. On the MIMIC-IV longitudinal cohort, VL-RiskFormer achieved an
average AUROC of 0.90 with an expected calibration error of 2.7 percent.
Asteria Semantic-Aware Cross-Region Caching for Agentic LLM Tool Access
Authors: Chaoyi Ruan, Chao Bi, Kaiwen Zheng, Ziji Shi, Xinyi Wan, Jialin Li
2025-09-22
Large Language Model () agents tackle data-intensive tasks such as deep
research and code generation. However, their effectiveness depends on frequent
interactions with knowledge sources across remote clouds or regions. Such
interactions can create non-trivial latency and cost bottlenecks. Existing
caching solutions focus on exact-match queries, limiting their effectiveness
for semantic knowledge reuse.
To address this challenge, we introduce Asteria, a novel cross-region
knowledge caching architecture for
agents. At its core are two
abstractions: Semantic Element (SE) and Semantic Retrieval Index (Sine). A
semantic element captures the semantic embedding representation of an
query
together with performance-aware metadata such as latency, cost, and staticity.
Sine then provides two-stage retrieval: a vector similar index with semantic
embedding for fast candidate selection and a lightweight
-powered semantic
judger for precise validation. Atop these primitives, Asteria builds a new
interface that includes a new semantic-aware
hit definition, a
cost-efficient eviction policy, and proactive prefetching. To reduce overhead,
Asteria co-locates the small
judger with the main
using adaptive
scheduling and resource sharing. Our evaluation demonstrates that Asteria
delivers substantial performance improvements without compromising correctness.
On representative search workloads, Asteria achieves up to a 3.6
increase in throughput by maintaining
hit rates of over 85%, while
pre
accuracy virtually identical to non-
d baselines. Asteria also
improves throughput for complex coding tasks by 20%, showcasing its versatility
across diverse agentic workloads.
Cronus Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill
Authors: Yunzhao Liu, Qiang Xu, Y. Charlie Hu
2025-09-22
Efficient inference is critical for real-world applications, especially
within heterogeneous GPU clusters commonly found in organizations and
on-premise datacenters as GPU architecture rapidly evolves. Current
d
strategies, which separate the
and
stages
of
inference across different GPUs, often suffer from suboptimal
performance due to imbalances between GPU capabilities and workload demands. On
the other hand, extending conventional data parallelism and pipeline
parallelism to heterogeneous setups incurs high inference latencies. To address
these challenges, we introduce Cronus, a novel
inference system designed to
dynamically balance workloads across heterogeneous GPUs using partially
d
. Cronus partitions each
stage and executes its
initial portion on the low-end GPU, while
ping the remaining
and
stages of earlier requests on the high-end GPU. Extensive evaluations
across various high-end and low-end GPU combinations demonstrate that Cronus
significantly improves the throughput over
d
. It also
reduces TTFT P99 and TBT P99 significantly over DP and PP while maintaining
similar or better throughput.
Compact representation of transonic airfoil buffet flows with observable-augmented machine learning
Authors: Kai Fukami, Yuta Iwatani, Soju Maejima, Hiroyuki Asada, Soshi Kawai
2025-09-22
Transonic buffet presents time-dependent aerodynamic characteristics
associated with shock, turbulent boundary layer, and their interactions.
Despite strong nonlinearities and a large degree of freedom, there exists a
dominant dynamic pattern of a buffet cycle, suggesting the low dimensionality
of transonic buffet phenomena. This study seeks a low-dimensional
representation of transonic airfoil buffet at a high Reynolds number with
machine learning. Wall-modeled large-eddy simulations of flow over the OAT15A
supercritical airfoil at two Mach numbers, and 0.730,
respectively producing non-buffet and buffet conditions, at a chord-based
Reynolds number of are performed to generate the present
datasets. We find that the low-dimensional nature of transonic airfoil buffet
can be extracted as a sole three-dimensional latent representation through
lift-augmented autoencoder . The current low-order representation
not only describes the shock movement but also captures the moment when the
separation occurs near the trailing edge in a low-order manner. We further show
that it is possible to perform sensor-based reconstruction through the present
low-dimensional expression while identifying the sensitivity with respect to
aerodynamic responses. The present model trained at is
lastly evaluated at the level of a real aircraft operation of , exhibiting that the phase dynamics of lift is reasonably estimated from
sensors. The current study may provide a foundation toward data-driven
real-time analysis of transonic buffet conditions under aircraft operation.
Rational Multi-Modal Transformers for TCR-pMHC Prediction
Authors: Jiarui Li, Zixiang Yin, Zhengming Ding, Samuel J. Landry, Ramgopal R. Mettu
2025-09-22
T cell receptor (TCR) recognition of peptide-MHC (pMHC) complexes is
fundamental to adaptive immunity and central to the development of T cell-based
immunotherapies. While -based models have shown promise in
predicting TCR-pMHC interactions, most lack a systematic and explainable
approach to architecture design. We present an approach that uses a new
post-hoc explainability method to inform the construction of a novel
encoder-
r
model. By identifying the most informative
combinations of TCR and epitope sequence inputs, we optimize cross-attention
strategies, incorporate auxiliary training objectives, and introduce a novel
early-stopping criterion based on explanation quality. Our framework achieves
state-of-the-art predictive performance while simultaneously improving
explainability, robustness, and generalization. This work establishes a
principled, explanation-driven strategy for modeling TCR-pMHC binding and
offers mechanistic insights into sequence-level binding behavior through the
lens of deep learning.
Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection
Authors: Jun Seo Kim, Hyemi Kim, Woo Joo Oh, Hongjin Cho, Hochul Lee, Hye Hyeon Kim
2025-09-22
Cognitive distortions have been closely linked to mental health disorders,
yet their automatic detection remained challenging due to contextual ambiguity,
co-occurrence, and semantic . We proposed a novel framework that
combines Large Language Models (
s) with Multiple-Instance Learning (MIL)
architecture to enhance interpretability and expression-level reasoning. Each
utterance was decomposed into Emotion, Logic, and Behavior (ELB) components,
which were processed by
s to infer multiple distortion instances, each with
a predicted type, expression, and model-assigned salience score. These
instances were integrated via a Multi-View Gated Attention mechanism for final
classification. Experiments on Korean (KoACD) and English (Therapist QA)
datasets demonstrate that incorporating ELB and
-inferred salience scores
improves classification performance, especially for distortions with high
interpretive ambiguity. Our results suggested a psychologically grounded and
generalizable approach for fine-grained reasoning in mental health NLP.
DeepASA An Object-Oriented One-for-All Network for Auditory Scene Analysis
Authors: Dongheon Lee, Younghoo Kwon, Jung-Woo Choi
2025-09-21
We propose DeepASA, a one-for-all model for auditory scene analysis that
performs multi-input multi-output (MIMO) source separation, dereverberation,
sound event detection (SED), audio classification, and direction-of-arrival
estimation (DoAE) within a unified framework. DeepASA is designed for complex
auditory scenes where multiple, often similar, sound sources in time
and move dynamically in space. To achieve robust and consistent inference
across tasks, we introduce an object-oriented processing (OOP) strategy. This
approach encapsulates diverse auditory features into object-centric
representations and refines them through a chain-of-inference (CoI) mechanism.
The pipeline comprises a dynamic temporal kernel-based feature extractor, a
-based aggregator, and an object separator that yields per-object
features. These features feed into multiple task-specific
rs. Our
object-centric representations naturally resolve the parameter association
ambiguity inherent in traditional track-wise processing. However, early-stage
object separation can lead to failure in downstream ASA tasks. To address this,
we implement temporal coherence matching (TCM) within the chain-of-inference,
enabling multi-task fusion and iterative refinement of object features using
estimated auditory parameters. We evaluate DeepASA on representative spatial
audio benchmark datasets, including ASA2, MC-FUSS, and STARSS23. Experimental
results show that our model achieves state-of-the-art performance across all
evaluated tasks, demonstrating its effectiveness in both source separation and
auditory parameter estimation under diverse spatial auditory scenes.
MoEs Are Stronger than You Think Hyper-Parallel Inference Scaling with RoE
Authors: Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho
2025-09-21
The generation quality of large language models (s) is often improved by
utilizing inference-time sequence-level scaling methods (e.g.,
Chain-of-Thought). We introduce hyper-parallel scaling, a complementary
framework that improves prediction quality at the token level. Hyper-parallel
scaling computes and aggregates multiple output proposals for a single token
from the model. We implement this concept in Mixture-of-Experts (MoE) models,
which we refer to as Roster of Experts (RoE). RoE is a training-free inference
algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects
controlled stochasticity into the expert routing mechanism, enabling it to
sample multiple diverse experts for each token and aggregate their outputs for
a more accurate final prediction.To overcome the computational cost, we
introduce an efficient batching strategy and a specialized
-caching mechanism
that minimizes compute and memory overhead. For example, RoE enables a 7B MoE
model to match the performance of a 10.5B MoE model while using 30% less
compute for inference. These gains are achieved without any fine-tuning of
model parameters.
SignalLLM A General-Purpose LLM Agent Framework for Automated Signal Processing
Authors: Junlong Ke, Qiying Hu, Shenghai Yuan, Yuecong Xu, Jianfei Yang
2025-09-21
Modern signal processing (SP) pipelines, whether model-based or data-driven,
often constrained by complex and fragmented workflow, rely heavily on expert
knowledge and manual engineering, and struggle with adaptability and
generalization under limited data. In contrast, Large Language Models (s)
offer strong reasoning capabilities, broad general-purpose knowledge,
in-context learning, and cross-modal transfer abilities, positioning them as
powerful tools for automating and generalizing SP workflows. Motivated by these
potentials, we introduce Signal
, the first general-purpose
-based agent
framework for general SP tasks. Unlike prior
-based SP approaches that are
limited to narrow applications or tricky prompting, Signal
introduces a
principled, modular architecture. It decomposes high-level SP goals into
structured subtasks via in-context learning and domain-specific retrieval,
followed by hierarchical planning through adaptive retrieval-augmented
generation (RAG) and refinement; these subtasks are then executed through
prompt-based reasoning, cross-modal reasoning, code synthesis, model
invocation, or data-driven
-assisted modeling. Its generalizable design
enables the flexible selection of problem solving strategies across different
signal modalities, task types, and data conditions. We demonstrate the
versatility and effectiveness of Signal
through five representative tasks in
and sensing, such as radar target detection, human activity
recognition, and text
. Experimental results show superior
performance over traditional and existing
-based methods, particularly in
few-shot and zero-shot settings.
MAST Multi-Agent Spatial Transformer for Learning to Collaborate
Authors: Damian Owerko, Frederic Vatnsdal, Saurav Agarwal, Vijay Kumar, Alejandro Ribeiro
2025-09-21
This article presents a novel multi-agent spatial (MAST) for
learning
policies in large-scale decentralized and collaborative
multi-robot systems (DC-MRS). Challenges in collaboration in DC-MRS arise from:
(i) partial observable states as robots make only localized perception, (ii)
limited
range with no central server, and (iii) independent
execution of actions. The robots need to optimize a common task-specific
objective, which, under the restricted setting, must be done using a
policy that exhibits the desired collaborative behavior. The
proposed MAST is a decentralized
architecture that learns
policies to compute abstract information to be shared with other
agents and processes the received information with the robot's own
observations. The MAST extends the standard
with new positional
encoding strategies and attention operations that employ windowing to limit the
receptive field for MRS. These are designed for local computation,
shift-equivariance, and permutation equivariance, making it a promising
approach for DC-MRS. We demonstrate the efficacy of MAST on decentralized
assignment and navigation (DAN) and decentralized coverage control. Efficiently
trained using imitation learning in a centralized setting, the decentralized
MAST policy is robust to
delays, scales to large teams, and
performs better than the baselines and other learning-based approaches.
Attention Consistency for LLMs Explanation
Authors: Tian Lan, Jinyuan Xu, Xue He, Jenq-Neng Hwang, Lei Li
2025-09-21
Understanding the decision-making processes of large language models (s)
is essential for their trustworthy development and deployment. However, current
interpretability methods often face challenges such as low resolution and high
computational cost. To address these limitations, we propose the
\textbf{Multi-Layer Attention Consistency Score (MACS)}, a novel, lightweight,
and easily deployable heuristic for estimating the importance of input tokens
in
r-based models. MACS measures contributions of input tokens based on
the consistency of maximal attention. Empirical evaluations demonstrate that
MACS achieves a favorable trade-off between interpretability quality and
computational efficiency, showing faithfulness comparable to complex techniques
with a 22\% decrease in VRAM usage and 30\% reduction in latency.
Analyzing Memory Effects in Large Language Models through the lens of Cognitive Psychology
Authors: Zhaoyang Cao, Lael Schooler, Reza Zafarani
2025-09-21
Memory, a fundamental component of human cognition, exhibits adaptive yet
fallible characteristics as illustrated by Schacter's memory "sins".These
cognitive phenomena have been studied extensively in psychology and
neuroscience, but the extent to which artificial systems, specifically Large
Language Models (s), emulate these cognitive phenomena remains
underexplored. This study uses human memory research as a lens for
understanding
s and systematically investigates human memory effects in
state-of-the-art
s using paradigms drawn from psychological research. We
evaluate seven key memory phenomena, comparing human behavior to
performance. Both people and models remember less when overloaded with
information (list length effect) and remember better with repeated exposure
(list strength effect). They also show similar difficulties when retrieving
ping information, where storing too many similar facts leads to
confusion (fan effect). Like humans,
s are susceptible to falsely
"remembering" words that were never shown but are related to others (false
memories), and they can apply prior learning to new, related situations
(cross-domain generalization). However,
s differ in two key ways: they are
less influenced by the order in which information is presented (positional
bias) and more robust when processing random or meaningless material (nonsense
effect). These results reveal both alignments and divergences in how
s and
humans reconstruct memory. The findings help clarify how memory-like behavior
in
s echoes core features of human cognition, while also highlighting the
architectural differences that lead to distinct patterns of error and success.