2025-09-28

Quantized Visual Geometry Grounded Transformer
Nova Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization
Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training
Semantic Edge-Cloud Communication for Real-Time Urban Traffic Surveillance with ViT and LLMs over Mobile Networks
Explaining Fine Tuned LLMs via Counterfactuals A Knowledge Graph Driven Framework
Tree Search for LLM Agent Reinforcement Learning
A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
Who's Laughing Now? An Overview of Computational Humour Generation and Explanation
GRPO is Secretly a Process Reward Model
CAD-Tokenizer Towards Text-based CAD Prototyping via Modality-Specific Tokenization
UniSS Unified Expressive Speech-to-Speech Translation with Your Voice
Acoustic-based Gender Differentiation in Speech-aware Language Models
TyphoonMLA A Mixed Naive-Absorb MLA Kernel For Shared Prefix
KeyWorld Key Frame Reasoning Enables Effective and Efficient World Models
Binary Autoencoder for Mechanistic Interpretability of Large Language Models
Toward Robust and Efficient ML-Based GPU Caching for Modern Inference
MemLens Uncovering Memorization in LLMs with Activation Trajectories
Punching Above Precision Small Quantized Model Distillation with Learnable Regularizer
SPADE Structured Pruning and Adaptive Distillation for Efficient LLM-TTS
Towards Atoms of Large Language Models
Recidivism and Peer Influence with LLM Text Embeddings in Low Security Correctional Facilities
Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning
CHOIR A Chatbot-mediated Organizational Memory Leveraging Communication in University Research Labs
MARS toward more efficient multi-agent collaboration for LLM reasoning
Shared Neural Space Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision
Seedream 4.0 Toward Next-generation Multimodal Image Generation
Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing
SIM-CoT Supervised Implicit Chain-of-Thought
Multimodal Representation-disentangled Information Bottleneck for Multimodal Recommendation
Q-Palette Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment
From Text to Talk Audio-Language Model Needs Non-Autoregressive Joint Training
Projective Kolmogorov Arnold Neural Networks (P-KANs) Entropy-Driven Functional Space Discovery for Interpretable Machine Learning
Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks
MeshMosaic Scaling Artist Mesh Generation via Local-to-Global Assembly
RAD Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis
FastEagle Cascaded Drafting for Accelerating Speculative Decoding
Exploration with Foundation Models Capabilities, Limitations, and Hybrid Approaches
Future Policy Aware Preference Learning for Mathematical Reasoning
Structuring Collective Action with LLM-Guided Evolution From Ill-Structured Problems to Executable Heuristics
CollaPipe Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks
BurstEngine an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens
MMedFD A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition
Gyges Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference
Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling
Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation
Transformer Modeling for Both Scalability and Performance in Multivariate Time Series
CompLLM Compression for Long Context Q&A
Online Process Reward Leanring for Agentic Reinforcement Learning
Reading Images Like Texts Sequential Image Understanding in Vision-Language Models
BiGraspFormer End-to-End Bimanual Grasp Transformer
Clapping Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression
HD-PPT Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS
Benchmarking PDF Accessibility Evaluation A Dataset and Framework for Assessing Automated and LLM-Based Approaches for Accessibility Testing
Confidential LLM Inference Performance and Cost Across CPU and GPU TEEs
FlashGMM Fast Gaussian Mixture Entropy Model for Learned Image Compression
Bi-VLM Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models
HyperCool Reducing Encoding Cost in Overfitted Codecs with Hypernetworks
PIE Perception and Interaction Enhanced End-to-End Motion Planning for Autonomous Driving
FlexSED Towards Open-Vocabulary Sound Event Detection
OmniFed A Modular Framework for Configurable Federated Learning from Edge to HPC
LLMZ+ Contextual Prompt Whitelist Principles for Agentic LLMs
Individualized non-uniform quantization for vector search
LAWCAT Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling
NormGenesis Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery
Efficient Particle Acceleration in 2.5-Dimensional, Hybrid-Kinetic Simulations of Decaying, Supersonic, Plasma Turbulence
Chiplet-Based RISC-V SoC with Modular AI Acceleration
Speculate Deep and Accurate Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding
Exploiting Tree Structure for Credit Assignment in RL Training of LLMs
Evaluating Large Language Models for Detecting Antisemitism
Spiffy Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
GraDeT-HTR A Resource-Efficient Bengali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer
TempSamp-R1 Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
RadEval A framework for radiology text evaluation
Through the Lens of Human-Human Collaboration A Configurable Research Platform for Exploring Human-Agent Collaboration
Visual Detector Compression via Location-Aware Discriminant Analysis
Benchmarking Humans and Machines on Complex Multilingual Speech Understanding Tasks
Expert-as-a-Service Towards Efficient, Scalable, and Robust Large-scale MoE Serving
Towards General Computer Control with Hierarchical Agents and Multi-Level Action Spaces
ConfClip Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs
When TableQA Meets Noise A Dual Denoising Framework for Complex Questions and Large-scale Tables
Mechanistic Interpretability with SAEs Probing Religion, Violence, and Geography in Large Language Models
Evict3R Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
Bilateral Distribution Compression Reducing Both Data Size and Dimensionality
Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs
4DGCPro Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming
CorefInst Leveraging LLMs for Multilingual Coreference Resolution
Privacy in Action Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents
Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks
QWHA Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
DINVMark A Deep Invertible Network for Video Watermarking
Interpreting vision transformers via residual replacement model
EpiCache Episodic KV Cache Management for Long Conversational Question Answering
Multimodal Health Risk Prediction System for Chronic Diseases via Vision-Language Fusion and Large Language Models
Asteria Semantic-Aware Cross-Region Caching for Agentic LLM Tool Access
Cronus Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill
Compact representation of transonic airfoil buffet flows with observable-augmented machine learning
Rational Multi-Modal Transformers for TCR-pMHC Prediction
Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection
DeepASA An Object-Oriented One-for-All Network for Auditory Scene Analysis
MoEs Are Stronger than You Think Hyper-Parallel Inference Scaling with RoE
SignalLLM A General-Purpose LLM Agent Framework for Automated Signal Processing
MAST Multi-Agent Spatial Transformer for Learning to Collaborate
Attention Consistency for LLMs Explanation
Analyzing Memory Effects in Large Language Models through the lens of Cognitive Psychology

Quantized Visual Geometry Grounded Transformer

Authors: Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu

2025-09-25

http://arxiv.org/abs/2509.21302v1

Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale s. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7 $\times$ memory reduction and 2.5 $\times$ in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.

Nova Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization

Authors: Yuhang Xu, Shengzhong Liu, Dong Zhang, Bingheng Yan, Fan Wu, Guihai Chen

2025-09-25

http://arxiv.org/abs/2509.21301v1

This paper presents Nova, a real-time scheduling framework for agentic vision-language models (VLMs) on a single GPU with balanced per-request latency and overall request process throughput. Our design begins by enabling effective pipelining across vision encode, , and stages of VLMs, by exploiting their heterogeneous resource demands during execution and incorporating elastic GPU spatial partitioning among stages to maximally utilize the compute and memory resources. Building on this, we introduce a real-time scheduling algorithm that adaptively calibrates resource allocation among stages based on a Pareto-optimal analysis of the latency-throughput trade-off, allowing the system to sustain responsiveness and resource efficiency under dynamic request loads. To further alleviate GPU memory pressure, we design a lightweight weight offloading strategy for vision encoders that preserves inference efficiency with minimized memory overhead. Extensive evaluations on both synthetic and real-world agent workloads demonstrate that Nova consistently outperforms the state-of-the-art baselines, improving the maximum latency by up to 23.3%, while keeping competitive throughput.

Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training

Authors: Shiju Wang, Yujie Wang, Ao Sun, Fangcheng Fu, Zijian Zhu, Bin Cui, Xu Han, Kaisheng Ma

2025-09-25

http://arxiv.org/abs/2509.21275v1

Long context training is crucial for 's context extension. Existing schemes, such as sequence parallelism, incur substantial overhead. Pipeline parallelism (PP) reduces this cost, but its effectiveness hinges on partitioning granularity. Batch-level PP dividing input samples exhibits high memory consumption in long-context scenario, whereas token-level PP splitting sequences into slices alleviates memory overhead but may incur hardware under-utilization. This trade-off motivates adaptively selecting PP granularity to match resource and workload characteristics. Moreover, sequence length distribution of the real-world dataset exhibits skewness, posing a challenge on PP's workload balance and efficient scheduling. Current static PP scheduling methods overlook the variance of sequence length, leading to suboptimal performance. In this paper, we propose Elastic Pipeline Parallelism (EPP) that orchestrates token-level PP and batch-level PP to adapt to resource and workload heterogeneity. We build InfiniPipe, a distributed training system that unleashes the potential of EPP via (1) a resource-aware and workload-balanced sequence processor that splits long sequences and packs short ones; and (2) a co-optimization methodology that jointly optimizes pipeline schedule and gradient checkpointing via a mechanism named stage-aware chunk-level adaptive checkpointing. Comprehensive experiments demonstrate that InfiniPipe achieves a 1.69x speedup over state-of-the-art systems.

Semantic Edge-Cloud Communication for Real-Time Urban Traffic Surveillance with ViT and LLMs over Mobile Networks

Authors: Murat Arda Onsu, Poonam Lohan, Burak Kantarci, Aisha Syed, Matthew Andrews, Sean Kennedy

2025-09-25

http://arxiv.org/abs/2509.21259v1

Real-time urban traffic surveillance is vital for Intelligent Transportation Systems (ITS) to ensure road safety, optimize traffic flow, track vehicle trajectories, and prevent collisions in smart cities. Deploying edge cameras across urban environments is a standard practice for monitoring road conditions. However, integrating these with intelligent models requires a robust understanding of dynamic traffic scenarios and a responsive interface for user interaction. Although multimodal Large Language Models (s) can interpret traffic images and generate informative responses, their deployment on edge devices is infeasible due to high computational demands. Therefore, inference must occur on the cloud, necessitating visual data transmission from edge to cloud, a process hindered by limited bandwidth, leading to potential delays that compromise real-time performance. To address this challenge, we propose a semantic framework that significantly reduces transmission overhead. Our method involves detecting Regions of Interest (RoIs) using YOLOv11, cropping relevant image segments, and converting them into compact embedding vectors using a Vision Transformer (ViT). These embeddings are then transmitted to the cloud, where an image r reconstructs the cropped images. The reconstructed images are processed by a multimodal to generate traffic condition descriptions. This approach achieves a 99.9% reduction in data transmission size while maintaining an response accuracy of 89% for reconstructed cropped images, compared to 93% accuracy with original cropped images. Our results demonstrate the efficiency and practicality of ViT and -assisted edge-cloud semantic for real-time traffic surveillance.

Explaining Fine Tuned LLMs via Counterfactuals A Knowledge Graph Driven Framework

Authors: Yucheng Wang, Ziyang Chen, Md Faisal Kabir

2025-09-25

http://arxiv.org/abs/2509.21241v1

The widespread adoption of Low-Rank Adaptation (LoRA) has enabled large language models (s) to acquire domain-specific knowledge with remarkable efficiency. However, understanding how such a fine-tuning mechanism alters a model's structural reasoning and semantic behavior remains an open challenge. This work introduces a novel framework that explains fine-tuned s via counterfactuals grounded in knowledge graphs. Specifically, we construct BioToolKG, a domain-specific heterogeneous knowledge graph in bioinformatics tools and design a counterfactual-based fine-tuned s explainer (CFFTExplainer) that learns soft masks over graph nodes and edges to generate minimal structural perturbations that induce maximum semantic divergence. Our method jointly optimizes structural and semantic divergence while enforcing interpretability pre constraints such as entropy regularization and edge smoothness. We apply this framework to a fine-tuned LLaMA-based and reveal that counterfactual masking exposes the model's structural dependencies and aligns with LoRA-induced parameter shifts. This work provides new insights into the internal mechanisms of fine-tuned s and highlights counterfactual graphs as a potential tool for interpretable AI.

Tree Search for LLM Agent Reinforcement Learning

Authors: Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu

2025-09-25

http://arxiv.org/abs/2509.21240v1

Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (s). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.

A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA

Authors: Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, Xiuying Chen

2025-09-25

http://arxiv.org/abs/2509.21199v1

Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for s as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass s. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in s. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more multi-step reasoning methods: \faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.

Who's Laughing Now? An Overview of Computational Humour Generation and Explanation

Authors: Tyler Loakman, William Thorne, Chenghua Lin

2025-09-25

http://arxiv.org/abs/2509.21175v1

The creation and perception of humour is a fundamental human trait, positioning its computational understanding as one of the most challenging tasks in natural language processing (NLP). As an abstract, creative, and frequently context-dependent construct, humour requires extensive reasoning to understand and create, making it a pertinent task for assessing the common-sense knowledge and reasoning abilities of modern large language models (s). In this work, we survey the landscape of computational humour as it pertains to the generative tasks of creation and explanation. We observe that, despite the task of understanding humour bearing all the hallmarks of a foundational NLP task, work on generating and explaining humour beyond puns remains , while state-of-the-art models continue to fall short of human capabilities. We bookend our literature survey by motivating the importance of computational humour processing as a subdiscipline of NLP and presenting an extensive discussion of future directions for research in the area that takes into account the subjective and ethically ambiguous nature of humour.

GRPO is Secretly a Process Reward Model

Authors: Michael Sullivan

2025-09-25

http://arxiv.org/abs/2509.21154v1

We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ( $\lambda$ -GRPO), and show that s trained with $\lambda$ -GRPO achieve higher validation accuracy and performance on downstream reasoning tasks $-$ and reach peak performance more rapidly $-$ than s trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.

CAD-Tokenizer Towards Text-based CAD Prototyping via Modality-Specific Tokenization

Authors: Ruiyu Wang, Shizhao Sun, Weijian Ma, Jiang Bian

2025-09-25

http://arxiv.org/abs/2509.21150v1

Computer-Aided Design (CAD) is a foundational component of industrial prototyping, where models are defined not by raw coordinates but by construction sequences such as sketches and extrusions. This sequential structure enables both efficient prototype initialization and subsequent editing. Text-guided CAD prototyping, which unifies Text-to-CAD generation and CAD editing, has the potential to streamline the entire design pipeline. However, prior work has not explored this setting, largely because standard large language model () tokenizers decompose CAD sequences into natural-language word pieces, failing to capture primitive-level CAD semantics and hindering attention modules from modeling geometric structure. We conjecture that a multimodal tokenization strategy, aligned with CAD's primitive and structural nature, can provide more effective representations. To this end, we propose CAD-Tokenizer, a framework that represents CAD data with modality-specific tokens using a sequence-based VQ-VAE with primitive-level pooling and constrained . This design produces compact, primitive-aware representations that align with CAD's structural nature. Applied to unified text-guided CAD prototyping, CAD-Tokenizer significantly improves instruction following and generation quality, achieving better quantitative and qualitative performance over both general-purpose s and task-specific baselines.

UniSS Unified Expressive Speech-to-Speech Translation with Your Voice

Authors: Sitong Cheng, Weizhen Bian, Xinsheng Wang, Ruibin Yuan, Jianyi Chen, Shunshun Yin, Yike Guo, Wei Xue

2025-09-25

http://arxiv.org/abs/2509.21144v1

The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while pre the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (s). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the d results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while pre voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo.

Acoustic-based Gender Differentiation in Speech-aware Language Models

Authors: Junhyuk Choi, Jihwan Seol, Nayeon Kim, Chanhee Cho, EunBin Cho, Bugeun Kim

2025-09-25

http://arxiv.org/abs/2509.21125v1

Speech-aware Language Models (SpeechLMs) have fundamentally transformed human-AI interaction by enabling voice-based , yet they may exhibit acoustic-based gender differentiation where identical questions lead to different responses based on the speaker's gender. This paper propose a new dataset that enables systematic analysis of this phenomenon, containing 9,208 speech samples across three categories: Gender-Independent, Gender-Stereotypical, and Gender-Dependent. We further evaluated LLaMA-Omni series and discovered a paradoxical pattern; while overall responses seems identical regardless of gender, the pattern is far from unbiased responses. Specifically, in Gender-Stereotypical questions, all models consistently exhibited male-oriented responses; meanwhile, in Gender-Dependent questions where gender differentiation would be contextually appropriate, models exhibited responses independent to gender instead. We also confirm that this pattern does not result from neutral options nor perceived gender of a voice. When we allow neutral response, models tends to respond neutrally also in Gender-Dependent questions. The paradoxical pattern yet retains when we applied gender neutralization methods on speech. Through comparison between SpeechLMs with corresponding backbone s, we confirmed that these paradoxical patterns primarily stem from Whisper speech encoders, which generates male-oriented acoustic tokens. These findings reveal that current SpeechLMs may not successfully remove gender biases though they prioritized general fairness principles over contextual appropriateness, highlighting the need for more sophisticated techniques to utilize gender information properly in speech technology.

TyphoonMLA A Mixed Naive-Absorb MLA Kernel For Shared Prefix

Authors: Ahmet Caner Yüzügüler, Ahmet Çelik, Jiawei Zhuang, Lukas Cavigelli

2025-09-25

http://arxiv.org/abs/2509.21081v1

Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art s such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and for their computational efficiency, existing kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with only a 3% overhead in HBM size.

KeyWorld Key Frame Reasoning Enables Effective and Efficient World Models

Authors: Sibo Li, Qianyue Hao, Yu Shang, Yong Li

2025-09-25

http://arxiv.org/abs/2509.21027v1

Robotic world models are a promising paradigm for forecasting future environment states, yet their inference speed and the physical plausibility of generated trajectories remain critical bottlenecks, limiting their real-world applications. This stems from the redundancy of the prevailing frame-to-frame generation approach, where the model conducts costly computation on similar frames, as well as neglecting the semantic importance of key transitions. To address this inefficiency, we propose KeyWorld, a framework that improves text-conditioned robotic world models by concentrating s computation on a few semantic key frames while employing a lightweight convolutional model to fill the intermediate frames. Specifically, KeyWorld first identifies significant transitions by iteratively simplifying the robot's motion trajectories, obtaining the ground truth key frames. Then, a DiT model is trained to reason and generate these physically meaningful key frames from textual task descriptions. Finally, a lightweight interpolator efficiently reconstructs the full video by inpainting all intermediate frames. Evaluations on the LIBERO benchmark demonstrate that KeyWorld achieves a 5.68 $\times$ compared to the frame-to-frame generation baseline, and focusing on the motion-aware key frames further contributes to the physical validity of the generated videos, especially on complex tasks. Our approach highlights a practical path toward deploying world models in real-time robotic control and other domains requiring both efficient and effective world models. Code is released at https://anonymous.4open.science/r/Keyworld-E43D.

Binary Autoencoder for Mechanistic Interpretability of Large Language Models

Authors: Hakaze Cho, Haolin Yang, Brian M. Kurkoski, Naoya Inoue

2025-09-25

http://arxiv.org/abs/2509.20997v1

Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (s) for interpreting their mechanism. However, they typically rely on autoencoders constrained by some implicit training-time regularization on single training instances (i.e., $L_1$ normalization, top-k function, etc.), without an explicit guarantee of global among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which we empirically evaluate and leverage to characterize the inference dynamics of s and In-context Learning. (2) Feature untangling. Similar to typical methods, BAE can extract atomized features from 's hidden states. To robustly evaluate such feature extraction capability, we refine traditional feature-interpretation methods to avoid unreliable handling of numerical tokens, and show that BAE avoids dense features while producing the largest number of interpretable ones among baselines, which confirms the effectiveness of BAE as a feature extractor.

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

Authors: Peng Chen, Jiaji Zhang, Hailiang Zhao, Yirong Zhang, Jiahong Yu, Xueyan Tang, Yixuan Wang, Hao Li, Jianping Zou, Gang Xiong, Kingsum Chow, Shuibing He, Shuiguang Deng

2025-09-25

http://arxiv.org/abs/2509.20979v1

In modern GPU inference, efficiency remains a major bottleneck. In recommendation models, embedding hit rates largely determine throughput, while in large language models, - misses substantially increase time-to-first-token (TTFT). Heuristic policies such as \textsc{LRU} often struggle under structured access patterns. Learning-based approaches are promising, but in practice face two major limitations: they degrade sharply when predictions are inaccurate, or they gain little even with accurate predictions due to conservative designs. Some also incur high overhead, further limiting practicality. We present \textsc{LCR}, a practical framework for learning-based GPU caching that delivers performance gains while ensuring robustness and efficiency. Its core algorithm, \textsc{LARU}, enhances \textsc{LRU} with machine-learned predictions and dynamically adapts to prediction accuracy through online error estimation. When predictions are accurate, \textsc{LARU} achieves near-optimal performance. With inaccurate predictions, it degrades gracefully to near-\textsc{LRU} performance. With \textsc{LCR}, we bridge the gap between empirical progress and theoretical advances in learning-based caching. Experiments show that \textsc{LCR} delivers consistent gains under realistic conditions. In DLRM and scenarios, it improves throughput by up to 24.2\% and reduces P99 TTFT by up to 28.3\%, outperforming widely used inference systems. Even under poor predictions, its performance remains stable, demonstrating practical robustness.

MemLens Uncovering Memorization in LLMs with Activation Trajectories

Authors: Zirui He, Haiyan Zhao, Ali Payani, Mengnan du

2025-09-25

http://arxiv.org/abs/2509.20909v1

Large language models (s) are commonly evaluated on challenging benchmarks such as AIME and Math500, which are susceptible to contamination and risk of being memorized. Existing detection methods, which primarily rely on surface-level lexical and perplexity, demonstrate low generalization and degrade significantly when encountering implicitly contaminated data. In this paper, we propose MemLens (An Activation Lens for Memorization Detection) to detect memorization by analyzing the probability trajectories of numeric tokens during generation. Our method reveals that contaminated samples exhibit ``shortcut'' behaviors, locking onto an answer with high confidence in the model's early layers, whereas clean samples show more gradual evidence accumulation across the model's full depth. We observe that contaminated and clean samples exhibit distinct and well-separated reasoning trajectories. To further validate this, we inject carefully designed samples into the model through LoRA fine-tuning and observe the same trajectory patterns as in naturally contaminated data. These results provide strong evidence that MemLens captures genuine signals of memorization rather than spurious correlations.

Punching Above Precision Small Quantized Model Distillation with Learnable Regularizer

Authors: Abdur Rehman, S M A Sharif, Md Abdur Rahaman, Mohamed Jismy Aashik Rasool, Seongwan Kim, Jaeho Lee

2025-09-25

http://arxiv.org/abs/2509.20854v1

Quantization-aware training (QAT) combined with knowledge distillation (KD) is a promising strategy for compressing Artificial Intelligence (AI) models for deployment on resource-constrained hardware. However, existing QAT-KD methods often struggle to balance task-specific (TS) and distillation losses due to heterogeneous gradient magnitudes, especially under . We propose Game of Regularizer (GoR), a novel learnable regularization method that adaptively balances TS and KD objectives using only two trainable parameters for dynamic loss weighting. GoR reduces conflict between supervision signals, improves convergence, and boosts the performance of small d models (SQMs). Experiments on image classification, object detection (OD), and large language model () show that GoR consistently outperforms state-of-the-art QAT-KD methods. On low-power edge devices, it delivers faster inference while maintaining full-precision accuracy. We also introduce QAT-EKD-GoR, an ensemble distillation framework that uses multiple heterogeneous teacher models. Under optimal conditions, the proposed EKD-GoR can outperform full-precision models, providing a robust solution for real-world deployment.

SPADE Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

Authors: Tan Dat Nguyen, Jaehun Kim, Ji-Hoon Kim, Shukjae Choi, Youshin Lim, Joon Son Chung

2025-09-25

http://arxiv.org/abs/2509.20802v1

The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (-TTS). Recent -TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7x faster real-time factor with less than 5% of the original training data. These results show that compact -TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation. Audio samples are available at https://mm.kaist.ac.kr/projects/SPADE/.

Towards Atoms of Large Language Models

Authors: Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

2025-09-25

http://arxiv.org/abs/2509.20784v1

The fundamental units of internal representations in large language models (s) remain undefined, limiting further understanding of their mechanisms. Neurons or features are often regarded as such units, yet neurons suffer from polysemy, while features face concerns of unreliable reconstruction and instability. To address this issue, we propose the Atoms Theory, which defines such units as atoms. We introduce the atomic inner product (AIP) to correct representation shifting, formally define atoms, and prove the conditions that atoms satisfy the Restricted Isometry Property (RIP), ensuring stable representations over atom set and linking to compressed sensing. Under stronger conditions, we further establish the uniqueness and exact $\ell_1$ recoverability of the representations, and provide guarantees that single-layer autoencoders (SAEs) with threshold activations can reliably identify the atoms. To validate the Atoms Theory, we train threshold-activated SAEs on Gemma2-2B, Gemma2-9B, and Llama3.1-8B, achieving 99.9% reconstruction across layers on average, and more than 99.8% of atoms satisfy the uniqueness condition, compared to 0.5% for neurons and 68.2% for features, showing that atoms more faithfully capture intrinsic representations of s. Scaling experiments further reveal the link between SAEs size and recovery capacity. Overall, this work systematically introduces and validates Atoms Theory of s, providing a theoretical framework for understanding internal representations and a foundation for mechanistic interpretability. Code available at https://github.com/ChenhuiHu/towards_atoms.

Recidivism and Peer Influence with LLM Text Embeddings in Low Security Correctional Facilities

Authors: Shanjukta Nath, Jiwon Hong, Jae Ho Chang, Keith Warren, Subhadeep Paul

2025-09-25

http://arxiv.org/abs/2509.20634v1

We find AI embeddings obtained using a pre-trained -based Large Language Model () of 80,000-120,000 written affirmations and correction exchanges among residents in low-security correctional facilities to be highly predictive of recidivism. The prediction accuracy is 30\% higher with embedding vectors than with only pre-entry covariates. However, since the text embedding vectors are high-dimensional, we perform Zero-Shot classification of these texts to a low-dimensional vector of user-defined classes to aid interpretation while retaining the predictive power. To shed light on the social dynamics inside the correctional facilities, we estimate peer effects in these -generated numerical representations of language with a multivariate peer effect model, adjusting for network endogeneity. We develop new methodology and theory for peer effect estimation that accommodate networks, multivariate latent variables, and correlated multivariate outcomes. With these new methods, we find significant peer effects in language usage for interaction and feedback.

Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning

Authors: Hanjiang Hu, Changliu Liu, Na Li, Yebin Wang

2025-09-24

http://arxiv.org/abs/2509.20616v1

Large Language Models (s) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training agents for complex multi-turn task planning faces significant challenges, including episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this paper introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in higher multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks with over 30 steps. We also theoretically and empirically validate the strong cross-task generalizability that the models trained on complex tasks can lead to the successful completion of all simpler subtasks.

CHOIR A Chatbot-mediated Organizational Memory Leveraging Communication in University Research Labs

Authors: Sangwook Lee, Adnan Abbas, Yan Chen, Young-Ho Kim, Sang Won Lee

2025-09-24

http://arxiv.org/abs/2509.20512v1

University research labs often rely on chat-based platforms for and project management, where valuable knowledge surfaces but is easily lost in message streams. Documentation can preserve knowledge, but it requires ongoing maintenance and is challenging to navigate. Drawing on formative interviews that revealed organizational memory challenges in labs, we designed CHOIR, an -based chatbot that supports organizational memory through four key functions: document-grounded Q&A, Q&A sharing for follow-up discussion, knowledge extraction from conversations, and AI-assisted document updates. We deployed CHOIR in four research labs for one month (n=21), where the lab members asked 107 questions and lab directors updated documents 38 times in the organizational memory. Our findings reveal a privacy-awareness tension: questions were asked privately, limiting directors' visibility into documentation gaps. Students often avoided contribution due to challenges in generalizing personal experiences into universal documentation. We contribute design implications for privacy-pre awareness and supporting context-specific knowledge documentation.

MARS toward more efficient multi-agent collaboration for LLM reasoning

Authors: Xiao Wang, Jia Wang, Yijie Wang, Pengtao Dang, Sha Cao, Chi Zhang

2025-09-24

http://arxiv.org/abs/2509.20502v1

Large language models (s) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different s show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50\%. Code is available at https://github.com/xwang97/MARS.

Shared Neural Space Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision

Authors: Jing Li, Oskar Bartosz, Chengyu Wang, Michal Wnuczynski, Dilshan Godaliyadda, Michael Polley

2025-09-24

http://arxiv.org/abs/2509.20481v1

The majority of AI models in imaging and vision are customized to perform on specific high-precision task. However, this strategy is inefficient for applications with a series of modular tasks, since each requires a mapping into a disparate latent domain. To address this inefficiency, we proposed a universal Neural Space (NS), where an encoder-r framework pre-computes features across vision and imaging tasks. Our encoder learns transformation aware, generalizable representations, which enable multiple downstream AI modules to share the same feature space. This architecture reduces redundancy, improves generalization across domain shift, and establishes a foundation for effecient multi-task vision pipelines. Furthermore, as opposed to larger backbones, our backbone is lightweight and CNN-based, allowing for wider across hardware. We furthur demonstrate that imaging and vision modules, such as demosaicing, denoising, depth estimation and semantic segmentation can be performed efficiently in the NS.

Seedream 4.0 Toward Next-generation Multimodal Image Generation

Authors: Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu

2025-09-24

http://arxiv.org/abs/2509.20427v1

We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference , we integrate adversarial distillation, distribution matching, and , as well as speculative . It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a /VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.

Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing

Authors: Xinnan Dai, Chung-Hsiang Lo, Kai Guo, Shenglai Zeng, Dongsheng Luo, Jiliang Tang

2025-09-24

http://arxiv.org/abs/2509.20336v1

Transformer-based s demonstrate strong performance on graph reasoning tasks, yet their internal mechanisms remain underexplored. To uncover these reasoning process mechanisms in a fundamental and unified view, we set the basic r-only s and explain them using the circuit-tracer framework. Through this lens, we visualize reasoning traces and identify two core mechanisms in graph reasoning: token merging and structural memorization, which underlie both path reasoning and substructure extraction tasks. We further quantify these behaviors and analyze how they are influenced by graph density and model size. Our study provides a unified interpretability framework for understanding structural reasoning in r-only Transformers.

SIM-CoT Supervised Implicit Chain-of-Thought

Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin

2025-09-24

http://arxiv.org/abs/2509.20317v2

Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models (s), but a persistent performance gap has limited their adoption. We identify a core latent instability issue when scaling the computational budget of implicit CoT: as the number of reasoning tokens increases, training often becomes unstable and collapses. Our analysis shows that this instability arises from latent representations becoming homogeneous and losing semantic diversity, caused by insufficient step-level supervision in current implicit CoT methods. To address this, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. SIM-CoT employs an auxiliary r during training to align each implicit token with its corresponding explicit reasoning step, ensuring latent states capture distinct and meaningful information. The auxiliary r is removed at inference, pre the efficiency of implicit CoT with no added overhead. It also provides interpretability by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization and diagnosis. SIM-CoT significantly improves both in-domain accuracy and out-of-domain stability of implicit CoT methods, boosting Coconut by +8.2\% on GPT-2 and CODI by +3.0\% on LLaMA-3.1 8B. It further surpasses the explicit CoT baseline on GPT-2 by 2.1\% with 2.3 $\times$ greater token efficiency, while closing the performance gap on larger models like LLaMA-3.1 8B. Code: https://github.com/InternLM/SIM-CoT

Multimodal Representation-disentangled Information Bottleneck for Multimodal Recommendation

Authors: Hui Wang, Jinghui Qin, Wushao Wen, Qingling Li, Shanshan Zhong, Zhongzhan Huang

2025-09-24

http://arxiv.org/abs/2509.20225v1

Multimodal data has significantly advanced recommendation systems by integrating diverse information sources to model user preferences and item characteristics. However, these systems often struggle with redundant and irrelevant information, which can degrade performance. Most existing methods either fuse multimodal information directly or use rigid architectural separation for disentanglement, failing to adequately filter noise and model the complex interplay between modalities. To address these challenges, we propose a novel framework, the Multimodal Representation-disentangled Information Bottleneck (MRdIB). Concretely, we first employ a Multimodal Information Bottleneck to compress the input representations, effectively filtering out task-irrelevant noise while pre rich semantic information. Then, we decompose the information based on its relationship with the recommendation target into unique, redundant, and synergistic components. We achieve this decomposition with a series of constraints: a unique information learning objective to preserve modality-unique signals, a redundant information learning objective to minimize , and a synergistic information learning objective to capture emergent information. By optimizing these objectives, MRdIB guides a model to learn more powerful and disentangled representations. Extensive experiments on several competitive models and three benchmark datasets demonstrate the effectiveness and versatility of our MRdIB in enhancing multimodal recommendation.

Q-Palette Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment

Authors: Deokjae Lee, Hyun Oh Song

2025-09-24

http://arxiv.org/abs/2509.20214v1

We study weight-only post-training (PTQ), which s the weights of a large language model () without retraining, using little or no calibration data. Weight-only PTQ is crucial for reducing the memory footprint and latency of inference, especially in memory-bound, small-batch inference scenarios, such as personalized inference on edge devices. Despite its importance, irregular weight distributions with heavy-tailed outliers in s complicate , recently motivating rotation-based methods that transform weights into near-Gaussian distributions, which are more regular with fewer outliers, thereby reducing error. In this work, we first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit rs approaching the Gaussian distortion-rate bound are essential to achieve near-optimal performance. To bridge this theoretical insight and practical implementation, we introduce Q-Palette, a versatile collection of fractional-bit rs that range from trellis-coded rs offering near-optimal distortion to simpler vector and scalar rs optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Furthermore, leveraging Q-Palette as a foundational component, we propose a novel mixed-scheme framework, jointly optimizing r choices and layer fusion decisions given resource constraints. The code is available at https://github.com/snu-mllab/Q-Palette.

From Text to Talk Audio-Language Model Needs Non-Autoregressive Joint Training

Authors: Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo, Zitao Liu

2025-09-24

http://arxiv.org/abs/2509.20072v2

Recent advances in large language models (s) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates autoregressive (AR) text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order autoregressive property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Extensive experiments across Audio-QA and ASR tasks demonstrate the effectiveness of our approach, with detailed ablation studies validating each proposed component. We will open-source our models, data and code to facilitate future research in this direction.

Projective Kolmogorov Arnold Neural Networks (P-KANs) Entropy-Driven Functional Space Discovery for Interpretable Machine Learning

Authors: Alastair Poole, Stig McArthur, Saravan Kumar

2025-09-24

http://arxiv.org/abs/2509.20049v1

Kolmogorov-Arnold Networks (KANs) relocate learnable nonlinearities from nodes to edges, demonstrating remarkable capabilities in scientific machine learning and interpretable modeling. However, current KAN implementations suffer from fundamental inefficiencies due to redundancy in high-dimensional spline parameter spaces, where numerous distinct parameterisations yield functionally equivalent behaviors. This redundancy manifests as a "nuisance space" in the model's Jacobian, leading to susceptibility to overfitting and poor generalization. We introduce Projective Kolmogorov-Arnold Networks (P-KANs), a novel training framework that guides edge function discovery towards interpretable functional representations through entropy-minimisation techniques from signal analysis and dictionary learning. Rather than constraining functions to predetermined spaces, our approach maintains spline space flexibility while introducing "gravitational" terms that encourage convergence towards optimal functional representations. Our key insight recognizes that optimal representations can be identified through entropy analysis of projection coefficients, compressing edge functions to lower-parameter projective spaces (Fourier, Chebyshev, Bessel). P-KANs demonstrate superior performance across multiple domains, achieving up to 80% parameter reduction while maintaining representational capacity, significantly improved robustness to noise compared to standard KANs, and successful application to industrial automated fiber placement prediction. Our approach enables automatic discovery of mixed functional representations where different edges converge to different optimal spaces, providing both benefits and enhanced interpretability for scientific machine learning applications.

Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks

Authors: Vani Kanjirangat, Tanja Samardžić, Ljiljana Dolamic, Fabio Rinaldi

2025-09-24

http://arxiv.org/abs/2509.20045v1

Dialectal data are characterized by linguistic variation that appears small to humans but has a significant impact on the performance of models. This dialect gap has been related to various factors (e.g., data size, economic and social factors) whose impact, however, turns out to be inconsistent. In this work, we investigate factors impacting the model performance more directly: we correlate Tokenization Parity (TP) and Information Parity (IP), as measures of representational biases in pre-trained multilingual models, with the downstream performance. We compare state-of-the-art r-only s with encoder-based models across three tasks: dialect classification, topic classification, and extractive question answering, controlling for varying scripts (Latin vs. non-Latin) and resource availability (high vs. low). Our analysis reveals that TP is a better predictor of the performance on tasks reliant on syntactic and morphological cues (e.g., extractive QA), while IP better predicts performance in semantic tasks (e.g., topic classification). Complementary analyses, including tokenizer behavior, vocabulary coverage, and qualitative insights, reveal that the language support claims of s often might mask deeper mismatches at the script or token level.

MeshMosaic Scaling Artist Mesh Generation via Local-to-Global Assembly

Authors: Rui Xu, Tianyang Xue, Qiujie Dong, Le Wan, Zhe Zhu, Peng Li, Zhiyang Dou, Cheng Lin, Shiqing Xin, Yuan Liu, Wenping Wang, Taku Komura

2025-09-24

http://arxiv.org/abs/2509.19995v1

Scaling artist-designed meshes to high triangle numbers remains challenging for autoregressive generative models. Existing -based methods suffer from long-sequence bottlenecks and limited resolution, primarily due to the large number of tokens required and constrained granularity. These issues prevent faithful reproduction of fine geometric details and structured density patterns. We introduce MeshMosaic, a novel local-to-global framework for artist mesh generation that scales to over 100K triangles--substantially surpassing prior methods, which typically handle only around 8K faces. MeshMosaic first segments shapes into patches, generating each patch autoregressively and leveraging shared boundary conditions to promote coherence, symmetry, and seamless connectivity between neighboring regions. This strategy enhances scalability to high-resolution meshes by quantizing patches individually, resulting in more symmetrical and organized mesh density and structure. Extensive experiments across multiple public datasets demonstrate that MeshMosaic significantly outperforms state-of-the-art methods in both geometric fidelity and user preference, supporting superior detail representation and practical mesh generation for real-world applications.

Authors: Haolin Li, Tianjie Dai, Zhe Chen, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang

2025-09-24

http://arxiv.org/abs/2509.19980v1

Clinical diagnosis is a highly specialized discipline requiring both domain expertise and strict adherence to rigorous guidelines. While current AI-driven medical research predominantly focuses on knowledge graphs or natural text pretraining paradigms to incorporate medical knowledge, these approaches primarily rely on implicitly encoded knowledge within model parameters, neglecting task-specific knowledge required by diverse downstream tasks. To address this limitation, we propose Retrieval-Augmented Diagnosis (RAD), a novel framework that explicitly injects external knowledge into multimodal models directly on downstream tasks. Specifically, RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss that constrains the latent distance between multi-modal features and guideline knowledge, and the dual r that employs guidelines as queries to steer cross-modal fusion, aligning the models with clinical diagnostic workflows from guideline acquisition to feature extraction and decision-making. Moreover, recognizing the lack of quantitative evaluation of interpretability for multimodal diagnostic models, we introduce a set of criteria to assess the interpretability from both image and text perspectives. Extensive evaluations across four datasets with different anatomies demonstrate RAD's generalizability, achieving state-of-the-art performance. Furthermore, RAD enables the model to concentrate more precisely on abnormal regions and critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code is available at https://github.com/tdlhl/RAD.

FastEagle Cascaded Drafting for Accelerating Speculative Decoding

Authors: Haiduo Huang, Jiangcheng Song, Wenzhe Zhao, Pengju Ren

2025-09-24

http://arxiv.org/abs/2509.20416v1

Speculative accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive cascaded drafter that emits an entire draft in a single forward pass. FastEagle replaces temporal steps with a lightweight layer cascade and trains with layer-wise supervision to mitigate error accumulation. Coupled with a constrained draft tree that preserves lossless verification cost, FastEagle delivers substantial wall-clock speedups over strong autoregressive drafters while maintaining competitive acceptance behavior. Across multiple s (Vicuna-13B, LLaMA-Instruct 3.x, and DeepSeek-R1-Distill-LLaMA) and tasks (MT-Bench, HumanEval, GSM8K, CNN/DM, Alpaca), FastEagle consistently outperforms EAGLE-3 in speedup under both greedy and stochastic , with comparable average acceptance lengths. These results indicate that removing sequential dependencies in drafting is a practical path toward lossless inference .

Exploration with Foundation Models Capabilities, Limitations, and Hybrid Approaches

Authors: Remo Sasso, Michelangelo Conserva, Dominik Jeurissen, Paulo Rauber

2025-09-24

http://arxiv.org/abs/2509.19924v1

Exploration in reinforcement learning (RL) remains challenging, particularly in -reward settings. While foundation models possess strong semantic priors, their capabilities as zero-shot exploration agents in classic RL benchmarks are not well understood. We benchmark s and VLMs on multi-armed bandits, Gridworlds, and -reward Atari to test zero-shot exploration. Our investigation reveals a key limitation: while VLMs can infer high-level objectives from visual input, they consistently fail at precise low-level control: the "knowing-doing gap". To analyze a potential bridge for this gap, we investigate a simple on-policy hybrid framework in a controlled, best-case scenario. Our results in this idealized setting show that VLM guidance can significantly improve early-stage sample efficiency, providing a clear analysis of the potential and constraints of using foundation models to guide exploration rather than for end-to-end control.

Future Policy Aware Preference Learning for Mathematical Reasoning

Authors: Minjae Oh, Yunho Choi, Dongmin Choi, Yohan Jo

2025-09-24

http://arxiv.org/abs/2509.19893v1

Preference learning methods such as Direct Preference Optimization (DPO) have become standard for Large Language Model () post-training, yet they are often ineffective for mathematical reasoning. A key challenge is the large token between preferred and dispreferred trajectories; lowering the probability of dispreferred trajectories also reduces the probability of shared useful tokens, leading to over-penalization and overall performance collapse. As a mitigation, existing algorithms include the probability of a trajectory under the current policy as a regularization term, which decreases the effect of the gradient when the probability is low. However, by the time this effect takes hold, useful tokens may have already been over-penalized as the model has begun to degrade. To address this, we propose Future Policy Aware (FPA) preference learning, which replaces the current policy with a future policy in the regularization term. This future policy is estimated via lightweight, logit-space extrapolation from a reference model toward the current model. FPA enables safer training by preemptively regularizing potentially problematic gradients. We apply FPA to DPO, RPO, and SimPER and evaluate them on the MATH and GSM8K benchmarks. FPA yields consistent performance gains, with the largest improvements observed with SimPER, achieving gains of up to 5.75%. We demonstrate that FPA provides proactive regularization while pre the probability of shared, useful mathematical tokens, and enables longer, degradation-free training with negligible computational overhead. We will release our code publicly upon publication.

Structuring Collective Action with LLM-Guided Evolution From Ill-Structured Problems to Executable Heuristics

Authors: Kevin Bradley Dsouza, Graham Alexander Watt, Yuri Leonenko, Juan Moreno-Cruz

2025-09-24

http://arxiv.org/abs/2509.20412v1

Collective action problems, which require aligning individual incentives with collective goals, are classic examples of Ill-Structured Problems (ISPs). For an individual agent, the causal links between local actions and global outcomes are unclear, stakeholder objectives often conflict, and no single, clear algorithm can bridge micro-level choices with macro-level welfare. We present ECHO-MIMIC, a computational framework that converts this global complexity into a tractable, Well-Structured Problem (WSP) for each agent by discovering compact, executable heuristics and persuasive rationales. The framework operates in two stages: ECHO (Evolutionary Crafting of Heuristics from Outcomes) evolves snippets of Python code that encode candidate behavioral policies, while MIMIC (Mechanism Inference & Messaging for Individual-to-Collective Alignment) evolves companion natural language messages that motivate agents to adopt those policies. Both phases employ a large-language-model-driven evolutionary search: the proposes diverse and context-aware code or text variants, while population-level selection retains those that maximize collective performance in a simulated environment. We demonstrate this framework on a canonical ISP in agricultural landscape management, where local farming decisions impact global ecological connectivity. Results show that ECHO-MIMIC discovers high-performing heuristics compared to baselines and crafts tailored messages that successfully align simulated farmer behavior with landscape-level ecological goals. By coupling algorithmic rule discovery with tailored , ECHO-MIMIC transforms the cognitive burden of collective action into a simple set of agent-level instructions, making previously ill-structured problems solvable in practice and opening a new path toward scalable, adaptive policy design.

CollaPipe Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks

Authors: Jiewei Chen, Xiumei Deng, Zehui Xiong, Shaoyong Guo, Xuesong Qiu, Ping Wang, Dusit Niyato

2025-09-24

http://arxiv.org/abs/2509.19855v1

The increasing demand for intelligent mobile applications has made multi-agent collaboration with Transformer-based large language models (s) essential in mobile edge computing (MEC) networks. However, training s in such environments remains challenging due to heavy computation, high end-to-end latency, and limited model generalization. We introduce CollaPipe, a hybrid distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving intelligent networks. In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the r is deployed on edge servers to handle generative tasks. Then we perform global model update via federated aggregation. To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power. We derive and use a closed-form convergence bound to design an Dynamic Segment Scheduling and Resource Allocation (DSSDA) algorithm based on Lyapunov optimization, ensuring system stability under long-term constraints. Extensive experiments on downstream tasks with Transformer and BERT models show that CollaPipe improves computation efficiency by up to 15.09%, reduces end-to-end latency by at least 48.98%, and cuts single device memory usage by more than half, enabling online learning in heterogeneous and dynamic environments.

BurstEngine an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens

Authors: Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong sun

2025-09-24

http://arxiv.org/abs/2509.19836v1

Existing methods for training s on long-sequence data, such as Tensor Parallelism and Context Parallelism, exhibit low Model FLOPs Utilization as sequence lengths and number of GPUs increase, especially when sequence lengths exceed 1M tokens. To address these challenges, we propose BurstEngine, an efficient framework designed to train s on long-sequence data. BurstEngine introduces BurstAttention, an optimized distributed attention with lower cost than RingAttention. BurstAttention leverages topology-aware ring to fully utilize network bandwidth and incorporates fine-grained -computation . Furthermore, BurstEngine introduces sequence-level selective checkpointing and fuses the language modeling head with the loss function to reduce memory cost. Additionally, BurstEngine introduces workload balance optimization for various types of attention masking. By integrating these optimizations, BurstEngine achieves a $1.2\times$ speedup with much lower memory overhead than the state-of-the-art baselines when training s on extremely long sequences of over 1M tokens. We have made our code publicly available on GitHub: https://github.com/thunlp/BurstEngine.

MMedFD A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition

Authors: Hongzhao Chen, XiaoYang Wang, Jing Lan, Hexiao Ding, Yufeng Jiang MingHui Yang, DanHui Xu, Jun Luo, Nga-Chun Ng, Gerald W. Y. Cheng, Yunlin Mao, Jung Sun Yoo

2025-09-24

http://arxiv.org/abs/2509.19817v1

Automatic speech recognition (ASR) in clinical dialogue demands robustness to full-duplex interaction, speaker , and low-latency constraints, yet open benchmarks remain scarce. We present MMedFD, the first real-world Chinese healthcare ASR corpus designed for multi-turn, full-duplex settings. Captured from a deployed AI assistant, the dataset comprises 5,805 annotated sessions with synchronized user and mixed-channel views, RTTM/CTM timing, and role labels. We introduce a model-agnostic pipeline for streaming segmentation, speaker attribution, and dialogue memory, and fine-tune Whisper-small on role-concatenated audio for long-context recognition. ASR evaluation includes WER, CER, and HC-WER, which measures concept-level accuracy across healthcare settings. -generated responses are assessed using rubric-based and pairwise protocols. MMedFD establishes a reproducible framework for benchmarking streaming ASR and end-to-end duplex agents in healthcare deployment. The dataset and related resources are publicly available at https://github.com/Kinetics-JOJO/MMedFD

Gyges Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference

Authors: Haoyu Chen, Xue Li, Kun Qian, Yu Guan, Jin Zhao, Xin Wang

2025-09-24

http://arxiv.org/abs/2509.19729v1

Efficiently processing the dynamics of requests, especially the context length variance, is important in Large Language Model () scenarios. However, there is an intrinsic trade-off: while leveraging parallelism strategies, such as Tensor Parallelism (TP), can coordinate multiple GPUs to accommodate larger context lengths, it inevitably results in degraded overall throughput. In this paper, we propose Cross-Instance Parallelism Transformation (Gyges), which adaptively adjusts the parallelism strategies of running instances to align with the dynamics of incoming requests. We design (1) a page-friendly, header-centric layout to accelerate transformations; (2) dedicated weight padding to accelerate model weight transformations; and (3) a transformation-aware scheduler to cooperatively schedule requests and parallelism transformations, optimizing the overall performance. Evaluations using real-world traces show that Gyges improves throughput by 1.75x-6.57x compared to state-of-the-art solutions.

Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling

Authors: Youpeng Zhao, Jinpeng LV, Di Wu, Jun Wang, Christopher Gooley

2025-09-23

http://arxiv.org/abs/2509.19645v1

Test-time scaling (TTS) has recently emerged as a promising direction to exploit the hidden reasoning capabilities of pre-trained large language models (s). However, existing scaling methods narrowly focus on the compute-optimal Pareto-frontier, ignoring the simple fact that compute-optimal is not always system-optimal. In this work, we propose a system-driven perspective on TTS, analyzing how reasoning models scale against practical metrics, such as latency and cost-per-token. By evaluating the impact of popular optimizations such as tensor parallelism and speculative , our preliminary analysis reveals the limitations of current methods and calls for a paradigm shift toward holistic, system-aware evaluations that capture the true essence of scaling laws at inference time.

Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation

Authors: Roy Fejgin, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Ryan Langman Jaehyeon Kim, Subhankar Ghosh, Shehzeen Hussain, Jason Li

2025-09-23

http://arxiv.org/abs/2509.19592v1

Speech generation models based on large language models (s) typically operate on discrete acoustic codes, which differ fundamentally from text tokens due to their multicodebook structure. At each timestep, models must predict N codebook entries jointly, introducing dependencies that challenge simple parallel prediction approaches. Parallel prediction assumes independence among codebooks, yielding efficient but often at the cost of reduced fidelity. To address this, hierarchical strategies employ a local (LT) to refine predictions and capture intra-timestep dependencies. In this work, we systematically investigate two LT architectures: an autoregressive that generates codebooks sequentially, and a MaskGIT-based that performs iterative masked prediction. Both designs further enable frame stacking, where the primary predicts multiple frames jointly, and the LT s their codebooks, offering improvements in speed without compromising perceptual quality. Through extensive analysis, we characterize the tradeoffs between parallel and iterative sampling strategies across different throughput and quality regimes. Finally, we propose practical guidelines for selecting strategies based on deployment priorities such as computational efficiency and synthesis fidelity.

Transformer Modeling for Both Scalability and Performance in Multivariate Time Series

Authors: Hunjae Lee, Corey Clark

2025-09-23

http://arxiv.org/abs/2509.19471v1

Variable count is among the main scalability bottlenecks for modeling in multivariate time series (MTS) data. On top of this, a growing consensus in the field points to indiscriminate inter-variable mixing as a potential source of noise-accumulation and performance degradation. This is likely exacerbated by of informative signals characteristic of many MTS systems coupled with representational misalignment stemming from indiscriminate information mixing between (heterogeneous) variables. While scalability and performance are often seen as competing interests in design, we show that both can be improved simultaneously in MTS by strategically constraining the representational capacity of inter-variable mixing. Our proposed method, with Delegate Token Attention (DELTAformer), constrains inter-variable modeling through what we call delegate tokens which are then used to perform full, unconstrained, inter-temporal modeling. Delegate tokens act as an implicit regularizer that forces the model to be highly selective about what inter-variable information is allowed to propagate through the network. Our results show that DELTAformer scales linearly with variable-count while actually outperforming standard s, achieving state-of-the-art performance across benchmarks and baselines. In addition, DELTAformer can focus on relevant signals better than standard s in noisy MTS environments and overall exhibit superior noise-resilience. Overall, results across various experiments confirm that by aligning our model design to leverage domain-specific challenges in MTS to our advantage, DELTAformer can simultaneously achieve linear scaling while actually improving its performance against standard, quadratic s.

CompLLM Compression for Long Context Q&A

Authors: Gabriele Berton, Jayakrishnan Unnikrishnan, Son Tran, Mubarak Shah

2025-09-23

http://arxiv.org/abs/2509.19228v1

Large Language Models (s) face significant computational challenges when processing long contexts due to the quadratic complexity of self-attention. While soft context methods, which map input text to smaller latent representations, have shown promise, their real-world adoption is limited. Existing techniques typically compress the context as a single unit, which leads to quadratic complexity and an inability to reuse computations across queries with ping contexts. In this work, we introduce Comp, a soft technique designed for practical deployment. Instead of processing the context holistically, Comp divides it into segments and compresses each one independently. This simple design choice yields three critical properties: efficiency, as the step scales linearly with the context length; scalability, enabling models trained on short sequences (e.g., 1k tokens) to generalize to contexts of 100k tokens; and reusability, allowing compressed segments to be d and reused across different queries. Our experiments show that with a 2x rate, at high context lengths Comp speeds up Time To First Token (TTFT) by up to 4x and reduces the size by 50%. Furthermore, Comp achieves performance comparable to that obtained with the uncompressed context, and even surpasses it on very long sequences, demonstrating its effectiveness and practical utility.

Online Process Reward Leanring for Agentic Reinforcement Learning

Authors: Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, Jianbin Jiao

2025-09-23

http://arxiv.org/abs/2509.19199v2

Large language models (s) are increasingly trained with reinforcement learning (RL) as autonomous agents that reason and act over long horizons in interactive environments. However, and sometimes unverifiable rewards make temporal credit assignment extremely challenging. Recent work attempts to integrate process supervision into agent learning but suffers from biased annotation, reward hacking, high-variance from overly fine-grained signals or failtures when state is rare. We therefore introduce Online Process Reward Learning (OPRL), a general credit-assignment strategy for agentic RL that integrates seamlessly with standard on-policy algorithms without relying on additional rollouts or explicit step labels. In OPRL, we optimize an implicit process reward model (PRM) alternately with the agent's policy to transform trajectory preferences into implicit step rewards through a trajectory-based DPO objective. These step rewards are then used to compute step-level advantages, which are combined with episode-level advantages from outcome rewards for policy update, creating a self-reinforcing loop. Theoretical findings guarantee that the learned step rewards are consistent with trajectory preferences and act as potential-based shaping rewards, providing bounded gradients to stabilize training. Empirically, we evaluate OPRL on three distinct agent benmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverfiable rewards in SOTOPIA. Crucially, OPRL shows superior performance over frontier s and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and lower variance during training. Further analysis also demonstrates the efficient exploration by OPRL using fewer actions, underscoring its potential for agentic learning in real-world scenarios.

Reading Images Like Texts Sequential Image Understanding in Vision-Language Models

Authors: Yueyan Li, Chenggong Zhao, Zeyuan Zang, Caixia Yuan, Xiaojie Wang

2025-09-23

http://arxiv.org/abs/2509.19191v1

Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks. However, existing VLMs typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision. Moreover, their opaque internal mechanisms hinder both deeper understanding and architectural innovation. Inspired by the dual-stream hypothesis of human vision, which distinguishes the "what" and "where" pathways, we deconstruct the visual processing in VLMs into object recognition and spatial perception for separate study. For object recognition, we convert images into text token maps and find that the model's perception of image content unfolds as a two-stage process from shallow to deep layers, beginning with attribute recognition and culminating in semantic disambiguation. For spatial perception, we theoretically derive and empirically verify the geometric structure underlying the positional representation in VLMs. Based on these findings, we introduce an instruction-agnostic token algorithm based on a plug-and-play visual r to improve efficiency, and a RoPE scaling technique to enhance spatial reasoning. Through rigorous experiments, our work validates these analyses, offering a deeper understanding of VLM internals and providing clear principles for designing more capable future architectures.

BiGraspFormer End-to-End Bimanual Grasp Transformer

Authors: Kangmin Kim, Seunghyeok Back, Geonhyup Lee, Sangbeom Lee, Sangjun Noh, Kyoobin Lee

2025-09-23

http://arxiv.org/abs/2509.19142v1

Bimanual grasping is essential for robots to handle large and complex objects. However, existing methods either focus solely on single-arm grasping or employ separate grasp generation and bimanual evaluation stages, leading to coordination problems including collision risks and unbalanced force distribution. To address these limitations, we propose BiGraspFormer, a unified end-to-end framework that directly generates coordinated bimanual grasps from object point clouds. Our key idea is the Single-Guided Bimanual (SGB) strategy, which first generates diverse single grasp candidates using a r, then leverages their learned features through specialized attention mechanisms to jointly predict bimanual poses and quality scores. This conditioning strategy reduces the complexity of the 12-DoF search space while ensuring coordinated bimanual manipulation. Comprehensive simulation experiments and real-world validation demonstrate that BiGraspFormer consistently outperforms existing methods while maintaining efficient inference speed (<0.05s), confirming the effectiveness of our framework. Code and supplementary materials are available at https://sites.google.com/bigraspformer

Clapping Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression

Authors: Boao Kong, Xu Huang, Yuqi Xu, Yixuan Liang, Bin Wang, Kun Yuan

2025-09-23

http://arxiv.org/abs/2509.19029v1

Pipeline-parallel distributed optimization is essential for large-scale machine learning but is challenged by significant overhead from transmitting high-dimensional activations and gradients between workers. Existing approaches often depend on impractical unbiased gradient assumptions or incur sample-size memory overhead. This paper introduces Clapping, a Communication algorithm with LAzy samPling for Pipeline-parallel learnING. Clapping adopts a lazy sampling strategy that reuses data samples across steps, breaking sample-wise memory barrier and supporting convergence in few-epoch or online training regimes. Clapping comprises two variants including Clapping-FC and Clapping-FU, both of which achieve convergence without unbiased gradient assumption, effectively addressing error propagation in multi-worker settings. Numerical experiments validate the performance of Clapping across different learning tasks.

HD-PPT Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS

Authors: Sihang Nie, Xiaofen Xing, Jingyuan Xing, Baiji Liu, Xiangmin Xu

2025-09-23

http://arxiv.org/abs/2509.19001v1

Large Language Model ()-based Text-to-Speech (TTS) models have already reached a high degree of naturalness. However, the precision control of TTS inference is still challenging. Although instruction-based Text-to-Speech (Instruct-TTS) models are proposed, these models still lack fine-grained control due to the modality gap between single-level text instructions and multilevel speech tokens. To address this limitation, we propose HD-PPT, a framework that transforms speech synthesis into a structured, hierarchical task. To enable fine-grained control, we introduce a novel speech codec to extract distinct prompt-preference and content-preference tokens from the complex speech tokens, supervised by automatic speech recognition (ASR) and cross-lingual audio-text pre-training (CLAP) objectives. To bridge the modality gap of these tokens, we propose a hierarchical strategy, where the generates tokens in a structured order: first semantic, then fine-grained style, and finally complete acoustic representation. Extensive experiments demonstrate that this hierarchical paradigm significantly improves instruction adherence and achieves state-of-the-art naturalness, validating our approach for precise and controllable speech synthesis. Audio samples are available at https://xxh333.github.io/.

Benchmarking PDF Accessibility Evaluation A Dataset and Framework for Assessing Automated and LLM-Based Approaches for Accessibility Testing

Authors: Anukriti Kumar, Tanushree Padath, Lucy Lu Wang

2025-09-23

http://arxiv.org/abs/2509.18965v1

PDFs remain the dominant format for scholarly , despite significant accessibility challenges for blind and low-vision users. While various tools attempt to evaluate PDF accessibility, there is no standardized methodology to evaluate how different accessibility assessment approaches perform. Our work addresses this critical gap by introducing a novel benchmark dataset of scholarly PDFs with expert-validated accessibility annotations across seven criteria (alternative text quality, logical reading order, semantic tagging, table structure, functional hyperlinks, color contrast, and font readability), and a four-category evaluation framework with standardized labels (Passed, Failed, Not Present, Cannot Tell) to systematically assess accessibility evaluation approaches. Using our evaluation framework, we explore whether large language models (s) are capable of supporting automated accessibility evaluation. We benchmark five s, which demonstrate varying capabilities in correctly assessing different accessibility criteria, with GPT-4-Turbo achieving the highest overall accuracy (0.85). However, all models struggled in correctly categorizing documents with Not Present and Cannot Tell accessibility labels, particularly for alt text quality assessment. Our qualitative comparison with standard automated checkers reveals complementary strengths: rule-based tools excel at technical verification, while s better evaluate semantic appropriateness and contextual relevance. Based on our findings, we propose a hybrid approach that would combine automated checkers, evaluation, and human assessment as a future strategy for PDF accessibility evaluation.

Confidential LLM Inference Performance and Cost Across CPU and GPU TEEs

Authors: Marcin Chrapek, Marcin Copik, Etienne Mettaz, Torsten Hoefler

2025-09-23

http://arxiv.org/abs/2509.18886v1

Large Language Models (s) are increasingly deployed on converged Cloud and High-Performance Computing (HPC) infrastructure. However, as s handle confidential inputs and are fine-tuned on costly, proprietary datasets, their heightened security requirements slow adoption in privacy-sensitive sectors such as healthcare and finance. We investigate methods to address this gap and propose Trusted Execution Environments (TEEs) as a solution for securing end-to-end inference. We validate their practicality by evaluating these compute-intensive workloads entirely within CPU and GPU TEEs. On the CPU side, we conduct an in-depth study running full Llama2 inference pipelines (7B, 13B, 70B) inside Intel's TDX and SGX, accelerated by Advanced Matrix Extensions (A). We derive 12 insights, including that across various data types, batch sizes, and input lengths, CPU TEEs impose under 10% throughput and 20% latency overheads, further reduced by A. We run inference on NVIDIA H100 Confidential Compute GPUs, contextualizing our CPU findings and ob throughput penalties of 4-8% that diminish as batch and input sizes grow. By comparing performance, cost, and security trade-offs, we show how CPU TEEs can be more cost-effective or secure than their GPU counterparts. To our knowledge, our work is the first to comprehensively demonstrate the performance and practicality of modern TEEs across both CPUs and GPUs for enabling confidential s (cs).

FlashGMM Fast Gaussian Mixture Entropy Model for Learned Image Compression

Authors: Shimon Murai, Fangzheng Lin, Jiro Katto

2025-09-23

http://arxiv.org/abs/2509.18815v1

High-performance learned image codecs require flexible probability models to fit latent representations. Gaussian Mixture Models (GMMs) were proposed to satisfy this demand, but suffer from a significant runtime performance bottleneck due to the large Cumulative Distribution Function (CDF) tables that must be built for rANS coding. This paper introduces a fast coding algorithm that entirely eliminates this bottleneck. By leveraging the CDF's monotonic property, our r performs a dynamic binary search to find the correct symbol, eliminating the need for costly table construction and lookup. Aided by SIMD optimizations and numerical approximations, our approach accelerates the GMM entropy coding process by up to approximately 90x without compromising rate-distortion performance, significantly improving the practicality of GMM-based codecs. The implementation will be made publicly available at https://github.com/tokkiwa/FlashGMM.

Bi-VLM Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models

Authors: Xijun Wang, Junyun Huang, Rayyan Abdalla, Chengyuan Zhang, Ruiqi Xian, Dinesh Manocha

2025-09-23

http://arxiv.org/abs/2509.18763v1

We address the critical gap between the computational demands of vision-language models and the possible ultra- weight precision (bitwidth $\leq2$ bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier (salient) and multiple inlier (unsalient) subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid algorithm and use it to weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%. We also perform token on the d models and observe that there is redundancy of image tokens 90% - 99% in the d models. This helps us to further prune the visual tokens to improve efficiency.

HyperCool Reducing Encoding Cost in Overfitted Codecs with Hypernetworks

Authors: Pep Borrell-Tatché, Till Aczel, Théo Ladune, Roger Wattenhofer

2025-09-23

http://arxiv.org/abs/2509.18748v1

Overfitted image codecs like Cool-chic achieve strong by tailoring lightweight models to individual images, but their encoding is slow and computationally expensive. To accelerate encoding, Non-Overfitted (N-O) Cool-chic replaces the per-image optimization with a learned inference model, trading performance for encoding speed. We introduce HyperCool, a hypernetwork architecture that mitigates this trade-off. Building upon the N-O Cool-chic framework, HyperCool generates content-adaptive parameters for a Cool-chic r in a single forward pass, tailoring the r to the input image without requiring per-image fine-tuning. Our method achieves a 4.9% rate reduction over N-O Cool-chic with minimal computational overhead. Furthermore, the output of our hypernetwork provides a strong initialization for further optimization, reducing the number of steps needed to approach fully overfitted model performance. With fine-tuning, HEVC-level is achieved with 60.4% of the encoding cost of the fully overfitted Cool-chic. This work proposes a practical method to accelerate encoding in overfitted image codecs, improving their viability in scenarios with tight compute budgets.

PIE Perception and Interaction Enhanced End-to-End Motion Planning for Autonomous Driving

Authors: Chengran Yuan, Zijian Lu, Zhanqi Zhang, Yimin Zhao, Zefan Huang, Shuo Sun, Jiawei Sun, Jiahui Li, Christina Dao Wen Lee, Dongen Li, Marcelo H. Ang Jr

2025-09-23

http://arxiv.org/abs/2509.18609v1

End-to-end motion planning is promising for simplifying complex autonomous driving pipelines. However, challenges such as scene understanding and effective prediction for decision-making continue to present substantial obstacles to its large-scale deployment. In this paper, we present PIE, a pioneering framework that integrates advanced perception, reasoning, and intention modeling to dynamically capture interactions between the ego vehicle and surrounding agents. It incorporates a bidirectional Mamba fusion that addresses data losses in multimodal fusion of camera and LiDAR inputs, alongside a novel reasoning-enhanced r integrating Mamba and Mixture-of-Experts to facilitate scene-compliant anchor selection and optimize adaptive trajectory inference. PIE adopts an action-motion interaction module to effectively utilize state predictions of surrounding agents to refine ego planning. The proposed framework is thoroughly validated on the NAVSIM benchmark. PIE, without using any ensemble and data augmentation techniques, achieves an 88.9 PDM score and 85.6 EPDM score, surpassing the performance of prior state-of-the-art methods. Comprehensive quantitative and qualitative analyses demonstrate that PIE is capable of reliably generating feasible and high-quality ego trajectories.

FlexSED Towards Open-Vocabulary Sound Event Detection

Authors: Jiarui Hai, Helin Wang, Weizhe Guo, Mounya Elhilali

2025-09-23

http://arxiv.org/abs/2509.18606v1

Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-friendly interaction, and they lack zero-shot capabilities and offer poor few-shot adaptability. Although text-query-based separation methods have been explored, they primarily focus on source separation and are ill-suited for SED tasks that require precise temporal localization and efficient detection across large and diverse sound vocabularies. In this paper, we propose FlexSED, an open-vocabulary sound event detection system. FlexSED builds on a pretrained audio SSL model and the CLAP text encoder, introducing an encoder-r composition and an adaptive fusion strategy to enable effective continuous training from pretrained weights. To ensure robust supervision, it also employs large language models (s) to assist in event query selection during training, addressing challenges related to missing labels. As a result, FlexSED achieves superior performance compared to vanilla SED models on AudioSet-Strong, while demonstrating strong zero-shot and few-shot capabilities. We release the code and pretrained models to support future research and applications based on FlexSED.

OmniFed A Modular Framework for Configurable Federated Learning from Edge to HPC

Authors: Sahil Tyagi, Andrei Cozma, Olivera Kotevska, Feiyi Wang

2025-09-23

http://arxiv.org/abs/2509.19396v1

Federated Learning (FL) is critical for edge and High Performance Computing (HPC) where data is not centralized and privacy is crucial. We present OmniFed, a modular framework designed around decoupling and clear separation of concerns for configuration, orchestration, , and training logic. Its architecture supports configuration-driven prototyping and code-level override-what-you-need customization. We also support different topologies, mixed protocols within a single deployment, and popular training algorithms. It also offers optional privacy mechanisms including Differential Privacy (DP), Homomorphic Encryption (HE), and Secure Aggregation (SA), as well as strategies. These capabilities are exposed through well-defined extension points, allowing users to customize topology and orchestration, learning logic, and privacy/ plugins, all while pre the integrity of the core system. We evaluate multiple models and algorithms to measure various performance metrics. By unifying topology configuration, mixed-protocol , and pluggable modules in one stack, OmniFed streamlines FL deployment across heterogeneous environments. Github repository is available at https://github.com/at-aaims/OmniFed.

LLMZ+ Contextual Prompt Whitelist Principles for Agentic LLMs

Authors: Tom Pawelek, Raj Patel, Charlotte Crowell, Noorbakhsh Amiri, Sudip Mittal, Shahram Rahimi, Andy Perkins

2025-09-23

http://arxiv.org/abs/2509.18557v1

Compared to traditional models, agentic AI represents a highly valuable target for potential attackers as they possess privileged access to data sources and API tools, which are traditionally not incorporated into classical agents. Unlike a typical software application residing in a Demilitarized Zone (DMZ), agentic s consciously rely on nondeterministic behavior of the AI (only defining a final goal, leaving the path selection to ). This characteristic introduces substantial security risk to both operational security and information security. Most common existing defense mechanism rely on detection of malicious intent and preventing it from reaching the agent, thus protecting against jailbreak attacks such as prompt injection. In this paper, we present an alternative approach, Z+, which moves beyond traditional detection-based approaches by implementing prompt whitelisting. Through this method, only contextually appropriate and safe messages are permitted to interact with the agentic . By leveraging the specificity of context, Z+ guarantees that all exchanges between external users and the conform to predefined use cases and operational boundaries. Our approach streamlines the security framework, enhances its long-term resilience, and reduces the resources required for sustaining information security. Our empirical evaluation demonstrates that Z+ provides strong resilience against the most common jailbreak prompts. At the same time, legitimate business s are not disrupted, and authorized traffic flows seamlessly between users and the agentic . We measure the effectiveness of approach using false positive and false negative rates, both of which can be reduced to 0 in our experimental setting.

Individualized non-uniform quantization for vector search

Authors: Mariano Tepper, Ted Willke

2025-09-22

http://arxiv.org/abs/2509.18471v1

Embedding vectors are widely used for representing unstructured data and searching through it for semantically similar items. However, the large size of these vectors, due to their high-dimensionality, creates problems for modern vector search techniques: retrieving large vectors from memory/storage is expensive and their footprint is costly. In this work, we present NVQ (non-uniform vector ), a new vector technique that is computationally and spatially efficient in the high-fidelity regime. The core in NVQ is to use novel parsimonious and computationally efficient nonlinearities for building non-uniform vector rs. Critically, these rs are \emph{individually} learned for each indexed vector. Our experimental results show that NVQ exhibits improved accuracy compared to the state of the art with a minimal computational cost.

LAWCAT Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

Authors: Zeyu Liu, Souvik Kundu, Lianghao Jiang, Anni Li, Srikanth Ronanki, Sravan Bodapati, Gourav Datta, Peter A. Beerel

2025-09-22

http://arxiv.org/abs/2509.18467v1

Although architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for latency-sensitive long-context applications. While recent linear-complexity alternatives are increasingly powerful, effectively training them from scratch is still resource-intensive. To overcome these limitations, we propose LAWCAT (Linear Attention with Convolution Across Time), a novel linearization framework designed to efficiently transfer the capabilities of pre-trained s into a performant linear attention architecture. LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths. Our comprehensive evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90\% passkey retrieval accuracy up to 22K tokens, significantly extending its effective context window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance on S-NIAH 1\&2\&3 tasks (1K-8K context length) and BABILong benchmark (QA2\&QA3, 0K-16K context length), requiring less than 0.1\% pre-training tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT thus provides an efficient pathway to high-performance, long-context linear models suitable for edge deployment, reducing reliance on extensive long-sequence training data and computational resources.

Authors: Minki Hong, Jangho Choi, Jihie Kim

2025-09-22

http://arxiv.org/abs/2509.18395v1

Social norms govern culturally appropriate behavior in , enabling dialogue systems to produce responses that are not only coherent but also socially acceptable. We present NormGenesis, a multicultural framework for generating and annotating socially grounded dialogues across English, Chinese, and Korean. To model the dynamics of social interaction beyond static norm classification, we propose a novel dialogue type, Violation-to-Resolution (V2R), which models the progression of conversations following norm violations through recognition and socially appropriate repair. To improve pragmatic consistency in underrepresented languages, we implement an exemplar-based iterative refinement early in the dialogue synthesis process. This design introduces alignment with linguistic, emotional, and sociocultural expectations before full dialogue generation begins. Using this framework, we construct a dataset of 10,800 multi-turn dialogues annotated at the turn level for norm adherence, speaker intent, and emotional response. Human and -based evaluations demonstrate that NormGenesis significantly outperforms existing datasets in refinement quality, dialogue naturalness, and generalization performance. We show that models trained on our V2R-augmented data exhibit improved pragmatic competence in ethically sensitive contexts. Our work establishes a new benchmark for culturally adaptive dialogue modeling and provides a scalable methodology for norm-aware generation across linguistically and culturally diverse languages.

Efficient Particle Acceleration in 2.5-Dimensional, Hybrid-Kinetic Simulations of Decaying, Supersonic, Plasma Turbulence

Authors: Keyan Gootkin, Colby Haggerty, Damiano Caprioli, Zachary Davis

2025-09-22

http://arxiv.org/abs/2509.18374v1

Collisionless, turbulent plasmas surround the Earth, from the magnetosphere to the intergalactic medium, and the fluctuations within them affect nearly every field in the space sciences, from space weather forecasts to theories of galaxy formation. Where turbulent motions become supersonic, their interactions can lead to the formation of shocks, which are known to efficiently energize ions to cosmic-ray energies. We present 2.5-dimensional, hybrid-kinetic simulations of decaying, supersonic, non-relativistic turbulence in a collisionless plasma using the code dHybridR. Turbulence within these simulations is highly compressible; after accounting for this by taking the omni-directional power-spectrum of the density weighted velocity field, we find turbulent spectra with power-law slopes of $\alpha \approx -\frac{5}{3}$ for low Mach numbers, in the inertial range, and $\alpha \approx -2$ for high Mach numbers. Ions embedded in the highly supersonic simulations are accelerated to non-thermal energies at efficiencies similar to those seen in shocks, despite being in a non-relativistic regime and lacking the large scale structure of a shock. We observe that particles are accelerated into a power-law spectrum, with a slope of $q \approx 2.5$ in (non-relativistic) energy. We compare these results to those obtained from the theory and simulations of diffusive shock , and discuss the astrophysical implications of this theoretical work.

Chiplet-Based RISC-V SoC with Modular AI Acceleration

Authors: P. Ramkumar, S. S. Bharadwaj

2025-09-22

http://arxiv.org/abs/2509.18355v1

Achieving high performance, energy efficiency, and cost-effectiveness while maintaining architectural flexibility is a critical challenge in the development and deployment of edge AI devices. Monolithic SoC designs struggle with this complex balance mainly due to low manufacturing yields (below 16%) at advanced 360 mm^2 process nodes. This paper presents a novel chiplet-based RISC-V SoC architecture that addresses these limitations through modular AI and intelligent system level optimization. Our proposed design integrates 4 different key innovations in a 30mm x 30mm silicon interposer: adaptive cross-chiplet Dynamic Voltage and Frequency Scaling (DVFS); AI-aware Universal Chiplet Interconnect Express (UCIe) protocol extensions featuring streaming flow control units and -aware transfers; distributed cryptographic security across heterogeneous chiplets; and intelligent sensor-driven load migration. The proposed architecture integrates a 7nm RISC-V CPU chiplet with dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3 memory stacks, and dedicated power management controllers. Experimental results across industry standard benchmarks like MobileNetV2, ResNet-50 and real-time video processing demonstrate significant performance improvements. The AI-optimized configuration achieves ~14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction compared to previous basic chiplet implementations. These improvements collectively translate to a 40.1% efficiency gain corresponding to ~3.5 mJ per MobileNetV2 inference (860 mW/244 images/s), while maintaining sub-5ms real-time capability across all experimented workloads. These performance upgrades demonstrate that modular chiplet designs can achieve near-monolithic computational density while enabling cost efficiency, scalability and upgradeability, crucial for next-generation edge AI device applications.

Speculate Deep and Accurate Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding

Authors: Pei-Shuo Wang, Jian-Jia Chen, Chun-Che Yang, Chi-Chih Chang, Ning-Chi Huang, Mohamed S. Abdelfattah, Kai-Chiang Wu

2025-09-22

http://arxiv.org/abs/2509.18344v1

The immense model sizes of large language models (s) challenge deployment on memory-limited consumer GPUs. Although model and parameter offloading are common strategies to address memory limitations, can degrade quality, and offloading maintains quality but suffers from slow inference. Speculative presents a promising avenue to accelerate parameter offloading, utilizing a fast draft model to propose multiple draft tokens, which are then verified by the target in parallel with a single forward pass. This method reduces the time-consuming data transfers in forward passes that involve offloaded weight transfers. Existing methods often rely on pretrained weights of the same family, but require additional training to align with custom-trained models. Moreover, approaches that involve draft model training usually yield only modest speedups. This limitation arises from insufficient alignment with the target model, preventing higher token acceptance lengths. To address these challenges and achieve greater speedups, we propose SubSpec, a plug-and-play method to accelerate parameter offloading that is lossless and training-free. SubSpec constructs a highly aligned draft model by generating d substitute layers from offloaded target portions. Additionally, our method shares the remaining GPU-resident layers and the -Cache, further reducing memory overhead and enhance alignment. SubSpec achieves a high average acceptance length, delivering 9.1x speedup for Qwen2.5 7B on MT-Bench (8GB VRAM limit) and an average of 12.5x speedup for Qwen2.5 32B on popular generation benchmarks (24GB VRAM limit).

Exploiting Tree Structure for Credit Assignment in RL Training of LLMs

Authors: Hieu Tran, Zonghai Yao, Hong Yu

2025-09-22

http://arxiv.org/abs/2509.18314v1

Reinforcement learning improves reasoning, yet delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes \emph{nonparametric} prefix values $V(s)$ by aggregating descendant outcomes. Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated \textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.

Evaluating Large Language Models for Detecting Antisemitism

Authors: Jay Patel, Hrudayangam Mehta, Jeremy Blackburn

2025-09-22

http://arxiv.org/abs/2509.18293v1

Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source s' capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among s. Our experiments highlight the differences observed across s' utility, explainability, and reliability.

Spiffy Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding

Authors: Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli

2025-09-22

http://arxiv.org/abs/2509.18085v1

Diffusion s (ds) have recently emerged as a powerful alternative to autoregressive s (AR-s) with the potential to operate at significantly higher token generation rates. However, currently available open-source ds often generate at much lower rates, typically only a single token at every denoising timestep in order to maximize output quality. We present Spiffy, a speculative algorithm that accelerates d inference by $\mathbf{2.8{-}3.1\times}$ while provably pre the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative of AR-s to the d setting. Spiffy proposes draft states by leveraging the d's distribution itself in an auto-speculative manner. This approach is efficient and effective, and eliminates the overheads of training and running an independent draft model. To structure the candidate draft states, we propose a novel directed draft graph which is uniquely designed to take advantage of the bidirectional, block-wise nature of d generation and can be verified in parallel by the d. To further optimize the structure of these draft graphs, we introduce an efficient, offline calibration algorithm that procedurally determines high-quality graph configurations. These optimized draft graphs, enabling increased acceptance rates, lead to a significant boost in the overall speedup achieved by the system. Crucially, Spiffy is also complementary to other recent innovations in improving d generation speeds such as -caching and multi-token unmasking. We demonstrate that when combined with such parallel algorithms, Spiffy is able to effectively multiply the benefits of these methods leading to total speedups of up to $\mathbf{7.9\times}$ .

GraDeT-HTR A Resource-Efficient Bengali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer

Authors: Md. Mahmudul Hasan, Ahmed Nesar Tahsin Choudhury, Mahmudul Hasan, Md. Mosaddek Khan

2025-09-22

http://arxiv.org/abs/2509.18081v1

Despite Bengali being the sixth most spoken language in the world, handwritten text recognition (HTR) systems for Bengali remain severely underdeveloped. The complexity of Bengali script--featuring conjuncts, diacritics, and highly variable handwriting styles--combined with a scarcity of annotated datasets makes this task particularly challenging. We present GraDeT-HTR, a resource-efficient Bengali handwritten text recognition system based on a Grapheme-aware Decoder-only Transformer architecture. To address the unique challenges of Bengali script, we augment the performance of a r-only by integrating a grapheme-based tokenizer and demonstrate that it significantly improves recognition accuracy compared to conventional subword tokenizers. Our model is pretrained on large-scale synthetic data and fine-tuned on real human-annotated samples, achieving state-of-the-art performance on multiple benchmark datasets.

TempSamp-R1 Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Authors: Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

2025-09-22

http://arxiv.org/abs/2509.18056v2

This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (Ms) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1

RadEval A framework for radiology text evaluation

Authors: Justin Xu, Xi Zhang, Javid Abderezaei, Julie Bauml, Roger Boodoo, Fatemeh Haghighi, Ali Ganjizadeh, Eric Brattain, Dave Van Veen, Zaiqiao Meng, David Eyre, Jean-Benoit Delbrouck

2025-09-22

http://arxiv.org/abs/2509.18030v1

We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced -based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.

Through the Lens of Human-Human Collaboration A Configurable Research Platform for Exploring Human-Agent Collaboration

Authors: Bingsheng Yao, Jiaju Chen, Chaoran Chen, April Wang, Toby Jia-jun Li, Dakuo Wang

2025-09-22

http://arxiv.org/abs/2509.18008v1

Intelligent systems have traditionally been designed as tools rather than collaborators, often lacking critical characteristics that collaboration partnerships require. Recent advances in large language model () agents open new opportunities for human--agent collaboration by enabling natural and various social and cognitive behaviors. Yet it remains unclear whether principles of computer-mediated collaboration established in HCI and CSCW persist, change, or fail when humans collaborate with agents. To support systematic investigations of these questions, we introduce an open and configurable research platform for HCI researchers. The platform's modular design allows seamless adaptation of classic CSCW experiments and manipulation of theory-grounded interaction controls. We demonstrate the platform's effectiveness and usability through two case studies: (1) re-implementing the classic human-human-collaboration task Shape Factory as a between-subject human-agent-collaboration experiment with 16 participants, and (2) a participatory cognitive walkthrough with five HCI researchers to refine workflows and interfaces for experiment setup and analysis.

Visual Detector Compression via Location-Aware Discriminant Analysis

Authors: Qizhen Lan, Jung Im Choi, Qing Tian

2025-09-22

http://arxiv.org/abs/2509.17968v1

Deep neural networks are powerful, yet their high complexity greatly limits their potential to be deployed on billions of resource-constrained edge devices. Pruning is a crucial network technique, yet most existing methods focus on classification models, with limited attention to detection. Even among those addressing detection, there is a lack of utilization of essential localization information. Also, many methods passively rely on pre-trained models, in which useful and useless components are intertwined, making it difficult to remove the latter without harming the former at the neuron/filter level. To address the above issues, in this paper, we propose a proactive detection-discriminants-based network approach for deep visual detectors, which alternates between two steps: (1) maximizing and compressing detection-related discriminants and aligning them with a subset of neurons/filters immediately before the detection head, and (2) tracing the detection-related discriminating power across the layers and discarding features of lower importance. Object location information is exploited in both steps. Extensive experiments, employing four advanced detection models and four state-of-the-art competing methods on the KITTI and COCO datasets, highlight the superiority of our approach. Remarkably, our compressed models can even beat the original base models with a substantial reduction in complexity.

Benchmarking Humans and Machines on Complex Multilingual Speech Understanding Tasks

Authors: Sai Samrat Kankanala, Ram Chandra, Sriram Ganapathy

2025-09-22

http://arxiv.org/abs/2509.17965v1

Auditory attention and selective phase-locking are central to human speech understanding in complex acoustic scenes and cocktail party settings, yet these capabilities in multilingual subjects remain poorly understood. While machine understanding of natural speech has advanced in recent years, questions persist about comprehension of ped and mixed-channel speech. We propose a systematic paradigm for studying humans and machines in speech question-answering tasks in multilingual settings with clean and mixed-channel speech. For human listeners, selective attention to a target speaker was significantly better in their native language (L1) than in their second language (L2). For machine listening, speech-based large language models (s) match or exceed human performance in clean, single-speaker conditions but often struggle to selectively attend in two-speaker settings. These results reveal a key divergence: humans rely on attentional cues that are more streamlined in their native language, whereas s default to parallel information extraction which exceed human skills.

Expert-as-a-Service Towards Efficient, Scalable, and Robust Large-scale MoE Serving

Authors: Ziming Liu, Boyu Tian, Guoteng Wang, Zhen Jiang, Peng Sun, Zhenhua Han, Tian Tang, Xiaohe Hu, Yanmin Jia, Yan Zhang, He Liu, Mingjun Zhang, Yiqi Zhang, Qiaoling Chen, Shenggan Cheng, Mingyu Gao, Yang You, Siyuan Feng

2025-09-22

http://arxiv.org/abs/2509.17863v1

Mixture-of-Experts (MoE) models challenge infrastructures with dynamic, expert utilization, causing instability on conventional systems designed for dense architectures. We propose EaaS, a novel system to enable efficient, scalable, and robust MoE deployment. Our system s MoE modules into independent, stateless services. This design enables fine-grained resource scaling and provides inherent fault tolerance by decoupling compute units. The architecture is powered by a high-performance, CPU-free peer-to-peer library that ensures minimal overhead and high throughput. Experiments confirm EaaS's scalability and efficiency, achieving performance comparable to monolithic systems while providing robust fault tolerance and strong scalability. EaaS incurs less than a 2% throughput reduction under simulated hardware failures that would otherwise halt monolithic architectures. It further saves up to 37.5% of computing resources through dynamic fine-grained adaptation to traffic, demonstrating strong resilience for large-scale MoE deployment in production.

Towards General Computer Control with Hierarchical Agents and Multi-Level Action Spaces

Authors: Zihan Dong, Xinyu Fan, Zixiang Tang, Yunqing Li

2025-09-22

http://arxiv.org/abs/2509.18230v1

Controlling desktop applications via software remains a fundamental yet under-served problem. Existing multi-modal large language models (Ms) ingest screenshots and task instructions to generate keystrokes and mouse events, but they suffer from prohibitive inference latency, poor sample efficiency on long-horizon -reward tasks, and infeasible on-device deployment. We introduce a lightweight hierarchical reinforcement learning framework, ComputerAgent, that formulates OS control as a two-level option process (manager and subpolicy), employs a triple-modal state encoder (screenshot, task ID, numeric state) to handle visual and contextual diversity, integrates meta-actions with an early-stop mechanism to reduce wasted interactions, and uses a compact vision backbone plus small policy networks for on-device inference (15M parameters). On a suite of 135 real-world desktop tasks, ComputerAgent attains 92.1% success on simple tasks (<8 steps) and 58.8% on hard tasks (>=8 steps), matching or exceeding 200B-parameter M baselines on simple scenarios while reducing model size by over four orders of magnitude and halving inference time. These results demonstrate that hierarchical RL offers a practical, scalable alternative to monolithic M-based automation for computer control.

ConfClip Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs

Authors: Bonan Zhang, Zhongqi Chen, Bowen Song, Qinya Li, Fan Wu, Guihai Chen

2025-09-22

http://arxiv.org/abs/2509.17730v1

Reinforcement learning (RL) has become a standard paradigm for refining large language models (s) beyond pre-training and instruction tuning. A prominent line of work is RL with verifiable rewards (RLVR), which leverages automatically verifiable outcomes (e.g., correctness or executability) to generate reward signals. While efficient, this framework faces two key limitations: First, its binary feedback is too to capture the quality of the reasoning process. Second, its coarse-grained rewards potentially lead to vanishing gradients. Inspired by observations from human learning, we introduce a RL technique that integrates verifiable outcomes with the model's own confidence estimates. This joint design enriches the reward signal, providing finer-grained feedback and implicitly supervising the reasoning process. Experimental results demonstrate that our proposed method enhances RL performance across multiple datasets and reduces token consumption during inference, while incurring negligible additional training cost. Moreover, it can be used as a plug-in module to enhance other state-of-the-art RL methods.

When TableQA Meets Noise A Dual Denoising Framework for Complex Questions and Large-scale Tables

Authors: Shenghao Ye, Yu Guo, Dong Jin, Yikai Shen, Yunpeng Hou, Shuangwu Chen, Jian Yang, Xiaofeng Jiang

2025-09-22

http://arxiv.org/abs/2509.17680v1

Table question answering (TableQA) is a fundamental task in natural language processing (NLP). The strong reasoning capabilities of large language models (s) have brought significant advances in this field. However, as real-world applications involve increasingly complex questions and larger tables, substantial noisy data is introduced, which severely degrades reasoning performance. To address this challenge, we focus on improving two core capabilities: Relevance Filtering, which identifies and retains information truly relevant to reasoning, and Table Pruning, which reduces table size while pre essential content. Based on these principles, we propose EnoTab, a dual denoising framework for complex questions and large-scale tables. Specifically, we first perform Evidence-based Question Denoising by decomposing the question into minimal semantic units and filtering out those irrelevant to answer reasoning based on consistency and usability criteria. Then, we propose Evidence Tree-guided Table Denoising, which constructs an explicit and transparent table path to remove irrelevant data step by step. At each step, we observe the intermediate state of the table and apply a post-order node rollback mechanism to handle abnormal table states, ultimately producing a highly reliable sub-table for final answer reasoning. Finally, extensive experiments show that EnoTab achieves outstanding performance on TableQA tasks with complex questions and large-scale tables, confirming its effectiveness.

Mechanistic Interpretability with SAEs Probing Religion, Violence, and Geography in Large Language Models

Authors: Katharina Simbeck, Mariam Mahran

2025-09-22

http://arxiv.org/abs/2509.17665v1

Despite growing research on bias in large language models (s), most work has focused on gender and race, with little attention to religious identity. This paper explores how religion is internally represented in s and how it intersects with concepts of violence and geography. Using mechanistic interpretability and Sparse Autoencoders (SAEs) via the Neuronpedia API, we analyze latent feature activations across five models. We measure between religion- and violence-related prompts and probe semantic patterns in activation contexts. While all five religions show comparable internal cohesion, Islam is more frequently linked to features associated with violent language. In contrast, geographic associations largely reflect real-world religious demographics, revealing how models embed both factual distributions and cultural stereotypes. These findings highlight the value of structural analysis in auditing not just outputs but also internal representations that shape model behavior.

Evict3R Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers

Authors: Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi

2025-09-22

http://arxiv.org/abs/2509.17650v1

Streaming visual s like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key value () memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.

Bilateral Distribution Compression Reducing Both Data Size and Dimensionality

Authors: Dominic Broadbent, Nick Whiteley, Robert Allison, Tom Lovett

2025-09-22

http://arxiv.org/abs/2509.17543v3

Existing distribution methods reduce dataset size by minimising the Maximum Mean Discrepancy (MMD) between original and compressed sets, but modern datasets are often large in both sample size and dimensionality. We propose Bilateral Distribution Compression (BDC), a two-stage framework that compresses along both axes while pre the underlying distribution, with overall linear time and memory complexity in dataset size and dimension. Central to BDC is the Decoded MMD (DMMD), which quantifies the discrepancy between the original data and a compressed set d from a low-dimensional latent space. BDC proceeds by (i) learning a low-dimensional projection using the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with the Encoded MMD (EMMD). We show that this procedure minimises the DMMD, guaranteeing that the compressed set faithfully represents the original distribution. Experiments show that across a variety of scenarios BDC can achieve comparable or superior performance to ambient-space at substantially lower cost.

Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs

Authors: Xing Chen, Rong Shi, Lu Zhao, Lingbin Wang, Xiao Jin, Yueqiang Chen, Hongfeng Sun

2025-09-22

http://arxiv.org/abs/2509.17542v1

-based applications have been widely used in various industries, but with the increasing of models size, an efficient large language model () inference system is an urgent problem to be solved for service providers. Since the inference system is divided into two stage with different characteristics: Prefill and Decode, the two stage will interfere with each other during the inference process. Toward this end, a P-D d inference framework is proposed by some researchers. Current research is done on homogeneous GPUs, and lacks deployment solutions based on business scenarios. Compared with homogeneous GPUs, using heterogeneous GPUs to construct inference systems can better improve resource utilization and reduce costs. Even if GPUs from different vendors are used to build inference systems, on the basis of reducing costs, the resource utilization rate can be improved and the dependence on a single vendor can be reduced. Therefore, a P-D disaggreagetd inference system based on heterogeneous GPUs is designed, and the heterogeneous compatible transmission module in the system is designed to address heterogeneous GPU data compatibility issues. Then, a joint optimization algorithm of parallel strategy and instance number allocation is proposed to obtain the deployment solutions. Finally, the experimental results show that the P-D d inference system can well solve the hybrid inference problem of heterogeneous GPUs from different vendors, and the joint optimization algorithm can obtain the optimal deployment solution.

4DGCPro Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming

Authors: Zihan Zheng, Zhenlong Wu, Houqiang Zhong, Yuan Tian, Ning Cao, Lan Xu, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Wenjun Zhang

2025-09-22

http://arxiv.org/abs/2509.17513v1

Achieving seamless viewing of high-fidelity volumetric video, comparable to 2D video experiences, remains an open challenge. Existing volumetric video methods either lack the flexibility to adjust quality and bitrate within a single model for efficient streaming across diverse networks and devices, or struggle with real-time and rendering on lightweight mobile platforms. To address these challenges, we introduce 4DGCPro, a novel hierarchical 4D Gaussian framework that facilitates real-time mobile and high-quality rendering via progressive volumetric video streaming in a single bitstream. Specifically, we propose a perceptually-weighted and -friendly hierarchical 4D Gaussian representation with motion-aware adaptive grouping to reduce temporal redundancy, preserve coherence, and enable scalable multi-level detail streaming. Furthermore, we present an end-to-end entropy-optimized training scheme, which incorporates layer-wise rate-distortion (RD) supervision and attribute-specific entropy modeling for efficient bitstream generation. Extensive experiments show that 4DGCPro enables flexible quality and multiple bitrate within a single model, achieving real-time and rendering on mobile devices while outperforming existing methods in RD performance across multiple datasets. Project Page: https://mediax-sjtu.github.io/4DGCPro

CorefInst Leveraging LLMs for Multilingual Coreference Resolution

Authors: Tuğba Pamay Arslan, Emircan Erol, Gülşen Eryiğit

2025-09-22

http://arxiv.org/abs/2509.17505v1

Coreference Resolution (CR) is a crucial yet challenging task in natural language understanding, often constrained by task-specific architectures and encoder-based language models that demand extensive training and lack adaptability. This study introduces the first multilingual CR methodology which leverages r-only s to handle both overt and zero mentions. The article explores how to model the CR task for s via five different instruction sets using a controlled inference method. The approach is evaluated across three s; Llama 3.1, Gemma 2, and Mistral 0.3. The results indicate that s, when instruction-tuned with a suitable instruction set, can surpass state-of-the-art task-specific architectures. Specifically, our best model, a fully fine-tuned Llama 3.1 for multilingual CR, outperforms the leading multilingual CR model (i.e., Corpipe 24 single stage variant) by 2 pp on average across all languages in the CorefUD v1.2 dataset collection.

Privacy in Action Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents

Authors: Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan

2025-09-22

http://arxiv.org/abs/2509.17488v1

The increasing autonomy of agents in handling sensitive s, accelerated by Model Context Protocol (MCP) and Agent-to-Agent (A2A) frameworks, creates urgent privacy challenges. While recent work reveals significant gaps between s' privacy Q&A performance and their agent behavior, existing benchmarks remain limited to static, simplified scenarios. We present PrivacyChecker, a model-agnostic, contextual integrity based mitigation approach that effectively reduces privacy leakage from 36.08% to 7.30% on DeepSeek-R1 and from 33.06% to 8.32% on GPT-4o, all while pre task helpfulness. We also introduce PrivacyLens-Live, transforming static benchmarks into dynamic MCP and A2A environments that reveal substantially higher privacy risks in practical. Our modular mitigation approach integrates seamlessly into agent protocols through three deployment strategies, providing practical privacy protection for the emerging agentic ecosystem. Our data and code will be made available at https://aka.ms/privacy_in_action.

Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks

Authors: Chaodong Tong, Qi Zhang, Lei Jiang, Yanbing Liu, Nannan Sun, Wei Li

2025-09-22

http://arxiv.org/abs/2509.17445v1

Reliable question answering with large language models (s) is challenged by hallucinations, fluent but factually incorrect outputs arising from epistemic uncertainty. Existing entropy-based semantic-level uncertainty estimation methods are limited by sampling noise and unstable clustering of variable-length answers. We propose Semantic Reformulation Entropy (SRE), which improves uncertainty estimation in two ways. First, input-side semantic reformulations produce faithful paraphrases, expand the estimation space, and reduce biases from superficial r tendencies. Second, progressive, energy-based hybrid clustering stabilizes semantic grouping. Experiments on SQuAD and TriviaQA show that SRE outperforms strong baselines, providing more robust and generalizable hallucination detection. These results demonstrate that combining input diversification with multi-signal clustering substantially enhances semantic-level uncertainty estimation.

QWHA Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models

Authors: Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim

2025-09-22

http://arxiv.org/abs/2509.17428v2

The demand for efficient deployment of large language models (s) has driven interest in , which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of -aware PEFT to produce accurate yet efficient d models. In this setting, reducing error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into d models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into d models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.

DINVMark A Deep Invertible Network for Video Watermarking

Authors: Jianbin Ji, Dawen Xu, Li Dong, Lin Yang, Songhan He

2025-09-22

http://arxiv.org/abs/2509.17416v1

With the wide spread of video, video watermarking has become increasingly crucial for copyright protection and content authentication. However, video watermarking still faces numerous challenges. For example, existing methods typically have shortcomings in terms of watermarking capacity and robustness, and there is a lack of specialized noise layer for High Efficiency Video Coding(HEVC) . To address these issues, this paper introduces a Deep Invertible Network for Video watermarking (DINVMark) and designs a noise layer to simulate HEVC . This approach not only in creases watermarking capacity but also enhances robustness. DINVMark employs an Invertible Neural Network (INN), where the encoder and r share the same network structure for both watermark embedding and extraction. This shared architecture ensures close coupling between the encoder and r, thereby improving the accuracy of the watermark extraction process. Experimental results demonstrate that the proposed scheme significantly enhances watermark robustness, preserves video quality, and substantially increases watermark embedding capacity.

Interpreting vision transformers via residual replacement model

Authors: Jinyeong Kim, Junhyeok Kim, Yumin Shim, Joohyeok Kim, Sunyoung Jung, Seong Jae Hwang

2025-09-22

http://arxiv.org/abs/2509.17401v1

How do vision s (ViTs) represent and process the world? This paper addresses this long-standing question through the first systematic analysis of 6.6K features across all layers, extracted via autoencoders, and by introducing the residual replacement model, which replaces ViT computations with interpretable features in the residual stream. Our analysis reveals not only a feature evolution from low-level patterns to high-level semantics, but also how ViTs encode curves and spatial positions through specialized feature types. The residual replacement model scalably produces a faithful yet parsimonious circuit for human-scale interpretability by significantly simplifying the original computations. As a result, this framework enables intuitive understanding of ViT mechanisms. Finally, we demonstrate the utility of our framework in debiasing spurious correlations.

EpiCache Episodic KV Cache Management for Long Conversational Question Answering

Authors: Minsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho

2025-09-22

http://arxiv.org/abs/2509.17396v2

Modern large language models (s) extend context lengths to up to millions of tokens, enabling AI assistants to generate coherent and personalized responses grounded in long conversational histories. This ability, however, hinges on Key-Value () caching, whose memory grows linearly with dialogue length and quickly becomes the bottleneck in resource-constrained environments. An active line of research for reducing memory bottleneck is , which seeks to limit size while pre accuracy. Yet existing methods face two major limitations: (i) evicting the after full-context causes unbounded peak memory, and (ii) query-dependent eviction narrows the to a single query, leading to failure cases in multi-turn conversations. We introduce EpiCache, a training-free management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds growth through block-wise and preserves topic-relevant context via episodic , which clusters conversation history into coherent episodes and applies episode-specific eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer's sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over recent baselines, sustains near-full accuracy under 4-6x , and reduces latency and memory by up to 2.4x and 3.5x, thereby enabling efficient multi-turn interaction under strict resource constraints.

Multimodal Health Risk Prediction System for Chronic Diseases via Vision-Language Fusion and Large Language Models

Authors: Dingxin Lu, Shurui Wu, Xinyi Huang

2025-09-22

http://arxiv.org/abs/2509.18221v1

With the rising global burden of chronic diseases and the multimodal and heterogeneous clinical data (medical imaging, free-text recordings, wearable sensor streams, etc.), there is an urgent need for a unified multimodal AI framework that can proactively predict individual health risks. We propose VL-RiskFormer, a hierarchical stacked visual-language multimodal Transformer with a large language model () inference head embedded in its top layer. The system builds on the dual-stream architecture of existing visual-linguistic models (e.g., PaLM-E, LLaVA) with four key innovations: (i) pre-training with cross-modal comparison and fine-grained alignment of radiological images, fundus maps, and wearable device photos with corresponding clinical narratives using momentum update encoders and debiased InfoNCE losses; (ii) a time fusion block that integrates irregular visit sequences into the causal Transformer r through adaptive time interval position coding; (iii) a disease ontology map adapter that injects ICD-10 codes into visual and textual channels in layers and infers comorbid patterns with the help of a graph attention mechanism. On the MIMIC-IV longitudinal cohort, VL-RiskFormer achieved an average AUROC of 0.90 with an expected calibration error of 2.7 percent.

Asteria Semantic-Aware Cross-Region Caching for Agentic LLM Tool Access

Authors: Chaoyi Ruan, Chao Bi, Kaiwen Zheng, Ziji Shi, Xinyi Wan, Jialin Li

2025-09-22

http://arxiv.org/abs/2509.17360v1

Large Language Model () agents tackle data-intensive tasks such as deep research and code generation. However, their effectiveness depends on frequent interactions with knowledge sources across remote clouds or regions. Such interactions can create non-trivial latency and cost bottlenecks. Existing caching solutions focus on exact-match queries, limiting their effectiveness for semantic knowledge reuse. To address this challenge, we introduce Asteria, a novel cross-region knowledge caching architecture for agents. At its core are two abstractions: Semantic Element (SE) and Semantic Retrieval Index (Sine). A semantic element captures the semantic embedding representation of an query together with performance-aware metadata such as latency, cost, and staticity. Sine then provides two-stage retrieval: a vector similar index with semantic embedding for fast candidate selection and a lightweight -powered semantic judger for precise validation. Atop these primitives, Asteria builds a new interface that includes a new semantic-aware hit definition, a cost-efficient eviction policy, and proactive prefetching. To reduce overhead, Asteria co-locates the small judger with the main using adaptive scheduling and resource sharing. Our evaluation demonstrates that Asteria delivers substantial performance improvements without compromising correctness. On representative search workloads, Asteria achieves up to a 3.6 $\times$ increase in throughput by maintaining hit rates of over 85%, while pre accuracy virtually identical to non-d baselines. Asteria also improves throughput for complex coding tasks by 20%, showcasing its versatility across diverse agentic workloads.

Cronus Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill

Authors: Yunzhao Liu, Qiang Xu, Y. Charlie Hu

2025-09-22

http://arxiv.org/abs/2509.17357v1

Efficient inference is critical for real-world applications, especially within heterogeneous GPU clusters commonly found in organizations and on-premise datacenters as GPU architecture rapidly evolves. Current d strategies, which separate the and stages of inference across different GPUs, often suffer from suboptimal performance due to imbalances between GPU capabilities and workload demands. On the other hand, extending conventional data parallelism and pipeline parallelism to heterogeneous setups incurs high inference latencies. To address these challenges, we introduce Cronus, a novel inference system designed to dynamically balance workloads across heterogeneous GPUs using partially d . Cronus partitions each stage and executes its initial portion on the low-end GPU, while ping the remaining and stages of earlier requests on the high-end GPU. Extensive evaluations across various high-end and low-end GPU combinations demonstrate that Cronus significantly improves the throughput over d . It also reduces TTFT P99 and TBT P99 significantly over DP and PP while maintaining similar or better throughput.

Compact representation of transonic airfoil buffet flows with observable-augmented machine learning

Authors: Kai Fukami, Yuta Iwatani, Soju Maejima, Hiroyuki Asada, Soshi Kawai

2025-09-22

http://arxiv.org/abs/2509.17306v1

Transonic buffet presents time-dependent aerodynamic characteristics associated with shock, turbulent boundary layer, and their interactions. Despite strong nonlinearities and a large degree of freedom, there exists a dominant dynamic pattern of a buffet cycle, suggesting the low dimensionality of transonic buffet phenomena. This study seeks a low-dimensional representation of transonic airfoil buffet at a high Reynolds number with machine learning. Wall-modeled large-eddy simulations of flow over the OAT15A supercritical airfoil at two Mach numbers, $M_\infty = 0.715$ and 0.730, respectively producing non-buffet and buffet conditions, at a chord-based Reynolds number of $Re = 3\times 10^6$ are performed to generate the present datasets. We find that the low-dimensional nature of transonic airfoil buffet can be extracted as a sole three-dimensional latent representation through lift-augmented autoencoder . The current low-order representation not only describes the shock movement but also captures the moment when the separation occurs near the trailing edge in a low-order manner. We further show that it is possible to perform sensor-based reconstruction through the present low-dimensional expression while identifying the sensitivity with respect to aerodynamic responses. The present model trained at $Re = 3\times 10^6$ is lastly evaluated at the level of a real aircraft operation of $Re = 3\times 10^7$ , exhibiting that the phase dynamics of lift is reasonably estimated from sensors. The current study may provide a foundation toward data-driven real-time analysis of transonic buffet conditions under aircraft operation.

Authors: Jiarui Li, Zixiang Yin, Zhengming Ding, Samuel J. Landry, Ramgopal R. Mettu

2025-09-22

http://arxiv.org/abs/2509.17305v1

T cell receptor (TCR) recognition of peptide-MHC (pMHC) complexes is fundamental to adaptive immunity and central to the development of T cell-based immunotherapies. While -based models have shown promise in predicting TCR-pMHC interactions, most lack a systematic and explainable approach to architecture design. We present an approach that uses a new post-hoc explainability method to inform the construction of a novel encoder-r model. By identifying the most informative combinations of TCR and epitope sequence inputs, we optimize cross-attention strategies, incorporate auxiliary training objectives, and introduce a novel early-stopping criterion based on explanation quality. Our framework achieves state-of-the-art predictive performance while simultaneously improving explainability, robustness, and generalization. This work establishes a principled, explanation-driven strategy for modeling TCR-pMHC binding and offers mechanistic insights into sequence-level binding behavior through the lens of deep learning.

Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection

Authors: Jun Seo Kim, Hyemi Kim, Woo Joo Oh, Hongjin Cho, Hochul Lee, Hye Hyeon Kim

2025-09-22

http://arxiv.org/abs/2509.17292v1

Cognitive distortions have been closely linked to mental health disorders, yet their automatic detection remained challenging due to contextual ambiguity, co-occurrence, and semantic . We proposed a novel framework that combines Large Language Models (s) with Multiple-Instance Learning (MIL) architecture to enhance interpretability and expression-level reasoning. Each utterance was decomposed into Emotion, Logic, and Behavior (ELB) components, which were processed by s to infer multiple distortion instances, each with a predicted type, expression, and model-assigned salience score. These instances were integrated via a Multi-View Gated Attention mechanism for final classification. Experiments on Korean (KoACD) and English (Therapist QA) datasets demonstrate that incorporating ELB and -inferred salience scores improves classification performance, especially for distortions with high interpretive ambiguity. Our results suggested a psychologically grounded and generalizable approach for fine-grained reasoning in mental health NLP.

DeepASA An Object-Oriented One-for-All Network for Auditory Scene Analysis

Authors: Dongheon Lee, Younghoo Kwon, Jung-Woo Choi

2025-09-21

http://arxiv.org/abs/2509.17247v1

We propose DeepASA, a one-for-all model for auditory scene analysis that performs multi-input multi-output (MIMO) source separation, dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation (DoAE) within a unified framework. DeepASA is designed for complex auditory scenes where multiple, often similar, sound sources in time and move dynamically in space. To achieve robust and consistent inference across tasks, we introduce an object-oriented processing (OOP) strategy. This approach encapsulates diverse auditory features into object-centric representations and refines them through a chain-of-inference (CoI) mechanism. The pipeline comprises a dynamic temporal kernel-based feature extractor, a -based aggregator, and an object separator that yields per-object features. These features feed into multiple task-specific rs. Our object-centric representations naturally resolve the parameter association ambiguity inherent in traditional track-wise processing. However, early-stage object separation can lead to failure in downstream ASA tasks. To address this, we implement temporal coherence matching (TCM) within the chain-of-inference, enabling multi-task fusion and iterative refinement of object features using estimated auditory parameters. We evaluate DeepASA on representative spatial audio benchmark datasets, including ASA2, MC-FUSS, and STARSS23. Experimental results show that our model achieves state-of-the-art performance across all evaluated tasks, demonstrating its effectiveness in both source separation and auditory parameter estimation under diverse spatial auditory scenes.

MoEs Are Stronger than You Think Hyper-Parallel Inference Scaling with RoE

Authors: Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho

2025-09-21

http://arxiv.org/abs/2509.17238v1

The generation quality of large language models (s) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction.To overcome the computational cost, we introduce an efficient batching strategy and a specialized -caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.

SignalLLM A General-Purpose LLM Agent Framework for Automated Signal Processing

Authors: Junlong Ke, Qiying Hu, Shenghai Yuan, Yuecong Xu, Jianfei Yang

2025-09-21

http://arxiv.org/abs/2509.17197v1

Modern signal processing (SP) pipelines, whether model-based or data-driven, often constrained by complex and fragmented workflow, rely heavily on expert knowledge and manual engineering, and struggle with adaptability and generalization under limited data. In contrast, Large Language Models (s) offer strong reasoning capabilities, broad general-purpose knowledge, in-context learning, and cross-modal transfer abilities, positioning them as powerful tools for automating and generalizing SP workflows. Motivated by these potentials, we introduce Signal, the first general-purpose -based agent framework for general SP tasks. Unlike prior -based SP approaches that are limited to narrow applications or tricky prompting, Signal introduces a principled, modular architecture. It decomposes high-level SP goals into structured subtasks via in-context learning and domain-specific retrieval, followed by hierarchical planning through adaptive retrieval-augmented generation (RAG) and refinement; these subtasks are then executed through prompt-based reasoning, cross-modal reasoning, code synthesis, model invocation, or data-driven -assisted modeling. Its generalizable design enables the flexible selection of problem solving strategies across different signal modalities, task types, and data conditions. We demonstrate the versatility and effectiveness of Signal through five representative tasks in and sensing, such as radar target detection, human activity recognition, and text . Experimental results show superior performance over traditional and existing -based methods, particularly in few-shot and zero-shot settings.

MAST Multi-Agent Spatial Transformer for Learning to Collaborate

Authors: Damian Owerko, Frederic Vatnsdal, Saurav Agarwal, Vijay Kumar, Alejandro Ribeiro

2025-09-21

http://arxiv.org/abs/2509.17195v1

This article presents a novel multi-agent spatial (MAST) for learning policies in large-scale decentralized and collaborative multi-robot systems (DC-MRS). Challenges in collaboration in DC-MRS arise from: (i) partial observable states as robots make only localized perception, (ii) limited range with no central server, and (iii) independent execution of actions. The robots need to optimize a common task-specific objective, which, under the restricted setting, must be done using a policy that exhibits the desired collaborative behavior. The proposed MAST is a decentralized architecture that learns policies to compute abstract information to be shared with other agents and processes the received information with the robot's own observations. The MAST extends the standard with new positional encoding strategies and attention operations that employ windowing to limit the receptive field for MRS. These are designed for local computation, shift-equivariance, and permutation equivariance, making it a promising approach for DC-MRS. We demonstrate the efficacy of MAST on decentralized assignment and navigation (DAN) and decentralized coverage control. Efficiently trained using imitation learning in a centralized setting, the decentralized MAST policy is robust to delays, scales to large teams, and performs better than the baselines and other learning-based approaches.

Attention Consistency for LLMs Explanation

Authors: Tian Lan, Jinyuan Xu, Xue He, Jenq-Neng Hwang, Lei Li

2025-09-21

http://arxiv.org/abs/2509.17178v1

Understanding the decision-making processes of large language models (s) is essential for their trustworthy development and deployment. However, current interpretability methods often face challenges such as low resolution and high computational cost. To address these limitations, we propose the \textbf{Multi-Layer Attention Consistency Score (MACS)}, a novel, lightweight, and easily deployable heuristic for estimating the importance of input tokens in r-based models. MACS measures contributions of input tokens based on the consistency of maximal attention. Empirical evaluations demonstrate that MACS achieves a favorable trade-off between interpretability quality and computational efficiency, showing faithfulness comparable to complex techniques with a 22\% decrease in VRAM usage and 30\% reduction in latency.

Analyzing Memory Effects in Large Language Models through the lens of Cognitive Psychology

Authors: Zhaoyang Cao, Lael Schooler, Reza Zafarani

2025-09-21

http://arxiv.org/abs/2509.17138v1

Memory, a fundamental component of human cognition, exhibits adaptive yet fallible characteristics as illustrated by Schacter's memory "sins".These cognitive phenomena have been studied extensively in psychology and neuroscience, but the extent to which artificial systems, specifically Large Language Models (s), emulate these cognitive phenomena remains underexplored. This study uses human memory research as a lens for understanding s and systematically investigates human memory effects in state-of-the-art s using paradigms drawn from psychological research. We evaluate seven key memory phenomena, comparing human behavior to performance. Both people and models remember less when overloaded with information (list length effect) and remember better with repeated exposure (list strength effect). They also show similar difficulties when retrieving ping information, where storing too many similar facts leads to confusion (fan effect). Like humans, s are susceptible to falsely "remembering" words that were never shown but are related to others (false memories), and they can apply prior learning to new, related situations (cross-domain generalization). However, s differ in two key ways: they are less influenced by the order in which information is presented (positional bias) and more robust when processing random or meaningless material (nonsense effect). These results reveal both alignments and divergences in how s and humans reconstruct memory. The findings help clarify how memory-like behavior in s echoes core features of human cognition, while also highlighting the architectural differences that lead to distinct patterns of error and success.