2025-09-26

Quantized Visual Geometry Grounded Transformer
Nova Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization
Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training
Semantic Edge-Cloud Communication for Real-Time Urban Traffic Surveillance with ViT and LLMs over Mobile Networks
Explaining Fine Tuned LLMs via Counterfactuals A Knowledge Graph Driven Framework
Tree Search for LLM Agent Reinforcement Learning
A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
Who's Laughing Now? An Overview of Computational Humour Generation and Explanation
GRPO is Secretly a Process Reward Model
CAD-Tokenizer Towards Text-based CAD Prototyping via Modality-Specific Tokenization
UniSS Unified Expressive Speech-to-Speech Translation with Your Voice
Acoustic-based Gender Differentiation in Speech-aware Language Models
TyphoonMLA A Mixed Naive-Absorb MLA Kernel For Shared Prefix
KeyWorld Key Frame Reasoning Enables Effective and Efficient World Models
Binary Autoencoder for Mechanistic Interpretability of Large Language Models
Toward Robust and Efficient ML-Based GPU Caching for Modern Inference
MemLens Uncovering Memorization in LLMs with Activation Trajectories
Punching Above Precision Small Quantized Model Distillation with Learnable Regularizer
SPADE Structured Pruning and Adaptive Distillation for Efficient LLM-TTS
Towards Atoms of Large Language Models
Recidivism and Peer Influence with LLM Text Embeddings in Low Security Correctional Facilities
Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning
CHOIR A Chatbot-mediated Organizational Memory Leveraging Communication in University Research Labs
MARS toward more efficient multi-agent collaboration for LLM reasoning
Shared Neural Space Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision
Seedream 4.0 Toward Next-generation Multimodal Image Generation
Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing
SIM-CoT Supervised Implicit Chain-of-Thought
Multimodal Representation-disentangled Information Bottleneck for Multimodal Recommendation
Q-Palette Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment
From Text to Talk Audio-Language Model Needs Non-Autoregressive Joint Training
Projective Kolmogorov Arnold Neural Networks (P-KANs) Entropy-Driven Functional Space Discovery for Interpretable Machine Learning
Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks
MeshMosaic Scaling Artist Mesh Generation via Local-to-Global Assembly
RAD Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis
FastEagle Cascaded Drafting for Accelerating Speculative Decoding
Exploration with Foundation Models Capabilities, Limitations, and Hybrid Approaches
Future Policy Aware Preference Learning for Mathematical Reasoning
Structuring Collective Action with LLM-Guided Evolution From Ill-Structured Problems to Executable Heuristics
CollaPipe Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks
BurstEngine an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens
MMedFD A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition
Gyges Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference
Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling
Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation
Transformer Modeling for Both Scalability and Performance in Multivariate Time Series
CompLLM Compression for Long Context Q&A
Online Process Reward Leanring for Agentic Reinforcement Learning
Reading Images Like Texts Sequential Image Understanding in Vision-Language Models
BiGraspFormer End-to-End Bimanual Grasp Transformer
Clapping Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression
HD-PPT Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS
Benchmarking PDF Accessibility Evaluation A Dataset and Framework for Assessing Automated and LLM-Based Approaches for Accessibility Testing
Confidential LLM Inference Performance and Cost Across CPU and GPU TEEs
FlashGMM Fast Gaussian Mixture Entropy Model for Learned Image Compression
Bi-VLM Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models
HyperCool Reducing Encoding Cost in Overfitted Codecs with Hypernetworks
PIE Perception and Interaction Enhanced End-to-End Motion Planning for Autonomous Driving
FlexSED Towards Open-Vocabulary Sound Event Detection
OmniFed A Modular Framework for Configurable Federated Learning from Edge to HPC
LLMZ+ Contextual Prompt Whitelist Principles for Agentic LLMs
Individualized non-uniform quantization for vector search
LAWCAT Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling
NormGenesis Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery
Efficient Particle Acceleration in 2.5-Dimensional, Hybrid-Kinetic Simulations of Decaying, Supersonic, Plasma Turbulence
Chiplet-Based RISC-V SoC with Modular AI Acceleration
Speculate Deep and Accurate Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding
Exploiting Tree Structure for Credit Assignment in RL Training of LLMs
Evaluating Large Language Models for Detecting Antisemitism
Spiffy Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
GraDeT-HTR A Resource-Efficient Bengali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer
TempSamp-R1 Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
RadEval A framework for radiology text evaluation
Through the Lens of Human-Human Collaboration A Configurable Research Platform for Exploring Human-Agent Collaboration
Visual Detector Compression via Location-Aware Discriminant Analysis
Benchmarking Humans and Machines on Complex Multilingual Speech Understanding Tasks
Expert-as-a-Service Towards Efficient, Scalable, and Robust Large-scale MoE Serving
Towards General Computer Control with Hierarchical Agents and Multi-Level Action Spaces
ConfClip Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs
When TableQA Meets Noise A Dual Denoising Framework for Complex Questions and Large-scale Tables
Mechanistic Interpretability with SAEs Probing Religion, Violence, and Geography in Large Language Models
Evict3R Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
Bilateral Distribution Compression Reducing Both Data Size and Dimensionality
Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs
4DGCPro Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming
CorefInst Leveraging LLMs for Multilingual Coreference Resolution
Privacy in Action Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents
Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks
QWHA Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
DINVMark A Deep Invertible Network for Video Watermarking
Interpreting vision transformers via residual replacement model
EpiCache Episodic KV Cache Management for Long Conversational Question Answering
Multimodal Health Risk Prediction System for Chronic Diseases via Vision-Language Fusion and Large Language Models
Asteria Semantic-Aware Cross-Region Caching for Agentic LLM Tool Access
Cronus Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill
Compact representation of transonic airfoil buffet flows with observable-augmented machine learning
Rational Multi-Modal Transformers for TCR-pMHC Prediction
Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection
DeepASA An Object-Oriented One-for-All Network for Auditory Scene Analysis
MoEs Are Stronger than You Think Hyper-Parallel Inference Scaling with RoE
SignalLLM A General-Purpose LLM Agent Framework for Automated Signal Processing
MAST Multi-Agent Spatial Transformer for Learning to Collaborate
Attention Consistency for LLMs Explanation
Analyzing Memory Effects in Large Language Models through the lens of Cognitive Psychology
SnipSnap A Joint Compression Format and Dataflow Co-Optimization Framework for Efficient Sparse LLM Accelerator Design
The Transfer Neurons Hypothesis An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLMs
PTQTP Post-Training Quantization to Trit-Planes for Large Language Models
LLM-Assisted Semantic Guidance for Sparsely Annotated Remote Sensing Object Detection
Catching the Details Self-Distilled RoI Predictors for Fine-Grained MLLM Perception
SwarmChat An LLM-Based, Context-Aware Multimodal Interaction System for Robotic Swarms
ShadowServe Interference-Free KV Cache Fetching for Distributed Prefix Caching
ISCS Parameter-Guided Channel Ordering and Grouping for Learned Image Compression
The Even Sheen of AI Kitsch, LLMs, and Homogeneity
Towards Transparent and Incentive-Compatible Collaboration in Decentralized LLM Multi-Agent Systems A Blockchain-Driven Approach
Decoding Uncertainty The Impact of Decoding Strategies for Uncertainty Estimation in Large Language Models
EG-MLA Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs
$\boldsymbolλ$ -Orthogonality Regularization for Compatible Representation Learning
Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing
PruneCD Contrasting Pruned Self Model to Improve Decoding Factuality
Data-Driven Reduced-Order Modeling of Phase Mixing Dynamics from Particle Kinetic Simulation
Assessing Classical Machine Learning and Transformer-based Approaches for Detecting AI-Generated Research Text
FG-Attn Leveraging Fine-Grained Sparsity In Diffusion Transformers
orb-QFL Orbital Quantum Federated Learning
GRIL Knowledge Graph Retrieval-Integrated Learning with Large Language Models
Synergies between Federated Foundation Models and Smart Power Grids
Shift Parallelism Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
LightCode Compiling LLM Inference for Photonic-Electronic Systems
SENSE-7 Taxonomy and Dataset for Measuring User Perceptions of Empathy in Sustained Human-AI Conversations
RephQA Evaluating Readability of Large Language Models in Public Health Question Answering
Improving Deep Tabular Learning
The Role of High-Performance GPU Resources in Large Language Model Based Radiology Imaging Diagnosis
MANZANO A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Agentic Aerial Cinematography From Dialogue Cues to Cinematic Trajectories
It Depends Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge
Personalized Federated Learning with Heat-Kernel Enhanced Tensorized Multi-View Clustering
SegDINO3D 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features
Think, Verbalize, then Speak Bridging Complex Thoughts and Comprehensible Speech
BEFT Bias-Efficient Fine-Tuning of Language Models
Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs
FedHK-MVFC Federated Heat Kernel Multi-View Clustering
UniGist Towards General and Hardware-aligned Sequence-level Long Context Compression
RMT-KD Random Matrix Theoretic Causal Knowledge Distillation
VOX-KRIKRI Unifying Speech and Language through Continuous Fusion
Multilingual LLM Prompting Strategies for Medical English-Vietnamese Machine Translation
Interplay Between Belief Propagation and Transformer Differential-Attention Message Passing Transformer
Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models
Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding
DNA-DetectLLM Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm
Optimization techniques for SQL+ML queries A performance analysis of real-time feature computation in OpenMLDB
LLM Cache Bandit Revisited Addressing Query Heterogeneity for Cost-Effective LLM Inference
A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication
CAGE Continuity-Aware edGE Network Unlocks Robust Floorplan Reconstruction
IMPQ Interaction-Aware Layerwise Mixed Precision Quantization for LLMs
LLM-Assisted Topic Reduction for BERTopic on Social Media Data
LNE-Blocking An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models
Beyond Surface Alignment Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
MaRVIn A Cross-Layer Mixed-Precision RISC-V Framework for DNN Inference, from ISA Extension to Hardware Acceleration
Stabilizing Information Flow Entropy Regularization for Safe and Interpretable Autonomous Driving Perception
A1 Asynchronous Test-Time Scaling via Conformal Prediction
Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning
Communication Efficient Split Learning of ViTs with Attention-based Double Compression
Value-Guided KV Compression for LLMs via Approximated CUR Decomposition

Quantized Visual Geometry Grounded Transformer

Authors: Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu

2025-09-25

http://arxiv.org/abs/2509.21302v1

Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale s. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7 $\times$ memory reduction and 2.5 $\times$ in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.

Nova Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization

Authors: Yuhang Xu, Shengzhong Liu, Dong Zhang, Bingheng Yan, Fan Wu, Guihai Chen

2025-09-25

http://arxiv.org/abs/2509.21301v1

This paper presents Nova, a real-time scheduling framework for agentic vision-language models (VLMs) on a single GPU with balanced per-request latency and overall request process throughput. Our design begins by enabling effective pipelining across vision encode, , and stages of VLMs, by exploiting their heterogeneous resource demands during execution and incorporating elastic GPU spatial partitioning among stages to maximally utilize the compute and memory resources. Building on this, we introduce a real-time scheduling algorithm that adaptively calibrates resource allocation among stages based on a Pareto-optimal analysis of the latency-throughput trade-off, allowing the system to sustain responsiveness and resource efficiency under dynamic request loads. To further alleviate GPU memory pressure, we design a lightweight weight offloading strategy for vision encoders that preserves inference efficiency with minimized memory overhead. Extensive evaluations on both synthetic and real-world agent workloads demonstrate that Nova consistently outperforms the state-of-the-art baselines, improving the maximum latency by up to 23.3%, while keeping competitive throughput.

Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training

Authors: Shiju Wang, Yujie Wang, Ao Sun, Fangcheng Fu, Zijian Zhu, Bin Cui, Xu Han, Kaisheng Ma

2025-09-25

http://arxiv.org/abs/2509.21275v1

Long context training is crucial for 's context extension. Existing schemes, such as sequence parallelism, incur substantial overhead. Pipeline parallelism (PP) reduces this cost, but its effectiveness hinges on partitioning granularity. Batch-level PP dividing input samples exhibits high memory consumption in long-context scenario, whereas token-level PP splitting sequences into slices alleviates memory overhead but may incur hardware under-utilization. This trade-off motivates adaptively selecting PP granularity to match resource and workload characteristics. Moreover, sequence length distribution of the real-world dataset exhibits skewness, posing a challenge on PP's workload balance and efficient scheduling. Current static PP scheduling methods overlook the variance of sequence length, leading to suboptimal performance. In this paper, we propose Elastic Pipeline Parallelism (EPP) that orchestrates token-level PP and batch-level PP to adapt to resource and workload heterogeneity. We build InfiniPipe, a distributed training system that unleashes the potential of EPP via (1) a resource-aware and workload-balanced sequence processor that splits long sequences and packs short ones; and (2) a co-optimization methodology that jointly optimizes pipeline schedule and gradient checkpointing via a mechanism named stage-aware chunk-level adaptive checkpointing. Comprehensive experiments demonstrate that InfiniPipe achieves a 1.69x speedup over state-of-the-art systems.

Semantic Edge-Cloud Communication for Real-Time Urban Traffic Surveillance with ViT and LLMs over Mobile Networks

Authors: Murat Arda Onsu, Poonam Lohan, Burak Kantarci, Aisha Syed, Matthew Andrews, Sean Kennedy

2025-09-25

http://arxiv.org/abs/2509.21259v1

Real-time urban traffic surveillance is vital for Intelligent Transportation Systems (ITS) to ensure road safety, optimize traffic flow, track vehicle trajectories, and prevent collisions in smart cities. Deploying edge cameras across urban environments is a standard practice for monitoring road conditions. However, integrating these with intelligent models requires a robust understanding of dynamic traffic scenarios and a responsive interface for user interaction. Although multimodal Large Language Models (s) can interpret traffic images and generate informative responses, their deployment on edge devices is infeasible due to high computational demands. Therefore, inference must occur on the cloud, necessitating visual data transmission from edge to cloud, a process hindered by limited bandwidth, leading to potential delays that compromise real-time performance. To address this challenge, we propose a semantic framework that significantly reduces transmission overhead. Our method involves detecting Regions of Interest (RoIs) using YOLOv11, cropping relevant image segments, and converting them into compact embedding vectors using a Vision Transformer (ViT). These embeddings are then transmitted to the cloud, where an image r reconstructs the cropped images. The reconstructed images are processed by a multimodal to generate traffic condition descriptions. This approach achieves a 99.9% reduction in data transmission size while maintaining an response accuracy of 89% for reconstructed cropped images, compared to 93% accuracy with original cropped images. Our results demonstrate the efficiency and practicality of ViT and -assisted edge-cloud semantic for real-time traffic surveillance.

Explaining Fine Tuned LLMs via Counterfactuals A Knowledge Graph Driven Framework

Authors: Yucheng Wang, Ziyang Chen, Md Faisal Kabir

2025-09-25

http://arxiv.org/abs/2509.21241v1

The widespread adoption of Low-Rank Adaptation (LoRA) has enabled large language models (s) to acquire domain-specific knowledge with remarkable efficiency. However, understanding how such a fine-tuning mechanism alters a model's structural reasoning and semantic behavior remains an open challenge. This work introduces a novel framework that explains fine-tuned s via counterfactuals grounded in knowledge graphs. Specifically, we construct BioToolKG, a domain-specific heterogeneous knowledge graph in bioinformatics tools and design a counterfactual-based fine-tuned s explainer (CFFTExplainer) that learns soft masks over graph nodes and edges to generate minimal structural perturbations that induce maximum semantic divergence. Our method jointly optimizes structural and semantic divergence while enforcing interpretability pre constraints such as entropy regularization and edge smoothness. We apply this framework to a fine-tuned LLaMA-based and reveal that counterfactual masking exposes the model's structural dependencies and aligns with LoRA-induced parameter shifts. This work provides new insights into the internal mechanisms of fine-tuned s and highlights counterfactual graphs as a potential tool for interpretable AI.

Tree Search for LLM Agent Reinforcement Learning

Authors: Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu

2025-09-25

http://arxiv.org/abs/2509.21240v1

Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (s). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.

A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA

Authors: Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, Xiuying Chen

2025-09-25

http://arxiv.org/abs/2509.21199v1

Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for s as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass s. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in s. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more multi-step reasoning methods: \faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.

Who's Laughing Now? An Overview of Computational Humour Generation and Explanation

Authors: Tyler Loakman, William Thorne, Chenghua Lin

2025-09-25

http://arxiv.org/abs/2509.21175v1

The creation and perception of humour is a fundamental human trait, positioning its computational understanding as one of the most challenging tasks in natural language processing (NLP). As an abstract, creative, and frequently context-dependent construct, humour requires extensive reasoning to understand and create, making it a pertinent task for assessing the common-sense knowledge and reasoning abilities of modern large language models (s). In this work, we survey the landscape of computational humour as it pertains to the generative tasks of creation and explanation. We observe that, despite the task of understanding humour bearing all the hallmarks of a foundational NLP task, work on generating and explaining humour beyond puns remains , while state-of-the-art models continue to fall short of human capabilities. We bookend our literature survey by motivating the importance of computational humour processing as a subdiscipline of NLP and presenting an extensive discussion of future directions for research in the area that takes into account the subjective and ethically ambiguous nature of humour.

GRPO is Secretly a Process Reward Model

Authors: Michael Sullivan

2025-09-25

http://arxiv.org/abs/2509.21154v1

We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ( $\lambda$ -GRPO), and show that s trained with $\lambda$ -GRPO achieve higher validation accuracy and performance on downstream reasoning tasks $-$ and reach peak performance more rapidly $-$ than s trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.

CAD-Tokenizer Towards Text-based CAD Prototyping via Modality-Specific Tokenization

Authors: Ruiyu Wang, Shizhao Sun, Weijian Ma, Jiang Bian

2025-09-25

http://arxiv.org/abs/2509.21150v1

Computer-Aided Design (CAD) is a foundational component of industrial prototyping, where models are defined not by raw coordinates but by construction sequences such as sketches and extrusions. This sequential structure enables both efficient prototype initialization and subsequent editing. Text-guided CAD prototyping, which unifies Text-to-CAD generation and CAD editing, has the potential to streamline the entire design pipeline. However, prior work has not explored this setting, largely because standard large language model () tokenizers decompose CAD sequences into natural-language word pieces, failing to capture primitive-level CAD semantics and hindering attention modules from modeling geometric structure. We conjecture that a multimodal tokenization strategy, aligned with CAD's primitive and structural nature, can provide more effective representations. To this end, we propose CAD-Tokenizer, a framework that represents CAD data with modality-specific tokens using a sequence-based VQ-VAE with primitive-level pooling and constrained . This design produces compact, primitive-aware representations that align with CAD's structural nature. Applied to unified text-guided CAD prototyping, CAD-Tokenizer significantly improves instruction following and generation quality, achieving better quantitative and qualitative performance over both general-purpose s and task-specific baselines.

UniSS Unified Expressive Speech-to-Speech Translation with Your Voice

Authors: Sitong Cheng, Weizhen Bian, Xinsheng Wang, Ruibin Yuan, Jianyi Chen, Shunshun Yin, Yike Guo, Wei Xue

2025-09-25

http://arxiv.org/abs/2509.21144v1

The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while pre the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (s). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the d results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while pre voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo.

Acoustic-based Gender Differentiation in Speech-aware Language Models

Authors: Junhyuk Choi, Jihwan Seol, Nayeon Kim, Chanhee Cho, EunBin Cho, Bugeun Kim

2025-09-25

http://arxiv.org/abs/2509.21125v1

Speech-aware Language Models (SpeechLMs) have fundamentally transformed human-AI interaction by enabling voice-based , yet they may exhibit acoustic-based gender differentiation where identical questions lead to different responses based on the speaker's gender. This paper propose a new dataset that enables systematic analysis of this phenomenon, containing 9,208 speech samples across three categories: Gender-Independent, Gender-Stereotypical, and Gender-Dependent. We further evaluated LLaMA-Omni series and discovered a paradoxical pattern; while overall responses seems identical regardless of gender, the pattern is far from unbiased responses. Specifically, in Gender-Stereotypical questions, all models consistently exhibited male-oriented responses; meanwhile, in Gender-Dependent questions where gender differentiation would be contextually appropriate, models exhibited responses independent to gender instead. We also confirm that this pattern does not result from neutral options nor perceived gender of a voice. When we allow neutral response, models tends to respond neutrally also in Gender-Dependent questions. The paradoxical pattern yet retains when we applied gender neutralization methods on speech. Through comparison between SpeechLMs with corresponding backbone s, we confirmed that these paradoxical patterns primarily stem from Whisper speech encoders, which generates male-oriented acoustic tokens. These findings reveal that current SpeechLMs may not successfully remove gender biases though they prioritized general fairness principles over contextual appropriateness, highlighting the need for more sophisticated techniques to utilize gender information properly in speech technology.

TyphoonMLA A Mixed Naive-Absorb MLA Kernel For Shared Prefix

Authors: Ahmet Caner Yüzügüler, Ahmet Çelik, Jiawei Zhuang, Lukas Cavigelli

2025-09-25

http://arxiv.org/abs/2509.21081v1

Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art s such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and for their computational efficiency, existing kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with only a 3% overhead in HBM size.

KeyWorld Key Frame Reasoning Enables Effective and Efficient World Models

Authors: Sibo Li, Qianyue Hao, Yu Shang, Yong Li

2025-09-25

http://arxiv.org/abs/2509.21027v1

Robotic world models are a promising paradigm for forecasting future environment states, yet their inference speed and the physical plausibility of generated trajectories remain critical bottlenecks, limiting their real-world applications. This stems from the redundancy of the prevailing frame-to-frame generation approach, where the model conducts costly computation on similar frames, as well as neglecting the semantic importance of key transitions. To address this inefficiency, we propose KeyWorld, a framework that improves text-conditioned robotic world models by concentrating s computation on a few semantic key frames while employing a lightweight convolutional model to fill the intermediate frames. Specifically, KeyWorld first identifies significant transitions by iteratively simplifying the robot's motion trajectories, obtaining the ground truth key frames. Then, a DiT model is trained to reason and generate these physically meaningful key frames from textual task descriptions. Finally, a lightweight interpolator efficiently reconstructs the full video by inpainting all intermediate frames. Evaluations on the LIBERO benchmark demonstrate that KeyWorld achieves a 5.68 $\times$ compared to the frame-to-frame generation baseline, and focusing on the motion-aware key frames further contributes to the physical validity of the generated videos, especially on complex tasks. Our approach highlights a practical path toward deploying world models in real-time robotic control and other domains requiring both efficient and effective world models. Code is released at https://anonymous.4open.science/r/Keyworld-E43D.

Binary Autoencoder for Mechanistic Interpretability of Large Language Models

Authors: Hakaze Cho, Haolin Yang, Brian M. Kurkoski, Naoya Inoue

2025-09-25

http://arxiv.org/abs/2509.20997v1

Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (s) for interpreting their mechanism. However, they typically rely on autoencoders constrained by some implicit training-time regularization on single training instances (i.e., $L_1$ normalization, top-k function, etc.), without an explicit guarantee of global among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which we empirically evaluate and leverage to characterize the inference dynamics of s and In-context Learning. (2) Feature untangling. Similar to typical methods, BAE can extract atomized features from 's hidden states. To robustly evaluate such feature extraction capability, we refine traditional feature-interpretation methods to avoid unreliable handling of numerical tokens, and show that BAE avoids dense features while producing the largest number of interpretable ones among baselines, which confirms the effectiveness of BAE as a feature extractor.

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

Authors: Peng Chen, Jiaji Zhang, Hailiang Zhao, Yirong Zhang, Jiahong Yu, Xueyan Tang, Yixuan Wang, Hao Li, Jianping Zou, Gang Xiong, Kingsum Chow, Shuibing He, Shuiguang Deng

2025-09-25

http://arxiv.org/abs/2509.20979v1

In modern GPU inference, efficiency remains a major bottleneck. In recommendation models, embedding hit rates largely determine throughput, while in large language models, - misses substantially increase time-to-first-token (TTFT). Heuristic policies such as \textsc{LRU} often struggle under structured access patterns. Learning-based approaches are promising, but in practice face two major limitations: they degrade sharply when predictions are inaccurate, or they gain little even with accurate predictions due to conservative designs. Some also incur high overhead, further limiting practicality. We present \textsc{LCR}, a practical framework for learning-based GPU caching that delivers performance gains while ensuring robustness and efficiency. Its core algorithm, \textsc{LARU}, enhances \textsc{LRU} with machine-learned predictions and dynamically adapts to prediction accuracy through online error estimation. When predictions are accurate, \textsc{LARU} achieves near-optimal performance. With inaccurate predictions, it degrades gracefully to near-\textsc{LRU} performance. With \textsc{LCR}, we bridge the gap between empirical progress and theoretical advances in learning-based caching. Experiments show that \textsc{LCR} delivers consistent gains under realistic conditions. In DLRM and scenarios, it improves throughput by up to 24.2\% and reduces P99 TTFT by up to 28.3\%, outperforming widely used inference systems. Even under poor predictions, its performance remains stable, demonstrating practical robustness.

MemLens Uncovering Memorization in LLMs with Activation Trajectories

Authors: Zirui He, Haiyan Zhao, Ali Payani, Mengnan du

2025-09-25

http://arxiv.org/abs/2509.20909v1

Large language models (s) are commonly evaluated on challenging benchmarks such as AIME and Math500, which are susceptible to contamination and risk of being memorized. Existing detection methods, which primarily rely on surface-level lexical and perplexity, demonstrate low generalization and degrade significantly when encountering implicitly contaminated data. In this paper, we propose MemLens (An Activation Lens for Memorization Detection) to detect memorization by analyzing the probability trajectories of numeric tokens during generation. Our method reveals that contaminated samples exhibit ``shortcut'' behaviors, locking onto an answer with high confidence in the model's early layers, whereas clean samples show more gradual evidence accumulation across the model's full depth. We observe that contaminated and clean samples exhibit distinct and well-separated reasoning trajectories. To further validate this, we inject carefully designed samples into the model through LoRA fine-tuning and observe the same trajectory patterns as in naturally contaminated data. These results provide strong evidence that MemLens captures genuine signals of memorization rather than spurious correlations.

Punching Above Precision Small Quantized Model Distillation with Learnable Regularizer

Authors: Abdur Rehman, S M A Sharif, Md Abdur Rahaman, Mohamed Jismy Aashik Rasool, Seongwan Kim, Jaeho Lee

2025-09-25

http://arxiv.org/abs/2509.20854v1

Quantization-aware training (QAT) combined with knowledge distillation (KD) is a promising strategy for compressing Artificial Intelligence (AI) models for deployment on resource-constrained hardware. However, existing QAT-KD methods often struggle to balance task-specific (TS) and distillation losses due to heterogeneous gradient magnitudes, especially under . We propose Game of Regularizer (GoR), a novel learnable regularization method that adaptively balances TS and KD objectives using only two trainable parameters for dynamic loss weighting. GoR reduces conflict between supervision signals, improves convergence, and boosts the performance of small d models (SQMs). Experiments on image classification, object detection (OD), and large language model () show that GoR consistently outperforms state-of-the-art QAT-KD methods. On low-power edge devices, it delivers faster inference while maintaining full-precision accuracy. We also introduce QAT-EKD-GoR, an ensemble distillation framework that uses multiple heterogeneous teacher models. Under optimal conditions, the proposed EKD-GoR can outperform full-precision models, providing a robust solution for real-world deployment.

SPADE Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

Authors: Tan Dat Nguyen, Jaehun Kim, Ji-Hoon Kim, Shukjae Choi, Youshin Lim, Joon Son Chung

2025-09-25

http://arxiv.org/abs/2509.20802v1

The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (-TTS). Recent -TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7x faster real-time factor with less than 5% of the original training data. These results show that compact -TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation. Audio samples are available at https://mm.kaist.ac.kr/projects/SPADE/.

Towards Atoms of Large Language Models

Authors: Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

2025-09-25

http://arxiv.org/abs/2509.20784v1

The fundamental units of internal representations in large language models (s) remain undefined, limiting further understanding of their mechanisms. Neurons or features are often regarded as such units, yet neurons suffer from polysemy, while features face concerns of unreliable reconstruction and instability. To address this issue, we propose the Atoms Theory, which defines such units as atoms. We introduce the atomic inner product (AIP) to correct representation shifting, formally define atoms, and prove the conditions that atoms satisfy the Restricted Isometry Property (RIP), ensuring stable representations over atom set and linking to compressed sensing. Under stronger conditions, we further establish the uniqueness and exact $\ell_1$ recoverability of the representations, and provide guarantees that single-layer autoencoders (SAEs) with threshold activations can reliably identify the atoms. To validate the Atoms Theory, we train threshold-activated SAEs on Gemma2-2B, Gemma2-9B, and Llama3.1-8B, achieving 99.9% reconstruction across layers on average, and more than 99.8% of atoms satisfy the uniqueness condition, compared to 0.5% for neurons and 68.2% for features, showing that atoms more faithfully capture intrinsic representations of s. Scaling experiments further reveal the link between SAEs size and recovery capacity. Overall, this work systematically introduces and validates Atoms Theory of s, providing a theoretical framework for understanding internal representations and a foundation for mechanistic interpretability. Code available at https://github.com/ChenhuiHu/towards_atoms.

Recidivism and Peer Influence with LLM Text Embeddings in Low Security Correctional Facilities

Authors: Shanjukta Nath, Jiwon Hong, Jae Ho Chang, Keith Warren, Subhadeep Paul

2025-09-25

http://arxiv.org/abs/2509.20634v1

We find AI embeddings obtained using a pre-trained -based Large Language Model () of 80,000-120,000 written affirmations and correction exchanges among residents in low-security correctional facilities to be highly predictive of recidivism. The prediction accuracy is 30\% higher with embedding vectors than with only pre-entry covariates. However, since the text embedding vectors are high-dimensional, we perform Zero-Shot classification of these texts to a low-dimensional vector of user-defined classes to aid interpretation while retaining the predictive power. To shed light on the social dynamics inside the correctional facilities, we estimate peer effects in these -generated numerical representations of language with a multivariate peer effect model, adjusting for network endogeneity. We develop new methodology and theory for peer effect estimation that accommodate networks, multivariate latent variables, and correlated multivariate outcomes. With these new methods, we find significant peer effects in language usage for interaction and feedback.

Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning

Authors: Hanjiang Hu, Changliu Liu, Na Li, Yebin Wang

2025-09-24

http://arxiv.org/abs/2509.20616v1

Large Language Models (s) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training agents for complex multi-turn task planning faces significant challenges, including episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this paper introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in higher multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks with over 30 steps. We also theoretically and empirically validate the strong cross-task generalizability that the models trained on complex tasks can lead to the successful completion of all simpler subtasks.

CHOIR A Chatbot-mediated Organizational Memory Leveraging Communication in University Research Labs

Authors: Sangwook Lee, Adnan Abbas, Yan Chen, Young-Ho Kim, Sang Won Lee

2025-09-24

http://arxiv.org/abs/2509.20512v1

University research labs often rely on chat-based platforms for and project management, where valuable knowledge surfaces but is easily lost in message streams. Documentation can preserve knowledge, but it requires ongoing maintenance and is challenging to navigate. Drawing on formative interviews that revealed organizational memory challenges in labs, we designed CHOIR, an -based chatbot that supports organizational memory through four key functions: document-grounded Q&A, Q&A sharing for follow-up discussion, knowledge extraction from conversations, and AI-assisted document updates. We deployed CHOIR in four research labs for one month (n=21), where the lab members asked 107 questions and lab directors updated documents 38 times in the organizational memory. Our findings reveal a privacy-awareness tension: questions were asked privately, limiting directors' visibility into documentation gaps. Students often avoided contribution due to challenges in generalizing personal experiences into universal documentation. We contribute design implications for privacy-pre awareness and supporting context-specific knowledge documentation.

MARS toward more efficient multi-agent collaboration for LLM reasoning

Authors: Xiao Wang, Jia Wang, Yijie Wang, Pengtao Dang, Sha Cao, Chi Zhang

2025-09-24

http://arxiv.org/abs/2509.20502v1

Large language models (s) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different s show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50\%. Code is available at https://github.com/xwang97/MARS.

Shared Neural Space Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision

Authors: Jing Li, Oskar Bartosz, Chengyu Wang, Michal Wnuczynski, Dilshan Godaliyadda, Michael Polley

2025-09-24

http://arxiv.org/abs/2509.20481v1

The majority of AI models in imaging and vision are customized to perform on specific high-precision task. However, this strategy is inefficient for applications with a series of modular tasks, since each requires a mapping into a disparate latent domain. To address this inefficiency, we proposed a universal Neural Space (NS), where an encoder-r framework pre-computes features across vision and imaging tasks. Our encoder learns transformation aware, generalizable representations, which enable multiple downstream AI modules to share the same feature space. This architecture reduces redundancy, improves generalization across domain shift, and establishes a foundation for effecient multi-task vision pipelines. Furthermore, as opposed to larger backbones, our backbone is lightweight and CNN-based, allowing for wider across hardware. We furthur demonstrate that imaging and vision modules, such as demosaicing, denoising, depth estimation and semantic segmentation can be performed efficiently in the NS.

Seedream 4.0 Toward Next-generation Multimodal Image Generation

Authors: Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu

2025-09-24

http://arxiv.org/abs/2509.20427v1

We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference , we integrate adversarial distillation, distribution matching, and , as well as speculative . It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a /VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.

Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing

Authors: Xinnan Dai, Chung-Hsiang Lo, Kai Guo, Shenglai Zeng, Dongsheng Luo, Jiliang Tang

2025-09-24

http://arxiv.org/abs/2509.20336v1

Transformer-based s demonstrate strong performance on graph reasoning tasks, yet their internal mechanisms remain underexplored. To uncover these reasoning process mechanisms in a fundamental and unified view, we set the basic r-only s and explain them using the circuit-tracer framework. Through this lens, we visualize reasoning traces and identify two core mechanisms in graph reasoning: token merging and structural memorization, which underlie both path reasoning and substructure extraction tasks. We further quantify these behaviors and analyze how they are influenced by graph density and model size. Our study provides a unified interpretability framework for understanding structural reasoning in r-only Transformers.

SIM-CoT Supervised Implicit Chain-of-Thought

Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin

2025-09-24

http://arxiv.org/abs/2509.20317v2

Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models (s), but a persistent performance gap has limited their adoption. We identify a core latent instability issue when scaling the computational budget of implicit CoT: as the number of reasoning tokens increases, training often becomes unstable and collapses. Our analysis shows that this instability arises from latent representations becoming homogeneous and losing semantic diversity, caused by insufficient step-level supervision in current implicit CoT methods. To address this, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. SIM-CoT employs an auxiliary r during training to align each implicit token with its corresponding explicit reasoning step, ensuring latent states capture distinct and meaningful information. The auxiliary r is removed at inference, pre the efficiency of implicit CoT with no added overhead. It also provides interpretability by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization and diagnosis. SIM-CoT significantly improves both in-domain accuracy and out-of-domain stability of implicit CoT methods, boosting Coconut by +8.2\% on GPT-2 and CODI by +3.0\% on LLaMA-3.1 8B. It further surpasses the explicit CoT baseline on GPT-2 by 2.1\% with 2.3 $\times$ greater token efficiency, while closing the performance gap on larger models like LLaMA-3.1 8B. Code: https://github.com/InternLM/SIM-CoT

Multimodal Representation-disentangled Information Bottleneck for Multimodal Recommendation

Authors: Hui Wang, Jinghui Qin, Wushao Wen, Qingling Li, Shanshan Zhong, Zhongzhan Huang

2025-09-24

http://arxiv.org/abs/2509.20225v1

Multimodal data has significantly advanced recommendation systems by integrating diverse information sources to model user preferences and item characteristics. However, these systems often struggle with redundant and irrelevant information, which can degrade performance. Most existing methods either fuse multimodal information directly or use rigid architectural separation for disentanglement, failing to adequately filter noise and model the complex interplay between modalities. To address these challenges, we propose a novel framework, the Multimodal Representation-disentangled Information Bottleneck (MRdIB). Concretely, we first employ a Multimodal Information Bottleneck to compress the input representations, effectively filtering out task-irrelevant noise while pre rich semantic information. Then, we decompose the information based on its relationship with the recommendation target into unique, redundant, and synergistic components. We achieve this decomposition with a series of constraints: a unique information learning objective to preserve modality-unique signals, a redundant information learning objective to minimize , and a synergistic information learning objective to capture emergent information. By optimizing these objectives, MRdIB guides a model to learn more powerful and disentangled representations. Extensive experiments on several competitive models and three benchmark datasets demonstrate the effectiveness and versatility of our MRdIB in enhancing multimodal recommendation.

Q-Palette Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment

Authors: Deokjae Lee, Hyun Oh Song

2025-09-24

http://arxiv.org/abs/2509.20214v1

We study weight-only post-training (PTQ), which s the weights of a large language model () without retraining, using little or no calibration data. Weight-only PTQ is crucial for reducing the memory footprint and latency of inference, especially in memory-bound, small-batch inference scenarios, such as personalized inference on edge devices. Despite its importance, irregular weight distributions with heavy-tailed outliers in s complicate , recently motivating rotation-based methods that transform weights into near-Gaussian distributions, which are more regular with fewer outliers, thereby reducing error. In this work, we first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit rs approaching the Gaussian distortion-rate bound are essential to achieve near-optimal performance. To bridge this theoretical insight and practical implementation, we introduce Q-Palette, a versatile collection of fractional-bit rs that range from trellis-coded rs offering near-optimal distortion to simpler vector and scalar rs optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Furthermore, leveraging Q-Palette as a foundational component, we propose a novel mixed-scheme framework, jointly optimizing r choices and layer fusion decisions given resource constraints. The code is available at https://github.com/snu-mllab/Q-Palette.

From Text to Talk Audio-Language Model Needs Non-Autoregressive Joint Training

Authors: Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo, Zitao Liu

2025-09-24

http://arxiv.org/abs/2509.20072v2

Recent advances in large language models (s) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates autoregressive (AR) text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order autoregressive property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Extensive experiments across Audio-QA and ASR tasks demonstrate the effectiveness of our approach, with detailed ablation studies validating each proposed component. We will open-source our models, data and code to facilitate future research in this direction.

Projective Kolmogorov Arnold Neural Networks (P-KANs) Entropy-Driven Functional Space Discovery for Interpretable Machine Learning

Authors: Alastair Poole, Stig McArthur, Saravan Kumar

2025-09-24

http://arxiv.org/abs/2509.20049v1

Kolmogorov-Arnold Networks (KANs) relocate learnable nonlinearities from nodes to edges, demonstrating remarkable capabilities in scientific machine learning and interpretable modeling. However, current KAN implementations suffer from fundamental inefficiencies due to redundancy in high-dimensional spline parameter spaces, where numerous distinct parameterisations yield functionally equivalent behaviors. This redundancy manifests as a "nuisance space" in the model's Jacobian, leading to susceptibility to overfitting and poor generalization. We introduce Projective Kolmogorov-Arnold Networks (P-KANs), a novel training framework that guides edge function discovery towards interpretable functional representations through entropy-minimisation techniques from signal analysis and dictionary learning. Rather than constraining functions to predetermined spaces, our approach maintains spline space flexibility while introducing "gravitational" terms that encourage convergence towards optimal functional representations. Our key insight recognizes that optimal representations can be identified through entropy analysis of projection coefficients, compressing edge functions to lower-parameter projective spaces (Fourier, Chebyshev, Bessel). P-KANs demonstrate superior performance across multiple domains, achieving up to 80% parameter reduction while maintaining representational capacity, significantly improved robustness to noise compared to standard KANs, and successful application to industrial automated fiber placement prediction. Our approach enables automatic discovery of mixed functional representations where different edges converge to different optimal spaces, providing both benefits and enhanced interpretability for scientific machine learning applications.

Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks

Authors: Vani Kanjirangat, Tanja Samardžić, Ljiljana Dolamic, Fabio Rinaldi

2025-09-24

http://arxiv.org/abs/2509.20045v1

Dialectal data are characterized by linguistic variation that appears small to humans but has a significant impact on the performance of models. This dialect gap has been related to various factors (e.g., data size, economic and social factors) whose impact, however, turns out to be inconsistent. In this work, we investigate factors impacting the model performance more directly: we correlate Tokenization Parity (TP) and Information Parity (IP), as measures of representational biases in pre-trained multilingual models, with the downstream performance. We compare state-of-the-art r-only s with encoder-based models across three tasks: dialect classification, topic classification, and extractive question answering, controlling for varying scripts (Latin vs. non-Latin) and resource availability (high vs. low). Our analysis reveals that TP is a better predictor of the performance on tasks reliant on syntactic and morphological cues (e.g., extractive QA), while IP better predicts performance in semantic tasks (e.g., topic classification). Complementary analyses, including tokenizer behavior, vocabulary coverage, and qualitative insights, reveal that the language support claims of s often might mask deeper mismatches at the script or token level.

MeshMosaic Scaling Artist Mesh Generation via Local-to-Global Assembly

Authors: Rui Xu, Tianyang Xue, Qiujie Dong, Le Wan, Zhe Zhu, Peng Li, Zhiyang Dou, Cheng Lin, Shiqing Xin, Yuan Liu, Wenping Wang, Taku Komura

2025-09-24

http://arxiv.org/abs/2509.19995v1

Scaling artist-designed meshes to high triangle numbers remains challenging for autoregressive generative models. Existing -based methods suffer from long-sequence bottlenecks and limited resolution, primarily due to the large number of tokens required and constrained granularity. These issues prevent faithful reproduction of fine geometric details and structured density patterns. We introduce MeshMosaic, a novel local-to-global framework for artist mesh generation that scales to over 100K triangles--substantially surpassing prior methods, which typically handle only around 8K faces. MeshMosaic first segments shapes into patches, generating each patch autoregressively and leveraging shared boundary conditions to promote coherence, symmetry, and seamless connectivity between neighboring regions. This strategy enhances scalability to high-resolution meshes by quantizing patches individually, resulting in more symmetrical and organized mesh density and structure. Extensive experiments across multiple public datasets demonstrate that MeshMosaic significantly outperforms state-of-the-art methods in both geometric fidelity and user preference, supporting superior detail representation and practical mesh generation for real-world applications.

Authors: Haolin Li, Tianjie Dai, Zhe Chen, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang

2025-09-24

http://arxiv.org/abs/2509.19980v1

Clinical diagnosis is a highly specialized discipline requiring both domain expertise and strict adherence to rigorous guidelines. While current AI-driven medical research predominantly focuses on knowledge graphs or natural text pretraining paradigms to incorporate medical knowledge, these approaches primarily rely on implicitly encoded knowledge within model parameters, neglecting task-specific knowledge required by diverse downstream tasks. To address this limitation, we propose Retrieval-Augmented Diagnosis (RAD), a novel framework that explicitly injects external knowledge into multimodal models directly on downstream tasks. Specifically, RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss that constrains the latent distance between multi-modal features and guideline knowledge, and the dual r that employs guidelines as queries to steer cross-modal fusion, aligning the models with clinical diagnostic workflows from guideline acquisition to feature extraction and decision-making. Moreover, recognizing the lack of quantitative evaluation of interpretability for multimodal diagnostic models, we introduce a set of criteria to assess the interpretability from both image and text perspectives. Extensive evaluations across four datasets with different anatomies demonstrate RAD's generalizability, achieving state-of-the-art performance. Furthermore, RAD enables the model to concentrate more precisely on abnormal regions and critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code is available at https://github.com/tdlhl/RAD.

FastEagle Cascaded Drafting for Accelerating Speculative Decoding

Authors: Haiduo Huang, Jiangcheng Song, Wenzhe Zhao, Pengju Ren

2025-09-24

http://arxiv.org/abs/2509.20416v1

Speculative accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive cascaded drafter that emits an entire draft in a single forward pass. FastEagle replaces temporal steps with a lightweight layer cascade and trains with layer-wise supervision to mitigate error accumulation. Coupled with a constrained draft tree that preserves lossless verification cost, FastEagle delivers substantial wall-clock speedups over strong autoregressive drafters while maintaining competitive acceptance behavior. Across multiple s (Vicuna-13B, LLaMA-Instruct 3.x, and DeepSeek-R1-Distill-LLaMA) and tasks (MT-Bench, HumanEval, GSM8K, CNN/DM, Alpaca), FastEagle consistently outperforms EAGLE-3 in speedup under both greedy and stochastic , with comparable average acceptance lengths. These results indicate that removing sequential dependencies in drafting is a practical path toward lossless inference .

Exploration with Foundation Models Capabilities, Limitations, and Hybrid Approaches

Authors: Remo Sasso, Michelangelo Conserva, Dominik Jeurissen, Paulo Rauber

2025-09-24

http://arxiv.org/abs/2509.19924v1

Exploration in reinforcement learning (RL) remains challenging, particularly in -reward settings. While foundation models possess strong semantic priors, their capabilities as zero-shot exploration agents in classic RL benchmarks are not well understood. We benchmark s and VLMs on multi-armed bandits, Gridworlds, and -reward Atari to test zero-shot exploration. Our investigation reveals a key limitation: while VLMs can infer high-level objectives from visual input, they consistently fail at precise low-level control: the "knowing-doing gap". To analyze a potential bridge for this gap, we investigate a simple on-policy hybrid framework in a controlled, best-case scenario. Our results in this idealized setting show that VLM guidance can significantly improve early-stage sample efficiency, providing a clear analysis of the potential and constraints of using foundation models to guide exploration rather than for end-to-end control.

Future Policy Aware Preference Learning for Mathematical Reasoning

Authors: Minjae Oh, Yunho Choi, Dongmin Choi, Yohan Jo

2025-09-24

http://arxiv.org/abs/2509.19893v1

Preference learning methods such as Direct Preference Optimization (DPO) have become standard for Large Language Model () post-training, yet they are often ineffective for mathematical reasoning. A key challenge is the large token between preferred and dispreferred trajectories; lowering the probability of dispreferred trajectories also reduces the probability of shared useful tokens, leading to over-penalization and overall performance collapse. As a mitigation, existing algorithms include the probability of a trajectory under the current policy as a regularization term, which decreases the effect of the gradient when the probability is low. However, by the time this effect takes hold, useful tokens may have already been over-penalized as the model has begun to degrade. To address this, we propose Future Policy Aware (FPA) preference learning, which replaces the current policy with a future policy in the regularization term. This future policy is estimated via lightweight, logit-space extrapolation from a reference model toward the current model. FPA enables safer training by preemptively regularizing potentially problematic gradients. We apply FPA to DPO, RPO, and SimPER and evaluate them on the MATH and GSM8K benchmarks. FPA yields consistent performance gains, with the largest improvements observed with SimPER, achieving gains of up to 5.75%. We demonstrate that FPA provides proactive regularization while pre the probability of shared, useful mathematical tokens, and enables longer, degradation-free training with negligible computational overhead. We will release our code publicly upon publication.

Structuring Collective Action with LLM-Guided Evolution From Ill-Structured Problems to Executable Heuristics

Authors: Kevin Bradley Dsouza, Graham Alexander Watt, Yuri Leonenko, Juan Moreno-Cruz

2025-09-24

http://arxiv.org/abs/2509.20412v1

Collective action problems, which require aligning individual incentives with collective goals, are classic examples of Ill-Structured Problems (ISPs). For an individual agent, the causal links between local actions and global outcomes are unclear, stakeholder objectives often conflict, and no single, clear algorithm can bridge micro-level choices with macro-level welfare. We present ECHO-MIMIC, a computational framework that converts this global complexity into a tractable, Well-Structured Problem (WSP) for each agent by discovering compact, executable heuristics and persuasive rationales. The framework operates in two stages: ECHO (Evolutionary Crafting of Heuristics from Outcomes) evolves snippets of Python code that encode candidate behavioral policies, while MIMIC (Mechanism Inference & Messaging for Individual-to-Collective Alignment) evolves companion natural language messages that motivate agents to adopt those policies. Both phases employ a large-language-model-driven evolutionary search: the proposes diverse and context-aware code or text variants, while population-level selection retains those that maximize collective performance in a simulated environment. We demonstrate this framework on a canonical ISP in agricultural landscape management, where local farming decisions impact global ecological connectivity. Results show that ECHO-MIMIC discovers high-performing heuristics compared to baselines and crafts tailored messages that successfully align simulated farmer behavior with landscape-level ecological goals. By coupling algorithmic rule discovery with tailored , ECHO-MIMIC transforms the cognitive burden of collective action into a simple set of agent-level instructions, making previously ill-structured problems solvable in practice and opening a new path toward scalable, adaptive policy design.

CollaPipe Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks

Authors: Jiewei Chen, Xiumei Deng, Zehui Xiong, Shaoyong Guo, Xuesong Qiu, Ping Wang, Dusit Niyato

2025-09-24

http://arxiv.org/abs/2509.19855v1

The increasing demand for intelligent mobile applications has made multi-agent collaboration with Transformer-based large language models (s) essential in mobile edge computing (MEC) networks. However, training s in such environments remains challenging due to heavy computation, high end-to-end latency, and limited model generalization. We introduce CollaPipe, a hybrid distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving intelligent networks. In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the r is deployed on edge servers to handle generative tasks. Then we perform global model update via federated aggregation. To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power. We derive and use a closed-form convergence bound to design an Dynamic Segment Scheduling and Resource Allocation (DSSDA) algorithm based on Lyapunov optimization, ensuring system stability under long-term constraints. Extensive experiments on downstream tasks with Transformer and BERT models show that CollaPipe improves computation efficiency by up to 15.09%, reduces end-to-end latency by at least 48.98%, and cuts single device memory usage by more than half, enabling online learning in heterogeneous and dynamic environments.

BurstEngine an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens

Authors: Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong sun

2025-09-24

http://arxiv.org/abs/2509.19836v1

Existing methods for training s on long-sequence data, such as Tensor Parallelism and Context Parallelism, exhibit low Model FLOPs Utilization as sequence lengths and number of GPUs increase, especially when sequence lengths exceed 1M tokens. To address these challenges, we propose BurstEngine, an efficient framework designed to train s on long-sequence data. BurstEngine introduces BurstAttention, an optimized distributed attention with lower cost than RingAttention. BurstAttention leverages topology-aware ring to fully utilize network bandwidth and incorporates fine-grained -computation . Furthermore, BurstEngine introduces sequence-level selective checkpointing and fuses the language modeling head with the loss function to reduce memory cost. Additionally, BurstEngine introduces workload balance optimization for various types of attention masking. By integrating these optimizations, BurstEngine achieves a $1.2\times$ speedup with much lower memory overhead than the state-of-the-art baselines when training s on extremely long sequences of over 1M tokens. We have made our code publicly available on GitHub: https://github.com/thunlp/BurstEngine.

MMedFD A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition

Authors: Hongzhao Chen, XiaoYang Wang, Jing Lan, Hexiao Ding, Yufeng Jiang MingHui Yang, DanHui Xu, Jun Luo, Nga-Chun Ng, Gerald W. Y. Cheng, Yunlin Mao, Jung Sun Yoo

2025-09-24

http://arxiv.org/abs/2509.19817v1

Automatic speech recognition (ASR) in clinical dialogue demands robustness to full-duplex interaction, speaker , and low-latency constraints, yet open benchmarks remain scarce. We present MMedFD, the first real-world Chinese healthcare ASR corpus designed for multi-turn, full-duplex settings. Captured from a deployed AI assistant, the dataset comprises 5,805 annotated sessions with synchronized user and mixed-channel views, RTTM/CTM timing, and role labels. We introduce a model-agnostic pipeline for streaming segmentation, speaker attribution, and dialogue memory, and fine-tune Whisper-small on role-concatenated audio for long-context recognition. ASR evaluation includes WER, CER, and HC-WER, which measures concept-level accuracy across healthcare settings. -generated responses are assessed using rubric-based and pairwise protocols. MMedFD establishes a reproducible framework for benchmarking streaming ASR and end-to-end duplex agents in healthcare deployment. The dataset and related resources are publicly available at https://github.com/Kinetics-JOJO/MMedFD

Gyges Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference

Authors: Haoyu Chen, Xue Li, Kun Qian, Yu Guan, Jin Zhao, Xin Wang

2025-09-24

http://arxiv.org/abs/2509.19729v1

Efficiently processing the dynamics of requests, especially the context length variance, is important in Large Language Model () scenarios. However, there is an intrinsic trade-off: while leveraging parallelism strategies, such as Tensor Parallelism (TP), can coordinate multiple GPUs to accommodate larger context lengths, it inevitably results in degraded overall throughput. In this paper, we propose Cross-Instance Parallelism Transformation (Gyges), which adaptively adjusts the parallelism strategies of running instances to align with the dynamics of incoming requests. We design (1) a page-friendly, header-centric layout to accelerate transformations; (2) dedicated weight padding to accelerate model weight transformations; and (3) a transformation-aware scheduler to cooperatively schedule requests and parallelism transformations, optimizing the overall performance. Evaluations using real-world traces show that Gyges improves throughput by 1.75x-6.57x compared to state-of-the-art solutions.

Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling

Authors: Youpeng Zhao, Jinpeng LV, Di Wu, Jun Wang, Christopher Gooley

2025-09-23

http://arxiv.org/abs/2509.19645v1

Test-time scaling (TTS) has recently emerged as a promising direction to exploit the hidden reasoning capabilities of pre-trained large language models (s). However, existing scaling methods narrowly focus on the compute-optimal Pareto-frontier, ignoring the simple fact that compute-optimal is not always system-optimal. In this work, we propose a system-driven perspective on TTS, analyzing how reasoning models scale against practical metrics, such as latency and cost-per-token. By evaluating the impact of popular optimizations such as tensor parallelism and speculative , our preliminary analysis reveals the limitations of current methods and calls for a paradigm shift toward holistic, system-aware evaluations that capture the true essence of scaling laws at inference time.

Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation

Authors: Roy Fejgin, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Ryan Langman Jaehyeon Kim, Subhankar Ghosh, Shehzeen Hussain, Jason Li

2025-09-23

http://arxiv.org/abs/2509.19592v1

Speech generation models based on large language models (s) typically operate on discrete acoustic codes, which differ fundamentally from text tokens due to their multicodebook structure. At each timestep, models must predict N codebook entries jointly, introducing dependencies that challenge simple parallel prediction approaches. Parallel prediction assumes independence among codebooks, yielding efficient but often at the cost of reduced fidelity. To address this, hierarchical strategies employ a local (LT) to refine predictions and capture intra-timestep dependencies. In this work, we systematically investigate two LT architectures: an autoregressive that generates codebooks sequentially, and a MaskGIT-based that performs iterative masked prediction. Both designs further enable frame stacking, where the primary predicts multiple frames jointly, and the LT s their codebooks, offering improvements in speed without compromising perceptual quality. Through extensive analysis, we characterize the tradeoffs between parallel and iterative sampling strategies across different throughput and quality regimes. Finally, we propose practical guidelines for selecting strategies based on deployment priorities such as computational efficiency and synthesis fidelity.

Transformer Modeling for Both Scalability and Performance in Multivariate Time Series

Authors: Hunjae Lee, Corey Clark

2025-09-23

http://arxiv.org/abs/2509.19471v1

Variable count is among the main scalability bottlenecks for modeling in multivariate time series (MTS) data. On top of this, a growing consensus in the field points to indiscriminate inter-variable mixing as a potential source of noise-accumulation and performance degradation. This is likely exacerbated by of informative signals characteristic of many MTS systems coupled with representational misalignment stemming from indiscriminate information mixing between (heterogeneous) variables. While scalability and performance are often seen as competing interests in design, we show that both can be improved simultaneously in MTS by strategically constraining the representational capacity of inter-variable mixing. Our proposed method, with Delegate Token Attention (DELTAformer), constrains inter-variable modeling through what we call delegate tokens which are then used to perform full, unconstrained, inter-temporal modeling. Delegate tokens act as an implicit regularizer that forces the model to be highly selective about what inter-variable information is allowed to propagate through the network. Our results show that DELTAformer scales linearly with variable-count while actually outperforming standard s, achieving state-of-the-art performance across benchmarks and baselines. In addition, DELTAformer can focus on relevant signals better than standard s in noisy MTS environments and overall exhibit superior noise-resilience. Overall, results across various experiments confirm that by aligning our model design to leverage domain-specific challenges in MTS to our advantage, DELTAformer can simultaneously achieve linear scaling while actually improving its performance against standard, quadratic s.

CompLLM Compression for Long Context Q&A

Authors: Gabriele Berton, Jayakrishnan Unnikrishnan, Son Tran, Mubarak Shah

2025-09-23

http://arxiv.org/abs/2509.19228v1

Large Language Models (s) face significant computational challenges when processing long contexts due to the quadratic complexity of self-attention. While soft context methods, which map input text to smaller latent representations, have shown promise, their real-world adoption is limited. Existing techniques typically compress the context as a single unit, which leads to quadratic complexity and an inability to reuse computations across queries with ping contexts. In this work, we introduce Comp, a soft technique designed for practical deployment. Instead of processing the context holistically, Comp divides it into segments and compresses each one independently. This simple design choice yields three critical properties: efficiency, as the step scales linearly with the context length; scalability, enabling models trained on short sequences (e.g., 1k tokens) to generalize to contexts of 100k tokens; and reusability, allowing compressed segments to be d and reused across different queries. Our experiments show that with a 2x rate, at high context lengths Comp speeds up Time To First Token (TTFT) by up to 4x and reduces the size by 50%. Furthermore, Comp achieves performance comparable to that obtained with the uncompressed context, and even surpasses it on very long sequences, demonstrating its effectiveness and practical utility.

Online Process Reward Leanring for Agentic Reinforcement Learning

Authors: Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, Jianbin Jiao

2025-09-23

http://arxiv.org/abs/2509.19199v2

Large language models (s) are increasingly trained with reinforcement learning (RL) as autonomous agents that reason and act over long horizons in interactive environments. However, and sometimes unverifiable rewards make temporal credit assignment extremely challenging. Recent work attempts to integrate process supervision into agent learning but suffers from biased annotation, reward hacking, high-variance from overly fine-grained signals or failtures when state is rare. We therefore introduce Online Process Reward Learning (OPRL), a general credit-assignment strategy for agentic RL that integrates seamlessly with standard on-policy algorithms without relying on additional rollouts or explicit step labels. In OPRL, we optimize an implicit process reward model (PRM) alternately with the agent's policy to transform trajectory preferences into implicit step rewards through a trajectory-based DPO objective. These step rewards are then used to compute step-level advantages, which are combined with episode-level advantages from outcome rewards for policy update, creating a self-reinforcing loop. Theoretical findings guarantee that the learned step rewards are consistent with trajectory preferences and act as potential-based shaping rewards, providing bounded gradients to stabilize training. Empirically, we evaluate OPRL on three distinct agent benmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverfiable rewards in SOTOPIA. Crucially, OPRL shows superior performance over frontier s and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and lower variance during training. Further analysis also demonstrates the efficient exploration by OPRL using fewer actions, underscoring its potential for agentic learning in real-world scenarios.

Reading Images Like Texts Sequential Image Understanding in Vision-Language Models

Authors: Yueyan Li, Chenggong Zhao, Zeyuan Zang, Caixia Yuan, Xiaojie Wang

2025-09-23

http://arxiv.org/abs/2509.19191v1

Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks. However, existing VLMs typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision. Moreover, their opaque internal mechanisms hinder both deeper understanding and architectural innovation. Inspired by the dual-stream hypothesis of human vision, which distinguishes the "what" and "where" pathways, we deconstruct the visual processing in VLMs into object recognition and spatial perception for separate study. For object recognition, we convert images into text token maps and find that the model's perception of image content unfolds as a two-stage process from shallow to deep layers, beginning with attribute recognition and culminating in semantic disambiguation. For spatial perception, we theoretically derive and empirically verify the geometric structure underlying the positional representation in VLMs. Based on these findings, we introduce an instruction-agnostic token algorithm based on a plug-and-play visual r to improve efficiency, and a RoPE scaling technique to enhance spatial reasoning. Through rigorous experiments, our work validates these analyses, offering a deeper understanding of VLM internals and providing clear principles for designing more capable future architectures.

BiGraspFormer End-to-End Bimanual Grasp Transformer

Authors: Kangmin Kim, Seunghyeok Back, Geonhyup Lee, Sangbeom Lee, Sangjun Noh, Kyoobin Lee

2025-09-23

http://arxiv.org/abs/2509.19142v1

Bimanual grasping is essential for robots to handle large and complex objects. However, existing methods either focus solely on single-arm grasping or employ separate grasp generation and bimanual evaluation stages, leading to coordination problems including collision risks and unbalanced force distribution. To address these limitations, we propose BiGraspFormer, a unified end-to-end framework that directly generates coordinated bimanual grasps from object point clouds. Our key idea is the Single-Guided Bimanual (SGB) strategy, which first generates diverse single grasp candidates using a r, then leverages their learned features through specialized attention mechanisms to jointly predict bimanual poses and quality scores. This conditioning strategy reduces the complexity of the 12-DoF search space while ensuring coordinated bimanual manipulation. Comprehensive simulation experiments and real-world validation demonstrate that BiGraspFormer consistently outperforms existing methods while maintaining efficient inference speed (<0.05s), confirming the effectiveness of our framework. Code and supplementary materials are available at https://sites.google.com/bigraspformer

Clapping Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression

Authors: Boao Kong, Xu Huang, Yuqi Xu, Yixuan Liang, Bin Wang, Kun Yuan

2025-09-23

http://arxiv.org/abs/2509.19029v1

Pipeline-parallel distributed optimization is essential for large-scale machine learning but is challenged by significant overhead from transmitting high-dimensional activations and gradients between workers. Existing approaches often depend on impractical unbiased gradient assumptions or incur sample-size memory overhead. This paper introduces Clapping, a Communication algorithm with LAzy samPling for Pipeline-parallel learnING. Clapping adopts a lazy sampling strategy that reuses data samples across steps, breaking sample-wise memory barrier and supporting convergence in few-epoch or online training regimes. Clapping comprises two variants including Clapping-FC and Clapping-FU, both of which achieve convergence without unbiased gradient assumption, effectively addressing error propagation in multi-worker settings. Numerical experiments validate the performance of Clapping across different learning tasks.

HD-PPT Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS

Authors: Sihang Nie, Xiaofen Xing, Jingyuan Xing, Baiji Liu, Xiangmin Xu

2025-09-23

http://arxiv.org/abs/2509.19001v1

Large Language Model ()-based Text-to-Speech (TTS) models have already reached a high degree of naturalness. However, the precision control of TTS inference is still challenging. Although instruction-based Text-to-Speech (Instruct-TTS) models are proposed, these models still lack fine-grained control due to the modality gap between single-level text instructions and multilevel speech tokens. To address this limitation, we propose HD-PPT, a framework that transforms speech synthesis into a structured, hierarchical task. To enable fine-grained control, we introduce a novel speech codec to extract distinct prompt-preference and content-preference tokens from the complex speech tokens, supervised by automatic speech recognition (ASR) and cross-lingual audio-text pre-training (CLAP) objectives. To bridge the modality gap of these tokens, we propose a hierarchical strategy, where the generates tokens in a structured order: first semantic, then fine-grained style, and finally complete acoustic representation. Extensive experiments demonstrate that this hierarchical paradigm significantly improves instruction adherence and achieves state-of-the-art naturalness, validating our approach for precise and controllable speech synthesis. Audio samples are available at https://xxh333.github.io/.

Benchmarking PDF Accessibility Evaluation A Dataset and Framework for Assessing Automated and LLM-Based Approaches for Accessibility Testing

Authors: Anukriti Kumar, Tanushree Padath, Lucy Lu Wang

2025-09-23

http://arxiv.org/abs/2509.18965v1

PDFs remain the dominant format for scholarly , despite significant accessibility challenges for blind and low-vision users. While various tools attempt to evaluate PDF accessibility, there is no standardized methodology to evaluate how different accessibility assessment approaches perform. Our work addresses this critical gap by introducing a novel benchmark dataset of scholarly PDFs with expert-validated accessibility annotations across seven criteria (alternative text quality, logical reading order, semantic tagging, table structure, functional hyperlinks, color contrast, and font readability), and a four-category evaluation framework with standardized labels (Passed, Failed, Not Present, Cannot Tell) to systematically assess accessibility evaluation approaches. Using our evaluation framework, we explore whether large language models (s) are capable of supporting automated accessibility evaluation. We benchmark five s, which demonstrate varying capabilities in correctly assessing different accessibility criteria, with GPT-4-Turbo achieving the highest overall accuracy (0.85). However, all models struggled in correctly categorizing documents with Not Present and Cannot Tell accessibility labels, particularly for alt text quality assessment. Our qualitative comparison with standard automated checkers reveals complementary strengths: rule-based tools excel at technical verification, while s better evaluate semantic appropriateness and contextual relevance. Based on our findings, we propose a hybrid approach that would combine automated checkers, evaluation, and human assessment as a future strategy for PDF accessibility evaluation.

Confidential LLM Inference Performance and Cost Across CPU and GPU TEEs

Authors: Marcin Chrapek, Marcin Copik, Etienne Mettaz, Torsten Hoefler

2025-09-23

http://arxiv.org/abs/2509.18886v1

Large Language Models (s) are increasingly deployed on converged Cloud and High-Performance Computing (HPC) infrastructure. However, as s handle confidential inputs and are fine-tuned on costly, proprietary datasets, their heightened security requirements slow adoption in privacy-sensitive sectors such as healthcare and finance. We investigate methods to address this gap and propose Trusted Execution Environments (TEEs) as a solution for securing end-to-end inference. We validate their practicality by evaluating these compute-intensive workloads entirely within CPU and GPU TEEs. On the CPU side, we conduct an in-depth study running full Llama2 inference pipelines (7B, 13B, 70B) inside Intel's TDX and SGX, accelerated by Advanced Matrix Extensions (A). We derive 12 insights, including that across various data types, batch sizes, and input lengths, CPU TEEs impose under 10% throughput and 20% latency overheads, further reduced by A. We run inference on NVIDIA H100 Confidential Compute GPUs, contextualizing our CPU findings and ob throughput penalties of 4-8% that diminish as batch and input sizes grow. By comparing performance, cost, and security trade-offs, we show how CPU TEEs can be more cost-effective or secure than their GPU counterparts. To our knowledge, our work is the first to comprehensively demonstrate the performance and practicality of modern TEEs across both CPUs and GPUs for enabling confidential s (cs).

FlashGMM Fast Gaussian Mixture Entropy Model for Learned Image Compression

Authors: Shimon Murai, Fangzheng Lin, Jiro Katto

2025-09-23

http://arxiv.org/abs/2509.18815v1

High-performance learned image codecs require flexible probability models to fit latent representations. Gaussian Mixture Models (GMMs) were proposed to satisfy this demand, but suffer from a significant runtime performance bottleneck due to the large Cumulative Distribution Function (CDF) tables that must be built for rANS coding. This paper introduces a fast coding algorithm that entirely eliminates this bottleneck. By leveraging the CDF's monotonic property, our r performs a dynamic binary search to find the correct symbol, eliminating the need for costly table construction and lookup. Aided by SIMD optimizations and numerical approximations, our approach accelerates the GMM entropy coding process by up to approximately 90x without compromising rate-distortion performance, significantly improving the practicality of GMM-based codecs. The implementation will be made publicly available at https://github.com/tokkiwa/FlashGMM.

Bi-VLM Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models

Authors: Xijun Wang, Junyun Huang, Rayyan Abdalla, Chengyuan Zhang, Ruiqi Xian, Dinesh Manocha

2025-09-23

http://arxiv.org/abs/2509.18763v1

We address the critical gap between the computational demands of vision-language models and the possible ultra- weight precision (bitwidth $\leq2$ bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier (salient) and multiple inlier (unsalient) subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid algorithm and use it to weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%. We also perform token on the d models and observe that there is redundancy of image tokens 90% - 99% in the d models. This helps us to further prune the visual tokens to improve efficiency.

HyperCool Reducing Encoding Cost in Overfitted Codecs with Hypernetworks

Authors: Pep Borrell-Tatché, Till Aczel, Théo Ladune, Roger Wattenhofer

2025-09-23

http://arxiv.org/abs/2509.18748v1

Overfitted image codecs like Cool-chic achieve strong by tailoring lightweight models to individual images, but their encoding is slow and computationally expensive. To accelerate encoding, Non-Overfitted (N-O) Cool-chic replaces the per-image optimization with a learned inference model, trading performance for encoding speed. We introduce HyperCool, a hypernetwork architecture that mitigates this trade-off. Building upon the N-O Cool-chic framework, HyperCool generates content-adaptive parameters for a Cool-chic r in a single forward pass, tailoring the r to the input image without requiring per-image fine-tuning. Our method achieves a 4.9% rate reduction over N-O Cool-chic with minimal computational overhead. Furthermore, the output of our hypernetwork provides a strong initialization for further optimization, reducing the number of steps needed to approach fully overfitted model performance. With fine-tuning, HEVC-level is achieved with 60.4% of the encoding cost of the fully overfitted Cool-chic. This work proposes a practical method to accelerate encoding in overfitted image codecs, improving their viability in scenarios with tight compute budgets.

PIE Perception and Interaction Enhanced End-to-End Motion Planning for Autonomous Driving

Authors: Chengran Yuan, Zijian Lu, Zhanqi Zhang, Yimin Zhao, Zefan Huang, Shuo Sun, Jiawei Sun, Jiahui Li, Christina Dao Wen Lee, Dongen Li, Marcelo H. Ang Jr

2025-09-23

http://arxiv.org/abs/2509.18609v1

End-to-end motion planning is promising for simplifying complex autonomous driving pipelines. However, challenges such as scene understanding and effective prediction for decision-making continue to present substantial obstacles to its large-scale deployment. In this paper, we present PIE, a pioneering framework that integrates advanced perception, reasoning, and intention modeling to dynamically capture interactions between the ego vehicle and surrounding agents. It incorporates a bidirectional Mamba fusion that addresses data losses in multimodal fusion of camera and LiDAR inputs, alongside a novel reasoning-enhanced r integrating Mamba and Mixture-of-Experts to facilitate scene-compliant anchor selection and optimize adaptive trajectory inference. PIE adopts an action-motion interaction module to effectively utilize state predictions of surrounding agents to refine ego planning. The proposed framework is thoroughly validated on the NAVSIM benchmark. PIE, without using any ensemble and data augmentation techniques, achieves an 88.9 PDM score and 85.6 EPDM score, surpassing the performance of prior state-of-the-art methods. Comprehensive quantitative and qualitative analyses demonstrate that PIE is capable of reliably generating feasible and high-quality ego trajectories.

FlexSED Towards Open-Vocabulary Sound Event Detection

Authors: Jiarui Hai, Helin Wang, Weizhe Guo, Mounya Elhilali

2025-09-23

http://arxiv.org/abs/2509.18606v1

Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-friendly interaction, and they lack zero-shot capabilities and offer poor few-shot adaptability. Although text-query-based separation methods have been explored, they primarily focus on source separation and are ill-suited for SED tasks that require precise temporal localization and efficient detection across large and diverse sound vocabularies. In this paper, we propose FlexSED, an open-vocabulary sound event detection system. FlexSED builds on a pretrained audio SSL model and the CLAP text encoder, introducing an encoder-r composition and an adaptive fusion strategy to enable effective continuous training from pretrained weights. To ensure robust supervision, it also employs large language models (s) to assist in event query selection during training, addressing challenges related to missing labels. As a result, FlexSED achieves superior performance compared to vanilla SED models on AudioSet-Strong, while demonstrating strong zero-shot and few-shot capabilities. We release the code and pretrained models to support future research and applications based on FlexSED.

OmniFed A Modular Framework for Configurable Federated Learning from Edge to HPC

Authors: Sahil Tyagi, Andrei Cozma, Olivera Kotevska, Feiyi Wang

2025-09-23

http://arxiv.org/abs/2509.19396v1

Federated Learning (FL) is critical for edge and High Performance Computing (HPC) where data is not centralized and privacy is crucial. We present OmniFed, a modular framework designed around decoupling and clear separation of concerns for configuration, orchestration, , and training logic. Its architecture supports configuration-driven prototyping and code-level override-what-you-need customization. We also support different topologies, mixed protocols within a single deployment, and popular training algorithms. It also offers optional privacy mechanisms including Differential Privacy (DP), Homomorphic Encryption (HE), and Secure Aggregation (SA), as well as strategies. These capabilities are exposed through well-defined extension points, allowing users to customize topology and orchestration, learning logic, and privacy/ plugins, all while pre the integrity of the core system. We evaluate multiple models and algorithms to measure various performance metrics. By unifying topology configuration, mixed-protocol , and pluggable modules in one stack, OmniFed streamlines FL deployment across heterogeneous environments. Github repository is available at https://github.com/at-aaims/OmniFed.

LLMZ+ Contextual Prompt Whitelist Principles for Agentic LLMs

Authors: Tom Pawelek, Raj Patel, Charlotte Crowell, Noorbakhsh Amiri, Sudip Mittal, Shahram Rahimi, Andy Perkins

2025-09-23

http://arxiv.org/abs/2509.18557v1

Compared to traditional models, agentic AI represents a highly valuable target for potential attackers as they possess privileged access to data sources and API tools, which are traditionally not incorporated into classical agents. Unlike a typical software application residing in a Demilitarized Zone (DMZ), agentic s consciously rely on nondeterministic behavior of the AI (only defining a final goal, leaving the path selection to ). This characteristic introduces substantial security risk to both operational security and information security. Most common existing defense mechanism rely on detection of malicious intent and preventing it from reaching the agent, thus protecting against jailbreak attacks such as prompt injection. In this paper, we present an alternative approach, Z+, which moves beyond traditional detection-based approaches by implementing prompt whitelisting. Through this method, only contextually appropriate and safe messages are permitted to interact with the agentic . By leveraging the specificity of context, Z+ guarantees that all exchanges between external users and the conform to predefined use cases and operational boundaries. Our approach streamlines the security framework, enhances its long-term resilience, and reduces the resources required for sustaining information security. Our empirical evaluation demonstrates that Z+ provides strong resilience against the most common jailbreak prompts. At the same time, legitimate business s are not disrupted, and authorized traffic flows seamlessly between users and the agentic . We measure the effectiveness of approach using false positive and false negative rates, both of which can be reduced to 0 in our experimental setting.

Individualized non-uniform quantization for vector search

Authors: Mariano Tepper, Ted Willke

2025-09-22

http://arxiv.org/abs/2509.18471v1

Embedding vectors are widely used for representing unstructured data and searching through it for semantically similar items. However, the large size of these vectors, due to their high-dimensionality, creates problems for modern vector search techniques: retrieving large vectors from memory/storage is expensive and their footprint is costly. In this work, we present NVQ (non-uniform vector ), a new vector technique that is computationally and spatially efficient in the high-fidelity regime. The core in NVQ is to use novel parsimonious and computationally efficient nonlinearities for building non-uniform vector rs. Critically, these rs are \emph{individually} learned for each indexed vector. Our experimental results show that NVQ exhibits improved accuracy compared to the state of the art with a minimal computational cost.

LAWCAT Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

Authors: Zeyu Liu, Souvik Kundu, Lianghao Jiang, Anni Li, Srikanth Ronanki, Sravan Bodapati, Gourav Datta, Peter A. Beerel

2025-09-22

http://arxiv.org/abs/2509.18467v1

Although architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for latency-sensitive long-context applications. While recent linear-complexity alternatives are increasingly powerful, effectively training them from scratch is still resource-intensive. To overcome these limitations, we propose LAWCAT (Linear Attention with Convolution Across Time), a novel linearization framework designed to efficiently transfer the capabilities of pre-trained s into a performant linear attention architecture. LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths. Our comprehensive evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90\% passkey retrieval accuracy up to 22K tokens, significantly extending its effective context window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance on S-NIAH 1\&2\&3 tasks (1K-8K context length) and BABILong benchmark (QA2\&QA3, 0K-16K context length), requiring less than 0.1\% pre-training tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT thus provides an efficient pathway to high-performance, long-context linear models suitable for edge deployment, reducing reliance on extensive long-sequence training data and computational resources.

Authors: Minki Hong, Jangho Choi, Jihie Kim

2025-09-22

http://arxiv.org/abs/2509.18395v1

Social norms govern culturally appropriate behavior in , enabling dialogue systems to produce responses that are not only coherent but also socially acceptable. We present NormGenesis, a multicultural framework for generating and annotating socially grounded dialogues across English, Chinese, and Korean. To model the dynamics of social interaction beyond static norm classification, we propose a novel dialogue type, Violation-to-Resolution (V2R), which models the progression of conversations following norm violations through recognition and socially appropriate repair. To improve pragmatic consistency in underrepresented languages, we implement an exemplar-based iterative refinement early in the dialogue synthesis process. This design introduces alignment with linguistic, emotional, and sociocultural expectations before full dialogue generation begins. Using this framework, we construct a dataset of 10,800 multi-turn dialogues annotated at the turn level for norm adherence, speaker intent, and emotional response. Human and -based evaluations demonstrate that NormGenesis significantly outperforms existing datasets in refinement quality, dialogue naturalness, and generalization performance. We show that models trained on our V2R-augmented data exhibit improved pragmatic competence in ethically sensitive contexts. Our work establishes a new benchmark for culturally adaptive dialogue modeling and provides a scalable methodology for norm-aware generation across linguistically and culturally diverse languages.

Efficient Particle Acceleration in 2.5-Dimensional, Hybrid-Kinetic Simulations of Decaying, Supersonic, Plasma Turbulence

Authors: Keyan Gootkin, Colby Haggerty, Damiano Caprioli, Zachary Davis

2025-09-22

http://arxiv.org/abs/2509.18374v1

Collisionless, turbulent plasmas surround the Earth, from the magnetosphere to the intergalactic medium, and the fluctuations within them affect nearly every field in the space sciences, from space weather forecasts to theories of galaxy formation. Where turbulent motions become supersonic, their interactions can lead to the formation of shocks, which are known to efficiently energize ions to cosmic-ray energies. We present 2.5-dimensional, hybrid-kinetic simulations of decaying, supersonic, non-relativistic turbulence in a collisionless plasma using the code dHybridR. Turbulence within these simulations is highly compressible; after accounting for this by taking the omni-directional power-spectrum of the density weighted velocity field, we find turbulent spectra with power-law slopes of $\alpha \approx -\frac{5}{3}$ for low Mach numbers, in the inertial range, and $\alpha \approx -2$ for high Mach numbers. Ions embedded in the highly supersonic simulations are accelerated to non-thermal energies at efficiencies similar to those seen in shocks, despite being in a non-relativistic regime and lacking the large scale structure of a shock. We observe that particles are accelerated into a power-law spectrum, with a slope of $q \approx 2.5$ in (non-relativistic) energy. We compare these results to those obtained from the theory and simulations of diffusive shock , and discuss the astrophysical implications of this theoretical work.

Chiplet-Based RISC-V SoC with Modular AI Acceleration

Authors: P. Ramkumar, S. S. Bharadwaj

2025-09-22

http://arxiv.org/abs/2509.18355v1

Achieving high performance, energy efficiency, and cost-effectiveness while maintaining architectural flexibility is a critical challenge in the development and deployment of edge AI devices. Monolithic SoC designs struggle with this complex balance mainly due to low manufacturing yields (below 16%) at advanced 360 mm^2 process nodes. This paper presents a novel chiplet-based RISC-V SoC architecture that addresses these limitations through modular AI and intelligent system level optimization. Our proposed design integrates 4 different key innovations in a 30mm x 30mm silicon interposer: adaptive cross-chiplet Dynamic Voltage and Frequency Scaling (DVFS); AI-aware Universal Chiplet Interconnect Express (UCIe) protocol extensions featuring streaming flow control units and -aware transfers; distributed cryptographic security across heterogeneous chiplets; and intelligent sensor-driven load migration. The proposed architecture integrates a 7nm RISC-V CPU chiplet with dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3 memory stacks, and dedicated power management controllers. Experimental results across industry standard benchmarks like MobileNetV2, ResNet-50 and real-time video processing demonstrate significant performance improvements. The AI-optimized configuration achieves ~14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction compared to previous basic chiplet implementations. These improvements collectively translate to a 40.1% efficiency gain corresponding to ~3.5 mJ per MobileNetV2 inference (860 mW/244 images/s), while maintaining sub-5ms real-time capability across all experimented workloads. These performance upgrades demonstrate that modular chiplet designs can achieve near-monolithic computational density while enabling cost efficiency, scalability and upgradeability, crucial for next-generation edge AI device applications.

Speculate Deep and Accurate Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding

Authors: Pei-Shuo Wang, Jian-Jia Chen, Chun-Che Yang, Chi-Chih Chang, Ning-Chi Huang, Mohamed S. Abdelfattah, Kai-Chiang Wu

2025-09-22

http://arxiv.org/abs/2509.18344v1

The immense model sizes of large language models (s) challenge deployment on memory-limited consumer GPUs. Although model and parameter offloading are common strategies to address memory limitations, can degrade quality, and offloading maintains quality but suffers from slow inference. Speculative presents a promising avenue to accelerate parameter offloading, utilizing a fast draft model to propose multiple draft tokens, which are then verified by the target in parallel with a single forward pass. This method reduces the time-consuming data transfers in forward passes that involve offloaded weight transfers. Existing methods often rely on pretrained weights of the same family, but require additional training to align with custom-trained models. Moreover, approaches that involve draft model training usually yield only modest speedups. This limitation arises from insufficient alignment with the target model, preventing higher token acceptance lengths. To address these challenges and achieve greater speedups, we propose SubSpec, a plug-and-play method to accelerate parameter offloading that is lossless and training-free. SubSpec constructs a highly aligned draft model by generating d substitute layers from offloaded target portions. Additionally, our method shares the remaining GPU-resident layers and the -Cache, further reducing memory overhead and enhance alignment. SubSpec achieves a high average acceptance length, delivering 9.1x speedup for Qwen2.5 7B on MT-Bench (8GB VRAM limit) and an average of 12.5x speedup for Qwen2.5 32B on popular generation benchmarks (24GB VRAM limit).

Exploiting Tree Structure for Credit Assignment in RL Training of LLMs

Authors: Hieu Tran, Zonghai Yao, Hong Yu

2025-09-22

http://arxiv.org/abs/2509.18314v1

Reinforcement learning improves reasoning, yet delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes \emph{nonparametric} prefix values $V(s)$ by aggregating descendant outcomes. Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated \textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.

Evaluating Large Language Models for Detecting Antisemitism

Authors: Jay Patel, Hrudayangam Mehta, Jeremy Blackburn

2025-09-22

http://arxiv.org/abs/2509.18293v1

Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source s' capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among s. Our experiments highlight the differences observed across s' utility, explainability, and reliability.

Spiffy Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding

Authors: Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli

2025-09-22

http://arxiv.org/abs/2509.18085v1

Diffusion s (ds) have recently emerged as a powerful alternative to autoregressive s (AR-s) with the potential to operate at significantly higher token generation rates. However, currently available open-source ds often generate at much lower rates, typically only a single token at every denoising timestep in order to maximize output quality. We present Spiffy, a speculative algorithm that accelerates d inference by $\mathbf{2.8{-}3.1\times}$ while provably pre the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative of AR-s to the d setting. Spiffy proposes draft states by leveraging the d's distribution itself in an auto-speculative manner. This approach is efficient and effective, and eliminates the overheads of training and running an independent draft model. To structure the candidate draft states, we propose a novel directed draft graph which is uniquely designed to take advantage of the bidirectional, block-wise nature of d generation and can be verified in parallel by the d. To further optimize the structure of these draft graphs, we introduce an efficient, offline calibration algorithm that procedurally determines high-quality graph configurations. These optimized draft graphs, enabling increased acceptance rates, lead to a significant boost in the overall speedup achieved by the system. Crucially, Spiffy is also complementary to other recent innovations in improving d generation speeds such as -caching and multi-token unmasking. We demonstrate that when combined with such parallel algorithms, Spiffy is able to effectively multiply the benefits of these methods leading to total speedups of up to $\mathbf{7.9\times}$ .

GraDeT-HTR A Resource-Efficient Bengali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer

Authors: Md. Mahmudul Hasan, Ahmed Nesar Tahsin Choudhury, Mahmudul Hasan, Md. Mosaddek Khan

2025-09-22

http://arxiv.org/abs/2509.18081v1

Despite Bengali being the sixth most spoken language in the world, handwritten text recognition (HTR) systems for Bengali remain severely underdeveloped. The complexity of Bengali script--featuring conjuncts, diacritics, and highly variable handwriting styles--combined with a scarcity of annotated datasets makes this task particularly challenging. We present GraDeT-HTR, a resource-efficient Bengali handwritten text recognition system based on a Grapheme-aware Decoder-only Transformer architecture. To address the unique challenges of Bengali script, we augment the performance of a r-only by integrating a grapheme-based tokenizer and demonstrate that it significantly improves recognition accuracy compared to conventional subword tokenizers. Our model is pretrained on large-scale synthetic data and fine-tuned on real human-annotated samples, achieving state-of-the-art performance on multiple benchmark datasets.

TempSamp-R1 Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Authors: Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

2025-09-22

http://arxiv.org/abs/2509.18056v2

This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (Ms) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1

RadEval A framework for radiology text evaluation

Authors: Justin Xu, Xi Zhang, Javid Abderezaei, Julie Bauml, Roger Boodoo, Fatemeh Haghighi, Ali Ganjizadeh, Eric Brattain, Dave Van Veen, Zaiqiao Meng, David Eyre, Jean-Benoit Delbrouck

2025-09-22

http://arxiv.org/abs/2509.18030v1

We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced -based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.

Through the Lens of Human-Human Collaboration A Configurable Research Platform for Exploring Human-Agent Collaboration

Authors: Bingsheng Yao, Jiaju Chen, Chaoran Chen, April Wang, Toby Jia-jun Li, Dakuo Wang

2025-09-22

http://arxiv.org/abs/2509.18008v1

Intelligent systems have traditionally been designed as tools rather than collaborators, often lacking critical characteristics that collaboration partnerships require. Recent advances in large language model () agents open new opportunities for human--agent collaboration by enabling natural and various social and cognitive behaviors. Yet it remains unclear whether principles of computer-mediated collaboration established in HCI and CSCW persist, change, or fail when humans collaborate with agents. To support systematic investigations of these questions, we introduce an open and configurable research platform for HCI researchers. The platform's modular design allows seamless adaptation of classic CSCW experiments and manipulation of theory-grounded interaction controls. We demonstrate the platform's effectiveness and usability through two case studies: (1) re-implementing the classic human-human-collaboration task Shape Factory as a between-subject human-agent-collaboration experiment with 16 participants, and (2) a participatory cognitive walkthrough with five HCI researchers to refine workflows and interfaces for experiment setup and analysis.

Visual Detector Compression via Location-Aware Discriminant Analysis

Authors: Qizhen Lan, Jung Im Choi, Qing Tian

2025-09-22

http://arxiv.org/abs/2509.17968v1

Deep neural networks are powerful, yet their high complexity greatly limits their potential to be deployed on billions of resource-constrained edge devices. Pruning is a crucial network technique, yet most existing methods focus on classification models, with limited attention to detection. Even among those addressing detection, there is a lack of utilization of essential localization information. Also, many methods passively rely on pre-trained models, in which useful and useless components are intertwined, making it difficult to remove the latter without harming the former at the neuron/filter level. To address the above issues, in this paper, we propose a proactive detection-discriminants-based network approach for deep visual detectors, which alternates between two steps: (1) maximizing and compressing detection-related discriminants and aligning them with a subset of neurons/filters immediately before the detection head, and (2) tracing the detection-related discriminating power across the layers and discarding features of lower importance. Object location information is exploited in both steps. Extensive experiments, employing four advanced detection models and four state-of-the-art competing methods on the KITTI and COCO datasets, highlight the superiority of our approach. Remarkably, our compressed models can even beat the original base models with a substantial reduction in complexity.

Benchmarking Humans and Machines on Complex Multilingual Speech Understanding Tasks

Authors: Sai Samrat Kankanala, Ram Chandra, Sriram Ganapathy

2025-09-22

http://arxiv.org/abs/2509.17965v1

Auditory attention and selective phase-locking are central to human speech understanding in complex acoustic scenes and cocktail party settings, yet these capabilities in multilingual subjects remain poorly understood. While machine understanding of natural speech has advanced in recent years, questions persist about comprehension of ped and mixed-channel speech. We propose a systematic paradigm for studying humans and machines in speech question-answering tasks in multilingual settings with clean and mixed-channel speech. For human listeners, selective attention to a target speaker was significantly better in their native language (L1) than in their second language (L2). For machine listening, speech-based large language models (s) match or exceed human performance in clean, single-speaker conditions but often struggle to selectively attend in two-speaker settings. These results reveal a key divergence: humans rely on attentional cues that are more streamlined in their native language, whereas s default to parallel information extraction which exceed human skills.

Expert-as-a-Service Towards Efficient, Scalable, and Robust Large-scale MoE Serving

Authors: Ziming Liu, Boyu Tian, Guoteng Wang, Zhen Jiang, Peng Sun, Zhenhua Han, Tian Tang, Xiaohe Hu, Yanmin Jia, Yan Zhang, He Liu, Mingjun Zhang, Yiqi Zhang, Qiaoling Chen, Shenggan Cheng, Mingyu Gao, Yang You, Siyuan Feng

2025-09-22

http://arxiv.org/abs/2509.17863v1

Mixture-of-Experts (MoE) models challenge infrastructures with dynamic, expert utilization, causing instability on conventional systems designed for dense architectures. We propose EaaS, a novel system to enable efficient, scalable, and robust MoE deployment. Our system s MoE modules into independent, stateless services. This design enables fine-grained resource scaling and provides inherent fault tolerance by decoupling compute units. The architecture is powered by a high-performance, CPU-free peer-to-peer library that ensures minimal overhead and high throughput. Experiments confirm EaaS's scalability and efficiency, achieving performance comparable to monolithic systems while providing robust fault tolerance and strong scalability. EaaS incurs less than a 2% throughput reduction under simulated hardware failures that would otherwise halt monolithic architectures. It further saves up to 37.5% of computing resources through dynamic fine-grained adaptation to traffic, demonstrating strong resilience for large-scale MoE deployment in production.

Towards General Computer Control with Hierarchical Agents and Multi-Level Action Spaces

Authors: Zihan Dong, Xinyu Fan, Zixiang Tang, Yunqing Li

2025-09-22

http://arxiv.org/abs/2509.18230v1

Controlling desktop applications via software remains a fundamental yet under-served problem. Existing multi-modal large language models (Ms) ingest screenshots and task instructions to generate keystrokes and mouse events, but they suffer from prohibitive inference latency, poor sample efficiency on long-horizon -reward tasks, and infeasible on-device deployment. We introduce a lightweight hierarchical reinforcement learning framework, ComputerAgent, that formulates OS control as a two-level option process (manager and subpolicy), employs a triple-modal state encoder (screenshot, task ID, numeric state) to handle visual and contextual diversity, integrates meta-actions with an early-stop mechanism to reduce wasted interactions, and uses a compact vision backbone plus small policy networks for on-device inference (15M parameters). On a suite of 135 real-world desktop tasks, ComputerAgent attains 92.1% success on simple tasks (<8 steps) and 58.8% on hard tasks (>=8 steps), matching or exceeding 200B-parameter M baselines on simple scenarios while reducing model size by over four orders of magnitude and halving inference time. These results demonstrate that hierarchical RL offers a practical, scalable alternative to monolithic M-based automation for computer control.

ConfClip Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs

Authors: Bonan Zhang, Zhongqi Chen, Bowen Song, Qinya Li, Fan Wu, Guihai Chen

2025-09-22

http://arxiv.org/abs/2509.17730v1

Reinforcement learning (RL) has become a standard paradigm for refining large language models (s) beyond pre-training and instruction tuning. A prominent line of work is RL with verifiable rewards (RLVR), which leverages automatically verifiable outcomes (e.g., correctness or executability) to generate reward signals. While efficient, this framework faces two key limitations: First, its binary feedback is too to capture the quality of the reasoning process. Second, its coarse-grained rewards potentially lead to vanishing gradients. Inspired by observations from human learning, we introduce a RL technique that integrates verifiable outcomes with the model's own confidence estimates. This joint design enriches the reward signal, providing finer-grained feedback and implicitly supervising the reasoning process. Experimental results demonstrate that our proposed method enhances RL performance across multiple datasets and reduces token consumption during inference, while incurring negligible additional training cost. Moreover, it can be used as a plug-in module to enhance other state-of-the-art RL methods.

When TableQA Meets Noise A Dual Denoising Framework for Complex Questions and Large-scale Tables

Authors: Shenghao Ye, Yu Guo, Dong Jin, Yikai Shen, Yunpeng Hou, Shuangwu Chen, Jian Yang, Xiaofeng Jiang

2025-09-22

http://arxiv.org/abs/2509.17680v1

Table question answering (TableQA) is a fundamental task in natural language processing (NLP). The strong reasoning capabilities of large language models (s) have brought significant advances in this field. However, as real-world applications involve increasingly complex questions and larger tables, substantial noisy data is introduced, which severely degrades reasoning performance. To address this challenge, we focus on improving two core capabilities: Relevance Filtering, which identifies and retains information truly relevant to reasoning, and Table Pruning, which reduces table size while pre essential content. Based on these principles, we propose EnoTab, a dual denoising framework for complex questions and large-scale tables. Specifically, we first perform Evidence-based Question Denoising by decomposing the question into minimal semantic units and filtering out those irrelevant to answer reasoning based on consistency and usability criteria. Then, we propose Evidence Tree-guided Table Denoising, which constructs an explicit and transparent table path to remove irrelevant data step by step. At each step, we observe the intermediate state of the table and apply a post-order node rollback mechanism to handle abnormal table states, ultimately producing a highly reliable sub-table for final answer reasoning. Finally, extensive experiments show that EnoTab achieves outstanding performance on TableQA tasks with complex questions and large-scale tables, confirming its effectiveness.

Mechanistic Interpretability with SAEs Probing Religion, Violence, and Geography in Large Language Models

Authors: Katharina Simbeck, Mariam Mahran

2025-09-22

http://arxiv.org/abs/2509.17665v1

Despite growing research on bias in large language models (s), most work has focused on gender and race, with little attention to religious identity. This paper explores how religion is internally represented in s and how it intersects with concepts of violence and geography. Using mechanistic interpretability and Sparse Autoencoders (SAEs) via the Neuronpedia API, we analyze latent feature activations across five models. We measure between religion- and violence-related prompts and probe semantic patterns in activation contexts. While all five religions show comparable internal cohesion, Islam is more frequently linked to features associated with violent language. In contrast, geographic associations largely reflect real-world religious demographics, revealing how models embed both factual distributions and cultural stereotypes. These findings highlight the value of structural analysis in auditing not just outputs but also internal representations that shape model behavior.

Evict3R Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers

Authors: Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi

2025-09-22

http://arxiv.org/abs/2509.17650v1

Streaming visual s like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key value () memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.

Bilateral Distribution Compression Reducing Both Data Size and Dimensionality

Authors: Dominic Broadbent, Nick Whiteley, Robert Allison, Tom Lovett

2025-09-22

http://arxiv.org/abs/2509.17543v3

Existing distribution methods reduce dataset size by minimising the Maximum Mean Discrepancy (MMD) between original and compressed sets, but modern datasets are often large in both sample size and dimensionality. We propose Bilateral Distribution Compression (BDC), a two-stage framework that compresses along both axes while pre the underlying distribution, with overall linear time and memory complexity in dataset size and dimension. Central to BDC is the Decoded MMD (DMMD), which quantifies the discrepancy between the original data and a compressed set d from a low-dimensional latent space. BDC proceeds by (i) learning a low-dimensional projection using the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with the Encoded MMD (EMMD). We show that this procedure minimises the DMMD, guaranteeing that the compressed set faithfully represents the original distribution. Experiments show that across a variety of scenarios BDC can achieve comparable or superior performance to ambient-space at substantially lower cost.

Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs

Authors: Xing Chen, Rong Shi, Lu Zhao, Lingbin Wang, Xiao Jin, Yueqiang Chen, Hongfeng Sun

2025-09-22

http://arxiv.org/abs/2509.17542v1

-based applications have been widely used in various industries, but with the increasing of models size, an efficient large language model () inference system is an urgent problem to be solved for service providers. Since the inference system is divided into two stage with different characteristics: Prefill and Decode, the two stage will interfere with each other during the inference process. Toward this end, a P-D d inference framework is proposed by some researchers. Current research is done on homogeneous GPUs, and lacks deployment solutions based on business scenarios. Compared with homogeneous GPUs, using heterogeneous GPUs to construct inference systems can better improve resource utilization and reduce costs. Even if GPUs from different vendors are used to build inference systems, on the basis of reducing costs, the resource utilization rate can be improved and the dependence on a single vendor can be reduced. Therefore, a P-D disaggreagetd inference system based on heterogeneous GPUs is designed, and the heterogeneous compatible transmission module in the system is designed to address heterogeneous GPU data compatibility issues. Then, a joint optimization algorithm of parallel strategy and instance number allocation is proposed to obtain the deployment solutions. Finally, the experimental results show that the P-D d inference system can well solve the hybrid inference problem of heterogeneous GPUs from different vendors, and the joint optimization algorithm can obtain the optimal deployment solution.

4DGCPro Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming

Authors: Zihan Zheng, Zhenlong Wu, Houqiang Zhong, Yuan Tian, Ning Cao, Lan Xu, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Wenjun Zhang

2025-09-22

http://arxiv.org/abs/2509.17513v1

Achieving seamless viewing of high-fidelity volumetric video, comparable to 2D video experiences, remains an open challenge. Existing volumetric video methods either lack the flexibility to adjust quality and bitrate within a single model for efficient streaming across diverse networks and devices, or struggle with real-time and rendering on lightweight mobile platforms. To address these challenges, we introduce 4DGCPro, a novel hierarchical 4D Gaussian framework that facilitates real-time mobile and high-quality rendering via progressive volumetric video streaming in a single bitstream. Specifically, we propose a perceptually-weighted and -friendly hierarchical 4D Gaussian representation with motion-aware adaptive grouping to reduce temporal redundancy, preserve coherence, and enable scalable multi-level detail streaming. Furthermore, we present an end-to-end entropy-optimized training scheme, which incorporates layer-wise rate-distortion (RD) supervision and attribute-specific entropy modeling for efficient bitstream generation. Extensive experiments show that 4DGCPro enables flexible quality and multiple bitrate within a single model, achieving real-time and rendering on mobile devices while outperforming existing methods in RD performance across multiple datasets. Project Page: https://mediax-sjtu.github.io/4DGCPro

CorefInst Leveraging LLMs for Multilingual Coreference Resolution

Authors: Tuğba Pamay Arslan, Emircan Erol, Gülşen Eryiğit

2025-09-22

http://arxiv.org/abs/2509.17505v1

Coreference Resolution (CR) is a crucial yet challenging task in natural language understanding, often constrained by task-specific architectures and encoder-based language models that demand extensive training and lack adaptability. This study introduces the first multilingual CR methodology which leverages r-only s to handle both overt and zero mentions. The article explores how to model the CR task for s via five different instruction sets using a controlled inference method. The approach is evaluated across three s; Llama 3.1, Gemma 2, and Mistral 0.3. The results indicate that s, when instruction-tuned with a suitable instruction set, can surpass state-of-the-art task-specific architectures. Specifically, our best model, a fully fine-tuned Llama 3.1 for multilingual CR, outperforms the leading multilingual CR model (i.e., Corpipe 24 single stage variant) by 2 pp on average across all languages in the CorefUD v1.2 dataset collection.

Privacy in Action Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents

Authors: Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan

2025-09-22

http://arxiv.org/abs/2509.17488v1

The increasing autonomy of agents in handling sensitive s, accelerated by Model Context Protocol (MCP) and Agent-to-Agent (A2A) frameworks, creates urgent privacy challenges. While recent work reveals significant gaps between s' privacy Q&A performance and their agent behavior, existing benchmarks remain limited to static, simplified scenarios. We present PrivacyChecker, a model-agnostic, contextual integrity based mitigation approach that effectively reduces privacy leakage from 36.08% to 7.30% on DeepSeek-R1 and from 33.06% to 8.32% on GPT-4o, all while pre task helpfulness. We also introduce PrivacyLens-Live, transforming static benchmarks into dynamic MCP and A2A environments that reveal substantially higher privacy risks in practical. Our modular mitigation approach integrates seamlessly into agent protocols through three deployment strategies, providing practical privacy protection for the emerging agentic ecosystem. Our data and code will be made available at https://aka.ms/privacy_in_action.

Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks

Authors: Chaodong Tong, Qi Zhang, Lei Jiang, Yanbing Liu, Nannan Sun, Wei Li

2025-09-22

http://arxiv.org/abs/2509.17445v1

Reliable question answering with large language models (s) is challenged by hallucinations, fluent but factually incorrect outputs arising from epistemic uncertainty. Existing entropy-based semantic-level uncertainty estimation methods are limited by sampling noise and unstable clustering of variable-length answers. We propose Semantic Reformulation Entropy (SRE), which improves uncertainty estimation in two ways. First, input-side semantic reformulations produce faithful paraphrases, expand the estimation space, and reduce biases from superficial r tendencies. Second, progressive, energy-based hybrid clustering stabilizes semantic grouping. Experiments on SQuAD and TriviaQA show that SRE outperforms strong baselines, providing more robust and generalizable hallucination detection. These results demonstrate that combining input diversification with multi-signal clustering substantially enhances semantic-level uncertainty estimation.

QWHA Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models

Authors: Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim

2025-09-22

http://arxiv.org/abs/2509.17428v2

The demand for efficient deployment of large language models (s) has driven interest in , which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of -aware PEFT to produce accurate yet efficient d models. In this setting, reducing error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into d models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into d models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.

DINVMark A Deep Invertible Network for Video Watermarking

Authors: Jianbin Ji, Dawen Xu, Li Dong, Lin Yang, Songhan He

2025-09-22

http://arxiv.org/abs/2509.17416v1

With the wide spread of video, video watermarking has become increasingly crucial for copyright protection and content authentication. However, video watermarking still faces numerous challenges. For example, existing methods typically have shortcomings in terms of watermarking capacity and robustness, and there is a lack of specialized noise layer for High Efficiency Video Coding(HEVC) . To address these issues, this paper introduces a Deep Invertible Network for Video watermarking (DINVMark) and designs a noise layer to simulate HEVC . This approach not only in creases watermarking capacity but also enhances robustness. DINVMark employs an Invertible Neural Network (INN), where the encoder and r share the same network structure for both watermark embedding and extraction. This shared architecture ensures close coupling between the encoder and r, thereby improving the accuracy of the watermark extraction process. Experimental results demonstrate that the proposed scheme significantly enhances watermark robustness, preserves video quality, and substantially increases watermark embedding capacity.

Interpreting vision transformers via residual replacement model

Authors: Jinyeong Kim, Junhyeok Kim, Yumin Shim, Joohyeok Kim, Sunyoung Jung, Seong Jae Hwang

2025-09-22

http://arxiv.org/abs/2509.17401v1

How do vision s (ViTs) represent and process the world? This paper addresses this long-standing question through the first systematic analysis of 6.6K features across all layers, extracted via autoencoders, and by introducing the residual replacement model, which replaces ViT computations with interpretable features in the residual stream. Our analysis reveals not only a feature evolution from low-level patterns to high-level semantics, but also how ViTs encode curves and spatial positions through specialized feature types. The residual replacement model scalably produces a faithful yet parsimonious circuit for human-scale interpretability by significantly simplifying the original computations. As a result, this framework enables intuitive understanding of ViT mechanisms. Finally, we demonstrate the utility of our framework in debiasing spurious correlations.

EpiCache Episodic KV Cache Management for Long Conversational Question Answering

Authors: Minsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho

2025-09-22

http://arxiv.org/abs/2509.17396v2

Modern large language models (s) extend context lengths to up to millions of tokens, enabling AI assistants to generate coherent and personalized responses grounded in long conversational histories. This ability, however, hinges on Key-Value () caching, whose memory grows linearly with dialogue length and quickly becomes the bottleneck in resource-constrained environments. An active line of research for reducing memory bottleneck is , which seeks to limit size while pre accuracy. Yet existing methods face two major limitations: (i) evicting the after full-context causes unbounded peak memory, and (ii) query-dependent eviction narrows the to a single query, leading to failure cases in multi-turn conversations. We introduce EpiCache, a training-free management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds growth through block-wise and preserves topic-relevant context via episodic , which clusters conversation history into coherent episodes and applies episode-specific eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer's sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over recent baselines, sustains near-full accuracy under 4-6x , and reduces latency and memory by up to 2.4x and 3.5x, thereby enabling efficient multi-turn interaction under strict resource constraints.

Multimodal Health Risk Prediction System for Chronic Diseases via Vision-Language Fusion and Large Language Models

Authors: Dingxin Lu, Shurui Wu, Xinyi Huang

2025-09-22

http://arxiv.org/abs/2509.18221v1

With the rising global burden of chronic diseases and the multimodal and heterogeneous clinical data (medical imaging, free-text recordings, wearable sensor streams, etc.), there is an urgent need for a unified multimodal AI framework that can proactively predict individual health risks. We propose VL-RiskFormer, a hierarchical stacked visual-language multimodal Transformer with a large language model () inference head embedded in its top layer. The system builds on the dual-stream architecture of existing visual-linguistic models (e.g., PaLM-E, LLaVA) with four key innovations: (i) pre-training with cross-modal comparison and fine-grained alignment of radiological images, fundus maps, and wearable device photos with corresponding clinical narratives using momentum update encoders and debiased InfoNCE losses; (ii) a time fusion block that integrates irregular visit sequences into the causal Transformer r through adaptive time interval position coding; (iii) a disease ontology map adapter that injects ICD-10 codes into visual and textual channels in layers and infers comorbid patterns with the help of a graph attention mechanism. On the MIMIC-IV longitudinal cohort, VL-RiskFormer achieved an average AUROC of 0.90 with an expected calibration error of 2.7 percent.

Asteria Semantic-Aware Cross-Region Caching for Agentic LLM Tool Access

Authors: Chaoyi Ruan, Chao Bi, Kaiwen Zheng, Ziji Shi, Xinyi Wan, Jialin Li

2025-09-22

http://arxiv.org/abs/2509.17360v1

Large Language Model () agents tackle data-intensive tasks such as deep research and code generation. However, their effectiveness depends on frequent interactions with knowledge sources across remote clouds or regions. Such interactions can create non-trivial latency and cost bottlenecks. Existing caching solutions focus on exact-match queries, limiting their effectiveness for semantic knowledge reuse. To address this challenge, we introduce Asteria, a novel cross-region knowledge caching architecture for agents. At its core are two abstractions: Semantic Element (SE) and Semantic Retrieval Index (Sine). A semantic element captures the semantic embedding representation of an query together with performance-aware metadata such as latency, cost, and staticity. Sine then provides two-stage retrieval: a vector similar index with semantic embedding for fast candidate selection and a lightweight -powered semantic judger for precise validation. Atop these primitives, Asteria builds a new interface that includes a new semantic-aware hit definition, a cost-efficient eviction policy, and proactive prefetching. To reduce overhead, Asteria co-locates the small judger with the main using adaptive scheduling and resource sharing. Our evaluation demonstrates that Asteria delivers substantial performance improvements without compromising correctness. On representative search workloads, Asteria achieves up to a 3.6 $\times$ increase in throughput by maintaining hit rates of over 85%, while pre accuracy virtually identical to non-d baselines. Asteria also improves throughput for complex coding tasks by 20%, showcasing its versatility across diverse agentic workloads.

Cronus Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill

Authors: Yunzhao Liu, Qiang Xu, Y. Charlie Hu

2025-09-22

http://arxiv.org/abs/2509.17357v1

Efficient inference is critical for real-world applications, especially within heterogeneous GPU clusters commonly found in organizations and on-premise datacenters as GPU architecture rapidly evolves. Current d strategies, which separate the and stages of inference across different GPUs, often suffer from suboptimal performance due to imbalances between GPU capabilities and workload demands. On the other hand, extending conventional data parallelism and pipeline parallelism to heterogeneous setups incurs high inference latencies. To address these challenges, we introduce Cronus, a novel inference system designed to dynamically balance workloads across heterogeneous GPUs using partially d . Cronus partitions each stage and executes its initial portion on the low-end GPU, while ping the remaining and stages of earlier requests on the high-end GPU. Extensive evaluations across various high-end and low-end GPU combinations demonstrate that Cronus significantly improves the throughput over d . It also reduces TTFT P99 and TBT P99 significantly over DP and PP while maintaining similar or better throughput.

Compact representation of transonic airfoil buffet flows with observable-augmented machine learning

Authors: Kai Fukami, Yuta Iwatani, Soju Maejima, Hiroyuki Asada, Soshi Kawai

2025-09-22

http://arxiv.org/abs/2509.17306v1

Transonic buffet presents time-dependent aerodynamic characteristics associated with shock, turbulent boundary layer, and their interactions. Despite strong nonlinearities and a large degree of freedom, there exists a dominant dynamic pattern of a buffet cycle, suggesting the low dimensionality of transonic buffet phenomena. This study seeks a low-dimensional representation of transonic airfoil buffet at a high Reynolds number with machine learning. Wall-modeled large-eddy simulations of flow over the OAT15A supercritical airfoil at two Mach numbers, $M_\infty = 0.715$ and 0.730, respectively producing non-buffet and buffet conditions, at a chord-based Reynolds number of $Re = 3\times 10^6$ are performed to generate the present datasets. We find that the low-dimensional nature of transonic airfoil buffet can be extracted as a sole three-dimensional latent representation through lift-augmented autoencoder . The current low-order representation not only describes the shock movement but also captures the moment when the separation occurs near the trailing edge in a low-order manner. We further show that it is possible to perform sensor-based reconstruction through the present low-dimensional expression while identifying the sensitivity with respect to aerodynamic responses. The present model trained at $Re = 3\times 10^6$ is lastly evaluated at the level of a real aircraft operation of $Re = 3\times 10^7$ , exhibiting that the phase dynamics of lift is reasonably estimated from sensors. The current study may provide a foundation toward data-driven real-time analysis of transonic buffet conditions under aircraft operation.

Authors: Jiarui Li, Zixiang Yin, Zhengming Ding, Samuel J. Landry, Ramgopal R. Mettu

2025-09-22

http://arxiv.org/abs/2509.17305v1

T cell receptor (TCR) recognition of peptide-MHC (pMHC) complexes is fundamental to adaptive immunity and central to the development of T cell-based immunotherapies. While -based models have shown promise in predicting TCR-pMHC interactions, most lack a systematic and explainable approach to architecture design. We present an approach that uses a new post-hoc explainability method to inform the construction of a novel encoder-r model. By identifying the most informative combinations of TCR and epitope sequence inputs, we optimize cross-attention strategies, incorporate auxiliary training objectives, and introduce a novel early-stopping criterion based on explanation quality. Our framework achieves state-of-the-art predictive performance while simultaneously improving explainability, robustness, and generalization. This work establishes a principled, explanation-driven strategy for modeling TCR-pMHC binding and offers mechanistic insights into sequence-level binding behavior through the lens of deep learning.

Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection

Authors: Jun Seo Kim, Hyemi Kim, Woo Joo Oh, Hongjin Cho, Hochul Lee, Hye Hyeon Kim

2025-09-22

http://arxiv.org/abs/2509.17292v1

Cognitive distortions have been closely linked to mental health disorders, yet their automatic detection remained challenging due to contextual ambiguity, co-occurrence, and semantic . We proposed a novel framework that combines Large Language Models (s) with Multiple-Instance Learning (MIL) architecture to enhance interpretability and expression-level reasoning. Each utterance was decomposed into Emotion, Logic, and Behavior (ELB) components, which were processed by s to infer multiple distortion instances, each with a predicted type, expression, and model-assigned salience score. These instances were integrated via a Multi-View Gated Attention mechanism for final classification. Experiments on Korean (KoACD) and English (Therapist QA) datasets demonstrate that incorporating ELB and -inferred salience scores improves classification performance, especially for distortions with high interpretive ambiguity. Our results suggested a psychologically grounded and generalizable approach for fine-grained reasoning in mental health NLP.

DeepASA An Object-Oriented One-for-All Network for Auditory Scene Analysis

Authors: Dongheon Lee, Younghoo Kwon, Jung-Woo Choi

2025-09-21

http://arxiv.org/abs/2509.17247v1

We propose DeepASA, a one-for-all model for auditory scene analysis that performs multi-input multi-output (MIMO) source separation, dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation (DoAE) within a unified framework. DeepASA is designed for complex auditory scenes where multiple, often similar, sound sources in time and move dynamically in space. To achieve robust and consistent inference across tasks, we introduce an object-oriented processing (OOP) strategy. This approach encapsulates diverse auditory features into object-centric representations and refines them through a chain-of-inference (CoI) mechanism. The pipeline comprises a dynamic temporal kernel-based feature extractor, a -based aggregator, and an object separator that yields per-object features. These features feed into multiple task-specific rs. Our object-centric representations naturally resolve the parameter association ambiguity inherent in traditional track-wise processing. However, early-stage object separation can lead to failure in downstream ASA tasks. To address this, we implement temporal coherence matching (TCM) within the chain-of-inference, enabling multi-task fusion and iterative refinement of object features using estimated auditory parameters. We evaluate DeepASA on representative spatial audio benchmark datasets, including ASA2, MC-FUSS, and STARSS23. Experimental results show that our model achieves state-of-the-art performance across all evaluated tasks, demonstrating its effectiveness in both source separation and auditory parameter estimation under diverse spatial auditory scenes.

MoEs Are Stronger than You Think Hyper-Parallel Inference Scaling with RoE

Authors: Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho

2025-09-21

http://arxiv.org/abs/2509.17238v1

The generation quality of large language models (s) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction.To overcome the computational cost, we introduce an efficient batching strategy and a specialized -caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.

SignalLLM A General-Purpose LLM Agent Framework for Automated Signal Processing

Authors: Junlong Ke, Qiying Hu, Shenghai Yuan, Yuecong Xu, Jianfei Yang

2025-09-21

http://arxiv.org/abs/2509.17197v1

Modern signal processing (SP) pipelines, whether model-based or data-driven, often constrained by complex and fragmented workflow, rely heavily on expert knowledge and manual engineering, and struggle with adaptability and generalization under limited data. In contrast, Large Language Models (s) offer strong reasoning capabilities, broad general-purpose knowledge, in-context learning, and cross-modal transfer abilities, positioning them as powerful tools for automating and generalizing SP workflows. Motivated by these potentials, we introduce Signal, the first general-purpose -based agent framework for general SP tasks. Unlike prior -based SP approaches that are limited to narrow applications or tricky prompting, Signal introduces a principled, modular architecture. It decomposes high-level SP goals into structured subtasks via in-context learning and domain-specific retrieval, followed by hierarchical planning through adaptive retrieval-augmented generation (RAG) and refinement; these subtasks are then executed through prompt-based reasoning, cross-modal reasoning, code synthesis, model invocation, or data-driven -assisted modeling. Its generalizable design enables the flexible selection of problem solving strategies across different signal modalities, task types, and data conditions. We demonstrate the versatility and effectiveness of Signal through five representative tasks in and sensing, such as radar target detection, human activity recognition, and text . Experimental results show superior performance over traditional and existing -based methods, particularly in few-shot and zero-shot settings.

MAST Multi-Agent Spatial Transformer for Learning to Collaborate

Authors: Damian Owerko, Frederic Vatnsdal, Saurav Agarwal, Vijay Kumar, Alejandro Ribeiro

2025-09-21

http://arxiv.org/abs/2509.17195v1

This article presents a novel multi-agent spatial (MAST) for learning policies in large-scale decentralized and collaborative multi-robot systems (DC-MRS). Challenges in collaboration in DC-MRS arise from: (i) partial observable states as robots make only localized perception, (ii) limited range with no central server, and (iii) independent execution of actions. The robots need to optimize a common task-specific objective, which, under the restricted setting, must be done using a policy that exhibits the desired collaborative behavior. The proposed MAST is a decentralized architecture that learns policies to compute abstract information to be shared with other agents and processes the received information with the robot's own observations. The MAST extends the standard with new positional encoding strategies and attention operations that employ windowing to limit the receptive field for MRS. These are designed for local computation, shift-equivariance, and permutation equivariance, making it a promising approach for DC-MRS. We demonstrate the efficacy of MAST on decentralized assignment and navigation (DAN) and decentralized coverage control. Efficiently trained using imitation learning in a centralized setting, the decentralized MAST policy is robust to delays, scales to large teams, and performs better than the baselines and other learning-based approaches.

Attention Consistency for LLMs Explanation

Authors: Tian Lan, Jinyuan Xu, Xue He, Jenq-Neng Hwang, Lei Li

2025-09-21

http://arxiv.org/abs/2509.17178v1

Understanding the decision-making processes of large language models (s) is essential for their trustworthy development and deployment. However, current interpretability methods often face challenges such as low resolution and high computational cost. To address these limitations, we propose the \textbf{Multi-Layer Attention Consistency Score (MACS)}, a novel, lightweight, and easily deployable heuristic for estimating the importance of input tokens in r-based models. MACS measures contributions of input tokens based on the consistency of maximal attention. Empirical evaluations demonstrate that MACS achieves a favorable trade-off between interpretability quality and computational efficiency, showing faithfulness comparable to complex techniques with a 22\% decrease in VRAM usage and 30\% reduction in latency.

Analyzing Memory Effects in Large Language Models through the lens of Cognitive Psychology

Authors: Zhaoyang Cao, Lael Schooler, Reza Zafarani

2025-09-21

http://arxiv.org/abs/2509.17138v1

Memory, a fundamental component of human cognition, exhibits adaptive yet fallible characteristics as illustrated by Schacter's memory "sins".These cognitive phenomena have been studied extensively in psychology and neuroscience, but the extent to which artificial systems, specifically Large Language Models (s), emulate these cognitive phenomena remains underexplored. This study uses human memory research as a lens for understanding s and systematically investigates human memory effects in state-of-the-art s using paradigms drawn from psychological research. We evaluate seven key memory phenomena, comparing human behavior to performance. Both people and models remember less when overloaded with information (list length effect) and remember better with repeated exposure (list strength effect). They also show similar difficulties when retrieving ping information, where storing too many similar facts leads to confusion (fan effect). Like humans, s are susceptible to falsely "remembering" words that were never shown but are related to others (false memories), and they can apply prior learning to new, related situations (cross-domain generalization). However, s differ in two key ways: they are less influenced by the order in which information is presented (positional bias) and more robust when processing random or meaningless material (nonsense effect). These results reveal both alignments and divergences in how s and humans reconstruct memory. The findings help clarify how memory-like behavior in s echoes core features of human cognition, while also highlighting the architectural differences that lead to distinct patterns of error and success.

SnipSnap A Joint Compression Format and Dataflow Co-Optimization Framework for Efficient Sparse LLM Accelerator Design

Authors: Junyi Wu, Chao Fang, Zhongfeng Wang

2025-09-21

http://arxiv.org/abs/2509.17072v1

The growing scale of large language models (s) has intensified demands on computation and memory, making efficient inference a key challenge. While can reduce these costs, existing design space exploration (DSE) frameworks often overlook formats, a key factor for leveraging on accelerators. This paper proposes SnipSnap, a joint format and dataflow co-optimization framework for efficient accelerator design. SnipSnap introduces: (1) a hierarchical format encoding to expand the design space; (2) an adaptive engine for selecting formats under diverse ; and (3) a progressive co-search workflow that jointly optimizes dataflow and formats. SnipSnap achieves 18.24\% average memory energy savings via format optimization, along with 2248.3 $\times$ and 21.0 $\times$ speedups over Sparseloop and DiMO-Sparse frameworks, respectively.

The Transfer Neurons Hypothesis An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLMs

Authors: Hinata Tezuka, Naoya Inoue

2025-09-21

http://arxiv.org/abs/2509.17030v1

Recent studies have suggested a processing framework for multilingual inputs in r-based s: early layers convert inputs into English-centric and language-agnostic representations; middle layers perform reasoning within an English-centric latent space; and final layers generate outputs by transforming these representations back into language-specific latent spaces. However, the internal dynamics of such transformation and the underlying mechanism remain underexplored. Towards a deeper understanding of this framework, we propose and empirically validate The Transfer Neurons Hypothesis: certain neurons in the MLP module are responsible for transferring representations between language-specific latent spaces and a shared semantic latent space. Furthermore, we show that one function of language-specific neurons, as identified in recent studies, is to facilitate movement between latent spaces. Finally, we show that transfer neurons are critical for reasoning in multilingual s.

PTQTP Post-Training Quantization to Trit-Planes for Large Language Models

Authors: He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zheng Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, Ngai Wong

2025-09-21

http://arxiv.org/abs/2509.16989v1

Post-training (PTQ) of large language models (s) to extremely low bit-widths remains challenging due to the fundamental trade-off between computational efficiency and model expressiveness. While existing ultra- PTQ methods rely on binary approximations or complex compensation mechanisms, they suffer from either limited representational capacity or computational overhead that undermines their efficiency gains. We introduce PTQ to Trit-Planes (PTQTP), the first ternary-weight PTQ framework that decomposes weight matrices into structured ternary {-1, 0, 1} trit-planes using 2x1.58-bit representation. PTQTP achieves multiplication-free inference, identical to 1-bit , while maintaining superior expressiveness through its novel structured decomposition. Our approach provides: (1) a theoretically grounded progressive approximation algorithm ensuring global weight consistency; (2) model-agnostic deployment across diverse modern s without architectural modifications; and (3) uniform ternary operations that eliminate the need for mixed-precision or compensation schemes. Comprehensive experiments across LLaMA3.x and Qwen3 model families (0.6B-70B parameters) demonstrate that PTQTP significantly outperforms existing PTQ methods, achieving 82.4% mathematical reasoning retention versus 0% for competing approaches. PTQTP approaches and sometimes surpasses 1.58-bit -aware training performance while requiring only single-hour compared to 10-14 GPU days for training-based methods. These results establish PTQTP as a practical solution for efficient deployment in resource-constrained environments.

LLM-Assisted Semantic Guidance for Sparsely Annotated Remote Sensing Object Detection

Authors: Wei Liao, Chunyan Xu, Chenxu Wang, Zhen Cui

2025-09-21

http://arxiv.org/abs/2509.16970v1

Sparse annotation in remote sensing object detection poses significant challenges due to dense object distributions and category imbalances. Although existing Dense Pseudo-Label methods have demonstrated substantial potential in pseudo-labeling tasks, they remain constrained by selection ambiguities and inconsistencies in confidence estimation.In this paper, we introduce an -assisted semantic guidance framework tailored for ly annotated remote sensing object detection, exploiting the advanced semantic reasoning capabilities of large language models (s) to distill high-confidence pseudo-labels.By integrating -generated semantic priors, we propose a Class-Aware Dense Pseudo-Label Assignment mechanism that adaptively assigns pseudo-labels for both unlabeled and ly labeled data, ensuring robust supervision across varying data distributions. Additionally, we develop an Adaptive Hard-Negative Reweighting Module to stabilize the supervised learning branch by mitigating the influence of confounding background information. Extensive experiments on DOTA and HRSC2016 demonstrate that the proposed method outperforms existing single-stage detector-based frameworks, significantly improving detection performance under annotations.

Catching the Details Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

Authors: Yuheng Shi, Xiaohuan Pei, Minjing Dong, Chang Xu

2025-09-21

http://arxiv.org/abs/2509.16944v1

Multimodal Large Language Models (Ms) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive. While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient and less accurate, requiring either multi-pass stages or reliance on the slow auto-regressive process. In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the M's middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the M's middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations.To validate our approach, we integrate the framework into the LLaVA-1.5 architecture. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of Ms without requiring costly supervision or full model fine-tuning. Code is available at https://github.com/YuHengsss/SD-RPN.

SwarmChat An LLM-Based, Context-Aware Multimodal Interaction System for Robotic Swarms

Authors: Ettilla Mohiuddin Eumi, Hussein Abbass, Nadine Marcus

2025-09-21

http://arxiv.org/abs/2509.16920v1

Traditional Human-Swarm Interaction (HSI) methods often lack intuitive real-time adaptive interfaces, making decision making slower and increasing cognitive load while limiting command flexibility. To solve this, we present SwarmChat, a context-aware, multimodal interaction system powered by Large Language Models (s). SwarmChat enables users to issue natural language commands to robotic swarms using multiple modalities, such as text, voice, or teleoperation. The system integrates four -based modules: Context Generator, Intent Recognition, Task Planner, and Modality Selector. These modules collaboratively generate context from keywords, detect user intent, adapt commands based on real-time robot state, and suggest optimal modalities. Its three-layer architecture offers a dynamic interface with both fixed and customizable command options, supporting flexible control while optimizing cognitive effort. The preliminary evaluation also shows that the SwarmChat's modules provide accurate context interpretation, relevant intent recognition, and effective command delivery, achieving high user satisfaction.

ShadowServe Interference-Free KV Cache Fetching for Distributed Prefix Caching

Authors: Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, Minlan Yu

2025-09-21

http://arxiv.org/abs/2509.16857v1

Distributed prefix caching accelerates long-context by reusing entries for common context prefixes. However, fetches can become a bottleneck when network bandwidth is limited. Compression mitigates the bandwidth issue, but can degrade overall performance when de interferes with model computation. We present ShadowServe, the first SmartNIC-accelerated, interference-free prefix caching system for . ShadowServe separates a control plane on the host and a data plane fully offloaded to the SmartNIC, which eliminates interference to both host GPU and CPU. To overcome the SmartNIC's limited compute and memory resources, we design a chunked pipeline that parallelizes data plane operations across the SmartNIC's compute resources, and a minimal-copy memory management scheme that reduces memory pressure on the SmartNIC. Compared to state-of-the-art solutions, ShadowServe achieves up to 2.2x lower loaded time-per-output-token (TPOT), and reduces time-to-first-token (TTFT) by up to 1.38x in low-bandwidth scenarios (<= 20 Gbps), translating to up to 1.35x higher throughput.

ISCS Parameter-Guided Channel Ordering and Grouping for Learned Image Compression

Authors: Jinhao Wang, Cihan Ruan, Nam Ling, Wei Wang, Wei Jiang

2025-09-21

http://arxiv.org/abs/2509.16853v1

Prior studies in learned image (LIC) consistently show that only a small subset of latent channels is critical for reconstruction, while many others carry limited information. Exploiting this imbalance could improve both coding and computational efficiency, yet existing approaches often rely on costly, dataset-specific ablation tests and typically analyze channels in isolation, ignoring their interdependencies. We propose a generalizable, dataset-agnostic method to identify and organize important channels in pretrained VAE-based LIC models. Instead of brute-force empirical evaluations, our approach leverages intrinsic parameter statistics-weight variances, bias magnitudes, and pairwise correlations-to estimate channel importance. This analysis reveals a consistent organizational structure, termed the Invariant Salient Channel Space (ISCS), where Salient-Core channels capture dominant structures and Salient-Auxiliary channels provide complementary details. Building on ISCS, we introduce a deterministic channel ordering and grouping strategy that enables slice-parallel , reduces redundancy, and improves bitrate efficiency. Experiments across multiple LIC architectures demonstrate that our method effectively reduces bitrate and computation while maintaining reconstruction quality, providing a practical and modular enhancement to existing learned frameworks.

The Even Sheen of AI Kitsch, LLMs, and Homogeneity

Authors: Gyburg Uhlmann

2025-09-20

http://arxiv.org/abs/2509.16794v1

The exploding use and impact of Chatbots such as ChatGPT that are based on Large Language Models urgently call for a language which is fit to clearly describe functions and problems of the production process and qualities of the Chatbots' textual and image output. Recently, the discussion about appropriate and illuminating metaphors to describe s has gained momentum. As an alternative to well-established metaphors such as "hallucinating" and "bullshit", we propose "kitsch" as a new metaphor. As an internationally widespread term from literary and cultural studies, we argue that "kitsch" is particularly suitable for analytically illuminating a previously neglected feature of -based images and texts: their tendency to produce homogeneous and average content, which is becoming increasingly dominant as the proportion of AI-generated content on the internet grows. This is leading to the equalisation of language, style and argument. In view of the potential negative consequences of this averaging, including for human content producers on the internet, we advocate combining methods and insights from kitsch studies with AI research, philosophy, and studies in order to better understand the phenomenon and develop countermeasures.

Towards Transparent and Incentive-Compatible Collaboration in Decentralized LLM Multi-Agent Systems A Blockchain-Driven Approach

Authors: Minfeng Qi, Tianqing Zhu, Lefeng Zhang, Ningran Li, Wanlei Zhou

2025-09-20

http://arxiv.org/abs/2509.16736v1

Large Language Models (s) have enabled the emergence of autonomous agents capable of complex reasoning, planning, and interaction. However, coordinating such agents at scale remains a fundamental challenge, particularly in decentralized environments where lacks transparency and agent behavior cannot be shaped through centralized incentives. We propose a blockchain-based framework that enables transparent agent registration, verifiable task allocation, and dynamic reputation tracking through smart contracts. The core of our design lies in two mechanisms: a matching score-based task allocation protocol that evaluates agents by reputation, capability match, and workload; and a behavior-shaping incentive mechanism that adjusts agent behavior via feedback on performance and reward. Our implementation integrates GPT-4 agents with Solidity contracts and demonstrates, through 50-round simulations, strong task success rates, stable utility distribution, and emergent agent specialization. The results underscore the potential for trustworthy, incentive-compatible multi-agent coordination in open environments.

Decoding Uncertainty The Impact of Decoding Strategies for Uncertainty Estimation in Large Language Models

Authors: Wataru Hashimoto, Hidetaka Kamigaito, Taro Watanabe

2025-09-20

http://arxiv.org/abs/2509.16696v1

Decoding strategies manipulate the probability distribution underlying the output of a language model and can therefore affect both generation quality and its uncertainty. In this study, we investigate the impact of strategies on uncertainty estimation in Large Language Models (s). Our experiments show that Contrastive Search, which mitigates repetition, yields better uncertainty estimates on average across a range of preference-aligned s. In contrast, the benefits of these strategies sometimes diverge when the model is only post-trained with supervised fine-tuning, i.e. without explicit alignment.

EG-MLA Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs

Authors: Zhengge Cai, Haowen Hou

2025-09-20

http://arxiv.org/abs/2509.16686v1

Reducing the key-value () size is a crucial step toward enabling efficient inference in large language models (s), especially under latency and memory constraints. While Multi-Head Attention (MHA) offers strong representational power, it incurs significant memory overhead. Recent work on Multi-head Latent Attention (MLA) mitigates this by compressing representations into a shared latent space, achieving a better trade-off between performance and efficiency. While MLA already achieves significant reduction, the scope for further remains limited without performance loss. In this paper, we propose \textbf{Embedding-Gated Multi-head Latent Attention (EG-MLA)}, a novel extension of MLA that further reduces size while enhancing representational expressiveness. EG-MLA introduces a token-specific embedding gating mechanism applied in the latent space, enabling fine-grained modulation of compressed vectors with minimal additional computation. Compared to MHA, EG-MLA achieves over 91.6\% reduction in size with negligible performance degradation. Relative to MLA, EG-MLA consistently improves task accuracy across diverse reasoning benchmarks while achieving up to 59.9\% additional memory savings. Our theoretical analysis highlights how embedding gating induces implicit high-order interactions, and empirical evaluations demonstrate robust generalization across model scales and regimes. Notably, we successfully scale EG-MLA to over 1 billion parameters, demonstrating its practical viability for large-scale deployment. These results establish EG-MLA as a memory- and compute-efficient attention mechanism that enables scalable, high-performance inference in modern s.

$\boldsymbolλ$ -Orthogonality Regularization for Compatible Representation Learning

Authors: Simone Ricci, Niccolò Biondi, Federico Pernici, Ioannis Patras, Alberto Del Bimbo

2025-09-20

http://arxiv.org/abs/2509.16664v1

Retrieval systems rely on representations learned by increasingly powerful models. However, due to the high training cost and inconsistencies in learned representations, there is significant interest in facilitating between representations and ensuring compatibility across independently trained neural networks. In the literature, two primary approaches are commonly used to adapt different learned representations: affine transformations, which adapt well to specific distributions but can significantly alter the original representation, and orthogonal transformations, which preserve the original structure with strict geometric constraints but limit adaptability. A key challenge is adapting the latent spaces of updated models to align with those of previous models on downstream distributions while pre the newly learned representation spaces. In this paper, we impose a relaxed orthogonality constraint, namely $\lambda$ -orthogonality regularization, while learning an affine transformation, to obtain distribution-specific adaptation while retaining the original learned representations. Extensive experiments across various architectures and datasets validate our approach, demonstrating that it preserves the model's zero-shot performance and ensures compatibility across model updates. Code available at: https://github.com/miccunifi/lambda_orthogonality

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Authors: Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland

2025-09-20

http://arxiv.org/abs/2509.16622v1

Diffusion-based large language models (Ds) have recently attracted growing interest as an alternative to autoregressive rs. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone r for ASR with diffusion-based and semi-autoregressive . Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based s for ASR and point to promising directions for improvements.

PruneCD Contrasting Pruned Self Model to Improve Decoding Factuality

Authors: Byeongho Yu, Changhun Lee, Jungyu Jin, Eunhyeok Park

2025-09-20

http://arxiv.org/abs/2509.16598v2

To mitigate the hallucination problem in large language models, DoLa exploits early exit logits from the same model as a contrastive prior. However, we found that these early exit logits tend to be flat, low in magnitude, and fail to reflect meaningful contrasts. To address this, we propose PruneCD, a novel contrastive method that constructs the amateur model via layer rather than early exit. This design leads to more informative and well-aligned logits, enabling more effective contrastive . Through qualitative and quantitative analyses, we demonstrate that PruneCD consistently improves factuality with minimal inference overhead, offering a robust and practical approach to mitigating hallucinations in s.

Data-Driven Reduced-Order Modeling of Phase Mixing Dynamics from Particle Kinetic Simulation

Authors: Darian Figuera-Michal, Sungpil Yum, Jae-Min Kwon, Eisung Yoon

2025-09-20

http://arxiv.org/abs/2509.16535v1

Phase mixing is a fundamental kinetic process that governs dissipation and stability in collisionless plasmas, but its inherent filamentation in velocity space creates major challenges for both high-fidelity simulations and reduced-order modeling. This work presents the first exploratory evaluation of a joint Proper Orthogonal Decomposition and Sparse Identification of Nonlinear Dynamics (POD-SINDy) framework applied to particle-in-cell simulations of phase mixing. Simulation datasets were generated under progressively complex conditions, starting from a passive kinetic case without self-consistent electric fields, extending to self-consistent simulations with nonlinear electric field feedback, and finally to a noisy dataset with reduced particle resolution. In the passive kinetic regime, POD-SINDy achieved near-optimal reconstructions with only five modes, reproducing filamentation with errors below four percent. In self-consistent electrostatic cases, variance spread across more modes due to nonlinear interactions and noise, slowing singular value decay and making strict low-rank embeddings more demanding. Nevertheless, retaining ten modes was sufficient to recover the dominant structures, yielding reconstruction errors of about seven percent for the low-noise case and thirteen percent for the noisy dataset. Across all scenarios, SINDy provided and interpretable equations for modal amplitudes that remained predominantly linear despite the underlying nonlinear data, while POD truncation effectively filtered particle noise and preserved coherent dynamics. These findings demonstrate that POD-SINDy constitutes a compact and interpretable approach to reduced-order modeling of phase mixing, capable of retaining essential physics across regimes of increasing complexity while achieving data from three to five orders of magnitude depending on dataset complexity.

Assessing Classical Machine Learning and Transformer-based Approaches for Detecting AI-Generated Research Text

Authors: Sharanya Parimanoharan, Ruwan D. Nawarathna

2025-09-20

http://arxiv.org/abs/2509.20375v1

The rapid adoption of large language models (s) such as ChatGPT has blurred the line between human and AI-generated texts, raising urgent questions about academic integrity, intellectual property, and the spread of misinformation. Thus, reliable AI-text detection is needed for fair assessment to safeguard human authenticity and cultivate trust in digital . In this study, we investigate how well current machine learning (ML) approaches can distinguish ChatGPT-3.5-generated texts from human-written texts employing a labeled data set of 250 pairs of abstracts from a wide range of research topics. We test and compare both classical (Logistic Regression armed with classical Bag-of-Words, POS, and TF-IDF features) and -based (BERT augmented with N-grams, DistilBERT, BERT with a lightweight custom classifier, and LSTM-based N-gram models) ML detection techniques. As we aim to assess each model's performance in detecting AI-generated research texts, we also aim to test whether an ensemble of these models can outperform any single detector. Results show DistilBERT achieves the overall best performance, while Logistic Regression and BERT-Custom offer solid, balanced alternatives; LSTM- and BERT-N-gram approaches lag. The max voting ensemble of the three best models fails to surpass DistilBERT itself, highlighting the primacy of a single -based representation over mere model diversity. By comprehensively assessing the strengths and weaknesses of these AI-text detection approaches, this work lays a foundation for more robust frameworks with larger, richer datasets to keep pace with ever-improving generative AI models.

FG-Attn Leveraging Fine-Grained Sparsity In Diffusion Transformers

Authors: Sankeerth Durvasula, Kavya Sreedhar, Zain Moustafa, Suraj Kothawade, Ashish Gondimalla, Suvinay Subramanian, Narges Shahidi, Nandita Vijaykumar

2025-09-20

http://arxiv.org/abs/2509.16518v1

Generating realistic videos with diffusion s demands significant computation, with attention layers the central bottleneck; even producing a short clip requires running a over a very long sequence of embeddings, e.g., more than 30K embeddings for a 5-second video, incurring significant latency. Prior work aims to mitigate this bottleneck by exploiting in the attention layers to reduce computation. However, these works typically rely on block- attention, which skips score computation only when all entries in a block of attention scores (corresponding to M queries and M keys, with M = 64 typically) are zero. This coarse-granular skipping of attention scores does not fully exploit in the attention map and leaves room for improvement. In this work, we propose FG-Attn, a attention mechanism for long-context diffusion s that leverages at a fine granularity. Unlike block- attention, which skips entire MxM blocks, our approach skips computations at the granularity of Mx1 slices of the attention map. Each slice is produced by query-key dot products between a block of query vectors and a single key. To implement our proposed attention mechanism, we develop a new efficient bulk-load operation called asynchronous-gather load. This load operation gathers a set of relevant key-value vectors from memory and arranges them into packed tiles in the GPU's shared memory. Only a set of keys relevant to those queries are loaded into shared memory when computing attention for a block of queries, in contrast to loading full blocks of key tokens in block- attention. Our fine-grained attention, applied to video diffusion models, achieves an average 1.55X (up to 1.65X) speedup for 5 second, 480p videos, and an average 1.41X (up to 1.49X) for 5 second, 720p videos on a single H100 GPU.

orb-QFL Orbital Quantum Federated Learning

Authors: Dev Gurung, Shiva Raj Pokhrel

2025-09-20

http://arxiv.org/abs/2509.16505v1

Recent breakthroughs in quantum computing present transformative opportunities for advancing Federated Learning (FL), particularly in non-terrestrial environments characterized by stringent and coordination constraints. In this study, we propose orbital QFL, termed orb-QFL, a novel quantum-assisted Federated Learning framework tailored for Low Earth Orbit (LEO) satellite constellations. Distinct from conventional FL paradigms, termed orb-QFL operates without centralized servers or global aggregation mechanisms (e.g., FedAvg), instead leveraging quantum entanglement and local quantum processing to facilitate decentralized, inter-satellite collaboration. This design inherently addresses the challenges of orbital dynamics, such as intermittent connectivity, high propagation delays, and coverage variability. The framework enables continuous model refinement through direct quantum-based synchronization between neighboring satellites, thereby enhancing resilience and pre data locality. To validate our approach, we integrate the Qiskit quantum machine learning toolkit with Poliastro-based orbital simulations and conduct experiments using Statlog dataset.

GRIL Knowledge Graph Retrieval-Integrated Learning with Large Language Models

Authors: Jialin Chen, Houyu Zhang, Seongjun Yun, Alejandro Mottini, Rex Ying, Xiang Song, Vassilis N. Ioannidis, Zheng Li, Qingjun Cui

2025-09-20

http://arxiv.org/abs/2509.16502v1

Retrieval-Augmented Generation (RAG) has significantly mitigated the hallucinations of Large Language Models (s) by grounding the generation with external knowledge. Recent extensions of RAG to graph-based retrieval offer a promising direction, leveraging the structural knowledge for multi-hop reasoning. However, existing graph RAG typically decouples retrieval and reasoning processes, which prevents the retriever from adapting to the reasoning needs of the . They also struggle with scalability when performing multi-hop expansion over large-scale graphs, or depend heavily on annotated ground-truth entities, which are often unavailable in open-domain settings. To address these challenges, we propose a novel graph retriever trained end-to-end with , which features an attention-based growing and mechanism, adaptively navigating multi-hop relevant entities while filtering out noise. Within the extracted subgraph, structural knowledge and semantic features are encoded via soft tokens and the verbalized graph, respectively, which are infused into the together, thereby enhancing its reasoning capability and facilitating interactive joint training of the graph retriever and the reasoner. Experimental results across three QA benchmarks show that our approach consistently achieves state-of-the-art performance, validating the strength of joint graph- optimization for complex reasoning tasks. Notably, our framework eliminates the need for predefined ground-truth entities by directly optimizing the retriever using logits as implicit feedback, making it especially effective in open-domain settings.

Synergies between Federated Foundation Models and Smart Power Grids

Authors: Seyyedali Hosseinalipour, Shimiao Li, Adedoyin Inaolaji, Filippo Malandra, Luis Herrera, Nicholas Mastronarde

2025-09-20

http://arxiv.org/abs/2509.16496v1

The recent emergence of large language models (s) such as GPT-3 has marked a significant paradigm shift in machine learning. Trained on massive corpora of data, these models demonstrate remarkable capabilities in language understanding, generation, summarization, and reasoning, transforming how intelligent systems process and interact with human language. Although s may still seem like a recent breakthrough, the field is already witnessing the rise of a new and more general category: multi-modal, multi-task foundation models (M3T FMs). These models go beyond language and can process heterogeneous data types/modalities, such as time-series measurements, audio, imagery, tabular records, and unstructured logs, while supporting a broad range of downstream tasks spanning forecasting, classification, control, and retrieval. When combined with federated learning (FL), they give rise to M3T Federated Foundation Models (FedFMs): a highly recent and largely unexplored class of models that enable scalable, privacy-pre model training/fine-tuning across distributed data sources. In this paper, we take one of the first steps toward introducing these models to the power systems research community by offering a bidirectional perspective: (i) M3T FedFMs for smart grids and (ii) smart grids for FedFMs. In the former, we explore how M3T FedFMs can enhance key grid functions, such as load/demand forecasting and fault detection, by learning from distributed, heterogeneous data available at the grid edge in a privacy-pre manner. In the latter, we investigate how the constraints and structure of smart grids, spanning energy, , and regulatory dimensions, shape the design, training, and deployment of M3T FedFMs.

Shift Parallelism Low-Latency, High-Throughput LLM Inference for Dynamic Workloads

Authors: Mert Hidayetoglu, Aurick Qiao, Michael Wyatt, Jeff Rasley, Yuxiong He, Samyam Rajbhandari

2025-09-20

http://arxiv.org/abs/2509.16495v1

Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (s). Tensor parallelism (TP) is the state-of-the-art method for reducing response latency, however GPU s reduces combined token throughput. On the other hand, data parallelism (DP) obtains a higher throughput yet is slow in response latency. Best of both worlds does not exist, and it is not possible to combine TP and DP because of the variance across the parallelisms. We notice Sequence Parallelism (SP - Ulysses in training) has similar properties as DP but with invariance. We adapt SP to inference, and combine it with TP to get the best of both worlds. Our solution: Shift Parallelism. Shift Parallelism dynamically switches across TP and SP, and minimizes latency in low traffic without losing throughput in high traffic. The efficient GPU s of Shift Parallelism yields up to i) 1.51x faster response in interactive workloads and ii) 50% higher throughput in batch workloads, compared to a TP-only solution. We evaluate Shift Parallelism with real-world production traces with dynamic traffic patterns as well as synthetic benchmarking patterns across models, context sizes, and arrival rates. All results affirm the same: Shift Parallelism has a better the latency vs. throughput tradeoff than TP or DP, and hence obtains low latency without degrading throughput in dynamic workloads.

LightCode Compiling LLM Inference for Photonic-Electronic Systems

Authors: Ryan Tomich, Zhizhen Zhong, Dirk Englund

2025-09-19

http://arxiv.org/abs/2509.16443v1

The growing demand for low-latency, energy-efficient inference in large language models (s) has catalyzed interest in heterogeneous architectures. While GPUs remain dominant, they are poorly suited for integration with emerging domain-specific accelerators like the Photonic Tensor Units (PTUs), which offer low-power, high-throughput linear computation. This motivates hybrid compilation strategies that combine photonic and electronic resources. We present LightCode, a compiler framework and simulator for mapping inference workloads across hybrid photonic-electronic systems. LightCode introduces the Stacked Graph, an intermediate representation that encodes multiple hardware-specific realizations of each tensor operation. Hardware assignment is formulated as a constrained subgraph selection problem optimized for latency or energy under parametric cost models. We evaluate LightCode on the stage of GPT-2 and Llama-7B showing that under our workload and hardware assumptions, (i) Photonic hardware reduced energy by up to 50% in our simulated workloads at maximum sequence length; (ii) multiplexing and assignment strategy yielded latency improvements exceeding 10x; and (iii) Optimizing for latency or energy resulted in distinct hardware mappings in our simulations. LightCode offers a module, foundational framework and simulator for compiling s to emerging photonic accelerators.

SENSE-7 Taxonomy and Dataset for Measuring User Perceptions of Empathy in Sustained Human-AI Conversations

Authors: Jina Suh, Lindy Le, Erfan Shayegani, Gonzalo Ramos, Judith Amores, Desmond C. Ong, Mary Czerwinski, Javier Hernandez

2025-09-19

http://arxiv.org/abs/2509.16437v1

Empathy is increasingly recognized as a key factor in human-AI , yet conventional approaches to "digital empathy" often focus on simulating internal, human-like emotional states while overlooking the inherently subjective, contextual, and relational facets of empathy as perceived by users. In this work, we propose a human-centered taxonomy that emphasizes observable empathic behaviors and introduce a new dataset, Sense-7, of real-world conversations between information workers and Large Language Models (s), which includes per-turn empathy annotations directly from the users, along with user characteristics, and contextual details, offering a more user-grounded representation of empathy. Analysis of 695 conversations from 109 participants reveals that empathy judgments are highly individualized, context-sensitive, and vulnerable to disruption when conversational continuity fails or user expectations go unmet. To promote further research, we provide a subset of 672 anonymized conversation and provide exploratory classification analysis, showing that an -based classifier can recognize 5 levels of empathy with an encouraging average Spearman $\rho$ =0.369 and Accuracy=0.487 over this set. Overall, our findings underscore the need for AI designs that dynamically tailor empathic behaviors to user contexts and goals, offering a roadmap for future research and practical development of socially attuned, human-centered artificial agents.

RephQA Evaluating Readability of Large Language Models in Public Health Question Answering

Authors: Weikang Qiu, Tinglin Huang, Ryan Rullo, Yucheng Kuang, Ali Maatouk, S. Raquel Ramos, Rex Ying

2025-09-19

http://arxiv.org/abs/2509.16360v1

Large Language Models (s) hold promise in addressing complex medical problems. However, while most prior studies focus on improving accuracy and reasoning abilities, a significant bottleneck in developing effective healthcare agents lies in the readability of -generated responses, specifically, their ability to answer public health problems clearly and simply to people without medical backgrounds. In this work, we introduce RephQA, a benchmark for evaluating the readability of s in public health question answering (QA). It contains 533 expert-reviewed QA pairs from 27 sources across 13 topics, and includes a proxy multiple-choice task to assess informativeness, along with two readability metrics: Flesch-Kincaid grade level and professional score. Evaluation of 25 s reveals that most fail to meet readability standards, highlighting a gap between reasoning and effective . To address this, we explore four readability-enhancing strategies-standard prompting, chain-of-thought prompting, Group Relative Policy Optimization (GRPO), and a token-adapted variant. Token-adapted GRPO achieves the best results, advancing the development of more practical and user-friendly public health agents. These results represent a step toward building more practical agents for public health.

Improving Deep Tabular Learning

Authors: Sivan Sarafian, Yehudit Aperstein

2025-09-19

http://arxiv.org/abs/2509.16354v1

Tabular data remain a dominant form of real-world information but pose persistent challenges for deep learning due to heterogeneous feature types, lack of natural structure, and limited label-pre augmentations. As a result, ensemble models based on decision trees continue to dominate benchmark leaderboards. In this work, we introduce RuleNet, a -based architecture specifically designed for deep tabular learning. RuleNet incorporates learnable rule embeddings in a r, a piecewise linear quantile projection for numerical features, and feature masking ensembles for robustness and uncertainty estimation. Evaluated on eight benchmark datasets, RuleNet matches or surpasses state-of-the-art tree-based methods in most cases, while remaining computationally efficient, offering a practical neural alternative for tabular prediction tasks.

The Role of High-Performance GPU Resources in Large Language Model Based Radiology Imaging Diagnosis

Authors: Jyun-Ping Kao

2025-09-19

http://arxiv.org/abs/2509.16328v2

Large-language models (s) are rapidly being applied to radiology, enabling automated image interpretation and report generation tasks. Their deployment in clinical practice requires both high diagnostic accuracy and low inference latency, which in turn demands powerful hardware. High-performance graphical processing units (GPUs) provide the necessary compute and memory throughput to run large s on imaging data. We review modern GPU architectures (e.g. NVIDIA A100/H100, AMD Instinct MI250X/MI300) and key performance metrics of floating-point throughput, memory bandwidth, VRAM capacity. We show how these hardware capabilities affect radiology tasks: for example, generating reports or detecting findings on CheXpert and MIMIC-CXR images is computationally intensive and benefits from GPU parallelism and tensor-core . Empirical studies indicate that using appropriate GPU resources can reduce inference time and improve throughput. We discuss practical challenges including privacy, deployment, cost, power and optimization strategies: mixed-precision, , , and multi-GPU scaling. Finally, we anticipate that next-generation features (8-bit tensor cores, enhanced interconnect) will further enable on-premise and federated radiology AI. Advancing GPU infrastructure is essential for safe, efficient -based radiology diagnostics.

MANZANO A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Authors: Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen

2025-09-19

http://arxiv.org/abs/2509.16197v1

Unified multimodal Large Language Models (s) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion r subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

Agentic Aerial Cinematography From Dialogue Cues to Cinematic Trajectories

Authors: Yifan Lin, Sophie Ziyu Liu, Ran Qi, George Z. Xue, Xinping Song, Chao Qin, Hugh H. -T. Liu

2025-09-19

http://arxiv.org/abs/2509.16176v1

We present Agentic Aerial Cinematography: From Dialogue Cues to Cinematic Trajectories (ACDC), an autonomous drone cinematography system driven by natural language between human directors and drones. The main limitation of previous drone cinematography workflows is that they require manual selection of waypoints and view angles based on predefined human intent, which is labor-intensive and yields inconsistent performance. In this paper, we propose employing large language models (s) and vision foundation models (VFMs) to convert free-form natural language prompts directly into executable indoor UAV video tours. Specifically, our method comprises a vision-language retrieval pipeline for initial waypoint selection, a preference-based Bayesian optimization framework that refines poses using aesthetic feedback, and a motion planner that generates safe quadrotor trajectories. We validate ACDC through both simulation and hardware-in-the-loop experiments, demonstrating that it robustly produces professional-quality footage across diverse indoor scenes without requiring expertise in robotics or cinematography. These results highlight the potential of embodied AI agents to close the loop from open-vocabulary dialogue to real-world autonomous aerial cinematography.

It Depends Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge

Authors: Lukas Ellinger, Georg Groh

2025-09-19

http://arxiv.org/abs/2509.16107v1

Ambiguous words or underspecified references require interlocutors to resolve them, often by relying on shared context and commonsense knowledge. Therefore, we systematically investigate whether Large Language Models (s) can leverage commonsense to resolve referential ambiguity in multi-turn conversations and analyze their behavior when ambiguity persists. Further, we study how requests for simplified language affect this capacity. Using a novel multilingual evaluation dataset, we test DeepSeek v3, GPT-4o, Qwen3-32B, GPT-4o-mini, and Llama-3.1-8B via -as-Judge and human annotations. Our findings indicate that current s struggle to resolve ambiguity effectively: they tend to commit to a single interpretation or cover all possible references, rather than hedging or seeking clarification. This limitation becomes more pronounced under simplification prompts, which drastically reduce the use of commonsense reasoning and diverse response strategies. Fine-tuning Llama-3.1-8B with Direct Preference Optimization substantially improves ambiguity resolution across all request types. These results underscore the need for advanced fine-tuning to improve s' handling of ambiguity and to ensure robust performance across diverse styles.

Personalized Federated Learning with Heat-Kernel Enhanced Tensorized Multi-View Clustering

Authors: Kristina P. Sinaga

2025-09-19

http://arxiv.org/abs/2509.16101v1

We present a robust personalized federated learning framework that leverages heat-kernel enhanced tensorized multi-view fuzzy c-means clustering with advanced tensor decomposition techniques. Our approach integrates heat-kernel coefficients adapted from quantum field theory with Tucker decomposition and canonical polyadic decomposition (CANDECOMP/PARAFAC) to transform conventional distance metrics and efficiently represent high-dimensional multi-view structures. The framework employs matriculation and vectorization techniques to facilitate the discovery of hidden structures and multilinear relationships via N-way generalized tensors. The proposed method introduces a dual-level optimization scheme: local heat-kernel enhanced fuzzy clustering with tensor decomposition operating on order-N input tensors, and federated aggregation of tensor factors with privacy-pre personalization mechanisms. The local stage employs tensorized kernel Euclidean distance transformations and Tucker decomposition to discover client-specific patterns in multi-view tensor data, while the global aggregation process coordinates tensor factors (core tensors and factor matrices) across clients through differential privacy-pre protocols. This tensorized approach enables efficient handling of high-dimensional multi-view data with significant savings through low-rank tensor approximations.

SegDINO3D 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features

Authors: Jinyuan Qu, Hongyang Li, Xingyu Chen, Shilong Liu, Yukai Shi, Tianhe Ren, Ruitao Jing, Lei Zhang

2025-09-19

http://arxiv.org/abs/2509.16098v1

In this paper, we present SegDINO3D, a novel Transformer encoder-r framework for 3D instance segmentation. As 3D training data is generally not as sufficient as 2D training images, SegDINO3D is designed to fully leverage 2D representation from a pre-trained 2D detection model, including both image-level and object-level features, for improving 3D representation. SegDINO3D takes both a point cloud and its associated 2D images as input. In the encoder stage, it first enriches each 3D point by retrieving 2D image features from its corresponding image views and then leverages a 3D encoder for 3D context fusion. In the r stage, it formulates 3D object queries as 3D anchor boxes and performs cross-attention from 3D queries to 2D object queries obtained from 2D images using the 2D detection model. These 2D object queries serve as a compact object-level representation of 2D images, effectively avoiding the challenge of keeping thousands of image feature maps in the memory while faithfully pre the knowledge of the pre-trained 2D model. The introducing of 3D box queries also enables the model to modulate cross-attention using the predicted boxes for more precise querying. SegDINO3D achieves the state-of-the-art performance on the ScanNetV2 and ScanNet200 3D instance segmentation benchmarks. Notably, on the challenging ScanNet200 dataset, SegDINO3D significantly outperforms prior methods by +8.7 and +6.8 mAP on the validation and hidden test sets, respectively, demonstrating its superiority.

Think, Verbalize, then Speak Bridging Complex Thoughts and Comprehensible Speech

Authors: Sang Hoon Woo, Sehun Lee, Kang-wook Kim, Gunhee Kim

2025-09-19

http://arxiv.org/abs/2509.16028v1

Spoken dialogue systems increasingly employ large language models (s) to leverage their advanced reasoning capabilities. However, direct application of s in spoken often yield suboptimal results due to mismatches between optimal textual and verbal delivery. While existing approaches adapt s to produce speech-friendly outputs, their impact on reasoning performance remains underexplored. In this work, we propose Think-Verbalize-Speak, a framework that decouples reasoning from spoken delivery to preserve the full reasoning capacity of s. Central to our method is verbalizing, an intermediate step that translates thoughts into natural, speech-ready text. We also introduce ReVerT, a latency-efficient verbalizer based on incremental and asynchronous summarization. Experiments across multiple benchmarks show that our method enhances speech naturalness and conciseness with minimal impact on reasoning. The project page with the dataset and the source code is available at https://yhytoto12.github.io/TVS-ReVerT

BEFT Bias-Efficient Fine-Tuning of Language Models

Authors: Baichuan Huang, Ananth Balashankar, Amir Aminifar

2025-09-19

http://arxiv.org/abs/2509.15974v1

Fine-tuning all-bias-terms stands out among various parameter-efficient fine-tuning (PEFT) techniques, owing to its out-of-the-box usability and competitive performance, especially in low-data regimes. Bias-only fine-tuning has the potential for unprecedented parameter efficiency. However, the link between fine-tuning different bias terms (i.e., bias terms in the query, key, or value projections) and downstream performance remains unclear. The existing approaches, e.g., based on the magnitude of bias change or empirical Fisher information, provide limited guidance for selecting the particular bias term for effective fine-tuning. In this paper, we propose an approach for selecting the bias term to be fine-tuned, forming the foundation of our bias-efficient fine-tuning (BEFT). We extensively evaluate our bias-efficient approach against other bias-selection approaches, across a wide range of large language models (s) spanning encoder-only and r-only architectures from 110M to 6.7B parameters. Our results demonstrate the effectiveness and superiority of our bias-efficient approach on diverse downstream tasks, including classification, multiple-choice, and generation tasks.

Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs

Authors: Guoliang He, Youhe Jiang, Wencong Xiao, Kaihua Jiang, Shuguang Wang, Jun Wang, Zixian Du, Zhuo Jiang, Xinlei Zhang, Binhang Yuan, Eiko Yoneki

2025-09-19

http://arxiv.org/abs/2509.15940v1

The scaling law for large language models (s) depicts that the path towards machine intelligence necessitates training at large scale. Thus, companies continuously build large-scale GPU clusters, and launch training jobs that span over thousands of computing nodes. However, pre-training presents unique challenges due to its complex patterns, where GPUs exchange data in yet high-volume bursts within specific groups. Inefficient resource scheduling exacerbates bandwidth contention, leading to suboptimal training performance. This paper presents Arnold, a scheduling system summarizing our experience to effectively align patterns with data center topology at scale. An in-depth characteristic study is performed to identify the impact of physical network topology to pre-training jobs. Based on the insights, we develop a scheduling algorithm to effectively align patterns with the physical network topology in modern data centers. Through simulation experiments, we show the effectiveness of our algorithm in reducing the maximum spread of groups by up to $1.67$ x. In production training, our scheduling system improves the end-to-end performance by $10.6\%$ when training with more than $9600$ GPUs, a significant improvement for our training pipeline.

FedHK-MVFC Federated Heat Kernel Multi-View Clustering

Authors: Kristina P. Sinaga

2025-09-19

http://arxiv.org/abs/2509.15844v1

In the realm of distributed AI and privacy-focused medical applications, we propose a framework for multi-view clustering that links quantum field theory with federated healthcare analytics. Our method uses heat-kernel coefficients from spectral analysis to convert Euclidean distances into geometry-aware similarity measures, capturing the structure of diverse medical data. We lay this out through the Heat Kernel Distance (HKD) transformation with convergence guarantees. Two algorithms are developed: Heat Kernel-Enhanced Multi-View Fuzzy Clustering (HK-MVFC) for central analysis, and Federated Heat Kernel Multi-View Fuzzy Clustering (FedHK-MVFC) for secure, privacy-pre learning across hospitals using differential privacy and secure aggregation to facilitate HIPAA-compliant collaboration. Tests on synthetic datasets of cardiovascular patients show an $8-12 \%$ increase in clustering accuracy, $70 \%$ reduced , and $98.2 \%$ efficiency retention over centralized methods. Validated on 10,000 patient records across two hospitals, it proves useful for collaborative phenotyping involving ECG, cardiac imaging, and behavioral data. Our theoretical contributions include update rules with proven convergence, adaptive view weighting, and privacy-pre protocols. This presents a new standard for geometry-aware federated learning in healthcare, turning advanced math into workable solutions for analyzing sensitive medical data while ensuring both rigor and clinical relevance.

UniGist Towards General and Hardware-aligned Sequence-level Long Context Compression

Authors: Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Tianqing Fang, Hongming Zhang, Haitao Mi, Dong Yu, Zhicheng Dou

2025-09-19

http://arxiv.org/abs/2509.15763v1

Large language models are increasingly capable of handling long-context inputs, but the memory overhead of key-value () remains a major bottleneck for general-purpose deployment. While various strategies have been explored, sequence-level , which drops the full s for certain tokens, is particularly challenging as it can lead to the loss of important contextual information. To address this, we introduce UniGist, a sequence-level long-context framework that efficiently preserves context information by replacing raw tokens with special tokens (gists) in a fine-grained manner. We adopt a chunk-free training strategy and design an efficient kernel with a gist shift trick, enabling optimized GPU training. Our scheme also supports flexible inference by allowing the actual removal of compressed tokens, resulting in real-time memory savings. Experiments across multiple long-context tasks demonstrate that UniGist significantly improves quality, with especially strong performance in detail-recalling tasks and long-range dependency modeling.

RMT-KD Random Matrix Theoretic Causal Knowledge Distillation

Authors: Davide Ettori, Nastaran Darabi, Sureshkumar Senthilkumar, Amit Ranjan Trivedi

2025-09-19

http://arxiv.org/abs/2509.15724v1

Large deep learning models such as BERT and ResNet achieve state-of-the-art performance but are costly to deploy at the edge due to their size and compute demands. We present RMT-KD, a method that leverages Random Matrix Theory (RMT) for knowledge distillation to iteratively reduce network size. Instead of or heuristic rank selection, RMT-KD preserves only informative directions identified via the spectral properties of hidden representations. RMT-based causal reduction is applied layer by layer with self-distillation to maintain stability and accuracy. On GLUE, AG News, and CIFAR-10, RMT-KD achieves up to 80% parameter reduction with only 2% accuracy loss, delivering 2.8x faster inference and nearly halved power consumption. These results establish RMT-KD as a mathematically grounded approach to network distillation.

VOX-KRIKRI Unifying Speech and Language through Continuous Fusion

Authors: Dimitrios Damianos, Leon Voukoutis, Georgios Paraskevopoulos, Vassilis Katsouros

2025-09-19

http://arxiv.org/abs/2509.15667v1

We present a multimodal fusion framework that bridges pre-trained r-based large language models () and acoustic encoder-r architectures such as Whisper, with the aim of building speech-enabled s. Instead of directly using audio embeddings, we explore an intermediate audio-conditioned text space as a more effective mechanism for alignment. Our method operates fully in continuous text representation spaces, fusing Whisper's hidden r states with those of an through cross-modal attention, and supports both offline and streaming modes. We introduce \textit{VoxKrikri}, the first Greek speech , and show through analysis that our approach effectively aligns representations across modalities. These results highlight continuous space fusion as a promising path for multilingual and low-resource speech s, while achieving state-of-the-art results for Automatic Speech Recognition in Greek, providing an average $\sim20\%$ relative improvement across benchmarks.

Multilingual LLM Prompting Strategies for Medical English-Vietnamese Machine Translation

Authors: Nhu Vo, Nu-Uyen-Phuong Le, Dung D. Le, Massimo Piccardi, Wray Buntine

2025-09-19

http://arxiv.org/abs/2509.15640v1

Medical English-Vietnamese machine translation (En-Vi MT) is essential for healthcare access and in Vietnam, yet Vietnamese remains a low-resource and under-studied language. We systematically evaluate prompting strategies for six multilingual s (0.5B-9B parameters) on the MedEV dataset, comparing zero-shot, few-shot, and dictionary-augmented prompting with Meddict, an English-Vietnamese medical lexicon. Results show that model scale is the primary driver of performance: larger s achieve strong zero-shot results, while few-shot prompting yields only marginal improvements. In contrast, terminology-aware cues and embedding-based example retrieval consistently improve domain-specific translation. These findings underscore both the promise and the current limitations of multilingual s for medical En-Vi MT.

Interplay Between Belief Propagation and Transformer Differential-Attention Message Passing Transformer

Authors: Chin Wa Lau, Xiang Shi, Ziyan Zheng, Haiwen Cao, Nian Guo

2025-09-19

http://arxiv.org/abs/2509.15637v1

Transformer-based neural rs have emerged as a promising approach to error correction coding, combining data-driven adaptability with efficient modeling of long-range dependencies. This paper presents a novel r architecture that integrates classical belief propagation principles with designs. We introduce a differentiable syndrome loss function leveraging global codebook structure and a differential-attention mechanism optimizing bit and syndrome embedding interactions. Experimental results demonstrate consistent performance improvements over existing -based rs, with our approach surpassing traditional belief propagation rs for short-to-medium length LDPC codes.

Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models

Authors: Tomoya Yamashita, Akira Ito, Yuuki Yamanaka, Masanori Yamada, Takayuki Miura, Toshiki Shibahara

2025-09-19

http://arxiv.org/abs/2509.15631v1

As large language models (s) are increasingly deployed across various applications, privacy and copyright concerns have heightened the need for more effective unlearning techniques. Many existing unlearning methods aim to suppress undesirable outputs through additional training (e.g., gradient ascent), which reduces the probability of generating such outputs. While such suppression-based approaches can control model outputs, they may not eliminate the underlying knowledge embedded in the model's internal activations; muting a response is not the same as forgetting it. Moreover, such suppression-based methods often suffer from model collapse. To address these issues, we propose a novel unlearning method that directly intervenes in the model's internal activations. In our formulation, forgetting is defined as a state in which the activation of a forgotten target is indistinguishable from that of unknown'' entities. Our method introduces an unlearning objective that modifies the activation of the target entity away from those of known entities and toward those of unknown entities in a ![key](https://img.shields.io/badge/sparse-F08080) autoencoder latent space. By aligning the target's internal activation with those of unknown entities, we shift the model's recognition of the target entity fromknown'' to ``unknown'', achieving genuine forgetting while avoiding over-suppression and model collapse. Empirically, we show that our method effectively aligns the internal activations of the forgotten target, a result that the suppression-based approaches do not reliably achieve. Additionally, our method effectively reduces the model's recall of target knowledge in question-answering tasks without significant damage to the non-target knowledge.

Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding

Authors: Ruanjun Li, Ziheng Liu, Yuanming Shi, Jiawei Shao, Chi Zhang, Xuelong Li

2025-09-19

http://arxiv.org/abs/2509.19368v1

Large language models (s) deliver impressive generation quality, but incur very high inference cost because each output token is generated auto-regressively through all model layers. Early-exit based self-speculative (EESD) has emerged to mitigate this cost. However, in practice, many approaches struggle to achieve the expected in such draft-then-verify paradigm even with a well-aligned early-exit head and selected exit position. Our analysis reveals that EESD only pays off when the vast majority of draft tokens are accepted by the . Otherwise, the draft cost may overcome the gain and lead to a negative speedup. To mitigate this, we propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work so that no effort is wasted on failed predictions. It has two key innovations. We configure the model layers as a pipeline in which early-exit (draft) computations and remaining-layer (verification) computations . We interleave drafting and verification per token. While the is verifying the current token in its final layers, the early-exit path simultaneously drafts the next token. Such a verify-while-draft scheme keeps all units busy and validates tokens on-the-fly analogous to pipelining the speculation and verification stages. Empirical results confirm that PPSD achieves state-of-the-art in self-speculative inference. On diverse benchmarks, PPSD achieves speedup ratios in the range of 2.01x~3.81x, which gains almost the optimal at the fixed acceptance rate and exit position, showcasing its advancement in providing efficient self-speculation.

DNA-DetectLLM Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm

Authors: Xiaowei Zhu, Yubing Ren, Fang Fang, Qingfeng Tan, Shi Wang, Yanan Cao

2025-09-19

http://arxiv.org/abs/2509.15550v1

The rapid advancement of large language models (s) has blurred the line between AI-generated and human-written text. This progress brings societal risks such as misinformation, authorship ambiguity, and intellectual property concerns, highlighting the urgent need for reliable AI-generated text detection methods. However, recent advances in generative language modeling have resulted in significant between the feature distributions of human-written and AI-generated text, blurring classification boundaries and making accurate detection increasingly challenging. To address the above challenges, we propose a DNA-inspired perspective, leveraging a repair-based process to directly and interpretably capture the intrinsic differences between human-written and AI-generated text. Building on this perspective, we introduce DNA-Detect, a zero-shot detection method for distinguishing AI-generated and human-written text. The method constructs an ideal AI-generated sequence for each input, iteratively repairs non-optimal tokens, and quantifies the cumulative repair effort as an interpretable detection signal. Empirical evaluations demonstrate that our method achieves state-of-the-art detection performance and exhibits strong robustness against various adversarial attacks and input lengths. Specifically, DNA-Detect achieves relative improvements of 5.55% in AUROC and 2.08% in F1 score across multiple public benchmark datasets.

Optimization techniques for SQL+ML queries A performance analysis of real-time feature computation in OpenMLDB

Authors: Mashkhal A. Sidiq, Aras A. Salih, Samrand M. Hassan

2025-09-19

http://arxiv.org/abs/2509.15529v1

In this study, we optimize SQL+ML queries on top of OpenMLDB, an open-source database that seamlessly integrates offline and online feature computations. The work used feature-rich synthetic dataset experiments in Docker, which acted like production environments that processed 100 to 500 records per batch and 6 to 12 requests per batch in parallel. Efforts have been concentrated in the areas of better query plans, d execution plans, parallel processing, and resource management. The experimental results show that OpenMLDB can support approximately 12,500 QPS with less than 1 ms latency, outperforming SparkSQL and ClickHouse by a factor of 23 and PostgreSQL and MySQL by 3.57 times. This study assessed the impact of optimization and showed that query plan optimization accounted for 35% of the performance gains, caching for 25%, and parallel processing for 20%. These results illustrate OpenMLDB's capability for time-sensitive ML use cases, such as fraud detection, personalized recommendation, and time series forecasting. The system's modular optimization framework, which combines batch and stream processing without interference, contributes to its significant performance gain over traditional database systems, particularly in applications that require real-time feature computation and . This study contributes to the understanding and design of high-performance SQL+ML systems and highlights the need for specialized SQL optimization for ML workloads.

LLM Cache Bandit Revisited Addressing Query Heterogeneity for Cost-Effective LLM Inference

Authors: Hantao Yang, Hong Xie, Defu Lian, Enhong Chen

2025-09-19

http://arxiv.org/abs/2509.15515v1

This paper revisits the bandit problem, with a special focus on addressing the query heterogeneity for cost-effective inference. Previous works often assume uniform query sizes. Heterogeneous query sizes introduce a combinatorial structure for selection, making the replacement process more computationally and statistically challenging. We treat optimal selection as a knapsack problem and employ an accumulation-based strategy to effectively balance computational overhead and updates. In theoretical analysis, we prove that the regret of our algorithm achieves an $O(\sqrt{MNT})$ bound, improving the coefficient of $\sqrt{MN}$ compared to the $O(MN\sqrt{T})$ result in Berkeley, where $N$ is the total number of queries and $M$ is the size. Additionally, we also provide a problem-dependent bound, which was absent in previous works. The experiment rely on real-world data show that our algorithm reduces the total cost by approximately 12\%.

A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication

Authors: Ryan Collette, Ross Greenwood, Serena Nicoll

2025-09-18

http://arxiv.org/abs/2509.15462v1

While existing speech audio codecs designed for exploit limited forms of temporal redundancy and allow for multi-scale representations, they tend to represent all features of audio in the same way. In contrast, generative voice models designed for text-to-speech and voice transfer tasks have recently proved effective at factorizing audio signals into high-level semantic representations of fundamentally distinct features. In this paper, we leverage such representations in a novel semantic s approach to achieve lower bitrates without sacrificing perceptual quality or suitability for specific downstream tasks. Our technique matches or outperforms existing audio codecs on transcription, sentiment analysis, and speaker verification when encoding at 2-4x lower bitrate -- notably surpassing Encodec in perceptual quality and speaker verification while using up to 4x less bitrate.

CAGE Continuity-Aware edGE Network Unlocks Robust Floorplan Reconstruction

Authors: Yiyi Liu, Chunyang Liu, Weiqin Jiao, Bojian Wu, Fashuai Li, Biao Xiong

2025-09-18

http://arxiv.org/abs/2509.15459v1

We present \textbf{CAGE} (\textit{Continuity-Aware edGE}) network, a \textcolor{red}{robust} framework for reconstructing vector floorplans directly from point-cloud density maps. Traditional corner-based polygon representations are highly sensitive to noise and incomplete observations, often resulting in fragmented or implausible layouts. Recent line grouping methods leverage structural cues to improve robustness but still struggle to recover fine geometric details. To address these limitations, we propose a \textit{native} edge-centric formulation, modeling each wall segment as a directed, geometrically continuous edge. This representation enables inference of coherent floorplan structures, ensuring watertight, topologically valid room boundaries while improving robustness and reducing artifacts. Towards this design, we develop a dual-query r that integrates perturbed and latent queries within a denoising framework, which not only stabilizes optimization but also accelerates convergence. Extensive experiments on Structured3D and SceneCAD show that \textbf{CAGE} achieves state-of-the-art performance, with F1 scores of 99.1\% (rooms), 91.7\% (corners), and 89.3\% (angles). The method also demonstrates strong cross-dataset generalization, underscoring the efficacy of our architectural innovations. Code and pretrained models will be released upon acceptance.

IMPQ Interaction-Aware Layerwise Mixed Precision Quantization for LLMs

Authors: Junchen Zhao, Ali Derakhshan, Dushyant Bharadwaj, Jayden Kana Hyman, Junhao Dong, Sangeetha Abdu Jyothi, Ian Harris

2025-09-18

http://arxiv.org/abs/2509.15455v1

Large Language Models (s) promise impressive capabilities, yet their multi-billion-parameter scale makes on-device or low-resource deployment prohibitive. Mixed-precision offers a compelling solution, but existing methods struggle when the average precision drops below four bits, as they rely on isolated, layer-specific metrics that overlook critical inter-layer interactions affecting overall performance. In this paper, we propose two innovations to address these limitations. First, we frame the mixed-precision problem as a cooperative game among layers and introduce Shapley-based Progressive Quantization Estimation (SPQE) to efficiently obtain accurate Shapley estimates of layer sensitivities and inter-layer interactions. Second, building upon SPQE, we propose Interaction-aware Mixed-Precision Quantization (IMPQ) which translates these Shapley estimates into a binary quadratic optimization formulation, assigning either 2 or 4-bit precision to layers under strict memory constraints. Comprehensive experiments conducted on Llama-3, Gemma-2, and Qwen-3 models across three independent PTQ backends (Quanto, HQQ, GPTQ) demonstrate IMPQ's scalability and consistently superior performance compared to methods relying solely on isolated metrics. Across average precisions spanning 4 bit down to 2 bit, IMPQ cuts Perplexity by 20 to 80 percent relative to the best baseline, with the margin growing as the bit-width tightens.

Authors: Wannes Janssens, Matthias Bogaert, Dirk Van den Poel

2025-09-18

http://arxiv.org/abs/2509.19365v1

The BERTopic framework leverages embeddings and hierarchical clustering to extract latent topics from unstructured text corpora. While effective, it often struggles with social media data, which tends to be noisy and , resulting in an excessive number of ping topics. Recent work explored the use of large language models for end-to-end topic modelling. However, these approaches typically require significant computational overhead, limiting their scalability in big data contexts. In this work, we propose a framework that combines BERTopic for topic generation with large language models for topic reduction. The method first generates an initial set of topics and constructs a representation for each. These representations are then provided as input to the language model, which iteratively identifies and merges semantically similar topics. We evaluate the approach across three Twitter/X datasets and four different language models. Our method outperforms the baseline approach in enhancing topic diversity and, in many cases, coherence, with some sensitivity to dataset characteristics and initial parameter selection.

LNE-Blocking An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models

Authors: Ruijie Hou, Yueyang Jiao, Hanxu Hu, Yingming Li, Wai Lam, Huajian Zhang, Hongyuan Lu

2025-09-18

http://arxiv.org/abs/2509.15218v1

The problem of data contamination is now almost inevitable during the development of large language models (s), with the training data commonly integrating those evaluation benchmarks even unintentionally. This problem subsequently makes it hard to benchmark s fairly. Instead of constructing contamination-free datasets (quite hard), we propose a novel framework, \textbf{LNE-Blocking}, to restore model performance prior to contamination on potentially leaked datasets. Our framework consists of two components: contamination detection and disruption operation. For the prompt, the framework first uses the contamination detection method, \textbf{LNE}, to assess the extent of contamination in the model. Based on this, it adjusts the intensity of the disruption operation, \textbf{Blocking}, to elicit non-memorized responses from the model. Our framework is the first to efficiently restore the model's greedy performance. This comes with a strong performance on multiple datasets with potential leakage risks, and it consistently achieves stable recovery results across different models and varying levels of data contamination. We release the code at https://github.com/RuijieH/LNE-Blocking to facilitate research.

Beyond Surface Alignment Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Authors: Yuanbo Xie, Yingjie Zhang, Tianyun Liu, Duohe Ma, Tingwen Liu

2025-09-18

http://arxiv.org/abs/2509.15202v1

Jailbreak attacks pose persistent threats to large language models (s). Current safety alignment methods have attempted to address these issues, but they experience two significant limitations: insufficient safety alignment depth and unrobust internal defense mechanisms. These limitations make them vulnerable to adversarial attacks such as ing and refusal direction manipulation. We introduce DeepRefusal, a robust safety alignment framework that overcomes these issues. DeepRefusal forces the model to dynamically rebuild its refusal mechanisms from jailbreak states. This is achieved by probabilistically ablating the refusal direction across layers and token depths during fine-tuning. Our method not only defends against ing and refusal direction attacks but also demonstrates strong resilience against other unseen jailbreak strategies. Extensive evaluations on four open-source families and six representative attacks show that DeepRefusal reduces attack success rates by approximately 95%, while maintaining model capabilities with minimal performance degradation.

MaRVIn A Cross-Layer Mixed-Precision RISC-V Framework for DNN Inference, from ISA Extension to Hardware Acceleration

Authors: Giorgos Armeniakos, Alexis Maras, Sotirios Xydis, Dimitrios Soudris

2025-09-18

http://arxiv.org/abs/2509.15187v1

The evolution of and mixed-precision techniques has unlocked new possibilities for enhancing the speed and energy efficiency of NNs. Several recent studies indicate that adapting precision levels across different parameters can maintain accuracy comparable to full-precision models while significantly reducing computational demands. However, existing embedded microprocessors lack sufficient architectural support for efficiently executing mixed-precision NNs, both in terms of ISA extensions and hardware design, resulting in inefficiencies such as excessive data packing/unpacking and underutilized arithmetic units. In this work, we propose novel ISA extensions and a micro-architecture implementation specifically designed to optimize mixed-precision execution, enabling energy-efficient deep learning inference on RISC-V architectures. We introduce MaRVIn, a cross-layer hardware-software co-design framework that enhances power efficiency and performance through a combination of hardware improvements, mixed-precision , ISA-level optimizations, and cycle-accurate emulation. At the hardware level, we enhance the ALU with configurable mixed-precision arithmetic (2, 4, 8 bits) for weights/activations and employ multi-pumping to reduce execution latency while implementing soft SIMD for efficient 2-bit ops. At the software level, we integrate a -aware fine-tuning method to optimize model and a greedy-based DSE approach to efficiently search for Pareto-optimal mixed-d models. Additionally, we incorporate voltage scaling to boost the power efficiency of our system. Our experimental evaluation over widely used DNNs and datasets, such as CIFAR10 and ImageNet, demonstrates that our framework can achieve, on average, 17.6x speedup for less than 1% accuracy loss and outperforms the ISA-agnostic state-of-the-art RISC-V cores, delivering up to 1.8 TOPs/W.

Stabilizing Information Flow Entropy Regularization for Safe and Interpretable Autonomous Driving Perception

Authors: Haobo Yang, Shiyan Zhang, Zhuoyi Yang, Jilong Guo, Jun Yang, Xinyu Zhang

2025-09-18

http://arxiv.org/abs/2509.16277v1

Deep perception networks in autonomous driving traditionally rely on data-intensive training regimes and post-hoc anomaly detection, often disregarding fundamental information-theoretic constraints governing stable information processing. We reconceptualize deep neural encoders as hierarchical chains that incrementally compress raw sensory inputs into task-relevant latent features. Within this framework, we establish two theoretically justified design principles for robust perception: (D1) smooth variation of mutual information between consecutive layers, and (D2) monotonic decay of latent entropy with network depth. Our analysis shows that, under realistic architectural assumptions, particularly blocks comprising repeated layers of similar capacity, enforcing smooth information flow (D1) naturally encourages entropy decay (D2), thus ensuring stable . Guided by these insights, we propose Eloss, a novel entropy-based regularizer designed as a lightweight, plug-and-play training objective. Rather than marginal accuracy improvements, this approach represents a conceptual shift: it unifies information-theoretic stability with standard perception tasks, enabling explicit, principled detection of anomalous sensor inputs through entropy deviations. Experimental validation on large-scale 3D object detection benchmarks (KITTI and nuScenes) demonstrates that incorporating Eloss consistently achieves competitive or improved accuracy while dramatically enhancing sensitivity to anomalies, amplifying distribution-shift signals by up to two orders of magnitude. This stable information- perspective not only improves interpretability but also establishes a solid theoretical foundation for safer, more robust autonomous driving perception systems.

A1 Asynchronous Test-Time Scaling via Conformal Prediction

Authors: Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng, Chenyang Zhao, Hui Shen, Alexander Hanbo Li, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, Lingpeng Kong, Ngai Wong

2025-09-18

http://arxiv.org/abs/2509.15148v1

Large language models (s) benefit from test-time scaling, but existing methods face significant challenges, including severe synchronization overhead, memory bottlenecks, and latency, especially during speculative with long reasoning chains. We introduce A1 (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive inference framework that addresses these challenges. A1 refines arithmetic intensity to identify synchronization as the dominant bottleneck, proposes an online calibration strategy to enable asynchronous inference, and designs a three-stage rejection sampling pipeline that supports both sequential and parallel scaling. Through experiments on the MATH, AMC23, AIME24, and AIME25 datasets, across various draft-target model families, we demonstrate that A1 achieves a remarkable 56.7x speedup in test-time scaling and a 4.14x improvement in throughput, all while maintaining accurate rejection-rate control, reducing latency and memory overhead, and no accuracy loss compared to using target model scaling alone. These results position A1 as an efficient and principled solution for scalable inference. We have released the code at https://github.com/menik1126/asynchronous-test-time-scaling.

Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning

Authors: Lei Wang, Jieming Bian, Letian Zhang, Jie Xu

2025-09-18

http://arxiv.org/abs/2509.15087v1

Large Language Models (s) have demonstrated impressive capabilities across various tasks, but fine-tuning them for domain-specific applications often requires substantial domain-specific data that may be distributed across multiple organizations. Federated Learning (FL) offers a privacy-pre solution, but faces challenges with computational constraints when applied to s. Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient fine-tuning approach, though a single LoRA module often struggles with heterogeneous data across diverse domains. This paper addresses two critical challenges in federated LoRA fine-tuning: 1. determining the optimal number and allocation of LoRA experts across heterogeneous clients, and 2. enabling clients to selectively utilize these experts based on their specific data characteristics. We propose FedLEASE (Federated adaptive LoRA Expert Allocation and SElection), a novel framework that adaptively clusters clients based on representation similarity to allocate and train domain-specific LoRA experts. It also introduces an adaptive top- $M$ Mixture-of-Experts mechanism that allows each client to select the optimal number of utilized experts. Our extensive experiments on diverse benchmark datasets demonstrate that FedLEASE significantly outperforms existing federated fine-tuning approaches in heterogeneous client settings while maintaining efficiency.

Communication Efficient Split Learning of ViTs with Attention-based Double Compression

Authors: Federico Alvetreti, Jary Pomponi, Paolo Di Lorenzo, Simone Scardapane

2025-09-18

http://arxiv.org/abs/2509.15058v1

This paper proposes a novel -efficient Split Learning (SL) framework, named Attention-based Double Compression (ADC), which reduces the overhead required for transmitting intermediate Vision Transformers activations during the SL training process. ADC incorporates two parallel strategies. The first one merges samples' activations that are similar, based on the average attention score calculated in the last client layer; this strategy is class-agnostic, meaning that it can also merge samples having different classes, without losing generalization ability nor decreasing final results. The second strategy follows the first and discards the least meaningful tokens, further reducing the cost. Combining these strategies not only allows for sending less during the forward pass, but also the gradients are naturally compressed, allowing the whole model to be trained without additional tuning or approximations of the gradients. Simulation results demonstrate that Attention-based Double Compression outperforms state-of-the-art SL frameworks by significantly reducing overheads while maintaining high accuracy.

Value-Guided KV Compression for LLMs via Approximated CUR Decomposition

Authors: Ayan Sengupta, Siddhant Chaudhary, Tanmoy Chakraborty

2025-09-18

http://arxiv.org/abs/2509.15038v1

Key-value () has emerged as a critical technique for reducing the memory and latency overhead of autoregressive language models during inference. Prior approaches predominantly rely on query-key attention scores to rank and evict d tokens, assuming that attention intensity correlates with semantic importance. However, this heuristic overlooks the contribution of value vectors, which directly influence the attention output. In this paper, we propose CurD, a novel, value-centric method that selects keys and values based on leverage scores computed from CUR matrix decomposition. Our approach approximates the dominant subspace of the attention output $softmax(QK^T)V$ , ensuring that the retained tokens best preserve the model's predictive behavior. Theoretically, we show that attention score approximation does not guarantee output preservation, and demonstrate that CUR-based selection minimizes end-to-end attention reconstruction loss. Empirically, CurD achieves up to 9.6% higher accuracy than state-of-the-art methods like Snap and Chunk under aggressive budgets on LLaMA and Mistral, while maintaining compatibility with FlashAttention and Grouped Query Attention. In addition to improved accuracy, CurD reduces generation latency by up to 40% at high , offering a practical speed-accuracy tradeoff.