2025-09-26
Table of Contents
- Quantized Visual Geometry Grounded Transformer
- Nova Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization
- Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training
- Semantic Edge-Cloud Communication for Real-Time Urban Traffic Surveillance with ViT and LLMs over Mobile Networks
- Explaining Fine Tuned LLMs via Counterfactuals A Knowledge Graph Driven Framework
- Tree Search for LLM Agent Reinforcement Learning
- A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
- Who's Laughing Now? An Overview of Computational Humour Generation and Explanation
- GRPO is Secretly a Process Reward Model
- CAD-Tokenizer Towards Text-based CAD Prototyping via Modality-Specific Tokenization
- UniSS Unified Expressive Speech-to-Speech Translation with Your Voice
- Acoustic-based Gender Differentiation in Speech-aware Language Models
- TyphoonMLA A Mixed Naive-Absorb MLA Kernel For Shared Prefix
- KeyWorld Key Frame Reasoning Enables Effective and Efficient World Models
- Binary Autoencoder for Mechanistic Interpretability of Large Language Models
- Toward Robust and Efficient ML-Based GPU Caching for Modern Inference
- MemLens Uncovering Memorization in LLMs with Activation Trajectories
- Punching Above Precision Small Quantized Model Distillation with Learnable Regularizer
- SPADE Structured Pruning and Adaptive Distillation for Efficient LLM-TTS
- Towards Atoms of Large Language Models
- Recidivism and Peer Influence with LLM Text Embeddings in Low Security Correctional Facilities
- Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning
- CHOIR A Chatbot-mediated Organizational Memory Leveraging Communication in University Research Labs
- MARS toward more efficient multi-agent collaboration for LLM reasoning
- Shared Neural Space Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision
- Seedream 4.0 Toward Next-generation Multimodal Image Generation
- Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing
- SIM-CoT Supervised Implicit Chain-of-Thought
- Multimodal Representation-disentangled Information Bottleneck for Multimodal Recommendation
- Q-Palette Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment
- From Text to Talk Audio-Language Model Needs Non-Autoregressive Joint Training
- Projective Kolmogorov Arnold Neural Networks (P-KANs) Entropy-Driven Functional Space Discovery for Interpretable Machine Learning
- Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks
- MeshMosaic Scaling Artist Mesh Generation via Local-to-Global Assembly
- RAD Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis
- FastEagle Cascaded Drafting for Accelerating Speculative Decoding
- Exploration with Foundation Models Capabilities, Limitations, and Hybrid Approaches
- Future Policy Aware Preference Learning for Mathematical Reasoning
- Structuring Collective Action with LLM-Guided Evolution From Ill-Structured Problems to Executable Heuristics
- CollaPipe Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks
- BurstEngine an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens
- MMedFD A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition
- Gyges Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference
- Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling
- Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation
- Transformer Modeling for Both Scalability and Performance in Multivariate Time Series
- CompLLM Compression for Long Context Q&A
- Online Process Reward Leanring for Agentic Reinforcement Learning
- Reading Images Like Texts Sequential Image Understanding in Vision-Language Models
- BiGraspFormer End-to-End Bimanual Grasp Transformer
- Clapping Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression
- HD-PPT Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS
- Benchmarking PDF Accessibility Evaluation A Dataset and Framework for Assessing Automated and LLM-Based Approaches for Accessibility Testing
- Confidential LLM Inference Performance and Cost Across CPU and GPU TEEs
- FlashGMM Fast Gaussian Mixture Entropy Model for Learned Image Compression
- Bi-VLM Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models
- HyperCool Reducing Encoding Cost in Overfitted Codecs with Hypernetworks
- PIE Perception and Interaction Enhanced End-to-End Motion Planning for Autonomous Driving
- FlexSED Towards Open-Vocabulary Sound Event Detection
- OmniFed A Modular Framework for Configurable Federated Learning from Edge to HPC
- LLMZ+ Contextual Prompt Whitelist Principles for Agentic LLMs
- Individualized non-uniform quantization for vector search
- LAWCAT Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling
- NormGenesis Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery
- Efficient Particle Acceleration in 2.5-Dimensional, Hybrid-Kinetic Simulations of Decaying, Supersonic, Plasma Turbulence
- Chiplet-Based RISC-V SoC with Modular AI Acceleration
- Speculate Deep and Accurate Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding
- Exploiting Tree Structure for Credit Assignment in RL Training of LLMs
- Evaluating Large Language Models for Detecting Antisemitism
- Spiffy Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
- GraDeT-HTR A Resource-Efficient Bengali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer
- TempSamp-R1 Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
- RadEval A framework for radiology text evaluation
- Through the Lens of Human-Human Collaboration A Configurable Research Platform for Exploring Human-Agent Collaboration
- Visual Detector Compression via Location-Aware Discriminant Analysis
- Benchmarking Humans and Machines on Complex Multilingual Speech Understanding Tasks
- Expert-as-a-Service Towards Efficient, Scalable, and Robust Large-scale MoE Serving
- Towards General Computer Control with Hierarchical Agents and Multi-Level Action Spaces
- ConfClip Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs
- When TableQA Meets Noise A Dual Denoising Framework for Complex Questions and Large-scale Tables
- Mechanistic Interpretability with SAEs Probing Religion, Violence, and Geography in Large Language Models
- Evict3R Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
- Bilateral Distribution Compression Reducing Both Data Size and Dimensionality
- Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs
- 4DGCPro Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming
- CorefInst Leveraging LLMs for Multilingual Coreference Resolution
- Privacy in Action Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents
- Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks
- QWHA Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
- DINVMark A Deep Invertible Network for Video Watermarking
- Interpreting vision transformers via residual replacement model
- EpiCache Episodic KV Cache Management for Long Conversational Question Answering
- Multimodal Health Risk Prediction System for Chronic Diseases via Vision-Language Fusion and Large Language Models
- Asteria Semantic-Aware Cross-Region Caching for Agentic LLM Tool Access
- Cronus Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill
- Compact representation of transonic airfoil buffet flows with observable-augmented machine learning
- Rational Multi-Modal Transformers for TCR-pMHC Prediction
- Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection
- DeepASA An Object-Oriented One-for-All Network for Auditory Scene Analysis
- MoEs Are Stronger than You Think Hyper-Parallel Inference Scaling with RoE
- SignalLLM A General-Purpose LLM Agent Framework for Automated Signal Processing
- MAST Multi-Agent Spatial Transformer for Learning to Collaborate
- Attention Consistency for LLMs Explanation
- Analyzing Memory Effects in Large Language Models through the lens of Cognitive Psychology
- SnipSnap A Joint Compression Format and Dataflow Co-Optimization Framework for Efficient Sparse LLM Accelerator Design
- The Transfer Neurons Hypothesis An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLMs
- PTQTP Post-Training Quantization to Trit-Planes for Large Language Models
- LLM-Assisted Semantic Guidance for Sparsely Annotated Remote Sensing Object Detection
- Catching the Details Self-Distilled RoI Predictors for Fine-Grained MLLM Perception
- SwarmChat An LLM-Based, Context-Aware Multimodal Interaction System for Robotic Swarms
- ShadowServe Interference-Free KV Cache Fetching for Distributed Prefix Caching
- ISCS Parameter-Guided Channel Ordering and Grouping for Learned Image Compression
- The Even Sheen of AI Kitsch, LLMs, and Homogeneity
- Towards Transparent and Incentive-Compatible Collaboration in Decentralized LLM Multi-Agent Systems A Blockchain-Driven Approach
- Decoding Uncertainty The Impact of Decoding Strategies for Uncertainty Estimation in Large Language Models
- EG-MLA Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs
- -Orthogonality Regularization for Compatible Representation Learning
- Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing
- PruneCD Contrasting Pruned Self Model to Improve Decoding Factuality
- Data-Driven Reduced-Order Modeling of Phase Mixing Dynamics from Particle Kinetic Simulation
- Assessing Classical Machine Learning and Transformer-based Approaches for Detecting AI-Generated Research Text
- FG-Attn Leveraging Fine-Grained Sparsity In Diffusion Transformers
- orb-QFL Orbital Quantum Federated Learning
- GRIL Knowledge Graph Retrieval-Integrated Learning with Large Language Models
- Synergies between Federated Foundation Models and Smart Power Grids
- Shift Parallelism Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
- LightCode Compiling LLM Inference for Photonic-Electronic Systems
- SENSE-7 Taxonomy and Dataset for Measuring User Perceptions of Empathy in Sustained Human-AI Conversations
- RephQA Evaluating Readability of Large Language Models in Public Health Question Answering
- Improving Deep Tabular Learning
- The Role of High-Performance GPU Resources in Large Language Model Based Radiology Imaging Diagnosis
- MANZANO A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
- Agentic Aerial Cinematography From Dialogue Cues to Cinematic Trajectories
- It Depends Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge
- Personalized Federated Learning with Heat-Kernel Enhanced Tensorized Multi-View Clustering
- SegDINO3D 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features
- Think, Verbalize, then Speak Bridging Complex Thoughts and Comprehensible Speech
- BEFT Bias-Efficient Fine-Tuning of Language Models
- Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs
- FedHK-MVFC Federated Heat Kernel Multi-View Clustering
- UniGist Towards General and Hardware-aligned Sequence-level Long Context Compression
- RMT-KD Random Matrix Theoretic Causal Knowledge Distillation
- VOX-KRIKRI Unifying Speech and Language through Continuous Fusion
- Multilingual LLM Prompting Strategies for Medical English-Vietnamese Machine Translation
- Interplay Between Belief Propagation and Transformer Differential-Attention Message Passing Transformer
- Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models
- Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding
- DNA-DetectLLM Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm
- Optimization techniques for SQL+ML queries A performance analysis of real-time feature computation in OpenMLDB
- LLM Cache Bandit Revisited Addressing Query Heterogeneity for Cost-Effective LLM Inference
- A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication
- CAGE Continuity-Aware edGE Network Unlocks Robust Floorplan Reconstruction
- IMPQ Interaction-Aware Layerwise Mixed Precision Quantization for LLMs
- LLM-Assisted Topic Reduction for BERTopic on Social Media Data
- LNE-Blocking An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models
- Beyond Surface Alignment Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
- MaRVIn A Cross-Layer Mixed-Precision RISC-V Framework for DNN Inference, from ISA Extension to Hardware Acceleration
- Stabilizing Information Flow Entropy Regularization for Safe and Interpretable Autonomous Driving Perception
- A1 Asynchronous Test-Time Scaling via Conformal Prediction
- Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning
- Communication Efficient Split Learning of ViTs with Attention-based Double Compression
- Value-Guided KV Compression for LLMs via Approximated CUR Decomposition
Quantized Visual Geometry Grounded Transformer
Authors: Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
2025-09-25
Learning-based 3D reconstruction models, represented by Visual Geometry
Grounded Transformers (VGGTs), have made remarkable progress with the use of
large-scale s. Their prohibitive computational and memory costs
severely hinder real-world deployment. Post-Training Quantization (PTQ) has
become a common practice for compressing and accelerating models. However, we
empirically observe that PTQ faces unique obstacles when compressing
billion-scale VGGTs: the data-independent special tokens induce heavy-tailed
activation distributions, while the multi-view nature of 3D data makes
calibration sample selection highly unstable. This paper proposes the first
Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two
technical contributions: First, we introduce Dual-Smoothed Fine-Grained
Quantization, which integrates pre-global Hadamard rotation and post-local
channel smoothing to mitigate heavy-tailed distributions and inter-channel
variance robustly. Second, we design Noise-Filtered Diverse Sampling, which
filters outliers via deep-layer statistics and constructs frame-aware diverse
calibration clusters to ensure stable
ranges. Comprehensive
experiments demonstrate that QuantVGGT achieves the state-of-the-art results
across different benchmarks and bit-width, surpassing the previous
state-of-the-art generic
method with a great margin. We highlight
that our 4-bit QuantVGGT can deliver a 3.7 memory reduction and
2.5
in real-hardware inference, while maintaining
reconstruction accuracy above 98\% of its full-precision counterpart. This
demonstrates the vast advantages and practicality of QuantVGGT in
resource-constrained scenarios. Our code is released in
https://github.com/wlfeng0509/QuantVGGT.
Nova Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization
Authors: Yuhang Xu, Shengzhong Liu, Dong Zhang, Bingheng Yan, Fan Wu, Guihai Chen
2025-09-25
This paper presents Nova, a real-time scheduling framework for
agentic vision-language models (VLMs) on a single GPU with balanced per-request
latency and overall request process throughput. Our design begins by enabling
effective pipelining across vision encode,
, and
stages
of VLMs, by exploiting their heterogeneous resource demands during execution
and incorporating elastic GPU spatial partitioning among stages to maximally
utilize the compute and memory resources. Building on this, we introduce a
real-time scheduling algorithm that adaptively calibrates resource allocation
among stages based on a Pareto-optimal analysis of the latency-throughput
trade-off, allowing the system to sustain responsiveness and resource
efficiency under dynamic request loads. To further alleviate GPU memory
pressure, we design a lightweight weight offloading strategy for vision
encoders that preserves inference efficiency with minimized memory overhead.
Extensive evaluations on both synthetic and real-world agent workloads
demonstrate that Nova consistently outperforms the state-of-the-art baselines,
improving the maximum latency by up to 23.3%, while keeping competitive
throughput.
Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training
Authors: Shiju Wang, Yujie Wang, Ao Sun, Fangcheng Fu, Zijian Zhu, Bin Cui, Xu Han, Kaisheng Ma
2025-09-25
Long context training is crucial for 's context extension. Existing
schemes, such as sequence parallelism, incur substantial
overhead. Pipeline parallelism (PP) reduces this cost, but its effectiveness
hinges on partitioning granularity. Batch-level PP dividing input samples
exhibits high memory consumption in long-context scenario, whereas token-level
PP splitting sequences into slices alleviates memory overhead but may incur
hardware under-utilization. This trade-off motivates adaptively selecting PP
granularity to match resource and workload characteristics. Moreover, sequence
length distribution of the real-world dataset exhibits skewness, posing a
challenge on PP's workload balance and efficient scheduling. Current static PP
scheduling methods overlook the variance of sequence length, leading to
suboptimal performance. In this paper, we propose Elastic Pipeline Parallelism
(EPP) that orchestrates token-level PP and batch-level PP to adapt to resource
and workload heterogeneity. We build InfiniPipe, a distributed training system
that unleashes the potential of EPP via (1) a resource-aware and
workload-balanced sequence processor that splits long sequences and packs short
ones; and (2) a co-optimization methodology that jointly optimizes pipeline
schedule and gradient checkpointing via a mechanism named stage-aware
chunk-level adaptive checkpointing. Comprehensive experiments demonstrate that
InfiniPipe achieves a 1.69x speedup over state-of-the-art systems.
Semantic Edge-Cloud Communication for Real-Time Urban Traffic Surveillance with ViT and LLMs over Mobile Networks
Authors: Murat Arda Onsu, Poonam Lohan, Burak Kantarci, Aisha Syed, Matthew Andrews, Sean Kennedy
2025-09-25
Real-time urban traffic surveillance is vital for Intelligent Transportation
Systems (ITS) to ensure road safety, optimize traffic flow, track vehicle
trajectories, and prevent collisions in smart cities. Deploying edge cameras
across urban environments is a standard practice for monitoring road
conditions. However, integrating these with intelligent models requires a
robust understanding of dynamic traffic scenarios and a responsive interface
for user interaction. Although multimodal Large Language Models (s) can
interpret traffic images and generate informative responses, their deployment
on edge devices is infeasible due to high computational demands. Therefore,
inference must occur on the cloud, necessitating visual data transmission from
edge to cloud, a process hindered by limited bandwidth, leading to potential
delays that compromise real-time performance. To address this challenge, we
propose a semantic
framework that significantly reduces
transmission overhead. Our method involves detecting Regions of Interest (RoIs)
using YOLOv11, cropping relevant image segments, and converting them into
compact embedding vectors using a Vision Transformer (ViT). These embeddings
are then transmitted to the cloud, where an image
r reconstructs the
cropped images. The reconstructed images are processed by a multimodal
to
generate traffic condition descriptions. This approach achieves a 99.9%
reduction in data transmission size while maintaining an
response accuracy
of 89% for reconstructed cropped images, compared to 93% accuracy with original
cropped images. Our results demonstrate the efficiency and practicality of ViT
and
-assisted edge-cloud semantic
for real-time traffic
surveillance.
Explaining Fine Tuned LLMs via Counterfactuals A Knowledge Graph Driven Framework
Authors: Yucheng Wang, Ziyang Chen, Md Faisal Kabir
2025-09-25
The widespread adoption of Low-Rank Adaptation (LoRA) has enabled large
language models (s) to acquire domain-specific knowledge with remarkable
efficiency. However, understanding how such a fine-tuning mechanism alters a
model's structural reasoning and semantic behavior remains an open challenge.
This work introduces a novel framework that explains fine-tuned
s via
counterfactuals grounded in knowledge graphs. Specifically, we construct
BioToolKG, a domain-specific heterogeneous knowledge graph in bioinformatics
tools and design a counterfactual-based fine-tuned
s explainer
(CFFT
Explainer) that learns soft masks over graph nodes and edges to
generate minimal structural perturbations that induce maximum semantic
divergence. Our method jointly optimizes structural
and semantic
divergence while enforcing interpretability pre
constraints such as
entropy regularization and edge smoothness. We apply this framework to a
fine-tuned LLaMA-based
and reveal that counterfactual masking exposes the
model's structural dependencies and aligns with LoRA-induced parameter shifts.
This work provides new insights into the internal mechanisms of fine-tuned
s
and highlights counterfactual graphs as a potential tool for interpretable AI.
Tree Search for LLM Agent Reinforcement Learning
Authors: Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu
2025-09-25
Recent advances in reinforcement learning (RL) have significantly enhanced
the agentic capabilities of large language models (s). In long-term and
multi-turn agent tasks, existing approaches driven solely by outcome rewards
often suffer from the problem of
supervision. To address the challenge,
we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped
agent RL method based on tree search, where each tree node represents the
complete agent interaction step. By sharing common prefixes, the tree search
sampling increases the number of rollouts achievable within a fixed budget of
tokens or tool calls. Moreover, we find that the tree-structured trajectory
naturally allows the construction of step-wise process supervised signals even
using only the outcome reward. Based on this, Tree-GRPO estimates the grouped
relative advantages both on intra-tree and inter-tree levels. Through
theoretical analysis, we demonstrate that the objective of intra-tree level
group relative policy optimization is equivalent to that of step-level direct
preference learning. Experiments across 11 datasets and 3 types of QA tasks
demonstrate the superiority of the proposed tree-based RL over the chain-based
RL method.
A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
Authors: Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, Xiuying Chen
2025-09-25
Multi-Hop Question Answering (MHQA) requires integrating dispersed,
interdependent evidence through sequential reasoning under noise. This task is
challenging for s as they have a finite per-pass output capacity, beyond
which the integration of task-relevant evidence proves unreliable.
Consequently, the single-pass reasoning paradigm is inherently vulnerable to
this capacity overflow. To formalize this bottleneck, our analysis establishes
a Fano-style accuracy upper bound, defining a theoretical performance ceiling
for single-pass
s. This bound reveals that accuracy inevitably collapses
once task complexity exceeds model capacity, providing general principles for
capacity-aware representation and structuring of MHQA in
s. Building on
these principles, we introduce a proof-of-concept multi-call framework for
MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware
task decomposition with active
of prior reasoning traces, keeping the
information load within the single-pass limit. It further achieves robustness
by a dependency-explicit workflow that enables precise control over the
reasoning path. We construct a stringent and noise-rich benchmark to validate
our theory and framework. Experimental results show that model behavior aligns
with our predicted capacity curves while InfoQA achieves consistent performance
improvements. We hope our work inspires more
multi-step reasoning methods:
\faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.
Who's Laughing Now? An Overview of Computational Humour Generation and Explanation
Authors: Tyler Loakman, William Thorne, Chenghua Lin
2025-09-25
The creation and perception of humour is a fundamental human trait,
positioning its computational understanding as one of the most challenging
tasks in natural language processing (NLP). As an abstract, creative, and
frequently context-dependent construct, humour requires extensive reasoning to
understand and create, making it a pertinent task for assessing the
common-sense knowledge and reasoning abilities of modern large language models
(s). In this work, we survey the landscape of computational humour as it
pertains to the generative tasks of creation and explanation. We observe that,
despite the task of understanding humour bearing all the hallmarks of a
foundational NLP task, work on generating and explaining humour beyond puns
remains
, while state-of-the-art models continue to fall short of human
capabilities. We bookend our literature survey by motivating the importance of
computational humour processing as a subdiscipline of NLP and presenting an
extensive discussion of future directions for research in the area that takes
into account the subjective and ethically ambiguous nature of humour.
GRPO is Secretly a Process Reward Model
Authors: Michael Sullivan
2025-09-25
We prove theoretically that the GRPO RL algorithm induces a non-trivial
process reward model (PRM), under certain assumptions regarding within-group
of token sequences across completions. We then show empirically that
these assumptions are met under real-world conditions: GRPO does in fact induce
a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a
flaw in the GRPO objective: non-uniformly distributed process steps hinder both
exploration and exploitation (under different conditions). We propose a simple
modification to the algorithm to mitigate this defect (-GRPO), and
show that
s trained with -GRPO achieve higher validation accuracy
and performance on downstream reasoning tasksand reach peak performance more
rapidlythan
s trained with standard GRPO. Our results call into question
the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is
possible to instead leverage the hidden, built-in PRM structure within the
vanilla GRPO algorithm to boost model performance with a negligible impact on
training time and cost.
CAD-Tokenizer Towards Text-based CAD Prototyping via Modality-Specific Tokenization
Authors: Ruiyu Wang, Shizhao Sun, Weijian Ma, Jiang Bian
2025-09-25
Computer-Aided Design (CAD) is a foundational component of industrial
prototyping, where models are defined not by raw coordinates but by
construction sequences such as sketches and extrusions. This sequential
structure enables both efficient prototype initialization and subsequent
editing. Text-guided CAD prototyping, which unifies Text-to-CAD generation and
CAD editing, has the potential to streamline the entire design pipeline.
However, prior work has not explored this setting, largely because standard
large language model () tokenizers decompose CAD sequences into
natural-language word pieces, failing to capture primitive-level CAD semantics
and hindering attention modules from modeling geometric structure. We
conjecture that a multimodal tokenization strategy, aligned with CAD's
primitive and structural nature, can provide more effective representations. To
this end, we propose CAD-Tokenizer, a framework that represents CAD data with
modality-specific tokens using a sequence-based VQ-VAE with primitive-level
pooling and constrained
. This design produces compact, primitive-aware
representations that align with CAD's structural nature. Applied to unified
text-guided CAD prototyping, CAD-Tokenizer significantly improves instruction
following and generation quality, achieving better quantitative and qualitative
performance over both general-purpose
s and task-specific baselines.
UniSS Unified Expressive Speech-to-Speech Translation with Your Voice
Authors: Sitong Cheng, Weizhen Bian, Xinsheng Wang, Ruibin Yuan, Jianyi Chen, Shunshun Yin, Yike Guo, Wei Xue
2025-09-25
The ultimate goal of expressive speech-to-speech translation (S2ST) is to
accurately translate spoken content while pre the speaker identity and
emotional style. However, progress in this field is largely hindered by three
key challenges: the scarcity of paired speech data that retains expressive
styles, the complexity of multi-stage processing pipelines, and the limited
transfer of translation capabilities from large language models (
s). In this
work, we address these challenges by introducing UniSS, a novel single-stage
framework for expressive S2ST. Our approach features carefully designed speech
semantic and style modeling, enabling seamless integration with existing
text-based
frameworks to develop a unified text-speech language model. To
transfer translation capabilities from text to speech, we propose a cross-modal
chain-of-thought prompting process that progressively aligns audio semantics
with text and ensures style preservation in the
d results. Furthermore,
we construct and release a large-scale, high-quality expressive S2ST dataset,
UniST, comprising 44.8k hours of data. Experimental results show that UniSS
significantly outperforms previous methods in translation fidelity and speech
quality while pre
voice, emotion, and duration consistency. Our work
establishes a simpler and more effective paradigm for building the next
generation of expressive S2ST systems. Audio samples are available at
https://cmots.github.io/uniss-demo.
Acoustic-based Gender Differentiation in Speech-aware Language Models
Authors: Junhyuk Choi, Jihwan Seol, Nayeon Kim, Chanhee Cho, EunBin Cho, Bugeun Kim
2025-09-25
Speech-aware Language Models (SpeechLMs) have fundamentally transformed
human-AI interaction by enabling voice-based , yet they may
exhibit acoustic-based gender differentiation where identical questions lead to
different responses based on the speaker's gender. This paper propose a new
dataset that enables systematic analysis of this phenomenon, containing 9,208
speech samples across three categories: Gender-Independent,
Gender-Stereotypical, and Gender-Dependent. We further evaluated LLaMA-Omni
series and discovered a paradoxical pattern; while overall responses seems
identical regardless of gender, the pattern is far from unbiased responses.
Specifically, in Gender-Stereotypical questions, all models consistently
exhibited male-oriented responses; meanwhile, in Gender-Dependent questions
where gender differentiation would be contextually appropriate, models
exhibited responses independent to gender instead. We also confirm that this
pattern does not result from neutral options nor perceived gender of a voice.
When we allow neutral response, models tends to respond neutrally also in
Gender-Dependent questions. The paradoxical pattern yet retains when we applied
gender neutralization methods on speech. Through comparison between SpeechLMs
with corresponding backbone
s, we confirmed that these paradoxical patterns
primarily stem from Whisper speech encoders, which generates male-oriented
acoustic tokens. These findings reveal that current SpeechLMs may not
successfully remove gender biases though they prioritized general fairness
principles over contextual appropriateness, highlighting the need for more
sophisticated techniques to utilize gender information properly in speech
technology.
TyphoonMLA A Mixed Naive-Absorb MLA Kernel For Shared Prefix
Authors: Ahmet Caner Yüzügüler, Ahmet Çelik, Jiawei Zhuang, Lukas Cavigelli
2025-09-25
Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in
state-of-the-art s such as DeepSeek-v3 and Kimi K2. Thanks to its novel
formulation, MLA allows two functionally equivalent but computationally
distinct kernel implementations: naive and absorb. While the naive kernels
(e.g., FlashAttention) are typically preferred in training and
for
their computational efficiency, existing
kernels (e.g., FlashMLA) rely
on the absorb method to minimize HBM bandwidth usage. However, the
compute-bound nature of the absorb implementations prohibits performance
benefits from data reuse opportunities in attention calculations, such as
shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that
combines naive and absorb formulations to harness the strengths of both.
TyphoonMLA effectively leverages the shared prefix by applying the naive
formulation to the compute-bound parts of attention calculations, while
reducing the bandwidth requirements for non-shared parts by using the absorb
formulation. As a result, TyphoonMLA improves the throughput of attention
calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with
only a 3% overhead in HBM size.
KeyWorld Key Frame Reasoning Enables Effective and Efficient World Models
Authors: Sibo Li, Qianyue Hao, Yu Shang, Yong Li
2025-09-25
Robotic world models are a promising paradigm for forecasting future
environment states, yet their inference speed and the physical plausibility of
generated trajectories remain critical bottlenecks, limiting their real-world
applications. This stems from the redundancy of the prevailing frame-to-frame
generation approach, where the model conducts costly computation on similar
frames, as well as neglecting the semantic importance of key transitions. To
address this inefficiency, we propose KeyWorld, a framework that improves
text-conditioned robotic world models by concentrating s computation
on a few semantic key frames while employing a lightweight convolutional model
to fill the intermediate frames. Specifically, KeyWorld first identifies
significant transitions by iteratively simplifying the robot's motion
trajectories, obtaining the ground truth key frames. Then, a DiT model is
trained to reason and generate these physically meaningful key frames from
textual task descriptions. Finally, a lightweight interpolator efficiently
reconstructs the full video by inpainting all intermediate frames. Evaluations
on the LIBERO benchmark demonstrate that KeyWorld achieves a 5.68
compared to the frame-to-frame generation baseline, and focusing
on the motion-aware key frames further contributes to the physical validity of
the generated videos, especially on complex tasks. Our approach highlights a
practical path toward deploying world models in real-time robotic control and
other domains requiring both efficient and effective world models. Code is
released at https://anonymous.4open.science/r/Keyworld-E43D.
Binary Autoencoder for Mechanistic Interpretability of Large Language Models
Authors: Hakaze Cho, Haolin Yang, Brian M. Kurkoski, Naoya Inoue
2025-09-25
Existing works are dedicated to untangling atomized numerical components
(features) from the hidden states of Large Language Models (s) for
interpreting their mechanism. However, they typically rely on autoencoders
constrained by some implicit training-time regularization on single training
instances (i.e., normalization, top-k function, etc.), without an
explicit guarantee of global
among instances, causing a large amount
of dense (simultaneously inactive) features, harming the feature
and
atomization. In this paper, we propose a novel autoencoder variant that
enforces minimal entropy on minibatches of hidden activations, thereby
promoting feature independence and
across instances. For efficient
entropy calculation, we discretize the hidden activations to 1-bit via a step
function and apply gradient estimation to enable backpropagation, so that we
term it as Binary Autoencoder (BAE) and empirically demonstrate two major
applications: (1) Feature set entropy calculation. Entropy can be reliably
estimated on binary hidden activations, which we empirically evaluate and
leverage to characterize the inference dynamics of
s and In-context
Learning. (2) Feature untangling. Similar to typical methods, BAE can extract
atomized features from
's hidden states. To robustly evaluate such feature
extraction capability, we refine traditional feature-interpretation methods to
avoid unreliable handling of numerical tokens, and show that BAE avoids dense
features while producing the largest number of interpretable ones among
baselines, which confirms the effectiveness of BAE
as a feature
extractor.
Toward Robust and Efficient ML-Based GPU Caching for Modern Inference
Authors: Peng Chen, Jiaji Zhang, Hailiang Zhao, Yirong Zhang, Jiahong Yu, Xueyan Tang, Yixuan Wang, Hao Li, Jianping Zou, Gang Xiong, Kingsum Chow, Shuibing He, Shuiguang Deng
2025-09-25
In modern GPU inference, efficiency remains a major bottleneck. In
recommendation models, embedding hit rates largely determine throughput, while
in large language models,
-
misses substantially increase
time-to-first-token (TTFT). Heuristic policies such as \textsc{LRU} often
struggle under structured access patterns. Learning-based approaches are
promising, but in practice face two major limitations: they degrade sharply
when predictions are inaccurate, or they gain little even with accurate
predictions due to conservative designs. Some also incur high overhead, further
limiting practicality.
We present \textsc{LCR}, a practical framework for learning-based GPU caching
that delivers performance gains while ensuring robustness and efficiency. Its
core algorithm, \textsc{LARU}, enhances \textsc{LRU} with machine-learned
predictions and dynamically adapts to prediction accuracy through online error
estimation. When predictions are accurate, \textsc{LARU} achieves near-optimal
performance. With inaccurate predictions, it degrades gracefully to
near-\textsc{LRU} performance. With \textsc{LCR}, we bridge the gap between
empirical progress and theoretical advances in learning-based caching.
Experiments show that \textsc{LCR} delivers consistent gains under realistic
conditions. In DLRM and
scenarios, it improves throughput by up to 24.2\%
and reduces P99 TTFT by up to 28.3\%, outperforming widely used inference
systems. Even under poor predictions, its performance remains stable,
demonstrating practical robustness.
MemLens Uncovering Memorization in LLMs with Activation Trajectories
Authors: Zirui He, Haiyan Zhao, Ali Payani, Mengnan du
2025-09-25
Large language models (s) are commonly evaluated on challenging benchmarks
such as AIME and Math500, which are susceptible to contamination and risk of
being memorized. Existing detection methods, which primarily rely on
surface-level lexical
and perplexity, demonstrate low generalization
and degrade significantly when encountering implicitly contaminated data. In
this paper, we propose MemLens (An Activation Lens for Memorization Detection)
to detect memorization by analyzing the probability trajectories of numeric
tokens during generation. Our method reveals that contaminated samples exhibit
``shortcut'' behaviors, locking onto an answer with high confidence in the
model's early layers, whereas clean samples show more gradual evidence
accumulation across the model's full depth. We observe that contaminated and
clean samples exhibit distinct and well-separated reasoning trajectories. To
further validate this, we inject carefully designed samples into the model
through LoRA fine-tuning and observe the same trajectory patterns as in
naturally contaminated data. These results provide strong evidence that MemLens
captures genuine signals of memorization rather than spurious correlations.
Punching Above Precision Small Quantized Model Distillation with Learnable Regularizer
Authors: Abdur Rehman, S M A Sharif, Md Abdur Rahaman, Mohamed Jismy Aashik Rasool, Seongwan Kim, Jaeho Lee
2025-09-25
Quantization-aware training (QAT) combined with knowledge distillation (KD)
is a promising strategy for compressing Artificial Intelligence (AI) models for
deployment on resource-constrained hardware. However, existing QAT-KD methods
often struggle to balance task-specific (TS) and distillation losses due to
heterogeneous gradient magnitudes, especially under
. We
propose Game of Regularizer (GoR), a novel learnable regularization method that
adaptively balances TS and KD objectives using only two trainable parameters
for dynamic loss weighting. GoR reduces conflict between supervision signals,
improves convergence, and boosts the performance of small
d models
(SQMs). Experiments on image classification, object detection (OD), and large
language model (
)
show that GoR consistently outperforms
state-of-the-art QAT-KD methods. On low-power edge devices, it delivers faster
inference while maintaining full-precision accuracy. We also introduce
QAT-EKD-GoR, an ensemble distillation framework that uses multiple
heterogeneous teacher models. Under optimal conditions, the proposed EKD-GoR
can outperform full-precision models, providing a robust solution for
real-world deployment.
SPADE Structured Pruning and Adaptive Distillation for Efficient LLM-TTS
Authors: Tan Dat Nguyen, Jaehun Kim, Ji-Hoon Kim, Shukjae Choi, Youshin Lim, Joon Son Chung
2025-09-25
The goal of this paper is to introduce SPADE, a framework for Structured
Pruning and Adaptive Distillation for Efficient Large Language Model-based
text-to-speech (-TTS). Recent
-TTS systems achieve strong controllability
and zero-shot generalization, but their large parameter counts and high latency
limit real-world deployment. SPADE addresses this by combining (i) a
step guided by a word-error-rate-based layer importance index to remove
non-essential Transformer layers, with (ii) multi-level knowledge distillation
to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves
near-parity perceptual quality while halving Transformer depth, reducing VRAM
usage by up to 20%, and achieving up to 1.7x faster real-time factor with less
than 5% of the original training data. These results show that compact
-TTS
models can maintain naturalness and speaker similarity while enabling practical
real-time speech generation. Audio samples are available at
https://mm.kaist.ac.kr/projects/SPADE/.
Towards Atoms of Large Language Models
Authors: Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
2025-09-25
The fundamental units of internal representations in large language models
(s) remain undefined, limiting further understanding of their mechanisms.
Neurons or features are often regarded as such units, yet neurons suffer from
polysemy, while features face concerns of unreliable reconstruction and
instability. To address this issue, we propose the Atoms Theory, which defines
such units as atoms. We introduce the atomic inner product (AIP) to correct
representation shifting, formally define atoms, and prove the conditions that
atoms satisfy the Restricted Isometry Property (RIP), ensuring stable
representations over atom set and linking to compressed sensing. Under stronger
conditions, we further establish the uniqueness and exact
recoverability of the
representations, and provide guarantees that
single-layer
autoencoders (SAEs) with threshold activations can reliably
identify the atoms. To validate the Atoms Theory, we train threshold-activated
SAEs on Gemma2-2B, Gemma2-9B, and Llama3.1-8B, achieving 99.9%
reconstruction across layers on average, and more than 99.8% of atoms satisfy
the uniqueness condition, compared to 0.5% for neurons and 68.2% for features,
showing that atoms more faithfully capture intrinsic representations of
s.
Scaling experiments further reveal the link between SAEs size and recovery
capacity. Overall, this work systematically introduces and validates Atoms
Theory of
s, providing a theoretical framework for understanding internal
representations and a foundation for mechanistic interpretability. Code
available at https://github.com/ChenhuiHu/towards_atoms.
Recidivism and Peer Influence with LLM Text Embeddings in Low Security Correctional Facilities
Authors: Shanjukta Nath, Jiwon Hong, Jae Ho Chang, Keith Warren, Subhadeep Paul
2025-09-25
We find AI embeddings obtained using a pre-trained -based Large
Language Model (
) of 80,000-120,000 written affirmations and correction
exchanges among residents in low-security correctional facilities to be highly
predictive of recidivism. The prediction accuracy is 30\% higher with embedding
vectors than with only pre-entry covariates. However, since the text embedding
vectors are high-dimensional, we perform Zero-Shot classification of these
texts to a low-dimensional vector of user-defined classes to aid interpretation
while retaining the predictive power. To shed light on the social dynamics
inside the correctional facilities, we estimate peer effects in these
-generated numerical representations of language with a multivariate peer
effect model, adjusting for network endogeneity. We develop new methodology and
theory for peer effect estimation that accommodate
networks,
multivariate latent variables, and correlated multivariate outcomes. With these
new methods, we find significant peer effects in language usage for interaction
and feedback.
Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning
Authors: Hanjiang Hu, Changliu Liu, Na Li, Yebin Wang
2025-09-24
Large Language Models (s) have demonstrated remarkable capabilities in
knowledge acquisition, reasoning, and tool use, making them promising
candidates for autonomous agent applications. However, training
agents for
complex multi-turn task planning faces significant challenges, including
episode-wise rewards, credit assignment across long horizons, and the
computational overhead of reinforcement learning in multi-turn interaction
settings. To this end, this paper introduces a novel approach that transforms
multi-turn task planning into single-turn task reasoning problems, enabling
efficient policy optimization through Group Relative Policy Optimization (GRPO)
with dense and verifiable reward from expert trajectories. Our theoretical
analysis shows that GRPO improvement on single-turn task reasoning results in
higher multi-turn success probability under the minimal turns, as well as the
generalization to subtasks with shorter horizons. Experimental evaluation on
the complex task planning benchmark demonstrates that our 1.5B parameter model
trained with single-turn GRPO achieves superior performance compared to larger
baseline models up to 14B parameters, with success rates of 70% for
long-horizon planning tasks with over 30 steps. We also theoretically and
empirically validate the strong cross-task generalizability that the models
trained on complex tasks can lead to the successful completion of all simpler
subtasks.
CHOIR A Chatbot-mediated Organizational Memory Leveraging Communication in University Research Labs
Authors: Sangwook Lee, Adnan Abbas, Yan Chen, Young-Ho Kim, Sang Won Lee
2025-09-24
University research labs often rely on chat-based platforms for
and project management, where valuable knowledge surfaces but is easily lost in
message streams. Documentation can preserve knowledge, but it requires ongoing
maintenance and is challenging to navigate. Drawing on formative interviews
that revealed organizational memory challenges in labs, we designed CHOIR, an
-based chatbot that supports organizational memory through four key
functions: document-grounded Q&A, Q&A sharing for follow-up discussion,
knowledge extraction from conversations, and AI-assisted document updates. We
deployed CHOIR in four research labs for one month (n=21), where the lab
members asked 107 questions and lab directors updated documents 38 times in the
organizational memory. Our findings reveal a privacy-awareness tension:
questions were asked privately, limiting directors' visibility into
documentation gaps. Students often avoided contribution due to challenges in
generalizing personal experiences into universal documentation. We contribute
design implications for privacy-pre
awareness and supporting
context-specific knowledge documentation.
MARS toward more efficient multi-agent collaboration for LLM reasoning
Authors: Xiao Wang, Jia Wang, Yijie Wang, Pengtao Dang, Sha Cao, Chi Zhang
2025-09-24
Large language models (s) have achieved impressive results in natural
language understanding, yet their reasoning capabilities remain limited when
operating as single agents. Multi-Agent Debate (MAD) has been proposed to
address this limitation by enabling collaborative reasoning among multiple
models in a round-table debate manner. While effective, MAD introduces
substantial computational overhead due to the number of agents involved and the
frequent
required. In this paper, we propose MARS (Multi-Agent
Review System), a role-based collaboration framework inspired by the review
process. In MARS, an author agent generates an initial solution, reviewer
agents provide decisions and comments independently, and a meta-reviewer
integrates the feedback to make the final decision and guide further revision.
This design enhances reasoning quality while avoiding costly
reviewer-to-reviewer interactions, thereby controlling token consumption and
inference time. We compared MARS with both MAD and other state-of-the-art
reasoning strategies across multiple benchmarks. Extensive experiments with
different
s show that MARS matches the accuracy of MAD while reducing both
token usage and inference time by approximately 50\%. Code is available at
https://github.com/xwang97/MARS.
Shared Neural Space Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision
Authors: Jing Li, Oskar Bartosz, Chengyu Wang, Michal Wnuczynski, Dilshan Godaliyadda, Michael Polley
2025-09-24
The majority of AI models in imaging and vision are customized to perform on
specific high-precision task. However, this strategy is inefficient for
applications with a series of modular tasks, since each requires a mapping into
a disparate latent domain. To address this inefficiency, we proposed a
universal Neural Space (NS), where an encoder-r framework pre-computes
features across vision and imaging tasks. Our encoder learns transformation
aware, generalizable representations, which enable multiple downstream AI
modules to share the same feature space. This architecture reduces redundancy,
improves generalization across domain shift, and establishes a foundation for
effecient multi-task vision pipelines. Furthermore, as opposed to larger
backbones, our backbone is lightweight and CNN-based, allowing for
wider across hardware. We furthur demonstrate that imaging and vision modules,
such as demosaicing, denoising, depth estimation and semantic segmentation can
be performed efficiently in the NS.
Seedream 4.0 Toward Next-generation Multimodal Image Generation
Authors: Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu
2025-09-24
We introduce Seedream 4.0, an efficient and high-performance multimodal image
generation system that unifies text-to-image (T2I) synthesis, image editing,
and multi-image composition within a single framework. We develop a highly
efficient diffusion with a powerful VAE which also can reduce the
number of image tokens considerably. This allows for efficient training of our
model, and enables it to fast generate native high-resolution images (e.g.,
1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning
diverse taxonomies and knowledge-centric concepts. Comprehensive data
collection across hundreds of vertical scenarios, coupled with optimized
strategies, ensures stable and large-scale training, with strong
generalization. By incorporating a carefully fine-tuned VLM model, we perform
multi-modal post-training for training both T2I and image editing tasks
jointly. For inference
, we integrate adversarial distillation,
distribution matching, and
, as well as speculative
. It
achieves an inference time of up to 1.8 seconds for generating a 2K image
(without a
/VLM as PE model). Comprehensive evaluations reveal that Seedream
4.0 can achieve state-of-the-art results on both T2I and multimodal image
editing. In particular, it demonstrates exceptional multimodal capabilities in
complex tasks, including precise image editing and in-context reasoning, and
also allows for multi-image reference, and can generate multiple output images.
This extends traditional T2I systems into an more interactive and
multidimensional creative tool, pushing the boundary of generative AI for both
creativity and professional applications. Seedream 4.0 is now accessible on
https://www.volcengine.com/experience/ark?launch=seedream.
Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing
Authors: Xinnan Dai, Chung-Hsiang Lo, Kai Guo, Shenglai Zeng, Dongsheng Luo, Jiliang Tang
2025-09-24
Transformer-based s demonstrate strong performance on graph reasoning
tasks, yet their internal mechanisms remain underexplored. To uncover these
reasoning process mechanisms in a fundamental and unified view, we set the
basic
r-only
s and explain them using the circuit-tracer
framework. Through this lens, we visualize reasoning traces and identify two
core mechanisms in graph reasoning: token merging and structural memorization,
which underlie both path reasoning and substructure extraction tasks. We
further quantify these behaviors and analyze how they are influenced by graph
density and model size. Our study provides a unified interpretability framework
for understanding structural reasoning in
r-only Transformers.
SIM-CoT Supervised Implicit Chain-of-Thought
Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin
2025-09-24
Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative
to explicit CoT reasoning in Large Language Models (s), but a persistent
performance gap has limited their adoption. We identify a core latent
instability issue when scaling the computational budget of implicit CoT: as the
number of reasoning tokens increases, training often becomes unstable and
collapses. Our analysis shows that this instability arises from latent
representations becoming homogeneous and losing semantic diversity, caused by
insufficient step-level supervision in current implicit CoT methods. To address
this, we propose SIM-CoT, a plug-and-play training module that introduces
step-level supervision to stabilize and enrich the latent reasoning space.
SIM-CoT employs an auxiliary
r during training to align each implicit
token with its corresponding explicit reasoning step, ensuring latent states
capture distinct and meaningful information. The auxiliary
r is removed
at inference, pre
the efficiency of implicit CoT with no added overhead.
It also provides interpretability by projecting each latent token onto an
explicit reasoning vocabulary, enabling per-step visualization and diagnosis.
SIM-CoT significantly improves both in-domain accuracy and out-of-domain
stability of implicit CoT methods, boosting Coconut by +8.2\% on GPT-2 and CODI
by +3.0\% on LLaMA-3.1 8B. It further surpasses the explicit CoT baseline on
GPT-2 by 2.1\% with 2.3 greater token efficiency, while closing the
performance gap on larger models like LLaMA-3.1 8B. Code:
https://github.com/InternLM/SIM-CoT
Multimodal Representation-disentangled Information Bottleneck for Multimodal Recommendation
Authors: Hui Wang, Jinghui Qin, Wushao Wen, Qingling Li, Shanshan Zhong, Zhongzhan Huang
2025-09-24
Multimodal data has significantly advanced recommendation systems by
integrating diverse information sources to model user preferences and item
characteristics. However, these systems often struggle with redundant and
irrelevant information, which can degrade performance. Most existing methods
either fuse multimodal information directly or use rigid architectural
separation for disentanglement, failing to adequately filter noise and model
the complex interplay between modalities. To address these challenges, we
propose a novel framework, the Multimodal Representation-disentangled
Information Bottleneck (MRdIB). Concretely, we first employ a Multimodal
Information Bottleneck to compress the input representations, effectively
filtering out task-irrelevant noise while pre rich semantic information.
Then, we decompose the information based on its relationship with the
recommendation target into unique, redundant, and synergistic components. We
achieve this decomposition with a series of constraints: a unique information
learning objective to preserve modality-unique signals, a redundant information
learning objective to minimize
, and a synergistic information learning
objective to capture emergent information. By optimizing these objectives,
MRdIB guides a model to learn more powerful and disentangled representations.
Extensive experiments on several competitive models and three benchmark
datasets demonstrate the effectiveness and versatility of our MRdIB in
enhancing multimodal recommendation.
Q-Palette Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment
Authors: Deokjae Lee, Hyun Oh Song
2025-09-24
We study weight-only post-training (PTQ), which
s the
weights of a large language model (
) without retraining, using little or no
calibration data. Weight-only PTQ is crucial for reducing the memory footprint
and latency of
inference, especially in memory-bound, small-batch inference
scenarios, such as personalized inference on edge devices. Despite its
importance, irregular weight distributions with heavy-tailed outliers in
s
complicate
, recently motivating rotation-based methods that
transform weights into near-Gaussian distributions, which are more regular with
fewer outliers, thereby reducing
error. In this work, we first
derive the information-theoretically optimal bit allocation for Gaussianized
weights under given bit budgets, revealing that fine-grained fractional-bit
rs approaching the Gaussian distortion-rate bound are essential to
achieve near-optimal
performance. To bridge this theoretical
insight and practical implementation, we introduce Q-Palette, a versatile
collection of fractional-bit
rs that range from trellis-coded
rs offering near-optimal distortion to simpler vector and scalar
rs optimized for faster inference, all efficiently implemented with
optimized CUDA kernels across various bitwidths. Furthermore, leveraging
Q-Palette as a foundational component, we propose a novel mixed-scheme
framework, jointly optimizing
r choices and layer fusion
decisions given resource constraints. The code is available at
https://github.com/snu-mllab/Q-Palette.
From Text to Talk Audio-Language Model Needs Non-Autoregressive Joint Training
Authors: Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo, Zitao Liu
2025-09-24
Recent advances in large language models (s) have attracted significant
interest in extending their capabilities to multimodal scenarios, particularly
for speech-to-speech conversational systems. However, existing multimodal
models handling interleaved audio and text rely on autoregressive methods,
overlooking that text depends on target-target relations whereas audio depends
mainly on source-target relations. In this work, we propose Text-to-Talk (TtT),
a unified audio-text framework that integrates autoregressive (AR) text
generation with non-autoregressive (NAR) audio diffusion in a single
Transformer. By leveraging the any-order autoregressive property of absorbing
discrete diffusion, our approach provides a unified training objective for text
and audio. To support this hybrid generation paradigm, we design a
modality-aware attention mechanism that enforces causal
for text while
allowing bidirectional modeling within audio spans, and further introduce three
training strategies that reduce train-test discrepancies. During inference, TtT
employs block-wise diffusion to synthesize audio in parallel while flexibly
handling variable-length outputs. Extensive experiments across Audio-QA and ASR
tasks demonstrate the effectiveness of our approach, with detailed ablation
studies validating each proposed component. We will open-source our models,
data and code to facilitate future research in this direction.
Projective Kolmogorov Arnold Neural Networks (P-KANs) Entropy-Driven Functional Space Discovery for Interpretable Machine Learning
Authors: Alastair Poole, Stig McArthur, Saravan Kumar
2025-09-24
Kolmogorov-Arnold Networks (KANs) relocate learnable nonlinearities from
nodes to edges, demonstrating remarkable capabilities in scientific machine
learning and interpretable modeling. However, current KAN implementations
suffer from fundamental inefficiencies due to redundancy in high-dimensional
spline parameter spaces, where numerous distinct parameterisations yield
functionally equivalent behaviors. This redundancy manifests as a "nuisance
space" in the model's Jacobian, leading to susceptibility to overfitting and
poor generalization. We introduce Projective Kolmogorov-Arnold Networks
(P-KANs), a novel training framework that guides edge function discovery
towards interpretable functional representations through entropy-minimisation
techniques from signal analysis and dictionary learning. Rather than
constraining functions to predetermined spaces, our approach maintains spline
space flexibility while introducing "gravitational" terms that encourage
convergence towards optimal functional representations. Our key insight
recognizes that optimal representations can be identified through entropy
analysis of projection coefficients, compressing edge functions to
lower-parameter projective spaces (Fourier, Chebyshev, Bessel). P-KANs
demonstrate superior performance across multiple domains, achieving up to 80%
parameter reduction while maintaining representational capacity, significantly
improved robustness to noise compared to standard KANs, and successful
application to industrial automated fiber placement prediction. Our approach
enables automatic discovery of mixed functional representations where different
edges converge to different optimal spaces, providing both
benefits
and enhanced interpretability for scientific machine learning applications.
Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks
Authors: Vani Kanjirangat, Tanja Samardžić, Ljiljana Dolamic, Fabio Rinaldi
2025-09-24
Dialectal data are characterized by linguistic variation that appears small
to humans but has a significant impact on the performance of models. This
dialect gap has been related to various factors (e.g., data size, economic and
social factors) whose impact, however, turns out to be inconsistent. In this
work, we investigate factors impacting the model performance more directly: we
correlate Tokenization Parity (TP) and Information Parity (IP), as measures of
representational biases in pre-trained multilingual models, with the downstream
performance. We compare state-of-the-art r-only
s with encoder-based
models across three tasks: dialect classification, topic classification, and
extractive question answering, controlling for varying scripts (Latin vs.
non-Latin) and resource availability (high vs. low). Our analysis reveals that
TP is a better predictor of the performance on tasks reliant on syntactic and
morphological cues (e.g., extractive QA), while IP better predicts performance
in semantic tasks (e.g., topic classification). Complementary analyses,
including tokenizer behavior, vocabulary coverage, and qualitative insights,
reveal that the language support claims of
s often might mask deeper
mismatches at the script or token level.
MeshMosaic Scaling Artist Mesh Generation via Local-to-Global Assembly
Authors: Rui Xu, Tianyang Xue, Qiujie Dong, Le Wan, Zhe Zhu, Peng Li, Zhiyang Dou, Cheng Lin, Shiqing Xin, Yuan Liu, Wenping Wang, Taku Komura
2025-09-24
Scaling artist-designed meshes to high triangle numbers remains challenging
for autoregressive generative models. Existing -based methods suffer
from long-sequence bottlenecks and limited
resolution, primarily
due to the large number of tokens required and constrained
granularity. These issues prevent faithful reproduction of fine geometric
details and structured density patterns. We introduce MeshMosaic, a novel
local-to-global framework for artist mesh generation that scales to over 100K
triangles--substantially surpassing prior methods, which typically handle only
around 8K faces. MeshMosaic first segments shapes into patches, generating each
patch autoregressively and leveraging shared boundary conditions to promote
coherence, symmetry, and seamless connectivity between neighboring regions.
This strategy enhances scalability to high-resolution meshes by quantizing
patches individually, resulting in more symmetrical and organized mesh density
and structure. Extensive experiments across multiple public datasets
demonstrate that MeshMosaic significantly outperforms state-of-the-art methods
in both geometric fidelity and user preference, supporting superior detail
representation and practical mesh generation for real-world applications.
RAD Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis
Authors: Haolin Li, Tianjie Dai, Zhe Chen, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang
2025-09-24
Clinical diagnosis is a highly specialized discipline requiring both domain
expertise and strict adherence to rigorous guidelines. While current AI-driven
medical research predominantly focuses on knowledge graphs or natural text
pretraining paradigms to incorporate medical knowledge, these approaches
primarily rely on implicitly encoded knowledge within model parameters,
neglecting task-specific knowledge required by diverse downstream tasks. To
address this limitation, we propose Retrieval-Augmented Diagnosis (RAD), a
novel framework that explicitly injects external knowledge into multimodal
models directly on downstream tasks. Specifically, RAD operates through three
key mechanisms: retrieval and refinement of disease-centered knowledge from
multiple medical sources, a guideline-enhanced contrastive loss that constrains
the latent distance between multi-modal features and guideline knowledge, and
the dual
r that employs guidelines as queries to steer
cross-modal fusion, aligning the models with clinical diagnostic workflows from
guideline acquisition to feature extraction and decision-making. Moreover,
recognizing the lack of quantitative evaluation of interpretability for
multimodal diagnostic models, we introduce a set of criteria to assess the
interpretability from both image and text perspectives. Extensive evaluations
across four datasets with different anatomies demonstrate RAD's
generalizability, achieving state-of-the-art performance. Furthermore, RAD
enables the model to concentrate more precisely on abnormal regions and
critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code
is available at https://github.com/tdlhl/RAD.
FastEagle Cascaded Drafting for Accelerating Speculative Decoding
Authors: Haiduo Huang, Jiangcheng Song, Wenzhe Zhao, Pengju Ren
2025-09-24
Speculative accelerates generation by drafting candidates and
verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still
require N sequential passes to propose N tokens. We present FastEagle, a
non-autoregressive cascaded drafter that emits an entire draft in a single
forward pass. FastEagle replaces temporal steps with a lightweight layer
cascade and trains with layer-wise supervision to mitigate error accumulation.
Coupled with a constrained draft tree that preserves lossless verification
cost, FastEagle delivers substantial wall-clock speedups over strong
autoregressive drafters while maintaining competitive acceptance behavior.
Across multiple
s (Vicuna-13B, LLaMA-Instruct 3.x, and
DeepSeek-R1-Distill-LLaMA) and tasks (MT-Bench, HumanEval, GSM8K, CNN/DM,
Alpaca), FastEagle consistently outperforms EAGLE-3 in speedup under both
greedy and stochastic
, with comparable average acceptance lengths.
These results indicate that removing sequential dependencies in drafting is a
practical path toward lossless
inference
.
Exploration with Foundation Models Capabilities, Limitations, and Hybrid Approaches
Authors: Remo Sasso, Michelangelo Conserva, Dominik Jeurissen, Paulo Rauber
2025-09-24
Exploration in reinforcement learning (RL) remains challenging, particularly
in -reward settings. While foundation models possess strong semantic
priors, their capabilities as zero-shot exploration agents in classic RL
benchmarks are not well understood. We benchmark
s and VLMs on multi-armed
bandits, Gridworlds, and
-reward Atari to test zero-shot exploration. Our
investigation reveals a key limitation: while VLMs can infer high-level
objectives from visual input, they consistently fail at precise low-level
control: the "knowing-doing gap". To analyze a potential bridge for this gap,
we investigate a simple on-policy hybrid framework in a controlled, best-case
scenario. Our results in this idealized setting show that VLM guidance can
significantly improve early-stage sample efficiency, providing a clear analysis
of the potential and constraints of using foundation models to guide
exploration rather than for end-to-end control.
Future Policy Aware Preference Learning for Mathematical Reasoning
Authors: Minjae Oh, Yunho Choi, Dongmin Choi, Yohan Jo
2025-09-24
Preference learning methods such as Direct Preference Optimization (DPO) have
become standard for Large Language Model () post-training, yet they are
often ineffective for mathematical reasoning. A key challenge is the large
token
between preferred and dispreferred trajectories; lowering the
probability of dispreferred trajectories also reduces the probability of shared
useful tokens, leading to over-penalization and overall performance collapse.
As a mitigation, existing algorithms include the probability of a trajectory
under the current policy as a regularization term, which decreases the effect
of the gradient when the probability is low. However, by the time this effect
takes hold, useful tokens may have already been over-penalized as the model has
begun to degrade. To address this, we propose Future Policy Aware (FPA)
preference learning, which replaces the current policy with a future policy in
the regularization term. This future policy is estimated via lightweight,
logit-space extrapolation from a reference model toward the current model. FPA
enables safer training by preemptively regularizing potentially problematic
gradients. We apply FPA to DPO, RPO, and SimPER and evaluate them on the MATH
and GSM8K benchmarks. FPA yields consistent performance gains, with the largest
improvements observed with SimPER, achieving gains of up to 5.75%. We
demonstrate that FPA provides proactive regularization while pre
the
probability of shared, useful mathematical tokens, and enables longer,
degradation-free training with negligible computational overhead. We will
release our code publicly upon publication.
Structuring Collective Action with LLM-Guided Evolution From Ill-Structured Problems to Executable Heuristics
Authors: Kevin Bradley Dsouza, Graham Alexander Watt, Yuri Leonenko, Juan Moreno-Cruz
2025-09-24
Collective action problems, which require aligning individual incentives with
collective goals, are classic examples of Ill-Structured Problems (ISPs). For
an individual agent, the causal links between local actions and global outcomes
are unclear, stakeholder objectives often conflict, and no single, clear
algorithm can bridge micro-level choices with macro-level welfare. We present
ECHO-MIMIC, a computational framework that converts this global complexity into
a tractable, Well-Structured Problem (WSP) for each agent by discovering
compact, executable heuristics and persuasive rationales. The framework
operates in two stages: ECHO (Evolutionary Crafting of Heuristics from
Outcomes) evolves snippets of Python code that encode candidate behavioral
policies, while MIMIC (Mechanism Inference & Messaging for
Individual-to-Collective Alignment) evolves companion natural language messages
that motivate agents to adopt those policies. Both phases employ a
large-language-model-driven evolutionary search: the proposes diverse and
context-aware code or text variants, while population-level selection retains
those that maximize collective performance in a simulated environment. We
demonstrate this framework on a canonical ISP in agricultural landscape
management, where local farming decisions impact global ecological
connectivity. Results show that ECHO-MIMIC discovers high-performing heuristics
compared to baselines and crafts tailored messages that successfully align
simulated farmer behavior with landscape-level ecological goals. By coupling
algorithmic rule discovery with tailored
, ECHO-MIMIC transforms
the cognitive burden of collective action into a simple set of agent-level
instructions, making previously ill-structured problems solvable in practice
and opening a new path toward scalable, adaptive policy design.
CollaPipe Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks
Authors: Jiewei Chen, Xiumei Deng, Zehui Xiong, Shaoyong Guo, Xuesong Qiu, Ping Wang, Dusit Niyato
2025-09-24
The increasing demand for intelligent mobile applications has made
multi-agent collaboration with Transformer-based large language models (s)
essential in mobile edge computing (MEC) networks. However, training
s in
such environments remains challenging due to heavy computation, high end-to-end
latency, and limited model generalization. We introduce CollaPipe, a hybrid
distributed learning framework that integrates collaborative pipeline
parallelism with federated aggregation to support self-evolving intelligent
networks. In CollaPipe, the encoder part is adaptively partitioned into
variable-sized segments and deployed across mobile devices for
pipeline-parallel training, while the
r is deployed on edge servers to
handle generative tasks. Then we perform global model update via federated
aggregation. To enhance training efficiency, we formulate a joint optimization
problem that adaptively allocates model segments, micro-batches, bandwidth, and
transmission power. We derive and use a closed-form convergence bound to design
an Dynamic Segment Scheduling and Resource Allocation (DSSDA) algorithm based
on Lyapunov optimization, ensuring system stability under long-term
constraints. Extensive experiments on downstream tasks with Transformer and
BERT models show that CollaPipe improves computation efficiency by up to
15.09%, reduces end-to-end latency by at least 48.98%, and cuts single device
memory usage by more than half, enabling online learning in heterogeneous and
dynamic
environments.
BurstEngine an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens
Authors: Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong sun
2025-09-24
Existing methods for training s on long-sequence data, such as Tensor
Parallelism and Context Parallelism, exhibit low Model FLOPs Utilization as
sequence lengths and number of GPUs increase, especially when sequence lengths
exceed 1M tokens. To address these challenges, we propose BurstEngine, an
efficient framework designed to train
s on long-sequence data. BurstEngine
introduces BurstAttention, an optimized distributed attention with lower
cost than RingAttention. BurstAttention leverages topology-aware
ring
to fully utilize network bandwidth and incorporates
fine-grained
-computation
. Furthermore, BurstEngine
introduces sequence-level selective checkpointing and fuses the language
modeling head with the loss function to reduce memory cost. Additionally,
BurstEngine introduces workload balance optimization for various types of
attention masking. By integrating these optimizations, BurstEngine achieves a
speedup with much lower memory overhead than the state-of-the-art
baselines when training
s on extremely long sequences of over 1M tokens. We
have made our code publicly available on GitHub:
https://github.com/thunlp/BurstEngine.
MMedFD A Real-world Healthcare Benchmark for Multi-turn Full-Duplex Automatic Speech Recognition
Authors: Hongzhao Chen, XiaoYang Wang, Jing Lan, Hexiao Ding, Yufeng Jiang MingHui Yang, DanHui Xu, Jun Luo, Nga-Chun Ng, Gerald W. Y. Cheng, Yunlin Mao, Jung Sun Yoo
2025-09-24
Automatic speech recognition (ASR) in clinical dialogue demands robustness to
full-duplex interaction, speaker , and low-latency constraints, yet open
benchmarks remain scarce. We present MMedFD, the first real-world Chinese
healthcare ASR corpus designed for multi-turn, full-duplex settings. Captured
from a deployed AI assistant, the dataset comprises 5,805 annotated sessions
with synchronized user and mixed-channel views, RTTM/CTM timing, and role
labels. We introduce a model-agnostic pipeline for streaming segmentation,
speaker attribution, and dialogue memory, and fine-tune Whisper-small on
role-concatenated audio for long-context recognition. ASR evaluation includes
WER, CER, and HC-WER, which measures concept-level accuracy across healthcare
settings.
-generated responses are assessed using rubric-based and pairwise
protocols. MMedFD establishes a reproducible framework for benchmarking
streaming ASR and end-to-end duplex agents in healthcare deployment. The
dataset and related resources are publicly available at
https://github.com/Kinetics-JOJO/MMedFD
Gyges Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference
Authors: Haoyu Chen, Xue Li, Kun Qian, Yu Guan, Jin Zhao, Xin Wang
2025-09-24
Efficiently processing the dynamics of requests, especially the context
length variance, is important in Large Language Model ()
scenarios.
However, there is an intrinsic trade-off: while leveraging parallelism
strategies, such as Tensor Parallelism (TP), can coordinate multiple GPUs to
accommodate larger context lengths, it inevitably results in degraded overall
throughput. In this paper, we propose Cross-Instance Parallelism Transformation
(Gyges), which adaptively adjusts the parallelism strategies of running
instances to align with the dynamics of incoming requests. We design (1) a
page-friendly, header-centric layout to accelerate
transformations;
(2) dedicated weight padding to accelerate model weight transformations; and
(3) a transformation-aware scheduler to cooperatively schedule requests and
parallelism transformations, optimizing the overall performance. Evaluations
using real-world traces show that Gyges improves throughput by 1.75x-6.57x
compared to state-of-the-art solutions.
Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling
Authors: Youpeng Zhao, Jinpeng LV, Di Wu, Jun Wang, Christopher Gooley
2025-09-23
Test-time scaling (TTS) has recently emerged as a promising direction to
exploit the hidden reasoning capabilities of pre-trained large language models
(s). However, existing scaling methods narrowly focus on the compute-optimal
Pareto-frontier, ignoring the simple fact that compute-optimal is not always
system-optimal. In this work, we propose a system-driven perspective on TTS,
analyzing how reasoning models scale against practical metrics, such as latency
and cost-per-token. By evaluating the impact of popular optimizations such as
tensor parallelism and speculative
, our preliminary analysis reveals
the limitations of current methods and calls for a paradigm shift toward
holistic, system-aware evaluations that capture the true essence of scaling
laws at inference time.
Frame-Stacked Local Transformers For Efficient Multi-Codebook Speech Generation
Authors: Roy Fejgin, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Ryan Langman Jaehyeon Kim, Subhankar Ghosh, Shehzeen Hussain, Jason Li
2025-09-23
Speech generation models based on large language models (s) typically
operate on discrete acoustic codes, which differ fundamentally from text tokens
due to their multicodebook structure. At each timestep, models must predict N
codebook entries jointly, introducing dependencies that challenge simple
parallel prediction approaches. Parallel prediction assumes independence among
codebooks, yielding efficient
but often at the cost of reduced
fidelity. To address this, hierarchical strategies employ a local
(LT) to refine predictions and capture intra-timestep dependencies. In this
work, we systematically investigate two LT architectures: an autoregressive
that generates codebooks sequentially, and a MaskGIT-based
that performs iterative masked prediction. Both designs further
enable frame stacking, where the primary
predicts multiple frames
jointly, and the LT
s their codebooks, offering improvements in speed
without compromising perceptual quality. Through extensive analysis, we
characterize the tradeoffs between parallel and iterative sampling strategies
across different throughput and quality regimes. Finally, we propose practical
guidelines for selecting
strategies based on deployment priorities
such as computational efficiency and synthesis fidelity.
Transformer Modeling for Both Scalability and Performance in Multivariate Time Series
Authors: Hunjae Lee, Corey Clark
2025-09-23
Variable count is among the main scalability bottlenecks for
modeling in multivariate time series (MTS) data. On top of this, a growing
consensus in the field points to indiscriminate inter-variable mixing as a
potential source of noise-accumulation and performance degradation. This is
likely exacerbated by
of informative signals characteristic of many
MTS systems coupled with representational misalignment stemming from
indiscriminate information mixing between (heterogeneous) variables. While
scalability and performance are often seen as competing interests in
design, we show that both can be improved simultaneously in MTS by
strategically constraining the representational capacity of inter-variable
mixing. Our proposed method,
with Delegate Token Attention
(DELTAformer), constrains inter-variable modeling through what we call delegate
tokens which are then used to perform full, unconstrained, inter-temporal
modeling. Delegate tokens act as an implicit regularizer that forces the model
to be highly selective about what inter-variable information is allowed to
propagate through the network. Our results show that DELTAformer scales
linearly with variable-count while actually outperforming standard
s, achieving state-of-the-art performance across benchmarks and
baselines. In addition, DELTAformer can focus on relevant signals better than
standard
s in noisy MTS environments and overall exhibit superior
noise-resilience. Overall, results across various experiments confirm that by
aligning our model design to leverage domain-specific challenges in MTS to our
advantage, DELTAformer can simultaneously achieve linear scaling while actually
improving its performance against standard, quadratic
s.
CompLLM Compression for Long Context Q&A
Authors: Gabriele Berton, Jayakrishnan Unnikrishnan, Son Tran, Mubarak Shah
2025-09-23
Large Language Models (s) face significant computational challenges when
processing long contexts due to the quadratic complexity of self-attention.
While soft context
methods, which map input text to smaller latent
representations, have shown promise, their real-world adoption is limited.
Existing techniques typically compress the context as a single unit, which
leads to quadratic
complexity and an inability to reuse
computations across queries with
ping contexts. In this work, we
introduce Comp
, a soft
technique designed for practical
deployment. Instead of processing the context holistically, Comp
divides it
into segments and compresses each one independently. This simple design choice
yields three critical properties: efficiency, as the
step scales
linearly with the context length; scalability, enabling models trained on short
sequences (e.g., 1k tokens) to generalize to contexts of 100k tokens; and
reusability, allowing compressed segments to be
d and reused across
different queries. Our experiments show that with a 2x
rate, at
high context lengths Comp
speeds up Time To First Token (TTFT) by up to 4x
and reduces the
size by 50%. Furthermore, Comp
achieves performance
comparable to that obtained with the uncompressed context, and even surpasses
it on very long sequences, demonstrating its effectiveness and practical
utility.
Online Process Reward Leanring for Agentic Reinforcement Learning
Authors: Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, Jianbin Jiao
2025-09-23
Large language models (s) are increasingly trained with reinforcement
learning (RL) as autonomous agents that reason and act over long horizons in
interactive environments. However,
and sometimes unverifiable rewards
make temporal credit assignment extremely challenging. Recent work attempts to
integrate process supervision into agent learning but suffers from biased
annotation, reward hacking, high-variance from overly fine-grained signals or
failtures when state
is rare. We therefore introduce Online Process
Reward Learning (OPRL), a general credit-assignment strategy for agentic RL
that integrates seamlessly with standard on-policy algorithms without relying
on additional rollouts or explicit step labels. In OPRL, we optimize an
implicit process reward model (PRM) alternately with the agent's policy to
transform trajectory preferences into implicit step rewards through a
trajectory-based DPO objective. These step rewards are then used to compute
step-level advantages, which are combined with episode-level advantages from
outcome rewards for policy update, creating a self-reinforcing loop.
Theoretical findings guarantee that the learned step rewards are consistent
with trajectory preferences and act as potential-based shaping rewards,
providing bounded gradients to stabilize training. Empirically, we evaluate
OPRL on three distinct agent benmarks, including WebShop and VisualSokoban, as
well as open-ended social interactions with unverfiable rewards in SOTOPIA.
Crucially, OPRL shows superior performance over frontier
s and strong RL
baselines across domains, achieving state-of-the-art results with higher
sample-efficiency and lower variance during training. Further analysis also
demonstrates the efficient exploration by OPRL using fewer actions,
underscoring its potential for agentic learning in real-world scenarios.
Reading Images Like Texts Sequential Image Understanding in Vision-Language Models
Authors: Yueyan Li, Chenggong Zhao, Zeyuan Zang, Caixia Yuan, Xiaojie Wang
2025-09-23
Vision-Language Models (VLMs) have demonstrated remarkable performance across
a variety of real-world tasks. However, existing VLMs typically process visual
information by serializing images, a method that diverges significantly from
the parallel nature of human vision. Moreover, their opaque internal mechanisms
hinder both deeper understanding and architectural innovation. Inspired by the
dual-stream hypothesis of human vision, which distinguishes the "what" and
"where" pathways, we deconstruct the visual processing in VLMs into object
recognition and spatial perception for separate study. For object recognition,
we convert images into text token maps and find that the model's perception of
image content unfolds as a two-stage process from shallow to deep layers,
beginning with attribute recognition and culminating in semantic
disambiguation. For spatial perception, we theoretically derive and empirically
verify the geometric structure underlying the positional representation in
VLMs. Based on these findings, we introduce an instruction-agnostic token
algorithm based on a plug-and-play visual
r to improve
efficiency, and a RoPE scaling technique to enhance spatial reasoning.
Through rigorous experiments, our work validates these analyses, offering a
deeper understanding of VLM internals and providing clear principles for
designing more capable future architectures.
BiGraspFormer End-to-End Bimanual Grasp Transformer
Authors: Kangmin Kim, Seunghyeok Back, Geonhyup Lee, Sangbeom Lee, Sangjun Noh, Kyoobin Lee
2025-09-23
Bimanual grasping is essential for robots to handle large and complex
objects. However, existing methods either focus solely on single-arm grasping
or employ separate grasp generation and bimanual evaluation stages, leading to
coordination problems including collision risks and unbalanced force
distribution. To address these limitations, we propose BiGraspFormer, a unified
end-to-end framework that directly generates coordinated bimanual
grasps from object point clouds. Our key idea is the Single-Guided Bimanual
(SGB) strategy, which first generates diverse single grasp candidates using a
r, then leverages their learned features through specialized
attention mechanisms to jointly predict bimanual poses and quality scores. This
conditioning strategy reduces the complexity of the 12-DoF search space while
ensuring coordinated bimanual manipulation. Comprehensive simulation
experiments and real-world validation demonstrate that BiGraspFormer
consistently outperforms existing methods while maintaining efficient inference
speed (<0.05s), confirming the effectiveness of our framework. Code and
supplementary materials are available at https://sites.google.com/bigraspformer
Clapping Removing Per-sample Storage for Pipeline Parallel Distributed Optimization with Communication Compression
Authors: Boao Kong, Xu Huang, Yuqi Xu, Yixuan Liang, Bin Wang, Kun Yuan
2025-09-23
Pipeline-parallel distributed optimization is essential for large-scale
machine learning but is challenged by significant overhead from
transmitting high-dimensional activations and gradients between workers.
Existing approaches often depend on impractical unbiased gradient assumptions
or incur sample-size memory overhead. This paper introduces Clapping, a
Communication
algorithm with LAzy samPling for Pipeline-parallel
learnING. Clapping adopts a lazy sampling strategy that reuses data samples
across steps, breaking sample-wise memory barrier and supporting convergence in
few-epoch or online training regimes. Clapping comprises two variants including
Clapping-FC and Clapping-FU, both of which achieve convergence without unbiased
gradient assumption, effectively addressing
error propagation in
multi-worker settings. Numerical experiments validate the performance of
Clapping across different learning tasks.
HD-PPT Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS
Authors: Sihang Nie, Xiaofen Xing, Jingyuan Xing, Baiji Liu, Xiangmin Xu
2025-09-23
Large Language Model ()-based Text-to-Speech (TTS) models have already
reached a high degree of naturalness. However, the precision control of TTS
inference is still challenging. Although instruction-based Text-to-Speech
(Instruct-TTS) models are proposed, these models still lack fine-grained
control due to the modality gap between single-level text instructions and
multilevel speech tokens. To address this limitation, we propose HD-PPT, a
framework that transforms speech synthesis into a structured, hierarchical
task. To enable fine-grained control, we introduce a novel speech codec to
extract distinct prompt-preference and content-preference tokens from the
complex speech tokens, supervised by automatic speech recognition (ASR) and
cross-lingual audio-text pre-training (CLAP) objectives. To bridge the modality
gap of these tokens, we propose a hierarchical
strategy, where the
generates tokens in a structured order: first semantic, then fine-grained
style, and finally complete acoustic representation. Extensive experiments
demonstrate that this hierarchical paradigm significantly improves instruction
adherence and achieves state-of-the-art naturalness, validating our approach
for precise and controllable speech synthesis. Audio samples are available at
https://xxh333.github.io/.
Benchmarking PDF Accessibility Evaluation A Dataset and Framework for Assessing Automated and LLM-Based Approaches for Accessibility Testing
Authors: Anukriti Kumar, Tanushree Padath, Lucy Lu Wang
2025-09-23
PDFs remain the dominant format for scholarly , despite
significant accessibility challenges for blind and low-vision users. While
various tools attempt to evaluate PDF accessibility, there is no standardized
methodology to evaluate how different accessibility assessment approaches
perform. Our work addresses this critical gap by introducing a novel benchmark
dataset of scholarly PDFs with expert-validated accessibility annotations
across seven criteria (alternative text quality, logical reading order,
semantic tagging, table structure, functional hyperlinks, color contrast, and
font readability), and a four-category evaluation framework with standardized
labels (Passed, Failed, Not Present, Cannot Tell) to systematically assess
accessibility evaluation approaches. Using our evaluation framework, we explore
whether large language models (
s) are capable of supporting automated
accessibility evaluation. We benchmark five
s, which demonstrate varying
capabilities in correctly assessing different accessibility criteria, with
GPT-4-Turbo achieving the highest overall accuracy (0.85). However, all models
struggled in correctly categorizing documents with Not Present and Cannot Tell
accessibility labels, particularly for alt text quality assessment. Our
qualitative comparison with standard automated checkers reveals complementary
strengths: rule-based tools excel at technical verification, while
s better
evaluate semantic appropriateness and contextual relevance. Based on our
findings, we propose a hybrid approach that would combine automated checkers,
evaluation, and human assessment as a future strategy for PDF accessibility
evaluation.
Confidential LLM Inference Performance and Cost Across CPU and GPU TEEs
Authors: Marcin Chrapek, Marcin Copik, Etienne Mettaz, Torsten Hoefler
2025-09-23
Large Language Models (s) are increasingly deployed on converged Cloud and
High-Performance Computing (HPC) infrastructure. However, as
s handle
confidential inputs and are fine-tuned on costly, proprietary datasets, their
heightened security requirements slow adoption in privacy-sensitive sectors
such as healthcare and finance. We investigate methods to address this gap and
propose Trusted Execution Environments (TEEs) as a solution for securing
end-to-end
inference. We validate their practicality by evaluating these
compute-intensive workloads entirely within CPU and GPU TEEs. On the CPU side,
we conduct an in-depth study running full Llama2 inference pipelines (7B, 13B,
70B) inside Intel's TDX and SGX, accelerated by Advanced Matrix Extensions
(A
). We derive 12 insights, including that across various data types, batch
sizes, and input lengths, CPU TEEs impose under 10% throughput and 20% latency
overheads, further reduced by A
. We run
inference on NVIDIA H100
Confidential Compute GPUs, contextualizing our CPU findings and ob
throughput penalties of 4-8% that diminish as batch and input sizes grow. By
comparing performance, cost, and security trade-offs, we show how CPU TEEs can
be more cost-effective or secure than their GPU counterparts. To our knowledge,
our work is the first to comprehensively demonstrate the performance and
practicality of modern TEEs across both CPUs and GPUs for enabling confidential
s (c
s).
FlashGMM Fast Gaussian Mixture Entropy Model for Learned Image Compression
Authors: Shimon Murai, Fangzheng Lin, Jiro Katto
2025-09-23
High-performance learned image codecs require flexible
probability models to fit latent representations. Gaussian Mixture Models
(GMMs) were proposed to satisfy this demand, but suffer from a significant
runtime performance bottleneck due to the large Cumulative Distribution
Function (CDF) tables that must be built for rANS coding. This paper introduces
a fast coding algorithm that entirely eliminates this bottleneck. By leveraging
the CDF's monotonic property, our
r performs a dynamic binary search to
find the correct symbol, eliminating the need for costly table construction and
lookup. Aided by SIMD optimizations and numerical approximations, our approach
accelerates the GMM entropy coding process by up to approximately 90x without
compromising rate-distortion performance, significantly improving the
practicality of GMM-based codecs. The implementation will be made publicly
available at https://github.com/tokkiwa/FlashGMM.
Bi-VLM Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models
Authors: Xijun Wang, Junyun Huang, Rayyan Abdalla, Chengyuan Zhang, Ruiqi Xian, Dinesh Manocha
2025-09-23
We address the critical gap between the computational demands of
vision-language models and the possible ultra- weight precision
(bitwidth bits) we can use for higher efficiency. Our work is motivated
by the substantial computational cost and memory requirements of VLMs, which
restrict their applicability in hardware-constrained environments. We propose
Bi-VLM, which separates model weights non-uniformly based on the Gaussian
quantiles. Our formulation groups the model weights into outlier (salient) and
multiple inlier (unsalient) subsets, ensuring that each subset contains a
proportion of weights corresponding to its quantile in the distribution. We
propose a saliency-aware hybrid
algorithm and use it to
weights by imposing different constraints on the scaler and binary matrices
based on the saliency metric and
objective. We have evaluated our
approach on different VLMs. For the language model part of the VLM, our Bi-VLM
outperforms the SOTA by 3%-47% on the visual question answering task in terms
of four different benchmarks and three different models. For the overall VLM,
our Bi-VLM outperforms the SOTA by 4%-45%. We also perform token
on the
d models and observe that there is redundancy of image tokens 90% - 99%
in the
d models. This helps us to further prune the visual tokens to
improve efficiency.
HyperCool Reducing Encoding Cost in Overfitted Codecs with Hypernetworks
Authors: Pep Borrell-Tatché, Till Aczel, Théo Ladune, Roger Wattenhofer
2025-09-23
Overfitted image codecs like Cool-chic achieve strong by
tailoring lightweight models to individual images, but their encoding is slow
and computationally expensive. To accelerate encoding, Non-Overfitted (N-O)
Cool-chic replaces the per-image optimization with a learned inference model,
trading
performance for encoding speed. We introduce HyperCool, a
hypernetwork architecture that mitigates this trade-off. Building upon the N-O
Cool-chic framework, HyperCool generates content-adaptive parameters for a
Cool-chic
r in a single forward pass, tailoring the
r to the input
image without requiring per-image fine-tuning. Our method achieves a 4.9% rate
reduction over N-O Cool-chic with minimal computational overhead. Furthermore,
the output of our hypernetwork provides a strong initialization for further
optimization, reducing the number of steps needed to approach fully overfitted
model performance. With fine-tuning, HEVC-level
is achieved with
60.4% of the encoding cost of the fully overfitted Cool-chic. This work
proposes a practical method to accelerate encoding in overfitted image codecs,
improving their viability in scenarios with tight compute budgets.
PIE Perception and Interaction Enhanced End-to-End Motion Planning for Autonomous Driving
Authors: Chengran Yuan, Zijian Lu, Zhanqi Zhang, Yimin Zhao, Zefan Huang, Shuo Sun, Jiawei Sun, Jiahui Li, Christina Dao Wen Lee, Dongen Li, Marcelo H. Ang Jr
2025-09-23
End-to-end motion planning is promising for simplifying complex autonomous
driving pipelines. However, challenges such as scene understanding and
effective prediction for decision-making continue to present substantial
obstacles to its large-scale deployment. In this paper, we present PIE, a
pioneering framework that integrates advanced perception, reasoning, and
intention modeling to dynamically capture interactions between the ego vehicle
and surrounding agents. It incorporates a bidirectional Mamba fusion that
addresses data losses in multimodal fusion of camera and LiDAR
inputs, alongside a novel reasoning-enhanced
r integrating Mamba and
Mixture-of-Experts to facilitate scene-compliant anchor selection and optimize
adaptive trajectory inference. PIE adopts an action-motion interaction module
to effectively utilize state predictions of surrounding agents to refine ego
planning. The proposed framework is thoroughly validated on the NAVSIM
benchmark. PIE, without using any ensemble and data augmentation techniques,
achieves an 88.9 PDM score and 85.6 EPDM score, surpassing the performance of
prior state-of-the-art methods. Comprehensive quantitative and qualitative
analyses demonstrate that PIE is capable of reliably generating feasible and
high-quality ego trajectories.
FlexSED Towards Open-Vocabulary Sound Event Detection
Authors: Jiarui Hai, Helin Wang, Weizhe Guo, Mounya Elhilali
2025-09-23
Despite recent progress in large-scale sound event detection (SED) systems
capable of handling hundreds of sound classes, existing multi-class
classification frameworks remain fundamentally limited. They cannot process
free-text sound queries, which enable more flexible and user-friendly
interaction, and they lack zero-shot capabilities and offer poor few-shot
adaptability. Although text-query-based separation methods have been explored,
they primarily focus on source separation and are ill-suited for SED tasks that
require precise temporal localization and efficient detection across large and
diverse sound vocabularies. In this paper, we propose FlexSED, an
open-vocabulary sound event detection system. FlexSED builds on a pretrained
audio SSL model and the CLAP text encoder, introducing an encoder-r
composition and an adaptive fusion strategy to enable effective continuous
training from pretrained weights. To ensure robust supervision, it also employs
large language models (
s) to assist in event query selection during
training, addressing challenges related to missing labels. As a result, FlexSED
achieves superior performance compared to vanilla SED models on
AudioSet-Strong, while demonstrating strong zero-shot and few-shot
capabilities. We release the code and pretrained models to support future
research and applications based on FlexSED.
OmniFed A Modular Framework for Configurable Federated Learning from Edge to HPC
Authors: Sahil Tyagi, Andrei Cozma, Olivera Kotevska, Feiyi Wang
2025-09-23
Federated Learning (FL) is critical for edge and High Performance Computing
(HPC) where data is not centralized and privacy is crucial. We present OmniFed,
a modular framework designed around decoupling and clear separation of concerns
for configuration, orchestration, , and training logic. Its
architecture supports configuration-driven prototyping and code-level
override-what-you-need customization. We also support different topologies,
mixed
protocols within a single deployment, and popular training
algorithms. It also offers optional privacy mechanisms including Differential
Privacy (DP), Homomorphic Encryption (HE), and Secure Aggregation (SA), as well
as
strategies. These capabilities are exposed through well-defined
extension points, allowing users to customize topology and orchestration,
learning logic, and privacy/
plugins, all while pre
the
integrity of the core system. We evaluate multiple models and algorithms to
measure various performance metrics. By unifying topology configuration,
mixed-protocol
, and pluggable modules in one stack, OmniFed
streamlines FL deployment across heterogeneous environments. Github repository
is available at https://github.com/at-aaims/OmniFed.
LLMZ+ Contextual Prompt Whitelist Principles for Agentic LLMs
Authors: Tom Pawelek, Raj Patel, Charlotte Crowell, Noorbakhsh Amiri, Sudip Mittal, Shahram Rahimi, Andy Perkins
2025-09-23
Compared to traditional models, agentic AI represents a highly valuable
target for potential attackers as they possess privileged access to data
sources and API tools, which are traditionally not incorporated into classical
agents. Unlike a typical software application residing in a Demilitarized Zone
(DMZ), agentic s consciously rely on nondeterministic behavior of the AI
(only defining a final goal, leaving the path selection to
). This
characteristic introduces substantial security risk to both operational
security and information security. Most common existing defense mechanism rely
on detection of malicious intent and preventing it from reaching the
agent,
thus protecting against jailbreak attacks such as prompt injection. In this
paper, we present an alternative approach,
Z+, which moves beyond
traditional detection-based approaches by implementing prompt whitelisting.
Through this method, only contextually appropriate and safe messages are
permitted to interact with the agentic
. By leveraging the specificity of
context,
Z+ guarantees that all exchanges between external users and the
conform to predefined use cases and operational boundaries. Our approach
streamlines the security framework, enhances its long-term resilience, and
reduces the resources required for sustaining
information security. Our
empirical evaluation demonstrates that
Z+ provides strong resilience against
the most common jailbreak prompts. At the same time, legitimate business
s are not disrupted, and authorized traffic flows seamlessly
between users and the agentic
. We measure the effectiveness of approach
using false positive and false negative rates, both of which can be reduced to
0 in our experimental setting.
Individualized non-uniform quantization for vector search
Authors: Mariano Tepper, Ted Willke
2025-09-22
Embedding vectors are widely used for representing unstructured data and
searching through it for semantically similar items. However, the large size of
these vectors, due to their high-dimensionality, creates problems for modern
vector search techniques: retrieving large vectors from memory/storage is
expensive and their footprint is costly. In this work, we present NVQ
(non-uniform vector ), a new vector
technique that is
computationally and spatially efficient in the high-fidelity regime. The core
in NVQ is to use novel parsimonious and computationally efficient
nonlinearities for building non-uniform vector
rs. Critically, these
rs are \emph{individually} learned for each indexed vector. Our
experimental results show that NVQ exhibits improved accuracy compared to the
state of the art with a minimal computational cost.
LAWCAT Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling
Authors: Zeyu Liu, Souvik Kundu, Lianghao Jiang, Anni Li, Srikanth Ronanki, Sravan Bodapati, Gourav Datta, Peter A. Beerel
2025-09-22
Although architectures have achieved state-of-the-art performance
across diverse domains, their quadratic computational complexity with respect
to sequence length remains a significant bottleneck, particularly for
latency-sensitive long-context applications. While recent linear-complexity
alternatives are increasingly powerful, effectively training them from scratch
is still resource-intensive. To overcome these limitations, we propose LAWCAT
(Linear Attention with Convolution Across Time), a novel linearization
framework designed to efficiently transfer the capabilities of pre-trained
s into a performant linear attention architecture. LAWCAT integrates
causal Conv1D layers to enhance local dependency modeling and employs
normalized gated linear attention to improve generalization across varying
context lengths. Our comprehensive evaluations demonstrate that, distilling
Mistral-7B with only 1K-length sequences yields over 90\% passkey retrieval
accuracy up to 22K tokens, significantly extending its effective context
window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance
on S-NIAH 1\&2\&3 tasks (1K-8K context length) and BABILong benchmark
(QA2\&QA3, 0K-16K context length), requiring less than 0.1\% pre-training
tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster
speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT
thus provides an efficient pathway to high-performance, long-context linear
models suitable for edge deployment, reducing reliance on extensive
long-sequence training data and computational resources.
NormGenesis Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery
Authors: Minki Hong, Jangho Choi, Jihie Kim
2025-09-22
Social norms govern culturally appropriate behavior in ,
enabling dialogue systems to produce responses that are not only coherent but
also socially acceptable. We present NormGenesis, a multicultural framework for
generating and annotating socially grounded dialogues across English, Chinese,
and Korean. To model the dynamics of social interaction beyond static norm
classification, we propose a novel dialogue type, Violation-to-Resolution
(V2R), which models the progression of conversations following norm violations
through recognition and socially appropriate repair. To improve pragmatic
consistency in underrepresented languages, we implement an exemplar-based
iterative refinement early in the dialogue synthesis process. This design
introduces alignment with linguistic, emotional, and sociocultural expectations
before full dialogue generation begins. Using this framework, we construct a
dataset of 10,800 multi-turn dialogues annotated at the turn level for norm
adherence, speaker intent, and emotional response. Human and
-based
evaluations demonstrate that NormGenesis significantly outperforms existing
datasets in refinement quality, dialogue naturalness, and generalization
performance. We show that models trained on our V2R-augmented data exhibit
improved pragmatic competence in ethically sensitive contexts. Our work
establishes a new benchmark for culturally adaptive dialogue modeling and
provides a scalable methodology for norm-aware generation across linguistically
and culturally diverse languages.
Efficient Particle Acceleration in 2.5-Dimensional, Hybrid-Kinetic Simulations of Decaying, Supersonic, Plasma Turbulence
Authors: Keyan Gootkin, Colby Haggerty, Damiano Caprioli, Zachary Davis
2025-09-22
Collisionless, turbulent plasmas surround the Earth, from the magnetosphere
to the intergalactic medium, and the fluctuations within them affect nearly
every field in the space sciences, from space weather forecasts to theories of
galaxy formation. Where turbulent motions become supersonic, their interactions
can lead to the formation of shocks, which are known to efficiently energize
ions to cosmic-ray energies. We present 2.5-dimensional, hybrid-kinetic
simulations of decaying, supersonic, non-relativistic turbulence in a
collisionless plasma using the code dHybridR. Turbulence within these
simulations is highly compressible; after accounting for this by
taking the omni-directional power-spectrum of the density weighted velocity
field, we find turbulent spectra with power-law slopes of for low Mach numbers, in the inertial range, and for high Mach numbers. Ions embedded in the highly supersonic simulations
are accelerated to non-thermal energies at efficiencies similar to those seen
in shocks, despite being in a non-relativistic regime and lacking the large
scale structure of a shock. We observe that particles are accelerated into a
power-law spectrum, with a slope of in (non-relativistic)
energy. We compare these results to those obtained from the theory and
simulations of diffusive shock
, and discuss the astrophysical
implications of this theoretical work.
Chiplet-Based RISC-V SoC with Modular AI Acceleration
Authors: P. Ramkumar, S. S. Bharadwaj
2025-09-22
Achieving high performance, energy efficiency, and cost-effectiveness while
maintaining architectural flexibility is a critical challenge in the
development and deployment of edge AI devices. Monolithic SoC designs struggle
with this complex balance mainly due to low manufacturing yields (below 16%) at
advanced 360 mm^2 process nodes. This paper presents a novel chiplet-based
RISC-V SoC architecture that addresses these limitations through modular AI
and intelligent system level optimization. Our proposed design
integrates 4 different key innovations in a 30mm x 30mm silicon interposer:
adaptive cross-chiplet Dynamic Voltage and Frequency Scaling (DVFS); AI-aware
Universal Chiplet Interconnect Express (UCIe) protocol extensions featuring
streaming flow control units and
-aware transfers; distributed
cryptographic security across heterogeneous chiplets; and intelligent
sensor-driven load migration. The proposed architecture integrates a 7nm RISC-V
CPU chiplet with dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3 memory
stacks, and dedicated power management controllers. Experimental results across
industry standard benchmarks like MobileNetV2, ResNet-50 and real-time video
processing demonstrate significant performance improvements. The AI-optimized
configuration achieves ~14.7% latency reduction, 17.3% throughput improvement,
and 16.2% power reduction compared to previous basic chiplet implementations.
These improvements collectively translate to a 40.1% efficiency gain
corresponding to ~3.5 mJ per MobileNetV2 inference (860 mW/244 images/s), while
maintaining sub-5ms real-time capability across all experimented workloads.
These performance upgrades demonstrate that modular chiplet designs can achieve
near-monolithic computational density while enabling cost efficiency,
scalability and upgradeability, crucial for next-generation edge AI device
applications.
Speculate Deep and Accurate Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding
Authors: Pei-Shuo Wang, Jian-Jia Chen, Chun-Che Yang, Chi-Chih Chang, Ning-Chi Huang, Mohamed S. Abdelfattah, Kai-Chiang Wu
2025-09-22
The immense model sizes of large language models (s) challenge deployment
on memory-limited consumer GPUs. Although model
and parameter
offloading are common strategies to address memory limitations,
can
degrade quality, and offloading maintains quality but suffers from slow
inference. Speculative
presents a promising avenue to accelerate
parameter offloading, utilizing a fast draft model to propose multiple draft
tokens, which are then verified by the target
in parallel with a single
forward pass. This method reduces the time-consuming data transfers in forward
passes that involve offloaded weight transfers. Existing methods often rely on
pretrained weights of the same family, but require additional training to align
with custom-trained models. Moreover, approaches that involve draft model
training usually yield only modest speedups. This limitation arises from
insufficient alignment with the target model, preventing higher token
acceptance lengths. To address these challenges and achieve greater speedups,
we propose SubSpec, a plug-and-play method to accelerate parameter offloading
that is lossless and training-free. SubSpec constructs a highly aligned draft
model by generating
d substitute layers from offloaded target
portions. Additionally, our method shares the remaining GPU-resident layers
and the
-Cache, further reducing memory overhead and enhance alignment.
SubSpec achieves a high average acceptance length, delivering 9.1x speedup for
Qwen2.5 7B on MT-Bench (8GB VRAM limit) and an average of 12.5x speedup for
Qwen2.5 32B on popular generation benchmarks (24GB VRAM limit).
Exploiting Tree Structure for Credit Assignment in RL Training of LLMs
Authors: Hieu Tran, Zonghai Yao, Hong Yu
2025-09-22
Reinforcement learning improves reasoning, yet
delayed reward over
long sequences makes token-level credit assignment the key bottleneck. We study
the verifiable-reward setting, where the final answer is checkable and multiple
responses can be drawn per prompt. Reasoning tasks in math and medical QA align
with this setup, where only a few decision tokens significantly impact the
outcome. PPO offers token-level advantages with a learned value model, but it
is complex to train both the actor and critic models simultaneously, and it is
not easily generalizable, as the token-level values from the critic model can
make training prone to overfitting. GRPO is critic-free and supports verifiable
rewards, but spreads a single sequence-level return across tokens and ignores
branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that
converts a group of responses into a prefix tree and computes
\emph{nonparametric} prefix values by aggregating descendant outcomes.
Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated
\textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a
critic-free algorithm that augments the group-relative outcome signal of GRPO
with \emph{branch-gated} temporal-difference corrections derived from the tree.
At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO
reduces to GRPO; at branching tokens, it supplies precise token-level credit
without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B,
TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and
out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and
reaches higher validation accuracy with roughly the same wall-clock time.
Evaluating Large Language Models for Detecting Antisemitism
Authors: Jay Patel, Hrudayangam Mehta, Jeremy Blackburn
2025-09-22
Detecting hateful content is a challenging and important problem. Automated
tools, like machine-learning models, can help, but they require continuous
training to adapt to the ever-changing landscape of social media. In this work,
we evaluate eight open-source s' capability to detect antisemitic content,
specifically leveraging in-context definition as a policy guideline. We explore
various prompting techniques and design a new CoT-like prompt, Guided-CoT.
Guided-CoT handles the in-context policy well, increasing performance across
all evaluated models, regardless of
configuration, model sizes, or
reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5.
Additionally, we examine
errors and introduce metrics to quantify semantic
divergence in model-generated rationales, revealing notable differences and
paradoxical behaviors among
s. Our experiments highlight the differences
observed across
s' utility, explainability, and reliability.
Spiffy Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
Authors: Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli
2025-09-22
Diffusion s (d
s) have recently emerged as a powerful alternative to
autoregressive
s (AR-
s) with the potential to operate at significantly
higher token generation rates. However, currently available open-source d
s
often generate at much lower rates, typically
only a single token at
every denoising timestep in order to maximize output quality. We present
Spiffy, a speculative
algorithm that accelerates d
inference by
while provably pre
the model's output
distribution. This work addresses the unique challenges involved in applying
ideas from speculative
of AR-
s to the d
setting. Spiffy proposes
draft states by leveraging the d
's distribution itself in an
auto-speculative manner. This approach is efficient and effective, and
eliminates the overheads of training and running an independent draft model. To
structure the candidate draft states, we propose a novel directed draft graph
which is uniquely designed to take advantage of the bidirectional, block-wise
nature of d
generation and can be verified in parallel by the d
. To
further optimize the structure of these draft graphs, we introduce an
efficient, offline calibration algorithm that procedurally determines
high-quality graph configurations. These optimized draft graphs, enabling
increased acceptance rates, lead to a significant boost in the overall speedup
achieved by the system. Crucially, Spiffy is also complementary to other recent
innovations in improving d
generation speeds such as
-caching and
multi-token unmasking. We demonstrate that when combined with such parallel
algorithms, Spiffy is able to effectively multiply the benefits of
these methods leading to total speedups of up to .
GraDeT-HTR A Resource-Efficient Bengali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer
Authors: Md. Mahmudul Hasan, Ahmed Nesar Tahsin Choudhury, Mahmudul Hasan, Md. Mosaddek Khan
2025-09-22
Despite Bengali being the sixth most spoken language in the world,
handwritten text recognition (HTR) systems for Bengali remain severely
underdeveloped. The complexity of Bengali script--featuring conjuncts,
diacritics, and highly variable handwriting styles--combined with a scarcity of
annotated datasets makes this task particularly challenging. We present
GraDeT-HTR, a resource-efficient Bengali handwritten text recognition system
based on a Grapheme-aware Decoder-only Transformer architecture. To address the
unique challenges of Bengali script, we augment the performance of a
r-only
by integrating a grapheme-based tokenizer and
demonstrate that it significantly improves recognition accuracy compared to
conventional subword tokenizers. Our model is pretrained on large-scale
synthetic data and fine-tuned on real human-annotated samples, achieving
state-of-the-art performance on multiple benchmark datasets.
TempSamp-R1 Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
Authors: Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng
2025-09-22
This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework
designed to improve the effectiveness of adapting multimodal large language
models (Ms) to video temporal grounding tasks. We reveal that existing
reinforcement learning methods, such as Group Relative Policy Optimization
(GRPO), rely on on-policy sampling for policy updates. However, in tasks with
large temporal search spaces, this strategy becomes both inefficient and
limited in performance, as it often fails to identify temporally accurate
solutions. To address this limitation, TempSamp-R1 leverages ground-truth
annotations as off-policy supervision to provide temporally precise guidance,
effectively compensating for the
and misalignment in on-policy
solutions. To further stabilize training and reduce variance in reward-based
updates, TempSamp-R1 provides a non-linear soft advantage computation method
that dynamically reshapes the reward feedback via an asymmetric transformation.
By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1
optimizes a single unified model to support both CoT and non-CoT inference
modes, enabling efficient handling of queries with varying reasoning
complexity. Experimental results demonstrate that TempSamp-R1 outperforms
GRPO-based baselines, establishing new state-of-the-art performance on
benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions
(R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover,
TempSamp-R1 shows robust few-shot generalization capabilities under limited
data. Code: https://github.com/HVision-NKU/TempSamp-R1
RadEval A framework for radiology text evaluation
Authors: Justin Xu, Xi Zhang, Javid Abderezaei, Julie Bauml, Roger Boodoo, Fatemeh Haghighi, Ali Ganjizadeh, Eric Brattain, Dave Van Veen, Zaiqiao Meng, David Eyre, Jean-Benoit Delbrouck
2025-09-22
We introduce RadEval, a unified, open-source framework for evaluating
radiology texts. RadEval consolidates a diverse range of metrics, from classic
n-gram (BLEU, ROUGE) and contextual measures (BERTScore) to clinical
concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT,
TemporalEntityF1) and advanced
-based evaluators (GREEN). We refine and
standardize implementations, extend GREEN to support multiple imaging
modalities with a more lightweight model, and pretrain a domain-specific
radiology encoder, demonstrating strong zero-shot retrieval performance. We
also release a richly annotated expert dataset with over 450 clinically
significant error labels and show how different metrics correlate with
radiologist judgment. Finally, RadEval provides statistical testing tools and
baseline model evaluations across multiple publicly available datasets,
facilitating reproducibility and robust benchmarking in radiology report
generation.
Through the Lens of Human-Human Collaboration A Configurable Research Platform for Exploring Human-Agent Collaboration
Authors: Bingsheng Yao, Jiaju Chen, Chaoran Chen, April Wang, Toby Jia-jun Li, Dakuo Wang
2025-09-22
Intelligent systems have traditionally been designed as tools rather than
collaborators, often lacking critical characteristics that collaboration
partnerships require. Recent advances in large language model () agents open
new opportunities for human-
-agent collaboration by enabling natural
and various social and cognitive behaviors. Yet it remains
unclear whether principles of computer-mediated collaboration established in
HCI and CSCW persist, change, or fail when humans collaborate with
agents.
To support systematic investigations of these questions, we introduce an open
and configurable research platform for HCI researchers. The platform's modular
design allows seamless adaptation of classic CSCW experiments and manipulation
of theory-grounded interaction controls. We demonstrate the platform's
effectiveness and usability through two case studies: (1) re-implementing the
classic human-human-collaboration task Shape Factory as a between-subject
human-agent-collaboration experiment with 16 participants, and (2) a
participatory cognitive walkthrough with five HCI researchers to refine
workflows and interfaces for experiment setup and analysis.
Visual Detector Compression via Location-Aware Discriminant Analysis
Authors: Qizhen Lan, Jung Im Choi, Qing Tian
2025-09-22
Deep neural networks are powerful, yet their high complexity greatly limits
their potential to be deployed on billions of resource-constrained edge
devices. Pruning is a crucial network technique, yet most existing
methods focus on classification models, with limited attention to detection.
Even among those addressing detection, there is a lack of utilization of
essential localization information. Also, many
methods passively rely
on pre-trained models, in which useful and useless components are intertwined,
making it difficult to remove the latter without harming the former at the
neuron/filter level. To address the above issues, in this paper, we propose a
proactive detection-discriminants-based network
approach for deep
visual detectors, which alternates between two steps: (1) maximizing and
compressing detection-related discriminants and aligning them with a subset of
neurons/filters immediately before the detection head, and (2) tracing the
detection-related discriminating power across the layers and discarding
features of lower importance. Object location information is exploited in both
steps. Extensive experiments, employing four advanced detection models and four
state-of-the-art competing methods on the KITTI and COCO datasets, highlight
the superiority of our approach. Remarkably, our compressed models can even
beat the original base models with a substantial reduction in complexity.
Benchmarking Humans and Machines on Complex Multilingual Speech Understanding Tasks
Authors: Sai Samrat Kankanala, Ram Chandra, Sriram Ganapathy
2025-09-22
Auditory attention and selective phase-locking are central to human speech
understanding in complex acoustic scenes and cocktail party settings, yet these
capabilities in multilingual subjects remain poorly understood. While machine
understanding of natural speech has advanced in recent years, questions persist
about comprehension of ped and mixed-channel speech. We propose a
systematic paradigm for studying humans and machines in speech
question-answering tasks in multilingual settings with clean and mixed-channel
speech. For human listeners, selective attention to a target speaker was
significantly better in their native language (L1) than in their second
language (L2). For machine listening, speech-based large language models (
s)
match or exceed human performance in clean, single-speaker conditions but often
struggle to selectively attend in two-speaker settings. These results reveal a
key divergence: humans rely on attentional cues that are more streamlined in
their native language, whereas
s default to parallel information extraction
which exceed human skills.
Expert-as-a-Service Towards Efficient, Scalable, and Robust Large-scale MoE Serving
Authors: Ziming Liu, Boyu Tian, Guoteng Wang, Zhen Jiang, Peng Sun, Zhenhua Han, Tian Tang, Xiaohe Hu, Yanmin Jia, Yan Zhang, He Liu, Mingjun Zhang, Yiqi Zhang, Qiaoling Chen, Shenggan Cheng, Mingyu Gao, Yang You, Siyuan Feng
2025-09-22
Mixture-of-Experts (MoE) models challenge infrastructures with
dynamic,
expert utilization, causing instability on conventional systems
designed for dense architectures. We propose EaaS, a novel
system to
enable efficient, scalable, and robust MoE deployment. Our system
s
MoE modules into independent, stateless services. This design enables
fine-grained resource scaling and provides inherent fault tolerance by
decoupling compute units. The architecture is powered by a high-performance,
CPU-free peer-to-peer
library that ensures minimal overhead and
high throughput. Experiments confirm EaaS's scalability and efficiency,
achieving performance comparable to monolithic systems while providing robust
fault tolerance and strong scalability. EaaS incurs less than a 2% throughput
reduction under simulated hardware failures that would otherwise halt
monolithic architectures. It further saves up to 37.5% of computing resources
through dynamic fine-grained adaptation to
traffic, demonstrating
strong resilience for large-scale MoE deployment in production.
Towards General Computer Control with Hierarchical Agents and Multi-Level Action Spaces
Authors: Zihan Dong, Xinyu Fan, Zixiang Tang, Yunqing Li
2025-09-22
Controlling desktop applications via software remains a fundamental yet
under-served problem. Existing multi-modal large language models (Ms) ingest
screenshots and task instructions to generate keystrokes and mouse events, but
they suffer from prohibitive inference latency, poor sample efficiency on
long-horizon
-reward tasks, and infeasible on-device deployment. We
introduce a lightweight hierarchical reinforcement learning framework,
ComputerAgent, that formulates OS control as a two-level option process
(manager and subpolicy), employs a triple-modal state encoder (screenshot, task
ID, numeric state) to handle visual and contextual diversity, integrates
meta-actions with an early-stop mechanism to reduce wasted interactions, and
uses a compact vision backbone plus small policy networks for on-device
inference (15M parameters). On a suite of 135 real-world desktop tasks,
ComputerAgent attains 92.1% success on simple tasks (<8 steps) and 58.8% on
hard tasks (>=8 steps), matching or exceeding 200B-parameter M
baselines on
simple scenarios while reducing model size by over four orders of magnitude and
halving inference time. These results demonstrate that hierarchical RL offers a
practical, scalable alternative to monolithic M
-based automation for
computer control.
ConfClip Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs
Authors: Bonan Zhang, Zhongqi Chen, Bowen Song, Qinya Li, Fan Wu, Guihai Chen
2025-09-22
Reinforcement learning (RL) has become a standard paradigm for refining large
language models (s) beyond pre-training and instruction tuning. A prominent
line of work is RL with verifiable rewards (RLVR), which leverages
automatically verifiable outcomes (e.g., correctness or executability) to
generate reward signals. While efficient, this framework faces two key
limitations: First, its binary feedback is too
to capture the quality of
the reasoning process. Second, its coarse-grained rewards potentially lead to
vanishing gradients. Inspired by observations from human learning, we introduce
a RL technique that integrates verifiable outcomes with the model's own
confidence estimates. This joint design enriches the reward signal, providing
finer-grained feedback and implicitly supervising the reasoning process.
Experimental results demonstrate that our proposed method enhances RL
performance across multiple datasets and reduces token consumption during
inference, while incurring negligible additional training cost. Moreover, it
can be used as a plug-in module to enhance other state-of-the-art RL methods.
When TableQA Meets Noise A Dual Denoising Framework for Complex Questions and Large-scale Tables
Authors: Shenghao Ye, Yu Guo, Dong Jin, Yikai Shen, Yunpeng Hou, Shuangwu Chen, Jian Yang, Xiaofeng Jiang
2025-09-22
Table question answering (TableQA) is a fundamental task in natural language
processing (NLP). The strong reasoning capabilities of large language models
(s) have brought significant advances in this field. However, as real-world
applications involve increasingly complex questions and larger tables,
substantial noisy data is introduced, which severely degrades reasoning
performance. To address this challenge, we focus on improving two core
capabilities: Relevance Filtering, which identifies and retains information
truly relevant to reasoning, and Table Pruning, which reduces table size while
pre
essential content. Based on these principles, we propose EnoTab, a
dual denoising framework for complex questions and large-scale tables.
Specifically, we first perform Evidence-based Question Denoising by decomposing
the question into minimal semantic units and filtering out those irrelevant to
answer reasoning based on consistency and usability criteria. Then, we propose
Evidence Tree-guided Table Denoising, which constructs an explicit and
transparent table
path to remove irrelevant data step by step. At each
step, we observe the intermediate state of the table and apply a
post-order node rollback mechanism to handle abnormal table states, ultimately
producing a highly reliable sub-table for final answer reasoning. Finally,
extensive experiments show that EnoTab achieves outstanding performance on
TableQA tasks with complex questions and large-scale tables, confirming its
effectiveness.
Mechanistic Interpretability with SAEs Probing Religion, Violence, and Geography in Large Language Models
Authors: Katharina Simbeck, Mariam Mahran
2025-09-22
Despite growing research on bias in large language models (s), most work
has focused on gender and race, with little attention to religious identity.
This paper explores how religion is internally represented in
s and how it
intersects with concepts of violence and geography. Using mechanistic
interpretability and Sparse Autoencoders (SAEs) via the Neuronpedia API, we
analyze latent feature activations across five models. We measure
between religion- and violence-related prompts and probe semantic patterns in
activation contexts. While all five religions show comparable internal
cohesion, Islam is more frequently linked to features associated with violent
language. In contrast, geographic associations largely reflect real-world
religious demographics, revealing how models embed both factual distributions
and cultural stereotypes. These findings highlight the value of structural
analysis in auditing not just outputs but also internal representations that
shape model behavior.
Evict3R Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
Authors: Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
2025-09-22
Streaming visual s like StreamVGGT achieve strong 3D perception
but suffer from unbounded growth of key value (
) memory, which limits
scalability. We propose a training-free, inference-time token eviction policy
that bounds memory by discarding redundant tokens while keeping the most
informative ones. Our method uses significantly less memory with little to no
drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from
18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under
strict memory budgets, eviction enables denser frame sampling, which improves
reconstruction accuracy compared to the baseline. Experiments across video
depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and
camera pose estimation (Sintel, TUM-dynamics) show that our approach closely
matches StreamVGGT at a fraction of the memory and makes long-horizon streaming
inference more practical.
Bilateral Distribution Compression Reducing Both Data Size and Dimensionality
Authors: Dominic Broadbent, Nick Whiteley, Robert Allison, Tom Lovett
2025-09-22
Existing distribution methods reduce dataset size by minimising
the Maximum Mean Discrepancy (MMD) between original and compressed sets, but
modern datasets are often large in both sample size and dimensionality. We
propose Bilateral Distribution Compression (BDC), a two-stage framework that
compresses along both axes while pre
the underlying distribution, with
overall linear time and memory complexity in dataset size and dimension.
Central to BDC is the Decoded MMD (DMMD), which quantifies the discrepancy
between the original data and a compressed set
d from a low-dimensional
latent space. BDC proceeds by (i) learning a low-dimensional projection using
the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with
the Encoded MMD (EMMD). We show that this procedure minimises the DMMD,
guaranteeing that the compressed set faithfully represents the original
distribution. Experiments show that across a variety of scenarios BDC can
achieve comparable or superior performance to ambient-space
at
substantially lower cost.
Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs
Authors: Xing Chen, Rong Shi, Lu Zhao, Lingbin Wang, Xiao Jin, Yueqiang Chen, Hongfeng Sun
2025-09-22
-based applications have been widely used in various industries, but with
the increasing of models size, an efficient large language model (
)
inference system is an urgent problem to be solved for service providers. Since
the inference system is divided into two stage with different characteristics:
Prefill and Decode, the two stage will interfere with each other during the
inference process. Toward this end, a P-D
d inference framework is
proposed by some researchers. Current research is done on homogeneous GPUs, and
lacks deployment solutions based on business scenarios. Compared with
homogeneous GPUs, using heterogeneous GPUs to construct inference systems can
better improve resource utilization and reduce costs. Even if GPUs from
different vendors are used to build inference systems, on the basis of reducing
costs, the resource utilization rate can be improved and the dependence on a
single vendor can be reduced. Therefore, a P-D disaggreagetd inference system
based on heterogeneous GPUs is designed, and the heterogeneous compatible
transmission module in the system is designed to address heterogeneous GPU data
compatibility issues. Then, a joint optimization algorithm of parallel strategy
and instance number allocation is proposed to obtain the deployment solutions.
Finally, the experimental results show that the P-D
d inference
system can well solve the hybrid inference problem of heterogeneous GPUs from
different vendors, and the joint optimization algorithm can obtain the optimal
deployment solution.
4DGCPro Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming
Authors: Zihan Zheng, Zhenlong Wu, Houqiang Zhong, Yuan Tian, Ning Cao, Lan Xu, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Wenjun Zhang
2025-09-22
Achieving seamless viewing of high-fidelity volumetric video, comparable to
2D video experiences, remains an open challenge. Existing volumetric video
methods either lack the flexibility to adjust quality and bitrate
within a single model for efficient streaming across diverse networks and
devices, or struggle with real-time
and rendering on lightweight
mobile platforms. To address these challenges, we introduce 4DGCPro, a novel
hierarchical 4D Gaussian
framework that facilitates real-time
mobile
and high-quality rendering via progressive volumetric video
streaming in a single bitstream. Specifically, we propose a
perceptually-weighted and
-friendly hierarchical 4D Gaussian
representation with motion-aware adaptive grouping to reduce temporal
redundancy, preserve coherence, and enable scalable multi-level detail
streaming. Furthermore, we present an end-to-end entropy-optimized training
scheme, which incorporates layer-wise rate-distortion (RD) supervision and
attribute-specific entropy modeling for efficient bitstream generation.
Extensive experiments show that 4DGCPro enables flexible quality and multiple
bitrate within a single model, achieving real-time
and rendering on
mobile devices while outperforming existing methods in RD performance across
multiple datasets. Project Page: https://mediax-sjtu.github.io/4DGCPro
CorefInst Leveraging LLMs for Multilingual Coreference Resolution
Authors: Tuğba Pamay Arslan, Emircan Erol, Gülşen Eryiğit
2025-09-22
Coreference Resolution (CR) is a crucial yet challenging task in natural
language understanding, often constrained by task-specific architectures and
encoder-based language models that demand extensive training and lack
adaptability. This study introduces the first multilingual CR methodology which
leverages r-only
s to handle both overt and zero mentions. The article
explores how to model the CR task for
s via five different instruction sets
using a controlled inference method. The approach is evaluated across three
s; Llama 3.1, Gemma 2, and Mistral 0.3. The results indicate that
s, when
instruction-tuned with a suitable instruction set, can surpass state-of-the-art
task-specific architectures. Specifically, our best model, a fully fine-tuned
Llama 3.1 for multilingual CR, outperforms the leading multilingual CR model
(i.e., Corpipe 24 single stage variant) by 2 pp on average across all languages
in the CorefUD v1.2 dataset collection.
Privacy in Action Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents
Authors: Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan
2025-09-22
The increasing autonomy of agents in handling sensitive
s,
accelerated by Model Context Protocol (MCP) and Agent-to-Agent (A2A)
frameworks, creates urgent privacy challenges. While recent work reveals
significant gaps between
s' privacy Q&A performance and their agent
behavior, existing benchmarks remain limited to static, simplified scenarios.
We present PrivacyChecker, a model-agnostic, contextual integrity based
mitigation approach that effectively reduces privacy leakage from 36.08% to
7.30% on DeepSeek-R1 and from 33.06% to 8.32% on GPT-4o, all while pre
task helpfulness. We also introduce PrivacyLens-Live, transforming static
benchmarks into dynamic MCP and A2A environments that reveal substantially
higher privacy risks in practical. Our modular mitigation approach integrates
seamlessly into agent protocols through three deployment strategies, providing
practical privacy protection for the emerging agentic ecosystem. Our data and
code will be made available at https://aka.ms/privacy_in_action.
Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks
Authors: Chaodong Tong, Qi Zhang, Lei Jiang, Yanbing Liu, Nannan Sun, Wei Li
2025-09-22
Reliable question answering with large language models (s) is challenged
by hallucinations, fluent but factually incorrect outputs arising from
epistemic uncertainty. Existing entropy-based semantic-level uncertainty
estimation methods are limited by sampling noise and unstable clustering of
variable-length answers. We propose Semantic Reformulation Entropy (SRE), which
improves uncertainty estimation in two ways. First, input-side semantic
reformulations produce faithful paraphrases, expand the estimation space, and
reduce biases from superficial
r tendencies. Second, progressive,
energy-based hybrid clustering stabilizes semantic grouping. Experiments on
SQuAD and TriviaQA show that SRE outperforms strong baselines, providing more
robust and generalizable hallucination detection. These results demonstrate
that combining input diversification with multi-signal clustering substantially
enhances semantic-level uncertainty estimation.
QWHA Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
Authors: Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim
2025-09-22
The demand for efficient deployment of large language models (s) has
driven interest in
, which reduces inference cost, and
parameter-efficient fine-tuning (PEFT), which lowers training overhead. This
motivated the development of
-aware PEFT to produce accurate yet
efficient
d models. In this setting, reducing
error prior
to fine-tuning is crucial for achieving high model accuracy. However, existing
methods that rely on low-rank adaptation suffer from limited representational
capacity. Recent Fourier-related transform (FT)-based adapters offer greater
representational power than low-rank adapters, but their direct integration
into
d models often results in ineffective error reduction and
increased computational overhead. To overcome these limitations, we propose
QWHA, a method that integrates FT-based adapters into
d models by
employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together
with a novel adapter initialization scheme incorporating adaptive parameter
selection and value refinement. We demonstrate that QWHA effectively mitigates
errors while facilitating fine-tuning, and that its design
substantially reduces computational cost. Experimental results show that QWHA
consistently outperforms baselines in
accuracy and
achieves significant training speedups over existing FT-based adapters. The
code is available at https://github.com/vantaa89/qwha.
DINVMark A Deep Invertible Network for Video Watermarking
Authors: Jianbin Ji, Dawen Xu, Li Dong, Lin Yang, Songhan He
2025-09-22
With the wide spread of video, video watermarking has become increasingly
crucial for copyright protection and content authentication. However, video
watermarking still faces numerous challenges. For example, existing methods
typically have shortcomings in terms of watermarking capacity and robustness,
and there is a lack of specialized noise layer for High Efficiency Video
Coding(HEVC) . To address these issues, this paper introduces a Deep
Invertible Network for Video watermarking (DINVMark) and designs a noise layer
to simulate HEVC
. This approach not only in creases watermarking
capacity but also enhances robustness. DINVMark employs an Invertible Neural
Network (INN), where the encoder and
r share the same network structure
for both watermark embedding and extraction. This shared architecture ensures
close coupling between the encoder and
r, thereby improving the accuracy
of the watermark extraction process. Experimental results demonstrate that the
proposed scheme significantly enhances watermark robustness, preserves video
quality, and substantially increases watermark embedding capacity.
Interpreting vision transformers via residual replacement model
Authors: Jinyeong Kim, Junhyeok Kim, Yumin Shim, Joohyeok Kim, Sunyoung Jung, Seong Jae Hwang
2025-09-22
How do vision s (ViTs) represent and process the world? This paper
addresses this long-standing question through the first systematic analysis of
6.6K features across all layers, extracted via
autoencoders, and by
introducing the residual replacement model, which replaces ViT computations
with interpretable features in the residual stream. Our analysis reveals not
only a feature evolution from low-level patterns to high-level semantics, but
also how ViTs encode curves and spatial positions through specialized feature
types. The residual replacement model scalably produces a faithful yet
parsimonious circuit for human-scale interpretability by significantly
simplifying the original computations. As a result, this framework enables
intuitive understanding of ViT mechanisms. Finally, we demonstrate the utility
of our framework in debiasing spurious correlations.
EpiCache Episodic KV Cache Management for Long Conversational Question Answering
Authors: Minsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho
2025-09-22
Modern large language models (s) extend context lengths to up to millions
of tokens, enabling AI assistants to generate coherent and personalized
responses grounded in long conversational histories. This ability, however,
hinges on Key-Value (
) caching, whose memory grows linearly with dialogue
length and quickly becomes the bottleneck in resource-constrained environments.
An active line of research for reducing memory bottleneck is
, which seeks to limit
size while pre
accuracy. Yet
existing methods face two major limitations: (i) evicting the
after
full-context
causes unbounded peak memory, and (ii) query-dependent
eviction narrows the
to a single query, leading to failure cases in
multi-turn conversations. We introduce EpiCache, a training-free
management framework for long conversational question answering (LongConvQA)
under fixed memory budgets. EpiCache bounds
growth through block-wise
and preserves topic-relevant context via episodic
, which
clusters conversation history into coherent episodes and applies
episode-specific
eviction. We further design an adaptive layer-wise
budget allocation strategy that measures each layer's sensitivity to eviction
and distributes the memory budget across layers accordingly. Across three
LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over recent
baselines, sustains near-full
accuracy under 4-6x
, and reduces
latency and memory by up to 2.4x and 3.5x, thereby enabling efficient
multi-turn interaction under strict resource constraints.
Multimodal Health Risk Prediction System for Chronic Diseases via Vision-Language Fusion and Large Language Models
Authors: Dingxin Lu, Shurui Wu, Xinyi Huang
2025-09-22
With the rising global burden of chronic diseases and the multimodal and
heterogeneous clinical data (medical imaging, free-text recordings, wearable
sensor streams, etc.), there is an urgent need for a unified multimodal AI
framework that can proactively predict individual health risks. We propose
VL-RiskFormer, a hierarchical stacked visual-language multimodal Transformer
with a large language model () inference head embedded in its top layer. The
system builds on the dual-stream architecture of existing visual-linguistic
models (e.g., PaLM-E, LLaVA) with four key innovations: (i) pre-training with
cross-modal comparison and fine-grained alignment of radiological images,
fundus maps, and wearable device photos with corresponding clinical narratives
using momentum update encoders and debiased InfoNCE losses; (ii) a time fusion
block that integrates irregular visit sequences into the causal Transformer
r through adaptive time interval position coding; (iii) a disease
ontology map adapter that injects ICD-10 codes into visual and textual channels
in layers and infers comorbid patterns with the help of a graph attention
mechanism. On the MIMIC-IV longitudinal cohort, VL-RiskFormer achieved an
average AUROC of 0.90 with an expected calibration error of 2.7 percent.
Asteria Semantic-Aware Cross-Region Caching for Agentic LLM Tool Access
Authors: Chaoyi Ruan, Chao Bi, Kaiwen Zheng, Ziji Shi, Xinyi Wan, Jialin Li
2025-09-22
Large Language Model () agents tackle data-intensive tasks such as deep
research and code generation. However, their effectiveness depends on frequent
interactions with knowledge sources across remote clouds or regions. Such
interactions can create non-trivial latency and cost bottlenecks. Existing
caching solutions focus on exact-match queries, limiting their effectiveness
for semantic knowledge reuse.
To address this challenge, we introduce Asteria, a novel cross-region
knowledge caching architecture for
agents. At its core are two
abstractions: Semantic Element (SE) and Semantic Retrieval Index (Sine). A
semantic element captures the semantic embedding representation of an
query
together with performance-aware metadata such as latency, cost, and staticity.
Sine then provides two-stage retrieval: a vector similar index with semantic
embedding for fast candidate selection and a lightweight
-powered semantic
judger for precise validation. Atop these primitives, Asteria builds a new
interface that includes a new semantic-aware
hit definition, a
cost-efficient eviction policy, and proactive prefetching. To reduce overhead,
Asteria co-locates the small
judger with the main
using adaptive
scheduling and resource sharing. Our evaluation demonstrates that Asteria
delivers substantial performance improvements without compromising correctness.
On representative search workloads, Asteria achieves up to a 3.6
increase in throughput by maintaining
hit rates of over 85%, while
pre
accuracy virtually identical to non-
d baselines. Asteria also
improves throughput for complex coding tasks by 20%, showcasing its versatility
across diverse agentic workloads.
Cronus Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill
Authors: Yunzhao Liu, Qiang Xu, Y. Charlie Hu
2025-09-22
Efficient inference is critical for real-world applications, especially
within heterogeneous GPU clusters commonly found in organizations and
on-premise datacenters as GPU architecture rapidly evolves. Current
d
strategies, which separate the
and
stages
of
inference across different GPUs, often suffer from suboptimal
performance due to imbalances between GPU capabilities and workload demands. On
the other hand, extending conventional data parallelism and pipeline
parallelism to heterogeneous setups incurs high inference latencies. To address
these challenges, we introduce Cronus, a novel
inference system designed to
dynamically balance workloads across heterogeneous GPUs using partially
d
. Cronus partitions each
stage and executes its
initial portion on the low-end GPU, while
ping the remaining
and
stages of earlier requests on the high-end GPU. Extensive evaluations
across various high-end and low-end GPU combinations demonstrate that Cronus
significantly improves the throughput over
d
. It also
reduces TTFT P99 and TBT P99 significantly over DP and PP while maintaining
similar or better throughput.
Compact representation of transonic airfoil buffet flows with observable-augmented machine learning
Authors: Kai Fukami, Yuta Iwatani, Soju Maejima, Hiroyuki Asada, Soshi Kawai
2025-09-22
Transonic buffet presents time-dependent aerodynamic characteristics
associated with shock, turbulent boundary layer, and their interactions.
Despite strong nonlinearities and a large degree of freedom, there exists a
dominant dynamic pattern of a buffet cycle, suggesting the low dimensionality
of transonic buffet phenomena. This study seeks a low-dimensional
representation of transonic airfoil buffet at a high Reynolds number with
machine learning. Wall-modeled large-eddy simulations of flow over the OAT15A
supercritical airfoil at two Mach numbers, and 0.730,
respectively producing non-buffet and buffet conditions, at a chord-based
Reynolds number of are performed to generate the present
datasets. We find that the low-dimensional nature of transonic airfoil buffet
can be extracted as a sole three-dimensional latent representation through
lift-augmented autoencoder . The current low-order representation
not only describes the shock movement but also captures the moment when the
separation occurs near the trailing edge in a low-order manner. We further show
that it is possible to perform sensor-based reconstruction through the present
low-dimensional expression while identifying the sensitivity with respect to
aerodynamic responses. The present model trained at is
lastly evaluated at the level of a real aircraft operation of , exhibiting that the phase dynamics of lift is reasonably estimated from
sensors. The current study may provide a foundation toward data-driven
real-time analysis of transonic buffet conditions under aircraft operation.
Rational Multi-Modal Transformers for TCR-pMHC Prediction
Authors: Jiarui Li, Zixiang Yin, Zhengming Ding, Samuel J. Landry, Ramgopal R. Mettu
2025-09-22
T cell receptor (TCR) recognition of peptide-MHC (pMHC) complexes is
fundamental to adaptive immunity and central to the development of T cell-based
immunotherapies. While -based models have shown promise in
predicting TCR-pMHC interactions, most lack a systematic and explainable
approach to architecture design. We present an approach that uses a new
post-hoc explainability method to inform the construction of a novel
encoder-
r
model. By identifying the most informative
combinations of TCR and epitope sequence inputs, we optimize cross-attention
strategies, incorporate auxiliary training objectives, and introduce a novel
early-stopping criterion based on explanation quality. Our framework achieves
state-of-the-art predictive performance while simultaneously improving
explainability, robustness, and generalization. This work establishes a
principled, explanation-driven strategy for modeling TCR-pMHC binding and
offers mechanistic insights into sequence-level binding behavior through the
lens of deep learning.
Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection
Authors: Jun Seo Kim, Hyemi Kim, Woo Joo Oh, Hongjin Cho, Hochul Lee, Hye Hyeon Kim
2025-09-22
Cognitive distortions have been closely linked to mental health disorders,
yet their automatic detection remained challenging due to contextual ambiguity,
co-occurrence, and semantic . We proposed a novel framework that
combines Large Language Models (
s) with Multiple-Instance Learning (MIL)
architecture to enhance interpretability and expression-level reasoning. Each
utterance was decomposed into Emotion, Logic, and Behavior (ELB) components,
which were processed by
s to infer multiple distortion instances, each with
a predicted type, expression, and model-assigned salience score. These
instances were integrated via a Multi-View Gated Attention mechanism for final
classification. Experiments on Korean (KoACD) and English (Therapist QA)
datasets demonstrate that incorporating ELB and
-inferred salience scores
improves classification performance, especially for distortions with high
interpretive ambiguity. Our results suggested a psychologically grounded and
generalizable approach for fine-grained reasoning in mental health NLP.
DeepASA An Object-Oriented One-for-All Network for Auditory Scene Analysis
Authors: Dongheon Lee, Younghoo Kwon, Jung-Woo Choi
2025-09-21
We propose DeepASA, a one-for-all model for auditory scene analysis that
performs multi-input multi-output (MIMO) source separation, dereverberation,
sound event detection (SED), audio classification, and direction-of-arrival
estimation (DoAE) within a unified framework. DeepASA is designed for complex
auditory scenes where multiple, often similar, sound sources in time
and move dynamically in space. To achieve robust and consistent inference
across tasks, we introduce an object-oriented processing (OOP) strategy. This
approach encapsulates diverse auditory features into object-centric
representations and refines them through a chain-of-inference (CoI) mechanism.
The pipeline comprises a dynamic temporal kernel-based feature extractor, a
-based aggregator, and an object separator that yields per-object
features. These features feed into multiple task-specific
rs. Our
object-centric representations naturally resolve the parameter association
ambiguity inherent in traditional track-wise processing. However, early-stage
object separation can lead to failure in downstream ASA tasks. To address this,
we implement temporal coherence matching (TCM) within the chain-of-inference,
enabling multi-task fusion and iterative refinement of object features using
estimated auditory parameters. We evaluate DeepASA on representative spatial
audio benchmark datasets, including ASA2, MC-FUSS, and STARSS23. Experimental
results show that our model achieves state-of-the-art performance across all
evaluated tasks, demonstrating its effectiveness in both source separation and
auditory parameter estimation under diverse spatial auditory scenes.
MoEs Are Stronger than You Think Hyper-Parallel Inference Scaling with RoE
Authors: Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho
2025-09-21
The generation quality of large language models (s) is often improved by
utilizing inference-time sequence-level scaling methods (e.g.,
Chain-of-Thought). We introduce hyper-parallel scaling, a complementary
framework that improves prediction quality at the token level. Hyper-parallel
scaling computes and aggregates multiple output proposals for a single token
from the model. We implement this concept in Mixture-of-Experts (MoE) models,
which we refer to as Roster of Experts (RoE). RoE is a training-free inference
algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects
controlled stochasticity into the expert routing mechanism, enabling it to
sample multiple diverse experts for each token and aggregate their outputs for
a more accurate final prediction.To overcome the computational cost, we
introduce an efficient batching strategy and a specialized
-caching mechanism
that minimizes compute and memory overhead. For example, RoE enables a 7B MoE
model to match the performance of a 10.5B MoE model while using 30% less
compute for inference. These gains are achieved without any fine-tuning of
model parameters.
SignalLLM A General-Purpose LLM Agent Framework for Automated Signal Processing
Authors: Junlong Ke, Qiying Hu, Shenghai Yuan, Yuecong Xu, Jianfei Yang
2025-09-21
Modern signal processing (SP) pipelines, whether model-based or data-driven,
often constrained by complex and fragmented workflow, rely heavily on expert
knowledge and manual engineering, and struggle with adaptability and
generalization under limited data. In contrast, Large Language Models (s)
offer strong reasoning capabilities, broad general-purpose knowledge,
in-context learning, and cross-modal transfer abilities, positioning them as
powerful tools for automating and generalizing SP workflows. Motivated by these
potentials, we introduce Signal
, the first general-purpose
-based agent
framework for general SP tasks. Unlike prior
-based SP approaches that are
limited to narrow applications or tricky prompting, Signal
introduces a
principled, modular architecture. It decomposes high-level SP goals into
structured subtasks via in-context learning and domain-specific retrieval,
followed by hierarchical planning through adaptive retrieval-augmented
generation (RAG) and refinement; these subtasks are then executed through
prompt-based reasoning, cross-modal reasoning, code synthesis, model
invocation, or data-driven
-assisted modeling. Its generalizable design
enables the flexible selection of problem solving strategies across different
signal modalities, task types, and data conditions. We demonstrate the
versatility and effectiveness of Signal
through five representative tasks in
and sensing, such as radar target detection, human activity
recognition, and text
. Experimental results show superior
performance over traditional and existing
-based methods, particularly in
few-shot and zero-shot settings.
MAST Multi-Agent Spatial Transformer for Learning to Collaborate
Authors: Damian Owerko, Frederic Vatnsdal, Saurav Agarwal, Vijay Kumar, Alejandro Ribeiro
2025-09-21
This article presents a novel multi-agent spatial (MAST) for
learning
policies in large-scale decentralized and collaborative
multi-robot systems (DC-MRS). Challenges in collaboration in DC-MRS arise from:
(i) partial observable states as robots make only localized perception, (ii)
limited
range with no central server, and (iii) independent
execution of actions. The robots need to optimize a common task-specific
objective, which, under the restricted setting, must be done using a
policy that exhibits the desired collaborative behavior. The
proposed MAST is a decentralized
architecture that learns
policies to compute abstract information to be shared with other
agents and processes the received information with the robot's own
observations. The MAST extends the standard
with new positional
encoding strategies and attention operations that employ windowing to limit the
receptive field for MRS. These are designed for local computation,
shift-equivariance, and permutation equivariance, making it a promising
approach for DC-MRS. We demonstrate the efficacy of MAST on decentralized
assignment and navigation (DAN) and decentralized coverage control. Efficiently
trained using imitation learning in a centralized setting, the decentralized
MAST policy is robust to
delays, scales to large teams, and
performs better than the baselines and other learning-based approaches.
Attention Consistency for LLMs Explanation
Authors: Tian Lan, Jinyuan Xu, Xue He, Jenq-Neng Hwang, Lei Li
2025-09-21
Understanding the decision-making processes of large language models (s)
is essential for their trustworthy development and deployment. However, current
interpretability methods often face challenges such as low resolution and high
computational cost. To address these limitations, we propose the
\textbf{Multi-Layer Attention Consistency Score (MACS)}, a novel, lightweight,
and easily deployable heuristic for estimating the importance of input tokens
in
r-based models. MACS measures contributions of input tokens based on
the consistency of maximal attention. Empirical evaluations demonstrate that
MACS achieves a favorable trade-off between interpretability quality and
computational efficiency, showing faithfulness comparable to complex techniques
with a 22\% decrease in VRAM usage and 30\% reduction in latency.
Analyzing Memory Effects in Large Language Models through the lens of Cognitive Psychology
Authors: Zhaoyang Cao, Lael Schooler, Reza Zafarani
2025-09-21
Memory, a fundamental component of human cognition, exhibits adaptive yet
fallible characteristics as illustrated by Schacter's memory "sins".These
cognitive phenomena have been studied extensively in psychology and
neuroscience, but the extent to which artificial systems, specifically Large
Language Models (s), emulate these cognitive phenomena remains
underexplored. This study uses human memory research as a lens for
understanding
s and systematically investigates human memory effects in
state-of-the-art
s using paradigms drawn from psychological research. We
evaluate seven key memory phenomena, comparing human behavior to
performance. Both people and models remember less when overloaded with
information (list length effect) and remember better with repeated exposure
(list strength effect). They also show similar difficulties when retrieving
ping information, where storing too many similar facts leads to
confusion (fan effect). Like humans,
s are susceptible to falsely
"remembering" words that were never shown but are related to others (false
memories), and they can apply prior learning to new, related situations
(cross-domain generalization). However,
s differ in two key ways: they are
less influenced by the order in which information is presented (positional
bias) and more robust when processing random or meaningless material (nonsense
effect). These results reveal both alignments and divergences in how
s and
humans reconstruct memory. The findings help clarify how memory-like behavior
in
s echoes core features of human cognition, while also highlighting the
architectural differences that lead to distinct patterns of error and success.
SnipSnap A Joint Compression Format and Dataflow Co-Optimization Framework for Efficient Sparse LLM Accelerator Design
Authors: Junyi Wu, Chao Fang, Zhongfeng Wang
2025-09-21
The growing scale of large language models (s) has intensified demands on
computation and memory, making efficient inference a key challenge. While
can reduce these costs, existing design space exploration (DSE)
frameworks often overlook
formats, a key factor for leveraging
on accelerators. This paper proposes SnipSnap, a joint
format and dataflow co-optimization framework for efficient
accelerator design. SnipSnap introduces: (1) a hierarchical
format
encoding to expand the design space; (2) an adaptive
engine for
selecting formats under diverse
; and (3) a progressive co-search
workflow that jointly optimizes dataflow and
formats. SnipSnap
achieves 18.24\% average memory energy savings via format optimization, along
with 2248.3 and 21.0 speedups over Sparseloop and DiMO-Sparse
frameworks, respectively.
The Transfer Neurons Hypothesis An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLMs
Authors: Hinata Tezuka, Naoya Inoue
2025-09-21
Recent studies have suggested a processing framework for multilingual inputs
in r-based
s: early layers convert inputs into English-centric and
language-agnostic representations; middle layers perform reasoning within an
English-centric latent space; and final layers generate outputs by transforming
these representations back into language-specific latent spaces. However, the
internal dynamics of such transformation and the underlying mechanism remain
underexplored. Towards a deeper understanding of this framework, we propose and
empirically validate The Transfer Neurons Hypothesis: certain neurons in the
MLP module are responsible for transferring representations between
language-specific latent spaces and a shared semantic latent space.
Furthermore, we show that one function of language-specific neurons, as
identified in recent studies, is to facilitate movement between latent spaces.
Finally, we show that transfer neurons are critical for reasoning in
multilingual
s.
PTQTP Post-Training Quantization to Trit-Planes for Large Language Models
Authors: He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zheng Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, Ngai Wong
2025-09-21
Post-training (PTQ) of large language models (
s) to extremely
low bit-widths remains challenging due to the fundamental trade-off between
computational efficiency and model expressiveness. While existing ultra-
PTQ methods rely on binary approximations or complex compensation mechanisms,
they suffer from either limited representational capacity or computational
overhead that undermines their efficiency gains. We introduce PTQ to
Trit-Planes (PTQTP), the first ternary-weight PTQ framework that decomposes
weight matrices into structured ternary {-1, 0, 1} trit-planes using 2x1.58-bit
representation. PTQTP achieves multiplication-free inference, identical to
1-bit
, while maintaining superior expressiveness through its novel
structured decomposition. Our approach provides: (1) a theoretically grounded
progressive approximation algorithm ensuring global weight consistency; (2)
model-agnostic deployment across diverse modern
s without architectural
modifications; and (3) uniform ternary operations that eliminate the need for
mixed-precision or compensation schemes. Comprehensive experiments across
LLaMA3.x and Qwen3 model families (0.6B-70B parameters) demonstrate that PTQTP
significantly outperforms existing
PTQ methods, achieving 82.4%
mathematical reasoning retention versus 0% for competing approaches. PTQTP
approaches and sometimes surpasses 1.58-bit
-aware training
performance while requiring only single-hour
compared to 10-14 GPU
days for training-based methods. These results establish PTQTP as a practical
solution for efficient
deployment in resource-constrained environments.
LLM-Assisted Semantic Guidance for Sparsely Annotated Remote Sensing Object Detection
Authors: Wei Liao, Chunyan Xu, Chenxu Wang, Zhen Cui
2025-09-21
Sparse annotation in remote sensing object detection poses significant
challenges due to dense object distributions and category imbalances. Although
existing Dense Pseudo-Label methods have demonstrated substantial potential in
pseudo-labeling tasks, they remain constrained by selection ambiguities and
inconsistencies in confidence estimation.In this paper, we introduce an
-assisted semantic guidance framework tailored for
ly annotated remote
sensing object detection, exploiting the advanced semantic reasoning
capabilities of large language models (
s) to distill high-confidence
pseudo-labels.By integrating
-generated semantic priors, we propose a
Class-Aware Dense Pseudo-Label Assignment mechanism that adaptively assigns
pseudo-labels for both unlabeled and
ly labeled data, ensuring robust
supervision across varying data distributions. Additionally, we develop an
Adaptive Hard-Negative Reweighting Module to stabilize the supervised learning
branch by mitigating the influence of confounding background information.
Extensive experiments on DOTA and HRSC2016 demonstrate that the proposed method
outperforms existing single-stage detector-based frameworks, significantly
improving detection performance under
annotations.
Catching the Details Self-Distilled RoI Predictors for Fine-Grained MLLM Perception
Authors: Yuheng Shi, Xiaohuan Pei, Minjing Dong, Chang Xu
2025-09-21
Multimodal Large Language Models (Ms) require high-resolution visual
information to perform fine-grained perception, yet processing entire
high-resolution images is computationally prohibitive. While recent methods
leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they
typically present a difficult trade-off: training-based approaches depend on
large-scale annotated datasets, while training-free methods that utilize the
model's internal attention are computationally inefficient and less accurate,
requiring either multi-pass
stages or reliance on the slow
auto-regressive
process. In this paper, we propose an efficient,
annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves
this trade-off. The SD-RPN is built around a pipeline that transforms the noisy
attention maps from the M
's middle layers into high-quality pseudo-RoI
labels by explicitly denoising the signal and resolving ambiguity. We use these
labels to train a lightweight Region Proposal Network (RPN) that learns a more
precise localization. This RPN is also highly efficient, predicting the RoI in
a single forward pass using features from the M
's middle layers, decoupling
RoI identification from the auto-regressive generation and avoiding costly
multi-pass operations.To validate our approach, we integrate the framework into
the LLaVA-1.5 architecture. Despite being trained on only a few (e.g. 10K)
question-answer pairs, our method demonstrates exceptional data efficiency and
generalization, achieving over a 10% absolute accuracy improvement on unseen
benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a
practical and scalable solution for enhancing the fine-grained perception of
M
s without requiring costly supervision or full model fine-tuning. Code is
available at https://github.com/YuHengsss/SD-RPN.
SwarmChat An LLM-Based, Context-Aware Multimodal Interaction System for Robotic Swarms
Authors: Ettilla Mohiuddin Eumi, Hussein Abbass, Nadine Marcus
2025-09-21
Traditional Human-Swarm Interaction (HSI) methods often lack intuitive
real-time adaptive interfaces, making decision making slower and increasing
cognitive load while limiting command flexibility. To solve this, we present
SwarmChat, a context-aware, multimodal interaction system powered by Large
Language Models (s). SwarmChat enables users to issue natural language
commands to robotic swarms using multiple modalities, such as text, voice, or
teleoperation. The system integrates four
-based modules: Context Generator,
Intent Recognition, Task Planner, and Modality Selector. These modules
collaboratively generate context from keywords, detect user intent, adapt
commands based on real-time robot state, and suggest optimal
modalities. Its three-layer architecture offers a dynamic interface with both
fixed and customizable command options, supporting flexible control while
optimizing cognitive effort. The preliminary evaluation also shows that the
SwarmChat's
modules provide accurate context interpretation, relevant
intent recognition, and effective command delivery, achieving high user
satisfaction.
ShadowServe Interference-Free KV Cache Fetching for Distributed Prefix Caching
Authors: Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, Minlan Yu
2025-09-21
Distributed prefix caching accelerates long-context
by reusing
entries for common context prefixes. However,
fetches can become
a bottleneck when network bandwidth is limited. Compression mitigates the
bandwidth issue, but can degrade overall performance when de
interferes with model computation.
We present ShadowServe, the first SmartNIC-accelerated, interference-free
prefix caching system for
. ShadowServe separates a control plane on
the host and a data plane fully offloaded to the SmartNIC, which eliminates
interference to both host GPU and CPU. To overcome the SmartNIC's limited
compute and memory resources, we design a chunked pipeline that parallelizes
data plane operations across the SmartNIC's compute resources, and a
minimal-copy memory management scheme that reduces memory pressure on the
SmartNIC. Compared to state-of-the-art solutions, ShadowServe achieves up to
2.2x lower loaded time-per-output-token (TPOT), and reduces time-to-first-token
(TTFT) by up to 1.38x in low-bandwidth scenarios (<= 20 Gbps), translating to
up to 1.35x higher throughput.
ISCS Parameter-Guided Channel Ordering and Grouping for Learned Image Compression
Authors: Jinhao Wang, Cihan Ruan, Nam Ling, Wei Wang, Wei Jiang
2025-09-21
Prior studies in learned image (LIC) consistently show that only
a small subset of latent channels is critical for reconstruction, while many
others carry limited information. Exploiting this imbalance could improve both
coding and computational efficiency, yet existing approaches often rely on
costly, dataset-specific ablation tests and typically analyze channels in
isolation, ignoring their interdependencies.
We propose a generalizable, dataset-agnostic method to identify and organize
important channels in pretrained VAE-based LIC models. Instead of brute-force
empirical evaluations, our approach leverages intrinsic parameter
statistics-weight variances, bias magnitudes, and pairwise correlations-to
estimate channel importance. This analysis reveals a consistent organizational
structure, termed the Invariant Salient Channel Space (ISCS), where
Salient-Core channels capture dominant structures and Salient-Auxiliary
channels provide complementary details. Building on ISCS, we introduce a
deterministic channel ordering and grouping strategy that enables
slice-parallel
, reduces redundancy, and improves bitrate efficiency.
Experiments across multiple LIC architectures demonstrate that our method
effectively reduces bitrate and computation while maintaining reconstruction
quality, providing a practical and modular enhancement to existing learned
frameworks.
The Even Sheen of AI Kitsch, LLMs, and Homogeneity
Authors: Gyburg Uhlmann
2025-09-20
The exploding use and impact of Chatbots such as ChatGPT that are based on
Large Language Models urgently call for a language which is fit to clearly
describe functions and problems of the production process and qualities of the
Chatbots' textual and image output. Recently, the discussion about appropriate
and illuminating metaphors to describe s has gained momentum. As an
alternative to well-established metaphors such as "hallucinating" and
"bullshit", we propose "kitsch" as a new metaphor. As an internationally
widespread term from literary and cultural studies, we argue that "kitsch" is
particularly suitable for analytically illuminating a previously neglected
feature of
-based images and texts: their tendency to produce homogeneous
and average content, which is becoming increasingly dominant as the proportion
of AI-generated content on the internet grows. This is leading to the
equalisation of language, style and argument. In view of the potential negative
consequences of this averaging, including for human content producers on the
internet, we advocate combining methods and insights from kitsch studies with
AI research, philosophy, and
studies in order to better
understand the phenomenon and develop countermeasures.
Towards Transparent and Incentive-Compatible Collaboration in Decentralized LLM Multi-Agent Systems A Blockchain-Driven Approach
Authors: Minfeng Qi, Tianqing Zhu, Lefeng Zhang, Ningran Li, Wanlei Zhou
2025-09-20
Large Language Models (s) have enabled the emergence of autonomous agents
capable of complex reasoning, planning, and interaction. However, coordinating
such agents at scale remains a fundamental challenge, particularly in
decentralized environments where
lacks transparency and agent
behavior cannot be shaped through centralized incentives. We propose a
blockchain-based framework that enables transparent agent registration,
verifiable task allocation, and dynamic reputation tracking through smart
contracts. The core of our design lies in two mechanisms: a matching
score-based task allocation protocol that evaluates agents by reputation,
capability match, and workload; and a behavior-shaping incentive mechanism that
adjusts agent behavior via feedback on performance and reward. Our
implementation integrates GPT-4 agents with Solidity contracts and
demonstrates, through 50-round simulations, strong task success rates, stable
utility distribution, and emergent agent specialization. The results underscore
the potential for trustworthy, incentive-compatible multi-agent coordination in
open environments.
Decoding Uncertainty The Impact of Decoding Strategies for Uncertainty Estimation in Large Language Models
Authors: Wataru Hashimoto, Hidetaka Kamigaito, Taro Watanabe
2025-09-20
Decoding strategies manipulate the probability distribution underlying the
output of a language model and can therefore affect both generation quality and
its uncertainty. In this study, we investigate the impact of
strategies on uncertainty estimation in Large Language Models (
s). Our
experiments show that Contrastive Search, which mitigates repetition, yields
better uncertainty estimates on average across a range of preference-aligned
s. In contrast, the benefits of these strategies sometimes diverge when the
model is only post-trained with supervised fine-tuning, i.e. without explicit
alignment.
EG-MLA Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs
Authors: Zhengge Cai, Haowen Hou
2025-09-20
Reducing the key-value ()
size is a crucial step toward enabling
efficient inference in large language models (
s), especially under latency
and memory constraints. While Multi-Head Attention (MHA) offers strong
representational power, it incurs significant memory overhead. Recent work on
Multi-head Latent Attention (MLA) mitigates this by compressing
representations into a shared latent space, achieving a better trade-off
between performance and
efficiency. While MLA already achieves
significant
reduction, the scope for further
remains
limited without performance loss. In this paper, we propose
\textbf{Embedding-Gated Multi-head Latent Attention (EG-MLA)}, a novel
extension of MLA that further reduces
size while enhancing
representational expressiveness. EG-MLA introduces a token-specific embedding
gating mechanism applied in the latent space, enabling fine-grained modulation
of compressed
vectors with minimal additional computation. Compared to MHA,
EG-MLA achieves over 91.6\% reduction in
size with negligible
performance degradation. Relative to MLA, EG-MLA consistently improves task
accuracy across diverse reasoning benchmarks while achieving up to 59.9\%
additional memory savings. Our theoretical analysis highlights how embedding
gating induces implicit high-order interactions, and empirical evaluations
demonstrate robust generalization across model scales and
regimes.
Notably, we successfully scale EG-MLA to over 1 billion parameters,
demonstrating its practical viability for large-scale
deployment. These
results establish EG-MLA as a memory- and compute-efficient attention mechanism
that enables scalable, high-performance inference in modern
s.
-Orthogonality Regularization for Compatible Representation Learning
Authors: Simone Ricci, Niccolò Biondi, Federico Pernici, Ioannis Patras, Alberto Del Bimbo
2025-09-20
Retrieval systems rely on representations learned by increasingly powerful
models. However, due to the high training cost and inconsistencies in learned
representations, there is significant interest in facilitating
between representations and ensuring compatibility across independently trained
neural networks. In the literature, two primary approaches are commonly used to
adapt different learned representations: affine transformations, which adapt
well to specific distributions but can significantly alter the original
representation, and orthogonal transformations, which preserve the original
structure with strict geometric constraints but limit adaptability. A key
challenge is adapting the latent spaces of updated models to align with those
of previous models on downstream distributions while pre
the newly
learned representation spaces. In this paper, we impose a relaxed orthogonality
constraint, namely -orthogonality regularization, while learning an
affine transformation, to obtain distribution-specific adaptation while
retaining the original learned representations. Extensive experiments across
various architectures and datasets validate our approach, demonstrating that it
preserves the model's zero-shot performance and ensures compatibility across
model updates. Code available at:
https://github.com/miccunifi/lambda_orthogonality
Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing
Authors: Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland
2025-09-20
Diffusion-based large language models (Ds) have recently attracted growing
interest as an alternative to autoregressive
rs. In this work, we present
an empirical study on using the diffusion-based large language model LLaDA for
automatic speech recognition (ASR). We first investigate its use as an external
deliberation-based processing module for Whisper-LLaMA transcripts. By
leveraging the bidirectional attention and denoising capabilities of LLaDA, we
explore random masking, low-confidence masking, and semi-autoregressive
strategies, showing that Whisper-LLaDA substantially reduces WER compared with
the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER
on test-clean/test-other, representing a 12.3% relative improvement over the
Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA
without acoustic features fails to improve accuracy, highlighting the
importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA
as a standalone
r for ASR with diffusion-based and semi-autoregressive
. Most experimental configurations achieve faster inference than the
Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These
findings offer an empirical view of diffusion-based
s for ASR and point to
promising directions for improvements.
PruneCD Contrasting Pruned Self Model to Improve Decoding Factuality
Authors: Byeongho Yu, Changhun Lee, Jungyu Jin, Eunhyeok Park
2025-09-20
To mitigate the hallucination problem in large language models, DoLa exploits
early exit logits from the same model as a contrastive prior. However, we found
that these early exit logits tend to be flat, low in magnitude, and fail to
reflect meaningful contrasts. To address this, we propose PruneCD, a novel
contrastive method that constructs the amateur model via layer
rather than early exit. This design leads to more informative and well-aligned
logits, enabling more effective contrastive
. Through qualitative and
quantitative analyses, we demonstrate that PruneCD consistently improves
factuality with minimal inference overhead, offering a robust and practical
approach to mitigating hallucinations in
s.
Data-Driven Reduced-Order Modeling of Phase Mixing Dynamics from Particle Kinetic Simulation
Authors: Darian Figuera-Michal, Sungpil Yum, Jae-Min Kwon, Eisung Yoon
2025-09-20
Phase mixing is a fundamental kinetic process that governs dissipation and
stability in collisionless plasmas, but its inherent filamentation in velocity
space creates major challenges for both high-fidelity simulations and
reduced-order modeling. This work presents the first exploratory evaluation of
a joint Proper Orthogonal Decomposition and Sparse Identification of Nonlinear
Dynamics (POD-SINDy) framework applied to particle-in-cell simulations of phase
mixing. Simulation datasets were generated under progressively complex
conditions, starting from a passive kinetic case without self-consistent
electric fields, extending to self-consistent simulations with nonlinear
electric field feedback, and finally to a noisy dataset with reduced particle
resolution. In the passive kinetic regime, POD-SINDy achieved near-optimal
reconstructions with only five modes, reproducing filamentation with errors
below four percent. In self-consistent electrostatic cases, variance spread
across more modes due to nonlinear interactions and noise, slowing singular
value decay and making strict low-rank embeddings more demanding. Nevertheless,
retaining ten modes was sufficient to recover the dominant structures, yielding
reconstruction errors of about seven percent for the low-noise case and
thirteen percent for the noisy dataset. Across all scenarios, SINDy provided
and interpretable equations for modal amplitudes that remained
predominantly linear despite the underlying nonlinear data, while POD
truncation effectively filtered particle noise and preserved coherent dynamics.
These findings demonstrate that POD-SINDy constitutes a compact and
interpretable approach to reduced-order modeling of phase mixing, capable of
retaining essential physics across regimes of increasing complexity while
achieving data
from three to five orders of magnitude depending on
dataset complexity.
Assessing Classical Machine Learning and Transformer-based Approaches for Detecting AI-Generated Research Text
Authors: Sharanya Parimanoharan, Ruwan D. Nawarathna
2025-09-20
The rapid adoption of large language models (s) such as ChatGPT has
blurred the line between human and AI-generated texts, raising urgent questions
about academic integrity, intellectual property, and the spread of
misinformation. Thus, reliable AI-text detection is needed for fair assessment
to safeguard human authenticity and cultivate trust in digital
.
In this study, we investigate how well current machine learning (ML) approaches
can distinguish ChatGPT-3.5-generated texts from human-written texts employing
a labeled data set of 250 pairs of abstracts from a wide range of research
topics. We test and compare both classical (Logistic Regression armed with
classical Bag-of-Words, POS, and TF-IDF features) and
-based (BERT
augmented with N-grams, DistilBERT, BERT with a lightweight custom classifier,
and LSTM-based N-gram models) ML detection techniques. As we aim to assess each
model's performance in detecting AI-generated research texts, we also aim to
test whether an ensemble of these models can outperform any single detector.
Results show DistilBERT achieves the overall best performance, while Logistic
Regression and BERT-Custom offer solid, balanced alternatives; LSTM- and
BERT-N-gram approaches lag. The max voting ensemble of the three best models
fails to surpass DistilBERT itself, highlighting the primacy of a single
-based representation over mere model diversity. By comprehensively
assessing the strengths and weaknesses of these AI-text detection approaches,
this work lays a foundation for more robust
frameworks with larger,
richer datasets to keep pace with ever-improving generative AI models.
FG-Attn Leveraging Fine-Grained Sparsity In Diffusion Transformers
Authors: Sankeerth Durvasula, Kavya Sreedhar, Zain Moustafa, Suraj Kothawade, Ashish Gondimalla, Suvinay Subramanian, Narges Shahidi, Nandita Vijaykumar
2025-09-20
Generating realistic videos with diffusion s demands significant
computation, with attention layers the central bottleneck; even producing a
short clip requires running a
over a very long sequence of
embeddings, e.g., more than 30K embeddings for a 5-second video, incurring
significant latency. Prior work aims to mitigate this bottleneck by exploiting
in the attention layers to reduce computation. However, these works
typically rely on block-
attention, which skips score computation only
when all entries in a block of attention scores (corresponding to M queries and
M keys, with M = 64 typically) are zero. This coarse-granular skipping of
attention scores does not fully exploit
in the attention map and
leaves room for improvement. In this work, we propose FG-Attn, a
attention mechanism for long-context diffusion
s that leverages
at a fine granularity. Unlike block-
attention, which skips
entire MxM blocks, our approach skips computations at the granularity of Mx1
slices of the attention map. Each slice is produced by query-key dot products
between a block of query vectors and a single key. To implement our proposed
attention mechanism, we develop a new efficient bulk-load operation
called asynchronous-gather load. This load operation gathers a
set of
relevant key-value vectors from memory and arranges them into packed tiles in
the GPU's shared memory. Only a
set of keys relevant to those queries
are loaded into shared memory when computing attention for a block of queries,
in contrast to loading full blocks of key tokens in block-
attention. Our
fine-grained
attention, applied to video diffusion models, achieves an
average 1.55X (up to 1.65X) speedup for 5 second, 480p videos, and an average
1.41X (up to 1.49X) for 5 second, 720p videos on a single H100 GPU.
orb-QFL Orbital Quantum Federated Learning
Authors: Dev Gurung, Shiva Raj Pokhrel
2025-09-20
Recent breakthroughs in quantum computing present transformative
opportunities for advancing Federated Learning (FL), particularly in
non-terrestrial environments characterized by stringent and
coordination constraints. In this study, we propose orbital QFL, termed
orb-QFL, a novel quantum-assisted Federated Learning framework tailored for Low
Earth Orbit (LEO) satellite constellations. Distinct from conventional FL
paradigms, termed orb-QFL operates without centralized servers or global
aggregation mechanisms (e.g., FedAvg), instead leveraging quantum entanglement
and local quantum processing to facilitate decentralized, inter-satellite
collaboration. This design inherently addresses the challenges of orbital
dynamics, such as intermittent connectivity, high propagation delays, and
coverage variability. The framework enables continuous model refinement through
direct quantum-based synchronization between neighboring satellites, thereby
enhancing resilience and pre
data locality. To validate our approach, we
integrate the Qiskit quantum machine learning toolkit with Poliastro-based
orbital simulations and conduct experiments using Statlog dataset.
GRIL Knowledge Graph Retrieval-Integrated Learning with Large Language Models
Authors: Jialin Chen, Houyu Zhang, Seongjun Yun, Alejandro Mottini, Rex Ying, Xiang Song, Vassilis N. Ioannidis, Zheng Li, Qingjun Cui
2025-09-20
Retrieval-Augmented Generation (RAG) has significantly mitigated the
hallucinations of Large Language Models (s) by grounding the generation with
external knowledge. Recent extensions of RAG to graph-based retrieval offer a
promising direction, leveraging the structural knowledge for multi-hop
reasoning. However, existing graph RAG typically decouples retrieval and
reasoning processes, which prevents the retriever from adapting to the
reasoning needs of the
. They also struggle with scalability when performing
multi-hop expansion over large-scale graphs, or depend heavily on annotated
ground-truth entities, which are often unavailable in open-domain settings. To
address these challenges, we propose a novel graph retriever trained end-to-end
with
, which features an attention-based growing and
mechanism,
adaptively navigating multi-hop relevant entities while filtering out noise.
Within the extracted subgraph, structural knowledge and semantic features are
encoded via soft tokens and the verbalized graph, respectively, which are
infused into the
together, thereby enhancing its reasoning capability and
facilitating interactive joint training of the graph retriever and the
reasoner. Experimental results across three QA benchmarks show that our
approach consistently achieves state-of-the-art performance, validating the
strength of joint graph-
optimization for complex reasoning tasks. Notably,
our framework eliminates the need for predefined ground-truth entities by
directly optimizing the retriever using
logits as implicit feedback, making
it especially effective in open-domain settings.
Synergies between Federated Foundation Models and Smart Power Grids
Authors: Seyyedali Hosseinalipour, Shimiao Li, Adedoyin Inaolaji, Filippo Malandra, Luis Herrera, Nicholas Mastronarde
2025-09-20
The recent emergence of large language models (s) such as GPT-3 has marked
a significant paradigm shift in machine learning. Trained on massive corpora of
data, these models demonstrate remarkable capabilities in language
understanding, generation, summarization, and reasoning, transforming how
intelligent systems process and interact with human language. Although
s may
still seem like a recent breakthrough, the field is already witnessing the rise
of a new and more general category: multi-modal, multi-task foundation models
(M3T FMs). These models go beyond language and can process heterogeneous data
types/modalities, such as time-series measurements, audio, imagery, tabular
records, and unstructured logs, while supporting a broad range of downstream
tasks spanning forecasting, classification, control, and retrieval. When
combined with federated learning (FL), they give rise to M3T Federated
Foundation Models (FedFMs): a highly recent and largely unexplored class of
models that enable scalable, privacy-pre
model training/fine-tuning
across distributed data sources. In this paper, we take one of the first steps
toward introducing these models to the power systems research community by
offering a bidirectional perspective: (i) M3T FedFMs for smart grids and (ii)
smart grids for FedFMs. In the former, we explore how M3T FedFMs can enhance
key grid functions, such as load/demand forecasting and fault detection, by
learning from distributed, heterogeneous data available at the grid edge in a
privacy-pre
manner. In the latter, we investigate how the constraints
and structure of smart grids, spanning energy,
, and regulatory
dimensions, shape the design, training, and deployment of M3T FedFMs.
Shift Parallelism Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
Authors: Mert Hidayetoglu, Aurick Qiao, Michael Wyatt, Jeff Rasley, Yuxiong He, Samyam Rajbhandari
2025-09-20
Efficient parallelism is necessary for achieving low-latency, high-throughput
inference with large language models (s). Tensor parallelism (TP) is the
state-of-the-art method for reducing
response latency, however GPU
s reduces combined token throughput. On the other hand, data
parallelism (DP) obtains a higher throughput yet is slow in response latency.
Best of both worlds does not exist, and it is not possible to combine TP and DP
because of the
variance across the parallelisms.
We notice Sequence Parallelism (SP - Ulysses in training) has similar
properties as DP but with
invariance. We adapt SP to inference, and
combine it with TP to get the best of both worlds. Our solution: Shift
Parallelism.
Shift Parallelism dynamically switches across TP and SP, and minimizes
latency in low traffic without losing throughput in high traffic. The efficient
GPU
s of Shift Parallelism yields up to i) 1.51x faster response
in interactive workloads and ii) 50% higher throughput in batch workloads,
compared to a TP-only solution.
We evaluate Shift Parallelism with real-world production traces with dynamic
traffic patterns as well as synthetic benchmarking patterns across models,
context sizes, and arrival rates. All results affirm the same: Shift
Parallelism has a better the latency vs. throughput tradeoff than TP or DP, and
hence obtains low latency without degrading throughput in dynamic workloads.
LightCode Compiling LLM Inference for Photonic-Electronic Systems
Authors: Ryan Tomich, Zhizhen Zhong, Dirk Englund
2025-09-19
The growing demand for low-latency, energy-efficient inference in large
language models (s) has catalyzed interest in heterogeneous architectures.
While GPUs remain dominant, they are poorly suited for integration with
emerging domain-specific accelerators like the Photonic Tensor Units (PTUs),
which offer low-power, high-throughput linear computation. This motivates
hybrid compilation strategies that combine photonic and electronic resources.
We present LightCode, a compiler framework and simulator for mapping
inference workloads across hybrid photonic-electronic systems. LightCode
introduces the Stacked Graph, an intermediate representation that encodes
multiple hardware-specific realizations of each tensor operation. Hardware
assignment is formulated as a constrained subgraph selection problem optimized
for latency or energy under parametric cost models. We evaluate LightCode on
the
stage of GPT-2 and Llama-7B showing that under our workload and
hardware assumptions, (i) Photonic hardware reduced energy by up to 50% in our
simulated workloads at maximum sequence length; (ii) multiplexing and
assignment strategy yielded latency improvements exceeding 10x; and (iii)
Optimizing for latency or energy resulted in distinct hardware mappings in our
simulations. LightCode offers a module, foundational framework and simulator
for compiling
s to emerging photonic accelerators.
SENSE-7 Taxonomy and Dataset for Measuring User Perceptions of Empathy in Sustained Human-AI Conversations
Authors: Jina Suh, Lindy Le, Erfan Shayegani, Gonzalo Ramos, Judith Amores, Desmond C. Ong, Mary Czerwinski, Javier Hernandez
2025-09-19
Empathy is increasingly recognized as a key factor in human-AI ,
yet conventional approaches to "digital empathy" often focus on simulating
internal, human-like emotional states while overlooking the inherently
subjective, contextual, and relational facets of empathy as perceived by users.
In this work, we propose a human-centered taxonomy that emphasizes observable
empathic behaviors and introduce a new dataset, Sense-7, of real-world
conversations between information workers and Large Language Models (
s),
which includes per-turn empathy annotations directly from the users, along with
user characteristics, and contextual details, offering a more user-grounded
representation of empathy. Analysis of 695 conversations from 109 participants
reveals that empathy judgments are highly individualized, context-sensitive,
and vulnerable to disruption when conversational continuity fails or user
expectations go unmet. To promote further research, we provide a subset of 672
anonymized conversation and provide exploratory classification analysis,
showing that an
-based classifier can recognize 5 levels of empathy with an
encouraging average Spearman =0.369 and Accuracy=0.487 over this set.
Overall, our findings underscore the need for AI designs that dynamically
tailor empathic behaviors to user contexts and goals, offering a roadmap for
future research and practical development of socially attuned, human-centered
artificial agents.
RephQA Evaluating Readability of Large Language Models in Public Health Question Answering
Authors: Weikang Qiu, Tinglin Huang, Ryan Rullo, Yucheng Kuang, Ali Maatouk, S. Raquel Ramos, Rex Ying
2025-09-19
Large Language Models (s) hold promise in addressing complex medical
problems. However, while most prior studies focus on improving accuracy and
reasoning abilities, a significant bottleneck in developing effective
healthcare agents lies in the readability of
-generated responses,
specifically, their ability to answer public health problems clearly and simply
to people without medical backgrounds. In this work, we introduce RephQA, a
benchmark for evaluating the readability of
s in public health question
answering (QA). It contains 533 expert-reviewed QA pairs from 27 sources across
13 topics, and includes a proxy multiple-choice task to assess informativeness,
along with two readability metrics: Flesch-Kincaid grade level and professional
score. Evaluation of 25
s reveals that most fail to meet readability
standards, highlighting a gap between reasoning and effective
. To
address this, we explore four readability-enhancing strategies-standard
prompting, chain-of-thought prompting, Group Relative Policy Optimization
(GRPO), and a token-adapted variant. Token-adapted GRPO achieves the best
results, advancing the development of more practical and user-friendly public
health agents. These results represent a step toward building more practical
agents for public health.
Improving Deep Tabular Learning
Authors: Sivan Sarafian, Yehudit Aperstein
2025-09-19
Tabular data remain a dominant form of real-world information but pose
persistent challenges for deep learning due to heterogeneous feature types,
lack of natural structure, and limited label-pre augmentations. As a
result, ensemble models based on decision trees continue to dominate benchmark
leaderboards. In this work, we introduce RuleNet, a
-based
architecture specifically designed for deep tabular learning. RuleNet
incorporates learnable rule embeddings in a
r, a piecewise linear
quantile projection for numerical features, and feature masking ensembles for
robustness and uncertainty estimation. Evaluated on eight benchmark datasets,
RuleNet matches or surpasses state-of-the-art tree-based methods in most cases,
while remaining computationally efficient, offering a practical neural
alternative for tabular prediction tasks.
The Role of High-Performance GPU Resources in Large Language Model Based Radiology Imaging Diagnosis
Authors: Jyun-Ping Kao
2025-09-19
Large-language models (s) are rapidly being applied to radiology, enabling
automated image interpretation and report generation tasks. Their deployment in
clinical practice requires both high diagnostic accuracy and low inference
latency, which in turn demands powerful hardware. High-performance graphical
processing units (GPUs) provide the necessary compute and memory throughput to
run large
s on imaging data. We review modern GPU architectures (e.g. NVIDIA
A100/H100, AMD Instinct MI250X/MI300) and key performance metrics of
floating-point throughput, memory bandwidth, VRAM capacity. We show how these
hardware capabilities affect radiology tasks: for example, generating reports
or detecting findings on CheXpert and MIMIC-CXR images is computationally
intensive and benefits from GPU parallelism and tensor-core
.
Empirical studies indicate that using appropriate GPU resources can reduce
inference time and improve throughput. We discuss practical challenges
including privacy, deployment, cost, power and optimization strategies:
mixed-precision,
,
, and multi-GPU scaling. Finally, we
anticipate that next-generation features (8-bit tensor cores, enhanced
interconnect) will further enable on-premise and federated radiology AI.
Advancing GPU infrastructure is essential for safe, efficient
-based
radiology diagnostics.
MANZANO A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Authors: Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen
2025-09-19
Unified multimodal Large Language Models (s) that can both understand and
generate visual content hold immense potential. However, existing open-source
models often suffer from a performance trade-off between these capabilities. We
present Manzano, a simple and scalable unified framework that substantially
reduces this tension by coupling a hybrid image tokenizer with a well-curated
training recipe. A single shared vision encoder feeds two lightweight adapters
that produce continuous embeddings for image-to-text understanding and discrete
tokens for text-to-image generation within a common semantic space. A unified
autoregressive
predicts high-level semantics in the form of text and image
tokens, with an auxiliary diffusion
r subsequently translating the image
tokens into pixels. The architecture, together with a unified training recipe
over understanding and generation data, enables scalable joint learning of both
capabilities. Manzano achieves state-of-the-art results among unified models,
and is competitive with specialist models, particularly on text-rich
evaluation. Our studies show minimal task conflicts and consistent gains from
scaling model size, validating our design choice of a hybrid tokenizer.
Agentic Aerial Cinematography From Dialogue Cues to Cinematic Trajectories
Authors: Yifan Lin, Sophie Ziyu Liu, Ran Qi, George Z. Xue, Xinping Song, Chao Qin, Hugh H. -T. Liu
2025-09-19
We present Agentic Aerial Cinematography: From Dialogue Cues to Cinematic
Trajectories (ACDC), an autonomous drone cinematography system driven by
natural language between human directors and drones. The main
limitation of previous drone cinematography workflows is that they require
manual selection of waypoints and view angles based on predefined human intent,
which is labor-intensive and yields inconsistent performance. In this paper, we
propose employing large language models (
s) and vision foundation models
(VFMs) to convert free-form natural language prompts directly into executable
indoor UAV video tours. Specifically, our method comprises a vision-language
retrieval pipeline for initial waypoint selection, a preference-based Bayesian
optimization framework that refines poses using aesthetic feedback, and a
motion planner that generates safe quadrotor trajectories. We validate ACDC
through both simulation and hardware-in-the-loop experiments, demonstrating
that it robustly produces professional-quality footage across diverse indoor
scenes without requiring expertise in robotics or cinematography. These results
highlight the potential of embodied AI agents to close the loop from
open-vocabulary dialogue to real-world autonomous aerial cinematography.
It Depends Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge
Authors: Lukas Ellinger, Georg Groh
2025-09-19
Ambiguous words or underspecified references require interlocutors to resolve
them, often by relying on shared context and commonsense knowledge. Therefore,
we systematically investigate whether Large Language Models (s) can leverage
commonsense to resolve referential ambiguity in multi-turn conversations and
analyze their behavior when ambiguity persists. Further, we study how requests
for simplified language affect this capacity. Using a novel multilingual
evaluation dataset, we test DeepSeek v3, GPT-4o, Qwen3-32B, GPT-4o-mini, and
Llama-3.1-8B via
-as-Judge and human annotations. Our findings indicate that
current
s struggle to resolve ambiguity effectively: they tend to commit to
a single interpretation or cover all possible references, rather than hedging
or seeking clarification. This limitation becomes more pronounced under
simplification prompts, which drastically reduce the use of commonsense
reasoning and diverse response strategies. Fine-tuning Llama-3.1-8B with Direct
Preference Optimization substantially improves ambiguity resolution across all
request types. These results underscore the need for advanced fine-tuning to
improve
s' handling of ambiguity and to ensure robust performance across
diverse
styles.
Personalized Federated Learning with Heat-Kernel Enhanced Tensorized Multi-View Clustering
Authors: Kristina P. Sinaga
2025-09-19
We present a robust personalized federated learning framework that leverages
heat-kernel enhanced tensorized multi-view fuzzy c-means clustering with
advanced tensor decomposition techniques. Our approach integrates heat-kernel
coefficients adapted from quantum field theory with Tucker decomposition and
canonical polyadic decomposition (CANDECOMP/PARAFAC) to transform conventional
distance metrics and efficiently represent high-dimensional multi-view
structures. The framework employs matriculation and vectorization techniques to
facilitate the discovery of hidden structures and multilinear relationships via
N-way generalized tensors. The proposed method introduces a dual-level
optimization scheme: local heat-kernel enhanced fuzzy clustering with tensor
decomposition operating on order-N input tensors, and federated aggregation of
tensor factors with privacy-pre personalization mechanisms. The local
stage employs tensorized kernel Euclidean distance transformations and Tucker
decomposition to discover client-specific patterns in multi-view tensor data,
while the global aggregation process coordinates tensor factors (core tensors
and factor matrices) across clients through differential privacy-pre
protocols. This tensorized approach enables efficient handling of
high-dimensional multi-view data with significant
savings through
low-rank tensor approximations.
SegDINO3D 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features
Authors: Jinyuan Qu, Hongyang Li, Xingyu Chen, Shilong Liu, Yukai Shi, Tianhe Ren, Ruitao Jing, Lei Zhang
2025-09-19
In this paper, we present SegDINO3D, a novel Transformer encoder-r
framework for 3D instance segmentation. As 3D training data is generally not as
sufficient as 2D training images, SegDINO3D is designed to fully leverage 2D
representation from a pre-trained 2D detection model, including both
image-level and object-level features, for improving 3D representation.
SegDINO3D takes both a point cloud and its associated 2D images as input. In
the encoder stage, it first enriches each 3D point by retrieving 2D image
features from its corresponding image views and then leverages a 3D encoder for
3D context fusion. In the
r stage, it formulates 3D object queries as 3D
anchor boxes and performs cross-attention from 3D queries to 2D object queries
obtained from 2D images using the 2D detection model. These 2D object queries
serve as a compact object-level representation of 2D images, effectively
avoiding the challenge of keeping thousands of image feature maps in the memory
while faithfully pre
the knowledge of the pre-trained 2D model. The
introducing of 3D box queries also enables the model to modulate
cross-attention using the predicted boxes for more precise querying. SegDINO3D
achieves the state-of-the-art performance on the ScanNetV2 and ScanNet200 3D
instance segmentation benchmarks. Notably, on the challenging ScanNet200
dataset, SegDINO3D significantly outperforms prior methods by +8.7 and +6.8 mAP
on the validation and hidden test sets, respectively, demonstrating its
superiority.
Think, Verbalize, then Speak Bridging Complex Thoughts and Comprehensible Speech
Authors: Sang Hoon Woo, Sehun Lee, Kang-wook Kim, Gunhee Kim
2025-09-19
Spoken dialogue systems increasingly employ large language models (s) to
leverage their advanced reasoning capabilities. However, direct application of
s in spoken
often yield suboptimal results due to mismatches
between optimal textual and verbal delivery. While existing approaches adapt
s to produce speech-friendly outputs, their impact on reasoning performance
remains underexplored. In this work, we propose Think-Verbalize-Speak, a
framework that decouples reasoning from spoken delivery to preserve the full
reasoning capacity of
s. Central to our method is verbalizing, an
intermediate step that translates thoughts into natural, speech-ready text. We
also introduce ReVerT, a latency-efficient verbalizer based on incremental and
asynchronous summarization. Experiments across multiple benchmarks show that
our method enhances speech naturalness and conciseness with minimal impact on
reasoning. The project page with the dataset and the source code is available
at https://yhytoto12.github.io/TVS-ReVerT
BEFT Bias-Efficient Fine-Tuning of Language Models
Authors: Baichuan Huang, Ananth Balashankar, Amir Aminifar
2025-09-19
Fine-tuning all-bias-terms stands out among various parameter-efficient
fine-tuning (PEFT) techniques, owing to its out-of-the-box usability and
competitive performance, especially in low-data regimes. Bias-only fine-tuning
has the potential for unprecedented parameter efficiency. However, the link
between fine-tuning different bias terms (i.e., bias terms in the query, key,
or value projections) and downstream performance remains unclear. The existing
approaches, e.g., based on the magnitude of bias change or empirical Fisher
information, provide limited guidance for selecting the particular bias term
for effective fine-tuning. In this paper, we propose an approach for selecting
the bias term to be fine-tuned, forming the foundation of our bias-efficient
fine-tuning (BEFT). We extensively evaluate our bias-efficient approach against
other bias-selection approaches, across a wide range of large language models
(s) spanning encoder-only and
r-only architectures from 110M to 6.7B
parameters. Our results demonstrate the effectiveness and superiority of our
bias-efficient approach on diverse downstream tasks, including classification,
multiple-choice, and generation tasks.
Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs
Authors: Guoliang He, Youhe Jiang, Wencong Xiao, Kaihua Jiang, Shuguang Wang, Jun Wang, Zixian Du, Zhuo Jiang, Xinlei Zhang, Binhang Yuan, Eiko Yoneki
2025-09-19
The scaling law for large language models (s) depicts that the path
towards machine intelligence necessitates training at large scale. Thus,
companies continuously build large-scale GPU clusters, and launch training jobs
that span over thousands of computing nodes. However,
pre-training presents
unique challenges due to its complex
patterns, where GPUs
exchange data in
yet high-volume bursts within specific groups.
Inefficient resource scheduling exacerbates bandwidth contention, leading to
suboptimal training performance. This paper presents Arnold, a scheduling
system summarizing our experience to effectively align
patterns with data center topology at scale. An in-depth characteristic study
is performed to identify the impact of physical network topology to
pre-training jobs. Based on the insights, we develop a scheduling algorithm to
effectively align
patterns with the physical network topology in
modern data centers. Through simulation experiments, we show the effectiveness
of our algorithm in reducing the maximum spread of
groups by up
to x. In production training, our scheduling system improves the
end-to-end performance by when training with more than GPUs, a
significant improvement for our training pipeline.
FedHK-MVFC Federated Heat Kernel Multi-View Clustering
Authors: Kristina P. Sinaga
2025-09-19
In the realm of distributed AI and privacy-focused medical applications, we
propose a framework for multi-view clustering that links quantum field theory
with federated healthcare analytics. Our method uses heat-kernel coefficients
from spectral analysis to convert Euclidean distances into geometry-aware
similarity measures, capturing the structure of diverse medical data. We lay
this out through the Heat Kernel Distance (HKD) transformation with convergence
guarantees. Two algorithms are developed: Heat Kernel-Enhanced Multi-View Fuzzy
Clustering (HK-MVFC) for central analysis, and Federated Heat Kernel Multi-View
Fuzzy Clustering (FedHK-MVFC) for secure, privacy-pre learning across
hospitals using differential privacy and secure aggregation to facilitate
HIPAA-compliant collaboration. Tests on synthetic datasets of cardiovascular
patients show an increase in clustering accuracy, reduced
, and efficiency retention over centralized methods.
Validated on 10,000 patient records across two hospitals, it proves useful for
collaborative phenotyping involving ECG, cardiac imaging, and behavioral data.
Our theoretical contributions include update rules with proven convergence,
adaptive view weighting, and privacy-pre
protocols. This presents a new
standard for geometry-aware federated learning in healthcare, turning advanced
math into workable solutions for analyzing sensitive medical data while
ensuring both rigor and clinical relevance.
UniGist Towards General and Hardware-aligned Sequence-level Long Context Compression
Authors: Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Tianqing Fang, Hongming Zhang, Haitao Mi, Dong Yu, Zhicheng Dou
2025-09-19
Large language models are increasingly capable of handling long-context
inputs, but the memory overhead of key-value ()
remains a major
bottleneck for general-purpose deployment. While various
strategies
have been explored, sequence-level
, which drops the full
s
for certain tokens, is particularly challenging as it can lead to the loss of
important contextual information. To address this, we introduce UniGist, a
sequence-level long-context
framework that efficiently preserves
context information by replacing raw tokens with special
tokens
(gists) in a fine-grained manner. We adopt a chunk-free training strategy and
design an efficient kernel with a gist shift trick, enabling optimized GPU
training. Our scheme also supports flexible inference by allowing the actual
removal of compressed tokens, resulting in real-time memory savings.
Experiments across multiple long-context tasks demonstrate that UniGist
significantly improves
quality, with especially strong performance
in detail-recalling tasks and long-range dependency modeling.
RMT-KD Random Matrix Theoretic Causal Knowledge Distillation
Authors: Davide Ettori, Nastaran Darabi, Sureshkumar Senthilkumar, Amit Ranjan Trivedi
2025-09-19
Large deep learning models such as BERT and ResNet achieve state-of-the-art
performance but are costly to deploy at the edge due to their size and compute
demands. We present RMT-KD, a method that leverages Random Matrix
Theory (RMT) for knowledge distillation to iteratively reduce network size.
Instead of
or heuristic rank selection, RMT-KD preserves only
informative directions identified via the spectral properties of hidden
representations. RMT-based causal reduction is applied layer by layer with
self-distillation to maintain stability and accuracy. On GLUE, AG News, and
CIFAR-10, RMT-KD achieves up to 80% parameter reduction with only 2% accuracy
loss, delivering 2.8x faster inference and nearly halved power consumption.
These results establish RMT-KD as a mathematically grounded approach to network
distillation.
VOX-KRIKRI Unifying Speech and Language through Continuous Fusion
Authors: Dimitrios Damianos, Leon Voukoutis, Georgios Paraskevopoulos, Vassilis Katsouros
2025-09-19
We present a multimodal fusion framework that bridges pre-trained
r-based large language models (
) and acoustic encoder-
r
architectures such as Whisper, with the aim of building speech-enabled
s.
Instead of directly using audio embeddings, we explore an intermediate
audio-conditioned text space as a more effective mechanism for alignment. Our
method operates fully in continuous text representation spaces, fusing
Whisper's hidden
r states with those of an
through cross-modal
attention, and supports both offline and streaming modes. We introduce
\textit{VoxKrikri}, the first Greek speech
, and show through analysis that
our approach effectively aligns representations across modalities. These
results highlight continuous space fusion as a promising path for multilingual
and low-resource speech
s, while achieving state-of-the-art results for
Automatic Speech Recognition in Greek, providing an average relative
improvement across benchmarks.
Multilingual LLM Prompting Strategies for Medical English-Vietnamese Machine Translation
Authors: Nhu Vo, Nu-Uyen-Phuong Le, Dung D. Le, Massimo Piccardi, Wray Buntine
2025-09-19
Medical English-Vietnamese machine translation (En-Vi MT) is essential for
healthcare access and in Vietnam, yet Vietnamese remains a
low-resource and under-studied language. We systematically evaluate prompting
strategies for six multilingual
s (0.5B-9B parameters) on the MedEV dataset,
comparing zero-shot, few-shot, and dictionary-augmented prompting with Meddict,
an English-Vietnamese medical lexicon. Results show that model scale is the
primary driver of performance: larger
s achieve strong zero-shot results,
while few-shot prompting yields only marginal improvements. In contrast,
terminology-aware cues and embedding-based example retrieval consistently
improve domain-specific translation. These findings underscore both the promise
and the current limitations of multilingual
s for medical En-Vi MT.
Interplay Between Belief Propagation and Transformer Differential-Attention Message Passing Transformer
Authors: Chin Wa Lau, Xiang Shi, Ziyan Zheng, Haiwen Cao, Nian Guo
2025-09-19
Transformer-based neural rs have emerged as a promising approach to
error correction coding, combining data-driven adaptability with efficient
modeling of long-range dependencies. This paper presents a novel
r
architecture that integrates classical belief propagation principles with
designs. We introduce a differentiable syndrome loss function
leveraging global codebook structure and a differential-attention mechanism
optimizing bit and syndrome embedding interactions. Experimental results
demonstrate consistent performance improvements over existing
-based
rs, with our approach surpassing traditional belief propagation
rs
for short-to-medium length LDPC codes.
Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models
Authors: Tomoya Yamashita, Akira Ito, Yuuki Yamanaka, Masanori Yamada, Takayuki Miura, Toshiki Shibahara
2025-09-19
As large language models (s) are increasingly deployed across various
applications, privacy and copyright concerns have heightened the need for more
effective
unlearning techniques. Many existing unlearning methods aim to
suppress undesirable outputs through additional training (e.g., gradient
ascent), which reduces the probability of generating such outputs. While such
suppression-based approaches can control model outputs, they may not eliminate
the underlying knowledge embedded in the model's internal activations; muting a
response is not the same as forgetting it. Moreover, such suppression-based
methods often suffer from model collapse. To address these issues, we propose a
novel unlearning method that directly intervenes in the model's internal
activations. In our formulation, forgetting is defined as a state in which the
activation of a forgotten target is indistinguishable from that of
unknown''
entities. Our method introduces an unlearning objective that modifies the
activation of the target entity away from those of known entities and toward
those of unknown entities in a  autoencoder latent space. By aligning the
target's internal activation with those of unknown entities, we shift the
model's recognition of the target entity from
known'' to ``unknown'',
achieving genuine forgetting while avoiding over-suppression and model
collapse. Empirically, we show that our method effectively aligns the internal
activations of the forgotten target, a result that the suppression-based
approaches do not reliably achieve. Additionally, our method effectively
reduces the model's recall of target knowledge in question-answering tasks
without significant damage to the non-target knowledge.
Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding
Authors: Ruanjun Li, Ziheng Liu, Yuanming Shi, Jiawei Shao, Chi Zhang, Xuelong Li
2025-09-19
Large language models (s) deliver impressive generation quality, but incur
very high inference cost because each output token is generated
auto-regressively through all model layers. Early-exit based self-speculative
(EESD) has emerged to mitigate this cost. However, in practice, many
approaches struggle to achieve the expected
in such
draft-then-verify paradigm even with a well-aligned early-exit head and
selected exit position. Our analysis reveals that EESD only pays off when the
vast majority of draft tokens are accepted by the
. Otherwise, the draft
cost may overcome the
gain and lead to a negative speedup. To
mitigate this, we propose Pipeline-Parallel Self-Speculative Decoding (PPSD)
that fully pipelines the draft and verification work so that no effort is
wasted on failed predictions. It has two key innovations. We configure the
model layers as a pipeline in which early-exit (draft) computations and
remaining-layer (verification) computations
. We interleave drafting and
verification per token. While the
is verifying the current token in its
final layers, the early-exit path simultaneously drafts the next token. Such a
verify-while-draft scheme keeps all units busy and validates tokens on-the-fly
analogous to pipelining the speculation and verification stages. Empirical
results confirm that PPSD achieves state-of-the-art
in
self-speculative
inference. On diverse benchmarks, PPSD achieves speedup
ratios in the range of 2.01x~3.81x, which gains almost the optimal
at the fixed acceptance rate and exit position, showcasing its advancement in
providing efficient self-speculation.
DNA-DetectLLM Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm
Authors: Xiaowei Zhu, Yubing Ren, Fang Fang, Qingfeng Tan, Shi Wang, Yanan Cao
2025-09-19
The rapid advancement of large language models (s) has blurred the line
between AI-generated and human-written text. This progress brings societal
risks such as misinformation, authorship ambiguity, and intellectual property
concerns, highlighting the urgent need for reliable AI-generated text detection
methods. However, recent advances in generative language modeling have resulted
in significant
between the feature distributions of human-written and
AI-generated text, blurring classification boundaries and making accurate
detection increasingly challenging. To address the above challenges, we propose
a DNA-inspired perspective, leveraging a repair-based process to directly and
interpretably capture the intrinsic differences between human-written and
AI-generated text. Building on this perspective, we introduce DNA-Detect
, a
zero-shot detection method for distinguishing AI-generated and human-written
text. The method constructs an ideal AI-generated sequence for each input,
iteratively repairs non-optimal tokens, and quantifies the cumulative repair
effort as an interpretable detection signal. Empirical evaluations demonstrate
that our method achieves state-of-the-art detection performance and exhibits
strong robustness against various adversarial attacks and input lengths.
Specifically, DNA-Detect
achieves relative improvements of 5.55% in AUROC
and 2.08% in F1 score across multiple public benchmark datasets.
Optimization techniques for SQL+ML queries A performance analysis of real-time feature computation in OpenMLDB
Authors: Mashkhal A. Sidiq, Aras A. Salih, Samrand M. Hassan
2025-09-19
In this study, we optimize SQL+ML queries on top of OpenMLDB, an open-source
database that seamlessly integrates offline and online feature computations.
The work used feature-rich synthetic dataset experiments in Docker, which acted
like production environments that processed 100 to 500 records per batch and 6
to 12 requests per batch in parallel. Efforts have been concentrated in the
areas of better query plans, d execution plans, parallel processing, and
resource management. The experimental results show that OpenMLDB can support
approximately 12,500 QPS with less than 1 ms latency, outperforming SparkSQL
and ClickHouse by a factor of 23 and PostgreSQL and MySQL by 3.57 times. This
study assessed the impact of optimization and showed that query plan
optimization accounted for 35% of the performance gains, caching for 25%, and
parallel processing for 20%. These results illustrate OpenMLDB's capability for
time-sensitive ML use cases, such as fraud detection, personalized
recommendation, and time series forecasting. The system's modular optimization
framework, which combines batch and stream processing without interference,
contributes to its significant performance gain over traditional database
systems, particularly in applications that require real-time feature
computation and
. This study contributes to the understanding and design
of high-performance SQL+ML systems and highlights the need for specialized SQL
optimization for ML workloads.
LLM Cache Bandit Revisited Addressing Query Heterogeneity for Cost-Effective LLM Inference
Authors: Hantao Yang, Hong Xie, Defu Lian, Enhong Chen
2025-09-19
This paper revisits the
bandit problem, with a special focus on
addressing the query heterogeneity for cost-effective
inference. Previous
works often assume uniform query sizes. Heterogeneous query sizes introduce a
combinatorial structure for
selection, making the
replacement
process more computationally and statistically challenging. We treat optimal
selection as a knapsack problem and employ an accumulation-based strategy
to effectively balance computational overhead and
updates. In theoretical
analysis, we prove that the regret of our algorithm achieves an
bound, improving the coefficient of compared to the
result in Berkeley, where is the total number of queries and is the
size. Additionally, we also provide a problem-dependent bound, which was
absent in previous works. The experiment rely on real-world data show that our
algorithm reduces the total cost by approximately 12\%.
A Novel Semantic Compression Approach for Ultra-low Bandwidth Voice Communication
Authors: Ryan Collette, Ross Greenwood, Serena Nicoll
2025-09-18
While existing speech audio codecs designed for exploit limited
forms of temporal redundancy and allow for multi-scale representations, they
tend to represent all features of audio in the same way. In contrast,
generative voice models designed for text-to-speech and voice transfer tasks
have recently proved effective at factorizing audio signals into high-level
semantic representations of fundamentally distinct features. In this paper, we
leverage such representations in a novel semantic
s approach to
achieve lower bitrates without sacrificing perceptual quality or suitability
for specific downstream tasks. Our technique matches or outperforms existing
audio codecs on transcription, sentiment analysis, and speaker verification
when encoding at 2-4x lower bitrate -- notably surpassing Encodec in perceptual
quality and speaker verification while using up to 4x less bitrate.
CAGE Continuity-Aware edGE Network Unlocks Robust Floorplan Reconstruction
Authors: Yiyi Liu, Chunyang Liu, Weiqin Jiao, Bojian Wu, Fashuai Li, Biao Xiong
2025-09-18
We present \textbf{CAGE} (\textit{Continuity-Aware edGE}) network, a
\textcolor{red}{robust} framework for reconstructing vector floorplans directly
from point-cloud density maps. Traditional corner-based polygon representations
are highly sensitive to noise and incomplete observations, often resulting in
fragmented or implausible layouts. Recent line grouping methods leverage
structural cues to improve robustness but still struggle to recover fine
geometric details. To address these limitations, we propose a \textit{native}
edge-centric formulation, modeling each wall segment as a directed,
geometrically continuous edge. This representation enables inference of
coherent floorplan structures, ensuring watertight, topologically valid room
boundaries while improving robustness and reducing artifacts. Towards this
design, we develop a dual-query
r that integrates perturbed
and latent queries within a denoising framework, which not only stabilizes
optimization but also accelerates convergence. Extensive experiments on
Structured3D and SceneCAD show that \textbf{CAGE} achieves state-of-the-art
performance, with F1 scores of 99.1\% (rooms), 91.7\% (corners), and 89.3\%
(angles). The method also demonstrates strong cross-dataset generalization,
underscoring the efficacy of our architectural innovations. Code and pretrained
models will be released upon acceptance.
IMPQ Interaction-Aware Layerwise Mixed Precision Quantization for LLMs
Authors: Junchen Zhao, Ali Derakhshan, Dushyant Bharadwaj, Jayden Kana Hyman, Junhao Dong, Sangeetha Abdu Jyothi, Ian Harris
2025-09-18
Large Language Models (s) promise impressive capabilities, yet their
multi-billion-parameter scale makes on-device or low-resource deployment
prohibitive. Mixed-precision
offers a compelling solution, but
existing methods struggle when the average precision drops below four bits, as
they rely on isolated, layer-specific metrics that overlook critical
inter-layer interactions affecting overall performance. In this paper, we
propose two innovations to address these limitations. First, we frame the
mixed-precision
problem as a cooperative game among layers and
introduce Shapley-based Progressive Quantization Estimation (SPQE) to
efficiently obtain accurate Shapley estimates of layer sensitivities and
inter-layer interactions. Second, building upon SPQE, we propose
Interaction-aware Mixed-Precision Quantization (IMPQ) which translates these
Shapley estimates into a binary quadratic optimization formulation, assigning
either 2 or 4-bit precision to layers under strict memory constraints.
Comprehensive experiments conducted on Llama-3, Gemma-2, and Qwen-3 models
across three independent PTQ backends (Quanto, HQQ, GPTQ) demonstrate IMPQ's
scalability and consistently superior performance compared to methods relying
solely on isolated metrics. Across average precisions spanning 4 bit down to 2
bit, IMPQ cuts Perplexity by 20 to 80 percent relative to the best baseline,
with the margin growing as the bit-width tightens.
LLM-Assisted Topic Reduction for BERTopic on Social Media Data
Authors: Wannes Janssens, Matthias Bogaert, Dirk Van den Poel
2025-09-18
The BERTopic framework leverages embeddings and hierarchical
clustering to extract latent topics from unstructured text corpora. While
effective, it often struggles with social media data, which tends to be noisy
and
, resulting in an excessive number of
ping topics. Recent work
explored the use of large language models for end-to-end topic modelling.
However, these approaches typically require significant computational overhead,
limiting their scalability in big data contexts. In this work, we propose a
framework that combines BERTopic for topic generation with large language
models for topic reduction. The method first generates an initial set of topics
and constructs a representation for each. These representations are then
provided as input to the language model, which iteratively identifies and
merges semantically similar topics. We evaluate the approach across three
Twitter/X datasets and four different language models. Our method outperforms
the baseline approach in enhancing topic diversity and, in many cases,
coherence, with some sensitivity to dataset characteristics and initial
parameter selection.
LNE-Blocking An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models
Authors: Ruijie Hou, Yueyang Jiao, Hanxu Hu, Yingming Li, Wai Lam, Huajian Zhang, Hongyuan Lu
2025-09-18
The problem of data contamination is now almost inevitable during the
development of large language models (s), with the training data commonly
integrating those evaluation benchmarks even unintentionally. This problem
subsequently makes it hard to benchmark
s fairly. Instead of constructing
contamination-free datasets (quite hard), we propose a novel framework,
\textbf{LNE-Blocking}, to restore model performance prior to contamination on
potentially leaked datasets. Our framework consists of two components:
contamination detection and disruption operation. For the prompt, the framework
first uses the contamination detection method, \textbf{LNE}, to assess the
extent of contamination in the model. Based on this, it adjusts the intensity
of the disruption operation, \textbf{Blocking}, to elicit non-memorized
responses from the model. Our framework is the first to efficiently restore the
model's greedy
performance. This comes with a strong performance on
multiple datasets with potential leakage risks, and it consistently achieves
stable recovery results across different models and varying levels of data
contamination. We release the code at https://github.com/RuijieH/LNE-Blocking
to facilitate research.
Beyond Surface Alignment Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
Authors: Yuanbo Xie, Yingjie Zhang, Tianyun Liu, Duohe Ma, Tingwen Liu
2025-09-18
Jailbreak attacks pose persistent threats to large language models (s).
Current safety alignment methods have attempted to address these issues, but
they experience two significant limitations: insufficient safety alignment
depth and unrobust internal defense mechanisms. These limitations make them
vulnerable to adversarial attacks such as
ing and refusal direction
manipulation. We introduce DeepRefusal, a robust safety alignment framework
that overcomes these issues. DeepRefusal forces the model to dynamically
rebuild its refusal mechanisms from jailbreak states. This is achieved by
probabilistically ablating the refusal direction across layers and token depths
during fine-tuning. Our method not only defends against
ing and refusal
direction attacks but also demonstrates strong resilience against other unseen
jailbreak strategies. Extensive evaluations on four open-source
families
and six representative attacks show that DeepRefusal reduces attack success
rates by approximately 95%, while maintaining model capabilities with minimal
performance degradation.
MaRVIn A Cross-Layer Mixed-Precision RISC-V Framework for DNN Inference, from ISA Extension to Hardware Acceleration
Authors: Giorgos Armeniakos, Alexis Maras, Sotirios Xydis, Dimitrios Soudris
2025-09-18
The evolution of and mixed-precision techniques has unlocked new
possibilities for enhancing the speed and energy efficiency of NNs. Several
recent studies indicate that adapting precision levels across different
parameters can maintain accuracy comparable to full-precision models while
significantly reducing computational demands. However, existing embedded
microprocessors lack sufficient architectural support for efficiently executing
mixed-precision NNs, both in terms of ISA extensions and hardware design,
resulting in inefficiencies such as excessive data packing/unpacking and
underutilized arithmetic units. In this work, we propose novel ISA extensions
and a micro-architecture implementation specifically designed to optimize
mixed-precision execution, enabling energy-efficient deep learning inference on
RISC-V architectures. We introduce MaRVIn, a cross-layer hardware-software
co-design framework that enhances power efficiency and performance through a
combination of hardware improvements, mixed-precision
, ISA-level
optimizations, and cycle-accurate emulation. At the hardware level, we enhance
the ALU with configurable mixed-precision arithmetic (2, 4, 8 bits) for
weights/activations and employ multi-pumping to reduce execution latency while
implementing soft SIMD for efficient 2-bit ops. At the software level, we
integrate a
-aware fine-tuning method to optimize model
and
a greedy-based DSE approach to efficiently search for Pareto-optimal
mixed-
d models. Additionally, we incorporate voltage scaling to boost
the power efficiency of our system. Our experimental evaluation over widely
used DNNs and datasets, such as CIFAR10 and ImageNet, demonstrates that our
framework can achieve, on average, 17.6x speedup for less than 1% accuracy loss
and outperforms the ISA-agnostic state-of-the-art RISC-V cores, delivering up
to 1.8 TOPs/W.
Stabilizing Information Flow Entropy Regularization for Safe and Interpretable Autonomous Driving Perception
Authors: Haobo Yang, Shiyan Zhang, Zhuoyi Yang, Jilong Guo, Jun Yang, Xinyu Zhang
2025-09-18
Deep perception networks in autonomous driving traditionally rely on
data-intensive training regimes and post-hoc anomaly detection, often
disregarding fundamental information-theoretic constraints governing stable
information processing. We reconceptualize deep neural encoders as hierarchical
chains that incrementally compress raw sensory inputs into
task-relevant latent features. Within this framework, we establish two
theoretically justified design principles for robust perception: (D1) smooth
variation of mutual information between consecutive layers, and (D2) monotonic
decay of latent entropy with network depth. Our analysis shows that, under
realistic architectural assumptions, particularly blocks comprising repeated
layers of similar capacity, enforcing smooth information flow (D1) naturally
encourages entropy decay (D2), thus ensuring stable
. Guided by
these insights, we propose Eloss, a novel entropy-based regularizer designed as
a lightweight, plug-and-play training objective. Rather than marginal accuracy
improvements, this approach represents a conceptual shift: it unifies
information-theoretic stability with standard perception tasks, enabling
explicit, principled detection of anomalous sensor inputs through entropy
deviations. Experimental validation on large-scale 3D object detection
benchmarks (KITTI and nuScenes) demonstrates that incorporating Eloss
consistently achieves competitive or improved accuracy while dramatically
enhancing sensitivity to anomalies, amplifying distribution-shift signals by up
to two orders of magnitude. This stable information-
perspective not
only improves interpretability but also establishes a solid theoretical
foundation for safer, more robust autonomous driving perception systems.
A1 Asynchronous Test-Time Scaling via Conformal Prediction
Authors: Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng, Chenyang Zhao, Hui Shen, Alexander Hanbo Li, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, Lingpeng Kong, Ngai Wong
2025-09-18
Large language models (s) benefit from test-time scaling, but existing
methods face significant challenges, including severe synchronization overhead,
memory bottlenecks, and latency, especially during speculative
with
long reasoning chains. We introduce A1 (Asynchronous Test-Time Scaling), a
statistically guaranteed adaptive inference framework that addresses these
challenges. A1 refines arithmetic intensity to identify synchronization as the
dominant bottleneck, proposes an online calibration strategy to enable
asynchronous inference, and designs a three-stage rejection sampling pipeline
that supports both sequential and parallel scaling. Through experiments on the
MATH, AMC23, AIME24, and AIME25 datasets, across various draft-target model
families, we demonstrate that A1 achieves a remarkable 56.7x speedup in
test-time scaling and a 4.14x improvement in throughput, all while maintaining
accurate rejection-rate control, reducing latency and memory overhead, and no
accuracy loss compared to using target model scaling alone. These results
position A1 as an efficient and principled solution for scalable
inference.
We have released the code at
https://github.com/menik1126/asynchronous-test-time-scaling.
Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning
Authors: Lei Wang, Jieming Bian, Letian Zhang, Jie Xu
2025-09-18
Large Language Models (s) have demonstrated impressive capabilities across
various tasks, but fine-tuning them for domain-specific applications often
requires substantial domain-specific data that may be distributed across
multiple organizations. Federated Learning (FL) offers a privacy-pre
solution, but faces challenges with computational constraints when applied to
s. Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient
fine-tuning approach, though a single LoRA module often struggles with
heterogeneous data across diverse domains. This paper addresses two critical
challenges in federated LoRA fine-tuning: 1. determining the optimal number and
allocation of LoRA experts across heterogeneous clients, and 2. enabling
clients to selectively utilize these experts based on their specific data
characteristics. We propose FedLEASE (Federated adaptive LoRA Expert Allocation
and SElection), a novel framework that adaptively clusters clients based on
representation similarity to allocate and train domain-specific LoRA experts.
It also introduces an adaptive top- Mixture-of-Experts mechanism that allows
each client to select the optimal number of utilized experts. Our extensive
experiments on diverse benchmark datasets demonstrate that FedLEASE
significantly outperforms existing federated fine-tuning approaches in
heterogeneous client settings while maintaining
efficiency.
Communication Efficient Split Learning of ViTs with Attention-based Double Compression
Authors: Federico Alvetreti, Jary Pomponi, Paolo Di Lorenzo, Simone Scardapane
2025-09-18
This paper proposes a novel -efficient Split Learning (SL)
framework, named Attention-based Double Compression (ADC), which reduces the
overhead required for transmitting intermediate Vision
Transformers activations during the SL training process. ADC incorporates two
parallel
strategies. The first one merges samples' activations that
are similar, based on the average attention score calculated in the last client
layer; this strategy is class-agnostic, meaning that it can also merge samples
having different classes, without losing generalization ability nor decreasing
final results. The second strategy follows the first and discards the least
meaningful tokens, further reducing the
cost. Combining these
strategies not only allows for sending less during the forward pass, but also
the gradients are naturally compressed, allowing the whole model to be trained
without additional tuning or approximations of the gradients. Simulation
results demonstrate that Attention-based Double Compression outperforms
state-of-the-art SL frameworks by significantly reducing
overheads while maintaining high accuracy.
Value-Guided KV Compression for LLMs via Approximated CUR Decomposition
Authors: Ayan Sengupta, Siddhant Chaudhary, Tanmoy Chakraborty
2025-09-18
Key-value ()
has emerged as a critical technique for
reducing the memory and latency overhead of autoregressive language models
during inference. Prior approaches predominantly rely on query-key attention
scores to rank and evict
d tokens, assuming that attention intensity
correlates with semantic importance. However, this heuristic overlooks the
contribution of value vectors, which directly influence the attention output.
In this paper, we propose CurD
, a novel, value-centric
method
that selects keys and values based on leverage scores computed from CUR matrix
decomposition. Our approach approximates the dominant subspace of the attention
output , ensuring that the retained tokens best preserve the
model's predictive behavior. Theoretically, we show that attention score
approximation does not guarantee output preservation, and demonstrate that
CUR-based selection minimizes end-to-end attention reconstruction loss.
Empirically, CurD
achieves up to 9.6% higher accuracy than state-of-the-art
methods like Snap
and Chunk
under aggressive
budgets on LLaMA
and Mistral, while maintaining compatibility with FlashAttention and Grouped
Query Attention. In addition to improved accuracy, CurD
reduces generation
latency by up to 40% at high
, offering a practical speed-accuracy
tradeoff.