2025-11-28

DSD A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO
Tidal forces around the Letelier-Alencar cloud of strings black hole
Mechanistic Interpretability for Transformer-based Time Series Classification
E-M3RF An Equivariant Multimodal 3D Re-assembly Framework
Prune4Web DOM Tree Pruning Programming for Web Agent
Anomaly Detection with Adaptive and Aggressive Rejection for Contaminated Training Data
Discovery and recovery of crystalline materials with property-conditioned transformers
Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale
Scenes as Tokens Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding
Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models
MLPMoE Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts
Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning
RAVQ-HoloNet Rate-Adaptive Vector-Quantized Hologram Compression
Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI
A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving
Towards Audio Token Compression in Large Audio Language Models
TrafficLens Multi-Camera Traffic Video Analysis Using LLMs
Emergence and Localisation of Semantic Role Circuits in LLMs
Latent Collaboration in Multi-Agent Systems
Quantum-Resistant Authentication Scheme for RFID Systems Using Lattice-Based Cryptography
DiFR Inference Verification Despite Nondeterminism
DP-MicroAdam Private and Frugal Algorithm for Training and Fine-tuning
Object-Centric Vision Token Pruning for Vision Language Models
The Case for Intent-Based Query Rewriting
FREE Uncertainty-Aware Autoregression for Parallel Diffusion Transformers
Scaling LLM Speculative Decoding Non-Autoregressive Forecasting in Large-Batch Scenarios
IrisNet Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection
Beyond Components Singular Vector-Based Interpretability of Transformer Circuits
Communication-Efficient Learning for Satellite Constellations
Interactive AI NPCs Powered by LLMs Technical Report for the CPDC Challenge 2025
Beluga A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management
SKEL-CF Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery
IDAP++ Advancing Divergence-Based Pruning via Filter-Level and Layer-Level Optimization
SSA Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
MTA A Merge-then-Adapt Framework for Personalized Large Language Model
FLaTEC Frequency-Disentangled Latent Triplanes for Efficient Compression of LiDAR Point Clouds
Adaptive Knowledge Transfer for Cross-Disciplinary Cold-Start Knowledge Tracing
DeeAD Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving
M $^3$ Prune Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation
ParaBlock Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models
AI/ML based Joint Source and Channel Coding for HARQ-ACK Payload

DSD A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving

Authors: Fengze Yu, Leshu Li, Brad McDanel, Saiqian Zhang

2025-11-26

http://arxiv.org/abs/2511.21669v1

Large language model () inference often suffers from high latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative (SD) techniques accelerate token generation but remain confined to single-node execution. We propose DSD, a distributed speculative framework that extends SD to multi-device deployments through coordinated draft-target execution. Given the lack of prior work on simulating this paradigm, we first introduce DSD-Sim, a discrete-event simulator that captures network, batching, and scheduling dynamics. Building on insights from DSD-Sim, we further design an Adaptive Window Control (AWC) policy that dynamically adjusts speculation window size to optimize throughput. Experiments across diverse workloads show that DSD achieves up to 1.1x speedup and 9.7% higher throughput over existing SD baselines, enabling agile and scalable across edge and cloud.

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Authors: Daniel R. Jiang, Jalaj Bhandari, Yukai Yang, Rémi Munos, Tyler Lu

2025-11-26

http://arxiv.org/abs/2511.21638v1

Optimizing large language models (s) for multi-turn conversational outcomes remains a significant challenge, especially in goal-oriented settings like AI marketing or sales agents who facilitate transactions via messaging platforms. The difficulty stems from , long-horizon rewards and the discrepancy between response-level planning and token-level generation. In this technical note, we propose a formal reduction of the multi-turn RL problem into a sequence of single-turn RLHF-style problems. This is achieved by setting a learned multi-turn Q-function as the reward model for the single-turn problem. We demonstrate and prove a key insight: solving this single-turn RL problem with standard token-level PPO is equivalent to a policy improvement step within the multi-turn problem. This insight naturally leads to Iterative PPO, a batch online policy iteration algorithm that alternates between fitting Q-functions from logged conversation trajectories and improving the policy. A major practical advantage is that Iterative PPO directly leverages stable, off-the-shelf single-turn RLHF tools, making it straightforward to implement. Our method occupies a middle ground between fully online and fully offline approaches, retaining the adaptability of online updates while gaining the stability benefits of offline training.

Tidal forces around the Letelier-Alencar cloud of strings black hole

Authors: Marcos V. de S. Silva, T. M. Crispim, R. R. Landim, Gonzalo Olmo, Diego Sáez-Chillón Gómez

2025-11-26

http://arxiv.org/abs/2511.21604v1

In this work, we investigate relativistic tidal forces around a black hole sourced by a cloud of strings, described by the generalized Letelier-Alencar solution. We first review the original Letelier spacetime and its recent generalization, computing the Kretschmann scalar, and showing that the generalized model exhibits a stronger curvature divergence at $r \to 0$ than both Letelier and Schwarzschild cases. We then analyze geodesic motion in this background. For massless particles, we focus on circular photon orbits, while for massive particles, we consider both radial infall and circular motion. We find that the radii of the photon sphere and of the innermost stable circular orbit increase with the cloud of strings parameter $g_s$ and decrease with the length scale $l_s$ , and that circular orbits cease to exist in certain regions of the parameter space. For radial motion, we compute the radial and the corresponding tidal forces. In this case, we show that an inversion between stretching and may occur, although this regime is typically hidden inside the event horizon. Finally, we study tidal forces for observers in circular motion, showing that the cloud of strings modifies the Keplerian frequency and the tidal force profile even at large distances, and that in this case there is no sign change of the tidal components.

Mechanistic Interpretability for Transformer-based Time Series Classification

Authors: Matīss Kalnāre, Sofoklis Kitharidis, Thomas Bäck, Niki van Stein

2025-11-26

http://arxiv.org/abs/2511.21514v1

Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing explainability methods often focus on input-output attributions, leaving the internal mechanisms largely opaque. This paper addresses this gap by adapting various Mechanistic Interpretability techniques; activation patching, attention saliency, and autoencoders, from NLP to architectures designed explicitly for time series classification. We systematically probe the internal causal roles of individual attention heads and timesteps, revealing causal structures within these models. Through experimentation on a benchmark time series dataset, we construct causal graphs illustrating how information propagates internally, highlighting key attention heads and temporal positions driving correct classifications. Additionally, we demonstrate the potential of autoencoders for uncovering interpretable latent features. Our findings provide both methodological contributions to interpretability and novel insights into the functional mechanics underlying performance in time series classification tasks.

E-M3RF An Equivariant Multimodal 3D Re-assembly Framework

Authors: Adeela Islam, Stefano Fiorini, Manuel Lecha, Theodore Tsesmelis, Stuart James, Pietro Morerio, Alessio Del Bue

2025-11-26

http://arxiv.org/abs/2511.21422v1

3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent ping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a . The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.

Prune4Web DOM Tree Pruning Programming for Web Agent

Authors: Jiayuan Zhang, Kaiquan Chen, Zhihao Lu, Enshen Zhou, Qian Yu, Jing Zhang

2025-11-26

http://arxiv.org/abs/2511.21398v1

Web automation employs intelligent agents to execute high-level tasks by mimicking human interactions with web interfaces. Despite the capabilities of recent Large Language Model ()-based web agents, navigating complex, real-world webpages efficiently remains a significant hurdle due to the prohibitively large size of Document Object Model (DOM) structures, often ranging from 10,000 to 100,000 tokens. Existing strategies typically rely on crude DOM truncation -- risking the loss of critical information -- or employ inefficient heuristics and separate ranking models, failing to achieve an optimal balance between precision and scalability. To address these challenges, we introduce Prune4Web, a novel paradigm that shifts DOM processing from resource-intensive reading to efficient programmatic . Central to our approach is DOM Tree Pruning Programming, where an generates executable Python scoring scripts to dynamically filter DOM elements based on semantic cues from decomposed sub-tasks. This mechanism eliminates the need for s to ingest raw, massive DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. This methodology achieves a 25x to 50x reduction in candidate elements for grounding, thereby facilitating precise action localization while mitigating attention dilution. Furthermore, we propose a specialized data annotation pipeline and a two-turn dialogue training strategy that jointly optimizes the Planner, Programmatic Filter, and Grounder within a unified framework. Extensive experiments demonstrate state-of-the-art performance. Notably, on our low-level grounding task, Prune4Web dramatically improves accuracy from 46.8% to 88.28%, underscoring its efficacy in real-world web automation.

Anomaly Detection with Adaptive and Aggressive Rejection for Contaminated Training Data

Authors: Jungi Lee, Jungkwon Kim, Chi Zhang, Kwangsun Yoo, Seok-Joo Byun

2025-11-26

http://arxiv.org/abs/2511.21378v1

Handling contaminated data poses a critical challenge in anomaly detection, as traditional models assume training on purely normal data. Conventional methods mitigate contamination by relying on fixed contamination ratios, but discrepancies between assumed and actual ratios can severely degrade performance, especially in noisy environments where normal and abnormal data distributions . To address these limitations, we propose Adaptive and Aggressive Rejection (AAR), a novel method that dynamically excludes anomalies using a modified z-score and Gaussian mixture model-based thresholds. AAR effectively balances the trade-off between pre normal data and excluding anomalies by integrating hard and soft rejection strategies. Extensive experiments on two image datasets and thirty tabular datasets demonstrate that AAR outperforms the state-of-the-art method by 0.041 AUROC. By providing a scalable and reliable solution, AAR enhances robustness against contaminated datasets, paving the way for broader real-world applications in domains such as security and healthcare.

Discovery and recovery of crystalline materials with property-conditioned transformers

Authors: Cyprien Bone, Matthew Walker, Kuangdai Leng, Luis M. Antunes, Ricardo Grau-Crespo, Amil Aligayev, Javier Dominguez, Keith T. Butler

2025-11-26

http://arxiv.org/abs/2511.21299v1

Generative models have recently shown great promise for accelerating the design and discovery of new functional materials. Conditional generation enhances this capacity by allowing inverse design, where specific desired properties can be requested during the generation process. However, conditioning of -based approaches, in particular, is constrained by discrete tokenisation schemes and the risk of catastrophic forgetting during fine-tuning. This work introduces Crysta-π (property injection), a conditional autoregressive framework that integrates continuous property representations directly into the 's attention mechanism. Two architectures, Property-Key-Value (P) Prefix attention and P Residual attention, are presented. These methods bypass inefficient sequence-level tokenisation and preserve foundational knowledge from unsupervised pre-training on Crystallographic Information Files (CIFs) as textual input. We establish the efficacy of these mechanisms through systematic robustness studies and evaluate the framework's versatility across two distinct tasks. First, for structure recovery, the model processes high-dimensional, heterogeneous X-ray diffraction patterns, achieving structural accuracy competitive with specialised models and demonstrating applications to experimental structure recovery and polymorph differentiation. Second, for materials discovery, the model is fine-tuned on a specialised photovoltaic dataset to generate novel, stable candidates validated by Density Functional Theory (DFT). It implicitly learns to target optimal band gap regions for high photovoltaic efficiency, demonstrating a capability to map complex structure-property relationships. Crysta-π provides a unified, flexible, and computationally efficient framework for inverse materials design.

Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale

Authors: Yicheng Zhong, Peiji Yang, Zhisheng Wang

2025-11-26

http://arxiv.org/abs/2511.21270v1

Recent advances in Large Language Models (s) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS s have emerged as compact and streamable architectures that jointly model semantic and acoustic integration. However, despite their efficiency, these models often exhibit unstable prosody, speaker drift, and degraded naturalness. To address these issues, we propose a multi-reward Group Relative Policy Optimization (GRPO) framework that directly optimizes the token generation policy of single-codebook TTS s. Beyond standard intelligibility and speaker similarity objectives, our design integrates three rule-based rewards: a length penalty for duration consistency, an entropy regularization reward for stability, and an -annotated prosody alignment reward that explicitly supervises rhythm. In this prosody reward, an external reasoning predicts multiple plausible pause structures via in-context learning, providing a human-preference-aligned supervisory signal for GRPO training. To assess universality, we further attach a flow-matching (FM) r on top of the GRPO-optimized AR backbone and observe consistent additional gains, indicating that our reinforcement optimization enhances the intrinsic AR policy. We further conduct a scalability analysis across data sizes and model scales, revealing that the proposed method consistently enhances prosodic stability, speaker similarity, and overall speech naturalness in single-codebook TTS s.

Scenes as Tokens Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding

Authors: Yutao Tang, Cheng Zhao, Gaurav Mittal, Rohith Kukkala, Rama Chellappa, Cheng Peng, Mei Chen

2025-11-26

http://arxiv.org/abs/2511.21191v1

Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, pre both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask , unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.

Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models

Authors: Changlin Li, Jiawei Zhang, Zeyi Shi, Zongxin Yang, Zhihui Li, Xiaojun Chang

2025-11-26

http://arxiv.org/abs/2511.21122v1

Large-scale vision generative models, including diffusion and flow models, have demonstrated remarkable performance in visual generation tasks. However, transferring these pre-trained models to downstream tasks often results in significant parameter redundancy. In this paper, we propose EntPruner, an entropy-guided automatic progressive framework for diffusion and flow models. First, we introduce entropy-guided , a block-level importance assessment strategy specifically designed for generative models. Unlike discriminative models, generative models require pre the diversity and condition-fidelity of the output distribution. As the importance of each module can vary significantly across downstream tasks, EntPruner prioritizes of less important blocks using data-dependent Conditional Entropy Deviation (CED) as a guiding metric. CED quantifies how much the distribution diverges from the learned conditional data distribution after removing a block. Second, we propose a zero-shot adaptive framework to automatically determine when and how much to prune during training. This dynamic strategy avoids the pitfalls of one-shot , mitigating mode collapse, and pre model performance. Extensive experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22 $\times$ inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets.

MLPMoE Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts

Authors: Ivan Novikov

2025-11-26

http://arxiv.org/abs/2511.21089v1

Large Language Models (s) are predominantly deployed as dense s, where every parameter in every feed-forward block is activated for every token. While architecturally simple, this is computationally inefficient, since inference costs scale linearly with parameter count. Recent upcycling methods such as MoEfication, CMoE, ToMoE, and MoORE reveal that much of the useful computation lives in , semi-modular substructures inside dense feed-forward networks, but these approaches typically rely on clustering, activation profiling, singular value decomposition, or custom routing that requires calibration data. This paper introduces MLPMoE (MLP Mixture-of-Experts), a training-free, deterministic transformation that restructures the dense MLP in blocks into a static, high-cardinality mixture of experts. The transformation uses simple tensor slicing and summation, reinterpreting the algebra of tensor parallelism as a topological conversion rather than a distributed training pattern. We further introduce Fractal Fade (differential branch ) and Compensated Pruning (variance-pre branch reduction) as lightweight mechanisms for structured . On Qwen2.5-0.5B-Instruct and DeepSeek-R1-Distill-Llama-8B, the zero-shot MLPMoE transform changes a proxy perplexity metric by less than 0.05 percent while keeping the parameter count effectively constant. On the 8B model, differential removes about 20 percent of MLP parameters while keeping perplexity within about 2 percent of the dense baseline. The method operates entirely post hoc on existing checkpoints and does not require gradients, calibration sets, or router training. Code is available at https://gist.github.com/iwallarm/fc2ef1eddf226ca7814f9e5e2ae9bad1

Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning

Authors: Zhenchao Tang, Fang Wang, Haohuai He, Jiale Zhou, Tianxu Lv, Jun Zhu, Shouzhi Chen, Minghao Yang, Yu Wang, Jiayang Wu, Yidong Song, Jianhua Yao

2025-11-26

http://arxiv.org/abs/2511.21075v1

Effective post-training is essential to align Large Language Models (s) with specialized biomedical knowledge to accelerate life science research. However, current approaches face significant limitations. First, biomedical reasoning involves intricate mechanisms often represented by textual data. Standard Supervised Fine-Tuning (SFT) tends to overfit to surface-level instruction patterns without effectively internalizing this fragmented scientific knowledge. Second, Reinforcement Learning (RL) is impractical for this domain, as defining meaningful rewards often necessitates prohibitive experimental validation (e.g., wet-lab verification of drug responses), rendering real-time feedback unfeasible. We propose Balanced Fine-Tuning (BFT), an efficient post-training method designed to learn complex reasoning from data without external reward signals. BFT operates through a two-layer weighting mechanism: 1. At the token level, it scales loss via prediction probabilities to stabilize gradients and prevent overfitting; 2. At the sample level, it uses "minimum group confidence" to adaptively enhance the learning of hard samples. Experiments demonstrate that BFT significantly outperforms SFT. In medical tasks, it enables s to acquire knowledge that SFT misses. In biological tasks, BFT-based s surpass GeneAgent (an accurate agent for biology analysis) in biological process reasoning. Moreover, the text embeddings generated by BFT can be directly applied to downstream tasks, such as gene interaction and single-cell perturbation response prediction. These results indicate that BFT facilitates broad applications of s in biomedical research.

RAVQ-HoloNet Rate-Adaptive Vector-Quantized Hologram Compression

Authors: Shima Rafiei, Zahra Nabizadeh Shahr Babak, Shadrokh Samavi, Shahram Shirani

2025-11-26

http://arxiv.org/abs/2511.21035v1

Holography offers significant potential for AR/VR applications, yet its adoption is limited by the high demands of data . Existing deep learning approaches generally lack rate adaptivity within a single network. We present RAVQ-HoloNet, a rate-adaptive vector framework that achieves high-fidelity reconstructions at low and ultra-low bit rates, outperforming current state-of-the-art methods. In low bit, our method exceeds by -33.91% in BD-Rate and achieves a BD-PSNR of 1.02 dB from the best existing method demonstrated by the rate-distortion curve.

Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI

Authors: Al Amin, Kamrul Hasan, Liang Hong, Sharif Ullah

2025-11-26

http://arxiv.org/abs/2511.20983v1

Collaborative machine learning across healthcare institutions promises improved diagnostic accuracy by leveraging diverse datasets, yet privacy regulations such as HIPAA prohibit direct patient data sharing. While federated learning (FL) enables decentralized training without raw data exchange, recent studies show that model gradients in conventional FL remain vulnerable to reconstruction attacks, potentially exposing sensitive medical information. This paper presents a privacy-pre federated learning framework combining Vision Transformers (ViT) with homomorphic encryption (HE) for secure multi-institutional histopathology classification. The approach leverages the ViT CLS token as a compact 768-dimensional feature representation for secure aggregation, encrypting these tokens using CKKS homomorphic encryption before transmission to the server. We demonstrate that encrypting CLS tokens achieves a 30-fold reduction compared to gradient encryption while maintaining strong privacy guarantees. Through evaluation on a three-client federated setup for lung cancer histopathology classification, we show that gradients are highly susceptible to model inversion attacks (PSNR: 52.26 dB, SSIM: 0.999, NMI: 0.741), enabling near-perfect image reconstruction. In contrast, the proposed CLS-protected HE approach prevents such attacks while enabling encrypted inference directly on ciphertexts, requiring only 326 KB of encrypted data transmission per aggregation round. The framework achieves 96.12 percent global classification accuracy in the unencrypted domain and 90.02 percent in the encrypted domain.

A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

Authors: Junhan Liao, Minxian Xu, Wanyi Zheng, Yan Wang, Kejiang Ye, Rajkumar Buyya, Chengzhong Xu

2025-11-26

http://arxiv.org/abs/2511.20982v1

To meet strict Service-Level Objectives (SLOs),contemporary Large Language Models (s) decouple the and stages and place them on separate GPUs to mitigate the distinct bottlenecks inherent to each phase. However, the heterogeneity of workloads causes producerconsumer imbalance between the two instance types in such d architecture. To address this problem, we propose DOPD (Dynamic Optimal Prefill/Decoding), a dynamic inference system that adjusts instance allocations to achieve an optimal -to- (P/D) ratio based on real-time load monitoring. Combined with an appropriate request-scheduling policy, DOPD effectively resolves imbalances between and instances and mitigates resource allocation mismatches due to mixed-length requests under high concurrency. Experimental evaluations show that, compared with v and DistServe (representative aggregation-based and disaggregationbased approaches), DOPD improves overall system goodput by up to 1.5X, decreases P90 time-to-first-token (TTFT) by up to 67.5%, and decreases P90 time-per-output-token (TPOT) by up to 22.8%. Furthermore, our dynamic P/D adjustment technique performs proactive reconfiguration based on historical load, achieving over 99% SLOs attainment while using less additional resources.

Towards Audio Token Compression in Large Audio Language Models

Authors: Saurabhchand Bhati, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

2025-11-26

http://arxiv.org/abs/2511.20973v1

Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the r. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the backbone.

TrafficLens Multi-Camera Traffic Video Analysis Using LLMs

Authors: Md Adnan Arefeen, Biplob Debnath, Srimat Chakradhar

2025-11-26

http://arxiv.org/abs/2511.20965v1

Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (s) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing ping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4\times$ while maintaining information accuracy.

Emergence and Localisation of Semantic Role Circuits in LLMs

Authors: Nura Aljaafari, Danilo S. Carvalho, André Freitas

2025-11-25

http://arxiv.org/abs/2511.20910v1

Despite displaying semantic competence, large language models' internal mechanisms that ground abstract semantic structure remain insufficiently characterised. We propose a method integrating role-cross minimal pairs, temporal emergence analysis, and cross-model comparison to study how s implement semantic roles. Our analysis uncovers: (i) highly concentrated circuits (89-94% attribution within 28 nodes); (ii) gradual structural refinement rather than phase transitions, with larger models sometimes bypassing localised circuits; and (iii) moderate cross-scale conservation (24-59% component ) alongside high spectral similarity. These findings suggest that s form compact, causally isolated mechanisms for abstract semantic structure, and these mechanisms exhibit partial transfer across scales and architectures.

Latent Collaboration in Multi-Agent Systems

Authors: Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang

2025-11-25

http://arxiv.org/abs/2511.20639v1

Multi-agent systems (MAS) extend large language models (s) from independent single-model reasoning to coordinative system-level intelligence. While existing agents depend on text-based mediation for reasoning and , we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

Quantum-Resistant Authentication Scheme for RFID Systems Using Lattice-Based Cryptography

Authors: Vaibhav Kumar, Kaiwalya Joshi, Bhavya Dixit, Gaurav S. Kasbekar

2025-11-25

http://arxiv.org/abs/2511.20630v1

We propose a novel quantum-resistant mutual authentication scheme for radio-frequency identification (RFID) systems. Our scheme uses lattice-based cryptography and, in particular, achieves quantum-resistance by leveraging the hardness of the inhomogeneous short integer solution (ISIS) problem. In contrast to prior work, which assumes that the reader-server channel is secure, our scheme is secure even when both the reader-server and tag-reader channels are insecure. Our proposed protocol provides robust security against man-in-the-middle (MITM), replay, impersonation, and reflection attacks, while also ensuring unforgeability and pre anonymity. We present a detailed security analysis, including semi-formal analysis and formal verification using the Automated Validation of Internet Security Protocols and Applications (AVISPA) tool. In addition, we analyze the storage, computation, and costs of the proposed protocol and compare its security properties with those of existing protocols, demonstrating that our scheme offers strong security guarantees. To the best of our knowledge, this paper is the first quantum-resistant authentication protocol for RFID systems that comprehensively addresses the insecurity of both the reader-server and tag-reader channels.

DiFR Inference Verification Despite Nondeterminism

Authors: Adam Karvonen, Daniel Reuter, Roy Rinberg, Luke Marks, Adrià Garriga-Alonso, Keri Warr

2025-11-25

http://arxiv.org/abs/2511.20621v1

As demand for inference grows, it is becoming increasingly important that providers and their customers can verify that inference processes are performed correctly, without errors or tampering. However, re-running the same inference process twice often leads to different results due to benign numerical noise, making it difficult to distinguish legitimate variation from actual problems. To address this problem, we introduce Token-DiFR (Token-Divergence-From-Reference), a method for verifying inference outputs by comparing generated tokens against predictions made by a trusted reference implementation conditioned on the same random seed. Sampling seed synchronization tightly constrains valid outputs, leaving providers minimal room to deviate from correct inference, which allows output tokens themselves to serve as auditable evidence of correctness at zero additional cost to the provider. Token-DiFR reliably identifies sampling errors, simulated bugs, and model , detecting 4-bit with AUC $>$ 0.999 within 300 output tokens. For applications requiring sample-efficient forward-pass verification, we additionally introduce Activation-DiFR, a scheme that uses random orthogonal projections to compress activations into compact fingerprints for subsequent verification. Activation-DiFR detects 4-bit with AUC $>$ 0.999 using just 2 output tokens, while reducing overhead by 25-75% relative to existing methods. We release an open-source integration with v to accelerate practical deployment of verifiable inference.

DP-MicroAdam Private and Frugal Algorithm for Training and Fine-tuning

Authors: Mihaela Hudişteanu, Edwige Cyffers, Nikita P. Kalinin

2025-11-25

http://arxiv.org/abs/2511.20509v1

Adaptive optimizers are the de facto standard in non-private training as they often enable faster convergence and improved performance. In contrast, differentially private (DP) training is still predominantly performed with DP-SGD, typically requiring extensive compute and hyperparameter tuning. We propose DP-MicroAdam, a memory-efficient and -aware adaptive DP optimizer. We prove that DP-MicroAdam converges in stochastic non-convex optimization at the optimal $\mathcal{O}(1/\sqrt{T})$ rate, up to privacy-dependent constants. Empirically, DP-MicroAdam outperforms existing adaptive DP optimizers and achieves competitive or superior accuracy compared to DP-SGD across a range of benchmarks, including CIFAR-10, large-scale ImageNet training, and private fine-tuning of pretrained s. These results demonstrate that adaptive optimization can improve both performance and stability under differential privacy.

Object-Centric Vision Token Pruning for Vision Language Models

Authors: Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen

2025-11-25

http://arxiv.org/abs/2511.20439v1

In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-pre VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.

The Case for Intent-Based Query Rewriting

Authors: Gianna Lisa Nicolai, Patrick Hansert, Sebastian Michel

2025-11-25

http://arxiv.org/abs/2511.20419v1

With this work, we describe the concept of intent-based query rewriting and present a first viable solution. The aim is to allow rewrites to alter the structure and syntactic outcome of an original query while keeping the obtainable insights intact. This drastically differs from traditional query rewriting, which typically aims to decrease query evaluation time by using strict equivalence rules and optimization heuristics on the query plan. Rewriting queries to queries that only provide a similar insight but otherwise can be entirely different can remedy inaccessible original data tables due to access control, privacy, or expensive data access regarding monetary cost or remote access. In this paper, we put forward INQURE, a system designed for INtent-based QUery REwriting. It uses access to a large language model () for the query understanding and human-like derivation of alternate queries. Around the , INQURE employs upfront table filtering and subsequent candidate rewrite and ranking. We report on the results of an evaluation using a benchmark set of over 900 database table schemas and discuss the pros and cons of alternate approaches regarding runtime and quality of the rewrites of a user study.

FREE Uncertainty-Aware Autoregression for Parallel Diffusion Transformers

Authors: Xinwan Wen, Bowen Li, Jiajun Luo, Ye Li, Zhi Wang

2025-11-25

http://arxiv.org/abs/2511.20390v1

Diffusion Transformers (DiTs) achieve state-of-the-art generation quality but require long sequential denoising trajectories, leading to high inference latency. Recent speculative inference methods enable lossless parallel sampling in U-Net-based diffusion models via a drafter-verifier scheme, but their is limited on DiTs due to insufficient draft accuracy during verification. To address this limitation, we analyze the DiTs' feature dynamics and find the features of the final layer (top-block) exhibit strong temporal consistency and rich semantic abstraction. Based on this insight, we propose FREE, a novel framework that employs a lightweight drafter to perform feature-level autoregression with parallel verification, guaranteeing lossless with theoretical and empirical support. Meanwhile, prediction variance (uncertainty) of DiTs naturally increases in later denoising steps, reducing acceptance rates under speculative sampling. To mitigate this effect, we further introduce an uncertainty-guided relaxation strategy, forming FREE (relax), which dynamically adjusts the acceptance probability in response to uncertainty levels. Experiments on ImageNet- $512^2$ show that FREE achieves up to $1.86 \times$ , and FREE (relax) further reaches $2.25 \times$ speedup while maintaining high perceptual and quantitative fidelity in generation quality.

Scaling LLM Speculative Decoding Non-Autoregressive Forecasting in Large-Batch Scenarios

Authors: Luohe Shi, Zuchao Li, Lefei Zhang, Baoyuan Qi, Guoming Liu, Hai Zhao

2025-11-25

http://arxiv.org/abs/2511.20340v1

Speculative accelerates inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative methods typically assume a considerable amount of available computing power, then generate a complex and massive draft tree using a small autoregressive language model to improve overall prediction accuracy. However, methods like batching have been widely applied in mainstream model inference systems as a superior alternative to speculative , as they compress the available idle computing power. Therefore, performing speculative with low verification resources and low scheduling costs has become an important research problem. We believe that more capable models that allow for parallel generation on draft sequences are what we truly need. Recognizing the fundamental nature of draft models to only generate sequences of limited length, we propose SpecFormer, a novel architecture that integrates unidirectional and bidirectional attention mechanisms. SpecFormer combines the autoregressive model's ability to extract information from the entire input sequence with the parallel generation benefits of non-autoregressive models. This design eliminates the reliance on large prefix trees and achieves consistent , even in large-batch scenarios. Through lossless speculative experiments across models of various scales, we demonstrate that SpecFormer sets a new standard for scaling inference with lower training demands and reduced computational costs.

IrisNet Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection

Authors: Xuelin Qian, Jiaming Lu, Zixuan Wang, Wenxuan Wang, Zhongling Huang, Dingwen Zhang, Junwei Han

2025-11-25

http://arxiv.org/abs/2511.20319v1

Infrared Small Target Detection (IRSTD) faces significant challenges due to low signal-to-noise ratios, complex backgrounds, and the absence of discernible target features. While deep learning-based encoder-r frameworks have advanced the field, their static pattern learning suffers from pattern drift across diverse scenarios (\emph{e.g.}, day/night variations, sky/maritime/ground domains), limiting robustness. To address this, we propose IrisNet, a novel meta-learned framework that dynamically adapts detection strategies to the input infrared image status. Our approach establishes a dynamic mapping between infrared image features and entire r parameters via an image-to-r . More concretely, we represent the parameterized r as a structured 2D tensor pre hierarchical layer correlations and enable the to model inter-layer dependencies through self-attention while generating adaptive patterns via cross-attention. To further enhance the perception ability of infrared images, we integrate high-frequency components to supplement target-position and scene-edge information. Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets demonstrate the superiority of our IrisNet, achieving state-of-the-art performance.

Beyond Components Singular Vector-Based Interpretability of Transformer Circuits

Authors: Areeb Ahmad, Abhinav Joshi, Ashutosh Modi

2025-11-25

http://arxiv.org/abs/2511.20273v1

Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple ping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.

Communication-Efficient Learning for Satellite Constellations

Authors: Ruxandra-Stefania Tudose, Moritz H. W. Grüss, Grace Ra Kim, Karl H. Johansson, Nicola Bastianello

2025-11-25

http://arxiv.org/abs/2511.20220v1

Satellite constellations in low-Earth orbit are now widespread, enabling positioning, Earth imaging, and s. In this paper we address the solution of learning problems using these satellite constellations. In particular, we focus on a federated approach, where satellites collect and locally process data, with the ground station aggregating local models. We focus on designing a novel, -efficient algorithm that still yields accurate trained models. To this end, we employ several mechanisms to reduce the number of s with the ground station (local training) and their size (). We then propose an error feedback mechanism that enhances accuracy, which yields, as a byproduct, an algorithm-agnostic error feedback scheme that can be more broadly applied. We analyze the convergence of the resulting algorithm, and compare it with the state of the art through simulations in a realistic space scenario, showcasing superior performance.

Interactive AI NPCs Powered by LLMs Technical Report for the CPDC Challenge 2025

Authors: Yitian Huang, Yuxuan Lei, Jianxun Lian, Hao Liao

2025-11-25

http://arxiv.org/abs/2511.20200v1

This report presents the solution and results of our team MSRA_SC in the Commonsense Persona-Grounded Dialogue Challenge (CPDC 2025). We propose a simple yet effective framework that unifies improvements across both GPU Track and API Track. Our method centers on two key components. First, Context Engineering applies dynamic tool and persona clipping for input , combined with post-processing techniques such as parameter normalization and function merging. Together with manually refined prompts, this design improves tool call stability, execution reliability, and role-playing guidance. Second, in the GPU Track, we further adopt GRPO training, replacing supervised fine-tuning with reinforcement learning directly optimized by reward signals. This mitigates small-sample overfitting and significantly enhances task-oriented dialogue performance. In the final evaluation, our team ranks 1st in Task 2 API, 2nd in Task 1 API, and 3rd in both Task 3 API and GPU track, demonstrating the effectiveness of our approach. Our code is publicly available at https://gitlab.aicrowd.com/nikoo_yu/cpdc-2025-winning-solution

Beluga A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management

Authors: Xinjun Yang, Qingda Hu, Junru Li, Feifei Li, Yuqi Zhou, Yicong Zhu, Qiuru Lin, Jian Dai, Yang Kong, Jiayu Zhang, Guoqiang Xu, Qiang Liu

2025-11-25

http://arxiv.org/abs/2511.20172v1

The rapid increase in model sizes and the growing demand for long-context inference have made memory a critical bottleneck in GPU-accelerated systems. Although high-bandwidth memory (HBM) on GPUs offers fast access, its limited capacity necessitates reliance on host memory (CPU DRAM) to support larger working sets such as the Cache. However, the maximum DRAM capacity is constrained by the limited number of memory channels per CPU socket. To overcome this limitation, current systems often adopt RDMA-based d memory pools, which introduce significant challenges including high access latency, complex protocols, and synchronization overhead. Fortunately, the emerging CXL technology introduces new opportunities in Cache design. In this paper, we propose Beluga, a novel memory architecture that enables GPUs and CPUs to access a shared, large-scale memory pool through CXL switches. By supporting native load/store access semantics over the CXL fabric, our design delivers near-local memory latency, while reducing programming complexity and minimizing synchronization overhead. We conduct a systematic characterization of a commercial CXL switch-based memory pool and propose a set of design guidelines. Based on Beluga, we design and implement Beluga-Cache, a system tailored for managing the large-scale Cache in inference. Beluga-Cache achieves an 89.6% reduction in Time-To-First-Token (TTFT) and 7.35x throughput improvement in the v inference engine compared to RDMA-based solutions. To the best of our knowledge, Beluga is the first system that enables GPUs to directly access large-scale memory pools through CXL switches, marking a significant step toward low-latency, shared access to vast memory resources by GPUs.

SKEL-CF Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery

Authors: Da Li, Jiping Jin, Xuanlong Yu, Wei Liu, Xiaodong Cun, Kai Chen, Rui Fan, Jiangang Kong, Xi Shen

2025-11-25

http://arxiv.org/abs/2511.20157v2

Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a -based encoder-r architecture, where the encoder predicts coarse camera and SKEL parameters, and the r progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.

IDAP++ Advancing Divergence-Based Pruning via Filter-Level and Layer-Level Optimization

Authors: Aleksei Samarin, Artem Nazarenko, Egor Kotenko, Valentin Malykh, Alexander Savelev, Aleksei Toropov

2025-11-25

http://arxiv.org/abs/2511.20141v1

This paper presents a novel approach to neural network that addresses redundancy at both the filter and architectural levels through a unified framework grounded in information flow analysis. Building on the concept of tensor flow divergence, which quantifies how information is transformed across network layers, we develop a two-stage optimization process. The first stage employs iterative divergence-aware to identify and remove redundant filters while pre critical information pathways. The second stage extends this principle to higher-level architecture optimization by analyzing layer-wise contributions to information propagation and selectively eliminating entire layers that demonstrate minimal impact on network performance. The proposed method naturally adapts to diverse architectures, including convolutional networks, s, and hybrid designs, providing a consistent metric for comparing the structural importance across different layer types. Experimental validation across multiple modern architectures and datasets reveals that this combined approach achieves substantial model while maintaining competitive accuracy. The presented approach achieves parameter reduction results that are globally comparable to those of state-of-the-art solutions and outperforms them across a wide range of modern neural network architectures, from convolutional models to s. The results demonstrate how flow divergence serves as an effective guiding principle for both filter-level and layer-level optimization, offering practical benefits for deployment in resource-constrained environments.

SSA Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Authors: Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He, Di Yin, Xing Sun

2025-11-25

http://arxiv.org/abs/2511.20102v1

The quadratic complexity of full attention limits efficient long-context processing in large language models (s). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native -attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging -attention outputs to align with their full-attention counterparts, thereby promoting stronger . As a result, SSA achieves state-of-the-art performance under both and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native -attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.

MTA A Merge-then-Adapt Framework for Personalized Large Language Model

Authors: Xiaopeng Li, Yuanjin Zheng, Wanyu Wang, wenlin zhang, Pengyue Jia, Yiqi Wang, Maolin Wang, Xuetao Wei, Xiangyu Zhao

2025-11-25

http://arxiv.org/abs/2511.20072v2

Personalized Large Language Models (Ps) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for Ps. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.

FLaTEC Frequency-Disentangled Latent Triplanes for Efficient Compression of LiDAR Point Clouds

Authors: Xiaoge Zhang, Zijie Wu, Mingtao Feng, Zichen Geng, Mehwish Nasim, Saeed Anwar, Ajmal Mian

2025-11-25

http://arxiv.org/abs/2511.20065v1

Point cloud methods jointly optimize bitrates and reconstruction distortion. However, balancing ratio and reconstruction quality is difficult because low-frequency and high-frequency components contribute differently at the same resolution. To address this, we propose FLaTEC, a frequency-aware model that enables the of a full scan with high ratios. Our approach introduces a frequency-aware mechanism that decouples low-frequency structures and high-frequency textures, while hybridizing latent triplanes as a compact proxy for point cloud. Specifically, we convert voxelized embeddings into triplane representations to reduce , computational cost, and storage requirements. We then devise a frequency-disentangling technique that extracts compact low-frequency content while collecting high-frequency details across scales. The decoupled low-frequency and high-frequency components are stored in binary format. During , full-spectrum signals are progressively recovered via a modulation block. Additionally, to compensate for the loss of 3D correlation, we introduce an efficient frequency-based attention mechanism that fosters local connectivity and outputs arbitrary resolution points. Our method achieves state-of-the-art rate-distortion performance and outperforms the standard codecs by 78\% and 94\% in BD-rate on both SemanticKITTI and Ford datasets.

Adaptive Knowledge Transfer for Cross-Disciplinary Cold-Start Knowledge Tracing

Authors: Yulong Deng, Zheng Guan, Min He, Xue Wang, Jie Liu, Zheng Li

2025-11-25

http://arxiv.org/abs/2511.20009v1

Cross-Disciplinary Cold-start Knowledge Tracing (CDCKT) faces a critical challenge: insufficient student interaction data in the target discipline prevents effective knowledge state modeling and performance prediction. Existing cross-disciplinary methods rely on ping entities between disciplines for knowledge transfer through simple mapping functions, but suffer from two key limitations: (1) ping entities are scarce in real-world scenarios, and (2) simple mappings inadequately capture cross-disciplinary knowledge complexity. To overcome these challenges, we propose Mixed of Experts and Adversarial Generative Network-based Cross-disciplinary Cold-start Knowledge Tracing Framework. Our approach consists of three key components: First, we pre-train a source discipline model and cluster student knowledge states into K categories. Second, these cluster attributes guide a mixture-of-experts network through a gating mechanism, as a cross-domain mapping bridge. Third, an adversarial discriminator enforces feature separation by pulling same-attribute student features closer while pushing different-attribute features apart, effectively mitigating small-sample limitations. We validate our method's effectiveness across 20 extreme cross-disciplinary cold-start scenarios.

DeeAD Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

Authors: Haibo HU, Lianming Huang, Nan Guan, Chun Jason Xue

2025-11-25

http://arxiv.org/abs/2511.20720v1

Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% -layer and 29% latency reduction, while pre planning quality and safety.

Authors: Weizi Shao, Taolin Zhang, Zijie Zhou, Chen Chen, Chengyu Wang, Xiaofeng He

2025-11-25

http://arxiv.org/abs/2511.19969v1

Recent advancements in multi-modal retrieval-augmented generation (mRAG), which enhance multi-modal large language models (Ms) with external knowledge, have demonstrated that the collective intelligence of multiple agents can significantly outperform a single model through effective . Despite impressive performance, existing multi-agent systems inherently incur substantial token overhead and increased computational costs, posing challenges for large-scale deployment. To address these issues, we propose a novel Multi-Modal Multi-agent hierarchical graph PRUNING framework, termed M $^3$ Prune. Our framework eliminates redundant edges across different modalities, achieving an optimal balance between task performance and token overhead. Specifically, M $^3$ Prune first applies intra-modal graph sparsification to textual and visual modalities, identifying the edges most critical for solving the task. Subsequently, we construct a dynamic topology using these key edges for inter-modal graph sparsification. Finally, we progressively prune redundant edges to obtain a more efficient and hierarchical topology. Extensive experiments on both general and domain-specific mRAG benchmarks demonstrate that our method consistently outperforms both single-agent and robust multi-agent mRAG systems while significantly reducing token consumption.

ParaBlock Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models

Authors: Yujia Wang, Yuanpu Cao, Jinghui Chen

2025-11-25

http://arxiv.org/abs/2511.19959v1

Federated learning (FL) has been extensively studied as a privacy-pre training paradigm. Recently, federated block coordinate descent scheme has become a popular option in training large-scale models, as it allows clients to train only a subset of the model locally instead of the entire model. However, in the era of large language models (s), even a single block can contain a significant number of parameters, posing substantial latency, particularly for resource-constrained clients. To address this challenge in federated training/fine-tuning s, we propose ParaBlock, a novel approach that establishes two parallel threads for and computation to enhance efficiency. We theoretically prove that the proposed ParaBlock achieves the same convergence rate as the standard federated block coordinate descent methods. Empirical evaluations on fine-tuning s on general instruction following and mathematical reasoning confirm that ParaBlock not only maintains strong performance but also significantly improves efficiency.

AI/ML based Joint Source and Channel Coding for HARQ-ACK Payload

Authors: Akash Doshi, Pinar Sen, Kirill Ivanov, Wei Yang, June Namgoong, Runxin Wang, Rachel Wang, Taesang Yoo, Jing Jiang, Tingfang Ji

2025-11-25

http://arxiv.org/abs/2511.19943v1

Channel coding from 2G to 5G has assumed the inputs bits at the physical layer to be uniformly distributed. However, hybrid automatic repeat request acknowledgement (HARQ-ACK) bits transmitted in the uplink are inherently non-uniformly distributed. For such sources, significant performance gains could be obtained by employing joint source channel coding, aided by deep learning-based techniques. In this paper, we learn a -based encoder using a novel "free-lunch" training algorithm and propose per-codeword power shaping to exploit the source prior at the encoder whilst being robust to small changes in the HARQ-ACK distribution. Furthermore, any HARQ-ACK r has to achieve a low negative acknowledgement (NACK) error rate to avoid radio link failures resulting from multiple NACK errors. We develop an extension of the Neyman-Pearson test to a coded bit system with multiple information bits to achieve Unequal Error Protection of NACK over ACK bits at the r. Finally, we apply the proposed encoder and r designs to a 5G New Radio (NR) compliant uplink setup under a fading channel, describing the optimal receiver design and a low complexity coherent approximation to it. Our results demonstrate 3-6 dB reduction in the average transmit power required to achieve the target error rates compared to the NR baseline, while also achieving a 2-3 dB reduction in the maximum transmit power, thus providing for significant coverage gains and power savings.