2025-11-28
Table of Contents
- DSD A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
- Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO
- Tidal forces around the Letelier-Alencar cloud of strings black hole
- Mechanistic Interpretability for Transformer-based Time Series Classification
- E-M3RF An Equivariant Multimodal 3D Re-assembly Framework
- Prune4Web DOM Tree Pruning Programming for Web Agent
- Anomaly Detection with Adaptive and Aggressive Rejection for Contaminated Training Data
- Discovery and recovery of crystalline materials with property-conditioned transformers
- Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale
- Scenes as Tokens Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding
- Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models
- MLPMoE Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts
- Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning
- RAVQ-HoloNet Rate-Adaptive Vector-Quantized Hologram Compression
- Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI
- A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving
- Towards Audio Token Compression in Large Audio Language Models
- TrafficLens Multi-Camera Traffic Video Analysis Using LLMs
- Emergence and Localisation of Semantic Role Circuits in LLMs
- Latent Collaboration in Multi-Agent Systems
- Quantum-Resistant Authentication Scheme for RFID Systems Using Lattice-Based Cryptography
- DiFR Inference Verification Despite Nondeterminism
- DP-MicroAdam Private and Frugal Algorithm for Training and Fine-tuning
- Object-Centric Vision Token Pruning for Vision Language Models
- The Case for Intent-Based Query Rewriting
- FREE Uncertainty-Aware Autoregression for Parallel Diffusion Transformers
- Scaling LLM Speculative Decoding Non-Autoregressive Forecasting in Large-Batch Scenarios
- IrisNet Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection
- Beyond Components Singular Vector-Based Interpretability of Transformer Circuits
- Communication-Efficient Learning for Satellite Constellations
- Interactive AI NPCs Powered by LLMs Technical Report for the CPDC Challenge 2025
- Beluga A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management
- SKEL-CF Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery
- IDAP++ Advancing Divergence-Based Pruning via Filter-Level and Layer-Level Optimization
- SSA Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
- MTA A Merge-then-Adapt Framework for Personalized Large Language Model
- FLaTEC Frequency-Disentangled Latent Triplanes for Efficient Compression of LiDAR Point Clouds
- Adaptive Knowledge Transfer for Cross-Disciplinary Cold-Start Knowledge Tracing
- DeeAD Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving
- MPrune Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation
- ParaBlock Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models
- AI/ML based Joint Source and Channel Coding for HARQ-ACK Payload
DSD A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
Authors: Fengze Yu, Leshu Li, Brad McDanel, Saiqian Zhang
2025-11-26
Large language model () inference often suffers from high
latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative
(SD) techniques accelerate token generation but remain confined to single-node execution. We propose DSD, a distributed speculative
framework that extends SD to multi-device deployments through coordinated draft-target execution. Given the lack of prior work on simulating this paradigm, we first introduce DSD-Sim, a discrete-event simulator that captures network, batching, and scheduling dynamics. Building on insights from DSD-Sim, we further design an Adaptive Window Control (AWC) policy that dynamically adjusts speculation window size to optimize throughput. Experiments across diverse workloads show that DSD achieves up to 1.1x speedup and 9.7% higher throughput over existing SD baselines, enabling agile and scalable
across edge and cloud.
Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO
Authors: Daniel R. Jiang, Jalaj Bhandari, Yukai Yang, Rémi Munos, Tyler Lu
2025-11-26
Optimizing large language models (s) for multi-turn conversational outcomes remains a significant challenge, especially in goal-oriented settings like AI marketing or sales agents who facilitate transactions via messaging platforms. The difficulty stems from
, long-horizon rewards and the discrepancy between response-level planning and token-level generation. In this technical note, we propose a formal reduction of the multi-turn RL problem into a sequence of single-turn RLHF-style problems. This is achieved by setting a learned multi-turn Q-function as the reward model for the single-turn problem. We demonstrate and prove a key insight: solving this single-turn RL problem with standard token-level PPO is equivalent to a policy improvement step within the multi-turn problem. This insight naturally leads to Iterative PPO, a batch online policy iteration algorithm that alternates between fitting Q-functions from logged conversation trajectories and improving the policy. A major practical advantage is that Iterative PPO directly leverages stable, off-the-shelf single-turn RLHF tools, making it straightforward to implement. Our method occupies a middle ground between fully online and fully offline approaches, retaining the adaptability of online updates while gaining the stability benefits of offline training.
Tidal forces around the Letelier-Alencar cloud of strings black hole
Authors: Marcos V. de S. Silva, T. M. Crispim, R. R. Landim, Gonzalo Olmo, Diego Sáez-Chillón Gómez
2025-11-26
In this work, we investigate relativistic tidal forces around a black hole sourced by a cloud of strings, described by the generalized Letelier-Alencar solution. We first review the original Letelier spacetime and its recent generalization, computing the Kretschmann scalar, and showing that the generalized model exhibits a stronger curvature divergence at than both Letelier and Schwarzschild cases. We then analyze geodesic motion in this background. For massless particles, we focus on circular photon orbits, while for massive particles, we consider both radial infall and circular motion. We find that the radii of the photon sphere and of the innermost stable circular orbit increase with the cloud of strings parameter and decrease with the length scale , and that circular orbits cease to exist in certain regions of the parameter space. For radial motion, we compute the radial and the corresponding tidal forces. In this case, we show that an inversion between stretching and
may occur, although this regime is typically hidden inside the event horizon. Finally, we study tidal forces for observers in circular motion, showing that the cloud of strings modifies the Keplerian frequency and the tidal force profile even at large distances, and that in this case there is no sign change of the tidal components.
Mechanistic Interpretability for Transformer-based Time Series Classification
Authors: Matīss Kalnāre, Sofoklis Kitharidis, Thomas Bäck, Niki van Stein
2025-11-26
Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing explainability methods often focus on input-output attributions, leaving the internal mechanisms largely opaque. This paper addresses this gap by adapting various Mechanistic Interpretability techniques; activation patching, attention saliency, and autoencoders, from NLP to
architectures designed explicitly for time series classification. We systematically probe the internal causal roles of individual attention heads and timesteps, revealing causal structures within these models. Through experimentation on a benchmark time series dataset, we construct causal graphs illustrating how information propagates internally, highlighting key attention heads and temporal positions driving correct classifications. Additionally, we demonstrate the potential of
autoencoders for uncovering interpretable latent features. Our findings provide both methodological contributions to
interpretability and novel insights into the functional mechanics underlying
performance in time series classification tasks.
E-M3RF An Equivariant Multimodal 3D Re-assembly Framework
Authors: Adeela Islam, Stefano Fiorini, Manuel Lecha, Theodore Tsesmelis, Stuart James, Pietro Morerio, Alessio Del Bue
2025-11-26
3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent ping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a
. The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.
Prune4Web DOM Tree Pruning Programming for Web Agent
Authors: Jiayuan Zhang, Kaiquan Chen, Zhihao Lu, Enshen Zhou, Qian Yu, Jing Zhang
2025-11-26
Web automation employs intelligent agents to execute high-level tasks by mimicking human interactions with web interfaces. Despite the capabilities of recent Large Language Model ()-based web agents, navigating complex, real-world webpages efficiently remains a significant hurdle due to the prohibitively large size of Document Object Model (DOM) structures, often ranging from 10,000 to 100,000 tokens. Existing strategies typically rely on crude DOM truncation -- risking the loss of critical information -- or employ inefficient heuristics and separate ranking models, failing to achieve an optimal balance between precision and scalability. To address these challenges, we introduce Prune4Web, a novel paradigm that shifts DOM processing from resource-intensive
reading to efficient programmatic
. Central to our approach is DOM Tree Pruning Programming, where an
generates executable Python scoring scripts to dynamically filter DOM elements based on semantic cues from decomposed sub-tasks. This mechanism eliminates the need for
s to ingest raw, massive DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. This methodology achieves a 25x to 50x reduction in candidate elements for grounding, thereby facilitating precise action localization while mitigating attention dilution. Furthermore, we propose a specialized data annotation pipeline and a two-turn dialogue training strategy that jointly optimizes the Planner, Programmatic Filter, and Grounder within a unified framework. Extensive experiments demonstrate state-of-the-art performance. Notably, on our low-level grounding task, Prune4Web dramatically improves accuracy from 46.8% to 88.28%, underscoring its efficacy in real-world web automation.
Anomaly Detection with Adaptive and Aggressive Rejection for Contaminated Training Data
Authors: Jungi Lee, Jungkwon Kim, Chi Zhang, Kwangsun Yoo, Seok-Joo Byun
2025-11-26
Handling contaminated data poses a critical challenge in anomaly detection, as traditional models assume training on purely normal data. Conventional methods mitigate contamination by relying on fixed contamination ratios, but discrepancies between assumed and actual ratios can severely degrade performance, especially in noisy environments where normal and abnormal data distributions . To address these limitations, we propose Adaptive and Aggressive Rejection (AAR), a novel method that dynamically excludes anomalies using a modified z-score and Gaussian mixture model-based thresholds. AAR effectively balances the trade-off between pre
normal data and excluding anomalies by integrating hard and soft rejection strategies. Extensive experiments on two image datasets and thirty tabular datasets demonstrate that AAR outperforms the state-of-the-art method by 0.041 AUROC. By providing a scalable and reliable solution, AAR enhances robustness against contaminated datasets, paving the way for broader real-world applications in domains such as security and healthcare.
Discovery and recovery of crystalline materials with property-conditioned transformers
Authors: Cyprien Bone, Matthew Walker, Kuangdai Leng, Luis M. Antunes, Ricardo Grau-Crespo, Amil Aligayev, Javier Dominguez, Keith T. Butler
2025-11-26
Generative models have recently shown great promise for accelerating the design and discovery of new functional materials. Conditional generation enhances this capacity by allowing inverse design, where specific desired properties can be requested during the generation process. However, conditioning of -based approaches, in particular, is constrained by discrete tokenisation schemes and the risk of catastrophic forgetting during fine-tuning. This work introduces Crysta
-π (property injection), a conditional autoregressive framework that integrates continuous property representations directly into the
's attention mechanism. Two architectures, Property-Key-Value (P
) Prefix attention and P
Residual attention, are presented. These methods bypass inefficient sequence-level tokenisation and preserve foundational knowledge from unsupervised pre-training on Crystallographic Information Files (CIFs) as textual input. We establish the efficacy of these mechanisms through systematic robustness studies and evaluate the framework's versatility across two distinct tasks. First, for structure recovery, the model processes high-dimensional, heterogeneous X-ray diffraction patterns, achieving structural accuracy competitive with specialised models and demonstrating applications to experimental structure recovery and polymorph differentiation. Second, for materials discovery, the model is fine-tuned on a specialised photovoltaic dataset to generate novel, stable candidates validated by Density Functional Theory (DFT). It implicitly learns to target optimal band gap regions for high photovoltaic efficiency, demonstrating a capability to map complex structure-property relationships. Crysta
-π provides a unified, flexible, and computationally efficient framework for inverse materials design.
Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale
Authors: Yicheng Zhong, Peiji Yang, Zhisheng Wang
2025-11-26
Recent advances in Large Language Models (s) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS
s have emerged as compact and streamable architectures that jointly model semantic and acoustic integration. However, despite their efficiency, these models often exhibit unstable prosody, speaker drift, and degraded naturalness. To address these issues, we propose a multi-reward Group Relative Policy Optimization (GRPO) framework that directly optimizes the token generation policy of single-codebook TTS
s. Beyond standard intelligibility and speaker similarity objectives, our design integrates three rule-based rewards: a length penalty for duration consistency, an entropy regularization reward for
stability, and an
-annotated prosody alignment reward that explicitly supervises rhythm. In this prosody reward, an external reasoning
predicts multiple plausible pause structures via in-context learning, providing a human-preference-aligned supervisory signal for GRPO training. To assess universality, we further attach a flow-matching (FM)
r on top of the GRPO-optimized AR backbone and observe consistent additional gains, indicating that our reinforcement optimization enhances the intrinsic AR policy. We further conduct a scalability analysis across data sizes and model scales, revealing that the proposed method consistently enhances prosodic stability, speaker similarity, and overall speech naturalness in single-codebook TTS
s.
Scenes as Tokens Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding
Authors: Yutao Tang, Cheng Zhao, Gaurav Mittal, Rohith Kukkala, Rama Chellappa, Cheng Peng, Mei Chen
2025-11-26
Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, pre both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by
endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask
, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.
Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models
Authors: Changlin Li, Jiawei Zhang, Zeyi Shi, Zongxin Yang, Zhihui Li, Xiaojun Chang
2025-11-26
Large-scale vision generative models, including diffusion and flow models, have demonstrated remarkable performance in visual generation tasks. However, transferring these pre-trained models to downstream tasks often results in significant parameter redundancy. In this paper, we propose EntPruner, an entropy-guided automatic progressive framework for diffusion and flow models. First, we introduce entropy-guided
, a block-level importance assessment strategy specifically designed for generative models. Unlike discriminative models, generative models require pre
the diversity and condition-fidelity of the output distribution. As the importance of each module can vary significantly across downstream tasks, EntPruner prioritizes
of less important blocks using data-dependent Conditional Entropy Deviation (CED) as a guiding metric. CED quantifies how much the distribution diverges from the learned conditional data distribution after removing a block. Second, we propose a zero-shot adaptive
framework to automatically determine when and how much to prune during training. This dynamic strategy avoids the pitfalls of one-shot
, mitigating mode collapse, and pre
model performance. Extensive experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22 inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets.
MLPMoE Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts
Authors: Ivan Novikov
2025-11-26
Large Language Models (s) are predominantly deployed as dense
s, where every parameter in every feed-forward block is activated for every token. While architecturally simple, this is computationally inefficient, since inference costs scale linearly with parameter count. Recent upcycling methods such as MoEfication, CMoE, ToMoE, and MoORE reveal that much of the useful computation lives in
, semi-modular substructures inside dense feed-forward networks, but these approaches typically rely on clustering, activation profiling, singular value decomposition, or custom routing that requires calibration data. This paper introduces MLPMoE (MLP Mixture-of-Experts), a training-free, deterministic transformation that restructures the dense MLP in
blocks into a static, high-cardinality mixture of experts. The transformation uses simple tensor slicing and summation, reinterpreting the algebra of tensor parallelism as a topological conversion rather than a distributed training pattern. We further introduce Fractal Fade (differential branch
) and Compensated Pruning (variance-pre
branch reduction) as lightweight mechanisms for structured
. On Qwen2.5-0.5B-Instruct and DeepSeek-R1-Distill-Llama-8B, the zero-shot MLPMoE transform changes a proxy perplexity metric by less than 0.05 percent while keeping the parameter count effectively constant. On the 8B model, differential
removes about 20 percent of MLP parameters while keeping perplexity within about 2 percent of the dense baseline. The method operates entirely post hoc on existing checkpoints and does not require gradients, calibration sets, or router training. Code is available at https://gist.github.com/iwallarm/fc2ef1eddf226ca7814f9e5e2ae9bad1
Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning
Authors: Zhenchao Tang, Fang Wang, Haohuai He, Jiale Zhou, Tianxu Lv, Jun Zhu, Shouzhi Chen, Minghao Yang, Yu Wang, Jiayang Wu, Yidong Song, Jianhua Yao
2025-11-26
Effective post-training is essential to align Large Language Models (s) with specialized biomedical knowledge to accelerate life science research. However, current approaches face significant limitations. First, biomedical reasoning involves intricate mechanisms often represented by
textual data. Standard Supervised Fine-Tuning (SFT) tends to overfit to surface-level instruction patterns without effectively internalizing this fragmented scientific knowledge. Second, Reinforcement Learning (RL) is impractical for this domain, as defining meaningful rewards often necessitates prohibitive experimental validation (e.g., wet-lab verification of drug responses), rendering real-time feedback unfeasible. We propose Balanced Fine-Tuning (BFT), an efficient post-training method designed to learn complex reasoning from
data without external reward signals. BFT operates through a two-layer weighting mechanism: 1. At the token level, it scales loss via prediction probabilities to stabilize gradients and prevent overfitting; 2. At the sample level, it uses "minimum group confidence" to adaptively enhance the learning of hard samples. Experiments demonstrate that BFT significantly outperforms SFT. In medical tasks, it enables
s to acquire knowledge that SFT misses. In biological tasks, BFT-based
s surpass GeneAgent (an accurate agent for biology analysis) in biological process reasoning. Moreover, the text embeddings generated by BFT can be directly applied to downstream tasks, such as gene interaction and single-cell perturbation response prediction. These results indicate that BFT facilitates broad applications of
s in biomedical research.
RAVQ-HoloNet Rate-Adaptive Vector-Quantized Hologram Compression
Authors: Shima Rafiei, Zahra Nabizadeh Shahr Babak, Shadrokh Samavi, Shahram Shirani
2025-11-26
Holography offers significant potential for AR/VR applications, yet its adoption is limited by the high demands of data . Existing deep learning approaches generally lack rate adaptivity within a single network. We present RAVQ-HoloNet, a rate-adaptive vector
framework that achieves high-fidelity reconstructions at low and ultra-low bit rates, outperforming current state-of-the-art methods. In low bit, our method exceeds by -33.91% in BD-Rate and achieves a BD-PSNR of 1.02 dB from the best existing method demonstrated by the rate-distortion curve.
Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI
Authors: Al Amin, Kamrul Hasan, Liang Hong, Sharif Ullah
2025-11-26
Collaborative machine learning across healthcare institutions promises improved diagnostic accuracy by leveraging diverse datasets, yet privacy regulations such as HIPAA prohibit direct patient data sharing. While federated learning (FL) enables decentralized training without raw data exchange, recent studies show that model gradients in conventional FL remain vulnerable to reconstruction attacks, potentially exposing sensitive medical information. This paper presents a privacy-pre federated learning framework combining Vision Transformers (ViT) with homomorphic encryption (HE) for secure multi-institutional histopathology classification. The approach leverages the ViT CLS token as a compact 768-dimensional feature representation for secure aggregation, encrypting these tokens using CKKS homomorphic encryption before transmission to the server. We demonstrate that encrypting CLS tokens achieves a 30-fold
reduction compared to gradient encryption while maintaining strong privacy guarantees. Through evaluation on a three-client federated setup for lung cancer histopathology classification, we show that gradients are highly susceptible to model inversion attacks (PSNR: 52.26 dB, SSIM: 0.999, NMI: 0.741), enabling near-perfect image reconstruction. In contrast, the proposed CLS-protected HE approach prevents such attacks while enabling encrypted inference directly on ciphertexts, requiring only 326 KB of encrypted data transmission per aggregation round. The framework achieves 96.12 percent global classification accuracy in the unencrypted domain and 90.02 percent in the encrypted domain.
A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving
Authors: Junhan Liao, Minxian Xu, Wanyi Zheng, Yan Wang, Kejiang Ye, Rajkumar Buyya, Chengzhong Xu
2025-11-26
To meet strict Service-Level Objectives (SLOs),contemporary Large Language Models (s) decouple the
and
stages and place them on separate GPUs to mitigate the distinct bottlenecks inherent to each phase. However, the heterogeneity of
workloads causes producerconsumer imbalance between the two instance types in such
d architecture. To address this problem, we propose DOPD (Dynamic Optimal Prefill/Decoding), a dynamic
inference system that adjusts instance allocations to achieve an optimal
-to-
(P/D) ratio based on real-time load monitoring. Combined with an appropriate request-scheduling policy, DOPD effectively resolves imbalances between
and
instances and mitigates resource allocation mismatches due to mixed-length requests under high concurrency. Experimental evaluations show that, compared with v
and DistServe (representative aggregation-based and disaggregationbased approaches), DOPD improves overall system goodput by up to 1.5X, decreases P90 time-to-first-token (TTFT) by up to 67.5%, and decreases P90 time-per-output-token (TPOT) by up to 22.8%. Furthermore, our dynamic P/D adjustment technique performs proactive reconfiguration based on historical load, achieving over 99% SLOs attainment while using less additional resources.
Towards Audio Token Compression in Large Audio Language Models
Authors: Saurabhchand Bhati, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
2025-11-26
Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices.
In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the
r. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the
backbone.
TrafficLens Multi-Camera Traffic Video Analysis Using LLMs
Authors: Md Adnan Arefeen, Biplob Debnath, Srimat Chakradhar
2025-11-26
Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (s) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing
ping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to while maintaining information accuracy.
Emergence and Localisation of Semantic Role Circuits in LLMs
Authors: Nura Aljaafari, Danilo S. Carvalho, André Freitas
2025-11-25
Despite displaying semantic competence, large language models' internal mechanisms that ground abstract semantic structure remain insufficiently characterised. We propose a method integrating role-cross minimal pairs, temporal emergence analysis, and cross-model comparison to study how s implement semantic roles. Our analysis uncovers: (i) highly concentrated circuits (89-94% attribution within 28 nodes); (ii) gradual structural refinement rather than phase transitions, with larger models sometimes bypassing localised circuits; and (iii) moderate cross-scale conservation (24-59% component
) alongside high spectral similarity. These findings suggest that
s form compact, causally isolated mechanisms for abstract semantic structure, and these mechanisms exhibit partial transfer across scales and architectures.
Latent Collaboration in Multi-Agent Systems
Authors: Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang
2025-11-25
Multi-agent systems (MAS) extend large language models (s) from independent single-model reasoning to coordinative system-level intelligence. While existing
agents depend on text-based mediation for reasoning and
, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among
agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.
Quantum-Resistant Authentication Scheme for RFID Systems Using Lattice-Based Cryptography
Authors: Vaibhav Kumar, Kaiwalya Joshi, Bhavya Dixit, Gaurav S. Kasbekar
2025-11-25
We propose a novel quantum-resistant mutual authentication scheme for radio-frequency identification (RFID) systems. Our scheme uses lattice-based cryptography and, in particular, achieves quantum-resistance by leveraging the hardness of the inhomogeneous short integer solution (ISIS) problem. In contrast to prior work, which assumes that the reader-server channel is secure, our scheme is secure even when both the reader-server and tag-reader
channels are insecure. Our proposed protocol provides robust security against man-in-the-middle (MITM), replay, impersonation, and reflection attacks, while also ensuring unforgeability and pre
anonymity. We present a detailed security analysis, including semi-formal analysis and formal verification using the Automated Validation of Internet Security Protocols and Applications (AVISPA) tool. In addition, we analyze the storage, computation, and
costs of the proposed protocol and compare its security properties with those of existing protocols, demonstrating that our scheme offers strong security guarantees. To the best of our knowledge, this paper is the first quantum-resistant authentication protocol for RFID systems that comprehensively addresses the insecurity of both the reader-server and tag-reader
channels.
DiFR Inference Verification Despite Nondeterminism
Authors: Adam Karvonen, Daniel Reuter, Roy Rinberg, Luke Marks, Adrià Garriga-Alonso, Keri Warr
2025-11-25
As demand for inference grows, it is becoming increasingly important that providers and their customers can verify that inference processes are performed correctly, without errors or tampering. However, re-running the same inference process twice often leads to different results due to benign numerical noise, making it difficult to distinguish legitimate variation from actual problems. To address this problem, we introduce Token-DiFR (Token-Divergence-From-Reference), a method for verifying inference outputs by comparing generated tokens against predictions made by a trusted reference implementation conditioned on the same random seed. Sampling seed synchronization tightly constrains valid outputs, leaving providers minimal room to deviate from correct inference, which allows output tokens themselves to serve as auditable evidence of correctness at zero additional cost to the provider. Token-DiFR reliably identifies sampling errors, simulated bugs, and model
, detecting 4-bit
with AUC 0.999 within 300 output tokens. For applications requiring sample-efficient forward-pass verification, we additionally introduce Activation-DiFR, a scheme that uses random orthogonal projections to compress activations into compact fingerprints for subsequent verification. Activation-DiFR detects 4-bit
with AUC 0.999 using just 2 output tokens, while reducing
overhead by 25-75% relative to existing methods. We release an open-source integration with v
to accelerate practical deployment of verifiable inference.
DP-MicroAdam Private and Frugal Algorithm for Training and Fine-tuning
Authors: Mihaela Hudişteanu, Edwige Cyffers, Nikita P. Kalinin
2025-11-25
Adaptive optimizers are the de facto standard in non-private training as they often enable faster convergence and improved performance. In contrast, differentially private (DP) training is still predominantly performed with DP-SGD, typically requiring extensive compute and hyperparameter tuning. We propose DP-MicroAdam, a memory-efficient and -aware adaptive DP optimizer. We prove that DP-MicroAdam converges in stochastic non-convex optimization at the optimal rate, up to privacy-dependent constants. Empirically, DP-MicroAdam outperforms existing adaptive DP optimizers and achieves competitive or superior accuracy compared to DP-SGD across a range of benchmarks, including CIFAR-10, large-scale ImageNet training, and private fine-tuning of pretrained
s. These results demonstrate that adaptive optimization can improve both performance and stability under differential privacy.
Object-Centric Vision Token Pruning for Vision Language Models
Authors: Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen
2025-11-25
In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-pre VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision
ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our
also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.
The Case for Intent-Based Query Rewriting
Authors: Gianna Lisa Nicolai, Patrick Hansert, Sebastian Michel
2025-11-25
With this work, we describe the concept of intent-based query rewriting and present a first viable solution. The aim is to allow rewrites to alter the structure and syntactic outcome of an original query while keeping the obtainable insights intact. This drastically differs from traditional query rewriting, which typically aims to decrease query evaluation time by using strict equivalence rules and optimization heuristics on the query plan. Rewriting queries to queries that only provide a similar insight but otherwise can be entirely different can remedy inaccessible original data tables due to access control, privacy, or expensive data access regarding monetary cost or remote access. In this paper, we put forward INQURE, a system designed for INtent-based QUery REwriting. It uses access to a large language model () for the query understanding and human-like derivation of alternate queries. Around the
, INQURE employs upfront table filtering and subsequent candidate rewrite
and ranking. We report on the results of an evaluation using a benchmark set of over 900 database table schemas and discuss the pros and cons of alternate approaches regarding runtime and quality of the rewrites of a user study.
FREE Uncertainty-Aware Autoregression for Parallel Diffusion Transformers
Authors: Xinwan Wen, Bowen Li, Jiajun Luo, Ye Li, Zhi Wang
2025-11-25
Diffusion Transformers (DiTs) achieve state-of-the-art generation quality but require long sequential denoising trajectories, leading to high inference latency. Recent speculative inference methods enable lossless parallel sampling in U-Net-based diffusion models via a drafter-verifier scheme, but their is limited on DiTs due to insufficient draft accuracy during verification. To address this limitation, we analyze the DiTs' feature dynamics and find the features of the final
layer (top-block) exhibit strong temporal consistency and rich semantic abstraction. Based on this insight, we propose FREE, a novel framework that employs a lightweight drafter to perform feature-level autoregression with parallel verification, guaranteeing lossless
with theoretical and empirical support. Meanwhile, prediction variance (uncertainty) of DiTs naturally increases in later denoising steps, reducing acceptance rates under speculative sampling. To mitigate this effect, we further introduce an uncertainty-guided relaxation strategy, forming FREE (relax), which dynamically adjusts the acceptance probability in response to uncertainty levels. Experiments on ImageNet- show that FREE achieves up to
, and FREE (relax) further reaches speedup while maintaining high perceptual and quantitative fidelity in generation quality.
Scaling LLM Speculative Decoding Non-Autoregressive Forecasting in Large-Batch Scenarios
Authors: Luohe Shi, Zuchao Li, Lefei Zhang, Baoyuan Qi, Guoming Liu, Hai Zhao
2025-11-25
Speculative accelerates
inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative
methods typically assume a considerable amount of available computing power, then generate a complex and massive draft tree using a small autoregressive language model to improve overall prediction accuracy. However, methods like batching have been widely applied in mainstream model inference systems as a superior alternative to speculative
, as they compress the available idle computing power. Therefore, performing speculative
with low verification resources and low scheduling costs has become an important research problem. We believe that more capable models that allow for parallel generation on draft sequences are what we truly need. Recognizing the fundamental nature of draft models to only generate sequences of limited length, we propose SpecFormer, a novel architecture that integrates unidirectional and bidirectional attention mechanisms. SpecFormer combines the autoregressive model's ability to extract information from the entire input sequence with the parallel generation benefits of non-autoregressive models. This design eliminates the reliance on large prefix trees and achieves consistent
, even in large-batch scenarios. Through lossless speculative
experiments across models of various scales, we demonstrate that SpecFormer sets a new standard for scaling
inference with lower training demands and reduced computational costs.
IrisNet Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection
Authors: Xuelin Qian, Jiaming Lu, Zixuan Wang, Wenxuan Wang, Zhongling Huang, Dingwen Zhang, Junwei Han
2025-11-25
Infrared Small Target Detection (IRSTD) faces significant challenges due to low signal-to-noise ratios, complex backgrounds, and the absence of discernible target features. While deep learning-based encoder-r frameworks have advanced the field, their static pattern learning suffers from pattern drift across diverse scenarios (\emph{e.g.}, day/night variations, sky/maritime/ground domains), limiting robustness. To address this, we propose IrisNet, a novel meta-learned framework that dynamically adapts detection strategies to the input infrared image status. Our approach establishes a dynamic mapping between infrared image features and entire
r parameters via an image-to-
r
. More concretely, we represent the parameterized
r as a structured 2D tensor pre
hierarchical layer correlations and enable the
to model inter-layer dependencies through self-attention while generating adaptive
patterns via cross-attention. To further enhance the perception ability of infrared images, we integrate high-frequency components to supplement target-position and scene-edge information. Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets demonstrate the superiority of our IrisNet, achieving state-of-the-art performance.
Beyond Components Singular Vector-Based Interpretability of Transformer Circuits
Authors: Areeb Ahmad, Abhinav Joshi, Ashutosh Modi
2025-11-25
Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple
ping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that
computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.
Communication-Efficient Learning for Satellite Constellations
Authors: Ruxandra-Stefania Tudose, Moritz H. W. Grüss, Grace Ra Kim, Karl H. Johansson, Nicola Bastianello
2025-11-25
Satellite constellations in low-Earth orbit are now widespread, enabling positioning, Earth imaging, and s. In this paper we address the solution of learning problems using these satellite constellations. In particular, we focus on a federated approach, where satellites collect and locally process data, with the ground station aggregating local models. We focus on designing a novel,
-efficient algorithm that still yields accurate trained models. To this end, we employ several mechanisms to reduce the number of
s with the ground station (local training) and their size (
). We then propose an error feedback mechanism that enhances accuracy, which yields, as a byproduct, an algorithm-agnostic error feedback scheme that can be more broadly applied. We analyze the convergence of the resulting algorithm, and compare it with the state of the art through simulations in a realistic space scenario, showcasing superior performance.
Interactive AI NPCs Powered by LLMs Technical Report for the CPDC Challenge 2025
Authors: Yitian Huang, Yuxuan Lei, Jianxun Lian, Hao Liao
2025-11-25
This report presents the solution and results of our team MSRA_SC in the Commonsense Persona-Grounded Dialogue Challenge (CPDC 2025). We propose a simple yet effective framework that unifies improvements across both GPU Track and API Track. Our method centers on two key components. First, Context Engineering applies dynamic tool and persona clipping for input
, combined with post-processing techniques such as parameter normalization and function merging. Together with manually refined prompts, this design improves tool call stability, execution reliability, and role-playing guidance. Second, in the GPU Track, we further adopt GRPO training, replacing supervised fine-tuning with reinforcement learning directly optimized by reward signals. This mitigates small-sample overfitting and significantly enhances task-oriented dialogue performance. In the final evaluation, our team ranks 1st in Task 2 API, 2nd in Task 1 API, and 3rd in both Task 3 API and GPU track, demonstrating the effectiveness of our approach. Our code is publicly available at https://gitlab.aicrowd.com/nikoo_yu/cpdc-2025-winning-solution
Beluga A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management
Authors: Xinjun Yang, Qingda Hu, Junru Li, Feifei Li, Yuqi Zhou, Yicong Zhu, Qiuru Lin, Jian Dai, Yang Kong, Jiayu Zhang, Guoqiang Xu, Qiang Liu
2025-11-25
The rapid increase in model sizes and the growing demand for long-context inference have made memory a critical bottleneck in GPU-accelerated
systems. Although high-bandwidth memory (HBM) on GPUs offers fast access, its limited capacity necessitates reliance on host memory (CPU DRAM) to support larger working sets such as the
Cache. However, the maximum DRAM capacity is constrained by the limited number of memory channels per CPU socket. To overcome this limitation, current systems often adopt RDMA-based
d memory pools, which introduce significant challenges including high access latency, complex
protocols, and synchronization overhead. Fortunately, the emerging CXL technology introduces new opportunities in
Cache design. In this paper, we propose Beluga, a novel memory architecture that enables GPUs and CPUs to access a shared, large-scale memory pool through CXL switches. By supporting native load/store access semantics over the CXL fabric, our design delivers near-local memory latency, while reducing programming complexity and minimizing synchronization overhead. We conduct a systematic characterization of a commercial CXL switch-based memory pool and propose a set of design guidelines. Based on Beluga, we design and implement Beluga-
Cache, a system tailored for managing the large-scale
Cache in
inference. Beluga-
Cache achieves an 89.6% reduction in Time-To-First-Token (TTFT) and 7.35x throughput improvement in the v
inference engine compared to RDMA-based solutions. To the best of our knowledge, Beluga is the first system that enables GPUs to directly access large-scale memory pools through CXL switches, marking a significant step toward low-latency, shared access to vast memory resources by GPUs.
SKEL-CF Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery
Authors: Da Li, Jiping Jin, Xuanlong Yu, Wei Liu, Xiaodong Cun, Kai Chen, Rui Fan, Jiangang Kong, Xi Shen
2025-11-25
Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a -based encoder-
r architecture, where the encoder predicts coarse camera and SKEL parameters, and the
r progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.
IDAP++ Advancing Divergence-Based Pruning via Filter-Level and Layer-Level Optimization
Authors: Aleksei Samarin, Artem Nazarenko, Egor Kotenko, Valentin Malykh, Alexander Savelev, Aleksei Toropov
2025-11-25
This paper presents a novel approach to neural network that addresses redundancy at both the filter and architectural levels through a unified framework grounded in information flow analysis. Building on the concept of tensor flow divergence, which quantifies how information is transformed across network layers, we develop a two-stage optimization process. The first stage employs iterative divergence-aware
to identify and remove redundant filters while pre
critical information pathways. The second stage extends this principle to higher-level architecture optimization by analyzing layer-wise contributions to information propagation and selectively eliminating entire layers that demonstrate minimal impact on network performance. The proposed method naturally adapts to diverse architectures, including convolutional networks,
s, and hybrid designs, providing a consistent metric for comparing the structural importance across different layer types. Experimental validation across multiple modern architectures and datasets reveals that this combined approach achieves substantial model
while maintaining competitive accuracy. The presented approach achieves parameter reduction results that are globally comparable to those of state-of-the-art solutions and outperforms them across a wide range of modern neural network architectures, from convolutional models to
s. The results demonstrate how flow divergence serves as an effective guiding principle for both filter-level and layer-level optimization, offering practical benefits for deployment in resource-constrained environments.
SSA Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
Authors: Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He, Di Yin, Xing Sun
2025-11-25
The quadratic complexity of full attention limits efficient long-context processing in large language models (s). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native
-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention
than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during
training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both
and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging
-attention outputs to align with their full-attention counterparts, thereby promoting stronger
. As a result, SSA achieves state-of-the-art performance under both
and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying
budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native
-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.
MTA A Merge-then-Adapt Framework for Personalized Large Language Model
Authors: Xiaopeng Li, Yuanjin Zheng, Wanyu Wang, wenlin zhang, Pengyue Jia, Yiqi Wang, Maolin Wang, Xuetao Wei, Xiangyu Zhao
2025-11-25
Personalized Large Language Models (Ps) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with
data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for P
s. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.
FLaTEC Frequency-Disentangled Latent Triplanes for Efficient Compression of LiDAR Point Clouds
Authors: Xiaoge Zhang, Zijie Wu, Mingtao Feng, Zichen Geng, Mehwish Nasim, Saeed Anwar, Ajmal Mian
2025-11-25
Point cloud methods jointly optimize bitrates and reconstruction distortion. However, balancing
ratio and reconstruction quality is difficult because low-frequency and high-frequency components contribute differently at the same resolution. To address this, we propose FLaTEC, a frequency-aware
model that enables the
of a full scan with high
ratios. Our approach introduces a frequency-aware mechanism that decouples low-frequency structures and high-frequency textures, while hybridizing latent triplanes as a compact proxy for point cloud. Specifically, we convert voxelized embeddings into triplane representations to reduce
, computational cost, and storage requirements. We then devise a frequency-disentangling technique that extracts compact low-frequency content while collecting high-frequency details across scales. The decoupled low-frequency and high-frequency components are stored in binary format. During
, full-spectrum signals are progressively recovered via a modulation block. Additionally, to compensate for the loss of 3D correlation, we introduce an efficient frequency-based attention mechanism that fosters local connectivity and outputs arbitrary resolution points. Our method achieves state-of-the-art rate-distortion performance and outperforms the standard codecs by 78\% and 94\% in BD-rate on both SemanticKITTI and Ford datasets.
Adaptive Knowledge Transfer for Cross-Disciplinary Cold-Start Knowledge Tracing
Authors: Yulong Deng, Zheng Guan, Min He, Xue Wang, Jie Liu, Zheng Li
2025-11-25
Cross-Disciplinary Cold-start Knowledge Tracing (CDCKT) faces a critical challenge: insufficient student interaction data in the target discipline prevents effective knowledge state modeling and performance prediction. Existing cross-disciplinary methods rely on ping entities between disciplines for knowledge transfer through simple mapping functions, but suffer from two key limitations: (1)
ping entities are scarce in real-world scenarios, and (2) simple mappings inadequately capture cross-disciplinary knowledge complexity. To overcome these challenges, we propose Mixed of Experts and Adversarial Generative Network-based Cross-disciplinary Cold-start Knowledge Tracing Framework. Our approach consists of three key components: First, we pre-train a source discipline model and cluster student knowledge states into K categories. Second, these cluster attributes guide a mixture-of-experts network through a gating mechanism,
as a cross-domain mapping bridge. Third, an adversarial discriminator enforces feature separation by pulling same-attribute student features closer while pushing different-attribute features apart, effectively mitigating small-sample limitations. We validate our method's effectiveness across 20 extreme cross-disciplinary cold-start scenarios.
DeeAD Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving
Authors: Haibo HU, Lianming Huang, Nan Guan, Chun Jason Xue
2025-11-25
Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28%
-layer
and 29% latency reduction, while pre
planning quality and safety.
MPrune Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation
Authors: Weizi Shao, Taolin Zhang, Zijie Zhou, Chen Chen, Chengyu Wang, Xiaofeng He
2025-11-25
Recent advancements in multi-modal retrieval-augmented generation (mRAG), which enhance multi-modal large language models (Ms) with external knowledge, have demonstrated that the collective intelligence of multiple agents can significantly outperform a single model through effective
. Despite impressive performance, existing multi-agent systems inherently incur substantial token overhead and increased computational costs, posing challenges for large-scale deployment. To address these issues, we propose a novel Multi-Modal Multi-agent hierarchical
graph PRUNING framework, termed MPrune. Our framework eliminates redundant edges across different modalities, achieving an optimal balance between task performance and token overhead. Specifically, MPrune first applies intra-modal graph sparsification to textual and visual modalities, identifying the edges most critical for solving the task. Subsequently, we construct a dynamic
topology using these key edges for inter-modal graph sparsification. Finally, we progressively prune redundant edges to obtain a more efficient and hierarchical topology. Extensive experiments on both general and domain-specific mRAG benchmarks demonstrate that our method consistently outperforms both single-agent and robust multi-agent mRAG systems while significantly reducing token consumption.
ParaBlock Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models
Authors: Yujia Wang, Yuanpu Cao, Jinghui Chen
2025-11-25
Federated learning (FL) has been extensively studied as a privacy-pre training paradigm. Recently, federated block coordinate descent scheme has become a popular option in training large-scale models, as it allows clients to train only a subset of the model locally instead of the entire model. However, in the era of large language models (
s), even a single block can contain a significant number of parameters, posing substantial
latency, particularly for resource-constrained clients. To address this challenge in federated training/fine-tuning
s, we propose ParaBlock, a novel approach that establishes two parallel threads for
and computation to enhance
efficiency. We theoretically prove that the proposed ParaBlock achieves the same convergence rate as the standard federated block coordinate descent methods. Empirical evaluations on fine-tuning
s on general instruction following and mathematical reasoning confirm that ParaBlock not only maintains strong performance but also significantly improves
efficiency.
AI/ML based Joint Source and Channel Coding for HARQ-ACK Payload
Authors: Akash Doshi, Pinar Sen, Kirill Ivanov, Wei Yang, June Namgoong, Runxin Wang, Rachel Wang, Taesang Yoo, Jing Jiang, Tingfang Ji
2025-11-25
Channel coding from 2G to 5G has assumed the inputs bits at the physical layer to be uniformly distributed. However, hybrid automatic repeat request acknowledgement (HARQ-ACK) bits transmitted in the uplink are inherently non-uniformly distributed. For such sources, significant performance gains could be obtained by employing joint source channel coding, aided by deep learning-based techniques. In this paper, we learn a -based encoder using a novel "free-lunch" training algorithm and propose per-codeword power shaping to exploit the source prior at the encoder whilst being robust to small changes in the HARQ-ACK distribution. Furthermore, any HARQ-ACK
r has to achieve a low negative acknowledgement (NACK) error rate to avoid radio link failures resulting from multiple NACK errors. We develop an extension of the Neyman-Pearson test to a coded bit system with multiple information bits to achieve Unequal Error Protection of NACK over ACK bits at the
r. Finally, we apply the proposed encoder and
r designs to a 5G New Radio (NR) compliant uplink setup under a fading channel, describing the optimal receiver design and a low complexity coherent approximation to it. Our results demonstrate 3-6 dB reduction in the average transmit power required to achieve the target error rates compared to the NR baseline, while also achieving a 2-3 dB reduction in the maximum transmit power, thus providing for significant coverage gains and power savings.