2025-12-05
Table of Contents
- Splannequin Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting
- Deep Forcing Training-Free Long Video Generation with Deep Sink and Participative Compression
- A Randomized Scheduling Framework for Privacy-Preserving Multi-robot Rendezvous given Prior Information
- The position and resolvability of blended point sources
- Isolating chirality-breaking SMEFT operators with Drell-Yan angular analysis
- Generative Neural Video Compression via Video Diffusion Prior
- A dynamic memory assignment strategy for dilation-based ICP algorithm on embedded GPUs
- Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models
- Learning Causality for Longitudinal Data
- Efficient Generative Transformer Operators For Million-Point PDEs
- FASTer Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization
- Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens
- Tokenizing Buildings A Transformer for Layout Synthesis
- ASTRIDE A Security Threat Modeling Platform for Agentic-AI Applications
- RLHFSpec Breaking the Efficiency Bottleneck in RLHF Training via Adaptive Drafting
- SignRoundV2 Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs
- M3-TTS Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis
- Large Speech Model Enabled Semantic Communication
- Topology Matters Measuring Memory Leakage in Multi-Agent LLMs
- SEASON Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
- Natural Language Actor-Critic Scalable Off-Policy Learning in Language Space
- SPLICE Part-Level 3D Shape Editing from Local Semantic Extraction to Global Neural Mixing
- MD-SNN Membrane Potential-aware Distillation on Quantized Spiking Neural Network
- Solving LLM Repetition Problem in Production A Comprehensive Study of Multiple Solutions
- Learning to Orchestrate Agents in Natural Language with the Conductor
- Universal Quantum Interconnects via Phase-Coherent Four-Wave Mixing
- Learning Single-Image Super-Resolution in the JPEG Compressed Domain
- Look Around and Pay Attention Multi-camera Point Tracking Reimagined with Transformers
- Constructing Low-Redundancy Codes via Distributed Graph Coloring
- Asymmetric excitation of left- vs right-handed photons in accelerating waveguides
- PosterCopilot Toward Layout Reasoning and Controllable Editing for Professional Graphic Design
- PSA Pyramid Sparse Attention for Efficient Video Understanding and Generation
- Ultra-lightweight Neural Video Representation Compression
- Teaching Old Tokenizers New Words Efficient Tokenizer Adaptation for Pre-trained Models
- An Information Theory of Finite Abstractions and their Fundamental Scalability Limits
- Technical Report on Text Dataset Distillation
- OD-MoE On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference
- UniMo Unifying 2D Video and 3D Human Motion with an Autoregressive Framework
- Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
- Training and Evaluation of Guideline-Based Medical Reasoning in LLMs
- Transmit Weights, Not Features Orthogonal-Basis Aided Wireless Point-Cloud Transmission
- CCN Decentralized Cross-Chain Channel Networks Supporting Secure and Privacy-Preserving Multi-Hop Interactions
- Different types of syntactic agreement recruit the same units within large language models
- ConvRot Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers
- Towards Privacy-Preserving Range Queries with Secure Learned Spatial Index over Encrypted Data
- SELF A Robust Singular Value and Eigenvalue Approach for LLM Fingerprinting
- KVNAND Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing
- Observation-driven correction of numerical weather prediction for marine winds
- A Preliminary Study on the Promises and Challenges of Native Top- Sparse Attention
- AsymPuzl An Asymmetric Puzzle for multi-agent cooperation
- Think Before You Drive World Model-Inspired Multimodal Grounding for Autonomous Vehicles
- A fast stochastic interacting particle-field method for 3D parabolic parabolic Chemotaxis systems numerical algorithms and error analysis
- Quantum Encrypted Control of Networked Systems
- TokenScale Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity
- UniQL Unified Quantization and Low-rank Compression for Adaptive Edge LLMs
- Rethinking Security in Semantic Communication Latent Manipulation as a New Threat
Splannequin Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting
Authors: Hao-Jen Chien, Yi-Chuan Huang, Chung-Ho Wu, Wei-Lun Chao, Yu-Lun Liu
2025-12-04
Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically pre subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model's time parameter. However, under this usage, monocular capture with
temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: https://chien90190.github.io/splannequin/
Deep Forcing Training-Free Long Video Generation with Deep Sink and Participative Compression
Authors: Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim
2025-12-04
Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying Streaming-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware
that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free
-
management can match or exceed training-based approaches for autoregressively streaming long-video generation.
A Randomized Scheduling Framework for Privacy-Preserving Multi-robot Rendezvous given Prior Information
Authors: Le Liu, Yu Kawano, Ming Cao
2025-12-04
Privacy has become a critical concern in modern multi-robot systems, driven by both ethical considerations and operational constraints. As a result, growing attention has been directed toward privacy-pre coordination in dynamical multi-robot systems. This work introduces a randomized scheduling mechanism for privacy-pre
robot rendezvous. The proposed approach achieves improved privacy even at lower
rates, where privacy is quantified via pointwise maximal leakage. We show that lower transmission rates provide stronger privacy guarantees and prove that rendezvous is still achieved under the randomized scheduling mechanism. Numerical simulations are provided to demonstrate the effectiveness of the method.
The position and resolvability of blended point sources
Authors: Zephyr Penoyre
2025-12-04
In this work we derive analytic expressions and numerical recipes for finding the effective observed position of sources close enough on sky that their Point Spread Functions (PSF), modelled as Gaussian profiles, . In particularly we derive these for an elongated PSF, with a long and short axis, such as we would see from an instrument with a rectangular or elliptical mirror (relevant, for example, for the Gaia mission). We show that in this case the problem can be reduced to a one dimensional brightness profile with extrema along the line connecting the two sources, with an effective PSF width that depends on the relative orientation of the PSF and its degree of elongation. The problem can then be expressed in units of this effective width to be a function of the relative separation and light ratio alone (thus reducing to a rescaling of the un-elongated case). We derive the minimum light ratio, for a given separation and effective width, above which two sources will be resolved. We map out numerical procedures for finding the positions of these extrema across all possible cases. Finally we derive the positional offset and deviance associated with ob
a fixed pair of blended sources from a variety of orientations, showing that this can be a significant source of excess noise.
Isolating chirality-breaking SMEFT operators with Drell-Yan angular analysis
Authors: Samuele Grossi, Xu Li, Lorenzo Rolla, Riccardo Torre
2025-12-04
We present a comprehensive strategy to isolate the effect of a class of chirality-breaking interactions in the Standard Model Effective Field Theory (SMEFT) by exploiting Drell-Yan angular analysis and the violation of the Lam-Tung relation. Unlike most SMEFT interpretation of Drell-Yan measurements, dominated by growing-with-energy effects generated by the interference of SMEFT-induced and SM amplitudes, this method isolates operators that contribute only quadratically in the Wilson coefficients, allowing for an independent probe of non-interfering operators. Denoting with the electroweak vev, with the center-of-mass energy, and with the scale of new physics, the non-interfering contributions to the amplitude generated by the chirality-breaking operators can be proportional to or . We argue that these two classes can be further distinguished by analyzing the angular observables of the lepton pair in the transverse momentum and in the invariant mass distribution of the lepton pair. We therefore present an analysis of the lepton-pair angular observables in both these distributions. Based on a precise estimate of the Standard Model contribution to the relevant observables for the process up to , we present realistic projections for the sensitivity of the LHC with fb and for the HL-LHC with ab to chirality-breaking interactions, demonstrating that angular observables provide an independent and clean handle on SMEFT effects, especially in regions where the Standard Model contribution is naturally suppressed thanks to the Lam-Tung relation. This analysis becomes crucial to go beyond single parameter global fits, since it helps breaking degeneracies with chirality pre operators and to disentangle
ping directions in the EFT parameter space.
Generative Neural Video Compression via Video Diffusion Prior
Authors: Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, Siwei Ma
2025-12-04
We present GNVC-VD, the first DiT-based generative neural video framework built upon an advanced video generation foundation model, where spatio-temporal latent
and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion
to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from
d spatio-temporal latents and learns a correction term that adapts the diffusion prior to
-induced degradation. A conditioning adaptor further injects
-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video
.
A dynamic memory assignment strategy for dilation-based ICP algorithm on embedded GPUs
Authors: Qiong Chang, Weimin Wang, Junpei Zhong, Jun Miyazaki
2025-12-04
This paper proposes a memory-efficient optimization strategy for the high-performance point cloud registration algorithm VANICP, enabling lightweight execution on embedded GPUs with constrained hardware resources. VANICP is a recently published framework that significantly improves the computational efficiency of point-cloud-based applications. By transforming the global nearest neighbor search into a localized process through a dilation-based information propagation mechanism, VANICP greatly reduces the computational complexity of the NNS. However, its original implementation demands a considerable amount of memory, which restricts its deployment in resource-constrained environments such as embedded systems. To address this issue, we propose a GPU-oriented dynamic memory assignment strategy that optimizes the memory usage of the dilation operation. Furthermore, based on this strategy, we construct an enhanced version of the VANICP framework that achieves over 97% reduction in memory consumption while pre
the original performance. Source code is published on: https://github.com/changqiong/VANICP4Em.git.
Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models
Authors: NaHyeon Park, Namin An, Kunhee Kim, Soyeon Yoon, Jiahao Huo, Hyunjung Shim
2025-12-04
Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through d intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while pre
text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.
Learning Causality for Longitudinal Data
Authors: Mouad EL Bouchattaoui
2025-12-04
This thesis develops methods for causal inference and causal representation learning (CRL) in high-dimensional, time-varying data.
The first contribution introduces the Causal Dynamic Variational Autoencoder (CDVAE), a model for estimating Individual Treatment Effects (ITEs) by capturing unobserved heterogeneity in treatment response driven by latent risk factors that affect only outcomes. CDVAE comes with theoretical guarantees on valid latent adjustment and generalization bounds for ITE error. Experiments on synthetic and real datasets show that CDVAE outperforms baselines, and that state-of-the-art models greatly improve when augmented with its latent substitutes, approaching oracle performance without access to true adjustment variables.
The second contribution proposes an efficient framework for long-term counterfactual regression based on RNNs enhanced with Contrastive Predictive Coding (CPC) and InfoMax. It captures long-range dependencies under time-varying confounding while avoiding the computational cost of s, achieving state-of-the-art results and introducing CPC into causal inference.
The third contribution advances CRL by addressing how latent causes manifest in observed variables. We introduce a model-agnostic interpretability layer based on the geometry of the
r Jacobian. A
self-expression prior induces modular, possibly
ping groups of observed features aligned with shared latent influences. We provide recovery guarantees in both disjoint and
ping settings and show that meaningful latent-to-observed structure can be recovered without anchor features or single-parent assumptions. Scalable Jacobian-based regularization techniques are also developed.
Efficient Generative Transformer Operators For Million-Point PDEs
Authors: Armand Kassaï Koupaï, Lise Le Boudec, Patrick Gallinari
2025-12-04
We introduce ECHO, a -operator framework for generating million-point PDE trajectories. While existing neural operators (NOs) have shown promise for solving partial differential equations, they remain limited in practice due to poor scalability on dense grids, error accumulation during dynamic unrolling, and task-specific design. ECHO addresses these challenges through three key innovations. (i) It employs a hierarchical convolutional encode-
architecture that achieves a 100 spatio-temporal
while pre
fidelity on mesh points. (ii) It incorporates a training and adaptation strategy that enables high-resolution PDE solution generation from
input grids. (iii) It adopts a generative modeling paradigm that learns complete trajectory segments, mitigating long-horizon error drift. The training strategy decouples representation learning from downstream task supervision, allowing the model to tackle multiple tasks such as trajectory generation, forward and inverse problems, and interpolation. The generative model further supports both conditional and unconditional generation. We demonstrate state-of-the-art performance on million-point simulations across diverse PDE systems featuring complex geometries, high-frequency dynamics, and long-term horizons.
FASTer Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization
Authors: Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, Liangtao Zheng, Tao Jiang, Jingjing Gong, Xipeng Qiu, Hang Zhao
2025-12-04
Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive
and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.
Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens
Authors: Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Chanfan Gan, Tieyuan Chen, Weiyao Lin
2025-12-04
Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to all previously generated visual tokens during
, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (
)
pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the
at the line level using a 2D view, pre
the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of
, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6
. It also improves DPG on Lumina-mGPT-768 with just 1/8
. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.
Tokenizing Buildings A Transformer for Layout Synthesis
Authors: Manuel Ladron de Guevara, Jinmo Rhee, Ardavan Bidgoli, Vaidas Razgaitis, Michael Bergin
2025-12-04
We introduce Small Building Model (SBM), a Transformer-based architecture for layout synthesis in Building Information Modeling (BIM) scenes. We address the question of how to tokenize buildings by unifying heterogeneous feature sets of architectural elements into sequences while pre compositional structure. Such feature sets are represented as a
attribute-feature matrix that captures room properties. We then design a unified embedding module that learns joint representations of categorical and possibly correlated continuous feature groups. Lastly, we train a single Transformer backbone in two modes: an encoder-only pathway that yields high-fidelity room embeddings, and an encoder-
r pipeline for autoregressive prediction of room entities, referred to as Data-Driven Entity Prediction (DDEP). Experiments across retrieval and generative layout synthesis show that SBM learns compact room embeddings that reliably cluster by type and topology, enabling strong semantic retrieval. In DDEP mode, SBM produces functionally sound layouts, with fewer collisions and boundary violations and improved navigability.
ASTRIDE A Security Threat Modeling Platform for Agentic-AI Applications
Authors: Eranga Bandara, Amin Hass, Ross Gore, Sachin Shetty, Ravi Mukkamala, Safdar H. Bouk, Xueping Liang, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan
2025-12-04
AI agent-based systems are becoming increasingly integral to modern software architectures, enabling autonomous decision-making, dynamic task execution, and multimodal interactions through large language models (s). However, these systems introduce novel and evolving security challenges, including prompt injection attacks, context poisoning, model manipulation, and opaque agent-to-agent
, that are not effectively captured by traditional threat modeling frameworks. In this paper, we introduce ASTRIDE, an automated threat modeling platform purpose-built for AI agent-based systems. ASTRIDE extends the classical STRIDE framework by introducing a new threat category, A for AI Agent-Specific Attacks, which encompasses emerging vulnerabilities such as prompt injection, unsafe tool invocation, and reasoning subversion, unique to agent-based applications. To automate threat modeling, ASTRIDE combines a consortium of fine-tuned vision-language models (VLMs) with the OpenAI-gpt-oss reasoning
to perform end-to-end analysis directly from visual agent architecture diagrams, such as data flow diagrams(DFDs).
agents orchestrate the end-to-end threat modeling automation process by coordinating interactions between the VLM consortium and the reasoning
. Our evaluations demonstrate that ASTRIDE provides accurate, scalable, and explainable threat modeling for next-generation intelligent systems. To the best of our knowledge, ASTRIDE is the first framework to both extend STRIDE with AI-specific threats and integrate fine-tuned VLMs with a reasoning
to fully automate diagram-driven threat modeling in AI agent-based applications.
RLHFSpec Breaking the Efficiency Bottleneck in RLHF Training via Adaptive Drafting
Authors: Siqi Wang, Hailong Yang, Junjie Zhu, Xuezhu Wang, Yufan Xu, Depei Qian
2025-12-04
Reinforcement Learning from Human Feedback (RLHF) is an important fine-tuning technique for large language models (s) and comprises three stages: generation, inference, and training. The generation stage generates samples that are then used to infer learnable experiences for training. We observe that the generation stage is the bottleneck of the entire execution process and consider it a key point for optimization. Specifically, we realize the first attempt to integrate speculative
into the RLHF generation stage and propose RLHFSpec, an RLHF system that accelerates generation execution with adaptive speculative
and sample reallocation. To fully exploit the performance potential provided by speculative
, especially dealing with the dynamic workload of the generation stage, RLHFSpec proposes a workload-aware drafting strategy selection mechanism, which selects the near-optimal strategy by jointly considering the verification cost and the number of accepted tokens. Moreover, RLHFSpec also proposes sample reallocation to fully utilize the GPU resources, and optimizes it with an efficient sample migration mechanism. The experimental results show that the RLHFSpec can achieve higher throughput in the generation stage compared to state-of-the-art works. Moreover, due to the effective alleviation of the generation bottleneck, RLHFSpec also shows significant performance speedup in the entire RLHF execution.
SignRoundV2 Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs
Authors: Wenhua Cheng, Weiwei Zhang, Heng Guo, Haihao Shen
2025-12-04
Extreme
is critical for efficiently deploying Large Language Models (
s), yet it often leads to severe performance degradation at 2-bits and even 4-bits (e.g.,
FP4). We present SignRoundV2, a post-training
framework that is highly effective even without mixed-precision. SignRoundV2 introduces (1) a fast sensitivity metric that combines gradient information with
-induced deviations to guide layer-wise bit allocation, and (2) a lightweight pre-tuning search for
scales to improve extremely
. These components allow SignRoundV2 to close the gap with full-precision models. Extensive experiments indicate that our method sustains competitive accuracy for
s, achieving production-grade performance with about 1 percent variance at 4-5 bits and strong results even at 2 bits. The implementation is available at https://github.com/intel/auto-round.
M3-TTS Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis
Authors: Xiaopeng Wang, Chunyu Qiang, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Yukun Liu, Yuzhe Liang, Kang Yin, Yuankun Xie, Heng Xie, Chenxing Li, Chen Zhang, Changsheng Li
2025-12-04
Non-autoregressive (NAR) text-to-speech synthesis relies on length alignment between text sequences and audio representations, constraining naturalness and expressiveness. Existing methods depend on duration modeling or pseudo-alignment strategies that severely limit naturalness and computational efficiency. We propose M3-TTS, a concise and efficient NAR TTS paradigm based on multi-modal diffusion (MM-DiT) architecture. M3-TTS employs joint diffusion
layers for cross-modal alignment, achieving stable monotonic alignment between variable-length text-speech sequences without pseudo-alignment requirements. Single diffusion
layers further enhance acoustic detail modeling. The framework integrates a mel-vae codec that provides 3* training
. Experimental results on Seed-TTS and AISHELL-3 benchmarks demonstrate that M3-TTS achieves state-of-the-art NAR performance with the lowest word error rates (1.36\% English, 1.31\% Chinese) while maintaining competitive naturalness scores. Code and demos will be available at https://wwwwxp.github.io/M3-TTS.
Large Speech Model Enabled Semantic Communication
Authors: Yun Tian, Zhijin Qin, Guocheng Lv, Ye Jin, Kaibin Huang, Zhu Han
2025-12-04
Existing speech semantic systems mainly based on Joint Source-Channel Coding (JSCC) architectures have demonstrated impressive performance, but their effectiveness remains limited by model structures specifically designed for particular tasks and datasets. Recent advances indicate that generative large models pre-trained on massive datasets, can achieve outstanding performance arexhibit exceptional performance across diverse downstream tasks with minimal fine-tuning. To exploit the rich semantic knowledge embedded in large models and enable adaptive transmission over lossy channels, we propose a Large Speech Model enabled Semantic Communication (LargeSC) system. Simultaneously achieving adaptive
and robust transmission over lossy channels remains challenging, requiring trade-offs among
efficiency, speech quality, and latency. In this work, we employ the Mimi as a speech codec, converting speech into discrete tokens compatible with existing network architectures. We propose an adaptive controller module that enables adaptive transmission and in-band Unequal Error Protection (UEP), dynamically adjusting to both speech content and packet loss probability under bandwidth constraints. Additionally, we employ Low-Rank Adaptation (LoRA) to finetune the Moshi foundation model for generative recovery of lost speech tokens. Simulation results show that the proposed system supports bandwidths ranging from 550 bps to 2.06 kbps, outperforms conventional baselines in speech quality under high packet loss rates and achieves an end-to-end latency of approximately 460 ms, thereby demonstrating its potential for real-time deployment.
Topology Matters Measuring Memory Leakage in Multi-Agent LLMs
Authors: Jinbo Liu, Defu Cao, Yifei Wei, Tianyao Su, Yuan Liang, Yushun Dong, Yue Zhao, Xiyang Hu
2025-12-04
Graph topology is a fundamental determinant of memory leakage in multi-agent systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a framework that measures how network structure shapes leakage. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two-phase protocol: Engram (seeding private information into a target agent's memory) and Resonance (multi-round interaction where an attacker attempts extraction). Over up to 10 interaction rounds, we quantify leakage as the fraction of ground-truth PII recovered from attacking agent outputs via exact matching. We systematically evaluate six common network topologies (fully connected, ring, chain, binary tree, star, and star-ring), varying agent counts , attacker-target placements, and base models. Our findings reveal consistent patterns: fully connected graphs exhibit maximum leakage while chains provide strongest protection; shorter attacker-target graph distance and higher target centrality significantly increase vulnerability; leakage rises sharply in early rounds before plateauing; model choice shifts absolute leakage rates but preserves topology rankings; temporal/locational PII attributes leak more readily than identity credentials or regulated identifiers. These results provide the first systematic mapping from architectural choices to measurable privacy risk, yielding actionable guidance: prefer
or hierarchical connectivity, maximize attacker-target separation, limit node degree and network radius, avoid shortcuts bypassing hubs, and implement topology-aware access controls.
SEASON Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
Authors: Chang-Hsun Wu, Kai-Po Chang, Yu-Yang Sheng, Hung-Kai Chung, Kuei-Chun Wang, Yu-Chiang Frank Wang
2025-12-04
Video Large Language Models (Videos) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive
against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves Video
s across four general video understanding benchmarks. The code will be released upon acceptance.
Natural Language Actor-Critic Scalable Off-Policy Learning in Language Space
Authors: Joey Hong, Kang Liu, Zhan Ling, Jiecao Chen, Sergey Levine
2025-12-04
Large language model () agents --
s that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training
agents has relied on policy gradient methods that optimize
policies with respect to an (often
) reward function. However, in long-horizon tasks with
rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains
policies using a generative
critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of
s to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for
policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for
agents.
SPLICE Part-Level 3D Shape Editing from Local Semantic Extraction to Global Neural Mixing
Authors: Jin Zhou, Hongliang Yang, Pengfei Xu, Hui Huang
2025-12-04
Neural implicit representations of 3D shapes have shown great potential in 3D shape editing due to their ability to model high-level semantics and continuous geometric representations. However, existing methods often suffer from limited editability, lack of part-level control, and unnatural results when modifying or rearranging shape parts. In this work, we present SPLICE, a novel part-level neural implicit representation of 3D shapes that enables intuitive, structure-aware, and high-fidelity shape editing. By encoding each shape part independently and positioning them using parameterized Gaussian ellipsoids, SPLICE effectively isolates part-specific features while discarding global context that may hinder flexible manipulation. A global attention-based r is then employed to integrate parts coherently, further enhanced by an attention-guiding filtering mechanism that prevents information leakage across symmetric or adjacent components. Through this architecture, SPLICE supports various part-level editing operations, including translation, rotation, scaling, deletion, duplication, and cross-shape part mixing. These operations enable users to flexibly explore design variations while pre
semantic consistency and maintaining structural plausibility. Extensive experiments demonstrate that SPLICE outperforms existing approaches both qualitatively and quantitatively across a diverse set of shape-editing tasks.
MD-SNN Membrane Potential-aware Distillation on Quantized Spiking Neural Network
Authors: Donghyun Lee, Abhishek Moitra, Youngeun Kim, Ruokai Yin, Priyadarshini Panda
2025-12-04
Spiking Neural Networks (SNNs) offer a promising and energy-efficient alternative to conventional neural networks, thanks to their binary activation. However, they face challenges regarding memory and computation overhead due to complex spatio-temporal dynamics and the necessity for multiple backpropagation computations across timesteps during training. To mitigate this overhead,
techniques such as
are applied to SNNs. Yet, naively applying
to SNNs introduces a mismatch in membrane potential, a crucial factor for the firing of spikes, resulting in accuracy degradation. In this paper, we introduce Membrane-aware Distillation on
d Spiking Neural Network (MD-SNN), which leverages membrane potential to mitigate discrepancies after weight, membrane potential, and batch normalization
. To our knowledge, this study represents the first application of membrane potential knowledge distillation in SNNs. We validate our approach on various datasets, including CIFAR10, CIFAR100, N-Caltech101, and TinyImageNet, demonstrating its effectiveness for both static and dynamic data scenarios. Furthermore, for hardware efficiency, we evaluate the MD-SNN with SpikeSim platform, finding that MD-SNNs achieve 14.85X lower energy-delay-area product (EDAP), 2.64X higher TOPS/W, and 6.19X higher TOPS/mm2 compared to floating point SNNs at iso-accuracy on N-Caltech101 dataset.
Solving LLM Repetition Problem in Production A Comprehensive Study of Multiple Solutions
Authors: Weiwei Wang, Weijie Zou, Jiyong Min
2025-12-04
The repetition problem, where Large Language Models (s) continuously generate repetitive content without proper termination, poses a critical challenge in production deployments, causing severe performance degradation and system stalling. This paper presents a comprehensive investigation and multiple practical solutions for the repetition problem encountered in real-world batch code interpretation tasks.
We identify three distinct repetition patterns: (1) business rule generation repetition, (2) method call relationship analysis repetition, and (3) PlantUML diagram syntax generation repetition. Through rigorous theoretical analysis based on Markov models, we establish that the root cause lies in greedy
's inability to escape repetitive loops, exacerbated by self-reinforcement effects.
Our comprehensive experimental evaluation demonstrates three viable solutions: (1) Beam Search
with early_stopping=True serves as a universal post-hoc mechanism that effectively resolves all three repetition patterns; (2) presence_penalty hyperparameter provides an effective solution specifically for BadCase 1; and (3) Direct Preference Optimization (DPO) fine-tuning offers a universal model-level solution for all three BadCases.
The primary value of this work lies in combining first-hand production experience with extensive experimental validation. Our main contributions include systematic theoretical analysis of repetition mechanisms, comprehensive evaluation of multiple solutions with task-specific applicability mapping, identification of early_stopping as the critical parameter for Beam Search effectiveness, and practical production-ready solutions validated in real deployment environments.
Learning to Orchestrate Agents in Natural Language with the Conductor
Authors: Stefan Nielsen, Edoardo Cetin, Peter Schwendeman, Qi Sun, Jinglue Xu, Yujin Tang
2025-12-04
Powerful large language models (s) from different providers have been expensively trained and finetuned to specialize across varying domains. In this work, we introduce a new kind of Conductor model trained with reinforcement learning to automatically discover powerful coordination strategies among
s. Our Conductor learns not only to design targeted
topologies for effective agent-to-agent collaboration, but also to prompt engineer focused instructions to the
s to maximally leverage their individual capabilities. We show that, by learning optimal coordination strategies over pools of powerful worker
s, a 7B Conductor achieves significant performance gains beyond any individual worker, attaining state-of-the-art results in challenging reasoning benchmarks, such as LiveCodeBench and GPQA. By training with randomized agent pools, our conductor effectively adapts to arbitrary sets of open- and closed-source agents, meeting any user requirements. Furthermore, allowing the Conductor to select itself as a worker gives rise to recursive topologies, elevating performance with a new form of dynamic test-time scaling through online iterative adaptation. More broadly, ours is among the early work demonstrating language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally in
s through pure end-to-end reward maximization.
Universal Quantum Interconnects via Phase-Coherent Four-Wave Mixing
Authors: Hao Zhang, Yang Xu, Linshan Sun, Wei Cui, Robert W. Boyd, Sergio Carbajo
2025-12-03
Quantum transduction, which enables the coherent conversion of quantum information between disparate physical platforms, is a cornerstone for realizing scalable and interoperable quantum networks. Among various approaches, parametric frequency mixing processes such as four-wave mixing (FWM) offer a promising pathway toward efficient and low-noise transduction. In this work, we demonstrate the feasibility of coherent quantum state transfer by indirectly verifying high-fidelity wavefunction's phase mapping (>99%) from the input field to the generated output field wave. Using a gas-filled hollow-core capillary fiber, we systematically investigate spectral phase evolution across a broad range, including infrared (IR) to ultraviolet (UV) transitions, as well as conversions from telecom-band (1550 nm) to visible (516 nm) and deep-UV (308 nm) wavelengths. Our results reveal that strong phase coherence can be maintained throughout these diverse conversion regimes. Because quantum properties such as coherence and entanglement are intrinsically encoded in both the amplitude and phase of a photonic wavefunction, pre spectral phase is essential for faithful quantum information transfer. We further show that efficient and phase-pre
transduction can be achieved by tuning system parameters, offering valuable insights into nonlinear coupling dynamics. These findings establish a promising foundation for advancing FWM-based quantum transduction schemes and open new avenues for integrating heterogeneous quantum systems across wide spectral domains within future quantum
networks.
Learning Single-Image Super-Resolution in the JPEG Compressed Domain
Authors: Sruthi Srinivasan, Elham Shakibapour, Rajy Rawther, Mehdi Saeedi
2025-12-03
Deep learning models have grown increasingly complex, with input data sizes scaling accordingly. Despite substantial advances in specialized deep learning hardware, data loading continues to be a major bottleneck that limits training and inference speed. To address this challenge, we propose training models directly on encoded JPEG features, reducing the computational overhead associated with full JPEG and significantly improving data loading efficiency. While prior works have focused on recognition tasks, we investigate the effectiveness of this approach for the restoration task of single-image super-resolution (SISR). We present a lightweight super-resolution pipeline that operates on JPEG discrete cosine transform (DCT) coefficients in the frequency domain. Our pipeline achieves a 2.6x speedup in data loading and a 2.5x speedup in training, while pre
visual quality comparable to standard SISR approaches.
Look Around and Pay Attention Multi-camera Point Tracking Reimagined with Transformers
Authors: Bishoy Galoaa, Xiangyu Bai, Shayda Moezzi, Utsav Nandi, Sai Siddhartha Vivek Dhir Rangoju, Somaieh Amraee, Sarah Ostadabbas
2025-12-03
This paper presents LAPA (Look Around and Pay Attention), a novel end-to-end -based architecture for multi-camera point tracking that integrates appearance-based matching with geometric constraints. Traditional pipelines decouple detection, association, and tracking, leading to error propagation and temporal inconsistency in challenging scenarios. LAPA addresses these limitations by leveraging attention mechanisms to jointly reason across views and time, establishing soft correspondences through a cross-view attention mechanism enhanced with geometric priors. Instead of relying on classical triangulation, we construct 3D point representations via attention-weighted aggregation, inherently accommodating uncertainty and partial observations. Temporal consistency is further maintained through a
r that models long-range dependencies, pre
identities through extended occlusions. Extensive experiments on challenging datasets, including our newly created multi-camera (MC) versions of TAPVid-3D panoptic and PointOdyssey, demonstrate that our unified approach significantly outperforms existing methods, achieving 37.5% APD on TAPVid-3D-MC and 90.3% APD on PointOdyssey-MC, particularly excelling in scenarios with complex motions and occlusions. Code is available at https://github.com/ostadabbas/Look-Around-and-Pay-Attention-LAPA-
Constructing Low-Redundancy Codes via Distributed Graph Coloring
Authors: Yuting Li, Ryan Gabrys, Farzad Farnoud
2025-12-03
We present a general framework for constructing error-correcting codes using distributed graph coloring under the LOCAL model. Building on the correspondence between independent sets in the confusion graph and valid codes, we show that the color of a single vertex - consistent with a global proper coloring - can be computed in polynomial time using a modified version of Linial's coloring algorithm, leading to efficient encoding and . Our results include: i) uniquely decodable code constructions for a constant number of errors of any type with redundancy twice the Gilbert-Varshamov bound; ii) list-decodable codes via a proposed extension of graph coloring, namely, hypergraph labeling; iii) an incremental synchronization scheme with reduced average-case
when the edit distance is not precisely known; and iv) the first asymptotically optimal codes (up to a factor of 8) for correcting bursts of unbounded-length edits. Compared to syndrome
, our approach is more flexible and generalizable, does not rely on a good base code, and achieves improved redundancy across a range of parameters.
Asymmetric excitation of left- vs right-handed photons in accelerating waveguides
Authors: Adrian del Rio
2025-12-03
The electromagnetic duality symmetry of Maxwell's equations in vacuum implies that the circular polarization of classical electromagnetic waves is conserved. In quantum field theory, the normal-ordered operator represents the difference between the number operators of right- and left-handed photons. Previous studies have shown that its expectation value is not conserved for observers propagating in a gravitational field. Here, we show that this Noether symmetry can also be realized in empty waveguides with duality-pre boundary conditions, and we
the source-free Maxwell theory inside a long, cylindrical waveguide undergoing both linear and rotational
from rest. In the vacuum associated to inertial observers, we find that the expectation value fails to be conserved for observers co-moving with the waveguide. In particular, frame-dragging effects induce a spectral asymmetry between the right- and left-handed field modes at late times. As a consequence, accelerated detectors co-moving with the rotating waveguide can detect photon-pair excitations from the quantum vacuum, exhibiting an imbalance between opposite helicity modes. This is a relativistic quantum effect, which shows that the classical conservation law associated with duality symmetry is broken in the quantum theory even in flat spacetime, provided we work with non-inertial systems. Our analysis provides a concrete proof of concept for testing this effect in analogue gravity platforms.
PosterCopilot Toward Layout Reasoning and Controllable Editing for Professional Graphic Design
Authors: Jiazhe Wei, Ken Li, Tianyu Lao, Haofan Wang, Liang Wang, Caifeng Shan, Chenyang Si
2025-12-03
Graphic design forms the cornerstone of modern visual ,
as a vital medium for promoting cultural and commercial events. Recent advances have explored automating this process using Large Multimodal Models (LMMs), yet existing methods often produce geometrically inaccurate layouts and lack the iterative, layer-specific editing required in professional workflows. To address these limitations, we present PosterCopilot, a framework that advances layout reasoning and controllable editing for professional graphic design. Specifically, we introduce a progressive three-stage training strategy that equips LMMs with geometric understanding and aesthetic reasoning for layout design, consisting of Perturbed Supervised Fine-Tuning, Reinforcement Learning for Visual-Reality Alignment, and Reinforcement Learning from Aesthetic Feedback. Furthermore, we develop a complete workflow that couples the trained LMM-based design model with generative models, enabling layer-controllable, iterative editing for precise element refinement while maintaining global visual consistency. Extensive experiments demonstrate that PosterCopilot achieves geometrically accurate and aesthetically superior layouts, offering unprecedented controllability for professional iterative design.
PSA Pyramid Sparse Attention for Efficient Video Understanding and Generation
Authors: Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang
2025-12-03
Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high
. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled
representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical
blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete
. This design, analogous to fixed-point
and classical feature pyramid networks in computer vision, effectively mitigates information loss while pre
computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing
attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: http://ziplab.co/PSA
Ultra-lightweight Neural Video Representation Compression
Authors: Ho Man Kwan, Tianhao Peng, Ge Gao, Fan Zhang, Mike Nilsson, Andrew Gower, David Bull
2025-12-03
Recent works have demonstrated the viability of utilizing over-fitted implicit neural representations (INRs) as alternatives to autoencoder-based models for neural video . Among these INR-based video codecs, Neural Video Representation Compression (NVRC) was the first to adopt a fully end-to-end
framework that compresses INRs, achieving state-of-the-art performance. Moreover, some recently proposed lightweight INRs have shown comparable performance to their baseline codecs with computational complexity lower than 10kMACs/pixel. In this work, we extend NVRC toward lightweight representations, and propose NVRC-Lite, which incorporates two key changes. Firstly, we integrated multi-scale feature grids into our lightweight neural representation, and the use of higher resolution grids significantly improves the performance of INRs at low complexity. Secondly, we address the issue that existing INRs typically leverage autoregressive models for entropy coding: these are effective but impractical due to their slow coding speed. In this work, we propose an octree-based context model for entropy coding high-dimensional feature grids, which accelerates the entropy coding module of the model. Our experimental results demonstrate that NVRC-Lite outperforms C3, one of the best lightweight INR-based video codecs, with up to 21.03% and 23.06% BD-rate savings when measured in PSNR and MS-SSIM, respectively, while achieving 8.4x encoding and 2.5x
speedup. The implementation of NVRC-Lite will be made available.
Teaching Old Tokenizers New Words Efficient Tokenizer Adaptation for Pre-trained Models
Authors: Taido Purason, Pavel Chizhov, Ivan P. Yamshchikov, Mark Fishel
2025-12-03
Tokenizer adaptation plays an important role in transferring pre-trained language models to new domains or languages. In this work, we address two complementary aspects of this process: vocabulary extension and . The common approach to extension trains a new tokenizer on domain-specific text and appends the tokens that do not
with the existing vocabulary, which often results in many tokens that are unreachable or never used. We propose continued BPE training, which adapts a pre-trained tokenizer by continuing the BPE merge learning process on new data. Experiments across multiple languages and model families show that this approach improves tokenization efficiency and leads to better utilization of added vocabulary. We also introduce leaf-based vocabulary
, which removes redundant tokens while pre
model quality. Together, these methods provide practical tools for controlled vocabulary modification, which we release as an open-source package.
An Information Theory of Finite Abstractions and their Fundamental Scalability Limits
Authors: Giannis Delimpaltadakis, Gabriel Gleizer
2025-12-03
Finite abstractions are discrete approximations of dynamical systems, such that the set of abstraction trajectories contains, in a formal sense, all system trajectories. There is a consensus that abstractions suffer from the curse of dimensionality: for the same ``accuracy" (how closely the abstraction represents the system), the abstraction size scales poorly with system dimensions. And, yet, after decades of research on abstractions, there are no formal results concerning their accuracy-size tradeoff. In this work, we derive a statistical, quantitative theory of abstractions' accuracy-size tradeoff and uncover fundamental limits on their scalability, through rate-distortion theory -- the branch of information theory studying lossy . Abstractions are viewed as encoder-
r pairs, encoding trajectories of dynamical systems in a higher-dimensional ambient space. Rate represents abstraction size, while distortion describes abstraction accuracy, defined as the spatial average deviation between abstract trajectories and system ones. We obtain a fundamental lower bound on the minimum abstraction distortion, given the system dynamics and a threshold on abstraction size. The bound depends on the complexity of the dynamics, through generalized entropy. We demonstrate the bound's tightness on certain dynamical systems. Finally, we showcase how the developed theory can be employed to construct optimal abstractions, in terms of the size-accuracy tradeoff, through an example on a chaotic system.
Technical Report on Text Dataset Distillation
Authors: Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Victor Zacarias, Edson Bollis, Lucas Pellicer, Rosimeire Pereira Costa, Anna Helena Reali Costa, Artur Jordao
2025-12-03
In the vision domain, dataset distillation arises as a technique to condense a large dataset into a smaller synthetic one that exhibits a similar result in the training process. While image data presents an extensive literature of distillation methods, text dataset distillation has fewer works in comparison. Text dataset distillation initially grew as an adaptation of efforts from the vision universe, as the particularities of the modality became clear obstacles, it rose into a separate branch of research. Several milestones mark the development of this area, such as the introduction of methods that use models, the generation of discrete synthetic text, and the scaling to
r-only models with over 1B parameters. Despite major advances in modern approaches, the field remains in a maturing phase, with room for improvement on benchmarking standardization, approaches to overcome the discrete nature of text, handling complex tasks, and providing explicit examples of real-world applications. In this report, we review past and recent advances in dataset distillation for text, highlighting different distillation strategies, key contributions, and general challenges.
OD-MoE On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference
Authors: Liujianfu Wang, Yuyang Du, Yuchen Pan, Soung Chang Liew, Jiacheng Liu, Kexin Chen
2025-12-03
Mixture-of-Experts (MoE), while offering significant advantages as a Large Language Model () architecture, faces substantial challenges when deployed on low-cost edge devices with tight memory constraints. Expert offloading mitigates this issue by storing expert parameters in CPU memory and caching a subset of popular experts in GPU memory. Although this approach improves GPU memory utilization by caching only the likely-used experts, the GPU memory reserved for expert caching is underutilized compared with dense
s. This paper presents OD-MoE, a distributed MoE inference framework that obviates the need for expert
s via fully on-demand expert loading. OD-MoE is built upon two key mechanisms: 1) parallelizing expert loading and expert computation across distributed edge nodes, and 2) an ultra-accurate emulative predictor that forecasts expert activations multiple layers ahead while expert computation is ongoing. With these innovations, OD-MoE dynamically loads each target expert to one of the distributed nodes just-in-time before its activation and promptly evicts it afterward, freeing GPU memory for subsequent experts. We comprehensively benchmark OD-MoE against state-of-the-art MoE offloading systems on a ten-node testbed. Experimental results show that: 1) OD-MoE achieves 99.94% expert activation prediction accuracy, substantially surpassing all existing methods; and 2) OD-MoE delivers approximately 75% of the
speed of a fully GPU-
d MoE deployment while using only 1/3 of the GPU memory. More importantly, by eliminating the need for expert
s, OD-MoE enables MoE inference on edge nodes with less-than-1GB GPU memory, paving the way for practical MoE deployment of low-cost IoT devices at the edge in the
era.
UniMo Unifying 2D Video and 3D Human Motion with an Autoregressive Framework
Authors: Youxin Pang, Yong Zhang, Ruizhi Shao, Xiang Deng, Feng Gao, Xu Xiaoming, Xiaoming Wei, Yebin Liu
2025-12-03
We propose UniMo, an innovative autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework, enabling simultaneous generation and understanding of these two modalities for the first time. Current methods predominantly focus on generating one modality given another as the condition or integrating either of them with other modalities such as text and audio. Unifying 2D videos and 3D motions for simultaneous optimization and generation remains largely unexplored, presenting significant challenges due to their substantial structural and distributional differences. Inspired by the 's ability to unify different modalities, our method models videos and 3D motions as a unified tokens sequence, utilizing separate embedding layers to mitigate distribution gaps. Additionally, we devise a sequence modeling strategy that integrates two distinct tasks within a single framework, proving the effectiveness of unified modeling. Moreover, to efficiently align with visual tokens and preserve 3D spatial information, we design a novel 3D motion tokenizer with a temporal expansion strategy, using a single VQ-VAE to produce
d motion tokens. It features multiple expert
rs that handle body shapes, translation, global orientation, and body poses for reliable 3D motion reconstruction. Extensive experiments demonstrate that our method simultaneously generates corresponding videos and motions while performing accurate motion capture. This work taps into the capacity of
s to fuse diverse data types, paving the way for integrating human-centric information into existing models and potentially enabling multimodal, controllable joint modeling of humans, objects, and scenes.
Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Authors: Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li, Siran Yang, Yunlong Xu, Jiaheng Liu, Yongchi Zhao, Jiamang Wang, Yuchi Xu, Wenbo Su, Bo Zheng
2025-12-03
Transformer rs have achieved strong results across tasks, but the memory required for the
becomes prohibitive at long sequence lengths. Although Cross-layer
Cache sharing (e.g., YOCO, CLA) offers a path to mitigate
Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose Fused
, whose top-layer
s are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, pre
relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose Fused
-Lite, an cross-layer sharing approach, where top-layer
s are directly derived from the bottom-layer values and the middle-layer keys. Compared to Fused
, Fused
-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on
s ranging from 332M to 4B parameters, our proposed method reduce 50\%
memory while achieving lower validation perplexity than the standard Transformer
r, establishing it as a memory-efficient, high-performance architectural alternative.
Training and Evaluation of Guideline-Based Medical Reasoning in LLMs
Authors: Michael Staniek, Artem Sokolov, Stefan Riezler
2025-12-03
Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach s to follow medical consensus guidelines step-by-step in their reasoning and prediction process. Since consensus guidelines are ubiquitous in medicine, instantiations of verbalized medical inference rules to electronic health records provide data for fine-tuning
s to learn consensus rules and possible exceptions thereof for many medical areas. Consensus rules also enable an automatic evaluation of the model's inference process regarding its derivation correctness (evaluating correct and faithful deduction of a conclusion from given premises) and value correctness (comparing predicted values against real-world measurements). We exemplify our work using the complex Sepsis-3 consensus definition. Our experiments show that small fine-tuned models outperform one-shot learning of considerably larger
s that are prompted with the explicit definition and models that are trained on medical texts including consensus definitions. Since fine-tuning on verbalized rule instantiations of a specific medical area yields nearly perfect derivation correctness for rules (and exceptions) on unseen patient data in that area, the bottleneck for early prediction is not out-of-distribution generalization, but the orthogonal problem of generalization into the future by forecasting
ly and irregularly sampled clinical variables. We show that the latter results can be improved by integrating the output representations of a time series forecasting model with the
in a multimodal setup.
Transmit Weights, Not Features Orthogonal-Basis Aided Wireless Point-Cloud Transmission
Authors: Junlin Chang, Yubo Han, Hnag Yue, John S Thompson, Rongke Liu
2025-12-03
The widespread adoption of depth sensors has substantially lowered the barrier to point-cloud acquisition. This letter proposes a semantic wireless transmission framework for three dimension (3D) point clouds built on Deep Joint Source - Channel Coding (DeepJSCC). Instead of sending raw features, the transmitter predicts combination weights over a receiver-side semantic orthogonal feature pool, enabling compact representations and robust reconstruction. A folding-based r deforms a 2D grid into 3D, enforcing manifold continuity while pre
geometric fidelity. Trained with Chamfer Distance (CD) and an orthogonality regularizer, the system is evaluated on ModelNet40 across varying Signal-to-Noise Ratios (SNRs) and bandwidths. Results show performance on par with SEmantic Point cloud Transmission (SEPT) at high bandwidth and clear gains in bandwidth-constrained regimes, with consistent improvements in both Peak Signal-to-Noise Ratio (PSNR) and CD. Ablation experiments confirm the benefits of orthogonalization and the folding prior.
CCN Decentralized Cross-Chain Channel Networks Supporting Secure and Privacy-Preserving Multi-Hop Interactions
Authors: Minghui Xu, Yihao Guo, Yanqiang Zhang, Zhiguang Shan, Guangyong Shang, Zhen Ma, Bin Xiao, Xiuzhen Cheng
2025-12-03
Cross-chain technology enables interoperability among otherwise isolated blockchains, supporting interactions across heterogeneous networks. Similar to how multi-hop became fundamental in the evolution of the Internet, the demand for multi-hop cross-chain interactions is gaining increasing attention. However, this growing demand introduces new security and privacy challenges. On the security side, multi-hop interactions depend on the availability of multiple participating nodes. If any node becomes temporarily offline during execution, the protocol may fail to complete correctly, leading to settlement failure or fund loss. On the privacy side, the need for on-chain transparency to validate intermediate states may unintentionally leak linkable information, compromising the unlinkability of user interactions. In this paper, we propose the Cross-Chain Channel Network (CCN), a decentralized network designed to support secure and privacy-pre
multi-hop cross-chain transactions. Through experimental evaluation, we identify two critical types of offline failures, referred to as active and passive offline cases, which have not been adequately addressed by existing solutions. To mitigate these issues, we introduce R-HTLC, a core protocol within CCN. R-HTLC incorporates an hourglass mechanism and a multi-path refund strategy to ensure settlement correctness even when some nodes go offline during execution. Importantly, CCN addresses not only the correctness under offline conditions but also maintains unlinkability in such adversarial settings. To overcome this, CCN leverages zero-knowledge proofs and off-chain coordination, ensuring that interaction relationships remain indistinguishable even when certain nodes are temporarily offline.
Different types of syntactic agreement recruit the same units within large language models
Authors: Daria Kryvosheieva, Andrea de Varda, Evelina Fedorenko, Greta Tuckute
2025-12-03
Large language models (s) can reliably distinguish grammatical from ungrammatical sentences, but how grammatical knowledge is represented within the models remains an open question. We investigate whether different syntactic phenomena recruit shared or distinct components in
s. Using a functional localization approach inspired by cognitive neuroscience, we identify the
units most responsive to 67 English syntactic phenomena in seven open-weight models. These units are consistently recruited across sentences containing the phenomena and causally support the models' syntactic performance. Critically, different types of syntactic agreement (e.g., subject-verb, anaphor, determiner-noun) recruit
ping sets of units, suggesting that agreement constitutes a meaningful functional category for
s. This pattern holds in English, Russian, and Chinese; and further, in a cross-lingual analysis of 57 diverse languages, structurally more similar languages share more units for subject-verb agreement. Taken together, these findings reveal that syntactic agreement-a critical marker of syntactic dependencies-constitutes a meaningful category within
s' representational spaces.
ConvRot Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers
Authors: Feice Huang, Zuliang Han, Xing Zhou, Yihuang Chen, Lifei Zhu, Haoqian Wang
2025-12-03
Diffusion s have demonstrated strong capabilities in generating high-quality images. However, as model size increases, the growing memory footprint and inference latency pose significant challenges for practical deployment. Recent studies in large language models (
s) show that rotation-based techniques can smooth outliers and enable 4-bit
, but these approaches often incur substantial overhead and struggle with row-wise outliers in diffusion
s. To address these challenges, we propose ConvRot, a group-wise rotation-based
method that leverages regular Hadamard transform (RHT) to suppress both row-wise and column-wise outliers while reducing complexity from quadratic to linear. Building on this, we design ConvLinear4bit, a plug-and-play module that integrates rotation,
, GEMM, and de
, enabling W4A4 inference without retraining and pre
visual quality. Experiments on FLUX.1-dev demonstrate a 2.26 speedup and 4.05 memory reduction while maintaining image fidelity. To our knowledge, this is the first application of rotation-based
for plug-and-play W4A4 inference in diffusion
s.
Towards Privacy-Preserving Range Queries with Secure Learned Spatial Index over Encrypted Data
Authors: Zuan Wang, Juntao Lu, Jiazhuang Wu, Youliang Tian, Wei Song, Qiuxian Li, Duo Zhang
2025-12-03
With the growing reliance on cloud services for large-scale data management, pre the security and privacy of outsourced datasets has become increasingly critical. While encrypting data and queries can prevent direct content exposure, recent research reveals that adversaries can still infer sensitive information via access pattern and search path analysis. However, existing solutions that offer strong access pattern privacy often incur substantial performance overhead. In this paper, we propose a novel privacy-pre
range query scheme over encrypted datasets, offering strong security guarantees while maintaining high efficiency. To achieve this, we develop secure learned spatial index (SLS-INDEX), a secure learned index that integrates the Paillier cryptosystem with a hierarchical prediction architecture and noise-injected buckets, enabling data-aware query
in the encrypted domain. To further obfuscate query execution paths, SLS-INDEXbased Range Queries (SLRQ) employs a permutation-based secure bucket prediction protocol. Additionally, we introduce a secure point extraction protocol that generates candidate results to reduce the overhead of secure computation. We provide formal security analysis under realistic leakage functions and implement a prototype to evaluate its practical performance. Extensive experiments on both real-world and synthetic datasets demonstrate that SLRQ significantly outperforms existing solutions in query efficiency while ensuring dataset, query, result, and access pattern privacy.
SELF A Robust Singular Value and Eigenvalue Approach for LLM Fingerprinting
Authors: Hanxiu Zhang, Yue Zheng
2025-12-03
The protection of Intellectual Property (IP) in Large Language Models (s) represents a critical challenge in contemporary AI research. While fingerprinting techniques have emerged as a fundamental mechanism for detecting unauthorized model usage, existing methods -- whether behavior-based or structural -- suffer from vulnerabilities such as false claim attacks or susceptible to weight manipulations. To overcome these limitations, we propose SELF, a novel intrinsic weight-based fingerprinting scheme that eliminates dependency on input and inherently resists false claims. SELF achieves robust IP protection through two key innovations: 1) unique, scalable and transformation-invariant fingerprint extraction via singular value and eigenvalue decomposition of
attention weights, and 2) effective neural network-based fingerprint similarity comparison based on few-shot learning and data augmentation. Experimental results demonstrate SELF maintains high IP infringement detection accuracy while showing strong robustness against various downstream modifications, including
,
, and fine-tuning attacks. Our code is available at https://github.com/HanxiuZhang/SELF_v2.
KVNAND Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing
Authors: Lishuo Deng, Shaojie Xu, Jinwu Chen, Changwei Yan, Jiajie Wang, Zhe Jiang, Weiwei Shan
2025-12-03
Deploying large language models (s) on edge devices enables personalized agents with strong privacy and low cost. However, with tens to hundreds of billions of parameters, single-batch autoregressive inference suffers from extremely low arithmetic intensity, creating severe weight-loading and bandwidth pressures on resource-constrained platforms. Recent in-flash computing (IFC) solutions alleviate this bottleneck by co-locating weight-related linear computations in the
phase with flash, yet still rely on DRAM for the key-value (
)
. As context length grows, the
can exceed model weights in size, imposing prohibitive DRAM cost and capacity requirements. Attempts to offload
to flash suffer from severe performance penalties.
We propose
NAND, the first DRAM-free, IFC-based architecture that stores both model weights and
entirely in compute-enabled 3D NAND flash.
NAND addresses the fundamental performance challenges of flash under intensive
access by leveraging IFC for all memory-bound operations to reduce data transfer overhead, introducing head-group parallelism to boost throughput, and employing page-level
mapping to align token access patterns with flash organization. In addition, we propose a design space exploration framework that evaluates discrete and compact
NAND variants to balance weight and
placement, automatically identifying the optimal design trade-off. These techniques mitigate latency, energy, and reliability concerns, turning flash into a practical medium for long-context
storage. Evaluations on MHA 7B and GQA 70B
s show that
NAND achieves 1.98/1.94/2.05 geomean speedup at 128/1K/10K-token contexts compared to DRAM-equipped IFC designs and addresses out-of-memory failures at 100K context length.
Observation-driven correction of numerical weather prediction for marine winds
Authors: Matteo Peduto, Qidong Yang, Jonathan Giezendanner, Devis Tuia, Sherrie Wang
2025-12-03
Accurate marine wind forecasts are essential for safe navigation, ship routing, and energy operations, yet they remain challenging because observations over the ocean are , heterogeneous, and temporally variable. We reformulate wind forecasting as observation-informed correction of a global numerical weather prediction (NWP) model. Rather than forecasting winds directly, we learn local correction patterns by assimilating the latest in-situ observations to adjust the Global Forecast System (GFS) output. We propose a
-based deep learning architecture that (i) handles irregular and time-varying observation sets through masking and set-based attention mechanisms, (ii) conditions predictions on recent observation-forecast pairs via cross-attention, and (iii) employs cyclical time embeddings and coordinate-aware location representations to enable single-pass inference at arbitrary spatial coordinates. We evaluate our model over the Atlantic Ocean using observations from the International Comprehensive Ocean-Atmosphere Data Set (ICOADS) as reference. The model reduces GFS 10-meter wind RMSE at all lead times up to 48 hours, achieving 45% improvement at 1-hour lead time and 13% improvement at 48-hour lead time. Spatial analyses reveal the most persistent improvements along coastlines and shipping routes, where observations are most abundant. The tokenized architecture naturally accommodates heterogeneous ob
platforms (ships, buoys, tide gauges, and coastal stations) and produces both site-specific predictions and basin-scale gridded products in a single forward pass. These results demonstrate a practical, low-latency post-processing approach that complements NWP by learning to correct systematic forecast errors.
A Preliminary Study on the Promises and Challenges of Native Top- Sparse Attention
Authors: Di Xiu, Hongyin Tang, Bolin Rong, Lizhi Yan, Jingang Wang, Yifan Lu, Xunliang Cai
2025-12-03
Large Language Models (s) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top- Attention mechanism during both the
and training phases. First, we validate the effectiveness of exact Top- Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the
stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top- Attention training strategy. Experiments confirm that ensuring the consistency between training and inference regarding Top- Attention operations facilitates the further unlocking of Top- Decoding's potential, thereby significantly enhancing model performance. Furthermore, considering the high computational complexity of exact Top- Attention, we investigate the impact of approximate Top- algorithm precision on downstream tasks. Our research confirms a positive correlation between downstream task performance and approximation fidelity, and we provide statistical evaluations of the Lightning Indexer's precision within the DeepSeek-V3.2-Exp model. Finally, this report provides a theoretical interpretation from the perspective of Entropy. Experimental observations indicate that models subjected to Top- Attention SFT exhibit a distinct phenomenon of entropy reduction in downstream tasks, which validates the hypothesis that low-entropy states are better adapted to Top- Decoding.
AsymPuzl An Asymmetric Puzzle for multi-agent cooperation
Authors: Xavier Cadet, Edward Koh, Peter Chin
2025-12-03
Large Language Model () agents are increasingly studied in multi-turn, multi-agent scenarios, yet most existing setups emphasize open-ended role-play rather than controlled evaluation. We introduce AsymPuzl, a minimal but expressive two-agent puzzle environment designed to isolate
under information asymmetry. Each agent observes complementary but incomplete views of a symbolic puzzle and must exchange messages to solve it cooperatively. Using a diverse set of current-generation and open-source
s, we show that (i) strong models such as GPT-5 and Claude-4.0 reliably converge across puzzle sizes on the solution by sharing complete information in two turns, (ii) weaker models often ignore partner messages or over-correct their hypotheses, and (iii) feedback design is non-trivial: simple self-feedback improves success rates, while detailed joint feedback can hurt performance. These findings show that even in simple cooperative tasks,
strategies diverge and depend on the granularity of feedback signals. AsymPuzl thus provides a testbed for probing the limits of multi-turn cooperation and opens avenues for studying coordination mechanisms.
Think Before You Drive World Model-Inspired Multimodal Grounding for Autonomous Vehicles
Authors: Haicheng Liao, Huanming Shen, Bonan Wang, Yongkang Li, Yihong Tang, Chengyue Wang, Dingyi Zhuang, Kehua Chen, Hai Yang, Chengzhong Xu, Zhenning Li
2025-12-03
Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided r then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted
pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.
A fast stochastic interacting particle-field method for 3D parabolic parabolic Chemotaxis systems numerical algorithms and error analysis
Authors: Jingyuan Hu, Zhongjian Wang, Jack Xin, Zhiwen Zhang
2025-12-03
In this paper, we develop a novel numerical framework, the stochastic interacting particle-field method with particle-in-cell (SIPF-PIC), for the efficient simulation of the three-dimensional (3D) parabolic-parabolic Keller-Segel (KS) systems. The SIPF-PIC method integrates Lagrangian particle dynamics with spectral field solvers, by leveraging localized particle-grid interpolations and fast Fourier transform (FFT) techniques. For particles and Fourier modes per spatial dimension, the SIPF-PIC method achieves a computational complexity of per time step, a significant improvement over the original SIPF method (proposed in \cite{SIPF1}), which has a complexity of , while pre
numerical accuracy. Moreover, we establish a rigorous error analysis, proving that the discretization errors are of order . Finally, we present numerical experiments to validate the theoretical convergence rates and demonstrate the computational efficiency of our new method. Notably, these experiments also show that the method captures complex blowup dynamics beyond single-point collapse, including ring-type singularities, where mass dynamically concentrates into evolving annular structures.
Quantum Encrypted Control of Networked Systems
Authors: Zihao Ren, Daniel Quevedo, Salah Sukkarieh, Guodong Shi
2025-12-03
Encrypted control has been extensively studied to ensure the confidentiality of system states and control inputs for networked control systems. This paper presents a computationally efficient encrypted control framework for networked systems enabled by quantum . A quantum channel between sensors and actuators is used to generate identical secret keys, whose security is further enhanced through quantum key distribution. These keys enable lightweight encryption and decryption while pre
confidentiality and control accuracy. We develop a novel encryption-decryption architecture for state-feedback control of linear systems based on quantum keys, and characterize the impact of quantum state errors on closed-loop stability. In particular, we establish the existence of a critical threshold on intrinsic quantum noise below which stability is guaranteed. In contrast to classical encrypted control schemes, which may collapse under a single key-bit error, the proposed quantum encrypted control exhibits strong robustness to key imperfections. We further adopt
techniques to address the scenarios with limited
bits in practical situations, and implement privacy protection for quantum keys based on a stochastic
r. These results demonstrate that integrating quantum technologies into control systems in a nontrivial and principled manner, even at their current level of maturity, can yield substantial performance gains in reducing computational complexity and improving resilience to key errors while ensuring security against multiple eavesdropping sources.
TokenScale Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity
Authors: Ruiqi Lai, Hongrui Liu, Chengzhi Lu, Zonghao Liu, Siyu Cao, Siyang Shao, Yixin Zhang, Luo Mai, Dmitrii Ustiugov
2025-12-03
The architectural shift to /
(PD) disaggregation in
improves resource utilization but struggles with the bursty nature of modern workloads. Existing autoscaling policies, often retrofitted from monolithic systems like those in AIBrix and DistServe, rely on lagging indicators such as GPU utilization or coarse-grained request counts. This results in slow reactions to load spikes, leading to significant Time-to First-Token (TTFT) and Time-Per-Output-Token (TPOT) SLO violations and costly over-provisioning. We introduce TokenScale, an autoscaling framework that resolves this performance mismatch through two innovations. First, we propose Token Velocity, a novel metric that unifies the
, network, and
stages by quantifying their rate of work. As a leading indicator of system backpressure, it enables proactive scaling. Second, Convertible Decoders allow
r GPUs to dynamically execute
tasks during traffic spikes, creating a rapid-response buffer that absorbs bursts and eliminates the initialization latency of new
ers. Our evaluation on a GPU cluster with production traces shows TokenScale improves SLO attainment from 50-88% to 80-96% and reduces costs by 4-14% over state-of-the-art systems, including DistServe, BlitzScale, and AIBrix. By uniting a predictive metric with a flexible system design, TokenScale significantly boosts the performance and efficiency of
d
infrastructure.
UniQL Unified Quantization and Low-rank Compression for Adaptive Edge LLMs
Authors: Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
2025-12-03
Deploying large language model () models on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training
and low-rank
framework with on-device configurable
rates for edge
s. UniQL is a general framework that integrates
and low-rank
for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x,
-aware singular value decomposition (SVD) to minimize
errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and
in the cloud in a single-pass workflow, while enabling on-device configurable
rates up to 35%. Our experiments show that
d and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15%
across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and
d models are available at: https://github.com/enyac-group/UniQL.
Rethinking Security in Semantic Communication Latent Manipulation as a New Threat
Authors: Zhiyuan Xi, Kun Zhu
2025-12-03
Deep learning-based semantic (SemCom) has emerged as a promising paradigm for next-generation wireless networks, offering superior transmission efficiency by extracting and conveying task-relevant semantic latent representations rather than raw data. However, the openness of the wireless medium and the intrinsic vulnerability of semantic latent representations expose such systems to previously unrecognized security risks. In this paper, we uncover a fundamental latent-space vulnerability that enables Man-in-the-Middle (MitM) attacker to covertly manipulate the transmitted semantics while pre
the statistical properties of the transmitted latent representations. We first present a Diffusion-based Re-encoding Attack (DiR), wherein the attacker employs a diffusion model to synthesize an attacker-designed semantic variant, and re-encodes it into a valid latent representation compatible with the SemCom
r. Beyond this model-dependent pathway, we further propose a model-agnostic and training-free Test-Time Adaptation Latent Manipulation attack (TTA-LM), in which the attacker perturbs and steers the intercepted latent representation toward an attacker-specified semantic target by leveraging the gradient of a target loss function. In contrast to diffusion-based manipulation, TTA-LM does not rely on any generative model and does not impose modality-specific or task-specific assumptions, thereby enabling efficient and broadly applicable latent-space tampering across diverse SemCom architectures. Extensive experiments on representative semantic
architectures demonstrate that both attacks can significantly alter the
d semantics while pre
natural latent-space distributions, making the attacks covert and difficult to detect.