2025-11-21
Table of Contents
- Taming the Long-Tail Efficient Reasoning RL Training with Adaptive Drafter
- Nemotron Elastic Towards Efficient Many-in-One Reasoning LLMs
- Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation
- Erase to Retain Low Rank Adaptation Guided Selective Unlearning in Medical Segmentation Networks
- An Exterior-Embedding Neural Operator Framework for Preserving Conservation Laws
- Optimizing Federated Learning in the Era of LLMs Message Quantization and Streaming
- ARK Answer-Centric Retriever Tuning via KG-augmented Curriculum Learning
- ChangeDINO DINOv3-Driven Building Change Detection in Optical Remote Sensing Imagery
- Distributed Agent Reasoning Across Independent Systems With Strict Data Locality
- "To Survive, I Must Defect" Jailbreaking LLMs via the Game-Theory Scenarios
- SeSE A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs
- Q-MLLM Vector Quantization for Robust Multimodal Large Language Model Security
- Fast LLM Post-training via Decoupled and Best-of-N Speculation
- Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
- On 10x Better Scalability KV Stores Scale Up KV Cache
- Decoupling Complexity from Scale in Latent Diffusion Model
- Pathlet Variational Auto-Encoder for Robust Trajectory Generation
- AMS-KV Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers
- Global Resolution Optimal Multi-Draft Speculative Sampling via Convex Minimization
- Mini Amusement Parks (MAPs) A Testbed for Modelling Business Decisions
- Joint Semantic-Channel Coding and Modulation for Token Communications
- MoDES Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
- FlashMesh Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
- Game Master LLM Task-Based Role-Playing for Natural Slang Learning
- Learning to Expand Images for Efficient Visual Autoregressive Modeling
- FairEnergy Contribution-Based Fairness meets Energy Efficiency in Federated Learning
- Breaking Expert Knowledge Limits Self-Pruning for Large Language Models
- IPTQ-ViT Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers
- Fidelity-Preserving Quantum Encoding for Quantum Neural Networks
- Quant-Trim in Practice Improved Cross-Platform Low-Bit Deployment on Edge NPUs
- SNAP Low-Latency Test-Time Adaptation with Sparse Updates
- Graph Query Networks for Object Detection with Automotive Radar
- Context Cascade Compression Exploring the Upper Limits of Text Compression
- Efficient Transformer-Integrated Deep Neural Architectures for Robust EEG Decoding of Complex Visual Imagery
- Unveiling Intrinsic Dimension of Texts from Academic Abstract to Creative Story
- Trustworthy GenAI over 6G Integrated Applications and Security Frameworks
- Masked Auto-Regressive Variational Acceleration Fast Inference Makes Practical Reinforcement Learning
- Fail fast techniques to probe rare events in quantum error correction
- A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
- Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents
- WiCo-PG Wireless Channel Foundation Model for Pathloss Map Generation via Synesthesia of Machines
- Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference
- Mathematical Analysis of Hallucination Dynamics in Large Language Models Uncertainty Quantification, Advanced Decoding, and Principled Mitigation
Taming the Long-Tail Efficient Reasoning RL Training with Adaptive Drafter
Authors: Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han
2025-11-20
The emergence of Large Language Models (s) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative
. Applying speculative
in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.
Nemotron Elastic Towards Efficient Many-in-One Reasoning LLMs
Authors: Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
2025-11-20
Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model through
and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented
s, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA
techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other
methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.
Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation
Authors: Md. Samiul Alim, Sharjil Khan, Amrijit Biswas, Fuad Rahman, Shafin Rahman, Nabeel Mohammed
2025-11-20
Unstructured remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided
framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-
recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global
strategy that efficiently eliminates redundant weights while pre
essential representations. After
, we employ
-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high
levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high
levels, while offering a more computationally efficient alternative to iterative
schemes like COLT. The proposed framework offers a computation-efficient, performance-pre
solution well suited for deployment in resource-constrained environments.
Erase to Retain Low Rank Adaptation Guided Selective Unlearning in Medical Segmentation Networks
Authors: Nirjhor Datta, Md. Golam Rabiul Alam
2025-11-20
The ability to selectively remove knowledge from medical segmentation networks is increasingly important for privacy compliance, ethical deployment, and continual dataset revision. We introduce Erase to Retain, a controllable unlearning framework for medical image segmentation that achieves targeted forgetting without full retraining. Our method uses a teacher-student distillation paradigm with Low-Rank Adaptation (LoRA) constrained subspace updates, enabling the student network to erase lesion-specific or class-specific representations in low-rank r spaces while pre
global anatomical understanding. During the strong unlearning phase, LoRA modules are adversarially optimized to contradict the teacher's confident predictions on a designated forget subset, enforcing semantic removal. This is followed by a gentle restoration phase that recovers generalization on retained data through head-only supervised refinement.
For ISIC segmentation, the student reduces forget-set IoU from 0.875 to 0.509 while maintaining competitive performance on the retain and validation splits (0.647 to 0.677 IoU). On the cross-domain CHASE dataset, Erase to Retain consistently lowers forget-set IoU while pre
utility on retain and validation sets. For ISIC classification, our method decreases accuracy on the forget subset from 87.0 percent to 64.1 percent while improving retain accuracy from 83.9 percent to 90.6 percent.
These results demonstrate that LoRA-based subspace unlearning provides a practical pathway toward responsible, controllable, and reversible unlearning in medical image analysis, enabling models to forget sensitive samples or structures while pre
performance where it matters most.
An Exterior-Embedding Neural Operator Framework for Preserving Conservation Laws
Authors: Huanshuo Dong, Hong Wang, Hao Wu, Zhiwei Zhuang, Xuanze Yang, Ruiqi Shu, Yuan Gao, Xiaomeng Huang
2025-11-20
Neural operators have demonstrated considerable effectiveness in accelerating the solution of time-dependent partial differential equations (PDEs) by directly learning governing physical laws from data. However, for PDEs governed by conservation laws(e.g., conservation of mass, energy, or matter), existing neural operators fail to satisfy conservation properties, which leads to degraded model performance and limited generalizability. Moreover, we observe that distinct PDE problems generally require different optimal neural network architectures. This finding underscores the inherent limitations of specialized models in generalizing across diverse problem domains.
To address these limitations, we propose Exterior-Embedded Conservation Framework (ECF), a universal con framework that can be integrated with various data-driven neural operators to enforce conservation laws strictly in predictions. The framework consists of two key components: a conservation quantity encoder that extracts conserved quantities from input data, and a conservation quantity
r that adjusts the neural operator's predictions using these quantities to ensure strict conservation compliance in the final output. Since our architecture enforces conservation laws, we theoretically prove that it enhances model performance. To validate the performance of our method, we conduct experiments on multiple conservation-law-constrained PDE scenarios, including adiabatic systems, shallow water equations, and the Allen-Cahn problem. These baselines demonstrate that our method effectively improves model accuracy while strictly enforcing conservation laws in the predictions.
Optimizing Federated Learning in the Era of LLMs Message Quantization and Streaming
Authors: Ziyue Xu, Zhihong Zhang, Holger R. Roth, Chester Chen, Yan Cheng, Andrew Feng
2025-11-20
Federated Learning (FL) offers a promising solution for training machine learning models across distributed data sources while pre data privacy. However, FL faces critical challenges related to
overhead and local resource constraints, especially in the era of Large Language Models (
s) with billions of parameters. The sheer size of these models exacerbates both memory and
constraints, making efficient transmission and processing essential for practical deployment. NVIDIA FLARE, an open-source SDK for federated learning, addresses these challenges by introducing advanced
capabilities. Building upon existing solutions for large object streaming, we enhance FL workflows for
s through two key techniques: message
and container/file streaming. Quantization reduces message size, while streaming enables efficient memory management, improving scalability and integration with existing workflows. These advancements significantly enhance the robustness and efficiency of FL with
s, ensuring better performance in real-world federated learning scenarios.
ARK Answer-Centric Retriever Tuning via KG-augmented Curriculum Learning
Authors: Jiawei Zhou, Hang Ding, Haiyun Jiang
2025-11-20
Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for knowledge-intensive tasks, yet its effectiveness in long-context scenarios is often bottlenecked by the retriever's inability to distinguish yet crucial evidence. Standard retrievers, optimized for query-document similarity, frequently fail to align with the downstream goal of generating a precise answer. To bridge this gap, we propose a novel fine-tuning framework that optimizes the retriever for Answer Alignment. Specifically, we first identify high-quality positive chunks by evaluating their sufficiency to generate the correct answer. We then employ a curriculum-based contrastive learning scheme to fine-tune the retriever. This curriculum leverages
-constructed Knowledge Graphs (KGs) to generate augmented queries, which in turn mine progressively challenging hard negatives. This process trains the retriever to distinguish the answer-sufficient positive chunks from these nuanced distractors, enhancing its generalization. Extensive experiments on 10 datasets from the Ultradomain and LongBench benchmarks demonstrate that our fine-tuned retriever achieves state-of-the-art performance, improving 14.5% over the base model without substantial architectural modifications and maintaining strong efficiency for long-context RAG. Our work presents a robust and effective methodology for building truly answer-centric retrievers.
ChangeDINO DINOv3-Driven Building Change Detection in Optical Remote Sensing Imagery
Authors: Ching-Heng Cheng, Chih-Chung Hsu
2025-11-20
Remote sensing change detection (RSCD) aims to identify surface changes from co-registered bi-temporal images. However, many deep learning-based RSCD methods rely solely on change-map annotations and underuse the semantic information in non-changing regions, which limits robustness under illumination variation, off-nadir views, and scarce labels. This article introduces ChangeDINO, an end-to-end multiscale Siamese framework for optical building change detection. The model fuses a lightweight backbone stream with features transferred from a frozen DINOv3, yielding semantic- and context-rich pyramids even on small datasets. A spatial-spectral differential
r then exploits multi-scale absolute differences as change priors to highlight true building changes and suppress irrelevant responses. Finally, a learnable morphology module refines the upsampled logits to recover clean boundaries. Experiments on four public benchmarks show that ChangeDINO consistently outperforms recent state-of-the-art methods in IoU and F1, and ablation studies confirm the effectiveness of each component. The source code is available at https://github.com/chingheng0808/ChangeDINO.
Distributed Agent Reasoning Across Independent Systems With Strict Data Locality
Authors: Daniel Vaughan, Kateřina Vaughan
2025-11-20
This paper presents a proof-of-concept demonstration of agent-to-agent across distributed systems, using only natural-language messages and without shared identifiers, structured schemas, or centralised data exchange. The prototype explores how multiple organisations (represented here as a Clinic, Insurer, and Specialist Network) can cooperate securely via pseudonymised case tokens, local data lookups, and controlled operational boundaries.
The system uses Orpius as the underlying platform for multi-agent orchestration, tool execution, and privacy-pre
. All agents communicate through OperationRelay calls, exchanging concise natural-language summaries. Each agent operates on its own data (such as synthetic clinic records, insurance enrolment tables, and clinical guidance extracts), and none receives or reconstructs patient identity. The Clinic computes an HMAC-based pseudonymous token, the Insurer evaluates coverage rules and consults the Specialist agent, and the Specialist returns an appropriateness recommendation.
The goal of this prototype is intentionally limited: to demonstrate feasibility, not to provide a clinically validated, production-ready system. No clinician review was conducted, and no evaluation beyond basic functional runs was performed. The work highlights architectural patterns, privacy considerations, and
flows that enable distributed reasoning among specialised agents while keeping data local to each organisation. We conclude by outlining opportunities for more rigorous evaluation and future research in decentralised multi-agent systems.
"To Survive, I Must Defect" Jailbreaking LLMs via the Game-Theory Scenarios
Authors: Zhen Sun, Zongmin Zhang, Deqi Liang, Han Sun, Yule Liu, Yun Shen, Xiangshan Gao, Yilong Yang, Shuai Liu, Yutao Yue, Xinlei He
2025-11-20
As s become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned
s as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the
's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the
's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on
s such as Deepseek-R1, while maintaining efficiency. Ablations over components,
, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot
-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world
applications and reports a longitudinal safety monitoring of popular HuggingFace
s.
SeSE A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs
Authors: Xingtao Zhao, Hao Peng, Dingli Su, Xianghua Zeng, Chunyang Liu, Jinzhi Liao, Philip S. Yu
2025-11-20
Reliable uncertainty quantification (UQ) is essential for deploying large language models (s) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinating falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of
s from a structural information perspective for hallucination detection. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically
unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal
. A higher SeSE value corresponds to greater uncertainty, indicating that
s are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation -- where existing methods often rely on heuristic sample-and-count techniques -- we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines, including strong supervised methods and the recently proposed KLE.
Q-MLLM Vector Quantization for Robust Multimodal Large Language Model Security
Authors: Wei Zhao, Zhe Li, Yige Li, Jun Sun
2025-11-20
Multimodal Large Language Models (Ms) have demonstrated impressive capabilities in cross-modal understanding, but remain vulnerable to adversarial attacks through visual inputs despite robust textual safety mechanisms. These vulnerabilities arise from two core weaknesses: the continuous nature of visual representations, which allows for gradient-based attacks, and the inadequate transfer of text-based safety mechanisms to visual content. We introduce Q-M
, a novel architecture that integrates two-level vector
to create a discrete bottleneck against adversarial attacks while pre
multimodal reasoning capabilities. By discretizing visual representations at both pixel-patch and semantic levels, Q-M
blocks attack pathways and bridges the cross-modal safety alignment gap. Our two-stage training methodology ensures robust learning while maintaining model utility. Experiments demonstrate that Q-M
achieves significantly better defense success rate against both jailbreak attacks and toxic image attacks than existing approaches. Notably, Q-M
achieves perfect defense success rate (100\%) against jailbreak attacks except in one arguable case, while maintaining competitive performance on multiple utility benchmarks with minimal inference overhead. This work establishes vector
as an effective defense mechanism for secure multimodal AI systems without requiring expensive safety-specific fine-tuning or detection overhead. Code is available at https://github.com/Amadeuszhao/QM
.
Fast LLM Post-training via Decoupled and Best-of-N Speculation
Authors: Rongxin Cheng, Kai Zhou, Xingda Wei, Siyuan Liu, Mingcong Han, Mingjing Ai, Yeju Zhou, Baoquan Zhong, Wencong Xiao, Xin Liu, Rong Chen, Haibo Chen
2025-11-20
Rollout dominates the training time in large language model () post-training, where the trained model is used to generate tokens given a batch of prompts. SpecActor achieves fast rollout with speculative
that deploys a fast path (e.g., a smaller model) to accelerate the unparallelizable generation, while the correctness is guaranteed by fast parallel verification of the outputs with the original model. SpecActor addresses two foundational challenges in speculative rollout by (1) a \emph{dynamic decoupled speculation} execution method that maximizes the GPU computational efficiency to realize speedup for large-batch execution -- a configuration common in training but unfriendly to speculative execution and (2) a \emph{dynamic Best-of-N speculation} method that selects and combines different drafting methods according to the rollout progress. It substantially improves the speculation accuracy even when the best drafting method is unknown a priori, meanwhile without requiring adding extra computation resources. {\sys} is {1.3--1.7}\, faster than common post-training baselines, and is {1.3--1.5}\, faster compared to naively adopting speculative
for rollout.
Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
Authors: Jian Ma, Qirong Peng, Xujie Zhu, Peixing Xie, Chen Chen, Haonan Lu
2025-11-20
Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise
within a single training phase. This distillation framework enables flexible knowledge transfer across diverse
ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50\% reduction in parameter count compared to the full model, with less than 3\% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher
ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning.
On 10x Better Scalability KV Stores Scale Up KV Cache
Authors: Weiping Yu, Ye Jiarui, He Mengke, Junfeng Liu, Siqiang Luo
2025-11-20
Large language models (s) rely on Key-Value (
)
to reduce time- to-first-token (TTFT) latency, but existing disk-based
systems using file-per-object layouts suffer from severe scalability bottlenecks due to file system metadata overhead, I/O inefficiency, and poor spatial locality. This paper presents SGLANG-LSM, a database-inspired system that leverages Log-Structured Merge- tree (LSM-tree) architectures for scalable
management. SGLANG-LSM implements a layered system design with three coordinated components: (1) a prefix-pre
storage engine that maintains token sequence locality while efficiently storing large
tensors through key-value separation, (2) an adaptive controller that dynamically optimizes LSM-tree configurations based on shifting workload characteristics, and (3) runtime services including batch opera- tions and automatic resource management for production deployment. Evaluation on large-scale dynamic workloads demonstrates that SGLANG-LSM significantly improves
hits by up to 143% and reduces TTFT by up to 24% compared to state-of-the-art systems, representing the first systematic application of database storage architectures to large-scale
management.
Decoupling Complexity from Scale in Latent Diffusion Model
Authors: Tianxiong Zhong, Xingye Tian, Xuebo Wang, Boyuan Jiang, Xin Tao, Pengfei Wan
2025-11-20
Existing latent diffusion models typically couple scale with content complexity, using more latent tokens to represent higher-resolution images or higher-frame rate videos. However, the latent capacity required to represent visual data primarily depends on content complexity, with scale only as an upper bound. Motivated by this observation, we propose DCS-LDM, a novel paradigm for visual generation that decouples information complexity from scale. DCS-LDM constructs a hierarchical, scale-independent latent space that models sample complexity through multi-level tokens and supports
to arbitrary resolutions and frame rates within a fixed latent representation. This latent space enables DCS-LDM to achieve a flexible computation-quality tradeoff. Furthermore, by decomposing structural and detailed information across levels, DCS-LDM supports a progressive coarse-to-fine generation paradigm. Experimental results show that DCS-LDM delivers performance comparable to state-of-the-art methods while offering flexible generation across diverse scales and visual qualities.
Pathlet Variational Auto-Encoder for Robust Trajectory Generation
Authors: Yuanbo Tang, Yan Tang, Zixuan Zhang, Zihui Zhao, Yang Li
2025-11-20
Trajectory generation has recently drawn growing interest in privacy-pre urban mobility studies and location-based service applications. Although many studies have used deep learning or generative AI methods to model trajectories and have achieved promising results, the robustness and interpretability of such models are largely unexplored. This limits the application of trajectory generation algorithms on noisy real-world data and their trustworthiness in downstream tasks. To address this issue, we exploit the regular structure in urban trajectories and propose a deep generative model based on the pathlet representation, which encode trajectories with binary vectors associated with a learned dictionary of trajectory segments. Specifically, we introduce a probabilistic graphical model to describe the trajectory generation process, which includes a Variational Autoencoder (VAE) component and a linear
r component. During training, the model can simultaneously learn the latent embedding of pathlet representations and the pathlet dictionary that captures mobility patterns in the trajectory dataset. The conditional version of our model can also be used to generate customized trajectories based on temporal and spatial constraints.
Our model can effectively learn data distribution even using noisy data, achieving relative improvements of and over strong baselines on two real-world trajectory datasets. Moreover, the generated trajectories can be conveniently utilized for multiple downstream tasks, including trajectory prediction and data denoising. Lastly, the framework design offers a significant efficiency advantage, saving of the time and of GPU memory compared to previous approaches.
AMS-KV Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers
Authors: Boxun Xu, Yu Wang, Zihu Wang, Peng Li
2025-11-20
Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value () caching in large language models (
s) has been extensively studied, next-scale prediction presents unique challenges, and
caching design for next-scale based VAR
s remains largely unexplored. A major bottleneck is the excessive
memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong
similarity across finer scales is predominantly observed in
-efficient layers, whereas
-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-
, a scale-adaptive
caching policy for next-scale prediction in VAR models. AMS-
prioritizes storing
s from condensed and local scales, pre
the most relevant tokens to maintain generation quality. It further optimizes
utilization and computational efficiency identifying
-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-
reduces
usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-
enables stable scaling to a batch size of 256 with improved throughput.
Global Resolution Optimal Multi-Draft Speculative Sampling via Convex Minimization
Authors: Rahul Krishna Thomas, Arka Pal
2025-11-19
Speculative sampling reduces the latency of autoregressive for target model
s without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample this token. To improve acceptance and
efficiency, recent work has explored the multi-draft extension, where at each step draft tokens are generated, and the verification criterion is a distribution conditioned on these. When this criterion maximizes the probability of accepting some draft token, it is called the optimal transport (OT). However, finding the OT is difficult, as it is the solution of a linear program (OTLP) in over variables, with being the vocabulary size. Two recent theoretical works have reframed the OTLP in terms of importance sampling or subset selection. In this work, we prove that these formulations are equivalent to an exponentially large relaxed OTLP, so it remains infeasible to solve. Then, we reverse engineer subset selection to formulate the OTLP as a max-flow problem. With a novel application of polymatroid theory, we reduce the exponentially large OTLP to a convex optimization problem in at most variables. This allows us to devise an algorithm for optimal -draft speculative sampling when the tokens are chosen i.i.d. from a single draft model, which can be tuned to arbitrary accuracy. Finally, we measure acceptance rates and algorithm runtimes for various and top- draft sampling settings. Our findings give the first multi-draft algorithm with 90% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution.
Mini Amusement Parks (MAPs) A Testbed for Modelling Business Decisions
Authors: Stéphane Aroca-Ouellette, Ian Berlot-Attwell, Panagiotis Lymperopoulos, Abhiramon Rajasekharan, Tongqi Zhu, Herin Kang, Kaheer Suleman, Sam Pasupalak
2025-11-19
Despite rapid progress in artificial intelligence, current systems struggle with the interconnected challenges that define real-world decision making. Practical domains, such as business management, require optimizing an open-ended and multi-faceted objective, actively learning environment dynamics from experience, planning over long horizons in stochastic settings, and reasoning over spatial information. Yet existing human--AI benchmarks isolate subsets of these capabilities, limiting our ability to assess holistic decision-making competence. We introduce Mini Amusement Parks (MAPs), an amusement-park simulator designed to evaluate an agent's ability to model its environment, anticipate long-term consequences under uncertainty, and strategically operate a complex business. We provide human baselines and a comprehensive evaluation of state-of-the-art
agents, finding that humans outperform these systems by 6.5x on easy mode and 9.8x on medium mode. Our analysis reveals persistent weaknesses in long-horizon optimization, sample-efficient learning, spatial reasoning, and world modelling. By unifying these challenges within a single environment, MAPs offers a new foundation for benchmarking agents capable of adaptable decision making. Code: https://github.com/Skyfall-Research/MAPs
Joint Semantic-Channel Coding and Modulation for Token Communications
Authors: Jingkai Ying, Zhijin Qin, Yulong Feng, Liejun Wang, Xiaoming Tao
2025-11-19
In recent years, the Transformer architecture has achieved outstanding performance across a wide range of tasks and modalities. Token is the unified input and output representation in Transformer-based models, which has become a fundamental information unit. In this work, we consider the problem of token , studying how to transmit tokens efficiently and reliably. Point cloud, a prevailing three-dimensional format which exhibits a more complex spatial structure compared to image or video, is chosen to be the information source. We utilize the set abstraction method to obtain point tokens. Subsequently, to get a more informative and transmission-friendly representation based on tokens, we propose a joint semantic-channel and modulation (JSCCM) scheme for the token encoder, mapping point tokens to standard digital constellation points (modulated tokens). Specifically, the JSCCM consists of two parallel Point Transformer-based encoders and a differential modulator which combines the Gumel-softmax and soft
methods. Besides, the rate allocator and channel adapter are developed, facilitating adaptive generation of high-quality modulated tokens conditioned on both semantic information and channel conditions. Extensive simulations demonstrate that the proposed method outperforms both joint semantic-channel coding and traditional separate coding, achieving over 1dB gain in reconstruction and more than 6x
ratio in modulated symbols.
MoDES Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Authors: Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong, Jinyang Guo, Xianglong Liu, Jun Zhang
2025-11-19
Mixture-of-Experts (MoE) Multimodal large language models (Ms) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (
s)-to M
s results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE M
inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the
ing time by 2.16 and the
time by 1.26.
FlashMesh Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
Authors: Tingrui Shen, Yiheng Zhang, Chen Tang, Chuan Ping, Zixing Zhao, Le Wan, Yuwang Wang, Ronggang Wang, Shengfeng He
2025-11-19
Autoregressive models can generate high-quality 3D meshes by sequentially producing vertices and faces, but their token-by-token results in slow inference, limiting practical use in interactive and large-scale applications. We present FlashMesh, a fast and high-fidelity mesh generation framework that rethinks autoregressive
through a predict-correct-verify paradigm. The key insight is that mesh tokens exhibit strong structural and geometric correlations that enable confident multi-token speculation. FlashMesh leverages this by introducing a speculative
scheme tailored to the commonly used hourglass
architecture, enabling parallel prediction across face, point, and coordinate levels. Extensive experiments show that FlashMesh achieves up to a 2 x speedup over standard autoregressive models while also improving generation fidelity. Our results demonstrate that structural priors in mesh data can be systematically harnessed to accelerate and enhance autoregressive generation.
Game Master LLM Task-Based Role-Playing for Natural Slang Learning
Authors: Amir Tahmasbi, Milad Esrafilian, Judson Wright, Sooyeon Jeong, Aniket Bera
2025-11-19
Natural and idiomatic expressions are essential for fluent, everyday , yet many second-language learners struggle to acquire and spontaneously use casual slang despite strong formal proficiency. To address this gap, we designed and evaluated an
-powered, task-based role-playing game in which a GPT-4o-based Game Master guides learners through an immersive, three-phase spoken narrative. After selecting five unfamiliar slang phrases to practice, participants engage in open-ended dialogue with non-player characters; the Game Master naturally incorporates the target phrases in rich semantic contexts (implicit input enhancement) while a dedicated Practice Box provides real-time explicit tracking and encouragement. Post-session, learners receive multi-level formative feedback analyzing the entire interaction.
We evaluated the system in a between-subjects study with 14 international graduate students, randomly assigned to either the RPG condition or a control condition consisting of a traditional AI-led virtual classroom. Results from an immediate post-test show that the RPG group achieved greater gains in both comprehension of the target phrases and their accurate, contextual use in sentences. Quantitative analysis of in-activity word-usage frequency, combined with qualitative survey responses, further indicates that the game-based approach provided more practice opportunities and higher perceived engagement, resulting in a more natural learning experience. These findings highlight the potential of narrative-driven
interactions in vocabulary acquisition.
Learning to Expand Images for Efficient Visual Autoregressive Modeling
Authors: Ruiqing Yang, Kaixin Zhang, Zheng Zhang, Shan You, Tao Huang
2025-11-19
Autoregressive models have recently shown great promise in visual generation by leveraging discrete token sequences akin to language modeling. However, existing approaches often suffer from inefficiency, either due to token-by-token or the complexity of multi-scale representations. In this work, we introduce Expanding Autoregressive Representation (EAR), a novel generation paradigm that emulates the human visual system's center-outward perception pattern. EAR unfolds image tokens in a spiral order from the center and progressively expands outward, pre
spatial continuity and enabling efficient parallel
. To further enhance flexibility and speed, we propose a length-adaptive
strategy that dynamically adjusts the number of tokens predicted at each step. This biologically inspired design not only reduces computational cost but also improves generation quality by aligning the generation order with perceptual relevance. Extensive experiments on ImageNet demonstrate that EAR achieves state-of-the-art trade-offs between fidelity and efficiency on single-scale autoregressive models, setting a new direction for scalable and cognitively aligned autoregressive image generation.
FairEnergy Contribution-Based Fairness meets Energy Efficiency in Federated Learning
Authors: Ouiame Marnissi, Hajar EL Hammouti, El Houcine Bergou
2025-11-19
Federated learning (FL) enables collaborative model training across distributed devices while pre data privacy. However, balancing energy efficiency and fair participation while ensuring high model accuracy remains challenging in wireless edge systems due to heterogeneous resources, unequal client contributions, and limited
capacity. To address these challenges, we propose FairEnergy, a fairness-aware energy minimization framework that integrates a contribution score capturing both the magnitude of updates and their
ratio into the joint optimization of device selection, bandwidth allocation, and
level. The resulting mixed-integer non-convex problem is solved by relaxing binary selection variables and applying Lagrangian decomposition to handle global bandwidth coupling, followed by per-device subproblem optimization. Experiments on non-IID data show that FairEnergy achieves higher accuracy while reducing energy consumption by up to 79\% compared to baseline strategies.
Breaking Expert Knowledge Limits Self-Pruning for Large Language Models
Authors: Haidong Kang, Lihong Lin, Enneng Yang, Hongning Dai, Hao Wang
2025-11-19
Large language models (s) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing
methods (e.g., Wanda) tailored for
s rely heavily on manual design
algorithms, thereby leading to \textit{huge labor costs} and \textit{requires expert knowledge}. Furthermore, we are the first to identify the serious \textit{outlier value issue} behind dramatic performance degradation under high
ratios that are caused by uniform
, raising an additional concern about how to design adaptive
ideal for
s. Can
s prune by themselves? In this work, we introduce an affirmative answer by proposing a novel
method called \textbf{AutoPrune}, which first overcomes expert knowledge limits by leveraging
s to design optimal
algorithms for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of
s, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the
algorithm and enabling us to generate
algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high
ratios. We conduct extensive experiments on mainstream
s benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors. The code is available at: https://anonymous.4open.science/r/AutoPrune.
IPTQ-ViT Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers
Authors: Gihwan Kim, Jemin Lee, Hyungshin Kim
2025-11-19
Previous Quantization-Aware Training (QAT) methods for vision s rely on expensive retraining to recover accuracy loss in non-linear layer
, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially
non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a novel PTQ framework for fully integer-only vision
s without retraining. We present approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating
sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44\%p (avg. 1.78\%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT outperforms partial floating-point PTQ methods under W8A8 and W4A8, and achieves accuracy and latency comparable to integer-only QAT methods. We plan to release our code https://github.com/gihwan-kim/IPTQ-ViT.git.
Fidelity-Preserving Quantum Encoding for Quantum Neural Networks
Authors: Yuhu Lu, Jinjing Shi
2025-11-19
Efficiently encoding classical visual data into quantum states is essential for realizing practical quantum neural networks (QNNs). However, existing encoding schemes often discard spatial and semantic information when adapting high-dimensional images to the limited qubits of Noisy Intermediate-Scale Quantum (NISQ) devices. We propose a Fidelity-Pre Quantum Encoding (FPQE) framework that performs near lossless data
and quantum encoding. FPQE employs a convolutional encoder-
r to learn compact multi-channel representations capable of reconstructing the original data with high fidelity, which are then mapped into quantum states through amplitude encoding. Experimental results show that FPQE performs comparably to conventional methods on simple datasets such as MNIST, while achieving clear improvements on more complex ones, outperforming PCA and
based encodings by up to 10.2\% accuracy on Cifar-10. The performance gain grows with data complexity, demonstrating FPQE's ability to preserve high-level structural information across diverse visual domains. By maintaining fidelity during classical to quantum transformation, FPQE establishes a scalable and hardware efficient foundation for high-quality quantum representation learning.
Quant-Trim in Practice Improved Cross-Platform Low-Bit Deployment on Edge NPUs
Authors: Rayen Dhahri, Steffen Urban
2025-11-19
Specialized edge accelerators rely on
, but vendor compilers differ in scaling, clipping, and kernel support, often as black boxes. The same floating-point (FP) checkpoint can therefore yield inconsistent accuracy across backends, forcing practitioners to tweak flags or refactor models to vendor-friendly operator subsets. We introduce Quant-Trim, a training-phase method that produces a hardware-neutral checkpoint robust to backend and precision choices. It combines progressive fake
to align training with the deployed integer grid and reverse
to tame outlier-driven scale inflation while pre
learnability. Quant-Trim is agnostic to
schemes (symmetric/asymmetric, per-tensor/per-channel, INT8/INT4) and requires no vendor-specific graph changes. Across models and tasks, it narrows the FP-to-
gap, reduces dependence on compiler heuristics/calibration, and avoids per-backend retraining. We report accuracy and edge metrics latency, throughput, energy per inference, and cost under static/dynamic activation scaling and varying operator coverage.
SNAP Low-Latency Test-Time Adaptation with Sparse Updates
Authors: Hyeongheon Cha, Dong Min Kim, Hye Won Chung, Taesik Gong, Sung-Ju Lee
2025-11-19
Test-Time Adaptation (TTA) adjusts models using unlabeled test data to handle dynamic distribution shifts. However, existing methods rely on frequent adaptation and high computational cost, making them unsuitable for resource-constrained edge environments. To address this, we propose SNAP, a TTA framework that reduces adaptation frequency and data usage while pre
accuracy. SNAP maintains competitive accuracy even when adapting based on only 1% of the incoming data stream, demonstrating its robustness under infrequent updates. Our method introduces two key components: (i) Class and Domain Representative Memory (CnDRM), which identifies and stores a small set of samples that are representative of both class and domain characteristics to support efficient adaptation with limited data; and (ii) Inference-only Batch-aware Memory Normalization (IoBMN), which dynamically adjusts normalization statistics at inference time by leveraging these representative samples, enabling efficient alignment to shifting target domains. Integrated with five state-of-the-art TTA algorithms, SNAP reduces latency by up to 93.12%, while keeping the accuracy drop below 3.3%, even across adaptation rates ranging from 1% to 50%. This demonstrates its strong potential for practical use on edge devices
latency-sensitive applications. The source code is available at https://github.com/chahh9808/SNAP.
Graph Query Networks for Object Detection with Automotive Radar
Authors: Loveneet Saini, Hasan Tercan, Tobias Meisen
2025-11-19
Object detection with 3D radar is essential for 360-degree automotive perception, but radar's long wavelengths produce and irregular reflections that challenge traditional grid and sequence-based convolutional and
detectors. This paper introduces Graph Query Networks (GQN), an attention-based framework that models objects sensed by radar as graphs, to extract individualized relational and contextual features. GQN employs a novel concept of graph queries to dynamically attend over the bird's-eye view (BEV) space, constructing object-specific graphs processed by two novel modules: EdgeFocus for relational reasoning and DeepContext Pooling for contextual aggregation. On the NuScenes dataset, GQN improves relative mAP by up to +53%, including a +8.2% gain over the strongest prior radar method, while reducing peak graph construction overhead by 80% with moderate FLOPs cost.
Context Cascade Compression Exploring the Upper Limits of Text Compression
Authors: Fanfan Liu, Haibo Qiu
2025-11-19
Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (s). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text
. Our method cascades two
s of different sizes to handle the
and
tasks. Specifically, a small
, acting as the first stage, performs text
by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large
, as the second stage, then executes the
task on this compressed context. Experiments show that at a 20x
ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98%
accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the
ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context
, C3 Compression demonstrates superior performance and feasibility over optical character
. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for
ratios in future work on optical character
, OCR, and related fields. Codes and model weights are publicly accessible at https://github.com/liufanfanlff/C3-Context-Cascade-Compression
Efficient Transformer-Integrated Deep Neural Architectures for Robust EEG Decoding of Complex Visual Imagery
Authors: Byoung-Hee Kwon
2025-11-19
This study introduces a pioneering approach in brain-computer interface (BCI) technology, featuring our novel concept of complex visual imagery for non-invasive electroencephalography (EEG)-based . Complex visual imagery, as proposed in our work, involves the user engaging in the mental visualization of complex upper limb movements. This innovative approach significantly enhances the BCI system, facilitating the extension of its applications to more sophisticated tasks such as EEG-based robotic arm control. By leveraging this advanced form of visual imagery, our study opens new horizons for intricate and intuitive mind-controlled interfaces. We developed an advanced deep learning architecture that integrates functional connectivity metrics with a convolutional neural network-image
. This framework is adept at
subtle user intentions, addressing the spatial variability in complex visual tasks, and effectively translating these into precise commands for robotic arm control. Our comprehensive offline and pseudo-online evaluations demonstrate the framework's efficacy in real-time applications, including the nuanced control of robotic arms. The robustness of our approach is further validated through leave-one-subject-out cross-validation, marking a significant step towards versatile, subject-independent BCI applications. This research highlights the transformative impact of advanced visual imagery and deep learning in enhancing the usability and adaptability of BCI systems, particularly in robotic arm manipulation.
Unveiling Intrinsic Dimension of Texts from Academic Abstract to Creative Story
Authors: Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya
2025-11-19
Intrinsic dimension (ID) is an important tool in modern analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and
autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary
s find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.
Trustworthy GenAI over 6G Integrated Applications and Security Frameworks
Authors: Bui Duc Son, Trinh Van Chien, Dong In Kim
2025-11-19
The integration of generative artificial intelligence (GenAI) into 6G networks promises substantial performance gains while simultaneously exposing novel security vulnerabilities rooted in multimodal data processing and autonomous reasoning. This article presents a unified perspective on cross-domain vulnerabilities that arise across integrated sensing and (ISAC), federated learning (FL), digital twins (DTs), diffusion models (DMs), and large tele
models (LTMs). We highlight emerging adversarial agents such as compromised DTs and LTMs that can manipulate both the physical and cognitive layers of 6G systems. To address these risks, we propose an adaptive evolutionary defense (AED) concept that continuously co-evolves with attacks through GenAI-driven simulation and feedback, combining physical-layer protection, secure learning pipelines, and cognitive-layer resilience. A case study using an
-based port prediction model for fluid-antenna systems demonstrates the susceptibility of GenAI modules to adversarial perturbations and the effectiveness of the proposed defense concept. Finally, we summarize open challenges and future research directions toward building trustworthy, quantum-resilient, and adaptive GenAI-enabled 6G networks.
Masked Auto-Regressive Variational Acceleration Fast Inference Makes Practical Reinforcement Learning
Authors: Yuxuan Gu, Weimin Bai, Yifei Wang, Weijian Luo, He Sun
2025-11-19
Masked auto-regressive diffusion models (MAR) benefit from the expressive modeling ability of diffusion models and the flexibility of masked auto-regressive ordering. However, vanilla MAR suffers from slow inference due to its hierarchical inference mechanism: an outer AR unmasking loop and an inner diffusion denoising chain. Such decoupled structure not only harm the generation efficiency but also hinder the practical use of MAR for reinforcement learning (RL), an increasingly critical paradigm for generative model post-training.To address this fundamental issue, we introduce MARVAL (Masked Auto-regressive Variational Acceleration), a distillation-based framework that compresses the diffusion chain into a single AR generation step while pre the flexible auto-regressive unmasking order. Such a distillation with MARVAL not only yields substantial inference
but, crucially, makes RL post-training with verifiable rewards practical, resulting in scalable yet human-preferred fast generative models. Our contributions are twofold: (1) a novel score-based variational objective for distilling masked auto-regressive diffusion models into a single generation step without sacrificing sample quality; and (2) an efficient RL framework for masked auto-regressive models via MARVAL-RL. On ImageNet 256*256, MARVAL-Huge achieves an FID of 2.00 with more than 30 times speedup compared with MAR-diffusion, and MARVAL-RL yields consistent improvements in CLIP and image-reward scores on ImageNet datasets with entity names. In conclusion, MARVAL demonstrates the first practical path to distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignments.
Fail fast techniques to probe rare events in quantum error correction
Authors: Michael E. Beverland, Malcolm Carroll, Andrew W. Cross, Theodore J. Yoder
2025-11-19
The ultimate goal of quantum error correction is to create logical qubits with very low error rates (e.g. 1e-12) and assemble them into large-scale quantum computers capable of performing many (e.g. billions) of logical gates on many (e.g. thousands) of logical qubits. However, it is necessarily difficult to directly assess the performance of such high-quality logical qubits using standard Monte Carlo sampling because logical failure events become very rare. Building on existing approaches to this problem, we develop three complementary techniques to characterize the rare-event regime for general quantum low-density parity-check (qLDPC) codes under circuit noise. (I) We propose a well-motivated, low-parameter ansatz for the failure spectrum (the fraction of fault sets of each size that fail) that empirically fits all the QEC systems we studied and predicts logical error rates at all physical error rates. (II) We find min-weight logical operators of syndrome measurement circuits and exactly compute the number of min-weight failing configurations. (III) We generalize the splitting method to qLDPC codes using multi-seeded Metropolis sampling to improve convergence for systems with many inequivalent logical operators. We apply these tools to distance-6, -12, and -18 bivariate bicycle codes under circuit noise, ob strong low-error-rate performance with the recently proposed Relay
r but also considerable scope for further improvement.
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
Authors: Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, Shijian Lu
2025-11-19
Discrete diffusion-based multimodal large language models (dMs) have emerged as a promising alternative to autoregressive M
s thanks to their advantages in parallel
and bidirectional context modeling, but most existing dM
s incur significant computational overhead during inference due to the full-sequence attention computation in each denoising step. Pioneer studies attempt to resolve this issue from a modality-agnostic perspective via key-value
optimization or efficient sampling but most of them overlook modality-specific visual token redundancy. In this work, we conduct a comprehensive study on how visual token redundancy evolves with different dM
architectures and tasks and how visual token
affects dM
responses and efficiency. Specifically, our study reveals that visual redundancy emerges only in from-scratch dM
s while handling long-answer tasks. In addition, we validate that visual token
introduces non-negligible information loss in dM
s and only from-scratch dM
s can recover the lost information progressively during late denoising steps. Furthermore, our study shows that layer-skipping is promising for accelerating AR-to-diffusion dM
s, whereas progressive or late-step
is more effective for from-scratch dM
s. Overall, this work offers a new perspective on efficiency optimization for dM
s, greatly advancing their applicability across various multimodal understanding tasks.
Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents
Authors: Henrik Bradland, Morten Goodwin, Vladimir I. Zadorozhny, Per-Arne Andersen
2025-11-19
The performance of machine learning models on tabular data is critically dependent on high-quality feature engineering. While Large Language Models (s) have shown promise in automating feature extraction (AutoFE), existing methods are often limited by monolithic
architectures, simplistic quantitative feedback, and a failure to systematically integrate external domain knowledge. This paper introduces Rogue One, a novel,
-based multi-agent framework for knowledge-informed automatic feature extraction. Rogue One operationalizes a decentralized system of three specialized agents-Scientist, Extractor, and Tester-that collaborate iteratively to discover, generate, and validate predictive features. Crucially, the framework moves beyond primitive accuracy scores by introducing a rich, qualitative feedback mechanism and a "flooding-
" strategy, allowing it to dynamically balance feature exploration and exploitation. By actively incorporating external knowledge via an integrated retrieval-augmented (RAG) system, Rogue One generates features that are not only statistically powerful but also semantically meaningful and interpretable. We demonstrate that Rogue One significantly outperforms state-of-the-art methods on a comprehensive suite of 19 classification and 9 regression datasets. Furthermore, we show qualitatively that the system surfaces novel, testable hypotheses, such as identifying a new potential biomarker in the myocardial dataset, underscoring its utility as a tool for scientific discovery.
WiCo-PG Wireless Channel Foundation Model for Pathloss Map Generation via Synesthesia of Machines
Authors: Mingran Sun, Lu Bai, Ziwei Huang, Xuesong Cai, Xiang Cheng, Jianjun Wu
2025-11-19
A wireless channel foundation model for pathloss map generation (WiCo-PG) via Synesthesia of Machines (SoM) is developed for the first time. Considering sixth-generation (6G) uncrewed aerial vehicle (UAV)-to-ground (U2G) scenarios, a new multi-modal sensing- dataset is constructed for WiCo-PG pre-training, including multiple U2G scenarios, diverse flight altitudes, and diverse frequency bands. Based on the constructed dataset, the proposed WiCo-PG enables cross-modal pathloss map generation by leveraging RGB images from different scenarios and flight altitudes. In WiCo-PG, a novel network architecture designed for cross-modal pathloss map generation based on dual vector
d generative adversarial networks (VQGANs) and Transformer is proposed. Furthermore, a novel frequency-guided shared-routed mixture of experts (S-R MoE) architecture is designed for cross-modal pathloss map generation. Simulation results demonstrate that the proposed WiCo-PG achieves improved pathloss map generation accuracy through pre-training with a normalized mean squared error (NMSE) of 0.012, outperforming the large language model (
)-based scheme, i.e.,
4PG, and the conventional deep learning-based scheme by more than 6.98 dB. The enhanced generality of the proposed WiCo-PG can further outperform the
4PG by at least 1.37 dB using 2.7% samples in few-shot generalization.
Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference
Authors: Kexin Chu, Dawei Xiang, Zixu Shen, Yiwei Yang, Zecheng Liu, Wei Zhang
2025-11-19
Mixture-of-Experts (MoE) models scale capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training
reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive
. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that
s promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets.
Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large
s on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware
is an effective strategy for memory-constrained MoE
.
Mathematical Analysis of Hallucination Dynamics in Large Language Models Uncertainty Quantification, Advanced Decoding, and Principled Mitigation
Authors: Moses Kiprono
2025-11-19
Large Language Models (s) are powerful linguistic engines but remain susceptible to hallucinations: plausible-sounding outputs that are factually incorrect or unsupported. In this work, we present a mathematically grounded framework to understand, measure, and mitigate these hallucinations. Drawing on probabilistic modeling, information theory, trigonometric signal analysis, and Bayesian uncertainty estimation, we analyze how errors compound autoregressively, propose refined uncertainty metrics, including semantic and phase-aware variants, and develop principled mitigation strategies such as contrastive
, retrieval-augmented grounding, factual alignment, and abstention. This unified lens connects recent advances in calibration, retrieval, and alignment to support safer and more reliable
s.