2025-12-31
Table of Contents
- Toward Trustworthy Agentic AI A Multimodal Framework for Preventing Prompt Injection Attacks
- TV-RAG A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding
- Replay Failures as Successes Sample-Efficient Reinforcement Learning for Instruction Following
- Mobile-Efficient Speech Emotion Recognition Using DistilHuBERT A Cross-Corpus Validation Study
- C2PO Diagnosing and Disentangling Bias Shortcuts in LLMs
- AKG kernel Agent A Multi-Agent Framework for Cross-Platform Kernel Synthesis
- Bridging Cognitive Gap Hierarchical Description Learning for Artistic Image Aesthetics Assessment
- On Signal Peak Power Constraint of Over-the-Air Federated Learning
- SpatialMosaic A Multiview VLM Dataset for Partial Visibility
- Splitwise Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL
- Agentic AI-Enhanced Semantic Communications Foundations, Architecture, and Applications
- Contour Information Aware 2D Gaussian Splatting for Image Representation
- RS-Prune Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models
- FairGFL Privacy-Preserving Fairness-Aware Federated Learning with Overlapping Subgraphs
- Not too long do read Evaluating LLM-generated extreme scientific summaries
- Energy and Memory-Efficient Federated Learning With Ordered Layer Freezing
- SPIRAL Symbolic LLM Planning via Grounded and Reflective Search
- Understanding EFL Learners' Code-Switching and Teachers' Pedagogical Approaches in LLM-Supported Speaking Practice
- Taming the Tail Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning
- Accelerating Language Model Workflows with Prompt Choreography
- Viability and Performance of a Private LLM Server for SMBs A Benchmark Analysis of Qwen3-30B on Consumer-Grade Hardware
- Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping
- Evolution of Buffer Management in Database Systems From Classical Algorithms to Machine Learning and Disaggregated Memory
- Argus Token Aware Distributed LLM Inference Optimization
- JavisGPT A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
- Hash Grid Feature Pruning
- 3D Scene Change Modeling With Consistent Multi-View Aggregation
- MoR Mixture Of Representations For Mixed-Precision Training
- Parallel Diffusion Solver via Residual Dirichlet Policy Optimization
- Text-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis
- WeDLM Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference
- GHaLIB A Multilingual Framework for Hope Speech Detection in Low-Resource Languages
- Modality Inflation Energy Characterization and Optimization Opportunities for MLLM Inference
- Fragile Knowledge, Robust Instruction-Following The Width Pruning Dichotomy in Llama-3.2
- Scaling Unverifiable Rewards A Case Study on Visual Insights
- Communication Compression for Distributed Learning with Aggregate and Server-Guided Feedback
- Learning Multi-Modal Mobility Dynamics for Generalized Next Location Recommendation
- RollArt Scaling Agentic RL Training via Disaggregated Infrastructure
- CoDS Collaborative Perception via Digital Semantic Communication
- Decomposing Task Vectors for Refined Model Editing
- Role-Based Fault Tolerance System for LLM RL Post-Training
- Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing
- MEGA-PCC A Mamba-based Efficient Approach for Joint Geometry and Attribute Point Cloud Compression
- Differentiable Inverse Modeling with Physics-Constrained Latent Diffusion for Heterogeneous Subsurface Parameter Fields
- Nightjar Dynamic Adaptive Speculative Decoding for Large Language Models Serving
- OxygenREC An Instruction-Following Generative Framework for E-commerce Recommendation
- Charge-Informed Quantum Error Correction
- Yume-1.5 A Text-Controlled Interactive World Generation Model
- Prefill vs. Decode Bottlenecks SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling
- Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models
- Robust Federated Fine-Tuning in Heterogeneous Networks with Unreliable Connections An Aggregation View
- Direction Finding with Sparse Arrays Based on Variable Window Size Spatial Smoothing
- Look Closer! An Adversarial Parametric Editing Framework for Hallucination Mitigation in VLMs
- BLEST Blazingly Efficient BFS using Tensor Cores
- GQ-VAE A gated quantized VAE for learning variable length tokens
- Accelerate Speculative Decoding with Sparse Computation in Verification
- LLMBoost Make Large Language Models Stronger with Boosting
- MMCTOP A Multimodal Textualization and Mixture-of-Experts Framework for Clinical Trial Outcome Prediction
- TimeBill Time-Budgeted Inference for Large Language Models
- Fast Inference of Visual Autoregressive Model with Adjacency-Adaptive Dynamical Draft Trees
- LIMEAccelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices
Toward Trustworthy Agentic AI A Multimodal Framework for Preventing Prompt Injection Attacks
Authors: Toqeer Ali Syed, Mishal Ateeq Almutairi, Mahmoud Abdel Moaty
2025-12-29
Powerful autonomous systems, which reason, plan, and converse using and between numerous tools and agents, are made possible by Large Language Models (s), Vision-Language Models (VLMs), and new agentic AI systems, like LangChain and GraphChain. Nevertheless, this agentic environment increases the probability of the occurrence of multimodal prompt injection (PI) attacks, in which concealed or malicious instructions carried in text, pictures, metadata, or agent-to-agent messages may spread throughout the graph and lead to unintended behavior, a breach of policy, or corruption of state. In order to mitigate these risks, this paper suggests a Cross-Agent Multimodal Provenanc- Aware Defense Framework whereby all the prompts, either user-generated or produced by upstream agents, are sanitized and all the outputs generated by an
are verified independently before being sent to downstream nodes. This framework contains a Text sanitizer agent, visual sanitizer agent, and output validator agent all coordinated by a provenance ledger, which keeps metadata of modality, source, and trust level throughout the entire agent network. This architecture makes sure that agent-to-agent
abides by clear trust frames such such that injected instructions are not propagated down LangChain or GraphChain-style-workflows. The experimental assessments show that multimodal injection detection accuracy is significantly enhanced, and the cross-agent trust leakage is minimized, as well as, agentic execution pathways become stable. The framework, which expands the concept of provenance tracking and validation to the multi-agent orchestration, enhances the establishment of secure, understandable and reliable agentic AI systems.
TV-RAG A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding
Authors: Zongsheng Cao, Yangfan He, Anran Liu, Feng Chen, Zepeng Wang, Jun Xie
2025-12-29
Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical , ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: \emph{(i)} a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and \emph{(ii)} an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing redundancy while pre
representativeness. By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning. The resulting system offers a lightweight, budget-friendly upgrade path and consistently surpasses most leading baselines across established long-video benchmarks such as Video-MME, MLVU, and LongVideoBench, confirming the effectiveness of our model. The code can be found at https://github.com/AI-Researcher-Team/TV-RAG.
Replay Failures as Successes Sample-Efficient Reinforcement Learning for Instruction Following
Authors: Kongcheng Zhang, Qi Yao, Shunyu Liu, Wenjian Zhang, Min Cen, Yang Zhou, Wenkai Fang, Yiru Zhao, Baisheng Lai, Mingli Song
2025-12-29
Reinforcement Learning (RL) has shown promise for aligning Large Language Models (s) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, high-quality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding
or indistinguishable rewards that impede learning. In this work, we propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs a select-then-rewrite strategy to replay failed attempts as successes based on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset is available at https://github.com/sastpg/HIR.
Mobile-Efficient Speech Emotion Recognition Using DistilHuBERT A Cross-Corpus Validation Study
Authors: Saifelden M. Ismail
2025-12-29
Speech Emotion Recognition (SER) has significant potential for mobile applications, yet deployment remains constrained by the computational demands of state-of-the-art architectures. This paper presents a mobile-efficient SER system based on DistilHuBERT, a distilled and 8-bit
d
that achieves 92% parameter reduction compared to full-scale Wav2Vec 2.0 models while maintaining competitive accuracy. We conduct a rigorous 5-fold Leave-One-Session-Out (LOSO) cross-validation on the IEMOCAP dataset to ensure speaker independence, augmented with cross-corpus training on CREMA-D to enhance generalization. Cross-corpus training with CREMA-D yields a 1.2% improvement in Weighted Accuracy, a 1.4% gain in Macro F1-score, and a 32% reduction in cross-fold variance, with the Neutral class showing the most substantial benefit at 5.4% F1-score improvement. Our approach achieves an Unweighted Accuracy of 61.4% with a
d model footprint of only 23 MB, representing approximately 91% of full-scale baseline performance. Cross-corpus evaluation on RAVDESS reveals that the theatrical nature of acted emotions causes predictions to cluster by arousal level rather than valence: happiness is systematically confused with anger due to acoustic saturation in high-energy expressions. Despite this theatricality effect reducing overall RAVDESS accuracy to 43.29%, the model maintains robust arousal detection with 97% recall for anger and 64% for sadness. These findings establish a Pareto-optimal tradeoff between model size and accuracy, enabling practical affect recognition on resource-constrained mobile devices.
C2PO Diagnosing and Disentangling Bias Shortcuts in LLMs
Authors: Xuan Feng, Bo An, Tianlong Gu, Liang Chang, Fengrui Hao, Peipeng Yu, Shuai Zhao
2025-12-29
Bias in Large Language Models (s) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical
or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features. Extensive experiments across multiple benchmarks covering stereotypical bias (BBQ, Unqover), structural bias (MNLI, HANS, Chatbot, MT-Bench), out-of-domain fairness (StereoSet, WinoBias), and general utility (MMLU, GSM8K) demonstrate that C2PO effectively mitigates stereotypical and structural biases while pre
robust general reasoning capabilities.
AKG kernel Agent A Multi-Agent Framework for Cross-Platform Kernel Synthesis
Authors: Jinye Du, Quan Yuan, Zuyao Zhang, Yanzhi Yi, Jiahui Hu, Wangyi Chen, Yiyang Zhu, Qishui Zheng, Wenxiang Zou, Xiangyu Chang, Zuohe Zheng, Zichun Ye, Chao Liu, Shanni Li, Renwei Zhang, Yiping Deng, Xinwei Hu, Xuefeng Jin, Jie Zhao
2025-12-29
Modern AI models demand high-performance computation kernels. The growing complexity of s, multimodal architectures, and recommendation systems, combined with techniques like
and
, creates significant computational challenges. Moreover, frequent hardware updates and diverse chip architectures further complicate this landscape, requiring tailored kernel implementations for each platform. However, manual optimization cannot keep pace with these demands, creating a critical bottleneck in AI system development. Recent advances in
code generation capabilities have opened new possibilities for automating kernel development. In this work, we propose AKG kernel agent (AI-driven Kernel Generator), a multi-agent system that automates kernel generation, migration, and performance tuning. AKG kernel agent is designed to support multiple domain-specific languages (DSLs), including Triton, TileLang, CPP, and CUDA-C, enabling it to target different hardware backends while maintaining correctness and portability. The system's modular design allows rapid integration of new DSLs and hardware targets. When evaluated on KernelBench using Triton DSL across GPU and NPU backends, AKG kernel agent achieves an average speedup of 1.46 over PyTorch Eager baselines implementations, demonstrating its effectiveness in accelerating kernel development for modern AI workloads.
Bridging Cognitive Gap Hierarchical Description Learning for Artistic Image Aesthetics Assessment
Authors: Henglin Liu, Nisha Huang, Chang Liu, Jiangpeng Yan, Huijuan Huang, Jixuan Ying, Tong-Yee Lee, Pengfei Wan, Xiangyang Ji
2025-12-29
The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of
rs. Besides, theoretical analysis confirms this symbiosis: RAD's semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.
On Signal Peak Power Constraint of Over-the-Air Federated Learning
Authors: Lorenz Bielefeld, Paul Zheng, Oner Hanay, Yao Zhu, Yulin Hu, Anke Schmeink
2025-12-29
Federated learning (FL) has been considered a promising privacy pre distributed edge learning framework. Over-the-air computation (AirComp) technique leveraging analog transmission enables the aggregation of local updates directly over-the-air by exploiting the superposition properties of wireless multiple-access channel, thereby drastically reducing the
bottleneck issues of FL compared with digital transmission schemes. This work points out that existing AirComp-FL overlooks a key practical constraint, the instantaneous peak-power constraints imposed by the non-linearity of radiofrequency power amplifiers. We present and analyze the effect of the classic method to deal with this issue, amplitude clipping combined with filtering. We investigate the effect of instantaneous peak-power constraints in AirComp-FL for both single-carrier and multi-carrier orthogonal frequency-division multiplexing (OFDM) systems. We highlight the specificity of AirComp-FL: the samples depend on the gradient value distribution, leading to a higher peak-to-average power ratio (PAPR) than that observed for uniformly distributed signals. Simulation results demonstrate that, in practical settings, the instantaneous transmit power regularly exceeds the power-amplifier limit; however, by applying clipping and filtering, the FL performance can be degraded. The degradation becomes pronounced especially in multi-carrier OFDM systems due to the in-band distortions caused by clipping and filtering.
SpatialMosaic A Multiview VLM Dataset for Partial Visibility
Authors: Kanghee Lee, Injae Lee, Minseok Kwak, Kwonyoung Ryu, Jungi Hong, Jaesik Park
2025-12-29
The rapid progress of Multimodal Large Language Models (Ms) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. However, existing approaches often rely on pre-constructed 3D representations or off-the-shelf reconstruction pipelines, which constrain scalability and real-world applicability. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling Vision-Language Models (VLMs) to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-
conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios, consisting of 1M QA pairs across 6 tasks. In addition, we present SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset and VQA tasks effectively enhance spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and diverse QA pairs. Code and dataset will be available soon.
Splitwise Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL
Authors: Abolfazl Younesi, Abbas Shabrang Maryan, Elyas Oustad, Zahra Najafabadi Samani, Mohsen Ansari, Thomas Fahringer
2025-12-29
Deploying large language models (s) on edge devices is challenging due to their limited memory and power resources. Cloud-only inference reduces device burden but introduces high latency and cost. Static edge-cloud partitions optimize a single metric and struggle when bandwidth fluctuates. We propose Splitwise, a novel Lyapunov-assisted deep reinforcement learning (DRL) framework for fine-grained, adaptive partitioning of
s across edge and cloud environments. Splitwise decomposes
layers into attention heads and feed-forward sub-blocks, exposing more partition choices than layer-wise schemes. A hierarchical DRL policy, guided by Lyapunov optimization, jointly minimizes latency, energy consumption, and accuracy degradation while guaranteeing queue stability under stochastic workloads and variable network bandwidth. Splitwise also guarantees robustness via partition checkpoints with exponential backoff recovery in case of
failures. Experiments on Jetson Orin NX, Galaxy S23, and Raspberry Pi 5 with GPT-2 (1.5B), LLaMA-7B, and LLaMA-13B show that Splitwise reduces end-to-end latency by 1.4x-2.8x and cuts energy consumption by up to 41% compared with existing partitioners. It lowers the 95th-percentile latency by 53-61% relative to cloud-only execution, while maintaining accuracy and modest memory requirements.
Agentic AI-Enhanced Semantic Communications Foundations, Architecture, and Applications
Authors: Haixiao Gao, Mengying Sun, Ruichen Zhang, Yanhan Wang, Xiaodong Xu, Nan Ma, Dusit Niyato, Ping Zhang
2025-12-29
Semantic s (SemCom), as one of the key technologies for 6G, is shifting networks from bit transmission to semantic information exchange. On this basis, introducing agentic artificial intelligence (AI) with perception, memory, reasoning, and action capabilities provides a practicable path to intelligent
s. This paper provides a systematic exposition of how agentic AI empowers SemCom from the perspectives of research foundations, system architecture, and application scenarios. We first provide a comprehensive review of existing studies by agent types, covering embedded agents, large language model (
)/large vision model (LVM) agents, and reinforcement learning (RL) agents. Additionally, we propose a unified agentic AI-enhanced SemCom framework covering the application layer, the semantic layer, and the cloud-edge collaboration layer, forming a closed loop from intent to encoding to transmission to
to action to evaluation. We also present several typical scenarios, including multi-vehicle collaborative perception, multi-robot cooperative rescue, and agentic operations for intellicise (intelligent and concise) networks. Furthermore, we introduce an agentic knowledge base (KB)-based joint source-channel coding case study, AKB-JSCC, where the source KB and channel KB are built by
/LVM agents and RL agents, respectively. Experimental results show that AKB-JSCC achieves higher information reconstruction quality under different channel conditions. Finally, we discuss future evolution and research directions, providing a reference for portable, verifiable, and controllable research and deployment of agentic SemCom.
Contour Information Aware 2D Gaussian Splatting for Image Representation
Authors: Masaya Takabe, Hiroshi Watanabe, Sujun Hong, Tomohiro Ikai, Zheming Fan, Ryo Ishimoto, Kakeru Sugimoto, Ruri Imichi
2025-12-29
Image representation is a fundamental task in computer vision. Recently, Gaussian Splatting has emerged as an efficient representation framework, and its extension to 2D image representation enables lightweight, yet expressive modeling of visual content. While recent 2D Gaussian Splatting (2DGS) approaches provide compact storage and real-time , they often produce blurry or indistinct boundaries when the number of Gaussians is small due to the lack of contour awareness. In this work, we propose a Contour Information-Aware 2D Gaussian Splatting framework that incorporates object segmentation priors into Gaussian-based image representation. By constraining each Gaussian to a specific segmentation region during rasterization, our method prevents cross-boundary blending and preserves edge structures under high
. We also introduce a warm-up scheme to stabilize training and improve convergence. Experiments on synthetic color charts and the DAVIS dataset demonstrate that our approach achieves higher reconstruction quality around object edges compared to existing 2DGS methods. The improvement is particularly evident in scenarios with very few Gaussians, while our method still maintains fast rendering and low memory usage.
RS-Prune Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models
Authors: Fan Wei, Runmin Dong, Yushan Lai, Yixiang Yang, Zhaoyang Luo, Jinxiao Zhang, Miao Yang, Shuai Yuan, Jiyao Zhao, Bin Luo, Haohuan Fu
2025-12-29
Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data approach that quickly select a high-quality subset under high
ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high
ratios while pre
overall diversity and representativeness. Experiments show that, even after
85\% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data
paradigm offers practical guidance for developing RS generative foundation models.
FairGFL Privacy-Preserving Fairness-Aware Federated Learning with Overlapping Subgraphs
Authors: Zihao Zhou, Shusen Yang, Fangyuan Zhao, Xuebin Ren
2025-12-29
Graph federated learning enables the collaborative extraction of high-order information from distributed subgraphs while pre the privacy of raw data. However, graph data often exhibits
among different clients. Previous research has demonstrated certain benefits of
ping data in mitigating data heterogeneity. However, the negative effects have not been explored, particularly in cases where the
s are imbalanced across clients. In this paper, we uncover the unfairness issue arising from imbalanced
ping subgraphs through both empirical observations and theoretical reasoning. To address this issue, we propose FairGFL (FAIRness-aware subGraph Federated Learning), a novel algorithm that enhances cross-client fairness while maintaining model utility in a privacy-pre
manner. Specifically, FairGFL incorporates an interpretable weighted aggregation approach to enhance fairness across clients, leveraging privacy-pre
estimation of their
ping ratios. Furthermore, FairGFL improves the tradeoff between model utility and fairness by integrating a carefully crafted regularizer into the federated composite loss function. Through extensive experiments on four benchmark graph datasets, we demonstrate that FairGFL outperforms four representative baseline algorithms in terms of both model utility and fairness.
Not too long do read Evaluating LLM-generated extreme scientific summaries
Authors: Zhuoqi Lyu, Qing Ke
2025-12-29
High-quality scientific extreme summary (TLDR) facilitates effective science . How do large language models (
s) perform in generating them? How are
-generated summaries different from those written by human experts? However, the lack of a comprehensive, high-quality scientific TLDR dataset hinders both the development and evaluation of
s' summarization ability. To address these, we propose a novel dataset, BiomedTLDR, containing a large sample of researcher-authored summaries from scientific papers, which leverages the common practice of including authors' comments alongside bibliography items. We then test popular open-weight
s for generating TLDRs based on abstracts. Our analysis reveals that, although some of them successfully produce humanoid summaries,
s generally exhibit a greater affinity for the original text's lexical choices and rhetorical structures, hence tend to be more extractive rather than abstractive in general, compared to humans. Our code and datasets are available at https://github.com/netknowledge/
_summarization (Lyu and Ke, 2025).
Energy and Memory-Efficient Federated Learning With Ordered Layer Freezing
Authors: Ziru Niu, Hai Dong, A. K. Qin, Tao Gu, Pengcheng Zhang
2025-12-29
Federated Learning (FL) has emerged as a privacy-pre paradigm for training machine learning models across distributed edge devices in the Internet of Things (IoT). By keeping data local and coordinating model training through a central server, FL effectively addresses privacy concerns and reduces
overhead. However, the limited computational power, memory, and bandwidth of IoT edge devices pose significant challenges to the efficiency and scalability of FL, especially when training deep neural networks. Various FL frameworks have been proposed to reduce computation and
overheads through dropout or layer freezing. However, these approaches often sacrifice accuracy or neglect memory constraints. To this end, in this work, we introduce Federated Learning with Ordered Layer Freezing (FedOLF). FedOLF consistently freezes layers in a predefined order before training, significantly mitigating computation and memory requirements. To further reduce
and energy costs, we incorporate Tensor Operation Approximation (TOA), a lightweight alternative to conventional
that better preserves model accuracy. Experimental results demonstrate that over non-iid data, FedOLF achieves at least 0.3%, 6.4%, 5.81%, 4.4%, 6.27% and 1.29% higher accuracy than existing works respectively on EMNIST (with CNN), CIFAR-10 (with AlexNet), CIFAR-100 (with ResNet20 and ResNet44), and CINIC-10 (with ResNet20 and ResNet44), along with higher energy efficiency and lower memory footprint.
SPIRAL Symbolic LLM Planning via Grounded and Reflective Search
Authors: Yifan Zhang, Giridhar Ganapavarapu, Srideepika Jayaraman, Bhavna Agrawal, Dhaval Patel, Achille Fokoue
2025-12-29
Large Language Models (s) often falter at complex planning tasks that require exploration and self-correction, as their linear reasoning process struggles to recover from early mistakes. While search algorithms like Monte Carlo Tree Search (MCTS) can explore alternatives, they are often ineffective when guided by
rewards and fail to leverage the rich semantic capabilities of
s. We introduce SPIRAL (Symbolic
Planning via Grounded and Reflective Search), a novel framework that embeds a cognitive architecture of three specialized
agents into an MCTS loop. SPIRAL's key contribution is its integrated planning pipeline where a Planner proposes creative next steps, a Simulator grounds the search by predicting realistic outcomes, and a Critic provides dense reward signals through reflection. This synergy transforms MCTS from a brute-force search into a guided, self-correcting reasoning process. On the DailyLifeAPIs and HuggingFace datasets, SPIRAL consistently outperforms the default Chain-of-Thought planning method and other state-of-the-art agents. More importantly, it substantially surpasses other state-of-the-art agents; for example, SPIRAL achieves 83.6% overall accuracy on DailyLifeAPIs, an improvement of over 16 percentage points against the next-best search framework, while also demonstrating superior token efficiency. Our work demonstrates that structuring
reasoning as a guided, reflective, and grounded search process yields more robust and efficient autonomous planners. The source code, full appendices, and all experimental data are available for reproducibility at the official project repository.
Understanding EFL Learners' Code-Switching and Teachers' Pedagogical Approaches in LLM-Supported Speaking Practice
Authors: Junyeong Park, Jieun Han, Yeon Su Park, Youngbin Lee, Suin Kim, Juho Kim, Alice Oh, So-Yeon Ahn
2025-12-29
For English as a Foreign Language (EFL) learners, code-switching (CSW), or alternating between their native language and the target language (English), can lower anxiety and ease barriers. Large language models (
s), with their multilingual abilities, offer new opportunities to support CSW in speaking practice. Yet, the pedagogical design of
-based tutors remains underexplored. To this end, we conducted a six-week study of
-mediated speaking practice with 20 Korean EFL learners, alongside a qualitative study with nine English teachers who designed and refined responses to learner CSW. Findings show that learners used CSW not only to bridge lexical gaps but also to express cultural and emotional nuance, prompting teachers to employ selective interventions and dynamic scaffolding strategies. We conclude with design implications for bilingual
-powered tutors that leverage teachers' expertise to transform CSW into meaningful learning opportunities.
Taming the Tail Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning
Authors: Yingru Li, Jiawei Xu, Jiacai Liu, Yuxuan Tong, Ziniu Li, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang
2025-12-28
Reinforcement learning for large language models (s) faces a fundamental tension: high-throughput inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch. We prove this mismatch has an asymmetric effect: the bound on log-probability mismatch scales as where is the token probability. For high-probability tokens, this bound vanishes, contributing negligibly to sequence-level mismatch. For low-probability tokens in the tail, the bound remains large, and moreover, when sampled, these tokens exhibit systematically biased mismatches that accumulate over sequences, destabilizing gradient estimation. Rather than applying post-hoc corrections, we propose constraining the RL objective to a dynamically-pruned ``safe'' vocabulary that excludes the extreme tail. By
such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias. Empirically, our method achieves stable training; theoretically, we bound the optimization bias introduced by vocabulary
.
Accelerating Language Model Workflows with Prompt Choreography
Authors: TJ Bai, Jason Eisner
2025-12-28
Large language models are increasingly deployed in multi-agent workflows. We introduce Prompt Choreography, a framework that efficiently executes workflows by maintaining a dynamic, global
. Each
call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages' encodings sometimes gives different results from re-encoding them in a new context, we show in diverse settings that fine-tuning the
to work with the
can help it mimic the original results. Prompt Choreography significantly reduces per-message latency (2.0--6.2 faster time-to-first-token) and achieves substantial end-to-end speedups (2.2) in some workflows dominated by redundant computation.
Viability and Performance of a Private LLM Server for SMBs A Benchmark Analysis of Qwen3-30B on Consumer-Grade Hardware
Authors: Alex Khalil, Guillaume Heilles, Maria Parraga, Simon Heilles
2025-12-28
The proliferation of Large Language Models (s) has been accompanied by a reliance on cloud-based, proprietary systems, raising significant concerns regarding data privacy, operational sovereignty, and escalating costs. This paper investigates the feasibility of deploying a high-performance, private
inference server at a cost accessible to Small and Medium Businesses (SMBs). We present a comprehensive benchmarking analysis of a locally hosted,
d 30-billion parameter Mixture-of-Experts (MoE) model based on Qwen3, running on a consumer-grade server equipped with a next-generation NVIDIA GPU. Unlike cloud-based offerings, which are expensive and complex to integrate, our approach provides an affordable and private solution for SMBs. We evaluate two dimensions: the model's intrinsic capabilities and the server's performance under load. Model performance is benchmarked against academic and industry standards to quantify reasoning and knowledge relative to cloud services. Concurrently, we measure server efficiency through latency, tokens per second, and time to first token, analyzing scalability under increasing concurrent users. Our findings demonstrate that a carefully configured on-premises setup with emerging consumer hardware and a
d open-source model can achieve performance comparable to cloud-based services, offering SMBs a viable pathway to deploy powerful
s without prohibitive costs or privacy compromises.
Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping
Authors: Tao Yu, Yongqi An, Kuan Zhu, Guibo Zhu, Ming Tang, Jinqiao Wang
2025-12-28
Large Language Models (s) demonstrate impressive performance across natural language tasks but incur substantial computational and storage costs due to their scale. Post-training structured
offers an efficient solution. However, when few-shot calibration sets fail to adequately reflect the pretraining data distribution, existing methods exhibit limited generalization to downstream tasks. To address this issue, we propose Function-Aware Neuron Grouping (FANG), a post-training
framework that alleviates calibration bias by identifying and pre
neurons critical to specific function. FANG groups neurons with similar function based on the type of semantic context they process and prunes each group independently. During importance estimation within each group, tokens that strongly correlate with the functional role of the neuron group are given higher weighting. Additionally, FANG also preserves neurons that contribute across multiple context types. To achieve a better trade-off between
and performance, it allocates
to each block adaptively based on its functional complexity. Experiments show that FANG improves downstream accuracy while pre
language modeling performance. It achieves the state-of-the-art (SOTA) results when combined with FLAP and OBC, two representative
methods. Specifically, FANG outperforms FLAP and OBC by 1.5%--8.5% in average accuracy under 30% and 40%
.
Evolution of Buffer Management in Database Systems From Classical Algorithms to Machine Learning and Disaggregated Memory
Authors: Prudhvi Gadupudi, Suman Saha
2025-12-28
Buffer management remains a critical component of database and operating system performance, as the primary mechanism for bridging the persistent latency gap between CPU processing speeds and storage access times. This paper provides a comprehensive survey of buffer management evolution spanning four decades of research. We systematically analyze the progression from foundational algorithms like LRU-K, 2Q, LIRS, and ARC to contemporary machine learning-augmented policies and
d memory architectures. Our survey examines the historical OS-DBMS architectural divergence, production system implementations in PostgreSQL, Oracle, and Linux, and emerging trends including eBPF-based kernel extensibility, NVM-aware tiering strategies, and RDMA-enabled memory
. Through analysis of over 50 seminal papers from leading conferences (SIGMOD, VLDB, OSDI, FAST), we identify key architectural patterns, performance trade-offs, and open research challenges. We conclude by outlining a research direction that integrates machine learning with kernel extensibility mechanisms to enable adaptive, cross-layer buffer management for heterogeneous memory hierarchies in modern database systems.
Argus Token Aware Distributed LLM Inference Optimization
Authors: Panlong Wu, Yifei Zhong, Danyang Chen, Ting Wang, Fangxin Wang
2025-12-28
Large Language Models (s) are rapidly being integrated into real-world applications, yet their autoregressive architectures introduce significant inference time variability, especially when deployed across heterogeneous edge-cloud systems. Existing solutions largely neglect the dynamic, stochastic, and heterogeneous nature of such environments, often ignoring the impact of variable output token lengths and device diversity. In this work, we present Argus, the first token-aware distributed edge-cloud
inference framework that conducts efficient task offloading. Argus features a Length-Aware Semantics (LAS) module, which predicts output token lengths for incoming prompts using a fine-tuned language model with token-length-sensitive feature modulation, enabling precise estimation. Building on this, our Lyapunov-guided Offloading Optimization (LOO) module formulates long-term Quality-of-Experience optimization that explicitly considers both
ing and
costs. We introduce a novel Iterative Offloading Algorithm with Damping and Congestion Control (IODCC) to effectively solve the resulting integer nonlinear programming problem under time-varying constraints. Extensive theoretical and empirical evaluations demonstrate that Argus achieves robust performance and superior efficiency in highly dynamic, heterogeneous settings.
JavisGPT A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
Authors: Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Yan, Hao Fei, Tat-Seng Chua
2025-12-28
This paper presents JavisGPT, the first unified multimodal large language model (M) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-
-
r architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing M
s, particularly in complex and temporally synchronized settings.
Hash Grid Feature Pruning
Authors: Yangzhi Ma, Bojun Liu, Jie Li, Li Li, Dong Liu
2025-12-28
Hash grids are widely used to learn an implicit neural field for Gaussian splatting, either as part of the entropy model or for inter-frame prediction. However, due to the irregular and non-uniform distribution of Gaussian splats in 3D space, numerous
regions exist, rendering many features in the hash grid invalid. This leads to redundant storage and transmission overhead. In this work, we propose a hash grid feature
method that identifies and prunes invalid features based on the coordinates of the input Gaussian splats, so that only the valid features are encoded. This approach reduces the storage size of the hash grid without compromising model performance, leading to improved rate-distortion performance. Following the Common Test Conditions (CTC) defined by the standardization committee, our method achieves an average bitrate reduction of 8% compared to the baseline approach.
3D Scene Change Modeling With Consistent Multi-View Aggregation
Authors: Zirui Zhou, Junfeng Ni, Shujie Zhang, Yixin Chen, Siyuan Huang
2025-12-28
Change detection plays a vital role in scene monitoring, exploration, and continual reconstruction. Existing 3D change detection methods often exhibit spatial inconsistency in the detected changes and fail to explicitly separate pre- and post-change states. To address these limitations, we propose SCaR-3D, a novel 3D scene change detection framework that identifies object-level changes from a dense-view pre-change image sequence and -view post-change images. Our approach consists of a signed-distance-based 2D differencing module followed by multi-view aggregation with voting and
, leveraging the consistent nature of 3DGS to robustly separate pre- and post-change states. We further develop a continual scene reconstruction strategy that selectively updates dynamic regions while pre
the unchanged areas. We also contribute CCS3D, a challenging synthetic dataset that allows flexible combinations of 3D change types to support controlled evaluations. Extensive experiments demonstrate that our method achieves both high accuracy and efficiency, outperforming existing methods.
MoR Mixture Of Representations For Mixed-Precision Training
Authors: Bor-Yiing Su, Peter Dykas, Mike Chrzanowski, Jatin Chhugani
2025-12-28
Mixed-precision training is a crucial technique for scaling deep learning models, but successful mixedprecision training requires identifying and applying the right combination of training methods. This paper presents our preliminary study on Mixture-of-Representations (MoR), a novel, per-tensor and sub-tensor level framework that dynamically analyzes a tensor's numerical properties to select between a variety of different representations. Based on the framework, we have proposed and experimented concrete algorithms that choose dynamically between FP8 and BF16 representations for both per-tensor and sub-tensor level granularities. Our universal approach is designed to preserve model quality across various
partition strategies and datasets. Our initial findings show that this approach can achieve state-of-the-art results with 98.38% of tensors
d to the FP8 format. This work highlights the potential of dynamic, property-aware
while pre
model quality. We believe this approach can generally improve the robustness of low precision training, as demonstrated by achieving FP8 accuracies that are on par with existing approaches without the need for fine-grain partitioning, or can be used in combination with other training methods to improve the leverage of even lower precision number formats such as NVFP4.
Parallel Diffusion Solver via Residual Dirichlet Policy Optimization
Authors: Ruoyu Wang, Ziyu Li, Beier Zhu, Liangyu Yuan, Hanwang Zhang, Xun Yang, Xiaojun Chang, Chi Zhang
2025-12-28
Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based methods often face significant image quality degradation under a low-latency budget, primarily due to accumulated truncation errors arising from the inability to capture high-curvature trajectory segments. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates these errors by incorporating multiple parallel gradient evaluations in each step. Motivated by the geometric insight that sampling trajectories are largely confined to a low-dimensional manifold, EPD-Solver leverages the Mean Value Theorem for vector-valued functions to approximate the integral solution more accurately. Importantly, since the additional gradient computations are independent, they can be fully parallelized, pre
low-latency sampling nature. We introduce a two-stage optimization framework. Initially, EPD-Solver optimizes a small set of learnable parameters via a distillation-based approach. We further propose a parameter-efficient Reinforcement Learning (RL) fine-tuning scheme that reformulates the solver as a stochastic Dirichlet policy. Unlike traditional methods that fine-tune the massive backbone, our RL approach operates strictly within the low-dimensional solver space, effectively mitigating reward hacking while enhancing performance in complex text-to-image (T2I) generation tasks. In addition, our method is flexible and can serve as a plugin (EPD-Plugin) to improve existing ODE samplers.
Text-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis
Authors: Dongning Rao, Yunbiao Zeng, Zhihua Jiang, Jujian Lv
2025-12-28
Human-interaction-involved applications underscore the need for Multi-modal Sentiment Analysis (MSA). Although many approaches have been proposed to address the subtle emotions in different modalities, the power of explanations and temporal alignments is still underexplored. Thus, this paper proposes the Text-routed mixture-of-Experts model with eXplanation and Temporal alignment for MSA (TEXT). TEXT first augments explanations for MSA via Multi-modal Large Language Models (M
), and then novelly aligns the epresentations of audio and video through a temporality-oriented neural network block. TEXT aligns different modalities with explanations and facilitates a new text-routed
mixture-of-experts with gate fusion. Our temporal alignment block merges the benefits of Mamba and temporal cross-attention. As a result, TEXT achieves the best performance cross four datasets among all tested models, including three recently proposed approaches and three M
s. TEXT wins on at least four metrics out of all six metrics. For example, TEXT decreases the mean absolute error to 0.353 on the CH-SIMS dataset, which signifies a 13.5% decrement compared with recently proposed approaches.
WeDLM Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference
Authors: Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, Jie Zhou
2025-12-28
Autoregressive (AR) generation is the standard paradigm for Large Language Models (
s), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (D
s) offer parallel
by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., v
). A key reason is that many D
s rely on bidirectional attention, which breaks standard prefix
caching and forces repeated contextualization, undermining efficiency. We propose WeDLM, a diffusion
framework built entirely on standard causal attention to make parallel generation prefix-
friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while pre
their logical positions. Building on this property, we introduce a streaming
procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by v
under matched deployment settings, demonstrating that diffusion-style
can outperform an optimized AR engine in practice.
GHaLIB A Multilingual Framework for Hope Speech Detection in Low-Resource Languages
Authors: Ahmed Abdullah, Sana Fatima, Haroon Mahmood
2025-12-27
Hope speech has been relatively underrepresented in Natural Language Processing (NLP). Current studies are largely focused on English, which has resulted in a lack of resources for low-resource languages such as Urdu. As a result, the creation of tools that facilitate positive online remains limited. Although
-based architectures have proven to be effective in detecting hate and offensive speech, little has been done to apply them to hope speech or, more generally, to test them across a variety of linguistic settings. This paper presents a multilingual framework for hope speech detection with a focus on Urdu. Using pretrained
models such as XLM-RoBERTa, mBERT, EuroBERT, and UrduBERT, we apply simple preprocessing and train classifiers for improved results. Evaluations on the PolyHope-M 2025 benchmark demonstrate strong performance, achieving F1-scores of 95.2% for Urdu binary classification and 65.2% for Urdu multi-class classification, with similarly competitive results in Spanish, German, and English. These results highlight the possibility of implementing existing multilingual models in low-resource environments, thus making it easier to identify hope speech and helping to build a more constructive digital discourse.
Modality Inflation Energy Characterization and Optimization Opportunities for MLLM Inference
Authors: Mona Moghadampanah, Adib Rezaei Shahmirzadi, Farhana Amin, Dimitrios S. Nikolopoulos
2025-12-27
Multimodal large language models (Ms) are built on text-only
s by incorporating additional modalities, enabling multimodal understanding and a broader range of applications. However, these additions introduce a previously unexplored energy trade-off across modalities that remains poorly understood, as most prior work focuses on text-only models. In this paper, we examine modality inflation, a key source of inefficiency in which multimodal inputs increase inference workloads through extra encoding stages and expanded token sequences. We provide the first detailed, stage-level analysis of energy consumption in M
inference by breaking the pipeline into vision encoding,
, and
stages. Using four representative M
s evaluated on NVIDIA A100 GPU, we quantify the additional energy required for multimodal inference compared to text-only baselines, ob
overheads ranging from 17% to 94% across models for identical inputs. Our results show that energy bottlenecks differ widely across model architectures, stemming either from compute-heavy vision encoders or from the downstream impact of large visual token sequences during
. By examining GPU power traces, we further uncover substantial GPU underutilization during multimodal execution and show that input complexity leads to markedly different energy scaling behaviors across models. Finally, we demonstrate that stage-wise dynamic voltage and frequency scaling (DVFS) is an effective optimization, allowing energy savings with only modest performance impact. Together, these findings offer practical insights and concrete guidance for designing more energy-efficient multimodal
systems.
Fragile Knowledge, Robust Instruction-Following The Width Pruning Dichotomy in Llama-3.2
Authors: Pere Martra
2025-12-27
Structured width of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that
induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely
as a
metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = -0.864, p = 0.012 in Llama-3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA-MC2): as knowledge degrades, the model's ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW-guided width
acts as a selective filter, reducing parametric knowledge while pre
or enhancing behavioral alignment. Additionally, we quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency, whereas batch processing workloads benefit uniformly.
Scaling Unverifiable Rewards A Case Study on Visual Insights
Authors: Shuyu Gan, James Mooney, Pan Hao, Renxiang Wang, Mingyi Hong, Qianwen Wang, Dongyeop Kang
2025-12-27
Large Language Model () agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), iterative refinement guided by reward signals. However, many real-world tasks involve multi-stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulate error over stages. We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and
low-quality branches early using process-specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end-to-end multi-agent pipeline for generating visually insightful charts and report of given dataset, and design a reliable
-based judge model, aligned with human experts (Kendall's τ=0.55). Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. We hope our findings serve as the first step toward to scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation.
Communication Compression for Distributed Learning with Aggregate and Server-Guided Feedback
Authors: Tomas Ortega, Chun-Yin Huang, Xiaoxiao Li, Hamid Jafarkhani
2025-12-27
Distributed learning, particularly Federated Learning (FL), faces a significant bottleneck in the cost, particularly the uplink transmission of client-to-server updates, which is often constrained by asymmetric bandwidth limits at the edge. Biased
techniques are effective in practice, but require error feedback mechanisms to provide theoretical guarantees and to ensure convergence when
is aggressive. Standard error feedback, however, relies on client-specific control variates, which violates user privacy and is incompatible with stateless clients common in large-scale FL. This paper proposes two novel frameworks that enable biased
without client-side state or control variates. The first, Compressed Aggregate Feedback (CAFe), uses the globally aggregated update from the previous round as a shared control variate for all clients. The second, Server-Guided Compressed Aggregate Feedback (CAFe-S), extends this idea to scenarios where the server possesses a small private dataset; it generates a server-guided candidate update to be used as a more accurate predictor. We consider Distributed Gradient Descent (DGD) as a representative algorithm and analytically prove CAFe's superiority to Distributed Compressed Gradient Descent (DCGD) with biased
in the non-convex regime with bounded gradient dissimilarity. We further prove that CAFe-S converges to a stationary point, with a rate that improves as the server's data become more representative. Experimental results in FL scenarios validate the superiority of our approaches over existing
schemes.
Learning Multi-Modal Mobility Dynamics for Generalized Next Location Recommendation
Authors: Junshu Dai, Yu Wang, Tongya Zheng, Wei Ji, Qinghong Guo, Ji Cao, Jie Song, Canghong Jin, Mingli Song
2025-12-27
The precise prediction of human mobility has produced significant socioeconomic impacts, such as location recommendations and evacuation suggestions. However, existing methods suffer from limited generalization capability: unimodal approaches are constrained by data and inherent biases, while multi-modal methods struggle to effectively capture mobility dynamics caused by the semantic gap between static multi-modal representation and spatial-temporal dynamics. Therefore, we leverage multi-modal spatial-temporal knowledge to characterize mobility dynamics for the location recommendation task, dubbed as \textbf{M}ulti-\textbf{M}odal \textbf{Mob}ility (\textbf{M}\textbf{ob}). First, we construct a unified spatial-temporal relational graph (STRG) for multi-modal representation, by leveraging the functional semantics and spatial-temporal knowledge captured by the large language models (
s)-enhanced spatial-temporal knowledge graph (STKG). Second, we design a gating mechanism to fuse spatial-temporal graph representations of different modalities, and propose an STKG-guided cross-modal alignment to inject spatial-temporal dynamic knowledge into the static image modality. Extensive experiments on six public datasets show that our proposed method not only achieves consistent improvements in normal scenarios but also exhibits significant generalization ability in abnormal scenarios.
RollArt Scaling Agentic RL Training via Disaggregated Infrastructure
Authors: Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang
2025-12-27
Agentic Reinforcement Learning (RL) enables Large Language Models (s) to perform autonomous decision-making and long-term planning. Unlike standard
post-training, agentic RL workloads are highly heterogeneous, combining compute-intensive
phases, bandwidth-bound
, and stateful, CPU-heavy environment simulations. We argue that efficient agentic RL training requires
d infrastructure to leverage specialized, best-fit hardware. However, naive
introduces substantial synchronization overhead and resource underutilization due to the complex dependencies between stages.
We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on
d infrastructure. RollArc is built on three core principles: (1) hardware-affinity workload mapping, which routes compute-bound and bandwidth-bound tasks to bestfit GPU devices, (2) fine-grained asynchrony, which manages execution at the trajectory level to mitigate resource bubbles, and (3) statefulness-aware computation, which offloads stateless components (e.g., reward models) to serverless infrastructure for elastic scaling. Our results demonstrate that RollArc effectively improves training throughput and achieves 1.35-2.05 end-to-end training time reduction compared to monolithic and synchronous baselines. We also evaluate RollArc by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with more than 3,000 GPUs, further demonstrating RollArc scalability and robustness. The code is available at https://github.com/alibaba/ROLL.
CoDS Collaborative Perception via Digital Semantic Communication
Authors: Jipeng Gan, Le Liang, Hua Zhang, Chongtao Guo, Shi Jin
2025-12-27
Semantic has been introduced into collaborative perception systems for autonomous driving, offering a promising approach to enhancing data transmission efficiency and robustness. Despite its potential, existing semantic
approaches predominantly rely on analog transmission models, rendering these systems fundamentally incompatible with the digital architecture of modern vehicle-to-everything (V2X) networks and posing a significant barrier to real-world deployment. To bridge this critical gap, we propose CoDS, a novel collaborative perception framework based on digital semantic
, designed to realize semantic-level transmission efficiency within practical digital
systems. Specifically, we develop a semantic
codec that extracts and compresses task-oriented semantic features while pre
downstream perception accuracy. Building on this, we propose a novel semantic analog-to-digital converter that converts these continuous semantic features into a discrete bitstream, ensuring integration with existing digital
pipelines. Furthermore, we develop an uncertainty-aware network (UAN) that assesses the reliability of each received feature and discards those corrupted by
failures, thereby mitigating the cliff effect of conventional channel coding schemes under low signal-to-noise ratio (SNR) conditions. Extensive experiments demonstrate that CoDS significantly outperforms existing semantic
and traditional digital
schemes, achieving state-of-the-art perception performance while ensuring compatibility with practical digital V2X systems.
Decomposing Task Vectors for Refined Model Editing
Authors: Hamed Damirchi, Ehsan Abbasnejad, Zhen Zhang, Javen Shi
2025-12-27
Large pre-trained models have transformed machine learning, yet adapting these models effectively to exhibit precise, concept-specific behaviors remains a significant challenge. Task vectors, defined as the difference between fine-tuned and pre-trained model parameters, provide a mechanism for steering neural networks toward desired behaviors. This has given rise to large repositories dedicated to task vectors tailored for specific behaviors. The arithmetic operation of these task vectors allows for the seamless combination of desired behaviors without the need for large datasets. However, these vectors often contain ping concepts that can interfere with each other during arithmetic operations, leading to unpredictable outcomes. We propose a principled decomposition method that separates each task vector into two components: one capturing shared knowledge across multiple task vectors, and another isolating information unique to each specific task. By identifying invariant subspaces across projections, our approach enables more precise control over concept manipulation without unintended amplification or diminution of other behaviors. We demonstrate the effectiveness of our decomposition method across three domains: improving multi-task merging in image classification by 5% using shared components as additional task vectors, enabling clean style mixing in diffusion models without generation degradation by mixing only the unique components, and achieving 47% toxicity reduction in language models while pre
performance on general knowledge tasks by negating the toxic information isolated to the unique component. Our approach provides a new framework for understanding and controlling task vector arithmetic, addressing fundamental limitations in model editing operations.
Role-Based Fault Tolerance System for LLM RL Post-Training
Authors: Zhenqian Chen, Baoquan Zhong, Xiang Li, Qing Dai, Xinkui Zhao, Miao Ye, Ren Cheng, Lufei Zhang, Jianwei Yin
2025-12-27
RL post-training for s has been widely scaled to enhance reasoning and tool-using capabilities. However, RL post-training interleaves training and inference workloads, exposing the system to faults from both sides. Existing fault tolerance frameworks for
s target either training or inference, leaving the optimization potential in the asynchronous execution unexplored for RL. Our key insight is role-based fault isolation so the failure in one machine does not affect the others. We treat trainer, rollout, and other management roles in RL training as distinct distributed sub-tasks. Instead of restarting the entire RL task in ByteRobust, we recover only the failed role and reconnect it to living ones, thereby eliminating the full-restart overhead including rollout replay and initialization delay.
We present RobustRL, the first comprehensive robust system to handle GPU machine errors for RL post-training Effective Training Time Ratio improvement. (1) \textit{Detect}. We implement role-aware monitoring to distinguish actual failures from role-specific behaviors to avoid the false positive and delayed detection. (2) \textit{Restart}. For trainers, we implement a non-disruptive recovery where rollouts persist state and continue trajectory generation, while the trainer is rapidly restored via rollout warm standbys. For rollout, we perform isolated machine replacement without interrupting the RL task. (3) \textit{Reconnect}. We replace static collective
with dynamic, UCX-based (Unified Communication X) point-to-point
, enabling immediate weight synchronization between recovered roles. In an RL training task on a 256-GPU cluster with Qwen3-8B-Math workload under 10\% failure injection frequency, RobustRL can achieve an ETTR of over 80\% compared with the 60\% in ByteRobust and achieves 8.4\%-17.4\% faster in end-to-end training time.
Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing
Authors: Sukhyun Jeong, Yong-Hoon Choi
2025-12-27
Text-based 3D motion generation aims to automatically synthesize diverse motions from natural-language descriptions to extend user creativity, whereas motion editing modifies an existing motion sequence in response to text while pre its overall structure. Pose-code-based frameworks such as CoMo map quantifiable pose attributes into discrete pose codes that support interpretable motion control, but their frame-wise representation struggles to capture subtle temporal dynamics and high-frequency details, often degrading reconstruction fidelity and local controllability. To address this limitation, we introduce pose-guided residual refinement for motion (PGRM), a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector
(RVQ). A pose-guided RVQ tokenizer decomposes motion into pose latents that encode coarse global structure and residual latents that model fine-grained temporal variations. Residual dropout further discourages over-reliance on residuals, pre
the semantic alignment and editability of the pose codes. On top of this tokenizer, a base Transformer autoregressively predicts pose codes from text, and a refine Transformer predicts residual codes conditioned on text, pose codes, and
stage. Experiments on HumanML3D and KIT-ML show that PGRM improves Fréchet inception distance and reconstruction metrics for both generation and editing compared with CoMo and recent diffusion- and tokenization-based baselines, while user studies confirm that it enables intuitive, structure-pre
motion edits.
MEGA-PCC A Mamba-based Efficient Approach for Joint Geometry and Attribute Point Cloud Compression
Authors: Kai-Hsiang Hsieh, Monyneath Yim, Wen-Hsiao Peng, Jui-Chiu Chiang
2025-12-27
Joint of point cloud geometry and attributes is essential for efficient 3D data representation. Existing methods often rely on post-hoc recoloring procedures and manually tuned bitrate allocation between geometry and attribute bitstreams in inference, which hinders end-to-end optimization and increases system complexity. To overcome these limitations, we propose MEGA-PCC, a fully end-to-end, learning-based framework featuring two specialized models for joint
. The main
model employs a shared encoder that encodes both geometry and attribute information into a unified latent representation, followed by dual
rs that sequentially reconstruct geometry and then attributes. Complementing this, the Mamba-based Entropy Model (MEM) enhances entropy coding by capturing spatial and channel-wise correlations to improve probability estimation. Both models are built on the Mamba architecture to effectively model long-range dependencies and rich contextual features. By eliminating the need for recoloring and heuristic bitrate tuning, MEGA-PCC enables data-driven bitrate allocation during training and simplifies the overall pipeline. Extensive experiments demonstrate that MEGA-PCC achieves superior rate-distortion performance and runtime efficiency compared to both traditional and learning-based baselines, offering a powerful solution for AI-driven point cloud
.
Differentiable Inverse Modeling with Physics-Constrained Latent Diffusion for Heterogeneous Subsurface Parameter Fields
Authors: Zihan Lin, QiZhi He
2025-12-27
We present a latent diffusion-based differentiable inversion method (LD-DIM) for PDE-constrained inverse problems involving high-dimensional spatially distributed coefficients. LD-DIM couples a pretrained latent diffusion prior with an end-to-end differentiable numerical solver to reconstruct unknown heterogeneous parameter fields in a low-dimensional nonlinear manifold, improving numerical conditioning and enabling stable gradient-based optimization under observations. The proposed framework integrates a latent diffusion model (LDM), trained in a compact latent space, with a differentiable finite-volume discretization of the forward PDE. Sensitivities are propagated through the discretization using adjoint-based gradients combined with reverse-mode automatic differentiation. Inversion is performed directly in latent space, which implicitly suppresses ill-conditioned degrees of freedom while pre
dominant structural modes, including sharp material interfaces. The effectiveness of LD-DIM is demonstrated using a representative inverse problem for flow in porous media, where heterogeneous conductivity fields are reconstructed from spatially
hydraulic head measurements. Numerical experiments assess convergence behavior and reconstruction quality for both Gaussian random fields and bimaterial coefficient distributions. The results show that LD-DIM achieves consistently improved numerical stability and reconstruction accuracy of both parameter fields and corresponding PDE solutions compared with physics-informed neural networks (PINNs) and physics-embedded variational autoencoder (VAE) baselines, while maintaining sharp discontinuities and reducing sensitivity to initialization.
Nightjar Dynamic Adaptive Speculative Decoding for Large Language Models Serving
Authors: Rui Li, Zhaoning Zhang, Libo Zhang, Huaimin Wang, Xiang Fu, Zhiquan Lai
2025-12-27
Speculative (SD) accelerates
inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Current SD implementations use a fixed speculative length, failing to adapt to dynamic request rates and creating a significant performance bottleneck in real-world
scenarios. To overcome this, we propose Nightjar, a novel learning-based algorithm for adaptive speculative inference that adjusts to request load by dynamically selecting the optimal speculative length for different batch sizes and even disabling speculative
when it provides no benefit. Experiments show that Nightjar achieves up to 14.8% higher throughput and 20.2% lower latency compared to standard speculative
, demonstrating robust efficiency for real-time
.
OxygenREC An Instruction-Following Generative Framework for E-commerce Recommendation
Authors: Xuegang Hao, Ming Zhang, Alex Li, Xiangyu Qian, Zhi Ma, Yanlong Zang, Shijie Yang, Zhongxuan Han, Xiaolong Ma, Jinguang Liu, Zhen Li, Zhida Jiang, Shusheng Wang, Ning Tang, Yanchen Qiao, Chenxiang Yang, Chen Sun, Jincheng Yuan, Chunhua Peng, Heng Hu, Peijun Yang, Baopeng Yuan, Caiyun Qiu, Zhaolong Xing, Haofei Yuan, Haipeng Zhang, Yuzhang Guo, Weijie Ding, Jiahua Gao, Hao Huang, Zhen Chen, Tongxuan Liu, Pinghua Gong
2025-12-26
Traditional recommendation systems suffer from inconsistency in multi-stage optimization objectives. Generative Recommendation (GR) mitigates them through an end-to-end framework; however, existing methods still rely on matching mechanisms based on inductive patterns. Although responsive, they lack the ability to uncover complex user intents that require deductive reasoning based on world knowledge. Meanwhile, s show strong deep reasoning capabilities, but their latency and computational costs remain challenging for industrial applications. More critically, there are performance bottlenecks in multi-scenario scalability: as shown in Figure 1, existing solutions require independent training and deployment for each scenario, leading to low resource utilization and high maintenance costs-a challenge unaddressed in GR literature. To address these, we present OxygenREC, an industrial recommendation system that leverages Fast-Slow Thinking to deliver deep reasoning with strict latency and multi-scenario requirements of real-world environments. First, we adopt a Fast-Slow Thinking architecture. Slow thinking uses a near-line
pipeline to synthesize Contextual Reasoning Instructions, while fast thinking employs a high-efficiency encoder--
r backbone for real-time generation. Second, to ensure reasoning instructions effectively enhance recommendation generation, we introduce a semantic alignment mechanism with Instruction-Guided Retrieval (IGR) to filter intent-relevant historical behaviors and use a Query-to-Item (Q2I) loss for instruction-item consistency. Finally, to resolve multi-scenario scalability, we transform scenario information into controllable instructions, using unified reward mapping and Soft Adaptive Group Clip Policy Optimization (SA-GCPO) to align policies with diverse business objectives, realizing a train-once-deploy-everywhere paradigm.
Charge-Informed Quantum Error Correction
Authors: Vlad Temkin, Zack Weinstein, Ruihua Fan, Daniel Podolsky, Ehud Altman
2025-12-26
We investigate the statistical physics of quantum error correction in symmetry-enriched topological quantum memories. Starting from a phenomenological error model of charge-con noise, we study the optimal
r assuming the local charges of each anyon can be measured. The error threshold of the optimal
r corresponds to a continuous phase transition in a disordered two-dimensional integer loop model on the Nishimori line. Using an effective replica field theory analysis and Monte Carlo numerics, we show that the optimal
transition exhibits Berezinskii-Kosterlitz-Thouless universality with a modified universal jump in winding number variance. We further generalize the model beyond the Nishimori line, which defines a large class of suboptimal
rs. At low nonzero temperatures and strong disorder, we find numerical evidence of a disorder-dominated loop-glass phase which corresponds to a "confidently incorrect"
r. The zero-temperature limit defines the minimum-cost flow
r, which serves as the analog of minimum-weight perfect matching in topological codes. Both the optimal and minimum-cost flow
rs are shown to dramatically outperform the charge-agnostic optimal
r in symmetry-enriched topological codes.
Yume-1.5 A Text-Controlled Interactive World Generation Model
Authors: Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang
2025-12-26
Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context with linear attention; (2) a real-time streaming
strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.
Prefill vs. Decode Bottlenecks SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling
Authors: Hannah Atmer, Yuan Yao, Thiemo Voigt, Stefanos Kaxiras
2025-12-26
Energy consumption dictates the cost and environmental impact of deploying Large Language Models. This paper investigates the impact of on-chip SRAM size and operating frequency on the energy efficiency and performance of inference, focusing on the distinct behaviors of the compute-bound
and memory-bound
phases. Our simulation methodology combines OpenRAM for energy modeling,
Compass for latency simulation, and ScaleSIM for systolic array operational intensity. Our findings show that total energy use is predominantly determined by SRAM size in both phases, with larger buffers significantly increasing static energy due to leakage, which is not offset by corresponding latency benefits. We quantitatively explore the memory-bandwidth bottleneck, demonstrating that while high operating frequencies reduce
latency, their positive impact on memory-bound
latency is capped by the external memory bandwidth. Counter-intuitively, high compute frequency can reduce total energy by reducing execution time and consequently decreasing static energy consumption more than the resulting dynamic power increase. We identify an optimal hardware configuration for the simulated workload: high operating frequencies (1200MHz-1400MHz) and a small local buffer size of 32KB to 64KB. This combination achieves the best energy-delay product, balancing low latency with high energy efficiency. Furthermore, we demonstrate how memory bandwidth acts as a performance ceiling, and that increasing compute frequency only yields performance gains up to the point where the workload becomes memory-bound. This analysis provides concrete architectural insights for designing energy-efficient
accelerators, especially for datacenters aiming to minimize their energy overhead.
Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models
Authors: Zongmin Zhang, Zhen Sun, Yifan Liao, Wenhan Dong, Xinlei He, Xingshuo Han, Shengmin Xu, Xinyi Huang
2025-12-26
Prompt-driven Video Segmentation Foundation Models (VSFMs) such as SAM2 are increasingly deployed in applications like autonomous driving and digital pathology, raising concerns about backdoor threats. Surprisingly, we find that directly transferring classic backdoor attacks (e.g., BadNet) to VSFMs is almost ineffective, with ASR below 5\%. To understand this, we study encoder gradients and attention maps and observe that conventional training keeps gradients for clean and triggered samples largely aligned, while attention still focuses on the true object, preventing the encoder from learning a distinct trigger-related representation. To address this challenge, we propose BadVSFM, the first backdoor framework tailored to prompt-driven VSFMs. BadVSFM uses a two-stage strategy: (1) steer the image encoder so triggered frames map to a designated target embedding while clean frames remain aligned with a clean reference encoder; (2) train the mask r so that, across prompt types, triggered frame-prompt pairs produce a shared target mask, while clean outputs stay close to a reference
r. Extensive experiments on two datasets and five VSFMs show that BadVSFM achieves strong, controllable backdoor effects under diverse triggers and prompts while pre
clean segmentation quality. Ablations over losses, stages, targets, trigger settings, and poisoning rates demonstrate robustness to reasonable hyperparameter changes and confirm the necessity of the two-stage design. Finally, gradient-conflict analysis and attention visualizations show that BadVSFM separates triggered and clean representations and shifts attention to trigger regions, while four representative defenses remain largely ineffective, revealing an underexplored vulnerability in current VSFMs.
Robust Federated Fine-Tuning in Heterogeneous Networks with Unreliable Connections An Aggregation View
Authors: Yanmeng Wang, Zhiwen Dai, Shuai Wang, Jian Zhou, Fu Xiao, Tony Q. S. Quek, Tsung-Hui Chang
2025-12-26
Federated Fine-Tuning (FFT) has attracted growing interest as it leverages both server- and client-side data to enhance global model generalization while pre privacy, and significantly reduces the computational burden on edge devices by avoiding training from scratch. Despite these advantages, FFT performance is often degraded by unreliable server-client connections and heterogeneous client data distributions. Most existing methods assume homogeneous network conditions or require prior knowledge of connection failures. However, these assumptions are impractical in real-world networks characterized by diverse
standards (e.g., wired, Wi-Fi, 4G, and 5G) and heterogeneous failure patterns. To address these limitations, we propose FedAuto, a novel FFT framework that mitigates the combined effects of connection failures and data heterogeneity via adaptive aggregation. FedAuto operates without prior knowledge of network conditions or modifications to existing infrastructure, enabling seamless plug-and-play deployment. Moreover, we establish a rigorous convergence guarantee for FedAuto. By adopting a novel per-round aggregation perspective, our analysis removes the need for assumptions on connection failures probabilities or client selection strategies commonly imposed in prior work, and guarantees convergence of FedAuto for each individual realization, providing a stronger theoretical assurance. Extensive experiments demonstrate that FedAuto consistently outperforms state-of-the-art baselines under diverse connection failure scenarios for both full-parameter and partial-parameter fine-tuning (e.g., LoRA), and even surpasses strategies that rely on complex
resource optimization.
Direction Finding with Sparse Arrays Based on Variable Window Size Spatial Smoothing
Authors: Wesley S. Leite, Rodrigo C. de Lamare, Yuriy Zakharov, Wei Liu, Martin Haardt
2025-12-26
In this work, we introduce a variable window size (VWS) spatial smoothing framework that enhances coarray-based direction of arrival (DOA) estimation for linear arrays. By compressing the smoothing aperture, the proposed VWS Coarray MUSIC (VWS-CA-MUSIC) and VWS Coarray root-MUSIC (VWS-CA-rMUSIC) algorithms replace part of the perturbed rank-one outer products in the smoothed coarray data with unperturbed low-rank additional terms, increasing the separation between signal and noise subspaces, while pre
the signal subspace span. We also derive the bounds that guarantees identifiability, by limiting the values that can be assumed by the
parameter. Simulations with
geometries reveal significant performance improvements and complexity savings relative to the fixed-window coarray MUSIC method.
Look Closer! An Adversarial Parametric Editing Framework for Hallucination Mitigation in VLMs
Authors: Jiayu Hu, Beibei Li, Jiangwei Xia, Yanjun Qin, Bing Ji, Zhongshi He
2025-12-26
While Vision-Language Models (VLMs) have garnered increasing attention in the AI community due to their promising practical applications, they exhibit persistent hallucination issues, generating outputs misaligned with visual inputs. Recent studies attribute these hallucinations to VLMs' over-reliance on linguistic priors and insufficient visual feature integration, proposing heuristic calibration strategies to mitigate them. However, the non-trainable nature of these strategies inherently limits their optimization potential. To this end, we propose an adversarial parametric editing framework for Hallucination mitigation in VLMs, which follows an \textbf{A}ctivate-\textbf{L}ocate-\textbf{E}dit \textbf{A}dversarially paradigm. Specifically, we first construct an activation dataset that comprises grounded responses (positive samples attentively anchored in visual features) and hallucinatory responses (negative samples reflecting
prior bias and internal knowledge artifacts). Next, we identify critical hallucination-prone parameter clusters by analyzing differential hidden states of response pairs. Then, these clusters are fine-tuned using prompts injected with adversarial tuned prefixes that are optimized to maximize visual neglect, thereby forcing the model to prioritize visual evidence over inherent parametric biases. Evaluations on both generative and discriminative VLM tasks demonstrate the significant effectiveness of ALEAHallu in alleviating hallucinations. Our code is available at https://github.com/hujiayu1223/ALEAHallu.
BLEST Blazingly Efficient BFS using Tensor Cores
Authors: Deniz Elbek, Kamer Kaya
2025-12-26
Breadth-First Search (BFS) is a fundamental graph kernel that underpins a wide range of applications. While modern GPUs provide specialised Matrix-Multiply-Accumulate (MMA) units, e.g., Tensor Cores (TC), with extremely high throughput, they target dense operations, making it non-trivial to exploit them for irregular, unstructured graph computations. In particular, fully utilising them for a BFS requires an efficient mapping of the edge operations onto TCs while avoiding redundancy, load imbalance, and synchronisation. We present BLEST, a TC-accelerated framework that reformulates the pull-based BFS pipeline around a bitmap-oriented structure and a carefully engineered execution layout. BLEST introduces Binarised Virtual Slice Sets (BVSS) to enforce warp-level load balancing and to eliminate frontier-oblivious work assignment. To improve both memory efficiency and update locality across diverse graphs, we apply two complementary graph reordering strategies: a -oriented ordering for social-like graphs and a bandwidth-reducing ordering for non-social graphs. At the compute level, we develop a batched SpMSpV multiplication pattern that uses the bitwise TC tiles to handle dot products without wasting output entries, thereby reducing the number of required MMA calls. Finally, BLEST combines kernel fusion with a lazy vertex update scheme to reduce host-side synchronisation, mitigate atomic overheads, and improve
locality. Experiments show that BLEST delivers, on average, , and speedup over BerryBees, Gunrock, and GSWITCH, respectively, across a broad set of real-world graphs.
GQ-VAE A gated quantized VAE for learning variable length tokens
Authors: Theo Datta, Kayla Huang, Sham Kakade, David Brandfonbrener
2025-12-26
While most frontier models still use deterministic frequency-based tokenization algorithms such as byte-pair encoding (BPE), there has been significant recent work to design learned neural tokenizers. However, these schemes generally add to underlying language model complexity and force large changes to architecture, making them hard to implement at large scales. To overcome these challenges, we propose the gated d variational autoencoder (GQ-VAE), a novel architecture that can be independently pre-trained to serve as a drop-in replacement for existing tokenizers. The key innovation of the architecture is to learn to encode variable-length discrete tokens. GQ-VAE improves
and language modeling performance over a standard VQ-VAE tokenizer, and approaches the
rate and language modeling performance of BPE. Interestingly, if we use BPE with a smaller vocabulary, such that the
is equivalent between GQ-VAE and BPE, we find that GQ-VAE improves downstream language model learning. We conclude with a discussion of several exciting avenues for future work. Code can be found at https://github.com/Theo-Datta-115/gq-vae.
Accelerate Speculative Decoding with Sparse Computation in Verification
Authors: Jikai Wang, Jianchao Tan, Yuxuan Hu, Jiayu Qin, Yerui Sun, Yuchen Xie, Xunliang Cai, Juntao Li, Min Zhang
2025-12-26
Speculative accelerates autoregressive language model inference by verifying multiple draft tokens in parallel. However, the verification stage often becomes the dominant computational bottleneck, especially for long-context inputs and mixture-of-experts (MoE) models. Existing sparsification methods are designed primarily for standard token-by-token autoregressive
to remove substantial computational redundancy in
s. This work systematically adopts different
methods on the verification stage of the speculative
and identifies structured redundancy across multiple dimensions. Based on these observations, we propose a
verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost. The framework further incorporates an inter-draft token and inter-layer retrieval reuse strategy to further reduce redundant computation without introducing additional training. Extensive experiments across summarization, question answering, and mathematical reasoning datasets demonstrate that the proposed methods achieve favorable efficiency-accuracy trade-offs, while maintaining stable acceptance length.
LLMBoost Make Large Language Models Stronger with Boosting
Authors: Zehao Chen, Tianxiang Ai, Yifei Li, Gongxun Li, Yuyang Wei, Wang Zhou, Guanghui Li, Bin Yu, Zhijun Chen, Hailong Sun, Fuzhen Zhuang, Jianxin Li, Deqing Wang, Yikun Ban
2025-12-26
Ensemble learning of s has emerged as a promising alternative to enhance performance, but existing approaches typically treat models as black boxes, combining the inputs or final outputs while overlooking the rich internal representations and interactions across models.In this work, we introduce
Boost, a novel ensemble fine-tuning framework that breaks this barrier by explicitly leveraging intermediate states of
s. Inspired by the boosting paradigm,
Boost incorporates three key innovations. First, a cross-model attention mechanism enables successor models to access and fuse hidden states from predecessors, facilitating hierarchical error correction and knowledge transfer. Second, a chain training paradigm progressively fine-tunes connected models with an error-suppression objective, ensuring that each model rectifies the mispredictions of its predecessor with minimal additional computation. Third, a near-parallel inference paradigm design pipelines hidden states across models layer by layer, achieving inference efficiency approaching single-model
. We further establish the theoretical foundations of
Boost, proving that sequential integration guarantees monotonic improvements under bounded correction assumptions. Extensive experiments on commonsense reasoning and arithmetic reasoning tasks demonstrate that
Boost consistently boosts accuracy while reducing inference latency.
MMCTOP A Multimodal Textualization and Mixture-of-Experts Framework for Clinical Trial Outcome Prediction
Authors: Carolina Aparício, Qi Shi, Bo Wen, Tesfaye Yadete, Qiwei Han
2025-12-26
Addressing the challenge of multimodal data fusion in high-dimensional biomedical informatics, we propose MMCTOP, a MultiModal Clinical-Trial Outcome Prediction framework that integrates heterogeneous biomedical signals spanning (i) molecular structure representations, (ii) protocol metadata and long-form eligibility narratives, and (iii) disease ontologies. MMCTOP couples schema-guided textualization and input-fidelity validation with modality-aware representation learning, in which domain-specific encoders generate aligned embeddings that are fused by a backbone augmented with a drug-disease-conditioned
Mixture-of-Experts (SMoE). This design explicitly supports specialization across therapeutic and design subspaces while maintaining scalable computation through top-k routing. MMCTOP achieves consistent improvements in precision, F1, and AUC over unimodal and multimodal baselines on benchmark datasets, and ablations show that schema-guided textualization and selective expert routing contribute materially to performance and stability. We additionally apply temperature scaling to obtain calibrated probabilities, ensuring reliable risk estimation for downstream decision support. Overall, MMCTOP advances multimodal trial modeling by combining controlled narrative normalization, context-conditioned expert fusion, and operational safeguards aimed at auditability and reproducibility in biomedical informatics.
TimeBill Time-Budgeted Inference for Large Language Models
Authors: Qi Fan, An Zou, Yehan Ma
2025-12-26
Large Language Models (s) are increasingly deployed in time-critical systems, such as robotics, autonomous driving, embodied intelligence, and industrial automation, where generating accurate responses within a given time budget is crucial for decision-making, control, or safety-critical tasks. However, the auto-regressive generation process of
s makes it challenging to model and estimate the end-to-end execution time. Furthermore, existing efficient inference methods based on a fixed key-value (
)
eviction ratio struggle to adapt to varying tasks with diverse time budgets, where an improper eviction ratio may lead to incomplete inference or a drop in response performance. In this paper, we propose TimeBill, a novel time-budgeted inference framework for
s that balances the inference efficiency and response performance. To be more specific, we propose a fine-grained response length predictor (RLP) and an execution time estimator (ETE) to accurately predict the end-to-end execution time of
s. Following this, we develop a time-budgeted efficient inference approach that adaptively adjusts the
eviction ratio based on execution time prediction and the given time budget. Finally, through extensive experiments, we demonstrate the advantages of TimeBill in improving task completion rate and maintaining response performance under various overrun strategies.
Fast Inference of Visual Autoregressive Model with Adjacency-Adaptive Dynamical Draft Trees
Authors: Haodong Lei, Hongsong Wang, Xin Geng, Liang Wang, Pan Zhou
2025-12-26
Autoregressive (AR) image models achieve diffusion-level quality but suffer from sequential inference, requiring approximately 2,000 steps for a 576x576 image. Speculative with draft trees accelerates
s yet underperforms on visual AR models due to spatially varying token prediction difficulty. We identify a key obstacle in applying speculative
to visual AR models: inconsistent acceptance rates across draft trees due to varying prediction difficulties in different image regions. We propose Adjacency-Adaptive Dynamical Draft Trees (ADT-Tree), an adjacency-adaptive dynamic draft tree that dynamically adjusts draft tree depth and width by leveraging adjacent token states and prior acceptance rates. ADT-Tree initializes via horizontal adjacency, then refines depth/width via bisectional adaptation, yielding deeper trees in simple regions and wider trees in complex ones. The empirical evaluations on MS-COCO 2017 and PartiPrompts demonstrate that ADT-Tree achieves speedups of 3.13xand 3.05x, respectively. Moreover, it integrates seamlessly with relaxed sampling methods such as LANTERN, enabling further
. Code is available at https://github.com/Haodong-Lei-Ray/ADT-Tree.
LIMEAccelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices
Authors: Mingyu Sun, Xiao Zhang, Shen Qu, Yan Li, Mengbai Xiao, Yuan Yuan, Dongxiao Yu
2025-12-26
Large language models (s) have emerged as a powerful foundation for intelligent reasoning and decision-making, demonstrating substantial impact across a wide range of domains and applications. However, their massive parameter scales and substantial resource demands pose critical challenges for efficient inference on edge devices. These devices are inherently constrained by limited computational power and memory capacity, while bandwidth bottlenecks at the network edge further restrict distributed deployment and real-time responsiveness. Although existing research has explored lightweight optimization techniques to mitigate memory limitations, such approaches often incur significant degradation in model accuracy and performance. To address these challenges, we propose LIME, a collaborative system that enables lossless inference for large models across multiple memory-constrained edge devices under limited network bandwidth. LIME employs an interleaved pipeline parallelism in conjunction with model offloading to dynamically balance computation and
. Furthermore, a fine-grained offline allocation scheduler and online memory adaptation strategy are introduced to enhance the device's computing and storage resources while minimizing inference latency. Extensive experiments demonstrate that LIME, deployed on four heterogeneous Nvidia Jetson edge devices for LLaMA3.3-70B-Instruct model inference, achieves 1.7 and 3.7 speedups over state-of-the-art baselines under sporadic and bursty request patterns respectively, without compromising model accuracy.