2026-01-16

Detecting Winning Arguments with Large Language Models and Persuasion Strategies
PACEvolve Enabling Long-Horizon Progress-Aware Consistent Evolution
Supergravity with Lagrange Multiplier Fields in 2 + 1 Dimensions
Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing
Communication-Efficient Federated Learning by Exploiting Spatio-Temporal Correlations of Gradients
Energy-Efficient Probabilistic Semantic Communication Over Visible Light Networks With Rate Splitting
Placement Delivery Array for Cache-Aided MIMO Systems
An effective interactive brain cytoarchitectonic parcellation framework using pretrained foundation model
TF3-RO-50M Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction
Toward Ultra-Long-Horizon Agentic Science Cognitive Accumulation for Machine Learning Engineering
LatentRefusal Latent-Signal Refusal for Unanswerable Text-to-SQL Queries
Online identification of nonlinear time-varying systems with uncertain information
Global Context Compression with Interleaved Vision-Text Transformation
Towards Efficient Low-rate Image Compression with Frequency-aware Diffusion Prior Refinement
Joint Bayesian inference of Earth's magnetic field and core surface flow on millennial timescales
Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning
In-Context Source and Channel Coding
STEAMROLLER A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter
LOOKAT Lookup-Optimized Key-Attention for Memory-Efficient Transformers
TopoDIM One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems
Sparse-RL Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts
Privacy Enhanced PEFT Tensor Train Decomposition Improves Privacy Utility Tradeoffs under DP-SGD
Towards Native Intelligence 6G-LLM Trained with Reinforcement Learning from NDT Feedback
Learning to Decode in Parallel Self-Coordinating Neural Network for Real-Time Quantum Error Correction
Advancing Model Refinement Muon-Optimized Distillation and Quantization for LLM Deployment
MedRedFlag Investigating how LLMs Redirect Misconceptions in Real-World Health Communication
LLM-Based Agentic Systems for Software Engineering Challenges and Opportunities
ShortCoder Knowledge-Augmented Syntax Optimization for Token-Efficient Code Generation
Empathy Applicability Modeling for General Health Queries
LLMs can Compress LLMs Adaptive Pruning by Agents
Parallaxes, Proper Motions, and Near-Infrared Photometry for 173 L and T Dwarfs From The US Naval Observatory Infrared Astrometry Program
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
OpenVoxel Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats
Private LLM Inference on Consumer Blackwell GPUs A Practical Guide for Cost-Effective Local Deployment in SMEs
Engineering Compressed Matrix Multiplication with the Fast Walsh-Hadamard Transform
Analysis of the Maximum Prediction Gain of Short-Term Prediction on Sustained Speech
SC-MAS Constructing Cost-Efficient Multi-Agent Systems with Edge-Level Heterogeneous Collaboration
Spectral Complex Autoencoder Pruning A Fidelity-Guided Criterion for Extreme Structured Channel Compression
See More, Store Less Memory-Efficient Resolution for Video Moment Retrieval
Range-Doppler-Acceleration Estimation for Fast-Moving and Accelerating Targets
Multi-Modal LLM based Image Captioning in ICT Bridging the Gap Between General and Industry Domain
Cluster Workload Allocation Semantic Soft Affinity Using Natural Language Processing
STaR Sensitive Trajectory Regulation for Unlearning in Large Reasoning Models
Coordinated Pandemic Control with Large Language Model Agents as Policymaking Assistants
BrainSegNet A Novel Framework for Whole-Brain MRI Parcellation Enhanced by Large Models
A Theoretical Framework for Rate-Distortion Limits in Learned Image Compression
DSA-Tokenizer Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion
$D^2Prune$ Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness
Data-Driven Exploration and Insights into Temperature-Dependent Phonons in Inorganic Materials
AviationLMM A Large Multimodal Foundation Model for Civil Aviation
Hidden States as Early Signals Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling
Exploring Reliable Spatiotemporal Dependencies for Efficient Visual Tracking
Depth-Wise Representation Development Under Blockwise Self-Supervised Learning for Video Vision Transformers
Layer-Parallel Training for Transformers
Universal Latent Homeomorphic Manifolds Cross-Domain Representation Learning via Homeomorphism Verification
Synthetic Data for Veterinary EHR De-identification Benefits, Limits, and Safety Trade-offs Under Fixed Compute

Detecting Winning Arguments with Large Language Models and Persuasion Strategies

Authors: Tiziano Labruna, Arkadiusz Modzelewski, Giorgio Satta, Giovanni Da San Martino

2026-01-15

http://arxiv.org/abs/2601.10660v1

Detecting persuasion in argumentative text is a challenging task with important implications for understanding human . This work investigates the role of persuasion strategies - such as Attack on reputation, Distraction, and Manipulative wording - in determining the persuasiveness of a text. We conduct experiments on three annotated argument datasets: Winning Arguments (built from the Change My View subreddit), Anthropic/Persuasion, and Persuasion for Good. Our approach leverages large language models (s) with a Multi-Strategy Persuasion Scoring approach that guides reasoning over six persuasion strategies. Results show that strategy-guided reasoning improves the prediction of persuasiveness. To better understand the influence of content, we organize the Winning Argument dataset into broad discussion topics and analyze performance across them. We publicly release this topic-annotated version of the dataset to facilitate future research. Overall, our methodology demonstrates the value of structured, strategy-aware prompting for enhancing interpretability and robustness in argument quality assessment.

PACEvolve Enabling Long-Horizon Progress-Aware Consistent Evolution

Authors: Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, Chi Wang, Ed H. Chi, Wang-Cheng Kang, Derek Zhiyuan Cheng, Beidou Wang

2026-01-15

http://arxiv.org/abs/2601.10657v1

Large Language Models (s) have emerged as powerful operators for evolutionary search, yet the design of efficient search scaffolds remains ad hoc. While promising, current -in-the-loop systems lack a systematic approach to managing the evolutionary process. We identify three distinct failure modes: Context Pollution, where experiment history biases future candidate generation; Mode Collapse, where agents stagnate in local minima due to poor exploration-exploitation balance; and Weak Collaboration, where rigid crossover strategies fail to leverage parallel search trajectories effectively. We introduce Progress-Aware Consistent Evolution (PACEvolve), a framework designed to robustly govern the agent's context and search dynamics, to address these challenges. PACEvolve combines hierarchical context management (HCM) with to address context pollution; momentum-based backtracking (MBB) to escape local minima; and a self-adaptive sampling policy that unifies backtracking and crossover for dynamic search coordination (CE), allowing agents to balance internal refinement with cross-trajectory collaboration. We demonstrate that PACEvolve provides a systematic path to consistent, long-horizon self-improvement, achieving state-of-the-art results on -SR and KernelBench, while discovering solutions surpassing the record on Modded NanoGPT.

Supergravity with Lagrange Multiplier Fields in 2 + 1 Dimensions

Authors: D. G. C. McKeon, F. T. Brandt, J. Frenkel, S. Martins-Filho

2026-01-15

http://arxiv.org/abs/2601.10593v1

We examine the first-order Einstein-Cartan (EC) action in 2+1 dimensions, including a cosmological term and its supersymmetric extension. In this setting the spin connection can be expressed as an axial vector, yielding an action that is bilinear in the quantum fields and allows without background fields. We identify the complete set of first-class constraints and derive the associated gauge transformations, which differ from the standard diffeomorphism and local Lorentz invariances. Using the closed gauge algebra, we construct the Faddeev-Popov-Nielsen path integral and show how a Lagrange multiplier field can be introduced to remove higher-loop contributions while pre unitarity and gauge invariance.

Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

Authors: Yinzhi Zhao, Ming Wang, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang

2026-01-15

http://arxiv.org/abs/2601.10543v1

Large language models (s) have achieved impressive performance across natural language tasks and are increasingly deployed in real-world applications. Despite extensive safety alignment efforts, recent studies show that such alignment is often shallow and remains vulnerable to jailbreak attacks. Existing defense mechanisms, including -based constraints and post-hoc content detectors, struggle against sophisticated jailbreaks, often intervening robust detection or excessively degrading model utility. In this work, we examine the process of s and make a key observation: even when successfully jailbroken, models internally exhibit latent safety-related signals during generation. However, these signals are overridden by the model's drive for fluent continuation, preventing timely self-correction or refusal. Building on this observation, we propose a simple yet effective approach that explicitly surfaces and leverages these latent safety signals for early detection of unsafe content during . Experiments across diverse jailbreak attacks demonstrate that our approach significantly enhances safety, while maintaining low over-refusal rates on benign inputs and pre response quality. Our results suggest that activating intrinsic safety-awareness during offers a promising and complementary direction for defending against jailbreak attacks. Code is available at: https://github.com/zyz13590/SafeProbing.

Communication-Efficient Federated Learning by Exploiting Spatio-Temporal Correlations of Gradients

Authors: Shenlong Zheng, Zhen Zhang, Yuhui Deng, Geyong Min, Lin Cui

2026-01-15

http://arxiv.org/abs/2601.10491v1

Communication overhead is a critical challenge in federated learning, particularly in bandwidth-constrained networks. Although many methods have been proposed to reduce overhead, most focus solely on compressing individual gradients, overlooking the temporal correlations among them. Prior studies have shown that gradients exhibit spatial correlations, typically reflected in low-rank structures. Through empirical analysis, we further observe a strong temporal correlation between client gradients across adjacent rounds. Based on these observations, we propose GradESTC, a technique that exploits both spatial and temporal gradient correlations. GradESTC exploits spatial correlations to decompose each full gradient into a compact set of basis vectors and corresponding combination coefficients. By exploiting temporal correlations, only a small portion of the basis vectors need to be dynamically updated in each round. GradESTC significantly reduces overhead by transmitting lightweight combination coefficients and a limited number of updated basis vectors instead of the full gradients. Extensive experiments show that, upon reaching a target accuracy level near convergence, GradESTC reduces uplink by an average of 39.79% compared to the strongest baseline, while maintaining comparable convergence speed and final accuracy to uncompressed FedAvg. By effectively leveraging spatio-temporal gradient structures, GradESTC offers a practical and scalable solution for -efficient federated learning.

Energy-Efficient Probabilistic Semantic Communication Over Visible Light Networks With Rate Splitting

Authors: Zhouxiang Zhao, Zhaohui Yang, Mingzhe Chen, Chen Zhu, Xin Tong, Zhaoyang Zhang

2026-01-15

http://arxiv.org/abs/2601.10452v1

Visible light (VLC) is emerging as a key technology for future wireless systems due to its unique physical-layer advantages over traditional radio-frequency (RF)-based systems. However, its integration with higher-layer techniques, such as semantic , remains underexplored. This paper investigates the energy efficiency maximization problem in a resource-constrained VLC-based probabilistic semantic (PSCom) system. In the considered model, light-emitting diode (LED) transmitters perform semantic to reduce data size, which incurs additional computation overhead. The compressed semantic information is transmitted to the users for semantic inference using a shared knowledge base that requires periodic updates to ensure synchronization. In the PSCom system, the knowledge base is represented by probabilistic graphs. To enable simultaneous transmission of both knowledge and information data, rate splitting multiple access (RSMA) is employed. The optimization problem focuses on maximizing energy efficiency by jointly optimizing transmit beamforming, direct current (DC) bias, common rate allocation, and semantic ratio, while accounting for both and computation costs. To solve this problem, an alternating optimization algorithm based on successive convex approximation (SCA) and Dinkelbach method is developed. Simulation results demonstrate the effectiveness of the proposed approach.

Placement Delivery Array for Cache-Aided MIMO Systems

Authors: Yifei Huang, Kai Wan, Minquan Cheng, Jinyan Wang, Giuseppe Caire

2026-01-15

http://arxiv.org/abs/2601.10422v1

We consider a $(G,L,K,M,N)$ -aided multiple-input multiple-output (MIMO) network, where a server equipped with $L$ antennas and a library of $N$ equal-size files communicates with $K$ users, each equipped with $G$ antennas and a of size $M$ files, over a wireless interference channel. Each user requests an arbitrary file from the library. The goal is to design coded caching schemes that simultaneously achieve the maximum sum degrees of freedom (sum-DoF) and low subpacketization. In this paper, we first introduce a unified combinatorial structure, termed the MIMO placement delivery array (MIMO-PDA), which characterizes uncoded placement and one-shot zero-forcing delivery. By analyzing the combinatorial properties of MIMO-PDAs, we derive a sum-DoF upper bound of $\min\{KG, Gt+G\lceil L/G \rceil\}$ , where $t=KM/N$ , which coincides with the optimal DoF characterization in prior work by Tehrani \emph{et al.}. Based on this upper bound, we present two novel constructions of MIMO-PDAs that achieve the maximum sum-DoF. The first construction achieves linear subpacketization under stringent parameter constraints, while the second achieves ordered exponential subpacketization under substantially milder constraints. Theoretical analysis and numerical comparisons demonstrate that the second construction exponentially reduces subpacketization compared to existing schemes while pre the maximum sum-DoF.

An effective interactive brain cytoarchitectonic parcellation framework using pretrained foundation model

Authors: Shiqi Zhang, Fang Xu, Pengcheng Zhou

2026-01-15

http://arxiv.org/abs/2601.10412v1

Cytoarchitectonic mapping provides anatomically grounded parcellations of brain structure and forms a foundation for integrative, multi-modal neuroscience analyses. These parcellations are defined based on the shape, density, and spatial arrangement of neuronal cell bodies observed in histological imaging. Recent works have demonstrated the potential of using deep learning models toward fully automatic segmentation of cytoarchitectonic areas in large-scale datasets, but performance is mainly constrained by the scarcity of training labels and the variability of staining and imaging conditions. To address these challenges, we propose an interactive cytoarchitectonic parcellation framework that leverages the strong transferability of the DINOv3 vision . Our framework combines (i) multi-layer DINOv3 feature fusion, (ii) a lightweight segmentation r, and (iii) real-time user-guided training from scribbles. This design enables rapid human-in-the-loop refinement while maintaining high segmentation accuracy. Compared with training an nnU-Net from scratch, transfer learning with DINOv3 yields markedly improved performance. We also show that features extracted by DINOv3 exhibit clear anatomical correspondence and demonstrate the method's practical utility for brain region segmentation using labels. These results highlight the potential of foundation-model-driven interactive segmentation for scalable and efficient cytoarchitectonic mapping.

TF3-RO-50M Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction

Authors: Mihai Dan Nadas, Laura Diosan, Andreea Tomescu, Andrei Piscoran

2026-01-15

http://arxiv.org/abs/2601.10410v1

Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end pipeline that unifies tokenizer design, preprocessing, pretraining, , evaluation, and large-scale synthetic data generation in a reproducible framework. Building on TF1, a three-million-story English fable dataset, and TF2, which extends TF1 through high-quality Romanian translations, we introduce TF3-RO, a Romanian-centric language modeling pipeline spanning tokenizer training, from-scratch model development, and Romanian-native dataset generation. TF3-RO constructs Romanian-specific BPE and Unigram tokenizers from a linguistically informed corpus to mitigate token inflation induced by Romanian morphology. Using long-sequence packed training, we pretrain a 51.65M-parameter LLaMA-style Transformer entirely from scratch. The model is subsequently optimized through , structured , and logit-based knowledge distillation, yielding a compact 26.45M-parameter student model with tied embeddings and strong deployment characteristics. Using this distilled model, TF3-RO generates three million Romanian-native synthetic fables via a controlled combinatorial prompting framework. Across all stages, the pipeline integrates a comprehensive evaluation suite combining intrinsic metrics, Romanian agreement probes, entity coherence, rule-based grammar checking, and -based assessment. TF3-RO provides a reproducible and linguistically grounded framework for training compact Romanian language models and producing large-scale synthetic narrative corpora.

Toward Ultra-Long-Horizon Agentic Science Cognitive Accumulation for Machine Learning Engineering

Authors: Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Jiaao Chen, Hanrui Wang, Wei-Chen Wang, Yuzhi Zhang, Linfeng Zhang, Weinan E, Di Jin, Siheng Chen

2026-01-15

http://arxiv.org/abs/2601.10402v1

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (s) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate feedback into coherent long-term guidance. Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI's MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%. Our findings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities.

LatentRefusal Latent-Signal Refusal for Unanswerable Text-to-SQL Queries

Authors: Xuancheng Ren, Shijing Hu, Zhihui Lu, Jiangqi Huang, Qiang Duan

2026-01-15

http://arxiv.org/abs/2601.10398v1

In -based text-to-SQL systems, unanswerable and underspecified user queries may generate not only incorrect text but also executable programs that yield misleading results or violate safety constraints, posing a major barrier to safe deployment. Existing refusal strategies for such queries either rely on output-level instruction following, which is brittle due to model hallucinations, or estimate output uncertainty, which adds complexity and overhead. To address this challenge, we formalize safe refusal in text-to-SQL systems as an answerability-gating problem and propose LatentRefusal, a latent-signal refusal mechanism that predicts query answerability from intermediate hidden activations of a large language model. We introduce the Tri-Residual Gated Encoder, a lightweight probing architecture, to suppress schema noise and amplify , localized cues of question-schema mismatch that indicate unanswerability. Extensive empirical evaluations across diverse ambiguous and unanswerable settings, together with ablation studies and interpretability analyses, demonstrate the effectiveness of the proposed approach and show that LatentRefusal provides an attachable and efficient safety layer for text-to-SQL systems. Across four benchmarks, LatentRefusal improves average F1 to 88.5 percent on both backbones while adding approximately 2 milliseconds of probe overhead.

Online identification of nonlinear time-varying systems with uncertain information

Authors: He Ren, Gaowei Yan, Hang Liu, Lifeng Cao, Zhijun Zhao, Gang Dang

2026-01-15

http://arxiv.org/abs/2601.10379v1

Digital twins (DTs), as the core enablers for real-time monitoring and predictive maintenance of complex cyber-physical systems, impose critical requirements on their virtual models: high predictive accuracy, strong interpretability, and online adaptive capability. However, existing techniques struggle to meet these demands simultaneously: Bayesian methods excel in uncertainty quantification but lack model interpretability, while interpretable symbolic identification methods (e.g., SINDy) are constrained by their offline, batch-processing nature, which make real-time updates challenging. To bridge this semantic and computational gap, this paper proposes a novel Bayesian Regression-based Symbolic Learning (BRSL) framework. The framework formulates online symbolic discovery as a unified probabilistic state-space model. By incorporating horseshoe priors, model selection is transformed into a Bayesian inference task, enabling simultaneous system identification and uncertainty quantification. Furthermore, we derive an online recursive algorithm with a forgetting factor and establish precise recursive conditions that guarantee the well-posedness of the posterior distribution. These conditions also function as real-time monitors for data utility, enhancing algorithmic robustness. Additionally, a rigorous convergence analysis is provided, demonstrating the convergence of parameter estimates under persistent excitation conditions. Case studies validate the effectiveness of the proposed framework in achieving interpretable, probabilistic prediction and online learning.

Global Context Compression with Interleaved Vision-Text Transformation

Authors: Dian Jiao, Jiaxin Duan, Shuai Zhao, Jiabing Leng, Yiran Zhang, Feng Huang

2026-01-15

http://arxiv.org/abs/2601.10378v1

Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss of textual information. This motivates earlier works that render the Transformer's input into images for ing, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context , which saves tokens at both ing and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4 $\times$ ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3 $\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.

Authors: Yichong Xia, Yimin Zhou, Jinpeng Wang, Bin Chen

2026-01-15

http://arxiv.org/abs/2601.10373v1

Recent advancements in diffusion-based generative priors have enabled visually plausible image at extremely low bit rates. However, existing approaches suffer from slow sampling processes and suboptimal bit allocation due to fragmented training paradigms. In this work, we propose Accelerate \textbf{Diff}usion-based Image Compression via \textbf{C}onsistency Prior \textbf{R}efinement (DiffCR), a novel framework for efficient and high-fidelity image reconstruction. At the heart of DiffCR is a Frequency-aware Skip Estimation (FaSE) module that refines the $ε$ -prediction prior from a pre-trained latent diffusion model and aligns it with compressed latents at different timesteps via Frequency Decoupling Attention (FDA). Furthermore, a lightweight consistency estimator enables fast \textbf{two-step } by pre the semantic trajectory of diffusion sampling. Without updating the backbone diffusion model, DiffCR achieves substantial bitrate savings (27.2\% BD-rate (LPIPS) and 65.1\% BD-rate (PSNR)) and over $10\times$ speed-up compared to SOTA diffusion-based baselines.

Joint Bayesian inference of Earth's magnetic field and core surface flow on millennial timescales

Authors: Andreas Nilsson, Neil Suttie, Marie Troyano, Nicolas Gillet, Julien Aubert, Anders Irbäck

2026-01-15

http://arxiv.org/abs/2601.10344v1

Understanding Earth's core dynamics over millennial timescales requires models that jointly describe the evolution of the geomagnetic field and core surface flow, while accommodating the , irregular, and uncertain nature of archaeomagnetic and palaeomagnetic data. We present a new Bayesian core field and core flow modelling framework that utilises archaeo/palaeomagnetic data directly, combining a reduced stochastic representation of core surface dynamics derived from numerical geodynamo statistics with a probabilistic treatment of observational and chronological uncertainties. A key innovation is an efficient discrete marginalisation of age uncertainties, which avoids the convergence difficulties associated with co-estimating ages in high-dimensional Hamiltonian Monte Carlo inversions. The framework aims to reconstruct the coupled evolution of the geomagnetic field and core surface flow over the past 9000 years while pre dynamical correlations implied by the prior geodynamo time series. Tests using synthetic data generated from an Earth-like geodynamo demonstrate that the method reliably recovers large-scale geomagnetic field variations and key aspects of core dynamics, including long-term westward drift and the evolution of planetary-scale eccentric gyres. These results show that, when combined with physically informed priors, archaeo/palaeomagnetic data can constrain millennial-scale core flow, paving the way for reconstructions based on real data.

Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning

Authors: Xin Guan, Zijian Li, Shen Huang, Pengjun Xie, Jingren Zhou, Jiuxin Cao

2026-01-15

http://arxiv.org/abs/2601.10306v1

While Reinforcement Learning (RL) has advanced reasoning, applying it to long-context scenarios is hindered by of outcome rewards. This limitation fails to penalize ungrounded "lucky guesses," leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. Comprehensive evaluations across eight benchmarks demonstrate that EAPO significantly enhances long-context reasoning performance compared to SOTA baselines.

In-Context Source and Channel Coding

Authors: Ziqiong Wang, Tianqi Ren, Rongpeng Li, Zhifeng Zhao, Honggang Zhang

2026-01-15

http://arxiv.org/abs/2601.10267v1

Separate Source-Channel Coding (SSCC) remains attractive for text transmission due to its modularity and compatibility with mature entropy coders and powerful channel codes. However, SSCC often suffers from a pronounced cliff effect in low Signal-to-Noise Ratio (SNR) regimes, where residual bit errors after channel can catastrophically break lossless source , especially for Arithmetic Coding (AC) driven by Large Language Models (s). This paper proposes a receiver-side In-Context Decoding (ICD) framework that enhances SSCC robustness without modifying the transmitter. ICD leverages an Error Correction Code Transformer (ECCT) to obtain bit-wise reliability for the d information bits. Based on the context-consistent bitstream, ICD constructs a confidence-ranked candidate pool via reliability-guided bit flipping, samples a compact yet diverse subset of candidates, and applies an -based arithmetic r to obtain both reconstructions and sequence-level log-likelihoods. A reliability-likelihood fusion rule then selects the final output. We further provide theoretical guarantees on the stability and convergence of the proposed sampling procedure. Extensive experiments over Additive White Gaussian Noise (AWGN) and Rayleigh fading channels demonstrate consistent gains compared with conventional SSCC baselines and representative Joint Source-Channel Coding (JSCC) schemes.

STEAMROLLER A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter

Authors: Ziqi Xu, Yi Liu, Yuekang Li, Ling Shi, Kailong Wang, Yongxin Zhao

2026-01-15

http://arxiv.org/abs/2601.10223v1

People who stutter (PWS) face systemic exclusion in today's voice-driven society, where access to voice assistants, authentication systems, and remote work tools increasingly depends on fluent speech. Current automatic speech recognition (ASR) systems, trained predominantly on fluent speech, fail to serve millions of PWS worldwide. We present STEAMROLLER, a real time system that transforms stuttered speech into fluent output through a novel multi-stage, multi-agent AI pipeline. Our approach addresses three critical technical challenges: (1) the difficulty of direct speech to speech conversion for disfluent input, (2) semantic distortions introduced during ASR transcription of stuttered speech, and (3) latency constraints for real time . STEAMROLLER employs a three stage architecture comprising ASR transcription, multi-agent text repair, and speech synthesis, where our core innovation lies in a collaborative multi-agent framework that iteratively refines transcripts while pre semantic intent. Experiments on the FluencyBank dataset and a user study demonstrates clear word error rate (WER) reduction and strong user satisfaction. Beyond immediate accessibility benefits, fine tuning ASR on STEAMROLLER repaired speech further yields additional WER improvements, creating a pathway toward inclusive AI ecosystems.

LOOKAT Lookup-Optimized Key-Attention for Memory-Efficient Transformers

Authors: Aryan Karmore

2026-01-15

http://arxiv.org/abs/2601.10155v1

Compressing the is a required step to deploy large language models on edge devices. Current methods compress storage but fail to reduce bandwidth as attention calculation requires dequantizing keys from INT4/INT8 to FP16 before use. We observe that attention scoring is mathematically equivalent to the inner product similarity search and we can apply some techniques from vector databases to compress - better. We propose LOOKAT, which applies product and asymmetric distance computation, to architecture by decomposing key vectors into subspaces, learning codebooks and computing attention tables via lookup tables. This transforms attention from memory-bound to compute-bound. LOOKAT achieves 64 $\times$ at 95.7\% output fidelity and 32 $\times$ at 95.0\% fidelity when tested on GPT-2. LOOKAT requires no architecture changes or training while maintaining rank correlation $ρ> 0.95$ . Theoretical analysis confirms that rank correlation degrades as $O(d_k/mK)$ , with guarantees validated across sequence lengths up to 1024 tokens.

TopoDIM One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems

Authors: Rui Sun, Jie Ding, Chenghua Gong, Tianjun Gu, Yihang Jiang, Juyuan Zhang, Liming Pan, Linyuan Lü

2026-01-15

http://arxiv.org/abs/2601.10120v1

Optimizing topology in -based multi-agent system is critical for enabling collective intelligence. Existing methods mainly rely on spatio-temporal interaction paradigms, where the sequential execution of multi-round dialogues incurs high latency and computation. Motivated by the recent insights that evaluation and debate mechanisms can improve problem-solving in multi-agent systems, we propose TopoDIM, a framework for one-shot Topology generation with Diverse Interaction Modes. Designed for decentralized execution to enhance adaptability and privacy, TopoDIM enables agents to autonomously construct heterogeneous without iterative coordination, achieving token efficiency and improved task performance. Experiments demonstrate that TopoDIM reduces total token consumption by 46.41% while improving average performance by 1.50% over state-of-the-art methods. Moreover, the framework exhibits strong adaptability in organizing among heterogeneous agents. Code is available at: https://anonymous.4open.science/r/TopoDIM-8D35/

Sparse-RL Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts

Authors: Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang, Jinbo Su, Mengshu Sun, Lei Liang, Jing Zhang

2026-01-15

http://arxiv.org/abs/2601.10079v1

Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (s). However, the substantial memory overhead of storing Key-Value () s during long-horizon rollouts acts as a critical bottleneck, often prohibiting efficient training on limited hardware. While existing techniques offer a remedy for inference, directly applying them to RL training induces a severe policy mismatch, leading to catastrophic performance collapse. To address this, we introduce Sparse-RL empowers stable RL training under rollouts. We show that instability arises from a fundamental policy mismatch among the dense old policy, the sampler policy, and the learner policy. To mitigate this issue, Sparse-RL incorporates Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct the off-policy bias introduced by -induced information loss. Experimental results show that Sparse-RL reduces rollout overhead compared to dense baselines while pre the performance. Furthermore, Sparse-RL inherently implements -aware training, significantly enhancing model robustness during inference deployment.

Privacy Enhanced PEFT Tensor Train Decomposition Improves Privacy Utility Tradeoffs under DP-SGD

Authors: Pradip Kunwar, Minh Vu, Maanak Gupta, Manish Bhattarai

2026-01-15

http://arxiv.org/abs/2601.10045v1

Fine-tuning large language models on sensitive data poses significant privacy risks, as membership inference attacks can reveal whether individual records were used during training. While Differential Privacy (DP) provides formal protection, applying DP to conventional Parameter-Efficient Fine-Tuning (PEFT) methods such as Low-Rank Adaptation (LoRA) often incurs substantial utility loss. In this work, we show that a more structurally constrained PEFT architecture, Tensor Train Low-Rank Adaptation (TTLoRA), can improve the privacy-utility tradeoff by shrinking the effective parameter space while pre expressivity. To this end, we develop TTLoRA-DP, a differentially private training framework for TTLoRA. Specifically, we extend the ghost clipping algorithm to Tensor Train cores via d contraction states, enabling efficient Differentially Private Stochastic Gradient Descent (DP-SGD) with exact per-example gradient norm computation without materializing full per-example gradients. Experiments on GPT-2 fine-tuning over the Enron and Penn Treebank datasets show that TTLoRA-DP consistently strengthens privacy protection relative to LoRA-DP while maintaining comparable or better downstream utility. Moreover, TTLoRA exhibits lower membership leakage even without DP training, using substantially smaller adapters and requiring on average 7.6X fewer parameters than LoRA. Overall, our results demonstrate that TTLoRA offers a practical path to improving the privacy-utility tradeoff in parameter-efficient language model adaptation.

Towards Native Intelligence 6G-LLM Trained with Reinforcement Learning from NDT Feedback

Authors: Zhuoran Xiao, Tao Tao, Chenhui Ye, Yunbo Hu, Yijia Feng, Tianyu Jiao, Liyu Cai

2026-01-15

http://arxiv.org/abs/2601.09992v1

Owing to its comprehensive understanding of upper-layer application requirements and the capabilities of practical systems, the 6G- (6G domain large language model) offers a promising pathway toward realizing network native intelligence. Serving as the system orchestrator, the 6G- drives a paradigm shift that fundamentally departs from existing rule-based approaches, which primarily rely on modular, experience-driven optimization. By contrast, the 6G- substantially enhances network flexibility and adaptability. Nevertheless, current efforts to construct 6G-s are constrained by their reliance on large-scale, meticulously curated, human-authored corpora, which are impractical to obtain in real-world scenarios. Moreover, purely offline-trained models lack the capacity for continual self-improvement, limiting their ability to adapt to the highly dynamic requirements of wireless environments. To overcome these limitations, we propose a novel training paradigm termed RLDTF (Reinforcement Learning from Digital Twin Feedback) for 6G-s. This framework leverages network digital twins to generate reward signals based on orchestration outcomes, while employing reinforcement learning to guide the model toward optimal decision-making dynamically. Furthermore, we introduce a weighted token mechanism to improve output accuracy. Comprehensive experimental results demonstrate that our proposed framework significantly outperforms state-of-the-art baselines in orchestration accuracy and solution optimality.

Learning to Decode in Parallel Self-Coordinating Neural Network for Real-Time Quantum Error Correction

Authors: Kai Zhang, Zhengzhong Yi, Shaojun Guo, Linghang Kong, Situ Wang, Xiaoyu Zhan, Tan He, Weiping Lin, Tao Jiang, Dongxin Gao, Yiming Zhang, Fangming Liu, Fang Zhang, Zhengfeng Ji, Fusheng Chen, Jianxin Chen

2026-01-14

http://arxiv.org/abs/2601.09921v1

Fast, reliable rs are pivotal components for enabling fault-tolerant quantum computation (FTQC). Neural network rs like AlphaQubit have demonstrated potential, achieving higher accuracy than traditional human-designed algorithms. However, existing implementations of neural network rs lack the parallelism required to the syndrome stream generated by a superconducting logical qubit in real time. Moreover, integrating AlphaQubit with sliding window-based parallel schemes presents non-trivial challenges: AlphaQubit is trained solely to output a single bit corresponding to the global logical correction for an entire memory experiment, rather than local physical corrections that can be easily integrated. We address this issue by training a recurrent, -based neural network specifically tailored for parallel window . While it still outputs a single bit, we derive training labels from a consistent set of local corrections and train on various types of windows simultaneously. This approach enables the network to self-coordinate across neighboring windows, facilitating high-accuracy parallel of arbitrarily long memory experiments. As a result, we overcome the throughput bottleneck that previously precluded the use of AlphaQubit-type rs in FTQC. Our work presents the first scalable, neural-network-based parallel framework that simultaneously achieves SOTA accuracy and the stringent throughput required for real-time quantum error correction. Using an end-to-end experimental workflow, we benchmark our r on the Zuchongzhi 3.2 superconducting quantum processor on surface codes with distances up to 7, demonstrating its superior accuracy. Moreover, we demonstrate that, using our approach, a single TPU v6e is capable of surface codes with distances up to 25 within 1us per round.

Authors: Jacob Sander, Brian Jalaian, Venkat R. Dasari

2026-01-14

http://arxiv.org/abs/2601.09865v1

Large Language Models (s) enable advanced natural language processing but face deployment challenges on resource-constrained edge devices due to high computational, memory, and energy demands. Optimizing these models requires addressing three key challenges: acquiring task-specific data, fine-tuning for performance, and compressing models to accelerate inference while reducing resource demands. We propose an integrated framework combining GPTQ-based , low-rank adaptation (LoRA), and a specialized data distillation process to significantly reduce model size and complexity while pre or enhancing task-specific performance. By leveraging data distillation, knowledge distillation via Kullback-Leibler divergence, Bayesian hyperparameter optimization, and the Muon optimizer, our pipeline achieves up to 2x memory (e.g., reducing a 6GB model to 3GB) and enables efficient inference for specialized tasks. Empirical results demonstrate superior performance on standard benchmarks compared to GPTQ alone, with the Muon optimizer notably enhancing fine-tuned models' resistance to accuracy decay during .

MedRedFlag Investigating how LLMs Redirect Misconceptions in Real-World Health Communication

Authors: Sraavya Sambara, Yuan Pu, Ayman Ali, Vishala Mishra, Lionel Wong, Monica Agrawal

2026-01-14

http://arxiv.org/abs/2601.09853v1

Real-world health questions from patients often unintentionally embed false assumptions or premises. In such cases, safe medical typically involves redirection: addressing the implicit misconception and then responding to the underlying patient context, rather than the original question. While large language models (s) are increasingly being used by lay users for medical advice, they have not yet been tested for this crucial competency. Therefore, in this work, we investigate how s react to false premises embedded within real-world health questions. We develop a semi-automated pipeline to curate MedRedFlag, a dataset of 1100+ questions sourced from Reddit that require redirection. We then systematically compare responses from state-of-the-art s to those from clinicians. Our analysis reveals that s often fail to redirect problematic questions, even when the problematic premise is detected, and provide answers that could lead to suboptimal medical decision making. Our benchmark and results reveal a novel and substantial gap in how s perform under the conditions of real-world health , highlighting critical safety concerns for patient-facing medical AI systems. Code and dataset are available at https://github.com/srsambara-1/MedRedFlag.

LLM-Based Agentic Systems for Software Engineering Challenges and Opportunities

Authors: Yongjian Tang, Thomas Runkler

2026-01-14

http://arxiv.org/abs/2601.09822v1

Despite recent advancements in Large Language Models (s), complex Software Engineering (SE) tasks require more collaborative and specialized approaches. This concept paper systematically reviews the emerging paradigm of -based multi-agent systems, examining their applications across the Software Development Life Cycle (SDLC), from requirements engineering and code generation to static code checking, testing, and debugging. We delve into a wide range of topics such as language model selection, SE evaluation benchmarks, state-of-the-art agentic frameworks and protocols. Furthermore, we identify key challenges and outline future research opportunities, with a focus on multi-agent orchestration, human-agent coordination, computational cost optimization, and effective data collection. This work aims to provide researchers and practitioners with valuable insights into the current forefront landscape of agentic systems within the software engineering domain.

ShortCoder Knowledge-Augmented Syntax Optimization for Token-Efficient Code Generation

Authors: Sicong Liu, Yanxian Huang, Mingwei Liu, Jiachi Chen, Ensheng Shi, Yuchi Ma, Hongyu Zhang, Yin Zhang, Yanlin Wang

2026-01-14

http://arxiv.org/abs/2601.09703v1

Code generation tasks aim to automate the conversion of user requirements into executable code, significantly reducing manual development efforts and enhancing software productivity. The emergence of large language models (s) has significantly advanced code generation, though their efficiency is still impacted by certain inherent architectural constraints. Each token generation necessitates a complete inference pass, requiring persistent retention of contextual information in memory and escalating resource consumption. While existing research prioritizes inference-phase optimizations such as prompt and model , the generation phase remains underexplored. To tackle these challenges, we propose a knowledge-infused framework named ShortCoder, which optimizes code generation efficiency while pre semantic equivalence and readability. In particular, we introduce: (1) ten syntax-level simplification rules for Python, derived from AST-pre transformations, achieving 18.1% token reduction without functional compromise; (2) a hybrid data synthesis pipeline integrating rule-based rewriting with -guided refinement, producing ShorterCodeBench, a corpus of validated tuples of original code and simplified code with semantic consistency; (3) a fine-tuning strategy that injects conciseness awareness into the base s. Extensive experimental results demonstrate that ShortCoder consistently outperforms state-of-the-art methods on HumanEval, achieving an improvement of 18.1%-37.8% in generation efficiency over previous methods while ensuring the performance of code generation.

Empathy Applicability Modeling for General Health Queries

Authors: Shan Randhawa, Agha Ali Raza, Kentaro Toyama, Julie Hui, Mustafa Naseem

2026-01-14

http://arxiv.org/abs/2601.09696v1

s are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient . Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic in asynchronous healthcare.

LLMs can Compress LLMs Adaptive Pruning by Agents

Authors: Sai Varun Kodathala, Rakesh Vunnam

2026-01-14

http://arxiv.org/abs/2601.09694v1

As Large Language Models (s) continue to scale, post-training has emerged as a promising approach to reduce computational costs while pre performance. Existing methods such as SparseGPT and Wanda achieve high through layer-wise weight reconstruction or activation-aware magnitude , but rely on uniform or hand-crafted heuristics to determine per-layer ratios. Moreover, recent work has shown that pruned s suffer from severe factual knowledge degradation, with structured methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided , where a foundation model acts as an adaptive agent to intelligently select which layers to prune at each iteration while pre critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. These statistics are processed by an agent equipped with self-reflection capabilities, enabling it to learn from previous outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% , demonstrating substantial improvements over structured baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the of other foundation models.

Parallaxes, Proper Motions, and Near-Infrared Photometry for 173 L and T Dwarfs From The US Naval Observatory Infrared Astrometry Program

Authors: Frederick J. Vrba, Adam C. Schneider, Jeffrey A. Munn, Arne A. Henden, Christain B. Luginbuhl, Conard C. Dahn, Harry H. Guetter, Blaise J. Canzian, Trudy M. Tilleman, Scott E. Dahm, Stephen J. Williams, Justice E. Bruursema, J. Davy Kirkpatrick, Adam J. Burgasser

2026-01-14

http://arxiv.org/abs/2601.09671v1

We present near-infrared parallax and proper motion astrometry for 74 L-dwarfs and 99 T-dwarfs, as single objects or in binary systems, obtained with the ASTROCAM astrometric imager on the USNO, Flagstaff Station 1.55-m telescope over two ob periods. For all 173 objects the median number of observational epochs was 62 with a median time frame of 5.25 years, resulting in median uncertainties of $σ$ ( $π_{abs}$ ) = 1.51 mas, $σ$ ( $μ_{abs}$ ) = 1.02 mas yr $^{-1}$ , and $σ$ ( $V_{\rm tan}$ ) = 1.01 km s $^{-1}$ . Our observations provide the first parallax/proper motion results for 16 objects and the highest precision parallaxes/proper motions for an additional 116 objects. A serendipitous of 40 objects with Gaia DR3 astrometry allows direct comparison and confirmation of our results, along with an investigation on the effects of resolved binarity on astrometric results. We also provide a uniform set of $J$ -, $H$ -, $K_{S}$ -band photometry in the UKIRT/MKO system, most of it being from new observations. We use these results to examine objects included in this study of special-interest populations, consisting of binaries, wide companions, young objects, subdwarfs, and brown dwarf spectral standards.

Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Authors: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park

2026-01-14

http://arxiv.org/abs/2601.09667v2

Multi-agent systems have evolved into practical -driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.

OpenVoxel Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

Authors: Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun

2026-01-14

http://arxiv.org/abs/2601.09575v1

We propose OpenVoxel, a training-free algorithm for grouping and captioning voxels for the open-vocabulary 3D scene understanding tasks. Given the voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (Ms), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using Ms. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.

Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats

Authors: Manyi Zhang, Ji-Fu Li, Zhongao Sun, Haoli Bai, Hui-Ling Zhen, Zhenhua Dong, Xianzhi Yu

2026-01-14

http://arxiv.org/abs/2601.09555v1

Microscaling Floating-Point (FP) has emerged as a promising low-precision format for large language models (s). Despite various post-training (PTQ) algorithms being proposed, they mostly focus on integer , while their applicability and behavior under FP formats remain largely unexplored. To address this gap, this work conducts a systematic investigation of PTQ under FP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 families. The key findings include: 1) FP8 consistently achieves near-lossless performance, while FP4 introduces substantial accuracy degradation and remains challenging; 2) PTQ effectiveness under FP depends strongly on format compatibility, with some algorithmic paradigms being consistently more effective than others; 3) PTQ performance exhibits highly consistent trends across model families and modalities, in particular, sensitivity is dominated by the language model rather than the vision encoder in multimodal s; 4) The scaling factor of is a critical error source in FP4, and a simple pre-scale optimization strategy can significantly mitigate its impact. Together, these results provide practical guidance on adapting existing PTQ methods to FP .

Private LLM Inference on Consumer Blackwell GPUs A Practical Guide for Cost-Effective Local Deployment in SMEs

Authors: Jonathan Knoop, Hendrik Holtmann

2026-01-14

http://arxiv.org/abs/2601.09527v1

SMEs increasingly seek alternatives to cloud APIs, which raise data privacy concerns. Dedicated cloud GPU instances offer improved privacy but with limited guarantees and ongoing costs, while professional on-premise hardware (A100, H100) remains prohibitively expensive. We present a systematic evaluation of NVIDIA's Blackwell consumer GPUs (RTX 5060 Ti, 5070 Ti, 5090) for production inference, benchmarking four open-weight models (Qwen3-8B, Gemma3-12B, Gemma3-27B, GPT-OSS-20B) across 79 configurations spanning formats (BF16, W4A16, NVFP4, FP4), context lengths (8k-64k), and three workloads: RAG, multi-LoRA agentic , and high-concurrency APIs. The RTX 5090 delivers 3.5-4.6x higher throughput than the 5060 Ti with 21x lower latency for RAG, but budget GPUs achieve the highest throughput-per-dollar for API workloads with sub-second latency. NVFP4 provides 1.6x throughput over BF16 with 41% energy reduction and only 2-4% quality loss. Self-hosted inference costs $0.001-0.04 per million tokens (electricity only), which is 40-200x cheaper than budget-tier cloud APIs, with hardware breaking even in under four months at moderate volume (30M tokens/day). Our results show that consumer GPUs can reliably replace cloud inference for most SME workloads, except latency-critical long-context RAG, where high-end GPUs remain essential. We provide deployment guidance and release all benchmark data for reproducible SME-scale deployments.

Engineering Compressed Matrix Multiplication with the Fast Walsh-Hadamard Transform

Authors: Joel Andersson, Matti Karppa

2026-01-14

http://arxiv.org/abs/2601.09477v1

We present an implementation of Pagh's compressed matrix multiplication algorithm, a randomized algorithm that constructs sketches of matrices to compute an unbiased estimate of their product. By leveraging fast polynomial multiplication via the FFT, the algorithm achieves high performance when the product matrix is or contains only a small number of entries with magnitudes significantly larger than the rest. We show empirically that the algorithm is practical and can outperform state-of-the-art DGEMM implementations when the product matrix has few nonzero entries or is otherwise dominated by a small subset of elements with large magnitude. As a minor theoretical contribution, we replace the FFT with the Fast Walsh-Hadamard Transform (FWHT) in sketched multiplication, pre all correctness and variance guarantees of the original algorithm. Experiments with our carefully engineered multithreaded CPU implementation for dense double-precision matrices on 64-core CPU nodes across a range of synthetic benchmarks, exhibiting variable patterns, show that the FWHT variant is up to 4 times faster than the FFT-based version. Under favorable and magnitude patterns in the product matrix, our FWHT-based implementation achieves a speedup of up to 40 over DGEMM from Intel MKL, with low probability of error in the estimates. Our implementation is released as free software and comes with NumPy-compatible Python bindings.

Analysis of the Maximum Prediction Gain of Short-Term Prediction on Sustained Speech

Authors: Reemt Hinrichs, Muhamad Fadli Damara, Stephan Preihs, Jörn Ostermann

2026-01-14

http://arxiv.org/abs/2601.09461v1

Signal prediction is widely used in, e.g., economic forecasting, echo cancellation and in data , particularly in predictive coding of speech and music. Predictive coding algorithms reduce the bit-rate required for data transmission or storage by signal prediction. The prediction gain is a classic measure in applied signal coding of the quality of a predictor, as it links the mean-squared prediction error to the signal-to--noise of predictive coders. To evaluate predictor models, knowledge about the maximum achievable prediction gain independent of a predictor model is desirable. In this manuscript, Nadaraya-Watson kernel-regression (NWKR) and an information theoretic upper bound are applied to analyze the upper bound of the prediction gain on a newly recorded dataset of sustained speech/phonemes. It was found that for unvoiced speech a linear predictor always achieves the maximum prediction gain within at most 0.3 dB. On voiced speech, the optimum one-tap predictor was found to be linear but starting with two taps, the maximum achievable prediction gain was found to be about 2 dB to 6 dB above the prediction gain of the linear predictor. Significant differences between speakers/subjects were observed. The created dataset as well as the code can be obtained for research purpose upon request.

SC-MAS Constructing Cost-Efficient Multi-Agent Systems with Edge-Level Heterogeneous Collaboration

Authors: Di Zhao, Longhui Ma, Siwei Wang, Miao Wang, Yi Kong

2026-01-14

http://arxiv.org/abs/2601.09434v1

Large Language Model ()-based Multi-Agent Systems (MAS) enhance complex problem solving through multi-agent collaboration, but often incur substantially higher costs than single-agent systems. Recent MAS routing methods aim to balance performance and overhead by dynamically selecting agent roles and language models. However, these approaches typically rely on a homogeneous collaboration mode, where all agents follow the same interaction pattern, limiting collaboration flexibility across different roles. Motivated by Social Capital Theory, which emphasizes that different roles benefit from distinct forms of collaboration, we propose SC-MAS, a framework for constructing heterogeneous and cost-efficient multi-agent systems. SC-MAS models MAS as directed graphs, where edges explicitly represent pairwise collaboration strategies, allowing different agent pairs to interact through tailored patterns. Given an input query, a unified controller progressively constructs an executable MAS by selecting task-relevant agent roles, assigning edge-level collaboration strategies, and allocating appropriate backbones to individual agents. Experiments on multiple benchmarks demonstrate the effectiveness of SC-MAS. In particular, SC-MAS improves accuracy by 3.35% on MMLU while reducing inference cost by 15.38%, and achieves a 3.53% accuracy gain with a 12.13% cost reduction on MBPP. These results validate the feasibility of SC-MAS and highlight the effectiveness of heterogeneous collaboration in multi-agent systems.

Spectral Complex Autoencoder Pruning A Fidelity-Guided Criterion for Extreme Structured Channel Compression

Authors: Wei Liu, Xing Deng, Haijian Shao, Yingtao Jiang

2026-01-14

http://arxiv.org/abs/2601.09352v1

We propose Spectral Complex Autoencoder Pruning (SCAP), a reconstruction-based criterion that measures functional redundancy at the level of individual output channels. For each convolutional layer, we construct a complex interaction field by pairing the full multi-channel input activation as the real part with a single output-channel activation (spatially aligned and broadcast across input channels) as the imaginary part. We transform this complex field to the frequency domain and train a low-capacity autoencoder to reconstruct normalized spectra. Channels whose spectra are reconstructed with high fidelity are interpreted as lying close to a low-dimensional manifold captured by the autoencoder and are therefore more compressible; conversely, channels with low fidelity are retained as they encode information that cannot be compactly represented by the learned manifold. This yields an importance score (optionally fused with the filter L1 norm) that supports simple threshold-based and produces a structurally consistent pruned network. On VGG16 trained on CIFAR-10, at a fixed threshold of 0.6, we obtain 90.11% FLOP reduction and 96.30% parameter reduction with an absolute Top-1 accuracy drop of 1.67% from a 93.44% baseline after fine-tuning, demonstrating that spectral reconstruction fidelity of complex interaction fields is an effective proxy for channel-level redundancy under aggressive .

See More, Store Less Memory-Efficient Resolution for Video Moment Retrieval

Authors: Mingyu Jeon, Sungjin Han, Jinkwon Hwang, Minchol Kwon, Jonghee Kim, Junyeong Kim

2026-01-14

http://arxiv.org/abs/2601.09350v1

Recent advances in Multimodal Large Language Models (Ms) have improved image recognition and reasoning, but video-related tasks remain challenging due to memory constraints from dense frame processing. Existing Video Moment Retrieval (VMR) methodologies rely on frame sampling, risking potential information loss, especially in lengthy videos. We propose SMORE (See MORE, store less), a framework that enhances memory efficiency while maintaining high information resolution. SMORE (1) uses query-guided captions to encode semantics aligned with user intent, (2) applies query-aware importance modulation to highlight relevant segments, and (3) adaptively compresses frames to preserve key content while reducing redundancy. This enables efficient video understanding without exceeding memory budgets. Experimental validation reveals that SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks.

Range-Doppler-Acceleration Estimation for Fast-Moving and Accelerating Targets

Authors: Nadav Neuberger, Simon Kollecker, Martin Kaeske

2026-01-14

http://arxiv.org/abs/2601.09317v1

A central aspect of every pulsed radar signal processor is the targets Range-Doppler estimation within a Coherent Processing Interval. Conventional methods typically rely on simplifying assumptions, such as linear target motion, narrowband operation, or constant velocity, to enable fast computation. However, these assumptions break down in scenarios involving quadratic range-time behavior, high radial velocities or s, or wideband signals, leading to undesired effects such as intra-pulse Doppler shift/stretch and target migration across Range-Doppler cells. This paper presents a generalized waveform-independent Range-Doppler approach that compensates for these effects while maintaining minimal Signal-to-Noise-Ratio loss and practical computational efficiency. The performance limits of the proposed method are analyzed and expressed through a unified metric that depends on both scene and system parameters. Comparison with other approaches is presented, showing their estimation bias and performance degradation.

Authors: Lianying Chao, Haoran Cai, Xubin Li, Kai Zhang, Sijie Wu, Rui Xu

2026-01-14

http://arxiv.org/abs/2601.09298v1

In the information and s technology (ICT) industry, training a domain-specific large language model () or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal (M) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and s, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel. Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel. Finally, experts and s jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters. Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively. On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate. In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain.

Cluster Workload Allocation Semantic Soft Affinity Using Natural Language Processing

Authors: Leszek Sliwko, Jolanta Mizeria-Pietraszko

2026-01-14

http://arxiv.org/abs/2601.09282v1

Cluster workload allocation often requires complex configurations, creating a usability gap. This paper introduces a semantic, intent-driven scheduling paradigm for cluster systems using Natural Language Processing. The system employs a Large Language Model () integrated via a Kubernetes scheduler extender to interpret natural language allocation hint annotations for soft affinity preferences. A prototype featuring a cluster state and an intent analyzer (using AWS Bedrock) was developed. Empirical evaluation demonstrated high parsing accuracy (>95% Subset Accuracy on an evaluation ground-truth dataset) for top-tier models like Amazon Nova Pro/Premier and Mistral Pixtral Large, significantly outperforming a baseline engine. Scheduling quality tests across six scenarios showed the prototype achieved superior or equivalent placement compared to standard Kubernetes configurations, particularly excelling in complex and quantitative scenarios and handling conflicting soft preferences. The results validate using s for accessible scheduling but highlight limitations like synchronous latency, suggesting asynchronous processing for production readiness. This work confirms the viability of semantic soft affinity for simplifying workload orchestration.

STaR Sensitive Trajectory Regulation for Unlearning in Large Reasoning Models

Authors: Jingjing Zhou, Gaoxiang Cong, Li Su, Liang Li

2026-01-14

http://arxiv.org/abs/2601.09281v1

Large Reasoning Models (LRMs) have advanced automated multi-step reasoning, but their ability to generate complex Chain-of-Thought (CoT) trajectories introduces severe privacy risks, as sensitive information may be deeply embedded throughout the reasoning process. Existing Large Language Models (s) unlearning approaches that typically focus on modifying only final answers are insufficient for LRMs, as they fail to remove sensitive content from intermediate steps, leading to persistent privacy leakage and degraded security. To address these challenges, we propose Sensitive Trajectory Regulation (STaR), a parameter-free, inference-time unlearning framework that achieves robust privacy protection throughout the reasoning process. Specifically, we first identify sensitive content via semantic-aware detection. Then, we inject global safety constraints through secure prompt prefix. Next, we perform trajectory-aware suppression to dynamically block sensitive content across the entire reasoning chain. Finally, we apply token-level adaptive filtering to prevent both exact and paraphrased sensitive tokens during generation. Furthermore, to overcome the inadequacies of existing evaluation protocols, we introduce two metrics: Multi-Decoding Consistency Assessment (MCS), which measures the consistency of unlearning across diverse strategies, and Multi-Granularity Membership Inference Attack (MIA) Evaluation, which quantifies privacy protection at both answer and reasoning-chain levels. Experiments on the R-TOFU benchmark demonstrate that STaR achieves comprehensive and stable unlearning with minimal utility loss, setting a new standard for privacy-pre reasoning in LRMs.

Coordinated Pandemic Control with Large Language Model Agents as Policymaking Assistants

Authors: Ziyi Shi, Xusen Guo, Hongliang Lu, Mingxing Peng, Haotian Wang, Zheng Zhu, Zhenning Li, Yuxuan Liang, Xinhu Zheng, Hai Yang

2026-01-14

http://arxiv.org/abs/2601.09264v1

Effective pandemic control requires timely and coordinated policymaking across administrative regions that are intrinsically interdependent. However, human-driven responses are often fragmented and reactive, with policies formulated in isolation and adjusted only after outbreaks escalate, undermining proactive intervention and global pandemic mitigation. To address this challenge, here we propose a large language model () multi-agent policymaking framework that supports coordinated and proactive pandemic control across regions. Within our framework, each administrative region is assigned an agent as an AI policymaking assistant. The agent reasons over region-specific epidemiological dynamics while communicating with other agents to account for cross-regional interdependencies. By integrating real-world data, a pandemic evolution simulator, and structured inter-agent , our framework enables agents to jointly explore counterfactual intervention scenarios and synthesize coordinated policy decisions through a closed-loop simulation process. We validate the proposed framework using state-level COVID-19 data from the United States between April and December 2020, together with real-world mobility records and observed policy interventions. Compared with real-world pandemic outcomes, our approach reduces cumulative infections and deaths by up to 63.7% and 40.1%, respectively, at the individual state level, and by 39.0% and 27.0%, respectively, when aggregated across states. These results demonstrate that multi-agent systems can enable more effective pandemic control with coordinated policymaking...

BrainSegNet A Novel Framework for Whole-Brain MRI Parcellation Enhanced by Large Models

Authors: Yucheng Li, Xiaofan Wang, Junyi Wang, Yijie Li, Xi Zhu, Mubai Du, Dian Sheng, Wei Zhang, Fan Zhang

2026-01-14

http://arxiv.org/abs/2601.09263v1

Whole-brain parcellation from MRI is a critical yet challenging task due to the complexity of subdividing the brain into numerous small, irregular shaped regions. Traditionally, template-registration methods were used, but recent advances have shifted to deep learning for faster workflows. While large models like the Segment Anything Model (SAM) offer transferable feature representations, they are not tailored for the high precision required in brain parcellation. To address this, we propose BrainSegNet, a novel framework that adapts SAM for accurate whole-brain parcellation into 95 regions. We enhance SAM by integrating U-Net skip connections and specialized modules into its encoder and r, enabling fine-grained anatomical precision. Key components include a hybrid encoder combining U-Net skip connections with SAM's blocks, a multi-scale attention r with pyramid pooling for varying-sized structures, and a boundary refinement module to sharpen edges. Experimental results on the Human Connectome Project (HCP) dataset demonstrate that BrainSegNet outperforms several state-of-the-art methods, achieving higher accuracy and robustness in complex, multi-label parcellation.

A Theoretical Framework for Rate-Distortion Limits in Learned Image Compression

Authors: Changshuo Wang, Zijian Liang, Kai Niu, Ping Zhang

2026-01-14

http://arxiv.org/abs/2601.09254v1

We present a novel systematic theoretical framework to analyze the rate-distortion (R-D) limits of learned image . While recent neural codecs have achieved remarkable empirical results, their distance from the information-theoretic limit remains unclear. Our work addresses this gap by decomposing the R-D performance loss into three key components: variance estimation, strategy, and context modeling. First, we derive the optimal latent variance as the second moment under a Gaussian assumption, providing a principled alternative to hyperprior-based estimation. Second, we quantify the gap between uniform and the Gaussian test channel derived from the reverse water-filling theorem. Third, we extend our framework to include context modeling, and demonstrate that accurate mean prediction yields substantial entropy reduction. Unlike prior R-D estimators, our method provides a structurally interpretable perspective that aligns with real modules and enables fine-grained analysis. Through joint simulation and end-to-end training, we derive a tight and actionable approximation of the theoretical R-D limits, offering new insights into the design of more efficient learned systems.

DSA-Tokenizer Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

Authors: Hanlin Zhang, Daxin Tan, Dehua Tao, Xiao Chen, Haochen Tan, Yunhe Li, Yuchen Cao, Jianping Wang, Linqi Song

2026-01-14

http://arxiv.org/abs/2601.09239v2

Speech tokenizers serve as the cornerstone of discrete Speech Large Language Models (Speech s). Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. To eliminate rigid length constraints between the two sequences, we introduce a hierarchical Flow-Matching r that further improve speech generation quality. Furthermore, We employ a joint reconstruction-recombination training strategy to enforce this separation. DSA-Tokenizer enables high fidelity reconstruction and flexible recombination through robust disentanglement, facilitating controllable generation in speech s. Our analysis highlights disentangled tokenization as a pivotal paradigm for future speech modeling. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/. The code and model will be made publicly available after the paper has been accepted.

$D^2Prune$ Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness

Authors: Lang Xiong, Ning Liu, Ao Ren, Yuheng Bai, Haining Fang, BinYan Zhang, Zhe Jiang, Yujuan Tan, Duo Liu

2026-01-14

http://arxiv.org/abs/2601.09176v1

Large language models (s) face significant deployment challenges due to their massive computational demands. % While offers a promising solution, existing methods suffer from two critical limitations: (1) They neglect activation distribution shifts between calibration data and test data, resulting in inaccurate error estimations; (2) They overlook the long-tail distribution characteristics of activations in the attention module. To address these limitations, this paper proposes a novel method, $D^2Prune$ . First, we propose a dual Taylor expansion-based method that jointly models weight and activation perturbations for precise error estimation, leading to precise mask selection and weight updating and facilitating error minimization during . % Second, we propose an attention-aware dynamic update strategy that preserves the long-tail attention pattern by jointly minimizing the KL divergence of attention distributions and the reconstruction error. Extensive experiments show that $D^2Prune$ consistently outperforms SOTA methods across various s (e.g., OPT-125M, LLaMA2/3, and Qwen3). Moreover, the dynamic attention update mechanism also generalizes well to ViT-based vision models like DeiT, achieving superior accuracy on ImageNet-1K.

Data-Driven Exploration and Insights into Temperature-Dependent Phonons in Inorganic Materials

Authors: Huiju Lee, Zhi Li, Jiangang he, Yi Xia

2026-01-14

http://arxiv.org/abs/2601.09123v1

Phonons, d vibrations of the atomic lattice, are fundamental to understanding thermal transport, structural stability, and phase behavior in crystalline solids. Despite advances in computational materials science, most predictions of vibrational properties in large materials databases rely on the harmonic approximation and overlook crucial temperature-dependent anharmonic effects. Here, we present a scalable computational framework that combines machine learning interatomic potentials, anharmonic lattice dynamics, and high-throughput calculations to investigate temperature-dependent phonons across thousands of materials. By fine-tuning the universal M3GNet interatomic potential using high-quality phonon data, we improve phonon prediction accuracy by a factor of four while pre computational efficiency. Integrating this refined model into a high-throughput implementation of the stochastic self-consistent harmonic approximation, we compute temperature-dependent phonons for 4,669 inorganic compounds. Our analysis identifies systematic elemental and structural trends governing anharmonic phonon renormalization, with particularly strong manifestations in alkali metals, perovskite-derived frameworks, and related systems. Machine learning models trained on this dataset identify key atomic-scale features driving strong anharmonicity, including weak bonding, large atomic radii, and specific coordination motifs. First-principles validation confirms that anharmonic effects can dramatically alter lattice thermal conductivity by factors of two to four in some materials. This work establishes a robust and efficient data-driven approach for predicting finite-temperature phonon behavior, offering new pathways for the design and discovery of materials with tailored thermal and vibrational properties.

AviationLMM A Large Multimodal Foundation Model for Civil Aviation

Authors: Wenbin Li, Jingling Wu, Xiaoyong Lin. Jing Chen, Cong Chen

2026-01-14

http://arxiv.org/abs/2601.09105v1

Civil aviation is a cornerstone of global transportation and commerce, and ensuring its safety, efficiency and customer satisfaction is paramount. Yet conventional Artificial Intelligence (AI) solutions in aviation remain siloed and narrow, focusing on isolated tasks or single modalities. They struggle to integrate heterogeneous data such as voice s, radar tracks, sensor streams and textual reports, which limits situational awareness, adaptability, and real-time decision support. This paper introduces the vision of AviationLMM, a Large Multimodal foundation Model for civil aviation, designed to unify the heterogeneous data streams of civil aviation and enable understanding, reasoning, generation and agentic applications. We firstly identify the gaps between existing AI solutions and requirements. Secondly, we describe the model architecture that ingests multimodal inputs such as air-ground voice, surveillance, on-board telemetry, video and structured texts, and performs cross-modal alignment and fusion, and produces flexible outputs ranging from situation summaries and risk alerts to predictive diagnostics and multimodal incident reconstructions. In order to fully realize this vision, we identify key research opportunities to address, including data acquisition, alignment and fusion, pretraining, reasoning, trustworthiness, privacy, robustness to missing modalities, and synthetic scenario generation. By articulating the design and challenges of AviationLMM, we aim to boost the civil aviation foundation model progress and catalyze coordinated research efforts toward an integrated, trustworthy and privacy-pre aviation AI ecosystem.

Hidden States as Early Signals Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling

Authors: Zhixiang Liang, Beichen Huang, Zheng Wang, Minjia Zhang

2026-01-14

http://arxiv.org/abs/2601.09093v1

Large Language Models (s) can enhance reasoning capabilities through test-time scaling by generating multiple traces. However, the combination of lengthy reasoning traces with multiple sampling introduces substantial computation and high end-to-end latency. Prior work on accelerating this process has relied on similarity-based or confidence-based , but these signals do not reliably indicate trace quality. To address these limitations, we propose STEP: Step-level Trace Evaluation and Pruning, a novel framework that evaluates reasoning steps using hidden states and dynamically prunes unpromising traces during generation. We train a lightweight step scorer to estimate trace quality, and design a GPU memory-aware strategy that triggers as the GPU memory is saturated by to reduce end-to-end latency. Experiments across challenging reasoning benchmarks demonstrate that STEP reduces end-to-end inference latency by 45%-70% on average compared to self-consistency while also improving reasoning accuracy. Our code is released at: https://github.com/Supercomputing-System-AI-Lab/STEP

Exploring Reliable Spatiotemporal Dependencies for Efficient Visual Tracking

Authors: Junze Shi, Yang Yu, Jian Shi, Haibo Luo

2026-01-14

http://arxiv.org/abs/2601.09078v1

Recent advances in -based lightweight object tracking have established new standards across benchmarks, leveraging the global receptive field and powerful feature extraction capabilities of attention mechanisms. Despite these achievements, existing methods universally employ sampling during training--utilizing only one template and one search image per sequence--which fails to comprehensively explore spatiotemporal information in videos. This limitation constrains performance and cause the gap between lightweight and high-performance trackers. To bridge this divide while maintaining real-time efficiency, we propose STDTrack, a framework that pioneers the integration of reliable spatiotemporal dependencies into lightweight trackers. Our approach implements dense video sampling to maximize spatiotemporal information utilization. We introduce a temporally propagating spatiotemporal token to guide per-frame feature extraction. To ensure comprehensive target state representation, we disign the Multi-frame Information Fusion Module (MFIFM), which augments current dependencies using historical context. The MFIFM operates on features stored in our constructed Spatiotemporal Token Maintainer (STM), where a quality-based update mechanism ensures information reliability. Considering the scale variation among tracking targets, we develop a multi-scale prediction head to dynamically adapt to objects of different sizes. Extensive experiments demonstrate state-of-the-art results across six benchmarks. Notably, on GOT-10k, STDTrack rivals certain high-performance non-real-time trackers (e.g., MixFormer) while operating at 192 FPS(GPU) and 41 FPS(CPU).

Depth-Wise Representation Development Under Blockwise Self-Supervised Learning for Video Vision Transformers

Authors: Jonas Römer, Timo Dickscheid

2026-01-14

http://arxiv.org/abs/2601.09040v1

End-to-end backpropagation couples all layers through a global error signal, enabling coordinated learning but requiring long-range credit assignment. Motivated by recent progress in blockwise self-supervised learning (BWSSL), we ask whether masked video s can be trained without end-to-end backpropagation. Applying BWSSL to masked video modeling remains relatively underexplored and must handle spatiotemporal context and long-range temporal structure. More broadly, analyses that compare BWSSL and end-to-end training in terms of learning dynamics and depth-wise representation development remain . We apply blockwise learning to a masked autoencoding video vision by partitioning the encoder into blocks, each of which is optimized with a local masked reconstruction loss. Across model sizes and partition granularities, training converges and yields representations close to matched end-to-end baselines under linear-probe and retrieval proxies. In order to compare intermediate representations, we analyze depth-wise decodability, inter-block similarity, and patch-level diagnostics. Blockwise training exposes higher-level structure earlier, while later blocks saturate and operate in a more geometry-pre regime. It can also induce token-level shifts consistent with stronger early mixing that pooled metrics can miss. These findings point to late-block saturation and interface formation as contributors to the remaining gap.

Layer-Parallel Training for Transformers

Authors: Shuai Jiang, Marc Salvado, Eric C. Cyr, Alena Kopaničáková, Rolf Krause, Jacob B. Schroder

2026-01-13

http://arxiv.org/abs/2601.09026v1

We present a new training methodology for s using a multilevel, layer-parallel approach. Through a neural ODE formulation of s, our application of a multilevel parallel-in-time algorithm for the forward and backpropagation phases of training achieves parallel over the layer dimension. This dramatically enhances parallel scalability as the network depth increases, which is particularly useful for increasingly large foundational models. However, achieving this introduces errors that cause systematic bias in the gradients, which in turn reduces convergence when closer to the minima. We develop an algorithm to detect this critical transition and either switch to serial training or systematically increase the accuracy of layer-parallel training. Results, including BERT, GPT2, ViT, and machine translation architectures, demonstrate parallel- as well as accuracy commensurate with serial pre-training while fine-tuning is unaffected.

Universal Latent Homeomorphic Manifolds Cross-Domain Representation Learning via Homeomorphism Verification

Authors: Tong Wu, Tayab Uddin Wara, Daniel Hernandez, Sidong Lei

2026-01-13

http://arxiv.org/abs/2601.09025v1

We present the Universal Latent Homeomorphic Manifold (ULHM), a framework that unifies semantic representations (e.g., human descriptions, diagnostic labels) and observation-driven machine representations (e.g., pixel intensities, sensor readings) into a single latent structure. Despite originating from fundamentally different pathways, both modalities capture the same underlying reality. We establish \emph{homeomorphism}, a continuous bijection pre topological structure, as the mathematical criterion for determining when latent manifolds induced by different semantic-observation pairs can be rigorously unified. This criterion provides theoretical guarantees for three critical applications: (1) semantic-guided recovery from incomplete observations, (2) cross-domain transfer learning with verified structural compatibility, and (3) zero-shot compositional learning via valid transfer from semantic to observation space. Our framework learns continuous manifold-to-manifold transformations through conditional variational inference, avoiding brittle point-to-point mappings. We develop practical verification algorithms, including trust, continuity, and Wasserstein distance metrics, that empirically validate homeomorphic structure from finite samples. Experiments demonstrate: (1) image recovery from 5\% of CelebA pixels and MNIST digit reconstruction at multiple levels, (2) cross-domain classifier transfer achieving 86.73\% accuracy from MNIST to Fashion-MNIST without retraining, and (3) zero-shot classification on unseen classes achieving 89.47\% on MNIST, 84.70\% on Fashion-MNIST, and 78.76\% on CIFAR-10. Critically, the homeomorphism criterion correctly rejects incompatible datasets, preventing invalid unification and providing a feasible way to principled decomposition of general foundation models into verified domain-specific components.

Synthetic Data for Veterinary EHR De-identification Benefits, Limits, and Safety Trade-offs Under Fixed Compute

Authors: David Brundage

2026-01-13

http://arxiv.org/abs/2601.09756v1

Veterinary electronic health records (vEHRs) contain privacy-sensitive identifiers that limit secondary use. While PetEVAL provides a benchmark for veterinary de-identification, the domain remains low-resource. This study evaluates whether large language model ()-generated synthetic narratives improve de-identification safety under distinct training regimes, emphasizing (i) synthetic augmentation and (ii) fixed-budget substitution. We conducted a controlled simulation using a PetEVAL-derived corpus (3,750 holdout/1,249 train). We generated 10,382 synthetic notes using a privacy-pre "template-only" regime where identifiers were removed prior to prompting. Three backbones (PetBERT, VetBERT, Bio_ClinicalBERT) were trained under varying mixtures. Evaluation prioritized document-level leakage rate (the fraction of documents with at least one missed identifier) as the primary safety outcome. Results show that under fixed-sample substitution, replacing real notes with synthetic ones monotonically increased leakage, indicating synthetic data cannot safely replace real supervision. Under compute-matched training, moderate synthetic mixing matched real-only performance, but high synthetic dominance degraded utility. Conversely, epoch-scaled augmentation improved performance: PetBERT span- F1 increased from 0.831 to 0.850 +/- 0.014, and leakage decreased from 6.32% to 4.02% +/- 0.19%. However, these gains largely reflect increased training exposure rather than intrinsic synthetic data quality. Corpus diagnostics revealed systematic synthetic-real mismatches in note length and label distribution that align with persistent leakage. We conclude that synthetic augmentation is effective for expanding exposure but is complementary, not substitutive, for safety-critical veterinary de-identification.