2026-01-23

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
Learning to Discover at Test Time
CONTEX-T Contextual Privacy Exploitation via Transformer Spectral Analysis for IoT Device Fingerprinting
Pay (Cross) Attention to the Melody Curriculum Masking for Single-Encoder Melodic Harmonization
Low-altitude Multi-UAV-assisted Data Collection and Semantic Forwarding for Post-Disaster Relief
SAMTok Representing Any Mask with Two Words
Sawtooth Wavefront Reordering Enhanced CuTile FlashAttention on NVIDIA GB10
Blind Identification of Channel Codes A Subspace-Coding Approach
TinySense Effective CSI Compression for Scalable and Accurate Wi-Fi Sensing
Determination of the longitude difference between Baghdad and Khwarezm using a lunar eclipse (the method of Abu Rayhan al-Biruni and Abu al-Wafa al-Buzjani)
Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models
LL-GaussianImage Efficient Image Representation for Zero-shot Low-Light Enhancement with 2D Gaussian Splatting
CoNRec Context-Discerning Negative Recommendation with LLMs
FlexLLM Composable HLS Library for Flexible Hybrid LLM Accelerator Design
Persona Switch Mixing Distinct Perspectives in Decoding Time
TempoNet Learning Realistic Communication and Timing Patterns for Network Traffic Simulation
Event-VStream Event-Driven Real-Time Understanding for Long Video Streams
Distributed Multichannel Active Noise Control with Asynchronous Communication
Scaling-Based Quantization of Spacetime Microstructure
Robust Tool Use via Fission-GRPO Learning to Recover from Execution Errors
Tensor-based phase difference estimation on time series analysis
ToxiTwitch Toward Emote-Aware Hybrid Moderation for Live Streaming Platforms
DeepASMR LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice
YuFeng-XGuard A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models
Amalgamated CHIRP and OFDM for ISAC
From Generation to Collaboration Using LLMs to Edit for Empathy in Healthcare
LLM or Human? Perceptions of Trust and Information Quality in Research Summaries
DS@GT at TREC TOT 2025 Bridging Vague Recollection with Fusion Retrieval and Learned Reranking
MARS Unleashing the Power of Speculative Decoding via Margin-Aware Verification
Martingale Foresight Sampling A Principled Approach to Inference-Time LLM Decoding
Exploring Implicit Perspectives on Autism in Large Language Models Through Multi-Agent Simulations
Domain-Specific Knowledge Graphs in RAG-Enhanced Healthcare LLMs
DuFal Dual-Frequency-Aware Learning for High-Fidelity Extremely Sparse-view CBCT Reconstruction
Beyond Prompting Efficient and Robust Contextual Biasing for Speech LLMs via Logit-Space Integration (LOGIC)
FedUMM A General Framework for Federated Learning with Unified Multimodal Models
Towards Understanding Best Practices for Quantization of Vision-Language Models
Lightweight LLMs for Network Attack Detection in IoT Networks
Aligned Stable Inpainting Mitigating Unwanted Object Insertion and Preserving Color Consistency
Deaf and Hard of Hearing Access to Intelligent Personal Assistants Comparison of Voice-Based Options with an LLM-Powered Touch Interface
The Flexibility Trap Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
DeepFedNAS A Unified Framework for Principled, Hardware-Aware, and Predictor-Free Federated Neural Architecture Search
Field-Space Autoencoder for Scalable Climate Emulators
LoRAP Low-Rank Aggregation Prompting for Quantized Graph Neural Networks Training
SpooFL Spoofing Federated Learning
Game-Theoretic Lens on LLM-based Multi-Agent Systems
Parallel Collaborative ADMM Privacy Computing and Adaptive GPU Acceleration for Distributed Edge Networks
Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers
Multi-Behavior Sequential Modeling with Transition-Aware Graph Attention Network for E-Commerce Recommendation
Deep Learning assisted Port-Cycling based Channel Sounding for Precoder Estimation in Massive MIMO Arrays
CorpusQA A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning
Rank-one Riemannian Subspace Descent for Nonlinear Matrix Equations
What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study
Multi-Task Transformer for Explainable Speech Deepfake Detection via Formant Modeling
POTR Post-Training 3DGS Compression
FunCineForge A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes
Render-of-Thought Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
ARISE -- Adaptive Refinement and Iterative Scenario Engineering
Optimizing FaaS Platforms for MCP-enabled Agentic Workflows
ARFT-Transformer Modeling Metric Dependencies for Cross-Project Aging-Related Bug Prediction
HERMES KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
Ramping-aware Enhanced Flexibility Aggregation of Distributed Generation with Energy Storage in Power Distribution Networks
IB-GRPO Aligning LLM-based Learning Path Recommendation with Educational Objectives via Indicator-Based Group Relative Policy Optimization
FARE Fast-Slow Agentic Robotic Exploration
INFA-Guard Mitigating Malicious Propagation via Infection-Aware Safeguarding in LLM-Based Multi-Agent Systems
Communication-Efficient Federated Risk Difference Estimation for Time-to-Event Clinical Outcomes
QMC Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design
On the Runway Cascade of Transformers for Language Modeling

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Authors: Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie

2026-01-22

http://arxiv.org/abs/2601.16208v1

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE rs on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.

Learning to Discover at Test Time

Authors: Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun

2026-01-22

http://arxiv.org/abs/2601.16175v1

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen . We perform reinforcement learning at test time, so the can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

CONTEX-T Contextual Privacy Exploitation via Transformer Spectral Analysis for IoT Device Fingerprinting

Authors: Nazmul Islam, Mohammad Zulkernine

2026-01-22

http://arxiv.org/abs/2601.16160v1

The rapid expansion of internet of things (IoT) devices have created a pervasive ecosystem where encrypted wireless s serve as the primary privacy and security protection mechanism. While encryption effectively protects message content, packet metadata and statistics inadvertently expose device identities and user contexts. Various studies have exploited raw packet statistics and their visual representations for device fingerprinting and identification. However, these approaches remain confined to the spatial domain with limited feature representation. Therefore, this paper presents CONTEX-T, a novel framework that exploits contextual privacy vulnerabilities using spectral representation of encrypted wireless traffic for IoT device characterization. The experiments show that spectral analysis provides new and rich feature representation for covert reconnaissance attacks, revealing a complex and expanding threat landscape that would require robust countermeasures for IoT security management. CONTEXT-T first transforms raw packet length sequences into time-frequency spectral representations and then utilizes -based spectral analysis for the device identification. We systematically evaluated multiple spectral representation techniques and -based models across encrypted traffic samples from various IoT devices. CONTEXT-T effectively exploited privacy vulnerabilities and achieved device classification accuracy exceeding 99% across all devices while remaining completely passive and undetectable.

Pay (Cross) Attention to the Melody Curriculum Masking for Single-Encoder Melodic Harmonization

Authors: Maximos Kaliakatsos-Papakostas, Dimos Makris, Konstantinos Soiledis, Konstantinos-Theodoros Tsamis, Vassilis Katsouros, Emilios Cambouropoulos

2026-01-22

http://arxiv.org/abs/2601.16150v1

Melodic harmonization, the task of generating harmonic accompaniments for a given melody, remains a central challenge in computational music generation. Recent single encoder approaches have framed harmonization as a masked sequence modeling problem, but existing training curricula inspired by discrete diffusion often result in weak (cross) attention between melody and harmony. This leads to limited exploitation of melodic cues, particularly in out-of-domain contexts. In this work, we introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody-harmony interactions. We systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony-melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. We further find that quarter-note , intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting. Our findings highlight the importance of training curricula in enabling effective melody conditioning and suggest that full-to-full unmasking offers a robust strategy for single encoder harmonization.

Low-altitude Multi-UAV-assisted Data Collection and Semantic Forwarding for Post-Disaster Relief

Authors: Xiaoya Zheng, Geng Sun, Jiahui Li, Jiacheng Wang, Weijie Yuan, Qingqing Wu, Dusit Niyato, Abbas Jamalipour

2026-01-22

http://arxiv.org/abs/2601.16146v1

The low-altitude economy (LAE) is an emerging economic paradigm which fosters integrated development across multiple fields. As a pivotal component of the LAE, low-altitude uncrewed aerial vehicles (UAVs) can restore by as aerial relays between the post-disaster areas and remote base stations (BSs). However, conventional approaches face challenges from vulnerable long-distance links between the UAVs and remote BSs, and data bottlenecks arising from massive data volumes and limited onboard UAV resources. In this work, we investigate a low-altitude multi-UAV-assisted data collection and semantic forwarding network, in which multiple UAVs collect data from ground users, form clusters, perform intra-cluster data aggregation with semantic extraction, and then cooperate as virtual antenna array (VAAs) to transmit the extracted semantic information to a remote BS via collaborative beamforming (CB). We formulate a data collection and semantic forwarding multi-objective optimization problem (DCSFMOP) that jointly maximizes both the user and semantic transmission rates while minimizing UAV energy consumption. The formulated DCSFMOP is a mixed-integer nonlinear programming (MINLP) problem that is inherently NP-hard and characterized by dynamically varying decision variable dimensionality. To address these challenges, we propose a large language model-enabled alternating optimization approach (-AOA), which effectively handles the complex search space and variable dimensionality by optimizing different subsets of decision variables through tailored optimization strategies. Simulation results demonstrate that -AOA outperforms AOA by approximately 26.8\% and 22.9\% in transmission rate and semantic rate, respectively.

SAMTok Representing Any Mask with Two Words

Authors: Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li

2026-01-22

http://arxiv.org/abs/2601.16093v1

Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal s (Ms) remain difficult to scale due to complex region-level encoders, specialized segmentation rs, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base Ms (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector r to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping Ms with strong pixel-wise capabilities. Our code and models are available.

Sawtooth Wavefront Reordering Enhanced CuTile FlashAttention on NVIDIA GB10

Authors: Yifan Zhu, Yekai Pan, Chen Ding

2026-01-22

http://arxiv.org/abs/2601.16032v1

High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, ob 50\% or greater reduction in L2 misses and up to 60\% increase in throughput on GB10.

Authors: Pramod Singh, Prasad Krishnan, Arti Yardi

2026-01-22

http://arxiv.org/abs/2601.15903v1

The problem of blind identification of channel codes at a receiver involves identifying a code chosen by a transmitter from a known code-family, by ob the transmitted codewords through the channel. Most existing approaches for code-identification are contingent upon the codes in the family having some special structure, and are often computationally expensive otherwise. Further, rigorous analytical guarantees on the performance of these existing techniques are largely absent. This work presents a new method for code-identification on the binary symmetric channel (BSC), inspired by the framework of subspace codes for operator channels, carefully combining principles of hamming-metric and subspace-metric . We refer to this method as the minimum denoised subspace discrepancy r. We present theoretical guarantees for code-identification using this r, for bounded-weight errors, and also present a bound on the probability of error when used on the BSC. Simulations demonstrate the improved performance of our r for random linear codes beyond existing general-purpose techniques, across most channel conditions and even with a limited number of received vectors.

TinySense Effective CSI Compression for Scalable and Accurate Wi-Fi Sensing

Authors: Toan Gian, Dung T. Tran, Viet Quoc Pham, Francesco Restuccia, Van-Dinh Nguyen

2026-01-22

http://arxiv.org/abs/2601.15838v1

With the growing demand for device-free and privacy-pre sensing solutions, Wi-Fi sensing has emerged as a promising approach for human pose estimation (HPE). However, existing methods often process vast amounts of channel state information (CSI) data directly, ultimately straining networking resources. This paper introduces TinySense, an efficient framework that enhances the scalability of Wi-Fi-based human sensing. Our approach is based on a new vector -based generative adversarial network (VQGAN). Specifically, by leveraging a VQGAN-learned codebook, TinySense significantly reduces CSI data while maintaining the accuracy required for reliable HPE. To optimize , we employ the K-means algorithm to dynamically adjust bitrates to cluster a large-scale pre-trained codebook into smaller subsets. Furthermore, a Transformer model is incorporated to mitigate bitrate loss, enhancing robustness in unreliable networking conditions. We prototype TinySense on an experimental testbed using Jetson Nano and Raspberry Pi to measure latency and network resource use. Extensive results demonstrate that TinySense significantly outperforms state-of-the-art schemes, achieving up to 1.5x higher HPE accuracy score (PCK20) under the same rate. It also reduces latency and networking overhead, respectively, by up to 5x and 2.5x. The code repository is available online at here.

Determination of the longitude difference between Baghdad and Khwarezm using a lunar eclipse (the method of Abu Rayhan al-Biruni and Abu al-Wafa al-Buzjani)

Authors: Rizoi Bakhromzod

2026-01-22

http://arxiv.org/abs/2601.15837v1

This paper examines how, in the tenth century, medieval Iranian scholars Abu Rayhan al-Biruni and Abu al-Wafa al-Buzjani determined the difference in geographical longitude between the cities of Baghdad and Khwarezm through simultaneous observation of a lunar eclipse. Brief academic biographies of these scholars are presented, with emphasis on their contributions to mathematics and astronomy. The study discusses the importance of determining geographical coordinates - especially longitude - in the science of the 10th-11th centuries, provides an overview of the methods of coordinate determination available at the time, and highlights the problem of synchronizing remote observations prior to the advent of electronic . Particular attention is devoted to a detailed analysis of the method based on ob a lunar eclipse to simultaneously measure longitude differences: the necessary conditions and organization of the experiment, the instruments employed, the mathematical calculations, and error estimates are described. The longitude difference obtained by al-Biruni and al-Buzjani is compared with modern values. The conclusion discusses the scientific significance of this method for the history of science and astronomy.

Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models

Authors: Fengheng Chu, Jiahao Chen, Yuhong Wang, Jun Wang, Zhihui Fu, Shouling Ji, Songze Li

2026-01-22

http://arxiv.org/abs/2601.15801v1

While Large Language Models (s) are aligned to mitigate risks, their safety guardrails remain fragile against jailbreak attacks. This reveals limited understanding of components governing safety. Existing methods rely on local, greedy attribution that assumes independent component contributions. However, they overlook the cooperative interactions between different components in s, such as attention heads, which jointly contribute to safety mechanisms. We propose \textbf{G}lobal \textbf{O}ptimization for \textbf{S}afety \textbf{V}ector Extraction (GOSV), a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously. We employ two complementary activation repatching strategies: Harmful Patching and Zero Ablation. These strategies identify two spatially distinct sets of safety vectors with consistently low , termed Malicious Injection Vectors and Safety Suppression Vectors, demonstrating that aligned s maintain separate functional pathways for safety purposes. Through systematic analyses, we find that complete safety breakdown occurs when approximately 30\% of total heads are repatched across all models. Building on these insights, we develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching. Our attack substantially outperforms existing white-box attacks across all test models, providing strong evidence for the effectiveness of the proposed GOSV framework on safety interpretability.

LL-GaussianImage Efficient Image Representation for Zero-shot Low-Light Enhancement with 2D Gaussian Splatting

Authors: Yuhan Chen, Wenxuan Yu, Guofa Li, Yijun Xu, Ying Fang, Yicui Shi, Long Cao, Wenbo Chu, Keqiang Li

2026-01-22

http://arxiv.org/abs/2601.15772v1

2D Gaussian Splatting (2DGS) is an emerging explicit scene representation method with significant potential for image due to high fidelity and high ratios. However, existing low-light enhancement algorithms operate predominantly within the pixel domain. Processing 2DGS-compressed images necessitates a cumbersome de-enhancement-re pipeline, which compromises efficiency and introduces secondary degradation. To address these limitations, we propose LL-GaussianImage, the first zero-shot unsupervised framework designed for low-light enhancement directly within the 2DGS compressed representation domain. Three primary advantages are offered by this framework. First, a semantic-guided Mixture-of-Experts enhancement framework is designed. Dynamic adaptive transformations are applied to the attribute space of 2DGS using rendered images as guidance to enable -as-enhancement without full de to a pixel grid. Second, a multi-objective collaborative loss function system is established to strictly constrain smoothness and fidelity during enhancement, suppressing artifacts while improving visual quality. Third, a two-stage optimization process is utilized to achieve reconstruction-as-enhancement. The accuracy of the base representation is ensured through single-scale reconstruction and network robustness is enhanced. High-quality enhancement of low-light images is achieved while high ratios are maintained. The feasibility and superiority of the paradigm for direct processing within the compressed representation domain are validated through experimental results.

CoNRec Context-Discerning Negative Recommendation with LLMs

Authors: Xinda Chen, Jiawei Wu, Yishuang Liu, Jialin Zhu, Shuwen Xiao, Junjun Zheng, Xiangheng Kong, Yuning Jiang

2026-01-22

http://arxiv.org/abs/2601.15721v1

Understanding what users like is relatively straightforward; understanding what users dislike, however, remains a challenging and underexplored problem. Research into users' negative preferences has gained increasing importance in modern recommendation systems. Numerous platforms have introduced explicit negative feedback mechanisms and leverage such signals to refine their recommendation models. Beyond traditional business metrics, user experience-driven metrics, such as negative feedback rates, have become critical indicators for evaluating system performance. However, most existing approaches primarily use negative feedback as an auxiliary signal to enhance positive recommendations, paying little attention to directly modeling negative interests, which can be highly valuable in offline applications. Moreover, due to the inherent of negative feedback data, models often suffer from context understanding biases induced by positive feedback dominance. To address these challenges, we propose the first large language model framework for negative feedback modeling with special designed context-discerning modules. We use semantic ID Representation to replace text-based item descriptions and introduce an item-level alignment task that enhances the 's understanding of the semantic context behind negative feedback. Furthermore, we design a Progressive GRPO training paradigm that enables the model to dynamically balance the positive and negative behavioral context utilization. Besides, our investigation further reveals a fundamental misalignment between the conventional next-negative-item prediction objective and users' true negative preferences, which is heavily influenced by the system's recommendation order. To mitigate this, we propose a novel reward function and evaluation metric grounded in multi-day future negative feedback and their collaborative signals.

FlexLLM Composable HLS Library for Flexible Hybrid LLM Accelerator Design

Authors: Jiahao Zhang, Zifan He, Nicholas Fraser, Michaela Blott, Yizhou Sun, Jason Cong

2026-01-22

http://arxiv.org/abs/2601.15710v1

We present Flex, a composable High-Level Synthesis (HLS) library for rapid development of domain-specific accelerators. Flex exposes key architectural degrees of freedom for stage-customized inference, enabling hybrid designs that tailor temporal reuse and spatial dataflow differently for and , and provides a comprehensive suite to support accurate deployment. Using Flex, we build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code. The system includes: (1) a stage-customized accelerator with hardware-efficient (12.68 WikiText-2 PPL) surpassing SpinQuant baseline, and (2) a Hierarchical Memory Transformer (HMT) plug-in for efficient long-context processing. On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29 $\times$ end-to-end speedup, 1.64 $\times$ higher throughput, and 3.14 $\times$ better energy efficiency than an NVIDIA A100 GPU (7nm) running BF16 inference; projected results on the V80 FPGA at 7nm reach 4.71 $\times$ , 6.55 $\times$ , and 4.13 $\times$ , respectively. In long-context scenarios, integrating the HMT plug-in reduces latency by 23.23 $\times$ and extends the context window by 64 $\times$ , delivering 1.10 $\times$ /4.86 $\times$ lower end-to-end latency and 5.21 $\times$ /6.27 $\times$ higher energy efficiency on the U280/V80 compared to the A100 baseline. Flex thus bridges algorithmic innovation in inference and high-performance accelerators with minimal manual effort.

Persona Switch Mixing Distinct Perspectives in Decoding Time

Authors: Junseok Kim, Nakyeong Yang, Kyomin Jung

2026-01-22

http://arxiv.org/abs/2601.15708v1

Role-play prompting is known to steer the behavior of language models by injecting a persona into the prompt, improving their zero-shot reasoning capabilities. However, such improvements are inconsistent across different tasks or instances. This inconsistency suggests that zero-shot and role-play prompting may offer complementary strengths rather than one being universally superior. Building on this insight, we propose Persona Switch, a novel method that dynamically combines the benefits of both prompting strategies. Our method proceeds step-by-step, selecting the better output between zero-shot and role-play prompting at each step by comparing their output confidence, as measured by the logit gap. Experiments with widely-used s demonstrate that Persona Switch consistently outperforms competitive baselines, achieving up to 5.13% accuracy improvement. Furthermore, we show that output confidence serves as an informative measure for selecting the more reliable output.

TempoNet Learning Realistic Communication and Timing Patterns for Network Traffic Simulation

Authors: Kristen Moore, Diksha Goel, Cody James Christopher, Zhen Wang, Minjune Kim, Ahmed Ibrahim, Ahmad Mohsin, Seyit Camtepe

2026-01-22

http://arxiv.org/abs/2601.15663v1

Realistic network traffic simulation is critical for evaluating intrusion detection systems, stress-testing network protocols, and constructing high-fidelity environments for cybersecurity training. While attack traffic can often be layered into training environments using red-teaming or replay methods, generating authentic benign background traffic remains a core challenge -- particularly in simulating the complex temporal and dynamics of real-world networks. This paper introduces TempoNet, a novel generative model that combines multi-task learning with multi-mark temporal point processes to jointly model inter-arrival times and all packet- and flow-header fields. TempoNet captures fine-grained timing patterns and higher-order correlations such as host-pair behavior and seasonal trends, addressing key limitations of GAN-, -, and Bayesian-based methods that fail to reproduce structured temporal variation. TempoNet produces temporally consistent, high-fidelity traces, validated on real-world datasets. Furthermore, we show that intrusion detection models trained on TempoNet-generated background traffic perform comparably to those trained on real data, validating its utility for real-world security applications.

Event-VStream Event-Driven Real-Time Understanding for Long Video Streams

Authors: Zhenghui Guo, Yuanbin Man, Junyuan Sheng, Bowen Lin, Ahmed Ahmed, Bo Jiang, Boyuan Zhang, Miao Yin, Sian Jin, Omprakash Gnawal, Chengming Zhang

2026-01-22

http://arxiv.org/abs/2601.15655v1

Real-time understanding of long video streams remains challenging for multimodal large language models (VLMs) due to redundant frame processing and rapid forgetting of past context. Existing streaming systems rely on fixed-interval or , which either produce repetitive outputs or discard crucial temporal information. We introduce Event-VStream, an event-aware framework that represents continuous video as a sequence of discrete, semantically coherent events. Our system detects meaningful state transitions by integrating motion, semantic, and predictive cues, and triggers language generation only at those boundaries. Each event embedding is consolidated into a persistent memory bank, enabling long-horizon reasoning while maintaining low latency. Across OVOBench-Realtime, and long-form Ego4D evaluations, Event-VStream achieves competitive performance. It improves over a Video-Online-8B baseline by +10.4 points on OVOBench-Realtime, achieves performance close to Flash-VStream-7B despite using only a general-purpose LLaMA-3-8B text backbone, and maintains around 70% GPT-5 win rate on 2-hour Ego4D streams.

Distributed Multichannel Active Noise Control with Asynchronous Communication

Authors: Junwei Ji, Dongyuan Shi, Boxiang Wang, Ziyi Yang, Haowen Li, Woon-Seng Gan

2026-01-22

http://arxiv.org/abs/2601.15653v1

Distributed multichannel active noise control (DMCANC) offers effective noise reduction across large spatial areas by distributing the computational load of centralized control to multiple low-cost nodes. Conventional DMCANC methods, however, typically assume synchronous and require frequent data exchange, resulting in high overhead. To enhance efficiency and adaptability, this work proposes an asynchronous strategy where each node executes a weight-constrained filtered-x LMS (WCFxLMS) algorithm and independently requests only when its local noise reduction performance degrades. Upon request, other nodes transmit the weight difference between their local control filter and the center point in WCFxLMS, which are then integrated to update both the control filter and the center point. This design enables nodes to operate asynchronously while pre cooperative behavior. Simulation results demonstrate that the proposed asynchronous DMCANC (ACDMCANC) system maintains effective noise reduction with significantly reduced load, offering improved scalability for heterogeneous networks.

Scaling-Based Quantization of Spacetime Microstructure

Authors: Weihu Ma, Yu-Gang Ma

2026-01-22

http://arxiv.org/abs/2601.15649v1

Planck-scale physics challenges the classical smooth-spacetime picture by introducing quantum fluctuations that imply a nontrivial spacetime microstructure. We present a framework that encodes these fluctuations by promoting local scale factors, rather than the metric tensor, to fundamental dynamical variables while pre general covariance. The construction employs a two-tiered hierarchy of scale manifolds, comprising a first-order manifold of scale coordinates and a second-order manifold of fluctuation amplitude coordinates. On the first-order manifold, we formulate differential geometry, field equations, and a canonical procedure. The theory yields a geometric renormalization-group flow for scale variables and implies spacetime discreteness at the microscopic level. By constructing a quadratic action and performing spectral decomposition with a stabilizing potential, we obtain discrete modal degrees of freedom d as harmonic oscillators. The framework proposes a microscopic description for zero-point energy of spacetime and explores implications for vacuum energy and ultraviolet regularization, suggesting a potential dynamical mechanism that could ameliorate the cosmological constant problem. Main results include a generalized uncertainty relation with scale-dependent coefficients, locally scaled Klein-Gordon and Dirac equations, geodesic equations for scale spacetime, and a microscopic area operator whose state counting is consistent with the Bekenstein-Hawking entropy. This work develops a scale-based procedure, providing a foundation for further mathematical analysis and phenomenological tests of spacetime .

Robust Tool Use via Fission-GRPO Learning to Recover from Execution Errors

Authors: Zhiwei Zhang, Fei Zhao, Rui Wang, Zezhong Wang, Bin Liang, Jiakang Wang, Yao Hu, Shaosheng Cao, Kam-Fai Wong

2026-01-22

http://arxiv.org/abs/2601.15625v1

Large language models (s) can call tools effectively, yet they remain brittle in multi-turn execution: following a tool call error, smaller models often degenerate into repetitive invalid re-invocations, failing to interpret error feedback and self-correct. This brittleness hinders reliable real-world deployment, where the execution errors are inherently inevitable during tool interaction procedures. We identify a key limitation of current approaches: standard reinforcement learning (RL) treats errors as negative rewards, providing no guidance on how to recover, while pre-collected synthetic error-correction datasets suffer from distribution mismatch with the model's on-policy error modes. To bridge this gap, we propose Fission-GRPO, a framework that converts execution errors into corrective supervision within the RL training loop. Our core mechanism fissions each failed trajectory into a new training instance by augmenting it with diagnostic feedback from a finetuned Error Simulator, then resampling recovery rollouts on-policy. This enables the model to learn from the precise errors it makes during exploration, rather than from static, pre-collected error cases. On the BFCL v4 Multi-Turn, Fission-GRPO improves the error recovery rate of Qwen3-8B by 5.7% absolute, crucially, yielding a 4% overall accuracy gain (42.75% to 46.75%) over GRPO and outperforming specialized tool-use agents.

Tensor-based phase difference estimation on time series analysis

Authors: Shu Kanno, Kenji Sugisaki, Rei Sakuma, Jumpei Kato, Hajime Nakamura, Naoki Yamamoto

2026-01-22

http://arxiv.org/abs/2601.15616v1

We propose a phase-difference estimation algorithm based on the tensor-network circuit , leveraging time-evolution data to pursue scalability and higher accuracy on a quantum phase estimation (QPE)-type algorithm. Using tensor networks, we construct circuits composed solely of nearest-neighbor gates and extract time-evolution data by four-type circuit measurements. In addition, to enhance the accuracy of time-evolution and state-preparation circuits, we propose techniques based on algorithmic error mitigation and on iterative circuit optimization combined with merging into matrix product states, respectively. Verifications using a noiseless simulator for the 8-qubit one-dimensional Hubbard model using an ancilla qubit show that the proposed algorithm achieves accuracies with 0.4--4.7\% error from a true energy gap on an appropriate time-step size, and that accuracy improvements due to the algorithmic error mitigation are observed. We also confirm the enhancement of the with matrix product states through iterative optimization. Finally, the proposed algorithm is demonstrated on IBM Heron devices with Q-CTRL error suppression for 8-, 36-, and 52-qubit models using more than 5,000 2-qubit gates. These largest-scale demonstrations for the QPE-type algorithm represent significant progress not only toward practical applications of near-term quantum computing but also toward preparation for the era of error-corrected quantum devices.

ToxiTwitch Toward Emote-Aware Hybrid Moderation for Live Streaming Platforms

Authors: Baktash Ansari, Shiza Ali, Elias Martin, Maryna Sivachenko, Afra Mashhadi

2026-01-22

http://arxiv.org/abs/2601.15605v1

The rapid growth of live-streaming platforms such as Twitch has introduced complex challenges in moderating toxic behavior. Traditional moderation approaches, such as human annotation and keyword-based filtering, have demonstrated utility, but human moderators on Twitch constantly struggle to scale effectively in the fast-paced, high-volume, and context-rich chat environment of the platform while also facing harassment themselves. Recent advances in large language models (s), such as DeepSeek-R1-Distill and Llama-3-8B-Instruct, offer new opportunities for toxicity detection, especially in understanding nuanced, multimodal involving emotes. In this work, we present an exploratory comparison of toxicity detection approaches tailored to Twitch. Our analysis reveals that incorporating emotes improves the detection of toxic behavior. To this end, we introduce ToxiTwitch, a hybrid model that combines -generated embeddings of text and emotes with traditional machine learning classifiers, including Random Forest and SVM. In our case study, the proposed hybrid approach reaches up to 80 percent accuracy under channel-specific training (with 13 percent improvement over BERT and F1-score of 76 percent). This work is an exploratory study intended to surface challenges and limits of emote-aware toxicity detection on Twitch.

DeepASMR LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice

Authors: Leying Zhang, Tingxiao Zhou, Haiyang Sun, Mengxiao Bi, Yanmin Qian

2026-01-22

http://arxiv.org/abs/2601.15596v1

While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR's subtle, often unvoiced characteristics and the demand for zero-shot speaker adaptation. In this paper, we introduce DeepASMR, the first framework designed for zero-shot ASMR generation. We demonstrate that a single short snippet of a speaker's ordinary, read-style speech is sufficient to synthesize high-fidelity ASMR in their voice, eliminating the need for whispered training data from the target speaker. Methodologically, we first identify that discrete speech tokens provide a soft factorization of ASMR style from speaker timbre. Leveraging this insight, we propose a two-stage pipeline incorporating a Large Language Model () for content-style encoding and a flow-matching acoustic r for timbre reconstruction. Furthermore, we contribute DeepASMR-DB, a comprehensive 670-hour English-Chinese multi-speaker ASMR speech corpus, and introduce a novel evaluation protocol integrating objective metrics, human listening tests, -based scoring and unvoiced speech analysis. Extensive experiments confirm that DeepASMR achieves state-of-the-art naturalness and style fidelity in ASMR generation for anyone of any voice, while maintaining competitive performance on normal speech synthesis.

YuFeng-XGuard A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models

Authors: Junyu Lin, Meizhen Liu, Xiufeng Huang, Jinfeng Li, Haiwen Hong, Xiaohan Yuan, Yuefeng Chen, Longtao Huang, Hui Xue, Ranjie Duan, Zhikai Chen, Yuchuan Fu, Defeng Li, Lingyao Gao, Yitong Yang

2026-01-22

http://arxiv.org/abs/2601.15588v1

As large language models (s) are increasingly deployed in real-world applications, safety guardrails are required to go beyond coarse-grained filtering and support fine-grained, interpretable, and adaptable risk assessment. However, existing solutions often rely on rapid classification schemes or post-hoc rules, resulting in limited transparency, inflexible policies, or prohibitive inference costs. To this end, we present YuFeng-XGuard, a reasoning-centric guardrail model family designed to perform multi-dimensional risk perception for interactions. Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process. This formulation enables safety decisions that are both actionable and interpretable. To balance decision latency and explanatory depth, we adopt a tiered inference paradigm that performs an initial risk decision based on the first d token, while pre ondemand explanatory reasoning when required. In addition, we introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining. Extensive experiments on a diverse set of public safety benchmarks demonstrate that YuFeng-XGuard achieves stateof-the-art performance while maintaining strong efficiency-efficacy trade-offs. We release YuFeng-XGuard as an open model family, including both a full-capacity variant and a lightweight version, to support a wide range of deployment scenarios.

Amalgamated CHIRP and OFDM for ISAC

Authors: Pankaj Kumar, Mohammed El-Hajjar, Ibrahim A. Hemadeh, Yasser Mestrah, Suraj Srivastava, Aditya K. Jagannatham, Lajos Hanzo

2026-01-22

http://arxiv.org/abs/2601.15584v1

Integrated Sensing and Communication (ISAC) requires the development of a waveform capable of efficiently supporting both and sensing functionalities. This paper proposes a novel waveform that combines the benefits of both the orthogonal frequency division multiplexing (OFDM) and the chirp waveforms to improve both the and sensing performance within an ISAC framework. Hence, a new architecture is proposed that utilizes the conventional framework while leveraging the parameters sensed at the receiver (Rx) for enhancing the performance. We demonstrate that the affine addition of OFDM and chirp signals results in a near constant-envelope OFDM waveform, which effectively reduces the peak-to-average power ratio (PAPR), a key limitation of traditional OFDM systems. Using the OFDM framework for sensing in the conventional fashion requires the allocation of some resources for sensing, which in turn reduces performance. As a remedy, the proposed affine amalgam facilitates sensing through the chirp waveform without consuming resources, thereby pre efficiency. Furthermore, a novel technique of integrating the chirp signal into the OFDM framework at the slot-level is proposed to enhance the accuracy of range estimation. The results show that the OFDM signal incorporated with chirp has better autocorrelation properties, improved root mean square error (RMSE) of range and velocity, and lower PAPR. Finally, we characterize the trade-off between s and sensing performance.

From Generation to Collaboration Using LLMs to Edit for Empathy in Healthcare

Authors: Man Luo, Bahareh Harandizadeh, Amara Tariq, Halim Abbas, Umar Ghaffar, Christopher J Warren, Segun O. Kolade, Haidar M. Abdul-Muhsin

2026-01-22

http://arxiv.org/abs/2601.15558v1

Clinical empathy is essential for patient care, but physicians need continually balance emotional warmth with factual precision under the cognitive and emotional constraints of clinical practice. This study investigates how large language models (s) can function as empathy editors, refining physicians' written responses to enhance empathetic tone while pre underlying medical information. More importantly, we introduce novel quantitative metrics, an Empathy Ranking Score and a MedFactChecking Score to systematically assess both emotional and factual quality of the responses. Experimental results show that edited responses significantly increase perceived empathy while pre factual accuracy compared with fully generated outputs. These findings suggest that using s as editorial assistants, rather than autonomous generators, offers a safer, more effective pathway to empathetic and trustworthy AI-assisted healthcare .

LLM or Human? Perceptions of Trust and Information Quality in Research Summaries

Authors: Nil-Jana Akpinar, Sandeep Avula, CJ Lee, Brandon Dang, Kaza Razat, Vanessa Murdock

2026-01-22

http://arxiv.org/abs/2601.15556v1

Large Language Models (s) are increasingly used to generate and edit scientific abstracts, yet their integration into academic writing raises questions about trust, quality, and disclosure. Despite growing adoption, little is known about how readers perceive -generated summaries and how these perceptions influence evaluations of scientific work. This paper presents a mixed-methods survey experiment investigating whether readers with ML expertise can distinguish between human- and -generated abstracts, how actual and perceived involvement affects judgments of quality and trustworthiness, and what orientations readers adopt toward AI-assisted writing. Our findings show that participants struggle to reliably identify -generated content, yet their beliefs about involvement significantly shape their evaluations. Notably, abstracts edited by s are rated more favorably than those written solely by humans or s. We also identify three distinct reader orientations toward -assisted writing, offering insights into evolving norms and informing policy around disclosure and acceptable use in scientific .

DS@GT at TREC TOT 2025 Bridging Vague Recollection with Fusion Retrieval and Learned Reranking

Authors: Wenxin Zhou, Ritesh Mehta, Anthony Miyaguchi

2026-01-21

http://arxiv.org/abs/2601.15518v1

We develop a two-stage retrieval system that combines multiple complementary retrieval methods with a learned reranker and -based reranking, to address the TREC Tip-of-the-Tongue (ToT) task. In the first stage, we employ hybrid retrieval that merges -based retrieval, (BM25), and dense (BGE-M3) retrieval methods. We also introduce topic-aware multi-index dense retrieval that partitions the Wikipedia corpus into 24 topical domains. In the second stage, we evaluate both a trained LambdaMART reranker and -based reranking. To support model training, we generate 5000 synthetic ToT queries using s. Our best system achieves recall of 0.66 and NDCG@1000 of 0.41 on the test set by combining hybrid retrieval with Gemini-2.5-flash reranking, demonstrating the effectiveness of fusion retrieval.

MARS Unleashing the Power of Speculative Decoding via Margin-Aware Verification

Authors: Jingwei Song, Xinyu Wang, Hanbin Wang, Xiaoxuan Lei, Bill Shi, Shixin Han, Eric Yang, Xiao-Wen Chang, Lynn Ai

2026-01-21

http://arxiv.org/abs/2601.15498v1

Speculative Decoding (SD) accelerates autoregressive large language model () inference by decoupling generation and verification. While recent methods improve draft quality by tightly coupling the drafter with the target model, the verification mechanism itself remains largely unchanged, relying on strict token-level rejection sampling. In practice, modern s frequently operate in low-margin regimes where the target model exhibits weak preference among top candidates. In such cases, rejecting plausible runner-up tokens yields negligible information gain while incurring substantial rollback cost, leading to a fundamental inefficiency in verification. We propose Margin-Aware Speculative Verification, a training-free and domain-agnostic verification strategy that adapts to the target model's local decisiveness. Our method conditions verification on decision stability measured directly from the target logits and relaxes rejection only when strict verification provides minimal benefit. Importantly, the approach modifies only the verification rule and is fully compatible with existing target-coupled speculative frameworks. Extensive experiments across model scales ranging from 8B to 235B demonstrate that our method delivers consistent and significant inference speedups over state-of-the-art baselines while pre generation quality across diverse benchmarks.

Martingale Foresight Sampling A Principled Approach to Inference-Time LLM Decoding

Authors: Huayu Li, ZhengXiao He, Siyuan Tian, Jinghao Wen, Ao Li

2026-01-21

http://arxiv.org/abs/2601.15482v1

Standard autoregressive in large language models (s) is inherently short-sighted, often failing to find globally optimal reasoning paths due to its token-by-token generation process. While inference-time strategies like foresight sampling attempt to mitigate this by simulating future steps, they typically rely on ad-hoc heuristics for valuing paths and the search space. This paper introduces Martingale Foresight Sampling (MFS), a principled framework that reformulates as a problem of identifying an optimal stochastic process. By modeling the quality of a reasoning path as a stochastic process, we leverage Martingale theory to design a theoretically-grounded algorithm. Our approach replaces heuristic mechanisms with principles from probability theory: step valuation is derived from the Doob Decomposition Theorem to measure a path's predictable advantage, path selection uses Optional Stopping Theory for principled of suboptimal candidates, and an adaptive stopping rule based on the Martingale Convergence Theorem terminates exploration once a path's quality has provably converged. Experiments on six reasoning benchmarks demonstrate that MFS surpasses state-of-the-art methods in accuracy while significantly improving computational efficiency. Code will be released at https://github.com/miraclehetech/EACL2026-Martingale-Foresight-Sampling.

Exploring Implicit Perspectives on Autism in Large Language Models Through Multi-Agent Simulations

Authors: Sohyeon Park, Jesus Armando Beltran, Aehong Min, Anamara Ritt-Olson, Gillian R. Hayes

2026-01-21

http://arxiv.org/abs/2601.15437v1

Large Language Models (s) like ChatGPT offer potential support for autistic people, but this potential requires understanding the implicit perspectives these models might carry, including their biases and assumptions about autism. Moving beyond single-agent prompting, we utilized -based multi-agent systems to investigate complex social scenarios involving autistic and non-autistic agents. In our study, agents engaged in group-task conversations and answered structured interview questions, which we analyzed to examine ChatGPT's biases and how it conceptualizes autism. We found that ChatGPT assumes autistic people are socially dependent, which may affect how it interacts with autistic users or conveys information about autism. To address these challenges, we propose adopting the double empathy problem, which reframes breakdowns as a mutual challenge. We describe how future s could address the biases we observed and improve interactions involving autistic people by incorporating the double empathy problem into their design.

Domain-Specific Knowledge Graphs in RAG-Enhanced Healthcare LLMs

Authors: Sydney Anuyah, Mehedi Mahmud Kaushik, Hao Dai, Rakesh Shiradkar, Arjan Durresi, Sunandan Chakraborty

2026-01-21

http://arxiv.org/abs/2601.15429v1

Large Language Models (s) generate fluent answers but can struggle with trustworthy, domain-specific reasoning. We evaluate whether domain knowledge graphs (KGs) improve Retrieval-Augmented Generation (RAG) for healthcare by constructing three PubMed-derived graphs: $\mathbb{G}_1$ (T2DM), $\mathbb{G}_2$ (Alzheimer's disease), and $\mathbb{G}_3$ (AD+T2DM). We design two probes: Probe 1 targets merged AD T2DM knowledge, while Probe 2 targets the intersection of $\mathbb{G}_1$ and $\mathbb{G}_2$ . Seven instruction-tuned s are tested across retrieval sources {No-RAG, $\mathbb{G}_1$ , $\mathbb{G}_2$ , $\mathbb{G}_1$ + $\mathbb{G}_2$ , $\mathbb{G}_3$ , $\mathbb{G}_1$ + $\mathbb{G}_2$ + $\mathbb{G}_3$ } and three temperatures. Results show that scope alignment between probe and KG is decisive: precise, scope-matched retrieval (notably $\mathbb{G}_2$ ) yields the most consistent gains, whereas indiscriminate graph unions often introduce distractors that reduce accuracy. Larger models frequently match or exceed KG-RAG with a No-RAG baseline on Probe 1, indicating strong parametric priors, whereas smaller/mid-sized models benefit more from well-scoped retrieval. Temperature plays a secondary role; higher values rarely help. We conclude that precision-first, scope-matched KG-RAG is preferable to breadth-first unions, and we outline practical guidelines for graph selection, model sizing, and retrieval/reranking. Code and Data available here - https://github.com/sydneyanuyah/RAGComparison

DuFal Dual-Frequency-Aware Learning for High-Fidelity Extremely Sparse-view CBCT Reconstruction

Authors: Cuong Tran Van, Trong-Thang Pham, Ngoc-Son Nguyen, Duy Minh Ho Nguyen, Ngan Le

2026-01-21

http://arxiv.org/abs/2601.15416v1

Sparse-view Cone-Beam Computed Tomography reconstruction from limited X-ray projections remains a challenging problem in medical imaging due to the inherent undersampling of fine-grained anatomical details, which correspond to high-frequency components. Conventional CNN-based methods often struggle to recover these fine structures, as they are typically biased toward learning low-frequency information. To address this challenge, this paper presents DuFal (Dual-Frequency-Aware Learning), a novel framework that integrates frequency-domain and spatial-domain processing via a dual-path architecture. The core innovation lies in our High-Local Factorized Fourier Neural Operator, which comprises two complementary branches: a Global High-Frequency Enhanced Fourier Neural Operator that captures global frequency patterns and a Local High-Frequency Enhanced Fourier Neural Operator that processes spatially partitioned patches to preserve spatial locality that might be lost in global frequency analysis. To improve efficiency, we design a Spectral-Channel Factorization scheme that reduces the Fourier Neural Operator parameter count. We also design a Cross-Attention Frequency Fusion module to integrate spatial and frequency features effectively. The fused features are then d through a Feature Decoder to produce projection representations, which are subsequently processed through an Intensity Field Decoding pipeline to reconstruct a final Computed Tomography volume. Experimental results on the LUNA16 and ToothFairy datasets demonstrate that DuFal significantly outperforms existing state-of-the-art methods in pre high-frequency anatomical features, particularly under extremely -view settings.

Beyond Prompting Efficient and Robust Contextual Biasing for Speech LLMs via Logit-Space Integration (LOGIC)

Authors: Peidong Wang

2026-01-21

http://arxiv.org/abs/2601.15397v1

The rapid emergence of new entities -- driven by cultural shifts, evolving trends, and personalized user data -- poses a significant challenge for existing Speech Large Language Models (Speech s). While these models excel at general conversational tasks, their static training knowledge limits their ability to recognize domain-specific terms such as contact names, playlists, or technical jargon. Existing solutions primarily rely on prompting, which suffers from poor scalability: as the entity list grows, prompting encounters context window limitations, increased inference latency, and the "lost-in-the-middle" phenomenon. An alternative approach, Generative Error Correction (GEC), attempts to rewrite transcripts via post-processing but frequently suffers from "over-correction", introducing hallucinations of entities that were never spoken. In this work, we introduce LOGIC (Logit-Space Integration for Contextual Biasing), an efficient and robust framework that operates directly in the layer. Unlike prompting, LOGIC decouples context injection from input processing, ensuring constant-time complexity relative to prompt length. Extensive experiments using the Phi-4-MM model across 11 multilingual locales demonstrate that LOGIC achieves an average 9% relative reduction in Entity WER with a negligible 0.30% increase in False Alarm Rate.

FedUMM A General Framework for Federated Learning with Unified Multimodal Models

Authors: Zhaolong Su, Leheng Zhao, Xiaoying Wu, Ziyue Xu, Jindong Wang

2026-01-21

http://arxiv.org/abs/2601.15390v1

Unified multimodal models (UMMs) are emerging as strong foundation models that can do both generation and understanding tasks in a single architecture. However, they are typically trained in centralized settings where all training and downstream datasets are gathered in a central server, limiting the deployment in privacy-sensitive and geographically distributed scenarios. In this paper, we present FedUMM, a general federated learning framework for UMMs under non-IID multimodal data with low cost. Built on NVIDIA FLARE, FedUMM instantiates federation for a BLIP3o backbone via parameter-efficient fine-tuning: clients train lightweight LoRA adapters while freezing the foundation models, and the server aggregates only adapter updates. We evaluate on VQA v2 and the GenEval compositional generation benchmarks under Dirichlet-controlled heterogeneity with up to 16 clients. Results show slight degradation as client count and heterogeneity increase, while remaining competitive with centralized training. We further analyze computation-- trade-offs and demonstrate that adapter-only federation reduces per-round by over an order of magnitude compared to full fine-tuning, enabling practical federated UMM training. This work provides empirical experience for future research on privacy-pre federated unified multimodal models.

Towards Understanding Best Practices for Quantization of Vision-Language Models

Authors: Gautom Das, Vincent La, Ethan Lau, Abhinav Shrivastava, Matthew Gwilliam

2026-01-21

http://arxiv.org/abs/2601.15287v1

Large language models (s) deliver impressive results for a variety of tasks, but state-of-the-art systems require fast GPUs with large amounts of memory. To reduce both the memory and latency of these systems, practitioners their learned parameters, typically at half precision. A growing body of research focuses on pre the model performance with more aggressive bit widths, and some work has been done to apply these strategies to other models, like vision s. In our study we investigate how a variety of methods, including state-of-the-art GPTQ and AWQ, can be applied effectively to multimodal pipelines comprised of vision models, language models, and their connectors. We address how performance on captioning, retrieval, and question answering can be affected by bit width, method, and which portion of the pipeline the is used for. Results reveal that ViT and exhibit comparable importance in model performance, despite significant differences in parameter size, and that lower-bit of the achieves high accuracy at reduced bits per weight (bpw). These findings provide practical insights for efficient deployment of Ms and highlight the value of exploration for understanding component sensitivities in multimodal models. Our code is available at https://github.com/gautomdas/mmq.

Lightweight LLMs for Network Attack Detection in IoT Networks

Authors: Piyumi Bhagya Sudasinghe, Kushan Sudheera Kalupahana Liyanage, Harsha S. Gardiyawasam Pussewalage

2026-01-21

http://arxiv.org/abs/2601.15269v1

The rapid growth of Internet of Things (IoT) devices has increased the scale and diversity of cyberattacks, exposing limitations in traditional intrusion detection systems. Classical machine learning (ML) models such as Random Forest and Support Vector Machine perform well on known attacks but require retraining to detect unseen or zero-day threats. This study investigates lightweight r-only Large Language Models (s) for IoT attack detection by integrating structured-to-text conversion, Quantized Low-Rank Adaptation (QLoRA) fine-tuning, and Retrieval-Augmented Generation (RAG). Network traffic features are transformed into compact natural-language prompts, enabling efficient adaptation under constrained hardware. Experiments on the CICIoT2023 dataset show that a QLoRA-tuned LLaMA-1B model achieves an F1-score of 0.7124, comparable to the Random Forest (RF) baseline (0.7159) for known attacks. With RAG, the system attains 42.63% accuracy on unseen attack types without additional training, demonstrating practical zero-shot capability. These results highlight the potential of retrieval-enhanced lightweight s as adaptable and resource-efficient solutions for next-generation IoT intrusion detection.

Aligned Stable Inpainting Mitigating Unwanted Object Insertion and Preserving Color Consistency

Authors: Yikai Wang, Junqiu Yu, Chenjie Cao, Xiangyang Xue, Yanwei Fu

2026-01-21

http://arxiv.org/abs/2601.15368v1

Generative image inpainting can produce realistic, high-fidelity results even with large, irregular masks. However, existing methods still face key issues that make inpainted images look unnatural. In this paper, we identify two main problems: (1) Unwanted object insertion: generative models may hallucinate arbitrary objects in the masked region that do not match the surrounding context. (2) Color inconsistency: inpainted regions often exhibit noticeable color shifts, leading to smeared textures and degraded image quality. We analyze the underlying causes of these issues and propose efficient post-hoc solutions for pre-trained inpainting models. Specifically, we introduce the principled framework of Aligned Stable inpainting with UnKnown Areas prior (ASUKA). To reduce unwanted object insertion, we use reconstruction-based priors to guide the generative model, suppressing hallucinated objects while pre generative flexibility. To address color inconsistency, we design a specialized VAE r that formulates latent-to-image as a local harmonization task. This design significantly reduces color shifts and produces more color-consistent results. We implement ASUKA on two representative inpainting architectures: a U-Net-based model and a DiT-based model. We analyze and propose lightweight injection strategies that minimize interference with the model's original generation capacity while ensuring the mitigation of the two issues. We evaluate ASUKA using the Places2 dataset and MISATO, our proposed diverse benchmark. Experiments show that ASUKA effectively suppresses object hallucination and improves color consistency, outperforming standard diffusion, rectified flow models, and other inpainting methods. Dataset, models and codes will be released in github.

Deaf and Hard of Hearing Access to Intelligent Personal Assistants Comparison of Voice-Based Options with an LLM-Powered Touch Interface

Authors: Paige S. DeVries, Michaela Okosi, Ming Li, Nora Dunphy, Gidey Gezae, Dante Conway, Abraham Glasser, Raja Kushalnagar, Christian Vogler

2026-01-21

http://arxiv.org/abs/2601.15209v2

We investigate intelligent personal assistants (IPAs) accessibility for deaf and hard of hearing (DHH) people who can use their voice in everyday . The inability of IPAs to understand diverse accents including deaf speech renders them largely inaccessible to non-signing and speaking DHH individuals. Using an Echo Show, we compare the usability of natural language input via spoken English; with Alexa's automatic speech recognition and a Wizard-of-Oz setting with a trained facilitator re-speaking commands against that of a large language model ()-assisted touch interface in a mixed-methods study. The touch method was navigated through an -powered "task prompter," which integrated the user's history and smart environment to suggest contextually-appropriate commands. Quantitative results showed no significant differences across both spoken English conditions vs -assisted touch. Qualitative results showed variability in opinions on the usability of each method. Ultimately, it will be necessary to have robust deaf-accented speech recognized natively by IPAs.

The Flexibility Trap Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Authors: Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang

2026-01-21

http://arxiv.org/abs/2601.15165v1

Diffusion Large Language Models (ds) break the rigid left-to-right constraint of traditional s, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of ds. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of ds. We find that ds tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for ds, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to pre this flexibility. We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel ability of ds. Project page: https://nzl-thu.github.io/the-flexibility-trap

DeepFedNAS A Unified Framework for Principled, Hardware-Aware, and Predictor-Free Federated Neural Architecture Search

Authors: Bostan Khan, Masoud Daneshtalab

2026-01-21

http://arxiv.org/abs/2601.15127v1

Federated Neural Architecture Search (FedNAS) aims to automate model design for privacy-pre Federated Learning (FL) but currently faces two critical bottlenecks: unguided supernet training that yields suboptimal models, and costly multi-hour pipelines for post-training subnet discovery. We introduce DeepFedNAS, a novel, two-phase framework underpinned by a principled, multi-objective fitness function that synthesizes mathematical network design with architectural heuristics. Enabled by a re-engineered supernet, DeepFedNAS introduces Federated Pareto Optimal Supernet Training, which leverages a pre-computed Pareto-optimal of high-fitness architectures as an intelligent curriculum to optimize shared supernet weights. Subsequently, its Predictor-Free Search Method eliminates the need for costly accuracy surrogates by utilizing this fitness function as a direct, zero-cost proxy for accuracy, enabling on-demand subnet discovery in mere seconds. DeepFedNAS achieves state-of-the-art accuracy (e.g., up to 1.21% absolute improvement on CIFAR-100), superior parameter and efficiency, and a substantial ~61x speedup in total post-training search pipeline time. By reducing the pipeline from over 20 hours to approximately 20 minutes (including initial generation) and enabling 20-second individual subnet searches, DeepFedNAS makes hardware-aware FL deployments instantaneous and practical. The complete source code and experimental scripts are available at: https://github.com/bostankhan6/DeepFedNAS

Field-Space Autoencoder for Scalable Climate Emulators

Authors: Johannes Meuer, Maximilian Witte, Étiénne Plésiat, Thomas Ludwig, Christopher Kadow

2026-01-21

http://arxiv.org/abs/2601.15102v1

Kilometer-scale Earth system models are essential for capturing local climate change. However, these models are computationally expensive and produce petabyte-scale outputs, which limits their utility for applications such as probabilistic risk assessment. Here, we present the Field-Space Autoencoder, a scalable climate emulation framework based on a spherical model that overcomes these challenges. By utilizing Field-Space Attention, the model efficiently operates on native climate model output and therefore avoids geometric distortions caused by forcing spherical data onto Euclidean grids. This approach preserves physical structures significantly better than convolutional baselines. By producing a structured compressed field, it serves as a good baseline for downstream generative emulation. In addition, the model can perform zero-shot super-resolution that maps low-resolution large ensembles and scarce high-resolution data into a shared representation. We train a generative diffusion model on these compressed fields. The model can simultaneously learn internal variability from abundant low-resolution data and fine-scale physics from high-resolution data. Our work bridges the gap between the high volume of low-resolution ensemble statistics and the scarcity of high-resolution physical detail.

LoRAP Low-Rank Aggregation Prompting for Quantized Graph Neural Networks Training

Authors: Chenyu Liu, Haige Li, Luca Rossi

2026-01-21

http://arxiv.org/abs/2601.15079v1

Graph Neural Networks (GNNs) are neural networks that aim to process graph data, capturing the relationships and interactions between nodes using the message-passing mechanism. GNN has emerged as a promising approach for reducing model size and accelerating inference in resource-constrained environments. Compared to in s, quantizing graph features is more emphasized in GNNs. Inspired by the above, we propose to leverage prompt learning, which manipulates the input data, to improve the performance of -aware training (QAT) for GNNs. To mitigate the issue that prompting the node features alone can only make part of the d aggregation result optimal, we introduce Low-Rank Aggregation Prompting (LoRAP), which injects lightweight, input-dependent prompts into each aggregated feature to optimize the results of d aggregations. Extensive evaluations on 4 leading QAT frameworks over 9 graph datasets demonstrate that LoRAP consistently enhances the performance of d GNNs while introducing a minimal computational overhead.

SpooFL Spoofing Federated Learning

Authors: Isaac Baglin, Xiatian Zhu, Simon Hadfield

2026-01-21

http://arxiv.org/abs/2601.15055v1

Traditional defenses against Deep Leakage (DL) attacks in Federated Learning (FL) primarily focus on obfuscation, introducing noise, transformations or encryption to degrade an attacker's ability to reconstruct private data. While effective to some extent, these methods often still leak high-level information such as class distributions or feature representations, and are frequently broken by increasingly powerful denoising attacks. We propose a fundamentally different perspective on FL defense: framing it as a spoofing problem.We introduce SpooFL (Figure 1), a spoofing-based defense that deceives attackers into believing they have recovered the true training data, while actually providing convincing but entirely synthetic samples from an unrelated task. Unlike prior synthetic-data defenses that share classes or distributions with the private data and thus still leak semantic information, SpooFL uses a state-of-the-art generative model trained on an external dataset with no class . As a result, attackers are misled into recovering plausible yet completely irrelevant samples, preventing meaningful data leakage while pre FL training integrity. We implement the first example of such a spoofing defense, and evaluate our method against state-of-the-art DL defenses and demonstrate that it successfully misdirects attackers without compromising model performance significantly.

Game-Theoretic Lens on LLM-based Multi-Agent Systems

Authors: Jianing Hao, Han Ding, Yuanjian Xu, Tianze Sun, Ran Chen, Wanbo Zhang, Guang Zhang, Siguang Li

2026-01-21

http://arxiv.org/abs/2601.15047v1

Large language models (s) have demonstrated strong reasoning, planning, and abilities, enabling them to operate as autonomous agents in open environments. While single-agent systems remain limited in adaptability and coordination, recent progress has shifted attention toward multi-agent systems (MAS) composed of interacting s that pursue cooperative, competitive, or mixed objectives. This emerging paradigm provides a powerful testbed for studying social dynamics and strategic behaviors among intelligent agents. However, current research remains fragmented and lacks a unifying theoretical foundation. To address this gap, we present a comprehensive survey of -based multi-agent systems through a game-theoretic lens. By organizing existing studies around the four key elements of game theory: players, strategies, payoffs, and information, we establish a systematic framework for understanding, comparing, and guiding future research on the design and analysis of -based MAS.

Parallel Collaborative ADMM Privacy Computing and Adaptive GPU Acceleration for Distributed Edge Networks

Authors: Mengchun Xia, Zhicheng Dong, Donghong Cai, Fang Fang, Lisheng Fan, Pingzhi Fan

2026-01-21

http://arxiv.org/abs/2601.14980v1

Distributed computing has been widely applied in distributed edge networks for reducing the processing burden of high-dimensional data centralization, where a high-dimensional computational task is decomposed into multiple low-dimensional collaborative processing tasks or multiple edge nodes use distributed data to train a global model. However, the computing power of a single-edge node is limited, and collaborative computing will cause information leakage and excessive overhead. In this paper, we design a parallel collaborative distributed alternating direction method of multipliers (ADMM) and propose a three-phase parallel collaborative ADMM privacy computing (3P-ADMM-PC2) algorithm for distributed computing in edge networks, where the Paillier homomorphic encryption is utilized to protect data privacy during interactions. Especially, a method is introduced, which maps the real numbers to a positive integer interval without affecting the homomorphic operations. To address the architectural mismatch between large-integer and Graphics Processing Unit (GPU) computing, we transform high-bitwidth computations into width matrix and vector operations. Thus the GPU can be utilized to implement parallel encryption and decryption computations with long keys. Finally, a GPU-accelerated 3P-ADMM-PC2 is proposed to optimize the collaborative computing tasks. Meanwhile, large-scale computational tasks are conducted in network topologies with varying numbers of edge nodes. Experimental results demonstrate that the proposed 3P-ADMM-PC2 has excellent mean square error performance, which is close to that of distributed ADMM without privacy-pre. Compared to centralized ADMM and distributed ADMM implemented with Central Processing Unit (CPU) computation, the proposed scheme demonstrates a significant speedup ratio.

Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers

Authors: Xinyu Peng, Han Li, Yuyang Huang, Ziyang Zheng, Yaoming Wang, Xin Chen, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong

2026-01-21

http://arxiv.org/abs/2601.14959v1

Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named \textbf{L}ocal \textbf{D}iffusion \textbf{F}orcing for \textbf{V}ideo \textbf{F}rame \textbf{I}nterpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion that models the entire video sequence to ensure long-range temporal coherence. To mitigate error accumulation inherent in auto-regressive generation, we introduce a novel skip-concatenate sampling strategy that effectively maintains temporal stability. Furthermore, LDF-VFI incorporates , local attention and tiled VAE encoding, a combination that not only enables efficient processing of long sequences but also allows generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining. An enhanced conditional VAE r, which leverages multi-scale features from the input video, further improves reconstruction fidelity. Empirically, LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per-frame quality and temporal consistency, especially in scenes with large motion. The source code is available at https://github.com/xypeng9903/LDF-VFI.

Multi-Behavior Sequential Modeling with Transition-Aware Graph Attention Network for E-Commerce Recommendation

Authors: Hanqi Jin, Gaoming Yang, Zhangming Chan, Yapeng Yuan, Longbin Li, Fei Sun, Yeqiu Yang, Jian Wu, Yuning Jiang, Bo Zheng

2026-01-21

http://arxiv.org/abs/2601.14955v1

User interactions on e-commerce platforms are inherently diverse, involving behaviors such as clicking, favoriting, adding to cart, and purchasing. The transitions between these behaviors offer valuable insights into user-item interactions, as a key signal for understanding evolving preferences. Consequently, there is growing interest in leveraging multi-behavior data to better capture user intent. Recent studies have explored sequential modeling of multi-behavior data, many relying on -based architectures with polynomial time complexity. While effective, these approaches often incur high computational costs, limiting their applicability in large-scale industrial systems with long user sequences. To address this challenge, we propose the Transition-Aware Graph Attention Network (TGA), a linear-complexity approach for modeling multi-behavior transitions. Unlike traditional s that treat all behavior pairs equally, TGA constructs a structured graph by identifying informative transitions from three perspectives: (a) item-level transitions, (b) category-level transitions, and (c) neighbor-level transitions. Built upon the structured graph, TGA employs a transition-aware graph Attention mechanism that jointly models user-item interactions and behavior transition types, enabling more accurate capture of sequential patterns while maintaining computational efficiency. Experiments show that TGA outperforms all state-of-the-art models while significantly reducing computational cost. Notably, TGA has been deployed in a large-scale industrial production environment, where it leads to impressive improvements in key business metrics.

Deep Learning assisted Port-Cycling based Channel Sounding for Precoder Estimation in Massive MIMO Arrays

Authors: Advaith Arun, Shiv Shankar, Dhivagar Baskaran, Klutto Milleth, Bhaskar Ramamurthi

2026-01-21

http://arxiv.org/abs/2601.14953v1

Future wireless systems are expected to employ a substantially larger number of transmit ports for channel state information (CSI) estimation compared to current specifications. Although scaling ports improves spectral efficiency, it also increases the resource overhead to transmit reference signals across the time-frequency grid, ultimately reducing achievable data throughput. In this work, we propose an deep learning (DL)-based CSI reconstruction framework that serves as an enabler for reliable CSI acquisition in future 6G systems. The proposed solution involves designing a port-cycling mechanism that sequentially sounds different portions of CSI ports across time, thereby lowering the overhead while pre channel observability. The proposed CSI Adaptive Network (CsiAdaNet) model exploits the resulting measurements and captures both spatial and temporal correlations to accurately reconstruct the full-port CSI. The simulation results show that our method achieves overhead reduction while maintaining high CSI reconstruction accuracy.

CorpusQA A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning

Authors: Zhiyuan Lu, Chenliang Li, Yingcheng Shi, Weizhou Shen, Ming Yan, Fei Huang

2026-01-21

http://arxiv.org/abs/2601.14952v1

While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single long texts or rely on a " retrieval" assumption-that answers can be derived from a few relevant chunks. This assumption fails for true corpus-level analysis, where evidence is highly dispersed across hundreds of documents and answers require global integration, comparison, and statistical aggregation. To address this critical gap, we introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework. By decoupling reasoning from textual representation, this framework creates complex, computation-intensive queries with programmatically guaranteed ground-truth answers, challenging systems to perform holistic reasoning over vast, unstructured text without relying on fallible human annotation. We further demonstrate the utility of our framework beyond evaluation, showing that fine-tuning on our synthesized data effectively enhances an 's general long-context reasoning capabilities. Extensive experiments reveal that even state-of-the-art long-context s struggle as input length increases, and standard retrieval-augmented generation systems collapse entirely. Our findings indicate that memory-augmented agentic architectures offer a more robust alternative, suggesting a critical shift is needed from simply extending context windows to developing advanced architectures for global information synthesis.

Rank-one Riemannian Subspace Descent for Nonlinear Matrix Equations

Authors: Yogesh Darmwal, Ketan Rajawat

2026-01-21

http://arxiv.org/abs/2601.14933v1

We propose a rank-one Riemannian subspace descent algorithm for computing symmetric positive definite (SPD) solutions to nonlinear matrix equations arising in control theory, dynamic programming, and stochastic filtering. For solution matrices of size $n\times n$ , standard approaches for dense matrix equations typically incur $\mathcal{O}(n^3)$ cost per-iteration, while the efficient $\mathcal{O}(n^2)$ methods either rely on or low-rank solutions, or have iteration counts that scale poorly. The proposed method entails updating along the dominant eigen-component of a transformed Riemannian gradient, identified using at most $\mathcal{O}(\log(n))$ power iterations. The update structure also enables exact step-size selection in many cases at minimal additional cost. For objectives defined as compositions of standard matrix operations, each iteration can be implemented using only matrix--vector products, yielding $\mathcal{O}(n^2)$ arithmetic cost. We prove an $\mathcal{O}(n)$ iteration bound under standard smoothness assumptions, with improved bounds under geodesic strong convexity. Numerical experiments on large-scale CARE, DARE, and other nonlinear matrix equations show that the proposed algorithm solves instances (up to $n=10{,}000$ in our tests) for which the compared solvers, including MATLAB's \texttt{icare}, structure-pre doubling, and subspace-descent baselines fail to return a solution. These results demonstrate that rank-one manifold updates provide a practical approach for high-dimensional and dense SPD-constrained matrix equations. MATLAB code implementation is publicly available on GitHub : \href{https://github.com/yogeshd-iitk/nonlinear_matrix_equation_R1RSD}{\textcolor{blue}{https://github.com/yogeshd-iitk/nonlinear_matrix _equation_R1RSD}}

What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study

Authors: Keyu Lv, Manyi Zhang, Xiaobo Xia, Jingchen Ni, Shannan Yan, Xianzhi Yu, Lu Hou, Chun Yuan, Haoli Bai

2026-01-21

http://arxiv.org/abs/2601.14888v1

Reasoning models excel at complex tasks such as coding and mathematics, yet their inference is often slow and token-inefficient. To improve the inference efficiency, post-training (PTQ) usually comes with the cost of large accuracy drops, especially for reasoning tasks under settings. In this study, we present a systematic empirical study of -aware training (QAT) for reasoning models. Our key findings include: (1) Knowledge distillation is a robust objective for reasoning models trained via either supervised fine-tuning or reinforcement learning; (2) PTQ provides a strong initialization for QAT, improving accuracy while reducing training cost; (3) Reinforcement learning remains feasible for d models given a viable cold start and yields additional gains; and (4) Aligning the PTQ calibration domain with the QAT training domain accelerates convergence and often improves the final accuracy. Finally, we consolidate these findings into an optimized workflow (Reasoning-QAT), and show that it consistently outperforms state-of-the-art PTQ methods across multiple backbones and reasoning datasets. For instance, on Qwen3-0.6B, it surpasses GPTQ by 44.53% on MATH-500 and consistently recovers performance in the 2-bit regime.

Multi-Task Transformer for Explainable Speech Deepfake Detection via Formant Modeling

Authors: Viola Negroni, Luca Cuccovillo, Paolo Bestagini, Patrick Aichroth, Stefano Tubaro

2026-01-21

http://arxiv.org/abs/2601.14850v2

In this work, we introduce a multi-task for speech deepfake detection, capable of predicting formant trajectories and voicing patterns over time, ultimately classifying speech as real or fake, and highlighting whether its decisions rely more on voiced or unvoiced regions. Building on a prior speaker-formant architecture, we streamline the model with an improved input segmentation strategy, redesign the process, and integrate built-in explainability. Compared to the baseline, our model requires fewer parameters, trains faster, and provides better interpretability, without sacrificing prediction performance.

POTR Post-Training 3DGS Compression

Authors: Bert Ramlot, Martijn Courteaux, Peter Lambert, Glenn Van Wallendael

2026-01-21

http://arxiv.org/abs/2601.14821v1

3D Gaussian Splatting (3DGS) has recently emerged as a promising contender to Neural Radiance Fields (NeRF) in 3D scene reconstruction and real-time novel view synthesis. 3DGS outperforms NeRF in training and inference speed but has substantially higher storage requirements. To remedy this downside, we propose POTR, a post-training 3DGS codec built on two novel techniques. First, POTR introduces a novel approach that uses a modified 3DGS rasterizer to efficiently calculate every splat's individual removal effect simultaneously. This technique results in 2-4x fewer splats than other post-training techniques and as a result also significantly accelerates inference with experiments demonstrating 1.5-2x faster inference than other compressed models. Second, we propose a novel method to recompute lighting coefficients, significantly reducing their entropy without using any form of training. Our fast and highly parallel approach especially increases AC lighting coefficient , with experiments demonstrating increases from 70% to 97%, with minimal loss in quality. Finally, we extend POTR with a simple fine-tuning scheme to further enhance , inference, and rate-distortion performance. Experiments demonstrate that POTR, even without fine-tuning, consistently outperforms all other post-training techniques in both rate-distortion performance and inference speed.

FunCineForge A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes

Authors: Jiaxuan Liu, Yang Xiang, Han Zhao, Xiangang Li, Zhenhua Ling

2026-01-21

http://arxiv.org/abs/2601.14777v1

Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and an M-based dubbing model designed for diverse cinematic scenes. Using the pipeline, we construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data. Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following. Code and demos are available at https://anonymous.4open.science/w/FunCineForge.

Render-of-Thought Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Authors: Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei

2026-01-21

http://arxiv.org/abs/2601.14750v2

Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (s). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token and substantial inference compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT

Authors: Konstantin Poddubnyy, Igor Vozniak, Nils Lipp, Ivan Burmistrov, Davit Hovhannisyan, Christian Mueller, Philipp Slusallek

2026-01-21

http://arxiv.org/abs/2601.14743v1

The effectiveness of collision-free trajectory planners depends on the quality and diversity of training data, especially for rare scenarios. A widely used approach to improve dataset diversity involves generating realistic synthetic traffic scenarios. However, producing such scenarios remains difficult due to the precision required when scripting them manually or generating them in a single pass. Natural language offers a flexible way to describe scenarios, but existing text-to-simulation pipelines often rely on static snippet retrieval, limited grammar, single-pass , or lack robust executability checks. Moreover, they depend heavily on constrained prompting with minimal post-processing. To address these limitations, we introduce ARISE - Adaptive Refinement and Iterative Scenario Engineering, a multi-stage tool that converts natural language prompts into executable Scenic scripts through iterative -guided refinement. After each generation, ARISE tests script executability in simulation software, feeding structured diagnostics back to the until both syntactic and functional requirements are met. This process significantly reduces the need for manual intervention. Through extensive evaluation, ARISE outperforms the baseline in generating semantically accurate and executable traffic scenarios with greater reliability and robustness.

Optimizing FaaS Platforms for MCP-enabled Agentic Workflows

Authors: Varad Kulkarni, Vaibhav Jha, Nikhil Reddy, Yogesh Simmhan

2026-01-21

http://arxiv.org/abs/2601.14735v1

Agentic workflows that use autonomous AI Agents powered by Large Language Models (s) and Model Context Protocol (MCP) servers is rapidly rising. This introduces challenges in scalable cloud deployment and state management. Traditional hosting on Virtual Machines (VMs) is resource-intensive and lacks elasticity. Functions-as-a-Service (FaaS) platforms offer modularity, autoscaling and cost efficiency but are inherently stateless. In this paper, we present the FAME, a FaaS-based architecture for orchestrating MCP-enabled agentic workflows. FAME decomposes agentic patterns such as ReAct into composable agents: Planner, Actor and Evaluator, that are each a FaaS function built using LangGraph and are orchestrated as a FaaS workflow. This enables modular composition as AWS Step Functions and avoids function timeouts seen for monolithic agentic workflows. To address context persistence across user requests in a conversation, FAME automates agent memory persistence and injection using DynamoDB. It also optimizes MCP server deployment through AWS Lambda wrappers, s tool outputs in S3 and proposes function fusion strategies. We evaluate FAME on two representative applications, on research paper summarization and log analytics, under diverse memory and caching configurations. Results show up to 13x latency reduction, 88% fewer input tokens and 66% in cost savings, along with improved workflow completion rates. This demonstrates the viability of serverless platforms for hosting complex, multi-agent AI workflows at scale.

Authors: Shuning Ge, Fangyun Qin, Xiaohui Wan, Yang Liu, Qian Dai, Zheng Zheng

2026-01-21

http://arxiv.org/abs/2601.14731v1

Software systems that run for long periods often suffer from software aging, which is typically caused by Aging-Related Bugs (ARBs). To mitigate the risk of ARBs early in the development phase, ARB prediction has been introduced into software aging research. However, due to the difficulty of collecting ARBs, within-project ARB prediction faces the challenge of data scarcity, leading to the proposal of cross-project ARB prediction. This task faces two major challenges: 1) domain adaptation issue caused by distribution difference between source and target projects; and 2) severe class imbalance between ARB-prone and ARB-free samples. Although various methods have been proposed for cross-project ARB prediction, existing approaches treat the input metrics independently and often neglect the rich inter-metric dependencies, which can lead to ping information and misjudgment of metric importance, potentially affecting the model's performance. Moreover, they typically use cross-entropy as the loss function during training, which cannot distinguish the difficulty of sample classification. To overcome these limitations, we propose ARFT-Transformer, a -based cross-project ARB prediction framework that introduces a metric-level multi-head attention mechanism to capture metric interactions and incorporates Focal Loss function to effectively handle class imbalance. Experiments conducted on three large-scale open-source projects demonstrate that ARFT-Transformer on average outperforms state-of-the-art cross-project ARB prediction methods in both single-source and multi-source cases, achieving up to a 29.54% and 19.92% improvement in Balance metric.

HERMES KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Authors: Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu

2026-01-21

http://arxiv.org/abs/2601.14724v1

Recent advancements in Multimodal Large Language Models (Ms) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact , enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10 $\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.

Ramping-aware Enhanced Flexibility Aggregation of Distributed Generation with Energy Storage in Power Distribution Networks

Authors: Hyeongon Park, Daniel K. Molzahn, Rahul K. Gupta

2026-01-21

http://arxiv.org/abs/2601.14689v1

Power distribution networks are increasingly hosting controllable and flexible distributed energy resources (DERs) that, when aggregated, can provide ancillary support to transmission systems. However, existing aggregation schemes often ignore the ramping constraints of these DERs, which can render them impractical in real deployments. This work proposes a ramping-aware flexibility aggregation scheme, computed at the transmission-distribution boundary, that explicitly accounts for DER ramp limits and yields flexibility envelopes that are provably disaggregable. To further enhance the attainable flexibility region, we introduce a novel pre-ramping strategy, which proactively adjusts resource operating points to enlarge the aggregated flexibility envelope while pre both network feasibility and guarantees. The proposed method demonstrates a 5.2% to 19.2% improvement in flexibility relative to the baseline model, depending on system conditions. We validate the scheme on an IEEE-33 bus distribution system and provide formal proofs showing that both aggregation strategies are disaggregable for all feasible trajectories within the aggregate flexibility envelope.

IB-GRPO Aligning LLM-based Learning Path Recommendation with Educational Objectives via Indicator-Based Group Relative Policy Optimization

Authors: Shuai Wang, Yaoming Yang, Bingdong Li, Hao Hao, Aimin Zhou

2026-01-21

http://arxiv.org/abs/2601.14686v1

Learning Path Recommendation (LPR) aims to generate personalized sequences of learning items that maximize long-term learning effect while respecting pedagogical principles and operational constraints. Although large language models (s) offer rich semantic understanding for free-form recommendation, applying them to long-horizon LPR is challenging due to (i) misalignment with pedagogical objectives such as the Zone of Proximal Development (ZPD) under , delayed feedback, (ii) scarce and costly expert demonstrations, and (iii) multi-objective interactions among learning effect, difficulty scheduling, length controllability, and trajectory diversity. To address these issues, we propose IB-GRPO (Indicator-Based Group Relative Policy Optimization), an indicator-guided alignment approach for -based LPR. To mitigate data scarcity, we construct hybrid expert demonstrations via Genetic Algorithm search and teacher RL agents and warm-start the with supervised fine-tuning. Building on this warm-start, we design a within-session ZPD alignment score for difficulty scheduling. IB-GRPO then uses the $I_{ε+}$ dominance indicator to compute group-relative advantages over multiple objectives, avoiding manual scalarization and improving Pareto trade-offs. Experiments on ASSIST09 and Junyi using the KES simulator with a Qwen2.5-7B backbone show consistent improvements over representative RL and baselines.

FARE Fast-Slow Agentic Robotic Exploration

Authors: Shuhao Liao, Xuxin Lv, Jeric Lew, Shizhe Zhang, Jingsong Liang, Peizhuo Li, Yuhong Cao, Wenjun Wu, Guillaume Sartoretti

2026-01-21

http://arxiv.org/abs/2601.14681v1

This work advances autonomous robot exploration by integrating agent-level semantic reasoning with fast local control. We introduce FARE, a hierarchical autonomous exploration framework that integrates a large language model () for global reasoning with a reinforcement learning (RL) policy for local decision making. FARE follows a fast-slow thinking paradigm. The slow-thinking module interprets a concise textual description of the unknown environment and synthesizes an agent-level exploration strategy, which is then grounded into a sequence of global waypoints through a topological graph. To further improve reasoning efficiency, this module employs a modularity-based mechanism that reduces redundant graph structures. The fast-thinking RL module executes exploration by reacting to local observations while being guided by the -generated global waypoints. The RL policy is additionally shaped by a reward term that encourages adherence to the global waypoints, enabling coherent and robust closed-loop behavior. This architecture decouples semantic reasoning from geometric decision, allowing each module to operate in its appropriate temporal and spatial scale. In challenging simulated environments, our results show that FARE achieves substantial improvements in exploration efficiency over state-of-the-art baselines. We further deploy FARE on hardware and validate it in complex, large scale $200m\times130m$ building environment.

INFA-Guard Mitigating Malicious Propagation via Infection-Aware Safeguarding in LLM-Based Multi-Agent Systems

Authors: Yijin Zhou, Xiaoya Lu, Dongrui Liu, Junchi Yan, Jing Shao

2026-01-21

http://arxiv.org/abs/2601.14667v1

The rapid advancement of Large Language Model ()-based Multi-Agent Systems (MAS) has introduced significant security vulnerabilities, where malicious influence can propagate virally through inter-agent . Conventional safeguards often rely on a binary paradigm that strictly distinguishes between benign and attack agents, failing to account for infected agents i.e., benign entities converted by attack agents. In this paper, we propose Infection-Aware Guard, INFA-Guard, a novel defense framework that explicitly identifies and addresses infected agents as a distinct threat category. By leveraging infection-aware detection and topological constraints, INFA-Guard accurately localizes attack sources and infected ranges. During remediation, INFA-Guard replaces attackers and rehabilitates infected ones, avoiding malicious propagation while pre topological integrity. Extensive experiments demonstrate that INFA-Guard achieves state-of-the-art performance, reducing the Attack Success Rate (ASR) by an average of 33%, while exhibiting cross-model robustness, superior topological generalization, and high cost-effectiveness.

Communication-Efficient Federated Risk Difference Estimation for Time-to-Event Clinical Outcomes

Authors: Ziwen Wang, Siqi Li, Marcus Eng Hock Ong, Nan Liu

2026-01-21

http://arxiv.org/abs/2601.14609v1

Privacy-pre model co-training in medical research is often hindered by server-dependent architectures incompatible with protected hospital data systems and by the predominant focus on relative effect measures (hazard ratios) which lack clinical interpretability for absolute survival risk assessment. We propose FedRD, a -efficient framework for federated risk difference estimation in distributed survival data. Unlike typical federated learning frameworks (e.g., FedAvg) that require persistent server connections and extensive iterative , FedRD is server-independent with minimal : one round of summary statistics exchange for the stratified model and three rounds for the unstratified model. Crucially, FedRD provides valid confidence intervals and hypothesis testing--capabilities absent in FedAvg-based frameworks. We provide theoretical guarantees by establishing the asymptotic properties of FedRD and prove that FedRD (unstratified) is asymptotically equivalent to pooled individual-level analysis. Simulation studies and real-world clinical applications across different countries demonstrate that FedRD outperforms local and federated baselines in both estimation accuracy and prediction performance, providing an architecturally feasible solution for absolute risk assessment in privacy-restricted, multi-site clinical studies.

QMC Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design

Authors: Nilesh Prasad Pandey, Jangseon Park, Onat Gungor, Flavio Ponzina, Tajana Rosing

2026-01-21

http://arxiv.org/abs/2601.14549v1

Deploying Small Language Models (SLMs) on edge platforms is critical for real-time, privacy-sensitive generative AI, yet constrained by memory, latency, and energy budgets. Quantization reduces model size and cost but suffers from device noise in emerging non-volatile memories, while conventional memory hierarchies further limit efficiency. SRAM provides fast access but has low density, DRAM must simultaneously accommodate static weights and dynamic s, which creates bandwidth contention, and Flash, although dense, is primarily used for initialization and remains inactive during inference. These limitations highlight the need for hybrid memory organizations tailored to inference. We propose Outlier-aware Quantization with Memory Co-design (QMC), a retraining-free with a novel heterogeneous memory architecture. QMC identifies inlier and outlier weights in SLMs, storing inlier weights in compact multi-level Resistive-RAM (ReRAM) while pre critical outliers in high-precision on-chip Magnetoresistive-RAM (MRAM), mitigating noise-induced degradation. On language modeling and reasoning benchmarks, QMC outperforms and matches state-of-the-art methods using advanced algorithms and hybrid data formats, while achieving greater under both algorithm-only evaluation and realistic deployment settings. Specifically, compared against SoTA methods on the latest edge AI platform, QMC reduces memory usage by 6.3x-7.3x, external data transfers by 7.6x, energy by 11.7x, and latency by 12.5x when compared to FP16, establishing QMC as a scalable, deployment-ready co-design for efficient on-device inference.

On the Runway Cascade of Transformers for Language Modeling

Authors: Hunjae Lee, Corey Clark

2026-01-20

http://arxiv.org/abs/2601.14522v1

In r-only (causal) s, the computation graph created by causal masking routes information through both direct-path attention and indirect paths formed by intermediate tokens. We denote these indirect paths between token pairs as their runways. We argue that certain failure modes of causal s as observed by a growing body of recent works are likely exacerbated by a misalignment between these two information propagation modes. We formalize runway cascade as a phenomenon whereby this misalignment results in redundancies and irrelevant information cascading to token representations despite adequately learned attention patterns. As a solution, we propose runway-aware rewiring as a more explicit way of incorporating runway context directly into each token's direct-path attention. This mechanism re-wires the attention pattern for each token based on a summary of its runway landscape, enabling awareness of accumulating representational influences and allowing for more balanced information propagation. Our proposed methodology introduces no additional parameters and can seamlessly be integrated into standard attention mechanism. Empirically, our rewired results in steady improvements in general language modeling as well as noticeably stronger information retrieval and extrapolation abilities compared to standard s.