2025-09-05

PagedEviction Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference
Psychologically Enhanced AI Agents
Real Time FPGA Based Transformers & VLMs for Vision Tasks SOTA Designs and Optimizations
MultiWikiQA A Reading Comprehension Benchmark in 300+ Languages
Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue
Meta-Policy Reflexion Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent
MTQAMatrix of Thought for Enhanced Reasoning in Complex Question Answering
A Multidimensional AI-powered Framework for Analyzing Tourist Perception in Historic Urban Quarters A Case Study in Shanghai
Learning Neural Decoding with Parallelism and Self-Coordination for Quantum Error Correction
SAMVAD A Multi-Agent System for Simulating Judicial Deliberation Dynamics in India
RAGuard A Novel Approach for in-context Safe Retrieval Augmented Generation for LLMs
Efficient Item ID Generation for Large-Scale LLM-based Recommendation
OneCAT Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
On Entropy Control in LLM-RL Algorithms
Continuous Saudi Sign Language Recognition A Vision Transformer Approach
Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data Processing
Adaptive KV-Cache Compression without Manually Setting Budget
Handwriting Imagery EEG Classification based on Convolutional Neural Networks
Binary Quantization For LLMs Through Dynamic Grouping
FlashRecovery Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs
Mycroft Tracing Dependencies in Collective Communication Towards Reliable LLM Training
QNPU Quantum Network Processor Unit for Quantum Supercomputers
The Transparent Earth A Multimodal Foundation Model for the Earth's Subsurface
LExI Layer-Adaptive Active Experts for Efficient MoE Model Inference
Planning with Reasoning using Vision Language World Model
Lighting the Way for BRIGHT Reproducible Baselines with Anserini, Pyserini, and RankLLM
LLM-Enhanced Space-Air-Ground-Sea Integrated Networks
Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
MoPEQ Mixture of Mixed Precision Quantized Experts
Top-H Decoding Adapting the Creativity and Coherence with Bounded Entropy in Text Generation
HydroGAT Distributed Heterogeneous Graph Attention Transformer for Spatiotemporal Flood Prediction
MLP-Offload Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall
An Efficient and Adaptive Watermark Detection System with Tile-based Error Correction
Cache Management for Mixture-of-Experts LLMs -- extended version
Upcycling Candidate Tokens of Large Language Models for Query Expansion
AudioCodecBench A Comprehensive Benchmark for Audio Codec Evaluation
FPGA-Based RoCEv2-RDMA Readout Electronics for the CTAO-LST Advanced Camera
Dual-end Fluid Antennas For Robust Anti-jamming in Low-altitude Air-ground Communications
Avoidance Decoding for Diverse Multi-Branch Story Generation
AMBEDKAR-A Multi-level Bias Elimination through a Decoding Approach with Knowledge Augmentation for Robust Constitutional Alignment of Language Models
FlexNGIA 2.0 Redesigning the Internet with Agentic AI -- Protocols, Services, and Traffic Engineering Designed, Deployed, and Managed by AI
Batch Query Processing and Optimization for Agentic Workflows
Reentrant superconductivity and superconductor-to-insulator transition in a naturally occurring Josephson junction array tuned by RF power
Loop Quantum Vector-Tensor Gravity and Its Spherically Symmetric Model
FireRedTTS-2 Towards Long Conversational Speech Generation for Podcast and Chatbot
Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs
mFARM Towards Multi-Faceted Fairness Assessment based on HARMs in Clinical Decision Support
AHAMask Reliable Task Specification for Large Audio Language Models without Instructions
Communication-Aware Knowledge Distillation for Federated LLM Fine-Tuning over Wireless Networks
Preconditioned Regularized Wasserstein Proximal Sampling
Q-Sched Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling

PagedEviction Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Authors: Krishna Teja Chitty-Venkata, Jie Ye, Xian-He Sun, Anthony Kougkas, Murali Emani, Venkatram Vishwanath, Bogdan Nicolae

2025-09-04

http://arxiv.org/abs/2509.04377v1

caching significantly improves the efficiency of Large Language Model () inference by storing attention states from previously processed tokens, enabling faster generation of subsequent tokens. However, as sequence length increases, the quickly becomes a major memory bottleneck. To address this, we propose PagedEviction, a novel fine-grained, structured strategy that enhances the memory efficiency of v's PagedAttention. Unlike existing approaches that rely on attention-based token importance or evict tokens across different v pages, PagedEviction introduces an efficient block-wise eviction algorithm tailored for paged memory layouts. Our method integrates seamlessly with PagedAttention without requiring any modifications to its CUDA attention kernels. We evaluate PagedEviction across Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models on the LongBench benchmark suite, demonstrating improved memory usage with better accuracy than baselines on long context tasks.

Psychologically Enhanced AI Agents

Authors: Maciej Besta, Shriram Chandran, Robert Gerstenberger, Mathis Lindner, Marcin Chrapek, Sebastian Hermann Martschat, Taraneh Ghandi, Patrick Iff, Hubert Niewiadomski, Piotr Nyczyk, Jürgen Müller, Torsten Hoefler

2025-09-04

http://arxiv.org/abs/2509.04343v1

We introduce MBTI-in-Thoughts, a framework for enhancing the effectiveness of Large Language Model () agents through psychologically grounded personality conditioning. Drawing on the Myers-Briggs Type Indicator (MBTI), our method primes agents with distinct personality archetypes via prompt engineering, enabling control over behavior along two foundational axes of human psychology, cognition and affect. We show that such personality priming yields consistent, interpretable behavioral biases across diverse tasks: emotionally expressive agents excel in narrative generation, while analytically primed agents adopt more stable strategies in game-theoretic settings. Our framework supports experimenting with structured multi-agent protocols and reveals that self-reflection prior to interaction improves cooperation and reasoning quality. To ensure trait persistence, we integrate the official 16Personalities test for automated verification. While our focus is on MBTI, we show that our approach generalizes seamlessly to other psychological frameworks such as Big Five, HEXACO, or Enneagram. By bridging psychological theory and behavior design, we establish a foundation for psychologically enhanced AI agents without any fine-tuning.

Real Time FPGA Based Transformers & VLMs for Vision Tasks SOTA Designs and Optimizations

Authors: Safa Mohammed Sali, Mahmoud Meribout, Ashiyana Abdul Majeed

2025-09-04

http://arxiv.org/abs/2509.04162v1

Transformers and vision-language models (VLMs) have emerged as dominant architectures in computer vision and multimodal AI, offering state-of-the-art performance in tasks such as image classification, object detection, visual question answering, and caption generation. However, their high computational complexity, large memory footprints, and irregular data access patterns present significant challenges for deployment in latency- and power-constrained environments. Field-programmable gate arrays (FPGAs) provide an attractive hardware platform for such workloads due to their reconfigurability, fine-grained parallelism, and potential for energy-efficient . This paper presents a comprehensive review of design trade-offs, optimization strategies, and implementation challenges for FPGA-based inference of s and VLMs. We examine critical factors such as device-class selection, memory subsystem constraints, dataflow orchestration, strategies, exploitation, and toolchain choices, alongside modality-specific issues unique to VLMs, including heterogeneous compute balancing and cross-attention memory management. Additionally, we discuss emerging trends in hardware-algorithm co-design, highlighting innovations in attention mechanisms, compression, and modular overlays to improve efficiency and adaptability. Practical issues such as runtime flexibility, verification overhead, and the absence of standardized FPGA multimodal benchmarks are also considered. Finally, we outline future directions toward scalable, portable, and reconfigurable FPGA solutions that adapt to evolving model architectures while sustaining high utilization and predictable performance. This synthesis offers both a technical foundation and a forward-looking perspective to help bridge the gap between advanced multimodal AI models and efficient FPGA deployment.

MultiWikiQA A Reading Comprehension Benchmark in 300+ Languages

Authors: Dan Saattrup Smart

2025-09-04

http://arxiv.org/abs/2509.04111v1

We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages. The context data comes from Wikipedia articles, with questions generated by an and the answers appearing verbatim in the Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency of the generated questions across 30 of the languages, providing evidence that the questions are of good quality. We evaluate 6 different language models, both r and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. The dataset and survey evaluations are freely available.

Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue

Authors: Keara Schaaij, Roel Boumans, Tibor Bosse, Iris Hendrickx

2025-09-04

http://arxiv.org/abs/2509.04104v1

Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful . However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (s). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, as a foundational step toward lexical alignment strategies in conversational agents.

Meta-Policy Reflexion Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent

Authors: Chunlong Wu, Zhibo Qu

2025-09-04

http://arxiv.org/abs/2509.03990v1

Large language model () agents achieve impressive single-task performance but commonly exhibit repeated failures, inefficient exploration, and limited cross-task adaptability. Existing reflective strategies (e.g., Reflexion, ReAct) improve per-episode behavior but typically produce ephemeral, task-specific traces that are not reused across tasks. Reinforcement-learning based alternatives can produce transferable policies but require substantial parameter updates and compute. In this work we introduce Meta-Policy Reflexion (MPR): a hybrid framework that consolidates -generated reflections into a structured, predicate-like Meta-Policy Memory (MPM) and applies that memory at inference time through two complementary mechanisms soft memory-guided and hard rule admissibility checks(HAC). MPR (i) externalizes reusable corrective knowledge without model weight updates, (ii) enforces domain constraints to reduce unsafe or invalid actions, and (iii) retains the adaptability of language-based reflection. We formalize the MPM representation, present algorithms for update and , and validate the approach in a text-based agent environment following the experimental protocol described in the provided implementation (AlfWorld-based). Empirical results reported in the supplied material indicate consistent gains in execution accuracy and robustness when compared to Reflexion baselines; rule admissibility further improves stability. We analyze mechanisms that explain these gains, discuss scalability and failure modes, and outline future directions for multimodal and multi?agent extensions.

MTQAMatrix of Thought for Enhanced Reasoning in Complex Question Answering

Authors: Fengxiao Tang, Yufeng Li, Zongzong Wu, Ming Zhao

2025-09-04

http://arxiv.org/abs/2509.03918v1

Complex Question Answering (QA) is a fundamental and challenging task in NLP. While large language models (s) exhibit impressive performance in QA, they suffer from significant performance degradation when facing complex and abstract QA tasks due to insufficient reasoning capabilities. Works such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) aim to enhance s' reasoning abilities, but they face issues such as in-layer redundancy in tree structures and single paths in chain structures. Although some studies utilize Retrieval-Augmented Generation (RAG) methods to assist s in reasoning, the challenge of effectively utilizing large amounts of information involving multiple entities and hops remains critical. To address this, we propose the Matrix of Thought (MoT), a novel and efficient thought structure. MoT explores the problem in both horizontal and vertical dimensions through the "column-cell " mechanism, enabling s to actively engage in multi-strategy and deep-level thinking, reducing redundancy within the column cells and enhancing reasoning capabilities. Furthermore, we develop a fact-correction mechanism by constructing knowledge units from retrieved knowledge graph triples and raw text to enhance the initial knowledge for reasoning and correct erroneous answers. This leads to the development of an efficient and accurate QA framework (MTQA). Experimental results show that our framework outperforms state-of-the-art methods on four widely-used datasets in terms of F1 and EM scores, with reasoning time only 14.4\% of the baseline methods, demonstrating both its efficiency and accuracy. The code for this framework is available at https://github.com/lyfiter/mtqa.

A Multidimensional AI-powered Framework for Analyzing Tourist Perception in Historic Urban Quarters A Case Study in Shanghai

Authors: Kaizhen Tan, Yufan Wu, Yuxuan Liu, Haoran Zeng

2025-09-04

http://arxiv.org/abs/2509.03830v1

Historic urban quarters play a vital role in pre cultural heritage while as vibrant spaces for tourism and everyday life. Understanding how tourists perceive these environments is essential for sustainable, human-centered urban planning. This study proposes a multidimensional AI-powered framework for analyzing tourist perception in historic urban quarters using multimodal data from social media. Applied to twelve historic quarters in central Shanghai, the framework integrates focal point extraction, color theme analysis, and sentiment mining. Visual focus areas are identified from tourist-shared photos using a fine-tuned semantic segmentation model. To assess aesthetic preferences, dominant colors are extracted using a clustering method, and their spatial distribution across quarters is analyzed. Color themes are further compared between social media photos and real-world street views, revealing notable shifts. This divergence highlights potential gaps between visual expectations and the built environment, reflecting both stylistic preferences and perceptual bias. Tourist reviews are evaluated through a hybrid sentiment analysis approach combining a rule-based method and a multi-task BERT model. Satisfaction is assessed across four dimensions: tourist activities, built environment, service facilities, and business formats. The results reveal spatial variations in aesthetic appeal and emotional response. Rather than focusing on a single technical innovation, this framework offers an integrated, data-driven approach to tourist perception and contributes to informed decision-making in tourism, heritage conservation, and the design of aesthetically engaging public spaces.

Learning Neural Decoding with Parallelism and Self-Coordination for Quantum Error Correction

Authors: Kai Zhang, Situ Wang, Linghang Kong, Fang Zhang, Zhengfeng Ji, Jianxin Chen

2025-09-04

http://arxiv.org/abs/2509.03815v1

Fast, reliable rs are pivotal components for enabling fault-tolerant quantum computation. Neural network rs like AlphaQubit have demonstrated significant potential, achieving higher accuracy than traditional human-designed algorithms. However, existing implementations of neural network rs lack the parallelism required to the syndrome stream generated by a superconducting logical qubit in real time. Moreover, integrating AlphaQubit with sliding window-based parallel schemes presents non-trivial challenges: AlphaQubit is trained solely to output a single bit corresponding to the global logical correction for an entire memory experiment, rather than local physical corrections that can be easily integrated. We address this issue by training a recurrent, -based neural network specifically tailored for sliding-window . While our network still outputs a single bit per window, we derive training labels from a consistent set of local corrections and train on various types of windows simultaneously. This approach enables the network to self-coordinate across neighboring windows, facilitating high-accuracy parallel of arbitrarily long memory experiments. As a result, we resolve the throughput limitation that previously prohibited the application of AlphaQubit-type rs in fault-tolerant quantum computation.

SAMVAD A Multi-Agent System for Simulating Judicial Deliberation Dynamics in India

Authors: Prathamesh Devadiga, Omkaar Jayadev Shetty, Pooja Agarwal

2025-09-04

http://arxiv.org/abs/2509.03793v1

Understanding the complexities of judicial deliberation is crucial for assessing the efficacy and fairness of a justice system. However, empirical studies of judicial panels are constrained by significant ethical and practical barriers. This paper introduces SAMVAD, an innovative Multi-Agent System (MAS) designed to simulate the deliberation process within the framework of the Indian justice system. Our system comprises agents representing key judicial roles: a Judge, a Prosecution Counsel, a Defense Counsel, and multiple Adjudicators (simulating a judicial bench), all powered by large language models (s). A primary contribution of this work is the integration of Retrieval-Augmented Generation (RAG), grounded in a domain-specific knowledge base of landmark Indian legal documents, including the Indian Penal Code and the Constitution of India. This RAG functionality enables the Judge and Counsel agents to generate legally sound instructions and arguments, complete with source citations, thereby enhancing both the fidelity and transparency of the simulation. The Adjudicator agents engage in iterative deliberation rounds, processing case facts, legal instructions, and arguments to reach a consensus-based verdict. We detail the system architecture, agent protocols, the RAG pipeline, the simulation workflow, and a comprehensive evaluation plan designed to assess performance, deliberation quality, and outcome consistency. This work provides a configurable and explainable MAS platform for exploring legal reasoning and group decision-making dynamics in judicial simulations, specifically tailored to the Indian legal context and augmented with verifiable legal grounding via RAG.

RAGuard A Novel Approach for in-context Safe Retrieval Augmented Generation for LLMs

Authors: Connor Walker, Koorosh Aslansefat, Mohammad Naveed Akram, Yiannis Papadopoulos

2025-09-03

http://arxiv.org/abs/2509.03768v1

Accuracy and safety are paramount in Offshore Wind (OSW) maintenance, yet conventional Large Language Models (s) often fail when confronted with highly specialised or unexpected scenarios. We introduce RAGuard, an enhanced Retrieval-Augmented Generation (RAG) framework that explicitly integrates safety-critical documents alongside technical manuals.By issuing parallel queries to two indices and allocating separate retrieval budgets for knowledge and safety, RAGuard guarantees both technical depth and safety coverage. We further develop a SafetyClamp extension that fetches a larger candidate pool, "hard-clamping" exact slot guarantees to safety. We evaluate across (BM25), dense (Dense Passage Retrieval) and hybrid retrieval paradigms, measuring Technical Recall@K and Safety Recall@K. Both proposed extensions of RAG show an increase in Safety Recall@K from almost 0\% in RAG to more than 50\% in RAGuard, while maintaining Technical Recall above 60\%. These results demonstrate that RAGuard and SafetyClamp have the potential to establish a new standard for integrating safety assurance into -powered decision support in critical maintenance contexts.

Efficient Item ID Generation for Large-Scale LLM-based Recommendation

Authors: Anushya Subbiah, Vikram Aggarwal, James Pine, Steffen Rendle, Krishna Sayana, Kun Su

2025-09-03

http://arxiv.org/abs/2509.03746v1

Integrating product catalogs and user behavior into s can enhance recommendations with broad world knowledge, but the scale of real-world item catalogs, often containing millions of discrete item identifiers (Item IDs), poses a significant challenge. This contrasts with the smaller, tokenized text vocabularies typically used in s. The predominant view within the -based recommendation literature is that it is infeasible to treat item ids as a first class citizen in the and instead some sort of tokenization of an item into multiple tokens is required. However, this creates a key practical bottleneck in these models for real-time low-latency applications. Our paper challenges this predominant practice and integrates item ids as first class citizens into the . We provide simple, yet highly effective, novel training and inference modifications that enable single-token representations of items and single-step . Our method shows improvements in recommendation quality (Recall and NDCG) over existing techniques on the Amazon shopping datasets while significantly improving inference efficiency by 5x-14x. Our work offers an efficiency perspective distinct from that of other popular approaches within -based recommendation, potentially inspiring further research and opening up a new direction for integrating IDs into s. Our code is available here https://drive.google.com/file/d/1cUMj37rV0Z1bCWMdhQ6i4q4eTRQLURtC

OneCAT Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

Authors: Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, Hongkai Xiong

2025-09-03

http://arxiv.org/abs/2509.03498v1

We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure r-only architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model () that drastically reduces steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.

On Entropy Control in LLM-RL Algorithms

Authors: Han Shen

2025-09-03

http://arxiv.org/abs/2509.03493v1

For RL algorithms, appropriate entropy control is crucial to their effectiveness. To control the policy entropy, a commonly used method is entropy regularization, which is adopted in various popular RL algorithms including PPO, SAC and A3C. Although entropy regularization proves effective in robotic and games RL conventionally, studies found that it gives weak to no gains in -RL training. In this work, we study the issues of entropy bonus in -RL setting. Specifically, we first argue that the conventional entropy regularization suffers from the 's extremely large response space and the of the optimal outputs. As a remedy, we propose AEnt, an entropy control method that utilizes a new clamped entropy bonus with an automatically adjusted coefficient. The clamped entropy is evaluated with the re-normalized policy defined on certain smaller token space, which encourages exploration within a more compact response set. In addition, the algorithm automatically adjusts entropy coefficient according to the clamped entropy value, effectively controlling the entropy-induced bias while leveraging the entropy's benefits. AEnt is tested in math-reasoning tasks under different base models and datasets, and it is observed that AEnt outperforms the baselines consistently across multiple benchmarks.

Continuous Saudi Sign Language Recognition A Vision Transformer Approach

Authors: Soukeina Elhassen, Lama Al Khuzayem, Areej Alhothali, Ohoud Alzamzami, Nahed Alowaidi

2025-09-03

http://arxiv.org/abs/2509.03467v1

Sign language (SL) is an essential form for hearing-impaired and deaf people, enabling engagement within the broader society. Despite its significance, limited public awareness of SL often leads to inequitable access to educational and professional opportunities, thereby contributing to social exclusion, particularly in Saudi Arabia, where over 84,000 individuals depend on Saudi Sign Language (SSL) as their primary form of . Although certain technological approaches have helped to improve for individuals with hearing impairments, there continues to be an urgent requirement for more precise and dependable translation techniques, especially for Arabic sign language variants like SSL. Most state-of-the-art solutions have primarily focused on non-Arabic sign languages, resulting in a considerable absence of resources dedicated to Arabic sign language, specifically SSL. The complexity of the Arabic language and the prevalence of isolated sign language datasets that concentrate on individual words instead of continuous speech contribute to this issue. To address this gap, our research represents an important step in developing SSL resources. To address this, we introduce the first continuous Saudi Sign Language dataset called KAU-CSSL, focusing on complete sentences to facilitate further research and enable sophisticated recognition systems for SSL recognition and translation. Additionally, we propose a -based model, utilizing a pretrained ResNet-18 for spatial feature extraction and a Transformer Encoder with Bidirectional LSTM for temporal dependencies, achieving 99.02\% accuracy at signer dependent mode and 77.71\% accuracy at signer independent mode. This development leads the way to not only improving tools for the SSL community but also making a substantial contribution to the wider field of sign language.

Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data Processing

Authors: Rui Xie, Asad Ul Haq, Linsen Ma, Yunhua Fang, Zirak Burzin Engineer, Liu Liu, Tong Zhang

2025-09-03

http://arxiv.org/abs/2509.03377v1

Large language model () inference is bottlenecked by the limited bandwidth of CXL-based memory used for capacity expansion. We introduce CXL-NDP, a transparent near-data processing architecture that amplifies effective CXL bandwidth without requiring changes to the CXL.mem interface or AI models. CXL-NDP integrates a precision-scalable bit-plane layout for dynamic with transparent lossless compression of weights and s directly within the CXL device. In end-to-end , CXL-NDP improves throughput by 43%, extends the maximum context length by 87%, and reduces the footprint by 46.9% without accuracy loss. Hardware synthesis confirms its practicality with a modest silicon footprint, lowering the barrier for adopting efficient, scalable CXL-based memory in generative AI infrastructure.

Adaptive KV-Cache Compression without Manually Setting Budget

Authors: Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang

2025-09-03

http://arxiv.org/abs/2509.03136v1

Large language models (s) inference relies heavily on -s to accelerate autoregressive , but the resulting memory footprint grows rapidly with sequence length, posing significant efficiency challenges. Current - compression methods suffer from a Procrustes' bed problem: they force diverse workloads into fixed compression ratios, leading to suboptimal resource allocation and inference performance. To this end, we present GVote, an adaptive - compression scheme that eliminates manual budget specification while achieving superior accuracy-efficiency trade-offs. GVote operates on the principle that the important keys are the aggregation of keys required by future queries. The method predicts future query attention demands by Monte-Carlo style sampling potential queries and aggregating selected keys to determine the optimal budget without manual specification. Experimental evaluation demonstrates GVote's effectiveness across multiple benchmarks, including GSM8K, RULER and Longbench. Compared to baselines, GVote exhibits 2 $\times$ memory reduction while the accuracy maintains higher or comparable.

Handwriting Imagery EEG Classification based on Convolutional Neural Networks

Authors: Hao Yang, Guang Ouyang

2025-09-03

http://arxiv.org/abs/2509.03111v1

Handwriting imagery has emerged as a promising paradigm for brain-computer interfaces (BCIs) aimed at translating brain activity into text output. Compared with invasively recorded electroencephalography (EEG), non-invasive recording offers a more practical and feasible approach to capturing brain signals for BCI. This study explores the limit of non-invasive EEG associated with handwriting imagery into English letters using deep neural networks. To this end, five participants were instructed to imagine writing the 26 English letters with their EEG being recorded from the scalp. A measurement of EEG similarity across letters was conducted to investigate letter-specific patterns in the dataset. Subsequently, four convolutional neural network (CNN) models were trained for EEG classification. Descriptively, the EEG data clearly exhibited letter-specific patterns as a proof-of-concept for EEG-to-text translation. Under the chance level of accuracy at 3.85%, the CNN classifiers trained on each participant reached the highest limit of around 20%. This study marks the first attempt to non-invasive EEG associated with handwriting imagery. Although the achieved accuracy is not sufficient for a usable brain-to-text BCI, the model's performance is noteworthy in revealing the potential for translating non-invasively recorded brain signals into text outputs and establishing a baseline for future research.

Binary Quantization For LLMs Through Dynamic Grouping

Authors: Xinzhe Zheng, Zhen-Qun Yang, Haoran Xie, S. Joe Qin, Arlene Chen, Fangzhen Lin

2025-09-03

http://arxiv.org/abs/2509.03054v1

Large Language Models (s) have demonstrated remarkable performance across a wide range of Natural Language Processing (NLP) tasks, but require substantial memory and computational resources. Binary , which compresses model weights from 16-bit Brain Float to 1-bit representations in {-1, 1}, offers significant reductions in storage and inference costs. However, such aggressive often leads to notable performance degradation compared to more conservative 4-bit methods. In this research, we propose a novel optimization objective tailored for binary , along with three algorithms designed to realize it effectively. Our method enhances blocked by dynamically identifying optimal unstructured sub-matrices through adaptive grouping strategies. Experimental results demonstrate that our approach achieves an average bit length of just 1.007 bits, while maintaining high model quality. Specifically, our d LLaMA 3.2 3B model attains a perplexity of 8.23, remarkably close to the original 7.81, and surpasses previous SOTA Bi with a perplexity of only 123.90. Furthermore, our method is competitive with SOTA 4-bit approaches such as GPTQ in both performance and efficiency. The compression process is highly efficient, requiring only 14 seconds to the full LLaMA 3.2 3B weights on a single CPU core, with the entire process completing in under 100 minutes and exhibiting embarrassingly parallel properties. Code - https://github.com/johnnyzheng0636/WGM_bi_quan

FlashRecovery Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs

Authors: Haijun Zhang, Jinxiang Wang, Zhenhua Yu, Yanyong Zhang, Xuejie Ji, Kaining Mao, Jun Zhang, Yaqing Zhang, Ting Wu, Fei Jie, Xiemin Huang, Zhifang Cai, Junhua Cheng, Shuwei Wang, Wei Li, Xiaoming Bao, Hua Xu, Shixiong Zhao, Jun Li, Hongwei Sun, Ziyang Zhang, Yi Xiong, Chunsheng Li

2025-09-03

http://arxiv.org/abs/2509.03047v1

Large language models (s) have made a profound impact across various fields due to their advanced capabilities. However, training these models at unprecedented scales requires extensive AI accelerator clusters and sophisticated parallelism strategies, which pose significant challenges in maintaining system reliability over prolonged training periods. A major concern is the substantial loss of training time caused by inevitable hardware and software failures. To address these challenges, we present FlashRecovery, a fast and low-cost failure recovery system comprising three core modules: (1) Active and real-time failure detection. This module performs continuous training state monitoring, enabling immediate identification of hardware and software failures within seconds, thus ensuring rapid incident response; (2) Scale-independent task restart. By employing different recovery strategies for normal and faulty nodes, combined with an optimized group reconstruction protocol, our approach ensures that the recovery time remains nearly constant, regardless of cluster scale; (3) Checkpoint-free recovery within one step. Our novel recovery mechanism enables single-step restoration, completely eliminating dependence on traditional checkpointing methods and their associated overhead. Collectively, these innovations enable FlashRecovery to achieve optimal Recovery Time Objective (RTO) and Recovery Point Objective (RPO), substantially improving the reliability and efficiency of long-duration training. Experimental results demonstrate that FlashRecovery system can achieve training restoration on training cluster with 4, 800 devices in 150 seconds. We also verify that the time required for failure recovery is nearly consistent for different scales of training tasks.

Mycroft Tracing Dependencies in Collective Communication Towards Reliable LLM Training

Authors: Yangtao Deng, Lei Zhang, Qinlong Wang, Xiaoyun Zhi, Xinlei Zhang, Zhuo Jiang, Haohan Xu, Lei Wang, Zuquan Song, Gaohong Liu, Yang Bai, Shuguang Wang, Wencong Xiao, Jianxi Ye, Minlan Yu, Hong Xu

2025-09-03

http://arxiv.org/abs/2509.03018v1

Reliability is essential for ensuring efficiency in training. However, many real-world reliability issues remain difficult to resolve, resulting in wasted resources and degraded model performance. Unfortunately, today's collective libraries operate as black boxes, hiding critical information needed for effective root cause analysis. We propose Mycroft, a lightweight distributed tracing and root cause analysis system designed to address previously hidden reliability issues in collective . Mycroft's key idea is to trace collective states and leverage internal control and data dependencies to resolve reliability problems in training. Mycroft has been deployed at ByteDance for over six months to debug collective related issues at runtime. It detected anomalies within 15 seconds in 90% of cases and identified the root cause within 20 seconds in 60% of cases. We also conducted extensive fault injection experiments to demonstrate Mycroft's capability and efficiency.

QNPU Quantum Network Processor Unit for Quantum Supercomputers

Authors: Peiyi Li, Chenxu Liu, Ji Liu, Huiyang Zhou, Ang Li

2025-09-02

http://arxiv.org/abs/2509.02827v1

As quantum computing progresses, the need for scalable solutions to address large-scale computational problems has become critical. Quantum supercomputers are the next upcoming frontier by enabling multiple quantum processors to collaborate effectively to solve large-scale computational problems. The emergence of quantum supercomputers necessitates an efficient interface to manage the quantum protocols between quantum processors. In this paper, we propose the Quantum Network Processing Unit (QNPU), which enables quantum applications to efficiently scale beyond the capacity of individual quantum processors, as a critical building block for future quantum supercomputers. The QNPU works alongside the Quantum Processing Unit (QPU) in our decoupled processing units architecture, where the QPU handles local quantum operations while the QNPU manages quantum between nodes. We design a comprehensive instruction set architecture (ISA) for the QNPU with high-level protocol abstractions, implemented via micro-operations that manage EPR resources, quantum operations, and classical . To facilitate programming, we introduce DistQASM, which extends OpenQASM with distributed quantum operations. We then propose a microarchitecture featuring both scalar and superscalar QNPU designs to enhance performance for -intensive quantum workloads. Finally, we evaluate the performance of our proposed QNPU design with distributed quantum workloads and demonstrate that the QNPU significantly improves the efficiency of between quantum nodes, paving the way for quantum supercomputing.

The Transparent Earth A Multimodal Foundation Model for the Earth's Subsurface

Authors: Arnab Mazumder, Javier E. Santos, Noah Hobbs, Mohamed Mehana, Daniel O'Malley

2025-09-02

http://arxiv.org/abs/2509.02783v1

We present the Transparent Earth, a -based architecture for reconstructing subsurface properties from heterogeneous datasets that vary in , resolution, and modality, where each modality represents a distinct type of observation (e.g., stress angle, mantle temperature, tectonic plate type). The model incorporates positional encodings of observations together with modality encodings, derived from a text embedding model applied to a description of each modality. This design enables the model to scale to an arbitrary number of modalities, making it straightforward to add new ones not considered in the initial design. We currently include eight modalities spanning directional angles, categorical classes, and continuous properties such as temperature and thickness. These capabilities support in-context learning, enabling the model to generate predictions either with no inputs or with an arbitrary number of additional observations from any subset of modalities. On validation data, this reduces errors in predicting stress angle by more than a factor of three. The proposed architecture is scalable and demonstrates improved performance with increased parameters. Together, these advances make the Transparent Earth an initial foundation model for the Earth's subsurface that ultimately aims to predict any subsurface property anywhere on Earth.

LExI Layer-Adaptive Active Experts for Efficient MoE Model Inference

Authors: Krishna Teja Chitty-Venkata, Sandeep Madireddy, Murali Emani, Venkatram Vishwanath

2025-09-02

http://arxiv.org/abs/2509.02753v1

Mixture-of-Experts (MoE) models scale efficiently by activating only a subset of experts per token, offering a computationally alternative to dense architectures. While prior post-training optimizations, such as inter- and intra-expert , reduce memory usage they provide limited gains in inference-time compute efficiency. Moreover, existing MoE architectures typically activate a fixed number of experts uniformly across all layers, resulting in redundant computation and suboptimal performance. In this work, we first demonstrate that MoE strategies improve only the memory footprint but do not significantly improve inference performance on GPU using optimized frameworks such as v. To address this, we introduce LExI, a data-free optimization technique that determines the optimal number of active experts per layer in a pretrained MoE model. LExI leverages only the model weights to estimate the relative importance of each layer and adaptively assigns the number of active experts accordingly per layer. Experiments on state-of-the-art language and vision MoE benchmarks demonstrate that LExI significantly outperforms traditional MoE approaches in terms of inference efficiency with negligible accuracy loss. For example, using LExI, Qwen1.5-MoE achieves the same throughput on Nvidia H100 GPU with 10% better accuracy than traditional expert .

Planning with Reasoning using Vision Language World Model

Authors: Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, Pascale Fung

2025-09-02

http://arxiv.org/abs/2509.02722v1

Effective planning requires strong world models, but high-level world models that can understand and reason about actions with semantic and temporal abstraction remain largely underdeveloped. We introduce the Vision Language World Model (VLWM), a foundation model trained for language-based world modeling on natural videos. Given visual observations, the VLWM first infers the overall goal achievements then predicts a trajectory composed of interleaved actions and world state changes. Those targets are extracted by iterative Self-Refine conditioned on compressed future observations represented by Tree of Captions. The VLWM learns both an action policy and a dynamics model, which respectively facilitates reactive system-1 plan and reflective system-2 planning via cost minimization. The cost evaluates the semantic distance between the hypothetical future states given by VLWM roll-outs and the expected goal state, and is measured by a critic model that we trained in a self-supervised manner. The VLWM achieves state-of-the-art Visual Planning for Assistance (VPA) performance on both benchmark evaluations and our proposed PlannerArena human evaluations, where system-2 improves the Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmark.

Lighting the Way for BRIGHT Reproducible Baselines with Anserini, Pyserini, and RankLLM

Authors: Yijun Ge, Sahel Sharifymoghaddam, Jimmy Lin

2025-09-02

http://arxiv.org/abs/2509.02558v1

The BRIGHT benchmark is a dataset consisting of reasoning-intensive queries over diverse domains. We explore retrieval results on BRIGHT using a range of retrieval techniques, including , dense, and fusion methods, and establish reproducible baselines. We then apply listwise reranking with large language models (s) to further investigate the impact of reranking on reasoning-intensive queries. These baselines are integrated into popular retrieval and reranking toolkits Anserini, Pyserini, and Rank, with two-click reproducibility that makes them easy to build upon and convenient for further development. While attempting to reproduce the results reported in the original BRIGHT paper, we find that the provided BM25 scores differ notably from those that we obtain using Anserini and Pyserini. We discover that this difference is due to BRIGHT's implementation of BM25, which applies BM25 on the query rather than using the standard bag-of-words approach, as in Anserini, to construct query vectors. This difference has become increasingly relevant due to the rise of longer queries, with BRIGHT's lengthy reasoning-intensive queries being a prime example, and further accentuated by the increasing usage of retrieval-augmented generation, where prompts can grow to be much longer than ''traditional'' search engine queries. Our observation signifies that it may be time to reconsider BM25 approaches going forward in order to better accommodate emerging applications. To facilitate this, we integrate query-side BM25 into both Anserini and Pyserini.

LLM-Enhanced Space-Air-Ground-Sea Integrated Networks

Authors: Halvin Yang, Sangarapillai Lambotharan, Mahsa Derakhshani, Lajos Hanzo

2025-09-02

http://arxiv.org/abs/2509.02540v1

The space-air-ground-sea integrated networking (SAGSIN) concept promises seamless global multimedia connectivity, yet two obstacles still limit its practical deployment. Firstly, high-velocity satellites, aerial relays and sea-surface platforms suffer from obsolete channel state information (CSI), undermining feedback-based adaptation. Secondly, data-rate disparity across the protocol stack is extreme: terabit optical links in space coexist with kilobit acoustic under-water links. This article shows that a single large language model () backbone, trained jointly on radio, optical and acoustic traces, can provide a unified, data-driven adaptation layer that addresses both rapid CSI ageing and severe bandwidth disparity across the SAGSIN protocol stack. Explicitly, an -based long-range channel predictor forecasts the strongest delay-Doppler components several coherence intervals ahead, facilitating near-capacity reception despite violent channel fluctuations. Furthermore, our -based semantic encoder turns raw sensor payloads into task-oriented tokens. This substantially reduces the SNR required for high-fidelity image delivery in a coastal underwater link, circumventing the data rate limitation by semantic s. Inclusion of these tools creates a medium-agnostic adaptation layer that spans radio, optical and acoustic channels. We conclude with promising open research directions in on-device model compression, multimodal fidelity control, cross-layer resource orchestration and trustworthy operation, charting a path from laboratory prototypes to field deployment.

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

Authors: Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang, Wanwei He, Run Luo, Min Yang

2025-09-02

http://arxiv.org/abs/2509.02522v1

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (s) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling s to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose $\textbf{PACS}$ , a novel RLVR framework that achieves im $\textbf{P}$ licit $\textbf{A}$ ctor $\textbf{C}$ ritic coupling via a $\textbf{S}$ upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78\% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for s post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.

MoPEQ Mixture of Mixed Precision Quantized Experts

Authors: Krishna Teja Chitty-Venkata, Jie Ye, Murali Emani

2025-09-02

http://arxiv.org/abs/2509.02512v1

Large Language and Vision Models using a Mixture-of-Experts (MoE) architecture pose significant challenges for deployment due to their computational and memory demands. Mixed Precision Quantization assigns different precisions to different layers of an /VLM based on layer sensitivity and importance within the model. In this work, we propose a Post Training Quantization algorithm, MoPEQ, that assigns optimal bit width to each expert. Our method balances accuracy and model size by analyzing each expert's sensitivity using Hessian trace approximation instead of relying on the activation frequency of the expert. This per-expert granularity approach clusters similar experts to maintain model performance while reducing memory requirements. The experimental results on VLMEvalKit benchmark datasets using State-of-the-art VLMs Deepseek-VL2 -tiny, -small, -base, and MolmoE models demonstrate that our mixed precision d MoEs achieve competitive accuracy with substantial improvements in memory footprint compared to uniform-precision baseline methods. We perform a comprehensive study to analyze the impact of expert activation frequency and sensitivity using Hessian trace approximation at both layer-wise and model-wide expert precision allocation of 2, 3, and 4 bits to provide a thorough understanding of mixed precision of VLM-MoEs.

Top-H Decoding Adapting the Creativity and Coherence with Bounded Entropy in Text Generation

Authors: Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram

2025-09-02

http://arxiv.org/abs/2509.02510v1

Large language models (s), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in open-ended text generation: fostering diversity and creativity while pre logical coherence. Existing truncated sampling techniques, including temperature scaling, top-$p$ (nucleus) sampling, and min-$p$ sampling, aim to manage this trade-off. However, they exhibit limitations, particularly in the effective incorporation of the confidence of the model into the corresponding sampling strategy. For example, min-$p$ sampling relies on a single top token as a heuristic for confidence, eventually underutilizing the information of the probability distribution. Toward effective incorporation of the confidence of the model, in this paper, we present top-H . We first establish the theoretical foundation of the interplay between creativity and coherence in truncated sampling by formulating an entropy-constrained minimum divergence problem. We then prove this minimization problem to be equivalent to an entropy-constrained mass maximization (ECMM) problem, which is NP-hard. Finally, we present top-H , a computationally efficient greedy algorithm to solve the ECMM problem. Extensive empirical evaluations demonstrate that top-H outperforms the state-of-the-art (SoTA) alternative of min-$p$ sampling by up to 25.63% on creative writing benchmarks, while maintaining robustness on question-answering datasets such as GPQA, GSM8K, and MT-Bench. Additionally, an -as-judge evaluation confirms that top-H indeed produces coherent outputs even at higher temperatures, where creativity is especially critical. In summary, top-H advances SoTA in open-ended text generation and can be easily integrated into creative writing applications. The code is available at https://github.com/ErfanBaghaei/Top-H-Decoding.

HydroGAT Distributed Heterogeneous Graph Attention Transformer for Spatiotemporal Flood Prediction

Authors: Aishwarya Sarkar, Autrin Hakimi, Xiaoqiong Chen, Hai Huang, Chaoqun Lu, Ibrahim Demir, Ali Jannesari

2025-09-02

http://arxiv.org/abs/2509.02481v1

Accurate flood forecasting remains a challenge for water-resource management, as it demands modeling of local, time-varying runoff drivers (e.g., rainfall-induced peaks, baseflow trends) and complex spatial interactions across a river network. Traditional data-driven approaches, such as convolutional networks and sequence-based models, ignore topological information about the region. Graph Neural Networks (GNNs) propagate information exactly along the river network, which is ideal for learning hydrological routing. However, state-of-the-art GNN-based flood prediction models collapse pixels to coarse catchment polygons as the cost of training explodes with graph size and higher resolution. Furthermore, most existing methods treat spatial and temporal dependencies separately, either applying GNNs solely on spatial graphs or s purely on temporal sequences, thus failing to simultaneously capture spatiotemporal interactions critical for accurate flood prediction. We introduce a heterogenous basin graph where every land and river pixel is a node connected by physical hydrological flow directions and inter-catchment relationships. We propose HydroGAT, a spatiotemporal network that adaptively learns local temporal importance and the most influential upstream locations. Evaluated in two Midwestern US basins and across five baseline architectures, our model achieves higher NSE (up to 0.97), improved KGE (up to 0.96), and low bias (PBIAS within $\pm$ 5%) in hourly discharge prediction, while offering interpretable attention maps that reveal , structured intercatchment influences. To support high-resolution basin-scale training, we develop a distributed data-parallel pipeline that scales efficiently up to 64 NVIDIA A100 GPUs on NERSC Perlmutter supercomputer, demonstrating up to 15x speedup across machines. Our code is available at https://github.com/swapp-lab/HydroGAT.

MLP-Offload Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall

Authors: Avinash Maurya, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae

2025-09-02

http://arxiv.org/abs/2509.02480v1

Training s larger than the aggregated memory of multiple GPUs is increasingly necessary due to the faster growth of sizes compared to GPU memory. To this end, multi-tier host memory or disk offloading techniques are proposed by state of art. Despite advanced asynchronous multi-tier read/write strategies, such offloading strategies result in significant I/O overheads in the critical path of training, resulting in slower iterations. To this end, we propose MLP-Offload, a novel multi-level, multi-path offloading engine specifically designed for optimizing training on resource-constrained setups by mitigating I/O bottlenecks. We make several key observations that drive the design of MLP-Offload, such as I/O overheads during the update dominate the iteration time; I/O bandwidth of the third-level remote storage tier remains unutilized; and, contention due to concurrent offloading amplifies I/O bottlenecks. Driven by these insights, we design and implement MLP-Offload to offload the optimizer states across multiple tiers in a -efficient and concurrency-controlled fashion to mitigate I/O bottlenecks during the backward and update phases. Evaluations on models up to 280B parameters shows that MLP-Offload achieves 2.5 $\times$ faster iterations compared to the state-of-the-art training runtimes.

An Efficient and Adaptive Watermark Detection System with Tile-based Error Correction

Authors: Xinrui Zhong, Xinze Feng, Jingwei Zuo, Fanjiang Ye, Yi Mu, Junfeng Guo, Heng Huang, Myungjin Lee, Yuke Wang

2025-09-02

http://arxiv.org/abs/2509.02447v1

Efficient and reliable detection of generated images is critical for the responsible deployment of generative models. Existing approaches primarily focus on improving detection accuracy and robustness under various image transformations and adversarial manipulations, yet they largely overlook the efficiency challenges of watermark detection across large-scale image collections. To address this gap, we propose QRMark, an efficient and adaptive end-to-end method for detecting embedded image watermarks. The core idea of QRMark is to combine QR Code inspired error correction with tailored tiling techniques to improve detection efficiency while pre accuracy and robustness. At the algorithmic level, QRMark employs a Reed-Solomon error correction mechanism to mitigate the accuracy degradation introduced by tiling. At the system level, QRMark implements a resource-aware stream allocation policy that adaptively assigns more streams to GPU-intensive stages of the detection pipeline. It further employs a tile-based workload interleaving strategy to data-loading overhead with computation and schedules kernels across stages to maximize efficiency. End-to-end evaluations show that QRMark achieves an average 2.43x inference speedup over the sequential baseline.

Cache Management for Mixture-of-Experts LLMs -- extended version

Authors: Spyros Angelopoulos, Loris Marchal, Adrien Obrecht, Bertrand Simon

2025-09-02

http://arxiv.org/abs/2509.02408v1

Large language models (s) have demonstrated remarkable capabilities across a variety of tasks. One of the main challenges towards the successful deployment of s is memory management, since they typically involve billions of parameters. To this end, architectures based on Mixture-of-Experts have been proposed, which aim to reduce the size of the parameters that are activated when producing a token. This raises the equally critical issue of efficiently managing the limited of the system, in that frequently used experts should be stored in the fast rather than in the slower secondary memory. In this work, we introduce and study a new paging problem that models expert management optimization. Our formulation captures both the layered architecture of s and the requirement that experts are d efficiently. We first present lower bounds on the competitive ratio of both deterministic and randomized algorithms, which show that under mild assumptions, LRU-like policies have good theoretical competitive performance. We then propose a layer-based extension of LRU that is tailored to the problem at hand. Extensive simulations on both synthetic datasets and actual traces of MoE usage show that our algorithm outperforms policies for the classic paging problem, such as the standard LRU.

Upcycling Candidate Tokens of Large Language Models for Query Expansion

Authors: Jinseok Kim, Sukmin Cho, Soyeong Jeong, Sangyeop Kim, Sungzoon Cho

2025-09-02

http://arxiv.org/abs/2509.02377v1

Query Expansion (QE) improves retrieval performance by enriching queries with related terms. Recently, Large Language Models (s) have been used for QE, but existing methods face a trade-off: generating diverse terms boosts performance but increases computational cost. To address this challenge, we propose Candidate Token Query Expansion (CTQE), which extracts diverse and relevant terms from a single pass by leveraging unselected candidate tokens. These tokens, though not part of the final output, are conditioned on the full query and capture useful information. By aggregating them, CTQE achieves both relevance and diversity without extra inference, reducing overhead and latency. Experiments show that CTQE delivers strong retrieval performance with significantly lower cost, outperforming or comparable to more expensive methods. Code is available at: https://github.com/bluejeans8/CTQE

AudioCodecBench A Comprehensive Benchmark for Audio Codec Evaluation

Authors: Lu Wang, Hao Chen, Siyu Wu, Zhiyue Wu, Hao Zhou, Chengfeng Zhang, Ting Wang, Haodi Zhang

2025-09-02

http://arxiv.org/abs/2509.02349v2

Multimodal Large Language Models (Ms) have been widely applied in speech and music. This tendency has led to a focus on audio tokenization for Large Models (LMs). Unlike semantic-only text tokens, audio tokens must both capture global semantic content and preserve fine-grained acoustic details. Moreover, they provide a discrete method for speech and music that can be effectively integrated into Ms. However, existing research is unsuitable in the definitions of semantic tokens and acoustic tokens. In addition, the evaluation of different codecs typically concentrates on specific domains or tasks, such as reconstruction or Automatic Speech Recognition (ASR) task, which prevents fair and comprehensive comparisons. To address these problems, this paper provides suitable definitions for semantic and acoustic tokens and introduces a systematic evaluation framework. This framework allows for a comprehensive assessment of codecs' capabilities which evaluate across four dimensions: audio reconstruction metric, codebook index (ID) stability, r-only perplexity, and performance on downstream probe tasks. Our results show the correctness of the provided suitable definitions and the correlation among reconstruction metrics, codebook ID stability, downstream probe tasks and perplexity.

FPGA-Based RoCEv2-RDMA Readout Electronics for the CTAO-LST Advanced Camera

Authors: F. Marini, M. Bellato, A. Bergnoli, D. Corti, A. Griggio, R. Isocrate, L. Modenese, M. Toffano, C. Arcaro, F. Di Pierro, M. Mariotti, M. Mi, P. Wang

2025-09-02

http://arxiv.org/abs/2509.02285v1

CTAO's (Cherenkov Telescope Array Observatory) largest telescopes type, the LST (Large-Sized Telescope), are being installed at the northern site of the Cherenkov Telescope Array (CTA) at the Observatorio del Roque de los Muchachos on the Canary island of La Palma. Their aim is to capture the lowest-energy gamma rays of the observatory. The hereby proposed readout electronics architecture, as a proof-of-concept for its advanced camera upgrade, relies on a custom high-channel count fast sampling hardware digitizer board acting as a Front-End. The design includes a versatile pre-amplification stage and high-speed serial links for streaming JESD204C-compliant data at rates approaching 12 Gb/s per lane. The data get transferred to Back-End electronics for a first data-processing and trigger before being transmitted to event-building servers through 10 Gb/s Ethernet links. The performance of the link is exploited by implementing RDMA in hardware, thanks to a RoCEv2 core written in Bluespec SystemVerilog, enabling the possibility of transfer data directly to processing units without CPU intervention. Hardware design and characterization of the Front End board are reported, as well as a detailed description and tests of the Back End RDMA firmware.

Dual-end Fluid Antennas For Robust Anti-jamming in Low-altitude Air-ground Communications

Authors: Yifan Guo, Junshan Luo, Fanggang Wang, Haiyang Ding, Shilian Wang, Zhenhai Xu

2025-09-02

http://arxiv.org/abs/2509.02260v1

This paper addresses the challenge of co-channel interference and intentional jamming in low-altitude air-ground s. Since conventional fixed-position antenna (FPA) systems lack spatial adaptability to dynamically balance signal enhancement against interference suppression, we propose a transformative fluid antenna system (FAS)-assisted heterogeneous dual-layer transmission architecture. Specifically, a terrestrial base station with FPA serves ground users, while a low altitude- base station equipped with FAS communicates with the aerial user, also equipped with FAS, under the attack of a malicious jammer. We formulate a worst-case achievable rate maximization problem for aerial user subject to constraints including quality-of-service for terrestrial users, imperfect jamming directions, minimum antenna separation, etc. To address the non-convex problem, we propose a fractional programming-block coordinate descent algorithm that alternately optimizes the transmit precoders, receive combiner, and antenna positions at both transceiver sides. Convex hull-based approach and geometric boundary method are used to handle the jamming uncertainty and antenna placement constraints in confined spatial regions, respectively. Extensive simulations validate significant performance gains. The FAS achieves up to 56\% higher data rates than FPA under equivalent power constraints. Strategic antenna repositioning demonstrably enhances signal quality while suppressing interference, maintaining robustness across diverse jammer channel uncertainties.

Avoidance Decoding for Diverse Multi-Branch Story Generation

Authors: Kyeongman Park, Nakyeong Yang, Kyomin Jung

2025-09-02

http://arxiv.org/abs/2509.02170v2

Large Language Models (s) often generate repetitive and monotonous outputs, especially in tasks like story generation, due to limited creative diversity when given the same input prompt. To address this challenge, we propose a novel strategy, Avoidance Decoding, that modifies token logits by penalizing similarity to previously generated outputs, thereby encouraging more diverse multi-branch stories. This penalty adaptively balances two similarity measures: (1) Concept-level Similarity Penalty, which is prioritized in early stages to diversify initial story concepts, and (2) Narrative-level Similarity Penalty, which is increasingly emphasized later to ensure natural yet diverse plot development. Notably, our method achieves up to 2.6 times higher output diversity and reduces repetition by an average of 30% compared to strong baselines, while effectively mitigating text degeneration. Furthermore, we reveal that our method activates a broader range of neurons, demonstrating that it leverages the model's intrinsic creativity.

AMBEDKAR-A Multi-level Bias Elimination through a Decoding Approach with Knowledge Augmentation for Robust Constitutional Alignment of Language Models

Authors: Snehasis Mukhopadhyay, Aryan Kasat, Shivam Dubey, Rahul Karthikeyan, Dhruv Sood, Vinija Jain, Aman Chadha, Amitava Das

2025-09-02

http://arxiv.org/abs/2509.02133v1

Large Language Models (s) can inadvertently reflect societal biases present in their training data, leading to harmful or prejudiced outputs. In the Indian context, our empirical evaluations across a suite of models reveal that biases around caste and religion are particularly salient. Yet, most existing mitigation strategies are Western-centric and fail to address these local nuances. We propose AMBEDKAR, a framework inspired by the egalitarian vision of Dr B. R. Ambedkar, architect of the Indian Constitution, to guide outputs toward fairness, neutrality, and inclusion in line with Articles 14 to 17. Our approach introduces a Constitution-Aware Decoding Layer, guided by the AI Constitution of India and applied only at inference time, without any parameter updates to the base model. We incorporate a speculative algorithm that proactively reduces casteist and communal bias during generation. This mitigation layer operates directly within the process, avoiding changes to model internals and lowering the computational and infrastructural costs associated with retraining. We reinterpret speculative not merely as an efficiency tool but as a mechanism for fairness. In this framework, a Small Language Model (SLM) acts as a potentially biased generator, while a constitutionally guided Large Language Model () serves as the verifier. Rather than accelerating generation, the enforces bias-robust trajectories in the SLM outputs. This inversion of roles gives rise to a fairness-by-speculation paradigm. Our approach yields an absolute reduction of bias up to 26.41 percent compared to baseline. Our source code, datasets, and results are available at https://anonymous.4open.science/r/AMBEDKAR-983B/

FlexNGIA 2.0 Redesigning the Internet with Agentic AI -- Protocols, Services, and Traffic Engineering Designed, Deployed, and Managed by AI

Authors: Mohamed Faten Zhani, Younes Korbi, Yamen Mkadem

2025-09-02

http://arxiv.org/abs/2509.02124v1

The escalating demands of immersive s, alongside advances in network softwarization and AI-driven cognition and generative reasoning, create a pivotal opportunity to rethink and reshape the future Internet. In this context, we introduce in this paper, FlexNGIA 2.0, an Agentic AI-driven Internet architecture that leverages -based AI agents to autonomously orchestrate, configure, and evolve the network. These agents can, at runtime, perceive, reason, coordinate among themselves to dynamically design, implement, deploy, and adapt protocols, Service Function Chains (SFCs), network functions, resource allocation strategies, congestion control, and traffic engineering schemes, thereby ensuring optimal performance, reliability, and efficiency under evolving conditions. The paper first outlines the overall architecture of FlexNGIA 2.0 and its constituent -Based AI agents. For each agent, we detail its design, implementation, inputs and outputs, prompt structures, interactions with tools and other agents, followed by preliminary proof-of-concept experiments demonstrating its operation and potential. The results clearly highlight the ability of these -based AI agents to automate the design, the implementation, the deployment, and the performance evaluation of transport protocols, service function chains, network functions, congestion control schemes, and resource allocation strategies. FlexNGIA 2.0 paves the way for a new class of Agentic AI-Driven networks, where fully cognitive, self-evolving AI agents can autonomously design, implement, adapt and optimize the network's protocols, algorithms, and behaviors to efficiently operate across complex, dynamic, and heterogeneous environments. To bring this vision to reality, we also identify key research challenges toward achieving fully autonomous, adaptive, and agentic AI-driven networks.

Batch Query Processing and Optimization for Agentic Workflows

Authors: Junyi Shen, Noppanat Wadlom, Yao Lu

2025-09-02

http://arxiv.org/abs/2509.02121v1

Large Language Models (s) in agentic workflows combine multi-step reasoning, tool use, and collaboration across multiple specialized agents. Existing engines optimize individual calls in isolation, while multi-agent frameworks focus on orchestration without system-level performance planning. As a result, repeated prompts, ping contexts, and concurrent executions create substantial redundancy and poor GPU utilization, especially in batch analytics scenarios. We introduce Halo, a system that brings batch query processing and optimization into agentic workflows. Halo represents each workflow as a structured query plan DAG and constructs a consolidated graph for batched queries that exposes shared computation. Guided by a cost model that jointly considers and costs, reuse, and GPU placement, Halo performs plan-level optimization to minimize redundant execution. Its runtime integrates adaptive batching, - sharing and migration, along with compute- to maximize hardware efficiency. Evaluation across six benchmarks shows that Halo achieves up to 18.6x speedup for batch inference and 4.7x throughput improvement under online , scaling to workloads of tens of thousands of queries and complex graphs. These gains are achieved without compromising output quality. By unifying query optimization with , Halo enables efficient agentic workflows in data analytics and decision-making applications.

Reentrant superconductivity and superconductor-to-insulator transition in a naturally occurring Josephson junction array tuned by RF power

Authors: S. Avraham, S. Sankar, S. Sandik, A. Burshtein, M. Goldstein, E. Sela, Y. Dagan

2025-09-02

http://arxiv.org/abs/2509.02063v1

Superconductivity, characterized by dissipationless current flow with flux expulsion or , is usually muted when the magnetic field or the temperature is sufficiently high. However, in rare instances, superconductivity can reappear upon increasing the temperature or magnetic field, a phenomenon known as reentrant superconductivity. It usually emerges from competing orders in strongly correlated materials. Here we demonstrate reentrant superconductivity as a function of both temperature and magnetic field, tuned by radio frequency (RF) power in a relatively simple system: granular aluminum (grAl), which exhibits the properties of a naturally occurring Josephson junction array. At low temperatures, giant Shapiro steps emerge, exhibiting characteristics of a single Josephson junction. Coherent phase locking across the array's multiple junctions amplifies the d voltage, enabling tunability at radio frequencies, as observed in artificially designed Josephson arrays. We show that our system can be tuned from a coherent superconducting (stiff-phase) to an insulating (phase-fluctuating) state using RF power. We propose that the RF power modulates the Josephson coupling energy, $E_J$ . Remarkably, at elevated temperatures, the screening of the electron charge suppresses the charging energy, causing superconductivity to reappear. This many-body effect cannot be described within a single junction framework and involves many-body correlations. Our system can therefore be tuned to observe both the single-junction regime and many-body correlation effects, as a quantum simulator for complex phenomena in condensed matter physics.

Loop Quantum Vector-Tensor Gravity and Its Spherically Symmetric Model

Authors: Shengzhi Li, Yongge Ma

2025-09-02

http://arxiv.org/abs/2509.02056v1

The Hamiltoinian analysis of the vector-tensor theory of gravity is performed. The resulting geometrical dynamics is reformulated into the connection dynamics, with the real SU(2)-connection as one of the configuration variables. This formulation allows us to extend the loop scheme of general relativity to the vector-tensor theory, thereby rigorously constructing its quantum kinematical framework. The scalar constraint is promoted to a well-defined operator in the vertex Hilbert space, to represent quantum dynamics. Moreover, the spherically symmetric model of the vector-tensor theory is obtained by the symmetric reduction. Following the general deparametrization strategy for theories with diffeomorphism invariance, the spherically symmetric model can be fully deparametrized in terms of the degrees of freedom of the vector field. The corresponding reduced phase space is carried out. The physical Hamiltonian generating relative evolution is promoted to a well-defined operator on the physical Hilbert space.

FireRedTTS-2 Towards Long Conversational Speech Generation for Podcast and Chatbot

Authors: Kun Xie, Feiyu Shen, Junjie Li, Fenglong Xie, Xu Tang, Yao Hu

2025-09-02

http://arxiv.org/abs/2509.02020v2

Current dialogue generation approaches typically require the complete dialogue text before synthesis and produce a single, inseparable speech containing all voices, making them unsuitable for interactive chat; moreover, they suffer from unstable synthesis, inaccurate speaker transitions, and incoherent prosody. In this work, we present FireRedTTS-2, a long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural speech with reliable speaker switching and context-aware prosody. A new 12.5Hz streaming speech tokenizer accelerates training and inference, extends maximum dialogue length, encodes richer semantics to stabilize text-to-token modeling and supports high-fidelity streaming generation for real-time applications. We adopt a text-speech interleaved format, concatenating speaker-labeled text with aligned speech tokens in chronological order, and model it with a dual-: a large r-only predicts tokens at the first layer, and a smaller one completes subsequent layers. Experimental results show that FireRedTTS-2 integrates seamlessly with chat frameworks and, with minimal fine-tuning, produces emotionally expressive speech guided by implicit contextual cues. In podcast generation, it surpasses existing systems including MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in objective intelligibility, speaker-turn reliability, and perceived naturalness with context-consistent prosody. Our demos are available at https://fireredteam.github.io/demos/firered_tts_2.

Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs

Authors: Yuhao Wang, Junwei Pan, Xinhang Li, Maolin Wang, Yuan Wang, Yue Liu, Dapeng Liu, Jie Jiang, Xiangyu Zhao

2025-09-02

http://arxiv.org/abs/2509.02017v1

Sequential recommendation (SR) aims to capture users' dynamic interests and sequential patterns based on their historical interactions. Recently, the powerful capabilities of large language models (s) have driven their adoption in SR. However, we identify two critical challenges in existing -based SR methods: 1) embedding collapse when incorporating pre-trained collaborative embeddings and 2) catastrophic forgetting of d embeddings when utilizing semantic IDs. These issues dampen the model scalability and lead to suboptimal recommendation performance. Therefore, based on s like Llama3-8B-instruct, we introduce a novel SR framework named MME-SID, which integrates multimodal embeddings and d embeddings to mitigate embedding collapse. Additionally, we propose a Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE) with maximum mean discrepancy as the reconstruction loss and contrastive learning for alignment, which effectively preserve intra-modal distance information and capture inter-modal correlations, respectively. To further alleviate catastrophic forgetting, we initialize the model with the trained multimodal code embeddings. Finally, we fine-tune the efficiently using LoRA in a multimodal frequency-aware fusion manner. Extensive experiments on three public datasets validate the superior performance of MME-SID thanks to its capability to mitigate embedding collapse and catastrophic forgetting. The implementation code and datasets are publicly available for reproduction: https://github.com/Applied-Machine-Learning-Lab/MME-SID.

mFARM Towards Multi-Faceted Fairness Assessment based on HARMs in Clinical Decision Support

Authors: Shreyash Adappanavar, Krithi Shailya, Gokul S Krishnan, Sriraam Natarajan, Balaraman Ravindran

2025-09-02

http://arxiv.org/abs/2509.02007v1

The deployment of Large Language Models (s) in high-stakes medical settings poses a critical AI alignment challenge, as models can inherit and amplify societal biases, leading to significant disparities. Existing fairness evaluation methods fall short in these contexts as they typically use simplistic metrics that overlook the multi-dimensional nature of medical harms. This also promotes models that are fair only because they are clinically inert, defaulting to safe but potentially inaccurate outputs. To address this gap, our contributions are mainly two-fold: first, we construct two large-scale, controlled benchmarks (ED-Triage and Opioid Analgesic Recommendation) from MIMIC-IV, comprising over 50,000 prompts with twelve race x gender variants and three context tiers. Second, we propose a multi-metric framework - Multi-faceted Fairness Assessment based on hARMs ( $mFARM$ ) to audit fairness for three distinct dimensions of disparity (Allocational, Stability, and Latent) and aggregate them into an $mFARM$ score. We also present an aggregated Fairness-Accuracy Balance (FAB) score to benchmark and observe trade-offs between fairness and prediction accuracy. We empirically evaluate four open-source s (Mistral-7B, BioMistral-7B, Qwen-2.5-7B, Bio-LLaMA3-8B) and their finetuned versions under and context variations. Our findings showcase that the proposed $mFARM$ metrics capture subtle biases more effectively under various settings. We find that most models maintain robust performance in terms of $mFARM$ score across varying levels of but deteriorate significantly when the context is reduced. Our benchmarks and evaluation code are publicly released to enhance research in aligned AI for healthcare.

AHAMask Reliable Task Specification for Large Audio Language Models without Instructions

Authors: Yiwei Guo, Bohan Li, Hankun Wang, Zhihan Li, Shuai Wang, Xie Chen, Kai Yu

2025-09-01

http://arxiv.org/abs/2509.01787v1

Although current large audio language models (LALMs) extend text large language models (s) with generic acoustic understanding abilities, they usually suffer from instruction sensitivity, where different instructions of the same intention can yield drastically different outcomes. In this work, we propose AHAMask, where we simply mask some of the attention heads in the r-only backbone of LALMs, to trigger specific acoustic task functionalities without instructions. These masks are efficiently obtained by training on an LALM, with the number of trainable parameters equal to the attention head count in its backbone. We show by experiments that applying such selective attention head masks achieves comparable or even better performance than using instructions, either on single or composite tasks. Besides achieving reliable acoustic task specification for LALMs, this also reveals that LALMs exhibit certain "functional pathways" in their attention heads.

Communication-Aware Knowledge Distillation for Federated LLM Fine-Tuning over Wireless Networks

Authors: Xinlu Zhang, Na Yan, Yang Su, Yansha Deng, Toktam Mahmoodi

2025-09-01

http://arxiv.org/abs/2509.01750v1

Federated learning (FL) for large language models (s) offers a privacy-pre scheme, enabling clients to collaboratively fine-tune locally deployed s or smaller language models (SLMs) without exchanging raw data. While parameter-sharing methods in traditional FL models solves number of technical challenges, they still incur high overhead and struggle with adapting to heterogeneous model architectures. Federated distillation, a framework for mutual knowledge transfer via shared logits, typically offers lower overhead than parameter-sharing methods. However, transmitting logits from s remains challenging for bandwidth-limited clients due to their high dimensionality. In this work, we focus on a federated distillation with efficient overhead. To achieve this, we first propose an adaptive Top-k logit selection mechanism, dynamically sparsifying logits according to real-time conditions. Then to tackle the dimensional inconsistency introduced by the adaptive sparsification, we design an adaptive logits aggregation scheme, effectively alleviating the artificial and uninformative inputs introduced by conventional zero-padding methods. Finally, to enhance the distillation effect, we incorporate LoRA-adapted hidden-layer projection from into the distillation loss, reducing the overhead further while providing richer representation. Experimental results demonstrate that our scheme achieves superior performance compared to baseline methods while effectively reducing overhead by approximately 50%.

Preconditioned Regularized Wasserstein Proximal Sampling

Authors: Hong Ye Tan, Stanley Osher, Wuchen Li

2025-09-01

http://arxiv.org/abs/2509.01685v1

We consider sampling from a Gibbs distribution by evolving finitely many particles. We propose a preconditioned version of a recently proposed noise-free sampling method, governed by approximating the score function with the numerically tractable score of a regularized Wasserstein proximal operator. This is derived by a Cole--Hopf transformation on coupled anisotropic heat equations, yielding a kernel formulation for the preconditioned regularized Wasserstein proximal. The diffusion component of the proposed method is also interpreted as a modified self-attention block, as in architectures. For quadratic potentials, we provide a discrete-time non-asymptotic convergence analysis and explicitly characterize the bias, which is dependent on regularization and independent of step-size. Experiments demonstrate and particle-level stability on various log-concave and non-log-concave toy examples to Bayesian total-variation regularized image deconvolution, and competitive/better performance on non-convex Bayesian neural network training when utilizing variable preconditioning matrices.

Q-Sched Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling

Authors: Natalia Frumkin, Diana Marculescu

2025-09-01

http://arxiv.org/abs/2509.01624v1

Text-to-image diffusion models are computationally intensive, often requiring dozens of forward passes through large backbones. For instance, Stable Diffusion XL generates high-quality images with 50 evaluations of a 2.6B-parameter model, an expensive process even for a single batch. Few-step diffusion models reduce this cost to 2-8 denoising steps but still depend on large, uncompressed U-Net or diffusion backbones, which are often too costly for full-precision inference without datacenter GPUs. These requirements also limit existing post-training methods that rely on full-precision calibration. We introduce Q-Sched, a new paradigm for post-training that modifies the diffusion model scheduler rather than model weights. By adjusting the few-step sampling trajectory, Q-Sched achieves full-precision accuracy with a 4x reduction in model size. To learn -aware pre-conditioning coefficients, we propose the JAQ loss, which combines text-image compatibility with an image quality metric for fine-grained optimization. JAQ is reference-free and requires only a handful of calibration prompts, avoiding full-precision inference during calibration. Q-Sched delivers substantial gains: a 15.5% FID improvement over the FP16 4-step Latent Consistency Model and a 16.6% improvement over the FP16 8-step Phased Consistency Model, showing that and few-step distillation are complementary for high-fidelity generation. A large-scale user study with more than 80,000 annotations further confirms Q-Sched's effectiveness on both FLUX.1[schnell] and SDXL-Turbo.