2025-09-05
Table of Contents
- PagedEviction Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference
- Psychologically Enhanced AI Agents
- Real Time FPGA Based Transformers & VLMs for Vision Tasks SOTA Designs and Optimizations
- MultiWikiQA A Reading Comprehension Benchmark in 300+ Languages
- Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue
- Meta-Policy Reflexion Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent
- MTQAMatrix of Thought for Enhanced Reasoning in Complex Question Answering
- A Multidimensional AI-powered Framework for Analyzing Tourist Perception in Historic Urban Quarters A Case Study in Shanghai
- Learning Neural Decoding with Parallelism and Self-Coordination for Quantum Error Correction
- SAMVAD A Multi-Agent System for Simulating Judicial Deliberation Dynamics in India
- RAGuard A Novel Approach for in-context Safe Retrieval Augmented Generation for LLMs
- Efficient Item ID Generation for Large-Scale LLM-based Recommendation
- OneCAT Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
- On Entropy Control in LLM-RL Algorithms
- Continuous Saudi Sign Language Recognition A Vision Transformer Approach
- Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data Processing
- Adaptive KV-Cache Compression without Manually Setting Budget
- Handwriting Imagery EEG Classification based on Convolutional Neural Networks
- Binary Quantization For LLMs Through Dynamic Grouping
- FlashRecovery Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs
- Mycroft Tracing Dependencies in Collective Communication Towards Reliable LLM Training
- QNPU Quantum Network Processor Unit for Quantum Supercomputers
- The Transparent Earth A Multimodal Foundation Model for the Earth's Subsurface
- LExI Layer-Adaptive Active Experts for Efficient MoE Model Inference
- Planning with Reasoning using Vision Language World Model
- Lighting the Way for BRIGHT Reproducible Baselines with Anserini, Pyserini, and RankLLM
- LLM-Enhanced Space-Air-Ground-Sea Integrated Networks
- Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
- MoPEQ Mixture of Mixed Precision Quantized Experts
- Top-H Decoding Adapting the Creativity and Coherence with Bounded Entropy in Text Generation
- HydroGAT Distributed Heterogeneous Graph Attention Transformer for Spatiotemporal Flood Prediction
- MLP-Offload Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall
- An Efficient and Adaptive Watermark Detection System with Tile-based Error Correction
- Cache Management for Mixture-of-Experts LLMs -- extended version
- Upcycling Candidate Tokens of Large Language Models for Query Expansion
- AudioCodecBench A Comprehensive Benchmark for Audio Codec Evaluation
- FPGA-Based RoCEv2-RDMA Readout Electronics for the CTAO-LST Advanced Camera
- Dual-end Fluid Antennas For Robust Anti-jamming in Low-altitude Air-ground Communications
- Avoidance Decoding for Diverse Multi-Branch Story Generation
- AMBEDKAR-A Multi-level Bias Elimination through a Decoding Approach with Knowledge Augmentation for Robust Constitutional Alignment of Language Models
- FlexNGIA 2.0 Redesigning the Internet with Agentic AI -- Protocols, Services, and Traffic Engineering Designed, Deployed, and Managed by AI
- Batch Query Processing and Optimization for Agentic Workflows
- Reentrant superconductivity and superconductor-to-insulator transition in a naturally occurring Josephson junction array tuned by RF power
- Loop Quantum Vector-Tensor Gravity and Its Spherically Symmetric Model
- FireRedTTS-2 Towards Long Conversational Speech Generation for Podcast and Chatbot
- Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs
- mFARM Towards Multi-Faceted Fairness Assessment based on HARMs in Clinical Decision Support
- AHAMask Reliable Task Specification for Large Audio Language Models without Instructions
- Communication-Aware Knowledge Distillation for Federated LLM Fine-Tuning over Wireless Networks
- Preconditioned Regularized Wasserstein Proximal Sampling
- Q-Sched Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling
PagedEviction Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference
Authors: Krishna Teja Chitty-Venkata, Jie Ye, Xian-He Sun, Anthony Kougkas, Murali Emani, Venkatram Vishwanath, Bogdan Nicolae
2025-09-04
caching significantly improves the efficiency of Large Language Model
(
) inference by storing attention states from previously processed tokens,
enabling faster generation of subsequent tokens. However, as sequence length
increases, the
quickly becomes a major memory bottleneck. To address
this, we propose PagedEviction, a novel fine-grained, structured
strategy that enhances the memory efficiency of v
's PagedAttention.
Unlike existing approaches that rely on attention-based token importance or
evict tokens across different v
pages, PagedEviction introduces an efficient
block-wise eviction algorithm tailored for paged memory layouts. Our method
integrates seamlessly with PagedAttention without requiring any modifications
to its CUDA attention kernels. We evaluate PagedEviction across
Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models
on the LongBench benchmark suite, demonstrating improved memory usage with
better accuracy than baselines on long context tasks.
Psychologically Enhanced AI Agents
Authors: Maciej Besta, Shriram Chandran, Robert Gerstenberger, Mathis Lindner, Marcin Chrapek, Sebastian Hermann Martschat, Taraneh Ghandi, Patrick Iff, Hubert Niewiadomski, Piotr Nyczyk, Jürgen Müller, Torsten Hoefler
2025-09-04
We introduce MBTI-in-Thoughts, a framework for enhancing the effectiveness of
Large Language Model () agents through psychologically grounded personality
conditioning. Drawing on the Myers-Briggs Type Indicator (MBTI), our method
primes agents with distinct personality archetypes via prompt engineering,
enabling control over behavior along two foundational axes of human psychology,
cognition and affect. We show that such personality priming yields consistent,
interpretable behavioral biases across diverse tasks: emotionally expressive
agents excel in narrative generation, while analytically primed agents adopt
more stable strategies in game-theoretic settings. Our framework supports
experimenting with structured multi-agent
protocols and reveals
that self-reflection prior to interaction improves cooperation and reasoning
quality. To ensure trait persistence, we integrate the official 16Personalities
test for automated verification. While our focus is on MBTI, we show that our
approach generalizes seamlessly to other psychological frameworks such as Big
Five, HEXACO, or Enneagram. By bridging psychological theory and
behavior
design, we establish a foundation for psychologically enhanced AI agents
without any fine-tuning.
Real Time FPGA Based Transformers & VLMs for Vision Tasks SOTA Designs and Optimizations
Authors: Safa Mohammed Sali, Mahmoud Meribout, Ashiyana Abdul Majeed
2025-09-04
Transformers and vision-language models (VLMs) have emerged as dominant
architectures in computer vision and multimodal AI, offering state-of-the-art
performance in tasks such as image classification, object detection, visual
question answering, and caption generation. However, their high computational
complexity, large memory footprints, and irregular data access patterns present
significant challenges for deployment in latency- and power-constrained
environments. Field-programmable gate arrays (FPGAs) provide an attractive
hardware platform for such workloads due to their reconfigurability,
fine-grained parallelism, and potential for energy-efficient . This
paper presents a comprehensive review of design trade-offs, optimization
strategies, and implementation challenges for FPGA-based inference of
s and VLMs. We examine critical factors such as device-class
selection, memory subsystem constraints, dataflow orchestration,
strategies,
exploitation, and toolchain choices, alongside
modality-specific issues unique to VLMs, including heterogeneous compute
balancing and cross-attention memory management. Additionally, we discuss
emerging trends in hardware-algorithm co-design, highlighting innovations in
attention mechanisms, compression, and modular overlays to improve efficiency
and adaptability. Practical issues such as runtime flexibility, verification
overhead, and the absence of standardized FPGA multimodal benchmarks are also
considered. Finally, we outline future directions toward scalable, portable,
and reconfigurable FPGA solutions that adapt to evolving model architectures
while sustaining high utilization and predictable performance. This synthesis
offers both a technical foundation and a forward-looking perspective to help
bridge the gap between advanced multimodal AI models and efficient FPGA
deployment.
MultiWikiQA A Reading Comprehension Benchmark in 300+ Languages
Authors: Dan Saattrup Smart
2025-09-04
We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which
covers 306 languages. The context data comes from Wikipedia articles, with
questions generated by an and the answers appearing verbatim in the
Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency
of the generated questions across 30 of the languages, providing evidence that
the questions are of good quality. We evaluate 6 different language models,
both
r and encoder models of varying sizes, showing that the benchmark is
sufficiently difficult and that there is a large performance discrepancy
amongst the languages. The dataset and survey evaluations are freely available.
Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue
Authors: Keara Schaaij, Roel Boumans, Tibor Bosse, Iris Hendrickx
2025-09-04
Lexical alignment, where speakers start to use similar words across
conversation, is known to contribute to successful . However, its
implementation in conversational agents remains underexplored, particularly
considering the recent advancements in large language models (
s). As a first
step towards enabling lexical alignment in human-agent dialogue, this study
draws on strategies for personalising conversational agents and investigates
the construction of stable, personalised lexical profiles as a basis for
lexical alignment. Specifically, we varied the amounts of transcribed spoken
data used for construction as well as the number of items included in the
profiles per part-of-speech (POS) category and evaluated profile performance
across time using recall, coverage, and cosine similarity metrics. It was shown
that smaller and more compact profiles, created after 10 min of transcribed
speech containing 5 items for adjectives, 5 items for conjunctions, and 10
items for adverbs, nouns, pronouns, and verbs each, offered the best balance in
both performance and data efficiency. In conclusion, this study offers
practical insights into constructing stable, personalised lexical profiles,
taking into account minimal data requirements,
as a foundational step
toward lexical alignment strategies in conversational agents.
Meta-Policy Reflexion Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent
Authors: Chunlong Wu, Zhibo Qu
2025-09-04
Large language model () agents achieve impressive single-task performance
but commonly exhibit repeated failures, inefficient exploration, and limited
cross-task adaptability. Existing reflective strategies (e.g., Reflexion,
ReAct) improve per-episode behavior but typically produce ephemeral,
task-specific traces that are not reused across tasks. Reinforcement-learning
based alternatives can produce transferable policies but require substantial
parameter updates and compute. In this work we introduce Meta-Policy Reflexion
(MPR): a hybrid framework that consolidates
-generated reflections into a
structured, predicate-like Meta-Policy Memory (MPM) and applies that memory at
inference time through two complementary mechanisms soft memory-guided
and hard rule admissibility checks(HAC). MPR (i) externalizes reusable
corrective knowledge without model weight updates, (ii) enforces domain
constraints to reduce unsafe or invalid actions, and (iii) retains the
adaptability of language-based reflection. We formalize the MPM representation,
present algorithms for update and
, and validate the approach in a
text-based agent environment following the experimental protocol described in
the provided implementation (AlfWorld-based). Empirical results reported in the
supplied material indicate consistent gains in execution accuracy and
robustness when compared to Reflexion baselines; rule admissibility further
improves stability. We analyze mechanisms that explain these gains, discuss
scalability and failure modes, and outline future directions for multimodal and
multi?agent extensions.
MTQAMatrix of Thought for Enhanced Reasoning in Complex Question Answering
Authors: Fengxiao Tang, Yufeng Li, Zongzong Wu, Ming Zhao
2025-09-04
Complex Question Answering (QA) is a fundamental and challenging task in NLP.
While large language models (s) exhibit impressive performance in QA, they
suffer from significant performance degradation when facing complex and
abstract QA tasks due to insufficient reasoning capabilities. Works such as
Chain-of-Thought (CoT) and Tree-of-Thought (ToT) aim to enhance
s' reasoning
abilities, but they face issues such as in-layer redundancy in tree structures
and single paths in chain structures. Although some studies utilize
Retrieval-Augmented Generation (RAG) methods to assist
s in reasoning, the
challenge of effectively utilizing large amounts of information involving
multiple entities and hops remains critical. To address this, we propose the
Matrix of Thought (MoT), a novel and efficient
thought structure. MoT
explores the problem in both horizontal and vertical dimensions through the
"column-cell
" mechanism, enabling
s to actively engage in
multi-strategy and deep-level thinking, reducing redundancy within the column
cells and enhancing reasoning capabilities. Furthermore, we develop a
fact-correction mechanism by constructing knowledge units from retrieved
knowledge graph triples and raw text to enhance the initial knowledge for
reasoning and correct erroneous answers. This leads to the development of an
efficient and accurate QA framework (MTQA). Experimental results show that our
framework outperforms state-of-the-art methods on four widely-used datasets in
terms of F1 and EM scores, with reasoning time only 14.4\% of the baseline
methods, demonstrating both its efficiency and accuracy. The code for this
framework is available at https://github.com/lyfiter/mtqa.
A Multidimensional AI-powered Framework for Analyzing Tourist Perception in Historic Urban Quarters A Case Study in Shanghai
Authors: Kaizhen Tan, Yufan Wu, Yuxuan Liu, Haoran Zeng
2025-09-04
Historic urban quarters play a vital role in pre cultural heritage
while
as vibrant spaces for tourism and everyday life. Understanding
how tourists perceive these environments is essential for sustainable,
human-centered urban planning. This study proposes a multidimensional
AI-powered framework for analyzing tourist perception in historic urban
quarters using multimodal data from social media. Applied to twelve historic
quarters in central Shanghai, the framework integrates focal point extraction,
color theme analysis, and sentiment mining. Visual focus areas are identified
from tourist-shared photos using a fine-tuned semantic segmentation model. To
assess aesthetic preferences, dominant colors are extracted using a clustering
method, and their spatial distribution across quarters is analyzed. Color
themes are further compared between social media photos and real-world street
views, revealing notable shifts. This divergence highlights potential gaps
between visual expectations and the built environment, reflecting both
stylistic preferences and perceptual bias. Tourist reviews are evaluated
through a hybrid sentiment analysis approach combining a rule-based method and
a multi-task BERT model. Satisfaction is assessed across four dimensions:
tourist activities, built environment, service facilities, and business
formats. The results reveal spatial variations in aesthetic appeal and
emotional response. Rather than focusing on a single technical innovation, this
framework offers an integrated, data-driven approach to
tourist
perception and contributes to informed decision-making in tourism, heritage
conservation, and the design of aesthetically engaging public spaces.
Learning Neural Decoding with Parallelism and Self-Coordination for Quantum Error Correction
Authors: Kai Zhang, Situ Wang, Linghang Kong, Fang Zhang, Zhengfeng Ji, Jianxin Chen
2025-09-04
Fast, reliable rs are pivotal components for enabling fault-tolerant
quantum computation. Neural network
rs like AlphaQubit have demonstrated
significant potential, achieving higher accuracy than traditional
human-designed
algorithms. However, existing implementations of neural
network
rs lack the parallelism required to
the syndrome stream
generated by a superconducting logical qubit in real time. Moreover,
integrating AlphaQubit with sliding window-based parallel
schemes
presents non-trivial challenges: AlphaQubit is trained solely to output a
single bit corresponding to the global logical correction for an entire memory
experiment, rather than local physical corrections that can be easily
integrated.
We address this issue by training a recurrent,
-based neural
network specifically tailored for sliding-window
. While our network
still outputs a single bit per window, we derive training labels from a
consistent set of local corrections and train on various types of
windows simultaneously. This approach enables the network to self-coordinate
across neighboring windows, facilitating high-accuracy parallel
of
arbitrarily long memory experiments. As a result, we resolve the throughput
limitation that previously prohibited the application of AlphaQubit-type
rs in fault-tolerant quantum computation.
SAMVAD A Multi-Agent System for Simulating Judicial Deliberation Dynamics in India
Authors: Prathamesh Devadiga, Omkaar Jayadev Shetty, Pooja Agarwal
2025-09-04
Understanding the complexities of judicial deliberation is crucial for
assessing the efficacy and fairness of a justice system. However, empirical
studies of judicial panels are constrained by significant ethical and practical
barriers. This paper introduces SAMVAD, an innovative Multi-Agent System (MAS)
designed to simulate the deliberation process within the framework of the
Indian justice system.
Our system comprises agents representing key judicial roles: a Judge, a
Prosecution Counsel, a Defense Counsel, and multiple Adjudicators (simulating a
judicial bench), all powered by large language models (s). A primary
contribution of this work is the integration of Retrieval-Augmented Generation
(RAG), grounded in a domain-specific knowledge base of landmark Indian legal
documents, including the Indian Penal Code and the Constitution of India. This
RAG functionality enables the Judge and Counsel agents to generate legally
sound instructions and arguments, complete with source citations, thereby
enhancing both the fidelity and transparency of the simulation.
The Adjudicator agents engage in iterative deliberation rounds, processing
case facts, legal instructions, and arguments to reach a consensus-based
verdict. We detail the system architecture, agent
protocols, the
RAG pipeline, the simulation workflow, and a comprehensive evaluation plan
designed to assess performance, deliberation quality, and outcome consistency.
This work provides a configurable and explainable MAS platform for exploring
legal reasoning and group decision-making dynamics in judicial simulations,
specifically tailored to the Indian legal context and augmented with verifiable
legal grounding via RAG.
RAGuard A Novel Approach for in-context Safe Retrieval Augmented Generation for LLMs
Authors: Connor Walker, Koorosh Aslansefat, Mohammad Naveed Akram, Yiannis Papadopoulos
2025-09-03
Accuracy and safety are paramount in Offshore Wind (OSW) maintenance, yet
conventional Large Language Models (s) often fail when confronted with
highly specialised or unexpected scenarios. We introduce RAGuard, an enhanced
Retrieval-Augmented Generation (RAG) framework that explicitly integrates
safety-critical documents alongside technical manuals.By issuing parallel
queries to two indices and allocating separate retrieval budgets for knowledge
and safety, RAGuard guarantees both technical depth and safety coverage. We
further develop a SafetyClamp extension that fetches a larger candidate pool,
"hard-clamping" exact slot guarantees to safety. We evaluate across
(BM25), dense (Dense Passage Retrieval) and hybrid retrieval paradigms,
measuring Technical Recall@K and Safety Recall@K. Both proposed extensions of
RAG show an increase in Safety Recall@K from almost 0\% in RAG to more than
50\% in RAGuard, while maintaining Technical Recall above 60\%. These results
demonstrate that RAGuard and SafetyClamp have the potential to establish a new
standard for integrating safety assurance into
-powered decision support in
critical maintenance contexts.
Efficient Item ID Generation for Large-Scale LLM-based Recommendation
Authors: Anushya Subbiah, Vikram Aggarwal, James Pine, Steffen Rendle, Krishna Sayana, Kun Su
2025-09-03
Integrating product catalogs and user behavior into s can enhance
recommendations with broad world knowledge, but the scale of real-world item
catalogs, often containing millions of discrete item identifiers (Item IDs),
poses a significant challenge. This contrasts with the smaller, tokenized text
vocabularies typically used in
s. The predominant view within the
-based
recommendation literature is that it is infeasible to treat item ids as a first
class citizen in the
and instead some sort of tokenization of an item into
multiple tokens is required. However, this creates a key practical bottleneck
in
these models for real-time low-latency applications.
Our paper challenges this predominant practice and integrates item ids as
first class citizens into the
. We provide simple, yet highly effective,
novel training and inference modifications that enable single-token
representations of items and single-step
. Our method shows
improvements in recommendation quality (Recall and NDCG) over existing
techniques on the Amazon shopping datasets while significantly improving
inference efficiency by 5x-14x. Our work offers an efficiency perspective
distinct from that of other popular approaches within
-based recommendation,
potentially inspiring further research and opening up a new direction for
integrating IDs into
s. Our code is available here
https://drive.google.com/file/d/1cUMj37rV0Z1bCWMdhQ6i4q4eTRQLURtC
OneCAT Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
Authors: Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, Hongkai Xiong
2025-09-03
We introduce OneCAT, a unified multimodal model that seamlessly integrates
understanding, generation, and editing within a novel, pure r-only
architecture. Our framework uniquely eliminates the need for
external components such as Vision Transformers (ViT) or vision tokenizer
during inference, leading to significant efficiency gains, especially for
high-resolution inputs. This is achieved through a modality-specific
Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR)
objective, which also natively supports dynamic resolutions. Furthermore, we
pioneer a multi-scale visual autoregressive mechanism within the Large Language
Model (
) that drastically reduces
steps compared to diffusion-based
methods while maintaining state-of-the-art performance. Our findings
demonstrate the powerful potential of pure autoregressive modeling as a
sufficient and elegant foundation for unified multimodal intelligence. As a
result, OneCAT sets a new performance standard, outperforming existing
open-source unified multimodal models across benchmarks for multimodal
generation, editing, and understanding.
On Entropy Control in LLM-RL Algorithms
Authors: Han Shen
2025-09-03
For RL algorithms, appropriate entropy control is crucial to their
effectiveness. To control the policy entropy, a commonly used method is entropy
regularization, which is adopted in various popular RL algorithms including
PPO, SAC and A3C. Although entropy regularization proves effective in robotic
and games RL conventionally, studies found that it gives weak to no gains in
-RL training. In this work, we study the issues of entropy bonus in
-RL
setting. Specifically, we first argue that the conventional entropy
regularization suffers from the
's extremely large response space and the
of the optimal outputs. As a remedy, we propose AEnt, an entropy
control method that utilizes a new clamped entropy bonus with an automatically
adjusted coefficient. The clamped entropy is evaluated with the re-normalized
policy defined on certain smaller token space, which encourages exploration
within a more compact response set. In addition, the algorithm automatically
adjusts entropy coefficient according to the clamped entropy value, effectively
controlling the entropy-induced bias while leveraging the entropy's benefits.
AEnt is tested in math-reasoning tasks under different base models and
datasets, and it is observed that AEnt outperforms the baselines consistently
across multiple benchmarks.
Continuous Saudi Sign Language Recognition A Vision Transformer Approach
Authors: Soukeina Elhassen, Lama Al Khuzayem, Areej Alhothali, Ohoud Alzamzami, Nahed Alowaidi
2025-09-03
Sign language (SL) is an essential form for hearing-impaired
and deaf people, enabling engagement within the broader society. Despite its
significance, limited public awareness of SL often leads to inequitable access
to educational and professional opportunities, thereby contributing to social
exclusion, particularly in Saudi Arabia, where over 84,000 individuals depend
on Saudi Sign Language (SSL) as their primary form of
. Although
certain technological approaches have helped to improve
for
individuals with hearing impairments, there continues to be an urgent
requirement for more precise and dependable translation techniques, especially
for Arabic sign language variants like SSL. Most state-of-the-art solutions
have primarily focused on non-Arabic sign languages, resulting in a
considerable absence of resources dedicated to Arabic sign language,
specifically SSL. The complexity of the Arabic language and the prevalence of
isolated sign language datasets that concentrate on individual words instead of
continuous speech contribute to this issue. To address this gap, our research
represents an important step in developing SSL resources. To address this, we
introduce the first continuous Saudi Sign Language dataset called KAU-CSSL,
focusing on complete sentences to facilitate further research and enable
sophisticated recognition systems for SSL recognition and translation.
Additionally, we propose a
-based model, utilizing a pretrained
ResNet-18 for spatial feature extraction and a Transformer Encoder with
Bidirectional LSTM for temporal dependencies, achieving 99.02\% accuracy at
signer dependent mode and 77.71\% accuracy at signer independent mode. This
development leads the way to not only improving
tools for the SSL
community but also making a substantial contribution to the wider field of sign
language.
Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data Processing
Authors: Rui Xie, Asad Ul Haq, Linsen Ma, Yunhua Fang, Zirak Burzin Engineer, Liu Liu, Tong Zhang
2025-09-03
Large language model () inference is bottlenecked by the limited bandwidth
of CXL-based memory used for capacity expansion. We introduce CXL-NDP, a
transparent near-data processing architecture that amplifies effective CXL
bandwidth without requiring changes to the CXL.mem interface or AI models.
CXL-NDP integrates a precision-scalable bit-plane layout for dynamic
with transparent lossless compression of weights and
s
directly within the CXL device. In end-to-end
, CXL-NDP improves
throughput by 43%, extends the maximum context length by 87%, and reduces the
footprint by 46.9% without accuracy loss. Hardware synthesis confirms
its practicality with a modest silicon footprint, lowering the barrier for
adopting efficient, scalable CXL-based memory in generative AI infrastructure.
Adaptive KV-Cache Compression without Manually Setting Budget
Authors: Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang
2025-09-03
Large language models (s) inference relies heavily on
-
s to
accelerate autoregressive
, but the resulting memory footprint grows
rapidly with sequence length, posing significant efficiency challenges. Current
-
compression methods suffer from a Procrustes' bed problem: they force
diverse workloads into fixed compression ratios, leading to suboptimal resource
allocation and inference performance. To this end, we present GVote, an
adaptive
-
compression scheme that eliminates manual budget
specification while achieving superior accuracy-efficiency trade-offs. GVote
operates on the principle that the important keys are the aggregation of keys
required by future queries. The method predicts future query attention demands
by Monte-Carlo style sampling potential queries and aggregating selected keys
to determine the optimal
budget without manual specification.
Experimental evaluation demonstrates GVote's effectiveness across multiple
benchmarks, including GSM8K, RULER and Longbench. Compared to baselines, GVote
exhibits 2 memory reduction while the accuracy maintains higher or
comparable.
Handwriting Imagery EEG Classification based on Convolutional Neural Networks
Authors: Hao Yang, Guang Ouyang
2025-09-03
Handwriting imagery has emerged as a promising paradigm for brain-computer
interfaces (BCIs) aimed at translating brain activity into text output.
Compared with invasively recorded electroencephalography (EEG), non-invasive
recording offers a more practical and feasible approach to capturing brain
signals for BCI. This study explores the limit of non-invasive EEG
associated with handwriting imagery into English letters using deep neural
networks. To this end, five participants were instructed to imagine writing the
26 English letters with their EEG being recorded from the scalp. A measurement
of EEG similarity across letters was conducted to investigate letter-specific
patterns in the dataset. Subsequently, four convolutional neural network (CNN)
models were trained for EEG classification. Descriptively, the EEG data clearly
exhibited letter-specific patterns
as a proof-of-concept for
EEG-to-text translation. Under the chance level of accuracy at 3.85%, the CNN
classifiers trained on each participant reached the highest limit of around
20%. This study marks the first attempt to
non-invasive EEG associated
with handwriting imagery. Although the achieved accuracy is not sufficient for
a usable brain-to-text BCI, the model's performance is noteworthy in revealing
the potential for translating non-invasively recorded brain signals into text
outputs and establishing a baseline for future research.
Binary Quantization For LLMs Through Dynamic Grouping
Authors: Xinzhe Zheng, Zhen-Qun Yang, Haoran Xie, S. Joe Qin, Arlene Chen, Fangzhen Lin
2025-09-03
Large Language Models (s) have demonstrated remarkable performance across
a wide range of Natural Language Processing (NLP) tasks, but require
substantial memory and computational resources. Binary
, which
compresses model weights from 16-bit Brain Float to 1-bit representations in
{-1, 1}, offers significant reductions in storage and inference costs. However,
such aggressive
often leads to notable performance degradation
compared to more conservative 4-bit
methods. In this research, we
propose a novel optimization objective tailored for binary
, along
with three algorithms designed to realize it effectively. Our method enhances
blocked
by dynamically identifying optimal unstructured
sub-matrices through adaptive grouping strategies. Experimental results
demonstrate that our approach achieves an average bit length of just 1.007
bits, while maintaining high model quality. Specifically, our
d LLaMA
3.2 3B model attains a perplexity of 8.23, remarkably close to the original
7.81, and surpasses previous SOTA Bi
with a perplexity of only 123.90.
Furthermore, our method is competitive with SOTA 4-bit approaches such as GPTQ
in both performance and efficiency. The compression process is highly
efficient, requiring only 14 seconds to
the full LLaMA 3.2 3B weights
on a single CPU core, with the entire process completing in under 100 minutes
and exhibiting embarrassingly parallel properties.
Code - https://github.com/johnnyzheng0636/WGM_bi_quan
FlashRecovery Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs
Authors: Haijun Zhang, Jinxiang Wang, Zhenhua Yu, Yanyong Zhang, Xuejie Ji, Kaining Mao, Jun Zhang, Yaqing Zhang, Ting Wu, Fei Jie, Xiemin Huang, Zhifang Cai, Junhua Cheng, Shuwei Wang, Wei Li, Xiaoming Bao, Hua Xu, Shixiong Zhao, Jun Li, Hongwei Sun, Ziyang Zhang, Yi Xiong, Chunsheng Li
2025-09-03
Large language models (s) have made a profound impact across various
fields due to their advanced capabilities. However, training these models at
unprecedented scales requires extensive AI accelerator clusters and
sophisticated parallelism strategies, which pose significant challenges in
maintaining system reliability over prolonged training periods. A major concern
is the substantial loss of training time caused by inevitable hardware and
software failures. To address these challenges, we present FlashRecovery, a
fast and low-cost failure recovery system comprising three core modules: (1)
Active and real-time failure detection. This module performs continuous
training state monitoring, enabling immediate identification of hardware and
software failures within seconds, thus ensuring rapid incident response; (2)
Scale-independent task restart. By employing different recovery strategies for
normal and faulty nodes, combined with an optimized
group
reconstruction protocol, our approach ensures that the recovery time remains
nearly constant, regardless of cluster scale; (3) Checkpoint-free recovery
within one step. Our novel recovery mechanism enables single-step restoration,
completely eliminating dependence on traditional checkpointing methods and
their associated overhead. Collectively, these innovations enable FlashRecovery
to achieve optimal Recovery Time Objective (RTO) and Recovery Point Objective
(RPO), substantially improving the reliability and efficiency of long-duration
training. Experimental results demonstrate that FlashRecovery system can
achieve training restoration on training cluster with 4, 800 devices in 150
seconds. We also verify that the time required for failure recovery is nearly
consistent for different scales of training tasks.
Mycroft Tracing Dependencies in Collective Communication Towards Reliable LLM Training
Authors: Yangtao Deng, Lei Zhang, Qinlong Wang, Xiaoyun Zhi, Xinlei Zhang, Zhuo Jiang, Haohan Xu, Lei Wang, Zuquan Song, Gaohong Liu, Yang Bai, Shuguang Wang, Wencong Xiao, Jianxi Ye, Minlan Yu, Hong Xu
2025-09-03
Reliability is essential for ensuring efficiency in training. However,
many real-world reliability issues remain difficult to resolve, resulting in
wasted resources and degraded model performance. Unfortunately, today's
collective
libraries operate as black boxes, hiding critical
information needed for effective root cause analysis. We propose Mycroft, a
lightweight distributed tracing and root cause analysis system designed to
address previously hidden reliability issues in collective
.
Mycroft's key idea is to trace collective
states and leverage
internal control and data dependencies to resolve reliability problems in
training. Mycroft has been deployed at ByteDance for over six months to debug
collective
related issues at runtime. It detected anomalies
within 15 seconds in 90% of cases and identified the root cause within 20
seconds in 60% of cases. We also conducted extensive fault injection
experiments to demonstrate Mycroft's capability and efficiency.
QNPU Quantum Network Processor Unit for Quantum Supercomputers
Authors: Peiyi Li, Chenxu Liu, Ji Liu, Huiyang Zhou, Ang Li
2025-09-02
As quantum computing progresses, the need for scalable solutions to address
large-scale computational problems has become critical. Quantum supercomputers
are the next upcoming frontier by enabling multiple quantum processors to
collaborate effectively to solve large-scale computational problems. The
emergence of quantum supercomputers necessitates an efficient interface to
manage the quantum protocols between quantum processors. In this
paper, we propose the Quantum Network Processing Unit (QNPU), which enables
quantum applications to efficiently scale beyond the capacity of individual
quantum processors,
as a critical building block for future quantum
supercomputers. The QNPU works alongside the Quantum Processing Unit (QPU) in
our decoupled processing units architecture, where the QPU handles local
quantum operations while the QNPU manages quantum
between nodes.
We design a comprehensive instruction set architecture (ISA) for the QNPU with
high-level
protocol abstractions, implemented via
micro-operations that manage EPR resources, quantum operations, and classical
. To facilitate programming, we introduce DistQASM, which extends
OpenQASM with distributed quantum operations. We then propose a
microarchitecture featuring both scalar and superscalar QNPU designs to enhance
performance for
-intensive quantum workloads. Finally, we evaluate
the performance of our proposed QNPU design with distributed quantum workloads
and demonstrate that the QNPU significantly improves the efficiency of
between quantum nodes, paving the way for quantum supercomputing.
The Transparent Earth A Multimodal Foundation Model for the Earth's Subsurface
Authors: Arnab Mazumder, Javier E. Santos, Noah Hobbs, Mohamed Mehana, Daniel O'Malley
2025-09-02
We present the Transparent Earth, a -based architecture for
reconstructing subsurface properties from heterogeneous datasets that vary in
, resolution, and modality, where each modality represents a distinct
type of observation (e.g., stress angle, mantle temperature, tectonic plate
type). The model incorporates positional encodings of observations together
with modality encodings, derived from a text embedding model applied to a
description of each modality. This design enables the model to scale to an
arbitrary number of modalities, making it straightforward to add new ones not
considered in the initial design. We currently include eight modalities
spanning directional angles, categorical classes, and continuous properties
such as temperature and thickness. These capabilities support in-context
learning, enabling the model to generate predictions either with no inputs or
with an arbitrary number of additional observations from any subset of
modalities. On validation data, this reduces errors in predicting stress angle
by more than a factor of three. The proposed architecture is scalable and
demonstrates improved performance with increased parameters. Together, these
advances make the Transparent Earth an initial foundation model for the Earth's
subsurface that ultimately aims to predict any subsurface property anywhere on
Earth.
LExI Layer-Adaptive Active Experts for Efficient MoE Model Inference
Authors: Krishna Teja Chitty-Venkata, Sandeep Madireddy, Murali Emani, Venkatram Vishwanath
2025-09-02
Mixture-of-Experts (MoE) models scale efficiently by activating only a subset
of experts per token, offering a computationally alternative to dense
architectures. While prior post-training optimizations, such as inter- and
intra-expert
, reduce memory usage they provide limited gains in
inference-time compute efficiency. Moreover, existing MoE architectures
typically activate a fixed number of experts uniformly across all layers,
resulting in redundant computation and suboptimal performance. In this work, we
first demonstrate that MoE
strategies improve only the memory footprint
but do not significantly improve inference performance on GPU using optimized
frameworks such as v
. To address this, we introduce LExI, a data-free
optimization technique that determines the optimal number of active experts per
layer in a pretrained MoE model. LExI leverages only the model weights to
estimate the relative importance of each layer and adaptively assigns the
number of active experts accordingly per layer. Experiments on state-of-the-art
language and vision MoE benchmarks demonstrate that LExI significantly
outperforms traditional MoE
approaches in terms of inference efficiency
with negligible accuracy loss. For example, using LExI, Qwen1.5-MoE achieves
the same throughput on Nvidia H100 GPU with 10% better accuracy than
traditional expert
.
Planning with Reasoning using Vision Language World Model
Authors: Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, Pascale Fung
2025-09-02
Effective planning requires strong world models, but high-level world models
that can understand and reason about actions with semantic and temporal
abstraction remain largely underdeveloped. We introduce the Vision Language
World Model (VLWM), a foundation model trained for language-based world
modeling on natural videos. Given visual observations, the VLWM first infers
the overall goal achievements then predicts a trajectory composed of
interleaved actions and world state changes. Those targets are extracted by
iterative Self-Refine conditioned on compressed future observations
represented by Tree of Captions. The VLWM learns both an action policy and a
dynamics model, which respectively facilitates reactive system-1 plan
and reflective system-2 planning via cost minimization. The cost evaluates the
semantic distance between the hypothetical future states given by VLWM
roll-outs and the expected goal state, and is measured by a critic model that
we trained in a self-supervised manner. The VLWM achieves state-of-the-art
Visual Planning for Assistance (VPA) performance on both benchmark evaluations
and our proposed PlannerArena human evaluations, where system-2 improves the
Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM
baselines on RoboVQA and WorldPrediction benchmark.
Lighting the Way for BRIGHT Reproducible Baselines with Anserini, Pyserini, and RankLLM
Authors: Yijun Ge, Sahel Sharifymoghaddam, Jimmy Lin
2025-09-02
The BRIGHT benchmark is a dataset consisting of reasoning-intensive queries
over diverse domains. We explore retrieval results on BRIGHT using a range of
retrieval techniques, including , dense, and fusion methods, and
establish reproducible baselines. We then apply listwise reranking with large
language models (
s) to further investigate the impact of reranking on
reasoning-intensive queries. These baselines are integrated into popular
retrieval and reranking toolkits Anserini, Pyserini, and Rank
, with
two-click reproducibility that makes them easy to build upon and convenient for
further development. While attempting to reproduce the results reported in the
original BRIGHT paper, we find that the provided BM25 scores differ notably
from those that we obtain using Anserini and Pyserini. We discover that this
difference is due to BRIGHT's implementation of BM25, which applies BM25 on the
query rather than using the standard bag-of-words approach, as in Anserini, to
construct query vectors. This difference has become increasingly relevant due
to the rise of longer queries, with BRIGHT's lengthy reasoning-intensive
queries being a prime example, and further accentuated by the increasing usage
of retrieval-augmented generation, where
prompts can grow to be much longer
than ''traditional'' search engine queries. Our observation signifies that it
may be time to reconsider BM25 approaches going forward in order to better
accommodate emerging applications. To facilitate this, we integrate query-side
BM25 into both Anserini and Pyserini.
LLM-Enhanced Space-Air-Ground-Sea Integrated Networks
Authors: Halvin Yang, Sangarapillai Lambotharan, Mahsa Derakhshani, Lajos Hanzo
2025-09-02
The space-air-ground-sea integrated networking (SAGSIN) concept promises
seamless global multimedia connectivity, yet two obstacles still limit its
practical deployment. Firstly, high-velocity satellites, aerial relays and
sea-surface platforms suffer from obsolete channel state information (CSI),
undermining feedback-based adaptation. Secondly, data-rate disparity across the
protocol stack is extreme: terabit optical links in space coexist with kilobit
acoustic under-water links. This article shows that a single large language
model () backbone, trained jointly on radio, optical and acoustic traces,
can provide a unified, data-driven adaptation layer that addresses both rapid
CSI ageing and severe bandwidth disparity across the SAGSIN protocol stack.
Explicitly, an
-based long-range channel predictor forecasts the strongest
delay-Doppler components several coherence intervals ahead, facilitating
near-capacity reception despite violent channel fluctuations. Furthermore, our
-based semantic encoder turns raw sensor payloads into task-oriented tokens.
This substantially reduces the SNR required for high-fidelity image delivery in
a coastal underwater link, circumventing the data rate limitation by semantic
s. Inclusion of these tools creates a medium-agnostic adaptation
layer that spans radio, optical and acoustic channels. We conclude with
promising open research directions in on-device model compression, multimodal
fidelity control, cross-layer resource orchestration and trustworthy operation,
charting a path from laboratory prototypes to field deployment.
Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
Authors: Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang, Wanwei He, Run Luo, Min Yang
2025-09-02
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have
empowered large language models (s) to tackle challenging reasoning tasks
such as mathematics and programming. RLVR leverages verifiable outcome rewards
to guide policy optimization, enabling
s to progressively improve output
quality in a grounded and reliable manner. Despite its promise, the RLVR
paradigm poses significant challenges, as existing methods often suffer from
reward signals and unstable policy gradient updates, particularly in
RL-based approaches. To address the challenges, we propose , a
novel RLVR framework that achieves imlicit ctor
ritic coupling via a upervised learning framework. By
treating the outcome reward as a predictable label, we reformulate the RLVR
problem into a supervised learning task over a score function parameterized by
the policy model and optimized using cross-entropy loss. A detailed gradient
analysis shows that this supervised formulation inherently recovers the
classical policy gradient update while implicitly coupling actor and critic
roles, yielding more stable and efficient training. Benchmarking on challenging
mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as
PPO and GRPO, achieving superior reasoning performance. For instance, PACS
achieves 59.78\% at pass@256 on AIME 2025, representing improvements of 13.32
and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a
promising avenue for
s post-training with verifiable rewards. Our code and
data are available as open source at https://github.com/ritzz-ai/PACS.
MoPEQ Mixture of Mixed Precision Quantized Experts
Authors: Krishna Teja Chitty-Venkata, Jie Ye, Murali Emani
2025-09-02
Large Language and Vision Models using a Mixture-of-Experts (MoE)
architecture pose significant challenges for deployment due to their
computational and memory demands. Mixed Precision Quantization assigns
different precisions to different layers of an /VLM based on layer
sensitivity and importance within the model. In this work, we propose a Post
Training Quantization algorithm, MoPEQ, that assigns optimal bit width to each
expert. Our method balances accuracy and model size by analyzing each expert's
sensitivity using Hessian trace approximation instead of relying on the
activation frequency of the expert. This per-expert granularity approach
clusters similar experts to maintain model performance while reducing memory
requirements. The experimental results on VLMEvalKit benchmark datasets using
State-of-the-art VLMs Deepseek-VL2 -tiny, -small, -base, and MolmoE models
demonstrate that our mixed precision
d MoEs achieve competitive
accuracy with substantial improvements in memory footprint compared to
uniform-precision baseline methods. We perform a comprehensive study to analyze
the impact of expert activation frequency and sensitivity using Hessian trace
approximation at both layer-wise and model-wide expert precision allocation of
2, 3, and 4 bits to provide a thorough understanding of mixed precision
of VLM-MoEs.
Top-H Decoding Adapting the Creativity and Coherence with Bounded Entropy in Text Generation
Authors: Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram
2025-09-02
Large language models (s), despite their impressive performance across a
wide range of tasks, often struggle to balance two competing objectives in
open-ended text generation: fostering diversity and creativity while pre
logical coherence. Existing truncated sampling techniques, including
temperature scaling, top-$p$ (nucleus) sampling, and min-$p$ sampling, aim
to manage this trade-off. However, they exhibit limitations, particularly in
the effective incorporation of the confidence of the model into the
corresponding sampling strategy. For example, min-$p$ sampling relies on a
single top token as a heuristic for confidence, eventually underutilizing the
information of the probability distribution. Toward effective incorporation of
the confidence of the model, in this paper, we present top-H
. We
first establish the theoretical foundation of the interplay between creativity
and coherence in truncated sampling by formulating an entropy-constrained
minimum divergence problem. We then prove this minimization problem to be
equivalent to an entropy-constrained mass maximization (ECMM) problem,
which is NP-hard. Finally, we present top-H
, a computationally
efficient greedy algorithm to solve the ECMM problem. Extensive empirical
evaluations demonstrate that top-H outperforms the state-of-the-art (SoTA)
alternative of min-$p$ sampling by up to 25.63% on creative writing
benchmarks, while maintaining robustness on question-answering datasets such as
GPQA, GSM8K, and MT-Bench. Additionally, an
-as-judge evaluation confirms
that top-H indeed produces coherent outputs even at higher temperatures, where
creativity is especially critical. In summary, top-H advances SoTA in
open-ended text generation and can be easily integrated into creative writing
applications. The code is available at
https://github.com/ErfanBaghaei/Top-H-Decoding.
HydroGAT Distributed Heterogeneous Graph Attention Transformer for Spatiotemporal Flood Prediction
Authors: Aishwarya Sarkar, Autrin Hakimi, Xiaoqiong Chen, Hai Huang, Chaoqun Lu, Ibrahim Demir, Ali Jannesari
2025-09-02
Accurate flood forecasting remains a challenge for water-resource management,
as it demands modeling of local, time-varying runoff drivers (e.g.,
rainfall-induced peaks, baseflow trends) and complex spatial interactions
across a river network. Traditional data-driven approaches, such as
convolutional networks and sequence-based models, ignore topological
information about the region. Graph Neural Networks (GNNs) propagate
information exactly along the river network, which is ideal for learning
hydrological routing. However, state-of-the-art GNN-based flood prediction
models collapse pixels to coarse catchment polygons as the cost of training
explodes with graph size and higher resolution. Furthermore, most existing
methods treat spatial and temporal dependencies separately, either applying
GNNs solely on spatial graphs or s purely on temporal sequences,
thus failing to simultaneously capture spatiotemporal interactions critical for
accurate flood prediction. We introduce a heterogenous basin graph where every
land and river pixel is a node connected by physical hydrological flow
directions and inter-catchment relationships. We propose HydroGAT, a
spatiotemporal network that adaptively learns local temporal importance and the
most influential upstream locations. Evaluated in two Midwestern US basins and
across five baseline architectures, our model achieves higher NSE (up to 0.97),
improved KGE (up to 0.96), and low bias (PBIAS within 5%) in hourly
discharge prediction, while offering interpretable attention maps that reveal
, structured intercatchment influences. To support high-resolution
basin-scale training, we develop a distributed data-parallel pipeline that
scales efficiently up to 64 NVIDIA A100 GPUs on NERSC Perlmutter supercomputer,
demonstrating up to 15x speedup across machines. Our code is available at
https://github.com/swapp-lab/HydroGAT.
MLP-Offload Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall
Authors: Avinash Maurya, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae
2025-09-02
Training s larger than the aggregated memory of multiple GPUs is
increasingly necessary due to the faster growth of
sizes compared to GPU
memory. To this end, multi-tier host memory or disk offloading techniques are
proposed by state of art. Despite advanced asynchronous multi-tier read/write
strategies, such offloading strategies result in significant I/O overheads in
the critical path of training, resulting in slower iterations. To this end, we
propose MLP-Offload, a novel multi-level, multi-path offloading engine
specifically designed for optimizing
training on resource-constrained
setups by mitigating I/O bottlenecks. We make several key observations that
drive the design of MLP-Offload, such as I/O overheads during the update
dominate the iteration time; I/O bandwidth of the third-level remote storage
tier remains unutilized; and, contention due to concurrent offloading amplifies
I/O bottlenecks. Driven by these insights, we design and implement MLP-Offload
to offload the optimizer states across multiple tiers in a
-efficient and
concurrency-controlled fashion to mitigate I/O bottlenecks during the backward
and update phases. Evaluations on models up to 280B parameters shows that
MLP-Offload achieves 2.5 faster iterations compared to the
state-of-the-art
training runtimes.
An Efficient and Adaptive Watermark Detection System with Tile-based Error Correction
Authors: Xinrui Zhong, Xinze Feng, Jingwei Zuo, Fanjiang Ye, Yi Mu, Junfeng Guo, Heng Huang, Myungjin Lee, Yuke Wang
2025-09-02
Efficient and reliable detection of generated images is critical for the
responsible deployment of generative models. Existing approaches primarily
focus on improving detection accuracy and robustness under various image
transformations and adversarial manipulations, yet they largely overlook the
efficiency challenges of watermark detection across large-scale image
collections. To address this gap, we propose QRMark, an efficient and adaptive
end-to-end method for detecting embedded image watermarks. The core idea of
QRMark is to combine QR Code inspired error correction with tailored tiling
techniques to improve detection efficiency while pre accuracy and
robustness. At the algorithmic level, QRMark employs a Reed-Solomon error
correction mechanism to mitigate the accuracy degradation introduced by tiling.
At the system level, QRMark implements a resource-aware stream allocation
policy that adaptively assigns more streams to GPU-intensive stages of the
detection pipeline. It further employs a tile-based workload interleaving
strategy to
data-loading overhead with computation and schedules
kernels across stages to maximize efficiency. End-to-end evaluations show that
QRMark achieves an average 2.43x inference speedup over the sequential
baseline.
Cache Management for Mixture-of-Experts LLMs -- extended version
Authors: Spyros Angelopoulos, Loris Marchal, Adrien Obrecht, Bertrand Simon
2025-09-02
Large language models (s) have demonstrated remarkable capabilities across
a variety of tasks. One of the main challenges towards the successful
deployment of
s is memory management, since they typically involve billions
of parameters. To this end, architectures based on Mixture-of-Experts have been
proposed, which aim to reduce the size of the parameters that are activated
when producing a token. This raises the equally critical issue of efficiently
managing the limited
of the system, in that frequently used experts
should be stored in the fast
rather than in the slower secondary memory.
In this work, we introduce and study a new paging problem that models expert
management optimization. Our formulation captures both the layered architecture
of
s and the requirement that experts are
d efficiently. We first
present lower bounds on the competitive ratio of both deterministic and
randomized algorithms, which show that under mild assumptions, LRU-like
policies have good theoretical competitive performance. We then propose a
layer-based extension of LRU that is tailored to the problem at hand.
Extensive simulations on both synthetic datasets and actual traces of MoE
usage show that our algorithm outperforms policies for the classic paging
problem, such as the standard LRU.
Upcycling Candidate Tokens of Large Language Models for Query Expansion
Authors: Jinseok Kim, Sukmin Cho, Soyeong Jeong, Sangyeop Kim, Sungzoon Cho
2025-09-02
Query Expansion (QE) improves retrieval performance by enriching queries with
related terms. Recently, Large Language Models (s) have been used for QE,
but existing methods face a trade-off: generating diverse terms boosts
performance but increases computational cost. To address this challenge, we
propose Candidate Token Query Expansion (CTQE), which extracts diverse and
relevant terms from a single
pass by leveraging unselected
candidate tokens. These tokens, though not part of the final output, are
conditioned on the full query and capture useful information. By aggregating
them, CTQE achieves both relevance and diversity without extra inference,
reducing overhead and latency. Experiments show that CTQE delivers strong
retrieval performance with significantly lower cost, outperforming or
comparable to more expensive methods. Code is available at:
https://github.com/bluejeans8/CTQE
AudioCodecBench A Comprehensive Benchmark for Audio Codec Evaluation
Authors: Lu Wang, Hao Chen, Siyu Wu, Zhiyue Wu, Hao Zhou, Chengfeng Zhang, Ting Wang, Haodi Zhang
2025-09-02
Multimodal Large Language Models (Ms) have been widely applied in speech
and music. This tendency has led to a focus on audio tokenization for Large
Models (LMs). Unlike semantic-only text tokens, audio tokens must both capture
global semantic content and preserve fine-grained acoustic details. Moreover,
they provide a discrete method for speech and music that can be effectively
integrated into M
s. However, existing research is unsuitable in the
definitions of semantic tokens and acoustic tokens. In addition, the evaluation
of different codecs typically concentrates on specific domains or tasks, such
as reconstruction or Automatic Speech Recognition (ASR) task, which prevents
fair and comprehensive comparisons. To address these problems, this paper
provides suitable definitions for semantic and acoustic tokens and introduces a
systematic evaluation framework. This framework allows for a comprehensive
assessment of codecs' capabilities which evaluate across four dimensions: audio
reconstruction metric, codebook index (ID) stability,
r-only
perplexity, and performance on downstream probe tasks. Our results show the
correctness of the provided suitable definitions and the correlation among
reconstruction metrics, codebook ID stability, downstream probe tasks and
perplexity.
FPGA-Based RoCEv2-RDMA Readout Electronics for the CTAO-LST Advanced Camera
Authors: F. Marini, M. Bellato, A. Bergnoli, D. Corti, A. Griggio, R. Isocrate, L. Modenese, M. Toffano, C. Arcaro, F. Di Pierro, M. Mariotti, M. Mi, P. Wang
2025-09-02
CTAO's (Cherenkov Telescope Array Observatory) largest telescopes type, the
LST (Large-Sized Telescope), are being installed at the northern site of the
Cherenkov Telescope Array (CTA) at the Observatorio del Roque de los Muchachos
on the Canary island of La Palma. Their aim is to capture the lowest-energy
gamma rays of the observatory. The hereby proposed readout electronics
architecture, as a proof-of-concept for its advanced camera upgrade,
relies on a custom high-channel count fast sampling hardware digitizer board
acting as a Front-End. The design includes a versatile pre-amplification stage
and high-speed serial links for streaming JESD204C-compliant data at rates
approaching 12 Gb/s per lane. The data get transferred to Back-End electronics
for a first data-processing and trigger before being transmitted to
event-building servers through 10 Gb/s Ethernet links. The performance of the
link is exploited by implementing RDMA
in hardware, thanks to a
RoCEv2 core written in Bluespec SystemVerilog, enabling the possibility of
transfer data directly to processing units without CPU intervention. Hardware
design and characterization of the Front End board are reported, as well as a
detailed description and tests of the Back End RDMA firmware.
Dual-end Fluid Antennas For Robust Anti-jamming in Low-altitude Air-ground Communications
Authors: Yifan Guo, Junshan Luo, Fanggang Wang, Haiyang Ding, Shilian Wang, Zhenhai Xu
2025-09-02
This paper addresses the challenge of co-channel interference and intentional
jamming in low-altitude air-ground s. Since conventional
fixed-position antenna (FPA) systems lack spatial adaptability to dynamically
balance signal enhancement against interference suppression, we propose a
transformative fluid antenna system (FAS)-assisted heterogeneous dual-layer
transmission architecture. Specifically, a terrestrial base station with FPA
serves ground users, while a low altitude-
base station equipped with
FAS communicates with the aerial user, also equipped with FAS, under the attack
of a malicious jammer. We formulate a worst-case achievable rate maximization
problem for aerial user subject to constraints including quality-of-service for
terrestrial users, imperfect jamming directions, minimum antenna separation,
etc. To address the non-convex problem, we propose a fractional
programming-block coordinate descent algorithm that alternately optimizes the
transmit precoders, receive combiner, and antenna positions at both transceiver
sides. Convex hull-based approach and geometric boundary method are used to
handle the jamming uncertainty and antenna placement constraints in confined
spatial regions, respectively. Extensive simulations validate significant
performance gains. The FAS achieves up to 56\% higher data rates than FPA under
equivalent power constraints. Strategic antenna repositioning demonstrably
enhances signal quality while suppressing interference, maintaining robustness
across diverse jammer channel uncertainties.
Avoidance Decoding for Diverse Multi-Branch Story Generation
Authors: Kyeongman Park, Nakyeong Yang, Kyomin Jung
2025-09-02
Large Language Models (s) often generate repetitive and monotonous
outputs, especially in tasks like story generation, due to limited creative
diversity when given the same input prompt. To address this challenge, we
propose a novel
strategy, Avoidance Decoding, that modifies token
logits by penalizing similarity to previously generated outputs, thereby
encouraging more diverse multi-branch stories. This penalty adaptively balances
two similarity measures: (1) Concept-level Similarity Penalty, which is
prioritized in early stages to diversify initial story concepts, and (2)
Narrative-level Similarity Penalty, which is increasingly emphasized later to
ensure natural yet diverse plot development. Notably, our method achieves up to
2.6 times higher output diversity and reduces repetition by an average of 30%
compared to strong baselines, while effectively mitigating text degeneration.
Furthermore, we reveal that our method activates a broader range of neurons,
demonstrating that it leverages the model's intrinsic creativity.
AMBEDKAR-A Multi-level Bias Elimination through a Decoding Approach with Knowledge Augmentation for Robust Constitutional Alignment of Language Models
Authors: Snehasis Mukhopadhyay, Aryan Kasat, Shivam Dubey, Rahul Karthikeyan, Dhruv Sood, Vinija Jain, Aman Chadha, Amitava Das
2025-09-02
Large Language Models (s) can inadvertently reflect societal biases
present in their training data, leading to harmful or prejudiced outputs. In
the Indian context, our empirical evaluations across a suite of models reveal
that biases around caste and religion are particularly salient. Yet, most
existing mitigation strategies are Western-centric and fail to address these
local nuances. We propose AMBEDKAR, a framework inspired by the egalitarian
vision of Dr B. R. Ambedkar, architect of the Indian Constitution, to guide
outputs toward fairness, neutrality, and inclusion in line with Articles 14 to
17. Our approach introduces a Constitution-Aware Decoding Layer, guided by the
AI Constitution of India and applied only at inference time, without any
parameter updates to the base model. We incorporate a speculative
algorithm that proactively reduces casteist and communal bias during
generation. This mitigation layer operates directly within the
process, avoiding changes to model internals and lowering the computational and
infrastructural costs associated with retraining. We reinterpret speculative
not merely as an efficiency tool but as a mechanism for fairness. In
this framework, a Small Language Model (SLM) acts as a potentially biased
generator, while a constitutionally guided Large Language Model (
) serves as
the verifier. Rather than accelerating generation, the
enforces bias-robust
trajectories in the SLM outputs. This inversion of roles gives rise to a
fairness-by-speculation paradigm. Our approach yields an absolute reduction of
bias up to 26.41 percent compared to baseline. Our source code, datasets, and
results are available at https://anonymous.4open.science/r/AMBEDKAR-983B/
FlexNGIA 2.0 Redesigning the Internet with Agentic AI -- Protocols, Services, and Traffic Engineering Designed, Deployed, and Managed by AI
Authors: Mohamed Faten Zhani, Younes Korbi, Yamen Mkadem
2025-09-02
The escalating demands of immersive s, alongside advances in
network softwarization and AI-driven cognition and generative reasoning, create
a pivotal opportunity to rethink and reshape the future Internet. In this
context, we introduce in this paper, FlexNGIA 2.0, an Agentic AI-driven
Internet architecture that leverages
-based AI agents to autonomously
orchestrate, configure, and evolve the network. These agents can, at runtime,
perceive, reason, coordinate among themselves to dynamically design, implement,
deploy, and adapt
protocols, Service Function Chains (SFCs),
network functions, resource allocation strategies, congestion control, and
traffic engineering schemes, thereby ensuring optimal performance, reliability,
and efficiency under evolving conditions.
The paper first outlines the overall architecture of FlexNGIA 2.0 and its
constituent
-Based AI agents. For each agent, we detail its design,
implementation, inputs and outputs, prompt structures, interactions with tools
and other agents, followed by preliminary proof-of-concept experiments
demonstrating its operation and potential. The results clearly highlight the
ability of these
-based AI agents to automate the design, the
implementation, the deployment, and the performance evaluation of transport
protocols, service function chains, network functions, congestion control
schemes, and resource allocation strategies.
FlexNGIA 2.0 paves the way for a new class of Agentic AI-Driven networks,
where fully cognitive, self-evolving AI agents can autonomously design,
implement, adapt and optimize the network's protocols, algorithms, and
behaviors to efficiently operate across complex, dynamic, and heterogeneous
environments. To bring this vision to reality, we also identify key research
challenges toward achieving fully autonomous, adaptive, and agentic AI-driven
networks.
Batch Query Processing and Optimization for Agentic Workflows
Authors: Junyi Shen, Noppanat Wadlom, Yao Lu
2025-09-02
Large Language Models (s) in agentic workflows combine multi-step
reasoning, tool use, and collaboration across multiple specialized agents.
Existing
engines optimize individual calls in isolation, while
multi-agent frameworks focus on orchestration without system-level performance
planning. As a result, repeated prompts,
ping contexts, and concurrent
executions create substantial redundancy and poor GPU utilization, especially
in batch analytics scenarios. We introduce Halo, a system that brings batch
query processing and optimization into agentic
workflows. Halo represents
each workflow as a structured query plan DAG and constructs a consolidated
graph for batched queries that exposes shared computation. Guided by a cost
model that jointly considers
and
costs,
reuse, and GPU
placement, Halo performs plan-level optimization to minimize redundant
execution. Its runtime integrates adaptive batching,
-
sharing and
migration, along with compute-
to maximize hardware
efficiency. Evaluation across six benchmarks shows that Halo achieves up to
18.6x speedup for batch inference and 4.7x throughput improvement under online
, scaling to workloads of tens of thousands of queries and complex
graphs. These gains are achieved without compromising output quality. By
unifying query optimization with
, Halo enables efficient agentic
workflows in data analytics and decision-making applications.
Reentrant superconductivity and superconductor-to-insulator transition in a naturally occurring Josephson junction array tuned by RF power
Authors: S. Avraham, S. Sankar, S. Sandik, A. Burshtein, M. Goldstein, E. Sela, Y. Dagan
2025-09-02
Superconductivity, characterized by dissipationless current flow with flux
expulsion or , is usually muted when the magnetic field or the
temperature is sufficiently high. However, in rare instances, superconductivity
can reappear upon increasing the temperature or magnetic field, a phenomenon
known as reentrant superconductivity. It usually emerges from competing orders
in strongly correlated materials. Here we demonstrate reentrant
superconductivity as a function of both temperature and magnetic field, tuned
by radio frequency (RF) power in a relatively simple system: granular aluminum
(grAl), which exhibits the properties of a naturally occurring Josephson
junction array. At low temperatures, giant Shapiro steps emerge, exhibiting
characteristics of a single Josephson junction. Coherent phase locking across
the array's multiple junctions amplifies the
d voltage, enabling
tunability at radio frequencies, as observed in artificially designed Josephson
arrays. We show that our system can be tuned from a coherent superconducting
(stiff-phase) to an insulating (phase-fluctuating) state using RF power. We
propose that the RF power modulates the Josephson coupling energy, .
Remarkably, at elevated temperatures, the screening of the electron charge
suppresses the charging energy, causing superconductivity to reappear. This
many-body effect cannot be described within a single junction framework and
involves many-body correlations. Our system can therefore be tuned to observe
both the single-junction regime and many-body correlation effects,
as a
quantum simulator for complex phenomena in condensed matter physics.
Loop Quantum Vector-Tensor Gravity and Its Spherically Symmetric Model
Authors: Shengzhi Li, Yongge Ma
2025-09-02
The Hamiltoinian analysis of the vector-tensor theory of gravity is
performed. The resulting geometrical dynamics is reformulated into the
connection dynamics, with the real SU(2)-connection as one of the
configuration variables. This formulation allows us to extend the loop
scheme of general relativity to the vector-tensor theory, thereby
rigorously constructing its quantum kinematical framework. The scalar
constraint is promoted to a well-defined operator in the vertex Hilbert space,
to represent quantum dynamics. Moreover, the spherically symmetric model of the
vector-tensor theory is obtained by the symmetric reduction. Following the
general deparametrization strategy for theories with diffeomorphism invariance,
the spherically symmetric model can be fully deparametrized in terms of the
degrees of freedom of the vector field. The corresponding reduced phase space
is carried out. The physical Hamiltonian generating relative
evolution is promoted to a well-defined operator on the physical Hilbert space.
FireRedTTS-2 Towards Long Conversational Speech Generation for Podcast and Chatbot
Authors: Kun Xie, Feiyu Shen, Junjie Li, Fenglong Xie, Xu Tang, Yao Hu
2025-09-02
Current dialogue generation approaches typically require the complete
dialogue text before synthesis and produce a single, inseparable speech
containing all voices, making them unsuitable for interactive chat; moreover,
they suffer from unstable synthesis, inaccurate speaker transitions, and
incoherent prosody. In this work, we present FireRedTTS-2, a long-form
streaming TTS system for multi-speaker dialogue generation, delivering stable,
natural speech with reliable speaker switching and context-aware prosody. A new
12.5Hz streaming speech tokenizer accelerates training and inference, extends
maximum dialogue length, encodes richer semantics to stabilize text-to-token
modeling and supports high-fidelity streaming generation for real-time
applications. We adopt a text-speech interleaved format, concatenating
speaker-labeled text with aligned speech tokens in chronological order, and
model it with a dual-: a large
r-only
predicts
tokens at the first layer, and a smaller one completes subsequent layers.
Experimental results show that FireRedTTS-2 integrates seamlessly with chat
frameworks and, with minimal fine-tuning, produces emotionally expressive
speech guided by implicit contextual cues. In podcast generation, it surpasses
existing systems including MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in
objective intelligibility, speaker-turn reliability, and perceived naturalness
with context-consistent prosody. Our demos are available at
https://fireredteam.github.io/demos/firered_tts_2.
Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs
Authors: Yuhao Wang, Junwei Pan, Xinhang Li, Maolin Wang, Yuan Wang, Yue Liu, Dapeng Liu, Jie Jiang, Xiangyu Zhao
2025-09-02
Sequential recommendation (SR) aims to capture users' dynamic interests and
sequential patterns based on their historical interactions. Recently, the
powerful capabilities of large language models (s) have driven their
adoption in SR. However, we identify two critical challenges in existing
-based SR methods: 1) embedding collapse when incorporating pre-trained
collaborative embeddings and 2) catastrophic forgetting of
d embeddings
when utilizing semantic IDs. These issues dampen the model scalability and lead
to suboptimal recommendation performance. Therefore, based on
s like
Llama3-8B-instruct, we introduce a novel SR framework named MME-SID, which
integrates multimodal embeddings and
d embeddings to mitigate embedding
collapse. Additionally, we propose a Multimodal Residual Quantized Variational
Autoencoder (MM-RQ-VAE) with maximum mean discrepancy as the reconstruction
loss and contrastive learning for alignment, which effectively preserve
intra-modal distance information and capture inter-modal correlations,
respectively. To further alleviate catastrophic forgetting, we initialize the
model with the trained multimodal code embeddings. Finally, we fine-tune the
efficiently using LoRA in a multimodal frequency-aware fusion manner.
Extensive experiments on three public datasets validate the superior
performance of MME-SID thanks to its capability to mitigate embedding collapse
and catastrophic forgetting. The implementation code and datasets are publicly
available for reproduction:
https://github.com/Applied-Machine-Learning-Lab/MME-SID.
mFARM Towards Multi-Faceted Fairness Assessment based on HARMs in Clinical Decision Support
Authors: Shreyash Adappanavar, Krithi Shailya, Gokul S Krishnan, Sriraam Natarajan, Balaraman Ravindran
2025-09-02
The deployment of Large Language Models (s) in high-stakes medical
settings poses a critical AI alignment challenge, as models can inherit and
amplify societal biases, leading to significant disparities. Existing fairness
evaluation methods fall short in these contexts as they typically use
simplistic metrics that overlook the multi-dimensional nature of medical harms.
This also promotes models that are fair only because they are clinically inert,
defaulting to safe but potentially inaccurate outputs. To address this gap, our
contributions are mainly two-fold: first, we construct two large-scale,
controlled benchmarks (ED-Triage and Opioid Analgesic Recommendation) from
MIMIC-IV, comprising over 50,000 prompts with twelve race x gender variants and
three context tiers. Second, we propose a multi-metric framework -
Multi-faceted Fairness Assessment based on hARMs () to audit fairness
for three distinct dimensions of disparity (Allocational, Stability, and
Latent) and aggregate them into an score. We also present an aggregated
Fairness-Accuracy Balance (FAB) score to benchmark and observe trade-offs
between fairness and prediction accuracy. We empirically evaluate four
open-source
s (Mistral-7B, BioMistral-7B, Qwen-2.5-7B, Bio-LLaMA3-8B) and
their finetuned versions under
and context variations. Our
findings showcase that the proposed metrics capture subtle biases more
effectively under various settings. We find that most models maintain robust
performance in terms of score across varying levels of
but
deteriorate significantly when the context is reduced. Our benchmarks and
evaluation code are publicly released to enhance research in aligned AI for
healthcare.
AHAMask Reliable Task Specification for Large Audio Language Models without Instructions
Authors: Yiwei Guo, Bohan Li, Hankun Wang, Zhihan Li, Shuai Wang, Xie Chen, Kai Yu
2025-09-01
Although current large audio language models (LALMs) extend text large
language models (s) with generic acoustic understanding abilities, they
usually suffer from instruction sensitivity, where different instructions of
the same intention can yield drastically different outcomes. In this work, we
propose AHAMask, where we simply mask some of the attention heads in the
r-only
backbone of LALMs, to trigger specific acoustic task
functionalities without instructions. These masks are efficiently obtained by
training on an LALM, with the number of trainable parameters equal to the
attention head count in its
backbone. We show by experiments that applying
such selective attention head masks achieves comparable or even better
performance than using instructions, either on single or composite tasks.
Besides achieving reliable acoustic task specification for LALMs, this also
reveals that LALMs exhibit certain "functional pathways" in their attention
heads.
Communication-Aware Knowledge Distillation for Federated LLM Fine-Tuning over Wireless Networks
Authors: Xinlu Zhang, Na Yan, Yang Su, Yansha Deng, Toktam Mahmoodi
2025-09-01
Federated learning (FL) for large language models (s) offers a
privacy-pre
scheme, enabling clients to collaboratively fine-tune
locally deployed
s or smaller language models (SLMs) without exchanging raw
data. While parameter-sharing methods in traditional FL models solves number of
technical challenges, they still incur high
overhead and struggle
with adapting to heterogeneous model architectures. Federated distillation, a
framework for mutual knowledge transfer via shared logits, typically offers
lower
overhead than parameter-sharing methods. However,
transmitting logits from
s remains challenging for bandwidth-limited clients
due to their high dimensionality. In this work, we focus on a federated
distillation with efficient
overhead. To achieve this, we first
propose an adaptive Top-k logit selection mechanism, dynamically sparsifying
logits according to real-time
conditions. Then to tackle the
dimensional inconsistency introduced by the adaptive sparsification, we design
an adaptive logits aggregation scheme, effectively alleviating the artificial
and uninformative inputs introduced by conventional zero-padding methods.
Finally, to enhance the distillation effect, we incorporate LoRA-adapted
hidden-layer projection from
into the distillation loss, reducing the
overhead further while providing richer representation.
Experimental results demonstrate that our scheme achieves superior performance
compared to baseline methods while effectively reducing
overhead
by approximately 50%.
Preconditioned Regularized Wasserstein Proximal Sampling
Authors: Hong Ye Tan, Stanley Osher, Wuchen Li
2025-09-01
We consider sampling from a Gibbs distribution by evolving finitely many
particles. We propose a preconditioned version of a recently proposed
noise-free sampling method, governed by approximating the score function with
the numerically tractable score of a regularized Wasserstein proximal operator.
This is derived by a Cole--Hopf transformation on coupled anisotropic heat
equations, yielding a kernel formulation for the preconditioned regularized
Wasserstein proximal. The diffusion component of the proposed method is also
interpreted as a modified self-attention block, as in
architectures. For quadratic potentials, we provide a discrete-time
non-asymptotic convergence analysis and explicitly characterize the bias, which
is dependent on regularization and independent of step-size. Experiments
demonstrate
and particle-level stability on various log-concave
and non-log-concave toy examples to Bayesian total-variation regularized image
deconvolution, and competitive/better performance on non-convex Bayesian neural
network training when utilizing variable preconditioning matrices.
Q-Sched Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling
Authors: Natalia Frumkin, Diana Marculescu
2025-09-01
Text-to-image diffusion models are computationally intensive, often requiring
dozens of forward passes through large backbones. For instance,
Stable Diffusion XL generates high-quality images with 50 evaluations of a
2.6B-parameter model, an expensive process even for a single batch. Few-step
diffusion models reduce this cost to 2-8 denoising steps but still depend on
large, uncompressed U-Net or diffusion
backbones, which are often
too costly for full-precision inference without datacenter GPUs. These
requirements also limit existing post-training
methods that rely
on full-precision calibration. We introduce Q-Sched, a new paradigm for
post-training
that modifies the diffusion model scheduler rather
than model weights. By adjusting the few-step sampling trajectory, Q-Sched
achieves full-precision accuracy with a 4x reduction in model size. To learn
-aware pre-conditioning coefficients, we propose the JAQ loss,
which combines text-image compatibility with an image quality metric for
fine-grained optimization. JAQ is reference-free and requires only a handful of
calibration prompts, avoiding full-precision inference during calibration.
Q-Sched delivers substantial gains: a 15.5% FID improvement over the FP16
4-step Latent Consistency Model and a 16.6% improvement over the FP16 8-step
Phased Consistency Model, showing that
and few-step distillation
are complementary for high-fidelity generation. A large-scale user study with
more than 80,000 annotations further confirms Q-Sched's effectiveness on both
FLUX.1[schnell] and SDXL-Turbo.