2025-10-24
Table of Contents
- ToolDreamer Instilling LLM Reasoning Into Tool Retrievers
- GaLLoP Gradient-based Sparse Learning on Low-Magnitude Parameters
- CommonSense Efficient Set Intersection (SetX) Protocol Based on Compressed Sensing
- Dictionary learning methods for brain activity mapping with MEG data
- Are Large Language Models Sensitive to the Motives Behind Communication?
- CircuitGuard Mitigating LLM Memorization in RTL Code Generation Against IP Leakage
- CoSense-LLM Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation
- Overlap-weighted orthogonal meta-learner for treatment effect estimation over time
- ELUTQ Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices
- MSC-Bench A Rigorous Benchmark for Multi-Server Tool Orchestration
- Monitoring LLM-based Multi-Agent Systems Against Corruptions via Node Evaluation
- MoE-Prism Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs
- Multi-code rate Task-Oriented Communication for Multi-Edge Cooperative Inference
- LAPRAD LLM-Assisted PRotocol Attack Discovery
- RLBoost Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs
- Tibetan Language and AI A Comprehensive Survey of Resources, Methods and Challenges
- An Efficient Calibration Framework for Volatility Derivatives under Rough Volatility with Jumps
- From Memorization to Generalization Fine-Tuning Large Language Models for Biomedical Term-to-Identifier Normalization
- CLiVR Conversational Learning System in Virtual Reality with AI-Powered Patients
- An Encoder-Decoder Foundation Chemical Language Model for Generative Polymer Design
- Dimensionality Reduction for Remote Sensing Data Analysis A Systematic Review of Methods and Applications
- MTraining Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
- mSQUID Model-Based Leanred Modulo Recovery at Low Sampling Rates
- SSD Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation
- Fetch.ai An Architecture for Modern Multi-Agent Systems
- Reasoning Language Model Inference Serving Unveiled An Empirical Study
- Binary Quadratic Quantization Beyond First-Order Quantization for Real-Valued Matrix Compression
- C-SWAP Explainability-Aware Structured Pruning for Efficient Neural Networks Compression
- Tokencake A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications
- EfficientNav Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval
- LLMs as Sparse RetrieversA Framework for First-Stage Product Search
- From Quarter to All Accelerating Speculative LLM Decoding via Floating-Point Exponent Remapping and Parameter Sharing
- The Attribution Story of WhisperGate An Academic Perspective
- LAFA Agentic LLM-Driven Federated Analytics over Decentralized Data Sources
- DART A Structured Dataset of Regulatory Drug Documents in Italian for Clinical NLP
- CircuitSeer Mining High-Quality Data by Probing Mathematical Reasoning Circuits in LLMs
- Adamas Hadamard Sparse Attention for Efficient Long-Context Inference
- How2Compress Scalable and Efficient Edge Video Analytics via Adaptive Granular Video Compression
- MENTOR A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models
- S2AP Score-space Sharpness Minimization for Adversarial Pruning
- Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers
- Learning Human-Object Interaction as Groups
- Text or Pixels? It Takes Half On the Token Efficiency of Visual Text Inputs in Multimodal LLMs
- StreamingTOM Streaming Token Compression for Efficient Video Understanding
- Learning from the Best, Differently A Diversity-Driven Rethinking on Data Selection
- DeepSeek-OCR Contexts Optical Compression
- Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
- Extracting Rule-based Descriptions of Attention Features in Transformers
- Any-Depth Alignment Unlocking Innate Safety Alignment of LLMs to Any-Depth
- MEG-GPT A transformer-based foundation model for magnetoencephalography data
- CompactPrompt A Unified Pipeline for Prompt Data Compression in LLM Workflows
- OPTAGENT Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning
- From Local to Global Revisiting Structured Pruning Paradigms for Large Language Models
- Glyph Scaling Context Windows via Visual-Text Compression
- Beyond More Context Retrieval Diversity Boosts Multi-Turn Intent Understanding
- ZACH-ViT A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification
- Language Confusion Gate Language-Aware Decoding Through Model Self-Distillation
- TabR1 Taming GRPO for tabular reasoning LLMs
- M2H Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception
- Localist LLMs with Recruitment Learning
- Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems
- StreamingThinker Large Language Models Can Think While Reading
- DSEBench A Test Collection for Explainable Dataset Search with Examples
- CosmoCore Affective Dream-Replay Reinforcement Learning for Code Generation
- ZSPAPrune Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models
- When AI companions become witty Can human brain recognize AI-generated irony?
- ParaVul A Parallel Large Language Model and Retrieval-Augmented Framework for Smart Contract Vulnerability Detection
- Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models
- Enrich and Detect Video Temporal Grounding with Multimodal LLMs
- UniGTE Unified Graph-Text Encoding for Zero-Shot Generalization across Graph Tasks and Domains
- ArmFormer Lightweight Transformer Architecture for Real-Time Multi-Class Weapon Segmentation and Classification
- Neuronal Group Communication for Efficient Neural representation
- Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads
- Mixed-Precision Quantization for Language Models Techniques and Prospects
- 3D-GSRD 3D Molecular Graph Auto-Encoder with Selective Re-mask Decoding
- EMRRG Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation
- L-MoE End-to-End Training of a Lightweight Mixture of Low-Rank Adaptation Experts
- ELMM Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion
- An Efficient Semantic Segmentation Decoder for In-Car or Distributed Applications
- Long-Context Attention Benchmark From Kernel Efficiency to Distributed Context Parallelism
- U-Codec Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation
- Count Counts Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards
- VisionSelector End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
- SHIELD Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense
- Human-Aligned Code Readability Assessment with Large Language Models
- Ripple Effect Protocol Coordinating Agent Populations
- Language over Content Tracing Cultural Understanding in Multilingual Large Language Models
- Hybrid CNN-Transformer Based Sparse Channel Prediction for High-Mobility OTFS Systems
- HGC-Avatar Hierarchical Gaussian Compression for Streamable Dynamic 3D Avatars
- FrugalPrompt Reducing Contextual Overhead in Large Language Models via Token Attribution
- Learning to Optimize Edge Robotics A Fast Integrated Perception-Motion-Communication Approach
- FourierCompress Layer-Aware Spectral Activation Compression for Efficient and Accurate Collaborative LLM Inference
- Longwave-transparent low-emissivity material
- Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with Prior
- Does GenAI Rewrite How We Write? An Empirical Study on Two-Million Preprints
- What Limits Agentic Systems Efficiency?
- One-Bit Quantization for Random Features Models
- SentinelNet Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection
ToolDreamer Instilling LLM Reasoning Into Tool Retrievers
Authors: Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang, Suhang Wang, Zhe Feng
2025-10-22
Tool calling has become increasingly popular for Large Language Models
(s). However, for large tool sets, the resulting tokens would exceed the
's context window limit, making it impossible to include every tool. Hence,
an external retriever is used to provide
s with the most relevant tools for
a query. Existing retrieval models rank tools based on the similarity between a
user query and a tool description (TD). This leads to suboptimal retrieval as
user requests are often poorly aligned with the language of TD. To remedy the
issue, we propose ToolDreamer, a framework to condition retriever models to
fetch tools based on hypothetical (synthetic) TD generated using an
, i.e.,
description of tools that the
feels will be potentially useful for the
query. The framework enables a more natural alignment between queries and tools
within the language space of TD's. We apply ToolDreamer on the ToolRet dataset
and show that our method improves the performance of
and dense
retrievers with and without training, thus showcasing its flexibility. Through
our proposed framework, our aim is to offload a portion of the reasoning burden
to the retriever so that the
may effectively handle a large collection of
tools without inundating its context window.
GaLLoP Gradient-based Sparse Learning on Low-Magnitude Parameters
Authors: Anand Choudhary, Yasser Sulaıman, Lukas Mauch, Ghouthi Boukli Hacene, Fabien Cardinaux, Antoine Bosselut
2025-10-22
Sparse fine-tuning techniques adapt s to downstream tasks by only tuning a
subset of model parameters. However, the effectiveness of
adaptation depends on optimally selecting the model parameters to be
fine-tuned. In this work, we introduce a novel
fine-tuning technique
named GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters, which
fine-tunes only those model parameters which have the largest gradient
magnitudes on downstream tasks and the smallest pre-trained magnitudes,
intuitively prioritizing parameters that are highly task-relevant, but
minimally disruptive to pre-trained knowledge. Our experimentation with LLaMA3
8B and Gemma 2B as base models shows that GaLLoP consistently improves or
matches the in-distribution as well as out-of-distribution performance obtained
via the usage of other leading parameter-efficient fine-tuning techniques,
including LoRA, DoRA, and SAFT. Our analysis demonstrates that GaLLoP mitigates
catastrophic forgetting and memorization of task data, as important pre-trained
parameters remain unchanged, and stabilizes performance relative to other
fine-tuning techniques, robustly generalizing across most random seeds.
CommonSense Efficient Set Intersection (SetX) Protocol Based on Compressed Sensing
Authors: Jingfan Meng, Tianji Yang, Jun Xu
2025-10-22
In the set reconciliation (\textsf{SetR}) problem, two parties Alice and Bob,
holding sets and , communicate to learn the symmetric
difference . In this work, we study a related but
under-explored problem: set intersection (\textsf{SetX})~\cite{Ozisik2019},
where both parties learn instead. However,
existing solutions typically reuse \textsf{SetR} protocols due to the absence
of dedicated \textsf{SetX} protocols and the misconception that \textsf{SetR}
and \textsf{SetX} have comparable costs. Ob that \textsf{SetX} is
fundamentally cheaper than \textsf{SetR}, we developed a multi-round
\textsf{SetX} protocol that outperforms the information-theoretic lower bound
of \textsf{SetR} problem. In our \textsf{SetX} protocol, Alice sends Bob a
compressed sensing (CS) sketch of to help Bob identify his unique
elements (those in ). This solves the \textsf{SetX}
problem, if . Otherwise, Bob sends a CS sketch
of the residue (a set of elements he cannot
) back to Alice for her to
her unique elements (those in ). As such, Alice
and Bob communicate back and forth %with a set membership filter (SMF) of
estimated . Alice updates and
repeats until both parties agrees on . On real world datasets, experiments show that our
protocol reduces the
cost by 8 to 10 times compared to the
IBLT-based protocol.
Dictionary learning methods for brain activity mapping with MEG data
Authors: Daniela Calvetti, Erkki Somersalo
2025-10-22
A central goal in many brain studies is the identification of those brain
regions that are activated during an observation window that may correspond to
a motor task, a stimulus, or simply a resting state. While functional MRI is
currently the most commonly employed modality for such task, methods based on
the electromagnetic activity of the brain are valuable alternatives because of
their excellent time resolution and of the fact that the measured signals are
directly related to brain activation and not to a secondary effect such as the
hemodynamic response. In this work we focus on the MEG modality, investigating
the performance of a recently proposed Bayesian dictionary learning (BDL)
algorithm for brain region identification. The partitioning of the source space
into the 148 regions of interest (ROI) corresponding to parcellation of the
Destrieux atlas provides a natural determination of the subdictionaries
necessary for the BDL algorithm. We design a simulation protocol where a small
randomly selected patch in each ROI is activated, the MEG signal is computed
and the inverse problem of active brain region identification is solved using
the BDL algorithm. The BDL algorithm consists of two phases, the first one
comprising dictionary and Bayesian
error analysis, and
the second one performing dictionary coding with a deflated dictionary built on
the output of the first phase, both steps relying on Bayesian
promoting computations. For assessing the performance, we give a probabilistic
interpretation of the confusion matrix, and consider different impurity
measures for a multi-class classifier.
Are Large Language Models Sensitive to the Motives Behind Communication?
Authors: Addison J. Wu, Ryan Liu, Kerem Oktar, Theodore R. Sumers, Thomas L. Griffiths
2025-10-22
Human is motivated: people speak, write, and create content
with a particular communicative intent in mind. As a result, information that
large language models (
s) and AI agents process is inherently framed by
humans' intentions and incentives. People are adept at navigating such nuanced
information: we routinely identify benevolent or self-
motives in order
to decide what statements to trust. For
s to be effective in the real world,
they too must critically evaluate content by factoring in the motivations of
the source -- for instance, weighing the credibility of claims made in a sales
pitch. In this paper, we undertake a comprehensive study of whether
s have
this capacity for motivational vigilance. We first employ controlled
experiments from cognitive science to verify that
s' behavior is consistent
with rational models of learning from motivated testimony, and find they
successfully discount information from biased sources in a human-like manner.
We then extend our evaluation to sponsored online adverts, a more naturalistic
reflection of
agents' information ecosystems. In these settings, we find
that
s' inferences do not track the rational models' predictions nearly as
closely -- partly due to additional information that distracts them from
vigilance-relevant considerations. However, a simple steering intervention that
boosts the salience of intentions and incentives substantially increases the
correspondence between
s and the rational model. These results suggest that
s possess a basic sensitivity to the motivations of others, but generalizing
to novel real-world settings will require further improvements to these models.
CircuitGuard Mitigating LLM Memorization in RTL Code Generation Against IP Leakage
Authors: Nowfel Mashnoor, Mohammad Akyash, Hadi Kamali, Kimia Azar
2025-10-22
Large Language Models (s) have achieved remarkable success in generative
tasks, including register-transfer level (RTL) hardware synthesis. However,
their tendency to memorize training data poses critical risks when proprietary
or security-sensitive designs are unintentionally exposed during inference.
While prior work has examined memorization in natural language, RTL introduces
unique challenges: In RTL, structurally different implementations (e.g.,
behavioral vs. gate-level descriptions) can realize the same hardware, leading
to intellectual property (IP) leakage (full or partial) even without verbatim
. Conversely, even small syntactic variations (e.g., operator precedence
or blocking vs. non-blocking assignments) can drastically alter circuit
behavior, making correctness preservation especially challenging. In this work,
we systematically study memorization in RTL code generation and propose
CircuitGuard, a defense strategy that balances leakage reduction with
correctness preservation. CircuitGuard (1) introduces a novel RTL-aware
similarity metric that captures both structural and functional equivalence
beyond surface-level
, and (2) develops an activation-level steering
method that identifies and attenuates
components most responsible
for memorization. Our empirical evaluation demonstrates that CircuitGuard
identifies (and isolates) 275 memorization-critical features across layers
18-28 of Llama 3.1-8B model, achieving up to 80% reduction in semantic
similarity to proprietary patterns while maintaining generation quality.
CircuitGuard further shows 78-85% cross-domain transfer effectiveness, enabling
robust memorization mitigation across circuit categories without retraining.
CoSense-LLM Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation
Authors: Hasan Akgul, Mari Eplik, Javier Rojas, Aina Binti Abdullah, Pieter van der Merwe
2025-10-22
We present CoSense-, an edge-first framework that turns continuous
multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and
lightweight vision) into compact, verifiable semantic tokens and coordinates
with large language models under explicit latency, energy, bandwidth, and
privacy constraints. CoSense-
has four parts: (i) SenseFusion, a lightweight
encoder that aligns sensor embeddings with language and compresses them into
short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer
that grounds generation in site specific policies and notes; (iii)
PromptRouter, a cost and uncertainty aware policy that selects edge only
generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure
Execution, an auditable redaction path that enforces data minimization so raw
waveforms never leave the device. The system works with modern
optimizations, including paged or streaming
s, FlashAttention style
kernels, speculative
, and
d LoRA adapters, and supports on
device personalization and federated updates under non IID drift. Across home,
office, and clinic deployments, CoSense-
delivers grounded explanations
while meeting tight service level objectives: it sustains sub second (p95) end
to end latency on edge dominant paths, reduces inter tier token and bandwidth
costs by preferring local retrieval grounded responses, and preserves privacy
by transmitting only discrete codes and redacted metadata. Ablations show that
Edge-RAG improves factual consistency and reduces contradictions, calibrated
uncertainty enables selective abstention and controlled escalations, and
plus
accelerators lower energy per decision. The results support an
edge first design that treats semantics, privacy, and predictable latency as co
equal goals for large model deployments in interference prone environments.
Overlap-weighted orthogonal meta-learner for treatment effect estimation over time
Authors: Konstantin Hess, Dennis Frauen, Mihaela van der Schaar, Stefan Feuerriegel
2025-10-22
Estimating heterogeneous treatment effects (HTEs) in time-varying settings is
particularly challenging, as the probability of ob certain treatment
sequences decreases exponentially with longer prediction horizons. Thus, the
observed data contain little support for many plausible treatment sequences,
which creates severe
problems. Existing meta-learners for the
time-varying setting typically assume adequate treatment
, and thus
suffer from exploding estimation variance when the
is low. To address
this problem, we introduce a novel
-weighted orthogonal (WO)
meta-learner for estimating HTEs that targets regions in the observed data with
high probability of receiving the interventional treatment sequences. This
offers a fully data-driven approach through which our WO-learner can counteract
instabilities as in existing meta-learners and thus obtain more reliable HTE
estimates. Methodologically, we develop a novel Neyman-orthogonal population
risk function that minimizes the
-weighted oracle risk. We show that our
WO-learner has the favorable property of Neyman-orthogonality, meaning that it
is robust against misspecification in the nuisance functions. Further, our
WO-learner is fully model-agnostic and can be applied to any machine learning
model. Through extensive experiments with both
and LSTM backbones,
we demonstrate the benefits of our novel WO-learner.
ELUTQ Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices
Authors: Xin Nie, Liang Dong, HaiCheng Zhang, JiaWang Xiao, G. Sun
2025-10-22
The deployment of Large Language Models (s) on CPU-based edge devices is
crucial for enabling on-device intelligence and expanding AI accessibility.
However, it remains challenging due to limited memory and computational
resources. During edge inference, memory usage and latency are the primary
bottlenecks. Although weight
can effectively reduce memory
consumption, existing hardware-friendly approaches often rely on uniform
, which poorly fits weight distributions and incurs high
de
overhead at low bit widths. To address these limitations, we
propose ELUTQ, an efficient
framework introducing a novel
format, Hierarchical Linear Quantization (HLQ). HLQ better
captures the statistical characteristics of weights without increasing the
computational cost of Bit-serial LUT-based GEMM operations, thereby eliminating
de
overhead. It is orthogonal to existing
algorithms
and can be seamlessly integrated into various
pipelines. For
efficient on-device deployment, ELUTQ provides optimized CPU kernels for
end-to-end inference. Experiments show that for LLaMA3-8B, HLQ reduces
perplexity by about 8% at 3-bit and 85% at 2-bit precision under post-training
, completing
within one hour. With efficient
finetuning, HLQ further improves 2-bit performance within two hours. In terms
of inference efficiency, our 2-bit LLaMA2-7B achieves over 25 tokens/s on an
Apple M2 chip (4 threads, batch size = 1).
MSC-Bench A Rigorous Benchmark for Multi-Server Tool Orchestration
Authors: Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai
2025-10-22
We introduce MSC-Bench, a large-scale benchmark for evaluating multi-hop,
end-to-end tool orchestration by agents in a hierarchical Model-Context
Protocol (MCP) ecosystem. Existing benchmarks often evaluate tools in
isolation, ignoring challenges such as functional
and cross-server
orchestration, leading to overly optimistic assessments. MSC-Bench addresses
these gaps by constructing ground truth through 'equal function sets', allowing
objective metrics such as F1 score and reducing the dependency on
-as-a-judge evaluation. Organized as a five-level curriculum, it
systematically tests agent capabilities from single-tool orchestration to
complex cross-server planning, and robustness to out-of-scope requests.
Experiments reveal that rigid hierarchies can hinder performance without
co-designed strategies, and even state-of-the-art agents exhibit systemic
weaknesses in robustness. MSC-Bench provides a diagnostic framework to expose
these limitations and guide the development of more capable and efficient
tool-using agents. The benchmark and resources are publicly available at
https://github.com/snooow1029/MSC_Bench.
Monitoring LLM-based Multi-Agent Systems Against Corruptions via Node Evaluation
Authors: Chengcan Wu, Zhixin Zhang, Mingqian Xu, Zeming Wei, Meng Sun
2025-10-22
Large Language Model ()-based Multi-Agent Systems (MAS) have become a
popular paradigm of AI applications. However, trustworthiness issues in MAS
remain a critical concern. Unlike challenges in single-agent systems, MAS
involve more complex
processes, making them susceptible to
corruption attacks. To mitigate this issue, several defense mechanisms have
been developed based on the graph representation of MAS, where agents represent
nodes and
s form edges. Nevertheless, these methods predominantly
focus on static graph defense, attempting to either detect attacks in a fixed
graph structure or optimize a static topology with certain defensive
capabilities. To address this limitation, we propose a dynamic defense paradigm
for MAS graph structures, which continuously monitors
within the
MAS graph, then dynamically adjusts the graph topology, accurately disrupts
malicious
s, and effectively defends against evolving and diverse
dynamic attacks. Experimental results in increasingly complex and dynamic MAS
environments demonstrate that our method significantly outperforms existing MAS
defense mechanisms, contributing an effective guardrail for their trustworthy
applications. Our code is available at
https://github.com/ChengcanWu/Monitoring-
-Based-Multi-Agent-Systems.
MoE-Prism Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs
Authors: Xinfeng Xia, Jiacheng Liu, Xiaofeng Hou, Peng Tang, Mingxuan Zhang, Wenfeng Wang, Chao Li
2025-10-22
Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI,
achieve high quality by ly activating parameters. However, their reliance
on routing between a few monolithic experts via a top-k mechanism creates a
"quality cliff", offering only a few coarse-grained operating points. This
inflexibility forces a difficult trade-off between cost and quality, preventing
adaptation to diverse Service Level Objectives (SLOs) and leading to
significant resource over-provisioning.
This paper introduces MoE-Prism, a model-system co-design that transforms
rigid MoE models into elastic services. Our methodology is divided into two
phases. First, an \emph{Offline Refactoring Engine} systematically deconstructs
monolithic experts into fine-grained "sub-experts." This engine employs a
partitioning optimization solver that uses a metaheuristic-based approach to
group neurons, pre
functional locality without requiring retraining.
Second, an \emph{Online Scheduling Engine} leverages this new elasticity
through QoS-aware scheduling. It implements specialized policies to solve
complex system problems, including maximizing throughput in cloud deployments
and managing latency-optimized offloading for memory-constrained devices. Our
evaluation across three different MoE models shows that MoE-Prismprovides over
4 times more distinct, stable operating points than the baseline. This allows
an AI service to dynamically improve throughput by up to 19.9\% under a strict
latency budget or reduce latency by up to 10.36\% under limited resources.
MoE-Prism provides the critical "control knob" to bridge the model-system gap,
enabling the next generation of adaptive, efficient, and QoS-aware AI services.
Multi-code rate Task-Oriented Communication for Multi-Edge Cooperative Inference
Authors: Dongwon Kim, Jiwan Seo, Joonhyuk Kang
2025-10-22
The integration of artificial intelligence (AI) with the internet of things
(IoT) enables task-oriented for multi-edge cooperative inference
system, where edge devices transmit extracted features of local sensory data to
an edge server to perform AI-driven tasks. However, the privacy concerns and
limited
bandwidth pose fundamental challenges, since simultaneous
transmission of extracted features with a single fixed
ratio from
all devices leads to severe inefficiency in
resource utilization.
To address this challenge, we propose a framework that dynamically adjusts the
code rate in feature extraction based on its importance to the downstream
inference task by adopting a rate-adaptive
(RAQ) scheme.
Furthermore, to select the code rate for each edge device under limited
bandwidth constraint, a dynamic programming (DP) approach is leveraged to
allocate the code rate across discrete code rate options. Experiments on
multi-view datasets demonstrate that the proposed frameworks significantly
outperform the frameworks using fixed-rate
, achieving a favorable
balance between
efficiency and inference performance under
limited bandwidth conditions.
LAPRAD LLM-Assisted PRotocol Attack Discovery
Authors: R. Can Aygun, Yehuda Afek, Anat Bremler-Barr, Leonard Kleinrock
2025-10-22
With the goal of improving the security of Internet protocols, we seek
faster, semi-automatic methods to discover new vulnerabilities in protocols
such as DNS, BGP, and others. To this end, we introduce the -Assisted
Protocol Attack Discovery (LAPRAD) methodology, enabling security researchers
with some DNS knowledge to efficiently uncover vulnerabilities that would
otherwise be hard to detect.
LAPRAD follows a three-stage process. In the first, we consult an
(GPT-o1) that has been trained on a broad corpus of DNS-related sources and
previous DDoS attacks to identify potential exploits. In the second stage, a
different
automatically constructs the corresponding attack configurations
using the ReACT approach implemented via LangChain (DNS zone file generation).
Finally, in the third stage, we validate the attack's functionality and
effectiveness.
Using LAPRAD, we uncovered three new DDoS attacks on the DNS protocol and
rediscovered two recently reported ones that were not included in the
's
training data. The first new attack employs a bait-and-switch technique to
trick resolvers into caching large, bogus DNSSEC RRSIGs, reducing their
capacity to as little as 6%. The second exploits large DNSSEC encryption
algorithms (RSA-4096) with multiple keys, thereby bypassing a recently
implemented default RRSet limit. The third leverages ANY-type responses to
produce a similar effect.
These variations of a
-flushing DDoS attack, called SigCacheFlush,
circumvent existing patches, severely degrade resolver query capacity, and
impact the latest versions of major DNS resolver implementations.
RLBoost Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs
Authors: Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z. Morley Mao, Arvind Krishnamurthy, Ion Stoica
2025-10-22
Reinforcement learning (RL) has become essential for unlocking advanced
reasoning capabilities in large language models (s). RL workflows involve
interleaving rollout and training stages with fundamentally different resource
requirements. Rollout typically dominates overall execution time, yet scales
efficiently through multiple independent instances. In contrast, training
requires tightly-coupled GPUs with full-mesh
. Existing RL
frameworks fall into two categories: co-located and
d
architectures. Co-located ones fail to address this resource tension by forcing
both stages to share the same GPUs. Disaggregated architectures, without
modifications of well-established RL algorithms, suffer from resource
under-utilization. Meanwhile, preemptible GPU resources, i.e., spot instances
on public clouds and spare capacity in production clusters, present significant
cost-saving opportunities for accelerating RL workflows, if efficiently
harvested for rollout.
In this paper, we present RLBoost, a systematic solution for cost-efficient
RL training that harvests preemptible GPU resources. Our key insight is that
rollout's stateless and embarrassingly parallel nature aligns perfectly with
preemptible and often fragmented resources. To efficiently utilize these
resources despite frequent and unpredictable availability changes, RLBoost
adopts a hybrid architecture with three key techniques: (1) adaptive rollout
offload to dynamically adjust workloads on the reserved (on-demand) cluster,
(2) pull-based weight transfer that quickly provisions newly available
instances, and (3) token-level response collection and migration for efficient
preemption handling and continuous load balancing. Extensive experiments show
RLBoost increases training throughput by 1.51x-1.97x while improving cost
efficiency by 28%-49% compared to using only on-demand GPU resources.
Tibetan Language and AI A Comprehensive Survey of Resources, Methods and Challenges
Authors: Cheng Huang, Nyima Tashi, Fan Gao, Yutong Liu, Jiahao Li, Hao Tian, Siyang Jiang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Jin Zhang, Xiao Feng, Hao Wang, Jie Tang, Guojie Tang, Xiangxiang Wang, Jia Zhang, Tsengdar Lee, Yongbin Yu
2025-10-22
Tibetan, one of the major low-resource languages in Asia, presents unique
linguistic and sociocultural characteristics that pose both challenges and
opportunities for AI research. Despite increasing interest in developing AI
systems for underrepresented languages, Tibetan has received limited attention
due to a lack of accessible data resources, standardized benchmarks, and
dedicated tools. This paper provides a comprehensive survey of the current
state of Tibetan AI in the AI domain, covering textual and speech data
resources, NLP tasks, machine translation, speech recognition, and recent
developments in s. We systematically categorize existing datasets and tools,
evaluate methods used across different tasks, and compare performance where
possible. We also identify persistent bottlenecks such as data
,
orthographic variation, and the lack of unified evaluation metrics.
Additionally, we discuss the potential of cross-lingual transfer, multi-modal
learning, and community-driven resource creation. This survey aims to serve as
a foundational reference for future work on Tibetan AI research and encourages
collaborative efforts to build an inclusive and sustainable AI ecosystem for
low-resource languages.
An Efficient Calibration Framework for Volatility Derivatives under Rough Volatility with Jumps
Authors: Keyuan Wu, Tenghan Zhong, Yuxuan Ouyang
2025-10-21
We present a fast and robust calibration method for stochastic volatility
models that admit Fourier-analytic transform-based pricing via characteristic
functions. The design is structure-pre: we keep the original pricing
transform and (i) split the pricing formula into data-independent inte- grals
and a market-dependent remainder; (ii) precompute those data-independent
integrals with GPU
; and (iii) approximate only the remaining,
market-dependent pricing map with a small neural network. We instantiate the
workflow on a rough volatility model with tempered-stable jumps tailored to
power-type volatility derivatives and calibrate it to VIX options with a
global-to-local search. We verify that a pure-jump rough volatility model
adequately captures the VIX dynamics, consistent with prior empirical findings,
and demonstrate that our calibration method achieves high accuracy and speed.
From Memorization to Generalization Fine-Tuning Large Language Models for Biomedical Term-to-Identifier Normalization
Authors: Suswitha Pericharla, Daniel B. Hier, Tayo Obafemi-Ajayi
2025-10-21
Effective biomedical data integration depends on automated term
normalization, the mapping of natural language biomedical terms to standardized
identifiers. This linking of terms to identifiers is essential for semantic
interoperability. Large language models (s) show promise for this task but
perform unevenly across terminologies. We evaluated both memorization
(training-term performance) and generalization (validation-term performance)
across multiple biomedical ontologies. Fine-tuning Llama 3.1 8B revealed marked
differences by terminology. GO mappings showed strong memorization gains (up to
77% improvement in term-to-identifier accuracy), whereas HPO showed minimal
improvement. Generalization occurred only for protein-gene (GENE) mappings
(13.9% gain), while fine-tuning for HPO and GO yielded negligible transfer.
Baseline accuracy varied by model scale, with GPT-4o outperforming both Llama
variants for all terminologies. Embedding analyses showed tight semantic
alignment between gene symbols and protein names but weak alignment between
terms and identifiers for GO or HPO, consistent with limited lexicalization.
Fine-tuning success depended on two interacting factors: identifier popularity
and lexicalization. Popular identifiers were more likely encountered during
pretraining, enhancing memorization. Lexicalized identifiers, such as gene
symbols, enabled semantic generalization. By contrast, arbitrary identifiers in
GO and HPO constrained models to rote learning. These findings provide a
predictive framework for when fine-tuning enhances factual recall versus when
it fails due to
or non-lexicalized identifiers.
CLiVR Conversational Learning System in Virtual Reality with AI-Powered Patients
Authors: Akilan Amithasagaran, Sagnik Dakshit, Bhavani Suryadevara, Lindsey Stockton
2025-10-21
Simulations constitute a fundamental component of medical and nursing
education and traditionally employ standardized patients (SP) and high-fidelity
manikins to develop clinical reasoning and skills. However, these
methods require substantial resources, limiting accessibility and scalability.
In this study, we introduce CLiVR, a Conversational Learning system in Virtual
Reality that integrates large language models (
s), speech processing, and 3D
avatars to simulate realistic doctor-patient interactions. Developed in Unity
and deployed on the Meta Quest 3 platform, CLiVR enables trainees to engage in
natural dialogue with virtual patients. Each simulation is dynamically
generated from a syndrome-symptom database and enhanced with sentiment analysis
to provide feedback on
tone. Through an expert user study
involving medical school faculty (n=13), we assessed usability, realism, and
perceived educational impact. Results demonstrated strong user acceptance, high
confidence in educational potential, and valuable feedback for improvement.
CLiVR offers a scalable, immersive supplement to SP-based training.
An Encoder-Decoder Foundation Chemical Language Model for Generative Polymer Design
Authors: Harikrishna Sahu, Wei Xiong, Anagha Savit, Shivank S Shukla, Rampi Ramprasad
2025-10-21
Traditional machine learning has advanced polymer discovery, yet direct
generation of chemically valid and synthesizable polymers without exhaustive
enumeration remains a challenge. Here we present polyT5, an encoder-r
chemical language model based on the T5 architecture, trained to understand and
generate polymer structures. polyT5 enables both property prediction and the
targeted generation of polymers conditioned on desired property values. We
demonstrate its utility for dielectric polymer design, seeking candidates with
dielectric constant >3, bandgap >4 eV, and glass transition temperature >400 K,
alongside melt-processability and solubility requirements. From over 20,000
generated promising candidates, one was experimentally synthesized and
validated, showing strong agreement with predictions. To further enhance
usability, we integrated polyT5 within an agentic AI framework that couples it
with a general-purpose
, allowing natural language interaction for property
prediction and generative design. Together, these advances establish a
versatile and accessible framework for accelerated polymer discovery.
Dimensionality Reduction for Remote Sensing Data Analysis A Systematic Review of Methods and Applications
Authors: Nathan Mankovich, Kai-Hendrik Cohrs, Homer Durand, Vasileios Sitokonstantinou, Tristan Williams, Gustau Camps-Valls
2025-10-21
Earth observation involves collecting, analyzing, and processing an
ever-growing mass of data. Automatically harvesting information is crucial for
addressing significant societal, economic, and environmental challenges,
ranging from environmental monitoring to urban planning and disaster
management. However, the high dimensionality of these data poses challenges in
terms of , inefficiency, and the curse of dimensionality, which limits
the effectiveness of machine learning models. Dimensionality reduction (DR)
techniques, specifically feature extraction, address these challenges by
pre
essential data properties while reducing complexity and enhancing
tasks such as data
, cleaning, fusion, visualization, anomaly
detection, and prediction. This review provides a handbook for leveraging DR
across the RS data value chain and identifies opportunities for under-explored
DR algorithms and their application in future research.
MTraining Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
Authors: Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, Yucheng Li, Yuqing Yang, Lili Qiu
2025-10-21
The adoption of long context windows has become a standard feature in Large
Language Models (s), as extended contexts significantly enhance their
capacity for complex reasoning and broaden their applicability across diverse
scenarios. Dynamic
attention is a promising approach for reducing the
computational cost of long-context. However, efficiently training
s with
dynamic
attention on ultra-long contexts-especially in distributed
settings-remains a significant challenge, due in large part to worker- and
step-level imbalance. This paper introduces MTraining, a novel distributed
methodology leveraging dynamic
attention to enable efficient training
for
s with ultra-long contexts. Specifically, MTraining integrates three key
components: a dynamic
training pattern, balanced
ring attention,
and hierarchical
ring attention. These components are designed to
synergistically address the computational imbalance and
overheads
inherent in dynamic
attention mechanisms during the training of models
with extensive context lengths. We demonstrate the efficacy of MTraining by
training Qwen2.5-3B, successfully expanding its context window from 32K to 512K
tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite
of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A
Haystack, reveal that MTraining achieves up to a 6x higher training throughput
while pre
model accuracy. Our code is available at
https://github.com/microsoft/MInference/tree/main/MTraining.
mSQUID Model-Based Leanred Modulo Recovery at Low Sampling Rates
Authors: Yhonatan Kvich, Rotem Arie, Hana Hasan, Shaik Basheeruddin Shah, Yonina C. Eldar
2025-10-21
Modulo sampling enables acquisition of signals with unlimited dynamic range
by folding the input into a bounded interval prior to sampling, thus
eliminating the risk of signal clipping and pre information without
requiring highresolution ADCs. While this enables low-cost hardware, the
nonlinear distortion introduced by folding presents recovery challenges,
particularly under noise and
. We propose a model-based deep
unfolding network tailored to this setting, combining the interpretability of
classical compress sensing (CS) solvers with the flexibility of learning. A key
innovation is a soft-
module that encodes the modulo prior by
guiding the solution toward discrete multiples of the folding range in a
differentiable and learnable way. Our method, modulo soft-
d unfolded
iterative
r (mSQUID), achieves superior reconstruction performance at low
sampling rates under additive Gaussian noise. We further demonstrate its
utility in a challenging case where signals with vastly different amplitudes
and disjoint frequency bands are acquired simultaneously and
d. In this
scenario, classical sampling often struggles due to weak signal distortion or
strong signal clipping, while our approach is able to recover the input
signals. Our method also offers significantly reduced runtimes, making it
suitable for real-time, resource-limited systems.
SSD Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation
Authors: Siyong Jian, Huan Wang
2025-10-21
Autoregressive image generation models like Janus-Pro produce high-quality
images, but at the significant cost of high memory and ever-growing
computational demands due to the large number of visual tokens. While
has been extensively studied in language modeling, it still remains
largely unexplored for the image generation domain. In this work, we begin by
identifying a distinct and prominent attention phenomenon, which we term
spatial locality and emergent semantic sink. To leverage this key insight, we
introduce a novel
framework. Specifically, we compress the
for all visual tokens by adaptively decoupling attention heads into
two separate types: for spatial-locality heads, our method maintains a short
recent token window; for semantic-sink heads, it strategically preserves a
compact set of highly-attended tokens. Our extensive experiments demonstrate
that the proposed method achieves a 5 reduction in memory usage and a
notable 6.6 speedup in overall throughput with only minimal visual
quality loss, thereby enabling highly efficient native autoregressive image
generation on resource-constrained hardware.
Fetch.ai An Architecture for Modern Multi-Agent Systems
Authors: Michael J. Wooldridge, Attila Bagoly, Jonathan J. Ward, Emanuele La Malfa, Gabriel Paludo Licks
2025-10-21
Recent surges in -driven intelligent systems largely overlook decades of
foundational multi-agent systems (MAS) research, resulting in frameworks with
critical limitations such as centralization and inadequate trust and
protocols. This paper introduces the Fetch.ai architecture, an
industrial-strength platform designed to bridge this gap by facilitating the
integration of classical MAS principles with modern AI capabilities. We present
a novel, multi-layered solution built on a decentralized foundation of on-chain
blockchain services for verifiable identity, discovery, and transactions. This
is complemented by a comprehensive development framework for creating secure,
interoperable agents, a cloud-based platform for deployment, and an intelligent
orchestration layer where an agent-native
translates high-level human goals
into complex, multi-agent workflows. We demonstrate the deployed nature of this
system through a decentralized logistics use case where autonomous agents
dynamically discover, negotiate, and transact with one another securely.
Ultimately, the Fetch.ai stack provides a principled architecture for moving
beyond current agent implementations towards open, collaborative, and
economically sustainable multi-agent ecosystems.
Reasoning Language Model Inference Serving Unveiled An Empirical Study
Authors: Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang, Yuhan Chen, Shaohuai Shi, Xiaowen Chu
2025-10-21
The reasoning large language model (R) has been proven competitive in
solving complex reasoning tasks such as mathematics, coding, compared to
general
. However, the
performance and behavior of R
remains
unexplored, which may undermine the deployment and utilization of R
in
real-world scenario. To close this gap, in this paper, we conduct a
comprehensive study of R
service. We first perform a pilot study on
comparing the
performance between R
and traditional
and reveal
that there are several distinct differences regarding
behavior: (1)
significant memory usage and fluctuations; (2) straggler requests; (3) adaptive
running time; (4) domain preference. Then we further investigate whether
existing inference optimization techniques are valid for R
. Our main
takeaways are that model
methods and speculative
can
improve service system efficiency with small compromise to R
accuracy, while
prefix caching,
may even degrade accuracy or
performance for small R
. Lastly, we conduct evaluation under real world
workload modeled by Gamma distribution to verify our findings. Empirical
results of real world workload evaluation across different dataset are aligned
with our main findings regarding R
. We hope our work can provide the
research community and industry with insights to advance R
inference
.
Binary Quadratic Quantization Beyond First-Order Quantization for Real-Valued Matrix Compression
Authors: Kyo Kuroki, Yasuyuki Okoshi, Thiem Van Chu, Kazushi Kawamura, Masato Motomura
2025-10-21
This paper proposes a novel matrix method, Binary Quadratic
Quantization (BQQ). In contrast to conventional first-order
approaches, such as uniform
and binary coding
, that
approximate real-valued matrices via linear combinations of binary bases, BQQ
leverages the expressive power of binary quadratic expressions while
maintaining an extremely compact data format. We validate our approach with two
experiments: a matrix
benchmark and post-training
(PTQ) on pretrained Vision Transformer-based models. Experimental results
demonstrate that BQQ consistently achieves a superior trade-off between memory
efficiency and reconstruction error than conventional methods for compressing
diverse matrix data. It also delivers strong PTQ performance, even though we
neither target state-of-the-art PTQ accuracy under tight memory constraints nor
rely on PTQ-specific binary matrix optimization. For example, our proposed
method outperforms the state-of-the-art PTQ method by up to 2.2\% and 59.1% on
the ImageNet dataset under the calibration-based and data-free scenarios,
respectively, with
equivalent to 2 bits. These findings highlight
the surprising effectiveness of binary quadratic expressions for efficient
matrix approximation and neural network
.
C-SWAP Explainability-Aware Structured Pruning for Efficient Neural Networks Compression
Authors: Baptiste Bauvin, Loïc Baret, Ola Ahmad
2025-10-21
Neural network has gained increasing attention in recent years,
particularly in computer vision applications, where the need for model
reduction is crucial for overcoming deployment constraints. Pruning is a widely
used technique that prompts
in model structures, e.g. weights,
neurons, and layers, reducing size and inference costs. Structured
is
especially important as it allows for the removal of entire structures, which
further accelerates inference time and reduces memory overhead. However, it can
be computationally expensive, requiring iterative retraining and optimization.
To overcome this problem, recent methods considered one-shot setting, which
applies
directly at post-training. Unfortunately, they often lead to a
considerable drop in performance. In this paper, we focus on this issue by
proposing a novel one-shot
framework that relies on explainable deep
learning. First, we introduce a causal-aware
approach that leverages
cause-effect relations between model predictions and structures in a
progressive
process. It allows us to efficiently reduce the size of the
network, ensuring that the removed structures do not deter the performance of
the model. Then, through experiments conducted on convolution neural network
and vision
baselines, pre-trained on classification tasks, we
demonstrate that our method consistently achieves substantial reductions in
model size, with minimal impact on performance, and without the need for
fine-tuning. Overall, our approach outperforms its counterparts, offering the
best trade-off. Our code is available on GitHub.
Tokencake A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications
Authors: Zhuohang Bian, Feiyang Wu, Teng Ma, Youwei Zhuo
2025-10-21
Large Language Models (s) are increasingly deployed in complex multi-agent
applications that use external function calls. This workload creates severe
performance challenges for the
Cache: space contention leads to the eviction
of critical agents'
s and time underutilization leaves the
of agents
stalled on long-running tool calls idling in GPU memory. We present Tokencake,
a
-Cache-centric
framework that co-optimizes scheduling and memory
management with an agent-aware design. Tokencake's Space Scheduler uses dynamic
memory partitioning to shield critical agents from contention, while its Time
Scheduler employs a proactive offload and predictive upload mechanism to
repurpose GPU memory during function call stalls. Our evaluation on
representative multi-agent benchmarks shows that Tokencake can reduce
end-to-end latency by over 47.06%, improve effective GPU memory utilization by
up to 16.9% compared to v
.
EfficientNav Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval
Authors: Zebin Yang, Sunjian Zheng, Tong Xie, Tianshi Xu, Bo Yu, Fan Wang, Jie Tang, Shaoshan Liu, Meng Li
2025-10-21
Object-goal navigation (ObjNav) tasks an agent with navigating to the
location of a specific object in an unseen environment. Embodied agents
equipped with large language models (s) and online constructed navigation
maps can perform ObjNav in a zero-shot manner. However, existing agents heavily
rely on giant
s on the cloud, e.g., GPT-4, while directly switching to small
s, e.g., LLaMA3.2-11b, suffer from significant success rate drops due to
limited model capacity for understanding complex navigation maps, which
prevents deploying ObjNav on local devices. At the same time, the long prompt
introduced by the navigation map description will cause high planning latency
on local devices. In this paper, we propose EfficientNav to enable on-device
efficient
-based zero-shot ObjNav. To help the smaller
s better
understand the environment, we propose semantics-aware memory retrieval to
prune redundant information in navigation maps. To reduce planning latency, we
propose discrete memory caching and attention-based memory clustering to
efficiently save and re-use the
. Extensive experimental results
demonstrate that EfficientNav achieves 11.1% improvement in success rate on
HM3D benchmark over GPT-4-based baselines, and demonstrates 6.7x real-time
latency reduction and 4.7x end-to-end latency reduction over GPT-4 planner. Our
code will be released soon.
LLMs as Sparse RetrieversA Framework for First-Stage Product Search
Authors: Hongru Song, Yu-an Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Sen Li, Wenjun Peng, Fuyu Lv, Xueqi Cheng
2025-10-21
Product search is a crucial component of modern e-commerce platforms, with
billions of user queries every day. In product search systems, first-stage
retrieval should achieve high recall while ensuring efficient online
deployment. Sparse retrieval is particularly attractive in this context due to
its interpretability and storage efficiency. However, retrieval methods
suffer from severe vocabulary mismatch issues, leading to suboptimal
performance in product search scenarios. With their potential for semantic
analysis, large language models (
s) offer a promising avenue for mitigating
vocabulary mismatch issues and thereby improving retrieval quality. Directly
applying
s to
retrieval in product search exposes two key
challenges:(1)Queries and product titles are typically short and highly
susceptible to
-induced hallucinations, such as generating irrelevant
expansion terms or underweighting critical literal terms like brand names and
model numbers;(2)The large vocabulary space of
s leads to difficulty in
initializing training effectively, making it challenging to learn meaningful
representations in such ultra-high-dimensional spaces.To address these
challenges, we propose PROSPER, a framework for PROduct search leveraging
s
as SParsE Retrievers. PROSPER incorporates: (1)A literal residual network that
alleviates hallucination in lexical expansion by reinforcing underweighted
literal terms through a residual compensation mechanism; and (2)A lexical
focusing window that facilitates effective training initialization via a
coarse-to-fine sparsification strategy.Extensive offline and online experiments
show that PROSPER significantly outperforms
baselines and achieves
recall performance comparable to advanced dense retrievers, while also
achieving revenue increments online.
From Quarter to All Accelerating Speculative LLM Decoding via Floating-Point Exponent Remapping and Parameter Sharing
Authors: Yushu Zhao, Yubin Qin, Yang Wang, Xiaolong Yang, Huiming Han, Shaojun Wei, Yang Hu, Shouyi Yin
2025-10-21
Large language models achieve impressive performance across diverse tasks but
exhibit high inference latency due to their large parameter sizes. While
reduces model size, it often leads to performance degradation
compared to the full model. Speculative
remains lossless but typically
incurs extra overheads. We propose SPEQ, an algorithm-hardware co-designed
speculative
method that uses part of the full-model weight bits to
form a
d draft model, thereby eliminating additional training or
storage overhead. A reconfigurable processing element array enables efficient
execution of both the draft and verification passes. Experimental results
across 15
s and tasks demonstrate that SPEQ achieves speedups of 2.07x,
1.53x, and 1.45x compared over FP16, Olive, and Tender, respectively.
The Attribution Story of WhisperGate An Academic Perspective
Authors: Oleksandr Adamov, Anders Carlsson
2025-10-21
This paper explores the challenges of cyberattack attribution, specifically
APTs, applying the case study approach for the WhisperGate cyber operation of
January 2022 executed by the Russian military intelligence service (GRU) and
targeting Ukrainian government entities. The study provides a detailed review
of the threat actor identifiers and taxonomies used by leading cybersecurity
vendors, focusing on the evolving attribution from Microsoft, ESET, and
CrowdStrike researchers. Once the attribution to Ember Bear (GRU Unit 29155) is
established through technical and intelligence reports, we use both traditional
machine learning classifiers and a large language model (ChatGPT) to analyze
the indicators of compromise (IoCs), tactics, and techniques to statistically
and semantically attribute the WhisperGate attack. Our findings reveal
ping indicators with the Sandworm group (GRU Unit 74455) but also strong
evidence pointing to Ember Bear, especially when the
is fine-tuned or
contextually augmented with additional intelligence. Thus, showing how AI/GenAI
with proper fine-tuning are capable of solving the attribution challenge.
LAFA Agentic LLM-Driven Federated Analytics over Decentralized Data Sources
Authors: Haichao Ji, Zibo Wang, Yifei Zhu, Meng han, Dan Wang, Zhu Han
2025-10-21
Large Language Models (s) have shown great promise in automating data
analytics tasks by interpreting natural language queries and generating
multi-operation execution plans. However, existing
-agent-based analytics
frameworks operate under the assumption of centralized data access, offering
little to no privacy protection. In contrast, federated analytics (FA) enables
privacy-pre
computation across distributed data sources, but lacks
support for natural language input and requires structured, machine-readable
queries. In this work, we present LAFA, the first system that integrates
-agent-based data analytics with FA. LAFA introduces a hierarchical
multi-agent architecture that accepts natural language queries and transforms
them into optimized, executable FA workflows. A coarse-grained planner first
decomposes complex queries into sub-queries, while a fine-grained planner maps
each subquery into a Directed Acyclic Graph of FA operations using prior
structural knowledge. To improve execution efficiency, an optimizer agent
rewrites and merges multiple DAGs, eliminating redundant operations and
minimizing computational and
al overhead. Our experiments
demonstrate that LAFA consistently outperforms baseline prompting strategies by
achieving higher execution plan success rates and reducing resource-intensive
FA operations by a substantial margin. This work establishes a practical
foundation for privacy-pre
,
-driven analytics that supports natural
language input in the FA setting.
DART A Structured Dataset of Regulatory Drug Documents in Italian for Clinical NLP
Authors: Mariano Barone, Antonio Laudante, Giuseppe Riccio, Antonio Romano, Marco Postiglione, Vincenzo Moscato
2025-10-21
The extraction of pharmacological knowledge from regulatory documents has
become a key focus in biomedical natural language processing, with applications
ranging from adverse event monitoring to AI-assisted clinical decision support.
However, research in this field has predominantly relied on English-language
corpora such as DrugBank, leaving a significant gap in resources tailored to
other healthcare systems. To address this limitation, we introduce DART (Drug
Annotation from Regulatory Texts), the first structured corpus of Italian
Summaries of Product Characteristics derived from the official repository of
the Italian Medicines Agency (AIFA). The dataset was built through a
reproducible pipeline encompassing web-scale document retrieval, semantic
segmentation of regulatory sections, and clinical summarization using a
few-shot-tuned large language model with low-temperature . DART
provides structured information on key pharmacological domains such as
indications, adverse drug reactions, and drug-drug interactions. To validate
its utility, we implemented an
-based drug interaction checker that
leverages the dataset to infer clinically meaningful interactions. Experimental
results show that instruction-tuned
s can accurately infer potential
interactions and their clinical implications when grounded in the structured
textual fields of DART. We publicly release our code on GitHub:
https://github.com/PRAISELab-PicusLab/DART.
CircuitSeer Mining High-Quality Data by Probing Mathematical Reasoning Circuits in LLMs
Authors: Shaobo Wang, Yongliang Miao, Yuancheng Liu, Qianli Ma, Ning Liao, Linfeng Zhang
2025-10-21
Large language models (s) have demonstrated impressive reasoning
capabilities, but scaling their performance often relies on massive reasoning
datasets that are computationally expensive to train on. Existing data
selection methods aim to curate smaller, high-quality subsets but often rely on
costly external models or opaque heuristics. In this work, we shift the focus
from external heuristics to the model's internal mechanisms. We find that
complex reasoning tasks consistently activate a
, specialized subset of
attention heads, forming core reasoning circuits. Building on this insight, we
propose CircuitSeer, a novel data selection method that quantifies the
reasoning complexity of data by measuring its influence on these crucial
circuits. Extensive experiments on 4 models and 9 datasets demonstrate
CircuitSeer's superiority. Notably, fine-tuning Qwen2.5-Math-7B on just 10% of
data selected by our method achieves a 1.4-point gain in average Pass@1 over
training on the full dataset, highlighting its efficiency and effectiveness.
Adamas Hadamard Sparse Attention for Efficient Long-Context Inference
Authors: Siyuan Yan, Guo-Qing Jiang, Yuchen Zhang, Xiaoxing Ma, Ran Zhu, Chun Cao, Jingwei Xu
2025-10-21
Large language models (s) now support context windows of hundreds of
thousands to millions of tokens, enabling applications such as long-document
summarization, large-scale code synthesis, multi-document question answering
and persistent multi-turn dialogue. However, such extended contexts exacerbate
the quadratic cost of self-attention, leading to severe latency in
autoregressive
. Existing
attention methods alleviate these
costs but rely on heuristic patterns that struggle to recall critical key-value
(
) pairs for each query, resulting in accuracy degradation. We introduce
Adamas, a lightweight yet highly accurate
attention mechanism designed
for long-context inference. Adamas applies the Hadamard transform,
bucketization and 2-bit
to produce compact representations, and
leverages Manhattan-distance estimation for efficient top-k selections.
Experiments show that Adamas matches the accuracy of full attention with only a
64-token budget, achieves near-lossless performance at 128, and supports up to
8x higher
than prior state-of-the-art (SOTA) methods while delivering
up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences.
Remarkably, Adamas attains comparable or even lower perplexity than full
attention, underscoring its effectiveness in maintaining accuracy under
aggressive
.
How2Compress Scalable and Efficient Edge Video Analytics via Adaptive Granular Video Compression
Authors: Yuheng Wu, Thanh-Tung Nguyen, Lucas Liebe, Quang Tau, Pablo Espinosa Campos, Jinghan Cheng, Dongman Lee
2025-10-21
With the rapid proliferation of the Internet of Things, video analytics has
become a cornerstone application in wireless multimedia sensor networks. To
support such applications under bandwidth constraints, learning-based adaptive
for video
have demonstrated strong potential in
reducing bitrate while maintaining analytical accuracy. However, existing
frameworks often fail to fully exploit the fine-grained quality control enabled
by modern blockbased video codecs, leaving significant
efficiency
untapped.
In this paper, we present How2Compress, a simple yet effective framework
designed to enhance video
efficiency through precise, fine-grained
quality control at the macroblock level. How2Compress is a plug-and-play module
and can be seamlessly integrated into any existing edge video analytics
pipelines. We implement How2Compress on the H.264 codec and evaluate its
performance across diverse real-world scenarios. Experimental results show that
How2Compress achieves up to bitrate savings and outperforms baselines
by up to without compromising accuracy, demonstrating its
practical effectiveness and efficiency. Code is available at
https://github.com/wyhallenwu/how2compress and a reproducible docker image at
https://hub.docker.com/r/wuyuheng/how2compress.
MENTOR A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models
Authors: ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim
2025-10-21
Distilling the tool-using capabilities of large language models (s) into
smaller, more efficient small language models (SLMs) is a key challenge for
their practical application. The predominant approach, supervised fine-tuning
(SFT), suffers from poor generalization as it trains models to imitate a static
set of teacher trajectories rather than learn a robust methodology. While
reinforcement learning (RL) offers an alternative, the standard RL using
rewards fails to effectively guide SLMs, causing them to struggle with
inefficient exploration and adopt suboptimal strategies. To address these
distinct challenges, we propose MENTOR, a framework that synergistically
combines RL with teacher-guided distillation. Instead of simple imitation,
MENTOR employs an RL-based process to learn a more generalizable policy through
exploration. In addition, to solve the problem of reward
, it uses a
teacher's reference trajectory to construct a dense, composite teacher-guided
reward that provides fine-grained guidance. Extensive experiments demonstrate
that MENTOR significantly improves the cross-domain generalization and
strategic competence of SLMs compared to both SFT and standard
-reward RL
baselines.
S2AP Score-space Sharpness Minimization for Adversarial Pruning
Authors: Giorgio Piras, Qi Zhao, Fabio Brau, Maura Pintor, Christian Wressnegger, Battista Biggio
2025-10-21
Adversarial methods have emerged as a powerful tool for compressing
neural networks while pre
robustness against adversarial attacks. These
methods typically follow a three-step pipeline: (i) pretrain a robust model,
(ii) select a binary mask for weight
, and (iii) finetune the pruned
model. To select the binary mask, these methods minimize a robust loss by
assigning an importance score to each weight, and then keep the weights with
the highest scores. However, this score-space optimization can lead to sharp
local minima in the robust loss landscape and, in turn, to an unstable mask
selection, reducing the robustness of adversarial
methods. To overcome
this issue, we propose a novel plug-in method for adversarial
, termed
Score-space Sharpness-aware Adversarial Pruning (S2AP). Through our method, we
introduce the concept of score-space sharpness minimization, which operates
during the mask search by perturbing importance scores and minimizing the
corresponding robust loss. Extensive experiments across various datasets,
models, and
levels demonstrate that S2AP effectively minimizes
sharpness in score space, stabilizing the mask selection, and ultimately
improving the robustness of adversarial
methods.
Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers
Authors: Firas Gabetni, Giuseppe Curci, Andrea Pilzer, Subhankar Roy, Elisa Ricci, Gianni Franchi
2025-10-21
Uncertainty quantification (UQ) is essential for deploying deep neural
networks in safety-critical settings. Although methods like Deep Ensembles
achieve strong UQ performance, their high computational and memory costs hinder
scalability to large models. We introduce Hydra Ensembles, an efficient
-based ensemble that prunes attention heads to create diverse
members and merges them via a new multi-head attention with grouped
fully-connected layers. This yields a compact model with inference speed close
to a single network, matching or surpassing Deep Ensembles in UQ performance
without retraining from scratch. We also provide an in-depth analysis of
, showing that naive approaches can harm calibration, whereas Hydra
Ensembles preserves robust uncertainty. Experiments on image and text
classification tasks, with various architectures, show consistent gains over
Deep Ensembles. Remarkably, in zero-shot classification on ImageNet-1k, our
approach surpasses state of the art methods, even without requiring additional
training.
Learning Human-Object Interaction as Groups
Authors: Jiajun Hong, Jianan Wei, Wenguan Wang
2025-10-21
Human-Object Interaction Detection (HOI-DET) aims to localize human-object
pairs and identify their interactive relationships. To aggregate contextual
cues, existing methods typically propagate information across all detected
entities via self-attention mechanisms, or establish message passing between
humans and objects with bipartite graphs. However, they primarily focus on
pairwise relationships, overlooking that interactions in real-world scenarios
often emerge from collective behaviors (multiple humans and objects engaging in
joint activities). In light of this, we revisit relation modeling from a group
view and propose GroupHOI, a framework that propagates contextual information
in terms of geometric proximity and semantic similarity. To exploit the
geometric proximity, humans and objects are grouped into distinct clusters
using a learnable proximity estimator based on spatial features derived from
bounding boxes. In each group, a soft correspondence is computed via
self-attention to aggregate and dispatch contextual cues. To incorporate the
semantic similarity, we enhance the vanilla -based interaction
r with local contextual cues from HO-pair features. Extensive experiments
on HICO-DET and V-COCO benchmarks demonstrate the superiority of GroupHOI over
the state-of-the-art methods. It also exhibits leading performance on the more
challenging Nonverbal Interaction Detection (NVI-DET) task, which involves
varied forms of higher-order interactions within groups.
Text or Pixels? It Takes Half On the Token Efficiency of Visual Text Inputs in Multimodal LLMs
Authors: Yanhong Li, Zixuan Lan, Jiawei Zhou
2025-10-21
Large language models (s) and their multimodal variants can now process
visual inputs, including images of text. This raises an intriguing question:
can we compress textual inputs by feeding them as images to reduce token usage
while pre
performance? In this paper, we show that visual text
representations are a practical and surprisingly effective form of input
for
r
s. We exploit the idea of rendering long text inputs
as a single image and provide it directly to the model. This leads to
dramatically reduced number of
r tokens required, offering a new form of
input
. Through experiments on two distinct benchmarks RULER
(long-context retrieval) and CNN/DailyMail (document summarization) we
demonstrate that this text-as-image method yields substantial token savings
(often nearly half) without degrading task performance.
StreamingTOM Streaming Token Compression for Efficient Video Understanding
Authors: Xueyi Chen, Keda Tao, Kele Shao, Huan Wang
2025-10-21
Unlike offline processing, streaming video vision-language models face two
fundamental constraints: causality and accumulation. Causality prevents access
to future frames that offline methods exploit, while accumulation causes tokens
to grow unbounded, creating efficiency bottlenecks. However, existing
approaches only regulate post- kv-
, leaving costly pre-
unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage
framework that addresses both pre-
and post-
bottlenecks with predictable
latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects
tokens based on adjacent-frame changes and token saliency, drastically reducing
per-frame
cost by processing only a compact subset of visual tokens per
frame instead of all visual tokens. Online Quantized Memory stores tokens in
4-bit format, retrieves relevant groups on demand, and de
s them,
keeping the active kv-
bounded regardless of stream length. Experiments
demonstrate our method achieves kv-
,
lower peak memory and faster TTFT compared to prior SOTA.
StreamingTOM maintains state-of-the-art accuracy among training-free methods
with an average of on offline benchmarks and on RVS.
These results highlight the practical benefits of our two-stage approach for
efficient streaming video understanding with bounded growth.
Learning from the Best, Differently A Diversity-Driven Rethinking on Data Selection
Authors: Hongyi He, Xiao Liu, Zhenghao Lin, Mingni Tang, Yi Cheng, Jintao Wang, Wenjie Li, Peng Cheng, Yeyun Gong
2025-10-21
High-quality pre-training data is crutial for large language models, where
quality captures factual reliability and semantic value, and diversity ensures
broad coverage and distributional heterogeneity. Existing approaches typically
rely on single or multiple-dimensional score-based selection. However, directly
selecting top-scored data often degrades performance, and sampling from a
broader range is required to recover results. The above non-monotonicity
between dataset scores and downstream benchmark results reveals a fundamental
bias: score-based methods collapse correlated dimensions, causing top-scored
data to appear high-quality while systematically overlooking diversity. We
argue that ensuring diversity requires decomposing correlated metrics into
orthogonal feature dimensions, from which the top-scored data can be directly
selected. Therefore, we proposed the Orthogonal Diversity-Aware Selection
(ODiS) algorithm, which preserves both quality and diversity during data
selection. First, ODiS evaluates data from multiple dimensions, covering
language quality, knowledge quality, and comprehension difficulty. The
multi-dimensional scores are then decorrelated via Principal Component Analysis
(PCA), yielding orthogonal evaluation dimensions. For each dimension, a
Roberta-based scorer is trained to regress the data onto PCA-projected scores,
enabling scalable inference on large corpora. Finally, ODiS constructs the
training dataset by selecting top-scored data within each orthogonal dimension,
thereby ensuring both quality and diversity. Empirical results show that
ODiS-selected data exhibit less than 2\% inter-dimension , confirming
orthogonality between dimensions. More importantly, models trained with
ODiS-selected data significantly outperform other baselines on downstream
benchmarks, highlighting the necessity of orthogonal, diversity-aware data
selection for
s.
DeepSeek-OCR Contexts Optical Compression
Authors: Haoran Wei, Yaofeng Sun, Yukun Li
2025-10-21
We present DeepSeek-OCR as an initial investigation into the feasibility of
compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two
components: DeepEncoder and DeepSeek3B-MoE-A570M as the r. Specifically,
DeepEncoder serves as the core engine, designed to maintain low activations
under high-resolution input while achieving high
ratios to ensure
an optimal and manageable number of vision tokens. Experiments show that when
the number of text tokens is within 10 times that of vision tokens (i.e., a
ratio < 10x), the model can achieve
(OCR) precision of
97%. Even at a
ratio of 20x, the OCR accuracy still remains at
about 60%. This shows considerable promise for research areas such as
historical long-context
and memory forgetting mechanisms in
s.
Beyond this, DeepSeek-OCR also demonstrates high practical value. On
OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision
tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while
utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can
generate training data for
s/VLMs at a scale of 200k+ pages per day (a
single A100-40G). Codes and model weights are publicly accessible at
http://github.com/deepseek-ai/DeepSeek-OCR.
Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
Authors: Yoshinari Fujinuma
2025-10-21
Large Language Models (s) are commonly used as evaluators in various
applications, but the reliability of the outcomes remains a challenge. One such
challenge is using
s-as-judges for direct assessment, i.e., assigning scores
from a specified range without any references. We first show that this
challenge stems from
judge outputs being associated with score range bias,
i.e.,
judge outputs are highly sensitive to pre-defined score ranges,
preventing the search for optimal score ranges. We also show that similar
biases exist among models from the same family. We then mitigate this bias
through contrastive
, achieving up to 11.3% relative improvement on
average in Spearman correlation with human judgments across different score
ranges.
Extracting Rule-based Descriptions of Attention Features in Transformers
Authors: Dan Friedman, Adithya Bhaskar, Alexander Wettig, Danqi Chen
2025-10-20
Mechanistic interpretability strives to explain model behavior in terms of
bottom-up primitives. The leading paradigm is to express hidden states as a
linear combination of basis vectors, called features. However, this only
identifies which text sequences (exemplars) activate which features; the actual
interpretation of features requires subjective inspection of these exemplars.
This paper advocates for a different solution: rule-based descriptions that
match token patterns in the input and correspondingly increase or decrease the
likelihood of specific output tokens. Specifically, we extract rule-based
descriptions of SAE features trained on the outputs of attention layers. While
prior work treats the attention layers as an opaque box, we describe how it may
naturally be expressed in terms of interactions between input and output
features, of which we study three types: (1) skip-gram rules of the form
"[Canadian city]... speaks --> English", (2) absence rules of the form
"[Montreal]... speaks -/-> English," and (3) counting rules that toggle only
when the count of a word exceeds a certain value or the count of another word.
Absence and counting rules are not readily discovered by inspection of
exemplars, where manual and automatic descriptions often identify misleading or
incomplete explanations. We then describe a simple approach to extract these
types of rules automatically from a
, and apply it to GPT-2 small.
We find that a majority of features may be described well with around 100
skip-gram rules, though absence rules are abundant even as early as the first
layer (in over a fourth of features). We also isolate a few examples of
counting rules. This paper lays the groundwork for future research into
rule-based descriptions of features by defining them, showing how they may be
extracted, and providing a preliminary taxonomy of some of the behaviors they
represent.
Any-Depth Alignment Unlocking Innate Safety Alignment of LLMs to Any-Depth
Authors: Jiawei Zhang, Andrew Estornell, David D. Baek, Bo Li, Xiaojun Xu
2025-10-20
Large Language Models (s) exhibit strong but shallow alignment: they
directly refuse harmful queries when a refusal is expected at the very start of
an assistant turn, yet this protection collapses once a harmful continuation is
underway (either through the adversarial attacks or via harmful
assistant-
attacks). This raises a fundamental question: Can the innate
shallow alignment in
s be unlocked to ensure safety at arbitrary generation
depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an
effective inference-time defense with negligible overhead. ADA is built based
on our observation that alignment is concentrated in the assistant header
tokens through repeated use in shallow-refusal training, and these tokens
possess the model's strong alignment priors. By reintroducing these tokens
mid-stream, ADA induces the model to reassess harmfulness and recover refusals
at any point in generation. Across diverse open-source model families (Llama,
Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety
performance without requiring any changes to the base model's parameters. It
secures a near-100% refusal rate against challenging adversarial
attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces
the average success rate of prominent adversarial prompt attacks (such as GCG,
AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while pre
utility on benign tasks with minimal over-refusal. ADA maintains this
resilience even after the base model undergoes subsequent instruction tuning
(benign or adversarial).
MEG-GPT A transformer-based foundation model for magnetoencephalography data
Authors: Rukuang Huang, Sungjun Cho, Chetan Gohil, Oiwi Parker Jones, Mark Woolrich
2025-10-20
Modelling the complex spatiotemporal patterns of large-scale brain dynamics
is crucial for neuroscience, but traditional methods fail to capture the rich
structure in modalities such as magnetoencephalography (MEG). Recent advances
in deep learning have enabled significant progress in other domains, such as
language and vision, by using foundation models at scale. Here, we introduce
MEG-GPT, a based foundation model that uses time-attention and next
time-point prediction. To facilitate this, we also introduce a novel
data-driven tokeniser for continuous MEG data, which preserves the high
temporal resolution of continuous MEG signals without lossy transformations. We
trained MEG-GPT on tokenised brain region time-courses extracted from a
large-scale MEG dataset (N=612, eyes-closed rest, Cam-CAN data), and show that
the learnt model can generate data with realistic spatio-spectral properties,
including transient events and population variability. Critically, it performs
well in downstream
tasks, improving downstream supervised prediction
task, showing improved zero-shot generalisation across sessions (improving
accuracy from 0.54 to 0.59) and subjects (improving accuracy from 0.41 to 0.49)
compared to a baseline methods. Furthermore, we show the model can be
efficiently fine-tuned on a smaller labelled dataset to boost performance in
cross-subject
scenarios. This work establishes a powerful foundation
model for electrophysiological data, paving the way for applications in
computational neuroscience and neural
.
CompactPrompt A Unified Pipeline for Prompt Data Compression in LLM Workflows
Authors: Joong Ho Choi, Jiayang Zhao, Jeel Shah, Ritvika Sonawane, Vedant Singh, Avani Appalla, Will Flanagan, Filipe Condessa
2025-10-20
Large Language Models (s) deliver powerful reasoning and generation
capabilities but incur substantial run-time costs when operating in agentic
workflows that chain together lengthy prompts and process rich data streams. We
introduce CompactPrompt, an end-to-end pipeline that merges hard prompt
with lightweight file-level data
. CompactPrompt first
prunes low-information tokens from prompts using self-information scoring and
dependency-based phrase grouping. In parallel, it applies n-gram abbreviation
to recurrent textual patterns in attached documents and uniform
to
numerical columns, yielding compact yet semantically faithful representations.
Integrated into standard
agents, CompactPrompt reduces total token usage
and inference cost by up to 60% on benchmark dataset like TAT-QA and FinQA,
while pre
output quality (Results in less than 5% accuracy drop for
Claude-3.5-Sonnet, and GPT-4.1-Mini) CompactPrompt helps visualize real-time
decisions and quantify cost-performance trade-offs, laying the
groundwork for leaner generative AI pipelines.
OPTAGENT Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning
Authors: Zhenyu Bi, Meng Lu, Yang Li, Swastik Roy, Weijie Guan, Morteza Ziyadi, Xuan Wang
2025-10-20
Large Language Models (s) have shown remarkable reasoning capabilities in
mathematical and scientific tasks. To enhance complex reasoning, multi-agent
systems have been proposed to harness the collective intelligence of
agents. However, existing collaboration structures are either predefined or
rely on majority voting or round-table debates, which can suppress correct but
less dominant agent contributions. Recent approaches model multi-agent systems
as graph networks but optimize purely for agent performance, neglecting the
quality of interactions. We hypothesize that effective agent
is
crucial for multi-agent reasoning and that debating quality plays a significant
role. To address this, we propose , a multi-agent verbal reinforcement
learning algorithm that dynamically constructs and refines multi-agent
collaboration structures. Our method defines action spaces and a feedback
mechanism that evaluates
robustness and coherence throughout the
debate. The final decision is achieved through a majority vote over all the
agents. We assess on various reasoning tasks, including mathematical
reasoning, creative writing, scientific reasoning, and numerical sorting.
Results demonstrate that our approach significantly outperforms single-agent
prompting methods and state-of-the-art multi-agent frameworks on diverse tasks.
From Local to Global Revisiting Structured Pruning Paradigms for Large Language Models
Authors: Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Minwoo Lee, Shu-ping Yeh, Evgeny Stupachenko, Hao Feng, Li Yang
2025-10-20
Structured is a practical approach to deploying large language models
(
s) efficiently, as it yields compact, hardware-friendly architectures.
However, the dominant local paradigm is task-agnostic: by optimizing layer-wise
reconstruction rather than task objectives, it tends to preserve perplexity or
generic zero-shot behavior but fails to capitalize on modest task-specific
calibration signals, often yielding limited downstream gains. We revisit global
structured
and present GISP-Global Iterative Structured Pruning-a
post-training method that removes attention heads and MLP channels using
first-order, loss-based important weights aggregated at the structure level
with block-wise normalization. An iterative schedule, rather than one-shot
, stabilizes accuracy at higher
and mitigates perplexity
collapse without requiring intermediate fine-tuning; the
trajectory
also forms nested subnetworks that support a "prune-once, deploy-many"
workflow. Furthermore, because importance is defined by a model-level loss,
GISP naturally supports task-specific objectives; we instantiate perplexity for
language modeling and a margin-based objective for decision-style tasks.
Extensive experiments show that across Llama2-7B/13B, Llama3-8B, and
Mistral-0.3-7B, GISP consistently lowers WikiText-2 perplexity and improves
downstream accuracy, with especially strong gains at 40-50%
; on
DeepSeek-R1-Distill-Llama-3-8B with GSM8K, task-aligned calibration
substantially boosts exact-match accuracy.
Glyph Scaling Context Windows via Visual-Text Compression
Authors: Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang
2025-10-20
Large language models (s) increasingly rely on long-context modeling for
tasks such as document understanding, code analysis, and multi-step reasoning.
However, scaling context windows to the million-token level brings prohibitive
computational and memory costs, limiting the practicality of long-context
s.
In this work, we take a different perspective-visual context scaling-to tackle
this challenge. Instead of extending token-based sequences, we propose Glyph, a
framework that renders long texts into images and processes them with
vision-language models (VLMs). This approach substantially compresses textual
input while pre
semantic information, and we further design an
-driven genetic search to identify optimal visual rendering configurations
for balancing accuracy and
. Through extensive experiments, we
demonstrate that our method achieves 3-4x token
while maintaining
accuracy comparable to leading
s such as Qwen3-8B on various long-context
benchmarks. This
also leads to around 4x faster
ing and
, and approximately 2x faster SFT training. Furthermore, under extreme
, a 128K-context VLM could scale to handle 1M-token-level text
tasks. In addition, the rendered text data benefits real-world multimodal
tasks, such as document understanding. Our code and model are released at
https://github.com/thu-coai/Glyph.
Beyond More Context Retrieval Diversity Boosts Multi-Turn Intent Understanding
Authors: Zhiming Lin
2025-10-20
Multi turn intent understanding is central to task oriented chatbots, yet
real deployments face tight token budgets and noisy contexts, and most
retrieval pipelines emphasize relevance while overlooking set level diversity
and confounds such as more context or exemplar order. We ask whether retrieval
diversity, rather than longer prompts, systematically improves intent
understanding under fixed budgets. We present a diversity aware retrieval
framework that selects in context exemplars to balance intent coverage and
linguistic variety, and integrates this selection with standard
rs;
the evaluation enforces budget matched prompts and randomized positions, and
includes sensitivity analyses over exemplar count, diversity strength, and
backbone size. On MultiWOZ 2.4 and SGD, the approach achieves strong gains in
Joint Goal Accuracy under equal token budgets, surpassing strong
/DST
baselines, with consistent improvements across K from 4 to 7 and moderate
latency. Overall, the study isolates and validates the impact of content
diversity in retrieval and offers a simple, deployable selection principle for
building accurate, budget constrained multi turn intent systems.
ZACH-ViT A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification
Authors: Athanasios Angelakis, Amne Mousa, Micah L. A. Heldeweg, Laurens A. Biesheuvel, Mark A. Haaksma, Jasper M. Smit, Pieter R. Tuinman, Paul W. G. Elbers
2025-10-20
Differentiating cardiogenic pulmonary oedema (CPE) from non-cardiogenic and
structurally normal lungs in lung ultrasound (LUS) videos remains challenging
due to the high visual variability of non-cardiogenic inflammatory patterns
(NCIP/ARDS-like), interstitial lung disease, and healthy lungs. This
heterogeneity complicates automated classification as ping B-lines and
pleural artefacts are common. We introduce ZACH-ViT (Zero-token Adaptive
Compact Hierarchical Vision Transformer), a 0.25 M-parameter Vision Transformer
variant that removes both positional embeddings and the [CLS] token, making it
fully permutation-invariant and suitable for unordered medical image data. To
enhance generalization, we propose ShuffleStrides Data Augmentation (SSDA),
which permutes probe-view sequences and frame orders while pre
anatomical validity. ZACH-ViT was evaluated on 380 LUS videos from 95
critically ill patients against nine state-of-the-art baselines. Despite the
heterogeneity of the non-cardiogenic group, ZACH-ViT achieved the highest
validation and test ROC-AUC (0.80 and 0.79) with balanced sensitivity (0.60)
and specificity (0.91), while all competing models collapsed to trivial
classification. It trains 1.35x faster than Minimal ViT (0.62M parameters) with
2.5x fewer parameters, supporting real-time clinical deployment. These results
show that aligning architectural design with data structure can outperform
scale in small-data medical imaging.
Language Confusion Gate Language-Aware Decoding Through Model Self-Distillation
Authors: Collin Zhang, Fei Huang, Chenhan Yuan, Junyang Lin
2025-10-20
Large language models (s) often experience language confusion, which is
the unintended mixing of languages during text generation. Current solutions to
this problem either necessitate model retraining or cannot differentiate
between harmful confusion and acceptable code-switching. This paper introduces
the Language Confusion Gate (LCG), a lightweight, plug-in solution that filters
tokens during
without altering the base
. The LCG is trained using
norm-adjusted self-distillation to predict appropriate language families and
apply masking only when needed. Our method is based on the findings that
language confusion is infrequent, correct-language tokens are usually among the
top predictions, and output token embedding norms are larger for high-resource
languages, which biases sampling. When evaluated across various models,
including Qwen3, GPT-OSS, Gemma3, Llama3.1, LCG decreases language confusion
significantly, often by an order of magnitude, without negatively impacting
task performance. Code is available at
https://github.com/collinzrj/language_confusion_gate.
TabR1 Taming GRPO for tabular reasoning LLMs
Authors: Pengxiang Cai, Zihao Gao, Jintai Chen
2025-10-20
Tabular prediction has traditionally relied on gradient-boosted decision
trees and specialized deep learning models, which excel within tasks but
provide limited interpretability and weak transfer across tables. Reasoning
large language models (s) promise cross-task adaptability with trans- parent
reasoning traces, yet their potential has not been fully realized for tabular
data. This paper presents TabR1, the first reasoning
for tabular prediction
with multi-step reasoning. At its core is Permutation Relative Policy
Optimization (PRPO), a simple yet efficient reinforcement learning method that
encodes column-permutation invariance as a structural prior. By construct- ing
multiple label-pre
permutations per sample and estimating advantages
both within and across permutations, PRPO transforms
rewards into dense
learning signals and improves generalization. With limited supervision, PRPO
activates the reasoning ability of
s for tabular prediction, enhancing
few-shot and zero-shot performance as well as interpretability. Comprehensive
experiments demonstrate that TabR1 achieves performance comparable to strong
baselines under full-supervision fine-tuning. In the zero-shot setting, TabR1
approaches the performance of strong baselines under the 32-shot setting.
Moreover, TabR1 (8B) substantially outperforms much larger
s across various
tasks, achieving up to 53.17% improvement over DeepSeek-R1 (685B).
M2H Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception
Authors: U. V. B. L Udugama, George Vosselman, Francesco Nex
2025-10-20
Deploying real-time spatial perception on edge devices requires efficient
multi-task models that leverage complementary task information while minimizing
computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel
multi-task learning framework designed for semantic segmentation and depth,
edge, and surface normal estimation from a single monocular image. Unlike
conventional approaches that rely on independent single-task models or shared
encoder-r architectures, M2H introduces a Window-Based Cross-Task
Attention Module that enables structured feature exchange while pre
task-specific details, improving prediction consistency across tasks. Built on
a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time
deployment and serves as the foundation for monocular spatial perception
systems supporting 3D scene graph construction in dynamic environments.
Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task
models on NYUDv2, surpasses single-task depth and semantic baselines on
Hypersim, and achieves superior performance on the Cityscapes dataset, all
while maintaining computational efficiency on laptop hardware. Beyond
benchmarks, M2H is validated on real-world data, demonstrating its practicality
in spatial perception tasks.
Localist LLMs with Recruitment Learning
Authors: Joachim Diederich
2025-10-20
We present a novel framework for training large language models with
continuously adjustable internal representations that span the full spectrum
from localist (interpretable, rule-based) to distributed (generalizable,
efficient) encodings. The key innovations are (1) a locality dial, a tunable
parameter that dynamically controls the degree of localization during both
training and inference without requiring model retraining, (2) an
information-theoretic recruitment mechanism that adaptively allocates semantic
blocks as needed, eliminating the requirement for complete domain knowledge at
initialization, and (3) a hierarchical recruitment framework that extends
capacity allocation to entire specialized s, enabling multi-granularity
architectural adaptation. This is achieved through group
penalties on
attention mechanisms, information-theoretic anchor design, dynamic rule
injection, and principled recruitment criteria based on penalized likelihood
with explicit units. We provide rigorous mathematical results establishing
explicit threshold conditions under which attention provably concentrates on
semantically relevant blocks at stationary points, with exact bounds on
attention entropy and pointer fidelity. The hierarchical recruitment mechanism
provides convergence guarantees at both the block level (fine-grained,
within-
) and the
level (coarse-grained, cross-domain), ensuring the
system discovers semantic partitions that balance model complexity against data
encoding efficiency. This framework enables practitioners to continuously
interpolate between interpretable and high-performance modes while adapting
architectural capacity at multiple granularities, supporting applications in
regulated domains requiring both transparency and capability.
Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems
Authors: Rishi Jha, Harold Triedman, Justin Wagle, Vitaly Shmatikov
2025-10-20
Control-flow hijacking attacks manipulate orchestration mechanisms in
multi-agent systems into performing unsafe actions that compromise the system
and exfiltrate sensitive information. Recently proposed defenses, such as
LlamaFirewall, rely on alignment checks of inter-agent s to ensure
that all agent invocations are "related to" and "likely to further" the
original objective.
We start by demonstrating control-flow hijacking attacks that evade these
defenses even if alignment checks are performed by advanced
s. We argue that
the safety and functionality objectives of multi-agent systems fundamentally
conflict with each other. This conflict is exacerbated by the brittle
definitions of "alignment" and the checkers' incomplete visibility into the
execution context.
We then propose, implement, and evaluate ControlValve, a new defense inspired
by the principles of control-flow integrity and least privilege. ControlValve
(1) generates permitted control-flow graphs for multi-agent systems, and (2)
enforces that all executions comply with these graphs, along with contextual
rules (generated in a zero-shot manner) for each agent invocation.
StreamingThinker Large Language Models Can Think While Reading
Authors: Junlong Tong, Yingqi Fan, Anhao Zhao, Yunpu Ma, Xiaoyu Shen
2025-10-20
Large language models (s) have demonstrated remarkable capabilities in
chain of thought (CoT) reasoning. However, the current
reasoning paradigm
initiates thinking only after the entire input is available, which introduces
unnecessary latency and weakens attention to earlier information in dynamic
scenarios. Inspired by human cognition of thinking while reading, we first
design a \textit{\textbf{streaming thinking}} paradigm for
s, where
reasoning unfolds in the order of input and further adjusts its depth once
reading is complete. We instantiate this paradigm with
\textit{StreamingThinker}, a framework that enables
s to think while reading
through the integration of streaming CoT generation, streaming-constraint
training, and streaming parallel inference. Specifically, StreamingThinker
employs streaming reasoning units with quality control for CoT generation,
enforces order-pre
reasoning through streaming attention masks and
position encoding, and leverages parallel
s that decouple input
encoding from reasoning generation, thereby ensuring alignment and enabling
true concurrency. We evaluate StreamingThinker on the Qwen3 model family across
math reasoning, logical reasoning, and context-based QA reasoning tasks.
Experimental results show that the StreamingThinker preserves performance
comparable to batch thinking, while yielding an 80\% reduction in token waiting
before the onset of reasoning and a more than 60\% reduction in time-level
latency for producing the final answer, demonstrating the effectiveness of the
streaming paradigm for
reasoning. Code will be released at
\href{https://github.com/EIT-NLP/Streaming
/tree/main/StreamingThinker}{this
repository.}
DSEBench A Test Collection for Explainable Dataset Search with Examples
Authors: Qing Shi, Jing He, Qiaosheng Chen, Gong Cheng
2025-10-20
Dataset search has been an established information retrieval task. Current
paradigms either retrieve datasets that are relevant to a keyword query or find
datasets that are similar to an input target dataset. To allow for their
combined specification of information needs, in this article, we investigate
the more generalized task of Dataset Search with Examples (DSE) and further
extend it to Explainable DSE that requires identifying the metadata and content
fields of a dataset that indicate its relevance to the query and similarity to
the target datasets. To facilitate this research, we construct DSEBench, a test
collection that provides high-quality dataset- and field-level annotations to
enable the evaluation of explainable DSE. We also employ a large language model
to generate numerous annotations to be used for training. We establish
extensive baselines on DSEBench by adapting and evaluating a variety of ,
dense, and
-based retrieval, reranking, and explanation methods.
CosmoCore Affective Dream-Replay Reinforcement Learning for Code Generation
Authors: Santhosh Kumar Ravindran
2025-10-20
We introduce CosmoCore, a neuroscience-inspired reinforcement learning (RL)
architecture that integrates affective signals to enhance code generation in
large language models (s). Motivated by human and animal learning where
embarrassment from mistakes drives rapid correction, as observed in training a
puppy to avoid repeating errors after a single scolding CosmoCore tags code
generation trajectories with valence and surprise using a lightweight
multi-layer perceptron (MLP). High-negative valence (cringe) episodes, such as
buggy code outputs, are prioritized in a Dream Queue for five-fold replay
during off-policy updates, while low-surprise successes are pruned to prevent
overconfidence and buffer bloat. Evaluated on code generation benchmarks like
HumanEval and BigCodeBench, alongside simulations with a custom data pipeline
environment, CosmoCore reduces hallucinated code (e.g., syntax errors or
logical bugs) by 48\% and accelerates self-correction by 45\%. Local
experiments using Hugging Face models in a PySpark environment validate these
gains, with code snippets provided for replication. Ablations confirm valence
tagging boosts curiosity in exploration, and
mitigates inefficiency.
This framework extends RL from human feedback (RLHF) for more emotionally aware
code assistants, with applications in IDEs and data pipelines. Code and the
custom mini-world simulation are released.
ZSPAPrune Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models
Authors: Pu Zhang, Yuwei Li, Xingyuan Xian, Guoming Tang
2025-10-20
As the capabilities of Vision-Language Models (VLMs) advance, they can
process increasingly large inputs, which, unlike in s, generates significant
visual token redundancy and leads to prohibitive inference costs. While many
methods aim to reduce these costs by
visual tokens, existing
approaches, whether based on attention or diversity, typically neglect the
guidance of the text prompt and thus fail to prioritize task relevance. In this
work, we propose a novel, zero-shot method that reframes the problem by
introducing a prompt-aware perspective, explicitly modeling visual token
as a balance between task relevance and information diversity. Our
hierarchical approach first selects a core set of task-relevant visual tokens
and then supplements them with diversity tokens to preserve broader context.
Experiments across multiple models and benchmarks show that our method achieves
performance that matches or surpasses the state-of-the-art with only minimal
accuracy loss, even when
up to 90\% of the tokens. Furthermore, these
gains are accompanied by significant reductions in GPU memory footprint and
inference latency.
When AI companions become witty Can human brain recognize AI-generated irony?
Authors: Xiaohui Rao, Hanlin Wu, Zhenguang G. Cai
2025-10-20
As Large Language Models (s) are increasingly deployed as social agents
and trained to produce humor and irony, a question emerges: when encountering
witty AI remarks, do people interpret these as intentional
or
mere computational output? This study investigates whether people adopt the
intentional stance, attributing mental states to explain behavior,toward AI
during irony comprehension. Irony provides an ideal paradigm because it
requires distinguishing intentional contradictions from unintended errors
through effortful semantic reanalysis. We compared behavioral and neural
responses to ironic statements from AI versus human sources using established
ERP components: P200 reflecting early incongruity detection and P600 indexing
cognitive efforts in reinterpreting incongruity as deliberate irony. Results
demonstrate that people do not fully adopt the intentional stance toward
AI-generated irony. Behaviorally, participants attributed incongruity to
deliberate
for both sources, though significantly less for AI
than human, showing greater tendency to interpret AI incongruities as
computational errors. Neural data revealed attenuated P200 and P600 effects for
AI-generated irony, suggesting reduced effortful detection and reanalysis
consistent with diminished attribution of communicative intent. Notably, people
who perceived AI as more sincere showed larger P200 and P600 effects for
AI-generated irony, suggesting that intentional stance adoption is calibrated
by specific mental models of artificial agents. These findings reveal that
source attribution shapes neural processing of social-communicative phenomena.
Despite current
s' linguistic sophistication, achieving genuine social
agency requires more than linguistic competence, it necessitates a shift in how
humans perceive and attribute intentionality to artificial agents.
ParaVul A Parallel Large Language Model and Retrieval-Augmented Framework for Smart Contract Vulnerability Detection
Authors: Tenghui Huang, Jinbo Wen, Jiawen Kang, Siyong Chen, Zhengtao Li, Tao Zhang, Dongning Liu, Jiacheng Wang, Chengjun Cai, Yinqiu Liu, Dusit Niyato
2025-10-20
Smart contracts play a significant role in automating blockchain services.
Nevertheless, vulnerabilities in smart contracts pose serious threats to
blockchain security. Currently, traditional detection methods primarily rely on
static analysis and formal verification, which can result in high
false-positive rates and poor scalability. Large Language Models (s) have
recently made significant progress in smart contract vulnerability detection.
However, they still face challenges such as high inference costs and
substantial computational overhead. In this paper, we propose ParaVul, a
parallel
and retrieval-augmented framework to improve the reliability and
accuracy of smart contract vulnerability detection. Specifically, we first
develop Sparse Low-Rank Adaptation (SLoRA) for
fine-tuning. SLoRA
introduces sparsification by incorporating a
matrix into
d
LoRA-based
s, thereby reducing computational overhead and resource
requirements while enhancing their ability to understand vulnerability-related
issues. We then construct a vulnerability contract dataset and develop a hybrid
Retrieval-Augmented Generation (RAG) system that integrates dense retrieval
with Best Matching 25 (BM25), assisting in verifying the results generated by
the
. Furthermore, we propose a meta-learning model to fuse the outputs of
the RAG system and the
, thereby generating the final detection results.
After completing vulnerability detection, we design chain-of-thought prompts to
guide
s to generate comprehensive vulnerability detection reports.
Simulation results demonstrate the superiority of ParaVul, especially in terms
of F1 scores, achieving 0.9398 for single-label detection and 0.9330 for
multi-label detection.
Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models
Authors: Elias Hossain, Swayamjit Saha, Somshubhra Roy, Ravi Prasad
2025-10-20
Even when prompts and parameters are secured, language models
remain vulnerable because their key-value (
)
during inference
constitutes an overlooked attack surface. This paper introduces Malicious Token
Injection (MTI), a modular framework that systematically perturbs
d key
vectors at selected layers and timesteps through controlled magnitude and
frequency, using additive Gaussian noise, zeroing, and orthogonal rotations. A
theoretical analysis quantifies how these perturbations propagate through
attention, linking logit deviations to the Frobenius norm of corruption and
softmax Lipschitz dynamics. Empirical results show that MTI significantly
alters next-token distributions and downstream task performance across GPT-2
and LLaMA-2/7B, as well as destabilizes retrieval-augmented and agentic
reasoning pipelines. These findings identify
integrity as a critical yet
underexplored vulnerability in current
deployments, positioning
corruption as a reproducible and theoretically grounded threat model for future
robustness and security research.
Enrich and Detect Video Temporal Grounding with Multimodal LLMs
Authors: Shraman Pramanick, Effrosyni Mavroudi, Yale Song, Rama Chellappa, Lorenzo Torresani, Triantafyllos Afouras
2025-10-19
We introduce ED-VTG, a method for fine-grained video temporal grounding
utilizing multi-modal large language models. Our approach harnesses the
capabilities of multimodal s to jointly process text and video, in order to
effectively localize natural language queries in videos through a two-stage
process. Rather than being directly grounded, language queries are initially
transformed into enriched sentences that incorporate missing details and cues
to aid in grounding. In the second stage, these enriched queries are grounded,
using a lightweight
r, which specializes at predicting accurate
boundaries conditioned on contextualized representations of the enriched
queries. To mitigate noise and reduce the impact of hallucinations, our model
is trained with a multiple-instance-learning objective that dynamically selects
the optimal version of the query for each training sample. We demonstrate
state-of-the-art results across various benchmarks in temporal video grounding
and paragraph grounding settings. Experiments reveal that our method
significantly outperforms all previously proposed
-based temporal grounding
approaches and is either superior or comparable to specialized models, while
maintaining a clear advantage against them in zero-shot evaluation scenarios.
UniGTE Unified Graph-Text Encoding for Zero-Shot Generalization across Graph Tasks and Domains
Authors: Duo Wang, Yuan Zuo, Guangyue Lu, Junjie Wu
2025-10-19
Generalizing to unseen graph tasks without task-specific supervision is
challenging: conventional graph neural networks are typically tied to a fixed
label space, while large language models (s) struggle to capture graph
structure. We introduce UniGTE, an instruction-tuned encoder-
r framework
that unifies structural and semantic reasoning. The encoder augments a
pretrained autoregressive
with learnable alignment tokens and a
structure-aware graph-text attention mechanism, enabling it to attend jointly
to a tokenized graph and a natural-language task prompt while remaining
permutation-invariant to node order. This yields compact, task-aware graph
representations. Conditioned solely on these representations, a frozen
r predicts and reconstructs: it outputs the task answer and
simultaneously paraphrases the input graph in natural language. The
reconstruction objective regularizes the encoder to preserve structural cues.
UniGTE is instruction-tuned on five datasets spanning node-level, edge-level,
and graph-level tasks across diverse domains, yet requires no fine-tuning at
inference. It achieves new state-of-the-art zero-shot results on node
classification, link prediction, graph classification, and graph regression
under cross-task and cross-domain settings, demonstrating that tight
integration of graph structure with
semantics enables robust, transferable
graph reasoning.
ArmFormer Lightweight Transformer Architecture for Real-Time Multi-Class Weapon Segmentation and Classification
Authors: Akhila Kambhatla, Taminul Islam, Khaled R Ahmed
2025-10-19
The escalating threat of weapon-related violence necessitates automated
detection systems capable of pixel-level precision for accurate threat
assessment in real-time security applications. Traditional weapon detection
approaches rely on object detection frameworks that provide only coarse
bounding box localizations, lacking the fine-grained segmentation required for
comprehensive threat analysis. Furthermore, existing semantic segmentation
models either sacrifice accuracy for computational efficiency or require
excessive computational resources incompatible with edge deployment scenarios.
This paper presents ArmFormer, a lightweight -based semantic
segmentation framework that strategically integrates Convolutional Block
Attention Module (CBAM) with MixVisionTransformer architecture to achieve
superior accuracy while maintaining computational efficiency suitable for
resource-constrained edge devices. Our approach combines CBAM-enhanced encoder
backbone with attention-integrated hamburger
r to enable multi-class
weapon segmentation across five categories: handgun, rifle, knife, revolver,
and human. Comprehensive experiments demonstrate that ArmFormer achieves
state-of-the-art performance with 80.64% mIoU and 89.13% mFscore while
maintaining real-time inference at 82.26 FPS. With only 4.886G FLOPs and 3.66M
parameters, ArmFormer outperforms heavyweight models requiring up to 48x more
computation, establishing it as the optimal solution for deployment on portable
security cameras, surveillance drones, and embedded AI accelerators in
distributed security infrastructure.
Neuronal Group Communication for Efficient Neural representation
Authors: Zhengqi Pei, Qingming Huang, Shuhui Wang
2025-10-19
The ever-increasing scale of modern neural networks has brought unprecedented
performance alongside daunting challenges in efficiency and interpretability.
This paper addresses the core question of how to build large neural systems
that learn efficient, modular, and interpretable representations. We propose
Neuronal Group Communication (NGC), a theory-driven framework that reimagines a
neural network as a dynamical system of interacting neuronal groups rather than
a monolithic collection of neural weights. Instead of treating each weight as
an independent trainable parameter, NGC treats weights as transient
interactions between embedding-like neuronal states, with neural computation
unfolding through iterative among groups of neurons. This
low-rank, modular representation yields compact models: groups of neurons
exchange low-dimensional signals, enabling intra-group specialization and
inter-group information sharing while dramatically reducing redundant
parameters. By drawing on dynamical systems theory, we introduce a neuronal
stability metric (analogous to Lyapunov stability) that quantifies the
contraction of neuron activations toward stable patterns during sequence
processing. Using this metric, we reveal that emergent reasoning capabilities
correspond to an external driving force or ``potential'', which nudges the
neural dynamics away from trivial trajectories while pre
stability.
Empirically, we instantiate NGC in large language models (
s) and demonstrate
improved performance on complex reasoning benchmarks under moderate
. NGC consistently outperforms standard low-rank approximations and
cross-layer basis-sharing methods at comparable
rates. We conclude
by discussing the broader implications of NGC, including how structured
neuronal group dynamics might relate to generalization in high-dimensional
learning systems.
Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads
Authors: Zhoutong Wu, Yuan Zhang, Yiming Dong, Chenheng Zhang, Cong Fang, Kun Yuan, Zhouchen Lin
2025-10-19
Transformer models have driven breakthroughs across various language tasks by
their strong capability to learn rich contextual representations. Scaling them
to improve representation, however, often demands substantial memory and
compute costs, such as the Key-Value ()
used during auto-regressive
. Skip connections offer a promising way to improve representation
without bloating resource usage, yet most prior works either improve
expressivity while leaving
costs unchanged, or reduce memory at the cost of
weaker representation. In this work, we propose SkipV1Former, a Transformer
variant that uses skip connections from the first layer's Value heads to
strengthen model representation and reduce
. Specifically, from the
second block onward, each layer reuses half of its Value heads from the very
first layer, while computing the other half as usual-cutting Value projections
and V
by nearly 50 \%. Theoretically, we show that routing uncompressed
first-layer Values into deeper layers restores information lost to
and accelerates the model's implicit mesa-optimization-a key pattern of
Transformer in auto-regressive tasks. Empirically, across different model
scales, SkipV1Former delivers consistent reductions of approximately 25 \% in
while improving perplexity relative to standard Multi-Head Attention
(MHA) Transformers and some advanced variants. Moreover, we propose a recipe
for uptraining existing MHA Transformer checkpoints to SkipV1Former with only
10-15\% additional compute. Finally, SkipV1Former can seamlessly combine
advanced methods like Group-Query Attention and Multi-Latent Attention to
achieve further
savings and performance improvement. When combined
with YOCO, it cuts
size by nearly 50 \% while still improving
performance.
Mixed-Precision Quantization for Language Models Techniques and Prospects
Authors: Mariam Rakka, Marios Fournarakis, Olga Krestinskaya, Jinane Bazzi, Khaled N. Salama, Fadi Kurdahi, Ahmed M. Eltawil, Mohammed E. Fouda
2025-10-19
The rapid scaling of language models (LMs) has resulted in unprecedented
computational, memory, and energy requirements, making their training and
deployment increasingly unsustainable. Quantization has emerged as an essential
technique to reduce model size, alleviate memory bottlenecks, and
accelerate inference. However, while uniform
(e.g., INT8,
INT4) provides significant efficiency gains, it can degrade accuracy in
sensitive components of
-based LMs. Mixed-precision
offers a promising alternative by selectively allocating precision across
layers or within tensors to balance efficiency and accuracy. This survey
provides a comprehensive overview of Mixed-Precision
frameworks
for LMs (
PLMs). We first review
fundamentals, including uniform
and non-uniform
rs,
granularity, and methods widely used
in post-training
. We then categorize and compare recent
PLM
frameworks according to their bit allocation strategies and precision
configurations across weights, activations, and key-value
s. A comparative
analysis highlights differences in perplexity, zero-shot task performance, and
deployment trade-offs. Furthermore, we contrast
PLMs with earlier
mixed-precision
methods for deep neural networks, identifying
strategies that transfer and those that face challenges in the LM setting.
Finally, we summarize open issues and future directions, including
hardware-aware design, activation
, and scalable optimization
methods for billion-parameter models. By consolidating recent advances, this
work serves as a reference for understanding the current landscape and research
prospects of mixed-precision
for large-scale language models.
3D-GSRD 3D Molecular Graph Auto-Encoder with Selective Re-mask Decoding
Authors: Chang Wu, Zhiyuan Liu, Wen Shu, Liang Wang, Yanchen Luo, Wenqiang Lei, Yatao Bian, Junfeng Fang, Xiang Wang
2025-10-19
Masked graph modeling (MGM) is a promising approach for molecular
representation learning (MRL).However, extending the success of re-mask
from 2D to 3D MGM is non-trivial, primarily due to two conflicting
challenges: avoiding 2D structure leakage to the
r, while still providing
sufficient 2D context for reconstructing re-masked atoms. To address these
challenges, we propose 3D-GSRD: a 3D Molecular Graph Auto-Encoder with
Selective Re-mask Decoding. The core innovation of 3D-GSRD lies in its
Selective Re-mask Decoding(SRD), which re-masks only 3D-relevant information
from encoder representations while pre
the 2D graph structures. This SRD
is synergistically integrated with a 3D Relational-Transformer(3D-ReTrans)
encoder alongside a structure-independent
r. We analyze that SRD,
combined with the structure-independent
r, enhances the encoder's role in
MRL. Extensive experiments show that 3D-GSRD achieves strong downstream
performance, setting a new state-of-the-art on 7 out of 8 targets in the widely
used MD17 molecular property prediction benchmark. The code is released at
https://github.com/WuChang0124/3D-GSRD.
EMRRG Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation
Authors: Mingzheng Zhang, Jinfeng Gao, Dan Xu, Jiangrui Yu, Yuhan Qiao, Lan Chen, Jin Tang, Xiao Wang
2025-10-19
X-ray image-based medical report generation (MRG) is a pivotal area in
artificial intelligence that can significantly reduce diagnostic burdens for
clinicians and patient wait times. Existing MRG models predominantly rely on
Large Language Models (s) to improve report generation, with limited
exploration of pre-trained vision foundation models or advanced fine-tuning
techniques. Mainstream frameworks either avoid fine-tuning or utilize
simplistic methods like LoRA, often neglecting the potential of enhancing
cross-attention mechanisms. Additionally, while Transformer-based models
dominate vision-language tasks, non-Transformer architectures, such as the
Mamba network, remain underexplored for medical report generation, presenting a
promising avenue for future research. In this paper, we propose EMRRG, a novel
X-ray report generation framework that fine-tunes pre-trained Mamba networks
using parameter-efficient methods. Specifically, X-ray images are divided into
patches, tokenized, and processed by an SSM-based vision backbone for feature
extraction, with Partial LoRA yielding optimal performance. An
with a
hybrid
r generates the medical report, enabling end-to-end training and
achieving strong results on benchmark datasets. Extensive experiments on three
widely used benchmark datasets fully validated the effectiveness of our
proposed strategies for the X-ray MRG. The source code of this paper will be
released on https://github.com/Event-AHU/Medical_Image_Analysis.
L-MoE End-to-End Training of a Lightweight Mixture of Low-Rank Adaptation Experts
Authors: Shihao Ji, Zihui Song
2025-10-19
The Mixture of Experts (MoE) architecture enables the scaling of Large
Language Models (s) to trillions of parameters by activating a
subset
of weights for each input, maintaining constant computational cost during
inference. Concurrently, Low-Rank Adaptation (LoRA) has emerged as a dominant
technique for parameter-efficiently fine-tuning
s on specialized tasks. In
this work, we unify these two paradigms into a novel, end-to-end trainable
framework named L-MoE: a Lightweight Mixture of LoRA Experts. L-MoE redefines
MoE experts not as dense feed-forward networks, but as a collection of
task-specialized, low-rank adapters. A lightweight gating network, trained
jointly with the experts, learns to dynamically compose these LoRA adapters by
computing a weighted average of their parameters for each input token. This
composition is fully differentiable, allowing gradients from a standard
auto-regressive language modeling objective to flow back through the entire
architecture, simultaneously refining both the expert adapters and the routing
strategy. This approach creates a highly parameter-efficient MoE model that is
modular by design, allows for dynamic skill composition, and is trainable from
end-to-end. We present the formal mathematical framework for L-MoE, detailing
the differentiable routing mechanism and the joint optimization objective,
thereby providing a new path toward building more efficient, scalable, and
specialized language models.
ELMM Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion
Authors: Wei Huang, Peining Li, Meiyu Liang, Xu Hou, Junping Du, Yingxia Shao, Guanhua Ye, Wu Liu, Kangkang Lu, Yang Yu
2025-10-19
Multimodal Knowledge Graphs (MKGs) extend traditional knowledge graphs by
incorporating visual and textual modalities, enabling richer and more
expressive entity representations. However, existing MKGs often suffer from
incompleteness, which hinder their effectiveness in downstream tasks.
Therefore, multimodal knowledge graph completion (MKGC) task is receiving
increasing attention. While large language models (s) have shown promise for
knowledge graph completion (KGC), their application to the multimodal setting
remains underexplored. Moreover, applying Multimodal Large Language Models
(M
s) to the task of MKGC introduces significant challenges: (1) the large
number of image tokens per entity leads to semantic noise and modality
conflicts, and (2) the high computational cost of processing large token
inputs. To address these issues, we propose Efficient Lightweight Multimodal
Large Language Models (ELMM) for MKGC. ELMM proposes a Multi-view Visual Token
Compressor (MVTC) based on multi-head attention mechanism, which adaptively
compresses image tokens from both textual and visual views, thereby effectively
reducing redundancy while retaining necessary information and avoiding modality
conflicts. Additionally, we design an attention
strategy to remove
redundant attention layers from M
s, thereby significantly reducing the
inference cost. We further introduce a linear projection to compensate for the
performance degradation caused by
. Extensive experiments on benchmark
FB15k-237-IMG and WN18-IMG demonstrate that ELMM achieves state-of-the-art
performance while substantially improving computational efficiency,
establishing a new paradigm for multimodal knowledge graph completion.
An Efficient Semantic Segmentation Decoder for In-Car or Distributed Applications
Authors: Danish Nazir, Gowtham Sai Inti, Timo Bartels, Jan Piewek, Thorsten Bagdonat, Tim Fingscheidt
2025-10-19
Modern automotive systems leverage deep neural networks (DNNs) for semantic
segmentation and operate in two key application areas: (1) In-car, where the
DNN solely operates in the vehicle without strict constraints on the data rate.
(2) Distributed, where one DNN part operates in the vehicle and the other part
typically on a large-scale cloud platform with a particular constraint on
transmission bitrate efficiency. Typically, both applications share an image
and source encoder, while each uses distinct (joint) source and task rs.
Prior work utilized convolutional neural networks for joint source and task
but did not investigate
-based alternatives such as
SegDeformer, which offer superior performance at the cost of higher
computational complexity. In this work, we propose joint feature and task
for SegDeformer, thereby enabling lower computational complexity in
both in-car and distributed applications, despite SegDeformer's computational
demands. This improves scalability in the cloud while reducing in-car
computational complexity. For the in-car application, we increased the frames
per second (fps) by up to a factor of ( fps to fps) on
Cityscapes and by up to a factor of ( fps to fps) on
ADE20K, while being on-par w.r.t.\ the mean intersection over union (mIoU) of
the
-based baseline that doesn't compress by a source codec. For the
distributed application, we achieve state-of-the-art (SOTA) over a wide range
of bitrates on the mIoU metric, while using only \% (\%) of cloud
DNN parameters used in previous SOTA, reported on ADE20K (Cityscapes).
Long-Context Attention Benchmark From Kernel Efficiency to Distributed Context Parallelism
Authors: Tao Bu, Qiangang Wang, Bowen Zeng, Hanwen Sun, Yunpeng Huang, Chun Cao, Jingwei Xu
2025-10-19
Transformer-based large language models (s) have achieved remarkable
success, yet their standard attention mechanism incurs quadratic computation
and memory costs with respect to sequence length, posing a major bottleneck for
long-context training. Prior work tackles this challenge along two directions:
(1) kernel-level optimizations, which accelerate dense and
attention
operators; and (2) module-level strategies, often referred to as distributed
attention or context parallel training, which scale attention across multiple
devices. However, systematic evaluation still remains limited: operator-level
comparisons are often incomplete, while context parallel strategies are
typically framework-specific, with unclear performance analysis across
contexts. To address these gaps, we propose a unified benchmark that integrates
representative attention kernels and context parallel mechanisms with a modular
and extensible interface for evaluation. The benchmark evaluates methods along
two critical dimensions: (1) attention mask patterns, which strongly affect
efficiency, scalability, and usability, and (2) sequence length and distributed
scale, which determine performance under extreme long-context training. Through
comprehensive experiments on the cluster of up to 96 GPUs, our benchmark
enables reproducible comparisons, highlights method-specific trade-offs, and
provides practical guidance for designing and deploying attention mechanisms in
long-context
training.
U-Codec Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation
Authors: Xusheng Yang, Long Zhou, Wenfu Wang, Kai Hu, Shulin Feng, Chenxing Li, Meng Yu, Dong Yu, Yuexian Zou
2025-10-19
We propose \textbf{U-Codec}, an \textbf{U}ltra low frame-rate neural speech
\textbf{Codec} that achieves high-fidelity reconstruction and fast speech
generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme
at 5Hz typically leads to severe intelligibility and spectral
detail loss, we introduce a Transformer-based inter-frame long-term dependency
module and systematically explore residual vector
(RVQ) depth and
codebook size to identify optimal configurations. Moreover, we apply U-Codec
into a large language model (
)-based auto-regressive TTS model, which
leverages global and local hierarchical architecture to effectively capture
dependencies across multi-layer tokens. We extend
-based TTS from 3-layer
RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that
U-Codec improves
-based TTS inference speed by around 3 over
high-frame-rate codecs while maintaining similarity and naturalness. These
results validate the feasibility of using highly compressed 5Hz discrete tokens
for fast and high-fidelity speech synthesis.
Count Counts Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards
Authors: Xuan Zhang, Ruixiao Li, Zhijian Zhou, Long Li, Yulei Qin, Ke Li, Xing Sun, Xiaoyu Tan, Chao Qu, Yuan Qi
2025-10-18
Reinforcement Learning (RL) has become a compelling way to strengthen the
multi step reasoning ability of Large Language Models (s). However,
prevalent RL paradigms still lean on
outcome-based rewards and limited
exploration, which often drives
s toward repetitive and suboptimal reasoning
patterns. In this paper, we study the central question of how to design
exploration for
reasoning and introduce MERCI (Motivating Exploration in
Reasoning with Count-based Intrinsic Rewards), a novel RL algorithm that
augments policy optimization with a principled intrinsic reward. Building on
the idea of count-based exploration, MERCI leverages a lightweight Coin
Flipping Network (CFN) to estimate the pseudo count and further epistemic
uncertainty over reasoning trajectories, and converts them into an intrinsic
reward that values novelty while pre
the learning signal from task
rewards. We integrate MERCI into some advanced RL frameworks like Group
Relative Policy Optimization (GRPO). Experiments on complex reasoning
benchmarks demonstrate that MERCI encourages richer and more varied chains of
thought, significantly improves performance over strong baselines, and helps
the policy escape local routines to discover better solutions. It indicates
that our targeted intrinsic motivation can make exploration reliable for
language model reasoning.
VisionSelector End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
Authors: Jiaying Zhu, Yurui Zhu, Xin Lu, Wenrui Yan, Dong Li, Kunlin Liu, Xueyang Fu, Zheng-Jun Zha
2025-10-18
Multimodal Large Language Models (Ms) encounter significant computational
and memory bottlenecks from the massive number of visual tokens generated by
high-resolution images or multi-image inputs. Previous token
techniques are often constrained by heuristic rules that risk discarding
critical information. They may suffer from biases, such as attention sinks,
that lead to sharp performance drops under aggressive
ratios. To
address these limitations, we reformulate token
as a lightweight
plug-and-play framework that reformulates token
into an end-to-end
learnable decision process. To be specific, we propose VisionSelector, a scorer
module decoupled from the M
backbone that incorporates a differentiable
Top-K mechanism and a curriculum annealing strategy to bridge the
training-inference gap, enabling efficient and adaptive token selection various
arbitrary
rates. Remarkably lightweight with only 12.85M trainable
parameters, VisionSelector demonstrates generalization across various
rates and adaptively identifying critical tokens. This leads to
superior performance across all
budgets, evidenced by pre
100% accuracy on MME with 30% retention budget, outperforming prior methods by
12.14% at 10% retention budget, and doubling
speed. Our code is
available at https://github.com/JulietChoo/VisionSelector .
SHIELD Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense
Authors: Yiyang Huang, Liang Shi, Yitian Zhang, Yi Xu, Yun Fu
2025-10-18
Large Vision-Language Models (LVLMs) excel in diverse cross-modal tasks.
However, object hallucination, where models produce plausible but inaccurate
object descriptions, remains a significant challenge. In contrast to previous
work focusing on components, this paper is the first to trace LVLM
hallucinations to visual encoders and identifies three key issues: statistical
bias, inherent bias, and vulnerability. To address these challenges, we propose
SHIELD, a training-free framework that mitigates hallucinations through three
strategies: re-weighting visual tokens to reduce statistical bias, introducing
noise-derived tokens to counter inherent bias, and applying adversarial attacks
with contrastive
to address vulnerability. Experiments demonstrate
that SHIELD effectively mitigates object hallucinations across diverse
benchmarks and LVLM families. Moreover, SHIELD achieves strong performance on
the general LVLM benchmark, highlighting its broad applicability. Code will be
released.
Human-Aligned Code Readability Assessment with Large Language Models
Authors: Wendkûuni C. Ouédraogo, Yinghua Li, Xueqi Dang, Pawel Borsukiewicz, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé
2025-10-18
Code readability is crucial for software comprehension and maintenance, yet
difficult to assess at scale. Traditional static metrics often fail to capture
the subjective, context-sensitive nature of human judgments. Large Language
Models (s) offer a scalable alternative, but their behavior as readability
evaluators remains underexplored. We introduce CoReEval, the first large-scale
benchmark for evaluating
-based code readability assessment, comprising over
1.4 million model-snippet-prompt evaluations across 10 state of the art
s.
The benchmark spans 3 programming languages (Java, Python, CUDA), 2 code types
(functional code and unit tests), 4 prompting strategies (ZSL, FSL, CoT, ToT),
9
settings, and developer-guided prompts tailored to junior and senior
personas. We compare
outputs against human annotations and a validated
static model, analyzing numerical alignment (MAE, Pearson's, Spearman's) and
justification quality (sentiment, aspect coverage, semantic clustering). Our
findings show that developer-guided prompting grounded in human-defined
readability dimensions improves alignment in structured contexts, enhances
explanation quality, and enables lightweight personalization through persona
framing. However, increased score variability highlights trade-offs between
alignment, stability, and interpretability. CoReEval provides a robust
foundation for prompt engineering, model alignment studies, and human in the
loop evaluation, with applications in education, onboarding, and CI/CD
pipelines where
s can serve as explainable, adaptable reviewers.
Ripple Effect Protocol Coordinating Agent Populations
Authors: Ayush Chopra, Aman Sharma, Feroz Ahmad, Luca Muscariello, Vijoy Pandey, Ramesh Raskar
2025-10-18
Modern AI agents can exchange messages using protocols such as A2A and ACP,
yet these mechanisms emphasize over coordination. As agent
populations grow, this limitation produces brittle collective behavior, where
individually smart agents converge on poor group outcomes. We introduce the
Ripple Effect Protocol (REP), a coordination protocol in which agents share not
only their decisions but also lightweight sensitivities - signals expressing
how their choices would change if key environmental variables shifted. These
sensitivities ripple through local networks, enabling groups to align faster
and more stably than with agent-centric
alone. We formalize REP's
protocol specification, separating required message schemas from optional
aggregation rules, and evaluate it across scenarios with varying incentives and
network topologies. Benchmarks across three domains: (i) supply chain cascades
(Beer Game), (ii) preference aggregation in
networks (Movie Scheduling),
and (iii) sustainable resource allocation (Fishbanks) show that REP improves
coordination accuracy and efficiency over A2A by 41 to 100%, while flexibly
handling multimodal sensitivity signals from
s. By making coordination a
protocol-level capability, REP provides scalable infrastructure for the
emerging Internet of Agents
Language over Content Tracing Cultural Understanding in Multilingual Large Language Models
Authors: Seungho Cho, Changgeon Ko, Eui Jun Hwang, Junmyeong Lee, Huije Lee, Jong C. Park
2025-10-18
Large language models (s) are increasingly used across diverse cultural
contexts, making accurate cultural understanding essential. Prior evaluations
have mostly focused on output-level performance, obscuring the factors that
drive differences in responses, while studies using circuit analysis have
covered few languages and rarely focused on culture. In this work, we trace
s' internal cultural understanding mechanisms by measuring activation path
s when answering semantically equivalent questions under two conditions:
varying the target country while fixing the question language, and varying the
question language while fixing the country. We also use same-language country
pairs to disentangle language from cultural aspects. Results show that internal
paths
more for same-language, cross-country questions than for
cross-language, same-country questions, indicating strong language-specific
patterns. Notably, the South Korea-North Korea pair exhibits low
and
high variability, showing that linguistic similarity does not guarantee aligned
internal representation.
Hybrid CNN-Transformer Based Sparse Channel Prediction for High-Mobility OTFS Systems
Authors: Zhaowei Guan, Wenkun Wen, Peiran Wu, Chen Wang, Minghua Xia
2025-10-18
High-mobility scenarios in next-generation wireless networks, such as those
involving vehicular s, require ultra-reliable and low-latency
s (URLLC). However, rapidly time-varying channels pose significant
challenges to traditional OFDM-based systems due to the Doppler effect and
channel aging. Orthogonal time frequency space (OTFS) modulation offers
resilience by representing channels in the quasi-static delay-Doppler (DD)
domain. This letter proposes a novel channel prediction framework for OTFS
systems using a hybrid convolutional neural network and
(CNN-Transformer) architecture. The CNN extracts compact features that exploit
the DD-domain
of the channel matrices, while the
models
temporal dependencies with causal masking for consistency. Simulation
experiments under extreme \si{km/h} mobility conditions demonstrate that
the proposed method outperforms state-of-the-art baselines, reducing the root
mean square error and mean absolute error by and ,
respectively. These results demonstrate the effectiveness of DD-domain
representations and the proposed model in accurately predicting channels in
high-mobility scenarios, thereby supporting the stringent URLLC requirements in
future wireless systems.
HGC-Avatar Hierarchical Gaussian Compression for Streamable Dynamic 3D Avatars
Authors: Haocheng Tang, Ruoke Yan, Xinhui Yin, Qi Zhang, Xinfeng Zhang, Siwei Ma, Wen Gao, Chuanmin Jia
2025-10-18
Recent advances in 3D Gaussian Splatting (3DGS) have enabled fast,
photorealistic rendering of dynamic 3D scenes, showing strong potential in
immersive . However, in digital human encoding and transmission,
the
methods based on general 3DGS representations are limited by
the lack of human priors, resulting in suboptimal bitrate efficiency and
reconstruction quality at the
r side, which hinders their application in
streamable 3D avatar systems. We propose HGC-Avatar, a novel Hierarchical
Gaussian Compression framework designed for efficient transmission and
high-quality rendering of dynamic avatars. Our method disentangles the Gaussian
representation into a structural layer, which maps poses to Gaussians via a
StyleUNet-based generator, and a motion layer, which leverages the SMPL-X model
to represent temporal pose variations compactly and semantically. This
hierarchical design supports layer-wise
, progressive
, and
controllable rendering from diverse pose inputs such as video sequences or
text. Since people are most concerned with facial realism, we incorporate a
facial attention mechanism during StyleUNet training to preserve identity and
expression details under
rate constraints. Experimental results
demonstrate that HGC-Avatar provides a streamable solution for rapid 3D avatar
rendering, while significantly outperforming prior methods in both visual
quality and
efficiency.
FrugalPrompt Reducing Contextual Overhead in Large Language Models via Token Attribution
Authors: Syed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran, Mohammad Ali Moni
2025-10-18
Large language models (s) owe much of their stellar performance to
expansive input contexts, yet such verbosity inflates monetary costs, carbon
footprint, and inference-time latency. Much of this overhead manifests from the
redundant low-utility tokens present in typical prompts, as only a fraction of
tokens typically carries the majority of the semantic weight. We address this
inefficiency by introducing FrugalPrompt, a novel prompt
framework
for
s, which retains only the most semantically significant tokens.
Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX,
we assign salience scores to every token in an input sequence, rank them to
preserve the top-k% tokens in their original order, and obtain a
frugalized prompt. We evaluate the approach across four NLP tasks: Sentiment
Analysis, Commonsense QA, Summarization, and Mathematical Reasoning, using a
suite of frontier
s. For the first three tasks, a 20% prompt reduction
incurs only a marginal loss in task performance, demonstrating that
contemporary
s can reconstruct elided context from high-salience cues. In
contrast, performance on mathematical reasoning deteriorates sharply,
reflecting a stronger dependence on complete token continuity. Further analysis
with bottom-k% and random-k% tokens reveals asymmetric performance patterns
that may suggest potential task contamination effects, wherein models may
resort to shallow memorized patterns from pretraining exposure for conventional
NLP tasks. We posit that our work contributes to a more nuanced understanding
of
behavior in performance-efficiency trade-offs, and delineate the
boundary between tasks tolerant to contextual
and those requiring
exhaustive context. Our source code and models are available at:
https://github.com/Starscream-11813/Frugal-ICL.
Learning to Optimize Edge Robotics A Fast Integrated Perception-Motion-Communication Approach
Authors: Dan Guo, Xibin Jin, Shuai Wang, Zhigang Wen, Miaowen Wen, Chengzhong Xu
2025-10-18
Edge robotics involves frequent exchanges of large-volume multi-modal data.
Existing methods ignore the interdependency between robotic functionalities and
conditions, leading to excessive
overhead. This
paper revolutionizes edge robotics systems through integrated perception,
motion, and
(IPMC). As such, robots can dynamically adapt their
strategies (i.e.,
ratio, transmission frequency,
transmit power) by leveraging the knowledge of robotic perception and motion
dynamics, thus reducing the need for excessive sensor data uploads.
Furthermore, by leveraging the learning to optimize (LTO) paradigm, an
imitation learning neural network is designed and implemented, which reduces
the computational complexity by over 10x compared to state-of-the art
optimization solvers. Experiments demonstrate the superiority of the proposed
IPMC and the real-time execution capability of LTO.
FourierCompress Layer-Aware Spectral Activation Compression for Efficient and Accurate Collaborative LLM Inference
Authors: Jian Ma, Xinchen Lyu, Jun Jiang, Longhao Zou, Chenshan Ren, Qimei Cui, Xiaofeng Tao
2025-10-18
Collaborative large language model () inference enables real-time,
privacy-pre
AI services on resource-constrained edge devices by
partitioning computational workloads between client devices and edge servers.
However, this paradigm is severely hindered by
bottlenecks caused
by the transmission of high-dimensional intermediate activations, exacerbated
by the autoregressive
structure of
s, where bandwidth consumption
scales linearly with output length. Existing activation
methods
struggle to simultaneously achieve high
ratios, low reconstruction
error, and computational efficiency. This paper proposes FourierCompress, a
novel, layer-aware activation
framework that exploits the
frequency-domain
of
activations. We rigorously demonstrate that
activations from the first Transformer layer exhibit strong smoothness and
energy concentration in the low-frequency domain, making them highly amenable
to near-lossless
via the Fast Fourier Transform (FFT).
FourierCompress transforms activations into the frequency domain, retains only
a compact block of low-frequency coefficients, and reconstructs the signal at
the server using conjugate symmetry, enabling seamless hardware
on
DSPs and FPGAs. Extensive experiments on Llama 3 and Qwen2.5 models across 10
commonsense reasoning datasets demonstrate that FourierCompress preserves
performance remarkably close to the uncompressed baseline, outperforming Top-k,
QR, and SVD. FourierCompress bridges the gap between
efficiency
(an average 7.6x reduction in activation size), near-lossless inference (less
than 0.3% average accuracy loss), and significantly faster
(achieving over 32x reduction in
time compared to Top-k via
hardware
) for edge-device
inference.
Longwave-transparent low-emissivity material
Authors: Yue Zhang, Longnan Li, Junyan Dai, Xiaowen Zhang, Qunyan Zhou, Naiqin Yi, Ruizhe Jian, Fei Zhu, Xiaopeng Li, Mengke Sun, Jiazheng Wu, Xinfeng Li, Xiangtong Kong, Ziai Liu, Yinwei Li, Qiang Cheng, Yiming Zhu, Tie Jun Cui, Wei Li
2025-10-18
Low emissivity (low-e) materials are crucial for con thermal energy in
buildings, cold chain logistics and transportation by minimizing unwanted
radiative heat loss or gain. However, their metallic nature intrinsically
causes severe longwave attenuation, hindering their broad applications. Here,
we introduce, for the first time, an all-dielectric longwave-transparent
low-emissivity material (
) with ultra-broadband, high transmittance spanning
9 orders of magnitude, from terahertz to kilohertz frequencies. This
meter-scale
not only achieves energy savings of up to 41.1% over commercial
white paint and 10.2% over traditional low-e materials, but also unlocks
various fundamentally new capabilities including high-speed wireless
in energy-efficient buildings, wireless energy transfer with
radiative thermal insulation, as well as non-invasive terahertz security
screening and radio frequency identification in cold chain logistics. Our
approach represents a new photonic solution towards carbon neutrality and smart
city development, paving the way for a more sustainable and interconnected
future.
Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with Prior
Authors: Fuqun Han, Stanley Osher, Wuchen Li
2025-10-18
In this work, we propose a
architecture that incorporates
prior information about the underlying data distribution directly into the
structure of the neural network. The design of the model is
motivated by a special optimal transport problem, namely the regularized
Wasserstein proximal operator, which admits a closed-form solution and turns
out to be a special representation of
architectures. Compared with
classical flow-based models, the proposed approach improves the convexity
properties of the optimization problem and promotes
in the generated
samples. Through both theoretical analysis and numerical experiments, including
applications in generative modeling and Bayesian inverse problems, we
demonstrate that the
achieves higher accuracy and faster
convergence to the target distribution than classical neural ODE-based methods.
Does GenAI Rewrite How We Write? An Empirical Study on Two-Million Preprints
Authors: Minfeng Qi, Zhongmin Cao, Qin Wang, Ningran Li, Tianqing Zhu
2025-10-18
Preprint repositories become central infrastructures for scholarly
. Their expansion transforms how research is circulated and
evaluated before journal publication. Generative large language models (
s)
introduce a further potential disruption by altering how manuscripts are
written. While speculation abounds, systematic evidence of whether and how
s
reshape scientific publishing remains limited.
This paper addresses the gap through a large-scale analysis of more than 2.1
million preprints spanning 2016--2025 (115 months) across four major
repositories (i.e., arXiv, bioRxiv, medRxiv, SocArXiv). We introduce a
multi-level analytical framework that integrates interrupted time-series
models, collaboration and productivity metrics, linguistic profiling, and topic
modeling to assess changes in volume, authorship, style, and disciplinary
orientation. Our findings reveal that
s have accelerated submission and
revision cycles, modestly increased linguistic complexity, and
disproportionately expanded AI-related topics, while computationally intensive
fields benefit more than others. These results show that
s act less as
universal disruptors than as selective catalysts, amplifying existing strengths
and widening disciplinary divides. By documenting these dynamics, the paper
provides the first empirical foundation for evaluating the influence of
generative AI on academic publishing and highlights the need for governance
frameworks that preserve trust, fairness, and accountability in an AI-enabled
research ecosystem.
What Limits Agentic Systems Efficiency?
Authors: Song Bian, Minghao Yan, Anand Jayarajan, Gennady Pekhimenko, Shivaram Venkataraman
2025-10-18
Large Language Models (s), such as OpenAI-o1 and DeepSeek-R1, have
demonstrated strong reasoning capabilities. To further enhance
capabilities, recent agentic systems, such as Deep Research, incorporate web
interactions into
reasoning to mitigate uncertainties and reduce potential
errors. However, existing research predominantly focuses on reasoning
performance, often neglecting the efficiency of agentic systems. In this work,
we present a comprehensive empirical study that identifies efficiency
bottlenecks in web-interactive agentic systems. We decompose end-to-end latency
into two primary components:
API latency and web environment latency. We
conduct a comprehensive empirical study across 15 models and 5 providers to
demonstrate high variability in API-based agentic systems. We observe that web
environment latency can contribute as much as 53.7% to the overall latency in a
web-based agentic system. To improve latency, we propose SpecCache, a caching
framework augmented with speculative execution that can reduce web environment
overhead. Extensive evaluations on two standard benchmarks show that our
approach improves the
hit rate by up to 58x compared to a random caching
strategy, while reducing web environment overhead by up to 3.2x, without
degrading agentic system performance.
One-Bit Quantization for Random Features Models
Authors: Danil Akhtiamov, Reza Ghane, Babak Hassibi
2025-10-17
Recent advances in neural networks have led to significant computational and
memory demands, spurring interest in one-bit weight to enable
efficient inference on resource-constrained devices. However, the theoretical
underpinnings of such
remain poorly understood. We address this gap
by analyzing one-bit
in the Random Features model, a simplified
framework that corresponds to neural networks with random representations. We
prove that, asymptotically, quantizing weights of all layers except the last
incurs no loss in generalization error, compared to the full precision random
features model. Our findings offer theoretical insights into neural network
. We also demonstrate empirically that one-bit
leads to
significant inference speed ups for the Random Features models even on a laptop
GPU, confirming the practical benefits of our work. Additionally, we provide an
asymptotically precise characterization of the generalization error for Random
Features with an arbitrary number of layers. To the best of our knowledge, our
analysis yields more general results than all previous works in the related
literature.
SentinelNet Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection
Authors: Yang Feng, Xudong Pan
2025-10-17
Malicious agents pose significant threats to the reliability and
decision-making capabilities of Multi-Agent Systems (MAS) powered by Large
Language Models (s). Existing defenses often fall short due to reactive
designs or centralized architectures which may introduce single points of
failure. To address these challenges, we propose SentinelNet, the first
decentralized framework for proactively detecting and mitigating malicious
behaviors in multi-agent collaboration. SentinelNet equips each agent with a
credit-based detector trained via contrastive learning on augmented adversarial
debate trajectories, enabling autonomous evaluation of message credibility and
dynamic neighbor ranking via bottom-k elimination to suppress malicious
s. To overcome the scarcity of attack data, it generates
adversarial trajectories simulating diverse threats, ensuring robust training.
Experiments on MAS benchmarks show SentinelNet achieves near-perfect detection
of malicious agents, close to 100% within two debate rounds, and recovers 95%
of system accuracy from compromised baselines. By exhibiting strong
generalizability across domains and attack patterns, SentinelNet establishes a
novel paradigm for safeguarding collaborative MAS.