2025-10-24

ToolDreamer Instilling LLM Reasoning Into Tool Retrievers
GaLLoP Gradient-based Sparse Learning on Low-Magnitude Parameters
CommonSense Efficient Set Intersection (SetX) Protocol Based on Compressed Sensing
Dictionary learning methods for brain activity mapping with MEG data
Are Large Language Models Sensitive to the Motives Behind Communication?
CircuitGuard Mitigating LLM Memorization in RTL Code Generation Against IP Leakage
CoSense-LLM Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation
Overlap-weighted orthogonal meta-learner for treatment effect estimation over time
ELUTQ Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices
MSC-Bench A Rigorous Benchmark for Multi-Server Tool Orchestration
Monitoring LLM-based Multi-Agent Systems Against Corruptions via Node Evaluation
MoE-Prism Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs
Multi-code rate Task-Oriented Communication for Multi-Edge Cooperative Inference
LAPRAD LLM-Assisted PRotocol Attack Discovery
RLBoost Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs
Tibetan Language and AI A Comprehensive Survey of Resources, Methods and Challenges
An Efficient Calibration Framework for Volatility Derivatives under Rough Volatility with Jumps
From Memorization to Generalization Fine-Tuning Large Language Models for Biomedical Term-to-Identifier Normalization
CLiVR Conversational Learning System in Virtual Reality with AI-Powered Patients
An Encoder-Decoder Foundation Chemical Language Model for Generative Polymer Design
Dimensionality Reduction for Remote Sensing Data Analysis A Systematic Review of Methods and Applications
MTraining Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
mSQUID Model-Based Leanred Modulo Recovery at Low Sampling Rates
SSD Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation
Fetch.ai An Architecture for Modern Multi-Agent Systems
Reasoning Language Model Inference Serving Unveiled An Empirical Study
Binary Quadratic Quantization Beyond First-Order Quantization for Real-Valued Matrix Compression
C-SWAP Explainability-Aware Structured Pruning for Efficient Neural Networks Compression
Tokencake A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications
EfficientNav Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval
LLMs as Sparse RetrieversA Framework for First-Stage Product Search
From Quarter to All Accelerating Speculative LLM Decoding via Floating-Point Exponent Remapping and Parameter Sharing
The Attribution Story of WhisperGate An Academic Perspective
LAFA Agentic LLM-Driven Federated Analytics over Decentralized Data Sources
DART A Structured Dataset of Regulatory Drug Documents in Italian for Clinical NLP
CircuitSeer Mining High-Quality Data by Probing Mathematical Reasoning Circuits in LLMs
Adamas Hadamard Sparse Attention for Efficient Long-Context Inference
How2Compress Scalable and Efficient Edge Video Analytics via Adaptive Granular Video Compression
MENTOR A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models
S2AP Score-space Sharpness Minimization for Adversarial Pruning
Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers
Learning Human-Object Interaction as Groups
Text or Pixels? It Takes Half On the Token Efficiency of Visual Text Inputs in Multimodal LLMs
StreamingTOM Streaming Token Compression for Efficient Video Understanding
Learning from the Best, Differently A Diversity-Driven Rethinking on Data Selection
DeepSeek-OCR Contexts Optical Compression
Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
Extracting Rule-based Descriptions of Attention Features in Transformers
Any-Depth Alignment Unlocking Innate Safety Alignment of LLMs to Any-Depth
MEG-GPT A transformer-based foundation model for magnetoencephalography data
CompactPrompt A Unified Pipeline for Prompt Data Compression in LLM Workflows
OPTAGENT Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning
From Local to Global Revisiting Structured Pruning Paradigms for Large Language Models
Glyph Scaling Context Windows via Visual-Text Compression
Beyond More Context Retrieval Diversity Boosts Multi-Turn Intent Understanding
ZACH-ViT A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification
Language Confusion Gate Language-Aware Decoding Through Model Self-Distillation
TabR1 Taming GRPO for tabular reasoning LLMs
M2H Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception
Localist LLMs with Recruitment Learning
Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems
StreamingThinker Large Language Models Can Think While Reading
DSEBench A Test Collection for Explainable Dataset Search with Examples
CosmoCore Affective Dream-Replay Reinforcement Learning for Code Generation
ZSPAPrune Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models
When AI companions become witty Can human brain recognize AI-generated irony?
ParaVul A Parallel Large Language Model and Retrieval-Augmented Framework for Smart Contract Vulnerability Detection
Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models
Enrich and Detect Video Temporal Grounding with Multimodal LLMs
UniGTE Unified Graph-Text Encoding for Zero-Shot Generalization across Graph Tasks and Domains
ArmFormer Lightweight Transformer Architecture for Real-Time Multi-Class Weapon Segmentation and Classification
Neuronal Group Communication for Efficient Neural representation
Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads
Mixed-Precision Quantization for Language Models Techniques and Prospects
3D-GSRD 3D Molecular Graph Auto-Encoder with Selective Re-mask Decoding
EMRRG Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation
L-MoE End-to-End Training of a Lightweight Mixture of Low-Rank Adaptation Experts
ELMM Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion
An Efficient Semantic Segmentation Decoder for In-Car or Distributed Applications
Long-Context Attention Benchmark From Kernel Efficiency to Distributed Context Parallelism
U-Codec Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation
Count Counts Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards
VisionSelector End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
SHIELD Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense
Human-Aligned Code Readability Assessment with Large Language Models
Ripple Effect Protocol Coordinating Agent Populations
Language over Content Tracing Cultural Understanding in Multilingual Large Language Models
Hybrid CNN-Transformer Based Sparse Channel Prediction for High-Mobility OTFS Systems
HGC-Avatar Hierarchical Gaussian Compression for Streamable Dynamic 3D Avatars
FrugalPrompt Reducing Contextual Overhead in Large Language Models via Token Attribution
Learning to Optimize Edge Robotics A Fast Integrated Perception-Motion-Communication Approach
FourierCompress Layer-Aware Spectral Activation Compression for Efficient and Accurate Collaborative LLM Inference
Longwave-transparent low-emissivity material
Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior
Does GenAI Rewrite How We Write? An Empirical Study on Two-Million Preprints
What Limits Agentic Systems Efficiency?
One-Bit Quantization for Random Features Models
SentinelNet Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection

ToolDreamer Instilling LLM Reasoning Into Tool Retrievers

Authors: Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang, Suhang Wang, Zhe Feng

2025-10-22

http://arxiv.org/abs/2510.19791v1

Tool calling has become increasingly popular for Large Language Models (s). However, for large tool sets, the resulting tokens would exceed the 's context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide s with the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework to condition retriever models to fetch tools based on hypothetical (synthetic) TD generated using an , i.e., description of tools that the feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD's. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of and dense retrievers with and without training, thus showcasing its flexibility. Through our proposed framework, our aim is to offload a portion of the reasoning burden to the retriever so that the may effectively handle a large collection of tools without inundating its context window.

GaLLoP Gradient-based Sparse Learning on Low-Magnitude Parameters

Authors: Anand Choudhary, Yasser Sulaıman, Lukas Mauch, Ghouthi Boukli Hacene, Fabien Cardinaux, Antoine Bosselut

2025-10-22

http://arxiv.org/abs/2510.19778v1

Sparse fine-tuning techniques adapt s to downstream tasks by only tuning a subset of model parameters. However, the effectiveness of adaptation depends on optimally selecting the model parameters to be fine-tuned. In this work, we introduce a novel fine-tuning technique named GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters, which fine-tunes only those model parameters which have the largest gradient magnitudes on downstream tasks and the smallest pre-trained magnitudes, intuitively prioritizing parameters that are highly task-relevant, but minimally disruptive to pre-trained knowledge. Our experimentation with LLaMA3 8B and Gemma 2B as base models shows that GaLLoP consistently improves or matches the in-distribution as well as out-of-distribution performance obtained via the usage of other leading parameter-efficient fine-tuning techniques, including LoRA, DoRA, and SAFT. Our analysis demonstrates that GaLLoP mitigates catastrophic forgetting and memorization of task data, as important pre-trained parameters remain unchanged, and stabilizes performance relative to other fine-tuning techniques, robustly generalizing across most random seeds.

CommonSense Efficient Set Intersection (SetX) Protocol Based on Compressed Sensing

Authors: Jingfan Meng, Tianji Yang, Jun Xu

2025-10-22

http://arxiv.org/abs/2510.19725v1

In the set reconciliation (\textsf{SetR}) problem, two parties Alice and Bob, holding sets $\mathsf{A}$ and $\mathsf{B}$ , communicate to learn the symmetric difference $\mathsf{A} \Delta \mathsf{B}$ . In this work, we study a related but under-explored problem: set intersection (\textsf{SetX})~\cite{Ozisik2019}, where both parties learn $\mathsf{A} \cap \mathsf{B}$ instead. However, existing solutions typically reuse \textsf{SetR} protocols due to the absence of dedicated \textsf{SetX} protocols and the misconception that \textsf{SetR} and \textsf{SetX} have comparable costs. Ob that \textsf{SetX} is fundamentally cheaper than \textsf{SetR}, we developed a multi-round \textsf{SetX} protocol that outperforms the information-theoretic lower bound of \textsf{SetR} problem. In our \textsf{SetX} protocol, Alice sends Bob a compressed sensing (CS) sketch of $\mathsf{A}$ to help Bob identify his unique elements (those in $\mathsf{B \setminus A}$ ). This solves the \textsf{SetX} problem, if $\mathsf{A} \subseteq \mathsf{B}$ . Otherwise, Bob sends a CS sketch of the residue (a set of elements he cannot ) back to Alice for her to her unique elements (those in $\mathsf{A \setminus B}$ ). As such, Alice and Bob communicate back and forth %with a set membership filter (SMF) of estimated $\mathsf{B \setminus A}$ . Alice updates $\mathsf{A}$ and repeats until both parties agrees on $\mathsf{A} \cap \mathsf{B}$ . On real world datasets, experiments show that our $\mathsf{SetX}$ protocol reduces the cost by 8 to 10 times compared to the IBLT-based $\mathsf{SetR}$ protocol.

Dictionary learning methods for brain activity mapping with MEG data

Authors: Daniela Calvetti, Erkki Somersalo

2025-10-22

http://arxiv.org/abs/2510.19702v1

A central goal in many brain studies is the identification of those brain regions that are activated during an observation window that may correspond to a motor task, a stimulus, or simply a resting state. While functional MRI is currently the most commonly employed modality for such task, methods based on the electromagnetic activity of the brain are valuable alternatives because of their excellent time resolution and of the fact that the measured signals are directly related to brain activation and not to a secondary effect such as the hemodynamic response. In this work we focus on the MEG modality, investigating the performance of a recently proposed Bayesian dictionary learning (BDL) algorithm for brain region identification. The partitioning of the source space into the 148 regions of interest (ROI) corresponding to parcellation of the Destrieux atlas provides a natural determination of the subdictionaries necessary for the BDL algorithm. We design a simulation protocol where a small randomly selected patch in each ROI is activated, the MEG signal is computed and the inverse problem of active brain region identification is solved using the BDL algorithm. The BDL algorithm consists of two phases, the first one comprising dictionary and Bayesian error analysis, and the second one performing dictionary coding with a deflated dictionary built on the output of the first phase, both steps relying on Bayesian promoting computations. For assessing the performance, we give a probabilistic interpretation of the confusion matrix, and consider different impurity measures for a multi-class classifier.

Are Large Language Models Sensitive to the Motives Behind Communication?

Authors: Addison J. Wu, Ryan Liu, Kerem Oktar, Theodore R. Sumers, Thomas L. Griffiths

2025-10-22

http://arxiv.org/abs/2510.19687v1

Human is motivated: people speak, write, and create content with a particular communicative intent in mind. As a result, information that large language models (s) and AI agents process is inherently framed by humans' intentions and incentives. People are adept at navigating such nuanced information: we routinely identify benevolent or self- motives in order to decide what statements to trust. For s to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source -- for instance, weighing the credibility of claims made in a sales pitch. In this paper, we undertake a comprehensive study of whether s have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that s' behavior is consistent with rational models of learning from motivated testimony, and find they successfully discount information from biased sources in a human-like manner. We then extend our evaluation to sponsored online adverts, a more naturalistic reflection of agents' information ecosystems. In these settings, we find that s' inferences do not track the rational models' predictions nearly as closely -- partly due to additional information that distracts them from vigilance-relevant considerations. However, a simple steering intervention that boosts the salience of intentions and incentives substantially increases the correspondence between s and the rational model. These results suggest that s possess a basic sensitivity to the motivations of others, but generalizing to novel real-world settings will require further improvements to these models.

CircuitGuard Mitigating LLM Memorization in RTL Code Generation Against IP Leakage

Authors: Nowfel Mashnoor, Mohammad Akyash, Hadi Kamali, Kimia Azar

2025-10-22

http://arxiv.org/abs/2510.19676v1

Large Language Models (s) have achieved remarkable success in generative tasks, including register-transfer level (RTL) hardware synthesis. However, their tendency to memorize training data poses critical risks when proprietary or security-sensitive designs are unintentionally exposed during inference. While prior work has examined memorization in natural language, RTL introduces unique challenges: In RTL, structurally different implementations (e.g., behavioral vs. gate-level descriptions) can realize the same hardware, leading to intellectual property (IP) leakage (full or partial) even without verbatim . Conversely, even small syntactic variations (e.g., operator precedence or blocking vs. non-blocking assignments) can drastically alter circuit behavior, making correctness preservation especially challenging. In this work, we systematically study memorization in RTL code generation and propose CircuitGuard, a defense strategy that balances leakage reduction with correctness preservation. CircuitGuard (1) introduces a novel RTL-aware similarity metric that captures both structural and functional equivalence beyond surface-level , and (2) develops an activation-level steering method that identifies and attenuates components most responsible for memorization. Our empirical evaluation demonstrates that CircuitGuard identifies (and isolates) 275 memorization-critical features across layers 18-28 of Llama 3.1-8B model, achieving up to 80% reduction in semantic similarity to proprietary patterns while maintaining generation quality. CircuitGuard further shows 78-85% cross-domain transfer effectiveness, enabling robust memorization mitigation across circuit categories without retraining.

CoSense-LLM Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation

Authors: Hasan Akgul, Mari Eplik, Javier Rojas, Aina Binti Abdullah, Pieter van der Merwe

2025-10-22

http://arxiv.org/abs/2510.19670v1

We present CoSense-, an edge-first framework that turns continuous multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and lightweight vision) into compact, verifiable semantic tokens and coordinates with large language models under explicit latency, energy, bandwidth, and privacy constraints. CoSense- has four parts: (i) SenseFusion, a lightweight encoder that aligns sensor embeddings with language and compresses them into short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer that grounds generation in site specific policies and notes; (iii) PromptRouter, a cost and uncertainty aware policy that selects edge only generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure Execution, an auditable redaction path that enforces data minimization so raw waveforms never leave the device. The system works with modern optimizations, including paged or streaming s, FlashAttention style kernels, speculative , and d LoRA adapters, and supports on device personalization and federated updates under non IID drift. Across home, office, and clinic deployments, CoSense- delivers grounded explanations while meeting tight service level objectives: it sustains sub second (p95) end to end latency on edge dominant paths, reduces inter tier token and bandwidth costs by preferring local retrieval grounded responses, and preserves privacy by transmitting only discrete codes and redacted metadata. Ablations show that Edge-RAG improves factual consistency and reduces contradictions, calibrated uncertainty enables selective abstention and controlled escalations, and plus accelerators lower energy per decision. The results support an edge first design that treats semantics, privacy, and predictable latency as co equal goals for large model deployments in interference prone environments.

Overlap-weighted orthogonal meta-learner for treatment effect estimation over time

Authors: Konstantin Hess, Dennis Frauen, Mihaela van der Schaar, Stefan Feuerriegel

2025-10-22

http://arxiv.org/abs/2510.19643v1

Estimating heterogeneous treatment effects (HTEs) in time-varying settings is particularly challenging, as the probability of ob certain treatment sequences decreases exponentially with longer prediction horizons. Thus, the observed data contain little support for many plausible treatment sequences, which creates severe problems. Existing meta-learners for the time-varying setting typically assume adequate treatment , and thus suffer from exploding estimation variance when the is low. To address this problem, we introduce a novel -weighted orthogonal (WO) meta-learner for estimating HTEs that targets regions in the observed data with high probability of receiving the interventional treatment sequences. This offers a fully data-driven approach through which our WO-learner can counteract instabilities as in existing meta-learners and thus obtain more reliable HTE estimates. Methodologically, we develop a novel Neyman-orthogonal population risk function that minimizes the -weighted oracle risk. We show that our WO-learner has the favorable property of Neyman-orthogonality, meaning that it is robust against misspecification in the nuisance functions. Further, our WO-learner is fully model-agnostic and can be applied to any machine learning model. Through extensive experiments with both and LSTM backbones, we demonstrate the benefits of our novel WO-learner.

ELUTQ Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices

Authors: Xin Nie, Liang Dong, HaiCheng Zhang, JiaWang Xiao, G. Sun

2025-10-22

http://arxiv.org/abs/2510.19482v1

The deployment of Large Language Models (s) on CPU-based edge devices is crucial for enabling on-device intelligence and expanding AI accessibility. However, it remains challenging due to limited memory and computational resources. During edge inference, memory usage and latency are the primary bottlenecks. Although weight can effectively reduce memory consumption, existing hardware-friendly approaches often rely on uniform , which poorly fits weight distributions and incurs high de overhead at low bit widths. To address these limitations, we propose ELUTQ, an efficient framework introducing a novel format, Hierarchical Linear Quantization (HLQ). HLQ better captures the statistical characteristics of weights without increasing the computational cost of Bit-serial LUT-based GEMM operations, thereby eliminating de overhead. It is orthogonal to existing algorithms and can be seamlessly integrated into various pipelines. For efficient on-device deployment, ELUTQ provides optimized CPU kernels for end-to-end inference. Experiments show that for LLaMA3-8B, HLQ reduces perplexity by about 8% at 3-bit and 85% at 2-bit precision under post-training , completing within one hour. With efficient finetuning, HLQ further improves 2-bit performance within two hours. In terms of inference efficiency, our 2-bit LLaMA2-7B achieves over 25 tokens/s on an Apple M2 chip (4 threads, batch size = 1).

MSC-Bench A Rigorous Benchmark for Multi-Server Tool Orchestration

Authors: Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai

2025-10-22

http://arxiv.org/abs/2510.19423v1

We introduce MSC-Bench, a large-scale benchmark for evaluating multi-hop, end-to-end tool orchestration by agents in a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often evaluate tools in isolation, ignoring challenges such as functional and cross-server orchestration, leading to overly optimistic assessments. MSC-Bench addresses these gaps by constructing ground truth through 'equal function sets', allowing objective metrics such as F1 score and reducing the dependency on -as-a-judge evaluation. Organized as a five-level curriculum, it systematically tests agent capabilities from single-tool orchestration to complex cross-server planning, and robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. MSC-Bench provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool-using agents. The benchmark and resources are publicly available at https://github.com/snooow1029/MSC_Bench.

Monitoring LLM-based Multi-Agent Systems Against Corruptions via Node Evaluation

Authors: Chengcan Wu, Zhixin Zhang, Mingqian Xu, Zeming Wei, Meng Sun

2025-10-22

http://arxiv.org/abs/2510.19420v1

Large Language Model ()-based Multi-Agent Systems (MAS) have become a popular paradigm of AI applications. However, trustworthiness issues in MAS remain a critical concern. Unlike challenges in single-agent systems, MAS involve more complex processes, making them susceptible to corruption attacks. To mitigate this issue, several defense mechanisms have been developed based on the graph representation of MAS, where agents represent nodes and s form edges. Nevertheless, these methods predominantly focus on static graph defense, attempting to either detect attacks in a fixed graph structure or optimize a static topology with certain defensive capabilities. To address this limitation, we propose a dynamic defense paradigm for MAS graph structures, which continuously monitors within the MAS graph, then dynamically adjusts the graph topology, accurately disrupts malicious s, and effectively defends against evolving and diverse dynamic attacks. Experimental results in increasingly complex and dynamic MAS environments demonstrate that our method significantly outperforms existing MAS defense mechanisms, contributing an effective guardrail for their trustworthy applications. Our code is available at https://github.com/ChengcanWu/Monitoring--Based-Multi-Agent-Systems.

MoE-Prism Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs

Authors: Xinfeng Xia, Jiacheng Liu, Xiaofeng Hou, Peng Tang, Mingxuan Zhang, Wenfeng Wang, Chao Li

2025-10-22

http://arxiv.org/abs/2510.19366v1

Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI, achieve high quality by ly activating parameters. However, their reliance on routing between a few monolithic experts via a top-k mechanism creates a "quality cliff", offering only a few coarse-grained operating points. This inflexibility forces a difficult trade-off between cost and quality, preventing adaptation to diverse Service Level Objectives (SLOs) and leading to significant resource over-provisioning. This paper introduces MoE-Prism, a model-system co-design that transforms rigid MoE models into elastic services. Our methodology is divided into two phases. First, an \emph{Offline Refactoring Engine} systematically deconstructs monolithic experts into fine-grained "sub-experts." This engine employs a partitioning optimization solver that uses a metaheuristic-based approach to group neurons, pre functional locality without requiring retraining. Second, an \emph{Online Scheduling Engine} leverages this new elasticity through QoS-aware scheduling. It implements specialized policies to solve complex system problems, including maximizing throughput in cloud deployments and managing latency-optimized offloading for memory-constrained devices. Our evaluation across three different MoE models shows that MoE-Prismprovides over 4 times more distinct, stable operating points than the baseline. This allows an AI service to dynamically improve throughput by up to 19.9\% under a strict latency budget or reduce latency by up to 10.36\% under limited resources. MoE-Prism provides the critical "control knob" to bridge the model-system gap, enabling the next generation of adaptive, efficient, and QoS-aware AI services.

Multi-code rate Task-Oriented Communication for Multi-Edge Cooperative Inference

Authors: Dongwon Kim, Jiwan Seo, Joonhyuk Kang

2025-10-22

http://arxiv.org/abs/2510.19360v1

The integration of artificial intelligence (AI) with the internet of things (IoT) enables task-oriented for multi-edge cooperative inference system, where edge devices transmit extracted features of local sensory data to an edge server to perform AI-driven tasks. However, the privacy concerns and limited bandwidth pose fundamental challenges, since simultaneous transmission of extracted features with a single fixed ratio from all devices leads to severe inefficiency in resource utilization. To address this challenge, we propose a framework that dynamically adjusts the code rate in feature extraction based on its importance to the downstream inference task by adopting a rate-adaptive (RAQ) scheme. Furthermore, to select the code rate for each edge device under limited bandwidth constraint, a dynamic programming (DP) approach is leveraged to allocate the code rate across discrete code rate options. Experiments on multi-view datasets demonstrate that the proposed frameworks significantly outperform the frameworks using fixed-rate , achieving a favorable balance between efficiency and inference performance under limited bandwidth conditions.

LAPRAD LLM-Assisted PRotocol Attack Discovery

Authors: R. Can Aygun, Yehuda Afek, Anat Bremler-Barr, Leonard Kleinrock

2025-10-22

http://arxiv.org/abs/2510.19264v1

With the goal of improving the security of Internet protocols, we seek faster, semi-automatic methods to discover new vulnerabilities in protocols such as DNS, BGP, and others. To this end, we introduce the -Assisted Protocol Attack Discovery (LAPRAD) methodology, enabling security researchers with some DNS knowledge to efficiently uncover vulnerabilities that would otherwise be hard to detect. LAPRAD follows a three-stage process. In the first, we consult an (GPT-o1) that has been trained on a broad corpus of DNS-related sources and previous DDoS attacks to identify potential exploits. In the second stage, a different automatically constructs the corresponding attack configurations using the ReACT approach implemented via LangChain (DNS zone file generation). Finally, in the third stage, we validate the attack's functionality and effectiveness. Using LAPRAD, we uncovered three new DDoS attacks on the DNS protocol and rediscovered two recently reported ones that were not included in the 's training data. The first new attack employs a bait-and-switch technique to trick resolvers into caching large, bogus DNSSEC RRSIGs, reducing their capacity to as little as 6%. The second exploits large DNSSEC encryption algorithms (RSA-4096) with multiple keys, thereby bypassing a recently implemented default RRSet limit. The third leverages ANY-type responses to produce a similar effect. These variations of a -flushing DDoS attack, called SigCacheFlush, circumvent existing patches, severely degrade resolver query capacity, and impact the latest versions of major DNS resolver implementations.

RLBoost Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

Authors: Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z. Morley Mao, Arvind Krishnamurthy, Ion Stoica

2025-10-22

http://arxiv.org/abs/2510.19225v1

Reinforcement learning (RL) has become essential for unlocking advanced reasoning capabilities in large language models (s). RL workflows involve interleaving rollout and training stages with fundamentally different resource requirements. Rollout typically dominates overall execution time, yet scales efficiently through multiple independent instances. In contrast, training requires tightly-coupled GPUs with full-mesh . Existing RL frameworks fall into two categories: co-located and d architectures. Co-located ones fail to address this resource tension by forcing both stages to share the same GPUs. Disaggregated architectures, without modifications of well-established RL algorithms, suffer from resource under-utilization. Meanwhile, preemptible GPU resources, i.e., spot instances on public clouds and spare capacity in production clusters, present significant cost-saving opportunities for accelerating RL workflows, if efficiently harvested for rollout. In this paper, we present RLBoost, a systematic solution for cost-efficient RL training that harvests preemptible GPU resources. Our key insight is that rollout's stateless and embarrassingly parallel nature aligns perfectly with preemptible and often fragmented resources. To efficiently utilize these resources despite frequent and unpredictable availability changes, RLBoost adopts a hybrid architecture with three key techniques: (1) adaptive rollout offload to dynamically adjust workloads on the reserved (on-demand) cluster, (2) pull-based weight transfer that quickly provisions newly available instances, and (3) token-level response collection and migration for efficient preemption handling and continuous load balancing. Extensive experiments show RLBoost increases training throughput by 1.51x-1.97x while improving cost efficiency by 28%-49% compared to using only on-demand GPU resources.

Tibetan Language and AI A Comprehensive Survey of Resources, Methods and Challenges

Authors: Cheng Huang, Nyima Tashi, Fan Gao, Yutong Liu, Jiahao Li, Hao Tian, Siyang Jiang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Jin Zhang, Xiao Feng, Hao Wang, Jie Tang, Guojie Tang, Xiangxiang Wang, Jia Zhang, Tsengdar Lee, Yongbin Yu

2025-10-22

http://arxiv.org/abs/2510.19144v1

Tibetan, one of the major low-resource languages in Asia, presents unique linguistic and sociocultural characteristics that pose both challenges and opportunities for AI research. Despite increasing interest in developing AI systems for underrepresented languages, Tibetan has received limited attention due to a lack of accessible data resources, standardized benchmarks, and dedicated tools. This paper provides a comprehensive survey of the current state of Tibetan AI in the AI domain, covering textual and speech data resources, NLP tasks, machine translation, speech recognition, and recent developments in s. We systematically categorize existing datasets and tools, evaluate methods used across different tasks, and compare performance where possible. We also identify persistent bottlenecks such as data , orthographic variation, and the lack of unified evaluation metrics. Additionally, we discuss the potential of cross-lingual transfer, multi-modal learning, and community-driven resource creation. This survey aims to serve as a foundational reference for future work on Tibetan AI research and encourages collaborative efforts to build an inclusive and sustainable AI ecosystem for low-resource languages.

An Efficient Calibration Framework for Volatility Derivatives under Rough Volatility with Jumps

Authors: Keyuan Wu, Tenghan Zhong, Yuxuan Ouyang

2025-10-21

http://arxiv.org/abs/2510.19126v1

We present a fast and robust calibration method for stochastic volatility models that admit Fourier-analytic transform-based pricing via characteristic functions. The design is structure-pre: we keep the original pricing transform and (i) split the pricing formula into data-independent inte- grals and a market-dependent remainder; (ii) precompute those data-independent integrals with GPU ; and (iii) approximate only the remaining, market-dependent pricing map with a small neural network. We instantiate the workflow on a rough volatility model with tempered-stable jumps tailored to power-type volatility derivatives and calibrate it to VIX options with a global-to-local search. We verify that a pure-jump rough volatility model adequately captures the VIX dynamics, consistent with prior empirical findings, and demonstrate that our calibration method achieves high accuracy and speed.

From Memorization to Generalization Fine-Tuning Large Language Models for Biomedical Term-to-Identifier Normalization

Authors: Suswitha Pericharla, Daniel B. Hier, Tayo Obafemi-Ajayi

2025-10-21

http://arxiv.org/abs/2510.19036v1

Effective biomedical data integration depends on automated term normalization, the mapping of natural language biomedical terms to standardized identifiers. This linking of terms to identifiers is essential for semantic interoperability. Large language models (s) show promise for this task but perform unevenly across terminologies. We evaluated both memorization (training-term performance) and generalization (validation-term performance) across multiple biomedical ontologies. Fine-tuning Llama 3.1 8B revealed marked differences by terminology. GO mappings showed strong memorization gains (up to 77% improvement in term-to-identifier accuracy), whereas HPO showed minimal improvement. Generalization occurred only for protein-gene (GENE) mappings (13.9% gain), while fine-tuning for HPO and GO yielded negligible transfer. Baseline accuracy varied by model scale, with GPT-4o outperforming both Llama variants for all terminologies. Embedding analyses showed tight semantic alignment between gene symbols and protein names but weak alignment between terms and identifiers for GO or HPO, consistent with limited lexicalization. Fine-tuning success depended on two interacting factors: identifier popularity and lexicalization. Popular identifiers were more likely encountered during pretraining, enhancing memorization. Lexicalized identifiers, such as gene symbols, enabled semantic generalization. By contrast, arbitrary identifiers in GO and HPO constrained models to rote learning. These findings provide a predictive framework for when fine-tuning enhances factual recall versus when it fails due to or non-lexicalized identifiers.

CLiVR Conversational Learning System in Virtual Reality with AI-Powered Patients

Authors: Akilan Amithasagaran, Sagnik Dakshit, Bhavani Suryadevara, Lindsey Stockton

2025-10-21

http://arxiv.org/abs/2510.19031v1

Simulations constitute a fundamental component of medical and nursing education and traditionally employ standardized patients (SP) and high-fidelity manikins to develop clinical reasoning and skills. However, these methods require substantial resources, limiting accessibility and scalability. In this study, we introduce CLiVR, a Conversational Learning system in Virtual Reality that integrates large language models (s), speech processing, and 3D avatars to simulate realistic doctor-patient interactions. Developed in Unity and deployed on the Meta Quest 3 platform, CLiVR enables trainees to engage in natural dialogue with virtual patients. Each simulation is dynamically generated from a syndrome-symptom database and enhanced with sentiment analysis to provide feedback on tone. Through an expert user study involving medical school faculty (n=13), we assessed usability, realism, and perceived educational impact. Results demonstrated strong user acceptance, high confidence in educational potential, and valuable feedback for improvement. CLiVR offers a scalable, immersive supplement to SP-based training.

An Encoder-Decoder Foundation Chemical Language Model for Generative Polymer Design

Authors: Harikrishna Sahu, Wei Xiong, Anagha Savit, Shivank S Shukla, Rampi Ramprasad

2025-10-21

http://arxiv.org/abs/2510.18860v1

Traditional machine learning has advanced polymer discovery, yet direct generation of chemically valid and synthesizable polymers without exhaustive enumeration remains a challenge. Here we present polyT5, an encoder-r chemical language model based on the T5 architecture, trained to understand and generate polymer structures. polyT5 enables both property prediction and the targeted generation of polymers conditioned on desired property values. We demonstrate its utility for dielectric polymer design, seeking candidates with dielectric constant >3, bandgap >4 eV, and glass transition temperature >400 K, alongside melt-processability and solubility requirements. From over 20,000 generated promising candidates, one was experimentally synthesized and validated, showing strong agreement with predictions. To further enhance usability, we integrated polyT5 within an agentic AI framework that couples it with a general-purpose , allowing natural language interaction for property prediction and generative design. Together, these advances establish a versatile and accessible framework for accelerated polymer discovery.

Dimensionality Reduction for Remote Sensing Data Analysis A Systematic Review of Methods and Applications

Authors: Nathan Mankovich, Kai-Hendrik Cohrs, Homer Durand, Vasileios Sitokonstantinou, Tristan Williams, Gustau Camps-Valls

2025-10-21

http://arxiv.org/abs/2510.18935v1

Earth observation involves collecting, analyzing, and processing an ever-growing mass of data. Automatically harvesting information is crucial for addressing significant societal, economic, and environmental challenges, ranging from environmental monitoring to urban planning and disaster management. However, the high dimensionality of these data poses challenges in terms of , inefficiency, and the curse of dimensionality, which limits the effectiveness of machine learning models. Dimensionality reduction (DR) techniques, specifically feature extraction, address these challenges by pre essential data properties while reducing complexity and enhancing tasks such as data , cleaning, fusion, visualization, anomaly detection, and prediction. This review provides a handbook for leveraging DR across the RS data value chain and identifies opportunities for under-explored DR algorithms and their application in future research.

MTraining Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

Authors: Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, Yucheng Li, Yuqing Yang, Lili Qiu

2025-10-21

http://arxiv.org/abs/2510.18830v1

The adoption of long context windows has become a standard feature in Large Language Models (s), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic attention is a promising approach for reducing the computational cost of long-context. However, efficiently training s with dynamic attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic attention to enable efficient training for s with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic training pattern, balanced ring attention, and hierarchical ring attention. These components are designed to synergistically address the computational imbalance and overheads inherent in dynamic attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while pre model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.

mSQUID Model-Based Leanred Modulo Recovery at Low Sampling Rates

Authors: Yhonatan Kvich, Rotem Arie, Hana Hasan, Shaik Basheeruddin Shah, Yonina C. Eldar

2025-10-21

http://arxiv.org/abs/2510.18729v1

Modulo sampling enables acquisition of signals with unlimited dynamic range by folding the input into a bounded interval prior to sampling, thus eliminating the risk of signal clipping and pre information without requiring highresolution ADCs. While this enables low-cost hardware, the nonlinear distortion introduced by folding presents recovery challenges, particularly under noise and . We propose a model-based deep unfolding network tailored to this setting, combining the interpretability of classical compress sensing (CS) solvers with the flexibility of learning. A key innovation is a soft- module that encodes the modulo prior by guiding the solution toward discrete multiples of the folding range in a differentiable and learnable way. Our method, modulo soft-d unfolded iterative r (mSQUID), achieves superior reconstruction performance at low sampling rates under additive Gaussian noise. We further demonstrate its utility in a challenging case where signals with vastly different amplitudes and disjoint frequency bands are acquired simultaneously and d. In this scenario, classical sampling often struggles due to weak signal distortion or strong signal clipping, while our approach is able to recover the input signals. Our method also offers significantly reduced runtimes, making it suitable for real-time, resource-limited systems.

SSD Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation

Authors: Siyong Jian, Huan Wang

2025-10-21

http://arxiv.org/abs/2510.18716v1

Autoregressive image generation models like Janus-Pro produce high-quality images, but at the significant cost of high memory and ever-growing computational demands due to the large number of visual tokens. While has been extensively studied in language modeling, it still remains largely unexplored for the image generation domain. In this work, we begin by identifying a distinct and prominent attention phenomenon, which we term spatial locality and emergent semantic sink. To leverage this key insight, we introduce a novel framework. Specifically, we compress the for all visual tokens by adaptively decoupling attention heads into two separate types: for spatial-locality heads, our method maintains a short recent token window; for semantic-sink heads, it strategically preserves a compact set of highly-attended tokens. Our extensive experiments demonstrate that the proposed method achieves a 5 $\times$ reduction in memory usage and a notable 6.6 $\times$ speedup in overall throughput with only minimal visual quality loss, thereby enabling highly efficient native autoregressive image generation on resource-constrained hardware.

Fetch.ai An Architecture for Modern Multi-Agent Systems

Authors: Michael J. Wooldridge, Attila Bagoly, Jonathan J. Ward, Emanuele La Malfa, Gabriel Paludo Licks

2025-10-21

http://arxiv.org/abs/2510.18699v1

Recent surges in -driven intelligent systems largely overlook decades of foundational multi-agent systems (MAS) research, resulting in frameworks with critical limitations such as centralization and inadequate trust and protocols. This paper introduces the Fetch.ai architecture, an industrial-strength platform designed to bridge this gap by facilitating the integration of classical MAS principles with modern AI capabilities. We present a novel, multi-layered solution built on a decentralized foundation of on-chain blockchain services for verifiable identity, discovery, and transactions. This is complemented by a comprehensive development framework for creating secure, interoperable agents, a cloud-based platform for deployment, and an intelligent orchestration layer where an agent-native translates high-level human goals into complex, multi-agent workflows. We demonstrate the deployed nature of this system through a decentralized logistics use case where autonomous agents dynamically discover, negotiate, and transact with one another securely. Ultimately, the Fetch.ai stack provides a principled architecture for moving beyond current agent implementations towards open, collaborative, and economically sustainable multi-agent ecosystems.

Reasoning Language Model Inference Serving Unveiled An Empirical Study

Authors: Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang, Yuhan Chen, Shaohuai Shi, Xiaowen Chu

2025-10-21

http://arxiv.org/abs/2510.18672v1

The reasoning large language model (R) has been proven competitive in solving complex reasoning tasks such as mathematics, coding, compared to general . However, the performance and behavior of R remains unexplored, which may undermine the deployment and utilization of R in real-world scenario. To close this gap, in this paper, we conduct a comprehensive study of R service. We first perform a pilot study on comparing the performance between R and traditional and reveal that there are several distinct differences regarding behavior: (1) significant memory usage and fluctuations; (2) straggler requests; (3) adaptive running time; (4) domain preference. Then we further investigate whether existing inference optimization techniques are valid for R. Our main takeaways are that model methods and speculative can improve service system efficiency with small compromise to R accuracy, while prefix caching, may even degrade accuracy or performance for small R. Lastly, we conduct evaluation under real world workload modeled by Gamma distribution to verify our findings. Empirical results of real world workload evaluation across different dataset are aligned with our main findings regarding R . We hope our work can provide the research community and industry with insights to advance R inference .

Binary Quadratic Quantization Beyond First-Order Quantization for Real-Valued Matrix Compression

Authors: Kyo Kuroki, Yasuyuki Okoshi, Thiem Van Chu, Kazushi Kawamura, Masato Motomura

2025-10-21

http://arxiv.org/abs/2510.18650v1

This paper proposes a novel matrix method, Binary Quadratic Quantization (BQQ). In contrast to conventional first-order approaches, such as uniform and binary coding , that approximate real-valued matrices via linear combinations of binary bases, BQQ leverages the expressive power of binary quadratic expressions while maintaining an extremely compact data format. We validate our approach with two experiments: a matrix benchmark and post-training (PTQ) on pretrained Vision Transformer-based models. Experimental results demonstrate that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error than conventional methods for compressing diverse matrix data. It also delivers strong PTQ performance, even though we neither target state-of-the-art PTQ accuracy under tight memory constraints nor rely on PTQ-specific binary matrix optimization. For example, our proposed method outperforms the state-of-the-art PTQ method by up to 2.2\% and 59.1% on the ImageNet dataset under the calibration-based and data-free scenarios, respectively, with equivalent to 2 bits. These findings highlight the surprising effectiveness of binary quadratic expressions for efficient matrix approximation and neural network .

C-SWAP Explainability-Aware Structured Pruning for Efficient Neural Networks Compression

Authors: Baptiste Bauvin, Loïc Baret, Ola Ahmad

2025-10-21

http://arxiv.org/abs/2510.18636v1

Neural network has gained increasing attention in recent years, particularly in computer vision applications, where the need for model reduction is crucial for overcoming deployment constraints. Pruning is a widely used technique that prompts in model structures, e.g. weights, neurons, and layers, reducing size and inference costs. Structured is especially important as it allows for the removal of entire structures, which further accelerates inference time and reduces memory overhead. However, it can be computationally expensive, requiring iterative retraining and optimization. To overcome this problem, recent methods considered one-shot setting, which applies directly at post-training. Unfortunately, they often lead to a considerable drop in performance. In this paper, we focus on this issue by proposing a novel one-shot framework that relies on explainable deep learning. First, we introduce a causal-aware approach that leverages cause-effect relations between model predictions and structures in a progressive process. It allows us to efficiently reduce the size of the network, ensuring that the removed structures do not deter the performance of the model. Then, through experiments conducted on convolution neural network and vision baselines, pre-trained on classification tasks, we demonstrate that our method consistently achieves substantial reductions in model size, with minimal impact on performance, and without the need for fine-tuning. Overall, our approach outperforms its counterparts, offering the best trade-off. Our code is available on GitHub.

Tokencake A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications

Authors: Zhuohang Bian, Feiyang Wu, Teng Ma, Youwei Zhuo

2025-10-21

http://arxiv.org/abs/2510.18586v1

Large Language Models (s) are increasingly deployed in complex multi-agent applications that use external function calls. This workload creates severe performance challenges for the Cache: space contention leads to the eviction of critical agents' s and time underutilization leaves the of agents stalled on long-running tool calls idling in GPU memory. We present Tokencake, a -Cache-centric framework that co-optimizes scheduling and memory management with an agent-aware design. Tokencake's Space Scheduler uses dynamic memory partitioning to shield critical agents from contention, while its Time Scheduler employs a proactive offload and predictive upload mechanism to repurpose GPU memory during function call stalls. Our evaluation on representative multi-agent benchmarks shows that Tokencake can reduce end-to-end latency by over 47.06%, improve effective GPU memory utilization by up to 16.9% compared to v.

Authors: Zebin Yang, Sunjian Zheng, Tong Xie, Tianshi Xu, Bo Yu, Fan Wang, Jie Tang, Shaoshan Liu, Meng Li

2025-10-21

http://arxiv.org/abs/2510.18546v1

Object-goal navigation (ObjNav) tasks an agent with navigating to the location of a specific object in an unseen environment. Embodied agents equipped with large language models (s) and online constructed navigation maps can perform ObjNav in a zero-shot manner. However, existing agents heavily rely on giant s on the cloud, e.g., GPT-4, while directly switching to small s, e.g., LLaMA3.2-11b, suffer from significant success rate drops due to limited model capacity for understanding complex navigation maps, which prevents deploying ObjNav on local devices. At the same time, the long prompt introduced by the navigation map description will cause high planning latency on local devices. In this paper, we propose EfficientNav to enable on-device efficient -based zero-shot ObjNav. To help the smaller s better understand the environment, we propose semantics-aware memory retrieval to prune redundant information in navigation maps. To reduce planning latency, we propose discrete memory caching and attention-based memory clustering to efficiently save and re-use the . Extensive experimental results demonstrate that EfficientNav achieves 11.1% improvement in success rate on HM3D benchmark over GPT-4-based baselines, and demonstrates 6.7x real-time latency reduction and 4.7x end-to-end latency reduction over GPT-4 planner. Our code will be released soon.

LLMs as Sparse RetrieversA Framework for First-Stage Product Search

Authors: Hongru Song, Yu-an Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Sen Li, Wenjun Peng, Fuyu Lv, Xueqi Cheng

2025-10-21

http://arxiv.org/abs/2510.18527v2

Product search is a crucial component of modern e-commerce platforms, with billions of user queries every day. In product search systems, first-stage retrieval should achieve high recall while ensuring efficient online deployment. Sparse retrieval is particularly attractive in this context due to its interpretability and storage efficiency. However, retrieval methods suffer from severe vocabulary mismatch issues, leading to suboptimal performance in product search scenarios. With their potential for semantic analysis, large language models (s) offer a promising avenue for mitigating vocabulary mismatch issues and thereby improving retrieval quality. Directly applying s to retrieval in product search exposes two key challenges:(1)Queries and product titles are typically short and highly susceptible to -induced hallucinations, such as generating irrelevant expansion terms or underweighting critical literal terms like brand names and model numbers;(2)The large vocabulary space of s leads to difficulty in initializing training effectively, making it challenging to learn meaningful representations in such ultra-high-dimensional spaces.To address these challenges, we propose PROSPER, a framework for PROduct search leveraging s as SParsE Retrievers. PROSPER incorporates: (1)A literal residual network that alleviates hallucination in lexical expansion by reinforcing underweighted literal terms through a residual compensation mechanism; and (2)A lexical focusing window that facilitates effective training initialization via a coarse-to-fine sparsification strategy.Extensive offline and online experiments show that PROSPER significantly outperforms baselines and achieves recall performance comparable to advanced dense retrievers, while also achieving revenue increments online.

Authors: Yushu Zhao, Yubin Qin, Yang Wang, Xiaolong Yang, Huiming Han, Shaojun Wei, Yang Hu, Shouyi Yin

2025-10-21

http://arxiv.org/abs/2510.18525v1

Large language models achieve impressive performance across diverse tasks but exhibit high inference latency due to their large parameter sizes. While reduces model size, it often leads to performance degradation compared to the full model. Speculative remains lossless but typically incurs extra overheads. We propose SPEQ, an algorithm-hardware co-designed speculative method that uses part of the full-model weight bits to form a d draft model, thereby eliminating additional training or storage overhead. A reconfigurable processing element array enables efficient execution of both the draft and verification passes. Experimental results across 15 s and tasks demonstrate that SPEQ achieves speedups of 2.07x, 1.53x, and 1.45x compared over FP16, Olive, and Tender, respectively.

The Attribution Story of WhisperGate An Academic Perspective

Authors: Oleksandr Adamov, Anders Carlsson

2025-10-21

http://arxiv.org/abs/2510.18484v1

This paper explores the challenges of cyberattack attribution, specifically APTs, applying the case study approach for the WhisperGate cyber operation of January 2022 executed by the Russian military intelligence service (GRU) and targeting Ukrainian government entities. The study provides a detailed review of the threat actor identifiers and taxonomies used by leading cybersecurity vendors, focusing on the evolving attribution from Microsoft, ESET, and CrowdStrike researchers. Once the attribution to Ember Bear (GRU Unit 29155) is established through technical and intelligence reports, we use both traditional machine learning classifiers and a large language model (ChatGPT) to analyze the indicators of compromise (IoCs), tactics, and techniques to statistically and semantically attribute the WhisperGate attack. Our findings reveal ping indicators with the Sandworm group (GRU Unit 74455) but also strong evidence pointing to Ember Bear, especially when the is fine-tuned or contextually augmented with additional intelligence. Thus, showing how AI/GenAI with proper fine-tuning are capable of solving the attribution challenge.

LAFA Agentic LLM-Driven Federated Analytics over Decentralized Data Sources

Authors: Haichao Ji, Zibo Wang, Yifei Zhu, Meng han, Dan Wang, Zhu Han

2025-10-21

http://arxiv.org/abs/2510.18477v1

Large Language Models (s) have shown great promise in automating data analytics tasks by interpreting natural language queries and generating multi-operation execution plans. However, existing -agent-based analytics frameworks operate under the assumption of centralized data access, offering little to no privacy protection. In contrast, federated analytics (FA) enables privacy-pre computation across distributed data sources, but lacks support for natural language input and requires structured, machine-readable queries. In this work, we present LAFA, the first system that integrates -agent-based data analytics with FA. LAFA introduces a hierarchical multi-agent architecture that accepts natural language queries and transforms them into optimized, executable FA workflows. A coarse-grained planner first decomposes complex queries into sub-queries, while a fine-grained planner maps each subquery into a Directed Acyclic Graph of FA operations using prior structural knowledge. To improve execution efficiency, an optimizer agent rewrites and merges multiple DAGs, eliminating redundant operations and minimizing computational and al overhead. Our experiments demonstrate that LAFA consistently outperforms baseline prompting strategies by achieving higher execution plan success rates and reducing resource-intensive FA operations by a substantial margin. This work establishes a practical foundation for privacy-pre, -driven analytics that supports natural language input in the FA setting.

DART A Structured Dataset of Regulatory Drug Documents in Italian for Clinical NLP

Authors: Mariano Barone, Antonio Laudante, Giuseppe Riccio, Antonio Romano, Marco Postiglione, Vincenzo Moscato

2025-10-21

http://arxiv.org/abs/2510.18475v1

The extraction of pharmacological knowledge from regulatory documents has become a key focus in biomedical natural language processing, with applications ranging from adverse event monitoring to AI-assisted clinical decision support. However, research in this field has predominantly relied on English-language corpora such as DrugBank, leaving a significant gap in resources tailored to other healthcare systems. To address this limitation, we introduce DART (Drug Annotation from Regulatory Texts), the first structured corpus of Italian Summaries of Product Characteristics derived from the official repository of the Italian Medicines Agency (AIFA). The dataset was built through a reproducible pipeline encompassing web-scale document retrieval, semantic segmentation of regulatory sections, and clinical summarization using a few-shot-tuned large language model with low-temperature . DART provides structured information on key pharmacological domains such as indications, adverse drug reactions, and drug-drug interactions. To validate its utility, we implemented an -based drug interaction checker that leverages the dataset to infer clinically meaningful interactions. Experimental results show that instruction-tuned s can accurately infer potential interactions and their clinical implications when grounded in the structured textual fields of DART. We publicly release our code on GitHub: https://github.com/PRAISELab-PicusLab/DART.

CircuitSeer Mining High-Quality Data by Probing Mathematical Reasoning Circuits in LLMs

Authors: Shaobo Wang, Yongliang Miao, Yuancheng Liu, Qianli Ma, Ning Liao, Linfeng Zhang

2025-10-21

http://arxiv.org/abs/2510.18470v2

Large language models (s) have demonstrated impressive reasoning capabilities, but scaling their performance often relies on massive reasoning datasets that are computationally expensive to train on. Existing data selection methods aim to curate smaller, high-quality subsets but often rely on costly external models or opaque heuristics. In this work, we shift the focus from external heuristics to the model's internal mechanisms. We find that complex reasoning tasks consistently activate a , specialized subset of attention heads, forming core reasoning circuits. Building on this insight, we propose CircuitSeer, a novel data selection method that quantifies the reasoning complexity of data by measuring its influence on these crucial circuits. Extensive experiments on 4 models and 9 datasets demonstrate CircuitSeer's superiority. Notably, fine-tuning Qwen2.5-Math-7B on just 10% of data selected by our method achieves a 1.4-point gain in average Pass@1 over training on the full dataset, highlighting its efficiency and effectiveness.

Adamas Hadamard Sparse Attention for Efficient Long-Context Inference

Authors: Siyuan Yan, Guo-Qing Jiang, Yuchen Zhang, Xiaoxing Ma, Ran Zhu, Chun Cao, Jingwei Xu

2025-10-21

http://arxiv.org/abs/2510.18413v1

Large language models (s) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive . Existing attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value () pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive .

How2Compress Scalable and Efficient Edge Video Analytics via Adaptive Granular Video Compression

Authors: Yuheng Wu, Thanh-Tung Nguyen, Lucas Liebe, Quang Tau, Pablo Espinosa Campos, Jinghan Cheng, Dongman Lee

2025-10-21

http://arxiv.org/abs/2510.18409v1

With the rapid proliferation of the Internet of Things, video analytics has become a cornerstone application in wireless multimedia sensor networks. To support such applications under bandwidth constraints, learning-based adaptive for video have demonstrated strong potential in reducing bitrate while maintaining analytical accuracy. However, existing frameworks often fail to fully exploit the fine-grained quality control enabled by modern blockbased video codecs, leaving significant efficiency untapped. In this paper, we present How2Compress, a simple yet effective framework designed to enhance video efficiency through precise, fine-grained quality control at the macroblock level. How2Compress is a plug-and-play module and can be seamlessly integrated into any existing edge video analytics pipelines. We implement How2Compress on the H.264 codec and evaluate its performance across diverse real-world scenarios. Experimental results show that How2Compress achieves up to $50.4\%$ bitrate savings and outperforms baselines by up to $3.01\times$ without compromising accuracy, demonstrating its practical effectiveness and efficiency. Code is available at https://github.com/wyhallenwu/how2compress and a reproducible docker image at https://hub.docker.com/r/wuyuheng/how2compress.

MENTOR A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models

Authors: ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim

2025-10-21

http://arxiv.org/abs/2510.18383v1

Distilling the tool-using capabilities of large language models (s) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward , it uses a teacher's reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard -reward RL baselines.

S2AP Score-space Sharpness Minimization for Adversarial Pruning

Authors: Giorgio Piras, Qi Zhao, Fabio Brau, Maura Pintor, Christian Wressnegger, Battista Biggio

2025-10-21

http://arxiv.org/abs/2510.18381v1

Adversarial methods have emerged as a powerful tool for compressing neural networks while pre robustness against adversarial attacks. These methods typically follow a three-step pipeline: (i) pretrain a robust model, (ii) select a binary mask for weight , and (iii) finetune the pruned model. To select the binary mask, these methods minimize a robust loss by assigning an importance score to each weight, and then keep the weights with the highest scores. However, this score-space optimization can lead to sharp local minima in the robust loss landscape and, in turn, to an unstable mask selection, reducing the robustness of adversarial methods. To overcome this issue, we propose a novel plug-in method for adversarial , termed Score-space Sharpness-aware Adversarial Pruning (S2AP). Through our method, we introduce the concept of score-space sharpness minimization, which operates during the mask search by perturbing importance scores and minimizing the corresponding robust loss. Extensive experiments across various datasets, models, and levels demonstrate that S2AP effectively minimizes sharpness in score space, stabilizing the mask selection, and ultimately improving the robustness of adversarial methods.

Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers

Authors: Firas Gabetni, Giuseppe Curci, Andrea Pilzer, Subhankar Roy, Elisa Ricci, Gianni Franchi

2025-10-21

http://arxiv.org/abs/2510.18358v1

Uncertainty quantification (UQ) is essential for deploying deep neural networks in safety-critical settings. Although methods like Deep Ensembles achieve strong UQ performance, their high computational and memory costs hinder scalability to large models. We introduce Hydra Ensembles, an efficient -based ensemble that prunes attention heads to create diverse members and merges them via a new multi-head attention with grouped fully-connected layers. This yields a compact model with inference speed close to a single network, matching or surpassing Deep Ensembles in UQ performance without retraining from scratch. We also provide an in-depth analysis of , showing that naive approaches can harm calibration, whereas Hydra Ensembles preserves robust uncertainty. Experiments on image and text classification tasks, with various architectures, show consistent gains over Deep Ensembles. Remarkably, in zero-shot classification on ImageNet-1k, our approach surpasses state of the art methods, even without requiring additional training.

Learning Human-Object Interaction as Groups

Authors: Jiajun Hong, Jianan Wei, Wenguan Wang

2025-10-21

http://arxiv.org/abs/2510.18357v1

Human-Object Interaction Detection (HOI-DET) aims to localize human-object pairs and identify their interactive relationships. To aggregate contextual cues, existing methods typically propagate information across all detected entities via self-attention mechanisms, or establish message passing between humans and objects with bipartite graphs. However, they primarily focus on pairwise relationships, overlooking that interactions in real-world scenarios often emerge from collective behaviors (multiple humans and objects engaging in joint activities). In light of this, we revisit relation modeling from a group view and propose GroupHOI, a framework that propagates contextual information in terms of geometric proximity and semantic similarity. To exploit the geometric proximity, humans and objects are grouped into distinct clusters using a learnable proximity estimator based on spatial features derived from bounding boxes. In each group, a soft correspondence is computed via self-attention to aggregate and dispatch contextual cues. To incorporate the semantic similarity, we enhance the vanilla -based interaction r with local contextual cues from HO-pair features. Extensive experiments on HICO-DET and V-COCO benchmarks demonstrate the superiority of GroupHOI over the state-of-the-art methods. It also exhibits leading performance on the more challenging Nonverbal Interaction Detection (NVI-DET) task, which involves varied forms of higher-order interactions within groups.

Text or Pixels? It Takes Half On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

Authors: Yanhong Li, Zixuan Lan, Jiawei Zhou

2025-10-21

http://arxiv.org/abs/2510.18279v2

Large language models (s) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while pre performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input for r s. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of r tokens required, offering a new form of input . Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.

StreamingTOM Streaming Token Compression for Efficient Video Understanding

Authors: Xueyi Chen, Keda Tao, Kele Shao, Huan Wang

2025-10-21

http://arxiv.org/abs/2510.18269v1

Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post- kv-, leaving costly pre- unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre- and post- bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and des them, keeping the active kv- bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv- , $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.

Learning from the Best, Differently A Diversity-Driven Rethinking on Data Selection

Authors: Hongyi He, Xiao Liu, Zhenghao Lin, Mingni Tang, Yi Cheng, Jintao Wang, Wenjie Li, Peng Cheng, Yeyun Gong

2025-10-21

http://arxiv.org/abs/2510.18909v1

High-quality pre-training data is crutial for large language models, where quality captures factual reliability and semantic value, and diversity ensures broad coverage and distributional heterogeneity. Existing approaches typically rely on single or multiple-dimensional score-based selection. However, directly selecting top-scored data often degrades performance, and sampling from a broader range is required to recover results. The above non-monotonicity between dataset scores and downstream benchmark results reveals a fundamental bias: score-based methods collapse correlated dimensions, causing top-scored data to appear high-quality while systematically overlooking diversity. We argue that ensuring diversity requires decomposing correlated metrics into orthogonal feature dimensions, from which the top-scored data can be directly selected. Therefore, we proposed the Orthogonal Diversity-Aware Selection (ODiS) algorithm, which preserves both quality and diversity during data selection. First, ODiS evaluates data from multiple dimensions, covering language quality, knowledge quality, and comprehension difficulty. The multi-dimensional scores are then decorrelated via Principal Component Analysis (PCA), yielding orthogonal evaluation dimensions. For each dimension, a Roberta-based scorer is trained to regress the data onto PCA-projected scores, enabling scalable inference on large corpora. Finally, ODiS constructs the training dataset by selecting top-scored data within each orthogonal dimension, thereby ensuring both quality and diversity. Empirical results show that ODiS-selected data exhibit less than 2\% inter-dimension , confirming orthogonality between dimensions. More importantly, models trained with ODiS-selected data significantly outperform other baselines on downstream benchmarks, highlighting the necessity of orthogonal, diversity-aware data selection for s.

DeepSeek-OCR Contexts Optical Compression

Authors: Haoran Wei, Yaofeng Sun, Yukun Li

2025-10-21

http://arxiv.org/abs/2510.18234v1

We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the r. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a ratio < 10x), the model can achieve (OCR) precision of 97%. Even at a ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context and memory forgetting mechanisms in s. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for s/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.

Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge

Authors: Yoshinari Fujinuma

2025-10-21

http://arxiv.org/abs/2510.18196v1

Large Language Models (s) are commonly used as evaluators in various applications, but the reliability of the outcomes remains a challenge. One such challenge is using s-as-judges for direct assessment, i.e., assigning scores from a specified range without any references. We first show that this challenge stems from judge outputs being associated with score range bias, i.e., judge outputs are highly sensitive to pre-defined score ranges, preventing the search for optimal score ranges. We also show that similar biases exist among models from the same family. We then mitigate this bias through contrastive , achieving up to 11.3% relative improvement on average in Spearman correlation with human judgments across different score ranges.

Extracting Rule-based Descriptions of Attention Features in Transformers

Authors: Dan Friedman, Adithya Bhaskar, Alexander Wettig, Danqi Chen

2025-10-20

http://arxiv.org/abs/2510.18148v1

Mechanistic interpretability strives to explain model behavior in terms of bottom-up primitives. The leading paradigm is to express hidden states as a linear combination of basis vectors, called features. However, this only identifies which text sequences (exemplars) activate which features; the actual interpretation of features requires subjective inspection of these exemplars. This paper advocates for a different solution: rule-based descriptions that match token patterns in the input and correspondingly increase or decrease the likelihood of specific output tokens. Specifically, we extract rule-based descriptions of SAE features trained on the outputs of attention layers. While prior work treats the attention layers as an opaque box, we describe how it may naturally be expressed in terms of interactions between input and output features, of which we study three types: (1) skip-gram rules of the form "[Canadian city]... speaks --> English", (2) absence rules of the form "[Montreal]... speaks -/-> English," and (3) counting rules that toggle only when the count of a word exceeds a certain value or the count of another word. Absence and counting rules are not readily discovered by inspection of exemplars, where manual and automatic descriptions often identify misleading or incomplete explanations. We then describe a simple approach to extract these types of rules automatically from a , and apply it to GPT-2 small. We find that a majority of features may be described well with around 100 skip-gram rules, though absence rules are abundant even as early as the first layer (in over a fourth of features). We also isolate a few examples of counting rules. This paper lays the groundwork for future research into rule-based descriptions of features by defining them, showing how they may be extracted, and providing a preliminary taxonomy of some of the behaviors they represent.

Any-Depth Alignment Unlocking Innate Safety Alignment of LLMs to Any-Depth

Authors: Jiawei Zhang, Andrew Estornell, David D. Baek, Bo Li, Xiaojun Xu

2025-10-20

http://arxiv.org/abs/2510.18081v1

Large Language Models (s) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant- attacks). This raises a fundamental question: Can the innate shallow alignment in s be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model's strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model's parameters. It secures a near-100% refusal rate against challenging adversarial attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while pre utility on benign tasks with minimal over-refusal. ADA maintains this resilience even after the base model undergoes subsequent instruction tuning (benign or adversarial).

MEG-GPT A transformer-based foundation model for magnetoencephalography data

Authors: Rukuang Huang, Sungjun Cho, Chetan Gohil, Oiwi Parker Jones, Mark Woolrich

2025-10-20

http://arxiv.org/abs/2510.18080v1

Modelling the complex spatiotemporal patterns of large-scale brain dynamics is crucial for neuroscience, but traditional methods fail to capture the rich structure in modalities such as magnetoencephalography (MEG). Recent advances in deep learning have enabled significant progress in other domains, such as language and vision, by using foundation models at scale. Here, we introduce MEG-GPT, a based foundation model that uses time-attention and next time-point prediction. To facilitate this, we also introduce a novel data-driven tokeniser for continuous MEG data, which preserves the high temporal resolution of continuous MEG signals without lossy transformations. We trained MEG-GPT on tokenised brain region time-courses extracted from a large-scale MEG dataset (N=612, eyes-closed rest, Cam-CAN data), and show that the learnt model can generate data with realistic spatio-spectral properties, including transient events and population variability. Critically, it performs well in downstream tasks, improving downstream supervised prediction task, showing improved zero-shot generalisation across sessions (improving accuracy from 0.54 to 0.59) and subjects (improving accuracy from 0.41 to 0.49) compared to a baseline methods. Furthermore, we show the model can be efficiently fine-tuned on a smaller labelled dataset to boost performance in cross-subject scenarios. This work establishes a powerful foundation model for electrophysiological data, paving the way for applications in computational neuroscience and neural .

CompactPrompt A Unified Pipeline for Prompt Data Compression in LLM Workflows

Authors: Joong Ho Choi, Jiayang Zhao, Jeel Shah, Ritvika Sonawane, Vedant Singh, Avani Appalla, Will Flanagan, Filipe Condessa

2025-10-20

http://arxiv.org/abs/2510.18043v1

Large Language Models (s) deliver powerful reasoning and generation capabilities but incur substantial run-time costs when operating in agentic workflows that chain together lengthy prompts and process rich data streams. We introduce CompactPrompt, an end-to-end pipeline that merges hard prompt with lightweight file-level data . CompactPrompt first prunes low-information tokens from prompts using self-information scoring and dependency-based phrase grouping. In parallel, it applies n-gram abbreviation to recurrent textual patterns in attached documents and uniform to numerical columns, yielding compact yet semantically faithful representations. Integrated into standard agents, CompactPrompt reduces total token usage and inference cost by up to 60% on benchmark dataset like TAT-QA and FinQA, while pre output quality (Results in less than 5% accuracy drop for Claude-3.5-Sonnet, and GPT-4.1-Mini) CompactPrompt helps visualize real-time decisions and quantify cost-performance trade-offs, laying the groundwork for leaner generative AI pipelines.

OPTAGENT Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning

Authors: Zhenyu Bi, Meng Lu, Yang Li, Swastik Roy, Weijie Guan, Morteza Ziyadi, Xuan Wang

2025-10-20

http://arxiv.org/abs/2510.18032v1

Large Language Models (s) have shown remarkable reasoning capabilities in mathematical and scientific tasks. To enhance complex reasoning, multi-agent systems have been proposed to harness the collective intelligence of agents. However, existing collaboration structures are either predefined or rely on majority voting or round-table debates, which can suppress correct but less dominant agent contributions. Recent approaches model multi-agent systems as graph networks but optimize purely for agent performance, neglecting the quality of interactions. We hypothesize that effective agent is crucial for multi-agent reasoning and that debating quality plays a significant role. To address this, we propose $\ours$ , a multi-agent verbal reinforcement learning algorithm that dynamically constructs and refines multi-agent collaboration structures. Our method defines action spaces and a feedback mechanism that evaluates robustness and coherence throughout the debate. The final decision is achieved through a majority vote over all the agents. We assess $\ours$ on various reasoning tasks, including mathematical reasoning, creative writing, scientific reasoning, and numerical sorting. Results demonstrate that our approach significantly outperforms single-agent prompting methods and state-of-the-art multi-agent frameworks on diverse tasks.

From Local to Global Revisiting Structured Pruning Paradigms for Large Language Models

Authors: Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Minwoo Lee, Shu-ping Yeh, Evgeny Stupachenko, Hao Feng, Li Yang

2025-10-20

http://arxiv.org/abs/2510.18030v1

Structured is a practical approach to deploying large language models (s) efficiently, as it yields compact, hardware-friendly architectures. However, the dominant local paradigm is task-agnostic: by optimizing layer-wise reconstruction rather than task objectives, it tends to preserve perplexity or generic zero-shot behavior but fails to capitalize on modest task-specific calibration signals, often yielding limited downstream gains. We revisit global structured and present GISP-Global Iterative Structured Pruning-a post-training method that removes attention heads and MLP channels using first-order, loss-based important weights aggregated at the structure level with block-wise normalization. An iterative schedule, rather than one-shot , stabilizes accuracy at higher and mitigates perplexity collapse without requiring intermediate fine-tuning; the trajectory also forms nested subnetworks that support a "prune-once, deploy-many" workflow. Furthermore, because importance is defined by a model-level loss, GISP naturally supports task-specific objectives; we instantiate perplexity for language modeling and a margin-based objective for decision-style tasks. Extensive experiments show that across Llama2-7B/13B, Llama3-8B, and Mistral-0.3-7B, GISP consistently lowers WikiText-2 perplexity and improves downstream accuracy, with especially strong gains at 40-50% ; on DeepSeek-R1-Distill-Llama-3-8B with GSM8K, task-aligned calibration substantially boosts exact-match accuracy.

Glyph Scaling Context Windows via Visual-Text Compression

Authors: Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang

2025-10-20

http://arxiv.org/abs/2510.17800v2

Large language models (s) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context s. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while pre semantic information, and we further design an -driven genetic search to identify optimal visual rendering configurations for balancing accuracy and . Through extensive experiments, we demonstrate that our method achieves 3-4x token while maintaining accuracy comparable to leading s such as Qwen3-8B on various long-context benchmarks. This also leads to around 4x faster ing and , and approximately 2x faster SFT training. Furthermore, under extreme , a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.

Beyond More Context Retrieval Diversity Boosts Multi-Turn Intent Understanding

Authors: Zhiming Lin

2025-10-20

http://arxiv.org/abs/2510.17940v1

Multi turn intent understanding is central to task oriented chatbots, yet real deployments face tight token budgets and noisy contexts, and most retrieval pipelines emphasize relevance while overlooking set level diversity and confounds such as more context or exemplar order. We ask whether retrieval diversity, rather than longer prompts, systematically improves intent understanding under fixed budgets. We present a diversity aware retrieval framework that selects in context exemplars to balance intent coverage and linguistic variety, and integrates this selection with standard rs; the evaluation enforces budget matched prompts and randomized positions, and includes sensitivity analyses over exemplar count, diversity strength, and backbone size. On MultiWOZ 2.4 and SGD, the approach achieves strong gains in Joint Goal Accuracy under equal token budgets, surpassing strong /DST baselines, with consistent improvements across K from 4 to 7 and moderate latency. Overall, the study isolates and validates the impact of content diversity in retrieval and offers a simple, deployable selection principle for building accurate, budget constrained multi turn intent systems.

ZACH-ViT A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification

Authors: Athanasios Angelakis, Amne Mousa, Micah L. A. Heldeweg, Laurens A. Biesheuvel, Mark A. Haaksma, Jasper M. Smit, Pieter R. Tuinman, Paul W. G. Elbers

2025-10-20

http://arxiv.org/abs/2510.17650v1

Differentiating cardiogenic pulmonary oedema (CPE) from non-cardiogenic and structurally normal lungs in lung ultrasound (LUS) videos remains challenging due to the high visual variability of non-cardiogenic inflammatory patterns (NCIP/ARDS-like), interstitial lung disease, and healthy lungs. This heterogeneity complicates automated classification as ping B-lines and pleural artefacts are common. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a 0.25 M-parameter Vision Transformer variant that removes both positional embeddings and the [CLS] token, making it fully permutation-invariant and suitable for unordered medical image data. To enhance generalization, we propose ShuffleStrides Data Augmentation (SSDA), which permutes probe-view sequences and frame orders while pre anatomical validity. ZACH-ViT was evaluated on 380 LUS videos from 95 critically ill patients against nine state-of-the-art baselines. Despite the heterogeneity of the non-cardiogenic group, ZACH-ViT achieved the highest validation and test ROC-AUC (0.80 and 0.79) with balanced sensitivity (0.60) and specificity (0.91), while all competing models collapsed to trivial classification. It trains 1.35x faster than Minimal ViT (0.62M parameters) with 2.5x fewer parameters, supporting real-time clinical deployment. These results show that aligning architectural design with data structure can outperform scale in small-data medical imaging.

Language Confusion Gate Language-Aware Decoding Through Model Self-Distillation

Authors: Collin Zhang, Fei Huang, Chenhan Yuan, Junyang Lin

2025-10-20

http://arxiv.org/abs/2510.17555v1

Large language models (s) often experience language confusion, which is the unintended mixing of languages during text generation. Current solutions to this problem either necessitate model retraining or cannot differentiate between harmful confusion and acceptable code-switching. This paper introduces the Language Confusion Gate (LCG), a lightweight, plug-in solution that filters tokens during without altering the base . The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed. Our method is based on the findings that language confusion is infrequent, correct-language tokens are usually among the top predictions, and output token embedding norms are larger for high-resource languages, which biases sampling. When evaluated across various models, including Qwen3, GPT-OSS, Gemma3, Llama3.1, LCG decreases language confusion significantly, often by an order of magnitude, without negatively impacting task performance. Code is available at https://github.com/collinzrj/language_confusion_gate.

TabR1 Taming GRPO for tabular reasoning LLMs

Authors: Pengxiang Cai, Zihao Gao, Jintai Chen

2025-10-20

http://arxiv.org/abs/2510.17385v2

Tabular prediction has traditionally relied on gradient-boosted decision trees and specialized deep learning models, which excel within tasks but provide limited interpretability and weak transfer across tables. Reasoning large language models (s) promise cross-task adaptability with trans- parent reasoning traces, yet their potential has not been fully realized for tabular data. This paper presents TabR1, the first reasoning for tabular prediction with multi-step reasoning. At its core is Permutation Relative Policy Optimization (PRPO), a simple yet efficient reinforcement learning method that encodes column-permutation invariance as a structural prior. By construct- ing multiple label-pre permutations per sample and estimating advantages both within and across permutations, PRPO transforms rewards into dense learning signals and improves generalization. With limited supervision, PRPO activates the reasoning ability of s for tabular prediction, enhancing few-shot and zero-shot performance as well as interpretability. Comprehensive experiments demonstrate that TabR1 achieves performance comparable to strong baselines under full-supervision fine-tuning. In the zero-shot setting, TabR1 approaches the performance of strong baselines under the 32-shot setting. Moreover, TabR1 (8B) substantially outperforms much larger s across various tasks, achieving up to 53.17% improvement over DeepSeek-R1 (685B).

M2H Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

Authors: U. V. B. L Udugama, George Vosselman, Francesco Nex

2025-10-20

http://arxiv.org/abs/2510.17363v1

Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-r architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while pre task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

Localist LLMs with Recruitment Learning

Authors: Joachim Diederich

2025-10-20

http://arxiv.org/abs/2510.17358v1

We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovations are (1) a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining, (2) an information-theoretic recruitment mechanism that adaptively allocates semantic blocks as needed, eliminating the requirement for complete domain knowledge at initialization, and (3) a hierarchical recruitment framework that extends capacity allocation to entire specialized s, enabling multi-granularity architectural adaptation. This is achieved through group penalties on attention mechanisms, information-theoretic anchor design, dynamic rule injection, and principled recruitment criteria based on penalized likelihood with explicit units. We provide rigorous mathematical results establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks at stationary points, with exact bounds on attention entropy and pointer fidelity. The hierarchical recruitment mechanism provides convergence guarantees at both the block level (fine-grained, within-) and the level (coarse-grained, cross-domain), ensuring the system discovers semantic partitions that balance model complexity against data encoding efficiency. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes while adapting architectural capacity at multiple granularities, supporting applications in regulated domains requiring both transparency and capability.

Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems

Authors: Rishi Jha, Harold Triedman, Justin Wagle, Vitaly Shmatikov

2025-10-20

http://arxiv.org/abs/2510.17276v1

Control-flow hijacking attacks manipulate orchestration mechanisms in multi-agent systems into performing unsafe actions that compromise the system and exfiltrate sensitive information. Recently proposed defenses, such as LlamaFirewall, rely on alignment checks of inter-agent s to ensure that all agent invocations are "related to" and "likely to further" the original objective. We start by demonstrating control-flow hijacking attacks that evade these defenses even if alignment checks are performed by advanced s. We argue that the safety and functionality objectives of multi-agent systems fundamentally conflict with each other. This conflict is exacerbated by the brittle definitions of "alignment" and the checkers' incomplete visibility into the execution context. We then propose, implement, and evaluate ControlValve, a new defense inspired by the principles of control-flow integrity and least privilege. ControlValve (1) generates permitted control-flow graphs for multi-agent systems, and (2) enforces that all executions comply with these graphs, along with contextual rules (generated in a zero-shot manner) for each agent invocation.

StreamingThinker Large Language Models Can Think While Reading

Authors: Junlong Tong, Yingqi Fan, Anhao Zhao, Yunpu Ma, Xiaoyu Shen

2025-10-20

http://arxiv.org/abs/2510.17238v1

Large language models (s) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for s, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables s to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-pre reasoning through streaming attention masks and position encoding, and leverages parallel s that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80\% reduction in token waiting before the onset of reasoning and a more than 60\% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for reasoning. Code will be released at \href{https://github.com/EIT-NLP/Streaming/tree/main/StreamingThinker}{this repository.}

DSEBench A Test Collection for Explainable Dataset Search with Examples

Authors: Qing Shi, Jing He, Qiaosheng Chen, Gong Cheng

2025-10-20

http://arxiv.org/abs/2510.17228v1

Dataset search has been an established information retrieval task. Current paradigms either retrieve datasets that are relevant to a keyword query or find datasets that are similar to an input target dataset. To allow for their combined specification of information needs, in this article, we investigate the more generalized task of Dataset Search with Examples (DSE) and further extend it to Explainable DSE that requires identifying the metadata and content fields of a dataset that indicate its relevance to the query and similarity to the target datasets. To facilitate this research, we construct DSEBench, a test collection that provides high-quality dataset- and field-level annotations to enable the evaluation of explainable DSE. We also employ a large language model to generate numerous annotations to be used for training. We establish extensive baselines on DSEBench by adapting and evaluating a variety of , dense, and -based retrieval, reranking, and explanation methods.

CosmoCore Affective Dream-Replay Reinforcement Learning for Code Generation

Authors: Santhosh Kumar Ravindran

2025-10-20

http://arxiv.org/abs/2510.18895v1

We introduce CosmoCore, a neuroscience-inspired reinforcement learning (RL) architecture that integrates affective signals to enhance code generation in large language models (s). Motivated by human and animal learning where embarrassment from mistakes drives rapid correction, as observed in training a puppy to avoid repeating errors after a single scolding CosmoCore tags code generation trajectories with valence and surprise using a lightweight multi-layer perceptron (MLP). High-negative valence (cringe) episodes, such as buggy code outputs, are prioritized in a Dream Queue for five-fold replay during off-policy updates, while low-surprise successes are pruned to prevent overconfidence and buffer bloat. Evaluated on code generation benchmarks like HumanEval and BigCodeBench, alongside simulations with a custom data pipeline environment, CosmoCore reduces hallucinated code (e.g., syntax errors or logical bugs) by 48\% and accelerates self-correction by 45\%. Local experiments using Hugging Face models in a PySpark environment validate these gains, with code snippets provided for replication. Ablations confirm valence tagging boosts curiosity in exploration, and mitigates inefficiency. This framework extends RL from human feedback (RLHF) for more emotionally aware code assistants, with applications in IDEs and data pipelines. Code and the custom mini-world simulation are released.

ZSPAPrune Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models

Authors: Pu Zhang, Yuwei Li, Xingyuan Xian, Guoming Tang

2025-10-20

http://arxiv.org/abs/2510.17197v1

As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in s, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method achieves performance that matches or surpasses the state-of-the-art with only minimal accuracy loss, even when up to 90\% of the tokens. Furthermore, these gains are accompanied by significant reductions in GPU memory footprint and inference latency.

When AI companions become witty Can human brain recognize AI-generated irony?

Authors: Xiaohui Rao, Hanlin Wu, Zhenguang G. Cai

2025-10-20

http://arxiv.org/abs/2510.17168v1

As Large Language Models (s) are increasingly deployed as social agents and trained to produce humor and irony, a question emerges: when encountering witty AI remarks, do people interpret these as intentional or mere computational output? This study investigates whether people adopt the intentional stance, attributing mental states to explain behavior,toward AI during irony comprehension. Irony provides an ideal paradigm because it requires distinguishing intentional contradictions from unintended errors through effortful semantic reanalysis. We compared behavioral and neural responses to ironic statements from AI versus human sources using established ERP components: P200 reflecting early incongruity detection and P600 indexing cognitive efforts in reinterpreting incongruity as deliberate irony. Results demonstrate that people do not fully adopt the intentional stance toward AI-generated irony. Behaviorally, participants attributed incongruity to deliberate for both sources, though significantly less for AI than human, showing greater tendency to interpret AI incongruities as computational errors. Neural data revealed attenuated P200 and P600 effects for AI-generated irony, suggesting reduced effortful detection and reanalysis consistent with diminished attribution of communicative intent. Notably, people who perceived AI as more sincere showed larger P200 and P600 effects for AI-generated irony, suggesting that intentional stance adoption is calibrated by specific mental models of artificial agents. These findings reveal that source attribution shapes neural processing of social-communicative phenomena. Despite current s' linguistic sophistication, achieving genuine social agency requires more than linguistic competence, it necessitates a shift in how humans perceive and attribute intentionality to artificial agents.

ParaVul A Parallel Large Language Model and Retrieval-Augmented Framework for Smart Contract Vulnerability Detection

Authors: Tenghui Huang, Jinbo Wen, Jiawen Kang, Siyong Chen, Zhengtao Li, Tao Zhang, Dongning Liu, Jiacheng Wang, Chengjun Cai, Yinqiu Liu, Dusit Niyato

2025-10-20

http://arxiv.org/abs/2510.17919v1

Smart contracts play a significant role in automating blockchain services. Nevertheless, vulnerabilities in smart contracts pose serious threats to blockchain security. Currently, traditional detection methods primarily rely on static analysis and formal verification, which can result in high false-positive rates and poor scalability. Large Language Models (s) have recently made significant progress in smart contract vulnerability detection. However, they still face challenges such as high inference costs and substantial computational overhead. In this paper, we propose ParaVul, a parallel and retrieval-augmented framework to improve the reliability and accuracy of smart contract vulnerability detection. Specifically, we first develop Sparse Low-Rank Adaptation (SLoRA) for fine-tuning. SLoRA introduces sparsification by incorporating a matrix into d LoRA-based s, thereby reducing computational overhead and resource requirements while enhancing their ability to understand vulnerability-related issues. We then construct a vulnerability contract dataset and develop a hybrid Retrieval-Augmented Generation (RAG) system that integrates dense retrieval with Best Matching 25 (BM25), assisting in verifying the results generated by the . Furthermore, we propose a meta-learning model to fuse the outputs of the RAG system and the , thereby generating the final detection results. After completing vulnerability detection, we design chain-of-thought prompts to guide s to generate comprehensive vulnerability detection reports. Simulation results demonstrate the superiority of ParaVul, especially in terms of F1 scores, achieving 0.9398 for single-label detection and 0.9330 for multi-label detection.

Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models

Authors: Elias Hossain, Swayamjit Saha, Somshubhra Roy, Ravi Prasad

2025-10-20

http://arxiv.org/abs/2510.17098v1

Even when prompts and parameters are secured, language models remain vulnerable because their key-value () during inference constitutes an overlooked attack surface. This paper introduces Malicious Token Injection (MTI), a modular framework that systematically perturbs d key vectors at selected layers and timesteps through controlled magnitude and frequency, using additive Gaussian noise, zeroing, and orthogonal rotations. A theoretical analysis quantifies how these perturbations propagate through attention, linking logit deviations to the Frobenius norm of corruption and softmax Lipschitz dynamics. Empirical results show that MTI significantly alters next-token distributions and downstream task performance across GPT-2 and LLaMA-2/7B, as well as destabilizes retrieval-augmented and agentic reasoning pipelines. These findings identify integrity as a critical yet underexplored vulnerability in current deployments, positioning corruption as a reproducible and theoretically grounded threat model for future robustness and security research.

Enrich and Detect Video Temporal Grounding with Multimodal LLMs

Authors: Shraman Pramanick, Effrosyni Mavroudi, Yale Song, Rama Chellappa, Lorenzo Torresani, Triantafyllos Afouras

2025-10-19

http://arxiv.org/abs/2510.17023v1

We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. Our approach harnesses the capabilities of multimodal s to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process. Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding. In the second stage, these enriched queries are grounded, using a lightweight r, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries. To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings. Experiments reveal that our method significantly outperforms all previously proposed -based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.

UniGTE Unified Graph-Text Encoding for Zero-Shot Generalization across Graph Tasks and Domains

Authors: Duo Wang, Yuan Zuo, Guangyue Lu, Junjie Wu

2025-10-19

http://arxiv.org/abs/2510.16885v1

Generalizing to unseen graph tasks without task-specific supervision is challenging: conventional graph neural networks are typically tied to a fixed label space, while large language models (s) struggle to capture graph structure. We introduce UniGTE, an instruction-tuned encoder-r framework that unifies structural and semantic reasoning. The encoder augments a pretrained autoregressive with learnable alignment tokens and a structure-aware graph-text attention mechanism, enabling it to attend jointly to a tokenized graph and a natural-language task prompt while remaining permutation-invariant to node order. This yields compact, task-aware graph representations. Conditioned solely on these representations, a frozen r predicts and reconstructs: it outputs the task answer and simultaneously paraphrases the input graph in natural language. The reconstruction objective regularizes the encoder to preserve structural cues. UniGTE is instruction-tuned on five datasets spanning node-level, edge-level, and graph-level tasks across diverse domains, yet requires no fine-tuning at inference. It achieves new state-of-the-art zero-shot results on node classification, link prediction, graph classification, and graph regression under cross-task and cross-domain settings, demonstrating that tight integration of graph structure with semantics enables robust, transferable graph reasoning.

ArmFormer Lightweight Transformer Architecture for Real-Time Multi-Class Weapon Segmentation and Classification

Authors: Akhila Kambhatla, Taminul Islam, Khaled R Ahmed

2025-10-19

http://arxiv.org/abs/2510.16854v1

The escalating threat of weapon-related violence necessitates automated detection systems capable of pixel-level precision for accurate threat assessment in real-time security applications. Traditional weapon detection approaches rely on object detection frameworks that provide only coarse bounding box localizations, lacking the fine-grained segmentation required for comprehensive threat analysis. Furthermore, existing semantic segmentation models either sacrifice accuracy for computational efficiency or require excessive computational resources incompatible with edge deployment scenarios. This paper presents ArmFormer, a lightweight -based semantic segmentation framework that strategically integrates Convolutional Block Attention Module (CBAM) with MixVisionTransformer architecture to achieve superior accuracy while maintaining computational efficiency suitable for resource-constrained edge devices. Our approach combines CBAM-enhanced encoder backbone with attention-integrated hamburger r to enable multi-class weapon segmentation across five categories: handgun, rifle, knife, revolver, and human. Comprehensive experiments demonstrate that ArmFormer achieves state-of-the-art performance with 80.64% mIoU and 89.13% mFscore while maintaining real-time inference at 82.26 FPS. With only 4.886G FLOPs and 3.66M parameters, ArmFormer outperforms heavyweight models requiring up to 48x more computation, establishing it as the optimal solution for deployment on portable security cameras, surveillance drones, and embedded AI accelerators in distributed security infrastructure.

Neuronal Group Communication for Efficient Neural representation

Authors: Zhengqi Pei, Qingming Huang, Shuhui Wang

2025-10-19

http://arxiv.org/abs/2510.16851v1

The ever-increasing scale of modern neural networks has brought unprecedented performance alongside daunting challenges in efficiency and interpretability. This paper addresses the core question of how to build large neural systems that learn efficient, modular, and interpretable representations. We propose Neuronal Group Communication (NGC), a theory-driven framework that reimagines a neural network as a dynamical system of interacting neuronal groups rather than a monolithic collection of neural weights. Instead of treating each weight as an independent trainable parameter, NGC treats weights as transient interactions between embedding-like neuronal states, with neural computation unfolding through iterative among groups of neurons. This low-rank, modular representation yields compact models: groups of neurons exchange low-dimensional signals, enabling intra-group specialization and inter-group information sharing while dramatically reducing redundant parameters. By drawing on dynamical systems theory, we introduce a neuronal stability metric (analogous to Lyapunov stability) that quantifies the contraction of neuron activations toward stable patterns during sequence processing. Using this metric, we reveal that emergent reasoning capabilities correspond to an external driving force or ``potential'', which nudges the neural dynamics away from trivial trajectories while pre stability. Empirically, we instantiate NGC in large language models (s) and demonstrate improved performance on complex reasoning benchmarks under moderate . NGC consistently outperforms standard low-rank approximations and cross-layer basis-sharing methods at comparable rates. We conclude by discussing the broader implications of NGC, including how structured neuronal group dynamics might relate to generalization in high-dimensional learning systems.

Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads

Authors: Zhoutong Wu, Yuan Zhang, Yiming Dong, Chenheng Zhang, Cong Fang, Kun Yuan, Zhouchen Lin

2025-10-19

http://arxiv.org/abs/2510.16807v2

Transformer models have driven breakthroughs across various language tasks by their strong capability to learn rich contextual representations. Scaling them to improve representation, however, often demands substantial memory and compute costs, such as the Key-Value () used during auto-regressive . Skip connections offer a promising way to improve representation without bloating resource usage, yet most prior works either improve expressivity while leaving costs unchanged, or reduce memory at the cost of weaker representation. In this work, we propose SkipV1Former, a Transformer variant that uses skip connections from the first layer's Value heads to strengthen model representation and reduce . Specifically, from the second block onward, each layer reuses half of its Value heads from the very first layer, while computing the other half as usual-cutting Value projections and V by nearly 50 \%. Theoretically, we show that routing uncompressed first-layer Values into deeper layers restores information lost to and accelerates the model's implicit mesa-optimization-a key pattern of Transformer in auto-regressive tasks. Empirically, across different model scales, SkipV1Former delivers consistent reductions of approximately 25 \% in while improving perplexity relative to standard Multi-Head Attention (MHA) Transformers and some advanced variants. Moreover, we propose a recipe for uptraining existing MHA Transformer checkpoints to SkipV1Former with only 10-15\% additional compute. Finally, SkipV1Former can seamlessly combine advanced methods like Group-Query Attention and Multi-Latent Attention to achieve further savings and performance improvement. When combined with YOCO, it cuts size by nearly 50 \% while still improving performance.

Mixed-Precision Quantization for Language Models Techniques and Prospects

Authors: Mariam Rakka, Marios Fournarakis, Olga Krestinskaya, Jinane Bazzi, Khaled N. Salama, Fadi Kurdahi, Ahmed M. Eltawil, Mohammed E. Fouda

2025-10-19

http://arxiv.org/abs/2510.16805v1

The rapid scaling of language models (LMs) has resulted in unprecedented computational, memory, and energy requirements, making their training and deployment increasingly unsustainable. Quantization has emerged as an essential technique to reduce model size, alleviate memory bottlenecks, and accelerate inference. However, while uniform (e.g., INT8, INT4) provides significant efficiency gains, it can degrade accuracy in sensitive components of -based LMs. Mixed-precision offers a promising alternative by selectively allocating precision across layers or within tensors to balance efficiency and accuracy. This survey provides a comprehensive overview of Mixed-Precision frameworks for LMs (PLMs). We first review fundamentals, including uniform and non-uniform rs, granularity, and methods widely used in post-training . We then categorize and compare recent PLM frameworks according to their bit allocation strategies and precision configurations across weights, activations, and key-value s. A comparative analysis highlights differences in perplexity, zero-shot task performance, and deployment trade-offs. Furthermore, we contrast PLMs with earlier mixed-precision methods for deep neural networks, identifying strategies that transfer and those that face challenges in the LM setting. Finally, we summarize open issues and future directions, including hardware-aware design, activation , and scalable optimization methods for billion-parameter models. By consolidating recent advances, this work serves as a reference for understanding the current landscape and research prospects of mixed-precision for large-scale language models.

3D-GSRD 3D Molecular Graph Auto-Encoder with Selective Re-mask Decoding

Authors: Chang Wu, Zhiyuan Liu, Wen Shu, Liang Wang, Yanchen Luo, Wenqiang Lei, Yatao Bian, Junfeng Fang, Xiang Wang

2025-10-19

http://arxiv.org/abs/2510.16780v2

Masked graph modeling (MGM) is a promising approach for molecular representation learning (MRL).However, extending the success of re-mask from 2D to 3D MGM is non-trivial, primarily due to two conflicting challenges: avoiding 2D structure leakage to the r, while still providing sufficient 2D context for reconstructing re-masked atoms. To address these challenges, we propose 3D-GSRD: a 3D Molecular Graph Auto-Encoder with Selective Re-mask Decoding. The core innovation of 3D-GSRD lies in its Selective Re-mask Decoding(SRD), which re-masks only 3D-relevant information from encoder representations while pre the 2D graph structures. This SRD is synergistically integrated with a 3D Relational-Transformer(3D-ReTrans) encoder alongside a structure-independent r. We analyze that SRD, combined with the structure-independent r, enhances the encoder's role in MRL. Extensive experiments show that 3D-GSRD achieves strong downstream performance, setting a new state-of-the-art on 7 out of 8 targets in the widely used MD17 molecular property prediction benchmark. The code is released at https://github.com/WuChang0124/3D-GSRD.

EMRRG Efficient Fine-Tuning Pre-trained X-ray Mamba Networks for Radiology Report Generation

Authors: Mingzheng Zhang, Jinfeng Gao, Dan Xu, Jiangrui Yu, Yuhan Qiao, Lan Chen, Jin Tang, Xiao Wang

2025-10-19

http://arxiv.org/abs/2510.16776v1

X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence that can significantly reduce diagnostic burdens for clinicians and patient wait times. Existing MRG models predominantly rely on Large Language Models (s) to improve report generation, with limited exploration of pre-trained vision foundation models or advanced fine-tuning techniques. Mainstream frameworks either avoid fine-tuning or utilize simplistic methods like LoRA, often neglecting the potential of enhancing cross-attention mechanisms. Additionally, while Transformer-based models dominate vision-language tasks, non-Transformer architectures, such as the Mamba network, remain underexplored for medical report generation, presenting a promising avenue for future research. In this paper, we propose EMRRG, a novel X-ray report generation framework that fine-tunes pre-trained Mamba networks using parameter-efficient methods. Specifically, X-ray images are divided into patches, tokenized, and processed by an SSM-based vision backbone for feature extraction, with Partial LoRA yielding optimal performance. An with a hybrid r generates the medical report, enabling end-to-end training and achieving strong results on benchmark datasets. Extensive experiments on three widely used benchmark datasets fully validated the effectiveness of our proposed strategies for the X-ray MRG. The source code of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.

L-MoE End-to-End Training of a Lightweight Mixture of Low-Rank Adaptation Experts

Authors: Shihao Ji, Zihui Song

2025-10-19

http://arxiv.org/abs/2510.17898v1

The Mixture of Experts (MoE) architecture enables the scaling of Large Language Models (s) to trillions of parameters by activating a subset of weights for each input, maintaining constant computational cost during inference. Concurrently, Low-Rank Adaptation (LoRA) has emerged as a dominant technique for parameter-efficiently fine-tuning s on specialized tasks. In this work, we unify these two paradigms into a novel, end-to-end trainable framework named L-MoE: a Lightweight Mixture of LoRA Experts. L-MoE redefines MoE experts not as dense feed-forward networks, but as a collection of task-specialized, low-rank adapters. A lightweight gating network, trained jointly with the experts, learns to dynamically compose these LoRA adapters by computing a weighted average of their parameters for each input token. This composition is fully differentiable, allowing gradients from a standard auto-regressive language modeling objective to flow back through the entire architecture, simultaneously refining both the expert adapters and the routing strategy. This approach creates a highly parameter-efficient MoE model that is modular by design, allows for dynamic skill composition, and is trainable from end-to-end. We present the formal mathematical framework for L-MoE, detailing the differentiable routing mechanism and the joint optimization objective, thereby providing a new path toward building more efficient, scalable, and specialized language models.

ELMM Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion

Authors: Wei Huang, Peining Li, Meiyu Liang, Xu Hou, Junping Du, Yingxia Shao, Guanhua Ye, Wu Liu, Kangkang Lu, Yang Yu

2025-10-19

http://arxiv.org/abs/2510.16753v1

Multimodal Knowledge Graphs (MKGs) extend traditional knowledge graphs by incorporating visual and textual modalities, enabling richer and more expressive entity representations. However, existing MKGs often suffer from incompleteness, which hinder their effectiveness in downstream tasks. Therefore, multimodal knowledge graph completion (MKGC) task is receiving increasing attention. While large language models (s) have shown promise for knowledge graph completion (KGC), their application to the multimodal setting remains underexplored. Moreover, applying Multimodal Large Language Models (Ms) to the task of MKGC introduces significant challenges: (1) the large number of image tokens per entity leads to semantic noise and modality conflicts, and (2) the high computational cost of processing large token inputs. To address these issues, we propose Efficient Lightweight Multimodal Large Language Models (ELMM) for MKGC. ELMM proposes a Multi-view Visual Token Compressor (MVTC) based on multi-head attention mechanism, which adaptively compresses image tokens from both textual and visual views, thereby effectively reducing redundancy while retaining necessary information and avoiding modality conflicts. Additionally, we design an attention strategy to remove redundant attention layers from Ms, thereby significantly reducing the inference cost. We further introduce a linear projection to compensate for the performance degradation caused by . Extensive experiments on benchmark FB15k-237-IMG and WN18-IMG demonstrate that ELMM achieves state-of-the-art performance while substantially improving computational efficiency, establishing a new paradigm for multimodal knowledge graph completion.

An Efficient Semantic Segmentation Decoder for In-Car or Distributed Applications

Authors: Danish Nazir, Gowtham Sai Inti, Timo Bartels, Jan Piewek, Thorsten Bagdonat, Tim Fingscheidt

2025-10-19

http://arxiv.org/abs/2510.16747v1

Modern automotive systems leverage deep neural networks (DNNs) for semantic segmentation and operate in two key application areas: (1) In-car, where the DNN solely operates in the vehicle without strict constraints on the data rate. (2) Distributed, where one DNN part operates in the vehicle and the other part typically on a large-scale cloud platform with a particular constraint on transmission bitrate efficiency. Typically, both applications share an image and source encoder, while each uses distinct (joint) source and task rs. Prior work utilized convolutional neural networks for joint source and task but did not investigate -based alternatives such as SegDeformer, which offer superior performance at the cost of higher computational complexity. In this work, we propose joint feature and task for SegDeformer, thereby enabling lower computational complexity in both in-car and distributed applications, despite SegDeformer's computational demands. This improves scalability in the cloud while reducing in-car computational complexity. For the in-car application, we increased the frames per second (fps) by up to a factor of $11.7$ ( $1.4$ fps to $16.5$ fps) on Cityscapes and by up to a factor of $3.5$ ( $43.3$ fps to $154.3$ fps) on ADE20K, while being on-par w.r.t.\ the mean intersection over union (mIoU) of the -based baseline that doesn't compress by a source codec. For the distributed application, we achieve state-of-the-art (SOTA) over a wide range of bitrates on the mIoU metric, while using only $0.14$ \% ( $0.04$ \%) of cloud DNN parameters used in previous SOTA, reported on ADE20K (Cityscapes).

Long-Context Attention Benchmark From Kernel Efficiency to Distributed Context Parallelism

Authors: Tao Bu, Qiangang Wang, Bowen Zeng, Hanwen Sun, Yunpeng Huang, Chun Cao, Jingwei Xu

2025-10-19

http://arxiv.org/abs/2510.17896v1

Transformer-based large language models (s) have achieved remarkable success, yet their standard attention mechanism incurs quadratic computation and memory costs with respect to sequence length, posing a major bottleneck for long-context training. Prior work tackles this challenge along two directions: (1) kernel-level optimizations, which accelerate dense and attention operators; and (2) module-level strategies, often referred to as distributed attention or context parallel training, which scale attention across multiple devices. However, systematic evaluation still remains limited: operator-level comparisons are often incomplete, while context parallel strategies are typically framework-specific, with unclear performance analysis across contexts. To address these gaps, we propose a unified benchmark that integrates representative attention kernels and context parallel mechanisms with a modular and extensible interface for evaluation. The benchmark evaluates methods along two critical dimensions: (1) attention mask patterns, which strongly affect efficiency, scalability, and usability, and (2) sequence length and distributed scale, which determine performance under extreme long-context training. Through comprehensive experiments on the cluster of up to 96 GPUs, our benchmark enables reproducible comparisons, highlights method-specific trade-offs, and provides practical guidance for designing and deploying attention mechanisms in long-context training.

U-Codec Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation

Authors: Xusheng Yang, Long Zhou, Wenfu Wang, Kai Hu, Shulin Feng, Chenxing Li, Meng Yu, Dong Yu, Yuexian Zou

2025-10-19

http://arxiv.org/abs/2510.16718v1

We propose \textbf{U-Codec}, an \textbf{U}ltra low frame-rate neural speech \textbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme at 5Hz typically leads to severe intelligibility and spectral detail loss, we introduce a Transformer-based inter-frame long-term dependency module and systematically explore residual vector (RVQ) depth and codebook size to identify optimal configurations. Moreover, we apply U-Codec into a large language model ()-based auto-regressive TTS model, which leverages global and local hierarchical architecture to effectively capture dependencies across multi-layer tokens. We extend -based TTS from 3-layer RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that U-Codec improves -based TTS inference speed by around 3 $\times$ over high-frame-rate codecs while maintaining similarity and naturalness. These results validate the feasibility of using highly compressed 5Hz discrete tokens for fast and high-fidelity speech synthesis.

Count Counts Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards

Authors: Xuan Zhang, Ruixiao Li, Zhijian Zhou, Long Li, Yulei Qin, Ke Li, Xing Sun, Xiaoyu Tan, Chao Qu, Yuan Qi

2025-10-18

http://arxiv.org/abs/2510.16614v2

Reinforcement Learning (RL) has become a compelling way to strengthen the multi step reasoning ability of Large Language Models (s). However, prevalent RL paradigms still lean on outcome-based rewards and limited exploration, which often drives s toward repetitive and suboptimal reasoning patterns. In this paper, we study the central question of how to design exploration for reasoning and introduce MERCI (Motivating Exploration in Reasoning with Count-based Intrinsic Rewards), a novel RL algorithm that augments policy optimization with a principled intrinsic reward. Building on the idea of count-based exploration, MERCI leverages a lightweight Coin Flipping Network (CFN) to estimate the pseudo count and further epistemic uncertainty over reasoning trajectories, and converts them into an intrinsic reward that values novelty while pre the learning signal from task rewards. We integrate MERCI into some advanced RL frameworks like Group Relative Policy Optimization (GRPO). Experiments on complex reasoning benchmarks demonstrate that MERCI encourages richer and more varied chains of thought, significantly improves performance over strong baselines, and helps the policy escape local routines to discover better solutions. It indicates that our targeted intrinsic motivation can make exploration reliable for language model reasoning.

VisionSelector End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

Authors: Jiaying Zhu, Yurui Zhu, Xin Lu, Wenrui Yan, Dong Li, Kunlin Liu, Xueyang Fu, Zheng-Jun Zha

2025-10-18

http://arxiv.org/abs/2510.16598v1

Multimodal Large Language Models (Ms) encounter significant computational and memory bottlenecks from the massive number of visual tokens generated by high-resolution images or multi-image inputs. Previous token techniques are often constrained by heuristic rules that risk discarding critical information. They may suffer from biases, such as attention sinks, that lead to sharp performance drops under aggressive ratios. To address these limitations, we reformulate token as a lightweight plug-and-play framework that reformulates token into an end-to-end learnable decision process. To be specific, we propose VisionSelector, a scorer module decoupled from the M backbone that incorporates a differentiable Top-K mechanism and a curriculum annealing strategy to bridge the training-inference gap, enabling efficient and adaptive token selection various arbitrary rates. Remarkably lightweight with only 12.85M trainable parameters, VisionSelector demonstrates generalization across various rates and adaptively identifying critical tokens. This leads to superior performance across all budgets, evidenced by pre 100% accuracy on MME with 30% retention budget, outperforming prior methods by 12.14% at 10% retention budget, and doubling speed. Our code is available at https://github.com/JulietChoo/VisionSelector .

SHIELD Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense

Authors: Yiyang Huang, Liang Shi, Yitian Zhang, Yi Xu, Yun Fu

2025-10-18

http://arxiv.org/abs/2510.16596v1

Large Vision-Language Models (LVLMs) excel in diverse cross-modal tasks. However, object hallucination, where models produce plausible but inaccurate object descriptions, remains a significant challenge. In contrast to previous work focusing on components, this paper is the first to trace LVLM hallucinations to visual encoders and identifies three key issues: statistical bias, inherent bias, and vulnerability. To address these challenges, we propose SHIELD, a training-free framework that mitigates hallucinations through three strategies: re-weighting visual tokens to reduce statistical bias, introducing noise-derived tokens to counter inherent bias, and applying adversarial attacks with contrastive to address vulnerability. Experiments demonstrate that SHIELD effectively mitigates object hallucinations across diverse benchmarks and LVLM families. Moreover, SHIELD achieves strong performance on the general LVLM benchmark, highlighting its broad applicability. Code will be released.

Human-Aligned Code Readability Assessment with Large Language Models

Authors: Wendkûuni C. Ouédraogo, Yinghua Li, Xueqi Dang, Pawel Borsukiewicz, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé

2025-10-18

http://arxiv.org/abs/2510.16579v1

Code readability is crucial for software comprehension and maintenance, yet difficult to assess at scale. Traditional static metrics often fail to capture the subjective, context-sensitive nature of human judgments. Large Language Models (s) offer a scalable alternative, but their behavior as readability evaluators remains underexplored. We introduce CoReEval, the first large-scale benchmark for evaluating -based code readability assessment, comprising over 1.4 million model-snippet-prompt evaluations across 10 state of the art s. The benchmark spans 3 programming languages (Java, Python, CUDA), 2 code types (functional code and unit tests), 4 prompting strategies (ZSL, FSL, CoT, ToT), 9 settings, and developer-guided prompts tailored to junior and senior personas. We compare outputs against human annotations and a validated static model, analyzing numerical alignment (MAE, Pearson's, Spearman's) and justification quality (sentiment, aspect coverage, semantic clustering). Our findings show that developer-guided prompting grounded in human-defined readability dimensions improves alignment in structured contexts, enhances explanation quality, and enables lightweight personalization through persona framing. However, increased score variability highlights trade-offs between alignment, stability, and interpretability. CoReEval provides a robust foundation for prompt engineering, model alignment studies, and human in the loop evaluation, with applications in education, onboarding, and CI/CD pipelines where s can serve as explainable, adaptable reviewers.

Ripple Effect Protocol Coordinating Agent Populations

Authors: Ayush Chopra, Aman Sharma, Feroz Ahmad, Luca Muscariello, Vijoy Pandey, Ramesh Raskar

2025-10-18

http://arxiv.org/abs/2510.16572v1

Modern AI agents can exchange messages using protocols such as A2A and ACP, yet these mechanisms emphasize over coordination. As agent populations grow, this limitation produces brittle collective behavior, where individually smart agents converge on poor group outcomes. We introduce the Ripple Effect Protocol (REP), a coordination protocol in which agents share not only their decisions but also lightweight sensitivities - signals expressing how their choices would change if key environmental variables shifted. These sensitivities ripple through local networks, enabling groups to align faster and more stably than with agent-centric alone. We formalize REP's protocol specification, separating required message schemas from optional aggregation rules, and evaluate it across scenarios with varying incentives and network topologies. Benchmarks across three domains: (i) supply chain cascades (Beer Game), (ii) preference aggregation in networks (Movie Scheduling), and (iii) sustainable resource allocation (Fishbanks) show that REP improves coordination accuracy and efficiency over A2A by 41 to 100%, while flexibly handling multimodal sensitivity signals from s. By making coordination a protocol-level capability, REP provides scalable infrastructure for the emerging Internet of Agents

Language over Content Tracing Cultural Understanding in Multilingual Large Language Models

Authors: Seungho Cho, Changgeon Ko, Eui Jun Hwang, Junmyeong Lee, Huije Lee, Jong C. Park

2025-10-18

http://arxiv.org/abs/2510.16565v1

Large language models (s) are increasingly used across diverse cultural contexts, making accurate cultural understanding essential. Prior evaluations have mostly focused on output-level performance, obscuring the factors that drive differences in responses, while studies using circuit analysis have covered few languages and rarely focused on culture. In this work, we trace s' internal cultural understanding mechanisms by measuring activation path s when answering semantically equivalent questions under two conditions: varying the target country while fixing the question language, and varying the question language while fixing the country. We also use same-language country pairs to disentangle language from cultural aspects. Results show that internal paths more for same-language, cross-country questions than for cross-language, same-country questions, indicating strong language-specific patterns. Notably, the South Korea-North Korea pair exhibits low and high variability, showing that linguistic similarity does not guarantee aligned internal representation.

Hybrid CNN-Transformer Based Sparse Channel Prediction for High-Mobility OTFS Systems

Authors: Zhaowei Guan, Wenkun Wen, Peiran Wu, Chen Wang, Minghua Xia

2025-10-18

http://arxiv.org/abs/2510.16539v1

High-mobility scenarios in next-generation wireless networks, such as those involving vehicular s, require ultra-reliable and low-latency s (URLLC). However, rapidly time-varying channels pose significant challenges to traditional OFDM-based systems due to the Doppler effect and channel aging. Orthogonal time frequency space (OTFS) modulation offers resilience by representing channels in the quasi-static delay-Doppler (DD) domain. This letter proposes a novel channel prediction framework for OTFS systems using a hybrid convolutional neural network and (CNN-Transformer) architecture. The CNN extracts compact features that exploit the DD-domain of the channel matrices, while the models temporal dependencies with causal masking for consistency. Simulation experiments under extreme $500$ \si{km/h} mobility conditions demonstrate that the proposed method outperforms state-of-the-art baselines, reducing the root mean square error and mean absolute error by $12.2\%$ and $9.4\%$ , respectively. These results demonstrate the effectiveness of DD-domain representations and the proposed model in accurately predicting channels in high-mobility scenarios, thereby supporting the stringent URLLC requirements in future wireless systems.

HGC-Avatar Hierarchical Gaussian Compression for Streamable Dynamic 3D Avatars

Authors: Haocheng Tang, Ruoke Yan, Xinhui Yin, Qi Zhang, Xinfeng Zhang, Siwei Ma, Wen Gao, Chuanmin Jia

2025-10-18

http://arxiv.org/abs/2510.16463v1

Recent advances in 3D Gaussian Splatting (3DGS) have enabled fast, photorealistic rendering of dynamic 3D scenes, showing strong potential in immersive . However, in digital human encoding and transmission, the methods based on general 3DGS representations are limited by the lack of human priors, resulting in suboptimal bitrate efficiency and reconstruction quality at the r side, which hinders their application in streamable 3D avatar systems. We propose HGC-Avatar, a novel Hierarchical Gaussian Compression framework designed for efficient transmission and high-quality rendering of dynamic avatars. Our method disentangles the Gaussian representation into a structural layer, which maps poses to Gaussians via a StyleUNet-based generator, and a motion layer, which leverages the SMPL-X model to represent temporal pose variations compactly and semantically. This hierarchical design supports layer-wise , progressive , and controllable rendering from diverse pose inputs such as video sequences or text. Since people are most concerned with facial realism, we incorporate a facial attention mechanism during StyleUNet training to preserve identity and expression details under rate constraints. Experimental results demonstrate that HGC-Avatar provides a streamable solution for rapid 3D avatar rendering, while significantly outperforming prior methods in both visual quality and efficiency.

FrugalPrompt Reducing Contextual Overhead in Large Language Models via Token Attribution

Authors: Syed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran, Mohammad Ali Moni

2025-10-18

http://arxiv.org/abs/2510.16439v2

Large language models (s) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. Much of this overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. We address this inefficiency by introducing FrugalPrompt, a novel prompt framework for s, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to preserve the top-k% tokens in their original order, and obtain a frugalized prompt. We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning, using a suite of frontier s. For the first three tasks, a 20% prompt reduction incurs only a marginal loss in task performance, demonstrating that contemporary s can reconstruct elided context from high-salience cues. In contrast, performance on mathematical reasoning deteriorates sharply, reflecting a stronger dependence on complete token continuity. Further analysis with bottom-k% and random-k% tokens reveals asymmetric performance patterns that may suggest potential task contamination effects, wherein models may resort to shallow memorized patterns from pretraining exposure for conventional NLP tasks. We posit that our work contributes to a more nuanced understanding of behavior in performance-efficiency trade-offs, and delineate the boundary between tasks tolerant to contextual and those requiring exhaustive context. Our source code and models are available at: https://github.com/Starscream-11813/Frugal-ICL.

Learning to Optimize Edge Robotics A Fast Integrated Perception-Motion-Communication Approach

Authors: Dan Guo, Xibin Jin, Shuai Wang, Zhigang Wen, Miaowen Wen, Chengzhong Xu

2025-10-18

http://arxiv.org/abs/2510.16424v1

Edge robotics involves frequent exchanges of large-volume multi-modal data. Existing methods ignore the interdependency between robotic functionalities and conditions, leading to excessive overhead. This paper revolutionizes edge robotics systems through integrated perception, motion, and (IPMC). As such, robots can dynamically adapt their strategies (i.e., ratio, transmission frequency, transmit power) by leveraging the knowledge of robotic perception and motion dynamics, thus reducing the need for excessive sensor data uploads. Furthermore, by leveraging the learning to optimize (LTO) paradigm, an imitation learning neural network is designed and implemented, which reduces the computational complexity by over 10x compared to state-of-the art optimization solvers. Experiments demonstrate the superiority of the proposed IPMC and the real-time execution capability of LTO.

FourierCompress Layer-Aware Spectral Activation Compression for Efficient and Accurate Collaborative LLM Inference

Authors: Jian Ma, Xinchen Lyu, Jun Jiang, Longhao Zou, Chenshan Ren, Qimei Cui, Xiaofeng Tao

2025-10-18

http://arxiv.org/abs/2510.16418v1

Collaborative large language model () inference enables real-time, privacy-pre AI services on resource-constrained edge devices by partitioning computational workloads between client devices and edge servers. However, this paradigm is severely hindered by bottlenecks caused by the transmission of high-dimensional intermediate activations, exacerbated by the autoregressive structure of s, where bandwidth consumption scales linearly with output length. Existing activation methods struggle to simultaneously achieve high ratios, low reconstruction error, and computational efficiency. This paper proposes FourierCompress, a novel, layer-aware activation framework that exploits the frequency-domain of activations. We rigorously demonstrate that activations from the first Transformer layer exhibit strong smoothness and energy concentration in the low-frequency domain, making them highly amenable to near-lossless via the Fast Fourier Transform (FFT). FourierCompress transforms activations into the frequency domain, retains only a compact block of low-frequency coefficients, and reconstructs the signal at the server using conjugate symmetry, enabling seamless hardware on DSPs and FPGAs. Extensive experiments on Llama 3 and Qwen2.5 models across 10 commonsense reasoning datasets demonstrate that FourierCompress preserves performance remarkably close to the uncompressed baseline, outperforming Top-k, QR, and SVD. FourierCompress bridges the gap between efficiency (an average 7.6x reduction in activation size), near-lossless inference (less than 0.3% average accuracy loss), and significantly faster (achieving over 32x reduction in time compared to Top-k via hardware ) for edge-device inference.

Longwave-transparent low-emissivity material

Authors: Yue Zhang, Longnan Li, Junyan Dai, Xiaowen Zhang, Qunyan Zhou, Naiqin Yi, Ruizhe Jian, Fei Zhu, Xiaopeng Li, Mengke Sun, Jiazheng Wu, Xinfeng Li, Xiangtong Kong, Ziai Liu, Yinwei Li, Qiang Cheng, Yiming Zhu, Tie Jun Cui, Wei Li

2025-10-18

http://arxiv.org/abs/2510.16372v1

Low emissivity (low-e) materials are crucial for con thermal energy in buildings, cold chain logistics and transportation by minimizing unwanted radiative heat loss or gain. However, their metallic nature intrinsically causes severe longwave attenuation, hindering their broad applications. Here, we introduce, for the first time, an all-dielectric longwave-transparent low-emissivity material () with ultra-broadband, high transmittance spanning 9 orders of magnitude, from terahertz to kilohertz frequencies. This meter-scale not only achieves energy savings of up to 41.1% over commercial white paint and 10.2% over traditional low-e materials, but also unlocks various fundamentally new capabilities including high-speed wireless in energy-efficient buildings, wireless energy transfer with radiative thermal insulation, as well as non-invasive terahertz security screening and radio frequency identification in cold chain logistics. Our approach represents a new photonic solution towards carbon neutrality and smart city development, paving the way for a more sustainable and interconnected future.

Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior

Authors: Fuqun Han, Stanley Osher, Wuchen Li

2025-10-18

http://arxiv.org/abs/2510.16356v1

In this work, we propose a architecture that incorporates prior information about the underlying data distribution directly into the structure of the neural network. The design of the model is motivated by a special optimal transport problem, namely the regularized Wasserstein proximal operator, which admits a closed-form solution and turns out to be a special representation of architectures. Compared with classical flow-based models, the proposed approach improves the convexity properties of the optimization problem and promotes in the generated samples. Through both theoretical analysis and numerical experiments, including applications in generative modeling and Bayesian inverse problems, we demonstrate that the achieves higher accuracy and faster convergence to the target distribution than classical neural ODE-based methods.

Does GenAI Rewrite How We Write? An Empirical Study on Two-Million Preprints

Authors: Minfeng Qi, Zhongmin Cao, Qin Wang, Ningran Li, Tianqing Zhu

2025-10-18

http://arxiv.org/abs/2510.17882v1

Preprint repositories become central infrastructures for scholarly . Their expansion transforms how research is circulated and evaluated before journal publication. Generative large language models (s) introduce a further potential disruption by altering how manuscripts are written. While speculation abounds, systematic evidence of whether and how s reshape scientific publishing remains limited. This paper addresses the gap through a large-scale analysis of more than 2.1 million preprints spanning 2016--2025 (115 months) across four major repositories (i.e., arXiv, bioRxiv, medRxiv, SocArXiv). We introduce a multi-level analytical framework that integrates interrupted time-series models, collaboration and productivity metrics, linguistic profiling, and topic modeling to assess changes in volume, authorship, style, and disciplinary orientation. Our findings reveal that s have accelerated submission and revision cycles, modestly increased linguistic complexity, and disproportionately expanded AI-related topics, while computationally intensive fields benefit more than others. These results show that s act less as universal disruptors than as selective catalysts, amplifying existing strengths and widening disciplinary divides. By documenting these dynamics, the paper provides the first empirical foundation for evaluating the influence of generative AI on academic publishing and highlights the need for governance frameworks that preserve trust, fairness, and accountability in an AI-enabled research ecosystem.

What Limits Agentic Systems Efficiency?

Authors: Song Bian, Minghao Yan, Anand Jayarajan, Gennady Pekhimenko, Shivaram Venkataraman

2025-10-18

http://arxiv.org/abs/2510.16276v1

Large Language Models (s), such as OpenAI-o1 and DeepSeek-R1, have demonstrated strong reasoning capabilities. To further enhance capabilities, recent agentic systems, such as Deep Research, incorporate web interactions into reasoning to mitigate uncertainties and reduce potential errors. However, existing research predominantly focuses on reasoning performance, often neglecting the efficiency of agentic systems. In this work, we present a comprehensive empirical study that identifies efficiency bottlenecks in web-interactive agentic systems. We decompose end-to-end latency into two primary components: API latency and web environment latency. We conduct a comprehensive empirical study across 15 models and 5 providers to demonstrate high variability in API-based agentic systems. We observe that web environment latency can contribute as much as 53.7% to the overall latency in a web-based agentic system. To improve latency, we propose SpecCache, a caching framework augmented with speculative execution that can reduce web environment overhead. Extensive evaluations on two standard benchmarks show that our approach improves the hit rate by up to 58x compared to a random caching strategy, while reducing web environment overhead by up to 3.2x, without degrading agentic system performance.

One-Bit Quantization for Random Features Models

Authors: Danil Akhtiamov, Reza Ghane, Babak Hassibi

2025-10-17

http://arxiv.org/abs/2510.16250v1

Recent advances in neural networks have led to significant computational and memory demands, spurring interest in one-bit weight to enable efficient inference on resource-constrained devices. However, the theoretical underpinnings of such remain poorly understood. We address this gap by analyzing one-bit in the Random Features model, a simplified framework that corresponds to neural networks with random representations. We prove that, asymptotically, quantizing weights of all layers except the last incurs no loss in generalization error, compared to the full precision random features model. Our findings offer theoretical insights into neural network . We also demonstrate empirically that one-bit leads to significant inference speed ups for the Random Features models even on a laptop GPU, confirming the practical benefits of our work. Additionally, we provide an asymptotically precise characterization of the generalization error for Random Features with an arbitrary number of layers. To the best of our knowledge, our analysis yields more general results than all previous works in the related literature.

SentinelNet Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection

Authors: Yang Feng, Xudong Pan

2025-10-17

http://arxiv.org/abs/2510.16219v2

Malicious agents pose significant threats to the reliability and decision-making capabilities of Multi-Agent Systems (MAS) powered by Large Language Models (s). Existing defenses often fall short due to reactive designs or centralized architectures which may introduce single points of failure. To address these challenges, we propose SentinelNet, the first decentralized framework for proactively detecting and mitigating malicious behaviors in multi-agent collaboration. SentinelNet equips each agent with a credit-based detector trained via contrastive learning on augmented adversarial debate trajectories, enabling autonomous evaluation of message credibility and dynamic neighbor ranking via bottom-k elimination to suppress malicious s. To overcome the scarcity of attack data, it generates adversarial trajectories simulating diverse threats, ensuring robust training. Experiments on MAS benchmarks show SentinelNet achieves near-perfect detection of malicious agents, close to 100% within two debate rounds, and recovers 95% of system accuracy from compromised baselines. By exhibiting strong generalizability across domains and attack patterns, SentinelNet establishes a novel paradigm for safeguarding collaborative MAS.