2025-09-15

Inpainting-Guided Policy Optimization for Diffusion Large Language Models
Dropping Experts, Recombining Neurons Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs
MCBP A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness
Characterizing the Efficiency of Distributed Training A Power, Performance, and Thermal Perspective
Efficient Learned Image Compression Through Knowledge Distillation
Compute Only 16 Tokens in One Timestep Accelerating Diffusion Transformers with Cluster-Driven Feature Caching
OpenCSP A Deep Learning Framework for Crystal Structure Prediction from Ambient to High Pressure
SignClip Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion
A Symmetry-Integrated Approach to Surface Code Decoding
FedBiF Communication-Efficient Federated Learning via Bits Freezing
Perfect quantum state transfer via state restoring and ancilla measurement
Semantic Rate-Distortion Theory with Applications
Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge
LLMs as Agentic Cooperative Players in Multiplayer UNO
Latency and Token-Aware Test-Time Compute
Towards an AI-based knowledge assistant for goat farmers based on Retrieval-Augmented Generation
CoDiCodec Unifying Continuous and Discrete Compressed Representations of Audio
ButterflyQuant Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
LAVa Layer-wise KV Cache Eviction with Dynamic Budget Allocation
Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates
TrEnv Transparently Share Serverless Execution Environments Across Different Functions and Nodes
Combating the Memory Walls Optimization Pathways for Long-Context Agentic LLM Inference
ENSI Efficient Non-Interactive Secure Inference for Large Language Models
HD-MoE Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing
DiTReducio A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration
Efficient Transformer-Based Piano Transcription With Sparse Attention Mechanisms
From scratch to silver Creating trustworthy training data for patent-SDG classification using Large Language Models
Harnessing Uncertainty Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents
Medverse A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement
CCF A Context Compression Framework for Efficient Long-Sequence Language Modeling
GmSLM Generative Marmoset Spoken Language Modeling
AI Reasoning for Wireless Communications and Networking A Survey and Perspectives
Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication
DP-FedLoRA Privacy-Enhanced Federated Fine-Tuning for On-Device Large Language Models
Towards Confidential and Efficient LLM Inference with Dual Privacy Protection
SQAP-VLA A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
Instructional Prompt Optimization for Few-Shot LLM-Based Recommendations on Cold-Start Users
VoxelFormer Parameter-Efficient Multi-Subject Visual Decoding from fMRI
CSI Compression Beyond Latents End-to-End Hybrid Attention-CNN Networks with Entropy Regularization
CrowdQuery Density-Guided Query Module for Enhanced 2D and 3D Detection in Crowded Scenes
ChemBOMAS Accelerated BO in Chemistry with LLM-Enhanced Multi-Agent System
Compressing CNN models for resource-constrained systems by channel and layer pruning
Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching
Time-Dependent Modeling of the Sub-Hour Spectral Evolution During the 2013 Outburst of Mrk 421
Deep Unrolling of Sparsity-Induced RDO for 3D Point Cloud Attribute Coding
BitROM Weight Reload-Free CiROM Architecture Towards Billion-Parameter 1.58-bit LLM Inference
Two Sides of the Same Optimization Coin Model Degradation and Representation Collapse in Graph Foundation Models
Efficient Decoding Methods for Language Models on Encrypted Data
Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video
Persistent-DPO A novel loss function and hybrid learning for generative quantum eigensolver
Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism
Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing
EvolKV Evolutionary KV Cache Compression for LLM Inference
Towards Knowledge-Aware Document Systems Modeling Semantic Coverage Relations via Answerability Detection
RTR A Transformer-Based Lossless Crossover with Perfect Phase Alignment
Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning
Strategies for Improving Communication Efficiency in Distributed and Federated Learning Compression, Local Training, and Personalization
Sketched Gaussian Mechanism for Private Federated Learning
XML Prompting as Grammar-Constrained Interaction Fixed-Point Semantics, Convergence Guarantees, and Human-AI Protocols
OCTANE -- Optimal Control for Tensor-based Autoencoder Network Emergence Explicit Case
SCA-LLM Spectral-Attentive Channel Prediction with Large Language Models in MIMO-OFDM
Tensor-Train Operator Inference
Feature Space Analysis by Guided Diffusion Model
Biased Tales Cultural and Topic Bias in Generating Children's Stories
A Robot That Listens Enhancing Self-Disclosure and Engagement Through Sentiment-based Backchannels and Active Listening
Are Humans as Brittle as Large Language Models?
Query Expansion in the Age of Pre-trained and Large Language Models A Comprehensive Survey
SEEC Segmentation-Assisted Multi-Entropy Models for Learned Lossless Image Compression
Unleashing the True Potential of LLMs A Feedback-Triggered Self-Correction with Long-Term Multipath Decoding
Collaborative Exploration with a Marsupial Ground-Aerial Robot Team through Task-Driven Map Compression
Topology-Aware Optimization of Gaussian Primitives for Human-Centric Volumetric Videos
MaLei at MultiClinSUM Summarisation of Clinical Documents using Perspective-Aware Iterative Self-Prompting with LLMs
PanoLAM Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image
PatchSeeker Mapping NVD Records to their Vulnerability-fixing Commits with LLM Generated Commits and Embeddings
Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data
Multi-view-guided Passage Reranking with Large Language Models
DuoServe-MoE Dual-Phase Expert Prefetch and Cache Scheduling for Efficient MoE LLM Inference
PersonaFuse A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions
Explaining How Quantization Disparately Skews a Model
Neurocognitive Modeling for Text Generation Deep Learning Architecture for EEG Data
DischargeSim A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge
Faster VGGT with Block-Sparse Global Attention
H $_{2}$ OT Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers
Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models
Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data
From Noise to Narrative Tracing the Origins of Hallucinations in Transformers
Barlow-Swin Toward a novel siamese-based segmentation architecture using Swin-Transformers
COMPACT Common-token Optimized Model Pruning Across Channels and Tokens
Guided Decoding and Its Critical Role in Retrieval-Augmented Generation
HAVE Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models
Reasoning-enhanced Query Understanding through Decomposition and Interpretation
SLiNT Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion
Synesthesia of Machines (SoM)-Aided LiDAR Point Cloud Transmission for Collaborative Perception
Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
HyFedRAG A Federated Retrieval-Augmented Generation Framework for Heterogeneous and Privacy-Sensitive Data
Tree of Agents Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning
NeuroDeX Unlocking Diverse Support in Decompiling Deep Neural Network Executables
Mask-GCG Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?
A Geometric Multigrid-Accelerated Compact Gas-Kinetic Scheme for Fast Convergence in High-Speed Flows on GPUs
Ban&Pick Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs
Towards scalable organ level 3D plant segmentation Bridging the data algorithm computing gap
Text4Seg++ Advancing Image Segmentation via Generative Language Modeling
LoaQ Layer-wise Output Approximation Quantization
RecMind LLM-Enhanced Graph Neural Networks for Personalized Consumer Recommendations
FineServe Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving
Understanding the Influence of Synthetic Data for Text Embedders
Home-made Diffusion Model from Scratch to Hatch
1 bit is all we need binary normalized neural networks
A Unified Framework for Cultural Heritage Data Historicity and Migration The ARGUS Approach
Micro-Expression Recognition via Fine-Grained Dynamic Perception
MEGS $^{2}$ Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning
Application Space and the Rate-Distortion-Complexity Analysis of Neural Video CODECs
Physics-Guided Diffusion Transformer with Spherical Harmonic Posterior Sampling for High-Fidelity Angular Super-Resolution in Diffusion MRI
Beyond I'm Sorry, I Can't Dissecting Large Language Model Refusal
Chatbot To Help Patients Understand Their Health
time2time Causal Intervention in Hidden States to Simulate Rare Events in Time Series Foundation Models
LM-Searcher Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding
Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints
Icon $^{2}$ Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation
ProfilingAgent Profiling-Guided Agentic Reasoning for Adaptive Model Optimization
Sensitivity-Aware Post-Training Quantization for Deep Neural Networks
TreeGPT Pure TreeFFN Encoder-Decoder Architecture for Structured Reasoning Without Attention Mechanisms
veScale Consistent and Efficient Tensor Programming with Eager-Mode SPMD
Dynamic Sensitivity Filter Pruning using Multi-Agent Reinforcement Learning For DCNN's
Crosscoding Through Time Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
Recomposer Event-roll-guided generative audio editing
Exploring Autoregressive Vision Foundation Models for Image Compression
KVCompose Efficient Structured KV Cache Compression with Composite Tokens
FLOWER Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies
Ground-Aware Octree-A* Hybrid Path Planning for Memory-Efficient 3D Navigation of Ground Vehicles
PLaMo 2 Technical Report
OSC Cognitive Orchestration through Dynamic Knowledge Alignment in Multi-Agent LLM Collaboration
Broadband Simultaneous Beam Steering and Compressing Device Based on Subwavelength Protrusion Metallic Tunnels
VoltanaLLM Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
AI-Driven Fronthaul Link Compression in Wireless Communication Systems Review and Method Design
Personality as a Probe for LLM Evaluation Method Trade-offs and Downstream Effects
Decoders Laugh as Loud as Encoders
A Study of Large Language Models for Patient Information Extraction Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning
ODKE+ Ontology-Guided Open-Domain Knowledge Extraction with LLMs
First demonstration of coherent radiation imaging for bunch-by-bunch longitudinal compression monitoring
DarkStream real-time speech anonymization with low latency
AraHalluEval A Fine-grained Hallucination Evaluation Framework for Arabic LLMs
Scaling Environments for Organoid Intelligence with LLM-Automated Design and Plasticity-Based Evaluation
Schema Inference for Tabular Data Repositories Using Large Language Models
Communication-Efficient Collaborative LLM Inference via Distributed Speculative Decoding
PagedEviction Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference
Psychologically Enhanced AI Agents
Cross-Layer Attention Probing for Fine-Grained Hallucination Detection
Integrating Pruning with Quantization for Efficient Deep Neural Networks Compression
Real Time FPGA Based Transformers & VLMs for Vision Tasks SOTA Designs and Optimizations
MultiWikiQA A Reading Comprehension Benchmark in 300+ Languages
Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue
Meta-Policy Reflexion Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent
LMVC An End-to-End Learned Multiview Video Coding Framework
MTQAMatrix of Thought for Enhanced Reasoning in Complex Question Answering

Inpainting-Guided Policy Optimization for Diffusion Large Language Models

Authors: Siyan Zhao, Mengchen Liu, Jing Huang, Miao Liu, Chenyu Wang, Bo Liu, Yuandong Tian, Guan Pang, Sean Bell, Aditya Grover, Feiyu Chen

2025-09-12

http://arxiv.org/abs/2509.10396v1

Masked diffusion large language models (ds) are emerging as promising alternatives to autoregressive s, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for ds. Aligning s with reinforcement learning faces an exploration challenge: reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects s broadly, ds offer a distinctive opportunity--their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while pre self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with d generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across three mathematical benchmarks--GSM8K, Math500, and AMC--achieving new state-of-the-art results for full-attention masked ds.

Dropping Experts, Recombining Neurons Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs

Authors: Yixiao Zhou, Ziyu Zhao, Dongzhou Cheng, zhiliang wu, Jie Gui, Yi Yang, Fei Wu, Yu Cheng, Hehe Fan

2025-09-12

http://arxiv.org/abs/2509.10377v1

Sparse Mixture-of-Experts (SMoE) architectures are widely used in large language models (s) due to their computational efficiency. However, though only a few experts are activated for each token, SMoE still requires loading all expert parameters, leading to high memory usage and challenges in deployment. Previous work has tried to reduce the overhead by and merging experts, but primarily focused on expert-level operations, leaving neuron-level structure underexplored. We propose DERN (Dropping Experts, Recombining Neurons), a task-agnostic and retraining-free framework for expert and reconstruction. We observe that experts are often misaligned and contain semantic conflicts at the neuron level, which poses challenges for direct merging. To solve this, DERN works in three steps: it first prunes redundant experts using router statistics; then it decomposes them into neuron-level expert segments, assigning each segment to its most compatible retained expert; and finally, it merges segments within each retained expert to build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE models show that DERN improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert , without extra training. It also greatly reduces the number of experts and memory usage, making SMoE s easier to deploy in practice.

MCBP A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness

Authors: Huizheng Wang, Zichuan Wang, Zhiheng Yue, Yousheng Long, Taiquan Wei, Jianxun Yang, Yang Wang, Chao Li, Shaojun Wei, Yang Hu, Shouyi Yin

2025-09-12

http://arxiv.org/abs/2509.10372v1

Large language models (s) face significant inference latency due to inefficiencies in GEMM operations, weight access, and access, especially in real-time scenarios. This highlights the need for a versatile compute-memory efficient accelerator. Unfortunately, existing Transformer accelerators struggle to address both aspects simultaneously, as they focus on value-level processing, missing fine-grained opportunities to optimize computation and memory collaboratively. This paper introduces MCBP, a bit-grained compute-memory efficient algorithm-hardware co-design that leverages bit-slice (BS) enabled repetitiveness and to accelerate inference. MCBP features three key innovations: 1) BS-repetitiveness-enabled computation reduction (BRCR), which eliminates redundant GEMM computations via leveraging redundancy hidden among BS vectors; 2) BS--enabled two-state coding (BSTC), which reduces weight access via exploiting significant in high-order bit-slice weight; 3) Bit-grained progressive prediction (BGPP), which reduces access by leveraging early-termination-based bit-grained prediction. These techniques, supported by custom accelerator designs, effectively alleviate the burden in GEMM, weight access, and access. Extensive experiments on 26 benchmarks show that MCBP achieves 9.43x speed up and 31.1x higher energy efficiency than Nvidia A100 GPU. Compared to SOTA Transformer accelerators, MCBP achieves 35x, 5.2x and 3.2x energy saving than Spatten, FACT and SOFA, respectively.

Characterizing the Efficiency of Distributed Training A Power, Performance, and Thermal Perspective

Authors: Seokjin Go, Joongun Park, Spandan More, Hanjiang Wu, Irene Wang, Aaron Jezghani, Tushar Krishna, Divya Mahajan

2025-09-12

http://arxiv.org/abs/2509.10371v1

The rapid scaling of Large Language Models (s) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper, we present a comprehensive characterization of training across diverse real-world workloads and hardware platforms, including NVIDIA H100/H200 and AMD MI250 GPUs. We analyze dense and models under various parallelism strategies -- tensor, pipeline, data, and expert -- and evaluate their effects on hardware utilization, power consumption, and thermal behavior. We further evaluate the effectiveness of optimizations such as activation recomputation and compute- . Our findings show that performance is not determined solely by scaling hardware capacity. Scale-up systems with fewer, higher-memory GPUs can outperform scale-out systems in -bound regimes, but only under carefully tuned configurations; in other cases, scale-out deployments achieve superior throughput. We also show that certain parallelism combinations, such as tensor with pipeline, lead to bandwidth underutilization due to inefficient data chunking, while increasing microbatch sizes beyond a certain point induces bursty execution and peak power excursions that worsen thermal throttling. These insights reveal how training performance is shaped by complex interactions between hardware, system topology, and model execution. We conclude by offering recommendations for system and hardware design to improve the scalability and reliability of future systems and workloads. The source code of this project is available at https://github.com/sitar-lab/Char-PPT.

Efficient Learned Image Compression Through Knowledge Distillation

Authors: Fabien Allemand, Attilio Fiandrotti, Sumanta Chaudhuri, Alaa Eddine Mazouz

2025-09-12

http://arxiv.org/abs/2509.10366v1

Learned image sits at the intersection of machine learning and image processing. With advances in deep learning, neural network-based methods have emerged. In this process, an encoder maps the image to a low-dimensional latent space, which is then d, entropy-coded into a binary bitstream, and transmitted to the receiver. At the receiver end, the bitstream is entropy-d, and a r reconstructs an approximation of the original image. Recent research suggests that these models consistently outperform conventional codecs. However, they require significant processing power, making them unsuitable for real-time use on resource-constrained platforms, which hinders their deployment in mainstream applications. This study aims to reduce the resource requirements of neural networks used for image by leveraging knowledge distillation, a training paradigm where smaller neural networks, partially trained on the outputs of larger, more complex models, can achieve better performance than when trained independently. Our work demonstrates that knowledge distillation can be effectively applied to image tasks: i) across various architecture sizes, ii) to achieve different image quality/bit rate tradeoffs, and iii) to save processing and energy resources. This approach introduces new settings and hyperparameters, and future research could explore the impact of different teacher models, as well as alternative loss functions. Knowledge distillation could also be extended to -based models. The code is publicly available at: https://github.com/FABallemand/PRIM .

Compute Only 16 Tokens in One Timestep Accelerating Diffusion Transformers with Cluster-Driven Feature Caching

Authors: Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, Linfeng Zhang

2025-09-12

http://arxiv.org/abs/2509.10312v1

Diffusion s have gained significant attention in recent years for their ability to generate high-quality images and videos, yet still suffer from a huge computational cost due to their iterative denoising process. Recently, feature caching has been introduced to accelerate diffusion s by caching the feature computation in previous timesteps and reusing it in the following timesteps, which leverage the temporal similarity of diffusion models while ignoring the similarity in the spatial dimension. In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. Specifically, ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens, which is able to reduce the number of tokens by over 90%. Extensive experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation. Besides, it can be directly applied to any diffusion without requirements for training. For instance, ClusCa achieves 4.96x on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. The code is available at https://github.com/Shenyi-Z/Cache4Diffusion.

OpenCSP A Deep Learning Framework for Crystal Structure Prediction from Ambient to High Pressure

Authors: Yinan Wang, Xiaoyang Wang, Zhenyu Wang, Jing Wu, Jian Lv, Han Wang

2025-09-12

http://arxiv.org/abs/2509.10293v1

High-pressure crystal structure prediction (CSP) underpins advances in condensed matter physics, planetary science, and materials discovery. Yet, most large atomistic models are trained on near-ambient, equilibrium data, leading to degraded stress accuracy at tens to hundreds of gigapascals and coverage of pressure-stabilized stoichiometries and dense coordination motifs. Here, we introduce OpenCSP, a machine learning framework for CSP tasks spanning ambient to high-pressure conditions. This framework comprises an open-source pressure-resolved dataset alongside a suite of publicly available atomistic models that are jointly optimized for accuracy in energy, force, and stress predictions. The dataset is constructed via randomized high-pressure sampling and iteratively refined through an uncertainty-guided concurrent learning strategy, which enriches underrepresented regimes while suppressing redundant DFT labeling. Despite employing a training corpus one to two orders of magnitude smaller than those of leading large models, OpenCSP achieves comparable or superior performance in high-pressure enthalpy ranking and stability prediction. Across benchmark CSP tasks spanning a wide pressure window, our models match or surpass MACE-MPA-0, MatterSim v1 5M, and GRACE-2L-OAM, with the largest gains observed at elevated pressures. These results demonstrate that targeted, pressure-aware data acquisition coupled with scalable architectures enables data-efficient, high-fidelity CSP, paving the way for autonomous materials discovery under ambient and extreme conditions.

SignClip Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion

Authors: Wenfang Wu, Tingting Yuan, Yupeng Li, Daling Wang, Xiaoming Fu

2025-09-12

http://arxiv.org/abs/2509.10266v1

Sign language translation (SLT) aims to translate natural language from sign language videos, as a vital bridge for inclusive . While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.

A Symmetry-Integrated Approach to Surface Code Decoding

Authors: Hoshitaro Ohnishi, Hideo Mukai

2025-09-12

http://arxiv.org/abs/2509.10164v1

Quantum error correction, which utilizes logical qubits that are encoded as redundant multiple physical qubits to find and correct errors in physical qubits, is indispensable for practical quantum computing. Surface code is considered to be a promising encoding method with a high error threshold that is defined by stabilizer generators. However, previous methods have suffered from the problem that the r acquires solely the error probability distribution because of the non-uniqueness of correct prediction obtained from the input. To circumvent this problem, we propose a technique to reoptimize the r model by approximating syndrome measurements with a continuous function that is mathematically interpolated by neural network. We evaluated the improvement in accuracy of a multilayer perceptron based r for code distances of 5 and 7 as well as for rs based on convolutional and recurrent neural networks and s for a code distance of 5. In all cases, the reoptimized r gave better accuracy than the original models, demonstrating the universal effectiveness of the proposed method that is independent of code distance or network architecture. These results suggest that re-framing the problem of surface code into a regression problem that can be tackled by deep learning is a useful strategy.

FedBiF Communication-Efficient Federated Learning via Bits Freezing

Authors: Shiwei Li, Qunwei Li, Haozhao Wang, Ruixuan Li, Jianbin Lin, Wenliang Zhong

2025-09-12

http://arxiv.org/abs/2509.10161v1

Federated learning (FL) is an emerging distributed machine learning paradigm that enables collaborative model training without sharing local data. Despite its advantages, FL suffers from substantial overhead, which can affect training efficiency. Recent efforts have mitigated this issue by quantizing model updates to reduce costs. However, most existing methods apply only after local training, introducing errors into the trained parameters and potentially degrading model accuracy. In this paper, we propose Federated Bit Freezing (FedBiF), a novel FL framework that directly learns d model parameters during local training. In each round, the server first s the model parameters and transmits them to the clients. FedBiF then allows each client to update only a single bit of the multi-bit parameter representation, freezing the remaining bits. This bit-by-bit update strategy reduces each parameter update to one bit while maintaining high precision in parameter representation. Extensive experiments are conducted on five widely used datasets under both IID and Non-IID settings. The results demonstrate that FedBiF not only achieves superior but also promotes in the resulting models. Notably, FedBiF attains accuracy comparable to FedAvg, even when using only 1 bit-per-parameter (bpp) for uplink and 3 bpp for downlink . The code is available at https://github.com/Leopold1423/fedbif-tpds25.

Perfect quantum state transfer via state restoring and ancilla measurement

Authors: E. B. Fel'dman, J. Wu, A. I. Zenchuk

2025-09-12

http://arxiv.org/abs/2509.10100v1

We propose the protocol for perfect state transfer of an arbitrary pure quantum state along the spin-1/2 chain governed by the Hamiltonian pre the excitation number in the system. We show that the $k$ -excitation pure sender's state can be restored at the receiver using only the local transformations over the qubits of the extended receiver. The restored state appears in the superposition with other states which form garbage. This garbage can be easily removed by including the ancilla whose state labels the garbage, and then measuring the {ancilla state} with desired output. The resulting state of the receiver coincides with the initial sender's state {up to the unimportant common phase factor.} Then, to transfer an arbitrary {pure} state of some system $S_0$ , we encode this state into the $k$ -excitation state of the sender, transfer and restore it and finally the restored $k$ -excitation state {of the receiver} into the state of another subsystem $R_0$ . After labeling and removing the garbage via measuring the state of the ancillae we complete the algorithm for PST.

Semantic Rate-Distortion Theory with Applications

Authors: Yi-Qun Zhao, Zhi-Ming Ma, Geoffrey Ye Li, Shuai Yuan, Tong Ye, Chuan Zhou

2025-09-12

http://arxiv.org/abs/2509.10061v1

Artificial intelligence (AI) is ushering in a new era for . As a result, the establishment of a semantic framework is putting on the agenda. Based on a realistic semantic model, this paper develops a rate-distortion framework for semantic . Different from the existing works primarily focusing on r-side estimation of intrinsic meaning and ignoring its inherent issues, such as ambiguity and polysemy, we exploit a constraint of conditional semantic probability distortion to effectively capture the essential features of practical semantic exchanges in an AI-assisted system. With the help of the methods in rate-distortion-perception theory, we establish a theorem specifying the minimum achievable rate under this semantic constraint and a traditional symbolic constraint and obtain its closed-form limit for a particular semantic scenario. From the experiments in this paper, bounding conditional semantic probability distortion can effectively improve both semantic transmission accuracy and bit-rate efficiency. Our framework bridges information theory and AI, enabling potential applications in bandwidth-efficient semantic-aware networks, enhanced transceiver understanding, and optimized semantic transmission for AI-driven systems.

Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge

Authors: Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis, Sami Muhaidat

2025-09-12

http://arxiv.org/abs/2509.09955v1

Large-scale s are central to modern semantic , yet their high computational and costs hinder deployment on resource-constrained edge devices. This paper introduces a training-free framework for adaptive token merging, a novel mechanism that compresses representations at runtime by selectively merging semantically redundant tokens under per-layer similarity thresholds. Unlike prior fixed-ratio reduction, our approach couples merging directly to input redundancy, enabling data-dependent adaptation that balances efficiency and task relevance without retraining. We cast the discovery of merging strategies as a multi-objective optimization problem and leverage Bayesian optimization to obtain Pareto-optimal trade-offs between accuracy, inference cost, and cost. On ImageNet classification, we match the accuracy of the unmodified with 30\% fewer floating-point operations per second and under 20\% of the original cost, while for visual question answering our method achieves performance competitive with the full LLaVA model at less than one-third of the compute and one-tenth of the bandwidth. Finally, we show that our adaptive merging is robust across varying channel conditions and provides inherent privacy benefits, substantially degrading the efficacy of model inversion attacks. Our framework provides a practical and versatile solution for deploying powerful models in resource-limited edge intelligence scenarios.

LLMs as Agentic Cooperative Players in Multiplayer UNO

Authors: Yago Romano Matinez, Jesse Roberts

2025-09-11

http://arxiv.org/abs/2509.09867v1

s promise to assist humans -- not just by answering questions, but by offering useful guidance across a wide range of tasks. But how far does that assistance go? Can a large language model based agent actually help someone accomplish their goal as an active participant? We test this question by engaging an in UNO, a turn-based card game, asking it not to win but instead help another player to do so. We built a tool that allows r-only s to participate as agents within the RLCard game environment. These models receive full game-state information and respond using simple text prompts under two distinct prompting strategies. We evaluate models ranging from small (1B parameters) to large (70B parameters) and explore how model scale impacts performance. We find that while all models were able to successfully outperform a random baseline when playing UNO, few were able to significantly aid another player.

Latency and Token-Aware Test-Time Compute

Authors: Jenny Y. Huang, Mehul Damani, Yousef El-Kurdi, Ramon Astudillo, Wei Sun

2025-09-11

http://arxiv.org/abs/2509.09864v1

Inference-time scaling has emerged as a powerful way to improve large language model () performance by generating multiple candidate responses and selecting among them. However, existing work on dynamic allocation for test-time compute typically considers only parallel generation methods such as best-of-N, overlooking incremental methods like beam search, and has largely ignored latency, focusing only on token usage. We formulate inference-time scaling as a problem of dynamic compute allocation and method selection, where the system must decide which strategy to apply and how much compute to allocate on a per-query basis. Our framework explicitly incorporates both token cost and wall-clock latency, the latter being critical for user experience and particularly for agentic workflows where models must issue multiple queries efficiently. Experiments on reasoning benchmarks show that our approach consistently outperforms static strategies, achieving favorable accuracy-cost trade-offs while remaining practical for deployment.

Towards an AI-based knowledge assistant for goat farmers based on Retrieval-Augmented Generation

Authors: Nana Han, Dong Liu, Tomas Norton

2025-09-11

http://arxiv.org/abs/2509.09848v1

Large language models (s) are increasingly being recognised as valuable knowledge tools in many industries. However, their application in livestock farming remains limited, being constrained by several factors not least the availability, diversity and complexity of knowledge sources. This study introduces an intelligent knowledge assistant system designed to support health management in farmed goats. Leveraging the Retrieval-Augmented Generation (RAG), two structured knowledge processing methods, table textualization and decision-tree textualization, were proposed to enhance large language models' (s) understanding of heterogeneous data formats. Based on these methods, a domain-specific goat farming knowledge base was established to improve 's capacity for cross-scenario generalization. The knowledge base spans five key domains: Disease Prevention and Treatment, Nutrition Management, Rearing Management, Goat Milk Management, and Basic Farming Knowledge. Additionally, an online search module is integrated to enable real-time retrieval of up-to-date information. To evaluate system performance, six ablation experiments were conducted to examine the contribution of each component. The results demonstrated that heterogeneous knowledge fusion method achieved the best results, with mean accuracies of 87.90% on the validation set and 84.22% on the test set. Across the text-based, table-based, decision-tree based Q&A tasks, accuracy consistently exceeded 85%, validating the effectiveness of structured knowledge fusion within a modular design. Error analysis identified omission as the predominant error category, highlighting opportunities to further improve retrieval coverage and context integration. In conclusion, the results highlight the robustness and reliability of the proposed system for practical applications in goat farming.

CoDiCodec Unifying Continuous and Discrete Compressed Representations of Audio

Authors: Marco Pasini, Stefan Lattner, George Fazekas

2025-09-11

http://arxiv.org/abs/2509.09836v1

Efficiently representing audio signals in a compressed latent space is critical for latent generative modelling. However, existing autoencoders often force a choice between continuous embeddings and discrete tokens. Furthermore, achieving high ratios while maintaining audio fidelity remains a challenge. We introduce CoDiCodec, a novel audio autoencoder that overcomes these limitations by both efficiently encoding global features via summary embeddings, and by producing both compressed continuous embeddings at ~ 11 Hz and discrete tokens at a rate of 2.38 kbps from the same trained model, offering unprecedented flexibility for different downstream generative tasks. This is achieved through Finite Scalar Quantization (FSQ) and a novel FSQ-dropout technique, and does not require additional loss terms beyond the single consistency loss used for end-to-end training. CoDiCodec supports both autoregressive and a novel parallel strategy, with the latter achieving superior audio quality and faster . CoDiCodec outperforms existing continuous and discrete autoencoders at similar bitrates in terms of reconstruction audio quality. Our work enables a unified approach to audio , bridging the gap between continuous and discrete generative modelling paradigms.

ButterflyQuant Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

Authors: Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang

2025-09-11

http://arxiv.org/abs/2509.09679v1

Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before , using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$ . However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence $\mu = 1/\sqrt{n}$ --that cannot adapt to specific weight distributions. We identify that different layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. We propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete $\{+1, -1\}$ entries that are non-differentiable and prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to . Learning requires only 128 calibration samples and converges in minutes on a single GPU--a negligible one-time cost. On LLaMA-2-7B with 2-bit , ButterflyQuant achieves 15.4 perplexity versus 22.1 for QuaRot.

LAVa Layer-wise KV Cache Eviction with Dynamic Budget Allocation

Authors: Yiqun Shen, Song Yuan, Zhengze Zhang, Xiaoliang Wang, Daxin Jiang, Nguyen Cam-Tu

2025-09-11

http://arxiv.org/abs/2509.09754v1

Cache is commonly used to accelerate inference with long contexts, yet its high memory demand drives the need for . Existing methods, however, are largely heuristic and lack dynamic budget allocation. To address this limitation, we introduce a unified framework for by minimizing information loss in Transformer residual streams. Building on it, we analyze the layer attention output loss and derive a new metric to compare entries across heads, enabling layer-wise with dynamic head budgets. Additionally, by contrasting cross-layer information, we also achieve dynamic layer budgets. LAVa is the first unified strategy for eviction and dynamic budget allocation that, unlike prior methods, does not rely on training or the combination of multiple strategies. Experiments with benchmarks (LongBench, Needle-In-A-Haystack, Ruler, and InfiniteBench) demonstrate its superiority. Moreover, our experiments reveal a new insight: dynamic layer budgets are crucial for generation tasks (e.g., code completion), while dynamic head budgets play a key role in extraction tasks (e.g., extractive QA). As a fully dynamic method, LAVa consistently maintains top performance across task types. Our code is available at https://github.com/MGDDestiny/Lava.

Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates

Authors: Harry Julian, Rachel Beeson, Lohith Konathala, Johanna Ulin, Jiameng Gao

2025-09-11

http://arxiv.org/abs/2509.09550v2

Neural Audio Codecs (NACs) have become increasingly adopted in speech processing tasks due to their excellent rate-distortion performance and compatibility with Large Language Models (s) as discrete feature representations for audio generation. While most existing codecs rely on Residual Vector Quantization (RVQ), Finite Scalar Quantization (FSQ) has recently emerged as a compelling alternative that simplifies training and natively supports single codebooks. We introduce NeuCodec, an FSQ-based NAC, and show that FSQ encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels. First, through an encoder distillation experiment, we show that two different encoders can learn to encode identical audio into vastly different code sequences whilst maintaining comparable reconstruction quality with the same r and r. Second, we demonstrate that FSQ has vastly superior bit-level perturbation robustness by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.

Authors: Jialiang Huang, Teng Ma, Zheng Liu, Sixing Lin, Kang Chen, Jinlei Jiang, Xia Liao, Yingdi Shan, Yongwei Wu, Ning Zhang, Mengting Lu, Tao Ma, Haifeng Gong, Mingxing Zhang

2025-09-11

http://arxiv.org/abs/2509.09525v1

Serverless computing provides dynamic scalability, but its infrastructure overhead becomes a bottleneck for emerging workloads such as agents, which exhibit unpredictable invocation patterns and variable resource demands. Our analysis shows that for these agents, the cost of running on serverless platforms can reach up to 70% of the cost of API calls. This finding motivates the need for a more efficient, high-density serverless platform. We present TrEnv, a co-designed serverless platform that supports both container- and VM-based environments, optimized for the unique demands of agents. TrEnv reduces startup latency and memory usage through repurposable sandboxes and memory templates, which enable fast reuse and restoration of execution environments. To further reduce overhead in VM-based agent workloads, TrEnv leverages browser sharing and a page bypassing mechanism. Evaluations show that TrEnv reduces P99 latency by up to 7X and memory usage by 48% in container-based settings, and achieves up to 58% lower P99 latency and 61% memory savings for VM-based agents compared to state-of-the-art systems like E2B.

Combating the Memory Walls Optimization Pathways for Long-Context Agentic LLM Inference

Authors: Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey T. H. Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Wayne Luk, Hongxiang Fan, Jianyi Cheng, Timothy M. Jones, Rika Antonova, Robert Mullins, Aaron Zhao

2025-09-11

http://arxiv.org/abs/2509.09505v1

s now form the backbone of AI agents for a diverse array of applications, including tool use, command-line agents, and web or computer use agents. These agentic inference tasks are fundamentally different from chatbot-focused inference -- they often have much larger context lengths to capture complex, prolonged inputs, such as entire webpage DOMs or complicated tool call trajectories. This, in turn, generates significant off-chip memory traffic for the underlying hardware at the inference stage and causes the workload to be constrained by two memory walls, namely the bandwidth and capacity memory walls, preventing the on-chip compute units from achieving high utilization. In this paper, we introduce PLENA, a hardware-software co-designed system that applies three core optimization pathways to tackle these challenges. PLENA includes an efficient hardware implementation of compute and memory units supporting an asymmetric scheme. PLENA also features a novel flattened systolic array architecture that has native support for FlashAttention to tackle these memory walls in the scenario of inference for long-context s. Additionally, PLENA is developed with a complete stack, including a custom ISA, a compiler, a cycle-emulated simulator, and an automated design space exploration flow. The simulated results show that PLENA achieves up to 8.5x higher utilization than existing accelerators, and delivers 2.24x higher throughput than the A100 GPU and 3.85x higher throughput than the TPU v6e, under the same multiplier count and memory settings. The full PLENA system will also be open-sourced.

ENSI Efficient Non-Interactive Secure Inference for Large Language Models

Authors: Zhiyu He, Maojiang Wang, Xinwen Gao, Yuchuan Luo, Lin Liu, Shaojing Fu

2025-09-11

http://arxiv.org/abs/2509.09424v1

Secure inference enables privacy-pre machine learning by leveraging cryptographic protocols that support computations on sensitive user data without exposing it. However, integrating cryptographic protocols with large language models (s) presents significant challenges, as the inherent complexity of these protocols, together with s' massive parameter scale and sophisticated architectures, severely limits practical usability. In this work, we propose ENSI, a novel non-interactive secure inference framework for s, based on the principle of co-designing the cryptographic protocols and architecture. ENSI employs an optimized encoding strategy that seamlessly integrates CKKS scheme with a lightweight variant, BitNet, significantly reducing the computational complexity of encrypted matrix multiplications. In response to the prohibitive computational demands of softmax under homomorphic encryption (HE), we pioneer the integration of the sigmoid attention mechanism with HE as a seamless, retraining-free alternative. Furthermore, by embedding the Bootstrapping operation within the RMSNorm process, we efficiently refresh ciphertexts while markedly decreasing the frequency of costly bootstrapping invocations. Experimental evaluations demonstrate that ENSI achieves approximately an 8x in matrix multiplications and a 2.6x speedup in softmax inference on CPU compared to state-of-the-art method, with the proportion of bootstrapping is reduced to just 1%.

HD-MoE Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing

Authors: Haochen Huang, Shuzhang Zhong, Zhe Zhang, Shuangchen Li, Dimin Niu, Hongzhong Zheng, Runsheng Wang, Meng Li

2025-09-11

http://arxiv.org/abs/2509.09420v1

Large Language Models (s) with Mixture-of-Expert (MoE) architectures achieve superior model performance with reduced computation costs, but at the cost of high memory capacity and bandwidth requirements. Near-Memory Processing (NMP) accelerators that stack memory directly on the compute through hybrid bonding have demonstrated high bandwidth with high energy efficiency, becoming a promising architecture for MoE models. However, as NMP accelerators comprise distributed memory and computation, how to map the MoE computation directly determines the inference efficiency. Existing parallel mapping strategies, including Tensor Parallelism (TP) and Expert Parallelism (EP), suffer from either high costs or unbalanced computation utilization, leading to inferior efficiency. The dynamic routing mechanism of MoE s further aggravates the efficiency challenges. Therefore, in this paper, we propose HD-MoE to automatically optimize the MoE parallel computation across an NMP accelerator. HD-MoE features an offline automatic hybrid parallel mapping algorithm and an online dynamic scheduling strategy to reduce the costs while maximizing the computation utilization. With extensive experimental results, we demonstrate that HD-MoE achieves a speedup ranging from 1.1x to 1.8x over TP, 1.1x to 1.5x over EP, and 1.0x to 1.4x over the baseline Hybrid TP-EP with Compute-Balanced parallelism strategies.

DiTReducio A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration

Authors: Yanru Huo, Ziyue Jiang, Zuoli Tang, Qingyang Hong, Zhou Zhao

2025-09-11

http://arxiv.org/abs/2509.09748v1

While Diffusion Transformers (DiT) have advanced non-autoregressive (NAR) speech synthesis, their high computational demands remain an limitation. Existing DiT-based text-to-speech (TTS) model approaches mainly focus on reducing sampling steps through distillation techniques, yet they remain constrained by training costs. We introduce DiTReducio, a training-free framework that compresses computations in DiT-based TTS models via progressive calibration. We propose two methods, Temporal Skipping and Branch Skipping, to eliminate redundant computations during inference. Moreover, based on two characteristic attention patterns identified within DiT layers, we devise a pattern-guided strategy to selectively apply the methods. Our method allows flexible modulation between generation quality and computational efficiency through adjustable thresholds. Experimental evaluations conducted on F5-TTS and MegaTTS 3 demonstrate that DiTReducio achieves a 75.4% reduction in FLOPs and improves the Real-Time Factor (RTF) by 37.1%, while pre generation quality.

Efficient Transformer-Based Piano Transcription With Sparse Attention Mechanisms

Authors: Weixing Wei, Kazuyoshi Yoshii

2025-09-11

http://arxiv.org/abs/2509.09318v1

This paper investigates automatic piano transcription based on computationally-efficient yet high-performant variants of the Transformer that can capture longer-term dependency over the whole musical piece. Recently, -based sequence-to-sequence models have demonstrated excellent performance in piano transcription. These models, however, fail to deal with the whole piece at once due to the quadratic complexity of the self-attention mechanism, and music signals are thus typically processed in a sliding-window manner in practice. To overcome this limitation, we propose an efficient architecture with attention mechanisms. Specifically, we introduce sliding-window self-attention mechanisms for both the encoder and r, and a hybrid global-local cross-attention mechanism that attends to various spans according to the MIDI token types. We also use a hierarchical pooling strategy between the encoder and r to further reduce computational load. Our experiments on the MAESTRO dataset showed that the proposed model achieved a significant reduction in computational cost and memory usage, accelerating inference speed, while maintaining transcription performance comparable to the full-attention baseline. This allows for training with longer audio contexts on the same hardware, demonstrating the viability of attention for building efficient and high-performance piano transcription systems. The code is available at https://github.com/WX-Wei/efficient-seq2seq-piano-trans.

From scratch to silver Creating trustworthy training data for patent-SDG classification using Large Language Models

Authors: Grazia Sveva Ascione, Nicolò Tamagnone

2025-09-11

http://arxiv.org/abs/2509.09303v1

Classifying patents by their relevance to the UN Sustainable Development Goals (SDGs) is crucial for tracking how innovation addresses global challenges. However, the absence of a large, labeled dataset limits the use of supervised learning. Existing methods, such as keyword searches, transfer learning, and citation-based heuristics, lack scalability and generalizability. This paper frames patent-to-SDG classification as a weak supervision problem, using citations from patents to SDG-tagged scientific publications (NPL citations) as a noisy initial signal. To address its and noise, we develop a composite labeling function (LF) that uses large language models (s) to extract structured concepts, namely functions, solutions, and applications, from patents and SDG papers based on a patent ontology. Cross-domain similarity scores are computed and combined using a rank-based retrieval approach. The LF is calibrated via a custom positive-only loss that aligns with known NPL-SDG links without penalizing discovery of new SDG associations. The result is a silver-standard, soft multi-label dataset mapping patents to SDGs, enabling the training of effective multi-label regression models. We validate our approach through two complementary strategies: (1) internal validation against held-out NPL-based labels, where our method outperforms several baselines including -based models, and zero-shot ; and (2) external validation using network modularity in patent citation, co-inventor, and co-applicant graphs, where our labels reveal greater thematic, cognitive, and organizational coherence than traditional technological classifications. These results show that weak supervision and semantic alignment can enhance SDG classification at scale.

Harnessing Uncertainty Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

Authors: Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, Ke Wang

2025-09-11

http://arxiv.org/abs/2509.09265v1

In long-horizon tasks, recent agents based on Large Language Models (s) face a significant challenge that , outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of s: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/

Medverse A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement

Authors: Jiesi Hu, Jianfeng Cao, Yanwu Yang, Chenfei Ye, Yixuan Zhang, Hanyang Peng, Ting Ma

2025-09-11

http://arxiv.org/abs/2509.09232v1

In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while pre computational efficiency through spatial . Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.

CCF A Context Compression Framework for Efficient Long-Sequence Language Modeling

Authors: Wenhao Li, Bangcheng Sun, Weihao Ye, Tianyi Zhang, Daohai Yu, Fei Chao, Rongrong Ji

2025-09-11

http://arxiv.org/abs/2509.09199v1

Scaling language models to longer contexts is essential for capturing rich dependencies across extended discourse. However, na\"ive context extension imposes significant computational and memory burdens, often resulting in inefficiencies during both training and inference. In this work, we propose CCF, a novel context framework designed to enable efficient long-context modeling by learning hierarchical latent representations that preserve global semantics while aggressively reducing input redundancy. CCF integrates segment-wise semantic aggregation with key-value memory encoding, forming compact representations that support accurate reconstruction and long-range understanding. To further enhance scalability, we introduce a training-efficient optimization strategy that couples incremental segment with reservoir sampling, substantially reducing memory overhead without degrading performance. Empirical results on multiple long-context language modeling benchmarks demonstrate that CCF achieves competitive perplexity under high ratios, and significantly improves throughput and memory efficiency compared to existing approaches. These findings highlight the potential of structured for scalable and effective long-context language modeling.

GmSLM Generative Marmoset Spoken Language Modeling

Authors: Talia Sternberg, Michael London, David Omer, Yossi Adi

2025-09-11

http://arxiv.org/abs/2509.09198v1

Marmoset monkeys exhibit complex vocal , challenging the view that nonhuman primates vocal is entirely innate, and show similar features of human speech, such as vocal labeling of others and turn-taking. Studying their vocal offers a unique opportunity to link it with brain activity-especially given the difficulty of accessing the human brain in speech and language research. Since Marmosets communicate primarily through vocalizations, applying standard approaches is not straightforward. We introduce Generative Marmoset Spoken Language Modeling (GmSLM), an optimized spoken language model pipeline for Marmoset vocal . We designed a novel zero-shot evaluation metrics using unsupervised in-the-wild data, alongside weakly labeled conversational data, to assess GmSLM and demonstrate its advantage over a basic human-speech-based baseline. GmSLM generated vocalizations closely matched real resynthesized samples acoustically and performed well on downstream tasks. Despite being fully unsupervised, GmSLM effectively distinguish real from artificial conversations and may support further investigations of the neural basis of vocal and provides a practical framework linking vocalization and brain activity. We believe GmSLM stands to benefit future work in neuroscience, bioacoustics, and evolutionary biology. Samples are provided under: pages.cs.huji.ac.il/adiyoss-lab/GmSLM.

AI Reasoning for Wireless Communications and Networking A Survey and Perspectives

Authors: Haoxiang Luo, Yu Yan, Yanhui Bian, Wenjiao Feng, Ruichen Zhang, Yinqiu Liu, Jiacheng Wang, Gang Sun, Dusit Niyato, Hongfang Yu, Abbas Jamalipour, Shiwen Mao

2025-09-11

http://arxiv.org/abs/2509.09193v1

Artificial Intelligence (AI) techniques play a pivotal role in optimizing wireless networks. However, traditional deep learning approaches often act as closed boxes, lacking the structured reasoning abilities needed to tackle complex, multi-step decision problems. This survey provides a comprehensive review and outlook of reasoning-enabled AI in wireless networks, with a focus on Large Language Models (s) and other advanced reasoning paradigms. In particular, -based agents can combine reasoning with long-term planning, memory, tool utilization, and autonomous cross-layer control to dynamically optimize network operations with minimal human intervention. We begin by outlining the evolution of intelligent wireless networking and the limitations of conventional AI methods. We then introduce emerging AI reasoning techniques. Furthermore, we establish a classification system applicable to wireless network tasks. We also present a layer-by-layer examination for AI reasoning, covering the physical, data link, network, transport, and application layers. For each part, we identify key challenges and illustrate how AI reasoning methods can improve AI-based wireless performance. Finally, we discuss key research directions for AI reasoning toward future wireless networks. By combining insights from both s and AI, this survey aims to chart a path for integrating reasoning techniques into the next-generation wireless networks.

Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication

Authors: Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis

2025-09-11

http://arxiv.org/abs/2509.09168v1

Large-scale models have emerged as a powerful tool for semantic systems, enabling edge devices to extract rich representations for robust inference across noisy wireless channels. However, their substantial computational demands remain a major barrier to practical deployment in resource-constrained 6G networks. In this paper, we present a training-free framework for adaptive token merging in pretrained vision s to jointly reduce inference time and transmission resource usage. We formulate the selection of per-layer merging proportions as a multi-objective optimization problem to balance accuracy and computational cost. We employ Gaussian process-based Bayesian optimization to construct a Pareto frontier of optimal configurations, enabling flexible runtime adaptation to dynamic application requirements and channel conditions. Extensive experiments demonstrate that our method consistently outperforms other baselines and achieves significant reductions in floating-point operations while maintaining competitive accuracy across a wide range of signal-to-noise ratio (SNR) conditions. Additional results highlight the effectiveness of adaptive policies that adjust merging aggressiveness in response to channel quality, providing a practical mechanism to trade off latency and semantic fidelity on demand. These findings establish a scalable and efficient approach for deploying -based semantic in future edge intelligence systems.

DP-FedLoRA Privacy-Enhanced Federated Fine-Tuning for On-Device Large Language Models

Authors: Honghui Xu, Shiva Shrestha, Wei Chen, Zhiyuan Li, Zhipeng Cai

2025-09-11

http://arxiv.org/abs/2509.09097v1

As on-device large language model () systems become increasingly prevalent, federated fine-tuning enables advanced language understanding and generation directly on edge devices; however, it also involves processing sensitive, user-specific data, raising significant privacy concerns within the federated learning framework. To address these challenges, we propose DP-FedLoRA, a privacy-enhanced federated fine-tuning framework that integrates LoRA-based adaptation with differential privacy in a -efficient setting. Each client locally clips and perturbs its LoRA matrices using Gaussian noise to satisfy ( $\epsilon$ , $\delta$ )-differential privacy. We further provide a theoretical analysis demonstrating the unbiased nature of the updates and deriving bounds on the variance introduced by noise, offering practical guidance for privacy-budget calibration. Experimental results across mainstream benchmarks show that DP-FedLoRA delivers competitive performance while offering strong privacy guarantees, paving the way for scalable and privacy-pre deployment in on-device environments.

Towards Confidential and Efficient LLM Inference with Dual Privacy Protection

Authors: Honglan Yu, Yibin Wang, Feifei Dai, Dong Liu, Haihui Fan, Xiaoyan Gu

2025-09-11

http://arxiv.org/abs/2509.09091v1

CPU-based trusted execution environments (TEEs) and differential privacy (DP) have gained wide applications for private inference. Due to high inference latency in TEEs, researchers use partition-based approaches that offload linear model components to GPUs. However, dense nonlinear layers of large language models (s) result in significant overhead between TEEs and GPUs. DP-based approaches apply random noise to protect data privacy, but this compromises performance and semantic understanding. To overcome the above drawbacks, this paper proposes CMIF, a Confidential and efficient Model Inference Framework. CMIF confidentially deploys the embedding layer in the client-side TEE and subsequent layers on GPU servers. Meanwhile, it optimizes the Report-Noisy-Max mechanism to protect sensitive inputs with a slight decrease in model performance. Extensive experiments on Llama-series models demonstrate that CMIF reduces additional inference overhead in TEEs while pre user data privacy.

SQAP-VLA A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models

Authors: Hengyu Fang, Yijiang Liu, Yuan Du, Li Du, Huanrui Yang

2025-09-11

http://arxiv.org/abs/2509.09090v1

Vision-Language-Action (VLA) models exhibit unprecedented capabilities for embodied intelligence. However, their extensive computational and memory costs hinder their practical deployment. Existing VLA and approaches conduct or token in an ad-hoc manner but fail to enable both for a holistic efficiency improvement due to an observed incompatibility. This work introduces SQAP-VLA, the first structured, training-free VLA inference framework that simultaneously enables state-of-the-art and token . We overcome the incompatibility by co-designing the and token pipeline, where we propose new -aware token criteria that work on an aggressively d model while improving the r design to enhance effectiveness. When applied to standard VLA models, SQAP-VLA yields significant gains in computational efficiency and inference speed while successfully pre core model performance, achieving a $\times$ 1.93 speedup and up to a 4.5\% average success rate enhancement compared to the original model.

Instructional Prompt Optimization for Few-Shot LLM-Based Recommendations on Cold-Start Users

Authors: Haowei Yang, Yushang Zhao, Sitao Min, Bo Su, Chao Yao, Wei Xu

2025-09-11

http://arxiv.org/abs/2509.09066v1

The cold-start user issue further compromises the effectiveness of recommender systems in limiting access to the historical behavioral information. It is an effective pipeline to optimize instructional prompts on a few-shot large language model () used in recommender tasks. We introduce a context-conditioned prompt formulation method P(u,\ Ds)\ \rightarrow\ R\widehat, where u is a cold-start user profile, Ds is a curated support set, and R\widehat is the predicted ranked list of items. Based on systematic experimentation with -based autoregressive s (BioGPT, LLaMA-2, GPT-4), we provide empirical evidence that optimal exemplar injection and instruction structuring can significantly improve the precision@k and NDCG scores of such models in low-data settings. The pipeline uses token-level alignments and embedding space regularization with a greater semantic fidelity. Our findings not only show that timely composition is not merely syntactic but also functional as it is in direct control of attention scales and r conduct through inference. This paper shows that prompt-based adaptation may be considered one of the ways to address cold-start recommendation issues in -based pipelines.

VoxelFormer Parameter-Efficient Multi-Subject Visual Decoding from fMRI

Authors: Chenqian Le, Yilin Zhao, Nikasadat Emami, Kushagra Yadav, Xujin "Chris" Liu, Xupeng Chen, Yao Wang

2025-09-10

http://arxiv.org/abs/2509.09015v1

Recent advances in fMRI-based visual have enabled compelling reconstructions of perceived images. However, most approaches rely on subject-specific training, limiting scalability and practical deployment. We introduce \textbf{VoxelFormer}, a lightweight architecture that enables multi-subject training for visual from fMRI. VoxelFormer integrates a Token Merging Transformer (ToMer) for efficient voxel and a query-driven Q-Former that produces fixed-size neural representations aligned with the CLIP image embedding space. Evaluated on the 7T Natural Scenes Dataset, VoxelFormer achieves competitive retrieval performance on subjects included during training with significantly fewer parameters than existing methods. These results highlight token merging and query-based s as promising strategies for parameter-efficient neural .

CSI Compression Beyond Latents End-to-End Hybrid Attention-CNN Networks with Entropy Regularization

Authors: Maryam Ansarifard, Mostafa Rahmani, Mohit K. Sharma, Kishor C. Joshi, George Exarchakos, Alister Burr

2025-09-10

http://arxiv.org/abs/2509.08776v1

Massive MIMO systems rely on accurate Channel State Information (CSI) feedback to enable high-gain beam-forming. However, the feedback overhead scales linearly with the number of antennas, presenting a major bottleneck. While recent deep learning methods have improved CSI , most overlook the impact of and entropy coding, limiting their practical deployability. In this work, we propose an end-to-end CSI framework that integrates a Spatial Correlation-Guided Attention Mechanism with and entropy-aware training. Our model effectively exploits the spatial correlation among the antennas, thereby learning compact, entropy-optimized latent representations for efficient coding. This reduces the required feedback bitrates without sacrificing reconstruction accuracy, thereby yielding a superior rate-distortion trade-off. Experiments show that our method surpasses existing end-to-end CSI schemes, exceeding benchmark performance by an average of 21.5% on indoor datasets and 18.9% on outdoor datasets. The proposed framework results in a practical and efficient CSI feedback scheme.

CrowdQuery Density-Guided Query Module for Enhanced 2D and 3D Detection in Crowded Scenes

Authors: Marius Dähling, Sebastian Krebs, J. Marius Zöllner

2025-09-10

http://arxiv.org/abs/2509.08738v1

This paper introduces a novel method for end-to-end crowd detection that leverages object density information to enhance existing -based detectors. We present CrowdQuery (CQ), whose core component is our CQ module that predicts and subsequently embeds an object density map. The embedded density information is then systematically integrated into the r. Existing density map definitions typically depend on head positions or object-based spatial statistics. Our method extends these definitions to include individual bounding box dimensions. By incorporating density information into object queries, our method utilizes density-guided queries to improve detection in crowded scenes. CQ is universally applicable to both 2D and 3D detection without requiring additional data. Consequently, we are the first to design a method that effectively bridges 2D and 3D detection in crowded environments. We demonstrate the integration of CQ into both a general 2D and 3D -based object detector, introducing the architectures CQ2D and CQ3D. CQ is not limited to the specific models we selected. Experiments on the STCrowd dataset for both 2D and 3D domains show significant performance improvements compared to the base models, outperforming most state-of-the-art methods. When integrated into a state-of-the-art crowd detector, CQ can further improve performance on the challenging CrowdHuman dataset, demonstrating its generalizability. The code is released at https://github.com/mdaehl/CrowdQuery.

ChemBOMAS Accelerated BO in Chemistry with LLM-Enhanced Multi-Agent System

Authors: Dong Han, Zhehong Ai, Pengxiang Cai, Shuzhou Sun, Shanya Lu, Jianpeng Chen, Ben Gao, Lingli Ge, Weida Wang, Xiangxin Zhou, Xihui Liu, Mao Su, Wanli Ouyang, Lei Bai, Dongzhan Zhou, Tao XU, Yuqiang Li, Shufei Zhang

2025-09-10

http://arxiv.org/abs/2509.08736v1

The efficiency of Bayesian optimization (BO) in chemistry is often hindered by experimental data and complex reaction mechanisms. To overcome these limitations, we introduce ChemBOMAS, a new framework named -Enhanced Multi-Agent System for accelerating BO in chemistry. ChemBOMAS's optimization process is enhanced by s and synergistically employs two strategies: knowledge-driven coarse-grained optimization and data-driven fine-grained optimization. First, in the knowledge-driven coarse-grained optimization stage, s intelligently decompose the vast search space by reasoning over existing chemical knowledge to identify promising candidate regions. Subsequently, in the data-driven fine-grained optimization stage, s enhance the BO process within these candidate regions by generating pseudo-data points, thereby improving data utilization efficiency and accelerating convergence. Benchmark evaluations** further confirm that ChemBOMAS significantly enhances optimization effectiveness and efficiency compared to various BO algorithms. Importantly, the practical utility of ChemBOMAS was validated through wet-lab experiments conducted under pharmaceutical industry protocols, targeting conditional optimization for a previously unreported and challenging chemical reaction. In the wet experiment, ChemBOMAS achieved an optimal objective value of 96%. This was substantially higher than the 15% achieved by domain experts. This real-world success, together with strong performance on benchmark evaluations, highlights ChemBOMAS as a powerful tool to accelerate chemical discovery.

Compressing CNN models for resource-constrained systems by channel and layer pruning

Authors: Ahmed Sadaqa, Di Liu

2025-09-10

http://arxiv.org/abs/2509.08714v1

Convolutional Neural Networks (CNNs) have achieved significant breakthroughs in various fields. However, these advancements have led to a substantial increase in the complexity and size of these networks. This poses a challenge when deploying large and complex networks on edge devices. Consequently, model has emerged as a research field aimed at reducing the size and complexity of CNNs. One prominent technique in model is model . This paper will present a new technique of that combines both channel and layer in what is called a "hybrid framework". Inspired by EfficientNet, a renowned CNN architecture known for scaling up networks from both channel and layer perspectives, this hybrid approach applies the same principles but in reverse, where it scales down the network through . Experiments on the hybrid approach demonstrated a notable decrease in the overall complexity of the model, with only a minimal reduction in accuracy compared to the baseline model. This complexity reduction translates into reduced latency when deploying the pruned models on an NVIDIA JETSON TX2 embedded AI device.

Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching

Authors: Siratish Sakpiboonchit

2025-09-10

http://arxiv.org/abs/2509.08696v1

This paper presents a method to accelerate the inference process of diffusion (DiT)-based text-to-speech (TTS) models by applying a selective caching mechanism to layers. Specifically, I integrate SmoothCache into the F5-TTS architecture, focusing on caching outputs of self-attention and feed-forward network layers to reduce redundant computations during the denoising process. A calibration phase is introduced to analyze L1 relative errors between timesteps, guiding the selection of schedules that minimize quality degradation. To address the problem of inter-layer dependency, a unified caching schedule is adopted, applying the pattern derived from self-attention layers to both layer types. Experiments on LibriSpeech-PC and Seed-TTS datasets evaluate various thresholds and denoising step configurations. Results show that caching at higher denoising steps reduces inference time without compromising output quality, whereas caching at lower steps can negatively impact synthesis quality similarly to reducing the total number of denoising steps. Objective and subjective metrics confirm the effectiveness of SmoothCache in maintaining performance while improving computational efficiency. Comparisons between d inference and reduced-step inference further highlight the benefits of selective caching, especially under high-step configurations. This work demonstrates that layer caching is a practical solution for optimizing diffusion -based TTS models without requiring architectural changes or retraining. Example inference results can be heard at https://siratish.github.io/F5-TTS_SmoothCache/ .

Time-Dependent Modeling of the Sub-Hour Spectral Evolution During the 2013 Outburst of Mrk 421

Authors: MAGIC Collaboration, K. Abe, S. Abe, J. Abhir, A. Abhishek, A. Aguasca-Cabot, I. Agudo, T. Aniello, S. Ansoldi, L. A. Antonelli, A. Arbet Engels, C. Arcaro, T. T. H. Arnesen, A. Babić, C. Bakshi, U. Barres de Almeida, J. A. Barrio, L. Barrios-Jiménez, I. Batković, J. Baxter, J. Becerra González, W. Bednarek, E. Bernardini, J. Bernete, A. Berti, C. Bigongiari, A. Biland, O. Blanch, G. Bonnoli, Ž Bošnjak, E. Bronzini, I. Burelli, A. Campoy-Ordaz, A. Carosi, R. Carosi, M. Carretero-Castrillo, A. J. Castro-Tirado, D. Cerasole, G. Ceribella, Y. Chai, A. Cifuentes, J. L. Contreras, J. Cortina, S. Covino, F. D'Ammando, P. Da Vela, F. Dazzi, A. De Angelis, B. De Lotto, R. de Menezes, J. Delgado, C. Delgado Mendez, F. Di Pierro, R. Di Tria, L. Di Venere, A. Dinesh, D. Dominis Prester, A. Donini, D. Dorner, M. Doro, L. Eisenberger, D. Elsaesser, J. Escudero, L. Fariña, L. Foffano, L. Font, S. Fröse, Y. Fukazawa, R. J. García López, S. García Soto, M. Garczarczyk, S. Gasparyan, J. G. Giesbrecht Paiva, N. Giglietto, F. Giordano, P. Gliwny, T. Gradetzke, R. Grau, D. Green, J. G. Green, P. Günther, A. Hahn, T. Hassan, L. Heckmann, J. Herrera Llorente, D. Hrupec, D. Israyelyan, J. Jahanvi, I. Jiménez Martínez, J. Jiménez Quiles, J. Jormanainen, S. Kankkunen, T. Kayanoki, J. Konrad, P. M. Kouch, G. Koziol, H. Kubo, J. Kushida, M. Laínez, A. Lamastra, E. Lindfors, S. Lombardi, F. Longo, M. López-Moya, A. López-Oramas, S. Loporchio, L. Lulić, E. Lyard, P. Majumdar, M. Makariev, M. Mallamaci, G. Maneva, M. Manganaro, S. Mangano, K. Mannheim, S. Marchesi, M. Mariotti, M. Martínez, P. Maruševec, S. Menchiari, J. Méndez Gallego, S. Menon, D. Miceli, J. M. Miranda, R. Mirzoyan, M. Molero González, E. Molina, H. A. Mondal, A. Moralejo, C. Nanci, A. Negro, V. Neustroev, L. Nickel, M. Nievas Rosillo, C. Nigro, L. Nikolić, S. Nozaki, A. Okumura, J. Otero-Santos, S. Paiano, D. Paneque, R. Paoletti, J. M. Paredes, M. Peresano, M. Persic, M. Pihet, G. Pirola, F. Podobnik, P. G. Prada Moroni, E. Prandini, W. Rhode, M. Ribó, J. Rico, A. Roy, N. Sahakyan, F. G. Saturni, K. Schmitz, F. Schmuckermaier, T. Schweizer, A. Sciaccaluga, G. Silvestri, A. Simongini, J. Sitarek, V. Sliusar, D. Sobczynska, A. Stamerra, J. Strišković, D. Strom, M. Strzys, Y. Suda, H. Tajima, R. Takeishi, F. Tavecchio, T. Terzić, M. Teshima, A. Tutone, S. Ubach, J. van Scherpenberg, M. Vazquez Acosta, S. Ventura, G. Verna, I. Viale, A. Vigliano, C. F. Vigorito, E. Visentin, V. Vitale, I. Vovk, R. Walter, F. Wersig, M. Will, T. Yamamoto, P. K. H. Yeung, M. Petropoulou, M. Polkas

2025-09-10

http://arxiv.org/abs/2509.08686v1

In April 2013, the TeV blazar Markarian~421 underwent one of its most powerful emission outbursts to date. An extensive multi-instrument campaign featuring MAGIC, VERITAS, and \textit{NuSTAR} provided comprehensive very-high-energy (VHE; $E > 100$ \,GeV) and X-ray coverage over nine consecutive days. In this work, we perform a detailed spectral analysis of the X-ray and VHE emissions on sub-hour timescales throughout the flare. We identify several clockwise spectral hysteresis loops in the X-rays, revealing a spectral evolution more complex than a simple harder-when-brighter trend. The VHE spectrum extends beyond 10\,TeV, and its temporal evolution closely mirrors the behavior in the X-rays. We report the first evidence of VHE spectral hysteresis occurring simultaneously with the X-ray loops. To interpret these findings, we apply a time-dependent leptonic model to 240 broadband spectral energy distributions (SEDs) binned on a 15-minute scale, allowing us to self-consistently track the particle distribution's history. Our modeling shows that the majority of the sub-hour flux and spectral variations are driven by changes in the luminosity and slope of the injected electron distribution. The required variations in the electron slope are difficult to reconcile with magnetic reconnection but are consistent with a shock- scenario where the shock ratio evolves by a factor of $\sim2$ . The model also points to a relatively stable magnetic field and emitting region size, favoring a scenario where the emission originates from a stationary feature in the jet, such as a recollimation shock. However, this scenario requires a jet Lorentz factor that significantly exceeds values from VLBI measurements to account for the high minimum electron energy implied by the lack of variability in the optical band.

Deep Unrolling of Sparsity-Induced RDO for 3D Point Cloud Attribute Coding

Authors: Tam Thuc Do, Philip A. Chou, Gene Cheung

2025-09-10

http://arxiv.org/abs/2509.08685v1

Given encoded 3D point cloud geometry available at the r, we study the problem of lossy attribute in a multi-resolution B-spline projection framework. A target continuous 3D attribute function is first projected onto a sequence of nested subspaces $\mathcal{F}^{(p)}_{l_0} \subseteq \cdots \subseteq \mathcal{F}^{(p)}_{L}$ , where $\mathcal{F}^{(p)}_{l}$ is a family of functions spanned by a B-spline basis function of order $p$ at a chosen scale and its integer shifts. The projected low-pass coefficients $F_l^*$ are computed by variable-complexity unrolling of a rate-distortion (RD) optimization algorithm into a feed-forward network, where the rate term is the -promoting $\ell_1$ -norm. Thus, the projection operation is end-to-end differentiable. For a chosen coarse-to-fine predictor, the coefficients are then adjusted to account for the prediction from a lower-resolution to a higher-resolution, which is also optimized in a data-driven manner.

BitROM Weight Reload-Free CiROM Architecture Towards Billion-Parameter 1.58-bit LLM Inference

Authors: Wenlun Zhang, Xinyu Li, Shimpei Ando, Kentaro Yoshioka

2025-09-10

http://arxiv.org/abs/2509.08542v1

Compute-in-Read-Only-Memory (CiROM) accelerators offer outstanding energy efficiency for CNNs by eliminating runtime weight updates. However, their scalability to Large Language Models (s) is fundamentally constrained by their vast parameter sizes. Notably, LLaMA-7B - the smallest model in LLaMA series - demands more than 1,000 cm2 of silicon area even in advanced CMOS nodes. This paper presents BitROM, the first CiROM-based accelerator that overcomes this limitation through co-design with BitNet's 1.58-bit model, enabling practical and efficient inference at the edge. BitROM introduces three key innovations: 1) a novel Bidirectional ROM Array that stores two ternary weights per transistor; 2) a Tri-Mode Local Accumulator optimized for ternary-weight computations; and 3) an integrated Decode-Refresh (DR) eDRAM that supports on-die - management, significantly reducing external memory access during . In addition, BitROM integrates LoRA-based adapters to enable efficient transfer learning across various downstream tasks. Evaluated in 65nm CMOS, BitROM achieves 20.8 TOPS/W and a bit density of 4,967 kB/mm2 - offering a 10x improvement in area efficiency over prior digital CiROM designs. Moreover, the DR eDRAM contributes to a 43.6% reduction in external DRAM access, further enhancing deployment efficiency for s in edge applications.

Two Sides of the Same Optimization Coin Model Degradation and Representation Collapse in Graph Foundation Models

Authors: Xunkai Li, Daohan Su, Sicheng Liu, Ru Zhang, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang

2025-09-10

http://arxiv.org/abs/2509.08401v2

Graph foundation models, inspired by the success of s, are designed to learn the optimal embedding from multi-domain TAGs for the downstream cross-task generalization capability. During our investigation, graph VQ-MAE stands out among the increasingly diverse landscape of GFM architectures. This is attributed to its ability to jointly encode topology and textual attributes from multiple domains into discrete embedding spaces with clear semantic boundaries. Despite its potential, domain generalization conflicts cause imperceptible pitfalls. In this paper, we instantiate two of them, and they are just like two sides of the same GFM optimization coin - Side 1 Model Degradation: The encoder and codebook fail to capture the diversity of inputs; Side 2 Representation Collapse: The hidden embedding and codebook vector fail to preserve semantic separability due to constraints from narrow representation subspaces. These two pitfalls (sides) collectively impair the r and generate the low-quality reconstructed supervision, causing the GFM optimization dilemma during pre-training (coin). Through empirical investigation, we attribute the above challenges to Information Bottleneck and Regularization Deficit. To address them, we propose MoT (Mixture-of-Tinkers) - (1) Information Tinker for Two Pitfalls, which utilizes an edge-wise semantic fusion strategy and a mixture-of-codebooks with domain-aware routing to improve information capacity. (2) Regularization Tinker for Optimization Coin, which utilizes two additional regularizations to further improve gradient supervision in our proposed Information Tinker. Notably, as a flexible architecture, MoT adheres to the scaling laws of GFM, offering a controllable model scale. Compared to SOTA baselines, experiments on 22 datasets across 6 domains demonstrate that MoT achieves significant improvements in supervised, few-shot, and zero-shot scenarios.

Efficient Decoding Methods for Language Models on Encrypted Data

Authors: Matan Avitan, Moran Baruch, Nir Drucker, Itamar Zimerman, Yoav Goldberg

2025-09-10

http://arxiv.org/abs/2509.08383v1

Large language models (s) power modern AI applications, but processing sensitive data on untrusted servers raises privacy concerns. Homomorphic encryption (HE) enables computation on encrypted data for secure inference. However, neural text generation requires methods like argmax and sampling, which are non-polynomial and thus computationally expensive under encryption, creating a significant performance bottleneck. We introduce cutmax, an HE-friendly argmax algorithm that reduces ciphertext operations compared to prior methods, enabling practical greedy under encryption. We also propose the first HE-compatible nucleus (top-p) sampling method, leveraging cutmax for efficient stochastic with provable privacy guarantees. Both techniques are polynomial, supporting efficient inference in privacy-pre settings. Moreover, their differentiability facilitates gradient-based sequence-level optimization as a polynomial alternative to straight-through estimators. We further provide strong theoretical guarantees for cutmax, proving it converges globally to a unique two-level fixed point, independent of the input values beyond the identity of the maximizer, which explains its rapid convergence in just a few iterations. Evaluations on realistic outputs show latency reductions of 24x-35x over baselines, advancing secure text generation.

Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video

Authors: Xiao Li, Qi Chen, Xiulian Peng, Kai Yu, Xie Chen, Yan Lu

2025-09-10

http://arxiv.org/abs/2509.08376v1

We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a -based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a rate vector as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other types of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.

Persistent-DPO A novel loss function and hybrid learning for generative quantum eigensolver

Authors: Junya Nakamura, Shinichiro Sanji

2025-09-10

http://arxiv.org/abs/2509.08351v1

We study the generative quantum eigensolver (GQE)~\cite{nakaji2024generative}, which trains a classical generative model to produce quantum circuits with desired properties such as describing molecular ground states. We introduce two methods to improve GQE. First, we identify a limitation of direct preference optimization (DPO) when used as the loss function in GQE, and propose Persistent-DPO (P-DPO) as a solution to this limitation. Second, as a method to improve the online learning during the training phase of GQE, we introduce a hybrid approach that combines online and offline learning. Using a r implementation of GQE, we evaluate our methods through ground state search experiments on the $\mathrm{BeH_2^{}}$ molecule and observe that P-DPO achieves lower energies than DPO. The hybrid approach further improves convergence and final energy values, particularly with P-DPO.

Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism

Authors: Jiaming Yan, Jianchun Liu, Hongli Xu, Liusheng Huang

2025-09-10

http://arxiv.org/abs/2509.08342v1

Mixture-of-Experts (MoE) has emerged as a promising architecture for modern large language models (s). However, massive parameters impose heavy GPU memory (i.e., VRAM) demands, hindering the widespread adoption of MoE s. Offloading the expert parameters to CPU RAM offers an effective way to alleviate the VRAM requirements for MoE inference. Existing approaches typically a small subset of experts in VRAM and dynamically prefetch experts from RAM during inference, leading to significant degradation in inference speed due to the poor hit rate and substantial expert loading latency. In this work, we propose MoEpic, an efficient MoE inference system with a novel expert split mechanism. Specifically, each expert is vertically divided into two segments: top and bottom. MoEpic s the top segment of hot experts, so that more experts will be stored under the limited VRAM budget, thereby improving the hit rate. During each layer's inference, MoEpic predicts and prefetches the activated experts for the next layer. Since the top segments of d experts are exempt from fetching, the loading time is reduced, which allows efficient transfer-computation . Nevertheless, the performance of MoEpic critically depends on the configuration (i.e., each layer's VRAM budget and expert split ratio). To this end, we propose a divide-and-conquer algorithm based on fixed-point iteration for adaptive configuration. Extensive experiments on popular MoE s demonstrate that MoEpic can save about half of the GPU cost, while lowering the inference latency by about 37.51%-65.73% compared to the baselines.

Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing

Authors: Lukas Toral, Teddy Lazebnik

2025-09-10

http://arxiv.org/abs/2509.08329v1

Reinforcement Learning (RL) algorithms often require long training to become useful, especially in complex environments with rewards. While techniques like reward shaping and curriculum learning exist to accelerate training, these are often extremely specific and require the developer's professionalism and dedicated expertise in the problem's domain. Tackling this challenge, in this study, we explore the effectiveness of pre-trained Large Language Models (s) as tutors in a student-teacher architecture with RL algorithms, hypothesizing that -generated guidance allows for faster convergence. In particular, we explore the effectiveness of reusing the 's advice on the RL's convergence dynamics. Through an extensive empirical examination, which included 54 configurations, varying the RL algorithm (DQN, PPO, A2C), tutor (Llama, Vicuna, DeepSeek), and environment (Blackjack, Snake, Connect Four), our results demonstrate that tutoring significantly accelerates RL convergence while maintaining comparable optimal performance. Furthermore, the advice reuse mechanism shows a further improvement in training duration but also results in less stable convergence dynamics. Our findings suggest that tutoring generally improves convergence, and its effectiveness is sensitive to the specific task, RL algorithm, and model combination.

EvolKV Evolutionary KV Cache Compression for LLM Inference

Authors: Bohan Yu, Yekun Chai

2025-09-10

http://arxiv.org/abs/2509.08315v1

Existing key-value () methods typically rely on heuristics, such as uniform allocation across layers or static eviction policies, however, they ignore the critical interplays among layer-specific feature patterns and task performance, which can lead to degraded generalization. In this paper, we propose Evol, an adaptive framework for layer-wise, task-driven that jointly optimizes the memory efficiency and task performance. By reformulating allocation as a multi-objective optimization problem, Evol leverages evolutionary search to dynamically configure layer budgets while directly maximizing downstream performance. Extensive experiments on 11 tasks demonstrate that our approach outperforms all baseline methods across a wide range of budgets on long-context tasks and surpasses heuristic baselines by up to 7 percentage points on GSM8K. Notably, Evol achieves superior performance over the full setting on code completion while utilizing only 1.5% of the original budget, suggesting the untapped potential in learned strategies for budget allocation.

Towards Knowledge-Aware Document Systems Modeling Semantic Coverage Relations via Answerability Detection

Authors: Yehudit Aperstein, Alon Gottlib, Gal Benita, Alexander Apartsin

2025-09-10

http://arxiv.org/abs/2509.08304v1

Understanding how information is shared across documents, regardless of the format in which it is expressed, is critical for tasks such as information retrieval, summarization, and content alignment. In this work, we introduce a novel framework for modelling Semantic Coverage Relations (SCR), which classifies document pairs based on how their informational content aligns. We define three core relation types: equivalence, where both texts convey the same information using different textual forms or styles; inclusion, where one document fully contains the information of another and adds more; and semantic , where each document presents partially ping content. To capture these relations, we adopt a question answering (QA)-based approach, using the answerability of shared questions across documents as an indicator of semantic coverage. We construct a synthetic dataset derived from the SQuAD corpus by paraphrasing source passages and selectively omitting information, enabling precise control over content . This dataset allows us to benchmark generative language models and train -based classifiers for SCR prediction. Our findings demonstrate that discriminative models significantly outperform generative approaches, with the RoBERTa-base model achieving the highest accuracy of 61.4% and the Random Forest-based model showing the best balance with a macro-F1 score of 52.9%. The results show that QA provides an effective lens for assessing semantic relations across stylistically diverse texts, offering insights into the capacity of current models to reason about information beyond surface similarity. The dataset and code developed in this study are publicly available to support reproducibility.

RTR A Transformer-Based Lossless Crossover with Perfect Phase Alignment

Authors: Xiangying Li, Jiankuan Li, Yong Tang

2025-09-10

http://arxiv.org/abs/2509.08272v1

This paper proposes a -based lossless crossover method, termed Resonant Transformer Router (RTR), which achieves frequency separation while ensuring perfect phase alignment between low-frequency (LF) and high-frequency (HF) channels at the crossover frequency. The core property of RTR is that its frequency responses satisfy a linear complementary relation HLF(f)+HHF(f)=1. so that the original signal can be perfectly reconstructed by linear summation of the two channels. Theoretical derivation and circuit simulations demonstrate that RTR provides superior energy efficiency, phase consistency, and robustness against component tolerances. Compared with conventional LC crossovers and digital FIR/IIR filters, RTR offers a low-loss, low-latency hardware-assisted filtering solution suitable for high-fidelity audio and front-ends. The core theory behind this paper's work, lossless crossover, is based on a Chinese patent [CN116318117A] developed from the previous research of one of the authors, Jianluan Li. We provide a comprehensive experimental validation of this theory and propose a new extension.

Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning

Authors: Wei Huang, Anda Cheng, Yinggui Wang

2025-09-10

http://arxiv.org/abs/2509.08255v1

Recent advancements in large language models (s) have shown impressive capabilities in various downstream tasks but typically face Catastrophic Forgetting (CF) during fine-tuning. In this paper, we propose the Forgetting-Aware Pruning Metric (FAPM), a novel -based approach to balance CF and downstream task performance. Our investigation reveals that the degree to which task vectors (i.e., the subtraction of pre-trained weights from the weights fine-tuned on downstream tasks) with pre-trained model parameters is a critical factor for CF. Based on this finding, FAPM employs the ratio of the task vector to pre-trained model parameters as a metric to quantify CF, integrating this measure into the criteria. Importantly, FAPM does not necessitate modifications to the training process or model architecture, nor does it require any auxiliary data. We conducted extensive experiments across eight datasets, covering natural language inference, General Q&A, Medical Q&A, Math Q&A, reading comprehension, and cloze tests. The results demonstrate that FAPM limits CF to just 0.25\% while maintaining 99.67\% accuracy on downstream tasks. We provide the code to reproduce our results.

Strategies for Improving Communication Efficiency in Distributed and Federated Learning Compression, Local Training, and Personalization

Authors: Kai Yi

2025-09-10

http://arxiv.org/abs/2509.08233v1

Distributed and federated learning are essential paradigms for training models across decentralized data sources while pre privacy, yet overhead remains a major bottleneck. This dissertation explores strategies to improve efficiency, focusing on model , local training, and personalization. We establish a unified framework for biased and unbiased operators with convergence guarantees, then propose adaptive local training strategies that incorporate personalization to accelerate convergence and mitigate client drift. In particular, Scafflix balances global and personalized objectives, achieving superior performance under both IID and non-IID settings. We further introduce privacy-pre frameworks that optimize while minimizing costs, with Cohort-Squeeze leveraging hierarchical aggregation to reduce cross-device overhead. Finally, SymWanda, a symmetric post-training method, enhances robustness under high and maintains accuracy without retraining. Extensive experiments on benchmarks and large-scale language models demonstrate favorable trade-offs among accuracy, convergence, and , offering theoretical and practical insights for scalable, efficient distributed learning.

Sketched Gaussian Mechanism for Private Federated Learning

Authors: Qiaobo Li, Zhijie Chen, Arindam Banerjee

2025-09-09

http://arxiv.org/abs/2509.08195v1

Communication cost and privacy are two major considerations in federated learning (FL). For cost, gradient by sketching the clients' transmitted model updates is often used for reducing per-round . For privacy, the Gaussian mechanism (GM), which consists of clipping updates and adding Gaussian noise, is commonly used to guarantee client-level differential privacy. Existing literature on private FL analyzes privacy of sketching and GM in an isolated manner, illustrating that sketching provides privacy determined by the sketching dimension and that GM has to supply any additional desired privacy. In this paper, we introduce the Sketched Gaussian Mechanism (SGM), which directly combines sketching and the Gaussian mechanism for privacy. Using R\'enyi-DP tools, we present a joint analysis of SGM's overall privacy guarantee, which is significantly more flexible and sharper compared to isolated analysis of sketching and GM privacy. In particular, we prove that the privacy level of SGM for a fixed noise magnitude is proportional to $1/\sqrt{b}$ , where $b$ is the sketching dimension, indicating that (for moderate $b$ ) SGM can provide much stronger privacy guarantees than the original GM under the same noise budget. We demonstrate the application of SGM to FL with either gradient descent or adaptive server optimizers, and establish theoretical results on optimization convergence, which exhibits only a logarithmic dependence on the number of parameters $d$ . Experimental results confirm that at the same privacy level, SGM based FL is at least competitive with non-sketching private FL variants and outperforms them in some settings. Moreover, using adaptive optimization at the server improves empirical performance while maintaining the privacy guarantees.

XML Prompting as Grammar-Constrained Interaction Fixed-Point Semantics, Convergence Guarantees, and Human-AI Protocols

Authors: Faruk Alpay, Taylan Alpay

2025-09-09

http://arxiv.org/abs/2509.08182v1

Structured prompting with XML tags has emerged as an effective way to steer large language models (s) toward parseable, schema-adherent outputs in real-world systems. We develop a logic-first treatment of XML prompting that unifies (i) grammar-constrained , (ii) fixed-point semantics over lattices of hierarchical prompts, and (iii) convergent human-AI interaction loops. We formalize a complete lattice of XML trees under a refinement order and prove that monotone prompt-to-prompt operators admit least fixed points (Knaster-Tarski) that characterize steady-state protocols; under a task-aware contraction metric on trees, we further prove Banach-style convergence of iterative guidance. We instantiate these results with context-free grammars (CFGs) for XML schemas and show how constrained guarantees well-formedness while pre task performance. A set of multi-layer human-AI interaction recipes demonstrates practical deployment patterns, including multi-pass "plan $\to$ verify $\to$ revise" routines and agentic tool use. We provide mathematically complete proofs and tie our framework to recent advances in grammar-aligned , chain-of-verification, and programmatic prompting.

OCTANE -- Optimal Control for Tensor-based Autoencoder Network Emergence Explicit Case

Authors: Ratna Khatri, Anthony Kolshorn, Colin Olson, Harbir Antil

2025-09-09

http://arxiv.org/abs/2509.08169v1

This paper presents a novel, mathematically rigorous framework for autoencoder-type deep neural networks that combines optimal control theory and low-rank tensor methods to yield memory-efficient training and automated architecture discovery. The learning task is formulated as an optimization problem constrained by differential equations representing the encoder and r components of the network and the corresponding optimality conditions are derived via a Lagrangian approach. Efficient memory is enabled by approximating differential equation solutions on low-rank tensor manifolds using an adaptive explicit integration scheme. These concepts are combined to form OCTANE (Optimal Control for Tensor-based Autoencoder Network Emergence) -- a unified training framework that yields compact autoencoder architectures, reduces memory usage, and enables effective learning, even with limited training data. The framework's utility is illustrated with application to image denoising and deblurring tasks and recommendations regarding governing hyperparameters are provided.

SCA-LLM Spectral-Attentive Channel Prediction with Large Language Models in MIMO-OFDM

Authors: Ke He, Le He, Lisheng Fan, Xianfu Lei, Thang X. Vu, George K. Karagiannidis, Symeon Chatzinotas

2025-09-09

http://arxiv.org/abs/2509.08139v1

In recent years, the success of large language models (s) has inspired growing interest in exploring their potential applications in wireless s, especially for channel prediction tasks. However, directly applying s to channel prediction faces a domain mismatch issue stemming from their text-based pre-training. To mitigate this, the ``adapter + " paradigm has emerged, where an adapter is designed to bridge the domain gap between the channel state information (CSI) data and s. While showing initial success, existing adapters may not fully exploit the potential of this paradigm. To address this limitation, this work provides a key insight that learning representations from the spectral components of CSI features can more effectively help bridge the domain gap. Accordingly, we propose a spectral-attentive framework, named SCA-, for channel prediction in multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) systems. Specifically, its novel adapter can capture finer spectral details and better adapt the for channel prediction than previous methods. Extensive simulations show that SCA- achieves state-of-the-art prediction performance and strong generalization, yielding up to $-2.4~\text{dB}$ normalized mean squared error (NMSE) advantage over the previous based method. Ablation studies further confirm the superiority of SCA- in mitigating domain mismatch.

Tensor-Train Operator Inference

Authors: Engin Danis, Duc Truong, Kim Ø. Rasmussen§, Boian S. Alexandrov

2025-09-09

http://arxiv.org/abs/2509.08071v1

In this study, we present a tensor--train framework for nonintrusive operator inference aimed at learning discrete operators and using them to predict solutions of physical governing equations. Our framework comprises three approaches: full--order tensor--train operator inference, full--order d tensor--train operator inference, and reduced--order tensor--train operator inference. In each case, snapshot data is represented in tensor--train format--either through or cross interpolation--enabling the efficient handling of extremely large datasets with significantly reduced computational effort compared to standard methods. The effectiveness of each approach is demonstrated through numerical experiments related to Computational Fluid Dynamics and benchmarked against the standard reduced--order operator inference method, highlighting the advantages of the tensor--train representations in both accuracy and scalability.

Feature Space Analysis by Guided Diffusion Model

Authors: Kimiaki Shirahama, Miki Yanobu, Kaduki Yamashita, Miho Ohsaki

2025-09-09

http://arxiv.org/abs/2509.07936v1

One of the key issues in Deep Neural Networks (DNNs) is the black-box nature of their internal feature extraction process. Targeting vision-related domains, this paper focuses on analysing the feature space of a DNN by proposing a r that can generate images whose features are guaranteed to closely match a user-specified feature. Owing to this guarantee that is missed in past studies, our r allows us to evidence which of various attributes in an image are encoded into a feature by the DNN, by generating images whose features are in proximity to that feature. Our r is implemented as a guided diffusion model that guides the reverse image generation of a pre-trained diffusion model to minimise the Euclidean distance between the feature of a clean image estimated at each step and the user-specified feature. One practical advantage of our r is that it can analyse feature spaces of different DNNs with no additional training and run on a single COTS GPU. The experimental results targeting CLIP's image encoder, ResNet-50 and vision demonstrate that images generated by our r have features remarkably similar to the user-specified ones and reveal valuable insights into these DNNs' feature spaces.

Biased Tales Cultural and Topic Bias in Generating Children's Stories

Authors: Donya Rooein, Vilém Zouhar, Debora Nozza, Dirk Hovy

2025-09-09

http://arxiv.org/abs/2509.07908v1

Stories play a pivotal role in human , shaping beliefs and morals, particularly in children. As parents increasingly rely on large language models (s) to craft bedtime stories, the presence of cultural and gender stereotypes in these narratives raises significant concerns. To address this issue, we present Biased Tales, a comprehensive dataset designed to analyze how biases influence protagonists' attributes and story elements in -generated stories. Our analysis uncovers striking disparities. When the protagonist is described as a girl (as compared to a boy), appearance-related attributes increase by 55.26%. Stories featuring non-Western children disproportionately emphasize cultural heritage, tradition, and family themes far more than those for Western children. Our findings highlight the role of sociocultural bias in making creative AI use more equitable and diverse.

A Robot That Listens Enhancing Self-Disclosure and Engagement Through Sentiment-based Backchannels and Active Listening

Authors: Hieu Tran, Go-Eum Cha, Sooyeon Jeong

2025-09-09

http://arxiv.org/abs/2509.07873v1

As social robots get more deeply integrated intoour everyday lives, they will be expected to engage in meaningful conversations and exhibit socio-emotionally intelligent listening behaviors when interacting with people. Active listening and backchanneling could be one way to enhance robots' communicative capabilities and enhance their effectiveness in eliciting deeper self-disclosure, providing a sense of empathy,and forming positive rapport and relationships with people.Thus, we developed an -powered social robot that can exhibit contextually appropriate sentiment-based backchannelingand active listening behaviors (active listening+backchanneling) and compared its efficacy in eliciting people's self-disclosurein comparison to robots that do not exhibit any of these listening behaviors (control) and a robot that only exhibitsbackchanneling behavior (backchanneling-only). Through ourexperimental study with sixty-five participants, we found theparticipants who conversed with the active listening robot per-ceived the interactions more positively, in which they exhibited the highest self-disclosures, and reported the strongest senseof being listened to. The results of our study suggest that the implementation of active listening behaviors in social robotshas the potential to improve human-robot andcould further contribute to the building of deeper human-robot relationships and rapport.

Are Humans as Brittle as Large Language Models?

Authors: Jiahui Li, Sean Papay, Roman Klinger

2025-09-09

http://arxiv.org/abs/2509.07869v1

The output of large language models () is unstable, due to both non-determinism of the process as well as to prompt brittleness. While the intrinsic non-determinism of generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to s. This raises the question: do human annotators show similar sensitivity to instruction changes? If so, should prompt brittleness in s be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on s and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and s for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and s exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of s.

Query Expansion in the Age of Pre-trained and Large Language Models A Comprehensive Survey

Authors: Minghan Li, Xinxuan Lv, Junjie Zou, Tongna Chen, Chao Zhang, Suchao An, Ercong Nie, Guodong Zhou

2025-09-09

http://arxiv.org/abs/2509.07794v1

Modern information retrieval (IR) must bridge short, ambiguous queries and ever more diverse, rapidly evolving corpora. Query Expansion (QE) remains a key mechanism for mitigating vocabulary mismatch, but the design space has shifted markedly with pre-trained language models (PLMs) and large language models (s). This survey synthesizes the field from three angles: (i) a four-dimensional framework of query expansion - from the point of injection (explicit vs. implicit QE), through grounding and interaction (knowledge bases, model-internal capabilities, multi-turn retrieval) and learning alignment, to knowledge graph-based argumentation; (ii) a model-centric taxonomy spanning encoder-only, encoder-r, r-only, instruction-tuned, and domain/multilingual variants, highlighting their characteristic affordances for QE (contextual disambiguation, controllable generation, zero-/few-shot reasoning); and (iii) practice-oriented guidance on where and how neural QE helps in first-stage retrieval, multi-query fusion, re-ranking, and retrieval-augmented generation (RAG). We compare traditional query expansion with PLM/-based methods across seven key aspects, and we map applications across web search, biomedicine, e-commerce, open-domain QA/RAG, conversational and code search, and cross-lingual settings. The review distills design grounding and interaction, alignment/distillation (SFT/PEFT/DPO), and KG constraints - as robust remedies to topic drift and hallucination. We conclude with an agenda on quality control, cost-aware invocation, domain/temporal adaptation, evaluation beyond end-task metrics, and fairness/privacy. Collectively, these insights provide a principled blueprint for selecting and combining QE techniques under real-world constraints.

SEEC Segmentation-Assisted Multi-Entropy Models for Learned Lossless Image Compression

Authors: Chunhang Zheng, Zichang Ren, Dou Li

2025-09-09

http://arxiv.org/abs/2509.07704v1

Recently, learned image has attracted considerable attention due to its superior performance over traditional methods. However, most existing approaches employ a single entropy model to estimate the probability distribution of pixel values across the entire image, which limits their ability to capture the diverse statistical characteristics of different semantic regions. To overcome this limitation, we propose Segmentation-Assisted Multi-Entropy Models for Lossless Image Compression (SEEC). Our framework utilizes semantic segmentation to guide the selection and adaptation of multiple entropy models, enabling more accurate probability distribution estimation for distinct semantic regions. Specifically, SEEC first extracts image features and then applies semantic segmentation to identify different regions, each assigned a specialized entropy model to better capture its unique statistical properties. Finally, a multi-channel discrete logistic mixture likelihood is employed to model the pixel value distributions effectively. Experimental results on benchmark datasets demonstrate that SEEC achieves state-of-the-art ratios while introducing only minimal encoding and latency. With superior performance, the proposed model also supports Regions of Interest (ROIs) coding condition on the provided segmentation mask. Our code is available at https://github.com/chunbaobao/SEEC.

Unleashing the True Potential of LLMs A Feedback-Triggered Self-Correction with Long-Term Multipath Decoding

Authors: Jipeng Li, Zeyu Gao, Yubin Qi, Hande Dong, Weijian Chen, Qiang Lin

2025-09-09

http://arxiv.org/abs/2509.07676v1

Large Language Models (s) have achieved remarkable performance across diverse tasks, yet their susceptibility to generating incorrect content during inference remains a critical unsolved challenge. While self-correction methods offer potential solutions, their effectiveness is hindered by two inherent limitations: (1) the absence of reliable guidance signals for error localization, and (2) the restricted reasoning depth imposed by conventional next-token paradigms. To address these issues, we propose Feedback-Triggered Regeneration (FTR), a novel framework that synergizes user feedback with enhanced dynamics. Specifically, FTR activates response regeneration only upon receiving negative user feedback, thereby circumventing error propagation from faulty self-assessment while pre originally correct outputs. Furthermore, we introduce Long-Term Multipath (LTM) , which enables systematic exploration of multiple reasoning trajectories through delayed sequence evaluation, effectively overcoming the myopic decision-making characteristic of standard next-token prediction. Extensive experiments on mathematical reasoning and code generation benchmarks demonstrate that our framework achieves consistent and significant improvements over state-of-the-art prompt-based self-correction methods.

Collaborative Exploration with a Marsupial Ground-Aerial Robot Team through Task-Driven Map Compression

Authors: Angelos Zacharia, Mihir Dharmadhikari, Kostas Alexis

2025-09-09

http://arxiv.org/abs/2509.07655v1

Efficient exploration of unknown environments is crucial for autonomous robots, especially in confined and large-scale scenarios with limited . To address this challenge, we propose a collaborative exploration framework for a marsupial ground-aerial robot team that leverages the complementary capabilities of both platforms. The framework employs a graph-based path planning algorithm to guide exploration and deploy the aerial robot in areas where its expected gain significantly exceeds that of the ground robot, such as large open spaces or regions inaccessible to the ground platform, thereby maximizing coverage and efficiency. To facilitate large-scale spatial information sharing, we introduce a bandwidth-efficient, task-driven map strategy. This method enables each robot to reconstruct resolution-specific volumetric maps while pre exploration-critical details, even at high rates. By selectively compressing and sharing key data, overhead is minimized, ensuring effective map integration for collaborative path planning. Simulation and real-world experiments validate the proposed approach, demonstrating its effectiveness in improving exploration efficiency while significantly reducing data transmission.

Topology-Aware Optimization of Gaussian Primitives for Human-Centric Volumetric Videos

Authors: Yuheng Jiang, Chengcheng Guo, Yize Wu, Yu Hong, Shengkun Zhu, Zhehao Shen, Yingliang Zhang, Shaohui Jiao, Zhuo Su, Lan Xu, Marc Habermann, Christian Theobalt

2025-09-09

http://arxiv.org/abs/2509.07653v1

Volumetric video is emerging as a key medium for digitizing the dynamic physical world, creating the virtual environments with six degrees of freedom to deliver immersive user experiences. However, robustly modeling general dynamic scenes, especially those involving topological changes while maintaining long-term tracking remains a fundamental challenge. In this paper, we present TaoGS, a novel topology-aware dynamic Gaussian representation that disentangles motion and appearance to support, both, long-range tracking and topological adaptation. We represent scene motion with a set of motion Gaussians, which are continuously updated by a spatio-temporal tracker and photometric cues that detect structural variations across frames. To capture fine-grained texture, each motion Gaussian anchors and dynamically activates a set of local appearance Gaussians, which are non-rigidly warped to the current frame to provide strong initialization and significantly reduce training time. This activation mechanism enables efficient modeling of detailed textures and maintains temporal coherence, allowing high-fidelity rendering even under challenging scenarios such as changing clothes. To enable seamless integration into codec-based volumetric formats, we introduce a global Gaussian Lookup Table that records the lifespan of each Gaussian and organizes attributes into a lifespan-aware 2D layout. This structure aligns naturally with standard video codecs and supports up to 40 . TaoGS provides a unified, adaptive solution for scalable volumetric video under topological variation, capturing moments where "elegance in motion" and "Power in Stillness", delivering immersive experiences that harmonize with the physical world.

MaLei at MultiClinSUM Summarisation of Clinical Documents using Perspective-Aware Iterative Self-Prompting with LLMs

Authors: Libo Ren, Yee Man Ng, Lifeng Han

2025-09-09

http://arxiv.org/abs/2509.07622v1

Efficient between patients and clinicians plays an important role in shared decision-making. However, clinical reports are often lengthy and filled with clinical jargon, making it difficult for domain experts to identify important aspects in the document efficiently. This paper presents the methodology we applied in the MultiClinSUM shared task for summarising clinical case documents. We used an Iterative Self-Prompting technique on large language models (s) by asking s to generate task-specific prompts and refine them via example-based few-shot learning. Furthermore, we used lexical and embedding space metrics, ROUGE and BERT-score, to guide the model fine-tuning with epochs. Our submission using perspective-aware ISP on GPT-4 and GPT-4o achieved ROUGE scores (46.53, 24.68, 30.77) and BERTscores (87.84, 83.25, 85.46) for (P, R, F1) from the official evaluation on 3,396 clinical case reports from various specialties extracted from open journals. The high BERTscore indicates that the model produced semantically equivalent output summaries compared to the references, even though the at the exact lexicon level is lower, as reflected in the lower ROUGE scores. This work sheds some light on how perspective-aware ISP (PA-ISP) can be deployed for clinical report summarisation and support better between patients and clinicians.

PanoLAM Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image

Authors: Peng Li, Yisheng He, Yingdong Hu, Yuan Dong, Weihao Yuan, Yuan Liu, Zilong Dong, Yike Guo

2025-09-09

http://arxiv.org/abs/2509.07552v1

We present a feed-forward framework for Gaussian full-head synthesis from a single unposed image. Unlike previous work that relies on time-consuming GAN inversion and test-time optimization, our framework can reconstruct the Gaussian full-head model given a single unposed image in a single forward pass. This enables fast reconstruction and rendering during inference. To mitigate the lack of large-scale 3D head assets, we propose a large-scale synthetic dataset from trained 3D GANs and train our framework using only synthetic data. For efficient high-fidelity generation, we introduce a coarse-to-fine Gaussian head generation pipeline, where points from the FLAME model interact with the image features by blocks for feature extraction and coarse shape reconstruction, which are then densified for high-fidelity reconstruction. To fully leverage the prior knowledge residing in pretrained 3D GANs for effective reconstruction, we propose a dual-branch framework that effectively aggregates the structured spherical triplane feature and unstructured point-based features for more effective Gaussian head reconstruction. Experimental results show the effectiveness of our framework towards existing work.

PatchSeeker Mapping NVD Records to their Vulnerability-fixing Commits with LLM Generated Commits and Embeddings

Authors: Huu Hung Nguyen, Anh Tuan Nguyen, Thanh Le-Cong, Yikun Li, Han Wei Ang, Yide Yin, Frank Liauw, Shar Lwin Khin, Ouh Eng Lieh, Ting Zhang, David Lo

2025-09-09

http://arxiv.org/abs/2509.07540v1

Software vulnerabilities pose serious risks to modern software ecosystems. While the National Vulnerability Database (NVD) is the authoritative source for cataloging these vulnerabilities, it often lacks explicit links to the corresponding Vulnerability-Fixing Commits (VFCs). VFCs encode precise code changes, enabling vulnerability localization, patch analysis, and dataset construction. Automatically mapping NVD records to their true VFCs is therefore critical. Existing approaches have limitations as they rely on , often noisy commit messages and fail to capture the deep semantics in the vulnerability descriptions. To address this gap, we introduce PatchSeeker, a novel method that leverages large language models to create rich semantic links between vulnerability descriptions and their VFCs. PatchSeeker generates embeddings from NVD descriptions and enhances commit messages by synthesizing detailed summaries for those that are short or uninformative. These generated messages act as a semantic bridge, effectively closing the information gap between natural language reports and low-level code changes. Our approach PatchSeeker achieves 59.3% higher MRR and 27.9% higher Recall@10 than the best-performing baseline, Prospector, on the benchmark dataset. The extended evaluation on recent CVEs further confirms PatchSeeker's effectiveness. Ablation study shows that both the commit message generation method and the selection of backbone s make a positive contribution to PatchSeeker. We also discuss limitations and open challenges to guide future work.

Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data

Authors: Gokul Karthik Kumar, Rishabh Saraf, Ludovick Lepauloux, Abdul Muneer, Billel Mokeddem, Hakim Hacid

2025-09-09

http://arxiv.org/abs/2509.07526v1

Large language models (s) have transformed NLP, yet their integration with audio remains underexplored -- despite audio's centrality to human . We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned s and Whisper encoders. Using a remarkably small amount of public audio data -- less than 30K hours (5K unique) -- Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while distinguishing itself through superior data and parameter efficiency, single-stage training, and transparency. Notably, our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters. Through extensive ablations, we find that common complexities -- such as curriculum learning, multiple audio encoders, and intricate cross-attention connectors -- are not required for strong performance, even compared to models trained on over 500K hours of data.

Multi-view-guided Passage Reranking with Large Language Models

Authors: Jeongwoo Na, Jun Kwon, Eunseong Choi, Jongwuk Lee

2025-09-09

http://arxiv.org/abs/2509.07485v1

Recent advances in large language models (s) have shown impressive performance in passage reranking tasks. Despite their success, -based methods still face challenges in efficiency and sensitivity to external biases. (1) Existing models rely mostly on autoregressive generation and sliding window strategies to rank passages, which incur heavy computational overhead as the number of passages increases. (2) External biases, such as position or selection bias, hinder the model's ability to accurately represent passages and increase input-order sensitivity. To address these limitations, we introduce a novel passage reranking model, called Multi-View-guided Passage Reranking (MVP). MVP is a non-generative -based reranking method that encodes query-passage information into diverse view embeddings without being influenced by external biases. For each view, it combines query-aware passage embeddings to produce a distinct anchor vector, which is then used to directly compute relevance scores in a single step. In addition, it employs an orthogonal loss to make the views more distinctive. Extensive experiments demonstrate that MVP, with just 220M parameters, matches the performance of much larger 7B-scale fine-tuned models while achieving a 100x reduction in inference latency. Notably, the 3B-parameter variant of MVP achieves state-of-the-art performance on both in-domain and out-of-domain benchmarks. The source code is available at: https://github.com/bulbna/MVP

DuoServe-MoE Dual-Phase Expert Prefetch and Cache Scheduling for Efficient MoE LLM Inference

Authors: Yuning Zhang, Grant Pinkert, Nan Yang, Yanli Li, Dong Yuan

2025-09-09

http://arxiv.org/abs/2509.07379v1

Large Language Models (s) have demonstrated impressive performance across a wide range of deep learning tasks. Mixture of Experts (MoE) further enhances their capabilities by increasing model width through ly activated expert branches, which keeps inference computation efficient. However, the large number of expert weights introduces significant GPU memory pressure, especially in resource-constrained environments such as single-GPU servers. More importantly, MoE inference consists of two fundamentally different stages: a stage where most experts are activated densely, and a stage where only a few experts are triggered ly. Treating these stages with a uniform scheduling strategy often leads to suboptimal latency and memory usage. To address this, we propose DuoServe-MoE, an inference system that explicitly separates and stages and applies tailored expert scheduling strategies to each. In the stage, DuoServe-MoE uses a two-stream CUDA pipeline that s expert weight prefetching with the computation of non-MoE layers, limiting expert residency in GPU memory. In the stage, a lightweight layer-level predictor trained offline from activation traces is used to prefetch only the most likely activated experts, without requiring any changes to the model. Experiments on 4-bit Mixtral-8x7B and 8x22B models show that DuoServe-MoE improves end-to-end latency by 1.42 to 7.54 times while keeping peak memory usage at only 15 percent of the full model size.

PersonaFuse A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions

Authors: Yixuan Tang, Yi Yang, Ahmed Abbasi

2025-09-09

http://arxiv.org/abs/2509.07370v2

Recent advancements in Large Language Models (s) demonstrate remarkable capabilities across various fields. These developments have led to more direct between humans and s in various situations, such as social companionship and psychological support. However, s often exhibit limitations in emotional perception and social competence during real-world conversations. These limitations partly originate from their inability to adapt their style and emotional expression to different social and task contexts. In this work, we introduce PersonaFuse, a novel post-training framework that enables s to adapt and express different personalities for varying situations. Inspired by Trait Activation Theory and the Big Five personality model, PersonaFuse employs a Mixture-of-Expert architecture that combines persona adapters with a dynamic routing network, enabling contextual trait expression. Experimental results show that PersonaFuse substantially outperforms baseline models across multiple dimensions of social-emotional intelligence. Importantly, these gains are achieved without sacrificing general reasoning ability or model safety, which remain common limitations of direct prompting and supervised fine-tuning approaches. PersonaFuse also delivers consistent improvements in downstream human-centered applications, such as mental health counseling and review-based customer service. Finally, human preference evaluations against leading s, including GPT-4o and DeepSeek, demonstrate that PersonaFuse achieves competitive response quality despite its comparatively smaller model size. These findings demonstrate that PersonaFuse offers a theoretically grounded and practical approach for developing social-emotional enhanced s, marking a significant advancement toward more human-centric AI systems.

Explaining How Quantization Disparately Skews a Model

Authors: Abhimanyu Bellam, Jung-Eun Kim

2025-09-08

http://arxiv.org/abs/2509.07222v1

Post Training Quantization (PTQ) is widely adopted due to its high capacity and speed with minimal impact on accuracy. However, we observed that disparate impacts are exacerbated by , especially for minority groups. Our analysis explains that in the course of there is a chain of factors attributed to a disparate impact across groups during forward and backward passes. We explore how the changes in weights and activations induced by cause cascaded impacts in the network, resulting in logits with lower variance, increased loss, and compromised group accuracies. We extend our study to verify the influence of these impacts on group gradient norms and eigenvalues of the Hessian matrix, providing insights into the state of the network from an optimization point of view. To mitigate these effects, we propose integrating mixed precision Quantization Aware Training (QAT) with dataset sampling methods and weighted loss functions, therefore providing fair deployment of d neural networks.

Neurocognitive Modeling for Text Generation Deep Learning Architecture for EEG Data

Authors: Khushiyant

2025-09-08

http://arxiv.org/abs/2509.07202v1

Text generating capabilities have undergone a substantial transformation with the introduction of large language models (s). Electroencephalography (EEG)-based text production is still difficult, though, because it requires a lot of data and processing power. This paper introduces a new method that combines the use of the Gemma 2B with a classifier- architecture to incorporate a Recurrent Neural Network (RNN) encoder. Our approach drastically lowers the amount of data and compute power needed while achieving performance close to that of cutting-edge methods. Notably, compared to current methodologies, our methodology delivers an overall performance improvement of 10%. The suggested architecture demonstrates the possibility of effective transfer learning for EEG-based text production, remaining strong and functional even in the face of data limits. This work highlights the potential of integrating s with EEG to improve assistive technologies and improve independence and for those with severe motor limitations. Our method pushes the limits of present capabilities and opens new paths for research and application in brain-computer interfaces by efficiently using the strengths of pre-trained language models. This makes EEG-based text production more accessible and efficient.

DischargeSim A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge

Authors: Zonghai Yao, Michael Sun, Won Seok Jang, Sunjae Kwon, Soie Kwon, Hong Yu

2025-09-08

http://arxiv.org/abs/2509.07188v2

Discharge is a critical yet underexplored component of patient care, where the goal shifts from diagnosis to education. While recent large language model () benchmarks emphasize in-visit diagnostic reasoning, they fail to evaluate models' ability to support patients after the visit. We introduce DischargeSim, a novel benchmark that evaluates s on their ability to act as personalized discharge educators. DischargeSim simulates post-visit, multi-turn conversations between -driven DoctorAgents and PatientAgents with diverse psychosocial profiles (e.g., health literacy, education, emotion). Interactions are structured across six clinically grounded discharge topics and assessed along three axes: (1) dialogue quality via automatic and -as-judge evaluation, (2) personalized document generation including free-text summaries and structured AHRQ checklists, and (3) patient comprehension through a downstream multiple-choice exam. Experiments across 18 s reveal significant gaps in discharge education capability, with performance varying widely across patient profiles. Notably, model size does not always yield better education outcomes, highlighting trade-offs in strategy use and content prioritization. DischargeSim offers a first step toward benchmarking s in post-visit clinical education and promoting equitable, personalized patient support.

Faster VGGT with Block-Sparse Global Attention

Authors: Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, Bastian Leibe

2025-09-08

http://arxiv.org/abs/2509.07120v1

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent -based models like VGGT and $\pi^3$ have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block- kernels, yielding up to $4\times$ faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and $\pi^3$ , and supports large image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.

H $_{2}$ OT Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers

Authors: Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Shijian Lu, Nicu Sebe

2025-09-08

http://arxiv.org/abs/2509.06956v1

Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose s (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play -and-recovering framework, called Hierarchical Hourglass Tokenizer (H $_{2}$ OT), for efficient -based 3D human pose estimation from videos. H $_{2}$ OT begins with progressively pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token and recovery strategies. In addition, our H $_{2}$ OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method. Code and models are available at https://github.com/NationalGAILab/HoT.

Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

Authors: Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, Mengdi Wang

2025-09-08

http://arxiv.org/abs/2509.06949v1

We propose TraceRL, a trajectory-aware reinforcement learning framework for diffusion language models (DLMs) that incorporates preferred inference trajectory into post-training, and is applicable across different architectures. Equipped with a diffusion-based value model that enhances training stability, we demonstrate improved reasoning performance on complex math and coding tasks. Besides, it can also be applied to adapt block-specific models to larger blocks, which improves sampling flexibility. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than 7B-scale AR models, TraDo-4B-Instruct still consistently outperforms them across complex math reasoning tasks. TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks. Through curriculum learning, we also derive the first long-CoT DLM, outperforming Qwen2.5-7B-Instruct on MATH500 with an 18.1% relative accuracy gain. To facilitate reproducible research and practical applications, we release a comprehensive open-source framework for building, training, and deploying diffusion s across diverse architectures. The framework integrates accelerated - techniques and inference engines for both inference and reinforcement learning, and includes implementations of various supervised fine-tuning and RL methods for mathematics, coding, and general tasks. Code and Models: https://github.com/Gen-Verse/d-RL

Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data

Authors: Nithin Gopalakrishnan Nair, Srinivas Kaza, Xuan Luo, Vishal M. Patel, Stephen Lombardi, Jungyeon Park

2025-09-08

http://arxiv.org/abs/2509.06950v1

Large -based models have made significant progress in generalizable novel view synthesis (NVS) from input views, generating novel viewpoints without the need for test-time optimization. However, these models are constrained by the limited diversity of publicly available scene datasets, making most real-world (in-the-wild) scenes out-of-distribution. To overcome this, we incorporate synthetic training data generated from diffusion models, which improves generalization across unseen domains. While synthetic data offers scalability, we identify artifacts introduced during data generation as a key bottleneck affecting reconstruction quality. To address this, we propose a token disentanglement process within the architecture, enhancing feature separation and ensuring more effective learning. This refinement not only improves reconstruction quality over standard s but also enables scalable training with synthetic data. As a result, our method outperforms existing models on both in-dataset and cross-dataset evaluations, achieving state-of-the-art results across multiple benchmarks while significantly reducing computational costs. Project page: https://scaling3dnvs.github.io/

From Noise to Narrative Tracing the Origins of Hallucinations in Transformers

Authors: Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok

2025-09-08

http://arxiv.org/abs/2509.06938v1

As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained models through concept representations captured by autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a model can be reliably predicted from the concept patterns embedded in layer activations. This collection of insights on internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model's hallucination risk.

Barlow-Swin Toward a novel siamese-based segmentation architecture using Swin-Transformers

Authors: Morteza Kiani Haftlang, Mohammadhossein Malmir, Foroutan Parand, Umberto Michelucci, Safouane El Ghazouali

2025-09-08

http://arxiv.org/abs/2509.06885v1

Medical image segmentation is a critical task in clinical workflows, particularly for the detection and delineation of pathological regions. While convolutional architectures like U-Net have become standard for such tasks, their limited receptive field restricts global context modeling. Recent efforts integrating s have addressed this, but often result in deep, computationally expensive models unsuitable for real-time use. In this work, we present a novel end-to-end lightweight architecture designed specifically for real-time binary medical image segmentation. Our model combines a Swin Transformer-like encoder with a U-Net-like r, connected via skip pathways to preserve spatial detail while capturing contextual information. Unlike existing designs such as Swin Transformer or U-Net, our architecture is significantly shallower and competitively efficient. To improve the encoder's ability to learn meaningful features without relying on large amounts of labeled data, we first train it using Barlow Twins, a self-supervised learning method that helps the model focus on important patterns by reducing unnecessary repetition in the learned features. After this pretraining, we fine-tune the entire model for our specific task. Experiments on benchmark binary segmentation tasks demonstrate that our model achieves competitive accuracy with substantially reduced parameter count and faster inference, positioning it as a practical alternative for deployment in real-time and resource-limited clinical environments. The code for our method is available at Github repository: https://github.com/mkianih/Barlow-Swin.

COMPACT Common-token Optimized Model Pruning Across Channels and Tokens

Authors: Eugene Kwek, Wenpeng Yin

2025-09-08

http://arxiv.org/abs/2509.06836v1

Making s more efficient in memory, latency, and cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a key technique toward this goal. However, prior methods are limited: width often breaks the standard layout or requires custom inference code, while depth removes entire layers and can cause abrupt accuracy drops. In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post- token distribution. COMPACT enjoys merits of both depth and width , such as: deployment-friendliness (keeps a standard architecture), scale-adaptivity (trade off vocab vs. FFN ), training-free operation with competitive time, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream task performance at similar or higher ratios, with substantial reductions in parameters, GPU memory, and end-to-end latency.

Guided Decoding and Its Critical Role in Retrieval-Augmented Generation

Authors: Özgür Uğur, Musa Yılmaz, Esra Şavirdi, Özay Ezerceli, Mahmut El Huseyni, Selva Taş, Reyhan Bayraktar

2025-09-08

http://arxiv.org/abs/2509.06631v1

The integration of Large Language Models (s) into various applications has driven the need for structured and reliable responses. A key challenge in Retrieval-Augmented Generation (RAG) systems is ensuring that outputs align with expected formats while minimizing hallucinations. This study examines the role of guided in RAG systems, comparing three methods, Outlines, XGrammar, and LM Format Enforcer, across different multi-turn prompting setups (0-turn, 1-turn, and 2-turn). By evaluating success rates, hallucination rates, and output quality, we provide insights into their performance and applicability. Our findings reveal how multi-turn interactions influence guided , uncovering unexpected performance variations that can inform method selection for specific use cases. This work advances the understanding of structured output generation in RAG systems, offering both theoretical insights and practical guidance for deployment.

HAVE Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models

Authors: Xin Tong, Zhi Lin, Jingya Wang, Bo Jin

2025-09-08

http://arxiv.org/abs/2509.06596v1

Large Language Models (s) often produce hallucinations in retrieval-augmented or long-context generation, even when relevant evidence is present. This stems from two issues: head importance is treated as input-agnostic, and raw attention weights poorly reflect each token's true contribution. We present HAVE (Head-Adaptive Gating and ValuE Calibration), a parameter-free framework that directly addresses both challenges. HAVE introduces head-adaptive gating, which performs instance-level soft reweighing of attention heads, and value calibration, which augments attention with the magnitude of value vectors to approximate write-back contribution. Together, these modules construct token-level evidence aligned with model updates and fuse it with the LM distribution through a lightweight uncertainty-scaled policy. HAVE requires no finetuning and operates in a single forward pass, making it efficient and broadly applicable. Experiments across multiple QA benchmarks and families demonstrate that HAVE consistently reduces hallucinations and outperforms strong baselines, including DAGCD, with modest overhead. The framework is transparent, reproducible, and readily integrates with off-the-shelf s, advancing trustworthy generation in real-world settings.

Reasoning-enhanced Query Understanding through Decomposition and Interpretation

Authors: Yunfei Zhong, Jun Yang, Yixing Fan, Jiafeng Guo, Lixin Su, Maarten de Rijke, Ruqing Zhang, Dawei Yin, Xueqi Cheng

2025-09-08

http://arxiv.org/abs/2509.06544v1

Accurate inference of user intent is crucial for enhancing document retrieval in modern search engines. While large language models (s) have made significant strides in this area, their effectiveness has predominantly been assessed with short, keyword-based queries. As AI-driven search evolves, long-form queries with intricate intents are becoming more prevalent, yet they remain underexplored in the context of -based query understanding (QU). To bridge this gap, we introduce ReDI: a Reasoning-enhanced approach for query understanding through Decomposition and Interpretation. ReDI leverages the reasoning and comprehension capabilities of s in a three-stage pipeline: (i) it breaks down complex queries into targeted sub-queries to accurately capture user intent; (ii) it enriches each sub-query with detailed semantic interpretations to improve the query-document matching; and (iii) it independently retrieves documents for each sub-query and employs a fusion strategy to aggregate the results for the final ranking. We compiled a large-scale dataset of real-world complex queries from a major search engine and distilled the query understanding capabilities of teacher models into smaller models for practical application. Experiments on BRIGHT and BEIR demonstrate that ReDI consistently surpasses strong baselines in both and dense retrieval paradigms, affirming its effectiveness.

SLiNT Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion

Authors: Mengxue Yang, Chun Yang, Jiaqi Zhu, Jiafan Li, Jingqi Zhang, Yuyang Li, Ying Li

2025-09-08

http://arxiv.org/abs/2509.06531v1

Link prediction in knowledge graphs requires integrating structural information and semantic context to infer missing entities. While large language models offer strong generative reasoning capabilities, their limited exploitation of structural signals often results in structural and semantic ambiguity, especially under incomplete or zero-shot settings. To address these challenges, we propose SLiNT (Structure-aware Language model with Injection and coNtrastive Training), a modular framework that injects knowledge-graph-derived structural context into a frozen backbone with lightweight LoRA-based adaptation for robust link prediction. Specifically, Structure-Guided Neighborhood Enhancement (SGNE) retrieves pseudo-neighbors to enrich entities and mitigate missing context; Dynamic Hard Contrastive Learning (DHCL) introduces fine-grained supervision by interpolating hard positives and negatives to resolve entity-level ambiguity; and Gradient-Decoupled Dual Injection (GDDI) performs token-level structure-aware intervention while pre the core parameters. Experiments on WN18RR and FB15k-237 show that SLiNT achieves superior or competitive performance compared with both embedding-based and generation-based baselines, demonstrating the effectiveness of structure-aware representation learning for scalable knowledge graph completion.

Synesthesia of Machines (SoM)-Aided LiDAR Point Cloud Transmission for Collaborative Perception

Authors: Ensong Liu, Rongqing Zhang, Xiang Cheng, Jian Tang

2025-09-08

http://arxiv.org/abs/2509.06506v1

Collaborative perception enables more accurate and comprehensive scene understanding by learning how to share information between agents, with LiDAR point clouds providing essential precise spatial data. Due to the substantial data volume generated by LiDAR sensors, efficient point cloud transmission is essential for low-latency multi-agent collaboration. In this work, we propose an efficient, robust and applicable LiDAR point cloud transmission system via the Synesthesia of Machines (SoM), termed LiDAR Point Cloud Feature Transmission (LPC-FT), to support collaborative perception among multiple agents. Specifically, we employ a density-pre deep point cloud method that encodes the complete point cloud into a downsampled efficient representation. To mitigate the effects of the wireless channel, we design a channel encoder module based on self-attention to enhance LiDAR point cloud features and a feature fusion module based on cross-attention to integrate features from transceivers. Furthermore, we utilize the nonlinear activation layer and transfer learning to improve the training of deep neural networks in the presence the digital channel noise. Experimental results demonstrate that the proposed LPC-FT is more robust and effective than traditional octree-based followed by channel coding, and outperforms state-of-the-art deep learning-based techniques and existing semantic methods, reducing the Chamfer Distance by 30% and improving the PSNR by 1.9 dB on average. Owing to its superior reconstruction performance and robustness against channel variations, LPC-FT is expected to support collaborative perception tasks.

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

Authors: Ran Xin, Zeyu Zheng, Yanchen Nie, Kun Yuan, Xia Xiao

2025-09-08

http://arxiv.org/abs/2509.06493v1

The integration of Large Language Models (s) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces \texttt{BFS-Prover-V2}, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in -based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof . We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. \texttt{BFS-Prover-V2} achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.

HyFedRAG A Federated Retrieval-Augmented Generation Framework for Heterogeneous and Privacy-Sensitive Data

Authors: Cheng Qian, Hainan Zhang, Yongxin Tong, Hong-Wei Zheng, Zhiming Zheng

2025-09-08

http://arxiv.org/abs/2509.06444v1

Centralized RAG pipelines struggle with heterogeneous and privacy-sensitive data, especially in distributed healthcare settings where patient data spans SQL, knowledge graphs, and clinical notes. Clinicians face difficulties retrieving rare disease cases due to privacy constraints and the limitations of traditional cloud-based RAG systems in handling diverse formats and edge devices. To address this, we introduce HyFedRAG, a unified and efficient Federated RAG framework tailored for Hybrid data modalities. By leveraging an edge-cloud collaborative mechanism, HyFedRAG enables RAG to operate across diverse data sources while pre data privacy. Our key contributions are: (1) We design an edge-cloud collaborative RAG framework built on Flower, which supports querying structured SQL data, semi-structured knowledge graphs, and unstructured documents. The edge-side s convert diverse data into standardized privacy-pre representations, and the server-side s integrates them for global reasoning and generation. (2) We integrate lightweight local retrievers with privacy-aware s and provide three anonymization tools that enable each client to produce semantically rich, de-identified summaries for global inference across devices. (3) To optimize response latency and reduce redundant computation, we design a three-tier caching strategy consisting of local , intermediate representation , and cloud inference . Experimental results on PMC-Patients demonstrate that HyFedRAG outperforms existing baselines in terms of retrieval quality, generation consistency, and system efficiency. Our framework offers a scalable and privacy-compliant solution for RAG over structural-heterogeneous data, unlocking the potential of s in sensitive and diverse data environments.

Tree of Agents Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning

Authors: Song Yu, Xiaofei Xu, Ke Deng, Li Li, Lin Tian

2025-09-08

http://arxiv.org/abs/2509.06436v1

Large language models (s) face persistent challenges when handling long-context tasks, most notably the lost in the middle issue, where information located in the middle of a long input tends to be underutilized. Some existing methods that reduce input have the risk of discarding key information, while others that extend context windows often lead to attention dispersion. To address these limitations, we propose Tree of Agents (TOA), a multi-agent reasoning framework that segments the input into chunks processed by independent agents. Each agent generates its local cognition, then agents dynamically exchange information for collaborative reasoning along tree-structured paths. TOA enables agents to probe different reasoning orders for multi-perspective understanding, effectively mitigating position bias and reducing hallucinations. To improve processing efficiency, we incorporate prefix-hash caching and adaptive strategies, achieving significant performance improvements with comparable API overhead. Experiments show that TOA, powered by compact LLaMA3.1-8B, significantly outperforms multiple baselines and demonstrates comparable performance to the latest and much larger commercial models, such as Gemini1.5-pro, on various long-context tasks. Code is available at https://github.com/Aireduce952/Tree-of-Agents.

NeuroDeX Unlocking Diverse Support in Decompiling Deep Neural Network Executables

Authors: Yilin Li, Guozhu Meng, Mingyang Sun, Yanzhong Wang, Kun Sun, Hailong Chang, Yuekang Li

2025-09-08

http://arxiv.org/abs/2509.06402v1

On-device deep learning models have extensive real world demands. Deep learning compilers efficiently compile models into executables for deployment on edge devices, but these executables may face the threat of reverse engineering. Previous studies have attempted to decompile DNN executables, but they face challenges in handling compilation optimizations and analyzing d compiled models. In this paper, we present NeuroDeX to unlock diverse support in decompiling DNN executables. NeuroDeX leverages the semantic understanding capabilities of s along with dynamic analysis to accurately and efficiently perform operator type recognition, operator attribute recovery and model reconstruction. NeuroDeX can recover DNN executables into high-level models towards compilation optimizations, different architectures and d compiled models. We conduct experiments on 96 DNN executables across 12 common DNN models. Extensive experimental results demonstrate that NeuroDeX can decompile non-d executables into nearly identical high-level models. NeuroDeX can recover functionally similar high-level models for d executables, achieving an average top-1 accuracy of 72%. NeuroDeX offers a more comprehensive and effective solution compared to previous DNN executables decompilers.

Mask-GCG Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Authors: Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang

2025-09-08

http://arxiv.org/abs/2509.06350v1

Jailbreak attacks on Large Language Models (s) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while those at low-impact positions. This not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in prompts. Our findings provide insights for developing efficient and interpretable s from the perspective of jailbreak attacks.

A Geometric Multigrid-Accelerated Compact Gas-Kinetic Scheme for Fast Convergence in High-Speed Flows on GPUs

Authors: Hongyu Liu, Xing Ji, Yuan Fu, Kun Xu

2025-09-08

http://arxiv.org/abs/2509.06347v1

Implicit methods and GPU parallelization are two distinct yet powerful strategies for accelerating high-order CFD algorithms. However, few studies have successfully integrated both approaches within high-speed flow solvers. The core challenge lies in pre the robustness of implicit algorithms in the presence of strong discontinuities, while simultaneously enabling massive thread parallelism under the constraints of limited GPU memory. To address this, we propose a GPU-optimized, geometric multigrid-accelerated, high-order compact gas kinetic scheme (CGKS) that incorporates three key innovations: (1) a multi-color lower-upper symmetric Gauss-Seidel scheme that eliminates thread conflicts and preserves memory efficiency, as an implicit smoother on coarse grids; (2) a discontinuity-adaptive relaxation technique and a multigrid prolongation process, based on a discontinuous feedback factor, which dynamically stabilize shock regions without compromising convergence in smooth zones; and (3) a three-layer V-cycle geometric parallel multigrid strategy specifically tailored for unstructured meshes. Extensive tests on multi-dimensional subsonic to hypersonic flows demonstrate that our GPU-based high-performance solver achieves one to two orders of magnitude faster convergence compared to previous explicit solvers. More importantly, it preserves the shock-capturing robustness of the explicit CGKS and exhibits strong scalability on GPU architectures. This work presents a unified framework that synergistically leverages implicit and GPU optimization for high-speed flow simulations, effectively overcoming traditional trade-offs between parallelism, memory constraints, and numerical stability in high-order methods.

Ban&Pick Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs

Authors: Yuanteng Chen, Peisong Wang, Yuantian Shao, Jian Cheng

2025-09-08

http://arxiv.org/abs/2509.06346v1

Sparse Mixture-of-Experts (MoE) has become a key architecture for scaling large language models (s) efficiently. Recent fine-grained MoE designs introduce hundreds of experts per layer, with multiple experts activated per token, enabling stronger specialization. However, during pre-training, routers are optimized mainly for stability and robustness: they converge prematurely and enforce balanced usage, limiting the full potential of model performance and efficiency. In this work, we uncover two overlooked issues: (i) a few highly influential experts are underutilized due to premature and balanced routing decisions; and (ii) enforcing a fixed number of active experts per token introduces substantial redundancy. Instead of retraining models or redesigning MoE architectures, we introduce Ban&Pick, a post-training, plug-and-play strategy for smarter MoE routing. Pick discovers and reinforces key experts-a small group with outsized impact on performance-leading to notable accuracy gains across domains. Ban complements this by dynamically redundant experts based on layer and token sensitivity, delivering faster inference with minimal accuracy loss. Experiments on fine-grained MoE-s (DeepSeek, Qwen3) across math, code, and general reasoning benchmarks demonstrate that Ban&Pick delivers free performance gains and inference without retraining or architectural changes. For instance, on Qwen3-30B-A3B, it improves accuracy from 80.67 to 84.66 on AIME2024 and from 65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under the v.

Towards scalable organ level 3D plant segmentation Bridging the data algorithm computing gap

Authors: Ruiming Du, Guangxun Zhai, Tian Qiu, Yu Jiang

2025-09-08

http://arxiv.org/abs/2509.06329v1

The precise characterization of plant morphology provides valuable insights into plant environment interactions and genetic evolution. A key technology for extracting this information is 3D segmentation, which delineates individual plant organs from complex point clouds. Despite significant progress in general 3D computer vision domains, the adoption of 3D segmentation for plant phenotyping remains limited by three major challenges: i) the scarcity of large-scale annotated datasets, ii) technical difficulties in adapting advanced deep neural networks to plant point clouds, and iii) the lack of standardized benchmarks and evaluation protocols tailored to plant science. This review systematically addresses these barriers by: i) providing an overview of existing 3D plant datasets in the context of general 3D segmentation domains, ii) systematically summarizing deep learning-based methods for point cloud semantic and instance segmentation, iii) introducing Plant Segmentation Studio (PSS), an open-source framework for reproducible benchmarking, and iv) conducting extensive quantitative experiments to evaluate representative networks and sim-to-real learning strategies. Our findings highlight the efficacy of convolutional backbones and -based instance segmentation, while also emphasizing the complementary role of modeling-based and augmentation-based synthetic data generation for sim-to-real learning in reducing annotation demands. In general, this study bridges the gap between algorithmic advances and practical deployment, providing immediate tools for researchers and a roadmap for developing data-efficient and generalizable deep learning solutions in 3D plant phenotyping. Data and code are available at https://github.com/perrydoremi/PlantSegStudio.

Text4Seg++ Advancing Image Segmentation via Generative Language Modeling

Authors: Mengcheng Lan, Chaofeng Chen, Jiaxing Xu, Zongrui Li, Yiping Ke, Xudong Jiang, Yingchen Yu, Yunqing Zhao, Song Bai

2025-09-08

http://arxiv.org/abs/2509.06321v1

Multimodal Large Language Models (Ms) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remains a significant challenge. In this work, we propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional rs and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. We first introduce image-wise semantic descriptors, a patch-aligned textual representation of segmentation masks that integrates naturally into the language modeling pipeline. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3\times$ , without compromising performance. Building upon this, our initial framework Text4Seg achieves strong segmentation performance across a wide range of vision tasks. To further improve granularity and compactness, we propose box-wise semantic descriptors, which localizes regions of interest using bounding boxes and represents region masks via structured mask tokens called semantic bricks. This leads to our refined model, Text4Seg++, which formulates segmentation as a next-brick prediction task, combining precision, scalability, and generative efficiency. Comprehensive experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models across diverse benchmarks without any task-specific fine-tuning, while remaining compatible with existing M backbones. Our work highlights the effectiveness, scalability, and generalizability of text-driven image segmentation within the M framework.

LoaQ Layer-wise Output Approximation Quantization

Authors: Li Lin, Xiaojun Wan

2025-09-08

http://arxiv.org/abs/2509.06297v1

A natural and intuitive idea in model is to approximate each component's d output to match its original. Layer-wise post-training (PTQ), though based on this idea, adopts a strictly local view and can achieve, at best, only activation-aware approximations of weights. As a result, it often leads to insufficient approximations and practical deviations from this guiding intuition. Recent work has achieved a more accurate approximation of linear-layer outputs within the framework of layer-wise PTQ, but such refinements remain inadequate for achieving alignment with the full model output. Based on a deeper understanding of the structural characteristics of mainstream s, we propose $LoaQ$ , an output-approximation method for layer-wise PTQ that explicitly targets output-level consistency. It better aligns with this intuition and can feature a simple closed-form solution, making it orthogonal to existing techniques and readily integrable into existing pipelines. Experiments on the LLaMA and Qwen model families demonstrate that LoaQ performs effectively in both weight-only and weight-activation joint . By integrating seamlessly with existing strategies, it further enhances overall quality and shows strong potential to advance the frontier of post-training .

RecMind LLM-Enhanced Graph Neural Networks for Personalized Consumer Recommendations

Authors: Chang Xue, Youwei Lu, Chen Yang, Jinming Xing

2025-09-08

http://arxiv.org/abs/2509.06286v1

Personalization is a core capability across consumer technologies, streaming, shopping, wearables, and voice, yet it remains challenged by interactions, fast content churn, and heterogeneous textual signals. We present RecMind, an -enhanced graph recommender that treats the language model as a preference prior rather than a monolithic ranker. A frozen equipped with lightweight adapters produces text-conditioned user/item embeddings from titles, attributes, and reviews; a LightGCN backbone learns collaborative embeddings from the user-item graph. We align the two views with a symmetric contrastive objective and fuse them via intra-layer gating, allowing language to dominate in cold/long-tail regimes and graph structure to stabilize rankings elsewhere. On Yelp and Amazon-Electronics, RecMind attains the best results on all eight reported metrics, with relative improvements up to +4.53\% (Recall@40) and +4.01\% (NDCG@40) over strong baselines. Ablations confirm both the necessity of cross-view alignment and the advantage of gating over late fusion and -only variants.

FineServe Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving

Authors: Kyungmin Bin, Seungbeom Choi, Jimyoung Son, Jieun Choi, Daseul Bae, Daehyeon Baek, Kihyo Moon, Minsung Jang, Hyojung Lee

2025-09-08

http://arxiv.org/abs/2509.06261v1

Recent advances in Post-Training Quantization (PTQ) techniques have significantly increased demand for d large language models (s), enabling higher throughput and substantially reduced memory usage with minimal accuracy loss. Quantized models address memory constraints in s and enhance GPU resource utilization through efficient GPU sharing. However, d models have smaller block sizes than non-d models, causing limited memory efficiency due to memory fragmentation. Also, distinct resource usage patterns between d and non-d models require efficient scheduling to maximize throughput. To address these challenges, we propose FineServe, an inference framework for mixed-precision s. FineServe's key contributions include: (1) Slab, a precision-aware adaptive memory management technique dynamically allocating based on model characteristics, significantly reducing GPU memory fragmentation, and (2) a two-level scheduling framework comprising a global scheduler that places models to GPUs based on request rates, latency SLOs, and memory constraints and efficiency, and a local scheduler that adaptively adjusts batch sizes according to real-time request fluctuations. Experimental results demonstrate that FineServe achieves up to 2.2x higher SLO attainment and 1.8x higher token generation throughput compared to the state-of-the-art GPU sharing systems.

Understanding the Influence of Synthetic Data for Text Embedders

Authors: Jacob Mitchell Springer, Vaibhav Adlakha, Siva Reddy, Aditi Raghunathan, Marius Mosbach

2025-09-07

http://arxiv.org/abs/2509.06184v1

Recent progress in developing general purpose text embedders has been driven by training on ever-growing corpora of synthetic -generated data. Nonetheless, no publicly available synthetic dataset exists, posing a barrier to studying its role for generalization. To address this issue, we first reproduce and publicly release the synthetic data proposed by Wang et al. (Mistral-E5). Our synthetic data is high quality and leads to consistent improvements in performance. Next, we critically examine where exactly synthetic data improves model generalization. Our analysis reveals that benefits from synthetic data are and highly localized to individual datasets. Moreover, we observe trade-offs between the performance on different categories and data that benefits one task, degrades performance on another. Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders and challenge the notion that training on synthetic data leads to more robust embedding models across tasks.

Home-made Diffusion Model from Scratch to Hatch

Authors: Shih-Ying Yeh

2025-09-07

http://arxiv.org/abs/2509.06068v1

We introduce Home-made Diffusion Model (HDM), an efficient yet powerful text-to-image diffusion model optimized for training (and inferring) on consumer-grade hardware. HDM achieves competitive 1024x1024 generation quality while maintaining a remarkably low training cost of $535-620 using four RTX5090 GPUs, representing a significant reduction in computational requirements compared to traditional approaches. Our key contributions include: (1) Cross-U-Transformer (XUT), a novel U-shape , Cross-U-Transformer (XUT), that employs cross-attention for skip connections, providing superior feature integration that leads to remarkable compositional consistency; (2) a comprehensive training recipe that incorporates TREAD , a novel shifted square crop strategy for efficient arbitrary aspect-ratio training, and progressive resolution scaling; and (3) an empirical demonstration that smaller models (343M parameters) with carefully crafted architectures can achieve high-quality results and emergent capabilities, such as intuitive camera control. Our work provides an alternative paradigm of scaling, demonstrating a viable path toward democratizing high-quality text-to-image generation for individual researchers and smaller organizations with limited computational resources.

1 bit is all we need binary normalized neural networks

Authors: Eduardo Lobo Lustoda Cabral, Paulo Pirozelli, Larissa Driemeier

2025-09-07

http://arxiv.org/abs/2509.07025v1

The increasing size of large neural network models, specifically language models and foundational image models, poses deployment challenges, prompting efforts to reduce memory requirements and enhance computational efficiency. These efforts are critical to ensure practical deployment and effective utilization of these models across various applications. In this work, a novel type of neural network layers and models is developed that uses only single-bit parameters. In this novel type of models all parameters of all layers, including kernel weights and biases, only have values equal to zero or one. This novel type of models uses layers named as binary normalized layer. These binary normalized layers can be of any type, such as fully connected, convolutional, attention, etc., and they consist of slight variations of the corresponding conventional layers. To show the effectiveness of the binary normalized layers, two different models are configured to solve a multiclass image classification problem and a language r to predict the next token of a sequence. The model to solve the image classification has convolutional and fully connected layers, and the language model is composed of blocks with multi-head attention. The results show that models with binary normalized layers present almost the same results obtained by equivalent models with real 32-bit parameters. The binary normalized layers allow to develop models that use 32 times less memory than current models and have equivalent performance. Besides, the binary normalized layers can be easily implemented on current computers using 1-bit arrays, and do not require the development of dedicated electronic hardware. This novel type of layers opens a new era for large neural network models with reduced memory requirements that can be deployed using simple and cheap hardware, such as mobile devices or only cpus.

A Unified Framework for Cultural Heritage Data Historicity and Migration The ARGUS Approach

Authors: Lingxiao Kong, Apostolos Sarris, Miltiadis Polidorou, Victor Klingenberg, Vasilis Sevetlidis, Vasilis Arampatzakis, George Pavlidis, Cong Yang, Zeyd Boukhers

2025-09-07

http://arxiv.org/abs/2509.06044v1

Cultural heritage preservation faces significant challenges in managing diverse, multi-source, and multi-scale data for effective monitoring and conservation. This paper documents a comprehensive data historicity and migration framework implemented within the ARGUS project, which addresses the complexities of processing heterogeneous cultural heritage data. We describe a systematic data processing pipeline encompassing standardization, enrichment, integration, visualization, ingestion, and publication strategies. The framework transforms raw, disparate datasets into standardized formats compliant with FAIR principles. It enhances datasets through established imputation techniques, ensures interoperability through database integration, and improves querying capabilities through -powered natural language processing. This approach has been applied across five European pilot sites with varying preservation challenges, demonstrating its adaptability to diverse cultural heritage contexts. The implementation results show improved data accessibility, enhanced analytical capabilities, and more effective decision-making for conservation efforts.

Micro-Expression Recognition via Fine-Grained Dynamic Perception

Authors: Zhiwen Shao, Yifan Cheng, Fan Zhang, Xuehuai Shi, Canlin Li, Lizhuang Ma, Dit-yan Yeung

2025-09-07

http://arxiv.org/abs/2509.06015v1

Facial micro-expression recognition (MER) is a challenging task, due to the transience, subtlety, and dynamics of micro-expressions (MEs). Most existing methods resort to hand-crafted features or deep networks, in which the former often additionally requires key frames, and the latter suffers from small-scale and low-diversity training data. In this paper, we develop a novel fine-grained dynamic perception (FDP) framework for MER. We propose to rank frame-level features of a sequence of raw frames in chronological order, in which the rank process encodes the dynamic information of both ME appearances and motions. Specifically, a novel local-global feature-aware is proposed for frame representation learning. A rank scorer is further adopted to calculate rank scores of each frame-level feature. Afterwards, the rank features from rank scorer are pooled in temporal dimension to capture dynamic representation. Finally, the dynamic representation is shared by a MER module and a dynamic image construction module, in which the former predicts the ME category, and the latter uses an encoder-r structure to construct the dynamic image. The design of dynamic image construction task is beneficial for capturing facial subtle actions associated with MEs and alleviating the data scarcity issue. Extensive experiments show that our method (i) significantly outperforms the state-of-the-art MER methods, and (ii) works well for dynamic image construction. Particularly, our FDP improves by 4.05%, 2.50%, 7.71%, and 2.11% over the previous best results in terms of F1-score on the CASME II, SAMM, CAS(ME)^2, and CAS(ME)^3 datasets, respectively. The code is available at https://github.com/CYF-cuber/FDP.

MEGS $^{2}$ Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning

Authors: Jiarui Chen, Yikeng Chen, Yingshuang Zou, Ye Huang, Peng Wang, Yuan Liu, Yujing Sun, Wenping Wang

2025-09-07

http://arxiv.org/abs/2509.07021v1

3D Gaussian Splatting (3DGS) has emerged as a dominant novel-view synthesis technique, but its high memory consumption severely limits its applicability on edge devices. A growing number of 3DGS methods have been proposed to make 3DGS more efficient, yet most only focus on storage and fail to address the critical bottleneck of rendering memory. To address this problem, we introduce MEGS $^{2}$ , a novel memory-efficient framework that tackles this challenge by jointly optimizing two key factors: the total primitive number and the parameters per primitive, achieving unprecedented memory . Specifically, we replace the memory-intensive spherical harmonics with lightweight arbitrarily-oriented spherical Gaussian lobes as our color representations. More importantly, we propose a unified soft framework that models primitive-number and lobe-number as a single constrained optimization problem. Experiments show that MEGS $^{2}$ achieves a 50% static VRAM reduction and a 40% rendering VRAM reduction compared to existing methods, while maintaining comparable rendering quality.

Application Space and the Rate-Distortion-Complexity Analysis of Neural Video CODECs

Authors: Ricardo L. de Queiroz, Diogo C. Garcia, Yi-Hsin Chen, Ruhan Conceição, Wen-Hsiao Peng, Luciano V. Agostini

2025-09-07

http://arxiv.org/abs/2509.05929v1

We study the decision-making process for choosing video systems through a rate-distortion-complexity (RDC) analysis. We discuss the 2D Bjontegaard delta (BD) metric and formulate generalizations in an attempt to extend its notions to the 3D RDC volume. We follow that discussion with another one on the computation of metrics in the RDC volume, and on how to define and measure the cost of a coder-r (codec) pair, where the codec is characterized by a cloud of points in the RDC space. We use a Lagrangian cost $D+\lambda R + \gamma C$ , such that choosing the best video codec among a number of candidates for an application demands selecting appropriate $(\lambda, \gamma)$ values. Thus, we argue that an application may be associated with a $(\lambda, \gamma)$ point in the application space. An example streaming application was given as a case study to set a particular point in the $(\lambda, \gamma)$ plane. The result is that we can compare Lagrangian costs in an RDC volume for different codecs for a given application. Furthermore, we can span the plane and compare codecs for the entire application space filled with different $(\lambda, \gamma)$ choices. We then compared several state-of-the-art neural video codecs using the proposed metrics. Results are informative and surprising. We found that, within our RDC computation constraints, only four neural video codecs came out as the best suited for any application, depending on where its desirable $(\lambda, \gamma)$ lies.

Physics-Guided Diffusion Transformer with Spherical Harmonic Posterior Sampling for High-Fidelity Angular Super-Resolution in Diffusion MRI

Authors: Mu Nan, Taohui Xiao, Ruoyou Wu, Shoujun Yu, Ye Li, Hairong Zheng, Shanshan Wang

2025-09-07

http://arxiv.org/abs/2509.07020v1

Diffusion MRI (dMRI) angular super-resolution (ASR) aims to reconstruct high-angular-resolution (HAR) signals from limited low-angular-resolution (LAR) data without prolonging scan time. However, existing methods are limited in recovering fine-grained angular details or pre high fidelity due to inadequate modeling of q-space geometry and insufficient incorporation of physical constraints. In this paper, we introduce a Physics-Guided Diffusion Transformer (PGDiT) designed to explore physical priors throughout both training and inference stages. During training, a Q-space Geometry-Aware Module (QGAM) with b-vector modulation and random angular masking facilitates direction-aware representation learning, enabling the network to generate directionally consistent reconstructions with fine angular details from and noisy data. In inference, a two-stage Spherical Harmonics-Guided Posterior Sampling (SHPS) enforces alignment with the acquired data, followed by heat-diffusion-based SH regularization to ensure physically plausible reconstructions. This coarse-to-fine refinement strategy mitigates oversmoothing and artifacts commonly observed in purely data-driven or generative models. Extensive experiments on general ASR tasks and two downstream applications, Diffusion Tensor Imaging (DTI) and Neurite Orientation Dispersion and Density Imaging (NODDI), demonstrate that PGDiT outperforms existing deep learning models in detail recovery and data fidelity. Our approach presents a novel generative ASR framework that offers high-fidelity HAR dMRI reconstructions, with potential applications in neuroscience and clinical research.

Beyond I'm Sorry, I Can't Dissecting Large Language Model Refusal

Authors: Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah, Ranjan Satapathy, Erik Cambria, Roy Ka Wei Lee

2025-09-07

http://arxiv.org/abs/2509.09708v1

Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (s), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

Chatbot To Help Patients Understand Their Health

Authors: Won Seok Jang, Hieu Tran, Manav Mistry, SaiKiran Gandluri, Yifan Zhang, Sharmin Sultana, Sunjae Kown, Yuan Zhang, Zonghai Yao, Hong Yu

2025-09-06

http://arxiv.org/abs/2509.05818v1

Patients must possess the knowledge necessary to actively participate in their care. We present NoteAid-Chatbot, a conversational AI that promotes patient understanding via a novel 'learning as conversation' framework, built on a multi-agent large language model () and reinforcement learning (RL) setup without human-labeled data. NoteAid-Chatbot was built on a lightweight LLaMA 3.2 3B model trained in two stages: initial supervised fine-tuning on conversational data synthetically generated using medical conversation strategies, followed by RL with rewards derived from patient understanding assessments in simulated hospital discharge scenarios. Our evaluation, which includes comprehensive human-aligned assessments and case studies, demonstrates that NoteAid-Chatbot exhibits key emergent behaviors critical for patient education, such as clarity, relevance, and structured dialogue, even though it received no explicit supervision for these attributes. Our results show that even simple Proximal Policy Optimization (PPO)-based reward modeling can successfully train lightweight, domain-specific chatbots to handle multi-turn interactions, incorporate diverse educational strategies, and meet nuanced objectives. Our Turing test demonstrates that NoteAid-Chatbot surpasses non-expert human. Although our current focus is on healthcare, the framework we present illustrates the feasibility and promise of applying low-cost, PPO-based RL to realistic, open-ended conversational domains, broadening the applicability of RL-based alignment methods.

time2time Causal Intervention in Hidden States to Simulate Rare Events in Time Series Foundation Models

Authors: Debdeep Sanyal, Aaryan Nagpal, Dhruv Kumar, Murari Mandal, Saurabh Deshpande

2025-09-06

http://arxiv.org/abs/2509.05801v1

While -based foundation models excel at forecasting routine patterns, two questions remain: do they internalize semantic concepts such as market regimes, or merely fit curves? And can their internal representations be leveraged to simulate rare, high-stakes events such as market crashes? To investigate this, we introduce activation transplantation, a causal intervention that manipulates hidden states by imposing the statistical moments of one event (e.g., a historical crash) onto another (e.g., a calm period) during the forward pass. This procedure deterministically steers forecasts: injecting crash semantics induces downturn predictions, while injecting calm semantics suppresses crashes and restores stability. Beyond binary control, we find that models encode a graded notion of event severity, with the latent vector norm directly correlating with the magnitude of systemic shocks. Validated across two architecturally distinct TSFMs, Toto (r only) and Chronos (encoder-r), our results demonstrate that steerable, semantically grounded representations are a robust property of large time series s. Our findings provide evidence for a latent concept space that governs model predictions, shifting interpretability from post-hoc attribution to direct causal intervention, and enabling semantic "what-if" analysis for strategic stress-testing.

LM-Searcher Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding

Authors: Yuxuan Hu, Jihao Liu, Ke Wang, Jinliang Zhen, Weikang Shi, Manyuan Zhang, Qi Dou, Rui Liu, Aojun Zhou, Hongsheng Li

2025-09-06

http://arxiv.org/abs/2509.05657v2

Recent progress in Large Language Models (s) has opened new avenues for solving complex optimization problems, including Neural Architecture Search (NAS). However, existing -driven NAS approaches rely heavily on prompt engineering and domain-specific tuning, limiting their practicality and scalability across diverse tasks. In this work, we propose LM-Searcher, a novel framework that leverages s for cross-domain neural architecture optimization without the need for extensive domain-specific adaptation. Central to our approach is NCode, a universal numerical string representation for neural architectures, which enables cross-domain architecture encoding and search. We also reformulate the NAS problem as a ranking task, training s to select high-performing architectures from candidate pools using instruction-tuning samples derived from a novel -based subspace sampling strategy. Our curated dataset, encompassing a wide range of architecture-performance pairs, encourages robust and transferable learning. Comprehensive experiments demonstrate that LM-Searcher achieves competitive performance in both in-domain (e.g., CNNs for image classification) and out-of-domain (e.g., LoRA configurations for segmentation and generation) tasks, establishing a new paradigm for flexible and generalizable -based architecture search. The datasets and models will be released at https://github.com/Ashone3/LM-Searcher.

Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints

Authors: Waris Gill, Natalie Isak, Matthew Dressman

2025-09-06

http://arxiv.org/abs/2509.05608v1

The widespread deployment of s across enterprise services has created a critical security blind spot. Organizations operate multiple services handling billions of queries daily, yet regulatory compliance boundaries prevent these services from sharing threat intelligence about prompt injection attacks, the top security risk for s. When an attack is detected in one service, the same threat may persist undetected in others for months, as privacy regulations prohibit sharing user prompts across compliance boundaries. We present BinaryShield, the first privacy-pre threat intelligence system that enables secure sharing of attack fingerprints across compliance boundaries. BinaryShield transforms suspicious prompts through a unique pipeline combining PII redaction, semantic embedding, binary , and randomized response mechanism to potentially generate non-invertible fingerprints that preserve attack patterns while providing privacy. Our evaluations demonstrate that BinaryShield achieves an F1-score of 0.94, significantly outperforming SimHash (0.77), the privacy-pre baseline, while achieving 64x storage reduction and 38x faster similarity search compared to dense embeddings.

Icon $^{2}$ Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation

Authors: Qiyuan Chen, Hongsen Huang, Qian Shao, Jiahe Chen, Jintai Chen, Hongxia Xu, Renjie Hua, Ren Chuan, Jian Wu

2025-09-06

http://arxiv.org/abs/2509.05605v1

Large Language Models (s) require high quality preference datasets to align with human preferences. However, conventional methods for constructing such datasets face significant challenges: reliance on pre-collected instructions often leads to distribution mismatches with target models, while the need for sampling multiple stochastic responses introduces substantial computational overhead. In this work, we explore a paradigm shift by leveraging inherent regulation of s' representation space for efficient and tailored preference dataset construction, named Icon $^{2}$ . Specifically, it first extracts layer-wise direction vectors to encode sophisticated human preferences and then uses these vectors to filter self-synthesized instructions based on their inherent consistency. During , bidirectional inherent control is applied to steer token representations, enabling the precise generation of response pairs with clear alignment distinctions. Experimental results demonstrate significant improvements in both alignment and efficiency. Llama3-8B and Qwen2-7B achieve an average win rate improvement of 13.89% on AlpacaEval 2.0 and 13.45% on Arena-Hard, while reducing computational costs by up to 48.1%.

ProfilingAgent Profiling-Guided Agentic Reasoning for Adaptive Model Optimization

Authors: Sadegh Jafari, Aishwarya Sarkar, Mohiuddin Bilwal, Ali Jannesari

2025-09-06

http://arxiv.org/abs/2509.05584v1

Foundation models face growing compute and memory bottlenecks, hindering deployment on resource-limited platforms. While techniques such as and are widely used, most rely on uniform heuristics that ignore architectural and runtime heterogeneity. Profiling tools expose per-layer latency, memory, and compute cost, yet are rarely integrated into automated pipelines. We propose ProfilingAgent, a profiling-guided, agentic approach that uses large language models (s) to automate via structured and post-training dynamic . Our modular multi-agent system reasons over static metrics (MACs, parameter counts) and dynamic signals (latency, memory) to design architecture-specific strategies. Unlike heuristic baselines, ProfilingAgent tailors layer-wise decisions to bottlenecks. Experiments on ImageNet-1K, CIFAR-10, and CIFAR-100 with ResNet-101, ViT-B/16, Swin-B, and DeiT-B/16 show maintains competitive or improved accuracy (about 1% drop on ImageNet-1K, +2% gains for ViT-B/16 on smaller datasets), while achieves up to 74% memory savings with <0.5% accuracy loss. Our also yields consistent inference speedups of up to 1.74 times faster. Comparative studies with GPT-4o and GPT-4-Turbo highlight the importance of reasoning quality for iterative . These results establish agentic systems as scalable solutions for profiling-guided model optimization.

Sensitivity-Aware Post-Training Quantization for Deep Neural Networks

Authors: Zekang Zheng, Haokun Li, Yaofo Chen, Mingkui Tan, Qing Du

2025-09-06

http://arxiv.org/abs/2509.05576v1

Model reduces neural network parameter precision to achieve , but often compromises accuracy. Existing post-training (PTQ) methods employ iterative parameter updates to preserve accuracy under high ratios, incurring significant computational complexity and resource overhead, which limits applicability in resource-constrained edge computing and real-time inference scenarios. This paper proposes an efficient PTQ method guided by parameter sensitivity analysis. The approach prioritizes of high-sensitivity parameters, leveraging und low-sensitivity parameters to compensate for errors, thereby mitigating accuracy degradation. Furthermore, by exploiting column-wise clustering of parameter sensitivity, the method introduces a row-parallel framework with a globally shared inverse Hessian matrix update mechanism, reducing computational complexity by an order of magnitude. Experimental results on ResNet-50 and YOLOv5s demonstrate a 20-200-fold speedup over the Optimal Brain Quantization baseline, with mean accuracy loss below 0.3%, confirming the method's efficacy in balancing efficiency and accuracy.

TreeGPT Pure TreeFFN Encoder-Decoder Architecture for Structured Reasoning Without Attention Mechanisms

Authors: Zixi Li

2025-09-06

http://arxiv.org/abs/2509.05550v2

We present TreeGPT, an attention-free neural architecture that explores the potential of pure TreeFFN encoder-r design for structured reasoning tasks. Unlike traditional approaches that rely on attention mechanisms, TreeGPT employs bidirectional TreeFFN components that process sequences through adjacent connections in parallel, aiming to achieve computational efficiency while maintaining reasoning capabilities. Our approach centers on a TreeFFN Encoder-Decoder mechanism: $\text{Encoder TreeFFN (L} \rightarrow \text{R)} + \text{Decoder TreeFFN (R} \leftarrow \text{L)} \rightarrow \text{Parallel Processing}$ where the encoder processes left-to-right dependencies while the r handles right-to-left patterns, both using simple neighbor-to-neighbor connections. This design eliminates attention computation while maintaining sequence modeling capabilities. We evaluate our approach on the ARC Prize 2025 dataset, where TreeGPT achieves 99\% validation accuracy using 3.16M parameters. The model converges within 1500 training steps and demonstrates 100\% token-level accuracy on selected evaluation samples. Our preliminary results suggest that for certain structured reasoning tasks, specialized TreeFFN architectures may offer advantages over attention-based approaches. While these findings are encouraging, we acknowledge that further investigation across diverse tasks and datasets would be valuable to establish the broader applicability of attention-free designs.

veScale Consistent and Efficient Tensor Programming with Eager-Mode SPMD

Authors: Youjie Li, Cheng Wan, Zhiqi Lin, Hongyu Zhu, Jiacheng Yang, Ziang Song, Xinyi Di, Jiawei Wu, Huiyao Shu, Wenlei Bao, Yanghua Peng, Haibin Lin, Li-Wen Chang

2025-09-05

http://arxiv.org/abs/2509.07003v1

Large Language Models (s) have scaled rapidly in size and complexity, requiring increasingly intricate parallelism for distributed training, such as 3D parallelism. This sophistication motivates a shift toward simpler, more debuggable programming paradigm like Single Program Multiple Data (SPMD). However, SPMD in eager execution introduces two key challenges: ensuring consistency with single-device execution and achieving high performance at scale. In this paper, we introduce veScale, an eager-mode training system that fully embraces SPMD paradigm to democratize distributed tensor programming. veScale addresses the prevalent issue of inconsistent results in systems like PyTorch by introducing a novel algorithm of distributed Random Number Generation (RNG) compatible with arbitrary sharded operators. veScale also significantly boosts training performance by reducing PyTorch primitive's overhead and improving efficiency. Evaluations show that veScale delivers up to 2.2x speedup over the state-of-the-art training systems, like TorchTitan, and cuts code complexity by 78.4%, while pre single-device-equivalent results.

Dynamic Sensitivity Filter Pruning using Multi-Agent Reinforcement Learning For DCNN's

Authors: Iftekhar Haider Chowdhury, Zaed Ikbal Syed, Ahmed Faizul Haque Dhrubo, Mohammad Abdul Qayum

2025-09-05

http://arxiv.org/abs/2509.05446v1

Deep Convolutional Neural Networks have achieved state of the art performance across various computer vision tasks, however their practical deployment is limited by computational and memory overhead. This paper introduces Differential Sensitivity Fusion Pruning, a novel single shot filter framework that focuses on evaluating the stability and redundancy of filter importance scores across multiple criteria. Differential Sensitivity Fusion Pruning computes a differential sensitivity score for each filter by fusing the discrepancies among gradient based sensitivity, first order Taylor expansion, and KL divergence of activation distributions. An exponential scaling mechanism is applied to emphasize filters with inconsistent importance across metrics, identifying candidates that are structurally unstable or less critical to the model performance. Unlike iterative or reinforcement learning based strategies, Differential Sensitivity Fusion Pruning is efficient and deterministic, requiring only a single forward-backward pass for scoring and . Extensive experiments across varying rates between 50 to 70 percent demonstrate that Differential Sensitivity Fusion Pruning significantly reduces model complexity, achieving over 80 percent Floating point Operations Per Seconds reduction while maintaining high accuracy. For instance, at 70 percent , our approach retains up to 98.23 percent of baseline accuracy, surpassing traditional heuristics in both and generalization. The proposed method presents an effective solution for scalable and adaptive Deep Convolutional Neural Networks , paving the way for efficient deployment on edge and mobile platforms.

Crosscoding Through Time Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Authors: Deniz Bayazit, Aaron Mueller, Antoine Bosselut

2025-09-05

http://arxiv.org/abs/2509.05291v1

Large language models (s) learn non-trivial abstractions during pretraining, like detecting irregular plural noun subjects. However, it is not well understood when and how specific linguistic abilities emerge as traditional evaluation methods such as benchmarking fail to reveal how models acquire concepts and capabilities. To bridge this gap and better understand model training at the concept level, we use crosscoders to discover and align features across model checkpoints. Using this approach, we track the evolution of linguistic features during pretraining. We train crosscoders between open-sourced checkpoint triplets with significant performance and representation shifts, and introduce a novel metric, Relative Indirect Effects (RelIE), to trace training stages at which individual features become causally important for task performance. We show that crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining. Our approach is architecture-agnostic and scalable, offering a promising path toward more interpretable and fine-grained analysis of representation learning throughout pretraining.

Recomposer Event-roll-guided generative audio editing

Authors: Daniel P. W. Ellis, Eduardo Fonseca, Ron J. Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R. Hershey, Aren Jansen, R. Channing Moore, Manoj Plakal

2025-09-05

http://arxiv.org/abs/2509.05256v1

Editing complex real-world sound scenes is difficult because individual sound sources in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., enhance Door'') and a graphical representation of the event timing derived from anevent roll'' transcription. We present an encoder-r working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions -- action, class, timing. Our work demonstrates ``recomposition'' is an important and practical application.

Exploring Autoregressive Vision Foundation Models for Image Compression

Authors: Huu-Tai Phung, Yu-Hsiang Lin, Yen-Kuan Ho, Wen-Hsiao Peng

2025-09-05

http://arxiv.org/abs/2509.05169v1

This work presents the first attempt to repurpose vision foundation models (VFMs) as image codecs, aiming to explore their generation capability for low-rate image . VFMs are widely employed in both conditional and unconditional generation scenarios across diverse downstream tasks, e.g., physical AI applications. Many VFMs employ an encoder-r architecture similar to that of end-to-end learned image codecs and learn an autoregressive (AR) model to perform next-token prediction. To enable , we repurpose the AR model in VFM for entropy coding the next token based on previously coded tokens. This approach deviates from early semantic efforts that rely solely on conditional generation for reconstructing input images. Extensive experiments and analysis are conducted to compare VFM-based codec to current SOTA codecs optimized for distortion or perceptual quality. Notably, certain pre-trained, general-purpose VFMs demonstrate superior perceptual quality at extremely low bitrates compared to specialized learned image codecs. This finding paves the way for a promising research direction that leverages VFMs for low-rate, semantically rich image .

KVCompose Efficient Structured KV Cache Compression with Composite Tokens

Authors: Dmitry Akulov, Mohamed Sana, Antonio De Domenico, Tareq Si Salem, Nicola Piovesan, Fadhel Ayed

2025-09-05

http://arxiv.org/abs/2509.05165v1

Large language models (s) rely on key-value () s for efficient autoregressive ; however, size grows linearly with context length and model depth, becoming a major bottleneck in long-context inference. Prior methods either enforce rigid heuristics, disrupt tensor layouts with per-attention-head variability, or require specialized compute kernels. We propose a simple, yet effective, framework based on attention-guided, layer-adaptive composite tokens. Our method aggregates attention scores to estimate token importance, selects head-specific tokens independently, and aligns them into composite tokens that respect the uniform structure required by existing inference engines. A global allocation mechanism further adapts retention budgets across layers, assigning more capacity to layers with informative tokens. This approach achieves significant memory reduction while pre accuracy, consistently outperforming prior structured and semi-structured methods. Crucially, our approach remains fully compatible with standard inference pipelines, offering a practical and scalable solution for efficient long-context deployment.

FLOWER Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

Authors: Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, Rudolf Lioutikov

2025-09-05

http://arxiv.org/abs/2509.04996v1

Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by up to $50\%$ of layers, and action-specific Global-AdaLN conditioning, which cuts parameters by $20\%$ through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance with bigger VLAs across $190$ tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. In addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark. Demos, code and pretrained weights are available at https://intuitive-robots.github.io/flower_vla/.

Authors: Byeong-Il Ham, Hyun-Bin Kim, Kyung-Soo Kim

2025-09-05

http://arxiv.org/abs/2509.04950v1

In this paper, we propose a 3D path planning method that integrates the A algorithm with the octree structure. Unmanned Ground Vehicles (UGVs) and legged robots have been extensively studied, enabling locomotion across a variety of terrains. Advances in mobility have enabled obstacles to be regarded not only as hindrances to be avoided, but also as navigational aids when beneficial. A modified 3D A algorithm generates an optimal path by leveraging obstacles during the planning process. By incorporating a height-based penalty into the cost function, the algorithm enables the use of traversable obstacles to aid locomotion while avoiding those that are impassable, resulting in more efficient and realistic path generation. The octree-based 3D grid map achieves by merging high-resolution nodes into larger blocks, especially in obstacle-free or ly populated areas. This reduces the number of nodes explored by the A* algorithm, thereby improving computational efficiency and memory usage, and supporting real-time path planning in practical environments. Benchmark results demonstrate that the use of octree structure ensures an optimal path while significantly reducing memory usage and computation time.

PLaMo 2 Technical Report

Authors: Preferred Networks, :, Kaizaburo Chubachi, Yasuhiro Fujita, Shinichi Hemmi, Yuta Hirokawa, Toshiki Kataoka, Goro Kobayashi, Kenichi Maehashi, Calvin Metzger, Hiroaki Mikami, Shogo Murai, Daisuke Nishino, Kento Nozawa, Shintarou Okada, Daisuke Okanohara, Shunta Saito, Shotaro Sano, Shuji Suzuki, Daisuke Tanaka, Avinash Ummadisingu, Hanqin Wang, Sixue Wang, Tianqi Xu

2025-09-05

http://arxiv.org/abs/2509.04897v1

In this report, we introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture that transitions to full attention via continual pre-training to support 32K token contexts. Training leverages extensive synthetic corpora to overcome data scarcity, while computational efficiency is achieved through weight reuse and structured . This efficient methodology produces an 8B model that achieves performance comparable to our previous 100B model. Post-training further refines the models using a pipeline of supervised fine-tuning (SFT) and direct preference optimization (DPO), enhanced by synthetic Japanese instruction data and model merging techniques. Optimized for inference using v and with minimal accuracy loss, the PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.

OSC Cognitive Orchestration through Dynamic Knowledge Alignment in Multi-Agent LLM Collaboration

Authors: Jusheng Zhang, Yijia Fan, Kaitong Cai, Xiaofei Sun, Keze Wang

2025-09-05

http://arxiv.org/abs/2509.04876v1

This paper introduces OSC (Orchestrating Cognitive Synergy), a knowledge-aware adaptive collaboration framework designed to enhance cognitive synergy in multi-agent systems with large language models. While prior work has advanced agent selection and result aggregation, efficient linguistic interactions for deep collaboration among expert agents remain a critical bottleneck. OSC addresses this gap as a pivotal intermediate layer between selection and aggregation, introducing Collaborator Knowledge Models (CKM) to enable each agent to dynamically perceive its collaborators' cognitive states. Through real-time cognitive gap analysis, agents adaptively adjust behaviors, including content focus, detail level, and expression style, using learned strategies. Experiments on complex reasoning and problem-solving benchmarks demonstrate that OSC significantly improves task performance and efficiency, transforming "parallel-working individuals'' into a "deeply collaborative cognitive team.'' This framework not only optimizes multi-agent collaboration but also offers new insights into agent interaction behaviors.

Broadband Simultaneous Beam Steering and Compressing Device Based on Subwavelength Protrusion Metallic Tunnels

Authors: Dongguo Zhang, Fei Sun, Qin Liao, Yichao Liu, Donguk Nam

2025-09-05

http://arxiv.org/abs/2509.04856v1

Beam steering and beamwidth compressing play a role in steering the beam and narrowing its half-power beamwidth, respectively, which are both widely applied in extending the effective operational range of 6G s, IoT devices, and antenna systems. However, research on wave manipulation devices capable of simultaneously achieving both functionalities remains limited, despite their great potential for system miniaturization and functional integration. In this study, we design and realize a broadband device capable of simultaneously steering and compressing the TM-polarized EM waves using subwavelength protrusion metallic tunnels. The underlying physical mechanisms are quantitatively explained through wave optics and optical surface transformation, indicating the size ratio between the incident and output surface governs both the steering angle and the ratio. Numerical simulations demonstrate its outstanding performance, achieving a maximum steering angle of 40{\deg} and a ratio of 0.4 across 3 to 12 GHz, with averaged energy transmittance above 80%. The experiments further validate its effectiveness by measuring the magnetic field distributions of the output beam at various frequencies. The excellent beam steering and compressing effects make the proposed device highly promising for next-generation multifunctional wave manipulation in advanced systems.

VoltanaLLM Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

Authors: Jiahuan Yu, Aryan Taneja, Junfeng Lin, Minjia Zhang

2025-09-05

http://arxiv.org/abs/2509.04827v1

Modern Large Language Model () systems increasingly support interactive applications, like real-time chat assistants, code generation tools, and agentic workflows. However, the soaring energy cost of inference presents a growing challenge for sustainable and cost-effective deployment. This paper introduces Voltana, a system for SLO-aware, energy-efficient , built from a control theory perspective. Voltana co-designs frequency scaling and request routing in emerging / d architectures, leveraging their decoupled execution to enable fine-grained phase-specific control. It consists of a feedback-driven frequency controller that dynamically adapts GPU frequency for and phases, and a state-space router that explores routing decisions across frequency-scaled instances to minimize energy under latency constraints. We implement Voltana in SGLang and evaluate its performance over multiple state-of-the-art s and real-world datasets. The results demonstrate that Voltana achieves up to 36.3% energy savings while maintaining near-perfect SLO attainment rate, paving the way for sustainable and intelligent .

AI-Driven Fronthaul Link Compression in Wireless Communication Systems Review and Method Design

Authors: Keqin Zhang

2025-09-05

http://arxiv.org/abs/2509.04805v1

Modern fronthaul links in wireless systems must transport high-dimensional signals under stringent bandwidth and latency constraints, which makes indispensable. Traditional strategies such as compressed sensing, scalar , and fixed-codec pipelines often rely on restrictive priors, degrade sharply at high ratios, and are hard to tune across channels and deployments. Recent progress in Artificial Intelligence (AI) has brought end-to-end learned transforms, vector and hierarchical , and learned entropy models that better exploit the structure of Channel State Information(CSI), precoding matrices, I/Q samples, and LLRs. This paper first surveys AI-driven techniques and then provides a focused analysis of two representative high- routes: CSI feedback with end-to-end learning and Resource Block (RB) granularity precoding optimization combined with . Building on these insights, we propose a fronthaul strategy tailored to cell-free architectures. The design targets high with controlled performance loss, supports RB-level rate adaptation, and enables low-latency inference suitable for centralized cooperative transmission in next-generation networks.

Personality as a Probe for LLM Evaluation Method Trade-offs and Downstream Effects

Authors: Gunmay Handa, Zekun Wu, Adriano Koshiyama, Philip Treleaven

2025-09-05

http://arxiv.org/abs/2509.04794v1

Personality manipulation in large language models (s) is increasingly applied in customer service and agentic scenarios, yet its mechanisms and trade-offs remain unclear. We present a systematic study of personality control using the Big Five traits, comparing in-context learning (ICL), parameter-efficient fine-tuning (PEFT), and mechanistic steering (MS). Our contributions are fourfold. First, we construct a contrastive dataset with balanced high/low trait responses, enabling effective steering vector computation and fair cross-method evaluation. Second, we introduce a unified evaluation framework based on within-run $\Delta$ analysis that disentangles, reasoning capability, agent performance, and demographic bias across MMLU, GAIA, and BBQ benchmarks. Third, we develop trait purification techniques to separate openness from conscientiousness, addressing representational in trait encoding. Fourth, we propose a three-level stability framework that quantifies method-, trait-, and combination-level robustness, offering practical guidance under deployment constraints. Experiments on Gemma-2-2B-IT and LLaMA-3-8B-Instruct reveal clear trade-offs: ICL achieves strong alignment with minimal capability loss, PEFT delivers the highest alignment at the cost of degraded task performance, and MS provides lightweight runtime control with competitive effectiveness. Trait-level analysis shows openness as uniquely challenging, agreeableness as most resistant to ICL, and personality encoding consolidating around intermediate layers. Taken together, these results establish personality manipulation as a multi-level probe into behavioral representation, linking surface conditioning, parameter encoding, and activation-level steering, and positioning mechanistic steering as a lightweight alternative to fine-tuning for both deployment and interpretability.

Decoders Laugh as Loud as Encoders

Authors: Eli Borodach, Raj Dandekar, Rajat Dandekar, Sreedath Panat

2025-09-05

http://arxiv.org/abs/2509.04779v1

From the dawn of the computer, Allen Turing dreamed of a robot that could communicate using language as a human being. The recent advances in the field of Large Language Models (s) shocked the scientific community when a single model can apply for various natural language processing (NLP) tasks, while the output results are sometimes even better than most human skills. Models such as GPT, Claude, Grok, etc. have left their mark on the scientific community. However, it is unclear how much these models understand what they produce, especially in a nuanced theme such as humor. The question of whether computers understand humor is still open (among the rs, the latest to be checked was GPT-2). We addressed this issue in this paper; we have showed that a fine-tuned r (GPT-4o) performed (Mean F1-macro score of 0.85) as well as the best fine-tuned encoder (RoBERTa with a Mean of F1-score 0.86)

A Study of Large Language Models for Patient Information Extraction Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning

Authors: Cheng Peng, Xinyu Dong, Mengxian Lyu, Daniel Paredes, Yaoyun Zhang, Yonghui Wu

2025-09-05

http://arxiv.org/abs/2509.04753v1

Natural language processing (NLP) is a key technology to extract important patient information from clinical narratives to support healthcare applications. The rapid development of large language models (s) has revolutionized many NLP tasks in the clinical domain, yet their optimal use in patient information extraction tasks requires further exploration. This study examines s' effectiveness in patient information extraction, focusing on architectures, fine-tuning strategies, and multi-task instruction tuning techniques for developing robust and generalizable patient information extraction systems. This study aims to explore key concepts of using s for clinical concept and relation extraction tasks, including: (1) encoder-only or r-only s, (2) prompt-based parameter-efficient fine-tuning (PEFT) algorithms, and (3) multi-task instruction tuning on few-shot learning performance. We benchmarked a suite of s, including encoder-based s (BERT, GatorTron) and r-based s (GatorTronGPT, Llama 3.1, GatorTronLlama), across five datasets. We compared traditional full-size fine-tuning and prompt-based PEFT. We explored a multi-task instruction tuning framework that combines both tasks across four datasets to evaluate the zero-shot and few-shot learning performance using the leave-one-dataset-out strategy.

ODKE+ Ontology-Guided Open-Domain Knowledge Extraction with LLMs

Authors: Samira Khorshidi, Azadeh Nikfarjam, Suprita Shankar, Yisi Sang, Yash Govind, Hyun Jang, Ali Kasgari, Alexis McClimans, Mohamed Soliman, Vishnu Konda, Ahmed Fakhry, Xiaoguang Qi

2025-09-04

http://arxiv.org/abs/2509.04696v1

Knowledge graphs (KGs) are foundational to many AI applications, but maintaining their freshness and completeness remains costly. We present ODKE+, a production-grade system that automatically extracts and ingests millions of open-domain facts from web sources with high precision. ODKE+ combines modular components into a scalable pipeline: (1) the Extraction Initiator detects missing or stale facts, (2) the Evidence Retriever collects supporting documents, (3) hybrid Knowledge Extractors apply both pattern-based rules and ontology-guided prompting for large language models (s), (4) a lightweight Grounder validates extracted facts using a second , and (5) the Corroborator ranks and normalizes candidate facts for ingestion. ODKE+ dynamically generates ontology snippets tailored to each entity type to align extractions with schema constraints, enabling scalable, type-consistent fact extraction across 195 predicates. The system supports batch and streaming modes, processing over 9 million Wikipedia pages and ingesting 19 million high-confidence facts with 98.8% precision. ODKE+ significantly improves coverage over traditional methods, achieving up to 48% with third-party KGs and reducing update lag by 50 days on average. Our deployment demonstrates that -based extraction, grounded in ontological structure and verification workflows, can deliver trustworthiness, production-scale knowledge ingestion with broad real-world applicability. A recording of the system demonstration is included with the submission and is also available at https://youtu.be/UcnE3_GsTWs.

First demonstration of coherent radiation imaging for bunch-by-bunch longitudinal compression monitoring

Authors: Joseph Wolfenden, Ana Guisao-Betancur, Carsten Welsch, Billy Kyle, Thomas Pacey, Erik Mansten, Sara Thorin, Mathias Brandin

2025-09-04

http://arxiv.org/abs/2509.04689v1

Longitudinal bunch profile monitoring is a crucial diagnostic requirement in most accelerator facilities. This is particularly true in modern free-electron lasers and novel schemes, where bunch lengths are often <100 fs and standard instrumentation is invasive or lacks the required resolution. This paper proposes a new monitoring method in this challenging parameter space. Initial proof of principle results for relative monitoring via broadband imaging of coherent THz radiation are presented. The technique can utilize more conventional intensity monitoring or novel spatial distribution variation. Both techniques have been demonstrated using both invasive and non-invasive coherent radiation sources. These results pave the way for a future non-invasive longitudinal bunch profile monitor.

DarkStream real-time speech anonymization with low latency

Authors: Waris Quamer, Ricardo Gutierrez-Osuna

2025-09-04

http://arxiv.org/abs/2509.04667v1

We propose DarkStream, a streaming speech synthesis model for real-time speaker anonymization. To improve content encoding under strict latency constraints, DarkStream combines a causal waveform encoder, a short lookahead buffer, and -based contextual layers. To further reduce inference time, the model generates waveforms directly via a neural vocoder, thus removing intermediate mel-spectrogram conversions. Finally, DarkStream anonymizes speaker identity by injecting a GAN-generated pseudo-speaker embedding into linguistic features from the content encoder. Evaluations show our model achieves strong anonymization, yielding close to 50% speaker verification EER (near-chance performance) on the lazy-informed attack scenario, while maintaining acceptable linguistic intelligibility (WER within 9%). By balancing low-latency, robust privacy, and minimal intelligibility degradation, DarkStream provides a practical solution for privacy-pre real-time speech .

AraHalluEval A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

Authors: Aisha Alansari, Hamzah Luqman

2025-09-04

http://arxiv.org/abs/2509.04656v2

Recently, extensive research on the hallucination of the large language models (s) has mainly focused on the English language. Despite the growing number of multilingual and Arabic-specific s, evaluating s' hallucination in the Arabic context remains relatively underexplored. The knowledge gap is particularly pressing given Arabic's widespread use across many regions and its importance in global and media. This paper presents the first comprehensive hallucination evaluation of Arabic and multilingual s on two critical Arabic natural language generation tasks: generative question answering (GQA) and summarization. This study evaluates a total of 12 s, including 4 Arabic pre-trained models, 4 multilingual models, and 4 reasoning-based models. To assess the factual consistency and faithfulness of s' outputs, we developed a fine-grained hallucination evaluation framework consisting of 12 fine-grained hallucination indicators that represent the varying characteristics of each task. The results reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Notably, the Arabic pre-trained model Allam consistently demonstrates lower hallucination rates than multilingual models and a comparative performance with reasoning-based models. The code is available at: https://github.com/aishaalansari57/AraHalluEval

Scaling Environments for Organoid Intelligence with LLM-Automated Design and Plasticity-Based Evaluation

Authors: Brennen Hill

2025-09-04

http://arxiv.org/abs/2509.04633v1

As the complexity of artificial agents increases, the design of environments that can effectively shape their behavior and capabilities has become a critical research frontier. We propose a framework that extends this principle to a novel class of agents: biological neural networks in the form of neural organoids. This paper introduces three scalable, closed-loop virtual environments designed to train organoid-based biological agents and probe the underlying mechanisms of learning, such as long-term potentiation (LTP) and long-term depression (LTD). We detail the design of three distinct task environments with increasing complexity: (1) a conditional avoidance task, (2) a one-dimensional predator-prey scenario, and (3) a replication of the classic Pong game. For each environment, we formalize the state and action spaces, the sensory encoding and motor mechanisms, and the feedback protocols based on predictable (reward) and unpredictable (punishment) stimulation. Furthermore, we propose a novel meta-learning approach where a Large Language Model () is used to automate the generation and optimization of experimental protocols, scaling the process of environment and curriculum design. Finally, we outline a multi-modal approach for evaluating learning by measuring synaptic plasticity at electrophysiological, cellular, and molecular levels. This work bridges the gap between computational neuroscience and agent-based AI, offering a unique platform for studying embodiment, learning, and intelligence in a controlled biological substrate.

Schema Inference for Tabular Data Repositories Using Large Language Models

Authors: Zhenyu Wu, Jiaoyan Chen, Norman W. Paton

2025-09-04

http://arxiv.org/abs/2509.04632v1

Minimally curated tabular data often contain representational inconsistencies across heterogeneous sources, and are accompanied by metadata. Working with such data is intimidating. While prior work has advanced dataset discovery and exploration, schema inference remains difficult when metadata are limited. We present SI- (Schema Inference using Large Language Models), which infers a concise conceptual schema for tabular data using only column headers and cell values. The inferred schema comprises hierarchical entity types, attributes, and inter-type relationships. In extensive evaluation on two datasets from web tables and open data, SI- achieves promising end-to-end results, as well as better or comparable results to state-of-the-art methods at each step. All source code, full prompts, and datasets of SI- are available at https://github.com/PierreWoL/SI.

Communication-Efficient Collaborative LLM Inference via Distributed Speculative Decoding

Authors: Ce Zheng, Tingting Yang

2025-09-04

http://arxiv.org/abs/2509.04576v1

Speculative is an emerging technique that accelerates large language model () inference by allowing a smaller draft model to predict multiple tokens in advance, which are then verified or corrected by a larger target model. In AI-native radio access networks (AI-RAN), this paradigm is well-suited for collaborative inference between resource-constrained end devices and more capable edge servers or base stations (BSs). However, existing distributed speculative requires transmitting the full vocabulary probability distribution from the draft model on the device to the target model at the BS, which leads to prohibitive uplink overhead. To address this issue, we propose a Top-K Sparse Logits Transmission (TK-SLT) scheme, where the draft model transmits only the top-K token raw probabilities and the corresponding token indices instead of the entire distribution. This approach significantly reduces bandwidth consumption while maintaining inference performance. We further derive an analytical expression for the optimal draft length that maximizes inference throughput, and provide a theoretical analysis of the achievable speedup ratio under TK-SLT. Experimental results validate both the efficiency and effectiveness of the proposed method.

PagedEviction Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Authors: Krishna Teja Chitty-Venkata, Jie Ye, Xian-He Sun, Anthony Kougkas, Murali Emani, Venkatram Vishwanath, Bogdan Nicolae

2025-09-04

http://arxiv.org/abs/2509.04377v1

caching significantly improves the efficiency of Large Language Model () inference by storing attention states from previously processed tokens, enabling faster generation of subsequent tokens. However, as sequence length increases, the quickly becomes a major memory bottleneck. To address this, we propose PagedEviction, a novel fine-grained, structured strategy that enhances the memory efficiency of v's PagedAttention. Unlike existing approaches that rely on attention-based token importance or evict tokens across different v pages, PagedEviction introduces an efficient block-wise eviction algorithm tailored for paged memory layouts. Our method integrates seamlessly with PagedAttention without requiring any modifications to its CUDA attention kernels. We evaluate PagedEviction across Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models on the LongBench benchmark suite, demonstrating improved memory usage with better accuracy than baselines on long context tasks.

Psychologically Enhanced AI Agents

Authors: Maciej Besta, Shriram Chandran, Robert Gerstenberger, Mathis Lindner, Marcin Chrapek, Sebastian Hermann Martschat, Taraneh Ghandi, Patrick Iff, Hubert Niewiadomski, Piotr Nyczyk, Jürgen Müller, Torsten Hoefler

2025-09-04

http://arxiv.org/abs/2509.04343v1

We introduce MBTI-in-Thoughts, a framework for enhancing the effectiveness of Large Language Model () agents through psychologically grounded personality conditioning. Drawing on the Myers-Briggs Type Indicator (MBTI), our method primes agents with distinct personality archetypes via prompt engineering, enabling control over behavior along two foundational axes of human psychology, cognition and affect. We show that such personality priming yields consistent, interpretable behavioral biases across diverse tasks: emotionally expressive agents excel in narrative generation, while analytically primed agents adopt more stable strategies in game-theoretic settings. Our framework supports experimenting with structured multi-agent protocols and reveals that self-reflection prior to interaction improves cooperation and reasoning quality. To ensure trait persistence, we integrate the official 16Personalities test for automated verification. While our focus is on MBTI, we show that our approach generalizes seamlessly to other psychological frameworks such as Big Five, HEXACO, or Enneagram. By bridging psychological theory and behavior design, we establish a foundation for psychologically enhanced AI agents without any fine-tuning.

Cross-Layer Attention Probing for Fine-Grained Hallucination Detection

Authors: Malavika Suresh, Rahaf Aljundi, Ikechukwu Nkisi-Orji, Nirmalie Wiratunga

2025-09-04

http://arxiv.org/abs/2509.09700v1

With the large-scale adoption of Large Language Models (s) in various applications, there is a growing reliability concern due to their tendency to generate inaccurate text, i.e. hallucinations. In this work, we propose Cross-Layer Attention Probing (CLAP), a novel activation probing technique for hallucination detection, which processes the activations across the entire residual stream as a joint sequence. Our empirical evaluations using five s and three tasks show that CLAP improves hallucination detection compared to baselines on both greedy d responses as well as responses sampled at higher temperatures, thus enabling fine-grained detection, i.e. the ability to disambiguate hallucinations and non-hallucinations among different sampled responses to a given prompt. This allows us to propose a detect-then-mitigate strategy using CLAP to reduce hallucinations and improve reliability compared to direct mitigation approaches. Finally, we show that CLAP maintains high reliability even when applied out-of-distribution.

Integrating Pruning with Quantization for Efficient Deep Neural Networks Compression

Authors: Sara Makenali, Babak Rokh, Ali Azarpeyvand

2025-09-04

http://arxiv.org/abs/2509.04244v1

Deep Neural Networks (DNNs) have achieved significant advances in a wide range of applications. However, their deployment on resource-constrained devices remains a challenge due to the large number of layers and parameters, which result in considerable computational and memory demands. To address this issue, and are two widely used techniques, commonly applied individually in most studies to reduce model size and enhance processing speed. Nevertheless, combining these two techniques can yield even greater benefits. Effectively integrating and to harness their complementary advantages poses a challenging task, primarily due to their potential impact on model accuracy and the complexity of jointly optimizing both processes. In this paper, we propose two approaches that integrate similarity-based filter with Adaptive Power-of-Two (APoT) to achieve higher efficiency while pre model accuracy. In the first approach, and are applied simultaneously during training. In the second approach, is performed first to remove less important parameters, followed by of the pruned model using representations. Experimental results demonstrate that our proposed approaches achieve effective model with minimal accuracy degradation, making them well-suited for deployment on devices with limited computational resources.

Real Time FPGA Based Transformers & VLMs for Vision Tasks SOTA Designs and Optimizations

Authors: Safa Mohammed Sali, Mahmoud Meribout, Ashiyana Abdul Majeed

2025-09-04

http://arxiv.org/abs/2509.04162v1

Transformers and vision-language models (VLMs) have emerged as dominant architectures in computer vision and multimodal AI, offering state-of-the-art performance in tasks such as image classification, object detection, visual question answering, and caption generation. However, their high computational complexity, large memory footprints, and irregular data access patterns present significant challenges for deployment in latency- and power-constrained environments. Field-programmable gate arrays (FPGAs) provide an attractive hardware platform for such workloads due to their reconfigurability, fine-grained parallelism, and potential for energy-efficient . This paper presents a comprehensive review of design trade-offs, optimization strategies, and implementation challenges for FPGA-based inference of s and VLMs. We examine critical factors such as device-class selection, memory subsystem constraints, dataflow orchestration, strategies, exploitation, and toolchain choices, alongside modality-specific issues unique to VLMs, including heterogeneous compute balancing and cross-attention memory management. Additionally, we discuss emerging trends in hardware-algorithm co-design, highlighting innovations in attention mechanisms, , and modular overlays to improve efficiency and adaptability. Practical issues such as runtime flexibility, verification overhead, and the absence of standardized FPGA multimodal benchmarks are also considered. Finally, we outline future directions toward scalable, portable, and reconfigurable FPGA solutions that adapt to evolving model architectures while sustaining high utilization and predictable performance. This synthesis offers both a technical foundation and a forward-looking perspective to help bridge the gap between advanced multimodal AI models and efficient FPGA deployment.

MultiWikiQA A Reading Comprehension Benchmark in 300+ Languages

Authors: Dan Saattrup Smart

2025-09-04

http://arxiv.org/abs/2509.04111v2

We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages. The context data comes from Wikipedia articles, with questions generated by an and the answers appearing verbatim in the Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency of the generated questions across 30 of the languages, providing evidence that the questions are of good quality. We evaluate 6 different language models, both r and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. The dataset and survey evaluations are freely available.

Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue

Authors: Keara Schaaij, Roel Boumans, Tibor Bosse, Iris Hendrickx

2025-09-04

http://arxiv.org/abs/2509.04104v1

Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful . However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (s). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, as a foundational step toward lexical alignment strategies in conversational agents.

Meta-Policy Reflexion Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent

Authors: Chunlong Wu, Ye Luo, Zhibo Qu, Min Wang

2025-09-04

http://arxiv.org/abs/2509.03990v2

Large language model () agents achieve impressive single-task performance but commonly exhibit repeated failures, inefficient exploration, and limited cross-task adaptability. Existing reflective strategies (e.g., Reflexion, ReAct) improve per-episode behavior but typically produce ephemeral, task-specific traces that are not reused across tasks. Reinforcement-learning based alternatives can produce transferable policies but require substantial parameter updates and compute. In this work we introduce Meta-Policy Reflexion (MPR): a hybrid framework that consolidates -generated reflections into a structured, predicate-like Meta-Policy Memory (MPM) and applies that memory at inference time through two complementary mechanisms soft memory-guided and hard rule admissibility checks(HAC). MPR (i) externalizes reusable corrective knowledge without model weight updates, (ii) enforces domain constraints to reduce unsafe or invalid actions, and (iii) retains the adaptability of language-based reflection. We formalize the MPM representation, present algorithms for update and , and validate the approach in a text-based agent environment following the experimental protocol described in the provided implementation (AlfWorld-based). Empirical results reported in the supplied material indicate consistent gains in execution accuracy and robustness when compared to Reflexion baselines; rule admissibility further improves stability. We analyze mechanisms that explain these gains, discuss scalability and failure modes, and outline future directions for multimodal and multi-agent extensions.

LMVC An End-to-End Learned Multiview Video Coding Framework

Authors: Xihua Sheng, Yingwen Zhang, Long Xu, Shiqi Wang

2025-09-04

http://arxiv.org/abs/2509.03922v1

Multiview video is a key data source for volumetric video, enabling immersive 3D scene reconstruction but posing significant challenges in storage and transmission due to its massive data volume. Recently, deep learning-based end-to-end video coding has achieved great success, yet most focus on single-view or stereo videos, leaving general multiview scenarios underexplored. This paper proposes an end-to-end learned multiview video coding (LMVC) framework that ensures random access and backward compatibility while enhancing efficiency. Our key innovation lies in effectively leveraging independent-view motion and content information to enhance dependent-view . Specifically, to exploit the inter-view motion correlation, we propose a feature-based inter-view motion vector prediction method that conditions dependent-view motion encoding on d independent-view motion features, along with an inter-view motion entropy model that learns inter-view motion priors. To exploit the inter-view content correlation, we propose a disparity-free inter-view context prediction module that predicts inter-view contexts from d independent-view content features, combined with an inter-view contextual entropy model that captures inter-view context priors. Experimental results show that our proposed LMVC framework outperforms the reference software of the traditional MV-HEVC standard by a large margin, establishing a strong baseline for future research in this field.

MTQAMatrix of Thought for Enhanced Reasoning in Complex Question Answering

Authors: Fengxiao Tang, Yufeng Li, Zongzong Wu, Ming Zhao

2025-09-04

http://arxiv.org/abs/2509.03918v1

Complex Question Answering (QA) is a fundamental and challenging task in NLP. While large language models (s) exhibit impressive performance in QA, they suffer from significant performance degradation when facing complex and abstract QA tasks due to insufficient reasoning capabilities. Works such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) aim to enhance s' reasoning abilities, but they face issues such as in-layer redundancy in tree structures and single paths in chain structures. Although some studies utilize Retrieval-Augmented Generation (RAG) methods to assist s in reasoning, the challenge of effectively utilizing large amounts of information involving multiple entities and hops remains critical. To address this, we propose the Matrix of Thought (MoT), a novel and efficient thought structure. MoT explores the problem in both horizontal and vertical dimensions through the "column-cell " mechanism, enabling s to actively engage in multi-strategy and deep-level thinking, reducing redundancy within the column cells and enhancing reasoning capabilities. Furthermore, we develop a fact-correction mechanism by constructing knowledge units from retrieved knowledge graph triples and raw text to enhance the initial knowledge for reasoning and correct erroneous answers. This leads to the development of an efficient and accurate QA framework (MTQA). Experimental results show that our framework outperforms state-of-the-art methods on four widely-used datasets in terms of F1 and EM scores, with reasoning time only 14.4\% of the baseline methods, demonstrating both its efficiency and accuracy. The code for this framework is available at https://github.com/lyfiter/mtqa.