2025-09-15
Table of Contents
- Inpainting-Guided Policy Optimization for Diffusion Large Language Models
- Dropping Experts, Recombining Neurons Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs
- MCBP A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness
- Characterizing the Efficiency of Distributed Training A Power, Performance, and Thermal Perspective
- Efficient Learned Image Compression Through Knowledge Distillation
- Compute Only 16 Tokens in One Timestep Accelerating Diffusion Transformers with Cluster-Driven Feature Caching
- OpenCSP A Deep Learning Framework for Crystal Structure Prediction from Ambient to High Pressure
- SignClip Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion
- A Symmetry-Integrated Approach to Surface Code Decoding
- FedBiF Communication-Efficient Federated Learning via Bits Freezing
- Perfect quantum state transfer via state restoring and ancilla measurement
- Semantic Rate-Distortion Theory with Applications
- Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge
- LLMs as Agentic Cooperative Players in Multiplayer UNO
- Latency and Token-Aware Test-Time Compute
- Towards an AI-based knowledge assistant for goat farmers based on Retrieval-Augmented Generation
- CoDiCodec Unifying Continuous and Discrete Compressed Representations of Audio
- ButterflyQuant Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
- LAVa Layer-wise KV Cache Eviction with Dynamic Budget Allocation
- Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates
- TrEnv Transparently Share Serverless Execution Environments Across Different Functions and Nodes
- Combating the Memory Walls Optimization Pathways for Long-Context Agentic LLM Inference
- ENSI Efficient Non-Interactive Secure Inference for Large Language Models
- HD-MoE Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing
- DiTReducio A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration
- Efficient Transformer-Based Piano Transcription With Sparse Attention Mechanisms
- From scratch to silver Creating trustworthy training data for patent-SDG classification using Large Language Models
- Harnessing Uncertainty Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents
- Medverse A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement
- CCF A Context Compression Framework for Efficient Long-Sequence Language Modeling
- GmSLM Generative Marmoset Spoken Language Modeling
- AI Reasoning for Wireless Communications and Networking A Survey and Perspectives
- Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication
- DP-FedLoRA Privacy-Enhanced Federated Fine-Tuning for On-Device Large Language Models
- Towards Confidential and Efficient LLM Inference with Dual Privacy Protection
- SQAP-VLA A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
- Instructional Prompt Optimization for Few-Shot LLM-Based Recommendations on Cold-Start Users
- VoxelFormer Parameter-Efficient Multi-Subject Visual Decoding from fMRI
- CSI Compression Beyond Latents End-to-End Hybrid Attention-CNN Networks with Entropy Regularization
- CrowdQuery Density-Guided Query Module for Enhanced 2D and 3D Detection in Crowded Scenes
- ChemBOMAS Accelerated BO in Chemistry with LLM-Enhanced Multi-Agent System
- Compressing CNN models for resource-constrained systems by channel and layer pruning
- Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching
- Time-Dependent Modeling of the Sub-Hour Spectral Evolution During the 2013 Outburst of Mrk 421
- Deep Unrolling of Sparsity-Induced RDO for 3D Point Cloud Attribute Coding
- BitROM Weight Reload-Free CiROM Architecture Towards Billion-Parameter 1.58-bit LLM Inference
- Two Sides of the Same Optimization Coin Model Degradation and Representation Collapse in Graph Foundation Models
- Efficient Decoding Methods for Language Models on Encrypted Data
- Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video
- Persistent-DPO A novel loss function and hybrid learning for generative quantum eigensolver
- Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism
- Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing
- EvolKV Evolutionary KV Cache Compression for LLM Inference
- Towards Knowledge-Aware Document Systems Modeling Semantic Coverage Relations via Answerability Detection
- RTR A Transformer-Based Lossless Crossover with Perfect Phase Alignment
- Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning
- Strategies for Improving Communication Efficiency in Distributed and Federated Learning Compression, Local Training, and Personalization
- Sketched Gaussian Mechanism for Private Federated Learning
- XML Prompting as Grammar-Constrained Interaction Fixed-Point Semantics, Convergence Guarantees, and Human-AI Protocols
- OCTANE -- Optimal Control for Tensor-based Autoencoder Network Emergence Explicit Case
- SCA-LLM Spectral-Attentive Channel Prediction with Large Language Models in MIMO-OFDM
- Tensor-Train Operator Inference
- Feature Space Analysis by Guided Diffusion Model
- Biased Tales Cultural and Topic Bias in Generating Children's Stories
- A Robot That Listens Enhancing Self-Disclosure and Engagement Through Sentiment-based Backchannels and Active Listening
- Are Humans as Brittle as Large Language Models?
- Query Expansion in the Age of Pre-trained and Large Language Models A Comprehensive Survey
- SEEC Segmentation-Assisted Multi-Entropy Models for Learned Lossless Image Compression
- Unleashing the True Potential of LLMs A Feedback-Triggered Self-Correction with Long-Term Multipath Decoding
- Collaborative Exploration with a Marsupial Ground-Aerial Robot Team through Task-Driven Map Compression
- Topology-Aware Optimization of Gaussian Primitives for Human-Centric Volumetric Videos
- MaLei at MultiClinSUM Summarisation of Clinical Documents using Perspective-Aware Iterative Self-Prompting with LLMs
- PanoLAM Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image
- PatchSeeker Mapping NVD Records to their Vulnerability-fixing Commits with LLM Generated Commits and Embeddings
- Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data
- Multi-view-guided Passage Reranking with Large Language Models
- DuoServe-MoE Dual-Phase Expert Prefetch and Cache Scheduling for Efficient MoE LLM Inference
- PersonaFuse A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions
- Explaining How Quantization Disparately Skews a Model
- Neurocognitive Modeling for Text Generation Deep Learning Architecture for EEG Data
- DischargeSim A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge
- Faster VGGT with Block-Sparse Global Attention
- HOT Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers
- Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models
- Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data
- From Noise to Narrative Tracing the Origins of Hallucinations in Transformers
- Barlow-Swin Toward a novel siamese-based segmentation architecture using Swin-Transformers
- COMPACT Common-token Optimized Model Pruning Across Channels and Tokens
- Guided Decoding and Its Critical Role in Retrieval-Augmented Generation
- HAVE Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models
- Reasoning-enhanced Query Understanding through Decomposition and Interpretation
- SLiNT Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion
- Synesthesia of Machines (SoM)-Aided LiDAR Point Cloud Transmission for Collaborative Perception
- Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
- HyFedRAG A Federated Retrieval-Augmented Generation Framework for Heterogeneous and Privacy-Sensitive Data
- Tree of Agents Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning
- NeuroDeX Unlocking Diverse Support in Decompiling Deep Neural Network Executables
- Mask-GCG Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?
- A Geometric Multigrid-Accelerated Compact Gas-Kinetic Scheme for Fast Convergence in High-Speed Flows on GPUs
- Ban&Pick Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs
- Towards scalable organ level 3D plant segmentation Bridging the data algorithm computing gap
- Text4Seg++ Advancing Image Segmentation via Generative Language Modeling
- LoaQ Layer-wise Output Approximation Quantization
- RecMind LLM-Enhanced Graph Neural Networks for Personalized Consumer Recommendations
- FineServe Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving
- Understanding the Influence of Synthetic Data for Text Embedders
- Home-made Diffusion Model from Scratch to Hatch
- 1 bit is all we need binary normalized neural networks
- A Unified Framework for Cultural Heritage Data Historicity and Migration The ARGUS Approach
- Micro-Expression Recognition via Fine-Grained Dynamic Perception
- MEGS Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning
- Application Space and the Rate-Distortion-Complexity Analysis of Neural Video CODECs
- Physics-Guided Diffusion Transformer with Spherical Harmonic Posterior Sampling for High-Fidelity Angular Super-Resolution in Diffusion MRI
- Beyond I'm Sorry, I Can't Dissecting Large Language Model Refusal
- Chatbot To Help Patients Understand Their Health
- time2time Causal Intervention in Hidden States to Simulate Rare Events in Time Series Foundation Models
- LM-Searcher Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding
- Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints
- Icon Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation
- ProfilingAgent Profiling-Guided Agentic Reasoning for Adaptive Model Optimization
- Sensitivity-Aware Post-Training Quantization for Deep Neural Networks
- TreeGPT Pure TreeFFN Encoder-Decoder Architecture for Structured Reasoning Without Attention Mechanisms
- veScale Consistent and Efficient Tensor Programming with Eager-Mode SPMD
- Dynamic Sensitivity Filter Pruning using Multi-Agent Reinforcement Learning For DCNN's
- Crosscoding Through Time Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
- Recomposer Event-roll-guided generative audio editing
- Exploring Autoregressive Vision Foundation Models for Image Compression
- KVCompose Efficient Structured KV Cache Compression with Composite Tokens
- FLOWER Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies
- Ground-Aware Octree-A* Hybrid Path Planning for Memory-Efficient 3D Navigation of Ground Vehicles
- PLaMo 2 Technical Report
- OSC Cognitive Orchestration through Dynamic Knowledge Alignment in Multi-Agent LLM Collaboration
- Broadband Simultaneous Beam Steering and Compressing Device Based on Subwavelength Protrusion Metallic Tunnels
- VoltanaLLM Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
- AI-Driven Fronthaul Link Compression in Wireless Communication Systems Review and Method Design
- Personality as a Probe for LLM Evaluation Method Trade-offs and Downstream Effects
- Decoders Laugh as Loud as Encoders
- A Study of Large Language Models for Patient Information Extraction Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning
- ODKE+ Ontology-Guided Open-Domain Knowledge Extraction with LLMs
- First demonstration of coherent radiation imaging for bunch-by-bunch longitudinal compression monitoring
- DarkStream real-time speech anonymization with low latency
- AraHalluEval A Fine-grained Hallucination Evaluation Framework for Arabic LLMs
- Scaling Environments for Organoid Intelligence with LLM-Automated Design and Plasticity-Based Evaluation
- Schema Inference for Tabular Data Repositories Using Large Language Models
- Communication-Efficient Collaborative LLM Inference via Distributed Speculative Decoding
- PagedEviction Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference
- Psychologically Enhanced AI Agents
- Cross-Layer Attention Probing for Fine-Grained Hallucination Detection
- Integrating Pruning with Quantization for Efficient Deep Neural Networks Compression
- Real Time FPGA Based Transformers & VLMs for Vision Tasks SOTA Designs and Optimizations
- MultiWikiQA A Reading Comprehension Benchmark in 300+ Languages
- Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue
- Meta-Policy Reflexion Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent
- LMVC An End-to-End Learned Multiview Video Coding Framework
- MTQAMatrix of Thought for Enhanced Reasoning in Complex Question Answering
Inpainting-Guided Policy Optimization for Diffusion Large Language Models
Authors: Siyan Zhao, Mengchen Liu, Jing Huang, Miao Liu, Chenyu Wang, Bo Liu, Yuandong Tian, Guan Pang, Sean Bell, Aditya Grover, Feiyu Chen
2025-09-12
Masked diffusion large language models (ds) are emerging as promising
alternatives to autoregressive
s, offering competitive performance while
supporting unique generation capabilities such as inpainting. We explore how
inpainting can inform RL algorithm design for d
s. Aligning
s with
reinforcement learning faces an exploration challenge:
reward signals
and sample waste when models fail to discover correct solutions. While this
inefficiency affects
s broadly, d
s offer a distinctive opportunity--their
inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided
Policy Optimization), an RL framework that strategically inserts partial
ground-truth reasoning traces during online sampling. Unlike providing full
solutions, inpainting steers exploration toward promising trajectory spaces
while pre
self-generated reasoning, bridging supervised fine-tuning and
reinforcement learning. We apply IGPO to group-based optimization methods such
as GRPO, where exploration failures cause zero advantages and gradients. IGPO
restores meaningful gradients while improving sample efficiency. We also
propose supervised fine-tuning on synthetically rewritten concise traces that
better align with d
generation patterns. With additional techniques
including entropy-based filtering, our training recipe yields substantial gains
across three mathematical benchmarks--GSM8K, Math500, and AMC--achieving new
state-of-the-art results for full-attention masked d
s.
Dropping Experts, Recombining Neurons Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs
Authors: Yixiao Zhou, Ziyu Zhao, Dongzhou Cheng, zhiliang wu, Jie Gui, Yi Yang, Fei Wu, Yu Cheng, Hehe Fan
2025-09-12
Sparse Mixture-of-Experts (SMoE) architectures are widely used in large
language models (s) due to their computational efficiency. However, though
only a few experts are activated for each token, SMoE still requires loading
all expert parameters, leading to high memory usage and challenges in
deployment. Previous work has tried to reduce the overhead by
and
merging experts, but primarily focused on expert-level operations, leaving
neuron-level structure underexplored. We propose DERN (Dropping Experts,
Recombining Neurons), a task-agnostic and retraining-free framework for expert
and reconstruction. We observe that experts are often misaligned and
contain semantic conflicts at the neuron level, which poses challenges for
direct merging. To solve this, DERN works in three steps: it first prunes
redundant experts using router statistics; then it decomposes them into
neuron-level expert segments, assigning each segment to its most compatible
retained expert; and finally, it merges segments within each retained expert to
build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE
models show that DERN improves performance by more than 5% on commonsense
reasoning and MMLU benchmarks under 50% expert
, without extra
training. It also greatly reduces the number of experts and memory usage,
making SMoE
s easier to deploy in practice.
MCBP A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness
Authors: Huizheng Wang, Zichuan Wang, Zhiheng Yue, Yousheng Long, Taiquan Wei, Jianxun Yang, Yang Wang, Chao Li, Shaojun Wei, Yang Hu, Shouyi Yin
2025-09-12
Large language models (s) face significant inference latency due to
inefficiencies in GEMM operations, weight access, and
access,
especially in real-time scenarios. This highlights the need for a versatile
compute-memory efficient accelerator. Unfortunately, existing Transformer
accelerators struggle to address both aspects simultaneously, as they focus on
value-level processing, missing fine-grained opportunities to optimize
computation and memory collaboratively. This paper introduces MCBP, a
bit-grained compute-memory efficient algorithm-hardware co-design that
leverages bit-slice (BS) enabled repetitiveness and
to accelerate
inference. MCBP features three key innovations: 1) BS-repetitiveness-enabled
computation reduction (BRCR), which eliminates redundant GEMM computations via
leveraging redundancy hidden among BS vectors; 2) BS-
-enabled two-state
coding (BSTC), which reduces weight access via exploiting significant
in high-order bit-slice weight; 3) Bit-grained progressive prediction (BGPP),
which reduces
access by leveraging early-termination-based bit-grained
prediction. These techniques, supported by custom accelerator designs,
effectively alleviate the burden in GEMM, weight access, and
access.
Extensive experiments on 26 benchmarks show that MCBP achieves 9.43x speed up
and 31.1x higher energy efficiency than Nvidia A100 GPU. Compared to SOTA
Transformer accelerators, MCBP achieves 35x, 5.2x and 3.2x energy saving than
Spatten, FACT and SOFA, respectively.
Characterizing the Efficiency of Distributed Training A Power, Performance, and Thermal Perspective
Authors: Seokjin Go, Joongun Park, Spandan More, Hanjiang Wu, Irene Wang, Aaron Jezghani, Tushar Krishna, Divya Mahajan
2025-09-12
The rapid scaling of Large Language Models (s) has pushed training
workloads far beyond the limits of single-node analysis, demanding a deeper
understanding of how these models behave across large-scale, multi-GPU systems.
In this paper, we present a comprehensive characterization of
training
across diverse real-world workloads and hardware platforms, including NVIDIA
H100/H200 and AMD MI250 GPUs. We analyze dense and
models under various
parallelism strategies -- tensor, pipeline, data, and expert -- and evaluate
their effects on hardware utilization, power consumption, and thermal behavior.
We further evaluate the effectiveness of optimizations such as activation
recomputation and compute-
. Our findings show that
performance is not determined solely by scaling hardware capacity. Scale-up
systems with fewer, higher-memory GPUs can outperform scale-out systems in
-bound regimes, but only under carefully tuned configurations; in
other cases, scale-out deployments achieve superior throughput. We also show
that certain parallelism combinations, such as tensor with pipeline, lead to
bandwidth underutilization due to inefficient data chunking, while increasing
microbatch sizes beyond a certain point induces bursty execution and peak power
excursions that worsen thermal throttling. These insights reveal how training
performance is shaped by complex interactions between hardware, system
topology, and model execution. We conclude by offering recommendations for
system and hardware design to improve the scalability and reliability of future
systems and workloads. The source code of this project is available at
https://github.com/sitar-lab/Char
-PPT.
Efficient Learned Image Compression Through Knowledge Distillation
Authors: Fabien Allemand, Attilio Fiandrotti, Sumanta Chaudhuri, Alaa Eddine Mazouz
2025-09-12
Learned image sits at the intersection of machine learning and
image processing. With advances in deep learning, neural network-based
methods have emerged. In this process, an encoder maps the image to
a low-dimensional latent space, which is then
d, entropy-coded into a
binary bitstream, and transmitted to the receiver. At the receiver end, the
bitstream is entropy-
d, and a
r reconstructs an approximation of
the original image. Recent research suggests that these models consistently
outperform conventional codecs. However, they require significant processing
power, making them unsuitable for real-time use on resource-constrained
platforms, which hinders their deployment in mainstream applications. This
study aims to reduce the resource requirements of neural networks used for
image
by leveraging knowledge distillation, a training paradigm
where smaller neural networks, partially trained on the outputs of larger, more
complex models, can achieve better performance than when trained independently.
Our work demonstrates that knowledge distillation can be effectively applied to
image
tasks: i) across various architecture sizes, ii) to achieve
different image quality/bit rate tradeoffs, and iii) to save processing and
energy resources. This approach introduces new settings and hyperparameters,
and future research could explore the impact of different teacher models, as
well as alternative loss functions. Knowledge distillation could also be
extended to
-based models. The code is publicly available at:
https://github.com/FABallemand/PRIM .
Compute Only 16 Tokens in One Timestep Accelerating Diffusion Transformers with Cluster-Driven Feature Caching
Authors: Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, Linfeng Zhang
2025-09-12
Diffusion s have gained significant attention in recent years for
their ability to generate high-quality images and videos, yet still suffer from
a huge computational cost due to their iterative denoising process. Recently,
feature caching has been introduced to accelerate diffusion
s by
caching the feature computation in previous timesteps and reusing it in the
following timesteps, which leverage the temporal similarity of diffusion models
while ignoring the similarity in the spatial dimension. In this paper, we
introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and
complementary perspective for previous feature caching. Specifically, ClusCa
performs spatial clustering on tokens in each timestep, computes only one token
in each cluster and propagates their information to all the other tokens, which
is able to reduce the number of tokens by over 90%. Extensive experiments on
DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image
and text-to-video generation. Besides, it can be directly applied to any
diffusion
without requirements for training. For instance, ClusCa
achieves 4.96x
on FLUX with an ImageReward of 99.49%, surpassing
the original model by 0.51%. The code is available at
https://github.com/Shenyi-Z/Cache4Diffusion.
OpenCSP A Deep Learning Framework for Crystal Structure Prediction from Ambient to High Pressure
Authors: Yinan Wang, Xiaoyang Wang, Zhenyu Wang, Jing Wu, Jian Lv, Han Wang
2025-09-12
High-pressure crystal structure prediction (CSP) underpins advances in
condensed matter physics, planetary science, and materials discovery. Yet, most
large atomistic models are trained on near-ambient, equilibrium data, leading
to degraded stress accuracy at tens to hundreds of gigapascals and
coverage of pressure-stabilized stoichiometries and dense coordination motifs.
Here, we introduce OpenCSP, a machine learning framework for CSP tasks spanning
ambient to high-pressure conditions. This framework comprises an open-source
pressure-resolved dataset alongside a suite of publicly available atomistic
models that are jointly optimized for accuracy in energy, force, and stress
predictions. The dataset is constructed via randomized high-pressure sampling
and iteratively refined through an uncertainty-guided concurrent learning
strategy, which enriches underrepresented
regimes while suppressing
redundant DFT labeling. Despite employing a training corpus one to two orders
of magnitude smaller than those of leading large models, OpenCSP achieves
comparable or superior performance in high-pressure enthalpy ranking and
stability prediction. Across benchmark CSP tasks spanning a wide pressure
window, our models match or surpass MACE-MPA-0, MatterSim v1 5M, and
GRACE-2L-OAM, with the largest gains observed at elevated pressures. These
results demonstrate that targeted, pressure-aware data acquisition coupled with
scalable architectures enables data-efficient, high-fidelity CSP, paving the
way for autonomous materials discovery under ambient and extreme conditions.
SignClip Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion
Authors: Wenfang Wu, Tingting Yuan, Yupeng Li, Daling Wang, Xiaoming Fu
2025-09-12
Sign language translation (SLT) aims to translate natural language from sign
language videos, as a vital bridge for inclusive
. While
recent advances leverage powerful visual backbones and large language models,
most approaches mainly focus on manual signals (hand gestures) and tend to
overlook non-manual cues like mouthing. In fact, mouthing conveys essential
linguistic information in sign languages and plays a crucial role in
disambiguating visually similar signs. In this paper, we propose SignClip, a
novel framework to improve the accuracy of sign language translation. It fuses
manual and non-manual cues, specifically spatial gesture and lip movement
features. Besides, SignClip introduces a hierarchical contrastive learning
framework with multi-level alignment objectives, ensuring semantic consistency
across sign-lip and visual-text modalities. Extensive experiments on two
benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our
approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip
surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from
24.32 to 24.71, and ROUGE from 46.57 to 48.38.
A Symmetry-Integrated Approach to Surface Code Decoding
Authors: Hoshitaro Ohnishi, Hideo Mukai
2025-09-12
Quantum error correction, which utilizes logical qubits that are encoded as
redundant multiple physical qubits to find and correct errors in physical
qubits, is indispensable for practical quantum computing. Surface code is
considered to be a promising encoding method with a high error threshold that
is defined by stabilizer generators. However, previous methods have suffered
from the problem that the r acquires solely the error probability
distribution because of the non-uniqueness of correct prediction obtained from
the input. To circumvent this problem, we propose a technique to reoptimize the
r model by approximating syndrome measurements with a continuous function
that is mathematically interpolated by neural network. We evaluated the
improvement in accuracy of a multilayer perceptron based
r for code
distances of 5 and 7 as well as for
rs based on convolutional and
recurrent neural networks and
s for a code distance of 5. In all
cases, the reoptimized
r gave better accuracy than the original models,
demonstrating the universal effectiveness of the proposed method that is
independent of code distance or network architecture. These results suggest
that re-framing the problem of surface code
into a regression problem
that can be tackled by deep learning is a useful strategy.
FedBiF Communication-Efficient Federated Learning via Bits Freezing
Authors: Shiwei Li, Qunwei Li, Haozhao Wang, Ruixuan Li, Jianbin Lin, Wenliang Zhong
2025-09-12
Federated learning (FL) is an emerging distributed machine learning paradigm
that enables collaborative model training without sharing local data. Despite
its advantages, FL suffers from substantial overhead, which can
affect training efficiency. Recent efforts have mitigated this issue by
quantizing model updates to reduce
costs. However, most existing
methods apply
only after local training, introducing
errors into the trained parameters and potentially degrading model accuracy. In
this paper, we propose Federated Bit Freezing (FedBiF), a novel FL framework
that directly learns
d model parameters during local training. In each
round, the server first
s the model parameters and
transmits them to the clients. FedBiF then allows each client to update only a
single bit of the multi-bit parameter representation, freezing the remaining
bits. This bit-by-bit update strategy reduces each parameter update to one bit
while maintaining high precision in parameter representation. Extensive
experiments are conducted on five widely used datasets under both IID and
Non-IID settings. The results demonstrate that FedBiF not only achieves
superior
but also promotes
in the resulting
models. Notably, FedBiF attains accuracy comparable to FedAvg, even when using
only 1 bit-per-parameter (bpp) for uplink and 3 bpp for downlink
.
The code is available at https://github.com/Leopold1423/fedbif-tpds25.
Perfect quantum state transfer via state restoring and ancilla measurement
Authors: E. B. Fel'dman, J. Wu, A. I. Zenchuk
2025-09-12
We propose the protocol for perfect state transfer of an arbitrary pure
quantum state along the spin-1/2 chain governed by the Hamiltonian pre
the excitation number in the system. We show that the -excitation pure
sender's state can be restored at the receiver using only the local
transformations over the qubits of the extended receiver. The restored state
appears in the superposition with other states which form garbage. This garbage
can be easily removed by including the ancilla whose state labels the garbage,
and then measuring the {ancilla state} with desired output. The resulting state
of the receiver coincides with the initial sender's state {up to the
unimportant common phase factor.} Then, to transfer an arbitrary {pure} state
of some system , we encode this state into the -excitation state of the
sender, transfer and restore it and finally
the restored -excitation
state {of the receiver} into the state of another subsystem . After
labeling and removing the garbage via measuring the state of the ancillae we
complete the algorithm for PST.
Semantic Rate-Distortion Theory with Applications
Authors: Yi-Qun Zhao, Zhi-Ming Ma, Geoffrey Ye Li, Shuai Yuan, Tong Ye, Chuan Zhou
2025-09-12
Artificial intelligence (AI) is ushering in a new era for . As a
result, the establishment of a semantic
framework is putting on
the agenda. Based on a realistic semantic
model, this paper
develops a rate-distortion framework for semantic
. Different from
the existing works primarily focusing on
r-side estimation of intrinsic
meaning and ignoring its inherent issues, such as ambiguity and polysemy, we
exploit a constraint of conditional semantic probability distortion to
effectively capture the essential features of practical semantic exchanges in
an AI-assisted
system. With the help of the methods in
rate-distortion-perception theory, we establish a theorem specifying the
minimum achievable rate under this semantic constraint and a traditional
symbolic constraint and obtain its closed-form limit for a particular semantic
scenario. From the experiments in this paper, bounding conditional semantic
probability distortion can effectively improve both semantic transmission
accuracy and bit-rate efficiency. Our framework bridges information theory and
AI, enabling potential applications in bandwidth-efficient semantic-aware
networks, enhanced transceiver understanding, and optimized semantic
transmission for AI-driven systems.
Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge
Authors: Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis, Sami Muhaidat
2025-09-12
Large-scale s are central to modern semantic
, yet
their high computational and
costs hinder deployment on
resource-constrained edge devices. This paper introduces a training-free
framework for adaptive token merging, a novel mechanism that compresses
representations at runtime by selectively merging semantically
redundant tokens under per-layer similarity thresholds. Unlike prior
fixed-ratio reduction, our approach couples merging directly to input
redundancy, enabling data-dependent adaptation that balances efficiency and
task relevance without retraining. We cast the discovery of merging strategies
as a multi-objective optimization problem and leverage Bayesian optimization to
obtain Pareto-optimal trade-offs between accuracy, inference cost, and
cost. On ImageNet classification, we match the accuracy of the
unmodified
with 30\% fewer floating-point operations per second and
under 20\% of the original
cost, while for visual question
answering our method achieves performance competitive with the full LLaVA model
at less than one-third of the compute and one-tenth of the bandwidth. Finally,
we show that our adaptive merging is robust across varying channel conditions
and provides inherent privacy benefits, substantially degrading the efficacy of
model inversion attacks. Our framework provides a practical and versatile
solution for deploying powerful
models in resource-limited edge
intelligence scenarios.
LLMs as Agentic Cooperative Players in Multiplayer UNO
Authors: Yago Romano Matinez, Jesse Roberts
2025-09-11
s promise to assist humans -- not just by answering questions, but by
offering useful guidance across a wide range of tasks. But how far does that
assistance go? Can a large language model based agent actually help someone
accomplish their goal as an active participant? We test this question by
engaging an
in UNO, a turn-based card game, asking it not to win but
instead help another player to do so. We built a tool that allows
r-only
s to participate as agents within the RLCard game environment. These models
receive full game-state information and respond using simple text prompts under
two distinct prompting strategies. We evaluate models ranging from small (1B
parameters) to large (70B parameters) and explore how model scale impacts
performance. We find that while all models were able to successfully outperform
a random baseline when playing UNO, few were able to significantly aid another
player.
Latency and Token-Aware Test-Time Compute
Authors: Jenny Y. Huang, Mehul Damani, Yousef El-Kurdi, Ramon Astudillo, Wei Sun
2025-09-11
Inference-time scaling has emerged as a powerful way to improve large
language model () performance by generating multiple candidate responses and
selecting among them. However, existing work on dynamic allocation for
test-time compute typically considers only parallel generation methods such as
best-of-N, overlooking incremental
methods like beam search, and has
largely ignored latency, focusing only on token usage. We formulate
inference-time scaling as a problem of dynamic compute allocation and method
selection, where the system must decide which strategy to apply and how much
compute to allocate on a per-query basis. Our framework explicitly incorporates
both token cost and wall-clock latency, the latter being critical for user
experience and particularly for agentic workflows where models must issue
multiple queries efficiently. Experiments on reasoning benchmarks show that our
approach consistently outperforms static strategies, achieving favorable
accuracy-cost trade-offs while remaining practical for deployment.
Towards an AI-based knowledge assistant for goat farmers based on Retrieval-Augmented Generation
Authors: Nana Han, Dong Liu, Tomas Norton
2025-09-11
Large language models (s) are increasingly being recognised as valuable
knowledge
tools in many industries. However, their application in
livestock farming remains limited, being constrained by several factors not
least the availability, diversity and complexity of knowledge sources. This
study introduces an intelligent knowledge assistant system designed to support
health management in farmed goats. Leveraging the Retrieval-Augmented
Generation (RAG), two structured knowledge processing methods, table
textualization and decision-tree textualization, were proposed to enhance large
language models' (
s) understanding of heterogeneous data formats. Based on
these methods, a domain-specific goat farming knowledge base was established to
improve
's capacity for cross-scenario generalization. The knowledge base
spans five key domains: Disease Prevention and Treatment, Nutrition Management,
Rearing Management, Goat Milk Management, and Basic Farming Knowledge.
Additionally, an online search module is integrated to enable real-time
retrieval of up-to-date information. To evaluate system performance, six
ablation experiments were conducted to examine the contribution of each
component. The results demonstrated that heterogeneous knowledge fusion method
achieved the best results, with mean accuracies of 87.90% on the validation set
and 84.22% on the test set. Across the text-based, table-based, decision-tree
based Q&A tasks, accuracy consistently exceeded 85%, validating the
effectiveness of structured knowledge fusion within a modular design. Error
analysis identified omission as the predominant error category, highlighting
opportunities to further improve retrieval coverage and context integration. In
conclusion, the results highlight the robustness and reliability of the
proposed system for practical applications in goat farming.
CoDiCodec Unifying Continuous and Discrete Compressed Representations of Audio
Authors: Marco Pasini, Stefan Lattner, George Fazekas
2025-09-11
Efficiently representing audio signals in a compressed latent space is
critical for latent generative modelling. However, existing autoencoders often
force a choice between continuous embeddings and discrete tokens. Furthermore,
achieving high ratios while maintaining audio fidelity remains a
challenge. We introduce CoDiCodec, a novel audio autoencoder that overcomes
these limitations by both efficiently encoding global features via summary
embeddings, and by producing both compressed continuous embeddings at ~ 11 Hz
and discrete tokens at a rate of 2.38 kbps from the same trained model,
offering unprecedented flexibility for different downstream generative tasks.
This is achieved through Finite Scalar Quantization (FSQ) and a novel
FSQ-dropout technique, and does not require additional loss terms beyond the
single consistency loss used for end-to-end training. CoDiCodec supports both
autoregressive
and a novel parallel
strategy, with the latter
achieving superior audio quality and faster
. CoDiCodec outperforms
existing continuous and discrete autoencoders at similar bitrates in terms of
reconstruction audio quality. Our work enables a unified approach to audio
, bridging the gap between continuous and discrete generative
modelling paradigms.
ButterflyQuant Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
Authors: Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang
2025-09-11
Large language models require massive memory footprints, severely limiting
deployment on consumer hardware. Quantization reduces memory through lower
numerical precision, but extreme 2-bit suffers from catastrophic
performance loss due to outliers in activations. Rotation-based methods such as
QuIP and QuaRot apply orthogonal transforms to eliminate outliers before
, using computational invariance: for orthogonal . However, these
methods use fixed transforms--Hadamard matrices achieving optimal worst-case
coherence --that cannot adapt to specific weight
distributions. We identify that different
layers exhibit distinct
outlier patterns, motivating layer-adaptive rotations rather than
one-size-fits-all approaches. We propose ButterflyQuant, which replaces
Hadamard rotations with learnable butterfly transforms parameterized by
continuous Givens rotation angles. Unlike Hadamard's discrete
entries that are non-differentiable and prohibit gradient-based learning,
butterfly transforms' continuous parameterization enables smooth optimization
while guaranteeing orthogonality by construction. This orthogonal constraint
ensures theoretical guarantees in outlier suppression while achieving computational complexity with only learnable
parameters. We further introduce a uniformity regularization on
post-transformation activations to promote smoother distributions amenable to
. Learning requires only 128 calibration samples and converges in
minutes on a single GPU--a negligible one-time cost. On LLaMA-2-7B with 2-bit
, ButterflyQuant achieves 15.4 perplexity versus 22.1 for QuaRot.
LAVa Layer-wise KV Cache Eviction with Dynamic Budget Allocation
Authors: Yiqun Shen, Song Yuan, Zhengze Zhang, Xiaoliang Wang, Daxin Jiang, Nguyen Cam-Tu
2025-09-11
Cache is commonly used to accelerate
inference with long contexts, yet
its high memory demand drives the need for
. Existing
methods, however, are largely heuristic and lack dynamic budget
allocation. To address this limitation, we introduce a unified framework for
by minimizing information loss in Transformer residual
streams. Building on it, we analyze the layer attention output loss and derive
a new metric to compare
entries across heads, enabling layer-wise
with dynamic head budgets. Additionally, by contrasting cross-layer
information, we also achieve dynamic layer budgets. LAVa is the first unified
strategy for
eviction and dynamic budget allocation that, unlike prior
methods, does not rely on training or the combination of multiple strategies.
Experiments with benchmarks (LongBench, Needle-In-A-Haystack, Ruler, and
InfiniteBench) demonstrate its superiority. Moreover, our experiments reveal a
new insight: dynamic layer budgets are crucial for generation tasks (e.g., code
completion), while dynamic head budgets play a key role in extraction tasks
(e.g., extractive QA). As a fully dynamic
method, LAVa consistently
maintains top performance across task types. Our code is available at
https://github.com/MGDDestiny/Lava.
Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates
Authors: Harry Julian, Rachel Beeson, Lohith Konathala, Johanna Ulin, Jiameng Gao
2025-09-11
Neural Audio Codecs (NACs) have become increasingly adopted in speech
processing tasks due to their excellent rate-distortion performance and
compatibility with Large Language Models (s) as discrete feature
representations for audio generation. While most existing codecs rely on
Residual Vector Quantization (RVQ), Finite Scalar Quantization (FSQ) has
recently emerged as a compelling alternative that simplifies training and
natively supports single codebooks. We introduce NeuCodec, an FSQ-based NAC,
and show that FSQ encodes baked-in redundancy which produces an encoding which
is robust when transmitted through noisy channels. First, through an encoder
distillation experiment, we show that two different encoders can learn to
encode identical audio into vastly different code sequences whilst maintaining
comparable reconstruction quality with the same
r and
r. Second,
we demonstrate that FSQ has vastly superior bit-level perturbation robustness
by comparing the performance of RVQ and FSQ codecs when simulating the
transmission of code sequences through a noisy channel.
TrEnv Transparently Share Serverless Execution Environments Across Different Functions and Nodes
Authors: Jialiang Huang, Teng Ma, Zheng Liu, Sixing Lin, Kang Chen, Jinlei Jiang, Xia Liao, Yingdi Shan, Yongwei Wu, Ning Zhang, Mengting Lu, Tao Ma, Haifeng Gong, Mingxing Zhang
2025-09-11
Serverless computing provides dynamic scalability, but its infrastructure
overhead becomes a bottleneck for emerging workloads such as agents, which
exhibit unpredictable invocation patterns and variable resource demands. Our
analysis shows that for these agents, the cost of running on serverless
platforms can reach up to 70% of the cost of
API calls. This finding
motivates the need for a more efficient, high-density serverless platform. We
present TrEnv, a co-designed serverless platform that supports both container-
and VM-based environments, optimized for the unique demands of
agents.
TrEnv reduces startup latency and memory usage through repurposable sandboxes
and memory templates, which enable fast reuse and restoration of execution
environments. To further reduce overhead in VM-based agent workloads, TrEnv
leverages browser sharing and a page
bypassing mechanism. Evaluations
show that TrEnv reduces P99 latency by up to 7X and memory usage by 48% in
container-based settings, and achieves up to 58% lower P99 latency and 61%
memory savings for VM-based agents compared to state-of-the-art systems like
E2B.
Combating the Memory Walls Optimization Pathways for Long-Context Agentic LLM Inference
Authors: Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey T. H. Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Wayne Luk, Hongxiang Fan, Jianyi Cheng, Timothy M. Jones, Rika Antonova, Robert Mullins, Aaron Zhao
2025-09-11
s now form the backbone of AI agents for a diverse array of applications,
including tool use, command-line agents, and web or computer use agents. These
agentic
inference tasks are fundamentally different from chatbot-focused
inference -- they often have much larger context lengths to capture complex,
prolonged inputs, such as entire webpage DOMs or complicated tool call
trajectories. This, in turn, generates significant off-chip memory traffic for
the underlying hardware at the inference stage and causes the workload to be
constrained by two memory walls, namely the bandwidth and capacity memory
walls, preventing the on-chip compute units from achieving high utilization.
In this paper, we introduce PLENA, a hardware-software co-designed system
that applies three core optimization pathways to tackle these challenges. PLENA
includes an efficient hardware implementation of compute and memory units
supporting an asymmetric
scheme. PLENA also features a novel
flattened systolic array architecture that has native support for
FlashAttention to tackle these memory walls in the scenario of inference
for long-context
s. Additionally, PLENA is developed with a complete
stack, including a custom ISA, a compiler, a cycle-emulated simulator, and an
automated design space exploration flow. The simulated results show that PLENA
achieves up to 8.5x higher utilization than existing accelerators, and delivers
2.24x higher throughput than the A100 GPU and 3.85x higher throughput than the
TPU v6e, under the same multiplier count and memory settings. The full PLENA
system will also be open-sourced.
ENSI Efficient Non-Interactive Secure Inference for Large Language Models
Authors: Zhiyu He, Maojiang Wang, Xinwen Gao, Yuchuan Luo, Lin Liu, Shaojing Fu
2025-09-11
Secure inference enables privacy-pre machine learning by leveraging
cryptographic protocols that support computations on sensitive user data
without exposing it. However, integrating cryptographic protocols with large
language models (
s) presents significant challenges, as the inherent
complexity of these protocols, together with
s' massive parameter scale and
sophisticated architectures, severely limits practical usability. In this work,
we propose ENSI, a novel non-interactive secure inference framework for
s,
based on the principle of co-designing the cryptographic protocols and
architecture. ENSI employs an optimized encoding strategy that seamlessly
integrates CKKS scheme with a lightweight
variant, BitNet, significantly
reducing the computational complexity of encrypted matrix multiplications. In
response to the prohibitive computational demands of softmax under homomorphic
encryption (HE), we pioneer the integration of the sigmoid attention mechanism
with HE as a seamless, retraining-free alternative. Furthermore, by embedding
the Bootstrapping operation within the RMSNorm process, we efficiently refresh
ciphertexts while markedly decreasing the frequency of costly bootstrapping
invocations. Experimental evaluations demonstrate that ENSI achieves
approximately an 8x
in matrix multiplications and a 2.6x speedup
in softmax inference on CPU compared to state-of-the-art method, with the
proportion of bootstrapping is reduced to just 1%.
HD-MoE Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing
Authors: Haochen Huang, Shuzhang Zhong, Zhe Zhang, Shuangchen Li, Dimin Niu, Hongzhong Zheng, Runsheng Wang, Meng Li
2025-09-11
Large Language Models (s) with Mixture-of-Expert (MoE) architectures
achieve superior model performance with reduced computation costs, but at the
cost of high memory capacity and bandwidth requirements. Near-Memory Processing
(NMP) accelerators that stack memory directly on the compute through hybrid
bonding have demonstrated high bandwidth with high energy efficiency, becoming
a promising architecture for MoE models. However, as NMP accelerators comprise
distributed memory and computation, how to map the MoE computation directly
determines the
inference efficiency. Existing parallel mapping strategies,
including Tensor Parallelism (TP) and Expert Parallelism (EP), suffer from
either high
costs or unbalanced computation utilization, leading
to inferior efficiency. The dynamic routing mechanism of MoE
s further
aggravates the efficiency challenges. Therefore, in this paper, we propose
HD-MoE to automatically optimize the MoE parallel computation across an NMP
accelerator. HD-MoE features an offline automatic hybrid parallel mapping
algorithm and an online dynamic scheduling strategy to reduce the
costs while maximizing the computation utilization. With extensive experimental
results, we demonstrate that HD-MoE achieves a speedup ranging from 1.1x to
1.8x over TP, 1.1x to 1.5x over EP, and 1.0x to 1.4x over the baseline Hybrid
TP-EP with Compute-Balanced parallelism strategies.
DiTReducio A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration
Authors: Yanru Huo, Ziyue Jiang, Zuoli Tang, Qingyang Hong, Zhou Zhao
2025-09-11
While Diffusion Transformers (DiT) have advanced non-autoregressive (NAR)
speech synthesis, their high computational demands remain an limitation.
Existing DiT-based text-to-speech (TTS) model approaches mainly
focus on reducing sampling steps through distillation techniques, yet they
remain constrained by training costs. We introduce DiTReducio, a training-free
framework that compresses computations in DiT-based TTS models via
progressive calibration. We propose two
methods, Temporal Skipping
and Branch Skipping, to eliminate redundant computations during inference.
Moreover, based on two characteristic attention patterns identified within DiT
layers, we devise a pattern-guided strategy to selectively apply the
methods. Our method allows flexible modulation between generation
quality and computational efficiency through adjustable
thresholds.
Experimental evaluations conducted on F5-TTS and MegaTTS 3 demonstrate that
DiTReducio achieves a 75.4% reduction in FLOPs and improves the Real-Time
Factor (RTF) by 37.1%, while pre
generation quality.
Efficient Transformer-Based Piano Transcription With Sparse Attention Mechanisms
Authors: Weixing Wei, Kazuyoshi Yoshii
2025-09-11
This paper investigates automatic piano transcription based on
computationally-efficient yet high-performant variants of the Transformer that
can capture longer-term dependency over the whole musical piece. Recently,
-based sequence-to-sequence models have demonstrated excellent
performance in piano transcription. These models, however, fail to deal with
the whole piece at once due to the quadratic complexity of the self-attention
mechanism, and music signals are thus typically processed in a sliding-window
manner in practice. To overcome this limitation, we propose an efficient
architecture with
attention mechanisms. Specifically, we introduce
sliding-window self-attention mechanisms for both the encoder and
r, and
a hybrid global-local cross-attention mechanism that attends to various spans
according to the MIDI token types. We also use a hierarchical pooling strategy
between the encoder and
r to further reduce computational load. Our
experiments on the MAESTRO dataset showed that the proposed model achieved a
significant reduction in computational cost and memory usage, accelerating
inference speed, while maintaining transcription performance comparable to the
full-attention baseline. This allows for training with longer audio contexts on
the same hardware, demonstrating the viability of
attention for building
efficient and high-performance piano transcription systems. The code is
available at https://github.com/WX-Wei/efficient-seq2seq-piano-trans.
From scratch to silver Creating trustworthy training data for patent-SDG classification using Large Language Models
Authors: Grazia Sveva Ascione, Nicolò Tamagnone
2025-09-11
Classifying patents by their relevance to the UN Sustainable Development
Goals (SDGs) is crucial for tracking how innovation addresses global
challenges. However, the absence of a large, labeled dataset limits the use of
supervised learning. Existing methods, such as keyword searches, transfer
learning, and citation-based heuristics, lack scalability and generalizability.
This paper frames patent-to-SDG classification as a weak supervision problem,
using citations from patents to SDG-tagged scientific publications (NPL
citations) as a noisy initial signal. To address its and noise, we
develop a composite labeling function (LF) that uses large language models
(
s) to extract structured concepts, namely functions, solutions, and
applications, from patents and SDG papers based on a patent ontology.
Cross-domain similarity scores are computed and combined using a rank-based
retrieval approach. The LF is calibrated via a custom positive-only loss that
aligns with known NPL-SDG links without penalizing discovery of new SDG
associations. The result is a silver-standard, soft multi-label dataset mapping
patents to SDGs, enabling the training of effective multi-label regression
models. We validate our approach through two complementary strategies: (1)
internal validation against held-out NPL-based labels, where our method
outperforms several baselines including
-based models, and zero-shot
; and (2) external validation using network modularity in patent citation,
co-inventor, and co-applicant graphs, where our labels reveal greater thematic,
cognitive, and organizational coherence than traditional technological
classifications. These results show that weak supervision and semantic
alignment can enhance SDG classification at scale.
Harnessing Uncertainty Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents
Authors: Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, Ke Wang
2025-09-11
In long-horizon tasks, recent agents based on Large Language Models (s)
face a significant challenge that
, outcome-based rewards make it
difficult to assign credit to intermediate steps. Previous methods mainly focus
on creating dense reward signals to guide learning, either through traditional
reinforcement learning techniques like inverse reinforcement learning or by
using Process Reward Models for step-by-step feedback. In this paper, we
identify a fundamental problem in the learning dynamics of
s: the magnitude
of policy gradients is inherently coupled with the entropy, which leads to
inefficient small updates for confident correct actions and potentially
destabilizes large updates for uncertain ones. To resolve this, we propose
Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the
learning signal based on step-wise uncertainty and the final task outcome. EMPG
amplifies updates for confident correct actions, penalizes confident errors,
and attenuates updates from uncertain steps to stabilize exploration. We
further introduce a bonus term for future clarity that encourages agents to
find more predictable solution paths. Through comprehensive experiments on
three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we
demonstrate that EMPG achieves substantial performance gains and significantly
outperforms strong policy gradient baselines. Project page is at
https://empgseed-seed.github.io/
Medverse A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement
Authors: Jiesi Hu, Jianfeng Cao, Yanwu Yang, Chenfei Ye, Yixuan Zhang, Hanyang Peng, Ting Ma
2025-09-11
In-context learning (ICL) offers a promising paradigm for universal medical
image analysis, enabling models to perform diverse image processing tasks
without retraining. However, current ICL models for medical imaging remain
limited in two critical aspects: they cannot simultaneously achieve
high-fidelity predictions and global anatomical understanding, and there is no
unified model trained across diverse medical imaging tasks (e.g., segmentation
and enhancement) and anatomical regions. As a result, the full potential of ICL
in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a
universal ICL model for 3D medical imaging, trained on 22 datasets covering
diverse tasks in universal image segmentation, transformation, and enhancement
across multiple organs, imaging modalities, and clinical centers. Medverse
employs a next-scale autoregressive in-context learning framework that
progressively refines predictions from coarse to fine, generating consistent,
full-resolution volumetric outputs and enabling multi-scale anatomical
awareness. We further propose a blockwise cross-attention module that
facilitates long-range interactions between context and target inputs while
pre computational efficiency through spatial
. Medverse is
extensively evaluated on a broad collection of held-out datasets covering
previously unseen clinical centers, organs, species, and imaging modalities.
Results demonstrate that Medverse substantially outperforms existing ICL
baselines and establishes a novel paradigm for in-context learning. Code and
model weights will be made publicly available. Our model are publicly available
at https://github.com/jiesihu/Medverse.
CCF A Context Compression Framework for Efficient Long-Sequence Language Modeling
Authors: Wenhao Li, Bangcheng Sun, Weihao Ye, Tianyi Zhang, Daohai Yu, Fei Chao, Rongrong Ji
2025-09-11
Scaling language models to longer contexts is essential for capturing rich
dependencies across extended discourse. However, na\"ive context extension
imposes significant computational and memory burdens, often resulting in
inefficiencies during both training and inference. In this work, we propose
CCF, a novel context framework designed to enable efficient
long-context modeling by learning hierarchical latent representations that
preserve global semantics while aggressively reducing input redundancy. CCF
integrates segment-wise semantic aggregation with key-value memory encoding,
forming compact representations that support accurate reconstruction and
long-range understanding. To further enhance scalability, we introduce a
training-efficient optimization strategy that couples incremental segment
with
reservoir sampling, substantially reducing memory overhead
without degrading performance. Empirical results on multiple long-context
language modeling benchmarks demonstrate that CCF achieves competitive
perplexity under high
ratios, and significantly improves throughput
and memory efficiency compared to existing approaches. These findings highlight
the potential of structured
for scalable and effective long-context
language modeling.
GmSLM Generative Marmoset Spoken Language Modeling
Authors: Talia Sternberg, Michael London, David Omer, Yossi Adi
2025-09-11
Marmoset monkeys exhibit complex vocal , challenging the view
that nonhuman primates vocal
is entirely innate, and show similar
features of human speech, such as vocal labeling of others and turn-taking.
Studying their vocal
offers a unique opportunity to link it with
brain activity-especially given the difficulty of accessing the human brain in
speech and language research. Since Marmosets communicate primarily through
vocalizations, applying standard
approaches is not straightforward. We
introduce Generative Marmoset Spoken Language Modeling (GmSLM), an optimized
spoken language model pipeline for Marmoset vocal
. We designed a
novel zero-shot evaluation metrics using unsupervised in-the-wild data,
alongside weakly labeled conversational data, to assess GmSLM and demonstrate
its advantage over a basic human-speech-based baseline. GmSLM generated
vocalizations closely matched real resynthesized samples acoustically and
performed well on downstream tasks. Despite being fully unsupervised, GmSLM
effectively distinguish real from artificial conversations and may support
further investigations of the neural basis of vocal
and provides
a practical framework linking vocalization and brain activity. We believe GmSLM
stands to benefit future work in neuroscience, bioacoustics, and evolutionary
biology. Samples are provided under: pages.cs.huji.ac.il/adiyoss-lab/GmSLM.
AI Reasoning for Wireless Communications and Networking A Survey and Perspectives
Authors: Haoxiang Luo, Yu Yan, Yanhui Bian, Wenjiao Feng, Ruichen Zhang, Yinqiu Liu, Jiacheng Wang, Gang Sun, Dusit Niyato, Hongfang Yu, Abbas Jamalipour, Shiwen Mao
2025-09-11
Artificial Intelligence (AI) techniques play a pivotal role in optimizing
wireless networks. However, traditional deep learning approaches
often act as closed boxes, lacking the structured reasoning abilities needed to
tackle complex, multi-step decision problems. This survey provides a
comprehensive review and outlook of reasoning-enabled AI in wireless
networks, with a focus on Large Language Models (
s) and other
advanced reasoning paradigms. In particular,
-based agents can combine
reasoning with long-term planning, memory, tool utilization, and autonomous
cross-layer control to dynamically optimize network operations with minimal
human intervention. We begin by outlining the evolution of intelligent wireless
networking and the limitations of conventional AI methods. We then introduce
emerging AI reasoning techniques. Furthermore, we establish a classification
system applicable to wireless network tasks. We also present a layer-by-layer
examination for AI reasoning, covering the physical, data link, network,
transport, and application layers. For each part, we identify key challenges
and illustrate how AI reasoning methods can improve AI-based wireless
performance. Finally, we discuss key research directions for AI
reasoning toward future wireless
networks. By combining insights
from both
s and AI, this survey aims to chart a path for
integrating reasoning techniques into the next-generation wireless networks.
Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication
Authors: Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis
2025-09-11
Large-scale models have emerged as a powerful tool for semantic
systems, enabling edge devices to extract rich representations
for robust inference across noisy wireless channels. However, their substantial
computational demands remain a major barrier to practical deployment in
resource-constrained 6G networks. In this paper, we present a training-free
framework for adaptive token merging in pretrained vision
s to
jointly reduce inference time and transmission resource usage. We formulate the
selection of per-layer merging proportions as a multi-objective optimization
problem to balance accuracy and computational cost. We employ Gaussian
process-based Bayesian optimization to construct a Pareto frontier of optimal
configurations, enabling flexible runtime adaptation to dynamic application
requirements and channel conditions. Extensive experiments demonstrate that our
method consistently outperforms other baselines and achieves significant
reductions in floating-point operations while maintaining competitive accuracy
across a wide range of signal-to-noise ratio (SNR) conditions. Additional
results highlight the effectiveness of adaptive policies that adjust merging
aggressiveness in response to channel quality, providing a practical mechanism
to trade off latency and semantic fidelity on demand. These findings establish
a scalable and efficient approach for deploying
-based semantic
in future edge intelligence systems.
DP-FedLoRA Privacy-Enhanced Federated Fine-Tuning for On-Device Large Language Models
Authors: Honghui Xu, Shiva Shrestha, Wei Chen, Zhiyuan Li, Zhipeng Cai
2025-09-11
As on-device large language model () systems become increasingly
prevalent, federated fine-tuning enables advanced language understanding and
generation directly on edge devices; however, it also involves processing
sensitive, user-specific data, raising significant privacy concerns within the
federated learning framework. To address these challenges, we propose
DP-FedLoRA, a privacy-enhanced federated fine-tuning framework that integrates
LoRA-based adaptation with differential privacy in a
-efficient
setting. Each client locally clips and perturbs its LoRA matrices using
Gaussian noise to satisfy (, )-differential privacy. We
further provide a theoretical analysis demonstrating the unbiased nature of the
updates and deriving bounds on the variance introduced by noise, offering
practical guidance for privacy-budget calibration. Experimental results across
mainstream benchmarks show that DP-FedLoRA delivers competitive performance
while offering strong privacy guarantees, paving the way for scalable and
privacy-pre
deployment in on-device environments.
Towards Confidential and Efficient LLM Inference with Dual Privacy Protection
Authors: Honglan Yu, Yibin Wang, Feifei Dai, Dong Liu, Haihui Fan, Xiaoyan Gu
2025-09-11
CPU-based trusted execution environments (TEEs) and differential privacy (DP)
have gained wide applications for private inference. Due to high inference
latency in TEEs, researchers use partition-based approaches that offload linear
model components to GPUs. However, dense nonlinear layers of large language
models (s) result in significant
overhead between TEEs and
GPUs. DP-based approaches apply random noise to protect data privacy, but this
compromises
performance and semantic understanding. To overcome the above
drawbacks, this paper proposes CMIF, a Confidential and efficient Model
Inference Framework. CMIF confidentially deploys the embedding layer in the
client-side TEE and subsequent layers on GPU servers. Meanwhile, it optimizes
the Report-Noisy-Max mechanism to protect sensitive inputs with a slight
decrease in model performance. Extensive experiments on Llama-series models
demonstrate that CMIF reduces additional inference overhead in TEEs while
pre
user data privacy.
SQAP-VLA A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
Authors: Hengyu Fang, Yijiang Liu, Yuan Du, Li Du, Huanrui Yang
2025-09-11
Vision-Language-Action (VLA) models exhibit unprecedented capabilities for
embodied intelligence. However, their extensive computational and memory costs
hinder their practical deployment. Existing VLA and
approaches conduct
or token
in an ad-hoc manner but fail
to enable both for a holistic efficiency improvement due to an observed
incompatibility. This work introduces SQAP-VLA, the first structured,
training-free VLA inference
framework that simultaneously enables
state-of-the-art
and token
. We overcome the
incompatibility by co-designing the
and token
pipeline,
where we propose new
-aware token
criteria that work on an
aggressively
d model while improving the
r design to enhance
effectiveness. When applied to standard VLA models, SQAP-VLA yields
significant gains in computational efficiency and inference speed while
successfully pre
core model performance, achieving a 1.93
speedup and up to a 4.5\% average success rate enhancement compared to the
original model.
Instructional Prompt Optimization for Few-Shot LLM-Based Recommendations on Cold-Start Users
Authors: Haowei Yang, Yushang Zhao, Sitao Min, Bo Su, Chao Yao, Wei Xu
2025-09-11
The cold-start user issue further compromises the effectiveness of
recommender systems in limiting access to the historical behavioral
information. It is an effective pipeline to optimize instructional prompts on a
few-shot large language model () used in recommender tasks. We introduce a
context-conditioned prompt formulation method P(u,\ Ds)\ \rightarrow\
R\widehat, where u is a cold-start user profile, Ds is a curated support set,
and R\widehat is the predicted ranked list of items. Based on systematic
experimentation with
-based autoregressive
s (BioGPT, LLaMA-2,
GPT-4), we provide empirical evidence that optimal exemplar injection and
instruction structuring can significantly improve the precision@k and NDCG
scores of such models in low-data settings. The pipeline uses token-level
alignments and embedding space regularization with a greater semantic fidelity.
Our findings not only show that timely composition is not merely syntactic but
also functional as it is in direct control of attention scales and
r
conduct through inference. This paper shows that prompt-based adaptation may be
considered one of the ways to address cold-start recommendation issues in
-based pipelines.
VoxelFormer Parameter-Efficient Multi-Subject Visual Decoding from fMRI
Authors: Chenqian Le, Yilin Zhao, Nikasadat Emami, Kushagra Yadav, Xujin "Chris" Liu, Xupeng Chen, Yao Wang
2025-09-10
Recent advances in fMRI-based visual have enabled compelling
reconstructions of perceived images. However, most approaches rely on
subject-specific training, limiting scalability and practical deployment. We
introduce \textbf{VoxelFormer}, a lightweight
architecture that
enables multi-subject training for visual
from fMRI. VoxelFormer
integrates a Token Merging Transformer (ToMer) for efficient voxel
and a query-driven Q-Former that produces fixed-size neural representations
aligned with the CLIP image embedding space. Evaluated on the 7T Natural Scenes
Dataset, VoxelFormer achieves competitive retrieval performance on subjects
included during training with significantly fewer parameters than existing
methods. These results highlight token merging and query-based
s as
promising strategies for parameter-efficient neural
.
CSI Compression Beyond Latents End-to-End Hybrid Attention-CNN Networks with Entropy Regularization
Authors: Maryam Ansarifard, Mostafa Rahmani, Mohit K. Sharma, Kishor C. Joshi, George Exarchakos, Alister Burr
2025-09-10
Massive MIMO systems rely on accurate Channel State Information (CSI)
feedback to enable high-gain beam-forming. However, the feedback overhead
scales linearly with the number of antennas, presenting a major bottleneck.
While recent deep learning methods have improved CSI , most overlook
the impact of
and entropy coding, limiting their practical
deployability. In this work, we propose an end-to-end CSI
framework
that integrates a Spatial Correlation-Guided Attention Mechanism with
and entropy-aware training. Our model effectively exploits the
spatial correlation among the antennas, thereby learning compact,
entropy-optimized latent representations for efficient coding. This reduces the
required feedback bitrates without sacrificing reconstruction accuracy, thereby
yielding a superior rate-distortion trade-off. Experiments show that our method
surpasses existing end-to-end CSI
schemes, exceeding benchmark
performance by an average of 21.5% on indoor datasets and 18.9% on outdoor
datasets. The proposed framework results in a practical and efficient CSI
feedback scheme.
CrowdQuery Density-Guided Query Module for Enhanced 2D and 3D Detection in Crowded Scenes
Authors: Marius Dähling, Sebastian Krebs, J. Marius Zöllner
2025-09-10
This paper introduces a novel method for end-to-end crowd detection that
leverages object density information to enhance existing -based
detectors. We present CrowdQuery (CQ), whose core component is our CQ module
that predicts and subsequently embeds an object density map. The embedded
density information is then systematically integrated into the
r.
Existing density map definitions typically depend on head positions or
object-based spatial statistics. Our method extends these definitions to
include individual bounding box dimensions. By incorporating density
information into object queries, our method utilizes density-guided queries to
improve detection in crowded scenes. CQ is universally applicable to both 2D
and 3D detection without requiring additional data. Consequently, we are the
first to design a method that effectively bridges 2D and 3D detection in
crowded environments. We demonstrate the integration of CQ into both a general
2D and 3D
-based object detector, introducing the architectures CQ2D
and CQ3D. CQ is not limited to the specific
models we selected.
Experiments on the STCrowd dataset for both 2D and 3D domains show significant
performance improvements compared to the base models, outperforming most
state-of-the-art methods. When integrated into a state-of-the-art crowd
detector, CQ can further improve performance on the challenging CrowdHuman
dataset, demonstrating its generalizability. The code is released at
https://github.com/mdaehl/CrowdQuery.
ChemBOMAS Accelerated BO in Chemistry with LLM-Enhanced Multi-Agent System
Authors: Dong Han, Zhehong Ai, Pengxiang Cai, Shuzhou Sun, Shanya Lu, Jianpeng Chen, Ben Gao, Lingli Ge, Weida Wang, Xiangxin Zhou, Xihui Liu, Mao Su, Wanli Ouyang, Lei Bai, Dongzhan Zhou, Tao XU, Yuqiang Li, Shufei Zhang
2025-09-10
The efficiency of Bayesian optimization (BO) in chemistry is often hindered
by experimental data and complex reaction mechanisms. To overcome these
limitations, we introduce ChemBOMAS, a new framework named
-Enhanced
Multi-Agent System for accelerating BO in chemistry. ChemBOMAS's optimization
process is enhanced by
s and synergistically employs two strategies:
knowledge-driven coarse-grained optimization and data-driven fine-grained
optimization. First, in the knowledge-driven coarse-grained optimization stage,
s intelligently decompose the vast search space by reasoning over existing
chemical knowledge to identify promising candidate regions. Subsequently, in
the data-driven fine-grained optimization stage,
s enhance the BO process
within these candidate regions by generating pseudo-data points, thereby
improving data utilization efficiency and accelerating convergence. Benchmark
evaluations** further confirm that ChemBOMAS significantly enhances
optimization effectiveness and efficiency compared to various BO algorithms.
Importantly, the practical utility of ChemBOMAS was validated through wet-lab
experiments conducted under pharmaceutical industry protocols, targeting
conditional optimization for a previously unreported and challenging chemical
reaction. In the wet experiment, ChemBOMAS achieved an optimal objective value
of 96%. This was substantially higher than the 15% achieved by domain experts.
This real-world success, together with strong performance on benchmark
evaluations, highlights ChemBOMAS as a powerful tool to accelerate chemical
discovery.
Compressing CNN models for resource-constrained systems by channel and layer pruning
Authors: Ahmed Sadaqa, Di Liu
2025-09-10
Convolutional Neural Networks (CNNs) have achieved significant breakthroughs
in various fields. However, these advancements have led to a substantial
increase in the complexity and size of these networks. This poses a challenge
when deploying large and complex networks on edge devices. Consequently, model
has emerged as a research field aimed at reducing the size and
complexity of CNNs. One prominent technique in model
is model
. This paper will present a new technique of
that combines both
channel and layer
in what is called a "hybrid
framework".
Inspired by EfficientNet, a renowned CNN architecture known for scaling up
networks from both channel and layer perspectives, this hybrid approach applies
the same principles but in reverse, where it scales down the network through
. Experiments on the hybrid approach demonstrated a notable decrease in
the overall complexity of the model, with only a minimal reduction in accuracy
compared to the baseline model. This complexity reduction translates into
reduced latency when deploying the pruned models on an NVIDIA JETSON TX2
embedded AI device.
Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching
Authors: Siratish Sakpiboonchit
2025-09-10
This paper presents a method to accelerate the inference process of diffusion
(DiT)-based text-to-speech (TTS) models by applying a selective
caching mechanism to
layers. Specifically, I integrate SmoothCache
into the F5-TTS architecture, focusing on caching outputs of self-attention and
feed-forward network layers to reduce redundant computations during the
denoising process. A calibration phase is introduced to analyze L1 relative
errors between timesteps, guiding the selection of
schedules that
minimize quality degradation. To address the problem of inter-layer dependency,
a unified caching schedule is adopted, applying the
pattern derived from
self-attention layers to both layer types. Experiments on LibriSpeech-PC and
Seed-TTS datasets evaluate various
thresholds and denoising step
configurations. Results show that caching at higher denoising steps reduces
inference time without compromising output quality, whereas caching at lower
steps can negatively impact synthesis quality similarly to reducing the total
number of denoising steps. Objective and subjective metrics confirm the
effectiveness of SmoothCache in maintaining performance while improving
computational efficiency. Comparisons between
d inference and reduced-step
inference further highlight the benefits of selective caching, especially under
high-step configurations. This work demonstrates that
layer caching
is a practical solution for optimizing diffusion
-based TTS models
without requiring architectural changes or retraining. Example inference
results can be heard at https://siratish.github.io/F5-TTS_SmoothCache/ .
Time-Dependent Modeling of the Sub-Hour Spectral Evolution During the 2013 Outburst of Mrk 421
Authors: MAGIC Collaboration, K. Abe, S. Abe, J. Abhir, A. Abhishek, A. Aguasca-Cabot, I. Agudo, T. Aniello, S. Ansoldi, L. A. Antonelli, A. Arbet Engels, C. Arcaro, T. T. H. Arnesen, A. Babić, C. Bakshi, U. Barres de Almeida, J. A. Barrio, L. Barrios-Jiménez, I. Batković, J. Baxter, J. Becerra González, W. Bednarek, E. Bernardini, J. Bernete, A. Berti, C. Bigongiari, A. Biland, O. Blanch, G. Bonnoli, Ž Bošnjak, E. Bronzini, I. Burelli, A. Campoy-Ordaz, A. Carosi, R. Carosi, M. Carretero-Castrillo, A. J. Castro-Tirado, D. Cerasole, G. Ceribella, Y. Chai, A. Cifuentes, J. L. Contreras, J. Cortina, S. Covino, F. D'Ammando, P. Da Vela, F. Dazzi, A. De Angelis, B. De Lotto, R. de Menezes, J. Delgado, C. Delgado Mendez, F. Di Pierro, R. Di Tria, L. Di Venere, A. Dinesh, D. Dominis Prester, A. Donini, D. Dorner, M. Doro, L. Eisenberger, D. Elsaesser, J. Escudero, L. Fariña, L. Foffano, L. Font, S. Fröse, Y. Fukazawa, R. J. García López, S. García Soto, M. Garczarczyk, S. Gasparyan, J. G. Giesbrecht Paiva, N. Giglietto, F. Giordano, P. Gliwny, T. Gradetzke, R. Grau, D. Green, J. G. Green, P. Günther, A. Hahn, T. Hassan, L. Heckmann, J. Herrera Llorente, D. Hrupec, D. Israyelyan, J. Jahanvi, I. Jiménez Martínez, J. Jiménez Quiles, J. Jormanainen, S. Kankkunen, T. Kayanoki, J. Konrad, P. M. Kouch, G. Koziol, H. Kubo, J. Kushida, M. Laínez, A. Lamastra, E. Lindfors, S. Lombardi, F. Longo, M. López-Moya, A. López-Oramas, S. Loporchio, L. Lulić, E. Lyard, P. Majumdar, M. Makariev, M. Mallamaci, G. Maneva, M. Manganaro, S. Mangano, K. Mannheim, S. Marchesi, M. Mariotti, M. Martínez, P. Maruševec, S. Menchiari, J. Méndez Gallego, S. Menon, D. Miceli, J. M. Miranda, R. Mirzoyan, M. Molero González, E. Molina, H. A. Mondal, A. Moralejo, C. Nanci, A. Negro, V. Neustroev, L. Nickel, M. Nievas Rosillo, C. Nigro, L. Nikolić, S. Nozaki, A. Okumura, J. Otero-Santos, S. Paiano, D. Paneque, R. Paoletti, J. M. Paredes, M. Peresano, M. Persic, M. Pihet, G. Pirola, F. Podobnik, P. G. Prada Moroni, E. Prandini, W. Rhode, M. Ribó, J. Rico, A. Roy, N. Sahakyan, F. G. Saturni, K. Schmitz, F. Schmuckermaier, T. Schweizer, A. Sciaccaluga, G. Silvestri, A. Simongini, J. Sitarek, V. Sliusar, D. Sobczynska, A. Stamerra, J. Strišković, D. Strom, M. Strzys, Y. Suda, H. Tajima, R. Takeishi, F. Tavecchio, T. Terzić, M. Teshima, A. Tutone, S. Ubach, J. van Scherpenberg, M. Vazquez Acosta, S. Ventura, G. Verna, I. Viale, A. Vigliano, C. F. Vigorito, E. Visentin, V. Vitale, I. Vovk, R. Walter, F. Wersig, M. Will, T. Yamamoto, P. K. H. Yeung, M. Petropoulou, M. Polkas
2025-09-10
In April 2013, the TeV blazar Markarian~421 underwent one of its most
powerful emission outbursts to date. An extensive multi-instrument campaign
featuring MAGIC, VERITAS, and \textit{NuSTAR} provided comprehensive
very-high-energy (VHE; \,GeV) and X-ray coverage over nine consecutive
days. In this work, we perform a detailed spectral analysis of the X-ray and
VHE emissions on sub-hour timescales throughout the flare. We identify several
clockwise spectral hysteresis loops in the X-rays, revealing a spectral
evolution more complex than a simple harder-when-brighter trend. The VHE
spectrum extends beyond 10\,TeV, and its temporal evolution closely mirrors the
behavior in the X-rays. We report the first evidence of VHE spectral hysteresis
occurring simultaneously with the X-ray loops. To interpret these findings, we
apply a time-dependent leptonic model to 240 broadband spectral energy
distributions (SEDs) binned on a 15-minute scale, allowing us to
self-consistently track the particle distribution's history. Our modeling shows
that the majority of the sub-hour flux and spectral variations are driven by
changes in the luminosity and slope of the injected electron distribution. The
required variations in the electron slope are difficult to reconcile with
magnetic reconnection but are consistent with a shock- scenario
where the shock
ratio evolves by a factor of . The model
also points to a relatively stable magnetic field and emitting region size,
favoring a scenario where the emission originates from a stationary feature in
the jet, such as a recollimation shock. However, this scenario requires a jet
Lorentz factor that significantly exceeds values from VLBI measurements to
account for the high minimum electron energy implied by the lack of variability
in the optical band.
Deep Unrolling of Sparsity-Induced RDO for 3D Point Cloud Attribute Coding
Authors: Tam Thuc Do, Philip A. Chou, Gene Cheung
2025-09-10
Given encoded 3D point cloud geometry available at the r, we study the
problem of lossy attribute
in a multi-resolution B-spline
projection framework. A target continuous 3D attribute function is first
projected onto a sequence of nested subspaces , where
is a family of functions spanned by a B-spline basis
function of order at a chosen scale and its integer shifts. The projected
low-pass coefficients are computed by variable-complexity unrolling of
a rate-distortion (RD) optimization algorithm into a feed-forward network,
where the rate term is the
-promoting -norm. Thus, the
projection operation is end-to-end differentiable. For a chosen coarse-to-fine
predictor, the coefficients are then adjusted to account for the prediction
from a lower-resolution to a higher-resolution, which is also optimized in a
data-driven manner.
BitROM Weight Reload-Free CiROM Architecture Towards Billion-Parameter 1.58-bit LLM Inference
Authors: Wenlun Zhang, Xinyu Li, Shimpei Ando, Kentaro Yoshioka
2025-09-10
Compute-in-Read-Only-Memory (CiROM) accelerators offer outstanding energy
efficiency for CNNs by eliminating runtime weight updates. However, their
scalability to Large Language Models (s) is fundamentally constrained by
their vast parameter sizes. Notably, LLaMA-7B - the smallest model in LLaMA
series - demands more than 1,000 cm2 of silicon area even in advanced CMOS
nodes. This paper presents BitROM, the first CiROM-based accelerator that
overcomes this limitation through co-design with BitNet's 1.58-bit
model, enabling practical and efficient
inference at the edge. BitROM
introduces three key innovations: 1) a novel Bidirectional ROM Array that
stores two ternary weights per transistor; 2) a Tri-Mode Local Accumulator
optimized for ternary-weight computations; and 3) an integrated Decode-Refresh
(DR) eDRAM that supports on-die
-
management, significantly reducing
external memory access during
. In addition, BitROM integrates
LoRA-based adapters to enable efficient transfer learning across various
downstream tasks. Evaluated in 65nm CMOS, BitROM achieves 20.8 TOPS/W and a bit
density of 4,967 kB/mm2 - offering a 10x improvement in area efficiency over
prior digital CiROM designs. Moreover, the DR eDRAM contributes to a 43.6%
reduction in external DRAM access, further enhancing deployment efficiency for
s in edge applications.
Two Sides of the Same Optimization Coin Model Degradation and Representation Collapse in Graph Foundation Models
Authors: Xunkai Li, Daohan Su, Sicheng Liu, Ru Zhang, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang
2025-09-10
Graph foundation models, inspired by the success of s, are designed to
learn the optimal embedding from multi-domain TAGs for the downstream
cross-task generalization capability. During our investigation, graph VQ-MAE
stands out among the increasingly diverse landscape of GFM architectures. This
is attributed to its ability to jointly encode topology and textual attributes
from multiple domains into discrete embedding spaces with clear semantic
boundaries. Despite its potential, domain generalization conflicts cause
imperceptible pitfalls. In this paper, we instantiate two of them, and they are
just like two sides of the same GFM optimization coin - Side 1 Model
Degradation: The encoder and codebook fail to capture the diversity of inputs;
Side 2 Representation Collapse: The hidden embedding and codebook vector fail
to preserve semantic separability due to constraints from narrow representation
subspaces. These two pitfalls (sides) collectively impair the
r and
generate the low-quality reconstructed supervision, causing the GFM
optimization dilemma during pre-training (coin). Through empirical
investigation, we attribute the above challenges to Information Bottleneck and
Regularization Deficit. To address them, we propose MoT (Mixture-of-Tinkers) -
(1) Information Tinker for Two Pitfalls, which utilizes an edge-wise semantic
fusion strategy and a mixture-of-codebooks with domain-aware routing to improve
information capacity. (2) Regularization Tinker for Optimization Coin, which
utilizes two additional regularizations to further improve gradient supervision
in our proposed Information Tinker. Notably, as a flexible architecture, MoT
adheres to the scaling laws of GFM, offering a controllable model scale.
Compared to SOTA baselines, experiments on 22 datasets across 6 domains
demonstrate that MoT achieves significant improvements in supervised, few-shot,
and zero-shot scenarios.
Efficient Decoding Methods for Language Models on Encrypted Data
Authors: Matan Avitan, Moran Baruch, Nir Drucker, Itamar Zimerman, Yoav Goldberg
2025-09-10
Large language models (s) power modern AI applications, but processing
sensitive data on untrusted servers raises privacy concerns. Homomorphic
encryption (HE) enables computation on encrypted data for secure inference.
However, neural text generation requires
methods like argmax and
sampling, which are non-polynomial and thus computationally expensive under
encryption, creating a significant performance bottleneck. We introduce cutmax,
an HE-friendly argmax algorithm that reduces ciphertext operations compared to
prior methods, enabling practical greedy
under encryption. We also
propose the first HE-compatible nucleus (top-p) sampling method, leveraging
cutmax for efficient stochastic
with provable privacy guarantees. Both
techniques are polynomial, supporting efficient inference in privacy-pre
settings. Moreover, their differentiability facilitates gradient-based
sequence-level optimization as a polynomial alternative to straight-through
estimators. We further provide strong theoretical guarantees for cutmax,
proving it converges globally to a unique two-level fixed point, independent of
the input values beyond the identity of the maximizer, which explains its rapid
convergence in just a few iterations. Evaluations on realistic
outputs show
latency reductions of 24x-35x over baselines, advancing secure text generation.
Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video
Authors: Xiao Li, Qi Chen, Xiulian Peng, Kai Yu, Xie Chen, Yan Lu
2025-09-10
We propose a novel and general framework to disentangle video data into its
dynamic motion and static content components. Our proposed method is a
self-supervised pipeline with less assumptions and inductive biases than
previous works: it utilizes a -based architecture to jointly
generate flexible implicit features for frame-wise motion and clip-wise
content, and incorporates a
rate vector
as an information
bottleneck to promote disentanglement and form a meaningful discrete motion
space. The bitrate-controlled latent motion and content are used as conditional
inputs to a denoising diffusion model to facilitate self-supervised
representation learning. We validate our disentangled representation learning
framework on real-world talking head videos with motion transfer and
auto-regressive motion generation tasks. Furthermore, we also show that our
method can generalize to other types of video data, such as pixel sprites of 2D
cartoon characters. Our work presents a new perspective on self-supervised
learning of disentangled video representations, contributing to the broader
field of video analysis and generation.
Persistent-DPO A novel loss function and hybrid learning for generative quantum eigensolver
Authors: Junya Nakamura, Shinichiro Sanji
2025-09-10
We study the generative quantum eigensolver
(GQE)~\cite{nakaji2024generative}, which trains a classical generative model to
produce quantum circuits with desired properties such as describing molecular
ground states. We introduce two methods to improve GQE. First, we identify a
limitation of direct preference optimization (DPO) when used as the loss
function in GQE, and propose Persistent-DPO (P-DPO) as a solution to this
limitation. Second, as a method to improve the online learning during the
training phase of GQE, we introduce a hybrid approach that combines online and
offline learning. Using a
r implementation of GQE, we
evaluate our methods through ground state search experiments on the
molecule and observe that P-DPO achieves lower energies
than DPO. The hybrid approach further improves convergence and final energy
values, particularly with P-DPO.
Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism
Authors: Jiaming Yan, Jianchun Liu, Hongli Xu, Liusheng Huang
2025-09-10
Mixture-of-Experts (MoE) has emerged as a promising architecture for modern
large language models (s). However, massive parameters impose heavy GPU
memory (i.e., VRAM) demands, hindering the widespread adoption of MoE
s.
Offloading the expert parameters to CPU RAM offers an effective way to
alleviate the VRAM requirements for MoE inference. Existing approaches
typically
a small subset of experts in VRAM and dynamically prefetch
experts from RAM during inference, leading to significant degradation in
inference speed due to the poor
hit rate and substantial expert loading
latency. In this work, we propose MoEpic, an efficient MoE inference system
with a novel expert split mechanism. Specifically, each expert is vertically
divided into two segments: top and bottom. MoEpic
s the top segment of hot
experts, so that more experts will be stored under the limited VRAM budget,
thereby improving the
hit rate. During each layer's inference, MoEpic
predicts and prefetches the activated experts for the next layer. Since the top
segments of
d experts are exempt from fetching, the loading time is
reduced, which allows efficient transfer-computation
. Nevertheless, the
performance of MoEpic critically depends on the
configuration (i.e., each
layer's VRAM budget and expert split ratio). To this end, we propose a
divide-and-conquer algorithm based on fixed-point iteration for adaptive
configuration. Extensive experiments on popular MoE
s demonstrate that
MoEpic can save about half of the GPU cost, while lowering the inference
latency by about 37.51%-65.73% compared to the baselines.
Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing
Authors: Lukas Toral, Teddy Lazebnik
2025-09-10
Reinforcement Learning (RL) algorithms often require long training to become
useful, especially in complex environments with rewards. While
techniques like reward shaping and curriculum learning exist to accelerate
training, these are often extremely specific and require the developer's
professionalism and dedicated expertise in the problem's domain. Tackling this
challenge, in this study, we explore the effectiveness of pre-trained Large
Language Models (
s) as tutors in a student-teacher architecture with RL
algorithms, hypothesizing that
-generated guidance allows for faster
convergence. In particular, we explore the effectiveness of reusing the
's
advice on the RL's convergence dynamics. Through an extensive empirical
examination, which included 54 configurations, varying the RL algorithm (DQN,
PPO, A2C),
tutor (Llama, Vicuna, DeepSeek), and environment (Blackjack,
Snake, Connect Four), our results demonstrate that
tutoring significantly
accelerates RL convergence while maintaining comparable optimal performance.
Furthermore, the advice reuse mechanism shows a further improvement in training
duration but also results in less stable convergence dynamics. Our findings
suggest that
tutoring generally improves convergence, and its effectiveness
is sensitive to the specific task, RL algorithm, and
model combination.
EvolKV Evolutionary KV Cache Compression for LLM Inference
Authors: Bohan Yu, Yekun Chai
2025-09-10
Existing key-value ()
methods typically rely on
heuristics, such as uniform
allocation across layers or static eviction
policies, however, they ignore the critical interplays among layer-specific
feature patterns and task performance, which can lead to degraded
generalization. In this paper, we propose Evol
, an adaptive framework for
layer-wise, task-driven
that jointly optimizes the memory
efficiency and task performance. By reformulating
allocation as a
multi-objective optimization problem, Evol
leverages evolutionary search to
dynamically configure layer budgets while directly maximizing downstream
performance. Extensive experiments on 11 tasks demonstrate that our approach
outperforms all baseline methods across a wide range of
budgets on
long-context tasks and surpasses heuristic baselines by up to 7 percentage
points on GSM8K. Notably, Evol
achieves superior performance over the full
setting on code completion while utilizing only 1.5% of the original
budget, suggesting the untapped potential in learned
strategies for
budget allocation.
Towards Knowledge-Aware Document Systems Modeling Semantic Coverage Relations via Answerability Detection
Authors: Yehudit Aperstein, Alon Gottlib, Gal Benita, Alexander Apartsin
2025-09-10
Understanding how information is shared across documents, regardless of the
format in which it is expressed, is critical for tasks such as information
retrieval, summarization, and content alignment. In this work, we introduce a
novel framework for modelling Semantic Coverage Relations (SCR), which
classifies document pairs based on how their informational content aligns. We
define three core relation types: equivalence, where both texts convey the same
information using different textual forms or styles; inclusion, where one
document fully contains the information of another and adds more; and semantic
, where each document presents partially
ping content. To capture
these relations, we adopt a question answering (QA)-based approach, using the
answerability of shared questions across documents as an indicator of semantic
coverage. We construct a synthetic dataset derived from the SQuAD corpus by
paraphrasing source passages and selectively omitting information, enabling
precise control over content
. This dataset allows us to benchmark
generative language models and train
-based classifiers for SCR
prediction. Our findings demonstrate that discriminative models significantly
outperform generative approaches, with the RoBERTa-base model achieving the
highest accuracy of 61.4% and the Random Forest-based model showing the best
balance with a macro-F1 score of 52.9%. The results show that QA provides an
effective lens for assessing semantic relations across stylistically diverse
texts, offering insights into the capacity of current models to reason about
information beyond surface similarity. The dataset and code developed in this
study are publicly available to support reproducibility.
RTR A Transformer-Based Lossless Crossover with Perfect Phase Alignment
Authors: Xiangying Li, Jiankuan Li, Yong Tang
2025-09-10
This paper proposes a -based lossless crossover method, termed
Resonant Transformer Router (RTR), which achieves frequency separation while
ensuring perfect phase alignment between low-frequency (LF) and high-frequency
(HF) channels at the crossover frequency. The core property of RTR is that its
frequency responses satisfy a linear complementary relation HLF(f)+HHF(f)=1. so
that the original signal can be perfectly reconstructed by linear summation of
the two channels. Theoretical derivation and circuit simulations demonstrate
that RTR provides superior energy efficiency, phase consistency, and robustness
against component tolerances. Compared with conventional LC crossovers and
digital FIR/IIR filters, RTR offers a low-loss, low-latency hardware-assisted
filtering solution suitable for high-fidelity audio and
front-ends.
The core theory behind this paper's work, lossless crossover, is based on a
Chinese patent [CN116318117A] developed from the previous research of one of
the authors, Jianluan Li. We provide a comprehensive experimental validation of
this theory and propose a new extension.
Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning
Authors: Wei Huang, Anda Cheng, Yinggui Wang
2025-09-10
Recent advancements in large language models (s) have shown impressive
capabilities in various downstream tasks but typically face Catastrophic
Forgetting (CF) during fine-tuning. In this paper, we propose the
Forgetting-Aware Pruning Metric (FAPM), a novel
-based approach to
balance CF and downstream task performance. Our investigation reveals that the
degree to which task vectors (i.e., the subtraction of pre-trained weights from
the weights fine-tuned on downstream tasks)
with pre-trained model
parameters is a critical factor for CF. Based on this finding, FAPM employs the
ratio of the task vector to pre-trained model parameters as a metric to
quantify CF, integrating this measure into the
criteria. Importantly,
FAPM does not necessitate modifications to the training process or model
architecture, nor does it require any auxiliary data. We conducted extensive
experiments across eight datasets, covering natural language inference, General
Q&A, Medical Q&A, Math Q&A, reading comprehension, and cloze tests. The results
demonstrate that FAPM limits CF to just 0.25\% while maintaining 99.67\%
accuracy on downstream tasks. We provide the code to reproduce our results.
Strategies for Improving Communication Efficiency in Distributed and Federated Learning Compression, Local Training, and Personalization
Authors: Kai Yi
2025-09-10
Distributed and federated learning are essential paradigms for training
models across decentralized data sources while pre privacy, yet
overhead remains a major bottleneck. This dissertation explores
strategies to improve
efficiency, focusing on model
,
local training, and personalization. We establish a unified framework for
biased and unbiased
operators with convergence guarantees, then
propose adaptive local training strategies that incorporate personalization to
accelerate convergence and mitigate client drift. In particular, Scafflix
balances global and personalized objectives, achieving superior performance
under both IID and non-IID settings. We further introduce privacy-pre
frameworks that optimize
while minimizing
costs,
with Cohort-Squeeze leveraging hierarchical aggregation to reduce cross-device
overhead. Finally, SymWanda, a symmetric post-training
method, enhances
robustness under high
and maintains accuracy without retraining.
Extensive experiments on benchmarks and large-scale language models demonstrate
favorable trade-offs among accuracy, convergence, and
, offering
theoretical and practical insights for scalable, efficient distributed
learning.
Sketched Gaussian Mechanism for Private Federated Learning
Authors: Qiaobo Li, Zhijie Chen, Arindam Banerjee
2025-09-09
Communication cost and privacy are two major considerations in federated
learning (FL). For cost, gradient
by sketching the
clients' transmitted model updates is often used for reducing per-round
. For privacy, the Gaussian mechanism (GM), which consists of
clipping updates and adding Gaussian noise, is commonly used to guarantee
client-level differential privacy. Existing literature on private FL analyzes
privacy of sketching and GM in an isolated manner, illustrating that sketching
provides privacy determined by the sketching dimension and that GM has to
supply any additional desired privacy.
In this paper, we introduce the Sketched Gaussian Mechanism (SGM), which
directly combines sketching and the Gaussian mechanism for privacy. Using
R\'enyi-DP tools, we present a joint analysis of SGM's overall privacy
guarantee, which is significantly more flexible and sharper compared to
isolated analysis of sketching and GM privacy. In particular, we prove that the
privacy level of SGM for a fixed noise magnitude is proportional to
, where is the sketching dimension, indicating that (for
moderate ) SGM can provide much stronger privacy guarantees than the
original GM under the same noise budget. We demonstrate the application of SGM
to FL with either gradient descent or adaptive server optimizers, and establish
theoretical results on optimization convergence, which exhibits only a
logarithmic dependence on the number of parameters . Experimental results
confirm that at the same privacy level, SGM based FL is at least competitive
with non-sketching private FL variants and outperforms them in some settings.
Moreover, using adaptive optimization at the server improves empirical
performance while maintaining the privacy guarantees.
XML Prompting as Grammar-Constrained Interaction Fixed-Point Semantics, Convergence Guarantees, and Human-AI Protocols
Authors: Faruk Alpay, Taylan Alpay
2025-09-09
Structured prompting with XML tags has emerged as an effective way to steer
large language models (s) toward parseable, schema-adherent outputs in
real-world systems. We develop a logic-first treatment of XML prompting that
unifies (i) grammar-constrained
, (ii) fixed-point semantics over
lattices of hierarchical prompts, and (iii) convergent human-AI interaction
loops. We formalize a complete lattice of XML trees under a refinement order
and prove that monotone prompt-to-prompt operators admit least fixed points
(Knaster-Tarski) that characterize steady-state protocols; under a task-aware
contraction metric on trees, we further prove Banach-style convergence of
iterative guidance. We instantiate these results with context-free grammars
(CFGs) for XML schemas and show how constrained
guarantees
well-formedness while pre
task performance. A set of multi-layer
human-AI interaction recipes demonstrates practical deployment patterns,
including multi-pass "plan verify revise" routines and agentic tool
use. We provide mathematically complete proofs and tie our framework to recent
advances in grammar-aligned
, chain-of-verification, and programmatic
prompting.
OCTANE -- Optimal Control for Tensor-based Autoencoder Network Emergence Explicit Case
Authors: Ratna Khatri, Anthony Kolshorn, Colin Olson, Harbir Antil
2025-09-09
This paper presents a novel, mathematically rigorous framework for
autoencoder-type deep neural networks that combines optimal control theory and
low-rank tensor methods to yield memory-efficient training and automated
architecture discovery. The learning task is formulated as an optimization
problem constrained by differential equations representing the encoder and
r components of the network and the corresponding optimality conditions
are derived via a Lagrangian approach. Efficient memory
is enabled
by approximating differential equation solutions on low-rank tensor manifolds
using an adaptive explicit integration scheme. These concepts are combined to
form OCTANE (Optimal Control for Tensor-based Autoencoder Network Emergence) --
a unified training framework that yields compact autoencoder architectures,
reduces memory usage, and enables effective learning, even with limited
training data. The framework's utility is illustrated with application to image
denoising and deblurring tasks and recommendations regarding governing
hyperparameters are provided.
SCA-LLM Spectral-Attentive Channel Prediction with Large Language Models in MIMO-OFDM
Authors: Ke He, Le He, Lisheng Fan, Xianfu Lei, Thang X. Vu, George K. Karagiannidis, Symeon Chatzinotas
2025-09-09
In recent years, the success of large language models (s) has inspired
growing interest in exploring their potential applications in wireless
s, especially for channel prediction tasks. However, directly
applying
s to channel prediction faces a domain mismatch issue stemming from
their text-based pre-training. To mitigate this, the ``adapter +
" paradigm
has emerged, where an adapter is designed to bridge the domain gap between the
channel state information (CSI) data and
s. While showing initial success,
existing adapters may not fully exploit the potential of this paradigm. To
address this limitation, this work provides a key insight that learning
representations from the spectral components of CSI features can more
effectively help bridge the domain gap. Accordingly, we propose a
spectral-attentive framework, named SCA-
, for channel prediction in
multiple-input multiple-output orthogonal frequency division multiplexing
(MIMO-OFDM) systems. Specifically, its novel adapter can capture finer spectral
details and better adapt the
for channel prediction than previous methods.
Extensive simulations show that SCA-
achieves state-of-the-art prediction
performance and strong generalization, yielding up to
normalized mean squared error (NMSE) advantage over the previous
based
method. Ablation studies further confirm the superiority of SCA-
in
mitigating domain mismatch.
Tensor-Train Operator Inference
Authors: Engin Danis, Duc Truong, Kim Ø. Rasmussen§, Boian S. Alexandrov
2025-09-09
In this study, we present a tensor--train framework for nonintrusive operator
inference aimed at learning discrete operators and using them to predict
solutions of physical governing equations. Our framework comprises three
approaches: full--order tensor--train operator inference, full--order d
tensor--train operator inference, and reduced--order tensor--train operator
inference. In each case, snapshot data is represented in tensor--train
format--either through
or cross interpolation--enabling the
efficient handling of extremely large datasets with significantly reduced
computational effort compared to standard methods. The effectiveness of each
approach is demonstrated through numerical experiments related to Computational
Fluid Dynamics and benchmarked against the standard reduced--order operator
inference method, highlighting the advantages of the tensor--train
representations in both accuracy and scalability.
Feature Space Analysis by Guided Diffusion Model
Authors: Kimiaki Shirahama, Miki Yanobu, Kaduki Yamashita, Miho Ohsaki
2025-09-09
One of the key issues in Deep Neural Networks (DNNs) is the black-box nature
of their internal feature extraction process. Targeting vision-related domains,
this paper focuses on analysing the feature space of a DNN by proposing a
r that can generate images whose features are guaranteed to closely match
a user-specified feature. Owing to this guarantee that is missed in past
studies, our
r allows us to evidence which of various attributes in an
image are encoded into a feature by the DNN, by generating images whose
features are in proximity to that feature. Our
r is implemented as a
guided diffusion model that guides the reverse image generation of a
pre-trained diffusion model to minimise the Euclidean distance between the
feature of a clean image estimated at each step and the user-specified feature.
One practical advantage of our
r is that it can analyse feature spaces of
different DNNs with no additional training and run on a single COTS GPU. The
experimental results targeting CLIP's image encoder, ResNet-50 and vision
demonstrate that images generated by our
r have features
remarkably similar to the user-specified ones and reveal valuable insights into
these DNNs' feature spaces.
Biased Tales Cultural and Topic Bias in Generating Children's Stories
Authors: Donya Rooein, Vilém Zouhar, Debora Nozza, Dirk Hovy
2025-09-09
Stories play a pivotal role in human , shaping beliefs and
morals, particularly in children. As parents increasingly rely on large
language models (
s) to craft bedtime stories, the presence of cultural and
gender stereotypes in these narratives raises significant concerns. To address
this issue, we present Biased Tales, a comprehensive dataset designed to
analyze how biases influence protagonists' attributes and story elements in
-generated stories. Our analysis uncovers striking disparities. When the
protagonist is described as a girl (as compared to a boy), appearance-related
attributes increase by 55.26%. Stories featuring non-Western children
disproportionately emphasize cultural heritage, tradition, and family themes
far more than those for Western children. Our findings highlight the role of
sociocultural bias in making creative AI use more equitable and diverse.
A Robot That Listens Enhancing Self-Disclosure and Engagement Through Sentiment-based Backchannels and Active Listening
Authors: Hieu Tran, Go-Eum Cha, Sooyeon Jeong
2025-09-09
As social robots get more deeply integrated intoour everyday lives, they will
be expected to engage in meaningful conversations and exhibit socio-emotionally
intelligent listening behaviors when interacting with people. Active listening
and backchanneling could be one way to enhance robots' communicative
capabilities and enhance their effectiveness in eliciting deeper
self-disclosure, providing a sense of empathy,and forming positive rapport and
relationships with people.Thus, we developed an -powered social robot that
can exhibit contextually appropriate sentiment-based backchannelingand active
listening behaviors (active listening+backchanneling) and compared its efficacy
in eliciting people's self-disclosurein comparison to robots that do not
exhibit any of these listening behaviors (control) and a robot that only
exhibitsbackchanneling behavior (backchanneling-only). Through ourexperimental
study with sixty-five participants, we found theparticipants who conversed with
the active listening robot per-ceived the interactions more positively, in
which they exhibited the highest self-disclosures, and reported the strongest
senseof being listened to. The results of our study suggest that the
implementation of active listening behaviors in social robotshas the potential
to improve human-robot
andcould further contribute to the
building of deeper human-robot relationships and rapport.
Are Humans as Brittle as Large Language Models?
Authors: Jiahui Li, Sean Papay, Roman Klinger
2025-09-09
The output of large language models () is unstable, due to both
non-determinism of the
process as well as to prompt brittleness. While
the intrinsic non-determinism of
generation may mimic existing uncertainty
in human annotations through distributional shifts in outputs, it is largely
assumed, yet unexplored, that the prompt brittleness effect is unique to
s.
This raises the question: do human annotators show similar sensitivity to
instruction changes? If so, should prompt brittleness in
s be considered
problematic? One may alternatively hypothesize that prompt brittleness
correctly reflects human annotation variances. To fill this research gap, we
systematically compare the effects of prompt modifications on
s and
identical instruction modifications for human annotators, focusing on the
question of whether humans are similarly sensitive to prompt perturbations. To
study this, we prompt both humans and
s for a set of text classification
tasks conditioned on prompt variations. Our findings indicate that both humans
and
s exhibit increased brittleness in response to specific types of prompt
modifications, particularly those involving the substitution of alternative
label sets or label formats. However, the distribution of human judgments is
less affected by typographical errors and reversed label order than that of
s.
Query Expansion in the Age of Pre-trained and Large Language Models A Comprehensive Survey
Authors: Minghan Li, Xinxuan Lv, Junjie Zou, Tongna Chen, Chao Zhang, Suchao An, Ercong Nie, Guodong Zhou
2025-09-09
Modern information retrieval (IR) must bridge short, ambiguous queries and
ever more diverse, rapidly evolving corpora. Query Expansion (QE) remains a key
mechanism for mitigating vocabulary mismatch, but the design space has shifted
markedly with pre-trained language models (PLMs) and large language models
(s). This survey synthesizes the field from three angles: (i) a
four-dimensional framework of query expansion - from the point of injection
(explicit vs. implicit QE), through grounding and interaction (knowledge bases,
model-internal capabilities, multi-turn retrieval) and learning alignment, to
knowledge graph-based argumentation; (ii) a model-centric taxonomy spanning
encoder-only, encoder-
r,
r-only, instruction-tuned, and
domain/multilingual variants, highlighting their characteristic affordances for
QE (contextual disambiguation, controllable generation, zero-/few-shot
reasoning); and (iii) practice-oriented guidance on where and how neural QE
helps in first-stage retrieval, multi-query fusion, re-ranking, and
retrieval-augmented generation (RAG). We compare traditional query expansion
with PLM/
-based methods across seven key aspects, and we map applications
across web search, biomedicine, e-commerce, open-domain QA/RAG, conversational
and code search, and cross-lingual settings. The review distills design
grounding and interaction, alignment/distillation (SFT/PEFT/DPO), and KG
constraints - as robust remedies to topic drift and hallucination. We conclude
with an agenda on quality control, cost-aware invocation, domain/temporal
adaptation, evaluation beyond end-task metrics, and fairness/privacy.
Collectively, these insights provide a principled blueprint for selecting and
combining QE techniques under real-world constraints.
SEEC Segmentation-Assisted Multi-Entropy Models for Learned Lossless Image Compression
Authors: Chunhang Zheng, Zichang Ren, Dou Li
2025-09-09
Recently, learned image has attracted considerable attention due
to its superior performance over traditional methods. However, most existing
approaches employ a single entropy model to estimate the probability
distribution of pixel values across the entire image, which limits their
ability to capture the diverse statistical characteristics of different
semantic regions. To overcome this limitation, we propose Segmentation-Assisted
Multi-Entropy Models for Lossless Image Compression (SEEC). Our framework
utilizes semantic segmentation to guide the selection and adaptation of
multiple entropy models, enabling more accurate probability distribution
estimation for distinct semantic regions. Specifically, SEEC first extracts
image features and then applies semantic segmentation to identify different
regions, each assigned a specialized entropy model to better capture its unique
statistical properties. Finally, a multi-channel discrete logistic mixture
likelihood is employed to model the pixel value distributions effectively.
Experimental results on benchmark datasets demonstrate that SEEC achieves
state-of-the-art
ratios while introducing only minimal encoding and
latency. With superior performance, the proposed model also supports
Regions of Interest (ROIs) coding condition on the provided segmentation mask.
Our code is available at https://github.com/chunbaobao/SEEC.
Unleashing the True Potential of LLMs A Feedback-Triggered Self-Correction with Long-Term Multipath Decoding
Authors: Jipeng Li, Zeyu Gao, Yubin Qi, Hande Dong, Weijian Chen, Qiang Lin
2025-09-09
Large Language Models (s) have achieved remarkable performance across
diverse tasks, yet their susceptibility to generating incorrect content during
inference remains a critical unsolved challenge. While self-correction methods
offer potential solutions, their effectiveness is hindered by two inherent
limitations: (1) the absence of reliable guidance signals for error
localization, and (2) the restricted reasoning depth imposed by conventional
next-token
paradigms. To address these issues, we propose
Feedback-Triggered Regeneration (FTR), a novel framework that synergizes user
feedback with enhanced
dynamics. Specifically, FTR activates response
regeneration only upon receiving negative user feedback, thereby circumventing
error propagation from faulty self-assessment while pre
originally
correct outputs. Furthermore, we introduce Long-Term Multipath (LTM)
,
which enables systematic exploration of multiple reasoning trajectories through
delayed sequence evaluation, effectively overcoming the myopic decision-making
characteristic of standard next-token prediction. Extensive experiments on
mathematical reasoning and code generation benchmarks demonstrate that our
framework achieves consistent and significant improvements over
state-of-the-art prompt-based self-correction methods.
Collaborative Exploration with a Marsupial Ground-Aerial Robot Team through Task-Driven Map Compression
Authors: Angelos Zacharia, Mihir Dharmadhikari, Kostas Alexis
2025-09-09
Efficient exploration of unknown environments is crucial for autonomous
robots, especially in confined and large-scale scenarios with limited
. To address this challenge, we propose a collaborative
exploration framework for a marsupial ground-aerial robot team that leverages
the complementary capabilities of both platforms. The framework employs a
graph-based path planning algorithm to guide exploration and deploy the aerial
robot in areas where its expected gain significantly exceeds that of the ground
robot, such as large open spaces or regions inaccessible to the ground
platform, thereby maximizing coverage and efficiency. To facilitate large-scale
spatial information sharing, we introduce a bandwidth-efficient, task-driven
map
strategy. This method enables each robot to reconstruct
resolution-specific volumetric maps while pre
exploration-critical
details, even at high
rates. By selectively compressing and sharing
key data,
overhead is minimized, ensuring effective map
integration for collaborative path planning. Simulation and real-world
experiments validate the proposed approach, demonstrating its effectiveness in
improving exploration efficiency while significantly reducing data
transmission.
Topology-Aware Optimization of Gaussian Primitives for Human-Centric Volumetric Videos
Authors: Yuheng Jiang, Chengcheng Guo, Yize Wu, Yu Hong, Shengkun Zhu, Zhehao Shen, Yingliang Zhang, Shaohui Jiao, Zhuo Su, Lan Xu, Marc Habermann, Christian Theobalt
2025-09-09
Volumetric video is emerging as a key medium for digitizing the dynamic
physical world, creating the virtual environments with six degrees of freedom
to deliver immersive user experiences. However, robustly modeling general
dynamic scenes, especially those involving topological changes while
maintaining long-term tracking remains a fundamental challenge. In this paper,
we present TaoGS, a novel topology-aware dynamic Gaussian representation that
disentangles motion and appearance to support, both, long-range tracking and
topological adaptation. We represent scene motion with a set of motion
Gaussians, which are continuously updated by a spatio-temporal tracker and
photometric cues that detect structural variations across frames. To capture
fine-grained texture, each motion Gaussian anchors and dynamically activates a
set of local appearance Gaussians, which are non-rigidly warped to the current
frame to provide strong initialization and significantly reduce training time.
This activation mechanism enables efficient modeling of detailed textures and
maintains temporal coherence, allowing high-fidelity rendering even under
challenging scenarios such as changing clothes. To enable seamless integration
into codec-based volumetric formats, we introduce a global Gaussian Lookup
Table that records the lifespan of each Gaussian and organizes attributes into
a lifespan-aware 2D layout. This structure aligns naturally with standard video
codecs and supports up to 40
. TaoGS provides a unified, adaptive
solution for scalable volumetric video under topological variation, capturing
moments where "elegance in motion" and "Power in Stillness", delivering
immersive experiences that harmonize with the physical world.
MaLei at MultiClinSUM Summarisation of Clinical Documents using Perspective-Aware Iterative Self-Prompting with LLMs
Authors: Libo Ren, Yee Man Ng, Lifeng Han
2025-09-09
Efficient between patients and clinicians plays an important
role in shared decision-making. However, clinical reports are often lengthy and
filled with clinical jargon, making it difficult for domain experts to identify
important aspects in the document efficiently. This paper presents the
methodology we applied in the MultiClinSUM shared task for summarising clinical
case documents. We used an Iterative Self-Prompting technique on large language
models (
s) by asking
s to generate task-specific prompts and refine them
via example-based few-shot learning. Furthermore, we used lexical and embedding
space metrics, ROUGE and BERT-score, to guide the model fine-tuning with
epochs. Our submission using perspective-aware ISP on GPT-4 and GPT-4o achieved
ROUGE scores (46.53, 24.68, 30.77) and BERTscores (87.84, 83.25, 85.46) for (P,
R, F1) from the official evaluation on 3,396 clinical case reports from various
specialties extracted from open journals. The high BERTscore indicates that the
model produced semantically equivalent output summaries compared to the
references, even though the
at the exact lexicon level is lower, as
reflected in the lower ROUGE scores. This work sheds some light on how
perspective-aware ISP (PA-ISP) can be deployed for clinical report
summarisation and support better
between patients and clinicians.
PanoLAM Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image
Authors: Peng Li, Yisheng He, Yingdong Hu, Yuan Dong, Weihao Yuan, Yuan Liu, Zilong Dong, Yike Guo
2025-09-09
We present a feed-forward framework for Gaussian full-head synthesis from a
single unposed image. Unlike previous work that relies on time-consuming GAN
inversion and test-time optimization, our framework can reconstruct the
Gaussian full-head model given a single unposed image in a single forward pass.
This enables fast reconstruction and rendering during inference. To mitigate
the lack of large-scale 3D head assets, we propose a large-scale synthetic
dataset from trained 3D GANs and train our framework using only synthetic data.
For efficient high-fidelity generation, we introduce a coarse-to-fine Gaussian
head generation pipeline, where points from the FLAME model interact
with the image features by
blocks for feature extraction and coarse
shape reconstruction, which are then densified for high-fidelity
reconstruction. To fully leverage the prior knowledge residing in pretrained 3D
GANs for effective reconstruction, we propose a dual-branch framework that
effectively aggregates the structured spherical triplane feature and
unstructured point-based features for more effective Gaussian head
reconstruction. Experimental results show the effectiveness of our framework
towards existing work.
PatchSeeker Mapping NVD Records to their Vulnerability-fixing Commits with LLM Generated Commits and Embeddings
Authors: Huu Hung Nguyen, Anh Tuan Nguyen, Thanh Le-Cong, Yikun Li, Han Wei Ang, Yide Yin, Frank Liauw, Shar Lwin Khin, Ouh Eng Lieh, Ting Zhang, David Lo
2025-09-09
Software vulnerabilities pose serious risks to modern software ecosystems.
While the National Vulnerability Database (NVD) is the authoritative source for
cataloging these vulnerabilities, it often lacks explicit links to the
corresponding Vulnerability-Fixing Commits (VFCs). VFCs encode precise code
changes, enabling vulnerability localization, patch analysis, and dataset
construction. Automatically mapping NVD records to their true VFCs is therefore
critical. Existing approaches have limitations as they rely on , often
noisy commit messages and fail to capture the deep semantics in the
vulnerability descriptions. To address this gap, we introduce PatchSeeker, a
novel method that leverages large language models to create rich semantic links
between vulnerability descriptions and their VFCs. PatchSeeker generates
embeddings from NVD descriptions and enhances commit messages by synthesizing
detailed summaries for those that are short or uninformative. These generated
messages act as a semantic bridge, effectively closing the information gap
between natural language reports and low-level code changes. Our approach
PatchSeeker achieves 59.3% higher MRR and 27.9% higher Recall@10 than the
best-performing baseline, Prospector, on the benchmark dataset. The extended
evaluation on recent CVEs further confirms PatchSeeker's effectiveness.
Ablation study shows that both the commit message generation method and the
selection of backbone
s make a positive contribution to PatchSeeker. We also
discuss limitations and open challenges to guide future work.
Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data
Authors: Gokul Karthik Kumar, Rishabh Saraf, Ludovick Lepauloux, Abdul Muneer, Billel Mokeddem, Hakim Hacid
2025-09-09
Large language models (s) have transformed NLP, yet their integration with
audio remains underexplored -- despite audio's centrality to human
. We introduce Falcon3-Audio, a family of Audio-Language Models
(ALMs) built on instruction-tuned
s and Whisper encoders. Using a remarkably
small amount of public audio data -- less than 30K hours (5K unique) --
Falcon3-Audio-7B matches the best reported performance among open-weight models
on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while
distinguishing itself through superior data and parameter efficiency,
single-stage training, and transparency. Notably, our smallest 1B model remains
competitive with larger open models ranging from 2B to 13B parameters. Through
extensive ablations, we find that common complexities -- such as curriculum
learning, multiple audio encoders, and intricate cross-attention connectors --
are not required for strong performance, even compared to models trained on
over 500K hours of data.
Multi-view-guided Passage Reranking with Large Language Models
Authors: Jeongwoo Na, Jun Kwon, Eunseong Choi, Jongwuk Lee
2025-09-09
Recent advances in large language models (s) have shown impressive
performance in passage reranking tasks. Despite their success,
-based
methods still face challenges in efficiency and sensitivity to external biases.
(1) Existing models rely mostly on autoregressive generation and sliding window
strategies to rank passages, which incur heavy computational overhead as the
number of passages increases. (2) External biases, such as position or
selection bias, hinder the model's ability to accurately represent passages and
increase input-order sensitivity. To address these limitations, we introduce a
novel passage reranking model, called Multi-View-guided Passage Reranking
(MVP). MVP is a non-generative
-based reranking method that encodes
query-passage information into diverse view embeddings without being influenced
by external biases. For each view, it combines query-aware passage embeddings
to produce a distinct anchor vector, which is then used to directly compute
relevance scores in a single
step. In addition, it employs an
orthogonal loss to make the views more distinctive. Extensive experiments
demonstrate that MVP, with just 220M parameters, matches the performance of
much larger 7B-scale fine-tuned models while achieving a 100x reduction in
inference latency. Notably, the 3B-parameter variant of MVP achieves
state-of-the-art performance on both in-domain and out-of-domain benchmarks.
The source code is available at: https://github.com/bulbna/MVP
DuoServe-MoE Dual-Phase Expert Prefetch and Cache Scheduling for Efficient MoE LLM Inference
Authors: Yuning Zhang, Grant Pinkert, Nan Yang, Yanli Li, Dong Yuan
2025-09-09
Large Language Models (s) have demonstrated impressive performance across
a wide range of deep learning tasks. Mixture of Experts (MoE) further enhances
their capabilities by increasing model width through
ly activated expert
branches, which keeps inference computation efficient. However, the large
number of expert weights introduces significant GPU memory pressure, especially
in resource-constrained environments such as single-GPU servers. More
importantly, MoE inference consists of two fundamentally different stages: a
stage where most experts are activated densely, and a
stage
where only a few experts are triggered
ly. Treating these stages with a
uniform scheduling strategy often leads to suboptimal latency and memory usage.
To address this, we propose DuoServe-MoE, an inference
system that
explicitly separates
and
stages and applies tailored expert
scheduling strategies to each. In the
stage, DuoServe-MoE uses a
two-stream CUDA pipeline that
s expert weight prefetching with the
computation of non-MoE layers, limiting expert residency in GPU memory. In the
stage, a lightweight layer-level predictor trained offline from
activation traces is used to prefetch only the most likely activated experts,
without requiring any changes to the model. Experiments on 4-bit Mixtral-8x7B
and 8x22B models show that DuoServe-MoE improves end-to-end latency by 1.42 to
7.54 times while keeping peak memory usage at only 15 percent of the full model
size.
PersonaFuse A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions
Authors: Yixuan Tang, Yi Yang, Ahmed Abbasi
2025-09-09
Recent advancements in Large Language Models (s) demonstrate remarkable
capabilities across various fields. These developments have led to more direct
between humans and
s in various situations, such as social
companionship and psychological support. However,
s often exhibit
limitations in emotional perception and social competence during real-world
conversations. These limitations partly originate from their inability to adapt
their
style and emotional expression to different social and task
contexts. In this work, we introduce PersonaFuse, a novel
post-training
framework that enables
s to adapt and express different personalities for
varying situations. Inspired by Trait Activation Theory and the Big Five
personality model, PersonaFuse employs a Mixture-of-Expert architecture that
combines persona adapters with a dynamic routing network, enabling contextual
trait expression. Experimental results show that PersonaFuse substantially
outperforms baseline models across multiple dimensions of social-emotional
intelligence. Importantly, these gains are achieved without sacrificing general
reasoning ability or model safety, which remain common limitations of direct
prompting and supervised fine-tuning approaches. PersonaFuse also delivers
consistent improvements in downstream human-centered applications, such as
mental health counseling and review-based customer service. Finally, human
preference evaluations against leading
s, including GPT-4o and DeepSeek,
demonstrate that PersonaFuse achieves competitive response quality despite its
comparatively smaller model size. These findings demonstrate that PersonaFuse
offers a theoretically grounded and practical approach for developing
social-emotional enhanced
s, marking a significant advancement toward more
human-centric AI systems.
Explaining How Quantization Disparately Skews a Model
Authors: Abhimanyu Bellam, Jung-Eun Kim
2025-09-08
Post Training Quantization (PTQ) is widely adopted due to its high
capacity and speed with minimal impact on accuracy. However, we
observed that disparate impacts are exacerbated by
, especially for
minority groups. Our analysis explains that in the course of
there
is a chain of factors attributed to a disparate impact across groups during
forward and backward passes. We explore how the changes in weights and
activations induced by
cause cascaded impacts in the network,
resulting in logits with lower variance, increased loss, and compromised group
accuracies. We extend our study to verify the influence of these impacts on
group gradient norms and eigenvalues of the Hessian matrix, providing insights
into the state of the network from an optimization point of view. To mitigate
these effects, we propose integrating mixed precision Quantization Aware
Training (QAT) with dataset sampling methods and weighted loss functions,
therefore providing fair deployment of
d neural networks.
Neurocognitive Modeling for Text Generation Deep Learning Architecture for EEG Data
Authors: Khushiyant
2025-09-08
Text generating capabilities have undergone a substantial transformation with
the introduction of large language models (s). Electroencephalography
(EEG)-based text production is still difficult, though, because it requires a
lot of data and processing power. This paper introduces a new method that
combines the use of the Gemma 2B
with a classifier-
architecture to
incorporate a Recurrent Neural Network (RNN) encoder. Our approach drastically
lowers the amount of data and compute power needed while achieving performance
close to that of cutting-edge methods. Notably, compared to current
methodologies, our methodology delivers an overall performance improvement of
10%. The suggested architecture demonstrates the possibility of effective
transfer learning for EEG-based text production, remaining strong and
functional even in the face of data limits. This work highlights the potential
of integrating
s with EEG
to improve assistive technologies and
improve independence and
for those with severe motor limitations.
Our method pushes the limits of present capabilities and opens new paths for
research and application in brain-computer interfaces by efficiently using the
strengths of pre-trained language models. This makes EEG-based text production
more accessible and efficient.
DischargeSim A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge
Authors: Zonghai Yao, Michael Sun, Won Seok Jang, Sunjae Kwon, Soie Kwon, Hong Yu
2025-09-08
Discharge is a critical yet underexplored component of patient
care, where the goal shifts from diagnosis to education. While recent large
language model (
) benchmarks emphasize in-visit diagnostic reasoning, they
fail to evaluate models' ability to support patients after the visit. We
introduce DischargeSim, a novel benchmark that evaluates
s on their ability
to act as personalized discharge educators. DischargeSim simulates post-visit,
multi-turn conversations between
-driven DoctorAgents and PatientAgents with
diverse psychosocial profiles (e.g., health literacy, education, emotion).
Interactions are structured across six clinically grounded discharge topics and
assessed along three axes: (1) dialogue quality via automatic and
-as-judge
evaluation, (2) personalized document generation including free-text summaries
and structured AHRQ checklists, and (3) patient comprehension through a
downstream multiple-choice exam. Experiments across 18
s reveal significant
gaps in discharge education capability, with performance varying widely across
patient profiles. Notably, model size does not always yield better education
outcomes, highlighting trade-offs in strategy use and content prioritization.
DischargeSim offers a first step toward benchmarking
s in post-visit
clinical education and promoting equitable, personalized patient support.
Faster VGGT with Block-Sparse Global Attention
Authors: Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, Bastian Leibe
2025-09-08
Efficient and accurate feed-forward multi-view reconstruction has long been
an important task in computer vision. Recent -based models like VGGT
and have achieved impressive results with simple architectures, yet
they face an inherent runtime bottleneck, due to the quadratic complexity of
the global attention layers, that limits the scalability to large image sets.
In this paper, we empirically analyze the global attention matrix of these
models and observe that probability mass concentrates on a small subset of
patch-patch interactions that correspond to cross-view geometric matches.
Motivated by the structured attention and inspired by recent advancement in
large language models, we propose a replacement for the dense global attention
operation based on highly optimized block-
kernels, yielding up to
faster inference with comparable task performance. Our retrofit
requires no retraining of the backbone, extends to both VGGT and , and
supports large image collections. Evaluations on a comprehensive suite of
multi-view benchmarks demonstrate the effectiveness of our approach.
HOT Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers
Authors: Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Shijian Lu, Nicu Sebe
2025-09-08
Transformers have been successfully applied in the field of video-based 3D
human pose estimation. However, the high computational costs of these video
pose s (VPTs) make them impractical on resource-constrained devices.
In this paper, we present a hierarchical plug-and-play
-and-recovering
framework, called Hierarchical Hourglass Tokenizer (HOT), for efficient
-based 3D human pose estimation from videos. HOT begins with
progressively
pose tokens of redundant frames and ends with recovering
full-length sequences, resulting in a few pose tokens in the intermediate
blocks and thus improving the model efficiency. It works with two
key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module
(TRM). TPM dynamically selects a few representative tokens to eliminate the
redundancy of video frames, while TRM restores the detailed spatio-temporal
information based on the selected tokens, thereby expanding the network output
to the original full-length temporal resolution for fast inference. Our method
is general-purpose: it can be easily incorporated into common VPT models on
both seq2seq and seq2frame pipelines while effectively accommodating different
token
and recovery strategies. In addition, our HOT reveals that
maintaining the full pose sequence is unnecessary, and a few pose tokens of
representative frames can achieve both high efficiency and estimation accuracy.
Extensive experiments on multiple benchmark datasets demonstrate both the
effectiveness and efficiency of the proposed method. Code and models are
available at https://github.com/NationalGAILab/HoT.
Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models
Authors: Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, Mengdi Wang
2025-09-08
We propose TraceRL, a trajectory-aware reinforcement learning framework for
diffusion language models (DLMs) that incorporates preferred inference
trajectory into post-training, and is applicable across different
architectures. Equipped with a diffusion-based value model that enhances
training stability, we demonstrate improved reasoning performance on complex
math and coding tasks. Besides, it can also be applied to adapt block-specific
models to larger blocks, which improves sampling flexibility. Employing
TraceRL, we derive a series of state-of-the-art diffusion language models,
namely TraDo. Although smaller than 7B-scale AR models, TraDo-4B-Instruct still
consistently outperforms them across complex math reasoning tasks.
TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over
Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical
reasoning benchmarks. Through curriculum learning, we also derive the first
long-CoT DLM, outperforming Qwen2.5-7B-Instruct on MATH500 with an 18.1%
relative accuracy gain. To facilitate reproducible research and practical
applications, we release a comprehensive open-source framework for building,
training, and deploying diffusion s across diverse architectures. The
framework integrates accelerated
-
techniques and inference engines for
both inference and reinforcement learning, and includes implementations of
various supervised fine-tuning and RL methods for mathematics, coding, and
general tasks. Code and Models: https://github.com/Gen-Verse/d
-RL
Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data
Authors: Nithin Gopalakrishnan Nair, Srinivas Kaza, Xuan Luo, Vishal M. Patel, Stephen Lombardi, Jungyeon Park
2025-09-08
Large -based models have made significant progress in
generalizable novel view synthesis (NVS) from
input views, generating
novel viewpoints without the need for test-time optimization. However, these
models are constrained by the limited diversity of publicly available scene
datasets, making most real-world (in-the-wild) scenes out-of-distribution. To
overcome this, we incorporate synthetic training data generated from diffusion
models, which improves generalization across unseen domains. While synthetic
data offers scalability, we identify artifacts introduced during data
generation as a key bottleneck affecting reconstruction quality. To address
this, we propose a token disentanglement process within the
architecture, enhancing feature separation and ensuring more effective
learning. This refinement not only improves reconstruction quality over
standard
s but also enables scalable training with synthetic data.
As a result, our method outperforms existing models on both in-dataset and
cross-dataset evaluations, achieving state-of-the-art results across multiple
benchmarks while significantly reducing computational costs. Project page:
https://scaling3dnvs.github.io/
From Noise to Narrative Tracing the Origins of Hallucinations in Transformers
Authors: Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok
2025-09-08
As generative AI systems become competent and democratized in science,
business, and government, deeper insight into their failure modes now poses an
acute need. The occasional volatility in their behavior, such as the propensity
of models to hallucinate, impedes trust and adoption of emerging AI
solutions in high-stakes areas. In the present work, we establish how and when
hallucinations arise in pre-trained
models through concept
representations captured by
autoencoders, under scenarios with
experimentally controlled uncertainty in the input space. Our systematic
experiments reveal that the number of semantic concepts used by the
model grows as the input information becomes increasingly unstructured. In the
face of growing uncertainty in the input space, the
model becomes
prone to activate coherent yet input-insensitive semantic features, leading to
hallucinated output. At its extreme, for pure-noise inputs, we identify a wide
variety of robustly triggered and meaningful concepts in the intermediate
activations of pre-trained
models, whose functional integrity we
confirm through targeted steering. We also show that hallucinations in the
output of a
model can be reliably predicted from the concept
patterns embedded in
layer activations. This collection of insights
on
internal processing mechanics has immediate consequences for
aligning AI models with human values, AI safety, opening the attack surface for
potential adversarial attacks, and providing a basis for automatic
quantification of a model's hallucination risk.
Barlow-Swin Toward a novel siamese-based segmentation architecture using Swin-Transformers
Authors: Morteza Kiani Haftlang, Mohammadhossein Malmir, Foroutan Parand, Umberto Michelucci, Safouane El Ghazouali
2025-09-08
Medical image segmentation is a critical task in clinical workflows,
particularly for the detection and delineation of pathological regions. While
convolutional architectures like U-Net have become standard for such tasks,
their limited receptive field restricts global context modeling. Recent efforts
integrating s have addressed this, but often result in deep,
computationally expensive models unsuitable for real-time use. In this work, we
present a novel end-to-end lightweight architecture designed specifically for
real-time binary medical image segmentation. Our model combines a Swin
Transformer-like encoder with a U-Net-like
r, connected via skip pathways
to preserve spatial detail while capturing contextual information. Unlike
existing designs such as Swin Transformer or U-Net, our architecture is
significantly shallower and competitively efficient. To improve the encoder's
ability to learn meaningful features without relying on large amounts of
labeled data, we first train it using Barlow Twins, a self-supervised learning
method that helps the model focus on important patterns by reducing unnecessary
repetition in the learned features. After this pretraining, we fine-tune the
entire model for our specific task. Experiments on benchmark binary
segmentation tasks demonstrate that our model achieves competitive accuracy
with substantially reduced parameter count and faster inference, positioning it
as a practical alternative for deployment in real-time and resource-limited
clinical environments. The code for our method is available at Github
repository: https://github.com/mkianih/Barlow-Swin.
COMPACT Common-token Optimized Model Pruning Across Channels and Tokens
Authors: Eugene Kwek, Wenpeng Yin
2025-09-08
Making s more efficient in memory, latency, and
cost is crucial
for edge deployment, interactive applications, and sustainable inference at
scale. Pruning is a key technique toward this goal. However, prior
methods are limited: width
often breaks the standard
layout
or requires custom inference code, while depth
removes entire layers
and can cause abrupt accuracy drops. In this work, we propose COMPACT, which
jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii)
prunes FFN intermediate channels using common-token-weighted activations,
aligning importance with the post-
token distribution. COMPACT enjoys
merits of both depth and width
, such as: deployment-friendliness (keeps
a standard
architecture), scale-adaptivity (trade off vocab vs. FFN
), training-free operation with competitive
time, and strong
memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and
Gemma families (0.5B-70B) show state-of-the-art downstream task performance at
similar or higher
ratios, with substantial reductions in parameters,
GPU memory, and end-to-end latency.
Guided Decoding and Its Critical Role in Retrieval-Augmented Generation
Authors: Özgür Uğur, Musa Yılmaz, Esra Şavirdi, Özay Ezerceli, Mahmut El Huseyni, Selva Taş, Reyhan Bayraktar
2025-09-08
The integration of Large Language Models (s) into various applications has
driven the need for structured and reliable responses. A key challenge in
Retrieval-Augmented Generation (RAG) systems is ensuring that outputs align
with expected formats while minimizing hallucinations. This study examines the
role of guided
in RAG systems, comparing three methods, Outlines,
XGrammar, and LM Format Enforcer, across different multi-turn prompting setups
(0-turn, 1-turn, and 2-turn). By evaluating success rates, hallucination rates,
and output quality, we provide insights into their performance and
applicability. Our findings reveal how multi-turn interactions influence guided
, uncovering unexpected performance variations that can inform method
selection for specific use cases. This work advances the understanding of
structured output generation in RAG systems, offering both theoretical insights
and practical guidance for
deployment.
HAVE Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models
Authors: Xin Tong, Zhi Lin, Jingya Wang, Bo Jin
2025-09-08
Large Language Models (s) often produce hallucinations in
retrieval-augmented or long-context generation, even when relevant evidence is
present. This stems from two issues: head importance is treated as
input-agnostic, and raw attention weights poorly reflect each token's true
contribution. We present HAVE (Head-Adaptive Gating and ValuE Calibration), a
parameter-free
framework that directly addresses both challenges. HAVE
introduces head-adaptive gating, which performs instance-level soft reweighing
of attention heads, and value calibration, which augments attention with the
magnitude of value vectors to approximate write-back contribution. Together,
these modules construct token-level evidence aligned with model updates and
fuse it with the LM distribution through a lightweight uncertainty-scaled
policy. HAVE requires no finetuning and operates in a single forward pass,
making it efficient and broadly applicable. Experiments across multiple QA
benchmarks and
families demonstrate that HAVE consistently reduces
hallucinations and outperforms strong baselines, including DAGCD, with modest
overhead. The framework is transparent, reproducible, and readily integrates
with off-the-shelf
s, advancing trustworthy generation in real-world
settings.
Reasoning-enhanced Query Understanding through Decomposition and Interpretation
Authors: Yunfei Zhong, Jun Yang, Yixing Fan, Jiafeng Guo, Lixin Su, Maarten de Rijke, Ruqing Zhang, Dawei Yin, Xueqi Cheng
2025-09-08
Accurate inference of user intent is crucial for enhancing document retrieval
in modern search engines. While large language models (s) have made
significant strides in this area, their effectiveness has predominantly been
assessed with short, keyword-based queries. As AI-driven search evolves,
long-form queries with intricate intents are becoming more prevalent, yet they
remain underexplored in the context of
-based query understanding (QU). To
bridge this gap, we introduce ReDI: a Reasoning-enhanced approach for query
understanding through Decomposition and Interpretation. ReDI leverages the
reasoning and comprehension capabilities of
s in a three-stage pipeline: (i)
it breaks down complex queries into targeted sub-queries to accurately capture
user intent; (ii) it enriches each sub-query with detailed semantic
interpretations to improve the query-document matching; and (iii) it
independently retrieves documents for each sub-query and employs a fusion
strategy to aggregate the results for the final ranking. We compiled a
large-scale dataset of real-world complex queries from a major search engine
and distilled the query understanding capabilities of teacher models into
smaller models for practical application. Experiments on BRIGHT and BEIR
demonstrate that ReDI consistently surpasses strong baselines in both
and dense retrieval paradigms, affirming its effectiveness.
SLiNT Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion
Authors: Mengxue Yang, Chun Yang, Jiaqi Zhu, Jiafan Li, Jingqi Zhang, Yuyang Li, Ying Li
2025-09-08
Link prediction in knowledge graphs requires integrating structural
information and semantic context to infer missing entities. While large
language models offer strong generative reasoning capabilities, their limited
exploitation of structural signals often results in structural and
semantic ambiguity, especially under incomplete or zero-shot settings. To
address these challenges, we propose SLiNT (Structure-aware Language model with
Injection and coNtrastive Training), a modular framework that injects
knowledge-graph-derived structural context into a frozen
backbone with
lightweight LoRA-based adaptation for robust link prediction. Specifically,
Structure-Guided Neighborhood Enhancement (SGNE) retrieves pseudo-neighbors to
enrich
entities and mitigate missing context; Dynamic Hard Contrastive
Learning (DHCL) introduces fine-grained supervision by interpolating hard
positives and negatives to resolve entity-level ambiguity; and
Gradient-Decoupled Dual Injection (GDDI) performs token-level structure-aware
intervention while pre
the core
parameters. Experiments on WN18RR
and FB15k-237 show that SLiNT achieves superior or competitive performance
compared with both embedding-based and generation-based baselines,
demonstrating the effectiveness of structure-aware representation learning for
scalable knowledge graph completion.
Synesthesia of Machines (SoM)-Aided LiDAR Point Cloud Transmission for Collaborative Perception
Authors: Ensong Liu, Rongqing Zhang, Xiang Cheng, Jian Tang
2025-09-08
Collaborative perception enables more accurate and comprehensive scene
understanding by learning how to share information between agents, with LiDAR
point clouds providing essential precise spatial data. Due to the substantial
data volume generated by LiDAR sensors, efficient point cloud transmission is
essential for low-latency multi-agent collaboration. In this work, we propose
an efficient, robust and applicable LiDAR point cloud transmission system via
the Synesthesia of Machines (SoM), termed LiDAR Point Cloud Feature
Transmission (LPC-FT), to support collaborative perception among multiple
agents. Specifically, we employ a density-pre deep point cloud
method that encodes the complete point cloud into a downsampled
efficient representation. To mitigate the effects of the wireless channel, we
design a channel encoder module based on self-attention to enhance LiDAR point
cloud features and a feature fusion module based on cross-attention to
integrate features from transceivers. Furthermore, we utilize the nonlinear
activation layer and transfer learning to improve the training of deep neural
networks in the presence the digital channel noise. Experimental results
demonstrate that the proposed LPC-FT is more robust and effective than
traditional octree-based
followed by channel coding, and
outperforms state-of-the-art deep learning-based
techniques and
existing semantic
methods, reducing the Chamfer Distance by 30%
and improving the PSNR by 1.9 dB on average. Owing to its superior
reconstruction performance and robustness against channel variations, LPC-FT is
expected to support collaborative perception tasks.
Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
Authors: Ran Xin, Zeyu Zheng, Yanchen Nie, Kun Yuan, Xia Xiao
2025-09-08
The integration of Large Language Models (s) into automated theorem
proving has shown immense promise, yet is fundamentally constrained by
challenges in scaling up both training-time reinforcement learning (RL) and
inference-time compute. This paper introduces \texttt{BFS-Prover-V2}, a system
designed to address this dual scaling problem. We present two primary
innovations. The first is a novel multi-turn off-policy RL framework for
continually improving the performance of
step-prover at training time. This
framework, inspired by the principles of AlphaZero, utilizes a multi-stage
expert iteration pipeline featuring adaptive tactic-level data filtering and
periodic retraining to surmount the performance plateaus that typically curtail
long-term RL in
-based agents. The second innovation is a planner-enhanced
multi-agent search architecture that scales reasoning capabilities at inference
time. This architecture employs a general reasoning model as a high-level
planner to iteratively decompose complex theorems into a sequence of simpler
subgoals. This hierarchical approach substantially reduces the search space,
enabling a team of parallel prover agents to collaborate efficiently by
leveraging a shared proof
. We demonstrate that this dual approach to
scaling yields state-of-the-art results on established formal mathematics
benchmarks. \texttt{BFS-Prover-V2} achieves 95.08\% and 41.4\% on the MiniF2F
and ProofNet test sets respectively. While demonstrated in the domain of formal
mathematics, the RL and inference techniques presented in this work are of
broader interest and may be applied to other domains requiring long-horizon
multi-turn reasoning and complex search.
HyFedRAG A Federated Retrieval-Augmented Generation Framework for Heterogeneous and Privacy-Sensitive Data
Authors: Cheng Qian, Hainan Zhang, Yongxin Tong, Hong-Wei Zheng, Zhiming Zheng
2025-09-08
Centralized RAG pipelines struggle with heterogeneous and privacy-sensitive
data, especially in distributed healthcare settings where patient data spans
SQL, knowledge graphs, and clinical notes. Clinicians face difficulties
retrieving rare disease cases due to privacy constraints and the limitations of
traditional cloud-based RAG systems in handling diverse formats and edge
devices. To address this, we introduce HyFedRAG, a unified and efficient
Federated RAG framework tailored for Hybrid data modalities. By leveraging an
edge-cloud collaborative mechanism, HyFedRAG enables RAG to operate across
diverse data sources while pre data privacy. Our key contributions are:
(1) We design an edge-cloud collaborative RAG framework built on Flower, which
supports querying structured SQL data, semi-structured knowledge graphs, and
unstructured documents. The edge-side
s convert diverse data into
standardized privacy-pre
representations, and the server-side
s
integrates them for global reasoning and generation. (2) We integrate
lightweight local retrievers with privacy-aware
s and provide three
anonymization tools that enable each client to produce semantically rich,
de-identified summaries for global inference across devices. (3) To optimize
response latency and reduce redundant computation, we design a three-tier
caching strategy consisting of local
, intermediate representation
,
and cloud inference
. Experimental results on PMC-Patients demonstrate
that HyFedRAG outperforms existing baselines in terms of retrieval quality,
generation consistency, and system efficiency. Our framework offers a scalable
and privacy-compliant solution for RAG over structural-heterogeneous data,
unlocking the potential of
s in sensitive and diverse data environments.
Tree of Agents Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning
Authors: Song Yu, Xiaofei Xu, Ke Deng, Li Li, Lin Tian
2025-09-08
Large language models (s) face persistent challenges when handling
long-context tasks, most notably the lost in the middle issue, where
information located in the middle of a long input tends to be underutilized.
Some existing methods that reduce input have the risk of discarding key
information, while others that extend context windows often lead to attention
dispersion. To address these limitations, we propose Tree of Agents (TOA), a
multi-agent reasoning framework that segments the input into chunks processed
by independent agents. Each agent generates its local cognition, then agents
dynamically exchange information for collaborative reasoning along
tree-structured paths. TOA enables agents to probe different reasoning orders
for multi-perspective understanding, effectively mitigating position bias and
reducing hallucinations. To improve processing efficiency, we incorporate
prefix-hash caching and adaptive
strategies, achieving significant
performance improvements with comparable API overhead. Experiments show that
TOA, powered by compact LLaMA3.1-8B, significantly outperforms multiple
baselines and demonstrates comparable performance to the latest and much larger
commercial models, such as Gemini1.5-pro, on various long-context tasks. Code
is available at https://github.com/Aireduce952/Tree-of-Agents.
NeuroDeX Unlocking Diverse Support in Decompiling Deep Neural Network Executables
Authors: Yilin Li, Guozhu Meng, Mingyang Sun, Yanzhong Wang, Kun Sun, Hailong Chang, Yuekang Li
2025-09-08
On-device deep learning models have extensive real world demands. Deep
learning compilers efficiently compile models into executables for deployment
on edge devices, but these executables may face the threat of reverse
engineering. Previous studies have attempted to decompile DNN executables, but
they face challenges in handling compilation optimizations and analyzing
d compiled models. In this paper, we present NeuroDeX to unlock diverse
support in decompiling DNN executables. NeuroDeX leverages the semantic
understanding capabilities of
s along with dynamic analysis to accurately
and efficiently perform operator type recognition, operator attribute recovery
and model reconstruction. NeuroDeX can recover DNN executables into high-level
models towards compilation optimizations, different architectures and
d
compiled models. We conduct experiments on 96 DNN executables across 12 common
DNN models. Extensive experimental results demonstrate that NeuroDeX can
decompile non-
d executables into nearly identical high-level models.
NeuroDeX can recover functionally similar high-level models for
d
executables, achieving an average top-1 accuracy of 72%. NeuroDeX offers a more
comprehensive and effective solution compared to previous DNN executables
decompilers.
Mask-GCG Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?
Authors: Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang
2025-09-08
Jailbreak attacks on Large Language Models (s) have demonstrated various
successful methods whereby attackers manipulate models into generating harmful
responses that they are designed to avoid. Among these, Greedy Coordinate
Gradient (GCG) has emerged as a general and effective approach that optimizes
the tokens in a suffix to generate jailbreakable prompts. While several
improved variants of GCG have been proposed, they all rely on fixed-length
suffixes. However, the potential redundancy within these suffixes remains
unexplored. In this work, we propose Mask-GCG, a plug-and-play method that
employs learnable token masking to identify impactful tokens within the suffix.
Our approach increases the update probability for tokens at high-impact
positions while
those at low-impact positions. This
not only
reduces redundancy but also decreases the size of the gradient space, thereby
lowering computational overhead and shortening the time required to achieve
successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the
original GCG and several improved variants. Experimental results show that most
tokens in the suffix contribute significantly to attack success, and
a
minority of low-impact tokens does not affect the loss values or compromise the
attack success rate (ASR), thereby revealing token redundancy in
prompts.
Our findings provide insights for developing efficient and interpretable
s
from the perspective of jailbreak attacks.
A Geometric Multigrid-Accelerated Compact Gas-Kinetic Scheme for Fast Convergence in High-Speed Flows on GPUs
Authors: Hongyu Liu, Xing Ji, Yuan Fu, Kun Xu
2025-09-08
Implicit methods and GPU parallelization are two distinct yet powerful
strategies for accelerating high-order CFD algorithms. However, few studies
have successfully integrated both approaches within high-speed flow solvers.
The core challenge lies in pre the robustness of implicit algorithms in
the presence of strong discontinuities, while simultaneously enabling massive
thread parallelism under the constraints of limited GPU memory. To address
this, we propose a GPU-optimized, geometric multigrid-accelerated, high-order
compact gas kinetic scheme (CGKS) that incorporates three key innovations:
(1) a multi-color lower-upper symmetric Gauss-Seidel scheme that eliminates
thread conflicts and preserves memory efficiency,
as an implicit
smoother on coarse grids; (2) a discontinuity-adaptive relaxation technique and
a multigrid prolongation process, based on a discontinuous feedback factor,
which dynamically stabilize shock regions without compromising convergence in
smooth zones; and (3) a three-layer V-cycle geometric parallel multigrid
strategy specifically tailored for unstructured meshes. Extensive tests on
multi-dimensional subsonic to hypersonic flows demonstrate that our GPU-based
high-performance solver achieves one to two orders of magnitude faster
convergence compared to previous explicit solvers. More importantly, it
preserves the shock-capturing robustness of the explicit CGKS and exhibits
strong scalability on GPU architectures. This work presents a unified framework
that synergistically leverages implicit
and GPU optimization for
high-speed flow simulations, effectively overcoming traditional trade-offs
between parallelism, memory constraints, and numerical stability in high-order
methods.
Ban&Pick Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs
Authors: Yuanteng Chen, Peisong Wang, Yuantian Shao, Jian Cheng
2025-09-08
Sparse Mixture-of-Experts (MoE) has become a key architecture for scaling
large language models (s) efficiently. Recent fine-grained MoE designs
introduce hundreds of experts per layer, with multiple experts activated per
token, enabling stronger specialization. However, during pre-training, routers
are optimized mainly for stability and robustness: they converge prematurely
and enforce balanced usage, limiting the full potential of model performance
and efficiency. In this work, we uncover two overlooked issues: (i) a few
highly influential experts are underutilized due to premature and balanced
routing decisions; and (ii) enforcing a fixed number of active experts per
token introduces substantial redundancy. Instead of retraining models or
redesigning MoE architectures, we introduce Ban&Pick, a post-training,
plug-and-play strategy for smarter MoE routing. Pick discovers and reinforces
key experts-a small group with outsized impact on performance-leading to
notable accuracy gains across domains. Ban complements this by dynamically
redundant experts based on layer and token sensitivity, delivering
faster inference with minimal accuracy loss. Experiments on fine-grained
MoE-
s (DeepSeek, Qwen3) across math, code, and general reasoning benchmarks
demonstrate that Ban&Pick delivers free performance gains and inference
without retraining or architectural changes. For instance, on
Qwen3-30B-A3B, it improves accuracy from 80.67 to 84.66 on AIME2024 and from
65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under the
v
.
Towards scalable organ level 3D plant segmentation Bridging the data algorithm computing gap
Authors: Ruiming Du, Guangxun Zhai, Tian Qiu, Yu Jiang
2025-09-08
The precise characterization of plant morphology provides valuable insights
into plant environment interactions and genetic evolution. A key technology for
extracting this information is 3D segmentation, which delineates individual
plant organs from complex point clouds. Despite significant progress in general
3D computer vision domains, the adoption of 3D segmentation for plant
phenotyping remains limited by three major challenges: i) the scarcity of
large-scale annotated datasets, ii) technical difficulties in adapting advanced
deep neural networks to plant point clouds, and iii) the lack of standardized
benchmarks and evaluation protocols tailored to plant science. This review
systematically addresses these barriers by: i) providing an overview of
existing 3D plant datasets in the context of general 3D segmentation domains,
ii) systematically summarizing deep learning-based methods for point cloud
semantic and instance segmentation, iii) introducing Plant Segmentation Studio
(PSS), an open-source framework for reproducible benchmarking, and iv)
conducting extensive quantitative experiments to evaluate representative
networks and sim-to-real learning strategies. Our findings highlight the
efficacy of convolutional backbones and
-based instance
segmentation, while also emphasizing the complementary role of modeling-based
and augmentation-based synthetic data generation for sim-to-real learning in
reducing annotation demands. In general, this study bridges the gap between
algorithmic advances and practical deployment, providing immediate tools for
researchers and a roadmap for developing data-efficient and generalizable deep
learning solutions in 3D plant phenotyping. Data and code are available at
https://github.com/perrydoremi/PlantSegStudio.
Text4Seg++ Advancing Image Segmentation via Generative Language Modeling
Authors: Mengcheng Lan, Chaofeng Chen, Jiaxing Xu, Zongrui Li, Yiping Ke, Xudong Jiang, Yingchen Yu, Yunqing Zhao, Song Bai
2025-09-08
Multimodal Large Language Models (Ms) have shown exceptional capabilities
in vision-language tasks. However, effectively integrating image segmentation
into these models remains a significant challenge. In this work, we propose a
novel text-as-mask paradigm that casts image segmentation as a text generation
problem, eliminating the need for additional
rs and significantly
simplifying the segmentation process. Our key innovation is semantic
descriptors, a new textual representation of segmentation masks where each
image patch is mapped to its corresponding text label. We first introduce
image-wise semantic descriptors, a patch-aligned textual representation of
segmentation masks that integrates naturally into the language modeling
pipeline. To enhance efficiency, we introduce the Row-wise Run-Length Encoding
(R-RLE), which compresses redundant text sequences, reducing the length of
semantic descriptors by 74% and accelerating inference by , without
compromising performance. Building upon this, our initial framework Text4Seg
achieves strong segmentation performance across a wide range of vision tasks.
To further improve granularity and compactness, we propose box-wise semantic
descriptors, which localizes regions of interest using bounding boxes and
represents region masks via structured mask tokens called semantic bricks. This
leads to our refined model, Text4Seg++, which formulates segmentation as a
next-brick prediction task, combining precision, scalability, and generative
efficiency. Comprehensive experiments on natural and remote sensing datasets
show that Text4Seg++ consistently outperforms state-of-the-art models across
diverse benchmarks without any task-specific fine-tuning, while remaining
compatible with existing M
backbones. Our work highlights the effectiveness,
scalability, and generalizability of text-driven image segmentation within the
M
framework.
LoaQ Layer-wise Output Approximation Quantization
Authors: Li Lin, Xiaojun Wan
2025-09-08
A natural and intuitive idea in model is to approximate each
component's
d output to match its original. Layer-wise post-training
(PTQ), though based on this idea, adopts a strictly local view and
can achieve, at best, only activation-aware approximations of weights. As a
result, it often leads to insufficient approximations and practical deviations
from this guiding intuition. Recent work has achieved a more accurate
approximation of linear-layer outputs within the framework of layer-wise PTQ,
but such refinements remain inadequate for achieving alignment with the full
model output. Based on a deeper understanding of the structural characteristics
of mainstream
s, we propose , an output-approximation method for
layer-wise PTQ that explicitly targets output-level consistency. It better
aligns with this intuition and can feature a simple closed-form solution,
making it orthogonal to existing techniques and readily integrable into
existing
pipelines. Experiments on the LLaMA and Qwen model
families demonstrate that LoaQ performs effectively in both weight-only and
weight-activation joint
. By integrating seamlessly with existing
strategies, it further enhances overall
quality and
shows strong potential to advance the frontier of post-training
.
RecMind LLM-Enhanced Graph Neural Networks for Personalized Consumer Recommendations
Authors: Chang Xue, Youwei Lu, Chen Yang, Jinming Xing
2025-09-08
Personalization is a core capability across consumer technologies, streaming,
shopping, wearables, and voice, yet it remains challenged by
interactions, fast content churn, and heterogeneous textual signals. We present
RecMind, an
-enhanced graph recommender that treats the language model as a
preference prior rather than a monolithic ranker. A frozen
equipped with
lightweight adapters produces text-conditioned user/item embeddings from
titles, attributes, and reviews; a LightGCN backbone learns collaborative
embeddings from the user-item graph. We align the two views with a symmetric
contrastive objective and fuse them via intra-layer gating, allowing language
to dominate in cold/long-tail regimes and graph structure to stabilize rankings
elsewhere. On Yelp and Amazon-Electronics, RecMind attains the best results on
all eight reported metrics, with relative improvements up to +4.53\%
(Recall@40) and +4.01\% (NDCG@40) over strong baselines. Ablations confirm both
the necessity of cross-view alignment and the advantage of gating over late
fusion and
-only variants.
FineServe Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving
Authors: Kyungmin Bin, Seungbeom Choi, Jimyoung Son, Jieun Choi, Daseul Bae, Daehyeon Baek, Kihyo Moon, Minsung Jang, Hyojung Lee
2025-09-08
Recent advances in Post-Training Quantization (PTQ) techniques have
significantly increased demand for
d large language models
(
s), enabling higher throughput and substantially reduced memory usage with
minimal accuracy loss. Quantized models address memory constraints in
s and
enhance GPU resource utilization through efficient GPU sharing. However,
d models have smaller
block sizes than non-
d models, causing
limited memory efficiency due to memory fragmentation. Also, distinct resource
usage patterns between
d and non-
d models require efficient
scheduling to maximize throughput. To address these challenges, we propose
FineServe, an inference
framework for mixed-precision
s. FineServe's
key contributions include: (1)
Slab, a precision-aware adaptive memory
management technique dynamically allocating
based on model
characteristics, significantly reducing GPU memory fragmentation,
and (2) a two-level scheduling framework comprising a global scheduler that
places models to GPUs based on request rates, latency SLOs, and memory
constraints and efficiency, and a local scheduler that adaptively adjusts batch
sizes according to real-time request fluctuations. Experimental results
demonstrate that FineServe achieves up to 2.2x higher SLO attainment and 1.8x
higher token generation throughput compared to the state-of-the-art GPU sharing
systems.
Understanding the Influence of Synthetic Data for Text Embedders
Authors: Jacob Mitchell Springer, Vaibhav Adlakha, Siva Reddy, Aditi Raghunathan, Marius Mosbach
2025-09-07
Recent progress in developing general purpose text embedders has been driven
by training on ever-growing corpora of synthetic -generated data.
Nonetheless, no publicly available synthetic dataset exists, posing a barrier
to studying its role for generalization. To address this issue, we first
reproduce and publicly release the synthetic data proposed by Wang et al.
(Mistral-E5). Our synthetic data is high quality and leads to consistent
improvements in performance. Next, we critically examine where exactly
synthetic data improves model generalization. Our analysis reveals that
benefits from synthetic data are
and highly localized to individual
datasets. Moreover, we observe trade-offs between the performance on different
categories and data that benefits one task, degrades performance on another.
Our findings highlight the limitations of current synthetic data approaches for
building general-purpose embedders and challenge the notion that training on
synthetic data leads to more robust embedding models across tasks.
Home-made Diffusion Model from Scratch to Hatch
Authors: Shih-Ying Yeh
2025-09-07
We introduce Home-made Diffusion Model (HDM), an efficient yet powerful
text-to-image diffusion model optimized for training (and inferring) on
consumer-grade hardware. HDM achieves competitive 1024x1024 generation quality
while maintaining a remarkably low training cost of $535-620 using four RTX5090
GPUs, representing a significant reduction in computational requirements
compared to traditional approaches. Our key contributions include: (1)
Cross-U-Transformer (XUT), a novel U-shape , Cross-U-Transformer
(XUT), that employs cross-attention for skip connections, providing superior
feature integration that leads to remarkable compositional consistency; (2) a
comprehensive training recipe that incorporates TREAD
, a novel
shifted square crop strategy for efficient arbitrary aspect-ratio training, and
progressive resolution scaling; and (3) an empirical demonstration that smaller
models (343M parameters) with carefully crafted architectures can achieve
high-quality results and emergent capabilities, such as intuitive camera
control. Our work provides an alternative paradigm of scaling, demonstrating a
viable path toward democratizing high-quality text-to-image generation for
individual researchers and smaller organizations with limited computational
resources.
1 bit is all we need binary normalized neural networks
Authors: Eduardo Lobo Lustoda Cabral, Paulo Pirozelli, Larissa Driemeier
2025-09-07
The increasing size of large neural network models, specifically language
models and foundational image models, poses deployment challenges, prompting
efforts to reduce memory requirements and enhance computational efficiency.
These efforts are critical to ensure practical deployment and effective
utilization of these models across various applications. In this work, a novel
type of neural network layers and models is developed that uses only single-bit
parameters. In this novel type of models all parameters of all layers,
including kernel weights and biases, only have values equal to zero or one.
This novel type of models uses layers named as binary normalized layer. These
binary normalized layers can be of any type, such as fully connected,
convolutional, attention, etc., and they consist of slight variations of the
corresponding conventional layers. To show the effectiveness of the binary
normalized layers, two different models are configured to solve a multiclass
image classification problem and a language r to predict the next token
of a sequence. The model to solve the image classification has convolutional
and fully connected layers, and the language model is composed of
blocks with multi-head attention. The results show that models with binary
normalized layers present almost the same results obtained by equivalent models
with real 32-bit parameters. The binary normalized layers allow to develop
models that use 32 times less memory than current models and have equivalent
performance. Besides, the binary normalized layers can be easily implemented on
current computers using 1-bit arrays, and do not require the development of
dedicated electronic hardware. This novel type of layers opens a new era for
large neural network models with reduced memory requirements that can be
deployed using simple and cheap hardware, such as mobile devices or only cpus.
A Unified Framework for Cultural Heritage Data Historicity and Migration The ARGUS Approach
Authors: Lingxiao Kong, Apostolos Sarris, Miltiadis Polidorou, Victor Klingenberg, Vasilis Sevetlidis, Vasilis Arampatzakis, George Pavlidis, Cong Yang, Zeyd Boukhers
2025-09-07
Cultural heritage preservation faces significant challenges in managing
diverse, multi-source, and multi-scale data for effective monitoring and
conservation. This paper documents a comprehensive data historicity and
migration framework implemented within the ARGUS project, which addresses the
complexities of processing heterogeneous cultural heritage data. We describe a
systematic data processing pipeline encompassing standardization, enrichment,
integration, visualization, ingestion, and publication strategies. The
framework transforms raw, disparate datasets into standardized formats
compliant with FAIR principles. It enhances datasets through established
imputation techniques, ensures interoperability through database integration,
and improves querying capabilities through
-powered natural language
processing. This approach has been applied across five European pilot sites
with varying preservation challenges, demonstrating its adaptability to diverse
cultural heritage contexts. The implementation results show improved data
accessibility, enhanced analytical capabilities, and more effective
decision-making for conservation efforts.
Micro-Expression Recognition via Fine-Grained Dynamic Perception
Authors: Zhiwen Shao, Yifan Cheng, Fan Zhang, Xuehuai Shi, Canlin Li, Lizhuang Ma, Dit-yan Yeung
2025-09-07
Facial micro-expression recognition (MER) is a challenging task, due to the
transience, subtlety, and dynamics of micro-expressions (MEs). Most existing
methods resort to hand-crafted features or deep networks, in which the former
often additionally requires key frames, and the latter suffers from small-scale
and low-diversity training data. In this paper, we develop a novel fine-grained
dynamic perception (FDP) framework for MER. We propose to rank frame-level
features of a sequence of raw frames in chronological order, in which the rank
process encodes the dynamic information of both ME appearances and motions.
Specifically, a novel local-global feature-aware is proposed for
frame representation learning. A rank scorer is further adopted to calculate
rank scores of each frame-level feature. Afterwards, the rank features from
rank scorer are pooled in temporal dimension to capture dynamic representation.
Finally, the dynamic representation is shared by a MER module and a dynamic
image construction module, in which the former predicts the ME category, and
the latter uses an encoder-
r structure to construct the dynamic image.
The design of dynamic image construction task is beneficial for capturing
facial subtle actions associated with MEs and alleviating the data scarcity
issue. Extensive experiments show that our method (i) significantly outperforms
the state-of-the-art MER methods, and (ii) works well for dynamic image
construction. Particularly, our FDP improves by 4.05%, 2.50%, 7.71%, and 2.11%
over the previous best results in terms of F1-score on the CASME II, SAMM,
CAS(ME)^2, and CAS(ME)^3 datasets, respectively. The code is available at
https://github.com/CYF-cuber/FDP.
MEGS Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning
Authors: Jiarui Chen, Yikeng Chen, Yingshuang Zou, Ye Huang, Peng Wang, Yuan Liu, Yujing Sun, Wenping Wang
2025-09-07
3D Gaussian Splatting (3DGS) has emerged as a dominant novel-view synthesis
technique, but its high memory consumption severely limits its applicability on
edge devices. A growing number of 3DGS methods have been proposed
to make 3DGS more efficient, yet most only focus on storage
and
fail to address the critical bottleneck of rendering memory. To address this
problem, we introduce MEGS, a novel memory-efficient framework that
tackles this challenge by jointly optimizing two key factors: the total
primitive number and the parameters per primitive, achieving unprecedented
memory
. Specifically, we replace the memory-intensive spherical
harmonics with lightweight arbitrarily-oriented spherical Gaussian lobes as our
color representations. More importantly, we propose a unified soft
framework that models primitive-number and lobe-number
as a single
constrained optimization problem. Experiments show that MEGS achieves a
50% static VRAM reduction and a 40% rendering VRAM reduction compared to
existing methods, while maintaining comparable rendering quality.
Application Space and the Rate-Distortion-Complexity Analysis of Neural Video CODECs
Authors: Ricardo L. de Queiroz, Diogo C. Garcia, Yi-Hsin Chen, Ruhan Conceição, Wen-Hsiao Peng, Luciano V. Agostini
2025-09-07
We study the decision-making process for choosing video systems
through a rate-distortion-complexity (RDC) analysis. We discuss the 2D
Bjontegaard delta (BD) metric and formulate generalizations in an attempt to
extend its notions to the 3D RDC volume. We follow that discussion with another
one on the computation of metrics in the RDC volume, and on how to define and
measure the cost of a coder-
r (codec) pair, where the codec is
characterized by a cloud of points in the RDC space. We use a Lagrangian cost
, such that choosing the best video codec among a
number of candidates for an application demands selecting appropriate
values. Thus, we argue that an application may be
associated with a point in the application space. An
example streaming application was given as a case study to set a particular
point in the plane. The result is that we can compare
Lagrangian costs in an RDC volume for different codecs for a given application.
Furthermore, we can span the plane and compare codecs for the entire
application space filled with different choices. We then
compared several state-of-the-art neural video codecs using the proposed
metrics. Results are informative and surprising. We found that, within our RDC
computation constraints, only four neural video codecs came out as the best
suited for any application, depending on where its desirable lies.
Physics-Guided Diffusion Transformer with Spherical Harmonic Posterior Sampling for High-Fidelity Angular Super-Resolution in Diffusion MRI
Authors: Mu Nan, Taohui Xiao, Ruoyou Wu, Shoujun Yu, Ye Li, Hairong Zheng, Shanshan Wang
2025-09-07
Diffusion MRI (dMRI) angular super-resolution (ASR) aims to reconstruct
high-angular-resolution (HAR) signals from limited low-angular-resolution (LAR)
data without prolonging scan time. However, existing methods are limited in
recovering fine-grained angular details or pre high fidelity due to
inadequate modeling of q-space geometry and insufficient incorporation of
physical constraints. In this paper, we introduce a Physics-Guided Diffusion
Transformer (PGDiT) designed to explore physical priors throughout both
training and inference stages. During training, a Q-space Geometry-Aware Module
(QGAM) with b-vector modulation and random angular masking facilitates
direction-aware representation learning, enabling the network to generate
directionally consistent reconstructions with fine angular details from
and noisy data. In inference, a two-stage Spherical Harmonics-Guided Posterior
Sampling (SHPS) enforces alignment with the acquired data, followed by
heat-diffusion-based SH regularization to ensure physically plausible
reconstructions. This coarse-to-fine refinement strategy mitigates
oversmoothing and artifacts commonly observed in purely data-driven or
generative models. Extensive experiments on general ASR tasks and two
downstream applications, Diffusion Tensor Imaging (DTI) and Neurite Orientation
Dispersion and Density Imaging (NODDI), demonstrate that PGDiT outperforms
existing deep learning models in detail recovery and data fidelity. Our
approach presents a novel generative ASR framework that offers high-fidelity
HAR dMRI reconstructions, with potential applications in neuroscience and
clinical research.
Beyond I'm Sorry, I Can't Dissecting Large Language Model Refusal
Authors: Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah, Ranjan Satapathy, Erik Cambria, Roy Ka Wei Lee
2025-09-07
Refusal on harmful prompts is a key safety behaviour in instruction-tuned
large language models (s), yet the internal causes of this behaviour remain
poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT
and LLaMA-3.1-8B-IT, using
autoencoders (SAEs) trained on
residual-stream activations. Given a harmful prompt, we search the SAE latent
space for feature sets whose ablation flips the model from refusal to
compliance, demonstrating causal influence and creating a jailbreak. Our search
proceeds in three stages: (1) Refusal Direction: find a refusal-mediating
direction and collect SAE features near that direction; (2) Greedy Filtering:
prune to a minimal set; and (3) Interaction Discovery: fit a factorization
machine (FM) that captures nonlinear interactions among the remaining active
features and the minimal set. This pipeline yields a broad set of
jailbreak-critical features, offering insight into the mechanistic basis of
refusal. Moreover, we find evidence of redundant features that remain dormant
unless earlier features are suppressed. Our findings highlight the potential
for fine-grained auditing and targeted intervention in safety behaviours by
manipulating the interpretable latent space.
Chatbot To Help Patients Understand Their Health
Authors: Won Seok Jang, Hieu Tran, Manav Mistry, SaiKiran Gandluri, Yifan Zhang, Sharmin Sultana, Sunjae Kown, Yuan Zhang, Zonghai Yao, Hong Yu
2025-09-06
Patients must possess the knowledge necessary to actively participate in
their care. We present NoteAid-Chatbot, a conversational AI that promotes
patient understanding via a novel 'learning as conversation' framework, built
on a multi-agent large language model () and reinforcement learning (RL)
setup without human-labeled data. NoteAid-Chatbot was built on a lightweight
LLaMA 3.2 3B model trained in two stages: initial supervised fine-tuning on
conversational data synthetically generated using medical conversation
strategies, followed by RL with rewards derived from patient understanding
assessments in simulated hospital discharge scenarios. Our evaluation, which
includes comprehensive human-aligned assessments and case studies, demonstrates
that NoteAid-Chatbot exhibits key emergent behaviors critical for patient
education, such as clarity, relevance, and structured dialogue, even though it
received no explicit supervision for these attributes. Our results show that
even simple Proximal Policy Optimization (PPO)-based reward modeling can
successfully train lightweight, domain-specific chatbots to handle multi-turn
interactions, incorporate diverse educational strategies, and meet nuanced
objectives. Our Turing test demonstrates that NoteAid-Chatbot
surpasses non-expert human. Although our current focus is on healthcare, the
framework we present illustrates the feasibility and promise of applying
low-cost, PPO-based RL to realistic, open-ended conversational domains,
broadening the applicability of RL-based alignment methods.
time2time Causal Intervention in Hidden States to Simulate Rare Events in Time Series Foundation Models
Authors: Debdeep Sanyal, Aaryan Nagpal, Dhruv Kumar, Murari Mandal, Saurabh Deshpande
2025-09-06
While -based foundation models excel at forecasting routine
patterns, two questions remain: do they internalize semantic concepts such as
market regimes, or merely fit curves? And can their internal representations be
leveraged to simulate rare, high-stakes events such as market crashes? To
investigate this, we introduce activation transplantation, a causal
intervention that manipulates hidden states by imposing the statistical moments
of one event (e.g., a historical crash) onto another (e.g., a calm period)
during the forward pass. This procedure deterministically steers forecasts:
injecting crash semantics induces downturn predictions, while injecting calm
semantics suppresses crashes and restores stability. Beyond binary control, we
find that models encode a graded notion of event severity, with the latent
vector norm directly correlating with the magnitude of systemic shocks.
Validated across two architecturally distinct TSFMs, Toto (
r only) and
Chronos (encoder-
r), our results demonstrate that steerable, semantically
grounded representations are a robust property of large time series
s. Our findings provide evidence for a latent concept space that
governs model predictions, shifting interpretability from post-hoc attribution
to direct causal intervention, and enabling semantic "what-if" analysis for
strategic stress-testing.
LM-Searcher Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding
Authors: Yuxuan Hu, Jihao Liu, Ke Wang, Jinliang Zhen, Weikang Shi, Manyuan Zhang, Qi Dou, Rui Liu, Aojun Zhou, Hongsheng Li
2025-09-06
Recent progress in Large Language Models (s) has opened new avenues for
solving complex optimization problems, including Neural Architecture Search
(NAS). However, existing
-driven NAS approaches rely heavily on prompt
engineering and domain-specific tuning, limiting their practicality and
scalability across diverse tasks. In this work, we propose LM-Searcher, a novel
framework that leverages
s for cross-domain neural architecture optimization
without the need for extensive domain-specific adaptation. Central to our
approach is NCode, a universal numerical string representation for neural
architectures, which enables cross-domain architecture encoding and search. We
also reformulate the NAS problem as a ranking task, training
s to select
high-performing architectures from candidate pools using instruction-tuning
samples derived from a novel
-based subspace sampling strategy. Our
curated dataset, encompassing a wide range of architecture-performance pairs,
encourages robust and transferable learning. Comprehensive experiments
demonstrate that LM-Searcher achieves competitive performance in both in-domain
(e.g., CNNs for image classification) and out-of-domain (e.g., LoRA
configurations for segmentation and generation) tasks, establishing a new
paradigm for flexible and generalizable
-based architecture search. The
datasets and models will be released at https://github.com/Ashone3/LM-Searcher.
Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints
Authors: Waris Gill, Natalie Isak, Matthew Dressman
2025-09-06
The widespread deployment of s across enterprise services has created a
critical security blind spot. Organizations operate multiple
services
handling billions of queries daily, yet regulatory compliance boundaries
prevent these services from sharing threat intelligence about prompt injection
attacks, the top security risk for
s. When an attack is detected in one
service, the same threat may persist undetected in others for months, as
privacy regulations prohibit sharing user prompts across compliance boundaries.
We present BinaryShield, the first privacy-pre
threat intelligence
system that enables secure sharing of attack fingerprints across compliance
boundaries. BinaryShield transforms suspicious prompts through a unique
pipeline combining PII redaction, semantic embedding, binary
, and
randomized response mechanism to potentially generate non-invertible
fingerprints that preserve attack patterns while providing privacy. Our
evaluations demonstrate that BinaryShield achieves an F1-score of 0.94,
significantly outperforming SimHash (0.77), the privacy-pre
baseline,
while achieving 64x storage reduction and 38x faster similarity search compared
to dense embeddings.
Icon Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation
Authors: Qiyuan Chen, Hongsen Huang, Qian Shao, Jiahe Chen, Jintai Chen, Hongxia Xu, Renjie Hua, Ren Chuan, Jian Wu
2025-09-06
Large Language Models (s) require high quality preference datasets to
align with human preferences. However, conventional methods for constructing
such datasets face significant challenges: reliance on pre-collected
instructions often leads to distribution mismatches with target models, while
the need for sampling multiple stochastic responses introduces substantial
computational overhead. In this work, we explore a paradigm shift by leveraging
inherent regulation of
s' representation space for efficient and tailored
preference dataset construction, named Icon. Specifically, it first
extracts layer-wise direction vectors to encode sophisticated human preferences
and then uses these vectors to filter self-synthesized instructions based on
their inherent consistency. During
, bidirectional inherent control is
applied to steer token representations, enabling the precise generation of
response pairs with clear alignment distinctions. Experimental results
demonstrate significant improvements in both alignment and efficiency.
Llama3-8B and Qwen2-7B achieve an average win rate improvement of 13.89% on
AlpacaEval 2.0 and 13.45% on Arena-Hard, while reducing computational costs by
up to 48.1%.
ProfilingAgent Profiling-Guided Agentic Reasoning for Adaptive Model Optimization
Authors: Sadegh Jafari, Aishwarya Sarkar, Mohiuddin Bilwal, Ali Jannesari
2025-09-06
Foundation models face growing compute and memory bottlenecks, hindering
deployment on resource-limited platforms. While techniques such as
and
are widely used, most rely on uniform heuristics that
ignore architectural and runtime heterogeneity. Profiling tools expose
per-layer latency, memory, and compute cost, yet are rarely integrated into
automated pipelines. We propose ProfilingAgent, a profiling-guided, agentic
approach that uses large language models (
s) to automate
via
structured
and post-training dynamic
. Our modular
multi-agent system reasons over static metrics (MACs, parameter counts) and
dynamic signals (latency, memory) to design architecture-specific strategies.
Unlike heuristic baselines, ProfilingAgent tailors layer-wise decisions to
bottlenecks. Experiments on ImageNet-1K, CIFAR-10, and CIFAR-100 with
ResNet-101, ViT-B/16, Swin-B, and DeiT-B/16 show
maintains competitive
or improved accuracy (about 1% drop on ImageNet-1K, +2% gains for ViT-B/16 on
smaller datasets), while
achieves up to 74% memory savings with
<0.5% accuracy loss. Our
also yields consistent inference speedups
of up to 1.74 times faster. Comparative studies with GPT-4o and GPT-4-Turbo
highlight the importance of
reasoning quality for iterative
. These
results establish agentic systems as scalable solutions for profiling-guided
model optimization.
Sensitivity-Aware Post-Training Quantization for Deep Neural Networks
Authors: Zekang Zheng, Haokun Li, Yaofo Chen, Mingkui Tan, Qing Du
2025-09-06
Model reduces neural network parameter precision to achieve
, but often compromises accuracy. Existing post-training
(PTQ) methods employ iterative parameter updates to preserve
accuracy under high
ratios, incurring significant computational
complexity and resource overhead, which limits applicability in
resource-constrained edge computing and real-time inference scenarios. This
paper proposes an efficient PTQ method guided by parameter sensitivity
analysis. The approach prioritizes
of high-sensitivity parameters,
leveraging un
d low-sensitivity parameters to compensate for
errors, thereby mitigating accuracy degradation. Furthermore, by
exploiting column-wise clustering of parameter sensitivity, the method
introduces a row-parallel
framework with a globally shared inverse
Hessian matrix update mechanism, reducing computational complexity by an order
of magnitude. Experimental results on ResNet-50 and YOLOv5s demonstrate a
20-200-fold
speedup over the Optimal Brain Quantization baseline,
with mean accuracy loss below 0.3%, confirming the method's efficacy in
balancing efficiency and accuracy.
TreeGPT Pure TreeFFN Encoder-Decoder Architecture for Structured Reasoning Without Attention Mechanisms
Authors: Zixi Li
2025-09-06
We present TreeGPT, an attention-free neural architecture that explores the
potential of pure TreeFFN encoder-r design for structured reasoning
tasks. Unlike traditional
approaches that rely on attention
mechanisms, TreeGPT employs bidirectional TreeFFN components that process
sequences through adjacent connections in parallel, aiming to achieve
computational efficiency while maintaining reasoning capabilities.
Our approach centers on a TreeFFN Encoder-Decoder mechanism: where the encoder processes
left-to-right dependencies while the
r handles right-to-left patterns,
both using simple neighbor-to-neighbor connections. This design eliminates
attention computation while maintaining sequence modeling capabilities.
We evaluate our approach on the ARC Prize 2025 dataset, where TreeGPT
achieves 99\% validation accuracy using 3.16M parameters. The model converges
within 1500 training steps and demonstrates 100\% token-level accuracy on
selected evaluation samples. Our preliminary results suggest that for certain
structured reasoning tasks, specialized TreeFFN architectures may offer
advantages over attention-based approaches. While these findings are
encouraging, we acknowledge that further investigation across diverse tasks and
datasets would be valuable to establish the broader applicability of
attention-free designs.
veScale Consistent and Efficient Tensor Programming with Eager-Mode SPMD
Authors: Youjie Li, Cheng Wan, Zhiqi Lin, Hongyu Zhu, Jiacheng Yang, Ziang Song, Xinyi Di, Jiawei Wu, Huiyao Shu, Wenlei Bao, Yanghua Peng, Haibin Lin, Li-Wen Chang
2025-09-05
Large Language Models (s) have scaled rapidly in size and complexity,
requiring increasingly intricate parallelism for distributed training, such as
3D parallelism. This sophistication motivates a shift toward simpler, more
debuggable programming paradigm like Single Program Multiple Data (SPMD).
However, SPMD in eager execution introduces two key challenges: ensuring
consistency with single-device execution and achieving high performance at
scale. In this paper, we introduce veScale, an eager-mode training system that
fully embraces SPMD paradigm to democratize distributed tensor programming.
veScale addresses the prevalent issue of inconsistent results in systems like
PyTorch by introducing a novel algorithm of distributed Random Number
Generation (RNG) compatible with arbitrary sharded operators. veScale also
significantly boosts training performance by reducing PyTorch primitive's
overhead and improving
efficiency. Evaluations show that veScale
delivers up to 2.2x speedup over the state-of-the-art training systems, like
TorchTitan, and cuts code complexity by 78.4%, while pre
single-device-equivalent results.
Dynamic Sensitivity Filter Pruning using Multi-Agent Reinforcement Learning For DCNN's
Authors: Iftekhar Haider Chowdhury, Zaed Ikbal Syed, Ahmed Faizul Haque Dhrubo, Mohammad Abdul Qayum
2025-09-05
Deep Convolutional Neural Networks have achieved state of the art performance
across various computer vision tasks, however their practical deployment is
limited by computational and memory overhead. This paper introduces
Differential Sensitivity Fusion Pruning, a novel single shot filter
framework that focuses on evaluating the stability and redundancy of filter
importance scores across multiple criteria. Differential Sensitivity Fusion
Pruning computes a differential sensitivity score for each filter by fusing the
discrepancies among gradient based sensitivity, first order Taylor expansion,
and KL divergence of activation distributions. An exponential scaling mechanism
is applied to emphasize filters with inconsistent importance across metrics,
identifying candidates that are structurally unstable or less critical to the
model performance. Unlike iterative or reinforcement learning based
strategies, Differential Sensitivity Fusion Pruning is efficient and
deterministic, requiring only a single forward-backward pass for scoring and
. Extensive experiments across varying
rates between 50 to 70
percent demonstrate that Differential Sensitivity Fusion Pruning significantly
reduces model complexity, achieving over 80 percent Floating point Operations
Per Seconds reduction while maintaining high accuracy. For instance, at 70
percent
, our approach retains up to 98.23 percent of baseline accuracy,
surpassing traditional heuristics in both
and generalization. The
proposed method presents an effective solution for scalable and adaptive Deep
Convolutional Neural Networks
, paving the way for efficient
deployment on edge and mobile platforms.
Crosscoding Through Time Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
Authors: Deniz Bayazit, Aaron Mueller, Antoine Bosselut
2025-09-05
Large language models (s) learn non-trivial abstractions during
pretraining, like detecting irregular plural noun subjects. However, it is not
well understood when and how specific linguistic abilities emerge as
traditional evaluation methods such as benchmarking fail to reveal how models
acquire concepts and capabilities. To bridge this gap and better understand
model training at the concept level, we use
crosscoders to discover and
align features across model checkpoints. Using this approach, we track the
evolution of linguistic features during pretraining. We train crosscoders
between open-sourced checkpoint triplets with significant performance and
representation shifts, and introduce a novel metric, Relative Indirect Effects
(RelIE), to trace training stages at which individual features become causally
important for task performance. We show that crosscoders can detect feature
emergence, maintenance, and discontinuation during pretraining. Our approach is
architecture-agnostic and scalable, offering a promising path toward more
interpretable and fine-grained analysis of representation learning throughout
pretraining.
Recomposer Event-roll-guided generative audio editing
Authors: Daniel P. W. Ellis, Eduardo Fonseca, Ron J. Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R. Hershey, Aren Jansen, R. Channing Moore, Manoj Plakal
2025-09-05
Editing complex real-world sound scenes is difficult because individual sound
sources in time. Generative models can fill-in missing or corrupted
details based on their strong prior understanding of the data domain. We
present a system for editing individual sound events within complex scenes able
to delete, insert, and enhance individual sound events based on textual edit
descriptions (e.g.,
enhance Door'') and a graphical representation of the
event timing derived from an
event roll'' transcription. We present an
encoder-r
working on SoundStream representations, trained on
synthetic (input, desired output) audio example pairs formed by adding isolated
sound events to dense, real-world backgrounds. Evaluation reveals the
importance of each part of the edit descriptions -- action, class, timing. Our
work demonstrates ``recomposition'' is an important and practical application.
Exploring Autoregressive Vision Foundation Models for Image Compression
Authors: Huu-Tai Phung, Yu-Hsiang Lin, Yen-Kuan Ho, Wen-Hsiao Peng
2025-09-05
This work presents the first attempt to repurpose vision foundation models
(VFMs) as image codecs, aiming to explore their generation capability for
low-rate image . VFMs are widely employed in both conditional and
unconditional generation scenarios across diverse downstream tasks, e.g.,
physical AI applications. Many VFMs employ an encoder-
r architecture
similar to that of end-to-end learned image codecs and learn an autoregressive
(AR) model to perform next-token prediction. To enable
, we
repurpose the AR model in VFM for entropy coding the next token based on
previously coded tokens. This approach deviates from early semantic
efforts that rely solely on conditional generation for reconstructing input
images. Extensive experiments and analysis are conducted to compare VFM-based
codec to current SOTA codecs optimized for distortion or perceptual quality.
Notably, certain pre-trained, general-purpose VFMs demonstrate superior
perceptual quality at extremely low bitrates compared to specialized learned
image codecs. This finding paves the way for a promising research direction
that leverages VFMs for low-rate, semantically rich image
.
KVCompose Efficient Structured KV Cache Compression with Composite Tokens
Authors: Dmitry Akulov, Mohamed Sana, Antonio De Domenico, Tareq Si Salem, Nicola Piovesan, Fadhel Ayed
2025-09-05
Large language models (s) rely on key-value (
)
s for efficient
autoregressive
; however,
size grows linearly with context length
and model depth, becoming a major bottleneck in long-context inference. Prior
methods either enforce rigid heuristics, disrupt tensor
layouts with per-attention-head variability, or require specialized compute
kernels.
We propose a simple, yet effective,
framework based on
attention-guided, layer-adaptive composite tokens. Our method aggregates
attention scores to estimate token importance, selects head-specific tokens
independently, and aligns them into composite tokens that respect the uniform
structure required by existing inference engines. A global allocation
mechanism further adapts retention budgets across layers, assigning more
capacity to layers with informative tokens. This approach achieves significant
memory reduction while pre
accuracy, consistently outperforming prior
structured and semi-structured methods. Crucially, our approach remains fully
compatible with standard inference pipelines, offering a practical and scalable
solution for efficient long-context
deployment.
FLOWER Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies
Authors: Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, Rudolf Lioutikov
2025-09-05
Developing efficient Vision-Language-Action (VLA) policies is crucial for
practical robotics deployment, yet current approaches face prohibitive
computational costs and resource requirements. Existing diffusion-based VLA
policies require multi-billion-parameter models and massive datasets to achieve
strong performance. We tackle this efficiency challenge with two contributions:
intermediate-modality fusion, which reallocates capacity to the diffusion head
by up to of
layers, and action-specific Global-AdaLN
conditioning, which cuts parameters by through modular adaptation. We
integrate these advances into a novel 950 M-parameter VLA called FLOWER.
Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance
with bigger VLAs across tasks spanning ten simulation and real-world
benchmarks and demonstrates robustness across diverse robotic embodiments. In
addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark.
Demos, code and pretrained weights are available at
https://intuitive-robots.github.io/flower_vla/.
Ground-Aware Octree-A* Hybrid Path Planning for Memory-Efficient 3D Navigation of Ground Vehicles
Authors: Byeong-Il Ham, Hyun-Bin Kim, Kyung-Soo Kim
2025-09-05
In this paper, we propose a 3D path planning method that integrates the A
algorithm with the octree structure. Unmanned Ground Vehicles (UGVs) and legged
robots have been extensively studied, enabling locomotion across a variety of
terrains. Advances in mobility have enabled obstacles to be regarded not only
as hindrances to be avoided, but also as navigational aids when beneficial. A
modified 3D A algorithm generates an optimal path by leveraging obstacles
during the planning process. By incorporating a height-based penalty into the
cost function, the algorithm enables the use of traversable obstacles to aid
locomotion while avoiding those that are impassable, resulting in more
efficient and realistic path generation. The octree-based 3D grid map achieves
by merging high-resolution nodes into larger blocks, especially in
obstacle-free or
ly populated areas. This reduces the number of nodes
explored by the A* algorithm, thereby improving computational efficiency and
memory usage, and supporting real-time path planning in practical environments.
Benchmark results demonstrate that the use of octree structure ensures an
optimal path while significantly reducing memory usage and computation time.
PLaMo 2 Technical Report
Authors: Preferred Networks, :, Kaizaburo Chubachi, Yasuhiro Fujita, Shinichi Hemmi, Yuta Hirokawa, Toshiki Kataoka, Goro Kobayashi, Kenichi Maehashi, Calvin Metzger, Hiroaki Mikami, Shogo Murai, Daisuke Nishino, Kento Nozawa, Shintarou Okada, Daisuke Okanohara, Shunta Saito, Shotaro Sano, Shuji Suzuki, Daisuke Tanaka, Avinash Ummadisingu, Hanqin Wang, Sixue Wang, Tianqi Xu
2025-09-05
In this report, we introduce PLaMo 2, a series of Japanese-focused large
language models featuring a hybrid Samba-based architecture that transitions to
full attention via continual pre-training to support 32K token contexts.
Training leverages extensive synthetic corpora to overcome data scarcity, while
computational efficiency is achieved through weight reuse and structured
. This efficient
methodology produces an 8B model that achieves
performance comparable to our previous 100B model. Post-training further
refines the models using a pipeline of supervised fine-tuning (SFT) and direct
preference optimization (DPO), enhanced by synthetic Japanese instruction data
and model merging techniques. Optimized for inference using v
and
with minimal accuracy loss, the PLaMo 2 models achieve
state-of-the-art results on Japanese benchmarks, outperforming similarly-sized
open models in instruction-following, language fluency, and Japanese-specific
knowledge.
OSC Cognitive Orchestration through Dynamic Knowledge Alignment in Multi-Agent LLM Collaboration
Authors: Jusheng Zhang, Yijia Fan, Kaitong Cai, Xiaofei Sun, Keze Wang
2025-09-05
This paper introduces OSC (Orchestrating Cognitive Synergy), a
knowledge-aware adaptive collaboration framework designed to enhance cognitive
synergy in multi-agent systems with large language models. While prior work has
advanced agent selection and result aggregation, efficient linguistic
interactions for deep collaboration among expert agents remain a critical
bottleneck. OSC addresses this gap as a pivotal intermediate layer between
selection and aggregation, introducing Collaborator Knowledge Models (CKM) to
enable each agent to dynamically perceive its collaborators' cognitive states.
Through real-time cognitive gap analysis, agents adaptively adjust
behaviors, including content focus, detail level, and expression
style, using learned strategies. Experiments on complex reasoning and
problem-solving benchmarks demonstrate that OSC significantly improves task
performance and
efficiency, transforming "parallel-working
individuals'' into a "deeply collaborative cognitive team.'' This framework not
only optimizes multi-agent collaboration but also offers new insights into
agent interaction behaviors.
Broadband Simultaneous Beam Steering and Compressing Device Based on Subwavelength Protrusion Metallic Tunnels
Authors: Dongguo Zhang, Fei Sun, Qin Liao, Yichao Liu, Donguk Nam
2025-09-05
Beam steering and beamwidth compressing play a role in steering the beam and
narrowing its half-power beamwidth, respectively, which are both widely applied
in extending the effective operational range of 6G s, IoT devices,
and antenna systems. However, research on wave manipulation devices capable of
simultaneously achieving both functionalities remains limited, despite their
great potential for system miniaturization and functional integration. In this
study, we design and realize a broadband device capable of simultaneously
steering and compressing the TM-polarized EM waves using subwavelength
protrusion metallic tunnels. The underlying physical mechanisms are
quantitatively explained through wave optics and optical surface
transformation, indicating the size ratio between the incident and output
surface governs both the steering angle and the
ratio. Numerical
simulations demonstrate its outstanding performance, achieving a maximum
steering angle of 40{\deg} and a
ratio of 0.4 across 3 to 12 GHz,
with averaged energy transmittance above 80%. The experiments further validate
its effectiveness by measuring the magnetic field distributions of the output
beam at various frequencies. The excellent beam steering and compressing
effects make the proposed device highly promising for next-generation
multifunctional wave manipulation in advanced
systems.
VoltanaLLM Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
Authors: Jiahuan Yu, Aryan Taneja, Junfeng Lin, Minjia Zhang
2025-09-05
Modern Large Language Model ()
systems increasingly support
interactive applications, like real-time chat assistants, code generation
tools, and agentic workflows. However, the soaring energy cost of
inference
presents a growing challenge for sustainable and cost-effective deployment.
This paper introduces Voltana
, a system for SLO-aware, energy-efficient
, built from a control theory perspective. Voltana
co-designs
frequency scaling and request routing in emerging
/
d
architectures, leveraging their decoupled execution to enable fine-grained
phase-specific control. It consists of a feedback-driven frequency controller
that dynamically adapts GPU frequency for
and
phases, and a
state-space router that explores routing decisions across frequency-scaled
instances to minimize energy under latency constraints. We implement Voltana
in SGLang and evaluate its performance over multiple state-of-the-art
s and
real-world datasets. The results demonstrate that Voltana
achieves up to
36.3% energy savings while maintaining near-perfect SLO attainment rate, paving
the way for sustainable and intelligent
.
AI-Driven Fronthaul Link Compression in Wireless Communication Systems Review and Method Design
Authors: Keqin Zhang
2025-09-05
Modern fronthaul links in wireless systems must transport high-dimensional
signals under stringent bandwidth and latency constraints, which makes
indispensable. Traditional strategies such as compressed sensing,
scalar
, and fixed-codec pipelines often rely on restrictive
priors, degrade sharply at high
ratios, and are hard to tune across
channels and deployments. Recent progress in Artificial Intelligence (AI) has
brought end-to-end learned transforms, vector and hierarchical
,
and learned entropy models that better exploit the structure of Channel State
Information(CSI), precoding matrices, I/Q samples, and LLRs. This paper first
surveys AI-driven
techniques and then provides a focused analysis
of two representative high-
routes: CSI feedback with end-to-end
learning and Resource Block (RB) granularity precoding optimization combined
with
. Building on these insights, we propose a fronthaul
strategy tailored to cell-free architectures. The design targets
high
with controlled performance loss, supports RB-level rate
adaptation, and enables low-latency inference suitable for centralized
cooperative transmission in next-generation networks.
Personality as a Probe for LLM Evaluation Method Trade-offs and Downstream Effects
Authors: Gunmay Handa, Zekun Wu, Adriano Koshiyama, Philip Treleaven
2025-09-05
Personality manipulation in large language models (s) is increasingly
applied in customer service and agentic scenarios, yet its mechanisms and
trade-offs remain unclear. We present a systematic study of personality control
using the Big Five traits, comparing in-context learning (ICL),
parameter-efficient fine-tuning (PEFT), and mechanistic steering (MS). Our
contributions are fourfold. First, we construct a contrastive dataset with
balanced high/low trait responses, enabling effective steering vector
computation and fair cross-method evaluation. Second, we introduce a unified
evaluation framework based on within-run analysis that disentangles,
reasoning capability, agent performance, and demographic bias across MMLU,
GAIA, and BBQ benchmarks. Third, we develop trait purification techniques to
separate openness from conscientiousness, addressing representational
in trait encoding. Fourth, we propose a three-level stability framework that
quantifies method-, trait-, and combination-level robustness, offering
practical guidance under deployment constraints. Experiments on Gemma-2-2B-IT
and LLaMA-3-8B-Instruct reveal clear trade-offs: ICL achieves strong alignment
with minimal capability loss, PEFT delivers the highest alignment at the cost
of degraded task performance, and MS provides lightweight runtime control with
competitive effectiveness. Trait-level analysis shows openness as uniquely
challenging, agreeableness as most resistant to ICL, and personality encoding
consolidating around intermediate layers. Taken together, these results
establish personality manipulation as a multi-level probe into behavioral
representation, linking surface conditioning, parameter encoding, and
activation-level steering, and positioning mechanistic steering as a
lightweight alternative to fine-tuning for both deployment and
interpretability.
Decoders Laugh as Loud as Encoders
Authors: Eli Borodach, Raj Dandekar, Rajat Dandekar, Sreedath Panat
2025-09-05
From the dawn of the computer, Allen Turing dreamed of a robot that could
communicate using language as a human being. The recent advances in the field
of Large Language Models (s) shocked the scientific community when a single
model can apply for various natural language processing (NLP) tasks, while the
output results are sometimes even better than most human
skills.
Models such as GPT, Claude, Grok, etc. have left their mark on the scientific
community. However, it is unclear how much these models understand what they
produce, especially in a nuanced theme such as humor. The question of whether
computers understand humor is still open (among the
rs, the latest to be
checked was GPT-2). We addressed this issue in this paper; we have showed that
a fine-tuned
r (GPT-4o) performed (Mean F1-macro score of 0.85) as well
as the best fine-tuned encoder (RoBERTa with a Mean of F1-score 0.86)
A Study of Large Language Models for Patient Information Extraction Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning
Authors: Cheng Peng, Xinyu Dong, Mengxian Lyu, Daniel Paredes, Yaoyun Zhang, Yonghui Wu
2025-09-05
Natural language processing (NLP) is a key technology to extract important
patient information from clinical narratives to support healthcare
applications. The rapid development of large language models (s) has
revolutionized many NLP tasks in the clinical domain, yet their optimal use in
patient information extraction tasks requires further exploration. This study
examines
s' effectiveness in patient information extraction, focusing on
architectures, fine-tuning strategies, and multi-task instruction tuning
techniques for developing robust and generalizable patient information
extraction systems. This study aims to explore key concepts of using
s for
clinical concept and relation extraction tasks, including: (1) encoder-only or
r-only
s, (2) prompt-based parameter-efficient fine-tuning (PEFT)
algorithms, and (3) multi-task instruction tuning on few-shot learning
performance. We benchmarked a suite of
s, including encoder-based
s
(BERT, GatorTron) and
r-based
s (GatorTronGPT, Llama 3.1,
GatorTronLlama), across five datasets. We compared traditional full-size
fine-tuning and prompt-based PEFT. We explored a multi-task instruction tuning
framework that combines both tasks across four datasets to evaluate the
zero-shot and few-shot learning performance using the leave-one-dataset-out
strategy.
ODKE+ Ontology-Guided Open-Domain Knowledge Extraction with LLMs
Authors: Samira Khorshidi, Azadeh Nikfarjam, Suprita Shankar, Yisi Sang, Yash Govind, Hyun Jang, Ali Kasgari, Alexis McClimans, Mohamed Soliman, Vishnu Konda, Ahmed Fakhry, Xiaoguang Qi
2025-09-04
Knowledge graphs (KGs) are foundational to many AI applications, but
maintaining their freshness and completeness remains costly. We present ODKE+,
a production-grade system that automatically extracts and ingests millions of
open-domain facts from web sources with high precision. ODKE+ combines modular
components into a scalable pipeline: (1) the Extraction Initiator detects
missing or stale facts, (2) the Evidence Retriever collects supporting
documents, (3) hybrid Knowledge Extractors apply both pattern-based rules and
ontology-guided prompting for large language models (s), (4) a lightweight
Grounder validates extracted facts using a second
, and (5) the Corroborator
ranks and normalizes candidate facts for ingestion. ODKE+ dynamically generates
ontology snippets tailored to each entity type to align extractions with schema
constraints, enabling scalable, type-consistent fact extraction across 195
predicates. The system supports batch and streaming modes, processing over 9
million Wikipedia pages and ingesting 19 million high-confidence facts with
98.8% precision. ODKE+ significantly improves coverage over traditional
methods, achieving up to 48%
with third-party KGs and reducing update
lag by 50 days on average. Our deployment demonstrates that
-based
extraction, grounded in ontological structure and verification workflows, can
deliver trustworthiness, production-scale knowledge ingestion with broad
real-world applicability. A recording of the system demonstration is included
with the submission and is also available at https://youtu.be/UcnE3_GsTWs.
First demonstration of coherent radiation imaging for bunch-by-bunch longitudinal compression monitoring
Authors: Joseph Wolfenden, Ana Guisao-Betancur, Carsten Welsch, Billy Kyle, Thomas Pacey, Erik Mansten, Sara Thorin, Mathias Brandin
2025-09-04
Longitudinal bunch profile monitoring is a crucial diagnostic requirement in
most accelerator facilities. This is particularly true in modern free-electron
lasers and novel schemes, where bunch lengths are often <100 fs
and standard instrumentation is invasive or lacks the required resolution. This
paper proposes a new monitoring method in this challenging parameter space.
Initial proof of principle results for relative
monitoring via
broadband imaging of coherent THz radiation are presented. The technique can
utilize more conventional intensity monitoring or novel spatial distribution
variation. Both techniques have been demonstrated using both invasive and
non-invasive coherent radiation sources. These results pave the way for a
future non-invasive longitudinal bunch profile monitor.
DarkStream real-time speech anonymization with low latency
Authors: Waris Quamer, Ricardo Gutierrez-Osuna
2025-09-04
We propose DarkStream, a streaming speech synthesis model for real-time
speaker anonymization. To improve content encoding under strict latency
constraints, DarkStream combines a causal waveform encoder, a short lookahead
buffer, and -based contextual layers. To further reduce inference
time, the model generates waveforms directly via a neural vocoder, thus
removing intermediate mel-spectrogram conversions. Finally, DarkStream
anonymizes speaker identity by injecting a GAN-generated pseudo-speaker
embedding into linguistic features from the content encoder. Evaluations show
our model achieves strong anonymization, yielding close to 50% speaker
verification EER (near-chance performance) on the lazy-informed attack
scenario, while maintaining acceptable linguistic intelligibility (WER within
9%). By balancing low-latency, robust privacy, and minimal intelligibility
degradation, DarkStream provides a practical solution for privacy-pre
real-time speech
.
AraHalluEval A Fine-grained Hallucination Evaluation Framework for Arabic LLMs
Authors: Aisha Alansari, Hamzah Luqman
2025-09-04
Recently, extensive research on the hallucination of the large language
models (s) has mainly focused on the English language. Despite the growing
number of multilingual and Arabic-specific
s, evaluating
s' hallucination
in the Arabic context remains relatively underexplored. The knowledge gap is
particularly pressing given Arabic's widespread use across many regions and its
importance in global
and media. This paper presents the first
comprehensive hallucination evaluation of Arabic and multilingual
s on two
critical Arabic natural language generation tasks: generative question
answering (GQA) and summarization. This study evaluates a total of 12
s,
including 4 Arabic pre-trained models, 4 multilingual models, and 4
reasoning-based models. To assess the factual consistency and faithfulness of
s' outputs, we developed a fine-grained hallucination evaluation framework
consisting of 12 fine-grained hallucination indicators that represent the
varying characteristics of each task. The results reveal that factual
hallucinations are more prevalent than faithfulness errors across all models
and tasks. Notably, the Arabic pre-trained model Allam consistently
demonstrates lower hallucination rates than multilingual models and a
comparative performance with reasoning-based models. The code is available at:
https://github.com/aishaalansari57/AraHalluEval
Scaling Environments for Organoid Intelligence with LLM-Automated Design and Plasticity-Based Evaluation
Authors: Brennen Hill
2025-09-04
As the complexity of artificial agents increases, the design of environments
that can effectively shape their behavior and capabilities has become a
critical research frontier. We propose a framework that extends this principle
to a novel class of agents: biological neural networks in the form of neural
organoids. This paper introduces three scalable, closed-loop virtual
environments designed to train organoid-based biological agents and probe the
underlying mechanisms of learning, such as long-term potentiation (LTP) and
long-term depression (LTD). We detail the design of three distinct task
environments with increasing complexity: (1) a conditional avoidance task, (2)
a one-dimensional predator-prey scenario, and (3) a replication of the classic
Pong game. For each environment, we formalize the state and action spaces, the
sensory encoding and motor mechanisms, and the feedback protocols
based on predictable (reward) and unpredictable (punishment) stimulation.
Furthermore, we propose a novel meta-learning approach where a Large Language
Model (
) is used to automate the generation and optimization of experimental
protocols, scaling the process of environment and curriculum design. Finally,
we outline a multi-modal approach for evaluating learning by measuring synaptic
plasticity at electrophysiological, cellular, and molecular levels. This work
bridges the gap between computational neuroscience and agent-based AI, offering
a unique platform for studying embodiment, learning, and intelligence in a
controlled biological substrate.
Schema Inference for Tabular Data Repositories Using Large Language Models
Authors: Zhenyu Wu, Jiaoyan Chen, Norman W. Paton
2025-09-04
Minimally curated tabular data often contain representational inconsistencies
across heterogeneous sources, and are accompanied by metadata. Working
with such data is intimidating. While prior work has advanced dataset discovery
and exploration, schema inference remains difficult when metadata are limited.
We present SI-
(Schema Inference using Large Language Models), which infers
a concise conceptual schema for tabular data using only column headers and cell
values. The inferred schema comprises hierarchical entity types, attributes,
and inter-type relationships. In extensive evaluation on two datasets from web
tables and open data, SI-
achieves promising end-to-end results, as well as
better or comparable results to state-of-the-art methods at each step. All
source code, full prompts, and datasets of SI-
are available at
https://github.com/PierreWoL/SI
.
Communication-Efficient Collaborative LLM Inference via Distributed Speculative Decoding
Authors: Ce Zheng, Tingting Yang
2025-09-04
Speculative is an emerging technique that accelerates large language
model (
) inference by allowing a smaller draft model to predict multiple
tokens in advance, which are then verified or corrected by a larger target
model. In AI-native radio access networks (AI-RAN), this paradigm is
well-suited for collaborative inference between resource-constrained end
devices and more capable edge servers or base stations (BSs). However, existing
distributed speculative
requires transmitting the full vocabulary
probability distribution from the draft model on the device to the target model
at the BS, which leads to prohibitive uplink
overhead. To address
this issue, we propose a
Top-K Sparse Logits Transmission (TK-SLT)
scheme,
where the draft model transmits only the top-K token raw probabilities and the
corresponding token indices instead of the entire distribution. This approach
significantly reduces bandwidth consumption while maintaining inference
performance. We further derive an analytical expression for the optimal draft
length that maximizes inference throughput, and provide a theoretical analysis
of the achievable speedup ratio under TK-SLT. Experimental results validate
both the efficiency and effectiveness of the proposed method.
PagedEviction Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference
Authors: Krishna Teja Chitty-Venkata, Jie Ye, Xian-He Sun, Anthony Kougkas, Murali Emani, Venkatram Vishwanath, Bogdan Nicolae
2025-09-04
caching significantly improves the efficiency of Large Language Model
(
) inference by storing attention states from previously processed tokens,
enabling faster generation of subsequent tokens. However, as sequence length
increases, the
quickly becomes a major memory bottleneck. To address
this, we propose PagedEviction, a novel fine-grained, structured
strategy that enhances the memory efficiency of v
's PagedAttention.
Unlike existing approaches that rely on attention-based token importance or
evict tokens across different v
pages, PagedEviction introduces an efficient
block-wise eviction algorithm tailored for paged memory layouts. Our method
integrates seamlessly with PagedAttention without requiring any modifications
to its CUDA attention kernels. We evaluate PagedEviction across
Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models
on the LongBench benchmark suite, demonstrating improved memory usage with
better accuracy than baselines on long context tasks.
Psychologically Enhanced AI Agents
Authors: Maciej Besta, Shriram Chandran, Robert Gerstenberger, Mathis Lindner, Marcin Chrapek, Sebastian Hermann Martschat, Taraneh Ghandi, Patrick Iff, Hubert Niewiadomski, Piotr Nyczyk, Jürgen Müller, Torsten Hoefler
2025-09-04
We introduce MBTI-in-Thoughts, a framework for enhancing the effectiveness of
Large Language Model () agents through psychologically grounded personality
conditioning. Drawing on the Myers-Briggs Type Indicator (MBTI), our method
primes agents with distinct personality archetypes via prompt engineering,
enabling control over behavior along two foundational axes of human psychology,
cognition and affect. We show that such personality priming yields consistent,
interpretable behavioral biases across diverse tasks: emotionally expressive
agents excel in narrative generation, while analytically primed agents adopt
more stable strategies in game-theoretic settings. Our framework supports
experimenting with structured multi-agent
protocols and reveals
that self-reflection prior to interaction improves cooperation and reasoning
quality. To ensure trait persistence, we integrate the official 16Personalities
test for automated verification. While our focus is on MBTI, we show that our
approach generalizes seamlessly to other psychological frameworks such as Big
Five, HEXACO, or Enneagram. By bridging psychological theory and
behavior
design, we establish a foundation for psychologically enhanced AI agents
without any fine-tuning.
Cross-Layer Attention Probing for Fine-Grained Hallucination Detection
Authors: Malavika Suresh, Rahaf Aljundi, Ikechukwu Nkisi-Orji, Nirmalie Wiratunga
2025-09-04
With the large-scale adoption of Large Language Models (s) in various
applications, there is a growing reliability concern due to their tendency to
generate inaccurate text, i.e. hallucinations. In this work, we propose
Cross-Layer Attention Probing (CLAP), a novel activation probing technique for
hallucination detection, which processes the
activations across the entire
residual stream as a joint sequence. Our empirical evaluations using five
s
and three tasks show that CLAP improves hallucination detection compared to
baselines on both greedy
d responses as well as responses sampled at
higher temperatures, thus enabling fine-grained detection, i.e. the ability to
disambiguate hallucinations and non-hallucinations among different sampled
responses to a given prompt. This allows us to propose a detect-then-mitigate
strategy using CLAP to reduce hallucinations and improve
reliability
compared to direct mitigation approaches. Finally, we show that CLAP maintains
high reliability even when applied out-of-distribution.
Integrating Pruning with Quantization for Efficient Deep Neural Networks Compression
Authors: Sara Makenali, Babak Rokh, Ali Azarpeyvand
2025-09-04
Deep Neural Networks (DNNs) have achieved significant advances in a wide
range of applications. However, their deployment on resource-constrained
devices remains a challenge due to the large number of layers and parameters,
which result in considerable computational and memory demands. To address this
issue, and
are two widely used
techniques,
commonly applied individually in most studies to reduce model size and enhance
processing speed. Nevertheless, combining these two techniques can yield even
greater
benefits. Effectively integrating
and
to harness their complementary advantages poses a challenging task, primarily
due to their potential impact on model accuracy and the complexity of jointly
optimizing both processes. In this paper, we propose two approaches that
integrate similarity-based filter
with Adaptive Power-of-Two (APoT)
to achieve higher
efficiency while pre
model
accuracy. In the first approach,
and
are applied
simultaneously during training. In the second approach,
is performed
first to remove less important parameters, followed by
of the
pruned model using
representations. Experimental results demonstrate
that our proposed approaches achieve effective model
with minimal
accuracy degradation, making them well-suited for deployment on devices with
limited computational resources.
Real Time FPGA Based Transformers & VLMs for Vision Tasks SOTA Designs and Optimizations
Authors: Safa Mohammed Sali, Mahmoud Meribout, Ashiyana Abdul Majeed
2025-09-04
Transformers and vision-language models (VLMs) have emerged as dominant
architectures in computer vision and multimodal AI, offering state-of-the-art
performance in tasks such as image classification, object detection, visual
question answering, and caption generation. However, their high computational
complexity, large memory footprints, and irregular data access patterns present
significant challenges for deployment in latency- and power-constrained
environments. Field-programmable gate arrays (FPGAs) provide an attractive
hardware platform for such workloads due to their reconfigurability,
fine-grained parallelism, and potential for energy-efficient . This
paper presents a comprehensive review of design trade-offs, optimization
strategies, and implementation challenges for FPGA-based inference of
s and VLMs. We examine critical factors such as device-class
selection, memory subsystem constraints, dataflow orchestration,
strategies,
exploitation, and toolchain choices, alongside
modality-specific issues unique to VLMs, including heterogeneous compute
balancing and cross-attention memory management. Additionally, we discuss
emerging trends in hardware-algorithm co-design, highlighting innovations in
attention mechanisms,
, and modular overlays to improve efficiency
and adaptability. Practical issues such as runtime flexibility, verification
overhead, and the absence of standardized FPGA multimodal benchmarks are also
considered. Finally, we outline future directions toward scalable, portable,
and reconfigurable FPGA solutions that adapt to evolving model architectures
while sustaining high utilization and predictable performance. This synthesis
offers both a technical foundation and a forward-looking perspective to help
bridge the gap between advanced multimodal AI models and efficient FPGA
deployment.
MultiWikiQA A Reading Comprehension Benchmark in 300+ Languages
Authors: Dan Saattrup Smart
2025-09-04
We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which
covers 306 languages. The context data comes from Wikipedia articles, with
questions generated by an and the answers appearing verbatim in the
Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency
of the generated questions across 30 of the languages, providing evidence that
the questions are of good quality. We evaluate 6 different language models,
both
r and encoder models of varying sizes, showing that the benchmark is
sufficiently difficult and that there is a large performance discrepancy
amongst the languages. The dataset and survey evaluations are freely available.
Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue
Authors: Keara Schaaij, Roel Boumans, Tibor Bosse, Iris Hendrickx
2025-09-04
Lexical alignment, where speakers start to use similar words across
conversation, is known to contribute to successful . However, its
implementation in conversational agents remains underexplored, particularly
considering the recent advancements in large language models (
s). As a first
step towards enabling lexical alignment in human-agent dialogue, this study
draws on strategies for personalising conversational agents and investigates
the construction of stable, personalised lexical profiles as a basis for
lexical alignment. Specifically, we varied the amounts of transcribed spoken
data used for construction as well as the number of items included in the
profiles per part-of-speech (POS) category and evaluated profile performance
across time using recall, coverage, and cosine similarity metrics. It was shown
that smaller and more compact profiles, created after 10 min of transcribed
speech containing 5 items for adjectives, 5 items for conjunctions, and 10
items for adverbs, nouns, pronouns, and verbs each, offered the best balance in
both performance and data efficiency. In conclusion, this study offers
practical insights into constructing stable, personalised lexical profiles,
taking into account minimal data requirements,
as a foundational step
toward lexical alignment strategies in conversational agents.
Meta-Policy Reflexion Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent
Authors: Chunlong Wu, Ye Luo, Zhibo Qu, Min Wang
2025-09-04
Large language model () agents achieve impressive single-task performance
but commonly exhibit repeated failures, inefficient exploration, and limited
cross-task adaptability. Existing reflective strategies (e.g., Reflexion,
ReAct) improve per-episode behavior but typically produce ephemeral,
task-specific traces that are not reused across tasks. Reinforcement-learning
based alternatives can produce transferable policies but require substantial
parameter updates and compute. In this work we introduce Meta-Policy Reflexion
(MPR): a hybrid framework that consolidates
-generated reflections into a
structured, predicate-like Meta-Policy Memory (MPM) and applies that memory at
inference time through two complementary mechanisms soft memory-guided
and hard rule admissibility checks(HAC). MPR (i) externalizes reusable
corrective knowledge without model weight updates, (ii) enforces domain
constraints to reduce unsafe or invalid actions, and (iii) retains the
adaptability of language-based reflection. We formalize the MPM representation,
present algorithms for update and
, and validate the approach in a
text-based agent environment following the experimental protocol described in
the provided implementation (AlfWorld-based). Empirical results reported in the
supplied material indicate consistent gains in execution accuracy and
robustness when compared to Reflexion baselines; rule admissibility further
improves stability. We analyze mechanisms that explain these gains, discuss
scalability and failure modes, and outline future directions for multimodal and
multi-agent extensions.
LMVC An End-to-End Learned Multiview Video Coding Framework
Authors: Xihua Sheng, Yingwen Zhang, Long Xu, Shiqi Wang
2025-09-04
Multiview video is a key data source for volumetric video, enabling immersive
3D scene reconstruction but posing significant challenges in storage and
transmission due to its massive data volume. Recently, deep learning-based
end-to-end video coding has achieved great success, yet most focus on
single-view or stereo videos, leaving general multiview scenarios
underexplored. This paper proposes an end-to-end learned multiview video coding
(LMVC) framework that ensures random access and backward compatibility while
enhancing efficiency. Our key innovation lies in effectively
leveraging independent-view motion and content information to enhance
dependent-view
. Specifically, to exploit the inter-view motion
correlation, we propose a feature-based inter-view motion vector prediction
method that conditions dependent-view motion encoding on
d
independent-view motion features, along with an inter-view motion entropy model
that learns inter-view motion priors. To exploit the inter-view content
correlation, we propose a disparity-free inter-view context prediction module
that predicts inter-view contexts from
d independent-view content
features, combined with an inter-view contextual entropy model that captures
inter-view context priors. Experimental results show that our proposed LMVC
framework outperforms the reference software of the traditional MV-HEVC
standard by a large margin, establishing a strong baseline for future research
in this field.
MTQAMatrix of Thought for Enhanced Reasoning in Complex Question Answering
Authors: Fengxiao Tang, Yufeng Li, Zongzong Wu, Ming Zhao
2025-09-04
Complex Question Answering (QA) is a fundamental and challenging task in NLP.
While large language models (s) exhibit impressive performance in QA, they
suffer from significant performance degradation when facing complex and
abstract QA tasks due to insufficient reasoning capabilities. Works such as
Chain-of-Thought (CoT) and Tree-of-Thought (ToT) aim to enhance
s' reasoning
abilities, but they face issues such as in-layer redundancy in tree structures
and single paths in chain structures. Although some studies utilize
Retrieval-Augmented Generation (RAG) methods to assist
s in reasoning, the
challenge of effectively utilizing large amounts of information involving
multiple entities and hops remains critical. To address this, we propose the
Matrix of Thought (MoT), a novel and efficient
thought structure. MoT
explores the problem in both horizontal and vertical dimensions through the
"column-cell
" mechanism, enabling
s to actively engage in
multi-strategy and deep-level thinking, reducing redundancy within the column
cells and enhancing reasoning capabilities. Furthermore, we develop a
fact-correction mechanism by constructing knowledge units from retrieved
knowledge graph triples and raw text to enhance the initial knowledge for
reasoning and correct erroneous answers. This leads to the development of an
efficient and accurate QA framework (MTQA). Experimental results show that our
framework outperforms state-of-the-art methods on four widely-used datasets in
terms of F1 and EM scores, with reasoning time only 14.4\% of the baseline
methods, demonstrating both its efficiency and accuracy. The code for this
framework is available at https://github.com/lyfiter/mtqa.