2025-08-22
Table of Contents
- Communication Efficient LLM Pre-training with SparseLoCo
- Amortized In-Context Mixed Effect Transformer Models A Zero-Shot Approach for Pharmacokinetics
- Efficient Mixed-Precision Large Language Model Inference with TurboMind
- Deep Equilibrium Convolutional Sparse Coding for Hyperspectral Image Denoising
- LGMSNet Thinning a medical image segmentation model via dual-level multiscale fusion
- DiagECG An LLM-Driven Framework for Diagnostic Reasoning via Discretized ECG Tokenization
- Exploring Scaling Laws of CTR Model for Online Performance Improvement
- MeSS City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion
- LongRecall A Structured Approach for Robust Recall Evaluation in Long-Form Text
- GasTwinFormer A Hybrid Vision Transformer for Livestock Methane Emission Segmentation and Dietary Classification in Optical Gas Imaging
- Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner
- Quantization Meets dLLMs A Systematic Study of Post-training Quantization for Diffusion LLMs
- MissionHD Data-Driven Refinement of Reasoning Graph Structure through Hyperdimensional Causal Path Encoding and Decoding
- Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving
- Improving in-context learning with a better scoring function
- Can LLM Agents Solve Collaborative Tasks? A Study on Urgency-Aware Planning and Coordination
- Deep Skin Lesion Segmentation with Transformer-CNN Fusion Toward Intelligent Skin Cancer Analysis
- DeepTelecom A Digital-Twin Deep Learning Dataset for Channel and MIMO Applications
- Taming Transformer for Emotion-Controllable Talking Face Generation
- SurveyGen-I Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing
- Your Reward Function for RL is Your Best PRM for Search Unifying RL and Search-Based TTS
- GLASS Test-Time Acceleration for LLMs via Global-Local Neural Importance Aggregation
- Pixels to Play A Foundation Model for 3D Gameplay
- Measuring LLM Code Generation Stability via Structural Entropy
- Disentangling concept semantics via multilingual averaging in Sparse Autoencoders
- Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper
- GeoSAM2 Unleashing the Power of SAM2 for 3D Part Segmentation
- LLM-Powered Virtual Patient Agents for Interactive Clinical Skills Training with Automated Feedback
- LLMind 2.0 Distributed IoT Automation with Natural Language M2M Communication and Lightweight LLM Agents
- Prompt-Based One-Shot Exact Length-Controlled Generation with LLMs
- Communication-Efficient Federated Learning with Adaptive Number of Participants
- CRISP Persistent Concept Unlearning via Sparse Autoencoders
- Interpreting the Interpreter Can We Model post-ECB Conferences Volatility with LLM Agents?
- A Comparative Study of Decoding Strategies in Medical Text Generation
- LLM-Enhanced Linear Autoencoders for Recommendation
- ALIGN Word Association Learning for Cross-Cultural Generalization in Large Language Models
- Datarus-R1 An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis
- Exploring Autonomous Agents A Closer Look at Why They Fail When Completing Tasks
Communication Efficient LLM Pre-training with SparseLoCo
Authors: Amir Sarfi, Benjamin Thérien, Joel Lidin, Eugene Belilovsky
2025-08-21
Communication-efficient distributed training algorithms have received
considerable interest recently due to their benefits for training Large
Language Models (s) in bandwidth-constrained settings, such as across data
centers and over the internet. Despite reducing
frequency, these
methods still typically require communicating a full copy of the model's
gradients-resulting in a
bottleneck even for cross-datacenter
links. Furthermore, they can slightly degrade performance compared to a naive
AdamW DDP baseline. While
and error feedback are often applied to
reduce the pseudo-gradient's size, in the context of
pre-training, existing
approaches have been unable to additionally leverage sparsification and have
obtained limited
. In this work, we introduce SparseLoCo, a
-efficient training algorithm for
s that effectively leverages
Top-k sparsification and
to reach extreme compression ratios of up
to 1-3%
and 2-bit
while outperforming full-precision
DiLoCo. Our key observations are that outer momentum can be locally
approximated by an error feedback combined with aggressive
and that
aggregation can actually improve model performance. We empirically
demonstrate in a range of
-constrained
training settings that
SparseLoCo provides significant benefits in both performance and
cost.
Amortized In-Context Mixed Effect Transformer Models A Zero-Shot Approach for Pharmacokinetics
Authors: César Ali Ojeda Marin, Wilhelm Huisinga, Purity Kavwele, Niklas Hartung
2025-08-21
Accurate dose-response forecasting under sampling is central to
precision pharmacotherapy. We present the Amortized In-Context Mixed-Effect
Transformer (AICMET) model, a
-based latent-variable framework that
unifies mechanistic compartmental priors with amortized in-context Bayesian
inference. AICMET is pre-trained on hundreds of thousands of synthetic
pharmacokinetic trajectories with Ornstein-Uhlenbeck priors over the parameters
of compartment models, endowing the model with strong inductive biases and
enabling zero-shot adaptation to new compounds. At inference time, the
r
conditions on the collective context of previously profiled trial participants,
generating calibrated posterior predictions for newly enrolled patients after a
few early drug concentration measurements. This capability collapses
traditional model-development cycles from weeks to hours while pre
some
degree of expert modelling. Experiments across public datasets show that AICMET
attains state-of-the-art predictive accuracy and faithfully quantifies
inter-patient variability -- outperforming both nonlinear mixed-effects
baselines and recent neural ODE variants. Our results highlight the feasibility
of
-based, population-aware neural architectures as offering a new
alternative for bespoke pharmacokinetic modeling pipelines, charting a path
toward truly population-aware personalized dosing regimens.
Efficient Mixed-Precision Large Language Model Inference with TurboMind
Authors: Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, Kai Chen
2025-08-21
Mixed-precision inference techniques reduce the memory and computational
demands of Large Language Models (s) by applying hybrid precision formats to
model weights, activations, and
s. This work introduces mixed-precision
inference techniques that encompass (i) systematic memory and compute
optimization across hierarchical storage and tensor core architectures, and
(ii) comprehensive end-to-end mixed-precision optimization across diverse
precision formats and hardware configurations. Our approach features two novel
mixed-precision pipelines designed for optimal hardware utilization: a General
Matrix Multiply (GEMM) pipeline that optimizes matrix operations through
offline weight packing and online
, and an attention pipeline that
enables efficient attention computation with arbitrary Query, Key, and Value
precision combinations. The key implementation of the pipelines includes (i)
hardware-aware weight packing for automatic format optimization, (ii) adaptive
head alignment for efficient attention computation, (iii) instruction-level
parallelism for memory hierarchy exploitation, and (iv)
memory loading
pipeline for enhanced inference efficiency. We conduct comprehensive
evaluations across 16 popular
s and 4 representative GPU architectures.
Results demonstrate that our approach achieves up to 61% lower
latency
(30% on average) and up to 156% higher throughput (58% on average) in
mixed-precision workloads compared to existing mixed-precision frameworks,
establishing consistent performance improvements across all tested
configurations and hardware types. This work is integrated into TurboMind, a
high-performance inference engine of the LMDeploy project, which is
open-sourced and publicly available at https://github.com/InternLM/lmdeploy.
Deep Equilibrium Convolutional Sparse Coding for Hyperspectral Image Denoising
Authors: Jin Ye, Jingran Wang, Fengchao Xiong, Jingzhou Chen, Yuntao Qian
2025-08-21
Hyperspectral images (HSIs) play a crucial role in remote sensing but are
often degraded by complex noise patterns. Ensuring the physical property of the
denoised HSIs is vital for robust HSI denoising, giving the rise of deep
unfolding-based methods. However, these methods map the optimization of a
physical model to a learnable network with a predefined depth, which lacks
convergence guarantees. In contrast, Deep Equilibrium (DEQ) models treat the
hidden layers of deep networks as the solution to a fixed-point problem and
models them as infinite-depth networks, naturally consistent with the
optimization. Under the framework of DEQ, we propose a Deep Equilibrium
Convolutional Sparse Coding (DECSC) framework that unifies local
spatial-spectral correlations, nonlocal spatial self-similarities, and global
spatial consistency for robust HSI denoising. Within the convolutional
coding (CSC) framework, we enforce shared 2D convolutional
representation to ensure global spatial consistency across bands, while
unshared 3D convolutional
representation captures local spatial-spectral
details. To further exploit nonlocal self-similarities, a
block is
embedded after the 2D CSC. Additionally, a detail enhancement module is
integrated with the 3D CSC to promote image detail preservation. We formulate
the proximal gradient descent of the CSC model as a fixed-point problem and
transform the iterative updates into a learnable network architecture within
the framework of DEQ. Experimental results demonstrate that our DECSC method
achieves superior denoising performance compared to state-of-the-art methods.
LGMSNet Thinning a medical image segmentation model via dual-level multiscale fusion
Authors: Chengqi Dong, Fenghe Tang, Rongge Mao, Xinpei Gao, S. Kevin Zhou
2025-08-21
Medical image segmentation plays a pivotal role in disease diagnosis and
treatment planning, particularly in resource-constrained clinical settings
where lightweight and generalizable models are urgently needed. However,
existing lightweight models often compromise performance for efficiency and
rarely adopt computationally expensive attention mechanisms, severely
restricting their global contextual perception capabilities. Additionally,
current architectures neglect the channel redundancy issue under the same
convolutional kernels in medical imaging, which hinders effective feature
extraction. To address these challenges, we propose LGMSNet, a novel
lightweight framework based on local and global dual multiscale that achieves
state-of-the-art performance with minimal computational overhead. LGMSNet
employs heterogeneous intra-layer kernels to extract local high-frequency
information while mitigating channel redundancy. In addition, the model
integrates
-convolutional hybrid branches to capture
low-frequency global information. Extensive experiments across six public
datasets demonstrate LGMSNet's superiority over existing state-of-the-art
methods. In particular, LGMSNet maintains exceptional performance in zero-shot
generalization tests on four unseen datasets, underscoring its potential for
real-world deployment in resource-limited medical scenarios. The whole project
code is in https://github.com/cq-dong/LGMSNet.
DiagECG An LLM-Driven Framework for Diagnostic Reasoning via Discretized ECG Tokenization
Authors: Jinning Yang, Wen Shi
2025-08-21
Electrocardiography plays a central role in cardiovascular diagnostics, yet
existing automated approaches often struggle to generalize across clinical
tasks and offer limited support for open-ended reasoning. We present DiagECG, a
novel framework that integrates time-series and language modeling by enabling
large language models to process 12-lead ECG signals for clinical text
generation tasks. Our approach discretizes continuous ECG embeddings into
symbolic tokens using a lead-independent encoder and module. These
tokens are then used to extend the vocabulary of
, allowing the model to
handle both ECG and natural language inputs in a unified manner. To bridge the
modality gap, we pretrain the model on an autoregressive ECG forecasting task,
enabling the
to model temporal dynamics using its native language modeling
capabilities. Finally, we perform instruction tuning on both ECG question
answering and diagnostic report generation. Without modifying the core model,
DiagECG achieves strong performance across tasks while maintaining
generalization to out-of-distribution settings. Extensive experiments
demonstrate the effectiveness of each component and highlight the potential of
integrating symbolic ECG representations into
s for medical reasoning.
Exploring Scaling Laws of CTR Model for Online Performance Improvement
Authors: Weijiang Lai, Beihong Jin, Jiongyan Zhang, Yiyuan Zheng, Jian Dong, Jia Cheng, Jun Lei, Xingxing Wang
2025-08-21
CTR models play a vital role in improving user experience and boosting
business revenue in many online personalized services. However, current CTR
models generally encounter bottlenecks in performance improvement. Inspired by
the scaling law phenomenon of s, we propose a new paradigm for improving CTR
predictions: first, constructing a CTR model with accuracy scalable to the
model grade and data size, and then distilling the knowledge implied in this
model into its lightweight model that can serve online users. To put it into
practice, we construct a CTR model named SUAN (Stacked Unified Attention
Network). In SUAN, we propose the UAB as a behavior sequence encoder. A single
UAB unifies the modeling of the sequential and non-sequential features and also
measures the importance of each user behavior feature from multiple
perspectives. Stacked UABs elevate the configuration to a high grade, paving
the way for performance improvement. In order to benefit from the high
performance of the high-grade SUAN and avoid the disadvantage of its long
inference time, we modify the SUAN with
self-attention and parallel
inference strategies to form LightSUAN, and then adopt online distillation to
train the low-grade LightSUAN, taking a high-grade SUAN as a teacher. The
distilled LightSUAN has superior performance but the same inference time as the
LightSUAN, making it well-suited for online deployment. Experimental results
show that SUAN performs exceptionally well and holds the scaling laws spanning
three orders of magnitude in model grade and data size, and the distilled
LightSUAN outperforms the SUAN configured with one grade higher. More
importantly, the distilled LightSUAN has been integrated into an online
service, increasing the CTR by 2.81% and CPM by 1.69% while keeping the average
inference time acceptable. Our source code is available at
https://github.com/laiweijiang/SUAN.
MeSS City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion
Authors: Xuyang Chen, Zhijun Zhai, Kaixuan Zhou, Zengmao Wang, Jianan He, Dong Wang, Yanfeng Zhang, mingwei Sun, Rüdiger Westermann, Konrad Schindler, Liqiu Meng
2025-08-21
Mesh models have become increasingly accessible for numerous cities; however,
the lack of realistic textures restricts their application in virtual urban
navigation and autonomous driving. To address this, this paper proposes MeSS
(Meshbased Scene Synthesis) for generating high-quality, styleconsistent
outdoor scenes with city mesh models as the geometric prior. While
image and video diffusion models can leverage spatial layouts (such as depth
maps or HD maps) as control conditions to generate street-level perspective
views, they are not directly applicable to 3D scene generation. Video diffusion
models excel at synthesizing consistent view sequences that depict scenes but
often struggle to adhere to predefined camera paths or align accurately with
rendered control videos. In contrast, image diffusion models, though unable to
guarantee cross-view visual consistency, can produce more geometry-aligned
results when combined with ControlNet. Building on this insight, our approach
enhances image diffusion models by improving cross-view consistency. The
pipeline comprises three key stages: first, we generate geometrically
consistent
views using Cascaded Outpainting ControlNets; second, we
propagate denser intermediate views via a component dubbed AGInpaint; and
third, we globally eliminate visual inconsistencies (e.g., varying exposure)
using the GCAlign module. Concurrently with generation, a 3D Gaussian Splatting
(3DGS) scene is reconstructed by initializing Gaussian balls on the mesh
surface. Our method outperforms existing approaches in both geometric alignment
and generation quality. Once synthesized, the scene can be rendered in diverse
styles through relighting and style transfer techniques.
LongRecall A Structured Approach for Robust Recall Evaluation in Long-Form Text
Authors: MohamamdJavad Ardestani, Ehsan Kamalloo, Davood Rafiei
2025-08-20
LongRecall. The completeness of machine-generated text, ensuring that it
captures all relevant information, is crucial in domains such as medicine and
law and in tasks like list-based question answering (QA), where omissions can
have serious consequences. However, existing recall metrics often depend on
lexical , leading to errors with unsubstantiated entities and
paraphrased answers, while
-as-a-Judge methods with long holistic prompts
capture broader semantics but remain prone to misalignment and hallucinations
without structured verification. We introduce LongRecall, a general three-stage
recall evaluation framework that decomposes answers into self-contained facts,
successively narrows plausible candidate matches through lexical and semantic
filtering, and verifies their alignment through structured entailment checks.
This design reduces false positives and false negatives while accommodating
diverse phrasings and contextual variations,
as a foundational building
block for systematic recall assessment. We evaluate LongRecall on three
challenging long-form QA benchmarks using both human annotations and
-based
judges, demonstrating substantial improvements in recall accuracy over strong
lexical and
-as-a-Judge baselines.
GasTwinFormer A Hybrid Vision Transformer for Livestock Methane Emission Segmentation and Dietary Classification in Optical Gas Imaging
Authors: Toqi Tahamid Sarker, Mohamed Embaby, Taminul Islam, Amer AbuGhazaleh, Khaled R Ahmed
2025-08-20
Livestock methane emissions represent 32% of human-caused methane production,
making automated monitoring critical for climate mitigation strategies. We
introduce GasTwinFormer, a hybrid vision for real-time methane
emission segmentation and dietary classification in optical gas imaging through
a novel Mix Twin encoder alternating between spatially-reduced global attention
and locally-grouped attention mechanisms. Our architecture incorporates a
lightweight LR-ASPP
r for multi-scale feature aggregation and enables
simultaneous methane segmentation and dietary classification in a unified
framework. We contribute the first comprehensive beef cattle methane emission
dataset using OGI, containing 11,694 annotated frames across three dietary
treatments. GasTwinFormer achieves 74.47% mIoU and 83.63% mF1 for segmentation
while maintaining exceptional efficiency with only 3.348M parameters, 3.428G
FLOPs, and 114.9 FPS inference speed. Additionally, our method achieves perfect
dietary classification accuracy (100%), demonstrating the effectiveness of
leveraging diet-emission correlations. Extensive ablation studies validate each
architectural component, establishing GasTwinFormer as a practical solution for
real-time livestock emission monitoring. Please see our project page at
gastwinformer.github.io.
Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner
Authors: Bolian Li, Yanran Wu, Xinyu Luo, Ruqi Zhang
2025-08-20
Aligning large language models (s) with human preferences has become a
critical step in their development. Recent research has increasingly focused on
test-time alignment, where additional compute is allocated during inference to
enhance
safety and reasoning capabilities. However, these test-time
alignment techniques often incur substantial inference costs, limiting their
practical application. We are inspired by the speculative sampling
, which leverages a small draft model to efficiently predict future
tokens, to address the efficiency bottleneck of test-time alignment. We
introduce the reward-Shifted Speculative Sampling (SSS) algorithm, in which the
draft model is aligned with human preferences, while the target model remains
unchanged. We theoretically demonstrate that the distributional shift between
the aligned draft model and the unaligned target model can be exploited to
recover the RLHF optimal solution without actually obtaining it, by modifying
the acceptance criterion and bonus token distribution. Our algorithm achieves
superior gold reward scores at a significantly reduced inference cost in
test-time weak-to-strong alignment experiments, thereby validating both its
effectiveness and efficiency.
Quantization Meets dLLMs A Systematic Study of Post-training Quantization for Diffusion LLMs
Authors: Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun
2025-08-20
Recent advances in diffusion large language models (ds) have introduced a
promising alternative to autoregressive (AR)
s for natural language
generation tasks, leveraging full attention and denoising-based
strategies. However, the deployment of these models on edge devices remains
challenging due to their massive parameter scale and high resource demands.
While post-training
(PTQ) has emerged as a widely adopted
technique for compressing AR
s, its applicability to d
s remains largely
unexplored. In this work, we present the first systematic study on quantizing
diffusion-based language models. We begin by identifying the presence of
activation outliers, characterized by abnormally large activation values that
dominate the dynamic range. These outliers pose a key challenge to
, as they make it difficult to preserve precision for the majority
of values. More importantly, we implement state-of-the-art PTQ methods and
conduct a comprehensive evaluation across multiple task types and model
variants. Our analysis is structured along four key dimensions: bit-width,
method, task category, and model type. Through this
multi-perspective evaluation, we offer practical insights into the
behavior of d
s under different configurations. We hope our findings provide
a foundation for future research in efficient d
deployment. All codes and
experimental setups will be released to support the community.
MissionHD Data-Driven Refinement of Reasoning Graph Structure through Hyperdimensional Causal Path Encoding and Decoding
Authors: Sanggeon Yun, Raheeb Hassan, Ryozo Masukawa, Mohsen Imani
2025-08-20
Reasoning graphs from Large Language Models (s) are often misaligned with
downstream visual tasks such as video anomaly detection (VAD). Existing Graph
Structure Refinement (GSR) methods are ill-suited for these novel, dataset-less
graphs. We introduce Data-driven GSR (D-GSR), a new paradigm that directly
optimizes graph structure using downstream task data, and propose MissionHD, a
hyperdimensional computing (HDC) framework to operationalize it. MissionHD uses
an efficient encode-
process to refine the graph, guided by the
downstream task signal. Experiments on challenging VAD and VAR benchmarks show
significant performance improvements when using our refined graphs, validating
our approach as an effective pre-processing step.
Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving
Authors: Leila Cheshmi, Mennatullah Siam
2025-08-20
Ensuring safety in autonomous driving is a complex challenge requiring
handling unknown objects and unforeseen driving scenarios. We develop
multiscale video s capable of detecting unknown objects using only
motion cues. Video semantic and panoptic segmentation often relies on known
classes seen during training, overlooking novel categories. Recent visual
grounding with large language models is computationally expensive, especially
for pixel-level output. We propose an efficient video
trained
end-to-end for class-agnostic segmentation without optical flow. Our method
uses multi-stage multiscale query-memory
and a scale-specific random
drop-token to ensure efficiency and accuracy, maintaining detailed
spatiotemporal features with a shared, learnable memory module. Unlike
conventional
rs that compress features, our memory-centric design
preserves high-resolution information at multiple scales. We evaluate on
DAVIS'16, KITTI, and Cityscapes. Our method consistently outperforms multiscale
baselines while being efficient in GPU memory and run-time, demonstrating a
promising direction for real-time, robust dense prediction in safety-critical
robotics.
Improving in-context learning with a better scoring function
Authors: Omar Naim, Swarnadeep Bhar, Jérôme Bolte, Nicholas Asher
2025-08-20
Large language models (s) exhibit a remarkable capacity to learn by
analogy, known as in-context learning (ICL). However, recent studies have
revealed limitations in this ability. In this paper, we examine these
limitations on tasks involving first-order quantifiers such as {\em all} and
{\em some}, as well as on ICL with linear functions. We identify Softmax, the
scoring function in attention mechanism, as a contributing factor to these
constraints. To address this, we propose \textbf{scaled signed averaging
(SSA)}, a novel alternative to Softmax. Empirical results show that SSA
dramatically improves performance on our target tasks. Furthermore, we evaluate
both encoder-only and
r-only
s models with SSA, demonstrating
that they match or exceed their Softmax-based counterparts across a variety of
linguistic probing tasks.
Can LLM Agents Solve Collaborative Tasks? A Study on Urgency-Aware Planning and Coordination
Authors: João Vitor de Carvalho Silva, Douglas G. Macharet
2025-08-20
The ability to coordinate actions across multiple agents is critical for
solving complex, real-world problems. Large Language Models (s) have shown
strong capabilities in
, planning, and reasoning, raising the
question of whether they can also support effective collaboration in
multi-agent settings. In this work, we investigate the use of
agents to
solve a structured victim rescue task that requires division of labor,
prioritization, and cooperative planning. Agents operate in a fully known
graph-based environment and must allocate resources to victims with varying
needs and urgency levels. We systematically evaluate their performance using a
suite of coordination-sensitive metrics, including task success rate, redundant
actions, room conflicts, and urgency-weighted efficiency. This study offers new
insights into the strengths and failure modes of
s in physically grounded
multi-agent collaboration tasks, contributing to future benchmarks and
architectural improvements.
Deep Skin Lesion Segmentation with Transformer-CNN Fusion Toward Intelligent Skin Cancer Analysis
Authors: Xin Wang, Xiaopei Zhang, Xingang Wang
2025-08-20
This paper proposes a high-precision semantic segmentation method based on an
improved TransUNet architecture to address the challenges of complex lesion
structures, blurred boundaries, and significant scale variations in skin lesion
images. The method integrates a module into the traditional
encoder-
r framework to model global semantic information, while retaining
a convolutional branch to preserve local texture and edge features. This
enhances the model's ability to perceive fine-grained structures. A
boundary-guided attention mechanism and multi-scale upsampling path are also
designed to improve lesion boundary localization and segmentation consistency.
To verify the effectiveness of the approach, a series of experiments were
conducted, including comparative studies, hyperparameter sensitivity analysis,
data augmentation effects, input resolution variation, and training data split
ratio tests. Experimental results show that the proposed model outperforms
existing representative methods in mIoU, mDice, and mAcc, demonstrating
stronger lesion recognition accuracy and robustness. In particular, the model
achieves better boundary reconstruction and structural recovery in complex
scenarios, making it well-suited for the key demands of automated segmentation
tasks in skin lesion analysis.
DeepTelecom A Digital-Twin Deep Learning Dataset for Channel and MIMO Applications
Authors: Bohao Wang, Zehua Jiang, Zhenyu Yang, Chongwen Huang, Yongliang Shen, Siming Jiang, Chen Zhu, Zhaohui Yang, Richeng Jin, Zhaoyang Zhang, Sami Muhaidat, Merouane Debbah
2025-08-20
Domain-specific datasets are the foundation for unleashing artificial
intelligence (AI)-driven wireless innovation. Yet existing wireless AI corpora
are slow to produce, offer limited modeling fidelity, and cover only narrow
scenario types. To address the challenges, we create DeepTelecom, a
three-dimension (3D) digital-twin channel dataset. Specifically, a large
language model ()-assisted pipeline first builds the third level of details
(LoD3) outdoor and indoor scenes with segmentable material-parameterizable
surfaces. Then, DeepTelecom simulates full radio-wave propagation effects based
on Sionna's ray-tracing engine. Leveraging GPU
, DeepTelecom
streams ray-path trajectories and real-time signal-strength heat maps, compiles
them into high-frame-rate videos, and simultaneously outputs synchronized
multi-view images, channel tensors, and multi-scale fading traces. By
efficiently streaming large-scale, high-fidelity, and multimodal channel data,
DeepTelecom not only furnishes a unified benchmark for wireless AI research but
also supplies the domain-rich training substrate that enables foundation models
to tightly fuse large model intelligence with future
systems.
Taming Transformer for Emotion-Controllable Talking Face Generation
Authors: Ziqi Zhang, Cheng Deng
2025-08-20
Talking face generation is a novel and challenging generation task, aiming at
synthesizing a vivid speaking-face video given a specific audio. To fulfill
emotion-controllable talking face generation, current methods need to overcome
two challenges: One is how to effectively model the multimodal relationship
related to the specific emotion, and the other is how to leverage this
relationship to synthesize identity pre emotional videos. In this paper,
we propose a novel method to tackle the emotion-controllable talking face
generation task discretely. Specifically, we employ two pre-training strategies
to disentangle audio into independent components and
videos into
combinations of visual tokens. Subsequently, we propose the emotion-anchor (EA)
representation that integrates the emotional information into visual tokens.
Finally, we introduce an autoregressive
to model the global
distribution of the visual tokens under the given conditions and further
predict the index sequence for synthesizing the manipulated videos. We conduct
experiments on the MEAD dataset that controls the emotion of videos conditioned
on multiple emotional audios. Extensive experiments demonstrate the
superiorities of our method both qualitatively and quantitatively.
SurveyGen-I Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing
Authors: Jing Chen, Zhiheng Yang, Yixian Shen, Jie Liu, Adam Belloum, Chrysa Papagainni, Paola Grosso
2025-08-20
Survey papers play a critical role in scientific by
consolidating progress across a field. Recent advances in Large Language Models
(
s) offer a promising solution by automating key steps in the
survey-generation pipeline, such as retrieval, structuring, and summarization.
However, existing
-based approaches often struggle with maintaining
coherence across long, multi-section surveys and providing comprehensive
citation coverage. To address these limitations, we introduce SurveyGen-I, an
automatic survey generation framework that combines coarse-to-fine retrieval,
adaptive planning, and memory-guided generation. SurveyGen-I first performs
survey-level retrieval to construct the initial outline and writing plan, and
then dynamically refines both during generation through a memory mechanism that
stores previously written content and terminology, ensuring coherence across
subsections. When the system detects insufficient context, it triggers
fine-grained subsection-level retrieval. During generation, SurveyGen-I
leverages this memory mechanism to maintain coherence across subsections.
Experiments across four scientific domains demonstrate that SurveyGen-I
consistently outperforms previous works in content quality, consistency, and
citation coverage.
Your Reward Function for RL is Your Best PRM for Search Unifying RL and Search-Based TTS
Authors: Can Jin, Yang Zhou, Qixin Zhang, Hongwu Peng, Di Zhang, Marco Pavone, Ligong Han, Zhang-Wei Hong, Tong Che, Dimitris N. Metaxas
2025-08-19
Test-time scaling (TTS) for large language models (s) has thus far fallen
into two largely separate paradigms: (1) reinforcement learning (RL) methods
that optimize
outcome-based rewards, yet suffer from instability and low
sample efficiency; and (2) search-based techniques guided by independently
trained, static process reward models (PRMs), which require expensive human- or
-generated labels and often degrade under distribution shifts. In this
paper, we introduce AIRL-S, the first natural unification of RL-based and
search-based TTS. Central to AIRL-S is the insight that the reward function
learned during RL training inherently represents the ideal PRM for guiding
downstream search. Specifically, we leverage adversarial inverse reinforcement
learning (AIRL) combined with group relative policy optimization (GRPO) to
learn a dense, dynamic PRM directly from correct reasoning traces, entirely
eliminating the need for labeled intermediate process data. At inference, the
resulting PRM simultaneously serves as the critic for RL rollouts and as a
heuristic to effectively guide search procedures, facilitating robust reasoning
chain extension, mitigating reward hacking, and enhancing cross-task
generalization. Experimental results across eight benchmarks, including
mathematics, scientific reasoning, and code generation, demonstrate that our
unified approach improves performance by 9 % on average over the base model,
matching GPT-4o. Furthermore, when integrated into multiple search algorithms,
our PRM consistently outperforms all baseline PRMs trained with labeled data.
These results underscore that, indeed, your reward function for RL is your best
PRM for search, providing a robust and cost-effective solution to complex
reasoning tasks in
s.
GLASS Test-Time Acceleration for LLMs via Global-Local Neural Importance Aggregation
Authors: Amirmohsen Sattarifard, Sepehr Lavasani, Ehsan Imani, Kunlin Zhang, Hanlin Xu, Fengyu Sun, Negar Hassanpour, Chao Gao
2025-08-19
Deploying Large Language Models (s) on edge hardware demands aggressive,
prompt-aware dynamic
to reduce computation without degrading quality.
Static or predictor-based schemes either lock in a single
pattern or
incur extra runtime overhead, and recent zero-shot methods that rely on
statistics from a single prompt fail on short prompt and/or long generation
scenarios. We introduce A/I-GLASS: Activation- and Impact-based Global-Local
neural importance Aggregation for feed-forward network SparSification, two
training-free methods that dynamically select FFN units using a
rank-aggregation of prompt local and model-intrinsic global neuron statistics.
Empirical results across multiple
s and benchmarks demonstrate that GLASS
significantly outperforms prior training-free methods, particularly in
challenging long-form generation scenarios, without relying on auxiliary
predictors or adding any inference overhead.
Pixels to Play A Foundation Model for 3D Gameplay
Authors: Yuguang Yue, Chris Green, Samuel Hunt, Irakli Salia, Wenzhe Shi, Jonathan J Hunt
2025-08-19
We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play
a wide range of 3D video games with recognizable human-like behavior. Motivated
by emerging consumer and developer use cases - AI teammates, controllable NPCs,
personalized live-streamers, assistive testers - we argue that an agent must
rely on the same pixel stream available to players and generalize to new titles
with minimal game-specific engineering. P2P0.1 is trained end-to-end with
behavior cloning: labeled demonstrations collected from instrumented human
game-play are complemented by unlabeled public videos, to which we impute
actions via an inverse-dynamics model. A r-only
with
auto-regressive action output handles the large action space while remaining
latency-friendly on a single consumer GPU. We report qualitative results
showing competent play across simple Roblox and classic MS-DOS titles,
ablations on unlabeled data, and outline the scaling and evaluation steps
required to reach expert-level, text-conditioned control.
Measuring LLM Code Generation Stability via Structural Entropy
Authors: Yewei Song, Tiezhu Sun, Xunzhu Tang, Prateek Rajput, Tegawende F. Bissyande, Jacques Klein
2025-08-19
Assessing the stability of code generation from large language models (s)
is essential for judging their reliability in real-world development. We extend
prior "structural-entropy concepts" to the program domain by pairing entropy
with abstract syntax tree (AST) analysis. For any fixed prompt, we collect the
multiset of depth-bounded subtrees of AST in each generated program and treat
their relative frequencies as a probability distribution. We then measure
stability in two complementary ways: (i) Jensen-Shannon divergence, a
symmetric, bounded indicator of structural
, and (ii) a Structural
Cross-Entropy ratio that highlights missing high-probability patterns. Both
metrics admit structural-only and token-aware variants, enabling separate views
on control-flow shape and identifier-level variability. Unlike pass@k, BLEU, or
CodeBLEU, our metrics are reference-free, language-agnostic, and
execution-independent. We benchmark several leading
s on standard code
generation tasks, demonstrating that AST-driven structural entropy reveals
nuances in model consistency and robustness. The method runs in O(n,d) time
with no external tests, providing a lightweight addition to the code-generation
evaluation toolkit.
Disentangling concept semantics via multilingual averaging in Sparse Autoencoders
Authors: Cliff O'Reilly, Ernesto Jimenez-Ruiz, Tillman Weyde
2025-08-19
Connecting s with formal knowledge representation and reasoning is a
promising approach to address their shortcomings. Embeddings and
autoencoders are widely used to represent textual content, but the semantics
are entangled with syntactic and language-specific information. We propose a
method that isolates concept semantics in Large Langue Models by averaging
concept activations derived via Sparse Autoencoders. We create English text
representations from OWL ontology classes, translate the English into French
and Chinese and then pass these texts as prompts to the Gemma 2B
. Using the
open source Gemma Scope suite of Sparse Autoencoders, we obtain concept
activations for each class and language version. We average the different
language activations to derive a conceptual average. We then correlate the
conceptual averages with a ground truth mapping between ontology classes. Our
results give a strong indication that the conceptual average aligns to the true
relationship between classes when compared with a single language by itself.
The result hints at a new technique which enables mechanistic interpretation of
internal network states with higher accuracy.
Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper
Authors: Krishna Garg, Firoz Shaikh, Sambaran Bandyopadhyay, Cornelia Caragea
2025-08-19
As researchers increasingly adopt s as writing assistants, generating
high-quality research paper introductions remains both challenging and
essential. We introduce Scientific Introduction Generation (SciIG), a task that
evaluates
s' ability to produce coherent introductions from titles,
abstracts, and related works. Curating new datasets from NAACL 2025 and ICLR
2025 papers, we assess five state-of-the-art models, including both open-source
(DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and
closed-source GPT-4o systems, across multiple dimensions: lexical
,
semantic similarity, content coverage, faithfulness, consistency, citation
correctness, and narrative quality. Our comprehensive framework combines
automated metrics with
-as-a-judge evaluations. Results demonstrate LLaMA-4
Maverick's superior performance on most metrics, particularly in semantic
similarity and faithfulness. Moreover, three-shot prompting consistently
outperforms fewer-shot approaches. These findings provide practical insights
into developing effective research writing assistants and set realistic
expectations for
-assisted academic writing. To foster reproducibility and
future research, we will publicly release all code and datasets.
GeoSAM2 Unleashing the Power of SAM2 for 3D Part Segmentation
Authors: Ken Deng, Yunhan Yang, Jingxiang Sun, Xihui Liu, Yebin Liu, Ding Liang, Yan-Pei Cao
2025-08-19
Modern 3D generation methods can rapidly create shapes from or single
views, but their outputs often lack geometric detail due to computational
constraints. We present DetailGen3D, a generative approach specifically
designed to enhance these generated 3D shapes. Our key insight is to model the
coarse-to-fine transformation directly through data-dependent flows in latent
space, avoiding the computational overhead of large-scale 3D generative models.
We introduce a token matching strategy that ensures accurate spatial
correspondence during refinement, enabling local detail synthesis while
pre
global structure. By carefully designing our training data to match
the characteristics of synthesized coarse shapes, our method can effectively
enhance shapes produced by various 3D generation and reconstruction approaches,
from single-view to
multi-view inputs. Extensive experiments demonstrate
that DetailGen3D achieves high-fidelity geometric detail synthesis while
maintaining efficiency in training.
LLM-Powered Virtual Patient Agents for Interactive Clinical Skills Training with Automated Feedback
Authors: Henrik Voigt, Yurina Sugamiya, Kai Lawonn, Sina Zarrieß, Atsuo Takanishi
2025-08-19
Objective Structured Clinical Examinations (OSCEs) are essential for medical
training, but they require significant resources, including professional actors
and expert medical feedback. Although Large Language Models (s) have
introduced text-based virtual patients for
practice, these
simulations often lack the capability for richer, non-textual interactions.
This paper presents a novel framework that significantly enhances
-based
simulated patients by equipping them with action spaces, thereby enabling more
realistic and dynamic patient behaviors that extend beyond text. Furthermore,
our system incorporates virtual tutors that provide students with instant,
personalized feedback on their performance at any time during these simulated
encounters. We have conducted a rigorous evaluation of the framework's
real-time performance, including system latency and component accuracy.
Preliminary evaluations with medical experts assessed the naturalness and
coherence of the simulated patients, as well as the usefulness and
appropriateness of the virtual tutor's assessments. This innovative system
provides medical students with a low-cost, accessible platform for personalized
OSCE preparation at home.
LLMind 2.0 Distributed IoT Automation with Natural Language M2M Communication and Lightweight LLM Agents
Authors: Yuyang Du, Qun Yang, Liujianfu Wang, Jingqi Lin, Hongwei Cui, Soung Chang Liew
2025-08-19
Recent advances in large language models (s) have sparked interest in
their application to IoT and automation systems, particularly for facilitating
device management through natural language instructions. However, existing
centralized approaches face significant scalability challenges when managing
and coordinating the collaboration between IoT devices of diverse capabilities
in large-scale heterogeneous IoT systems. This paper introduces
ind 2.0, a
distributed IoT automation framework that addresses the scalability challenges
through lightweight
-empowered device agents via natural language-based
machine-to-machine (M2M)
. Unlike previous
-controlled
automation systems that rely on a centralized coordinator to generate
device-specific code to be executed on individual devices,
ind 2.0
distributes intelligence across individual devices through lightweight
s
embedded in IoT devices. The central coordinator translates human instructions
into simple subtasks described in natural human language, which are then
processed by device-specific agents to generate device-specific code locally at
the associated devices. This approach transcends device heterogeneity barriers
by using natural language as a unified
medium, enabling seamless
collaboration between devices from different manufacturers. The system
incorporates several key innovations: a Retrieval-Augmented Generation (RAG)
mechanism for accurate subtask-to-API mapping, fine-tuned lightweight
s for
reliable code generation, and a finite state machine-based task execution
framework. Experimental validation in multi-robot warehouse scenarios and
real-world WiFi network deployments demonstrates significant improvements in
scalability, reliability, and privacy protection compared to the centralized
approach.
Prompt-Based One-Shot Exact Length-Controlled Generation with LLMs
Authors: Juncheng Xie, Hung-yi Lee
2025-08-19
Controlling the length of text produced by large language models (s)
remains challenging: models frequently overshoot or undershoot explicit length
instructions because they cannot reliably keep an internal token count. We
present a prompt-based, one-shot strategy that compels an off-the-shelf
to
generate exactly a desired number of tokens - words (English) or characters
(Chinese) - without any fine-tuning or iterative sampling. The prompt appends
countdown markers and explicit counting rules so that the model "writes while
counting." We evaluate on four settings: open-ended generation (1-1000 tokens),
XSUM summarization, MT-Bench-LI instruction following, and the LIFEBENCH
equal-length track. On MT-Bench-LI, strict length compliance with GPT-4.1 leaps
from below 30% under naive prompts to above 95% with our countdown prompt,
surpassing the popular draft-then-revise baseline, while judged answer quality
is preserved. These results show that precise length control can be achieved
through prompt engineering alone, offering a lightweight alternative to
training- or
-based methods.
Communication-Efficient Federated Learning with Adaptive Number of Participants
Authors: Sergey Skorik, Vladislav Dorofeev, Gleb Molodtsov, Aram Avetisyan, Dmitry Bylinkin, Daniil Medyakov, Aleksandr Beznosikov
2025-08-19
Rapid scaling of deep learning models has enabled performance gains across
domains, yet it introduced several challenges. Federated Learning (FL) has
emerged as a promising framework to address these concerns by enabling
decentralized training. Nevertheless, efficiency remains a key
bottleneck in FL, particularly under heterogeneous and dynamic client
participation. Existing methods, such as FedAvg and FedProx, or other
approaches, including client selection strategies, attempt to mitigate
costs. However, the problem of choosing the number of clients in
a training round remains extremely underexplored. We introduce Intelligent
Selection of Participants (ISP), an adaptive mechanism that dynamically
determines the optimal number of clients per round to enhance
efficiency without compromising model accuracy. We validate the effectiveness
of ISP across diverse setups, including vision
s, real-world ECG
classification, and training with gradient compression. Our results show
consistent
savings of up to 30\% without losing the final
quality. Applying ISP to different real-world ECG classification setups
highlighted the selection of the number of clients as a separate task of
federated learning.
CRISP Persistent Concept Unlearning via Sparse Autoencoders
Authors: Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov
2025-08-19
As large language models (s) are increasingly deployed in real-world
applications, the need to selectively remove unwanted knowledge while
pre
model utility has become paramount. Recent work has explored
autoencoders (SAEs) to perform precise interventions on monosemantic features.
However, most SAE-based methods operate at inference time, which does not
create persistent changes in the model's parameters. Such interventions can be
bypassed or reversed by malicious actors with parameter access. We introduce
CRISP, a parameter-efficient method for persistent concept unlearning using
SAEs. CRISP automatically identifies salient SAE features across multiple
layers and suppresses their activations. We experiment with two
s and show
that our method outperforms prior approaches on safety-critical unlearning
tasks from the WMDP benchmark, successfully removing harmful knowledge while
pre
general and in-domain capabilities. Feature-level analysis reveals
that CRISP achieves semantically coherent separation between target and benign
concepts, allowing precise suppression of the target features.
Interpreting the Interpreter Can We Model post-ECB Conferences Volatility with LLM Agents?
Authors: Umberto Collodel
2025-08-19
This paper develops a novel method to simulate financial market reactions to
European Central Bank (ECB) press conferences using a Large Language Model
(). We create a behavioral, agent-based simulation of 30 synthetic traders,
each with distinct risk preferences, cognitive biases, and interpretive styles.
These agents forecast Euro interest rate swap levels at 3-month, 2-year, and
10-year maturities, with the variation across forecasts
as a measure of
market uncertainty or disagreement. We evaluate three prompting strategies,
naive, few-shot (enriched with historical data), and an advanced iterative
'
-as-a-Judge' framework, to assess the effect of prompt design on predictive
performance. Even the naive approach generates a strong correlation (roughly
0.5) between synthetic disagreement and actual market outcomes, particularly
for longer-term maturities. The
-as-a-Judge framework further improves
accuracy at the first iteration. These results demonstrate that
-driven
simulations can capture interpretive uncertainty beyond traditional measures,
providing central banks with a practical tool to anticipate market reactions,
refine
strategies, and enhance financial stability.
A Comparative Study of Decoding Strategies in Medical Text Generation
Authors: Oriana Presacan, Alireza Nik, Vajira Thambawita, Bogdan Ionescu, Michael Riegler
2025-08-19
Large Language Models (s) rely on various
strategies to generate
text, and these choices can significantly affect output quality. In healthcare,
where accuracy is critical, the impact of
strategies remains
underexplored. We investigate this effect in five open-ended medical tasks,
including translation, summarization, question answering, dialogue, and image
captioning, evaluating 11
strategies with medically specialized and
general-purpose
s of different sizes. Our results show that deterministic
strategies generally outperform stochastic ones: beam search achieves the
highest scores, while {\eta} and top-k sampling perform worst. Slower
methods tend to yield better quality. Larger models achieve higher scores
overall but have longer inference times and are no more robust to
.
Surprisingly, while medical
s outperform general ones in two of the five
tasks, statistical analysis shows no overall performance advantage and reveals
greater sensitivity to
choice. We further compare multiple evaluation
metrics and find that correlations vary by task, with MAUVE showing weak
agreement with BERTScore and ROUGE, as well as greater sensitivity to the
strategy. These results highlight the need for careful selection of
methods in medical applications, as their influence can sometimes
exceed that of model choice.
LLM-Enhanced Linear Autoencoders for Recommendation
Authors: Jaewan Moon, Seongmin Park, Jongwuk Lee
2025-08-19
Large language models (s) have been widely adopted to enrich the semantic
representation of textual item information in recommender systems. However,
existing linear autoencoders (LAEs) that incorporate textual information rely
on
word co-occurrence patterns, limiting their ability to capture rich
textual semantics. To address this, we propose L3AE, the first integration of
s into the LAE framework. L3AE effectively integrates the heterogeneous
knowledge of textual semantics and user-item interactions through a two-phase
optimization strategy. (i) L3AE first constructs a semantic item-to-item
correlation matrix from
-derived item representations. (ii) It then learns
an item-to-item weight matrix from collaborative signals while distilling
semantic item correlations as regularization. Notably, each phase of L3AE is
optimized through closed-form solutions, ensuring global optimality and
computational efficiency. Extensive experiments demonstrate that L3AE
consistently outperforms state-of-the-art
-enhanced models on three
benchmark datasets, achieving gains of 27.6% in Recall@20 and 39.3% in NDCG@20.
The source code is available at https://github.com/jaewan7599/L3AE_CIKM2025.
ALIGN Word Association Learning for Cross-Cultural Generalization in Large Language Models
Authors: Chunhua Liu, Kabir Manandhar Shrestha, Sukai Huang
2025-08-19
As large language models (s) increasingly mediate cross-cultural
, their behavior still reflects the distributional bias of the
languages and viewpoints that are over-represented in their pre-training
corpora. Yet, it remains a challenge to model and align culture due to limited
cultural knowledge and a lack of exploration into effective learning
approaches. We introduce a cost-efficient, cognitively grounded remedy:
parameter-efficient fine-tuning on native speakers' free word-association
norms, which encode implicit cultural schemas. Leveraging English-US and
Mandarin associations from the Small-World-of-Words project, we adapt
Llama-3.1-8B and Qwen-2.5-7B via supervised fine-tuning (SFT) and PPO-based
preference optimization. SFT boosts held-out association Precision at 5 by
16-20% in English and 43-165% in Mandarin, lifts median concreteness by +0.20,
and attains human-level valence and arousal. These lexical gains transfer: on
World-Values-Survey questions, fine-tuned models shift answer distributions
toward the target culture, and on a 50-item high-tension subset, Qwen's
Chinese-aligned responses double while Llama's US bias drops by one-third. Our
7-8B models rival or beat vanilla 70B baselines, showing that a few million
culture-grounded associations can instill value alignment without costly
retraining. Our work highlights both the promise and the need for future
research grounded in human cognition in improving cultural alignment in AI
models.
Datarus-R1 An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis
Authors: Ayoub Ben Chaliah, Hela Dellagi
2025-08-18
We present Datarus-R1-14B, a 14 B-parameter open-weights language model
fine-tuned from Qwen 2.5-14B-Instruct to act as a virtual data analyst and
graduate-level problem solver. Datarus is trained not on isolated
question-answer pairs but on full analytical trajectories including reasoning
steps, code execution, error traces, self-corrections, and final conclusions,
all captured in a ReAct-style notebook format spanning finance, medicine,
numerical analysis, and other quantitative domains. Our training pipeline
combines (i) a trajectory-centric synthetic data generator that yielded 144 000
tagged notebook episodes, (ii) a dual-reward framework blending a lightweight
tag-based structural signal with a Hierarchical Reward Model (HRM) that scores
both single-step soundness and end-to-end coherence, and (iii) a
memory-optimized implementation of Group Relative Policy Optimization (GRPO)
featuring -
reuse, sequential generation, and reference-model sharding.
A cosine curriculum smoothly shifts emphasis from structural fidelity to
semantic depth, reducing the format collapse and verbosity that often plague
RL-aligned
s. A central design choice in Datarus is it dual reasoning
interface. In agentic mode the model produces ReAct-tagged steps that invoke
Python tools to execute real code; in reflection mode it outputs compact
Chain-of-Thought (CoT) traces delimited by
Exploring Autonomous Agents A Closer Look at Why They Fail When Completing Tasks
Authors: Ruofan Lu, Yichen Li, Yintong Huo
2025-08-18
Autonomous agent systems powered by Large Language Models (s) have
demonstrated promising capabilities in automating complex tasks. However,
current evaluations largely rely on success rates without systematically
analyzing the interactions,
mechanisms, and failure causes within
these systems. To bridge this gap, we present a benchmark of 34 representative
programmable tasks designed to rigorously assess autonomous agents. Using this
benchmark, we evaluate three popular open-source agent frameworks combined with
two
backbones, ob
a task completion rate of approximately 50%.
Through in-depth failure analysis, we develop a three-tier taxonomy of failure
causes aligned with task phases, highlighting planning errors, task execution
issues, and incorrect response generation. Based on these insights, we propose
actionable improvements to enhance agent planning and self-diagnosis
capabilities. Our failure taxonomy, together with mitigation advice, provides
an empirical foundation for developing more robust and effective autonomous
agent systems in the future.