2025-08-22

Communication Efficient LLM Pre-training with SparseLoCo
Amortized In-Context Mixed Effect Transformer Models A Zero-Shot Approach for Pharmacokinetics
Efficient Mixed-Precision Large Language Model Inference with TurboMind
Deep Equilibrium Convolutional Sparse Coding for Hyperspectral Image Denoising
LGMSNet Thinning a medical image segmentation model via dual-level multiscale fusion
DiagECG An LLM-Driven Framework for Diagnostic Reasoning via Discretized ECG Tokenization
Exploring Scaling Laws of CTR Model for Online Performance Improvement
MeSS City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion
LongRecall A Structured Approach for Robust Recall Evaluation in Long-Form Text
GasTwinFormer A Hybrid Vision Transformer for Livestock Methane Emission Segmentation and Dietary Classification in Optical Gas Imaging
Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner
Quantization Meets dLLMs A Systematic Study of Post-training Quantization for Diffusion LLMs
MissionHD Data-Driven Refinement of Reasoning Graph Structure through Hyperdimensional Causal Path Encoding and Decoding
Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving
Improving in-context learning with a better scoring function
Can LLM Agents Solve Collaborative Tasks? A Study on Urgency-Aware Planning and Coordination
Deep Skin Lesion Segmentation with Transformer-CNN Fusion Toward Intelligent Skin Cancer Analysis
DeepTelecom A Digital-Twin Deep Learning Dataset for Channel and MIMO Applications
Taming Transformer for Emotion-Controllable Talking Face Generation
SurveyGen-I Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing
Your Reward Function for RL is Your Best PRM for Search Unifying RL and Search-Based TTS
GLASS Test-Time Acceleration for LLMs via Global-Local Neural Importance Aggregation
Pixels to Play A Foundation Model for 3D Gameplay
Measuring LLM Code Generation Stability via Structural Entropy
Disentangling concept semantics via multilingual averaging in Sparse Autoencoders
Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper
GeoSAM2 Unleashing the Power of SAM2 for 3D Part Segmentation
LLM-Powered Virtual Patient Agents for Interactive Clinical Skills Training with Automated Feedback
LLMind 2.0 Distributed IoT Automation with Natural Language M2M Communication and Lightweight LLM Agents
Prompt-Based One-Shot Exact Length-Controlled Generation with LLMs
Communication-Efficient Federated Learning with Adaptive Number of Participants
CRISP Persistent Concept Unlearning via Sparse Autoencoders
Interpreting the Interpreter Can We Model post-ECB Conferences Volatility with LLM Agents?
A Comparative Study of Decoding Strategies in Medical Text Generation
LLM-Enhanced Linear Autoencoders for Recommendation
ALIGN Word Association Learning for Cross-Cultural Generalization in Large Language Models
Datarus-R1 An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis
Exploring Autonomous Agents A Closer Look at Why They Fail When Completing Tasks

Communication Efficient LLM Pre-training with SparseLoCo

Authors: Amir Sarfi, Benjamin Thérien, Joel Lidin, Eugene Belilovsky

2025-08-21

http://arxiv.org/abs/2508.15706v1

Communication-efficient distributed training algorithms have received considerable interest recently due to their benefits for training Large Language Models (s) in bandwidth-constrained settings, such as across data centers and over the internet. Despite reducing frequency, these methods still typically require communicating a full copy of the model's gradients-resulting in a bottleneck even for cross-datacenter links. Furthermore, they can slightly degrade performance compared to a naive AdamW DDP baseline. While and error feedback are often applied to reduce the pseudo-gradient's size, in the context of pre-training, existing approaches have been unable to additionally leverage sparsification and have obtained limited . In this work, we introduce SparseLoCo, a -efficient training algorithm for s that effectively leverages Top-k sparsification and to reach extreme compression ratios of up to 1-3% and 2-bit while outperforming full-precision DiLoCo. Our key observations are that outer momentum can be locally approximated by an error feedback combined with aggressive and that aggregation can actually improve model performance. We empirically demonstrate in a range of -constrained training settings that SparseLoCo provides significant benefits in both performance and cost.

Amortized In-Context Mixed Effect Transformer Models A Zero-Shot Approach for Pharmacokinetics

Authors: César Ali Ojeda Marin, Wilhelm Huisinga, Purity Kavwele, Niklas Hartung

2025-08-21

http://arxiv.org/abs/2508.15659v1

Accurate dose-response forecasting under sampling is central to precision pharmacotherapy. We present the Amortized In-Context Mixed-Effect Transformer (AICMET) model, a -based latent-variable framework that unifies mechanistic compartmental priors with amortized in-context Bayesian inference. AICMET is pre-trained on hundreds of thousands of synthetic pharmacokinetic trajectories with Ornstein-Uhlenbeck priors over the parameters of compartment models, endowing the model with strong inductive biases and enabling zero-shot adaptation to new compounds. At inference time, the r conditions on the collective context of previously profiled trial participants, generating calibrated posterior predictions for newly enrolled patients after a few early drug concentration measurements. This capability collapses traditional model-development cycles from weeks to hours while pre some degree of expert modelling. Experiments across public datasets show that AICMET attains state-of-the-art predictive accuracy and faithfully quantifies inter-patient variability -- outperforming both nonlinear mixed-effects baselines and recent neural ODE variants. Our results highlight the feasibility of -based, population-aware neural architectures as offering a new alternative for bespoke pharmacokinetic modeling pipelines, charting a path toward truly population-aware personalized dosing regimens.

Efficient Mixed-Precision Large Language Model Inference with TurboMind

Authors: Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, Kai Chen

2025-08-21

http://arxiv.org/abs/2508.15601v1

Mixed-precision inference techniques reduce the memory and computational demands of Large Language Models (s) by applying hybrid precision formats to model weights, activations, and s. This work introduces mixed-precision inference techniques that encompass (i) systematic memory and compute optimization across hierarchical storage and tensor core architectures, and (ii) comprehensive end-to-end mixed-precision optimization across diverse precision formats and hardware configurations. Our approach features two novel mixed-precision pipelines designed for optimal hardware utilization: a General Matrix Multiply (GEMM) pipeline that optimizes matrix operations through offline weight packing and online , and an attention pipeline that enables efficient attention computation with arbitrary Query, Key, and Value precision combinations. The key implementation of the pipelines includes (i) hardware-aware weight packing for automatic format optimization, (ii) adaptive head alignment for efficient attention computation, (iii) instruction-level parallelism for memory hierarchy exploitation, and (iv) memory loading pipeline for enhanced inference efficiency. We conduct comprehensive evaluations across 16 popular s and 4 representative GPU architectures. Results demonstrate that our approach achieves up to 61% lower latency (30% on average) and up to 156% higher throughput (58% on average) in mixed-precision workloads compared to existing mixed-precision frameworks, establishing consistent performance improvements across all tested configurations and hardware types. This work is integrated into TurboMind, a high-performance inference engine of the LMDeploy project, which is open-sourced and publicly available at https://github.com/InternLM/lmdeploy.

Deep Equilibrium Convolutional Sparse Coding for Hyperspectral Image Denoising

Authors: Jin Ye, Jingran Wang, Fengchao Xiong, Jingzhou Chen, Yuntao Qian

2025-08-21

http://arxiv.org/abs/2508.15553v1

Hyperspectral images (HSIs) play a crucial role in remote sensing but are often degraded by complex noise patterns. Ensuring the physical property of the denoised HSIs is vital for robust HSI denoising, giving the rise of deep unfolding-based methods. However, these methods map the optimization of a physical model to a learnable network with a predefined depth, which lacks convergence guarantees. In contrast, Deep Equilibrium (DEQ) models treat the hidden layers of deep networks as the solution to a fixed-point problem and models them as infinite-depth networks, naturally consistent with the optimization. Under the framework of DEQ, we propose a Deep Equilibrium Convolutional Sparse Coding (DECSC) framework that unifies local spatial-spectral correlations, nonlocal spatial self-similarities, and global spatial consistency for robust HSI denoising. Within the convolutional coding (CSC) framework, we enforce shared 2D convolutional representation to ensure global spatial consistency across bands, while unshared 3D convolutional representation captures local spatial-spectral details. To further exploit nonlocal self-similarities, a block is embedded after the 2D CSC. Additionally, a detail enhancement module is integrated with the 3D CSC to promote image detail preservation. We formulate the proximal gradient descent of the CSC model as a fixed-point problem and transform the iterative updates into a learnable network architecture within the framework of DEQ. Experimental results demonstrate that our DECSC method achieves superior denoising performance compared to state-of-the-art methods.

LGMSNet Thinning a medical image segmentation model via dual-level multiscale fusion

Authors: Chengqi Dong, Fenghe Tang, Rongge Mao, Xinpei Gao, S. Kevin Zhou

2025-08-21

http://arxiv.org/abs/2508.15476v1

Medical image segmentation plays a pivotal role in disease diagnosis and treatment planning, particularly in resource-constrained clinical settings where lightweight and generalizable models are urgently needed. However, existing lightweight models often compromise performance for efficiency and rarely adopt computationally expensive attention mechanisms, severely restricting their global contextual perception capabilities. Additionally, current architectures neglect the channel redundancy issue under the same convolutional kernels in medical imaging, which hinders effective feature extraction. To address these challenges, we propose LGMSNet, a novel lightweight framework based on local and global dual multiscale that achieves state-of-the-art performance with minimal computational overhead. LGMSNet employs heterogeneous intra-layer kernels to extract local high-frequency information while mitigating channel redundancy. In addition, the model integrates -convolutional hybrid branches to capture low-frequency global information. Extensive experiments across six public datasets demonstrate LGMSNet's superiority over existing state-of-the-art methods. In particular, LGMSNet maintains exceptional performance in zero-shot generalization tests on four unseen datasets, underscoring its potential for real-world deployment in resource-limited medical scenarios. The whole project code is in https://github.com/cq-dong/LGMSNet.

DiagECG An LLM-Driven Framework for Diagnostic Reasoning via Discretized ECG Tokenization

Authors: Jinning Yang, Wen Shi

2025-08-21

http://arxiv.org/abs/2508.15338v1

Electrocardiography plays a central role in cardiovascular diagnostics, yet existing automated approaches often struggle to generalize across clinical tasks and offer limited support for open-ended reasoning. We present DiagECG, a novel framework that integrates time-series and language modeling by enabling large language models to process 12-lead ECG signals for clinical text generation tasks. Our approach discretizes continuous ECG embeddings into symbolic tokens using a lead-independent encoder and module. These tokens are then used to extend the vocabulary of , allowing the model to handle both ECG and natural language inputs in a unified manner. To bridge the modality gap, we pretrain the model on an autoregressive ECG forecasting task, enabling the to model temporal dynamics using its native language modeling capabilities. Finally, we perform instruction tuning on both ECG question answering and diagnostic report generation. Without modifying the core model, DiagECG achieves strong performance across tasks while maintaining generalization to out-of-distribution settings. Extensive experiments demonstrate the effectiveness of each component and highlight the potential of integrating symbolic ECG representations into s for medical reasoning.

Exploring Scaling Laws of CTR Model for Online Performance Improvement

Authors: Weijiang Lai, Beihong Jin, Jiongyan Zhang, Yiyuan Zheng, Jian Dong, Jia Cheng, Jun Lei, Xingxing Wang

2025-08-21

http://arxiv.org/abs/2508.15326v1

CTR models play a vital role in improving user experience and boosting business revenue in many online personalized services. However, current CTR models generally encounter bottlenecks in performance improvement. Inspired by the scaling law phenomenon of s, we propose a new paradigm for improving CTR predictions: first, constructing a CTR model with accuracy scalable to the model grade and data size, and then distilling the knowledge implied in this model into its lightweight model that can serve online users. To put it into practice, we construct a CTR model named SUAN (Stacked Unified Attention Network). In SUAN, we propose the UAB as a behavior sequence encoder. A single UAB unifies the modeling of the sequential and non-sequential features and also measures the importance of each user behavior feature from multiple perspectives. Stacked UABs elevate the configuration to a high grade, paving the way for performance improvement. In order to benefit from the high performance of the high-grade SUAN and avoid the disadvantage of its long inference time, we modify the SUAN with self-attention and parallel inference strategies to form LightSUAN, and then adopt online distillation to train the low-grade LightSUAN, taking a high-grade SUAN as a teacher. The distilled LightSUAN has superior performance but the same inference time as the LightSUAN, making it well-suited for online deployment. Experimental results show that SUAN performs exceptionally well and holds the scaling laws spanning three orders of magnitude in model grade and data size, and the distilled LightSUAN outperforms the SUAN configured with one grade higher. More importantly, the distilled LightSUAN has been integrated into an online service, increasing the CTR by 2.81% and CPM by 1.69% while keeping the average inference time acceptable. Our source code is available at https://github.com/laiweijiang/SUAN.

MeSS City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion

Authors: Xuyang Chen, Zhijun Zhai, Kaixuan Zhou, Zengmao Wang, Jianan He, Dong Wang, Yanfeng Zhang, mingwei Sun, Rüdiger Westermann, Konrad Schindler, Liqiu Meng

2025-08-21

http://arxiv.org/abs/2508.15169v1

Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving. To address this, this paper proposes MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models as the geometric prior. While image and video diffusion models can leverage spatial layouts (such as depth maps or HD maps) as control conditions to generate street-level perspective views, they are not directly applicable to 3D scene generation. Video diffusion models excel at synthesizing consistent view sequences that depict scenes but often struggle to adhere to predefined camera paths or align accurately with rendered control videos. In contrast, image diffusion models, though unable to guarantee cross-view visual consistency, can produce more geometry-aligned results when combined with ControlNet. Building on this insight, our approach enhances image diffusion models by improving cross-view consistency. The pipeline comprises three key stages: first, we generate geometrically consistent views using Cascaded Outpainting ControlNets; second, we propagate denser intermediate views via a component dubbed AGInpaint; and third, we globally eliminate visual inconsistencies (e.g., varying exposure) using the GCAlign module. Concurrently with generation, a 3D Gaussian Splatting (3DGS) scene is reconstructed by initializing Gaussian balls on the mesh surface. Our method outperforms existing approaches in both geometric alignment and generation quality. Once synthesized, the scene can be rendered in diverse styles through relighting and style transfer techniques.

LongRecall A Structured Approach for Robust Recall Evaluation in Long-Form Text

Authors: MohamamdJavad Ardestani, Ehsan Kamalloo, Davood Rafiei

2025-08-20

http://arxiv.org/abs/2508.15085v1

LongRecall. The completeness of machine-generated text, ensuring that it captures all relevant information, is crucial in domains such as medicine and law and in tasks like list-based question answering (QA), where omissions can have serious consequences. However, existing recall metrics often depend on lexical , leading to errors with unsubstantiated entities and paraphrased answers, while -as-a-Judge methods with long holistic prompts capture broader semantics but remain prone to misalignment and hallucinations without structured verification. We introduce LongRecall, a general three-stage recall evaluation framework that decomposes answers into self-contained facts, successively narrows plausible candidate matches through lexical and semantic filtering, and verifies their alignment through structured entailment checks. This design reduces false positives and false negatives while accommodating diverse phrasings and contextual variations, as a foundational building block for systematic recall assessment. We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and -based judges, demonstrating substantial improvements in recall accuracy over strong lexical and -as-a-Judge baselines.

GasTwinFormer A Hybrid Vision Transformer for Livestock Methane Emission Segmentation and Dietary Classification in Optical Gas Imaging

Authors: Toqi Tahamid Sarker, Mohamed Embaby, Taminul Islam, Amer AbuGhazaleh, Khaled R Ahmed

2025-08-20

http://arxiv.org/abs/2508.15057v1

Livestock methane emissions represent 32% of human-caused methane production, making automated monitoring critical for climate mitigation strategies. We introduce GasTwinFormer, a hybrid vision for real-time methane emission segmentation and dietary classification in optical gas imaging through a novel Mix Twin encoder alternating between spatially-reduced global attention and locally-grouped attention mechanisms. Our architecture incorporates a lightweight LR-ASPP r for multi-scale feature aggregation and enables simultaneous methane segmentation and dietary classification in a unified framework. We contribute the first comprehensive beef cattle methane emission dataset using OGI, containing 11,694 annotated frames across three dietary treatments. GasTwinFormer achieves 74.47% mIoU and 83.63% mF1 for segmentation while maintaining exceptional efficiency with only 3.348M parameters, 3.428G FLOPs, and 114.9 FPS inference speed. Additionally, our method achieves perfect dietary classification accuracy (100%), demonstrating the effectiveness of leveraging diet-emission correlations. Extensive ablation studies validate each architectural component, establishing GasTwinFormer as a practical solution for real-time livestock emission monitoring. Please see our project page at gastwinformer.github.io.

Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner

Authors: Bolian Li, Yanran Wu, Xinyu Luo, Ruqi Zhang

2025-08-20

http://arxiv.org/abs/2508.15044v1

Aligning large language models (s) with human preferences has become a critical step in their development. Recent research has increasingly focused on test-time alignment, where additional compute is allocated during inference to enhance safety and reasoning capabilities. However, these test-time alignment techniques often incur substantial inference costs, limiting their practical application. We are inspired by the speculative sampling , which leverages a small draft model to efficiently predict future tokens, to address the efficiency bottleneck of test-time alignment. We introduce the reward-Shifted Speculative Sampling (SSS) algorithm, in which the draft model is aligned with human preferences, while the target model remains unchanged. We theoretically demonstrate that the distributional shift between the aligned draft model and the unaligned target model can be exploited to recover the RLHF optimal solution without actually obtaining it, by modifying the acceptance criterion and bonus token distribution. Our algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments, thereby validating both its effectiveness and efficiency.

Quantization Meets dLLMs A Systematic Study of Post-training Quantization for Diffusion LLMs

Authors: Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun

2025-08-20

http://arxiv.org/abs/2508.14896v1

Recent advances in diffusion large language models (ds) have introduced a promising alternative to autoregressive (AR) s for natural language generation tasks, leveraging full attention and denoising-based strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training (PTQ) has emerged as a widely adopted technique for compressing AR s, its applicability to ds remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based language models. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to , as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the behavior of ds under different configurations. We hope our findings provide a foundation for future research in efficient d deployment. All codes and experimental setups will be released to support the community.

Authors: Sanggeon Yun, Raheeb Hassan, Ryozo Masukawa, Mohsen Imani

2025-08-20

http://arxiv.org/abs/2508.14746v1

Reasoning graphs from Large Language Models (s) are often misaligned with downstream visual tasks such as video anomaly detection (VAD). Existing Graph Structure Refinement (GSR) methods are ill-suited for these novel, dataset-less graphs. We introduce Data-driven GSR (D-GSR), a new paradigm that directly optimizes graph structure using downstream task data, and propose MissionHD, a hyperdimensional computing (HDC) framework to operationalize it. MissionHD uses an efficient encode- process to refine the graph, guided by the downstream task signal. Experiments on challenging VAD and VAR benchmarks show significant performance improvements when using our refined graphs, validating our approach as an effective pre-processing step.

Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving

Authors: Leila Cheshmi, Mennatullah Siam

2025-08-20

http://arxiv.org/abs/2508.14729v1

Ensuring safety in autonomous driving is a complex challenge requiring handling unknown objects and unforeseen driving scenarios. We develop multiscale video s capable of detecting unknown objects using only motion cues. Video semantic and panoptic segmentation often relies on known classes seen during training, overlooking novel categories. Recent visual grounding with large language models is computationally expensive, especially for pixel-level output. We propose an efficient video trained end-to-end for class-agnostic segmentation without optical flow. Our method uses multi-stage multiscale query-memory and a scale-specific random drop-token to ensure efficiency and accuracy, maintaining detailed spatiotemporal features with a shared, learnable memory module. Unlike conventional rs that compress features, our memory-centric design preserves high-resolution information at multiple scales. We evaluate on DAVIS'16, KITTI, and Cityscapes. Our method consistently outperforms multiscale baselines while being efficient in GPU memory and run-time, demonstrating a promising direction for real-time, robust dense prediction in safety-critical robotics.

Improving in-context learning with a better scoring function

Authors: Omar Naim, Swarnadeep Bhar, Jérôme Bolte, Nicholas Asher

2025-08-20

http://arxiv.org/abs/2508.14685v1

Large language models (s) exhibit a remarkable capacity to learn by analogy, known as in-context learning (ICL). However, recent studies have revealed limitations in this ability. In this paper, we examine these limitations on tasks involving first-order quantifiers such as {\em all} and {\em some}, as well as on ICL with linear functions. We identify Softmax, the scoring function in attention mechanism, as a contributing factor to these constraints. To address this, we propose \textbf{scaled signed averaging (SSA)}, a novel alternative to Softmax. Empirical results show that SSA dramatically improves performance on our target tasks. Furthermore, we evaluate both encoder-only and r-only s models with SSA, demonstrating that they match or exceed their Softmax-based counterparts across a variety of linguistic probing tasks.

Can LLM Agents Solve Collaborative Tasks? A Study on Urgency-Aware Planning and Coordination

Authors: João Vitor de Carvalho Silva, Douglas G. Macharet

2025-08-20

http://arxiv.org/abs/2508.14635v1

The ability to coordinate actions across multiple agents is critical for solving complex, real-world problems. Large Language Models (s) have shown strong capabilities in , planning, and reasoning, raising the question of whether they can also support effective collaboration in multi-agent settings. In this work, we investigate the use of agents to solve a structured victim rescue task that requires division of labor, prioritization, and cooperative planning. Agents operate in a fully known graph-based environment and must allocate resources to victims with varying needs and urgency levels. We systematically evaluate their performance using a suite of coordination-sensitive metrics, including task success rate, redundant actions, room conflicts, and urgency-weighted efficiency. This study offers new insights into the strengths and failure modes of s in physically grounded multi-agent collaboration tasks, contributing to future benchmarks and architectural improvements.

Deep Skin Lesion Segmentation with Transformer-CNN Fusion Toward Intelligent Skin Cancer Analysis

Authors: Xin Wang, Xiaopei Zhang, Xingang Wang

2025-08-20

http://arxiv.org/abs/2508.14509v1

This paper proposes a high-precision semantic segmentation method based on an improved TransUNet architecture to address the challenges of complex lesion structures, blurred boundaries, and significant scale variations in skin lesion images. The method integrates a module into the traditional encoder-r framework to model global semantic information, while retaining a convolutional branch to preserve local texture and edge features. This enhances the model's ability to perceive fine-grained structures. A boundary-guided attention mechanism and multi-scale upsampling path are also designed to improve lesion boundary localization and segmentation consistency. To verify the effectiveness of the approach, a series of experiments were conducted, including comparative studies, hyperparameter sensitivity analysis, data augmentation effects, input resolution variation, and training data split ratio tests. Experimental results show that the proposed model outperforms existing representative methods in mIoU, mDice, and mAcc, demonstrating stronger lesion recognition accuracy and robustness. In particular, the model achieves better boundary reconstruction and structural recovery in complex scenarios, making it well-suited for the key demands of automated segmentation tasks in skin lesion analysis.

DeepTelecom A Digital-Twin Deep Learning Dataset for Channel and MIMO Applications

Authors: Bohao Wang, Zehua Jiang, Zhenyu Yang, Chongwen Huang, Yongliang Shen, Siming Jiang, Chen Zhu, Zhaohui Yang, Richeng Jin, Zhaoyang Zhang, Sami Muhaidat, Merouane Debbah

2025-08-20

http://arxiv.org/abs/2508.14507v1

Domain-specific datasets are the foundation for unleashing artificial intelligence (AI)-driven wireless innovation. Yet existing wireless AI corpora are slow to produce, offer limited modeling fidelity, and cover only narrow scenario types. To address the challenges, we create DeepTelecom, a three-dimension (3D) digital-twin channel dataset. Specifically, a large language model ()-assisted pipeline first builds the third level of details (LoD3) outdoor and indoor scenes with segmentable material-parameterizable surfaces. Then, DeepTelecom simulates full radio-wave propagation effects based on Sionna's ray-tracing engine. Leveraging GPU , DeepTelecom streams ray-path trajectories and real-time signal-strength heat maps, compiles them into high-frame-rate videos, and simultaneously outputs synchronized multi-view images, channel tensors, and multi-scale fading traces. By efficiently streaming large-scale, high-fidelity, and multimodal channel data, DeepTelecom not only furnishes a unified benchmark for wireless AI research but also supplies the domain-rich training substrate that enables foundation models to tightly fuse large model intelligence with future systems.

Taming Transformer for Emotion-Controllable Talking Face Generation

Authors: Ziqi Zhang, Cheng Deng

2025-08-20

http://arxiv.org/abs/2508.14359v1

Talking face generation is a novel and challenging generation task, aiming at synthesizing a vivid speaking-face video given a specific audio. To fulfill emotion-controllable talking face generation, current methods need to overcome two challenges: One is how to effectively model the multimodal relationship related to the specific emotion, and the other is how to leverage this relationship to synthesize identity pre emotional videos. In this paper, we propose a novel method to tackle the emotion-controllable talking face generation task discretely. Specifically, we employ two pre-training strategies to disentangle audio into independent components and videos into combinations of visual tokens. Subsequently, we propose the emotion-anchor (EA) representation that integrates the emotional information into visual tokens. Finally, we introduce an autoregressive to model the global distribution of the visual tokens under the given conditions and further predict the index sequence for synthesizing the manipulated videos. We conduct experiments on the MEAD dataset that controls the emotion of videos conditioned on multiple emotional audios. Extensive experiments demonstrate the superiorities of our method both qualitatively and quantitatively.

SurveyGen-I Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing

Authors: Jing Chen, Zhiheng Yang, Yixian Shen, Jie Liu, Adam Belloum, Chrysa Papagainni, Paola Grosso

2025-08-20

http://arxiv.org/abs/2508.14317v1

Survey papers play a critical role in scientific by consolidating progress across a field. Recent advances in Large Language Models (s) offer a promising solution by automating key steps in the survey-generation pipeline, such as retrieval, structuring, and summarization. However, existing -based approaches often struggle with maintaining coherence across long, multi-section surveys and providing comprehensive citation coverage. To address these limitations, we introduce SurveyGen-I, an automatic survey generation framework that combines coarse-to-fine retrieval, adaptive planning, and memory-guided generation. SurveyGen-I first performs survey-level retrieval to construct the initial outline and writing plan, and then dynamically refines both during generation through a memory mechanism that stores previously written content and terminology, ensuring coherence across subsections. When the system detects insufficient context, it triggers fine-grained subsection-level retrieval. During generation, SurveyGen-I leverages this memory mechanism to maintain coherence across subsections. Experiments across four scientific domains demonstrate that SurveyGen-I consistently outperforms previous works in content quality, consistency, and citation coverage.

Your Reward Function for RL is Your Best PRM for Search Unifying RL and Search-Based TTS

Authors: Can Jin, Yang Zhou, Qixin Zhang, Hongwu Peng, Di Zhang, Marco Pavone, Ligong Han, Zhang-Wei Hong, Tong Che, Dimitris N. Metaxas

2025-08-19

http://arxiv.org/abs/2508.14313v1

Test-time scaling (TTS) for large language models (s) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or -generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S, the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces, entirely eliminating the need for labeled intermediate process data. At inference, the resulting PRM simultaneously serves as the critic for RL rollouts and as a heuristic to effectively guide search procedures, facilitating robust reasoning chain extension, mitigating reward hacking, and enhancing cross-task generalization. Experimental results across eight benchmarks, including mathematics, scientific reasoning, and code generation, demonstrate that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o. Furthermore, when integrated into multiple search algorithms, our PRM consistently outperforms all baseline PRMs trained with labeled data. These results underscore that, indeed, your reward function for RL is your best PRM for search, providing a robust and cost-effective solution to complex reasoning tasks in s.

GLASS Test-Time Acceleration for LLMs via Global-Local Neural Importance Aggregation

Authors: Amirmohsen Sattarifard, Sepehr Lavasani, Ehsan Imani, Kunlin Zhang, Hanlin Xu, Fengyu Sun, Negar Hassanpour, Chao Gao

2025-08-19

http://arxiv.org/abs/2508.14302v1

Deploying Large Language Models (s) on edge hardware demands aggressive, prompt-aware dynamic to reduce computation without degrading quality. Static or predictor-based schemes either lock in a single pattern or incur extra runtime overhead, and recent zero-shot methods that rely on statistics from a single prompt fail on short prompt and/or long generation scenarios. We introduce A/I-GLASS: Activation- and Impact-based Global-Local neural importance Aggregation for feed-forward network SparSification, two training-free methods that dynamically select FFN units using a rank-aggregation of prompt local and model-intrinsic global neuron statistics. Empirical results across multiple s and benchmarks demonstrate that GLASS significantly outperforms prior training-free methods, particularly in challenging long-form generation scenarios, without relying on auxiliary predictors or adding any inference overhead.

Pixels to Play A Foundation Model for 3D Gameplay

Authors: Yuguang Yue, Chris Green, Samuel Hunt, Irakli Salia, Wenzhe Shi, Jonathan J Hunt

2025-08-19

http://arxiv.org/abs/2508.14295v1

We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior. Motivated by emerging consumer and developer use cases - AI teammates, controllable NPCs, personalized live-streamers, assistive testers - we argue that an agent must rely on the same pixel stream available to players and generalize to new titles with minimal game-specific engineering. P2P0.1 is trained end-to-end with behavior cloning: labeled demonstrations collected from instrumented human game-play are complemented by unlabeled public videos, to which we impute actions via an inverse-dynamics model. A r-only with auto-regressive action output handles the large action space while remaining latency-friendly on a single consumer GPU. We report qualitative results showing competent play across simple Roblox and classic MS-DOS titles, ablations on unlabeled data, and outline the scaling and evaluation steps required to reach expert-level, text-conditioned control.

Measuring LLM Code Generation Stability via Structural Entropy

Authors: Yewei Song, Tiezhu Sun, Xunzhu Tang, Prateek Rajput, Tegawende F. Bissyande, Jacques Klein

2025-08-19

http://arxiv.org/abs/2508.14288v1

Assessing the stability of code generation from large language models (s) is essential for judging their reliability in real-world development. We extend prior "structural-entropy concepts" to the program domain by pairing entropy with abstract syntax tree (AST) analysis. For any fixed prompt, we collect the multiset of depth-bounded subtrees of AST in each generated program and treat their relative frequencies as a probability distribution. We then measure stability in two complementary ways: (i) Jensen-Shannon divergence, a symmetric, bounded indicator of structural , and (ii) a Structural Cross-Entropy ratio that highlights missing high-probability patterns. Both metrics admit structural-only and token-aware variants, enabling separate views on control-flow shape and identifier-level variability. Unlike pass@k, BLEU, or CodeBLEU, our metrics are reference-free, language-agnostic, and execution-independent. We benchmark several leading s on standard code generation tasks, demonstrating that AST-driven structural entropy reveals nuances in model consistency and robustness. The method runs in O(n,d) time with no external tests, providing a lightweight addition to the code-generation evaluation toolkit.

Disentangling concept semantics via multilingual averaging in Sparse Autoencoders

Authors: Cliff O'Reilly, Ernesto Jimenez-Ruiz, Tillman Weyde

2025-08-19

http://arxiv.org/abs/2508.14275v1

Connecting s with formal knowledge representation and reasoning is a promising approach to address their shortcomings. Embeddings and autoencoders are widely used to represent textual content, but the semantics are entangled with syntactic and language-specific information. We propose a method that isolates concept semantics in Large Langue Models by averaging concept activations derived via Sparse Autoencoders. We create English text representations from OWL ontology classes, translate the English into French and Chinese and then pass these texts as prompts to the Gemma 2B . Using the open source Gemma Scope suite of Sparse Autoencoders, we obtain concept activations for each class and language version. We average the different language activations to derive a conceptual average. We then correlate the conceptual averages with a ground truth mapping between ontology classes. Our results give a strong indication that the conceptual average aligns to the true relationship between classes when compared with a single language by itself. The result hints at a new technique which enables mechanistic interpretation of internal network states with higher accuracy.

Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper

Authors: Krishna Garg, Firoz Shaikh, Sambaran Bandyopadhyay, Cornelia Caragea

2025-08-19

http://arxiv.org/abs/2508.14273v1

As researchers increasingly adopt s as writing assistants, generating high-quality research paper introductions remains both challenging and essential. We introduce Scientific Introduction Generation (SciIG), a task that evaluates s' ability to produce coherent introductions from titles, abstracts, and related works. Curating new datasets from NAACL 2025 and ICLR 2025 papers, we assess five state-of-the-art models, including both open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems, across multiple dimensions: lexical , semantic similarity, content coverage, faithfulness, consistency, citation correctness, and narrative quality. Our comprehensive framework combines automated metrics with -as-a-judge evaluations. Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness. Moreover, three-shot prompting consistently outperforms fewer-shot approaches. These findings provide practical insights into developing effective research writing assistants and set realistic expectations for -assisted academic writing. To foster reproducibility and future research, we will publicly release all code and datasets.

GeoSAM2 Unleashing the Power of SAM2 for 3D Part Segmentation

Authors: Ken Deng, Yunhan Yang, Jingxiang Sun, Xihui Liu, Yebin Liu, Ding Liang, Yan-Pei Cao

2025-08-19

http://arxiv.org/abs/2508.14036v1

Modern 3D generation methods can rapidly create shapes from or single views, but their outputs often lack geometric detail due to computational constraints. We present DetailGen3D, a generative approach specifically designed to enhance these generated 3D shapes. Our key insight is to model the coarse-to-fine transformation directly through data-dependent flows in latent space, avoiding the computational overhead of large-scale 3D generative models. We introduce a token matching strategy that ensures accurate spatial correspondence during refinement, enabling local detail synthesis while pre global structure. By carefully designing our training data to match the characteristics of synthesized coarse shapes, our method can effectively enhance shapes produced by various 3D generation and reconstruction approaches, from single-view to multi-view inputs. Extensive experiments demonstrate that DetailGen3D achieves high-fidelity geometric detail synthesis while maintaining efficiency in training.

LLM-Powered Virtual Patient Agents for Interactive Clinical Skills Training with Automated Feedback

Authors: Henrik Voigt, Yurina Sugamiya, Kai Lawonn, Sina Zarrieß, Atsuo Takanishi

2025-08-19

http://arxiv.org/abs/2508.13943v1

Objective Structured Clinical Examinations (OSCEs) are essential for medical training, but they require significant resources, including professional actors and expert medical feedback. Although Large Language Models (s) have introduced text-based virtual patients for practice, these simulations often lack the capability for richer, non-textual interactions. This paper presents a novel framework that significantly enhances -based simulated patients by equipping them with action spaces, thereby enabling more realistic and dynamic patient behaviors that extend beyond text. Furthermore, our system incorporates virtual tutors that provide students with instant, personalized feedback on their performance at any time during these simulated encounters. We have conducted a rigorous evaluation of the framework's real-time performance, including system latency and component accuracy. Preliminary evaluations with medical experts assessed the naturalness and coherence of the simulated patients, as well as the usefulness and appropriateness of the virtual tutor's assessments. This innovative system provides medical students with a low-cost, accessible platform for personalized OSCE preparation at home.

LLMind 2.0 Distributed IoT Automation with Natural Language M2M Communication and Lightweight LLM Agents

Authors: Yuyang Du, Qun Yang, Liujianfu Wang, Jingqi Lin, Hongwei Cui, Soung Chang Liew

2025-08-19

http://arxiv.org/abs/2508.13920v1

Recent advances in large language models (s) have sparked interest in their application to IoT and automation systems, particularly for facilitating device management through natural language instructions. However, existing centralized approaches face significant scalability challenges when managing and coordinating the collaboration between IoT devices of diverse capabilities in large-scale heterogeneous IoT systems. This paper introduces ind 2.0, a distributed IoT automation framework that addresses the scalability challenges through lightweight -empowered device agents via natural language-based machine-to-machine (M2M) . Unlike previous -controlled automation systems that rely on a centralized coordinator to generate device-specific code to be executed on individual devices, ind 2.0 distributes intelligence across individual devices through lightweight s embedded in IoT devices. The central coordinator translates human instructions into simple subtasks described in natural human language, which are then processed by device-specific agents to generate device-specific code locally at the associated devices. This approach transcends device heterogeneity barriers by using natural language as a unified medium, enabling seamless collaboration between devices from different manufacturers. The system incorporates several key innovations: a Retrieval-Augmented Generation (RAG) mechanism for accurate subtask-to-API mapping, fine-tuned lightweight s for reliable code generation, and a finite state machine-based task execution framework. Experimental validation in multi-robot warehouse scenarios and real-world WiFi network deployments demonstrates significant improvements in scalability, reliability, and privacy protection compared to the centralized approach.

Prompt-Based One-Shot Exact Length-Controlled Generation with LLMs

Authors: Juncheng Xie, Hung-yi Lee

2025-08-19

http://arxiv.org/abs/2508.13805v1

Controlling the length of text produced by large language models (s) remains challenging: models frequently overshoot or undershoot explicit length instructions because they cannot reliably keep an internal token count. We present a prompt-based, one-shot strategy that compels an off-the-shelf to generate exactly a desired number of tokens - words (English) or characters (Chinese) - without any fine-tuning or iterative sampling. The prompt appends countdown markers and explicit counting rules so that the model "writes while counting." We evaluate on four settings: open-ended generation (1-1000 tokens), XSUM summarization, MT-Bench-LI instruction following, and the LIFEBENCH equal-length track. On MT-Bench-LI, strict length compliance with GPT-4.1 leaps from below 30% under naive prompts to above 95% with our countdown prompt, surpassing the popular draft-then-revise baseline, while judged answer quality is preserved. These results show that precise length control can be achieved through prompt engineering alone, offering a lightweight alternative to training- or -based methods.

Communication-Efficient Federated Learning with Adaptive Number of Participants

Authors: Sergey Skorik, Vladislav Dorofeev, Gleb Molodtsov, Aram Avetisyan, Dmitry Bylinkin, Daniil Medyakov, Aleksandr Beznosikov

2025-08-19

http://arxiv.org/abs/2508.13803v1

Rapid scaling of deep learning models has enabled performance gains across domains, yet it introduced several challenges. Federated Learning (FL) has emerged as a promising framework to address these concerns by enabling decentralized training. Nevertheless, efficiency remains a key bottleneck in FL, particularly under heterogeneous and dynamic client participation. Existing methods, such as FedAvg and FedProx, or other approaches, including client selection strategies, attempt to mitigate costs. However, the problem of choosing the number of clients in a training round remains extremely underexplored. We introduce Intelligent Selection of Participants (ISP), an adaptive mechanism that dynamically determines the optimal number of clients per round to enhance efficiency without compromising model accuracy. We validate the effectiveness of ISP across diverse setups, including vision s, real-world ECG classification, and training with gradient compression. Our results show consistent savings of up to 30\% without losing the final quality. Applying ISP to different real-world ECG classification setups highlighted the selection of the number of clients as a separate task of federated learning.

CRISP Persistent Concept Unlearning via Sparse Autoencoders

Authors: Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov

2025-08-19

http://arxiv.org/abs/2508.13650v1

As large language models (s) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while pre model utility has become paramount. Recent work has explored autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two s and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while pre general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.

Interpreting the Interpreter Can We Model post-ECB Conferences Volatility with LLM Agents?

Authors: Umberto Collodel

2025-08-19

http://arxiv.org/abs/2508.13635v1

This paper develops a novel method to simulate financial market reactions to European Central Bank (ECB) press conferences using a Large Language Model (). We create a behavioral, agent-based simulation of 30 synthetic traders, each with distinct risk preferences, cognitive biases, and interpretive styles. These agents forecast Euro interest rate swap levels at 3-month, 2-year, and 10-year maturities, with the variation across forecasts as a measure of market uncertainty or disagreement. We evaluate three prompting strategies, naive, few-shot (enriched with historical data), and an advanced iterative '-as-a-Judge' framework, to assess the effect of prompt design on predictive performance. Even the naive approach generates a strong correlation (roughly 0.5) between synthetic disagreement and actual market outcomes, particularly for longer-term maturities. The -as-a-Judge framework further improves accuracy at the first iteration. These results demonstrate that -driven simulations can capture interpretive uncertainty beyond traditional measures, providing central banks with a practical tool to anticipate market reactions, refine strategies, and enhance financial stability.

A Comparative Study of Decoding Strategies in Medical Text Generation

Authors: Oriana Presacan, Alireza Nik, Vajira Thambawita, Bogdan Ionescu, Michael Riegler

2025-08-19

http://arxiv.org/abs/2508.13580v1

Large Language Models (s) rely on various strategies to generate text, and these choices can significantly affect output quality. In healthcare, where accuracy is critical, the impact of strategies remains underexplored. We investigate this effect in five open-ended medical tasks, including translation, summarization, question answering, dialogue, and image captioning, evaluating 11 strategies with medically specialized and general-purpose s of different sizes. Our results show that deterministic strategies generally outperform stochastic ones: beam search achieves the highest scores, while {\eta} and top-k sampling perform worst. Slower methods tend to yield better quality. Larger models achieve higher scores overall but have longer inference times and are no more robust to . Surprisingly, while medical s outperform general ones in two of the five tasks, statistical analysis shows no overall performance advantage and reveals greater sensitivity to choice. We further compare multiple evaluation metrics and find that correlations vary by task, with MAUVE showing weak agreement with BERTScore and ROUGE, as well as greater sensitivity to the strategy. These results highlight the need for careful selection of methods in medical applications, as their influence can sometimes exceed that of model choice.

LLM-Enhanced Linear Autoencoders for Recommendation

Authors: Jaewan Moon, Seongmin Park, Jongwuk Lee

2025-08-19

http://arxiv.org/abs/2508.13500v1

Large language models (s) have been widely adopted to enrich the semantic representation of textual item information in recommender systems. However, existing linear autoencoders (LAEs) that incorporate textual information rely on word co-occurrence patterns, limiting their ability to capture rich textual semantics. To address this, we propose L3AE, the first integration of s into the LAE framework. L3AE effectively integrates the heterogeneous knowledge of textual semantics and user-item interactions through a two-phase optimization strategy. (i) L3AE first constructs a semantic item-to-item correlation matrix from -derived item representations. (ii) It then learns an item-to-item weight matrix from collaborative signals while distilling semantic item correlations as regularization. Notably, each phase of L3AE is optimized through closed-form solutions, ensuring global optimality and computational efficiency. Extensive experiments demonstrate that L3AE consistently outperforms state-of-the-art -enhanced models on three benchmark datasets, achieving gains of 27.6% in Recall@20 and 39.3% in NDCG@20. The source code is available at https://github.com/jaewan7599/L3AE_CIKM2025.

ALIGN Word Association Learning for Cross-Cultural Generalization in Large Language Models

Authors: Chunhua Liu, Kabir Manandhar Shrestha, Sukai Huang

2025-08-19

http://arxiv.org/abs/2508.13426v1

As large language models (s) increasingly mediate cross-cultural , their behavior still reflects the distributional bias of the languages and viewpoints that are over-represented in their pre-training corpora. Yet, it remains a challenge to model and align culture due to limited cultural knowledge and a lack of exploration into effective learning approaches. We introduce a cost-efficient, cognitively grounded remedy: parameter-efficient fine-tuning on native speakers' free word-association norms, which encode implicit cultural schemas. Leveraging English-US and Mandarin associations from the Small-World-of-Words project, we adapt Llama-3.1-8B and Qwen-2.5-7B via supervised fine-tuning (SFT) and PPO-based preference optimization. SFT boosts held-out association Precision at 5 by 16-20% in English and 43-165% in Mandarin, lifts median concreteness by +0.20, and attains human-level valence and arousal. These lexical gains transfer: on World-Values-Survey questions, fine-tuned models shift answer distributions toward the target culture, and on a 50-item high-tension subset, Qwen's Chinese-aligned responses double while Llama's US bias drops by one-third. Our 7-8B models rival or beat vanilla 70B baselines, showing that a few million culture-grounded associations can instill value alignment without costly retraining. Our work highlights both the promise and the need for future research grounded in human cognition in improving cultural alignment in AI models.

Datarus-R1 An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis

Authors: Ayoub Ben Chaliah, Hela Dellagi

2025-08-18

http://arxiv.org/abs/2508.13382v1

We present Datarus-R1-14B, a 14 B-parameter open-weights language model fine-tuned from Qwen 2.5-14B-Instruct to act as a virtual data analyst and graduate-level problem solver. Datarus is trained not on isolated question-answer pairs but on full analytical trajectories including reasoning steps, code execution, error traces, self-corrections, and final conclusions, all captured in a ReAct-style notebook format spanning finance, medicine, numerical analysis, and other quantitative domains. Our training pipeline combines (i) a trajectory-centric synthetic data generator that yielded 144 000 tagged notebook episodes, (ii) a dual-reward framework blending a lightweight tag-based structural signal with a Hierarchical Reward Model (HRM) that scores both single-step soundness and end-to-end coherence, and (iii) a memory-optimized implementation of Group Relative Policy Optimization (GRPO) featuring - reuse, sequential generation, and reference-model sharding. A cosine curriculum smoothly shifts emphasis from structural fidelity to semantic depth, reducing the format collapse and verbosity that often plague RL-aligned s. A central design choice in Datarus is it dual reasoning interface. In agentic mode the model produces ReAct-tagged steps that invoke Python tools to execute real code; in reflection mode it outputs compact Chain-of-Thought (CoT) traces delimited by and tags. On demanding postgraduate-level problems, Datarus exhibits an "AHA-moment" pattern: it sketches hypotheses, revises them once or twice, and converges avoiding the circular, token-inflating loops common to contemporary systems. Across standard public benchmarks Datarus surpasses similar size models and even reaches the level of larger reasoning models such as QwQ-32B achieving up to 30% higher accuracy on AIME 2024/2025 and LiveCodeBench while emitting 18-49% fewer tokens per solution.

Exploring Autonomous Agents A Closer Look at Why They Fail When Completing Tasks

Authors: Ruofan Lu, Yichen Li, Yintong Huo

2025-08-18

http://arxiv.org/abs/2508.13143v1

Autonomous agent systems powered by Large Language Models (s) have demonstrated promising capabilities in automating complex tasks. However, current evaluations largely rely on success rates without systematically analyzing the interactions, mechanisms, and failure causes within these systems. To bridge this gap, we present a benchmark of 34 representative programmable tasks designed to rigorously assess autonomous agents. Using this benchmark, we evaluate three popular open-source agent frameworks combined with two backbones, ob a task completion rate of approximately 50%. Through in-depth failure analysis, we develop a three-tier taxonomy of failure causes aligned with task phases, highlighting planning errors, task execution issues, and incorrect response generation. Based on these insights, we propose actionable improvements to enhance agent planning and self-diagnosis capabilities. Our failure taxonomy, together with mitigation advice, provides an empirical foundation for developing more robust and effective autonomous agent systems in the future.