2025-08-01

DepMicroDiff Diffusion-Based Dependency-Aware Multimodal Imputation for Microbiome Data
MemoCue Empowering LLM-Based Agents for Human Memory Recall via Strategy-Guided Querying
Trae Agent An LLM-based Agent for Software Engineering with Test-time Scaling
Unveiling Super Experts in Mixture-of-Experts Large Language Models
Model Directions, Not Words Mechanistic Topic Models Using Sparse Autoencoders
MoCHA Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention
Modality-Aware Feature Matching A Comprehensive Review of Single- and Cross-Modality Techniques
trAIce3D A Prompt-Driven Transformer Based U-Net for Semantic Segmentation of Microglial Cells from Large-Scale 3D Microscopy Images
Language Arithmetics Towards Systematic Language Neuron Identification and Manipulation
MetaAgent Automatically Constructing Multi-Agent Systems Based on Finite State Machines
Multi-modal Relational Item Representation Learning for Inferring Substitutable and Complementary Items
CTG-Insight A Multi-Agent Interpretable LLM Framework for Cardiotocography Analysis and Classification
Persona-Augmented Benchmarking Evaluating LLMs Across Diverse Writing Styles
IntentFlow Interactive Support for Communicating Intent with LLMs in Writing Tasks
Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models
EDGE-GRPO Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity
Unlocking Interpretability for RF Sensing A Complex-Valued White-Box Transformer
MOR-VIT Efficient Vision Transformer with Mixture-of-Recursions
Enhancing Graph-based Recommendations with Majority-Voting LLM-Rerank Augmentation
TriangleMix A Lossless and Efficient Attention Pattern for Long Context Prefilling
Large Language Models for Wireless Communications From Adaptation to Autonomy
Learning to Imitate with Less Efficient Individual Behavior Modeling in Chess
An LLM Driven Agent Framework for Automated Infrared Spectral Multi Task Reasoning
Transmission With Machine Language Tokens A Paradigm for Task-Oriented Agent Communication
MindChat Enhancing BCI Spelling with Large Language Models in Realistic Scenarios
Automated HEMT Model Construction from Datasheets via Multi-Modal Intelligence and Prior-Knowledge-Free Optimization
ReGATE Learning Faster and Better with Fewer Tokens in MLLMs
Enhancing and Accelerating Brain MRI through Deep Learning Reconstruction Using Prior Subject-Specific Imaging
Agentic Web Weaving the Next Web with AI Agents
SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment
The Importance of Facial Features in Vision-based Sign Language Recognition Eyes, Mouth or Full Face?
Latent Inter-User Difference Modeling for LLM Personalization
METEOR Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
Advancing Compositional LLM Reasoning with Structured Task Relations in Interactive Multimodal Communications
Beyond Class Tokens LLM-guided Dominant Property Mining for Few-shot Classification
Advancing Shared and Multi-Agent Autonomy in Underwater Missions Integrating Knowledge Graphs and Retrieval-Augmented Generation
Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation
What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations
Modeling Professionalism in Expert Questioning through Linguistic Differentiation
FAEDKV Infinite-Window Fourier Transform for Unbiased KV Cache Compression
The Carbon Cost of Conversation, Sustainability in the Age of Language Models
CaliDrop KV Cache Compression with Calibration
CrossPL Evaluating Large Language Models on Cross Programming Language Code Generation
AgentMesh A Cooperative Multi-Agent Generative AI Framework for Software Development Automation
HCAttention Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
Large Language Model Agent for Structural Drawing Generation Using ReAct Prompt Engineering and Retrieval Augmented Generation
LowKeyEMG Electromyographic typing with a reduced keyset
Towards Inclusive NLP Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks
"X of Information'' Continuum A Survey on AI-Driven Multi-dimensional Metrics for Next-Generation Networked Systems
DeltaLLM A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference
Advancing Event Forecasting through Massive Training of Large Language Models Challenges, Solutions, and Broader Impacts
GEPA Reflective Prompt Evolution Can Outperform Reinforcement Learning
Step-3 is Large yet Affordable Model-system Co-design for Cost-effective Decoding
Doubling Your Data in Minutes Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs
Patch Pruning Strategy Based on Robust Statistical Measures of Attention Weight Diversity in Vision Transformers
RegScore Scoring Systems for Regression Tasks
MixA-Q Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective
Deterministic diffusion models for Lagrangian turbulence robustness and encoding of extreme events

DepMicroDiff Diffusion-Based Dependency-Aware Multimodal Imputation for Microbiome Data

Authors: Rabeya Tus Sadia, Qiang Cheng

2025-07-31

http://arxiv.org/abs/2507.23676v1

Microbiome data analysis is essential for understanding host health and disease, yet its inherent and noise pose major challenges for accurate imputation, hindering downstream tasks such as biomarker discovery. Existing imputation methods, including recent diffusion-based models, often fail to capture the complex interdependencies between microbial taxa and overlook contextual metadata that can inform imputation. We introduce DepMicroDiff, a novel framework that combines diffusion-based generative modeling with a Dependency-Aware Transformer (DAT) to explicitly capture both mutual pairwise dependencies and autoregressive relationships. DepMicroDiff is further enhanced by VAE-based pretraining across diverse cancer datasets and conditioning on patient metadata encoded via a large language model (). Experiments on TCGA microbiome datasets show that DepMicroDiff substantially outperforms state-of-the-art baselines, achieving higher Pearson correlation (up to 0.712), cosine similarity (up to 0.812), and lower RMSE and MAE across multiple cancer types, demonstrating its robustness and generalizability for microbiome imputation.

MemoCue Empowering LLM-Based Agents for Human Memory Recall via Strategy-Guided Querying

Authors: Qian Zhao, Zhuo Sun, Bin Guo, Zhiwen Yu

2025-07-31

http://arxiv.org/abs/2507.23633v1

Agent-assisted memory recall is one critical research problem in the field of human-computer interaction. In conventional methods, the agent can retrieve information from its equipped memory module to help the person recall incomplete or vague memories. The limited size of memory module hinders the acquisition of complete memories and impacts the memory recall performance in practice. Memory theories suggest that the person's relevant memory can be proactively activated through some effective cues. Inspired by this, we propose a novel strategy-guided agent-assisted memory recall method, allowing the agent to transform an original query into a cue-rich one via the judiciously designed strategy to help the person recall memories. To this end, there are two key challenges. (1) How to choose the appropriate recall strategy for diverse forgetting scenarios with distinct memory-recall characteristics? (2) How to obtain the high-quality responses leveraging recall strategies, given only abstract and ly annotated strategy patterns? To address the challenges, we propose a Recall Router framework. Specifically, we design a 5W Recall Map to classify memory queries into five typical scenarios and define fifteen recall strategy patterns across the corresponding scenarios. We then propose a hierarchical recall tree combined with the Monte Carlo Tree Search algorithm to optimize the selection of strategy and the generation of strategy responses. We construct an instruction tuning dataset and fine-tune multiple open-source large language models (s) to develop MemoCue, an agent that excels in providing memory-inspired responses. Experiments on three representative datasets show that MemoCue surpasses -based methods by 17.74% in recall inspiration. Further human evaluation highlights its advantages in memory-recall applications.

Trae Agent An LLM-based Agent for Software Engineering with Test-time Scaling

Authors: Trae Research Team, Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, Xia Liu

2025-07-31

http://arxiv.org/abs/2507.23370v1

Software issue resolution is a critical challenge in software engineering and has garnered increasing attention in recent years. With the rapid advancement of large language models (s), substantial progress has been made in addressing real-world software engineering tasks. Recent studies have introduced ensemble reasoning techniques to enhance the performance of -based issue resolution. However, existing prompting-based methods still face limitations in effectively exploring large ensemble spaces and lack the capacity for repository-level understanding, both of which constrain their overall effectiveness. In this paper, we propose Trae Agent, the first agent-based ensemble reasoning approach for repository-level issue resolution. Trae Agent formulates our goal as an optimal solution search problem and addresses two key challenges, i.e., large ensemble spaces and repository-level understanding, through modular agents for generation, , and selection. We conduct extensive experiments using three leading s on the widely-adopted SWE-bench benchmark, comparing Trae Agent against four state-of-the-art ensemble reasoning techniques. Experimental results demonstrate that Trae Agent consistently achieves superior performance, with an average improvement of 10.22% over all baselines in terms of Pass@1. Trae Agent has achieved first place on the SWE-bench Verified leaderboard, with a notable Pass@1 score of 75.20%. We are pleased to release Trae Agent as an open-source project to support the research community, with all resources available at https://github.com/bytedance/trae-agent.

Unveiling Super Experts in Mixture-of-Experts Large Language Models

Authors: Zunhai Su, Qingyuan Li, Hao Zhang, YuLei Qian, Yuchen Xie, Kehong Yuan

2025-07-31

http://arxiv.org/abs/2507.23279v1

Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (s). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to improve the efficiency of MoE s. However, existing approaches often rely on empirical criteria to identify critical experts, lacking a deeper exploration and understanding of the heterogeneous importance of experts. In this study, we present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model's forward inference. These experts are prevalent in open-source MoE s, and despite their limited number, them leads to a significant decline in model performance (e.g., three causes Qwen3-30B-A3B to produce repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs. (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs remains model-specific and is unaffected by post-training processes. (ii) By SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model's overall performance, particularly in mathematical reasoning. (iii) We further enhance our understanding of the influence of SEs compression. Our findings confirm that MoE s rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE . The code is available at https://github.com/ZunhaiSu/Super-Experts-Profilling.

Model Directions, Not Words Mechanistic Topic Models Using Sparse Autoencoders

Authors: Carolina Zheng, Nicolas Beltran-Velez, Sweta Karlekar, Claudia Shi, Achille Nazaret, Asif Mallik, Amir Feder, David M. Blei

2025-07-31

http://arxiv.org/abs/2507.23220v1

Traditional topic models are effective at uncovering latent themes in large text collections. However, due to their reliance on bag-of-words representations, they struggle to capture semantically abstract features. While some neural variants use richer representations, they are similarly constrained by expressing topics as word lists, which limits their ability to articulate complex topics. We introduce Mechanistic Topic Models (MTMs), a class of topic models that operate on interpretable features learned by autoencoders (SAEs). By defining topics over this semantically rich space, MTMs can reveal deeper conceptual themes with expressive feature descriptions. Moreover, uniquely among topic models, MTMs enable controllable text generation using topic-based steering vectors. To properly evaluate MTM topics against word-list-based approaches, we propose \textit{topic judge}, an -based pairwise comparison evaluation framework. Across five datasets, MTMs match or exceed traditional and neural baselines on coherence metrics, are consistently preferred by topic judge, and enable effective steering of outputs.

MoCHA Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention

Authors: Yuqi Pang, Bowen Yang, Yun Cao, Fan Rong, Xiaoyu Li, Chen He

2025-07-30

http://arxiv.org/abs/2507.22805v1

Vision large language models (Vs) are focusing primarily on handling complex and fine-grained visual information by incorporating advanced vision encoders and scaling up visual models. However, these approaches face high training and inference costs, as well as challenges in extracting visual details, effectively bridging across modalities. In this work, we propose a novel visual framework, MoCHA, to address these issues. Our framework integrates four vision backbones (i.e., CLIP, SigLIP, DINOv2 and ConvNeXt) to extract complementary visual features and is equipped with a Mixture of Experts Connectors (MoECs) module to dynamically select experts tailored to different visual dimensions. To mitigate redundant or insufficient use of the visual information encoded by the MoECs module, we further design a Hierarchical Group Attention (HGA) with intra- and inter-group operations and an adaptive gating strategy for encoded visual features. We train MoCHA on two mainstream s (e.g., Phi2-2.7B and Vicuna-7B) and evaluate their performance across various benchmarks. Notably, MoCHA outperforms state-of-the-art open-weight models on various tasks. For example, compared to CuMo (Mistral-7B), our MoCHA (Phi2-2.7B) presents outstanding abilities to mitigate hallucination by showing improvements of 3.25% in POPE and to follow visual instructions by raising 153 points on MME. Finally, ablation studies further confirm the effectiveness and robustness of the proposed MoECs and HGA in improving the overall performance of MoCHA.

Modality-Aware Feature Matching A Comprehensive Review of Single- and Cross-Modality Techniques

Authors: Weide Liu, Wei Zhou, Jun Liu, Ping Hu, Jun Cheng, Jungong Han, Weisi Lin

2025-07-30

http://arxiv.org/abs/2507.22791v1

Feature matching is a cornerstone task in computer vision, essential for applications such as image retrieval, stereo matching, 3D reconstruction, and SLAM. This survey comprehensively reviews modality-based feature matching, exploring traditional handcrafted methods and emphasizing contemporary deep learning approaches across various modalities, including RGB images, depth images, 3D point clouds, LiDAR scans, medical images, and vision-language interactions. Traditional methods, leveraging detectors like Harris corners and descriptors such as SIFT and ORB, demonstrate robustness under moderate intra-modality variations but struggle with significant modality gaps. Contemporary deep learning-based methods, exemplified by detector-free strategies like CNN-based SuperPoint and -based LoFTR, substantially improve robustness and adaptability across modalities. We highlight modality-aware advancements, such as geometric and depth-specific descriptors for depth images, and dense learning methods for 3D point clouds, attention-enhanced neural networks for LiDAR scans, and specialized solutions like the MIND descriptor for complex medical image matching. Cross-modal applications, particularly in medical image registration and vision-language tasks, underscore the evolution of feature matching to handle increasingly diverse data interactions.

trAIce3D A Prompt-Driven Transformer Based U-Net for Semantic Segmentation of Microglial Cells from Large-Scale 3D Microscopy Images

Authors: MohammadAmin Alamalhoda, Arsalan Firoozi, Alessandro Venturino, Sandra Siegert

2025-07-30

http://arxiv.org/abs/2507.22635v1

The shape of a cell contains essential information about its function within the biological system. Segmenting these structures from large-scale 3D microscopy images is challenging, limiting clinical insights especially for microglia, immune-associated cells involved in neurodegenerative diseases. Existing segmentation methods mainly focus on cell bodies, struggle with ping structures, perform poorly on noisy images, require hyperparameter tuning for each new dataset, or rely on tedious semi-automated approaches. We introduce trAIce3D, a deep-learning architecture designed for precise microglia segmentation, capturing both somas and branches. It employs a two-stage approach: first, a 3D U-Net with vision s in the encoder detects somas using a sliding-window technique to cover the entire image. Then, the same architecture, enhanced with cross-attention blocks in skip connections, refines each soma and its branches by using soma coordinates as a prompt and a 3D window around the target cell as input. Training occurs in two phases: self-supervised Soma Segmentation, followed by prompt-based Branch Segmentation, leveraging pre-trained weights from the first phase. Trained and evaluated on a dataset of 41,230 microglial cells, trAIce3D significantly improves segmentation accuracy and generalization, enabling scalable analysis of complex cellular morphologies. While optimized for microglia, its architecture can extend to other intricate cell types, such as neurons and astrocytes, broadening its impact on neurobiological research.

Language Arithmetics Towards Systematic Language Neuron Identification and Manipulation

Authors: Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, Simon Ostermann

2025-07-30

http://arxiv.org/abs/2507.22608v1

Large language models (s) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share ping neurons, reflecting internal representations of linguistic proximity. Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal "fallback" mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation.

MetaAgent Automatically Constructing Multi-Agent Systems Based on Finite State Machines

Authors: Yaolun Zhang, Xiaogeng Liu, Chaowei Xiao

2025-07-30

http://arxiv.org/abs/2507.22606v1

Large Language Models (s) have demonstrated the ability to solve a wide range of practical tasks within multi-agent systems. However, existing human-designed multi-agent frameworks are typically limited to a small set of pre-defined scenarios, while current automated design methods suffer from several limitations, such as the lack of tool integration, dependence on external training data, and rigid structures. In this paper, we propose MetaAgent, a finite state machine based framework that can automatically generate a multi-agent system. Given a task description, MetaAgent will design a multi-agent system and polish it through an optimization algorithm. When the multi-agent system is deployed, the finite state machine will control the agent's actions and the state transitions. To evaluate our framework, we conduct experiments on both text-based tasks and practical tasks. The results indicate that the generated multi-agent system surpasses other auto-designed methods and can achieve a comparable performance with the human-designed multi-agent system, which is optimized for those specific tasks.

Authors: Junting Wang, Chenghuan Guo, Jiao Yang, Yanhui Guo, Yan Gao, Hari Sundaram

2025-07-29

http://arxiv.org/abs/2507.22268v1

We introduce a novel self-supervised multi-modal relational item representation learning framework designed to infer substitutable and complementary items. Existing approaches primarily focus on modeling item-item associations deduced from user behaviors using graph neural networks (GNNs) or leveraging item content information. However, these methods often overlook critical challenges, such as noisy user behavior data and data due to the long-tailed distribution of these behaviors. In this paper, we propose MMSC, a self-supervised multi-modal relational item representation learning framework to address these challenges. Specifically, MMSC consists of three main components: (1) a multi-modal item representation learning module that leverages a multi-modal foundational model and learns from item metadata, (2) a self-supervised behavior-based representation learning module that denoises and learns from user behavior data, and (3) a hierarchical representation aggregation mechanism that integrates item representations at both the semantic and task levels. Additionally, we leverage s to generate augmented training data, further enhancing the denoising process during training. We conduct extensive experiments on five real-world datasets, showing that MMSC outperforms existing baselines by 26.1% for substitutable recommendation and 39.2% for complementary recommendation. In addition, we empirically show that MMSC is effective in modeling cold-start items.

CTG-Insight A Multi-Agent Interpretable LLM Framework for Cardiotocography Analysis and Classification

Authors: Black Sun, Die, Hu

2025-07-29

http://arxiv.org/abs/2507.22205v1

Remote fetal monitoring technologies are becoming increasingly common. Yet, most current systems offer limited interpretability, leaving expectant parents with raw cardiotocography (CTG) data that is difficult to understand. In this work, we present CTG-Insight, a multi-agent system that provides structured interpretations of fetal heart rate (FHR) and uterine contraction (UC) signals. Drawing from established medical guidelines, CTG-Insight decomposes each CTG trace into five medically defined features: baseline, variability, s, decelerations, and sinusoidal pattern, each analyzed by a dedicated agent. A final aggregation agent synthesizes the outputs to deliver a holistic classification of fetal health, accompanied by a natural language explanation. We evaluate CTG-Insight on the NeuroFetalNet Dataset and compare it against deep learning models and the single-agent baseline. Results show that CTG-Insight achieves state-of-the-art accuracy (96.4%) and F1-score (97.8%) while producing transparent and interpretable outputs. This work contributes an interpretable and extensible CTG analysis framework.

Persona-Augmented Benchmarking Evaluating LLMs Across Diverse Writing Styles

Authors: Kimberly Le Truong, Riccardo Fogliato, Hoda Heidari, Zhiwei Steven Wu

2025-07-29

http://arxiv.org/abs/2507.22168v1

Current benchmarks for evaluating Large Language Models (s) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of patterns exhibited by humans. Thus, it is possible that s, which are optimized on these benchmarks, may demonstrate brittle performance when faced with "non-standard" input. In this work, we test this hypothesis by rewriting evaluation prompts using persona-based prompting, a low-cost method to emulate diverse writing styles. Our results show that, even with identical semantic content, variations in writing style and prompt formatting significantly impact the estimated performance of the under evaluation. Notably, we identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks, irrespective of model family, size, and recency. Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring performance across linguistic variations.

IntentFlow Interactive Support for Communicating Intent with LLMs in Writing Tasks

Authors: Yoonsu Kim, Brandon Chin, Kihoon Son, Seoyoung Kim, Juho Kim

2025-07-29

http://arxiv.org/abs/2507.22134v1

While large language models (s) are widely used for writing, users often struggle to express their nuanced and evolving intents through prompt-based interfaces. Intents -- low-level strategies or preferences for achieving a writing goal -- are often vague, fluid, or even subconscious, making it difficult for users to articulate and adjust them. To address this, we present IntentFlow, which supports the of dynamically evolving intents throughout -assisted writing. IntentFlow extracts goals and intents from user prompts and presents them as editable interface components, which users can revise, remove, or refine via direct manipulation or follow-up prompts. Visual links connect each component to the output segments it influences, helping users understand model behavior. In a within-subjects study (N=12), participants using IntentFlow, compared to a chat-based baseline, expressed their intents more easily and in detail, engaged in more meaningful actions to communicate intents, such as adjusting and deleting, and produced outputs that better aligned with their evolving intents. We found that editable intent representations help users refine and consolidate a final set of intents, which can be reused across similar tasks to support consistent and transferable -assisted writing.

Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models

Authors: Hyunwoo Yoo, Gail L. Rosen

2025-07-29

http://arxiv.org/abs/2507.21980v1

Traditional machine learning models struggle to generalize in microbiome studies where only metadata is available, especially in small-sample settings or across studies with heterogeneous label formats. In this work, we explore the use of large language models (s) to classify microbial samples into ontology categories such as EMPO 3 and related biological labels, as well as to predict pathogen contamination risk, specifically the presence of E. Coli, using environmental metadata alone. We evaluate s such as ChatGPT-4o, Claude 3.7 Sonnet, Grok-3, and LLaMA 4 in zero-shot and few-shot settings, comparing their performance against traditional models like Random Forests across multiple real-world datasets. Our results show that s not only outperform baselines in ontology classification, but also demonstrate strong predictive ability for contamination risk, generalizing across sites and metadata distributions. These findings suggest that s can effectively reason over , heterogeneous biological metadata and offer a promising metadata-only approach for environmental microbiology and biosurveillance applications.

EDGE-GRPO Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity

Authors: Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang

2025-07-29

http://arxiv.org/abs/2507.21848v1

Large Language Models (s) have made remarkable progress in enhancing step-by-step reasoning through reinforcement learning. However, the Group Relative Policy Optimization (GRPO) algorithm, which relies on reward rules, often encounters the issue of identical rewards within groups, leading to the advantage collapse problem. Existing works typically address this challenge from two perspectives: enforcing model reflection to enhance response diversity, and introducing internal feedback to augment the training signal (advantage). In this work, we begin by analyzing the limitations of model reflection and investigating the policy entropy of responses at the fine-grained sample level. Based on our experimental findings, we propose the EDGE-GRPO algorithm, which adopts \textbf{E}ntropy-\textbf{D}riven Advantage and \textbf{G}uided \textbf{E}rror Correction to effectively mitigate the problem of advantage collapse. Extensive experiments on several main reasoning benchmarks demonstrate the effectiveness and superiority of our approach. It is available at https://github.com/ZhangXJ199/EDGE-GRPO.

Unlocking Interpretability for RF Sensing A Complex-Valued White-Box Transformer

Authors: Xie Zhang, Yina Wang, Chenshu Wu

2025-07-29

http://arxiv.org/abs/2507.21799v1

The empirical success of deep learning has spurred its application to the radio-frequency (RF) domain, leading to significant advances in Deep Wireless Sensing (DWS). However, most existing DWS models function as black boxes with limited interpretability, which hampers their generalizability and raises concerns in security-sensitive physical applications. In this work, inspired by the remarkable advances of white-box s, we present RF-CRATE, the first mathematically interpretable deep network architecture for RF sensing, grounded in the principles of complex rate reduction. To accommodate the unique RF signals, we conduct non-trivial theoretical derivations that extend the original real-valued white-box to the complex domain. By leveraging the CR-Calculus framework, we successfully construct a fully complex-valued white-box with theoretically derived self-attention and residual multi-layer perceptron modules. Furthermore, to improve the model's ability to extract discriminative features from limited wireless data, we introduce Subspace Regularization, a novel regularization strategy that enhances feature diversity, resulting in an average performance improvement of 19.98% across multiple sensing tasks. We extensively evaluate RF-CRATE against seven baselines with multiple public and self-collected datasets involving different RF signals. The results show that RF-CRATE achieves performance on par with thoroughly engineered black-box models, while offering full mathematical interpretability. More importantly, by extending CRATE to the complex domain, RF-CRATE yields substantial improvements, achieving an average classification gain of 5.08% and reducing regression error by 10.34% across diverse sensing tasks compared to CRATE. RF-CRATE is fully open-sourced at: https://github.com/rfcrate/RF_CRATE.

MOR-VIT Efficient Vision Transformer with Mixture-of-Recursions

Authors: YiZhou Li

2025-07-29

http://arxiv.org/abs/2507.21761v1

Vision Transformers (ViTs) have achieved remarkable success in image recognition, yet standard ViT architectures are hampered by substantial parameter redundancy and high computational cost, limiting their practical deployment. While recent efforts on efficient ViTs primarily focus on static model compression or token-level sparsification, they remain constrained by fixed computational depth for all tokens. In this work, we present MoR-ViT, a novel vision framework that, for the first time, incorporates a token-level dynamic recursion mechanism inspired by the Mixture-of-Recursions (MoR) paradigm. This approach enables each token to adaptively determine its processing depth, yielding a flexible and input-dependent allocation of computational resources. Extensive experiments on ImageNet-1K and transfer benchmarks demonstrate that MoR-ViT not only achieves state-of-the-art accuracy with up to 70% parameter reduction and 2.5x inference , but also outperforms leading efficient ViT baselines such as DynamicViT and TinyViT under comparable conditions. These results establish dynamic recursion as an effective strategy for efficient vision s and open new avenues for scalable and deployable deep learning models in real-world scenarios.

Enhancing Graph-based Recommendations with Majority-Voting LLM-Rerank Augmentation

Authors: Minh-Anh Nguyen, Bao Nguyen, Ha Lan N. T., Tuan Anh Hoang, Duc-Trong Le, Dung D. Le

2025-07-29

http://arxiv.org/abs/2507.21563v1

Recommendation systems often suffer from data caused by limited user-item interactions, which degrade their performance and amplify popularity bias in real-world scenarios. This paper proposes a novel data augmentation framework that leverages Large Language Models (s) and item textual descriptions to enrich interaction data. By few-shot prompting s multiple times to rerank items and aggregating the results via majority voting, we generate high-confidence synthetic user-item interactions, supported by theoretical guarantees based on the concentration of measure. To effectively leverage the augmented data in the context of a graph recommendation system, we integrate it into a graph contrastive learning framework to mitigate distributional shift and alleviate popularity bias. Extensive experiments show that our method improves accuracy and reduces popularity bias, outperforming strong baselines.

TriangleMix A Lossless and Efficient Attention Pattern for Long Context Prefilling

Authors: Zhiyuan He, Yike Zhang, Chengruidong Zhang, Huiqiang Jiang, Yuqing Yang, Lili Qiu

2025-07-29

http://arxiv.org/abs/2507.21526v1

Large Language Models (s) rely on attention mechanisms whose time complexity grows quadratically with input sequence length, creating significant computational bottlenecks during the prefilling stage. Existing static attention methods typically degrade accuracy, while dynamic methods introduce additional computational overhead due to runtime index estimation. To address these limitations, we propose TriangleMix, a novel training-free static attention pattern. TriangleMix employs dense attention in shallow layers and switches to a triangle-shaped pattern in deeper layers. Extensive experiments demonstrate that TriangleMix reduces attention overhead by 3.7x to 15.3x in deep layers, and decreases overall Time-to-First-Token (TTFT) by 12% to 32% for sequence lengths ranging from 32K to 128K, without sacrificing model accuracy. Moreover, TriangleMix can be seamlessly integrated with dynamic methods to achieve further speedup, e.g. accelerating MInference by 19% at 128K, highlighting its potential to enhance inference efficiency.

Large Language Models for Wireless Communications From Adaptation to Autonomy

Authors: Le Liang, Hao Ye, Yucheng Sheng, Ouya Wang, Jiacheng Wang, Shi Jin, Geoffrey Ye Li

2025-07-29

http://arxiv.org/abs/2507.21524v1

The emergence of large language models (s) has revolutionized artificial intelligence, offering unprecedented capabilities in reasoning, generalization, and zero-shot learning. These strengths open new frontiers in wireless s, where increasing complexity and dynamics demand intelligent and adaptive solutions. This article explores the role of s in transforming wireless systems across three key directions: adapting pretrained s for core tasks, developing wireless-specific foundation models to balance versatility and efficiency, and enabling agentic s with autonomous reasoning and coordination capabilities. We highlight recent advances, practical case studies, and the unique benefits of -based approaches over traditional methods. Finally, we outline open challenges and research opportunities, including multimodal fusion, collaboration with lightweight models, and self-improving capabilities, charting a path toward intelligent, adaptive, and autonomous wireless networks of the future.

Learning to Imitate with Less Efficient Individual Behavior Modeling in Chess

Authors: Zhenwei Tang, Difan Jiao, Eric Xue, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, Ashton Anderson

2025-07-29

http://arxiv.org/abs/2507.21488v1

As humans seek to collaborate with, learn from, and better understand artificial intelligence systems, developing AIs that can accurately emulate individual decision-making becomes increasingly important. Chess, a long-standing AI benchmark with precise skill measurement, offers an ideal testbed for human-AI alignment. However, existing approaches to modeling human behavior require prohibitively large amounts of data from each individual, making them impractical for new or ly represented users. In this work, we introduce Maia4All, a framework designed to learn and adapt to individual decision-making styles efficiently, even with limited data. Maia4All achieves this through a two-stage optimization process: (1) an enrichment step, which bridges population and individual-level human behavior modeling with a prototype-enriched model, and (2) a democratization step, which leverages ability levels or user prototypes to initialize and refine individual embeddings with minimal data. Our experimental results show that Maia4All can accurately predict individual moves and profile behavioral patterns with high fidelity, establishing a new standard for personalized human-like AI behavior modeling in chess. Maia4All achieves individual human behavior modeling in chess with only 20 games, compared to the 5,000 games required previously, representing a significant improvement in data efficiency. Our work provides an example of how population AI systems can flexibly adapt to individual users using a prototype-enriched model as a bridge. This approach extends beyond chess, as shown in our case study on idiosyncratic s, highlighting its potential for broader applications in personalized AI adaptation.

An LLM Driven Agent Framework for Automated Infrared Spectral Multi Task Reasoning

Authors: Zujie Xie, Zixuan Chen, Jiheng Liang, Xiangyang Yu, Ziru Yu

2025-07-29

http://arxiv.org/abs/2507.21471v1

Infrared spectroscopy offers rapid, non destructive measurement of chemical and material properties but suffers from high dimensional, ping spectral bands that challenge conventional chemometric approaches. Emerging large language models (s), with their capacity for generalization and reasoning, offer promising potential for automating complex scientific workflows. Despite this promise, their application in IR spectral analysis remains largely unexplored. This study addresses the critical challenge of achieving accurate, automated infrared spectral interpretation under low-data conditions using an -driven framework. We introduce an end-to-end, large language model driven agent framework that integrates a structured literature knowledge base, automated spectral preprocessing, feature extraction, and multi task reasoning in a unified pipeline. By querying a curated corpus of peer reviewed IR publications, the agent selects scientifically validated routines. The selected methods transform each spectrum into low dimensional feature sets, which are fed into few shot prompt templates for classification, regression, and anomaly detection. A closed loop, multi turn protocol iteratively appends mispredicted samples to the prompt, enabling dynamic refinement of predictions. Across diverse materials: stamp pad ink, Chinese medicine, Pu'er tea, Citri Reticulatae Pericarpium and waste water COD datasets, the multi turn consistently outperforms single turn inference, rivaling or exceeding machine learning and deep learning models under low data regimes.

Transmission With Machine Language Tokens A Paradigm for Task-Oriented Agent Communication

Authors: Zhuoran Xiao, Chenhui Ye, Yijia Feng, Yunbo Hu, Tianyu Jiao, Liyu Cai, Guangyi Liu

2025-07-29

http://arxiv.org/abs/2507.21454v1

The rapid advancement in large foundation models is propelling the paradigm shifts across various industries. One significant change is that agents, instead of traditional machines or humans, will be the primary participants in the future production process, which consequently requires a novel AI-native system tailored for agent s. Integrating the ability of large language models (s) with task-oriented semantic is a potential approach. However, the output of existing is human language, which is highly constrained and sub-optimal for agent-type . In this paper, we innovatively propose a task-oriented agent system. Specifically, we leverage the original to learn a specialized machine language represented by token embeddings. Simultaneously, a multi-modal is trained to comprehend the application task and to extract essential implicit information from multi-modal inputs, subsequently expressing it using machine language tokens. This representation is significantly more efficient for transmission over the air interface. Furthermore, to reduce transmission overhead, we introduce a joint token and channel coding (JTCC) scheme that compresses the token sequence by exploiting its while enhancing robustness against channel noise. Extensive experiments demonstrate that our approach reduces transmission overhead for downstream tasks while enhancing accuracy relative to the SOTA methods.

MindChat Enhancing BCI Spelling with Large Language Models in Realistic Scenarios

Authors: JIaheng Wang, Yucun Zhong, Chengjie Huang, Lin Yao

2025-07-29

http://arxiv.org/abs/2507.21435v1

Brain-computer interface (BCI) spellers can render a new channel independent of peripheral nervous system, which are especially valuable for patients with severe motor disabilities. However, current BCI spellers often require users to type intended utterances letter-by-letter while spelling errors grow proportionally due to inaccurate electroencephalogram (EEG) decoding, largely impeding the efficiency and usability of BCIs in real-world . In this paper, we present MindChat, a large language model ()-assisted BCI speller to enhance BCI spelling efficiency by reducing users' manual keystrokes. Building upon prompt engineering, we prompt s (GPT-4o) to continuously suggest context-aware word and sentence completions/predictions during spelling. Online copy-spelling experiments encompassing four dialogue scenarios demonstrate that MindChat saves more than 62\% keystrokes and over 32\% spelling time compared with traditional BCI spellers. We envision high-speed BCI spellers enhanced by s will potentially lead to truly practical applications.

Authors: Yuang Peng, Jiarui Zhong, Yang Zhang, Hong Cai Chen

2025-07-29

http://arxiv.org/abs/2507.21430v1

Parameter extraction for industry-standard device models like ASM-HEMT is crucial in circuit design workflows. However, many manufacturers do not provide such models, leaving users to build them using only datasheets. Unfortunately, datasheets lack sufficient information for standard step-by-step extraction. Moreover, manual data extraction from datasheets is highly time-consuming, and the absence of a fully automated method forces engineers to perform tedious manual work. To address this challenge, this paper introduces a novel, end-to-end framework that fully automates the generation of simulation-ready ASM-HEMT SPICE models directly from PDF datasheets. Our framework is founded on two core innovations: 1) a multi-modal AI pipeline that integrates computer vision with a large language model () to robustly parse heterogeneous datasheet layouts and digitize characteristic curves, and 2) a novel Iterative-Focusing Tree-structured Parzen Estimator (IF-TPE) optimization algorithm is specifically designed for device parameter extraction under the high-dimensional, -data condition by adaptively refining the parameter search space. Experimental validation on a diverse set of 17 commercial HEMT devices from 10 manufacturers confirms the framework's accuracy and robustness. The generated models demonstrate excellent agreement with published DC and RF characteristics. As the first fully automated workflow of its kind, our proposed solution offers a transformative approach to device modeling, poised to significantly accelerate the circuit design cycle by eliminating the need for manual parameter extraction.

ReGATE Learning Faster and Better with Fewer Tokens in MLLMs

Authors: Chaoyu Li, Yogesh Kulkarni, Pooyan Fazli

2025-07-29

http://arxiv.org/abs/2507.21420v1

The computational cost of training multimodal large language models (Ms) rapidly increases with the number of tokens involved. Existing efficiency methods primarily target inference and rely on token reduction or merging, offering limited benefit during training. In this paper, we propose ReGATE (Reference $-$ Guided Adaptive Token Elision), an adaptive token method for accelerating M training. Specifically, ReGATE adopts a teacher-student framework in which the M being trained serves as the student, and a frozen reference large language model () acts as the teacher. The teacher computes per-token reference losses, which are combined with an exponential moving average (EMA) of the student's own difficulty scores. This adaptive difficulty-based scoring enables the selective processing of crucial tokens while bypassing less informative ones in the forward pass, significantly reducing computational overhead. Experiments demonstrate that ReGATE, when applied to VideoLLaMA2, matches the peak accuracy of standard training on MVBench up to 2 $\times$ faster, using only 35% of the tokens. With additional training, it even surpasses the baseline on several multimodal benchmarks, all while reducing the total token count by over 41%. Code and models will be released soon.

Enhancing and Accelerating Brain MRI through Deep Learning Reconstruction Using Prior Subject-Specific Imaging

Authors: Amirmohammad Shamaei, Alexander Stebner, Salome, Bosshart, Johanna Ospel, Gouri Ginde, Mariana Bento, Roberto Souza

2025-07-28

http://arxiv.org/abs/2507.21349v1

Magnetic resonance imaging (MRI) is a crucial medical imaging modality. However, long acquisition times remain a significant challenge, leading to increased costs, and reduced patient comfort. Recent studies have shown the potential of using deep learning models that incorporate information from prior subject-specific MRI scans to improve reconstruction quality of present scans. Integrating this prior information requires registration of the previous scan to the current image reconstruction, which can be time-consuming. We propose a novel deep-learning-based MRI reconstruction framework which consists of an initial reconstruction network, a deep registration model, and a -based enhancement network. We validated our method on a longitudinal dataset of T1-weighted MRI scans with 2,808 images from 18 subjects at four factors (R5, R10, R15, R20). Quantitative metrics confirmed our approach's superiority over existing methods (p < 0.05, Wilcoxon signed-rank test). Furthermore, we analyzed the impact of our MRI reconstruction method on the downstream task of brain segmentation and observed improved accuracy and volumetric agreement with reference segmentations. Our approach also achieved a substantial reduction in total reconstruction time compared to methods that use traditional registration algorithms, making it more suitable for real-time clinical applications. The code associated with this work is publicly available at https://github.com/amirshamaei/longitudinal-mri-deep-recon.

Agentic Web Weaving the Next Web with AI Agents

Authors: Yingxuan Yang, Mulei Ma, Yuxuan Huang, Huacan Chai, Chenyu Gong, Haoran Geng, Yuanjian Zhou, Ying Wen, Meng Fang, Muhao Chen, Shangding Gu, Ming Jin, Costas Spanos, Yang Yang, Pieter Abbeel, Dawn Song, Weinan Zhang, Jun Wang

2025-07-28

http://arxiv.org/abs/2507.21206v1

The emergence of AI agents powered by large language models (s) marks a pivotal shift toward the Agentic Web, a new phase of the internet defined by autonomous, goal-driven interactions. In this paradigm, agents interact directly with one another to plan, coordinate, and execute complex tasks on behalf of users. This transition from human-driven to machine-to-machine interaction allows intent to be delegated, relieving users from routine digital operations and enabling a more interactive, automated web experience. In this paper, we present a structured framework for understanding and building the Agentic Web. We trace its evolution from the PC and Mobile Web eras and identify the core technological foundations that support this shift. Central to our framework is a conceptual model consisting of three key dimensions: intelligence, interaction, and economics. These dimensions collectively enable the capabilities of AI agents, such as retrieval, recommendation, planning, and collaboration. We analyze the architectural and infrastructural challenges involved in creating scalable agentic systems, including protocols, orchestration strategies, and emerging paradigms such as the Agent Attention Economy. We conclude by discussing the potential applications, societal risks, and governance issues posed by agentic systems, and outline research directions for developing open, secure, and intelligent ecosystems shaped by both human intent and autonomous agent behavior. A continuously updated collection of relevant studies for agentic web is available at: https://github.com/SafeRL-Lab/agentic-web.

SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment

Authors: Yixin Song, Zhenliang Xue, Dongliang Wei, Feiyang Chen, Jianxiang Gao, Junchen Liu, Hangyu Liang, Guangshuo Qin, Chengrong Tian, Bo Wen, Longyu Zhao, Xinrui Zheng, Zeyu Mi, Haibo Chen

2025-07-28

http://arxiv.org/abs/2507.20984v2

While frontier large language models (s) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of s natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level structure combining fine-grained Mixture-of-Experts (MoE) with feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid attention mechanism to slash cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger s. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.

The Importance of Facial Features in Vision-based Sign Language Recognition Eyes, Mouth or Full Face?

Authors: Dinh Nam Pham, Eleftherios Avramidis

2025-07-28

http://arxiv.org/abs/2507.20884v2

Non-manual facial features play a crucial role in sign language , yet their importance in automatic sign language recognition (ASLR) remains underexplored. While prior studies have shown that incorporating facial features can improve recognition, related work often relies on hand-crafted feature extraction and fails to go beyond the comparison of manual features versus the combination of manual and facial features. In this work, we systematically investigate the contribution of distinct facial regionseyes, mouth, and full faceusing two different deep learning models (a CNN-based model and a -based model) trained on an SLR dataset of isolated signs with randomly selected classes. Through quantitative performance and qualitative saliency map evaluation, we reveal that the mouth is the most important non-manual facial feature, significantly improving accuracy. Our findings highlight the necessity of incorporating facial features in ASLR.

Latent Inter-User Difference Modeling for LLM Personalization

Authors: Yilun Qiu, Tianhao Shi, Xiaoyan Zhao, Fengbin Zhu, Yang Zhang, Fuli Feng

2025-07-28

http://arxiv.org/abs/2507.20849v1

Large language models (s) are increasingly integrated into users' daily lives, leading to a growing demand for personalized outputs. Previous work focuses on leveraging a user's own history, overlooking inter-user differences that are crucial for effective personalization. While recent work has attempted to model such differences, the reliance on language-based prompts often hampers the effective extraction of meaningful distinctions. To address these issues, we propose Difference-aware Embedding-based Personalization (DEP), a framework that models inter-user differences in the latent space instead of relying on language prompts. DEP constructs soft prompts by contrasting a user's embedding with those of peers who engaged with similar content, highlighting relative behavioral signals. A autoencoder then filters and compresses both user-specific and difference-aware embeddings, preserving only task-relevant features before injecting them into a frozen . Experiments on personalized review generation show that DEP consistently outperforms baseline methods across multiple metrics. Our code is available at https://github.com/SnowCharmQ/DEP.

METEOR Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

Authors: Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian

2025-07-28

http://arxiv.org/abs/2507.20842v1

Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduce prohibitive computational overhead to achieve superior performance using complementary visual representations from multiple vision encoders. To address this, we propose a progressive framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR), that eliminates redundant visual tokens across the encoding, fusion, and decoding stages for multi-encoder Ms. For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy. Subsequently, for multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative . Finally, we propose an adaptive token method in the decoding stage to further discard irrelevant tokens based on the text prompts with dynamically adjusting ratios for specific task demands. To our best knowledge, this is the first successful attempt that achieves an efficient multi-encoder based vision language model with multi-stage strategies. Extensive experiments on 11 benchmarks demonstrate the effectiveness of our proposed approach. Compared with EAGLE, a typical multi-encoder Ms, METEOR reduces 76% visual tokens with only 0.3% performance drop in average. The code is available at https://github.com/YuchenLiu98/METEOR.

Advancing Compositional LLM Reasoning with Structured Task Relations in Interactive Multimodal Communications

Authors: Xinye Cao, Hongcan Guo, Guoshun Nan, Jiaoyang Cui, Haoting Qian, Yihan Lin, Yilin Peng, Diyang Zhang, Yanzhao Hou, Huici Wu, Xiaofeng Tao, Tony Q. S. Quek

2025-07-28

http://arxiv.org/abs/2507.21199v1

Interactive multimodal applications (IMAs), such as route planning in the Internet of Vehicles, enrich users' personalized experiences by integrating various forms of data over wireless networks. Recent advances in large language models (s) utilize mixture-of-experts (MoE) mechanisms to empower multiple IMAs, with each trained individually for a specific task that presents different business workflows. In contrast to existing approaches that rely on multiple s for IMAs, this paper presents a novel paradigm that accomplishes various IMAs using a single compositional over wireless networks. The two primary challenges include 1) guiding a single to adapt to diverse IMA objectives and 2) ensuring the flexibility and efficiency of the in resource-constrained mobile environments. To tackle the first challenge, we propose ContextLoRA, a novel method that guides an to learn the rich structured context among IMAs by constructing a task dependency graph. We partition the learnable parameter matrix of neural layers for each IMA to facilitate composition. Then, we develop a step-by-step fine-tuning procedure guided by task relations, including training, freezing, and masking phases. This allows the to learn to reason among tasks for better adaptation, capturing the latent dependencies between tasks. For the second challenge, we introduce ContextGear, a scheduling strategy to optimize the training procedure of ContextLoRA, aiming to minimize computational and costs through a strategic grouping mechanism. Experiments on three benchmarks show the superiority of the proposed ContextLoRA and ContextGear. Furthermore, we prototype our proposed paradigm on a real-world wireless testbed, demonstrating its practical applicability for various IMAs. We will release our code to the community.

Beyond Class Tokens LLM-guided Dominant Property Mining for Few-shot Classification

Authors: Wei Zhuo, Runjie Luo, Wufeng Xue, Linlin Shen

2025-07-28

http://arxiv.org/abs/2507.20511v2

Few-shot Learning (FSL), which endeavors to develop the generalization ability for recognizing novel classes using only a few images, faces significant challenges due to data scarcity. Recent CLIP-like methods based on contrastive language-image pertaining mitigate the issue by leveraging textual representation of the class name for unseen image discovery. Despite the achieved success, simply aligning visual representations to class name embeddings would compromise the visual diversity for novel class discrimination. To this end, we proposed a novel Few-Shot Learning (FSL) method (BCT-CLIP) that explores \textbf{dominating properties} via contrastive learning beyond simply using class tokens. Through leveraging -based prior knowledge, our method pushes forward FSL with comprehensive structural image representations, including both global category representation and the patch-aware property embeddings. In particular, we presented a novel multi-property generator (MPG) with patch-aware cross-attentions to generate multiple visual property tokens, a Large-Language Model ()-assistant retrieval procedure with clustering-based to obtain dominating property descriptions, and a new contrastive learning strategy for property-token learning. The superior performances on the 11 widely used datasets demonstrate that our investigation of dominating properties advances discriminative class-specific representation learning and few-shot classification.

Advancing Shared and Multi-Agent Autonomy in Underwater Missions Integrating Knowledge Graphs and Retrieval-Augmented Generation

Authors: Michele Grimaldi, Carlo Cernicchiaro, Sebastian Realpe Rua, Alaaeddine El-Masri-El-Chaarani, Markus Buchholz, Loizos Michael, Pere Ridao Rodriguez, Ignacio Carlucho, Yvan R. Petillot

2025-07-27

http://arxiv.org/abs/2507.20370v1

Robotic platforms have become essential for marine operations by providing regular and continuous access to offshore assets, such as underwater infrastructure inspection, environmental monitoring, and resource exploration. However, the complex and dynamic nature of underwater environments, characterized by limited visibility, unpredictable currents, and constraints, presents significant challenges that demand advanced autonomy while ensuring operator trust and oversight. Central to addressing these challenges are knowledge representation and reasoning techniques, particularly knowledge graphs and retrieval-augmented generation (RAG) systems, that enable robots to efficiently structure, retrieve, and interpret complex environmental data. These capabilities empower robotic agents to reason, adapt, and respond effectively to changing conditions. The primary goal of this work is to demonstrate both multi-agent autonomy and shared autonomy, where multiple robotic agents operate independently while remaining connected to a human supervisor. We show how a RAG-powered large language model, augmented with knowledge graph data and domain taxonomy, enables autonomous multi-agent decision-making and facilitates seamless human-robot interaction, resulting in 100\% mission validation and behavior completeness. Finally, ablation studies reveal that without structured knowledge from the graph and/or taxonomy, the is prone to hallucinations, which can compromise decision quality.

Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation

Authors: Abdullah Alabdullah, Lifeng Han, Chenghua Lin

2025-07-27

http://arxiv.org/abs/2507.20301v1

Dialectal Arabic (DA) poses a persistent challenge for natural language processing (NLP), as most everyday in the Arab world occurs in dialects that diverge significantly from Modern Standard Arabic (MSA). This linguistic divide limits access to digital services and educational resources and impedes progress in Arabic machine translation. This paper presents two core contributions to advancing DA-MSA translation for the Levantine, Egyptian, and Gulf dialects, particularly in low-resource and computationally constrained settings: a comprehensive evaluation of training-free prompting techniques, and the development of a resource-efficient fine-tuning pipeline. Our evaluation of prompting strategies across six large language models (s) found that few-shot prompting consistently outperformed zero-shot, chain-of-thought, and our proposed Ara-TEaR method. GPT-4o achieved the highest performance across all prompting settings. For fine-tuning, a quantized Gemma2-9B model achieved a CHrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58). Joint multi-dialect trained models outperformed single-dialect counterparts by over 10% CHrF++, and 4-bit quantization reduced memory usage by 60% with less than 1% performance loss. The results and insights of our experiments offer a practical blueprint for improving dialectal inclusion in Arabic NLP, showing that high-quality DA-MSA machine translation is achievable even with limited resources and paving the way for more inclusive language technologies.

What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations

Authors: Katharina Trinley, Toshiki Nakai, Tatiana Anikina, Tanja Baeumel

2025-07-27

http://arxiv.org/abs/2507.20279v1

Large language models (s) excel at multilingual tasks, yet their internal language processing remains poorly understood. We analyze how Aya-23-8B, a decoder-only trained on balanced multilingual data, handles code-mixed, cloze, and translation tasks compared to predominantly monolingual models like Llama 3 and Chinese-LLaMA-2. Using logit lens and neuron specialization analyses, we find: (1) Aya-23 activates typologically related language representations during translation, unlike English-centric models that rely on a single pivot language; (2) code-mixed neuron activation patterns vary with mixing rates and are shaped more by the base language than the mixed-in one; and (3) Aya-23's languagespecific neurons for code-mixed inputs concentrate in final layers, diverging from prior findings on decoder-only models. Neuron analysis further shows that script similarity and typological relations impact processing across model types. These findings reveal how multilingual training shapes internals and inform future cross-lingual transfer research.

Modeling Professionalism in Expert Questioning through Linguistic Differentiation

Authors: Giulia D'Agostino, Chung-Chi Chen

2025-07-27

http://arxiv.org/abs/2507.20249v1

Professionalism is a crucial yet underexplored dimension of expert , particularly in high-stakes domains like finance. This paper investigates how linguistic features can be leveraged to model and evaluate professionalism in expert questioning. We introduce a novel annotation framework to quantify structural and pragmatic elements in financial analyst questions, such as discourse regulators, prefaces, and request types. Using both human-authored and large language model ()-generated questions, we construct two datasets: one annotated for perceived professionalism and one labeled by question origin. We show that the same linguistic features correlate strongly with both human judgments and authorship origin, suggesting a shared stylistic foundation. Furthermore, a classifier trained solely on these interpretable features outperforms gemini-2.0 and SVM baselines in distinguishing expert-authored questions. Our findings demonstrate that professionalism is a learnable, domain-general construct that can be captured through linguistically grounded modeling.

FAEDKV Infinite-Window Fourier Transform for Unbiased KV Cache Compression

Authors: Runchao Li, Yao Fu, Mu Sheng, Xianxuan Long, Haotian Yu, Pan Li

2025-07-26

http://arxiv.org/abs/2507.20030v1

The efficacy of Large Language Models (s) in long-context tasks is often hampered by the substantial memory footprint and computational demands of the Key-Value () cache. Current compression strategies, including token eviction and learned projections, frequently lead to biased representations -- either by overemphasizing recent/high-attention tokens or by repeatedly degrading information from earlier context -- and may require costly model retraining. We present FAED (Frequency-Adaptive Infinite-Window for cache), a novel, training-free cache compression framework that ensures unbiased information retention. FAED operates by transforming the cache into the frequency domain using a proposed Infinite-Window Fourier Transform (IWDFT). This approach allows for the equalized contribution of all tokens to the compressed representation, effectively preserving both early and recent contextual information. A preliminary frequency ablation study identifies critical spectral components for layer-wise, targeted compression. Experiments on LongBench benchmark demonstrate FAED's superiority over existing methods by up to 22\%. In addition, our method shows superior, position-agnostic retrieval accuracy on the Needle-In-A-Haystack task compared to compression based approaches.

The Carbon Cost of Conversation, Sustainability in the Age of Language Models

Authors: Sayed Mahbub Hasan Amiri, Prasun Goswami, Md. Mainul Islam, Mohammad Shakhawat Hossen, Sayed Majhab Hasan Amiri, Naznin Akter

2025-07-26

http://arxiv.org/abs/2507.20018v2

Large language models (s) like GPT-3 and BERT have revolutionized natural language processing (NLP), yet their environmental costs remain dangerously overlooked. This article critiques the sustainability of s, quantifying their carbon footprint, water usage, and contribution to e-waste through case studies of models such as GPT-4 and energy-efficient alternatives like Mistral 7B. Training a single can emit carbon dioxide equivalent to hundreds of cars driven annually, while data centre cooling exacerbates water scarcity in vulnerable regions. Systemic challenges corporate greenwashing, redundant model development, and regulatory voids perpetuate harm, disproportionately burdening marginalized communities in the Global South. However, pathways exist for sustainable NLP: technical innovations (e.g., model , quantum computing), policy reforms (carbon taxes, mandatory emissions reporting), and cultural shifts prioritizing necessity over novelty. By analysing industry leaders (Google, Microsoft) and laggards (Amazon), this work underscores the urgency of ethical accountability and global cooperation. Without immediate action, AIs ecological toll risks outpacing its societal benefits. The article concludes with a call to align technological progress with planetary boundaries, advocating for equitable, transparent, and regenerative AI systems that prioritize both human and environmental well-being.

CaliDrop KV Cache Compression with Calibration

Authors: Yi Su, Quantong Qiu, Yuechi Zhou, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang

2025-07-26

http://arxiv.org/abs/2507.19906v1

Large Language Models (s) require substantial computational resources during generation. While the Key-Value () cache significantly accelerates this process by storing attention intermediates, its memory footprint grows linearly with sequence length, batch size, and model size, creating a bottleneck in long-context scenarios. Various cache compression techniques, including token eviction, quantization, and low-rank projection, have been proposed to mitigate this bottleneck, often complementing each other. This paper focuses on enhancing token eviction strategies. Token eviction leverages the observation that the attention patterns are often , allowing for the removal of less critical entries to save memory. However, this reduction usually comes at the cost of notable accuracy degradation, particularly under high compression ratios. To address this issue, we propose \textbf{CaliDrop}, a novel strategy that enhances token eviction through calibration. Our preliminary experiments show that queries at nearby positions exhibit high similarity. Building on this observation, CaliDrop performs speculative calibration on the discarded tokens to mitigate the accuracy loss caused by token eviction. Extensive experiments demonstrate that CaliDrop significantly improves the accuracy of existing token eviction methods.

CrossPL Evaluating Large Language Models on Cross Programming Language Code Generation

Authors: Zhanhang Xiong, Dongxia Wang, Yuekang Li, Xinyuan An, Wenhai Wang

2025-07-26

http://arxiv.org/abs/2507.19904v1

As large language models (s) become increasingly embedded in software engineering workflows, a critical capability remains underexplored: generating correct code that enables cross-programming-language (CPL) interoperability. This skill is essential for building complex systems that integrate components written in multiple languages via mechanisms like inter-process (IPC). To bridge this gap, we present CrossPL, the first benchmark designed to systematically evaluate s' ability to generate CPL-interoperating code. CrossPL comprises 1,982 tasks centered around IPC, covering six widely-used programming languages and seven representative CPL techniques. We construct this benchmark by (i) analyzing 19,169 multi-language GitHub repositories using 156 hand-crafted finite state machines (FSMs), and (ii) developing an -based pipeline that automatically extracts CPL code snippets, generates task instructions, and validates functional correctness. We evaluate 14 state-of-the-art general-purpose s and 6 code-oriented s released in the past three years on CrossPL via FSM-based validation. Results reveal that even the best-performing models struggle with CPL scenarios, underscoring the need for more targeted research in this space. Our benchmark and code are available at: https://anonymous.4open.science/r/crosspl-2814.

AgentMesh A Cooperative Multi-Agent Generative AI Framework for Software Development Automation

Authors: Sourena Khanzadeh

2025-07-26

http://arxiv.org/abs/2507.19902v1

Software development is a complex, multi-phase process traditionally requiring collaboration among individuals with diverse expertise. We propose AgentMesh, a Python-based framework that uses multiple cooperating -powered agents to automate software development tasks. In AgentMesh, specialized agents - a Planner, Coder, Debugger, and Reviewer - work in concert to transform a high-level requirement into fully realized code. The Planner agent first decomposes user requests into concrete subtasks; the Coder agent implements each subtask in code; the Debugger agent tests and fixes the code; and the Reviewer agent validates the final output for correctness and quality. We describe the architecture and design of these agents and their , and provide implementation details including prompt strategies and workflow orchestration. A case study illustrates AgentMesh handling a non-trivial development request via sequential task planning, code generation, iterative debugging, and final code review. We discuss how dividing responsibilities among cooperative agents leverages the strengths of large language models while mitigating single-agent limitations. Finally, we examine current limitations - such as error propagation and context scaling - and outline future work toward more robust, scalable multi-agent AI systems for software engineering automation.

HCAttention Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs

Authors: Dongquan Yang, Yifan Yang, Xiaotian Yu, Xianbiao Qi, Rong Xiao

2025-07-26

http://arxiv.org/abs/2507.19823v1

Processing long-context inputs with large language models presents a significant challenge due to the enormous memory requirements of the Key-Value () cache during inference. Existing cache compression methods exhibit noticeable performance degradation when memory is reduced by more than 85%. Additionally, strategies that leverage GPU-CPU collaboration for approximate attention remain underexplored in this setting. We propose HCAttention, a heterogeneous attention computation framework that integrates key quantization, value offloading, and dynamic eviction to enable efficient inference under extreme memory constraints. The method is compatible with existing architectures and does not require model fine-tuning. Experimental results on the LongBench benchmark demonstrate that our approach preserves the accuracy of full-attention model while shrinking the cache memory footprint to 25% of its original size. Remarkably, it stays competitive with only 12.5% of the cache, setting a new state-of-the-art in cache compression. To the best of our knowledge, HCAttention is the first to extend the Llama-3-8B model to process 4 million tokens on a single A100 GPU with 80GB memory.

Large Language Model Agent for Structural Drawing Generation Using ReAct Prompt Engineering and Retrieval Augmented Generation

Authors: Xin Zhang, Lissette Iturburu, Juan Nicolas Villamizar, Xiaoyu Liu, Manuel Salmeron, Shirley J. Dyke, Julio Ramirez

2025-07-26

http://arxiv.org/abs/2507.19771v1

Structural drawings are widely used in many fields, e.g., mechanical engineering, civil engineering, etc. In civil engineering, structural drawings serve as the main tool between architects, engineers, and builders to avoid conflicts, act as legal documentation, and provide a reference for future maintenance or evaluation needs. They are often organized using key elements such as title/subtitle blocks, scales, plan views, elevation view, sections, and detailed sections, which are annotated with standardized symbols and line types for interpretation by engineers and contractors. Despite advances in software capabilities, the task of generating a structural drawing remains labor-intensive and time-consuming for structural engineers. Here we introduce a novel generative AI-based method for generating structural drawings employing a large language model () agent. The method incorporates a retrieval-augmented generation (RAG) technique using externally-sourced facts to enhance the accuracy and reliability of the language model. This method is capable of understanding varied natural language descriptions, processing these to extract necessary information, and generating code to produce the desired structural drawing in AutoCAD. The approach developed, demonstrated and evaluated herein enables the efficient and direct conversion of a structural drawing's natural language description into an AutoCAD drawing, significantly reducing the workload compared to current working process associated with manual drawing production, facilitating the typical iterative process of engineers for expressing design ideas in a simplified way.

LowKeyEMG Electromyographic typing with a reduced keyset

Authors: Johannes Y. Lee, Derek Xiao, Shreyas Kaasyap, Nima R. Hadidi, John L. Zhou, Jacob Cunningham, Rakshith R. Gore, Deniz O. Eren, Jonathan C. Kao

2025-07-26

http://arxiv.org/abs/2507.19736v1

We introduce LowKeyEMG, a real-time human-computer interface that enables efficient text entry using only 7 gesture classes decoded from surface electromyography (sEMG). Prior work has attempted full-alphabet decoding from sEMG, but decoding large character sets remains unreliable, especially for individuals with motor impairments. Instead, LowKeyEMG reduces the English alphabet to 4 gesture keys, with 3 more for space and system interaction, to reliably translate simple one-handed gestures into text, leveraging the recurrent -based language model RW for efficient computation. In real-time experiments, participants achieved average one-handed keyboardless typing speeds of 23.3 words per minute with LowKeyEMG, and improved gesture efficiency by 17% (relative to typed phrase length). When typing with only 7 keys, LowKeyEMG can achieve 98.2% top-3 word accuracy, demonstrating that this low-key typing paradigm can maintain practical rates. Our results have implications for assistive technologies and any interface where input bandwidth is constrained.

Towards Inclusive NLP Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks

Authors: Maitha Alshehhi, Ahmed Sharshar, Mohsen Guizani

2025-07-25

http://arxiv.org/abs/2507.19699v1

Although s have attained significant success in high-resource languages, their capacity in low-resource linguistic environments like Kannada and Arabic is not yet fully understood. This work benchmarking the performance of multilingual and monolingual Large Language Models (s) across Arabic, English, and Indic languages, with particular emphasis on the effects of model compression strategies such as and quantization. Findings shows significant performance differences driven by linguistic diversity and resource availability on SOTA S as BLOOMZ, AceGPT, Jais, LLaMA-2, XGLM, and AraGPT2. We find that multilingual versions of the model outperform their language-specific counterparts across the board, indicating substantial cross-lingual transfer benefits. Quantization (4-bit and 8-bit) is effective in maintaining model accuracy while promoting efficiency, but aggressive significantly compromises performance, especially in bigger models. Our findings pinpoint key strategies to construct scalable and fair multilingual NLP solutions and underscore the need for interventions to address hallucination and generalization errors in the low-resource setting.

"X of Information'' Continuum A Survey on AI-Driven Multi-dimensional Metrics for Next-Generation Networked Systems

Authors: Beining Wu, Jun Huang, Shui Yu

2025-07-25

http://arxiv.org/abs/2507.19657v1

The development of next-generation networking systems has inherently shifted from throughput-based paradigms towards intelligent, information-aware designs that emphasize the quality, relevance, and utility of transmitted information, rather than sheer data volume. While classical network metrics, such as latency and packet loss, remain significant, they are insufficient to quantify the nuanced information quality requirements of modern intelligent applications, including autonomous vehicles, digital twins, and metaverse environments. In this survey, we present the first comprehensive study of the ``X of Information'' continuum by introducing a systematic four-dimensional taxonomic framework that structures information metrics along temporal, quality/utility, reliability/robustness, and network/ dimensions. We uncover the increasing interdependencies among these dimensions, whereby temporal freshness triggers quality evaluation, which in turn helps with reliability appraisal, ultimately enabling effective network delivery. Our analysis reveals that artificial intelligence technologies, such as deep reinforcement learning, multi-agent systems, and neural optimization models, enable adaptive, context-aware optimization of competing information quality objectives. In our extensive study of six critical application domains, covering autonomous transportation, industrial IoT, healthcare digital twins, UAV s, ecosystems, and metaverse settings, we illustrate the revolutionary promise of multi-dimensional information metrics for meeting diverse operational needs. Our survey identifies prominent implementation challenges, including ...

DeltaLLM A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference

Authors: Jiawen Qi, Chang Gao, Zhaochun Ren, Qinyu Chen

2025-07-25

http://arxiv.org/abs/2507.19608v1

Deploying Large Language Models (s) on edge devices remains challenging due to their quadratically increasing computations with the sequence length. Existing studies for dynamic attention are designed for hardware with massively parallel computation capabilities, such as GPUs or TPUs, and aim at long context lengths (e.g., 64K), making them unsuitable for edge scenarios. We present Delta, a training-free framework that exploits temporal in attention patterns to enable efficient inference across both the prefilling and decoding stages, on resource-constrained edge devices. Delta introduces an accuracy- and memory-aware delta matrix construction strategy that introduces temporal , and a context-aware hybrid attention mechanism that combines full attention in a local context window with delta approximation outside it to increase accuracy. We evaluate our framework on the edge-device-friendly BitNet-b1.58-2B-4T model and Llama3.2-1B-Instruct model across diverse language tasks. The results show that on BitNet, our framework increases the attention from 0% to 60% during the prefilling stage with slight accuracy improvement on the WG task, and 0% to 57% across both the prefilling and decoding stages, with even higher F1 score from 29.63 to 30.97 on SQuAD-v2 task. On the Llama model, it can also achieve up to 60% during the prefilling stage and around 57% across both stages with negligible accuracy drop. These results demonstrate that Delta offers a promising solution for efficient edge deployment, requiring no fine-tuning and seamlessly integrating with existing inference pipelines.

Advancing Event Forecasting through Massive Training of Large Language Models Challenges, Solutions, and Broader Impacts

Authors: Sang-Woo Lee, Sohee Yang, Donghyun Kwak, Noah Y. Siegel

2025-07-25

http://arxiv.org/abs/2507.19477v1

Many recent papers have studied the development of superforecaster-level event forecasting s. While methodological problems with early studies cast doubt on the use of s for event forecasting, recent studies with improved evaluation methods have shown that state-of-the-art s are gradually reaching superforecaster-level performance, and reinforcement learning has also been reported to improve future forecasting. Additionally, the unprecedented success of recent reasoning models and Deep Research-style models suggests that technology capable of greatly improving forecasting performance has been developed. Therefore, based on these positive recent trends, we argue that the time is ripe for research on large-scale training of superforecaster-level event forecasting s. We discuss two key research directions: training methods and data acquisition. For training, we first introduce three difficulties of -based event forecasting training: noisiness-, knowledge cut-off, and simple reward structure problems. Then, we present related ideas to mitigate these problems: hypothetical event Bayesian networks, utilizing poorly-recalled and counterfactual events, and auxiliary reward signals. For data, we propose aggressive use of market, public, and crawling datasets to enable large-scale training and evaluation. Finally, we explain how these technical advances could enable AI to provide predictive intelligence to society in broader areas. This position paper presents promising specific paths and considerations for getting closer to superforecaster-level AI technology, aiming to call for researchers' interest in these directions.

GEPA Reflective Prompt Evolution Can Outperform Reinforcement Learning

Authors: Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab

2025-07-25

http://arxiv.org/abs/2507.19457v1

Large language models (s) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language can often provide a much richer learning medium for s, compared with policy gradients derived from , scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two s, and demonstrates promising results as an inference-time search strategy for code optimization.

Step-3 is Large yet Affordable Model-system Co-design for Cost-effective Decoding

Authors: StepFun, :, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin Miao, Chang Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengyuan Yao, Daokuan Lv, Dapeng Shi, Deshan Sun, Ding Huang, Dingyuan Hu, Dongqing Pang, Enle Liu, Fajie Zhang, Fanqi Wan, Gulin Yan, Han Zhang, Han Zhou, Hanghao Wu, Hangyu Guo, Hanqi Chen, Hanshan Zhang, Hao Wu, Haocheng Zhang, Haolong Yan, Haoran Lv, Haoran Wei, Hebin Zhou, Heng Wang, Heng Wang, Hongxin Li, Hongyu Zhou, Hongyuan Wang, Huiyong Guo, Jia Wang, Jiahao Gong, Jialing Xie, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yan, Jie Yang, Jieyi Hou, Jinguang Zhang, Jinlan Cao, Jisheng Yin, Junfeng Liu, Junhao Huang, Junzhe Lin, Kaijun Tan, Kaixiang Li, Kang An, Kangheng Lin, Kenkun Liu, Lei Yang, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lin Zhang, Lina Chen, Liwen Huang, Liying Shi, Longlong Gu, Mei Chen, Mengqiang Ren, Ming Li, Mingzhe Chen, Na Wang, Nan Wu, Qi Han, Qian Zhao, Qiang Zhang, Qianni Liu, Qiaohui Chen, Qiling Wu, Qinglin He, Qinyuan Tan, Qiufeng Wang, Qiuping Wu, Qiuyan Liang, Quan Sun, Rui Li, Ruihang Miao, Ruosi Wan, Ruyan Guo, Shangwu Zhong, Shaoliang Pang, Shengjie Fan, Shijie Shang, Shilei Jiang, Shiliang Yang, Shiming Hao, Shuli Gao, Siming Huang, Siqi Liu, Tiancheng Cao, Tianhao Cheng, Tianhao Peng, Wang You, Wei Ji, Wen Sun, Wenjin Deng, Wenqing He, Wenzhen Zheng, Xi Chen, Xiangwen Kong, Xianzhen Luo, Xiaobo Yang, Xiaojia Liu, Xiaoxiao Ren, Xin Han, Xin Li, Xin Wu, Xu Zhao, Yanan Wei, Yang Li, Yangguang Li, Yangshijie Xu, Yanming Xu, Yaqiang Shi, Yeqing Shen, Yi Yang, Yifei Yang, Yifeng Gong, Yihan Chen, Yijing Yang, Yinmin Zhang, Yizhuang Zhou, Yuanhao Ding, Yuantao Fan, Yuanzhen Yang, Yuchu Luo, Yue Peng, Yufan Lu, Yuhang Deng, Yuhe Yin, Yujie Liu, Yukun Chen, Yuling Zhao, Yun Mou, Yunlong Li, Yunzhou Ju, Yusheng Li, Yuxiang Yang, Yuxiang Zhang, Yuyang Chen, Zejia Weng, Zhe Xie, Zheng Ge, Zheng Gong, Zhenyi Lu, Zhewei Huang, Zhichao Chang, Zhiguo Huang, Zhirui Wang, Zidong Yang, Zili Wang, Ziqi Wang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Xiangyu Zhang

2025-07-25

http://arxiv.org/abs/2507.19427v1

Large language models (s) face low hardware efficiency during decoding, especially for long-context reasoning tasks. This paper introduces Step-3, a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context. Step-3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE , and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3's 2,324 in the same setup and sets a new Pareto frontier for decoding.

Doubling Your Data in Minutes Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs

Authors: Shuo Yang, Zheyu Zhang, Bardh Prenkaj, Gjergji Kasneci

2025-07-25

http://arxiv.org/abs/2507.19334v1

Tabular data is critical across diverse domains, yet high-quality datasets remain scarce due to privacy concerns and the cost of collection. Contemporary approaches adopt large language models (s) for tabular augmentation, but exhibit two major limitations: (1) dense dependency modeling among tabular features that can introduce bias, and (2) high computational overhead in sampling. To address these issues, we propose SPADA for SPArse Dependency-driven Augmentation, a lightweight generative framework that explicitly captures dependencies via an -induced graph. We treat each feature as a node and synthesize values by traversing the graph, conditioning each feature solely on its parent nodes. We explore two synthesis strategies: a non-parametric method using Gaussian kernel density estimation, and a conditional normalizing flow model that learns invertible mappings for conditional density estimation. Experiments on four datasets show that SPADA reduces constraint violations by 4% compared to diffusion-based methods and accelerates generation by nearly 9,500 times over -based baselines.

Patch Pruning Strategy Based on Robust Statistical Measures of Attention Weight Diversity in Vision Transformers

Authors: Yuki Igaue, Hiroaki Aizawa

2025-07-25

http://arxiv.org/abs/2507.19175v1

Multi-head self-attention is a distinctive feature extraction mechanism of vision s that computes pairwise relationships among all input patches, contributing significantly to their high performance. However, it is known to incur a quadratic computational complexity with respect to the number of patches. One promising approach to address this issue is patch , which improves computational efficiency by identifying and removing redundant patches. In this work, we propose a patch strategy that evaluates the importance of each patch based on the variance of attention weights across multiple attention heads. This approach is inspired by the design of multi-head self-attention, which aims to capture diverse attention patterns across different subspaces of feature representations. The proposed method can be easily applied during both training and inference, and achieves improved throughput while maintaining classification accuracy in scenarios such as fine-tuning with pre-trained models. In addition, we also found that using robust statistical measures, such as the median absolute deviation in place of variance, to assess patch importance can similarly lead to strong performance. Furthermore, by introducing ping patch embeddings, our method achieves better performance with comparable throughput to conventional approaches that utilize all patches.

RegScore Scoring Systems for Regression Tasks

Authors: Michal K. Grzeszczyk, Tomasz Szczepański, Pawel Renc, Siyeop Yoon, Jerome Charton, Tomasz Trzciński, Arkadiusz Sitek

2025-07-25

http://arxiv.org/abs/2507.19155v1

Scoring systems are widely adopted in medical applications for their inherent simplicity and transparency, particularly for classification tasks involving tabular data. In this work, we introduce RegScore, a novel, , and interpretable scoring system specifically designed for regression tasks. Unlike conventional scoring systems constrained to integer-valued coefficients, RegScore leverages beam search and k- ridge regression to relax these restrictions, thus enhancing predictive performance. We extend RegScore to bimodal deep learning by integrating tabular data with medical images. We utilize the classification token from the TIP (Tabular Image Pretraining) to generate Personalized Linear Regression parameters and a Personalized RegScore, enabling individualized scoring. We demonstrate the effectiveness of RegScore by estimating mean Pulmonary Artery Pressure using tabular data and further refine these estimates by incorporating cardiac MRI images. Experimental results show that RegScore and its personalized bimodal extensions achieve performance comparable to, or better than, state-of-the-art black-box models. Our method provides a transparent and interpretable approach for regression tasks in clinical settings, promoting more informed and trustworthy decision-making. We provide our code at https://github.com/SanoScience/RegScore.

MixA-Q Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective

Authors: Weitian Wang, Rai Shubham, Cecilia De La Parra, Akash Kumar

2025-07-25

http://arxiv.org/abs/2507.19131v1

In this paper, we propose MixA-Q, a mixed-precision activation quantization framework that leverages intra-layer activation (a concept widely explored in activation methods) for efficient inference of quantized window-based vision s. For a given uniform-bit quantization configuration, MixA-Q separates the batched window computations within Swin blocks and assigns a lower bit width to the activations of less important windows, improving the trade-off between model performance and efficiency. We introduce a Two-Branch Swin Block that processes activations separately in high- and low-bit precision, enabling seamless integration of our method with most quantization-aware training (QAT) and post-training quantization (PTQ) methods, or with simple modifications. Our experimental evaluations over the COCO dataset demonstrate that MixA-Q achieves a training-free 1.35x computational speedup without accuracy loss in PTQ configuration. With QAT, MixA-Q achieves a lossless 1.25x speedup and a 1.53x speedup with only a 1% mAP drop by incorporating activation . Notably, by reducing the quantization error in important regions, our -aware quantization adaptation improves the mAP of the quantized W4A4 model (with both weights and activations in 4-bit precision) by 0.7%, reducing quantization degradation by 24%.

Deterministic diffusion models for Lagrangian turbulence robustness and encoding of extreme events

Authors: Tianyi Li, Flavio Tuteri, Michele Buzzicotti, Fabio Bonaccorso, Luca Biferale

2025-07-25

http://arxiv.org/abs/2507.19103v1

Modeling Lagrangian turbulence remains a fundamental challenge due to its multiscale, intermittent, and non-Gaussian nature. Recent advances in data-driven diffusion models have enabled the generation of realistic Lagrangian velocity trajectories that accurately reproduce statistical properties across scales and capture rare extreme events. This study investigates three key aspects of diffusion-based modeling for Lagrangian turbulence. First, we assess architectural robustness by comparing a U-Net backbone with a -based alternative, finding strong consistency in generated trajectories, with only minor discrepancies at small scales. Second, leveraging a deterministic variant of diffusion model formulation, namely the deterministic denoising diffusion implicit model (DDIM), we identify structured features in the initial latent noise that align consistently with extreme events. Third, we explore accelerated generation by reducing the number of diffusion steps, and find that DDIM enables substantial speedups with minimal loss of statistical fidelity. These findings highlight the robustness of diffusion models and their potential for interpretable, scalable modeling of complex turbulent systems.