2025-08-01
Table of Contents
- DepMicroDiff Diffusion-Based Dependency-Aware Multimodal Imputation for Microbiome Data
- MemoCue Empowering LLM-Based Agents for Human Memory Recall via Strategy-Guided Querying
- Trae Agent An LLM-based Agent for Software Engineering with Test-time Scaling
- Unveiling Super Experts in Mixture-of-Experts Large Language Models
- Model Directions, Not Words Mechanistic Topic Models Using Sparse Autoencoders
- MoCHA Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention
- Modality-Aware Feature Matching A Comprehensive Review of Single- and Cross-Modality Techniques
- trAIce3D A Prompt-Driven Transformer Based U-Net for Semantic Segmentation of Microglial Cells from Large-Scale 3D Microscopy Images
- Language Arithmetics Towards Systematic Language Neuron Identification and Manipulation
- MetaAgent Automatically Constructing Multi-Agent Systems Based on Finite State Machines
- Multi-modal Relational Item Representation Learning for Inferring Substitutable and Complementary Items
- CTG-Insight A Multi-Agent Interpretable LLM Framework for Cardiotocography Analysis and Classification
- Persona-Augmented Benchmarking Evaluating LLMs Across Diverse Writing Styles
- IntentFlow Interactive Support for Communicating Intent with LLMs in Writing Tasks
- Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models
- EDGE-GRPO Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity
- Unlocking Interpretability for RF Sensing A Complex-Valued White-Box Transformer
- MOR-VIT Efficient Vision Transformer with Mixture-of-Recursions
- Enhancing Graph-based Recommendations with Majority-Voting LLM-Rerank Augmentation
- TriangleMix A Lossless and Efficient Attention Pattern for Long Context Prefilling
- Large Language Models for Wireless Communications From Adaptation to Autonomy
- Learning to Imitate with Less Efficient Individual Behavior Modeling in Chess
- An LLM Driven Agent Framework for Automated Infrared Spectral Multi Task Reasoning
- Transmission With Machine Language Tokens A Paradigm for Task-Oriented Agent Communication
- MindChat Enhancing BCI Spelling with Large Language Models in Realistic Scenarios
- Automated HEMT Model Construction from Datasheets via Multi-Modal Intelligence and Prior-Knowledge-Free Optimization
- ReGATE Learning Faster and Better with Fewer Tokens in MLLMs
- Enhancing and Accelerating Brain MRI through Deep Learning Reconstruction Using Prior Subject-Specific Imaging
- Agentic Web Weaving the Next Web with AI Agents
- SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment
- The Importance of Facial Features in Vision-based Sign Language Recognition Eyes, Mouth or Full Face?
- Latent Inter-User Difference Modeling for LLM Personalization
- METEOR Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
- Advancing Compositional LLM Reasoning with Structured Task Relations in Interactive Multimodal Communications
- Beyond Class Tokens LLM-guided Dominant Property Mining for Few-shot Classification
- Advancing Shared and Multi-Agent Autonomy in Underwater Missions Integrating Knowledge Graphs and Retrieval-Augmented Generation
- Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation
- What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations
- Modeling Professionalism in Expert Questioning through Linguistic Differentiation
- FAEDKV Infinite-Window Fourier Transform for Unbiased KV Cache Compression
- The Carbon Cost of Conversation, Sustainability in the Age of Language Models
- CaliDrop KV Cache Compression with Calibration
- CrossPL Evaluating Large Language Models on Cross Programming Language Code Generation
- AgentMesh A Cooperative Multi-Agent Generative AI Framework for Software Development Automation
- HCAttention Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
- Large Language Model Agent for Structural Drawing Generation Using ReAct Prompt Engineering and Retrieval Augmented Generation
- LowKeyEMG Electromyographic typing with a reduced keyset
- Towards Inclusive NLP Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks
- "X of Information'' Continuum A Survey on AI-Driven Multi-dimensional Metrics for Next-Generation Networked Systems
- DeltaLLM A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference
- Advancing Event Forecasting through Massive Training of Large Language Models Challenges, Solutions, and Broader Impacts
- GEPA Reflective Prompt Evolution Can Outperform Reinforcement Learning
- Step-3 is Large yet Affordable Model-system Co-design for Cost-effective Decoding
- Doubling Your Data in Minutes Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs
- Patch Pruning Strategy Based on Robust Statistical Measures of Attention Weight Diversity in Vision Transformers
- RegScore Scoring Systems for Regression Tasks
- MixA-Q Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective
- Deterministic diffusion models for Lagrangian turbulence robustness and encoding of extreme events
DepMicroDiff Diffusion-Based Dependency-Aware Multimodal Imputation for Microbiome Data
Authors: Rabeya Tus Sadia, Qiang Cheng
2025-07-31
http://arxiv.org/abs/2507.23676v1
Microbiome data analysis is essential for understanding host health and
disease, yet its inherent and noise pose major challenges for accurate
imputation, hindering downstream tasks such as biomarker discovery. Existing
imputation methods, including recent diffusion-based models, often fail to
capture the complex interdependencies between microbial taxa and overlook
contextual metadata that can inform imputation. We introduce DepMicroDiff, a
novel framework that combines diffusion-based generative modeling with a
Dependency-Aware Transformer (DAT) to explicitly capture both mutual pairwise
dependencies and autoregressive relationships. DepMicroDiff is further enhanced
by VAE-based pretraining across diverse cancer datasets and conditioning on
patient metadata encoded via a large language model (
). Experiments on TCGA
microbiome datasets show that DepMicroDiff substantially outperforms
state-of-the-art baselines, achieving higher Pearson correlation (up to 0.712),
cosine similarity (up to 0.812), and lower RMSE and MAE across multiple cancer
types, demonstrating its robustness and generalizability for microbiome
imputation.
MemoCue Empowering LLM-Based Agents for Human Memory Recall via Strategy-Guided Querying
Authors: Qian Zhao, Zhuo Sun, Bin Guo, Zhiwen Yu
2025-07-31
http://arxiv.org/abs/2507.23633v1
Agent-assisted memory recall is one critical research problem in the field of
human-computer interaction. In conventional methods, the agent can retrieve
information from its equipped memory module to help the person recall
incomplete or vague memories. The limited size of memory module hinders the
acquisition of complete memories and impacts the memory recall performance in
practice. Memory theories suggest that the person's relevant memory can be
proactively activated through some effective cues. Inspired by this, we propose
a novel strategy-guided agent-assisted memory recall method, allowing the agent
to transform an original query into a cue-rich one via the judiciously designed
strategy to help the person recall memories. To this end, there are two key
challenges. (1) How to choose the appropriate recall strategy for diverse
forgetting scenarios with distinct memory-recall characteristics? (2) How to
obtain the high-quality responses leveraging recall strategies, given only
abstract and ly annotated strategy patterns? To address the challenges,
we propose a Recall Router framework. Specifically, we design a 5W Recall Map
to classify memory queries into five typical scenarios and define fifteen
recall strategy patterns across the corresponding scenarios. We then propose a
hierarchical recall tree combined with the Monte Carlo Tree Search algorithm to
optimize the selection of strategy and the generation of strategy responses. We
construct an instruction tuning dataset and fine-tune multiple open-source
large language models (
s) to develop MemoCue, an agent that excels in
providing memory-inspired responses. Experiments on three representative
datasets show that MemoCue surpasses
-based methods by 17.74% in recall
inspiration. Further human evaluation highlights its advantages in
memory-recall applications.
Trae Agent An LLM-based Agent for Software Engineering with Test-time Scaling
Authors: Trae Research Team, Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, Xia Liu
2025-07-31
http://arxiv.org/abs/2507.23370v1
Software issue resolution is a critical challenge in software engineering and
has garnered increasing attention in recent years. With the rapid advancement
of large language models (s), substantial progress has been made in
addressing real-world software engineering tasks. Recent studies have
introduced ensemble reasoning techniques to enhance the performance of
-based issue resolution. However, existing prompting-based methods still
face limitations in effectively exploring large ensemble spaces and lack the
capacity for repository-level understanding, both of which constrain their
overall effectiveness. In this paper, we propose Trae Agent, the first
agent-based ensemble reasoning approach for repository-level issue resolution.
Trae Agent formulates our goal as an optimal solution search problem and
addresses two key challenges, i.e., large ensemble spaces and repository-level
understanding, through modular agents for generation,
, and selection.
We conduct extensive experiments using three leading
s on the widely-adopted
SWE-bench benchmark, comparing Trae Agent against four state-of-the-art
ensemble reasoning techniques. Experimental results demonstrate that Trae Agent
consistently achieves superior performance, with an average improvement of
10.22% over all baselines in terms of Pass@1. Trae Agent has achieved first
place on the SWE-bench Verified leaderboard, with a notable Pass@1 score of
75.20%. We are pleased to release Trae Agent as an open-source project to
support the research community, with all resources available at
https://github.com/bytedance/trae-agent.
Unveiling Super Experts in Mixture-of-Experts Large Language Models
Authors: Zunhai Su, Qingyuan Li, Hao Zhang, YuLei Qian, Yuchen Xie, Kehong Yuan
2025-07-31
http://arxiv.org/abs/2507.23279v1
Sparsely activated Mixture-of-Experts (MoE) models have shown promise in
enhancing the learning capacity of large language models (s). Leveraging the
intrinsic importance differences among experts, recent research has explored
expert-level compression techniques to improve the efficiency of MoE
s.
However, existing approaches often rely on empirical criteria to identify
critical experts, lacking a deeper exploration and understanding of the
heterogeneous importance of experts. In this study, we present the first
discovery and investigation of a distinct subset of experts that play a crucial
role in the underlying mechanisms during the model's forward inference. These
experts are prevalent in open-source MoE
s, and despite their limited
number,
them leads to a significant decline in model performance (e.g.,
three causes Qwen3-30B-A3B to produce repetitive and uninformative
outputs). We refer to these experts as Super Experts (SEs). Our comprehensive
analysis provides progressively deeper insights into SEs. (i) SEs are
characterized by rare but extreme activation outliers in the output of the
down_proj, which give rise to massive activations in the hidden states between
decoder layers. Moreover, the distribution of SEs remains model-specific and is
unaffected by post-training processes. (ii) By
SEs, we assess their
significance across a variety of tasks, revealing their considerable impact on
the model's overall performance, particularly in mathematical reasoning. (iii)
We further enhance our understanding of the influence of SEs compression. Our
findings confirm that MoE
s rely on SEs to induce attention sinks, which are
crucial for the distribution of attention scores but are significantly
disrupted by SE
. The code is available at
https://github.com/ZunhaiSu/Super-Experts-Profilling.
Model Directions, Not Words Mechanistic Topic Models Using Sparse Autoencoders
Authors: Carolina Zheng, Nicolas Beltran-Velez, Sweta Karlekar, Claudia Shi, Achille Nazaret, Asif Mallik, Amir Feder, David M. Blei
2025-07-31
http://arxiv.org/abs/2507.23220v1
Traditional topic models are effective at uncovering latent themes in large
text collections. However, due to their reliance on bag-of-words
representations, they struggle to capture semantically abstract features. While
some neural variants use richer representations, they are similarly constrained
by expressing topics as word lists, which limits their ability to articulate
complex topics. We introduce Mechanistic Topic Models (MTMs), a class of topic
models that operate on interpretable features learned by autoencoders
(SAEs). By defining topics over this semantically rich space, MTMs can reveal
deeper conceptual themes with expressive feature descriptions. Moreover,
uniquely among topic models, MTMs enable controllable text generation using
topic-based steering vectors. To properly evaluate MTM topics against
word-list-based approaches, we propose \textit{topic judge}, an
-based
pairwise comparison evaluation framework. Across five datasets, MTMs match or
exceed traditional and neural baselines on coherence metrics, are consistently
preferred by topic judge, and enable effective steering of
outputs.
MoCHA Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention
Authors: Yuqi Pang, Bowen Yang, Yun Cao, Fan Rong, Xiaoyu Li, Chen He
2025-07-30
http://arxiv.org/abs/2507.22805v1
Vision large language models (Vs) are focusing primarily on handling
complex and fine-grained visual information by incorporating advanced vision
encoders and scaling up visual models. However, these approaches face high
training and inference costs, as well as challenges in extracting visual
details, effectively bridging across modalities. In this work, we propose a
novel visual framework, MoCHA, to address these issues. Our framework
integrates four vision backbones (i.e., CLIP, SigLIP, DINOv2 and ConvNeXt) to
extract complementary visual features and is equipped with a
Mixture of
Experts Connectors (MoECs) module to dynamically select experts tailored to
different visual dimensions. To mitigate redundant or insufficient use of the
visual information encoded by the MoECs module, we further design a
Hierarchical Group Attention (HGA) with intra- and inter-group operations and
an adaptive gating strategy for encoded visual features. We train MoCHA on two
mainstream
s (e.g., Phi2-2.7B and Vicuna-7B) and evaluate their performance
across various benchmarks. Notably, MoCHA outperforms state-of-the-art
open-weight models on various tasks. For example, compared to CuMo
(Mistral-7B), our MoCHA (Phi2-2.7B) presents outstanding abilities to mitigate
hallucination by showing improvements of 3.25% in POPE and to follow visual
instructions by raising 153 points on MME. Finally, ablation studies further
confirm the effectiveness and robustness of the proposed MoECs and HGA in
improving the overall performance of MoCHA.
Modality-Aware Feature Matching A Comprehensive Review of Single- and Cross-Modality Techniques
Authors: Weide Liu, Wei Zhou, Jun Liu, Ping Hu, Jun Cheng, Jungong Han, Weisi Lin
2025-07-30
http://arxiv.org/abs/2507.22791v1
Feature matching is a cornerstone task in computer vision, essential for
applications such as image retrieval, stereo matching, 3D reconstruction, and
SLAM. This survey comprehensively reviews modality-based feature matching,
exploring traditional handcrafted methods and emphasizing contemporary deep
learning approaches across various modalities, including RGB images, depth
images, 3D point clouds, LiDAR scans, medical images, and vision-language
interactions. Traditional methods, leveraging detectors like Harris corners and
descriptors such as SIFT and ORB, demonstrate robustness under moderate
intra-modality variations but struggle with significant modality gaps.
Contemporary deep learning-based methods, exemplified by detector-free
strategies like CNN-based SuperPoint and -based LoFTR, substantially
improve robustness and adaptability across modalities. We highlight
modality-aware advancements, such as geometric and depth-specific descriptors
for depth images,
and dense learning methods for 3D point clouds,
attention-enhanced neural networks for LiDAR scans, and specialized solutions
like the MIND descriptor for complex medical image matching. Cross-modal
applications, particularly in medical image registration and vision-language
tasks, underscore the evolution of feature matching to handle increasingly
diverse data interactions.
trAIce3D A Prompt-Driven Transformer Based U-Net for Semantic Segmentation of Microglial Cells from Large-Scale 3D Microscopy Images
Authors: MohammadAmin Alamalhoda, Arsalan Firoozi, Alessandro Venturino, Sandra Siegert
2025-07-30
http://arxiv.org/abs/2507.22635v1
The shape of a cell contains essential information about its function within
the biological system. Segmenting these structures from large-scale 3D
microscopy images is challenging, limiting clinical insights especially for
microglia, immune-associated cells involved in neurodegenerative diseases.
Existing segmentation methods mainly focus on cell bodies, struggle with
ping structures, perform poorly on noisy images, require hyperparameter
tuning for each new dataset, or rely on tedious semi-automated approaches. We
introduce trAIce3D, a deep-learning architecture designed for precise microglia
segmentation, capturing both somas and branches. It employs a two-stage
approach: first, a 3D U-Net with vision
s in the encoder detects
somas using a sliding-window technique to cover the entire image. Then, the
same architecture, enhanced with cross-attention blocks in skip connections,
refines each soma and its branches by using soma coordinates as a prompt and a
3D window around the target cell as input. Training occurs in two phases:
self-supervised Soma Segmentation, followed by prompt-based Branch
Segmentation, leveraging pre-trained weights from the first phase. Trained and
evaluated on a dataset of 41,230 microglial cells, trAIce3D significantly
improves segmentation accuracy and generalization, enabling scalable analysis
of complex cellular morphologies. While optimized for microglia, its
architecture can extend to other intricate cell types, such as neurons and
astrocytes, broadening its impact on neurobiological research.
Language Arithmetics Towards Systematic Language Neuron Identification and Manipulation
Authors: Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, Simon Ostermann
2025-07-30
http://arxiv.org/abs/2507.22608v1
Large language models (s) exhibit strong multilingual abilities, yet the
neural mechanisms behind language-specific processing remain unclear. We
analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and
Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying
neurons that control language behavior. Using the Language Activation
Probability Entropy (LAPE) method, we show that these neurons cluster in deeper
layers, with non-Latin scripts showing greater specialization. Related
languages share
ping neurons, reflecting internal representations of
linguistic proximity.
Through language arithmetics, i.e. systematic activation addition and
multiplication, we steer models to deactivate unwanted languages and activate
desired ones, outperforming simpler replacement approaches. These interventions
effectively guide behavior across five multilingual tasks: language forcing,
translation, QA, comprehension, and NLI. Manipulation is more successful for
high-resource languages, while typological similarity improves effectiveness.
We also demonstrate that cross-lingual neuron steering enhances downstream
performance and reveal internal "fallback" mechanisms for language selection
when neurons are progressively deactivated. Our code is made publicly available
at https://github.com/d-gurgurov/Language-Neurons-Manipulation.
MetaAgent Automatically Constructing Multi-Agent Systems Based on Finite State Machines
Authors: Yaolun Zhang, Xiaogeng Liu, Chaowei Xiao
2025-07-30
http://arxiv.org/abs/2507.22606v1
Large Language Models (s) have demonstrated the ability to solve a wide
range of practical tasks within multi-agent systems. However, existing
human-designed multi-agent frameworks are typically limited to a small set of
pre-defined scenarios, while current automated design methods suffer from
several limitations, such as the lack of tool integration, dependence on
external training data, and rigid
structures. In this paper, we
propose MetaAgent, a finite state machine based framework that can
automatically generate a multi-agent system. Given a task description,
MetaAgent will design a multi-agent system and polish it through an
optimization algorithm. When the multi-agent system is deployed, the finite
state machine will control the agent's actions and the state transitions. To
evaluate our framework, we conduct experiments on both text-based tasks and
practical tasks. The results indicate that the generated multi-agent system
surpasses other auto-designed methods and can achieve a comparable performance
with the human-designed multi-agent system, which is optimized for those
specific tasks.
Multi-modal Relational Item Representation Learning for Inferring Substitutable and Complementary Items
Authors: Junting Wang, Chenghuan Guo, Jiao Yang, Yanhui Guo, Yan Gao, Hari Sundaram
2025-07-29
http://arxiv.org/abs/2507.22268v1
We introduce a novel self-supervised multi-modal relational item
representation learning framework designed to infer substitutable and
complementary items. Existing approaches primarily focus on modeling item-item
associations deduced from user behaviors using graph neural networks (GNNs) or
leveraging item content information. However, these methods often overlook
critical challenges, such as noisy user behavior data and data due to
the long-tailed distribution of these behaviors. In this paper, we propose
MMSC, a self-supervised multi-modal relational item representation learning
framework to address these challenges. Specifically, MMSC consists of three
main components: (1) a multi-modal item representation learning module that
leverages a multi-modal foundational model and learns from item metadata, (2) a
self-supervised behavior-based representation learning module that denoises and
learns from user behavior data, and (3) a hierarchical representation
aggregation mechanism that integrates item representations at both the semantic
and task levels. Additionally, we leverage
s to generate augmented training
data, further enhancing the denoising process during training. We conduct
extensive experiments on five real-world datasets, showing that MMSC
outperforms existing baselines by 26.1% for substitutable recommendation and
39.2% for complementary recommendation. In addition, we empirically show that
MMSC is effective in modeling cold-start items.
CTG-Insight A Multi-Agent Interpretable LLM Framework for Cardiotocography Analysis and Classification
Authors: Black Sun, Die, Hu
2025-07-29
http://arxiv.org/abs/2507.22205v1
Remote fetal monitoring technologies are becoming increasingly common. Yet,
most current systems offer limited interpretability, leaving expectant parents
with raw cardiotocography (CTG) data that is difficult to understand. In this
work, we present CTG-Insight, a multi-agent system that provides structured
interpretations of fetal heart rate (FHR) and uterine contraction (UC) signals.
Drawing from established medical guidelines, CTG-Insight decomposes each CTG
trace into five medically defined features: baseline, variability,
s, decelerations, and sinusoidal pattern, each analyzed by a
dedicated agent. A final aggregation agent synthesizes the outputs to deliver a
holistic classification of fetal health, accompanied by a natural language
explanation. We evaluate CTG-Insight on the NeuroFetalNet Dataset and compare
it against deep learning models and the single-agent
baseline. Results show
that CTG-Insight achieves state-of-the-art accuracy (96.4%) and F1-score
(97.8%) while producing transparent and interpretable outputs. This work
contributes an interpretable and extensible CTG analysis framework.
Persona-Augmented Benchmarking Evaluating LLMs Across Diverse Writing Styles
Authors: Kimberly Le Truong, Riccardo Fogliato, Hoda Heidari, Zhiwei Steven Wu
2025-07-29
http://arxiv.org/abs/2507.22168v1
Current benchmarks for evaluating Large Language Models (s) often do not
exhibit enough writing style diversity, with many adhering primarily to
standardized conventions. Such benchmarks do not fully capture the rich variety
of
patterns exhibited by humans. Thus, it is possible that
s,
which are optimized on these benchmarks, may demonstrate brittle performance
when faced with "non-standard" input. In this work, we test this hypothesis by
rewriting evaluation prompts using persona-based
prompting, a low-cost
method to emulate diverse writing styles. Our results show that, even with
identical semantic content, variations in writing style and prompt formatting
significantly impact the estimated performance of the
under evaluation.
Notably, we identify distinct writing styles that consistently trigger either
low or high performance across a range of models and tasks, irrespective of
model family, size, and recency. Our work offers a scalable approach to augment
existing benchmarks, improving the external validity of the assessments they
provide for measuring
performance across linguistic variations.
IntentFlow Interactive Support for Communicating Intent with LLMs in Writing Tasks
Authors: Yoonsu Kim, Brandon Chin, Kihoon Son, Seoyoung Kim, Juho Kim
2025-07-29
http://arxiv.org/abs/2507.22134v1
While large language models (s) are widely used for writing, users often
struggle to express their nuanced and evolving intents through prompt-based
interfaces. Intents -- low-level strategies or preferences for achieving a
writing goal -- are often vague, fluid, or even subconscious, making it
difficult for users to articulate and adjust them. To address this, we present
IntentFlow, which supports the
of dynamically evolving intents
throughout
-assisted writing. IntentFlow extracts goals and intents from
user prompts and presents them as editable interface components, which users
can revise, remove, or refine via direct manipulation or follow-up prompts.
Visual links connect each component to the output segments it influences,
helping users understand model behavior. In a within-subjects study (N=12),
participants using IntentFlow, compared to a chat-based baseline, expressed
their intents more easily and in detail, engaged in more meaningful actions to
communicate intents, such as adjusting and deleting, and produced outputs that
better aligned with their evolving intents. We found that editable intent
representations help users refine and consolidate a final set of intents, which
can be reused across similar tasks to support consistent and transferable
-assisted writing.
Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models
Authors: Hyunwoo Yoo, Gail L. Rosen
2025-07-29
http://arxiv.org/abs/2507.21980v1
Traditional machine learning models struggle to generalize in microbiome
studies where only metadata is available, especially in small-sample settings
or across studies with heterogeneous label formats. In this work, we explore
the use of large language models (s) to classify microbial samples into
ontology categories such as EMPO 3 and related biological labels, as well as to
predict pathogen contamination risk, specifically the presence of E. Coli,
using environmental metadata alone. We evaluate
s such as ChatGPT-4o, Claude
3.7 Sonnet, Grok-3, and LLaMA 4 in zero-shot and few-shot settings, comparing
their performance against traditional models like Random Forests across
multiple real-world datasets. Our results show that
s not only outperform
baselines in ontology classification, but also demonstrate strong predictive
ability for contamination risk, generalizing across sites and metadata
distributions. These findings suggest that
s can effectively reason over
, heterogeneous biological metadata and offer a promising metadata-only
approach for environmental microbiology and biosurveillance applications.
EDGE-GRPO Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity
Authors: Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang
2025-07-29
http://arxiv.org/abs/2507.21848v1
Large Language Models (s) have made remarkable progress in enhancing
step-by-step reasoning through reinforcement learning. However, the Group
Relative Policy Optimization (GRPO) algorithm, which relies on
reward
rules, often encounters the issue of identical rewards within groups, leading
to the advantage collapse problem. Existing works typically address this
challenge from two perspectives: enforcing model reflection to enhance response
diversity, and introducing internal feedback to augment the training signal
(advantage). In this work, we begin by analyzing the limitations of model
reflection and investigating the policy entropy of responses at the
fine-grained sample level. Based on our experimental findings, we propose the
EDGE-GRPO algorithm, which adopts \textbf{E}ntropy-\textbf{D}riven Advantage
and \textbf{G}uided \textbf{E}rror Correction to effectively mitigate the
problem of advantage collapse. Extensive experiments on several main reasoning
benchmarks demonstrate the effectiveness and superiority of our approach. It is
available at https://github.com/ZhangXJ199/EDGE-GRPO.
Unlocking Interpretability for RF Sensing A Complex-Valued White-Box Transformer
Authors: Xie Zhang, Yina Wang, Chenshu Wu
2025-07-29
http://arxiv.org/abs/2507.21799v1
The empirical success of deep learning has spurred its application to the
radio-frequency (RF) domain, leading to significant advances in Deep Wireless
Sensing (DWS). However, most existing DWS models function as black boxes with
limited interpretability, which hampers their generalizability and raises
concerns in security-sensitive physical applications. In this work, inspired by
the remarkable advances of white-box s, we present RF-CRATE, the
first mathematically interpretable deep network architecture for RF sensing,
grounded in the principles of complex
rate reduction. To accommodate the
unique RF signals, we conduct non-trivial theoretical derivations that extend
the original real-valued white-box
to the complex domain. By
leveraging the CR-Calculus framework, we successfully construct a fully
complex-valued white-box
with theoretically derived self-attention
and residual multi-layer perceptron modules. Furthermore, to improve the
model's ability to extract discriminative features from limited wireless data,
we introduce Subspace Regularization, a novel regularization strategy that
enhances feature diversity, resulting in an average performance improvement of
19.98% across multiple sensing tasks. We extensively evaluate RF-CRATE against
seven baselines with multiple public and self-collected datasets involving
different RF signals. The results show that RF-CRATE achieves performance on
par with thoroughly engineered black-box models, while offering full
mathematical interpretability. More importantly, by extending CRATE to the
complex domain, RF-CRATE yields substantial improvements, achieving an average
classification gain of 5.08% and reducing regression error by 10.34% across
diverse sensing tasks compared to CRATE. RF-CRATE is fully open-sourced at:
https://github.com/rfcrate/RF_CRATE.
MOR-VIT Efficient Vision Transformer with Mixture-of-Recursions
Authors: YiZhou Li
2025-07-29
http://arxiv.org/abs/2507.21761v1
Vision Transformers (ViTs) have achieved remarkable success in image
recognition, yet standard ViT architectures are hampered by substantial
parameter redundancy and high computational cost, limiting their practical
deployment. While recent efforts on efficient ViTs primarily focus on static
model compression or token-level sparsification, they remain constrained by
fixed computational depth for all tokens. In this work, we present MoR-ViT, a
novel vision framework that, for the first time, incorporates a
token-level dynamic recursion mechanism inspired by the Mixture-of-Recursions
(MoR) paradigm. This approach enables each token to adaptively determine its
processing depth, yielding a flexible and input-dependent allocation of
computational resources. Extensive experiments on ImageNet-1K and transfer
benchmarks demonstrate that MoR-ViT not only achieves state-of-the-art accuracy
with up to 70% parameter reduction and 2.5x inference
, but also
outperforms leading efficient ViT baselines such as DynamicViT and TinyViT
under comparable conditions. These results establish dynamic recursion as an
effective strategy for efficient vision
s and open new avenues for
scalable and deployable deep learning models in real-world scenarios.
Enhancing Graph-based Recommendations with Majority-Voting LLM-Rerank Augmentation
Authors: Minh-Anh Nguyen, Bao Nguyen, Ha Lan N. T., Tuan Anh Hoang, Duc-Trong Le, Dung D. Le
2025-07-29
http://arxiv.org/abs/2507.21563v1
Recommendation systems often suffer from data caused by limited
user-item interactions, which degrade their performance and amplify popularity
bias in real-world scenarios. This paper proposes a novel data augmentation
framework that leverages Large Language Models (
s) and item textual
descriptions to enrich interaction data. By few-shot prompting
s multiple
times to rerank items and aggregating the results via majority voting, we
generate high-confidence synthetic user-item interactions, supported by
theoretical guarantees based on the concentration of measure. To effectively
leverage the augmented data in the context of a graph recommendation system, we
integrate it into a graph contrastive learning framework to mitigate
distributional shift and alleviate popularity bias. Extensive experiments show
that our method improves accuracy and reduces popularity bias, outperforming
strong baselines.
TriangleMix A Lossless and Efficient Attention Pattern for Long Context Prefilling
Authors: Zhiyuan He, Yike Zhang, Chengruidong Zhang, Huiqiang Jiang, Yuqing Yang, Lili Qiu
2025-07-29
http://arxiv.org/abs/2507.21526v1
Large Language Models (s) rely on attention mechanisms whose time
complexity grows quadratically with input sequence length, creating significant
computational bottlenecks during the prefilling stage. Existing static
attention methods typically degrade accuracy, while dynamic
methods
introduce additional computational overhead due to runtime
index
estimation. To address these limitations, we propose TriangleMix, a novel
training-free static attention pattern. TriangleMix employs dense attention in
shallow layers and switches to a triangle-shaped
pattern in deeper
layers. Extensive experiments demonstrate that TriangleMix reduces attention
overhead by 3.7x to 15.3x in deep layers, and decreases overall
Time-to-First-Token (TTFT) by 12% to 32% for sequence lengths ranging from 32K
to 128K, without sacrificing model accuracy. Moreover, TriangleMix can be
seamlessly integrated with dynamic
methods to achieve further speedup,
e.g. accelerating MInference by 19% at 128K, highlighting its potential to
enhance
inference efficiency.
Large Language Models for Wireless Communications From Adaptation to Autonomy
Authors: Le Liang, Hao Ye, Yucheng Sheng, Ouya Wang, Jiacheng Wang, Shi Jin, Geoffrey Ye Li
2025-07-29
http://arxiv.org/abs/2507.21524v1
The emergence of large language models (s) has revolutionized artificial
intelligence, offering unprecedented capabilities in reasoning, generalization,
and zero-shot learning. These strengths open new frontiers in wireless
s, where increasing complexity and dynamics demand intelligent and
adaptive solutions. This article explores the role of
s in transforming
wireless systems across three key directions: adapting pretrained
s for core
tasks, developing wireless-specific foundation models to balance
versatility and efficiency, and enabling agentic
s with autonomous reasoning
and coordination capabilities. We highlight recent advances, practical case
studies, and the unique benefits of
-based approaches over traditional
methods. Finally, we outline open challenges and research opportunities,
including multimodal fusion, collaboration with lightweight models, and
self-improving capabilities, charting a path toward intelligent, adaptive, and
autonomous wireless networks of the future.
Learning to Imitate with Less Efficient Individual Behavior Modeling in Chess
Authors: Zhenwei Tang, Difan Jiao, Eric Xue, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, Ashton Anderson
2025-07-29
http://arxiv.org/abs/2507.21488v1
As humans seek to collaborate with, learn from, and better understand
artificial intelligence systems, developing AIs that can accurately emulate
individual decision-making becomes increasingly important. Chess, a
long-standing AI benchmark with precise skill measurement, offers an ideal
testbed for human-AI alignment. However, existing approaches to modeling human
behavior require prohibitively large amounts of data from each individual,
making them impractical for new or ly represented users. In this work, we
introduce Maia4All, a framework designed to learn and adapt to individual
decision-making styles efficiently, even with limited data. Maia4All achieves
this through a two-stage optimization process: (1) an enrichment step, which
bridges population and individual-level human behavior modeling with a
prototype-enriched model, and (2) a democratization step, which leverages
ability levels or user prototypes to initialize and refine individual
embeddings with minimal data. Our experimental results show that Maia4All can
accurately predict individual moves and profile behavioral patterns with high
fidelity, establishing a new standard for personalized human-like AI behavior
modeling in chess. Maia4All achieves individual human behavior modeling in
chess with only 20 games, compared to the 5,000 games required previously,
representing a significant improvement in data efficiency. Our work provides an
example of how population AI systems can flexibly adapt to individual users
using a prototype-enriched model as a bridge. This approach extends beyond
chess, as shown in our case study on idiosyncratic
s, highlighting its
potential for broader applications in personalized AI adaptation.
An LLM Driven Agent Framework for Automated Infrared Spectral Multi Task Reasoning
Authors: Zujie Xie, Zixuan Chen, Jiheng Liang, Xiangyang Yu, Ziru Yu
2025-07-29
http://arxiv.org/abs/2507.21471v1
Infrared spectroscopy offers rapid, non destructive measurement of chemical
and material properties but suffers from high dimensional, ping spectral
bands that challenge conventional chemometric approaches. Emerging large
language models (
s), with their capacity for generalization and reasoning,
offer promising potential for automating complex scientific workflows. Despite
this promise, their application in IR spectral analysis remains largely
unexplored. This study addresses the critical challenge of achieving accurate,
automated infrared spectral interpretation under low-data conditions using an
-driven framework. We introduce an end-to-end, large language model driven
agent framework that integrates a structured literature knowledge base,
automated spectral preprocessing, feature extraction, and multi task reasoning
in a unified pipeline. By querying a curated corpus of peer reviewed IR
publications, the agent selects scientifically validated routines. The selected
methods transform each spectrum into low dimensional feature sets, which are
fed into few shot prompt templates for classification, regression, and anomaly
detection. A closed loop, multi turn protocol iteratively appends mispredicted
samples to the prompt, enabling dynamic refinement of predictions. Across
diverse materials: stamp pad ink, Chinese medicine, Pu'er tea, Citri
Reticulatae Pericarpium and waste water COD datasets, the multi turn
consistently outperforms single turn inference, rivaling or exceeding machine
learning and deep learning models under low data regimes.
Transmission With Machine Language Tokens A Paradigm for Task-Oriented Agent Communication
Authors: Zhuoran Xiao, Chenhui Ye, Yijia Feng, Yunbo Hu, Tianyu Jiao, Liyu Cai, Guangyi Liu
2025-07-29
http://arxiv.org/abs/2507.21454v1
The rapid advancement in large foundation models is propelling the paradigm
shifts across various industries. One significant change is that agents,
instead of traditional machines or humans, will be the primary participants in
the future production process, which consequently requires a novel AI-native
system tailored for agent
s. Integrating the ability
of large language models (
s) with task-oriented semantic
is a
potential approach. However, the output of existing
is human language,
which is highly constrained and sub-optimal for agent-type
. In
this paper, we innovatively propose a task-oriented agent
system.
Specifically, we leverage the original
to learn a specialized machine
language represented by token embeddings. Simultaneously, a multi-modal
is
trained to comprehend the application task and to extract essential implicit
information from multi-modal inputs, subsequently expressing it using machine
language tokens. This representation is significantly more efficient for
transmission over the air interface. Furthermore, to reduce transmission
overhead, we introduce a joint token and channel coding (JTCC) scheme that
compresses the token sequence by exploiting its
while enhancing
robustness against channel noise. Extensive experiments demonstrate that our
approach reduces transmission overhead for downstream tasks while enhancing
accuracy relative to the SOTA methods.
MindChat Enhancing BCI Spelling with Large Language Models in Realistic Scenarios
Authors: JIaheng Wang, Yucun Zhong, Chengjie Huang, Lin Yao
2025-07-29
http://arxiv.org/abs/2507.21435v1
Brain-computer interface (BCI) spellers can render a new
channel independent of peripheral nervous system, which are especially valuable
for patients with severe motor disabilities. However, current BCI spellers
often require users to type intended utterances letter-by-letter while spelling
errors grow proportionally due to inaccurate electroencephalogram (EEG)
decoding, largely impeding the efficiency and usability of BCIs in real-world
. In this paper, we present MindChat, a large language model
(
)-assisted BCI speller to enhance BCI spelling efficiency by reducing
users' manual keystrokes. Building upon prompt engineering, we prompt
s
(GPT-4o) to continuously suggest context-aware word and sentence
completions/predictions during spelling. Online copy-spelling experiments
encompassing four dialogue scenarios demonstrate that MindChat saves more than
62\% keystrokes and over 32\% spelling time compared with traditional BCI
spellers. We envision high-speed BCI spellers enhanced by
s will potentially
lead to truly practical applications.
Automated HEMT Model Construction from Datasheets via Multi-Modal Intelligence and Prior-Knowledge-Free Optimization
Authors: Yuang Peng, Jiarui Zhong, Yang Zhang, Hong Cai Chen
2025-07-29
http://arxiv.org/abs/2507.21430v1
Parameter extraction for industry-standard device models like ASM-HEMT is
crucial in circuit design workflows. However, many manufacturers do not provide
such models, leaving users to build them using only datasheets. Unfortunately,
datasheets lack sufficient information for standard step-by-step extraction.
Moreover, manual data extraction from datasheets is highly time-consuming, and
the absence of a fully automated method forces engineers to perform tedious
manual work. To address this challenge, this paper introduces a novel,
end-to-end framework that fully automates the generation of simulation-ready
ASM-HEMT SPICE models directly from PDF datasheets. Our framework is founded on
two core innovations: 1) a multi-modal AI pipeline that integrates computer
vision with a large language model () to robustly parse heterogeneous
datasheet layouts and digitize characteristic curves, and 2) a novel
Iterative-Focusing Tree-structured Parzen Estimator (IF-TPE) optimization
algorithm is specifically designed for device parameter extraction under the
high-dimensional,
-data condition by adaptively refining the parameter
search space. Experimental validation on a diverse set of 17 commercial HEMT
devices from 10 manufacturers confirms the framework's accuracy and robustness.
The generated models demonstrate excellent agreement with published DC and RF
characteristics. As the first fully automated workflow of its kind, our
proposed solution offers a transformative approach to device modeling, poised
to significantly accelerate the circuit design cycle by eliminating the need
for manual parameter extraction.
ReGATE Learning Faster and Better with Fewer Tokens in MLLMs
Authors: Chaoyu Li, Yogesh Kulkarni, Pooyan Fazli
2025-07-29
http://arxiv.org/abs/2507.21420v1
The computational cost of training multimodal large language models (Ms)
rapidly increases with the number of tokens involved. Existing efficiency
methods primarily target inference and rely on token reduction or merging,
offering limited benefit during training. In this paper, we propose ReGATE
(ReferenceGuided Adaptive Token Elision), an adaptive token
method
for accelerating M
training. Specifically, ReGATE adopts a teacher-student
framework in which the M
being trained serves as the student, and a frozen
reference large language model (
) acts as the teacher. The teacher computes
per-token reference losses, which are combined with an exponential moving
average (EMA) of the student's own difficulty scores. This adaptive
difficulty-based scoring enables the selective processing of crucial tokens
while bypassing less informative ones in the forward pass, significantly
reducing computational overhead. Experiments demonstrate that ReGATE, when
applied to VideoLLaMA2, matches the peak accuracy of standard training on
MVBench up to 2 faster, using only 35% of the tokens. With additional
training, it even surpasses the baseline on several multimodal benchmarks, all
while reducing the total token count by over 41%. Code and models will be
released soon.
Enhancing and Accelerating Brain MRI through Deep Learning Reconstruction Using Prior Subject-Specific Imaging
Authors: Amirmohammad Shamaei, Alexander Stebner, Salome, Bosshart, Johanna Ospel, Gouri Ginde, Mariana Bento, Roberto Souza
2025-07-28
http://arxiv.org/abs/2507.21349v1
Magnetic resonance imaging (MRI) is a crucial medical imaging modality.
However, long acquisition times remain a significant challenge, leading to
increased costs, and reduced patient comfort. Recent studies have shown the
potential of using deep learning models that incorporate information from prior
subject-specific MRI scans to improve reconstruction quality of present scans.
Integrating this prior information requires registration of the previous scan
to the current image reconstruction, which can be time-consuming. We propose a
novel deep-learning-based MRI reconstruction framework which consists of an
initial reconstruction network, a deep registration model, and a
-based enhancement network. We validated our method on a
longitudinal dataset of T1-weighted MRI scans with 2,808 images from 18
subjects at four
factors (R5, R10, R15, R20). Quantitative metrics
confirmed our approach's superiority over existing methods (p < 0.05, Wilcoxon
signed-rank test). Furthermore, we analyzed the impact of our MRI
reconstruction method on the downstream task of brain segmentation and observed
improved accuracy and volumetric agreement with reference segmentations. Our
approach also achieved a substantial reduction in total reconstruction time
compared to methods that use traditional registration algorithms, making it
more suitable for real-time clinical applications. The code associated with
this work is publicly available at
https://github.com/amirshamaei/longitudinal-mri-deep-recon.
Agentic Web Weaving the Next Web with AI Agents
Authors: Yingxuan Yang, Mulei Ma, Yuxuan Huang, Huacan Chai, Chenyu Gong, Haoran Geng, Yuanjian Zhou, Ying Wen, Meng Fang, Muhao Chen, Shangding Gu, Ming Jin, Costas Spanos, Yang Yang, Pieter Abbeel, Dawn Song, Weinan Zhang, Jun Wang
2025-07-28
http://arxiv.org/abs/2507.21206v1
The emergence of AI agents powered by large language models (s) marks a
pivotal shift toward the Agentic Web, a new phase of the internet defined by
autonomous, goal-driven interactions. In this paradigm, agents interact
directly with one another to plan, coordinate, and execute complex tasks on
behalf of users. This transition from human-driven to machine-to-machine
interaction allows intent to be delegated, relieving users from routine digital
operations and enabling a more interactive, automated web experience. In this
paper, we present a structured framework for understanding and building the
Agentic Web. We trace its evolution from the PC and Mobile Web eras and
identify the core technological foundations that support this shift. Central to
our framework is a conceptual model consisting of three key dimensions:
intelligence, interaction, and economics. These dimensions collectively enable
the capabilities of AI agents, such as retrieval, recommendation, planning, and
collaboration. We analyze the architectural and infrastructural challenges
involved in creating scalable agentic systems, including
protocols, orchestration strategies, and emerging paradigms such as the Agent
Attention Economy. We conclude by discussing the potential applications,
societal risks, and governance issues posed by agentic systems, and outline
research directions for developing open, secure, and intelligent ecosystems
shaped by both human intent and autonomous agent behavior. A continuously
updated collection of relevant studies for agentic web is available at:
https://github.com/SafeRL-Lab/agentic-web.
SmallThinker A Family of Efficient Large Language Models Natively Trained for Local Deployment
Authors: Yixin Song, Zhenliang Xue, Dongliang Wei, Feiyang Chen, Jianxiang Gao, Junchen Liu, Hangyu Liang, Guangshuo Qin, Chengrong Tian, Bo Wen, Longyu Zhao, Xinrui Zheng, Zeyu Mi, Haibo Chen
2025-07-28
http://arxiv.org/abs/2507.20984v2
While frontier large language models (s) continue to push capability
boundaries, their deployment remains confined to GPU-powered cloud
infrastructure. We challenge this paradigm with SmallThinker, a family of
s
natively designed - not adapted - for the unique constraints of local devices:
weak computational power, limited memory, and slow storage. Unlike traditional
approaches that mainly compress existing models built for clouds, we architect
SmallThinker from the ground up to thrive within these limitations. Our
innovation lies in a deployment-aware architecture that transforms constraints
into design principles. First, We introduce a two-level
structure
combining fine-grained Mixture-of-Experts (MoE) with
feed-forward
networks, drastically reducing computational demands without sacrificing model
capacity. Second, to conquer the I/O bottleneck of slow storage, we design a
pre-attention router that enables our co-designed inference engine to prefetch
expert parameters from storage while computing attention, effectively hiding
storage latency that would otherwise cripple on-device inference. Third, for
memory efficiency, we utilize NoPE-RoPE hybrid
attention mechanism to
slash
cache requirements. We release SmallThinker-4B-A0.6B and
SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and
even outperform larger
s. Remarkably, our co-designed system mostly
eliminates the need for expensive GPU hardware: with Q4_0 quantization, both
models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB
and 8GB of memory respectively. SmallThinker is publicly available at
hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and
hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.
The Importance of Facial Features in Vision-based Sign Language Recognition Eyes, Mouth or Full Face?
Authors: Dinh Nam Pham, Eleftherios Avramidis
2025-07-28
http://arxiv.org/abs/2507.20884v2
Non-manual facial features play a crucial role in sign language
, yet their importance in automatic sign language recognition
(ASLR) remains underexplored. While prior studies have shown that incorporating
facial features can improve recognition, related work often relies on
hand-crafted feature extraction and fails to go beyond the comparison of manual
features versus the combination of manual and facial features. In this work, we
systematically investigate the contribution of distinct facial regionseyes,
mouth, and full faceusing two different deep learning models (a CNN-based model
and a
-based model) trained on an SLR dataset of isolated signs with
randomly selected classes. Through quantitative performance and qualitative
saliency map evaluation, we reveal that the mouth is the most important
non-manual facial feature, significantly improving accuracy. Our findings
highlight the necessity of incorporating facial features in ASLR.
Latent Inter-User Difference Modeling for LLM Personalization
Authors: Yilun Qiu, Tianhao Shi, Xiaoyan Zhao, Fengbin Zhu, Yang Zhang, Fuli Feng
2025-07-28
http://arxiv.org/abs/2507.20849v1
Large language models (s) are increasingly integrated into users' daily
lives, leading to a growing demand for personalized outputs. Previous work
focuses on leveraging a user's own history, overlooking inter-user differences
that are crucial for effective personalization. While recent work has attempted
to model such differences, the reliance on language-based prompts often hampers
the effective extraction of meaningful distinctions. To address these issues,
we propose Difference-aware Embedding-based Personalization (DEP), a framework
that models inter-user differences in the latent space instead of relying on
language prompts. DEP constructs soft prompts by contrasting a user's embedding
with those of peers who engaged with similar content, highlighting relative
behavioral signals. A
autoencoder then filters and compresses both
user-specific and difference-aware embeddings, preserving only task-relevant
features before injecting them into a frozen
. Experiments on personalized
review generation show that DEP consistently outperforms baseline methods
across multiple metrics. Our code is available at
https://github.com/SnowCharmQ/DEP.
METEOR Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
Authors: Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian
2025-07-28
http://arxiv.org/abs/2507.20842v1
Vision encoders serve as the cornerstone of multimodal understanding.
Single-encoder architectures like CLIP exhibit inherent constraints in
generalizing across diverse multimodal tasks, while recent multi-encoder fusion
methods introduce prohibitive computational overhead to achieve superior
performance using complementary visual representations from multiple vision
encoders. To address this, we propose a progressive framework, namely
Multi-Encoder collaboraTivE tOken pRuning (METEOR), that eliminates redundant
visual tokens across the encoding, fusion, and decoding stages for
multi-encoder M
s. For multi-vision encoding, we discard redundant tokens
within each encoder via a rank guided collaborative token assignment strategy.
Subsequently, for multi-vision fusion, we combine the visual features from
different encoders while reducing cross-encoder redundancy with cooperative
. Finally, we propose an adaptive token
method in the
decoding stage to further discard irrelevant tokens based on the text prompts
with dynamically adjusting
ratios for specific task demands. To our
best knowledge, this is the first successful attempt that achieves an efficient
multi-encoder based vision language model with multi-stage
strategies.
Extensive experiments on 11 benchmarks demonstrate the effectiveness of our
proposed approach. Compared with EAGLE, a typical multi-encoder M
s, METEOR
reduces 76% visual tokens with only 0.3% performance drop in average. The code
is available at https://github.com/YuchenLiu98/METEOR.
Advancing Compositional LLM Reasoning with Structured Task Relations in Interactive Multimodal Communications
Authors: Xinye Cao, Hongcan Guo, Guoshun Nan, Jiaoyang Cui, Haoting Qian, Yihan Lin, Yilin Peng, Diyang Zhang, Yanzhao Hou, Huici Wu, Xiaofeng Tao, Tony Q. S. Quek
2025-07-28
http://arxiv.org/abs/2507.21199v1
Interactive multimodal applications (IMAs), such as route planning in the
Internet of Vehicles, enrich users' personalized experiences by integrating
various forms of data over wireless networks. Recent advances in large language
models (s) utilize mixture-of-experts (MoE) mechanisms to empower multiple
IMAs, with each
trained individually for a specific task that presents
different business workflows. In contrast to existing approaches that rely on
multiple
s for IMAs, this paper presents a novel paradigm that accomplishes
various IMAs using a single compositional
over wireless networks. The two
primary challenges include 1) guiding a single
to adapt to diverse IMA
objectives and 2) ensuring the flexibility and efficiency of the
in
resource-constrained mobile environments. To tackle the first challenge, we
propose ContextLoRA, a novel method that guides an
to learn the rich
structured context among IMAs by constructing a task dependency graph. We
partition the learnable parameter matrix of neural layers for each IMA to
facilitate
composition. Then, we develop a step-by-step fine-tuning
procedure guided by task relations, including training, freezing, and masking
phases. This allows the
to learn to reason among tasks for better
adaptation, capturing the latent dependencies between tasks. For the second
challenge, we introduce ContextGear, a scheduling strategy to optimize the
training procedure of ContextLoRA, aiming to minimize computational and
costs through a strategic grouping mechanism. Experiments on
three benchmarks show the superiority of the proposed ContextLoRA and
ContextGear. Furthermore, we prototype our proposed paradigm on a real-world
wireless testbed, demonstrating its practical applicability for various IMAs.
We will release our code to the community.
Beyond Class Tokens LLM-guided Dominant Property Mining for Few-shot Classification
Authors: Wei Zhuo, Runjie Luo, Wufeng Xue, Linlin Shen
2025-07-28
http://arxiv.org/abs/2507.20511v2
Few-shot Learning (FSL), which endeavors to develop the generalization
ability for recognizing novel classes using only a few images, faces
significant challenges due to data scarcity. Recent CLIP-like methods based on
contrastive language-image pertaining mitigate the issue by leveraging textual
representation of the class name for unseen image discovery. Despite the
achieved success, simply aligning visual representations to class name
embeddings would compromise the visual diversity for novel class
discrimination. To this end, we proposed a novel Few-Shot Learning (FSL) method
(BCT-CLIP) that explores \textbf{dominating properties} via contrastive
learning beyond simply using class tokens. Through leveraging -based prior
knowledge, our method pushes forward FSL with comprehensive structural image
representations, including both global category representation and the
patch-aware property embeddings. In particular, we presented a novel
multi-property generator (MPG) with patch-aware cross-attentions to generate
multiple visual property tokens, a Large-Language Model (
)-assistant
retrieval procedure with clustering-based
to obtain dominating property
descriptions, and a new contrastive learning strategy for property-token
learning. The superior performances on the 11 widely used datasets demonstrate
that our investigation of dominating properties advances discriminative
class-specific representation learning and few-shot classification.
Advancing Shared and Multi-Agent Autonomy in Underwater Missions Integrating Knowledge Graphs and Retrieval-Augmented Generation
Authors: Michele Grimaldi, Carlo Cernicchiaro, Sebastian Realpe Rua, Alaaeddine El-Masri-El-Chaarani, Markus Buchholz, Loizos Michael, Pere Ridao Rodriguez, Ignacio Carlucho, Yvan R. Petillot
2025-07-27
http://arxiv.org/abs/2507.20370v1
Robotic platforms have become essential for marine operations by providing
regular and continuous access to offshore assets, such as underwater
infrastructure inspection, environmental monitoring, and resource exploration.
However, the complex and dynamic nature of underwater environments,
characterized by limited visibility, unpredictable currents, and
constraints, presents significant challenges that demand advanced autonomy
while ensuring operator trust and oversight. Central to addressing these
challenges are knowledge representation and reasoning techniques, particularly
knowledge graphs and retrieval-augmented generation (RAG) systems, that enable
robots to efficiently structure, retrieve, and interpret complex environmental
data. These capabilities empower robotic agents to reason, adapt, and respond
effectively to changing conditions. The primary goal of this work is to
demonstrate both multi-agent autonomy and shared autonomy, where multiple
robotic agents operate independently while remaining connected to a human
supervisor. We show how a RAG-powered large language model, augmented with
knowledge graph data and domain taxonomy, enables autonomous multi-agent
decision-making and facilitates seamless human-robot interaction, resulting in
100\% mission validation and behavior completeness. Finally, ablation studies
reveal that without structured knowledge from the graph and/or taxonomy, the
is prone to hallucinations, which can compromise decision quality.
Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation
Authors: Abdullah Alabdullah, Lifeng Han, Chenghua Lin
2025-07-27
http://arxiv.org/abs/2507.20301v1
Dialectal Arabic (DA) poses a persistent challenge for natural language
processing (NLP), as most everyday in the Arab world occurs in
dialects that diverge significantly from Modern Standard Arabic (MSA). This
linguistic divide limits access to digital services and educational resources
and impedes progress in Arabic machine translation. This paper presents two
core contributions to advancing DA-MSA translation for the Levantine, Egyptian,
and Gulf dialects, particularly in low-resource and computationally constrained
settings: a comprehensive evaluation of training-free prompting techniques, and
the development of a resource-efficient fine-tuning pipeline. Our evaluation of
prompting strategies across six large language models (
s) found that
few-shot prompting consistently outperformed zero-shot, chain-of-thought, and
our proposed Ara-TEaR method. GPT-4o achieved the highest performance across
all prompting settings. For fine-tuning, a quantized Gemma2-9B model achieved a
CHrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58). Joint
multi-dialect trained models outperformed single-dialect counterparts by over
10% CHrF++, and 4-bit quantization reduced memory usage by 60% with less than
1% performance loss. The results and insights of our experiments offer a
practical blueprint for improving dialectal inclusion in Arabic NLP, showing
that high-quality DA-MSA machine translation is achievable even with limited
resources and paving the way for more inclusive language technologies.
What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations
Authors: Katharina Trinley, Toshiki Nakai, Tatiana Anikina, Tanja Baeumel
2025-07-27
http://arxiv.org/abs/2507.20279v1
Large language models (s) excel at multilingual tasks, yet their internal
language processing remains poorly understood. We analyze how Aya-23-8B, a
decoder-only
trained on balanced multilingual data, handles code-mixed,
cloze, and translation tasks compared to predominantly monolingual models like
Llama 3 and Chinese-LLaMA-2. Using logit lens and neuron specialization
analyses, we find: (1) Aya-23 activates typologically related language
representations during translation, unlike English-centric models that rely on
a single pivot language; (2) code-mixed neuron activation patterns vary with
mixing rates and are shaped more by the base language than the mixed-in one;
and (3) Aya-23's languagespecific neurons for code-mixed inputs concentrate in
final layers, diverging from prior findings on decoder-only models. Neuron
analysis further shows that script similarity and typological relations
impact processing across model types. These findings reveal how multilingual
training shapes
internals and inform future cross-lingual transfer
research.
Modeling Professionalism in Expert Questioning through Linguistic Differentiation
Authors: Giulia D'Agostino, Chung-Chi Chen
2025-07-27
http://arxiv.org/abs/2507.20249v1
Professionalism is a crucial yet underexplored dimension of expert
, particularly in high-stakes domains like finance. This paper
investigates how linguistic features can be leveraged to model and evaluate
professionalism in expert questioning. We introduce a novel annotation
framework to quantify structural and pragmatic elements in financial analyst
questions, such as discourse regulators, prefaces, and request types. Using
both human-authored and large language model (
)-generated questions, we
construct two datasets: one annotated for perceived professionalism and one
labeled by question origin. We show that the same linguistic features correlate
strongly with both human judgments and authorship origin, suggesting a shared
stylistic foundation. Furthermore, a classifier trained solely on these
interpretable features outperforms gemini-2.0 and SVM baselines in
distinguishing expert-authored questions. Our findings demonstrate that
professionalism is a learnable, domain-general construct that can be captured
through linguistically grounded modeling.
FAEDKV Infinite-Window Fourier Transform for Unbiased KV Cache Compression
Authors: Runchao Li, Yao Fu, Mu Sheng, Xianxuan Long, Haotian Yu, Pan Li
2025-07-26
http://arxiv.org/abs/2507.20030v1
The efficacy of Large Language Models (s) in long-context tasks is often
hampered by the substantial memory footprint and computational demands of the
Key-Value (
) cache. Current compression strategies, including token eviction
and learned projections, frequently lead to biased representations -- either by
overemphasizing recent/high-attention tokens or by repeatedly degrading
information from earlier context -- and may require costly model retraining. We
present FAED
(Frequency-Adaptive Infinite-Window for
cache), a novel,
training-free
cache compression framework that ensures unbiased information
retention. FAED
operates by transforming the
cache into the frequency
domain using a proposed Infinite-Window Fourier Transform (IWDFT). This
approach allows for the equalized contribution of all tokens to the compressed
representation, effectively preserving both early and recent contextual
information. A preliminary frequency ablation study identifies critical
spectral components for layer-wise, targeted compression. Experiments on
LongBench benchmark demonstrate FAED
's superiority over existing methods by
up to 22\%. In addition, our method shows superior, position-agnostic retrieval
accuracy on the Needle-In-A-Haystack task compared to compression based
approaches.
The Carbon Cost of Conversation, Sustainability in the Age of Language Models
Authors: Sayed Mahbub Hasan Amiri, Prasun Goswami, Md. Mainul Islam, Mohammad Shakhawat Hossen, Sayed Majhab Hasan Amiri, Naznin Akter
2025-07-26
http://arxiv.org/abs/2507.20018v2
Large language models (s) like GPT-3 and BERT have revolutionized natural
language processing (NLP), yet their environmental costs remain dangerously
overlooked. This article critiques the sustainability of
s, quantifying
their carbon footprint, water usage, and contribution to e-waste through case
studies of models such as GPT-4 and energy-efficient alternatives like Mistral
7B. Training a single
can emit carbon dioxide equivalent to hundreds of
cars driven annually, while data centre cooling exacerbates water scarcity in
vulnerable regions. Systemic challenges corporate greenwashing, redundant model
development, and regulatory voids perpetuate harm, disproportionately burdening
marginalized communities in the Global South. However, pathways exist for
sustainable NLP: technical innovations (e.g., model
, quantum
computing), policy reforms (carbon taxes, mandatory emissions reporting), and
cultural shifts prioritizing necessity over novelty. By analysing industry
leaders (Google, Microsoft) and laggards (Amazon), this work underscores the
urgency of ethical accountability and global cooperation. Without immediate
action, AIs ecological toll risks outpacing its societal benefits. The article
concludes with a call to align technological progress with planetary
boundaries, advocating for equitable, transparent, and regenerative AI systems
that prioritize both human and environmental well-being.
CaliDrop KV Cache Compression with Calibration
Authors: Yi Su, Quantong Qiu, Yuechi Zhou, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang
2025-07-26
http://arxiv.org/abs/2507.19906v1
Large Language Models (s) require substantial computational resources
during generation. While the Key-Value (
) cache significantly accelerates
this process by storing attention intermediates, its memory footprint grows
linearly with sequence length, batch size, and model size, creating a
bottleneck in long-context scenarios. Various
cache compression techniques,
including token eviction, quantization, and low-rank projection, have been
proposed to mitigate this bottleneck, often complementing each other. This
paper focuses on enhancing token eviction strategies. Token eviction leverages
the observation that the attention patterns are often
, allowing for the
removal of less critical
entries to save memory. However, this reduction
usually comes at the cost of notable accuracy degradation, particularly under
high compression ratios. To address this issue, we propose \textbf{CaliDrop}, a
novel strategy that enhances token eviction through calibration. Our
preliminary experiments show that queries at nearby positions exhibit high
similarity. Building on this observation, CaliDrop performs speculative
calibration on the discarded tokens to mitigate the accuracy loss caused by
token eviction. Extensive experiments demonstrate that CaliDrop significantly
improves the accuracy of existing token eviction methods.
CrossPL Evaluating Large Language Models on Cross Programming Language Code Generation
Authors: Zhanhang Xiong, Dongxia Wang, Yuekang Li, Xinyuan An, Wenhai Wang
2025-07-26
http://arxiv.org/abs/2507.19904v1
As large language models (s) become increasingly embedded in software
engineering workflows, a critical capability remains underexplored: generating
correct code that enables cross-programming-language (CPL) interoperability.
This skill is essential for building complex systems that integrate components
written in multiple languages via mechanisms like inter-process
(IPC). To bridge this gap, we present CrossPL, the first benchmark designed to
systematically evaluate
s' ability to generate CPL-interoperating code.
CrossPL comprises 1,982 tasks centered around IPC, covering six widely-used
programming languages and seven representative CPL techniques. We construct
this benchmark by (i) analyzing 19,169 multi-language GitHub repositories using
156 hand-crafted finite state machines (FSMs), and (ii) developing an
-based
pipeline that automatically extracts CPL code snippets, generates task
instructions, and validates functional correctness. We evaluate 14
state-of-the-art general-purpose
s and 6 code-oriented
s released in the
past three years on CrossPL via FSM-based validation. Results reveal that even
the best-performing models struggle with CPL scenarios, underscoring the need
for more targeted research in this space. Our benchmark and code are available
at: https://anonymous.4open.science/r/crosspl-2814.
AgentMesh A Cooperative Multi-Agent Generative AI Framework for Software Development Automation
Authors: Sourena Khanzadeh
2025-07-26
http://arxiv.org/abs/2507.19902v1
Software development is a complex, multi-phase process traditionally
requiring collaboration among individuals with diverse expertise. We propose
AgentMesh, a Python-based framework that uses multiple cooperating -powered
agents to automate software development tasks. In AgentMesh, specialized agents
- a Planner, Coder, Debugger, and Reviewer - work in concert to transform a
high-level requirement into fully realized code. The Planner agent first
decomposes user requests into concrete subtasks; the Coder agent implements
each subtask in code; the Debugger agent tests and fixes the code; and the
Reviewer agent validates the final output for correctness and quality. We
describe the architecture and design of these agents and their
,
and provide implementation details including prompt strategies and workflow
orchestration. A case study illustrates AgentMesh handling a non-trivial
development request via sequential task planning, code generation, iterative
debugging, and final code review. We discuss how dividing responsibilities
among cooperative agents leverages the strengths of large language models while
mitigating single-agent limitations. Finally, we examine current limitations -
such as error propagation and context scaling - and outline future work toward
more robust, scalable multi-agent AI systems for software engineering
automation.
HCAttention Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
Authors: Dongquan Yang, Yifan Yang, Xiaotian Yu, Xianbiao Qi, Rong Xiao
2025-07-26
http://arxiv.org/abs/2507.19823v1
Processing long-context inputs with large language models presents a
significant challenge due to the enormous memory requirements of the Key-Value
() cache during inference. Existing
cache compression methods exhibit
noticeable performance degradation when memory is reduced by more than 85%.
Additionally, strategies that leverage GPU-CPU collaboration for approximate
attention remain underexplored in this setting. We propose HCAttention, a
heterogeneous attention computation framework that integrates key quantization,
value offloading, and dynamic
eviction to enable efficient inference under
extreme memory constraints. The method is compatible with existing
architectures and does not require model fine-tuning. Experimental results on
the LongBench benchmark demonstrate that our approach preserves the accuracy of
full-attention model while shrinking the
cache memory footprint to 25% of
its original size. Remarkably, it stays competitive with only 12.5% of the
cache, setting a new state-of-the-art in
cache compression. To the best
of our knowledge, HCAttention is the first to extend the Llama-3-8B model to
process 4 million tokens on a single A100 GPU with 80GB memory.
Large Language Model Agent for Structural Drawing Generation Using ReAct Prompt Engineering and Retrieval Augmented Generation
Authors: Xin Zhang, Lissette Iturburu, Juan Nicolas Villamizar, Xiaoyu Liu, Manuel Salmeron, Shirley J. Dyke, Julio Ramirez
2025-07-26
http://arxiv.org/abs/2507.19771v1
Structural drawings are widely used in many fields, e.g., mechanical
engineering, civil engineering, etc. In civil engineering, structural drawings
serve as the main tool between architects, engineers, and
builders to avoid conflicts, act as legal documentation, and provide a
reference for future maintenance or evaluation needs. They are often organized
using key elements such as title/subtitle blocks, scales, plan views, elevation
view, sections, and detailed sections, which are annotated with standardized
symbols and line types for interpretation by engineers and contractors. Despite
advances in software capabilities, the task of generating a structural drawing
remains labor-intensive and time-consuming for structural engineers. Here we
introduce a novel generative AI-based method for generating structural drawings
employing a large language model (
) agent. The method incorporates a
retrieval-augmented generation (RAG) technique using externally-sourced facts
to enhance the accuracy and reliability of the language model. This method is
capable of understanding varied natural language descriptions, processing these
to extract necessary information, and generating code to produce the desired
structural drawing in AutoCAD. The approach developed, demonstrated and
evaluated herein enables the efficient and direct conversion of a structural
drawing's natural language description into an AutoCAD drawing, significantly
reducing the workload compared to current working process associated with
manual drawing production, facilitating the typical iterative process of
engineers for expressing design ideas in a simplified way.
LowKeyEMG Electromyographic typing with a reduced keyset
Authors: Johannes Y. Lee, Derek Xiao, Shreyas Kaasyap, Nima R. Hadidi, John L. Zhou, Jacob Cunningham, Rakshith R. Gore, Deniz O. Eren, Jonathan C. Kao
2025-07-26
http://arxiv.org/abs/2507.19736v1
We introduce LowKeyEMG, a real-time human-computer interface that enables
efficient text entry using only 7 gesture classes decoded from surface
electromyography (sEMG). Prior work has attempted full-alphabet decoding from
sEMG, but decoding large character sets remains unreliable, especially for
individuals with motor impairments. Instead, LowKeyEMG reduces the English
alphabet to 4 gesture keys, with 3 more for space and system interaction, to
reliably translate simple one-handed gestures into text, leveraging the
recurrent -based language model RW
for efficient computation. In
real-time experiments, participants achieved average one-handed keyboardless
typing speeds of 23.3 words per minute with LowKeyEMG, and improved gesture
efficiency by 17% (relative to typed phrase length). When typing with only 7
keys, LowKeyEMG can achieve 98.2% top-3 word accuracy, demonstrating that this
low-key typing paradigm can maintain practical
rates. Our results
have implications for assistive technologies and any interface where input
bandwidth is constrained.
Towards Inclusive NLP Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks
Authors: Maitha Alshehhi, Ahmed Sharshar, Mohsen Guizani
2025-07-25
http://arxiv.org/abs/2507.19699v1
Although s have attained significant success in high-resource languages,
their capacity in low-resource linguistic environments like Kannada and Arabic
is not yet fully understood. This work benchmarking the performance of
multilingual and monolingual Large Language Models (
s) across Arabic,
English, and Indic languages, with particular emphasis on the effects of model
compression strategies such as
and quantization. Findings shows
significant performance differences driven by linguistic diversity and resource
availability on SOTA
S as BLOOMZ, AceGPT, Jais, LLaMA-2, XGLM, and AraGPT2.
We find that multilingual versions of the model outperform their
language-specific counterparts across the board, indicating substantial
cross-lingual transfer benefits. Quantization (4-bit and 8-bit) is effective in
maintaining model accuracy while promoting efficiency, but aggressive
significantly compromises performance, especially in bigger models. Our
findings pinpoint key strategies to construct scalable and fair multilingual
NLP solutions and underscore the need for interventions to address
hallucination and generalization errors in the low-resource setting.
"X of Information'' Continuum A Survey on AI-Driven Multi-dimensional Metrics for Next-Generation Networked Systems
Authors: Beining Wu, Jun Huang, Shui Yu
2025-07-25
http://arxiv.org/abs/2507.19657v1
The development of next-generation networking systems has inherently shifted
from throughput-based paradigms towards intelligent, information-aware designs
that emphasize the quality, relevance, and utility of transmitted information,
rather than sheer data volume. While classical network metrics, such as latency
and packet loss, remain significant, they are insufficient to quantify the
nuanced information quality requirements of modern intelligent applications,
including autonomous vehicles, digital twins, and metaverse environments. In
this survey, we present the first comprehensive study of the ``X of
Information'' continuum by introducing a systematic four-dimensional taxonomic
framework that structures information metrics along temporal, quality/utility,
reliability/robustness, and network/ dimensions. We uncover the
increasing interdependencies among these dimensions, whereby temporal freshness
triggers quality evaluation, which in turn helps with reliability appraisal,
ultimately enabling effective network delivery. Our analysis reveals that
artificial intelligence technologies, such as deep reinforcement learning,
multi-agent systems, and neural optimization models, enable adaptive,
context-aware optimization of competing information quality objectives. In our
extensive study of six critical application domains, covering autonomous
transportation, industrial IoT, healthcare digital twins, UAV
s,
ecosystems, and metaverse settings, we illustrate the revolutionary promise
of multi-dimensional information metrics for meeting diverse operational needs.
Our survey identifies prominent implementation challenges, including ...
DeltaLLM A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference
Authors: Jiawen Qi, Chang Gao, Zhaochun Ren, Qinyu Chen
2025-07-25
http://arxiv.org/abs/2507.19608v1
Deploying Large Language Models (s) on edge devices remains challenging
due to their quadratically increasing computations with the sequence length.
Existing studies for dynamic attention
are designed for hardware with
massively parallel computation capabilities, such as GPUs or TPUs, and aim at
long context lengths (e.g., 64K), making them unsuitable for edge scenarios. We
present Delta
, a training-free framework that exploits temporal
in
attention patterns to enable efficient
inference across both the prefilling
and decoding stages, on resource-constrained edge devices. Delta
introduces
an accuracy- and memory-aware delta matrix construction strategy that
introduces temporal
, and a context-aware hybrid attention mechanism
that combines full attention in a local context window with delta approximation
outside it to increase accuracy. We evaluate our framework on the
edge-device-friendly BitNet-b1.58-2B-4T model and Llama3.2-1B-Instruct model
across diverse language tasks. The results show that on BitNet, our framework
increases the attention
from 0% to 60% during the prefilling stage
with slight accuracy improvement on the WG task, and 0% to 57% across both the
prefilling and decoding stages, with even higher F1 score from 29.63 to 30.97
on SQuAD-v2 task. On the Llama model, it can also achieve up to 60%
during the prefilling stage and around 57% across both stages with negligible
accuracy drop. These results demonstrate that Delta
offers a promising
solution for efficient edge deployment, requiring no fine-tuning and seamlessly
integrating with existing inference pipelines.
Advancing Event Forecasting through Massive Training of Large Language Models Challenges, Solutions, and Broader Impacts
Authors: Sang-Woo Lee, Sohee Yang, Donghyun Kwak, Noah Y. Siegel
2025-07-25
http://arxiv.org/abs/2507.19477v1
Many recent papers have studied the development of superforecaster-level
event forecasting s. While methodological problems with early studies cast
doubt on the use of
s for event forecasting, recent studies with improved
evaluation methods have shown that state-of-the-art
s are gradually reaching
superforecaster-level performance, and reinforcement learning has also been
reported to improve future forecasting. Additionally, the unprecedented success
of recent reasoning models and Deep Research-style models suggests that
technology capable of greatly improving forecasting performance has been
developed. Therefore, based on these positive recent trends, we argue that the
time is ripe for research on large-scale training of superforecaster-level
event forecasting
s. We discuss two key research directions: training
methods and data acquisition. For training, we first introduce three
difficulties of
-based event forecasting training: noisiness-
,
knowledge cut-off, and simple reward structure problems. Then, we present
related ideas to mitigate these problems: hypothetical event Bayesian networks,
utilizing poorly-recalled and counterfactual events, and auxiliary reward
signals. For data, we propose aggressive use of market, public, and crawling
datasets to enable large-scale training and evaluation. Finally, we explain how
these technical advances could enable AI to provide predictive intelligence to
society in broader areas. This position paper presents promising specific paths
and considerations for getting closer to superforecaster-level AI technology,
aiming to call for researchers' interest in these directions.
GEPA Reflective Prompt Evolution Can Outperform Reinforcement Learning
Authors: Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab
2025-07-25
http://arxiv.org/abs/2507.19457v1
Large language models (s) are increasingly adapted to downstream tasks via
reinforcement learning (RL) methods like Group Relative Policy Optimization
(GRPO), which often require thousands of rollouts to learn new tasks. We argue
that the interpretable nature of language can often provide a much richer
learning medium for
s, compared with policy gradients derived from
,
scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt
optimizer that thoroughly incorporates natural language reflection to learn
high-level rules from trial and error. Given any AI system containing one or
more
prompts, GEPA samples system-level trajectories (e.g., reasoning, tool
calls, and tool outputs) and reflects on them in natural language to diagnose
problems, propose and test prompt updates, and combine complementary lessons
from the Pareto frontier of its own attempts. As a result of GEPA's design, it
can often turn even just a few rollouts into a large quality gain. Across four
tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up
to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer,
MIPROv2, by over 10% across two
s, and demonstrates promising results as an
inference-time search strategy for code optimization.
Step-3 is Large yet Affordable Model-system Co-design for Cost-effective Decoding
Authors: StepFun, :, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin Miao, Chang Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengyuan Yao, Daokuan Lv, Dapeng Shi, Deshan Sun, Ding Huang, Dingyuan Hu, Dongqing Pang, Enle Liu, Fajie Zhang, Fanqi Wan, Gulin Yan, Han Zhang, Han Zhou, Hanghao Wu, Hangyu Guo, Hanqi Chen, Hanshan Zhang, Hao Wu, Haocheng Zhang, Haolong Yan, Haoran Lv, Haoran Wei, Hebin Zhou, Heng Wang, Heng Wang, Hongxin Li, Hongyu Zhou, Hongyuan Wang, Huiyong Guo, Jia Wang, Jiahao Gong, Jialing Xie, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yan, Jie Yang, Jieyi Hou, Jinguang Zhang, Jinlan Cao, Jisheng Yin, Junfeng Liu, Junhao Huang, Junzhe Lin, Kaijun Tan, Kaixiang Li, Kang An, Kangheng Lin, Kenkun Liu, Lei Yang, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lin Zhang, Lina Chen, Liwen Huang, Liying Shi, Longlong Gu, Mei Chen, Mengqiang Ren, Ming Li, Mingzhe Chen, Na Wang, Nan Wu, Qi Han, Qian Zhao, Qiang Zhang, Qianni Liu, Qiaohui Chen, Qiling Wu, Qinglin He, Qinyuan Tan, Qiufeng Wang, Qiuping Wu, Qiuyan Liang, Quan Sun, Rui Li, Ruihang Miao, Ruosi Wan, Ruyan Guo, Shangwu Zhong, Shaoliang Pang, Shengjie Fan, Shijie Shang, Shilei Jiang, Shiliang Yang, Shiming Hao, Shuli Gao, Siming Huang, Siqi Liu, Tiancheng Cao, Tianhao Cheng, Tianhao Peng, Wang You, Wei Ji, Wen Sun, Wenjin Deng, Wenqing He, Wenzhen Zheng, Xi Chen, Xiangwen Kong, Xianzhen Luo, Xiaobo Yang, Xiaojia Liu, Xiaoxiao Ren, Xin Han, Xin Li, Xin Wu, Xu Zhao, Yanan Wei, Yang Li, Yangguang Li, Yangshijie Xu, Yanming Xu, Yaqiang Shi, Yeqing Shen, Yi Yang, Yifei Yang, Yifeng Gong, Yihan Chen, Yijing Yang, Yinmin Zhang, Yizhuang Zhou, Yuanhao Ding, Yuantao Fan, Yuanzhen Yang, Yuchu Luo, Yue Peng, Yufan Lu, Yuhang Deng, Yuhe Yin, Yujie Liu, Yukun Chen, Yuling Zhao, Yun Mou, Yunlong Li, Yunzhou Ju, Yusheng Li, Yuxiang Yang, Yuxiang Zhang, Yuyang Chen, Zejia Weng, Zhe Xie, Zheng Ge, Zheng Gong, Zhenyi Lu, Zhewei Huang, Zhichao Chang, Zhiguo Huang, Zhirui Wang, Zidong Yang, Zili Wang, Ziqi Wang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Xiangyu Zhang
2025-07-25
http://arxiv.org/abs/2507.19427v1
Large language models (s) face low hardware efficiency during decoding,
especially for long-context reasoning tasks. This paper introduces Step-3, a
321B-parameter VLM with hardware-aware model-system co-design optimized for
minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel
Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces
both
cache size and computation while maintaining high attention
expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed
inference system that decouples attention and Feed-Forward Network (FFN) layers
into specialized subsystems. This co-design achieves unprecedented cost
efficiency: Step-3 significantly reduces theoretical decoding costs compared
with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at
longer context. Step-3 achieves low cost while activating 38B parameters per
token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that
hardware-aligned attention arithmetic intensity, MoE
, and AFD are
critical to cost-effectiveness. We perform a head-to-head comparison with
DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs
achieves a decoding throughput of up to 4,039 tokens per second per GPU under
50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3's 2,324
in the same setup and sets a new Pareto frontier for
decoding.
Doubling Your Data in Minutes Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs
Authors: Shuo Yang, Zheyu Zhang, Bardh Prenkaj, Gjergji Kasneci
2025-07-25
http://arxiv.org/abs/2507.19334v1
Tabular data is critical across diverse domains, yet high-quality datasets
remain scarce due to privacy concerns and the cost of collection. Contemporary
approaches adopt large language models (s) for tabular augmentation, but
exhibit two major limitations: (1) dense dependency modeling among tabular
features that can introduce bias, and (2) high computational overhead in
sampling. To address these issues, we propose SPADA for SPArse
Dependency-driven Augmentation, a lightweight generative framework that
explicitly captures
dependencies via an
-induced graph. We treat each
feature as a node and synthesize values by traversing the graph, conditioning
each feature solely on its parent nodes. We explore two synthesis strategies: a
non-parametric method using Gaussian kernel density estimation, and a
conditional normalizing flow model that learns invertible mappings for
conditional density estimation. Experiments on four datasets show that SPADA
reduces constraint violations by 4% compared to diffusion-based methods and
accelerates generation by nearly 9,500 times over
-based baselines.
Patch Pruning Strategy Based on Robust Statistical Measures of Attention Weight Diversity in Vision Transformers
Authors: Yuki Igaue, Hiroaki Aizawa
2025-07-25
http://arxiv.org/abs/2507.19175v1
Multi-head self-attention is a distinctive feature extraction mechanism of
vision s that computes pairwise relationships among all input
patches, contributing significantly to their high performance. However, it is
known to incur a quadratic computational complexity with respect to the number
of patches. One promising approach to address this issue is patch
,
which improves computational efficiency by identifying and removing redundant
patches. In this work, we propose a patch
strategy that evaluates the
importance of each patch based on the variance of attention weights across
multiple attention heads. This approach is inspired by the design of multi-head
self-attention, which aims to capture diverse attention patterns across
different subspaces of feature representations. The proposed method can be
easily applied during both training and inference, and achieves improved
throughput while maintaining classification accuracy in scenarios such as
fine-tuning with pre-trained models. In addition, we also found that using
robust statistical measures, such as the median absolute deviation in place of
variance, to assess patch importance can similarly lead to strong performance.
Furthermore, by introducing
ping patch embeddings, our method achieves
better performance with comparable throughput to conventional approaches that
utilize all patches.
RegScore Scoring Systems for Regression Tasks
Authors: Michal K. Grzeszczyk, Tomasz Szczepański, Pawel Renc, Siyeop Yoon, Jerome Charton, Tomasz Trzciński, Arkadiusz Sitek
2025-07-25
http://arxiv.org/abs/2507.19155v1
Scoring systems are widely adopted in medical applications for their inherent
simplicity and transparency, particularly for classification tasks involving
tabular data. In this work, we introduce RegScore, a novel, , and
interpretable scoring system specifically designed for regression tasks. Unlike
conventional scoring systems constrained to integer-valued coefficients,
RegScore leverages beam search and k-
ridge regression to relax these
restrictions, thus enhancing predictive performance. We extend RegScore to
bimodal deep learning by integrating tabular data with medical images. We
utilize the classification token from the TIP (Tabular Image Pretraining)
to generate Personalized Linear Regression parameters and a
Personalized RegScore, enabling individualized scoring. We demonstrate the
effectiveness of RegScore by estimating mean Pulmonary Artery Pressure using
tabular data and further refine these estimates by incorporating cardiac MRI
images. Experimental results show that RegScore and its personalized bimodal
extensions achieve performance comparable to, or better than, state-of-the-art
black-box models. Our method provides a transparent and interpretable approach
for regression tasks in clinical settings, promoting more informed and
trustworthy decision-making. We provide our code at
https://github.com/SanoScience/RegScore.
MixA-Q Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective
Authors: Weitian Wang, Rai Shubham, Cecilia De La Parra, Akash Kumar
2025-07-25
http://arxiv.org/abs/2507.19131v1
In this paper, we propose MixA-Q, a mixed-precision activation quantization
framework that leverages intra-layer activation (a concept widely
explored in activation
methods) for efficient inference of quantized
window-based vision
s. For a given uniform-bit quantization
configuration, MixA-Q separates the batched window computations within Swin
blocks and assigns a lower bit width to the activations of less important
windows, improving the trade-off between model performance and efficiency. We
introduce a Two-Branch Swin Block that processes activations separately in
high- and low-bit precision, enabling seamless integration of our method with
most quantization-aware training (QAT) and post-training quantization (PTQ)
methods, or with simple modifications. Our experimental evaluations over the
COCO dataset demonstrate that MixA-Q achieves a training-free 1.35x
computational speedup without accuracy loss in PTQ configuration. With QAT,
MixA-Q achieves a lossless 1.25x speedup and a 1.53x speedup with only a 1% mAP
drop by incorporating activation
. Notably, by reducing the quantization
error in important regions, our
-aware quantization adaptation improves
the mAP of the quantized W4A4 model (with both weights and activations in 4-bit
precision) by 0.7%, reducing quantization degradation by 24%.
Deterministic diffusion models for Lagrangian turbulence robustness and encoding of extreme events
Authors: Tianyi Li, Flavio Tuteri, Michele Buzzicotti, Fabio Bonaccorso, Luca Biferale
2025-07-25
http://arxiv.org/abs/2507.19103v1
Modeling Lagrangian turbulence remains a fundamental challenge due to its
multiscale, intermittent, and non-Gaussian nature. Recent advances in
data-driven diffusion models have enabled the generation of realistic
Lagrangian velocity trajectories that accurately reproduce statistical
properties across scales and capture rare extreme events. This study
investigates three key aspects of diffusion-based modeling for Lagrangian
turbulence. First, we assess architectural robustness by comparing a U-Net
backbone with a -based alternative, finding strong consistency in
generated trajectories, with only minor discrepancies at small scales. Second,
leveraging a deterministic variant of diffusion model formulation, namely the
deterministic denoising diffusion implicit model (DDIM), we identify structured
features in the initial latent noise that align consistently with extreme
events. Third, we explore accelerated generation by reducing the
number of diffusion steps, and find that DDIM enables substantial speedups with
minimal loss of statistical fidelity. These findings highlight the robustness
of diffusion models and their potential for interpretable, scalable modeling of
complex turbulent systems.