2025-07-25
Table of Contents
- Explainable Mapper Charting LLM Embedding Spaces Using Perturbation-Based Explanation and Verification Agents
- Not All Features Deserve Attention Graph-Guided Dependency Learning for Tabular Data Generation with Language Models
- Enhanced Velocity-Adaptive Scheme Joint Fair Access and Age of Information Optimization in Vehicular Networks
- StyleAdaptedLM Enhancing Instruction Following Models with Efficient Stylistic Transfer
- Assemble Your Crew Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation
- Prune&Comp Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation
- SpecASR Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding
- ICWLM A Multi-Task Wireless Large Model via In-Context Learning
- NeuralDB Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database
- Who Attacks, and Why? Using LLMs to Identify Negative Campaigning in 18M Tweets across 19 Countries
- R-Stitch Dynamic Trajectory Stitching for Efficient Reasoning
- EFS Evolutionary Factor Searching for Sparse Portfolio Optimization Using Large Language Models
- LLM Meets the Sky Heuristic Multi-Agent Reinforcement Learning for Secure Heterogeneous UAV Networks
- Resilient Multi-Agent Negotiation for Medical Supply ChainsIntegrating LLMs and Blockchain for Transparent Coordination
- Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models
- LoRA is All You Need for Safety Alignment of Reasoning LLMs
- Parallelism Meets Adaptiveness Scalable Documents Understanding in Multi-Agent LLM Systems
- Beyond Context Limits Subconscious Threads for Long-Horizon Reasoning
- Collaborative Inference and Learning between Edge SLMs and Cloud LLMs A Survey of Algorithms, Execution, and Open Challenges
- ACT Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training
- Mamba-OTR a Mamba-based Solution for Online Take and Release Detection from Untrimmed Egocentric Video
- CompLeak Deep Learning Model Compression Exacerbates Privacy Leakage
- Time to Split Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders
- Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training
- Benchmarking LLM Privacy Recognition for Social Robot Decision Making
- TorchAO PyTorch-Native Training-to-Serving Model Optimization
- On the transferability of Sparse Autoencoders for interpreting compressed models
- Just Ask for Music (JAM) Multimodal and Personalized Natural Language Music Recommendation
- Reservoir Computing as a Language Model
- Who Leads in the Shadows? ERGM and Centrality Analysis of Congressional Democrats on Bluesky
- Transformer-based Deep Learning Model for Joint Routing and Scheduling with Varying Electric Vehicle Numbers
- Metaphor and Large Language Models When Surface Features Matter More than Deep Understanding
- Scaling Decentralized Learning with FLock
- IM-Chat A Multi-agent LLM-based Framework for Knowledge Transfer in Injection Molding Industry
- CHADET Cross-Hierarchical-Attention for Depth-Completion Using Unsupervised Lightweight Transformer
- Beyond Visual Line of Sight UAVs with Edge AI, Connected LLMs, and VR for Autonomous Aerial Intelligence
- From Neurons to Semantics Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment
- Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs
- Tiny language models
- An Evaluation of DUSt3R/MASt3R/VGGT 3D Reconstruction on Photogrammetric Aerial Blocks
- LeAdQA LLM-Driven Context-Aware Temporal Grounding for Video Question Answering
- CXR-TFT Multi-Modal Temporal Fusion Transformer for Predicting Chest X-ray Trajectories
- GRACE Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization
- Spatial-Temporal Transformer with Curriculum Learning for EEG-Based Emotion Recognition
- Enabling Efficient Hardware Acceleration of Hybrid Vision Transformer (ViT) Networks at the Edge
- Linear Relational Decoding of Morphology in Language Models
- KinForm Kinetics Informed Feature Optimised Representation Models for Enzyme and Prediction
- Agentic Satellite-Augmented Low-Altitude Economy and Terrestrial Networks A Survey on Generative Approaches
- Efficient LLM Inference Bandwidth, Compute, Synchronization, and Capacity are all you need
- Characterizing Communication Patterns in Distributed Large Language Model Inference
- DPMT Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration
- KROMA Ontology Matching with Knowledge Retrieval and Large Language Models
- LoopServe An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues
Explainable Mapper Charting LLM Embedding Spaces Using Perturbation-Based Explanation and Verification Agents
Authors: Xinyuan Yan, Rita Sevastjanova, Sinie van der Ben, Mennatallah El-Assady, Bei Wang
2025-07-24
http://arxiv.org/abs/2507.18607v1
Large language models (s) produce high-dimensional embeddings that capture
rich semantic and syntactic relationships between words, sentences, and
concepts. Investigating the topological structures of
embedding spaces via
mapper graphs enables us to understand their underlying structures.
Specifically, a mapper graph summarizes the topological structure of the
embedding space, where each node represents a topological neighborhood
(containing a cluster of embeddings), and an edge connects two nodes if their
corresponding neighborhoods
. However, manually exploring these
embedding spaces to uncover encoded linguistic properties requires considerable
human effort. To address this challenge, we introduce a framework for
semi-automatic annotation of these embedding properties. To organize the
exploration process, we first define a taxonomy of explorable elements within a
mapper graph such as nodes, edges, paths, components, and trajectories. The
annotation of these elements is executed through two types of customizable
-based agents that employ perturbation techniques for scalable and automated
analysis. These agents help to explore and explain the characteristics of
mapper elements and verify the robustness of the generated explanations. We
instantiate the framework within a visual analytics workspace and demonstrate
its effectiveness through case studies. In particular, we replicate findings
from prior research on BERT's embedding properties across various layers of its
architecture and provide further observations into the linguistic properties of
topological neighborhoods.
Not All Features Deserve Attention Graph-Guided Dependency Learning for Tabular Data Generation with Language Models
Authors: Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci
2025-07-24
http://arxiv.org/abs/2507.18504v1
Large Language Models (s) have shown strong potential for tabular data
generation by modeling textualized feature-value pairs. However, tabular data
inherently exhibits
feature-level dependencies, where many feature
interactions are structurally insignificant. This creates a fundamental
mismatch as
s' self-attention mechanism inevitably distributes focus across
all pairs, diluting attention on critical relationships, particularly in
datasets with complex dependencies or semantically ambiguous features. To
address this limitation, we propose GraDe (Graph-Guided Dependency Learning), a
novel method that explicitly integrates
dependency graphs into
s'
attention mechanism. GraDe employs a lightweight dynamic graph learning module
guided by externally extracted functional dependencies, prioritizing key
feature interactions while suppressing irrelevant ones. Our experiments across
diverse real-world datasets demonstrate that GraDe outperforms existing
-based approaches by up to 12% on complex datasets while achieving
competitive results with state-of-the-art approaches in synthetic data quality.
Our method is minimally intrusive yet effective, offering a practical solution
for structure-aware tabular data modeling with
s.
Enhanced Velocity-Adaptive Scheme Joint Fair Access and Age of Information Optimization in Vehicular Networks
Authors: Xiao Xu, Qiong Wu, Pingyi Fan, Kezhi Wang, Nan Cheng, Wen Chen, Khaled B. Letaief
2025-07-24
http://arxiv.org/abs/2507.18328v1
In this paper, we consider the fair access problem and the Age of Information
(AoI) under 5G New Radio (NR) Vehicle-to-Infrastructure (V2I) Mode 2 in
vehicular networks. Specifically, vehicles follow Mode 2 to communicate with
Roadside Units (RSUs) to obtain accurate data for driving
assistance.Nevertheless, vehicles often have different velocity when they are
moving in adjacent lanes, leading to difference in RSU dwelltime and
duration. This results in unfair access to network resources,
potentially influencing driving safety. To ensure the freshness of received
data, the AoI should be analyzed. Mode 2 introduces a novel preemption
mechanism, necessitating simultaneous optimization of fair access and AoI to
guarantee timely and relevant data delivery. We propose a joint optimization
framework for vehicular network, defining a fairness index and employing
Stochastic Hybrid Systems (SHS) to model AoI under preemption mechanism. By
adaptively adjusting the selection window of Semi-Persistent Scheduling (SPS)
in Mode 2, we address the optimization of fairness and AoI. We apply a large
language model (
)-Based Multi-objective Evolutionary Algorithm Based on
Decomposition (MOEA/D) to solve this problem. Simulation results demonstrate
the effectiveness of our scheme in balancing fair access and minimizing AoI.
StyleAdaptedLM Enhancing Instruction Following Models with Efficient Stylistic Transfer
Authors: Pritika Ramu, Apoorv Saxena, Meghanath M Y, Varsha Sankar, Debraj Basu
2025-07-24
http://arxiv.org/abs/2507.18294v1
Adapting s to specific stylistic characteristics, like brand voice or
authorial tones, is crucial for enterprise
but challenging to
achieve from corpora which lacks instruction-response formatting without
compromising instruction adherence. We introduce StyleAdaptedLM, a framework
that efficiently transfers stylistic traits to instruction-following models
using Low-Rank Adaptation (LoRA). LoRA adapters are first trained on a base
model with diverse unstructured stylistic corpora, then merged with a separate
instruction-following model. This enables robust stylistic customization
without paired data or sacrificing task performance. Experiments across
multiple datasets and models demonstrate improved stylistic consistency while
preserving instruction adherence, with human evaluations confirming
brand-specific convention uptake. StyleAdaptedLM offers an efficient path for
stylistic personalization in
s.
Assemble Your Crew Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation
Authors: Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, Shirui Pan
2025-07-24
http://arxiv.org/abs/2507.18224v1
Multi-agent systems (MAS) based on large language models (s) have emerged
as a powerful solution for dealing with complex problems across diverse
domains. The effectiveness of MAS is critically dependent on its collaboration
topology, which has become a focal point for automated design research.
However, existing approaches are fundamentally constrained by their reliance on
a template graph modification paradigm with a predefined set of agents and
hard-coded interaction structures, significantly limiting their adaptability to
task-specific requirements. To address these limitations, we reframe MAS design
as a conditional autoregressive graph generation task, where both the system
composition and structure are designed jointly. We propose ARG-Designer, a
novel autoregressive model that operationalizes this paradigm by constructing
the collaboration graph from scratch. Conditioned on a natural language task
query, ARG-Designer sequentially and dynamically determines the required number
of agents, selects their appropriate roles from an extensible pool, and
establishes the optimal
links between them. This generative
approach creates a customized topology in a flexible and extensible manner,
precisely tailored to the unique demands of different tasks. Extensive
experiments across six diverse benchmarks demonstrate that ARG-Designer not
only achieves state-of-the-art performance but also enjoys significantly
greater token efficiency and enhanced extensibility. The source code of
ARG-Designer is available at https://github.com/Shiy-Li/ARG-Designer.
Prune&Comp Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation
Authors: Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, Chun Yuan
2025-07-24
http://arxiv.org/abs/2507.18212v1
Layer has emerged as a promising technique for compressing large
language models (
s) while achieving
proportional to the
ratio. In this work, we identify that removing any layer induces a significant
magnitude gap in hidden states, resulting in substantial performance
degradation. To address this issue, we propose Prune&Comp, a novel
plug-and-play layer
scheme that leverages magnitude compensation to
mitigate such gaps in a training-free manner. Specifically, we first estimate
the magnitude gap caused by layer removal and then eliminate this gap by
rescaling the remaining weights offline, with zero runtime overhead incurred.
We further demonstrate the advantages of Prune&Comp through an iterative
strategy. When integrated with an iterative prune-and-compensate loop,
Prune&Comp consistently enhances existing layer
metrics. For instance,
when 5 layers of LLaMA-3-8B are pruned using the prevalent block influence
metric, Prune&Comp nearly halves the perplexity and retains 93.19\% of the
original model's question-answering performance, outperforming the baseline by
4.01%.
SpecASR Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding
Authors: Linye Wei, Shuzhang Zhong, Songqiang Xu, Runsheng Wang, Ru Huang, Meng Li
2025-07-24
http://arxiv.org/abs/2507.18181v1
Large language model ()-based automatic speech recognition (ASR) has
recently attracted a lot of attention due to its high recognition accuracy and
enhanced multi-dialect support. However, the high decoding latency of
s
challenges the real-time ASR requirements. Although speculative decoding has
been explored for better decoding efficiency, they usually ignore the key
characteristics of the ASR task and achieve limited speedup. To further reduce
the real-time ASR latency, in this paper, we propose a novel speculative
decoding framework specialized for ASR, dubbed SpecASR. SpecASR is developed
based on our core observation that ASR decoding is audio-conditioned, which
results in high output alignment between small and large ASR models, even given
output mismatches in intermediate decoding steps. Therefore, SpecASR features
an adaptive draft sequence generation process that dynamically modifies the
draft sequence length to maximize the token acceptance length. SpecASR further
proposes a draft sequence recycling strategy that reuses the previously
generated draft sequence to reduce the draft ASR model latency. Moreover, a
two-pass
token tree generation algorithm is also proposed to balance the
latency of draft and target ASR models. With extensive experimental results, we
demonstrate SpecASR achieves 3.04x-3.79x and 1.25x-1.84x speedup over the
baseline autoregressive decoding and speculative decoding, respectively,
without any loss in recognition accuracy.
ICWLM A Multi-Task Wireless Large Model via In-Context Learning
Authors: Yuxuan Wen, Xiaoming Chen, Maojun Zhang, Zhaoyang Zhang
2025-07-24
http://arxiv.org/abs/2507.18167v1
The rapid evolution of wireless technologies, particularly
massive multiple-input multiple-output (mMIMO) and millimeter-wave (mmWave),
introduces significant network complexity and computational demands.
Significant research efforts have been made to improve physical layer
performance by resorting to deep learning (DL) methods, which, however, are
usually task-specific and struggle with data scarcity and generalization. To
address these challenges, we propose a novel In-Context Wireless Large Model
(ICWLM), a wireless-native foundation model designed for simultaneous
multi-task learning at the physical layer. Unlike conventional methods that
adapt wireless data to pre-trained large language models (
s), ICWLM is
trained directly on large-scale, mixed wireless datasets from scratch. It
jointly solves multiple classical physical layer problems, including multi-user
precoding (sum-rate maximization and max-min SINR) and channel prediction. A
key innovation of ICWLM is its utilization of in-context learning (ICL),
enabling the model to adapt to varying system configurations and channel
conditions with minimal demonstration pairs, eliminating the need for extensive
retraining. Furthermore, we employ the Dynamic Weight Averaging (DWA) algorithm
to dynamically balance the individual task losses during multi-task training,
ensuring efficient and stable learning across diverse objectives. Extensive
simulation results demonstrate that ICWLM achieves competitive performance
compared to task-specific methods while exhibiting remarkable generalization
capabilities to unseen system configurations. This work offers a promising
paradigm for developing unified and adaptive AI models for future wireless
networks, potentially reducing deployment complexity and enhancing intelligent
resource management.
NeuralDB Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database
Authors: Weizhi Fei, Hao Shi, Jing Xu, Jingchen Peng, Jiazheng Li, Jingzhao Zhang, Bo Bai, Wei Han, Zhenyuan Chen, Xueyan Niu
2025-07-24
http://arxiv.org/abs/2507.18028v1
Efficiently editing knowledge stored in large language models (s) enables
model updates without large-scale training. One possible solution is
Locate-and-Edit (L\&E), allowing simultaneous modifications of a massive number
of facts. However, such editing may compromise the general abilities of
s
and even result in forgetting edited facts when scaling up to thousands of
edits. In this paper, we model existing linear L\&E methods as querying a
Key-Value (
) database. From this perspective, we then propose NeuralDB, an
editing framework that explicitly represents the edited facts as a neural
database equipped with a non-linear gated retrieval module, % In particular,
our gated module only operates when inference involves the edited facts,
effectively preserving the general abilities of
s. Comprehensive experiments
involving the editing of 10,000 facts were conducted on the ZsRE and
CounterFacts datasets, using GPT2-XL, GPT-J (6B) and Llama-3 (8B). The results
demonstrate that NeuralDB not only excels in editing efficacy, generalization,
specificity, fluency, and consistency, but also preserves overall performance
across six representative text understanding and generation tasks. Further
experiments indicate that NeuralDB maintains its effectiveness even when scaled
to 100,000 facts (\textbf{50x} more than in prior work).
Who Attacks, and Why? Using LLMs to Identify Negative Campaigning in 18M Tweets across 19 Countries
Authors: Victor Hartman, Petter Törnberg
2025-07-23
http://arxiv.org/abs/2507.17636v1
Negative campaigning is a central feature of political competition, yet
empirical research has been limited by the high cost and limited scalability of
existing classification methods. This study makes two key contributions. First,
it introduces zero-shot Large Language Models (s) as a novel approach for
cross-lingual classification of negative campaigning. Using benchmark datasets
in ten languages, we demonstrate that
s achieve performance on par with
native-speaking human coders and outperform conventional supervised machine
learning approaches. Second, we leverage this novel method to conduct the
largest cross-national study of negative campaigning to date, analyzing 18
million tweets posted by parliamentarians in 19 European countries between 2017
and 2022. The results reveal consistent cross-national patterns: governing
parties are less likely to use negative messaging, while ideologically extreme
and populist parties -- particularly those on the radical right -- engage in
significantly higher levels of negativity. These findings advance our
understanding of how party-level characteristics shape strategic
in multiparty systems. More broadly, the study demonstrates the potential of
s to enable scalable, transparent, and replicable research in political
across linguistic and cultural contexts.
R-Stitch Dynamic Trajectory Stitching for Efficient Reasoning
Authors: Zhuokun Chen, Zeren Chen, Jiahao He, Mingkui Tan, Jianfei Cai, Bohan Zhuang
2025-07-23
http://arxiv.org/abs/2507.17307v1
Chain-of-thought (CoT) reasoning enhances the problem-solving capabilities of
large language models by encouraging step-by-step intermediate reasoning during
inference. While effective, CoT introduces substantial computational overhead
due to its reliance on autoregressive decoding over long token sequences.
Existing strategies either reduce sequence length through early
stopping or compressive reward designs, or improve decoding speed via
speculative decoding with smaller models. However, speculative decoding suffers
from limited speedup when the agreement between small and large models is low,
and fails to exploit the potential advantages of small models in producing
concise intermediate reasoning. In this paper, we present R-Stitch, a
token-level, confidence-based hybrid decoding framework that accelerates CoT
inference by switching between a small language model (SLM) and a large
language model (
) along the reasoning trajectory. R-Stitch uses the SLM to
generate tokens by default and delegates to the
only when the SLM's
confidence falls below a threshold. This design avoids full-sequence rollback
and selectively invokes the
on uncertain steps, preserving both efficiency
and answer quality. R-Stitch is model-agnostic, training-free, and compatible
with standard decoding pipelines. Experiments on math reasoning benchmarks
demonstrate that R-Stitch achieves up to 85\% reduction in inference latency
with negligible accuracy drop, highlighting its practical effectiveness in
accelerating CoT reasoning.
EFS Evolutionary Factor Searching for Sparse Portfolio Optimization Using Large Language Models
Authors: Haochen Luo, Yuan Zhang, Chen Liu
2025-07-23
http://arxiv.org/abs/2507.17211v1
Sparse portfolio optimization is a fundamental yet challenging problem in
quantitative finance, since traditional approaches heavily relying on
historical return statistics and static objectives can hardly adapt to dynamic
market regimes. To address this issue, we propose Evolutionary Factor Search
(EFS), a novel framework that leverages large language models (s) to
automate the generation and evolution of alpha factors for
portfolio
construction. By reformulating the asset selection problem as a top-m ranking
task guided by
-generated factors, EFS incorporates an evolutionary feedback
loop to iteratively refine the factor pool based on performance. Extensive
experiments on five Fama-French benchmark datasets and three real-market
datasets (US50, HSI45 and CSI300) demonstrate that EFS significantly
outperforms both statistical-based and optimization-based baselines, especially
in larger asset universes and volatile conditions. Comprehensive ablation
studies validate the importance of prompt composition, factor diversity, and
backend choice. Our results highlight the promise of language-guided
evolution as a robust and interpretable paradigm for portfolio optimization
under structural constraints.
LLM Meets the Sky Heuristic Multi-Agent Reinforcement Learning for Secure Heterogeneous UAV Networks
Authors: Lijie Zheng, Ji He, Shih Yu Chang, Yulong Shen, Dusit Niyato
2025-07-23
http://arxiv.org/abs/2507.17188v1
This work tackles the physical layer security (PLS) problem of maximizing the
secrecy rate in heterogeneous UAV networks (HetUAVNs) under propulsion energy
constraints. Unlike prior studies that assume uniform UAV capabilities or
overlook energy-security trade-offs, we consider a realistic scenario where
UAVs with diverse payloads and computation resources collaborate to serve
ground terminals in the presence of eavesdroppers. To manage the complex
coupling between UAV motion and , we propose a hierarchical
optimization framework. The inner layer uses a semidefinite relaxation
(SDR)-based S2DC algorithm combining penalty functions and difference-of-convex
(d.c.) programming to solve the secrecy precoding problem with fixed UAV
positions. The outer layer introduces a Large Language Model (
)-guided
heuristic multi-agent reinforcement learning approach (
-HeMARL) for
trajectory optimization.
-HeMARL efficiently incorporates expert heuristics
policy generated by the
, enabling UAVs to learn energy-aware,
security-driven trajectories without the inference overhead of real-time
calls. The simulation results show that our method outperforms existing
baselines in secrecy rate and energy efficiency, with consistent robustness
across varying UAV swarm sizes and random seeds.
Resilient Multi-Agent Negotiation for Medical Supply ChainsIntegrating LLMs and Blockchain for Transparent Coordination
Authors: Mariam ALMutairi, Hyungmin Kim
2025-07-23
http://arxiv.org/abs/2507.17134v1
Global health emergencies, such as the COVID-19 pandemic, have exposed
critical weaknesses in traditional medical supply chains, including
inefficiencies in resource allocation, lack of transparency, and poor
adaptability to dynamic disruptions. This paper presents a novel hybrid
framework that integrates blockchain technology with a decentralized, large
language model () powered multi-agent negotiation system to enhance the
resilience and accountability of medical supply chains during crises. In this
system, autonomous agents-representing manufacturers, distributors, and
healthcare institutions-engage in structured, context-aware negotiation and
decision-making processes facilitated by
s, enabling rapid and ethical
allocation of scarce medical resources. The off-chain agent layer supports
adaptive reasoning and local decision-making, while the on-chain blockchain
layer ensures immutable, transparent, and auditable enforcement of decisions
via smart contracts. The framework also incorporates a formal cross-layer
protocol to bridge decentralized negotiation with institutional
enforcement. A simulation environment emulating pandemic scenarios evaluates
the system's performance, demonstrating improvements in negotiation efficiency,
fairness of allocation, supply chain responsiveness, and auditability. This
research contributes an innovative approach that synergizes blockchain trust
guarantees with the adaptive intelligence of
-driven agents, providing a
robust and scalable solution for critical supply chain coordination under
uncertainty.
Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models
Authors: Andrii Balashov
2025-07-23
http://arxiv.org/abs/2507.17107v1
Reinforcement learning (RL) is a key post-pretraining step for aligning large
language models (s) with complex tasks and human preferences. While it is
often assumed that RL fine-tuning requires updating most of a model's
parameters, we challenge this assumption with a surprising finding: RL
fine-tuning consistently modifies only a small subnetwork (typically 5-30% of
weights), leaving most parameters unchanged. We call this phenomenon RL-induced
parameter update
. It arises naturally, without any
constraints or parameter-efficient tuning, and appears across multiple RL
algorithms (e.g., PPO, DPO, SimPO, PRIME) and model families (e.g., OpenAI,
Meta, and open-source
s). Moreover, the subnetworks updated by RL show
substantial
across different seeds, datasets, and algorithms-far
exceeding chance-suggesting a partially transferable structure in the
pretrained model. We show that fine-tuning only this
subnetwork recovers
full model performance and yields parameters nearly identical to the fully
fine-tuned model. Our analysis suggests this
emerges because RL
operates near the model's original distribution, requiring only targeted
changes. KL penalties, gradient clipping, and on-policy dynamics have limited
effect on the
pattern. These findings shed new light on how RL adapts
models: not by shifting all weights, but by focusing training on a small,
consistently updated subnetwork. This insight enables more efficient RL methods
and reframes
through the lens of the lottery ticket hypothesis.
LoRA is All You Need for Safety Alignment of Reasoning LLMs
Authors: Yihao Xue, Baharan Mirzasoleiman
2025-07-22
http://arxiv.org/abs/2507.17075v1
Reasoning s have demonstrated remarkable breakthroughs in solving complex
problems that were previously out of reach. To ensure
s do not assist with
harmful requests, safety alignment fine-tuning is necessary in the
post-training phase. However, safety alignment fine-tuning has recently been
shown to significantly degrade reasoning abilities, a phenomenon known as the
"Safety Tax". In this work, we show that using LoRA for SFT on refusal datasets
effectively aligns the model for safety without harming its reasoning
capabilities. This is because restricting the safety weight updates to a
low-rank space minimizes the interference with the reasoning weights. Our
extensive experiments across four benchmarks covering math, science, and coding
show that this approach produces highly safe
s -- with safety levels
comparable to full-model fine-tuning -- without compromising their reasoning
abilities. Additionally, we observe that LoRA induces weight updates with
smaller
with the initial weights compared to full-model fine-tuning. We
also explore methods that further reduce such
-- via regularization or
during weight merging -- and observe some improvement on certain tasks. We hope
this result motivates designing approaches that yield more consistent
improvements in the reasoning-safety trade-off.
Parallelism Meets Adaptiveness Scalable Documents Understanding in Multi-Agent LLM Systems
Authors: Chengxuan Xia, Qianye Wu, Sixuan Tian, Yilun Hao
2025-07-22
http://arxiv.org/abs/2507.17061v1
Large language model () agents have shown increasing promise for
collaborative task completion. However, existing multi-agent frameworks often
rely on static workflows, fixed roles, and limited inter-agent
,
reducing their effectiveness in open-ended, high-complexity domains. This paper
proposes a coordination framework that enables adaptiveness through three core
mechanisms: dynamic task routing, bidirectional feedback, and parallel agent
evaluation. The framework allows agents to reallocate tasks based on confidence
and workload, exchange structured critiques to iteratively improve outputs, and
crucially compete on high-ambiguity subtasks with evaluator-driven selection of
the most suitable result. We instantiate these principles in a modular
architecture and demonstrate substantial improvements in factual coverage,
coherence, and efficiency over static and partially adaptive baselines. Our
findings highlight the benefits of incorporating both adaptiveness and
structured competition in multi-agent
systems.
Beyond Context Limits Subconscious Threads for Long-Horizon Reasoning
Authors: Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O'Brien, James Glass
2025-07-22
http://arxiv.org/abs/2507.16784v1
To break the context limits of large language models (s) that bottleneck
reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM),
a family of
s trained for recursive and decompositional problem solving, and
TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond
context limits. Together, TIM hosted on TIMRUN supports virtually unlimited
working memory and multi-hop tool calls within a single language model
inference, overcoming output limits, positional-embedding constraints, and
GPU-memory bottlenecks. Performance is achieved by modeling natural language as
reasoning trees measured by both length and depth instead of linear sequences.
The reasoning trees consist of tasks with thoughts, recursive subtasks, and
conclusions based on the concept we proposed in Schroeder et al, 2025. During
generation, we maintain a working memory that retains only the key-value states
of the most relevant context tokens, selected by a rule-based subtask-
mechanism, enabling reuse of positional embeddings and GPU memory pages
throughout reasoning. Experimental results show that our system sustains high
inference throughput, even when manipulating up to 90% of the
cache in GPU
memory. It also delivers accurate reasoning on mathematical tasks and handles
information retrieval challenges that require long-horizon reasoning and
multi-hop tool use.
Collaborative Inference and Learning between Edge SLMs and Cloud LLMs A Survey of Algorithms, Execution, and Open Challenges
Authors: Senyao Li, Haozhao Wang, Wenchao Xu, Rui Zhang, Song Guo, Jingling Yuan, Xian Zhong, Tianwei Zhang, Ruixuan Li
2025-07-22
http://arxiv.org/abs/2507.16731v1
As large language models (s) evolve, deploying them solely in the cloud or
compressing them for edge devices has become inadequate due to concerns about
latency, privacy, cost, and personalization. This survey explores a
collaborative paradigm in which cloud-based
s and edge-deployed small
language models (SLMs) cooperate across both inference and training. We present
a unified taxonomy of edge-cloud collaboration strategies. For inference, we
categorize approaches into task assignment, task division, and mixture-based
collaboration at both task and token granularity, encompassing adaptive
scheduling, resource-aware offloading, speculative decoding, and modular
routing. For training, we review distributed adaptation techniques, including
parameter alignment,
, bidirectional distillation, and
small-model-guided optimization. We further summarize datasets, benchmarks, and
deployment cases, and highlight privacy-preserving methods and vertical
applications. This survey provides the first systematic foundation for
-SLM
collaboration, bridging system and algorithm co-design to enable efficient,
scalable, and trustworthy edge-cloud intelligence.
ACT Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training
Authors: Shreya Saxena, Siva Prasad, Zishan Ahmad, Vishal Vaddina
2025-07-22
http://arxiv.org/abs/2507.16478v1
Code translation is a crucial process in software development and migration
projects, enabling interoperability between different programming languages and
enhancing software adaptability and thus longevity. Traditional automated
translation methods rely heavily on handcrafted transformation rules, which
often lack flexibility and scalability. Meanwhile, advanced language models
present promising alternatives but are often limited by proprietary, API-based
implementations that raise concerns over data security and reliance. In this
paper, we present Auto-Train for Code Translation (ACT), an innovative
framework that aims to improve code translation capabilities by enabling
in-house finetuning of open-source Large Language Models (s). ACT's
automated pipeline significantly boosts the performance of these models,
narrowing the gap between open-source accessibility and the high performance of
closed-source solutions. Central to ACT is its synthetic data generation
module, which builds extensive, high-quality datasets from initial code
samples, incorporating unit tests to ensure functional accuracy and diversity.
ACT's evaluation framework incorporates execution-level checks, offering a
comprehensive assessment of translation quality. A key feature in ACT is its
controller module, which manages the entire pipeline by dynamically adjusting
hyperparameters, orchestrating iterative data generation, and finetuning based
on real-time evaluations. This enables ACT to intelligently optimize when to
continue training, generate additional targeted training data, or stop the
process. Our results demonstrate that ACT consistently enhances the
effectiveness of open-source models, offering businesses and developers a
secure and reliable alternative. Additionally, applying our data generation
pipeline to industry-scale migration projects has led to a notable increase in
developer
.
Mamba-OTR a Mamba-based Solution for Online Take and Release Detection from Untrimmed Egocentric Video
Authors: Alessandro Sebastiano Catinello, Giovanni Maria Farinella, Antonino Furnari
2025-07-22
http://arxiv.org/abs/2507.16342v1
This work tackles the problem of Online detection of Take and Release (OTR)
of an object in untrimmed egocentric videos. This task is challenging due to
severe label imbalance, with temporally positive annotations, and the
need for precise temporal predictions. Furthermore, methods need to be
computationally efficient in order to be deployed in real-world online
settings. To address these challenges, we propose Mamba-OTR, a model based on
the Mamba architecture. Mamba-OTR is designed to exploit temporal recurrence
during inference while being trained on short video clips. To address label
imbalance, our training pipeline incorporates the focal loss and a novel
regularization scheme that aligns model predictions with the evaluation metric.
Extensive experiments on EPIC-KITCHENS-100, the comparisons with
-based approach, and the evaluation of different training and test
schemes demonstrate the superiority of Mamba-OTR in both accuracy and
efficiency. These finding are particularly evident when evaluating full-length
videos or high frame-rate sequences, even when trained on short video snippets
for computational convenience. The proposed Mamba-OTR achieves a noteworthy
mp-mAP of 45.48 when operating in a sliding-window fashion, and 43.35 in
streaming mode, versus the 20.32 of a vanilla
and 25.16 of a
vanilla Mamba, thus providing a strong baseline for OTR. We will publicly
release the source code of Mamba-OTR to support future research.
CompLeak Deep Learning Model Compression Exacerbates Privacy Leakage
Authors: Na Li, Yansong Gao, Hongsheng Hu, Boyu Kuang, Anmin Fu
2025-07-22
http://arxiv.org/abs/2507.16872v1
Model compression is crucial for minimizing memory storage and accelerating
inference in deep learning (DL) models, including recent foundation models like
large language models (s). Users can access different compressed model
versions according to their resources and budget. However, while existing
compression operations primarily focus on optimizing the trade-off between
resource efficiency and model performance, the privacy risks introduced by
compression remain overlooked and insufficiently understood.
In this work, through the lens of membership inference attack (MIA), we
propose CompLeak, the first privacy risk evaluation framework examining three
widely used compression configurations that are
, quantization, and
weight clustering supported by the commercial model compression framework of
Google's TensorFlow-Lite (TF-Lite) and Facebook's PyTorch Mobile. CompLeak has
three variants, given available access to the number of compressed models and
original model. CompLeakNR starts by adopting existing MIA methods to attack a
single compressed model, and identifies that different compressed models
influence members and non-members differently. When the original model and one
compressed model are available, CompLeakSR leverages the compressed model as a
reference to the original model and uncovers more privacy by combining meta
information (e.g., confidence vector) from both models. When multiple
compressed models are available with/without accessing the original model,
CompLeakMR innovatively exploits privacy leakage info from multiple compressed
versions to substantially signify the overall privacy leakage. We conduct
extensive experiments on seven diverse model architectures (from ResNet to
foundation models of BERT and GPT-2), and six image and textual benchmark
datasets.
Time to Split Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders
Authors: Danil Gusak, Anna Volodkevich, Anton Klenitskiy, Alexey Vasilev, Evgeny Frolov
2025-07-22
http://arxiv.org/abs/2507.16289v1
Modern sequential recommender systems, ranging from lightweight
-based variants to large language models, have become increasingly
prominent in academia and industry due to their strong performance in the
next-item prediction task. Yet common evaluation protocols for sequential
recommendations remain insufficiently developed: they often fail to reflect the
corresponding recommendation task accurately, or are not aligned with
real-world scenarios.
Although the widely used leave-one-out split matches next-item prediction, it
permits the
between training and test periods, which leads to temporal
leakage and unrealistically long test horizon, limiting real-world relevance.
Global temporal splitting addresses these issues by evaluating on distinct
future periods. However, its applications to sequential recommendations remain
loosely defined, particularly in terms of selecting target interactions and
constructing a validation subset that provides necessary consistency between
validation and test metrics.
In this paper, we demonstrate that evaluation outcomes can vary significantly
across splitting strategies, influencing model rankings and practical
deployment decisions. To improve reproducibility in both academic and
industrial settings, we systematically compare different splitting strategies
for sequential recommendations across multiple datasets and established
baselines. Our findings show that prevalent splits, such as leave-one-out, may
be insufficiently aligned with more realistic evaluation strategies. Code:
https://github.com/monkey0head/time-to-split
Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training
Authors: Zixiao Huang, Junhao Hu, Hao Lin, Chunyang Zhu, Yueran Tang, Quanlu Zhang, Zhen Guo, Zhenhua Li, Shengen Yan, Zhenhua Zhu, Guohao Dai, Yu Wang
2025-07-22
http://arxiv.org/abs/2507.16274v1
The rapid scaling of large language models (s) has significantly increased
GPU memory pressure, which is further aggravated by training optimization
techniques such as virtual pipeline and recomputation that disrupt tensor
lifespans and introduce considerable memory fragmentation. Default GPU memory
allocators of popular deep learning frameworks like PyTorch use online
strategies without knowledge of tensor lifespans, which can waste up to 43\% of
memory and cause out-of-memory errors, rendering optimization techniques
ineffective or even unusable.
To address this, we introduce STWeaver, a GPU memory allocator for deep
learning frameworks that reduces fragmentation by exploiting the spatial and
temporal regularity in memory allocation behaviors of training workloads.
STWeaver introduces a novel paradigm that combines offline planning with online
allocation. The offline planning leverages spatio-temporal regularities to
generate a near-optimal allocation plan, while the online allocation handles
complex and dynamic models such as Mixture-of-Experts (MoE). Built as a
pluggable PyTorch allocator, STWeaver reduces fragmentation ratio on average by
79.2\% (up to 100\%) across both dense and
models, with negligible
overhead. This enables more efficient, high-throughput training configurations
and improves performance by up to 32.5\%.
Benchmarking LLM Privacy Recognition for Social Robot Decision Making
Authors: Dakota Sullivan, Shirley Zhang, Jennica Li, Heather Kirkorian, Bilge Mutlu, Kassem Fawaz
2025-07-22
http://arxiv.org/abs/2507.16124v1
Social robots are embodied agents that interact with people while following
human norms. These robots interact using verbal and non-verbal
cues, and share the physical environments of people. While social robots have
previously utilized rule-based systems or probabilistic models for user
interaction, the rapid evolution of large language models (
s) presents new
opportunities to develop
-empowered social robots for enhanced human-robot
interaction. To fully realize these capabilities, however, robots need to
collect data such as audio, fine-grained images, video, and locations. As a
result,
s often process sensitive personal information, particularly within
home environments. Given the tension between utility and privacy risks,
evaluating how current
s manage sensitive data is critical. Specifically, we
aim to explore the extent to which out-of-the-box
s are privacy-aware in the
context of household social robots. In this study, we present a set of
privacy-relevant scenarios crafted through the lens of Contextual Integrity
(CI). We first survey users' privacy preferences regarding in-home social robot
behaviors and then examine how their privacy orientation affects their choices
of these behaviors (N = 450). We then provide the same set of scenarios and
questions to state-of-the-art
s (N = 10) and find that the agreement between
humans and
s is low. To further investigate the capabilities of
s as a
potential privacy controller, we implement four additional prompting strategies
and compare their results. Finally, we discuss the implications and potential
of AI privacy awareness in human-robot interaction.
TorchAO PyTorch-Native Training-to-Serving Model Optimization
Authors: Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, Supriya Rao, Thien Tran, Aleksandar Samardžić
2025-07-21
http://arxiv.org/abs/2507.16099v1
We present TorchAO, a PyTorch-native model optimization framework leveraging
quantization and to provide an end-to-end, training-to-serving
workflow for AI models. TorchAO supports a variety of popular model
optimization techniques, including FP8 quantized training, quantization-aware
training (QAT), post-training quantization (PTQ), and 2:4
, and
leverages a novel tensor subclass abstraction to represent a variety of
widely-used, backend agnostic low precision data types, including INT4, INT8,
FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader
ecosystem at each step of the model optimization pipeline, from pre-training
(TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, v
,
SGLang, ExecuTorch), connecting an otherwise fragmented space in a single,
unified workflow. TorchAO has enabled recent launches of the quantized Llama
3.2 1B/3B and LlamaGuard3-8B models and is open-source at
https://github.com/pytorch/ao/.
On the transferability of Sparse Autoencoders for interpreting compressed models
Authors: Suchit Gupte, Vishnu Kabir Chhabra, Mohammad Mahdi Khalili
2025-07-21
http://arxiv.org/abs/2507.15977v1
Modern s face inference efficiency challenges due to their scale. To
address this, many compression methods have been proposed, such as
and
quantization. However, the effect of compression on a model's interpretability
remains elusive. While several model interpretation approaches exist, such as
circuit discovery, Sparse Autoencoders (SAEs) have proven particularly
effective in decomposing a model's activation space into its feature basis. In
this work, we explore the differences in SAEs for the original and compressed
models. We find that SAEs trained on the original model can interpret the
compressed model albeit with slight performance degradation compared to the
trained SAE on the compressed model. Furthermore, simply
the original
SAE itself achieves performance comparable to training a new SAE on the pruned
model. This finding enables us to mitigate the extensive training costs of
SAEs.
Just Ask for Music (JAM) Multimodal and Personalized Natural Language Music Recommendation
Authors: Alessandro B. Melchiorre, Elena V. Epure, Shahed Masoudian, Gustavo Escobedo, Anna Hausberger, Manuel Moussallam, Markus Schedl
2025-07-21
http://arxiv.org/abs/2507.15826v1
Natural language interfaces offer a compelling approach for music
recommendation, enabling users to express complex preferences conversationally.
While Large Language Models (s) show promise in this direction, their
scalability in recommender systems is limited by high costs and latency.
Retrieval-based approaches using smaller language models mitigate these issues
but often rely on single-modal item representations, overlook long-term user
preferences, and require full model retraining, posing challenges for
real-world deployment. In this paper, we present JAM (Just Ask for Music), a
lightweight and intuitive framework for natural language music recommendation.
JAM models user-query-item interactions as vector translations in a shared
latent space, inspired by knowledge graph embedding methods like TransE. To
capture the complexity of music and user intent, JAM aggregates multimodal item
features via cross-attention and
mixture-of-experts. We also introduce
JAMSessions, a new dataset of over 100k user-query-item triples with anonymized
user/item embeddings, uniquely combining conversational queries and user
long-term preferences. Our results show that JAM provides accurate
recommendations, produces intuitive representations suitable for practical use
cases, and can be easily integrated with existing music recommendation stacks.
Reservoir Computing as a Language Model
Authors: Felix Köster, Atsushi Uchida
2025-07-21
http://arxiv.org/abs/2507.15779v1
Large Language Models () have dominated the science and media landscape
duo to their impressive performance on processing large chunks of data and
produce human-like levels of text. Nevertheless, their huge energy demand and
slow processing still a bottleneck for further increasing quality while also
making the models accessible to everyone. To solve this bottleneck, we will
investigate how reservoir computing performs on natural text processing, which
could enable fast and energy efficient hardware implementations. Studies
investigating the use of reservoir computing as a language model remain
.
In this paper, we compare three distinct approaches for character-level
language modeling, two different reservoir computing approaches, where only an
output layer is trainable, and the well-known
-based architectures,
which fully learn an attention-based sequence representation. We explore the
performance, computational cost and prediction accuracy for both paradigms by
equally varying the number of trainable parameters for all models. Using a
consistent pipeline for all three approaches, we demonstrate that
s
excel in prediction quality, whereas reservoir computers remain highly
efficient reducing the training and inference speed. Furthermore, we
investigate two types of reservoir computing: a traditional reservoir with a
static linear readout, and an attention-enhanced reservoir that dynamically
adapts its output weights via an attention mechanism. Our findings underline
how these paradigms scale and offer guidelines to balance resource constraints
with performance.
Who Leads in the Shadows? ERGM and Centrality Analysis of Congressional Democrats on Bluesky
Authors: Gordon Hew, Ian McCulloh
2025-07-21
http://arxiv.org/abs/2507.16858v1
Following the 2024 U.S. presidential election, Democratic lawmakers and their
supporters increasingly migrated from mainstream social media plat-forms like X
(formerly Twitter) to decentralized alternatives such as Bluesky. This study
investigates how Congressional Democrats use Bluesky to form networks of
influence and disseminate political messaging in a platform environment that
lacks algorithmic amplification. We employ a mixed-methods approach that
combines social network analysis, expo-nential random graph modeling (ERGM),
and -based topic mod-eling (BERTopic) to analyze follows, mentions,
reposts, and discourse pat-terns among 182 verified Democratic members of
Congress. Our findings show that while party leaders such as Hakeem Jeffries
and Elizabeth War-ren dominate visibility metrics, overlooked figures like
Marcy Kaptur, Donald Beyer, and Dwight Evans occupy structurally central
positions, suggesting latent influence within the digital party ecosystem. ERGM
re-sults reveal significant homophily along ideological, state, and leadership
lines, with Senate leadership exhibiting lower connectivity. Topic analysis
identifies both shared themes (e.g., reproductive rights, foreign conflicts)
and subgroup-specific issues, with The Squad showing the most distinct
discourse profile. These results demonstrate the potential of decentralized
platforms to reshape intra-party
dynamics and highlight the need
for continued computational research on elite political behavior in emerging
digital environments.
Transformer-based Deep Learning Model for Joint Routing and Scheduling with Varying Electric Vehicle Numbers
Authors: Jun Kang Yap, Vishnu Monn Baskaran, Wen Shan Tan, Ze Yang Ding, Hao Wang, David L. Dowe
2025-07-21
http://arxiv.org/abs/2507.15385v1
The growing integration of renewable energy sources in modern power systems
has introduced significant operational challenges due to their intermittent and
uncertain outputs. In recent years, mobile energy storage systems (ESSs) have
emerged as a popular flexible resource for mitigating these challenges.
Compared to stationary ESSs, mobile ESSs offer additional spatial flexibility,
enabling cost-effective energy delivery through the transportation network.
However, the widespread deployment of mobile ESSs is often hindered by the high
investment cost, which has motivated researchers to investigate utilising more
readily available alternatives, such as electric vehicles (EVs) as mobile
energy storage units instead. Hence, we explore this opportunity with a
MIP-based day-ahead electric vehicle joint routing and scheduling problem in
this work. However, solving the problem in a practical setting can often be
computationally intractable since the existence of binary variables makes it
combinatorial challenging. Therefore, we proposed to simplify the problem's
solution process for a MIP solver by the solution search space with a
-based deep learning (DL) model. This is done by training the model
to rapidly predict the optimal binary solutions. In addition, unlike many
existing DL approaches that assume fixed problem structures, the proposed model
is designed to accommodate problems with EV fleets of any sizes. This
flexibility is essential since frequent re-training can introduce significant
computational overhead. We evaluated the approach with simulations on the IEEE
33-bus system coupled with the Nguyen-Dupuis transportation network.
Metaphor and Large Language Models When Surface Features Matter More than Deep Understanding
Authors: Elisa Sanchez-Bayona, Rodrigo Agerri
2025-07-21
http://arxiv.org/abs/2507.15357v1
This paper presents a comprehensive evaluation of the capabilities of Large
Language Models (s) in metaphor interpretation across multiple datasets,
tasks, and prompt configurations. Although metaphor processing has gained
significant attention in Natural Language Processing (NLP), previous research
has been limited to single-dataset evaluations and specific task settings,
often using artificially constructed data through lexical replacement. We
address these limitations by conducting extensive experiments using diverse
publicly available datasets with inference and metaphor annotations, focusing
on Natural Language Inference (NLI) and Question Answering (QA) tasks. The
results indicate that
s' performance is more influenced by features like
lexical
and sentence length than by metaphorical content, demonstrating
that any alleged emergent abilities of
s to understand metaphorical language
are the result of a combination of surface-level features, in-context learning,
and linguistic knowledge. This work provides critical insights into the current
capabilities and limitations of
s in processing figurative language,
highlighting the need for more realistic evaluation frameworks in metaphor
interpretation tasks. Data and code are publicly available.
Scaling Decentralized Learning with FLock
Authors: Zehua Cheng, Rui Sun, Jiahao Sun, Yike Guo
2025-07-21
http://arxiv.org/abs/2507.15349v1
Fine-tuning the large language models (s) are prevented by the deficiency
of centralized control and the massive computing and
overhead on
the decentralized schemes. While the typical standard federated learning (FL)
supports data privacy, the central server requirement creates a single point of
attack and vulnerability to poisoning attacks. Generalizing the result in this
direction to 70B-parameter models in the heterogeneous, trustless environments
has turned out to be a huge, yet unbroken bottleneck. This paper introduces
FLock, a decentralized framework for secure and efficient collaborative
fine-tuning. Integrating a blockchain-based trust layer with economic
incentives, FLock replaces the central aggregator with a secure, auditable
protocol for cooperation among untrusted parties. We present the first
empirical validation of fine-tuning a 70B
in a secure, multi-domain,
decentralized setting. Our experiments show the FLock framework defends against
backdoor poisoning attacks that compromise standard FL optimizers and fosters
synergistic knowledge transfer. The resulting models show a >68% reduction in
adversarial attack success rates. The global model also demonstrates superior
cross-domain generalization, outperforming models trained in isolation on their
own specialized data.
IM-Chat A Multi-agent LLM-based Framework for Knowledge Transfer in Injection Molding Industry
Authors: Junhyeong Lee, Joon-Young Kim, Heekyu Kim, Inhyo Lee, Seunghwa Ryu
2025-07-21
http://arxiv.org/abs/2507.15268v1
The injection molding industry faces critical challenges in preserving and
transferring field knowledge, particularly as experienced workers retire and
multilingual barriers hinder effective . This study introduces
IM-Chat, a multi-agent framework based on large language models (
s),
designed to facilitate knowledge transfer in injection molding. IM-Chat
integrates both limited documented knowledge (e.g., troubleshooting tables,
manuals) and extensive field data modeled through a data-driven process
condition generator that infers optimal manufacturing settings from
environmental inputs such as temperature and humidity, enabling robust and
context-aware task resolution. By adopting a retrieval-augmented generation
(RAG) strategy and tool-calling agents within a modular architecture, IM-Chat
ensures adaptability without the need for fine-tuning. Performance was assessed
across 100 single-tool and 60 hybrid tasks for GPT-4o, GPT-4o-mini, and
GPT-3.5-turbo by domain experts using a 10-point rubric focused on relevance
and correctness, and was further supplemented by automated evaluation using
GPT-4o guided by a domain-adapted instruction prompt. The evaluation results
indicate that more capable models tend to achieve higher accuracy, particularly
in complex, tool-integrated scenarios. Overall, these findings demonstrate the
viability of multi-agent
systems for industrial knowledge workflows and
establish IM-Chat as a scalable and generalizable approach to AI-assisted
decision support in manufacturing.
CHADET Cross-Hierarchical-Attention for Depth-Completion Using Unsupervised Lightweight Transformer
Authors: Kevin Christiansen Marsim, Jinwoo Jeon, Yeeun Kim, Myeongwoo Jeong, Hyun Myung
2025-07-21
http://arxiv.org/abs/2507.15189v1
Depth information which specifies the distance between objects and current
position of the robot is essential for many robot tasks such as navigation.
Recently, researchers have proposed depth completion frameworks to provide
dense depth maps that offer comprehensive information about the surrounding
environment. However, existing methods show significant trade-offs between
computational efficiency and accuracy during inference. The substantial memory
and computational requirements make them unsuitable for real-time applications,
highlighting the need to improve the completeness and accuracy of depth
information while improving processing speed to enhance robot performance in
various tasks. To address these challenges, in this paper, we propose
CHADET(cross-hierarchical-attention depth-completion ), a
lightweight depth-completion network that can generate accurate dense depth
maps from RGB images and
depth points. For each pair, its feature is
extracted from the depthwise blocks and passed to the equally lightweight
-based decoder. In the decoder, we utilize the novel
cross-hierarchical-attention module that refines the image features from the
depth information. Our approach improves the quality and reduces memory usage
of the depth map prediction, as validated in both KITTI, NYUv2, and VOID
datasets.
Beyond Visual Line of Sight UAVs with Edge AI, Connected LLMs, and VR for Autonomous Aerial Intelligence
Authors: Andres Navarro, Carlos de Quinto, José Alberto Hernández
2025-07-20
http://arxiv.org/abs/2507.15049v1
Unmanned Aerial Vehicles are reshaping Non-Terrestrial Networks by acting as
agile, intelligent nodes capable of advanced analytics and instantaneous
situational awareness. This article introduces a budget-friendly quadcopter
platform that unites 5G s, edge-based processing, and AI to tackle
core challenges in NTN scenarios. Outfitted with a panoramic camera, robust
onboard computation, and
s, the drone system delivers seamless object
recognition, contextual analysis, and immersive operator experiences through
virtual reality VR technology. Field evaluations confirm the platform's ability
to process visual streams with low latency and sustain robust 5G links. Adding
s further streamlines operations by extracting actionable insights and
refining collected data for decision support. Demonstrated use cases, including
emergency response, infrastructure assessment, and environmental surveillance,
underscore the system's adaptability in demanding contexts.
From Neurons to Semantics Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment
Authors: Chongxuan Huang, Yongshi Ye, Biao Fu, Qifeng Su, Xiaodong Shi
2025-07-20
http://arxiv.org/abs/2507.14900v2
Large language models (s) have demonstrated remarkable multilingual
capabilities, however, how to evaluate cross-lingual alignment remains
underexplored. Existing alignment benchmarks primarily focus on sentence
embeddings, but prior research has shown that neural models tend to induce a
non-smooth representation space, which impact of semantic alignment evaluation
on low-resource languages. Inspired by neuroscientific findings that similar
information activates
ping neuronal regions, we propose a novel Neuron
State-Based Cross-Lingual Alignment (NeuronXA) to assess the cross-lingual a
lignment capabilities of
s, which offers a more semantically grounded
approach to assess cross-lingual alignment. We evaluate NeuronXA on several
prominent multilingual
s (LLaMA, Qwen, Mistral, GLM, and OLMo) across two
transfer tasks and three multilingual benchmarks. The results demonstrate that
with only 100 parallel sentence pairs, NeuronXA achieves a Pearson correlation
of 0.9556 with downstream tasks performance and 0.8514 with transferability.
These findings demonstrate NeuronXA's effectiveness in assessing both
cross-lingual alignment and transferability, even with a small dataset. This
highlights its potential to advance cross-lingual alignment research and to
improve the semantic understanding of multilingual
s.
Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs
Authors: Boyi Deng, Yu Wan, Baosong Yang, Fei Huang, Wenjie Wang, Fuli Feng
2025-07-20
http://arxiv.org/abs/2507.14894v1
Large Language Models (s) have impressive multilingual capabilities, but
they suffer from unexpected code-switching, also known as language mixing,
which involves switching to unexpected languages in the model response. This
problem leads to poor readability and degrades the usability of model
responses. However, existing work on this issue lacks a mechanistic analysis
and shows limited effectiveness. In this paper, we first provide an in-depth
analysis of unexpected code-switching using
autoencoders and find that
when
s switch to a language, the features of that language exhibit excessive
pre-activation values. Based on our findings, we propose parse
utoencoder-guided upervised
ineuning (SASFT), which teaches
s to maintain
appropriate pre-activation values of specific language features during
training. Experiments on five models across three languages demonstrate that
SASFT consistently reduces unexpected code-switching by more than 50\% compared
to standard supervised fine-tuning, with complete elimination in four cases.
Moreover, SASFT maintains or even improves the models' performance on six
multilingual benchmarks, showing its effectiveness in addressing code-switching
while preserving multilingual capabilities.
Tiny language models
Authors: Ronit D. Gross, Yarden Tzach, Tal Halevi, Ella Koresh, Ido Kanter
2025-07-20
http://arxiv.org/abs/2507.14871v2
A prominent achievement of natural language processing (NLP) is its ability
to understand and generate meaningful human language. This capability relies on
complex feedforward block architectures pre-trained on large
language models (
s). However,
pre-training is currently feasible only
for a few dominant companies due to the immense computational resources
required, limiting broader research participation. This creates a critical need
for more accessible alternatives. In this study, we explore whether tiny
language models (TLMs) exhibit the same key qualitative features of
s. We
demonstrate that TLMs exhibit a clear performance gap between pre-trained and
non-pre-trained models across classification tasks, indicating the
effectiveness of pre-training, even at a tiny scale. The performance gap
increases with the size of the pre-training dataset and with greater
between tokens in the pre-training and classification datasets. Furthermore,
the classification accuracy achieved by a pre-trained deep TLM architecture can
be replicated through a soft committee of multiple, independently pre-trained
shallow architectures, enabling low-latency TLMs without affecting
classification accuracy. Our results are based on pre-training BERT-6 and
variants of BERT-1 on subsets of the Wikipedia dataset and evaluating their
performance on FewRel, AGNews, and DBPedia classification tasks. Future
research on TLM is expected to further illuminate the mechanisms underlying
NLP, especially given that its biologically inspired models suggest that TLMs
may be sufficient for children or adolescents to develop language. The data and
code that support the findings of this study are openly available on
https://github.com/Rg32601/Tiny-Language-Models .
An Evaluation of DUSt3R/MASt3R/VGGT 3D Reconstruction on Photogrammetric Aerial Blocks
Authors: Xinyi Wu, Steven Landgraf, Markus Ulrich, Rongjun Qin
2025-07-20
http://arxiv.org/abs/2507.14798v1
State-of-the-art 3D computer vision algorithms continue to advance in
handling , unordered image sets. Recently developed foundational models
for 3D reconstruction, such as Dense and Unconstrained Stereo 3D Reconstruction
(DUSt3R), Matching and Stereo 3D Reconstruction (MASt3R), and Visual Geometry
Grounded Transformer (VGGT), have attracted attention due to their ability to
handle very
image
s. Evaluating DUSt3R/MASt3R/VGGT on typical
aerial images matters, as these models may handle extremely low image
s,
stereo occlusions, and textureless regions. For redundant collections, they can
accelerate 3D reconstruction by using extremely sparsified image sets. Despite
tests on various computer vision benchmarks, their potential on photogrammetric
aerial blocks remains unexplored. This paper conducts a comprehensive
evaluation of the pre-trained DUSt3R/MASt3R/VGGT models on the aerial blocks of
the UseGeo dataset for pose estimation and dense 3D reconstruction. Results
show these methods can accurately reconstruct dense point clouds from very
image sets (fewer than 10 images, up to 518 pixels resolution), with
completeness gains up to +50% over COLMAP. VGGT also demonstrates higher
computational efficiency, scalability, and more reliable camera pose
estimation. However, all exhibit limitations with high-resolution images and
large sets, as pose reliability declines with more images and geometric
complexity. These findings suggest
-based methods cannot fully
replace traditional SfM and MVS, but offer promise as complementary approaches,
especially in challenging, low-resolution, and
scenarios.
LeAdQA LLM-Driven Context-Aware Temporal Grounding for Video Question Answering
Authors: Xinxin Dong, Baoyun Peng, Haokai Ma, Yufei Wang, Zixuan Dong, Fei Hu, Xiaodong Wang
2025-07-20
http://arxiv.org/abs/2507.14784v1
Video Question Answering (VideoQA) requires identifying critical
moments in long videos and reasoning about their causal relationships to answer
semantically complex questions. While recent advances in multimodal learning
have improved alignment and fusion, current approaches remain limited by two
prevalent but fundamentally flawed strategies: (1) task-agnostic sampling
indiscriminately processes all frames, overwhelming key events with irrelevant
content; and (2) heuristic retrieval captures superficial patterns but misses
causal-temporal structures needed for complex reasoning. To address these
challenges, we introduce LeAdQA, an innovative approach that bridges these gaps
through synergizing causal-aware query refinement with fine-grained visual
grounding. Our method first leverages
s to reformulate question-option
pairs, resolving causal ambiguities and sharpening temporal focus. These
refined queries subsequently direct a temporal grounding model to precisely
retrieve the most salient segments, complemented by an adaptive fusion
mechanism dynamically integrating the evidence to maximize relevance. The
integrated visual-textual cues are then processed by an M
to generate
accurate, contextually-grounded answers. Experiments on NExT-QA, IntentQA, and
NExT-GQA demonstrate that our method's precise visual grounding substantially
enhances the understanding of video-question relationships, achieving
state-of-the-art (SOTA) performance on complex reasoning tasks while
maintaining computational efficiency.
CXR-TFT Multi-Modal Temporal Fusion Transformer for Predicting Chest X-ray Trajectories
Authors: Mehak Arora, Ayman Ali, Kaiyuan Wu, Carolyn Davis, Takashi Shimazui, Mahmoud Alwakeel, Victor Moas, Philip Yang, Annette Esper, Rishikesan Kamaleswaran
2025-07-19
http://arxiv.org/abs/2507.14766v1
In intensive care units (ICUs), patients with complex clinical conditions
require vigilant monitoring and prompt interventions. Chest X-rays (CXRs) are a
vital diagnostic tool, providing insights into clinical trajectories, but their
irregular acquisition limits their utility. Existing tools for CXR
interpretation are constrained by cross-sectional analysis, failing to capture
temporal dynamics. To address this, we introduce CXR-TFT, a novel multi-modal
framework that integrates temporally CXR imaging and radiology reports
with high-frequency clinical data, such as vital signs, laboratory values, and
respiratory flow sheets, to predict the trajectory of CXR findings in
critically ill patients. CXR-TFT leverages latent embeddings from a vision
encoder that are temporally aligned with hourly clinical data through
interpolation. A
model is then trained to predict CXR embeddings at
each hour, conditioned on previous embeddings and clinical measurements. In a
retrospective study of 20,000 ICU patients, CXR-TFT demonstrated high accuracy
in forecasting abnormal CXR findings up to 12 hours before they became
radiographically evident. This predictive capability in clinical data holds
significant potential for enhancing the management of time-sensitive conditions
like acute respiratory distress syndrome, where early intervention is crucial
and diagnoses are often delayed. By providing distinctive temporal resolution
in prognostic CXR analysis, CXR-TFT offers actionable 'whole patient' insights
that can directly improve clinical outcomes.
GRACE Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization
Authors: Luyi Ma, Wanjia Zhang, Kai Zhao, Abhishek Kulkarni, Lalitesh Morishetti, Anjana Ganesh, Ashish Ranjan, Aashika Padmanabhan, Jianpeng Xu, Jason Cho, Praveen Kanumala, Kaushiki Nag, Sumit Dutta, Kamiya Motwani, Malay Patel, Evren Korpeoglu, Sushant Kumar, Kannan Achan
2025-07-19
http://arxiv.org/abs/2507.14758v1
Generative models have recently demonstrated strong potential in
multi-behavior recommendation systems, leveraging the expressive power of
s and tokenization to generate personalized item sequences. However,
their adoption is hindered by (1) the lack of explicit information for token
reasoning, (2) high computational costs due to quadratic attention complexity
and dense sequence representations after tokenization, and (3) limited
multi-scale modeling over user history. In this work, we propose GRACE
(Generative Recommendation via journey-aware
Attention on
Chain-of-thought tokEnization), a novel generative framework for multi-behavior
sequential recommendation. GRACE introduces a hybrid Chain-of-Thought (CoT)
tokenization method that encodes user-item interactions with explicit
attributes from product knowledge graphs (e.g., category, brand, price) over
semantic tokenization, enabling interpretable and behavior-aligned generation.
To address the inefficiency of standard attention, we design a Journey-Aware
Sparse Attention (JSA) mechanism, which selectively attends to compressed,
intra-, inter-, and current-context segments in the tokenized sequence.
Experiments on two real-world datasets show that GRACE significantly
outperforms state-of-the-art baselines, achieving up to +106.9% HR@10 and
+106.7% NDCG@10 improvement over the state-of-the-art baseline on the Home
domain, and +22.1% HR@10 on the Electronics domain. GRACE also reduces
attention computation by up to 48% with long sequences.
Spatial-Temporal Transformer with Curriculum Learning for EEG-Based Emotion Recognition
Authors: Xuetao Lin, Tianhao Peng, Peihong Dai, Yu Liang, Wenjun Wu
2025-07-19
http://arxiv.org/abs/2507.14698v1
EEG-based emotion recognition plays an important role in developing adaptive
brain-computer systems, yet faces two fundamental challenges in
practical implementations: (1) effective integration of non-stationary
spatial-temporal neural patterns, (2) robust adaptation to dynamic emotional
intensity variations in real-world scenarios. This paper proposes SST-CL, a
novel framework integrating spatial-temporal
s with curriculum
learning. Our method introduces two core components: a spatial encoder that
models inter-channel relationships and a temporal encoder that captures
multi-scale dependencies through windowed attention mechanisms, enabling
simultaneous extraction of spatial correlations and temporal dynamics from EEG
signals. Complementing this architecture, an intensity-aware curriculum
learning strategy progressively guides training from high-intensity to
low-intensity emotional states through dynamic sample scheduling based on a
dual difficulty assessment. Comprehensive experiments on three benchmark
datasets demonstrate state-of-the-art performance across various emotional
intensity levels, with ablation studies confirming the necessity of both
architectural components and the curriculum learning mechanism.
Enabling Efficient Hardware Acceleration of Hybrid Vision Transformer (ViT) Networks at the Edge
Authors: Joren Dumoulin, Pouya Houshmand, Vikram Jain, Marian Verhelst
2025-07-19
http://arxiv.org/abs/2507.14651v1
Hybrid vision s combine the elements of conventional neural
networks (NN) and vision
s (ViT) to enable lightweight and accurate
detection. However, several challenges remain for their efficient deployment on
resource-constrained edge devices. The hybrid models suffer from a widely
diverse set of NN layer types and large intermediate data tensors, hampering
efficient hardware
. To enable their execution at the edge, this
paper proposes innovations across the hardware-scheduling stack: a.) At the
lowest level, a configurable PE array supports all hybrid ViT layer types; b.)
temporal loop re-ordering within one layer, enabling hardware support for
normalization and softmax layers, minimizing on-chip data transfers; c.)
further scheduling optimization employs layer fusion across inverted bottleneck
layers to drastically reduce off-chip memory transfers. The resulting
accelerator is implemented in 28nm CMOS, achieving a peak energy efficiency of
1.39 TOPS/W at 25.6 GMACs/s.
Linear Relational Decoding of Morphology in Language Models
Authors: Eric Xia, Jugal Kalita
2025-07-19
http://arxiv.org/abs/2507.14640v1
A two-part affine approximation has been found to be a good approximation for
computations over certain subject object relations. Adapting the
Bigger Analogy Test Set, we show that the linear transformation Ws, where s is
a middle layer representation of a subject token and W is derived from model
derivatives, is also able to accurately reproduce final object states for many
relations. This linear technique is able to achieve 90% faithfulness on
morphological relations, and we show similar findings multi-lingually and
across models. Our findings indicate that some conceptual relationships in
language models, such as morphology, are readily interpretable from latent
space, and are
ly encoded by cross-layer linear transformations.
KinForm Kinetics Informed Feature Optimised Representation Models for Enzyme and Prediction
Authors: Saleh Alwer, Ronan Fleming
2025-07-19
http://arxiv.org/abs/2507.14639v1
Kinetic parameters such as the turnover number () and Michaelis
constant () are essential for modelling enzymatic activity but
experimental data remains limited in scale and diversity. Previous methods for
predicting enzyme kinetics typically use mean-pooled residue embeddings from a
single protein language model to represent the protein. We present KinForm, a
machine learning framework designed to improve predictive accuracy and
generalisation for kinetic parameters by optimising protein feature
representations. KinForm combines several residue-level embeddings
(Evolutionary Scale Modeling Cambrian, Evolutionary Scale Modeling 2, and
ProtT5-XL-UniRef50), taken from empirically selected intermediate
layers and applies weighted pooling based on per-residue binding-site
probability. To counter the resulting high dimensionality, we apply
dimensionality reduction using principal--component analysis (PCA) on
concatenated protein features, and rebalance the training data via a
similarity-based oversampling strategy. KinForm outperforms baseline methods on
two benchmark datasets. Improvements are most pronounced in low sequence
similarity bins. We observe improvements from binding-site probability pooling,
intermediate-layer selection, PCA, and oversampling of low-identity proteins.
We also find that removing sequence
between folds provides a more
realistic evaluation of generalisation and should be the standard over random
splitting when benchmarking kinetic prediction models.
Agentic Satellite-Augmented Low-Altitude Economy and Terrestrial Networks A Survey on Generative Approaches
Authors: Xiaozheng Gao, Yichen Wang, Bosen Liu, Xiao Zhou, Ruichen Zhang, Jiacheng Wang, Dusit Niyato, Dong In Kim, Abbas Jamalipour, Chau Yuen, Jianping An, Kai Yang
2025-07-19
http://arxiv.org/abs/2507.14633v1
The development of satellite-augmented low-altitude economy and terrestrial
networks (SLAETNs) demands intelligent and autonomous systems that can operate
reliably across heterogeneous, dynamic, and mission-critical environments. To
address these challenges, this survey focuses on enabling agentic artificial
intelligence (AI), that is, artificial agents capable of perceiving, reasoning,
and acting, through generative AI (GAI) and large language models (s). We
begin by introducing the architecture and characteristics of SLAETNs, and
analyzing the challenges that arise in integrating satellite, aerial, and
terrestrial components. Then, we present a model-driven foundation by
systematically reviewing five major categories of generative models:
variational autoencoders (VAEs), generative adversarial networks (GANs),
generative diffusion models (GDMs),
-based models (TBMs), and
s.
Moreover, we provide a comparative analysis to highlight their generative
mechanisms, capabilities, and deployment trade-offs within SLAETNs. Building on
this foundation, we examine how these models empower agentic functions across
three domains:
enhancement, security and privacy protection, and
intelligent satellite tasks. Finally, we outline key future directions for
building scalable, adaptive, and trustworthy generative agents in SLAETNs. This
survey aims to provide a unified understanding and actionable reference for
advancing agentic AI in next-generation integrated networks.
Efficient LLM Inference Bandwidth, Compute, Synchronization, and Capacity are all you need
Authors: Michael Davies, Neal Crago, Karthikeyan Sankaralingam, Christos Kozyrakis
2025-07-18
http://arxiv.org/abs/2507.14397v1
This paper presents a limit study of -based large language model
(
) inference, focusing on the fundamental performance bottlenecks imposed by
memory bandwidth, memory capacity, and synchronization overhead in distributed
inference systems. We develop a hardware-agnostic performance model that
abstracts away implementation details, enabling the analysis of a wide range of
current and near-future hardware technologies. Our analysis spans from current
HBM3 memory technology used in AI accelerators like GPUs and TPUs to systems
based on advanced HBM4 and advanced 3D-stacked DRAM technology. It also covers
SRAM-based designs and scaling techniques from distributed clusters with
varying numbers of chips to wafer-scale integration. Our key findings for
auto-regressive decoding are: i) serving
s requires 100s of GB per server to
serve a model instance; ii) high memory bandwidth is critical for high per-user
throughput; iii) exposed synchronization latencies to achieve collective
must be around 1us else they make the memory bandwidth
ineffective; iv) DRAM-based designs have a fundamental advantage in terms of
system-level efficiency as measured in throughput per cost or watt; and v)
hardware designs can easily reach 2000+ user token/sec but getting to 10,000+
tokens/sec will need smaller models, smaller context, or other forms of
algorithmic advances. This study provides valuable insights into the
fundamental performance limits of
inference, highlighting the potential
benefits of future hardware advancements and guiding the optimization of
deployment strategies.
Characterizing Communication Patterns in Distributed Large Language Model Inference
Authors: Lang Xu, Kaushik Kandadi Suresh, Quentin Anthony, Nawras Alnaasan, Dhabaleswar K. Panda
2025-07-18
http://arxiv.org/abs/2507.14392v1
Large Language Models (s) built on
architectures have
transformed natural language processing, achieving remarkable performance
across diverse applications. While distributed inference frameworks enable
practical deployment of these models, inter-GPU
creates
significant performance constraints that limit service quality in real-world
systems. This paper investigates
dynamics in distributed
serving-analyzing how various parallelization approaches coordinate data
exchange between GPU workers during inference. We study dense
-based
models as representative examples of contemporary architectures widely used in
operational deployments. Our work combines detailed profiling measurements with
predictive analytical models to characterize
behavior across
different parallelization configurations. Results show that tensor parallelism
incurs substantial network overhead but delivers superior response times for
brief sequences, pipeline parallelism minimizes data transfer requirements
while increasing total latency, and combined approaches demand careful tuning
to achieve balanced performance. These insights offer practical recommendations
for selecting appropriate parallelization schemes in production
services
and identify key opportunities for optimizing inference frameworks and
infrastructure.
DPMT Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration
Authors: Xiyun Li, Yining Ding, Yuhua Jiang, Yunlong Zhao, Runpeng Xie, Shuang Xu, Yuanhua Ni, Yiqin Yang, Bo Xu
2025-07-18
http://arxiv.org/abs/2507.14088v1
Real-time human-artificial intelligence (AI) collaboration is crucial yet
challenging, especially when AI agents must adapt to diverse and unseen human
behaviors in dynamic scenarios. Existing large language model () agents
often fail to accurately model the complex human mental characteristics such as
domain intentions, especially in the absence of direct
. To
address this limitation, we propose a novel dual process multi-scale theory of
mind (DPMT) framework, drawing inspiration from cognitive science dual process
theory. Our DPMT framework incorporates a multi-scale theory of mind (ToM)
module to facilitate robust human partner modeling through mental
characteristic reasoning. Experimental results demonstrate that DPMT
significantly enhances human-AI collaboration, and ablation studies further
validate the contributions of our multi-scale ToM in the slow system.
KROMA Ontology Matching with Knowledge Retrieval and Large Language Models
Authors: Lam Nguyen, Erika Barcelos, Roger French, Yinghui Wu
2025-07-18
http://arxiv.org/abs/2507.14032v1
Ontology Matching (OM) is a cornerstone task of semantic interoperability,
yet existing systems often rely on handcrafted rules or specialized models with
limited adaptability. We present KROMA, a novel OM framework that harnesses
Large Language Models (s) within a Retrieval-Augmented Generation (RAG)
pipeline to dynamically enrich the semantic context of OM tasks with
structural, lexical, and definitional knowledge. To optimize both performance
and efficiency, KROMA integrates a bisimilarity-based concept matching and a
lightweight ontology refinement step, which prune candidate concepts and
substantially reduce the
overhead from invoking
s. Through
experiments on multiple benchmark datasets, we show that integrating knowledge
retrieval with context-augmented
s significantly enhances ontology matching,
outperforming both classic OM systems and cutting-edge
-based approaches
while keeping
overhead comparable. Our study highlights the
feasibility and benefit of the proposed optimization techniques (targeted
knowledge retrieval, prompt enrichment, and ontology refinement) for ontology
matching at scale.
LoopServe An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues
Authors: Haoyang Li, Zhanchao Xu, Yiming Li, Xuejia Chen, Darian Li, Anxin Tian, Qingfa Xiao, Cheng Deng, Jun Wang, Qing Li, Lei Chen, Mingxuan Yuan
2025-07-18
http://arxiv.org/abs/2507.13681v1
Multi-turn dialogues are essential in many real-world applications of large
language models, such as chatbots and virtual assistants. As conversation
histories become longer, existing large language models face increasing
computational and memory challenges, which hinder their ability to provide
efficient and responsive interactions. Most current methods either
compress the context or optimize key value caching, but they often rely on
fixed or position-based heuristics that do not adapt well to the dynamic and
unpredictable patterns found in actual multi-turn conversations. In this paper,
we present LoopServe, an adaptive dual-phase inference
framework
for large language models in multi-turn dialogues. LoopServe introduces two
main innovations. First, it performs online sparsification during the
prefilling phase by dynamically selecting the most important parts of the
attention matrix for each new input. Second, it uses progressive key value
compression during decoding by adaptively maintaining a relevant and efficient
cache based on the most recently generated output tokens. We also propose a
\href{https://huggingface.co/datasets/TreeAILab/Multi-turn_Long-context_Benchmark_for_
s}{new
benchmark} with eleven multi-turn datasets that reflect realistic query
positions and conversational dependencies. Extensive experiments demonstrate
that LoopServe consistently achieves superior effectiveness compared to
existing baselines and significantly accelerates
inference across a wide
range of long-context dialogue tasks.