2025-07-25

Explainable Mapper Charting LLM Embedding Spaces Using Perturbation-Based Explanation and Verification Agents
Not All Features Deserve Attention Graph-Guided Dependency Learning for Tabular Data Generation with Language Models
Enhanced Velocity-Adaptive Scheme Joint Fair Access and Age of Information Optimization in Vehicular Networks
StyleAdaptedLM Enhancing Instruction Following Models with Efficient Stylistic Transfer
Assemble Your Crew Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation
Prune&Comp Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation
SpecASR Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding
ICWLM A Multi-Task Wireless Large Model via In-Context Learning
NeuralDB Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database
Who Attacks, and Why? Using LLMs to Identify Negative Campaigning in 18M Tweets across 19 Countries
R-Stitch Dynamic Trajectory Stitching for Efficient Reasoning
EFS Evolutionary Factor Searching for Sparse Portfolio Optimization Using Large Language Models
LLM Meets the Sky Heuristic Multi-Agent Reinforcement Learning for Secure Heterogeneous UAV Networks
Resilient Multi-Agent Negotiation for Medical Supply ChainsIntegrating LLMs and Blockchain for Transparent Coordination
Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models
LoRA is All You Need for Safety Alignment of Reasoning LLMs
Parallelism Meets Adaptiveness Scalable Documents Understanding in Multi-Agent LLM Systems
Beyond Context Limits Subconscious Threads for Long-Horizon Reasoning
Collaborative Inference and Learning between Edge SLMs and Cloud LLMs A Survey of Algorithms, Execution, and Open Challenges
ACT Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training
Mamba-OTR a Mamba-based Solution for Online Take and Release Detection from Untrimmed Egocentric Video
CompLeak Deep Learning Model Compression Exacerbates Privacy Leakage
Time to Split Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders
Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training
Benchmarking LLM Privacy Recognition for Social Robot Decision Making
TorchAO PyTorch-Native Training-to-Serving Model Optimization
On the transferability of Sparse Autoencoders for interpreting compressed models
Just Ask for Music (JAM) Multimodal and Personalized Natural Language Music Recommendation
Reservoir Computing as a Language Model
Who Leads in the Shadows? ERGM and Centrality Analysis of Congressional Democrats on Bluesky
Transformer-based Deep Learning Model for Joint Routing and Scheduling with Varying Electric Vehicle Numbers
Metaphor and Large Language Models When Surface Features Matter More than Deep Understanding
Scaling Decentralized Learning with FLock
IM-Chat A Multi-agent LLM-based Framework for Knowledge Transfer in Injection Molding Industry
CHADET Cross-Hierarchical-Attention for Depth-Completion Using Unsupervised Lightweight Transformer
Beyond Visual Line of Sight UAVs with Edge AI, Connected LLMs, and VR for Autonomous Aerial Intelligence
From Neurons to Semantics Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment
Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs
Tiny language models
An Evaluation of DUSt3R/MASt3R/VGGT 3D Reconstruction on Photogrammetric Aerial Blocks
LeAdQA LLM-Driven Context-Aware Temporal Grounding for Video Question Answering
CXR-TFT Multi-Modal Temporal Fusion Transformer for Predicting Chest X-ray Trajectories
GRACE Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization
Spatial-Temporal Transformer with Curriculum Learning for EEG-Based Emotion Recognition
Enabling Efficient Hardware Acceleration of Hybrid Vision Transformer (ViT) Networks at the Edge
Linear Relational Decoding of Morphology in Language Models
KinForm Kinetics Informed Feature Optimised Representation Models for Enzyme $k_{cat}$ and $K_{M}$ Prediction
Agentic Satellite-Augmented Low-Altitude Economy and Terrestrial Networks A Survey on Generative Approaches
Efficient LLM Inference Bandwidth, Compute, Synchronization, and Capacity are all you need
Characterizing Communication Patterns in Distributed Large Language Model Inference
DPMT Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration
KROMA Ontology Matching with Knowledge Retrieval and Large Language Models
LoopServe An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues

Explainable Mapper Charting LLM Embedding Spaces Using Perturbation-Based Explanation and Verification Agents

Authors: Xinyuan Yan, Rita Sevastjanova, Sinie van der Ben, Mennatallah El-Assady, Bei Wang

2025-07-24

http://arxiv.org/abs/2507.18607v1

Large language models (s) produce high-dimensional embeddings that capture rich semantic and syntactic relationships between words, sentences, and concepts. Investigating the topological structures of embedding spaces via mapper graphs enables us to understand their underlying structures. Specifically, a mapper graph summarizes the topological structure of the embedding space, where each node represents a topological neighborhood (containing a cluster of embeddings), and an edge connects two nodes if their corresponding neighborhoods . However, manually exploring these embedding spaces to uncover encoded linguistic properties requires considerable human effort. To address this challenge, we introduce a framework for semi-automatic annotation of these embedding properties. To organize the exploration process, we first define a taxonomy of explorable elements within a mapper graph such as nodes, edges, paths, components, and trajectories. The annotation of these elements is executed through two types of customizable -based agents that employ perturbation techniques for scalable and automated analysis. These agents help to explore and explain the characteristics of mapper elements and verify the robustness of the generated explanations. We instantiate the framework within a visual analytics workspace and demonstrate its effectiveness through case studies. In particular, we replicate findings from prior research on BERT's embedding properties across various layers of its architecture and provide further observations into the linguistic properties of topological neighborhoods.

Not All Features Deserve Attention Graph-Guided Dependency Learning for Tabular Data Generation with Language Models

Authors: Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

2025-07-24

http://arxiv.org/abs/2507.18504v1

Large Language Models (s) have shown strong potential for tabular data generation by modeling textualized feature-value pairs. However, tabular data inherently exhibits feature-level dependencies, where many feature interactions are structurally insignificant. This creates a fundamental mismatch as s' self-attention mechanism inevitably distributes focus across all pairs, diluting attention on critical relationships, particularly in datasets with complex dependencies or semantically ambiguous features. To address this limitation, we propose GraDe (Graph-Guided Dependency Learning), a novel method that explicitly integrates dependency graphs into s' attention mechanism. GraDe employs a lightweight dynamic graph learning module guided by externally extracted functional dependencies, prioritizing key feature interactions while suppressing irrelevant ones. Our experiments across diverse real-world datasets demonstrate that GraDe outperforms existing -based approaches by up to 12% on complex datasets while achieving competitive results with state-of-the-art approaches in synthetic data quality. Our method is minimally intrusive yet effective, offering a practical solution for structure-aware tabular data modeling with s.

Enhanced Velocity-Adaptive Scheme Joint Fair Access and Age of Information Optimization in Vehicular Networks

Authors: Xiao Xu, Qiong Wu, Pingyi Fan, Kezhi Wang, Nan Cheng, Wen Chen, Khaled B. Letaief

2025-07-24

http://arxiv.org/abs/2507.18328v1

In this paper, we consider the fair access problem and the Age of Information (AoI) under 5G New Radio (NR) Vehicle-to-Infrastructure (V2I) Mode 2 in vehicular networks. Specifically, vehicles follow Mode 2 to communicate with Roadside Units (RSUs) to obtain accurate data for driving assistance.Nevertheless, vehicles often have different velocity when they are moving in adjacent lanes, leading to difference in RSU dwelltime and duration. This results in unfair access to network resources, potentially influencing driving safety. To ensure the freshness of received data, the AoI should be analyzed. Mode 2 introduces a novel preemption mechanism, necessitating simultaneous optimization of fair access and AoI to guarantee timely and relevant data delivery. We propose a joint optimization framework for vehicular network, defining a fairness index and employing Stochastic Hybrid Systems (SHS) to model AoI under preemption mechanism. By adaptively adjusting the selection window of Semi-Persistent Scheduling (SPS) in Mode 2, we address the optimization of fairness and AoI. We apply a large language model ()-Based Multi-objective Evolutionary Algorithm Based on Decomposition (MOEA/D) to solve this problem. Simulation results demonstrate the effectiveness of our scheme in balancing fair access and minimizing AoI.

StyleAdaptedLM Enhancing Instruction Following Models with Efficient Stylistic Transfer

Authors: Pritika Ramu, Apoorv Saxena, Meghanath M Y, Varsha Sankar, Debraj Basu

2025-07-24

http://arxiv.org/abs/2507.18294v1

Adapting s to specific stylistic characteristics, like brand voice or authorial tones, is crucial for enterprise but challenging to achieve from corpora which lacks instruction-response formatting without compromising instruction adherence. We introduce StyleAdaptedLM, a framework that efficiently transfers stylistic traits to instruction-following models using Low-Rank Adaptation (LoRA). LoRA adapters are first trained on a base model with diverse unstructured stylistic corpora, then merged with a separate instruction-following model. This enables robust stylistic customization without paired data or sacrificing task performance. Experiments across multiple datasets and models demonstrate improved stylistic consistency while preserving instruction adherence, with human evaluations confirming brand-specific convention uptake. StyleAdaptedLM offers an efficient path for stylistic personalization in s.

Assemble Your Crew Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation

Authors: Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, Shirui Pan

2025-07-24

http://arxiv.org/abs/2507.18224v1

Multi-agent systems (MAS) based on large language models (s) have emerged as a powerful solution for dealing with complex problems across diverse domains. The effectiveness of MAS is critically dependent on its collaboration topology, which has become a focal point for automated design research. However, existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures, significantly limiting their adaptability to task-specific requirements. To address these limitations, we reframe MAS design as a conditional autoregressive graph generation task, where both the system composition and structure are designed jointly. We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch. Conditioned on a natural language task query, ARG-Designer sequentially and dynamically determines the required number of agents, selects their appropriate roles from an extensible pool, and establishes the optimal links between them. This generative approach creates a customized topology in a flexible and extensible manner, precisely tailored to the unique demands of different tasks. Extensive experiments across six diverse benchmarks demonstrate that ARG-Designer not only achieves state-of-the-art performance but also enjoys significantly greater token efficiency and enhanced extensibility. The source code of ARG-Designer is available at https://github.com/Shiy-Li/ARG-Designer.

Prune&Comp Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

Authors: Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, Chun Yuan

2025-07-24

http://arxiv.org/abs/2507.18212v1

Layer has emerged as a promising technique for compressing large language models (s) while achieving proportional to the ratio. In this work, we identify that removing any layer induces a significant magnitude gap in hidden states, resulting in substantial performance degradation. To address this issue, we propose Prune&Comp, a novel plug-and-play layer scheme that leverages magnitude compensation to mitigate such gaps in a training-free manner. Specifically, we first estimate the magnitude gap caused by layer removal and then eliminate this gap by rescaling the remaining weights offline, with zero runtime overhead incurred. We further demonstrate the advantages of Prune&Comp through an iterative strategy. When integrated with an iterative prune-and-compensate loop, Prune&Comp consistently enhances existing layer metrics. For instance, when 5 layers of LLaMA-3-8B are pruned using the prevalent block influence metric, Prune&Comp nearly halves the perplexity and retains 93.19\% of the original model's question-answering performance, outperforming the baseline by 4.01%.

SpecASR Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding

Authors: Linye Wei, Shuzhang Zhong, Songqiang Xu, Runsheng Wang, Ru Huang, Meng Li

2025-07-24

http://arxiv.org/abs/2507.18181v1

Large language model ()-based automatic speech recognition (ASR) has recently attracted a lot of attention due to its high recognition accuracy and enhanced multi-dialect support. However, the high decoding latency of s challenges the real-time ASR requirements. Although speculative decoding has been explored for better decoding efficiency, they usually ignore the key characteristics of the ASR task and achieve limited speedup. To further reduce the real-time ASR latency, in this paper, we propose a novel speculative decoding framework specialized for ASR, dubbed SpecASR. SpecASR is developed based on our core observation that ASR decoding is audio-conditioned, which results in high output alignment between small and large ASR models, even given output mismatches in intermediate decoding steps. Therefore, SpecASR features an adaptive draft sequence generation process that dynamically modifies the draft sequence length to maximize the token acceptance length. SpecASR further proposes a draft sequence recycling strategy that reuses the previously generated draft sequence to reduce the draft ASR model latency. Moreover, a two-pass token tree generation algorithm is also proposed to balance the latency of draft and target ASR models. With extensive experimental results, we demonstrate SpecASR achieves 3.04x-3.79x and 1.25x-1.84x speedup over the baseline autoregressive decoding and speculative decoding, respectively, without any loss in recognition accuracy.

ICWLM A Multi-Task Wireless Large Model via In-Context Learning

Authors: Yuxuan Wen, Xiaoming Chen, Maojun Zhang, Zhaoyang Zhang

2025-07-24

http://arxiv.org/abs/2507.18167v1

The rapid evolution of wireless technologies, particularly massive multiple-input multiple-output (mMIMO) and millimeter-wave (mmWave), introduces significant network complexity and computational demands. Significant research efforts have been made to improve physical layer performance by resorting to deep learning (DL) methods, which, however, are usually task-specific and struggle with data scarcity and generalization. To address these challenges, we propose a novel In-Context Wireless Large Model (ICWLM), a wireless-native foundation model designed for simultaneous multi-task learning at the physical layer. Unlike conventional methods that adapt wireless data to pre-trained large language models (s), ICWLM is trained directly on large-scale, mixed wireless datasets from scratch. It jointly solves multiple classical physical layer problems, including multi-user precoding (sum-rate maximization and max-min SINR) and channel prediction. A key innovation of ICWLM is its utilization of in-context learning (ICL), enabling the model to adapt to varying system configurations and channel conditions with minimal demonstration pairs, eliminating the need for extensive retraining. Furthermore, we employ the Dynamic Weight Averaging (DWA) algorithm to dynamically balance the individual task losses during multi-task training, ensuring efficient and stable learning across diverse objectives. Extensive simulation results demonstrate that ICWLM achieves competitive performance compared to task-specific methods while exhibiting remarkable generalization capabilities to unseen system configurations. This work offers a promising paradigm for developing unified and adaptive AI models for future wireless networks, potentially reducing deployment complexity and enhancing intelligent resource management.

NeuralDB Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database

Authors: Weizhi Fei, Hao Shi, Jing Xu, Jingchen Peng, Jiazheng Li, Jingzhao Zhang, Bo Bai, Wei Han, Zhenyuan Chen, Xueyan Niu

2025-07-24

http://arxiv.org/abs/2507.18028v1

Efficiently editing knowledge stored in large language models (s) enables model updates without large-scale training. One possible solution is Locate-and-Edit (L\&E), allowing simultaneous modifications of a massive number of facts. However, such editing may compromise the general abilities of s and even result in forgetting edited facts when scaling up to thousands of edits. In this paper, we model existing linear L\&E methods as querying a Key-Value () database. From this perspective, we then propose NeuralDB, an editing framework that explicitly represents the edited facts as a neural database equipped with a non-linear gated retrieval module, % In particular, our gated module only operates when inference involves the edited facts, effectively preserving the general abilities of s. Comprehensive experiments involving the editing of 10,000 facts were conducted on the ZsRE and CounterFacts datasets, using GPT2-XL, GPT-J (6B) and Llama-3 (8B). The results demonstrate that NeuralDB not only excels in editing efficacy, generalization, specificity, fluency, and consistency, but also preserves overall performance across six representative text understanding and generation tasks. Further experiments indicate that NeuralDB maintains its effectiveness even when scaled to 100,000 facts (\textbf{50x} more than in prior work).

Who Attacks, and Why? Using LLMs to Identify Negative Campaigning in 18M Tweets across 19 Countries

Authors: Victor Hartman, Petter Törnberg

2025-07-23

http://arxiv.org/abs/2507.17636v1

Negative campaigning is a central feature of political competition, yet empirical research has been limited by the high cost and limited scalability of existing classification methods. This study makes two key contributions. First, it introduces zero-shot Large Language Models (s) as a novel approach for cross-lingual classification of negative campaigning. Using benchmark datasets in ten languages, we demonstrate that s achieve performance on par with native-speaking human coders and outperform conventional supervised machine learning approaches. Second, we leverage this novel method to conduct the largest cross-national study of negative campaigning to date, analyzing 18 million tweets posted by parliamentarians in 19 European countries between 2017 and 2022. The results reveal consistent cross-national patterns: governing parties are less likely to use negative messaging, while ideologically extreme and populist parties -- particularly those on the radical right -- engage in significantly higher levels of negativity. These findings advance our understanding of how party-level characteristics shape strategic in multiparty systems. More broadly, the study demonstrates the potential of s to enable scalable, transparent, and replicable research in political across linguistic and cultural contexts.

R-Stitch Dynamic Trajectory Stitching for Efficient Reasoning

Authors: Zhuokun Chen, Zeren Chen, Jiahao He, Mingkui Tan, Jianfei Cai, Bohan Zhuang

2025-07-23

http://arxiv.org/abs/2507.17307v1

Chain-of-thought (CoT) reasoning enhances the problem-solving capabilities of large language models by encouraging step-by-step intermediate reasoning during inference. While effective, CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences. Existing strategies either reduce sequence length through early stopping or compressive reward designs, or improve decoding speed via speculative decoding with smaller models. However, speculative decoding suffers from limited speedup when the agreement between small and large models is low, and fails to exploit the potential advantages of small models in producing concise intermediate reasoning. In this paper, we present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference by switching between a small language model (SLM) and a large language model () along the reasoning trajectory. R-Stitch uses the SLM to generate tokens by default and delegates to the only when the SLM's confidence falls below a threshold. This design avoids full-sequence rollback and selectively invokes the on uncertain steps, preserving both efficiency and answer quality. R-Stitch is model-agnostic, training-free, and compatible with standard decoding pipelines. Experiments on math reasoning benchmarks demonstrate that R-Stitch achieves up to 85\% reduction in inference latency with negligible accuracy drop, highlighting its practical effectiveness in accelerating CoT reasoning.

EFS Evolutionary Factor Searching for Sparse Portfolio Optimization Using Large Language Models

Authors: Haochen Luo, Yuan Zhang, Chen Liu

2025-07-23

http://arxiv.org/abs/2507.17211v1

Sparse portfolio optimization is a fundamental yet challenging problem in quantitative finance, since traditional approaches heavily relying on historical return statistics and static objectives can hardly adapt to dynamic market regimes. To address this issue, we propose Evolutionary Factor Search (EFS), a novel framework that leverages large language models (s) to automate the generation and evolution of alpha factors for portfolio construction. By reformulating the asset selection problem as a top-m ranking task guided by -generated factors, EFS incorporates an evolutionary feedback loop to iteratively refine the factor pool based on performance. Extensive experiments on five Fama-French benchmark datasets and three real-market datasets (US50, HSI45 and CSI300) demonstrate that EFS significantly outperforms both statistical-based and optimization-based baselines, especially in larger asset universes and volatile conditions. Comprehensive ablation studies validate the importance of prompt composition, factor diversity, and backend choice. Our results highlight the promise of language-guided evolution as a robust and interpretable paradigm for portfolio optimization under structural constraints.

LLM Meets the Sky Heuristic Multi-Agent Reinforcement Learning for Secure Heterogeneous UAV Networks

Authors: Lijie Zheng, Ji He, Shih Yu Chang, Yulong Shen, Dusit Niyato

2025-07-23

http://arxiv.org/abs/2507.17188v1

This work tackles the physical layer security (PLS) problem of maximizing the secrecy rate in heterogeneous UAV networks (HetUAVNs) under propulsion energy constraints. Unlike prior studies that assume uniform UAV capabilities or overlook energy-security trade-offs, we consider a realistic scenario where UAVs with diverse payloads and computation resources collaborate to serve ground terminals in the presence of eavesdroppers. To manage the complex coupling between UAV motion and , we propose a hierarchical optimization framework. The inner layer uses a semidefinite relaxation (SDR)-based S2DC algorithm combining penalty functions and difference-of-convex (d.c.) programming to solve the secrecy precoding problem with fixed UAV positions. The outer layer introduces a Large Language Model ()-guided heuristic multi-agent reinforcement learning approach (-HeMARL) for trajectory optimization. -HeMARL efficiently incorporates expert heuristics policy generated by the , enabling UAVs to learn energy-aware, security-driven trajectories without the inference overhead of real-time calls. The simulation results show that our method outperforms existing baselines in secrecy rate and energy efficiency, with consistent robustness across varying UAV swarm sizes and random seeds.

Resilient Multi-Agent Negotiation for Medical Supply ChainsIntegrating LLMs and Blockchain for Transparent Coordination

Authors: Mariam ALMutairi, Hyungmin Kim

2025-07-23

http://arxiv.org/abs/2507.17134v1

Global health emergencies, such as the COVID-19 pandemic, have exposed critical weaknesses in traditional medical supply chains, including inefficiencies in resource allocation, lack of transparency, and poor adaptability to dynamic disruptions. This paper presents a novel hybrid framework that integrates blockchain technology with a decentralized, large language model () powered multi-agent negotiation system to enhance the resilience and accountability of medical supply chains during crises. In this system, autonomous agents-representing manufacturers, distributors, and healthcare institutions-engage in structured, context-aware negotiation and decision-making processes facilitated by s, enabling rapid and ethical allocation of scarce medical resources. The off-chain agent layer supports adaptive reasoning and local decision-making, while the on-chain blockchain layer ensures immutable, transparent, and auditable enforcement of decisions via smart contracts. The framework also incorporates a formal cross-layer protocol to bridge decentralized negotiation with institutional enforcement. A simulation environment emulating pandemic scenarios evaluates the system's performance, demonstrating improvements in negotiation efficiency, fairness of allocation, supply chain responsiveness, and auditability. This research contributes an innovative approach that synergizes blockchain trust guarantees with the adaptive intelligence of -driven agents, providing a robust and scalable solution for critical supply chain coordination under uncertainty.

Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models

Authors: Andrii Balashov

2025-07-23

http://arxiv.org/abs/2507.17107v1

Reinforcement learning (RL) is a key post-pretraining step for aligning large language models (s) with complex tasks and human preferences. While it is often assumed that RL fine-tuning requires updating most of a model's parameters, we challenge this assumption with a surprising finding: RL fine-tuning consistently modifies only a small subnetwork (typically 5-30% of weights), leaving most parameters unchanged. We call this phenomenon RL-induced parameter update . It arises naturally, without any constraints or parameter-efficient tuning, and appears across multiple RL algorithms (e.g., PPO, DPO, SimPO, PRIME) and model families (e.g., OpenAI, Meta, and open-source s). Moreover, the subnetworks updated by RL show substantial across different seeds, datasets, and algorithms-far exceeding chance-suggesting a partially transferable structure in the pretrained model. We show that fine-tuning only this subnetwork recovers full model performance and yields parameters nearly identical to the fully fine-tuned model. Our analysis suggests this emerges because RL operates near the model's original distribution, requiring only targeted changes. KL penalties, gradient clipping, and on-policy dynamics have limited effect on the pattern. These findings shed new light on how RL adapts models: not by shifting all weights, but by focusing training on a small, consistently updated subnetwork. This insight enables more efficient RL methods and reframes through the lens of the lottery ticket hypothesis.

LoRA is All You Need for Safety Alignment of Reasoning LLMs

Authors: Yihao Xue, Baharan Mirzasoleiman

2025-07-22

http://arxiv.org/abs/2507.17075v1

Reasoning s have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure s do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the "Safety Tax". In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe s -- with safety levels comparable to full-model fine-tuning -- without compromising their reasoning abilities. Additionally, we observe that LoRA induces weight updates with smaller with the initial weights compared to full-model fine-tuning. We also explore methods that further reduce such -- via regularization or during weight merging -- and observe some improvement on certain tasks. We hope this result motivates designing approaches that yield more consistent improvements in the reasoning-safety trade-off.

Parallelism Meets Adaptiveness Scalable Documents Understanding in Multi-Agent LLM Systems

Authors: Chengxuan Xia, Qianye Wu, Sixuan Tian, Yilun Hao

2025-07-22

http://arxiv.org/abs/2507.17061v1

Large language model () agents have shown increasing promise for collaborative task completion. However, existing multi-agent frameworks often rely on static workflows, fixed roles, and limited inter-agent , reducing their effectiveness in open-ended, high-complexity domains. This paper proposes a coordination framework that enables adaptiveness through three core mechanisms: dynamic task routing, bidirectional feedback, and parallel agent evaluation. The framework allows agents to reallocate tasks based on confidence and workload, exchange structured critiques to iteratively improve outputs, and crucially compete on high-ambiguity subtasks with evaluator-driven selection of the most suitable result. We instantiate these principles in a modular architecture and demonstrate substantial improvements in factual coverage, coherence, and efficiency over static and partially adaptive baselines. Our findings highlight the benefits of incorporating both adaptiveness and structured competition in multi-agent systems.

Beyond Context Limits Subconscious Threads for Long-Horizon Reasoning

Authors: Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O'Brien, James Glass

2025-07-22

http://arxiv.org/abs/2507.16784v1

To break the context limits of large language models (s) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of s trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask- mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.

Collaborative Inference and Learning between Edge SLMs and Cloud LLMs A Survey of Algorithms, Execution, and Open Challenges

Authors: Senyao Li, Haozhao Wang, Wenchao Xu, Rui Zhang, Song Guo, Jingling Yuan, Xian Zhong, Tianwei Zhang, Ruixuan Li

2025-07-22

http://arxiv.org/abs/2507.16731v1

As large language models (s) evolve, deploying them solely in the cloud or compressing them for edge devices has become inadequate due to concerns about latency, privacy, cost, and personalization. This survey explores a collaborative paradigm in which cloud-based s and edge-deployed small language models (SLMs) cooperate across both inference and training. We present a unified taxonomy of edge-cloud collaboration strategies. For inference, we categorize approaches into task assignment, task division, and mixture-based collaboration at both task and token granularity, encompassing adaptive scheduling, resource-aware offloading, speculative decoding, and modular routing. For training, we review distributed adaptation techniques, including parameter alignment, , bidirectional distillation, and small-model-guided optimization. We further summarize datasets, benchmarks, and deployment cases, and highlight privacy-preserving methods and vertical applications. This survey provides the first systematic foundation for -SLM collaboration, bridging system and algorithm co-design to enable efficient, scalable, and trustworthy edge-cloud intelligence.

ACT Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training

Authors: Shreya Saxena, Siva Prasad, Zishan Ahmad, Vishal Vaddina

2025-07-22

http://arxiv.org/abs/2507.16478v1

Code translation is a crucial process in software development and migration projects, enabling interoperability between different programming languages and enhancing software adaptability and thus longevity. Traditional automated translation methods rely heavily on handcrafted transformation rules, which often lack flexibility and scalability. Meanwhile, advanced language models present promising alternatives but are often limited by proprietary, API-based implementations that raise concerns over data security and reliance. In this paper, we present Auto-Train for Code Translation (ACT), an innovative framework that aims to improve code translation capabilities by enabling in-house finetuning of open-source Large Language Models (s). ACT's automated pipeline significantly boosts the performance of these models, narrowing the gap between open-source accessibility and the high performance of closed-source solutions. Central to ACT is its synthetic data generation module, which builds extensive, high-quality datasets from initial code samples, incorporating unit tests to ensure functional accuracy and diversity. ACT's evaluation framework incorporates execution-level checks, offering a comprehensive assessment of translation quality. A key feature in ACT is its controller module, which manages the entire pipeline by dynamically adjusting hyperparameters, orchestrating iterative data generation, and finetuning based on real-time evaluations. This enables ACT to intelligently optimize when to continue training, generate additional targeted training data, or stop the process. Our results demonstrate that ACT consistently enhances the effectiveness of open-source models, offering businesses and developers a secure and reliable alternative. Additionally, applying our data generation pipeline to industry-scale migration projects has led to a notable increase in developer .

Mamba-OTR a Mamba-based Solution for Online Take and Release Detection from Untrimmed Egocentric Video

Authors: Alessandro Sebastiano Catinello, Giovanni Maria Farinella, Antonino Furnari

2025-07-22

http://arxiv.org/abs/2507.16342v1

This work tackles the problem of Online detection of Take and Release (OTR) of an object in untrimmed egocentric videos. This task is challenging due to severe label imbalance, with temporally positive annotations, and the need for precise temporal predictions. Furthermore, methods need to be computationally efficient in order to be deployed in real-world online settings. To address these challenges, we propose Mamba-OTR, a model based on the Mamba architecture. Mamba-OTR is designed to exploit temporal recurrence during inference while being trained on short video clips. To address label imbalance, our training pipeline incorporates the focal loss and a novel regularization scheme that aligns model predictions with the evaluation metric. Extensive experiments on EPIC-KITCHENS-100, the comparisons with -based approach, and the evaluation of different training and test schemes demonstrate the superiority of Mamba-OTR in both accuracy and efficiency. These finding are particularly evident when evaluating full-length videos or high frame-rate sequences, even when trained on short video snippets for computational convenience. The proposed Mamba-OTR achieves a noteworthy mp-mAP of 45.48 when operating in a sliding-window fashion, and 43.35 in streaming mode, versus the 20.32 of a vanilla and 25.16 of a vanilla Mamba, thus providing a strong baseline for OTR. We will publicly release the source code of Mamba-OTR to support future research.

CompLeak Deep Learning Model Compression Exacerbates Privacy Leakage

Authors: Na Li, Yansong Gao, Hongsheng Hu, Boyu Kuang, Anmin Fu

2025-07-22

http://arxiv.org/abs/2507.16872v1

Model compression is crucial for minimizing memory storage and accelerating inference in deep learning (DL) models, including recent foundation models like large language models (s). Users can access different compressed model versions according to their resources and budget. However, while existing compression operations primarily focus on optimizing the trade-off between resource efficiency and model performance, the privacy risks introduced by compression remain overlooked and insufficiently understood. In this work, through the lens of membership inference attack (MIA), we propose CompLeak, the first privacy risk evaluation framework examining three widely used compression configurations that are , quantization, and weight clustering supported by the commercial model compression framework of Google's TensorFlow-Lite (TF-Lite) and Facebook's PyTorch Mobile. CompLeak has three variants, given available access to the number of compressed models and original model. CompLeakNR starts by adopting existing MIA methods to attack a single compressed model, and identifies that different compressed models influence members and non-members differently. When the original model and one compressed model are available, CompLeakSR leverages the compressed model as a reference to the original model and uncovers more privacy by combining meta information (e.g., confidence vector) from both models. When multiple compressed models are available with/without accessing the original model, CompLeakMR innovatively exploits privacy leakage info from multiple compressed versions to substantially signify the overall privacy leakage. We conduct extensive experiments on seven diverse model architectures (from ResNet to foundation models of BERT and GPT-2), and six image and textual benchmark datasets.

Time to Split Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders

Authors: Danil Gusak, Anna Volodkevich, Anton Klenitskiy, Alexey Vasilev, Evgeny Frolov

2025-07-22

http://arxiv.org/abs/2507.16289v1

Modern sequential recommender systems, ranging from lightweight -based variants to large language models, have become increasingly prominent in academia and industry due to their strong performance in the next-item prediction task. Yet common evaluation protocols for sequential recommendations remain insufficiently developed: they often fail to reflect the corresponding recommendation task accurately, or are not aligned with real-world scenarios. Although the widely used leave-one-out split matches next-item prediction, it permits the between training and test periods, which leads to temporal leakage and unrealistically long test horizon, limiting real-world relevance. Global temporal splitting addresses these issues by evaluating on distinct future periods. However, its applications to sequential recommendations remain loosely defined, particularly in terms of selecting target interactions and constructing a validation subset that provides necessary consistency between validation and test metrics. In this paper, we demonstrate that evaluation outcomes can vary significantly across splitting strategies, influencing model rankings and practical deployment decisions. To improve reproducibility in both academic and industrial settings, we systematically compare different splitting strategies for sequential recommendations across multiple datasets and established baselines. Our findings show that prevalent splits, such as leave-one-out, may be insufficiently aligned with more realistic evaluation strategies. Code: https://github.com/monkey0head/time-to-split

Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training

Authors: Zixiao Huang, Junhao Hu, Hao Lin, Chunyang Zhu, Yueran Tang, Quanlu Zhang, Zhen Guo, Zhenhua Li, Shengen Yan, Zhenhua Zhu, Guohao Dai, Yu Wang

2025-07-22

http://arxiv.org/abs/2507.16274v1

The rapid scaling of large language models (s) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Default GPU memory allocators of popular deep learning frameworks like PyTorch use online strategies without knowledge of tensor lifespans, which can waste up to 43\% of memory and cause out-of-memory errors, rendering optimization techniques ineffective or even unusable. To address this, we introduce STWeaver, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STWeaver introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch allocator, STWeaver reduces fragmentation ratio on average by 79.2\% (up to 100\%) across both dense and models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves performance by up to 32.5\%.

Authors: Dakota Sullivan, Shirley Zhang, Jennica Li, Heather Kirkorian, Bilge Mutlu, Kassem Fawaz

2025-07-22

http://arxiv.org/abs/2507.16124v1

Social robots are embodied agents that interact with people while following human norms. These robots interact using verbal and non-verbal cues, and share the physical environments of people. While social robots have previously utilized rule-based systems or probabilistic models for user interaction, the rapid evolution of large language models (s) presents new opportunities to develop -empowered social robots for enhanced human-robot interaction. To fully realize these capabilities, however, robots need to collect data such as audio, fine-grained images, video, and locations. As a result, s often process sensitive personal information, particularly within home environments. Given the tension between utility and privacy risks, evaluating how current s manage sensitive data is critical. Specifically, we aim to explore the extent to which out-of-the-box s are privacy-aware in the context of household social robots. In this study, we present a set of privacy-relevant scenarios crafted through the lens of Contextual Integrity (CI). We first survey users' privacy preferences regarding in-home social robot behaviors and then examine how their privacy orientation affects their choices of these behaviors (N = 450). We then provide the same set of scenarios and questions to state-of-the-art s (N = 10) and find that the agreement between humans and s is low. To further investigate the capabilities of s as a potential privacy controller, we implement four additional prompting strategies and compare their results. Finally, we discuss the implications and potential of AI privacy awareness in human-robot interaction.

TorchAO PyTorch-Native Training-to-Serving Model Optimization

Authors: Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, Supriya Rao, Thien Tran, Aleksandar Samardžić

2025-07-21

http://arxiv.org/abs/2507.16099v1

We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 , and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, v, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at https://github.com/pytorch/ao/.

On the transferability of Sparse Autoencoders for interpreting compressed models

Authors: Suchit Gupte, Vishnu Kabir Chhabra, Mohammad Mahdi Khalili

2025-07-21

http://arxiv.org/abs/2507.15977v1

Modern s face inference efficiency challenges due to their scale. To address this, many compression methods have been proposed, such as and quantization. However, the effect of compression on a model's interpretability remains elusive. While several model interpretation approaches exist, such as circuit discovery, Sparse Autoencoders (SAEs) have proven particularly effective in decomposing a model's activation space into its feature basis. In this work, we explore the differences in SAEs for the original and compressed models. We find that SAEs trained on the original model can interpret the compressed model albeit with slight performance degradation compared to the trained SAE on the compressed model. Furthermore, simply the original SAE itself achieves performance comparable to training a new SAE on the pruned model. This finding enables us to mitigate the extensive training costs of SAEs.

Just Ask for Music (JAM) Multimodal and Personalized Natural Language Music Recommendation

Authors: Alessandro B. Melchiorre, Elena V. Epure, Shahed Masoudian, Gustavo Escobedo, Anna Hausberger, Manuel Moussallam, Markus Schedl

2025-07-21

http://arxiv.org/abs/2507.15826v1

Natural language interfaces offer a compelling approach for music recommendation, enabling users to express complex preferences conversationally. While Large Language Models (s) show promise in this direction, their scalability in recommender systems is limited by high costs and latency. Retrieval-based approaches using smaller language models mitigate these issues but often rely on single-modal item representations, overlook long-term user preferences, and require full model retraining, posing challenges for real-world deployment. In this paper, we present JAM (Just Ask for Music), a lightweight and intuitive framework for natural language music recommendation. JAM models user-query-item interactions as vector translations in a shared latent space, inspired by knowledge graph embedding methods like TransE. To capture the complexity of music and user intent, JAM aggregates multimodal item features via cross-attention and mixture-of-experts. We also introduce JAMSessions, a new dataset of over 100k user-query-item triples with anonymized user/item embeddings, uniquely combining conversational queries and user long-term preferences. Our results show that JAM provides accurate recommendations, produces intuitive representations suitable for practical use cases, and can be easily integrated with existing music recommendation stacks.

Reservoir Computing as a Language Model

Authors: Felix Köster, Atsushi Uchida

2025-07-21

http://arxiv.org/abs/2507.15779v1

Large Language Models () have dominated the science and media landscape duo to their impressive performance on processing large chunks of data and produce human-like levels of text. Nevertheless, their huge energy demand and slow processing still a bottleneck for further increasing quality while also making the models accessible to everyone. To solve this bottleneck, we will investigate how reservoir computing performs on natural text processing, which could enable fast and energy efficient hardware implementations. Studies investigating the use of reservoir computing as a language model remain . In this paper, we compare three distinct approaches for character-level language modeling, two different reservoir computing approaches, where only an output layer is trainable, and the well-known -based architectures, which fully learn an attention-based sequence representation. We explore the performance, computational cost and prediction accuracy for both paradigms by equally varying the number of trainable parameters for all models. Using a consistent pipeline for all three approaches, we demonstrate that s excel in prediction quality, whereas reservoir computers remain highly efficient reducing the training and inference speed. Furthermore, we investigate two types of reservoir computing: a traditional reservoir with a static linear readout, and an attention-enhanced reservoir that dynamically adapts its output weights via an attention mechanism. Our findings underline how these paradigms scale and offer guidelines to balance resource constraints with performance.

Who Leads in the Shadows? ERGM and Centrality Analysis of Congressional Democrats on Bluesky

Authors: Gordon Hew, Ian McCulloh

2025-07-21

http://arxiv.org/abs/2507.16858v1

Following the 2024 U.S. presidential election, Democratic lawmakers and their supporters increasingly migrated from mainstream social media plat-forms like X (formerly Twitter) to decentralized alternatives such as Bluesky. This study investigates how Congressional Democrats use Bluesky to form networks of influence and disseminate political messaging in a platform environment that lacks algorithmic amplification. We employ a mixed-methods approach that combines social network analysis, expo-nential random graph modeling (ERGM), and -based topic mod-eling (BERTopic) to analyze follows, mentions, reposts, and discourse pat-terns among 182 verified Democratic members of Congress. Our findings show that while party leaders such as Hakeem Jeffries and Elizabeth War-ren dominate visibility metrics, overlooked figures like Marcy Kaptur, Donald Beyer, and Dwight Evans occupy structurally central positions, suggesting latent influence within the digital party ecosystem. ERGM re-sults reveal significant homophily along ideological, state, and leadership lines, with Senate leadership exhibiting lower connectivity. Topic analysis identifies both shared themes (e.g., reproductive rights, foreign conflicts) and subgroup-specific issues, with The Squad showing the most distinct discourse profile. These results demonstrate the potential of decentralized platforms to reshape intra-party dynamics and highlight the need for continued computational research on elite political behavior in emerging digital environments.

Transformer-based Deep Learning Model for Joint Routing and Scheduling with Varying Electric Vehicle Numbers

Authors: Jun Kang Yap, Vishnu Monn Baskaran, Wen Shan Tan, Ze Yang Ding, Hao Wang, David L. Dowe

2025-07-21

http://arxiv.org/abs/2507.15385v1

The growing integration of renewable energy sources in modern power systems has introduced significant operational challenges due to their intermittent and uncertain outputs. In recent years, mobile energy storage systems (ESSs) have emerged as a popular flexible resource for mitigating these challenges. Compared to stationary ESSs, mobile ESSs offer additional spatial flexibility, enabling cost-effective energy delivery through the transportation network. However, the widespread deployment of mobile ESSs is often hindered by the high investment cost, which has motivated researchers to investigate utilising more readily available alternatives, such as electric vehicles (EVs) as mobile energy storage units instead. Hence, we explore this opportunity with a MIP-based day-ahead electric vehicle joint routing and scheduling problem in this work. However, solving the problem in a practical setting can often be computationally intractable since the existence of binary variables makes it combinatorial challenging. Therefore, we proposed to simplify the problem's solution process for a MIP solver by the solution search space with a -based deep learning (DL) model. This is done by training the model to rapidly predict the optimal binary solutions. In addition, unlike many existing DL approaches that assume fixed problem structures, the proposed model is designed to accommodate problems with EV fleets of any sizes. This flexibility is essential since frequent re-training can introduce significant computational overhead. We evaluated the approach with simulations on the IEEE 33-bus system coupled with the Nguyen-Dupuis transportation network.

Metaphor and Large Language Models When Surface Features Matter More than Deep Understanding

Authors: Elisa Sanchez-Bayona, Rodrigo Agerri

2025-07-21

http://arxiv.org/abs/2507.15357v1

This paper presents a comprehensive evaluation of the capabilities of Large Language Models (s) in metaphor interpretation across multiple datasets, tasks, and prompt configurations. Although metaphor processing has gained significant attention in Natural Language Processing (NLP), previous research has been limited to single-dataset evaluations and specific task settings, often using artificially constructed data through lexical replacement. We address these limitations by conducting extensive experiments using diverse publicly available datasets with inference and metaphor annotations, focusing on Natural Language Inference (NLI) and Question Answering (QA) tasks. The results indicate that s' performance is more influenced by features like lexical and sentence length than by metaphorical content, demonstrating that any alleged emergent abilities of s to understand metaphorical language are the result of a combination of surface-level features, in-context learning, and linguistic knowledge. This work provides critical insights into the current capabilities and limitations of s in processing figurative language, highlighting the need for more realistic evaluation frameworks in metaphor interpretation tasks. Data and code are publicly available.

Scaling Decentralized Learning with FLock

Authors: Zehua Cheng, Rui Sun, Jiahao Sun, Yike Guo

2025-07-21

http://arxiv.org/abs/2507.15349v1

Fine-tuning the large language models (s) are prevented by the deficiency of centralized control and the massive computing and overhead on the decentralized schemes. While the typical standard federated learning (FL) supports data privacy, the central server requirement creates a single point of attack and vulnerability to poisoning attacks. Generalizing the result in this direction to 70B-parameter models in the heterogeneous, trustless environments has turned out to be a huge, yet unbroken bottleneck. This paper introduces FLock, a decentralized framework for secure and efficient collaborative fine-tuning. Integrating a blockchain-based trust layer with economic incentives, FLock replaces the central aggregator with a secure, auditable protocol for cooperation among untrusted parties. We present the first empirical validation of fine-tuning a 70B in a secure, multi-domain, decentralized setting. Our experiments show the FLock framework defends against backdoor poisoning attacks that compromise standard FL optimizers and fosters synergistic knowledge transfer. The resulting models show a >68% reduction in adversarial attack success rates. The global model also demonstrates superior cross-domain generalization, outperforming models trained in isolation on their own specialized data.

IM-Chat A Multi-agent LLM-based Framework for Knowledge Transfer in Injection Molding Industry

Authors: Junhyeong Lee, Joon-Young Kim, Heekyu Kim, Inhyo Lee, Seunghwa Ryu

2025-07-21

http://arxiv.org/abs/2507.15268v1

The injection molding industry faces critical challenges in preserving and transferring field knowledge, particularly as experienced workers retire and multilingual barriers hinder effective . This study introduces IM-Chat, a multi-agent framework based on large language models (s), designed to facilitate knowledge transfer in injection molding. IM-Chat integrates both limited documented knowledge (e.g., troubleshooting tables, manuals) and extensive field data modeled through a data-driven process condition generator that infers optimal manufacturing settings from environmental inputs such as temperature and humidity, enabling robust and context-aware task resolution. By adopting a retrieval-augmented generation (RAG) strategy and tool-calling agents within a modular architecture, IM-Chat ensures adaptability without the need for fine-tuning. Performance was assessed across 100 single-tool and 60 hybrid tasks for GPT-4o, GPT-4o-mini, and GPT-3.5-turbo by domain experts using a 10-point rubric focused on relevance and correctness, and was further supplemented by automated evaluation using GPT-4o guided by a domain-adapted instruction prompt. The evaluation results indicate that more capable models tend to achieve higher accuracy, particularly in complex, tool-integrated scenarios. Overall, these findings demonstrate the viability of multi-agent systems for industrial knowledge workflows and establish IM-Chat as a scalable and generalizable approach to AI-assisted decision support in manufacturing.

CHADET Cross-Hierarchical-Attention for Depth-Completion Using Unsupervised Lightweight Transformer

Authors: Kevin Christiansen Marsim, Jinwoo Jeon, Yeeun Kim, Myeongwoo Jeong, Hyun Myung

2025-07-21

http://arxiv.org/abs/2507.15189v1

Depth information which specifies the distance between objects and current position of the robot is essential for many robot tasks such as navigation. Recently, researchers have proposed depth completion frameworks to provide dense depth maps that offer comprehensive information about the surrounding environment. However, existing methods show significant trade-offs between computational efficiency and accuracy during inference. The substantial memory and computational requirements make them unsuitable for real-time applications, highlighting the need to improve the completeness and accuracy of depth information while improving processing speed to enhance robot performance in various tasks. To address these challenges, in this paper, we propose CHADET(cross-hierarchical-attention depth-completion ), a lightweight depth-completion network that can generate accurate dense depth maps from RGB images and depth points. For each pair, its feature is extracted from the depthwise blocks and passed to the equally lightweight -based decoder. In the decoder, we utilize the novel cross-hierarchical-attention module that refines the image features from the depth information. Our approach improves the quality and reduces memory usage of the depth map prediction, as validated in both KITTI, NYUv2, and VOID datasets.

Beyond Visual Line of Sight UAVs with Edge AI, Connected LLMs, and VR for Autonomous Aerial Intelligence

Authors: Andres Navarro, Carlos de Quinto, José Alberto Hernández

2025-07-20

http://arxiv.org/abs/2507.15049v1

Unmanned Aerial Vehicles are reshaping Non-Terrestrial Networks by acting as agile, intelligent nodes capable of advanced analytics and instantaneous situational awareness. This article introduces a budget-friendly quadcopter platform that unites 5G s, edge-based processing, and AI to tackle core challenges in NTN scenarios. Outfitted with a panoramic camera, robust onboard computation, and s, the drone system delivers seamless object recognition, contextual analysis, and immersive operator experiences through virtual reality VR technology. Field evaluations confirm the platform's ability to process visual streams with low latency and sustain robust 5G links. Adding s further streamlines operations by extracting actionable insights and refining collected data for decision support. Demonstrated use cases, including emergency response, infrastructure assessment, and environmental surveillance, underscore the system's adaptability in demanding contexts.

From Neurons to Semantics Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment

Authors: Chongxuan Huang, Yongshi Ye, Biao Fu, Qifeng Su, Xiaodong Shi

2025-07-20

http://arxiv.org/abs/2507.14900v2

Large language models (s) have demonstrated remarkable multilingual capabilities, however, how to evaluate cross-lingual alignment remains underexplored. Existing alignment benchmarks primarily focus on sentence embeddings, but prior research has shown that neural models tend to induce a non-smooth representation space, which impact of semantic alignment evaluation on low-resource languages. Inspired by neuroscientific findings that similar information activates ping neuronal regions, we propose a novel Neuron State-Based Cross-Lingual Alignment (NeuronXA) to assess the cross-lingual a lignment capabilities of s, which offers a more semantically grounded approach to assess cross-lingual alignment. We evaluate NeuronXA on several prominent multilingual s (LLaMA, Qwen, Mistral, GLM, and OLMo) across two transfer tasks and three multilingual benchmarks. The results demonstrate that with only 100 parallel sentence pairs, NeuronXA achieves a Pearson correlation of 0.9556 with downstream tasks performance and 0.8514 with transferability. These findings demonstrate NeuronXA's effectiveness in assessing both cross-lingual alignment and transferability, even with a small dataset. This highlights its potential to advance cross-lingual alignment research and to improve the semantic understanding of multilingual s.

Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs

Authors: Boyi Deng, Yu Wan, Baosong Yang, Fei Huang, Wenjie Wang, Fuli Feng

2025-07-20

http://arxiv.org/abs/2507.14894v1

Large Language Models (s) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses. However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness. In this paper, we first provide an in-depth analysis of unexpected code-switching using autoencoders and find that when s switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose $\textbf{S}$ parse $\textbf{A}$ utoencoder-guided $\textbf{S}$ upervised $\textbf{F}$ ine $\textbf{t}$ uning (SASFT), which teaches s to maintain appropriate pre-activation values of specific language features during training. Experiments on five models across three languages demonstrate that SASFT consistently reduces unexpected code-switching by more than 50\% compared to standard supervised fine-tuning, with complete elimination in four cases. Moreover, SASFT maintains or even improves the models' performance on six multilingual benchmarks, showing its effectiveness in addressing code-switching while preserving multilingual capabilities.

Tiny language models

Authors: Ronit D. Gross, Yarden Tzach, Tal Halevi, Ella Koresh, Ido Kanter

2025-07-20

http://arxiv.org/abs/2507.14871v2

A prominent achievement of natural language processing (NLP) is its ability to understand and generate meaningful human language. This capability relies on complex feedforward block architectures pre-trained on large language models (s). However, pre-training is currently feasible only for a few dominant companies due to the immense computational resources required, limiting broader research participation. This creates a critical need for more accessible alternatives. In this study, we explore whether tiny language models (TLMs) exhibit the same key qualitative features of s. We demonstrate that TLMs exhibit a clear performance gap between pre-trained and non-pre-trained models across classification tasks, indicating the effectiveness of pre-training, even at a tiny scale. The performance gap increases with the size of the pre-training dataset and with greater between tokens in the pre-training and classification datasets. Furthermore, the classification accuracy achieved by a pre-trained deep TLM architecture can be replicated through a soft committee of multiple, independently pre-trained shallow architectures, enabling low-latency TLMs without affecting classification accuracy. Our results are based on pre-training BERT-6 and variants of BERT-1 on subsets of the Wikipedia dataset and evaluating their performance on FewRel, AGNews, and DBPedia classification tasks. Future research on TLM is expected to further illuminate the mechanisms underlying NLP, especially given that its biologically inspired models suggest that TLMs may be sufficient for children or adolescents to develop language. The data and code that support the findings of this study are openly available on https://github.com/Rg32601/Tiny-Language-Models .

An Evaluation of DUSt3R/MASt3R/VGGT 3D Reconstruction on Photogrammetric Aerial Blocks

Authors: Xinyi Wu, Steven Landgraf, Markus Ulrich, Rongjun Qin

2025-07-20

http://arxiv.org/abs/2507.14798v1

State-of-the-art 3D computer vision algorithms continue to advance in handling , unordered image sets. Recently developed foundational models for 3D reconstruction, such as Dense and Unconstrained Stereo 3D Reconstruction (DUSt3R), Matching and Stereo 3D Reconstruction (MASt3R), and Visual Geometry Grounded Transformer (VGGT), have attracted attention due to their ability to handle very image s. Evaluating DUSt3R/MASt3R/VGGT on typical aerial images matters, as these models may handle extremely low image s, stereo occlusions, and textureless regions. For redundant collections, they can accelerate 3D reconstruction by using extremely sparsified image sets. Despite tests on various computer vision benchmarks, their potential on photogrammetric aerial blocks remains unexplored. This paper conducts a comprehensive evaluation of the pre-trained DUSt3R/MASt3R/VGGT models on the aerial blocks of the UseGeo dataset for pose estimation and dense 3D reconstruction. Results show these methods can accurately reconstruct dense point clouds from very image sets (fewer than 10 images, up to 518 pixels resolution), with completeness gains up to +50% over COLMAP. VGGT also demonstrates higher computational efficiency, scalability, and more reliable camera pose estimation. However, all exhibit limitations with high-resolution images and large sets, as pose reliability declines with more images and geometric complexity. These findings suggest -based methods cannot fully replace traditional SfM and MVS, but offer promise as complementary approaches, especially in challenging, low-resolution, and scenarios.

LeAdQA LLM-Driven Context-Aware Temporal Grounding for Video Question Answering

Authors: Xinxin Dong, Baoyun Peng, Haokai Ma, Yufei Wang, Zixuan Dong, Fei Hu, Xiaodong Wang

2025-07-20

http://arxiv.org/abs/2507.14784v1

Video Question Answering (VideoQA) requires identifying critical moments in long videos and reasoning about their causal relationships to answer semantically complex questions. While recent advances in multimodal learning have improved alignment and fusion, current approaches remain limited by two prevalent but fundamentally flawed strategies: (1) task-agnostic sampling indiscriminately processes all frames, overwhelming key events with irrelevant content; and (2) heuristic retrieval captures superficial patterns but misses causal-temporal structures needed for complex reasoning. To address these challenges, we introduce LeAdQA, an innovative approach that bridges these gaps through synergizing causal-aware query refinement with fine-grained visual grounding. Our method first leverages s to reformulate question-option pairs, resolving causal ambiguities and sharpening temporal focus. These refined queries subsequently direct a temporal grounding model to precisely retrieve the most salient segments, complemented by an adaptive fusion mechanism dynamically integrating the evidence to maximize relevance. The integrated visual-textual cues are then processed by an M to generate accurate, contextually-grounded answers. Experiments on NExT-QA, IntentQA, and NExT-GQA demonstrate that our method's precise visual grounding substantially enhances the understanding of video-question relationships, achieving state-of-the-art (SOTA) performance on complex reasoning tasks while maintaining computational efficiency.

Authors: Mehak Arora, Ayman Ali, Kaiyuan Wu, Carolyn Davis, Takashi Shimazui, Mahmoud Alwakeel, Victor Moas, Philip Yang, Annette Esper, Rishikesan Kamaleswaran

2025-07-19

http://arxiv.org/abs/2507.14766v1

In intensive care units (ICUs), patients with complex clinical conditions require vigilant monitoring and prompt interventions. Chest X-rays (CXRs) are a vital diagnostic tool, providing insights into clinical trajectories, but their irregular acquisition limits their utility. Existing tools for CXR interpretation are constrained by cross-sectional analysis, failing to capture temporal dynamics. To address this, we introduce CXR-TFT, a novel multi-modal framework that integrates temporally CXR imaging and radiology reports with high-frequency clinical data, such as vital signs, laboratory values, and respiratory flow sheets, to predict the trajectory of CXR findings in critically ill patients. CXR-TFT leverages latent embeddings from a vision encoder that are temporally aligned with hourly clinical data through interpolation. A model is then trained to predict CXR embeddings at each hour, conditioned on previous embeddings and clinical measurements. In a retrospective study of 20,000 ICU patients, CXR-TFT demonstrated high accuracy in forecasting abnormal CXR findings up to 12 hours before they became radiographically evident. This predictive capability in clinical data holds significant potential for enhancing the management of time-sensitive conditions like acute respiratory distress syndrome, where early intervention is crucial and diagnoses are often delayed. By providing distinctive temporal resolution in prognostic CXR analysis, CXR-TFT offers actionable 'whole patient' insights that can directly improve clinical outcomes.

GRACE Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization

Authors: Luyi Ma, Wanjia Zhang, Kai Zhao, Abhishek Kulkarni, Lalitesh Morishetti, Anjana Ganesh, Ashish Ranjan, Aashika Padmanabhan, Jianpeng Xu, Jason Cho, Praveen Kanumala, Kaushiki Nag, Sumit Dutta, Kamiya Motwani, Malay Patel, Evren Korpeoglu, Sushant Kumar, Kannan Achan

2025-07-19

http://arxiv.org/abs/2507.14758v1

Generative models have recently demonstrated strong potential in multi-behavior recommendation systems, leveraging the expressive power of s and tokenization to generate personalized item sequences. However, their adoption is hindered by (1) the lack of explicit information for token reasoning, (2) high computational costs due to quadratic attention complexity and dense sequence representations after tokenization, and (3) limited multi-scale modeling over user history. In this work, we propose GRACE (Generative Recommendation via journey-aware Attention on Chain-of-thought tokEnization), a novel generative framework for multi-behavior sequential recommendation. GRACE introduces a hybrid Chain-of-Thought (CoT) tokenization method that encodes user-item interactions with explicit attributes from product knowledge graphs (e.g., category, brand, price) over semantic tokenization, enabling interpretable and behavior-aligned generation. To address the inefficiency of standard attention, we design a Journey-Aware Sparse Attention (JSA) mechanism, which selectively attends to compressed, intra-, inter-, and current-context segments in the tokenized sequence. Experiments on two real-world datasets show that GRACE significantly outperforms state-of-the-art baselines, achieving up to +106.9% HR@10 and +106.7% NDCG@10 improvement over the state-of-the-art baseline on the Home domain, and +22.1% HR@10 on the Electronics domain. GRACE also reduces attention computation by up to 48% with long sequences.

Spatial-Temporal Transformer with Curriculum Learning for EEG-Based Emotion Recognition

Authors: Xuetao Lin, Tianhao Peng, Peihong Dai, Yu Liang, Wenjun Wu

2025-07-19

http://arxiv.org/abs/2507.14698v1

EEG-based emotion recognition plays an important role in developing adaptive brain-computer systems, yet faces two fundamental challenges in practical implementations: (1) effective integration of non-stationary spatial-temporal neural patterns, (2) robust adaptation to dynamic emotional intensity variations in real-world scenarios. This paper proposes SST-CL, a novel framework integrating spatial-temporal s with curriculum learning. Our method introduces two core components: a spatial encoder that models inter-channel relationships and a temporal encoder that captures multi-scale dependencies through windowed attention mechanisms, enabling simultaneous extraction of spatial correlations and temporal dynamics from EEG signals. Complementing this architecture, an intensity-aware curriculum learning strategy progressively guides training from high-intensity to low-intensity emotional states through dynamic sample scheduling based on a dual difficulty assessment. Comprehensive experiments on three benchmark datasets demonstrate state-of-the-art performance across various emotional intensity levels, with ablation studies confirming the necessity of both architectural components and the curriculum learning mechanism.

Enabling Efficient Hardware Acceleration of Hybrid Vision Transformer (ViT) Networks at the Edge

Authors: Joren Dumoulin, Pouya Houshmand, Vikram Jain, Marian Verhelst

2025-07-19

http://arxiv.org/abs/2507.14651v1

Hybrid vision s combine the elements of conventional neural networks (NN) and vision s (ViT) to enable lightweight and accurate detection. However, several challenges remain for their efficient deployment on resource-constrained edge devices. The hybrid models suffer from a widely diverse set of NN layer types and large intermediate data tensors, hampering efficient hardware . To enable their execution at the edge, this paper proposes innovations across the hardware-scheduling stack: a.) At the lowest level, a configurable PE array supports all hybrid ViT layer types; b.) temporal loop re-ordering within one layer, enabling hardware support for normalization and softmax layers, minimizing on-chip data transfers; c.) further scheduling optimization employs layer fusion across inverted bottleneck layers to drastically reduce off-chip memory transfers. The resulting accelerator is implemented in 28nm CMOS, achieving a peak energy efficiency of 1.39 TOPS/W at 25.6 GMACs/s.

Linear Relational Decoding of Morphology in Language Models

Authors: Eric Xia, Jugal Kalita

2025-07-19

http://arxiv.org/abs/2507.14640v1

A two-part affine approximation has been found to be a good approximation for computations over certain subject object relations. Adapting the Bigger Analogy Test Set, we show that the linear transformation Ws, where s is a middle layer representation of a subject token and W is derived from model derivatives, is also able to accurately reproduce final object states for many relations. This linear technique is able to achieve 90% faithfulness on morphological relations, and we show similar findings multi-lingually and across models. Our findings indicate that some conceptual relationships in language models, such as morphology, are readily interpretable from latent space, and are ly encoded by cross-layer linear transformations.

KinForm Kinetics Informed Feature Optimised Representation Models for Enzyme $k_{cat}$ and $K_{M}$ Prediction

Authors: Saleh Alwer, Ronan Fleming

2025-07-19

http://arxiv.org/abs/2507.14639v1

Kinetic parameters such as the turnover number ( $k_{cat}$ ) and Michaelis constant ( $K_{\mathrm{M}}$ ) are essential for modelling enzymatic activity but experimental data remains limited in scale and diversity. Previous methods for predicting enzyme kinetics typically use mean-pooled residue embeddings from a single protein language model to represent the protein. We present KinForm, a machine learning framework designed to improve predictive accuracy and generalisation for kinetic parameters by optimising protein feature representations. KinForm combines several residue-level embeddings (Evolutionary Scale Modeling Cambrian, Evolutionary Scale Modeling 2, and ProtT5-XL-UniRef50), taken from empirically selected intermediate layers and applies weighted pooling based on per-residue binding-site probability. To counter the resulting high dimensionality, we apply dimensionality reduction using principal--component analysis (PCA) on concatenated protein features, and rebalance the training data via a similarity-based oversampling strategy. KinForm outperforms baseline methods on two benchmark datasets. Improvements are most pronounced in low sequence similarity bins. We observe improvements from binding-site probability pooling, intermediate-layer selection, PCA, and oversampling of low-identity proteins. We also find that removing sequence between folds provides a more realistic evaluation of generalisation and should be the standard over random splitting when benchmarking kinetic prediction models.

Agentic Satellite-Augmented Low-Altitude Economy and Terrestrial Networks A Survey on Generative Approaches

Authors: Xiaozheng Gao, Yichen Wang, Bosen Liu, Xiao Zhou, Ruichen Zhang, Jiacheng Wang, Dusit Niyato, Dong In Kim, Abbas Jamalipour, Chau Yuen, Jianping An, Kai Yang

2025-07-19

http://arxiv.org/abs/2507.14633v1

The development of satellite-augmented low-altitude economy and terrestrial networks (SLAETNs) demands intelligent and autonomous systems that can operate reliably across heterogeneous, dynamic, and mission-critical environments. To address these challenges, this survey focuses on enabling agentic artificial intelligence (AI), that is, artificial agents capable of perceiving, reasoning, and acting, through generative AI (GAI) and large language models (s). We begin by introducing the architecture and characteristics of SLAETNs, and analyzing the challenges that arise in integrating satellite, aerial, and terrestrial components. Then, we present a model-driven foundation by systematically reviewing five major categories of generative models: variational autoencoders (VAEs), generative adversarial networks (GANs), generative diffusion models (GDMs), -based models (TBMs), and s. Moreover, we provide a comparative analysis to highlight their generative mechanisms, capabilities, and deployment trade-offs within SLAETNs. Building on this foundation, we examine how these models empower agentic functions across three domains: enhancement, security and privacy protection, and intelligent satellite tasks. Finally, we outline key future directions for building scalable, adaptive, and trustworthy generative agents in SLAETNs. This survey aims to provide a unified understanding and actionable reference for advancing agentic AI in next-generation integrated networks.

Efficient LLM Inference Bandwidth, Compute, Synchronization, and Capacity are all you need

Authors: Michael Davies, Neal Crago, Karthikeyan Sankaralingam, Christos Kozyrakis

2025-07-18

http://arxiv.org/abs/2507.14397v1

This paper presents a limit study of -based large language model () inference, focusing on the fundamental performance bottlenecks imposed by memory bandwidth, memory capacity, and synchronization overhead in distributed inference systems. We develop a hardware-agnostic performance model that abstracts away implementation details, enabling the analysis of a wide range of current and near-future hardware technologies. Our analysis spans from current HBM3 memory technology used in AI accelerators like GPUs and TPUs to systems based on advanced HBM4 and advanced 3D-stacked DRAM technology. It also covers SRAM-based designs and scaling techniques from distributed clusters with varying numbers of chips to wafer-scale integration. Our key findings for auto-regressive decoding are: i) serving s requires 100s of GB per server to serve a model instance; ii) high memory bandwidth is critical for high per-user throughput; iii) exposed synchronization latencies to achieve collective must be around 1us else they make the memory bandwidth ineffective; iv) DRAM-based designs have a fundamental advantage in terms of system-level efficiency as measured in throughput per cost or watt; and v) hardware designs can easily reach 2000+ user token/sec but getting to 10,000+ tokens/sec will need smaller models, smaller context, or other forms of algorithmic advances. This study provides valuable insights into the fundamental performance limits of inference, highlighting the potential benefits of future hardware advancements and guiding the optimization of deployment strategies.

Characterizing Communication Patterns in Distributed Large Language Model Inference

Authors: Lang Xu, Kaushik Kandadi Suresh, Quentin Anthony, Nawras Alnaasan, Dhabaleswar K. Panda

2025-07-18

http://arxiv.org/abs/2507.14392v1

Large Language Models (s) built on architectures have transformed natural language processing, achieving remarkable performance across diverse applications. While distributed inference frameworks enable practical deployment of these models, inter-GPU creates significant performance constraints that limit service quality in real-world systems. This paper investigates dynamics in distributed serving-analyzing how various parallelization approaches coordinate data exchange between GPU workers during inference. We study dense -based models as representative examples of contemporary architectures widely used in operational deployments. Our work combines detailed profiling measurements with predictive analytical models to characterize behavior across different parallelization configurations. Results show that tensor parallelism incurs substantial network overhead but delivers superior response times for brief sequences, pipeline parallelism minimizes data transfer requirements while increasing total latency, and combined approaches demand careful tuning to achieve balanced performance. These insights offer practical recommendations for selecting appropriate parallelization schemes in production services and identify key opportunities for optimizing inference frameworks and infrastructure.

DPMT Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration

Authors: Xiyun Li, Yining Ding, Yuhua Jiang, Yunlong Zhao, Runpeng Xie, Shuang Xu, Yuanhua Ni, Yiqin Yang, Bo Xu

2025-07-18

http://arxiv.org/abs/2507.14088v1

Real-time human-artificial intelligence (AI) collaboration is crucial yet challenging, especially when AI agents must adapt to diverse and unseen human behaviors in dynamic scenarios. Existing large language model () agents often fail to accurately model the complex human mental characteristics such as domain intentions, especially in the absence of direct . To address this limitation, we propose a novel dual process multi-scale theory of mind (DPMT) framework, drawing inspiration from cognitive science dual process theory. Our DPMT framework incorporates a multi-scale theory of mind (ToM) module to facilitate robust human partner modeling through mental characteristic reasoning. Experimental results demonstrate that DPMT significantly enhances human-AI collaboration, and ablation studies further validate the contributions of our multi-scale ToM in the slow system.

KROMA Ontology Matching with Knowledge Retrieval and Large Language Models

Authors: Lam Nguyen, Erika Barcelos, Roger French, Yinghui Wu

2025-07-18

http://arxiv.org/abs/2507.14032v1

Ontology Matching (OM) is a cornerstone task of semantic interoperability, yet existing systems often rely on handcrafted rules or specialized models with limited adaptability. We present KROMA, a novel OM framework that harnesses Large Language Models (s) within a Retrieval-Augmented Generation (RAG) pipeline to dynamically enrich the semantic context of OM tasks with structural, lexical, and definitional knowledge. To optimize both performance and efficiency, KROMA integrates a bisimilarity-based concept matching and a lightweight ontology refinement step, which prune candidate concepts and substantially reduce the overhead from invoking s. Through experiments on multiple benchmark datasets, we show that integrating knowledge retrieval with context-augmented s significantly enhances ontology matching, outperforming both classic OM systems and cutting-edge -based approaches while keeping overhead comparable. Our study highlights the feasibility and benefit of the proposed optimization techniques (targeted knowledge retrieval, prompt enrichment, and ontology refinement) for ontology matching at scale.

LoopServe An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues

Authors: Haoyang Li, Zhanchao Xu, Yiming Li, Xuejia Chen, Darian Li, Anxin Tian, Qingfa Xiao, Cheng Deng, Jun Wang, Qing Li, Lei Chen, Mingxuan Yuan

2025-07-18

http://arxiv.org/abs/2507.13681v1

Multi-turn dialogues are essential in many real-world applications of large language models, such as chatbots and virtual assistants. As conversation histories become longer, existing large language models face increasing computational and memory challenges, which hinder their ability to provide efficient and responsive interactions. Most current methods either compress the context or optimize key value caching, but they often rely on fixed or position-based heuristics that do not adapt well to the dynamic and unpredictable patterns found in actual multi-turn conversations. In this paper, we present LoopServe, an adaptive dual-phase inference framework for large language models in multi-turn dialogues. LoopServe introduces two main innovations. First, it performs online sparsification during the prefilling phase by dynamically selecting the most important parts of the attention matrix for each new input. Second, it uses progressive key value compression during decoding by adaptively maintaining a relevant and efficient cache based on the most recently generated output tokens. We also propose a \href{https://huggingface.co/datasets/TreeAILab/Multi-turn_Long-context_Benchmark_for_s}{new benchmark} with eleven multi-turn datasets that reflect realistic query positions and conversational dependencies. Extensive experiments demonstrate that LoopServe consistently achieves superior effectiveness compared to existing baselines and significantly accelerates inference across a wide range of long-context dialogue tasks.