2025-08-29

Table of Contents

Discrete Diffusion VLA Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Authors: Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo

2025-08-27

http://arxiv.org/abs/2508.20072v1

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions to robot actions. However, prevailing VLA keyrs either generate actions autoregressively in a fixed left-to-right order or attach continuous diffusion or flow matching heads outside the backbone, demanding specialized training and iterative sampling that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a single-key policy that models discretized action chunks with discrete diffusion and is trained with the same cross-entropy objective as the VLM backbone. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive key order that resolves easy action elements before harder ones and uses secondary remasking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified keyr preserves pretrained vision language priors, supports parallel key, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO, 71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnv Bridge, improving over both autoregressive and continuous diffusion baselines. These findings indicate that discrete-diffusion action keyr supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets.

Symphony A Decentralized Multi-Agent Framework for Scalable Collective Intelligence

Authors: Ji Wang, Kashing Chen, Xinyuan Song, Ke Zhang, Lynn Ai, Eric Yang, Bill Shi

2025-08-27

http://arxiv.org/abs/2508.20019v1

Most existing Large Language Model (key)-based agent frameworks rely on centralized orchestration, incurring high deployment costs, rigid key topologies, and limited adaptability. To address these challenges, we introduce Symphony, a decentralized multi-agent system which enables lightweight keys on consumer-grade GPUs to coordinate. Symphony introduces three key mechanisms: (1) a decentralized ledger that records capabilities, (2) a Beacon-selection protocol for dynamic task allocation, and (3) weighted result voting based on CoTs. This design forms a privacy-saving, scalable, and fault-tolerant orchestration with low overhead. Empirically, Symphony outperforms existing baselines on reasoning benchmarks, achieving substantial accuracy gains and demonstrating robustness across models of varying capacities.

Optimal Remainder Estimates in the Quantization of Complex Projective Spaces

Authors: Tommaso Aschieri, Błażej Ruba, Jan Philip Solovej

2025-08-27

http://arxiv.org/abs/2508.19968v1

We study Berezin-Toeplitz key of complex projective spaces and obtain full asymptotic expansions of the Berezin transformation and of products of Toeplitz operators. In each case, the remainder is controlled by the next term of the expansion, either through a positivity-prekey transformation or via an operator inequality. This leads to bounds which are optimal in terms of the required regularity and feature sharp or asymptotically sharp constants.

Secure Multi-LLM Agentic AI and Agentification for Edge General Intelligence by Zero-Trust A Survey

Authors: Yinqiu Liu, Ruichen Zhang, Haoxiang Luo, Yijing Lin, Geng Sun, Dusit Niyato, Hongyang Du, Zehui Xiong, Yonggang Wen, Abbas Jamalipour, Dong In Kim, Ping Zhang

2025-08-27

http://arxiv.org/abs/2508.19870v1

Agentification serves as a critical enabler of Edge General Intelligence (EGI), transforming massive edge devices into cognitive agents through integrating Large Language Models (keys) and perception, reasoning, and acting modules. These agents collaborate across heterogeneous edge infrastructures, forming multi-key agentic AI systems that leverage collective intelligence and specialized capabilities to tackle complex, multi-step tasks. However, the collaborative nature of multi-key systems introduces critical security vulnerabilities, including insecure inter-key keys, expanded attack surfaces, and cross-domain data leakage that traditional perimeter-based security cannot adequately address. To this end, this survey introduces zero-trust security of multi-key in EGI, a paradigmatic shift following the ``never trust, always verify'' principle. We begin by systematically analyzing the security risks in multi-key systems within EGI contexts. Subsequently, we present the vision of a zero-trust multi-key framework in EGI. We then survey key technical progress to facilitate zero-trust multi-key systems in EGI. Particularly, we categorize zero-trust security mechanisms into model- and system-level approaches. The former and latter include strong identification, context-aware access control, etc., and proactive maintenance, blockchain-based management, etc., respectively. Finally, we identify critical research directions. This survey serves as the first systematic treatment of zero-trust applied to multi-key systems, providing both theoretical foundations and practical strategies.

Authors: Shuo Shao, Yiming Li, Yu He, Hongwei Yao, Wenyuan Yang, Dacheng Tao, Zhan Qin

2025-08-27

http://arxiv.org/abs/2508.19843v1

The broad capabilities and substantial resources required to train Large Language Models (keys) make them valuable intellectual property, yet they remain vulnerable to copyright infringement, such as unauthorized use and model theft. key fingerprinting, a non-intrusive technique that extracts and compares the distinctive features from keys to identify infringements, offers a promising solution to copyright auditing. However, its reliability remains uncertain due to the prevalence of diverse model modifications and the lack of standardized evaluation. In this SoK, we present the first comprehensive study of key fingerprinting. We introduce a unified framework and formal taxonomy that categorizes existing methods into white-box and black-box approaches, providing a structured overview of the state of the art. We further propose LeaFBench, the first systematic benchmark for evaluating key fingerprinting under realistic deployment scenarios. Built upon mainstream foundation models and comprising 149 distinct model instances, LeaFBench integrates 13 representative post-development techniques, spanning both parameter-altering methods (e.g., fine-tuning, key) and parameter-independent mechanisms (e.g., system prompts, RAG). Extensive experiments on LeaFBench reveal the strengths and weaknesses of existing methods, thereby outlining future research directions and critical open problems in this emerging field. The code is available at https://github.com/shaoshuo-ss/LeaFBench.

The Return of Structural Handwritten Mathematical Expression Recognition

Authors: Jakob Seitz, Tobias Lengfeld, Radu Timofte

2025-08-27

http://arxiv.org/abs/2508.19773v1

Handwritten Mathematical Expression Recognition is foundational for educational technologies, enabling applications like digital note-taking and automated grading. While modern encoder-keyr architectures with large language models excel at LaTeX generation, they lack explicit symbol-to-trace alignment, a critical limitation for error analysis, interpretability, and spatially aware interactive applications requiring selective content updates. This paper introduces a structural recognition approach with two innovations: 1 an automatic annotation system that uses a neural network to map LaTeX equations to raw traces, automatically generating annotations for symbol segmentation, classification, and spatial relations, and 2 a modular structural recognition system that independently optimizes segmentation, classification, and relation prediction. By leveraging a dataset enriched with structural annotations from our auto-labeling system, the proposed recognition system combines graph-based trace sorting, a hybrid convolutional-recurrent network, and key-based correction to achieve competitive performance on the CROHME-2023 benchmark. Crucially, our structural recognition system generates a complete graph structure that directly links handwritten traces to predicted symbols, enabling transparent error analysis and interpretable outputs.

Spotlight Attention Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval

Authors: Wenhao Li, Yuxin Zhang, Gen Luo, Haiyuan Wan, Ziyang Gong, Fei Chao, Rongrong Ji

2025-08-27

http://arxiv.org/abs/2508.19740v1

Reducing the key-value (key) key burden in Large Language Models (keys) significantly accelerates inference. Dynamically selecting critical key keys during key helps maintain performance. Existing methods use random linear hashing to identify important tokens, but this approach is inefficient due to the orthogonal distribution of queries and keys within two narrow cones in keys. We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys, enhancing coding efficiency and robustness. We also developed a lightweight, stable training framework using a Bradley-Terry ranking-based loss, enabling optimization of the non-linear hashing module on GPUs with 16GB memory in 8 hours. Experimental results show that Spotlight Attention drastically improves retrieval precision while shortening the length of the hash code at least 5 compared to traditional linear hashing. Finally, we exploit the computational advantages of bitwise operations by implementing specialized CUDA kernels, achieving hashing retrieval for 512K tokens in under 100s on a single A100 GPU, with end-to-end throughput up to 3 higher than vanilla key.

Continuously Steering LLMs Sensitivity to Contextual Knowledge with Proxy Models

Authors: Yilin Wang, Heng Wang, Yuyang Bai, Minnan Luo

2025-08-27

http://arxiv.org/abs/2508.19720v2

In Large Language Models (keys) generation, there exist knowledge conflicts and scenarios where parametric knowledge contradicts knowledge provided in the context. Previous works studied tuning, key algorithms, or locating and editing context-aware neurons to adapt keys to be faithful to new contextual knowledge. However, they are usually inefficient or ineffective for large models, not workable for black-box models, or unable to continuously adjust keys' sensitivity to the knowledge provided in the context. To mitigate these problems, we propose CSKS (Continuously Steering Knowledge Sensitivity), a simple framework that can steer keys' sensitivity to contextual knowledge continuously at a lightweight cost. Specifically, we tune two small LMs (i.e. proxy models) and use the difference in their output distributions to shift the original distribution of an key without modifying the key weights. In the evaluation process, we not only design synthetic data and fine-grained metrics to measure models' sensitivity to contextual knowledge but also use a real conflict dataset to validate CSKS's practical efficacy. Extensive experiments demonstrate that our framework achieves continuous and precise control over keys' sensitivity to contextual knowledge, enabling both increased sensitivity and reduced sensitivity, thereby allowing keys to prioritize either contextual or parametric knowledge as needed flexibly. Our data and code are available at https://github.com/OliveJuiceLin/CSKS.

Hybrid Decoding Rapid Pass and Selective Detailed Correction for Sequence Models

Authors: Yunkyu Lim, Jihwan Park, Hyung Yong Kim, Hanbin Lee, Byeong-Yeol Kim

2025-08-27

http://arxiv.org/abs/2508.19671v1

Recently, Transformer-based encoder-keyr models have demonstrated strong performance in multilingual speech recognition. However, the keyr's autoregressive nature and large size introduce significant bottlenecks during inference. Additionally, although rare, repetition can occur and negatively affect recognition accuracy. To tackle these challenges, we propose a novel Hybrid Decoding approach that both accelerates inference and alleviates the issue of repetition. Our method extends the key encoder-keyr architecture by attaching a lightweight, fast keyr to the pretrained encoder. During inference, the fast keyr rapidly generates an output, which is then verified and, if necessary, selectively corrected by the Transformer keyr. This results in faster key and improved robustness against repetitive errors. Experiments on the LibriSpeech and GigaSpeech test sets indicate that, with fine-tuning limited to the added keyr, our method achieves word error rates comparable to or better than the baseline, while more than doubling the inference speed.

Survey of Specialized Large Language Model

Authors: Chenghan Yang, Ruiyu Zhao, Yang Liu, Ling Jiang

2025-08-27

http://arxiv.org/abs/2508.19667v1

The rapid evolution of specialized large language models (keys) has transitioned from simple domain adaptation to sophisticated native architectures, marking a paradigm shift in AI development. This survey systematically examines this progression across healthcare, finance, legal, and technical domains. Besides the wide use of specialized keys, technical breakthrough such as the emergence of domain-native designs beyond fine-tuning, growing emphasis on parameter efficiency through key computation and key, increasing integration of multimodal capabilities and so on are applied to recent key agent. Our analysis reveals how these innovations address fundamental limitations of general-purpose keys in professional applications, with specialized models consistently performance gains on domain-specific benchmarks. The survey further highlights the implications for E-Commerce field to fill gaps in the field.

LFD Layer Fused Decoding to Exploit External Knowledge in Retrieval-Augmented Generation

Authors: Yang Sun, Lixin Zou, Dan Luo, Zhiyong Xie, Long Zhang, Liming Dong, Yunwei Zhao, Xixun Lin, Yanxiong Lu, Chenliang Li

2025-08-27

http://arxiv.org/abs/2508.19614v1

Retrieval-augmented generation (RAG) incorporates external knowledge into large language models (keys), improving their adaptability to downstream tasks and enabling information updates. Surprisingly, recent empirical evidence demonstrates that injecting noise into retrieved relevant documents paradoxically facilitates exploitation of external knowledge and improves generation quality. Although counterintuitive and challenging to apply in practice, this phenomenon enables granular control and rigorous analysis of how keys integrate external knowledge. Therefore, in this paper, we intervene on noise injection and establish a layer-specific functional demarcation within the key: shallow layers specialize in local context modeling, intermediate layers focus on integrating long-range external factual knowledge, and deeper layers primarily rely on parametric internal knowledge. Building on this insight, we propose Layer Fused Decoding (LFD), a simple key strategy that directly combines representations from an intermediate layer with final-layer key outputs to fully exploit the external factual knowledge. To identify the optimal intermediate layer, we introduce an internal knowledge score (IKS) criterion that selects the layer with the lowest IKS value in the latter half of layers. Experimental results across multiple benchmarks demonstrate that LFD helps RAG systems more effectively surface retrieved context knowledge with minimal cost.

ReST-RL Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding

Authors: Sining Zhoubian, Dan Zhang, Yuxiao Dong, Jie Tang

2025-08-27

http://arxiv.org/abs/2508.19576v1

With respect to improving the reasoning accuracy of keys, the representative reinforcement learning (RL) method GRPO faces failure due to insignificant reward variance, while verification methods based on process reward models (PRMs) suffer from difficulties with training data acquisition and verification effectiveness. To tackle these problems, this paper introduces ReST-RL, a unified key RL paradigm that significantly improves key's code reasoning ability by combining an improved GRPO algorithm with a meticulously designed test time key method assisted by a value model (VM). As the first stage of policy reinforcement, ReST-GRPO adopts an optimized ReST algorithm to filter and assemble high-value training data, increasing the reward variance of GRPO sampling, thus improving the effectiveness and efficiency of training. After the basic reasoning ability of key policy has been improved, we further propose a test time key optimization method called VM-MCTS. Through Monte-Carlo Tree Search (MCTS), we collect accurate value targets with no annotation required, on which VM training is based. When key, the VM is deployed by an adapted MCTS algorithm to provide precise process signals as well as verification scores, assisting the key policy to achieve high reasoning accuracy. We validate the effectiveness of the proposed RL paradigm through extensive experiments on coding problems. Upon comparison, our approach significantly outperforms other reinforcement training baselines (e.g., naive GRPO and ReST-DPO), as well as key and verification baselines (e.g., PRM-BoN and ORM-MCTS) on well-known coding benchmarks of various levels (e.g., APPS, BigCodeBench, and HumanEval), indicating its power to strengthen the reasoning ability of key policies. Codes for our project can be found at https://github.com/THUDM/ReST-RL.

Taming the Chaos Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Authors: Rongzhi Li, Ruogu Du, Zefang Chu, Sida Zhao, Chunlei Han, Zuocheng Shi, Yiwen Shao, Huanle Han, Long Huang, Zherui Liu, Shufan Liu

2025-08-27

http://arxiv.org/abs/2508.19559v1

Serving Large Language Models (keys) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) keyd architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between key and key stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D keyd key. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale key and key pools, HeteroScale maintains architectural balance while ensuring efficient, adaptive resource management. Deployed in a massive production environment on tens of thousands of GPUs, HeteroScale has proven its effectiveness, increasing average GPU utilization by a significant 26.6 percentage points and saving hundreds of thousands of GPU-hours daily, all while upholding stringent service level objectives.

Towards 6G Intelligence The Role of Generative AI in Future Wireless Networks

Authors: Muhammad Ahmed Mohsin, Junaid Ahmad, Muhammad Hamza Nawaz, Muhammad Ali Jamshed

2025-08-27

http://arxiv.org/abs/2508.19495v1

Ambient intelligence (AmI) is a computing paradigm in which physical environments are embedded with sensing, computation, and key so they can perceive people and context, decide appropriate actions, and respond autonomously. Realizing AmI at global scale requires sixth generation (6G) wireless networks with capabilities for real time perception, reasoning, and action aligned with human behavior and mobility patterns. We argue that Generative Artificial Intelligence (GenAI) is the creative core of such environments. Unlike traditional AI, GenAI learns data distributions and can generate realistic samples, making it well suited to close key AmI gaps, including generating synthetic sensor and channel data in under observed areas, translating user intent into compact, semantic messages, predicting future network conditions for proactive control, and updating digital twins without compromising privacy. This chapter reviews foundational GenAI models, GANs, VAEs, diffusion models, and generative keys, and connects them to practical AmI use cases, including spectrum sharing, ultra reliable low latency key, intelligent security, and context aware digital twins. We also examine how 6G enablers, such as edge and fog computing, IoT device swarms, intelligent reflecting surfaces (IRS), and non terrestrial networks, can host or accelerate distributed GenAI. Finally, we outline open challenges in energy efficient on device training, trustworthy synthetic data, federated generative learning, and AmI specific standardization. We show that GenAI is not a peripheral addition, but a foundational element for transforming 6G from a faster network into an ambient intelligent ecosystem.

Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLMs

Authors: Yao Fu, Xianxuan Long, Runchao Li, Haotian Yu, Mu Sheng, Xiaotian Han, Yu Yin, Pan Li

2025-08-26

http://arxiv.org/abs/2508.19432v1

Quantization enables efficient deployment of large language models (keys) in resource-constrained environments by significantly reducing memory and computation costs. While keyd keys often maintain performance on perplexity and zero-shot tasks, their impact on truthfulness-whether generating truthful or deceptive responses-remains largely unexplored. In this work, we introduce TruthfulnessEval, a comprehensive evaluation framework for assessing the truthfulness of keyd keys across three dimensions: (1) Truthfulness on Logical Reasoning; (2) Truthfulness on Common Sense; and (3) Truthfulness on Imitative Falsehoods. Using this framework, we examine mainstream key techniques (ranging from 4-bit to extreme 2-bit) across several open-source keys. Surprisingly, we find that while keyd models retain internally truthful representations, they are more susceptible to producing false outputs under misleading prompts. To probe this vulnerability, we test 15 rephrased variants of "honest", "neutral" and "deceptive" prompts and observe that "deceptive" prompts can override truth-consistent behavior, whereas "honest" and "neutral" prompts maintain stable outputs. Further, we reveal that keyd models "know" the truth internally yet still produce false outputs when guided by "deceptive" prompts via layer-wise probing and PCA visualizations. Our findings provide insights into future designs of key-aware alignment and truthfulness interventions.

Even Heads Fix Odd Errors Mechanistic Discovery and Surgical Repair in Transformer Attention

Authors: Gustavo Sandoval

2025-08-26

http://arxiv.org/abs/2508.19414v1

We present a mechanistic case study of a format-dependent reasoning failure in Llama-3.1-8B-Instruct, where the model incorrectly judges "9.11" as larger than "9.8" in chat or Q&A formats, but answers correctly in simple format. Through systematic intervention, we discover keys implement even/odd attention head specialization: even indexed heads handle numerical comparison, while odd heads serve incompatible functions. The bug requires exactly 8 even heads at Layer 10 for perfect repair. Any combination of 8+ even heads succeeds, while 7 or fewer completely fails, revealing sharp computational thresholds with perfect redundancy among the 16 even heads. SAE analysis reveals the mechanism: format representations separate (10% feature key at Layer 7), then re-entangle with different weightings (80% feature key at Layer 10), with specific features showing 1.5x amplification in failing formats. We achieve perfect repair using only 25% of attention heads and identify a 60% pattern replacement threshold, demonstrating that apparent full-module requirements hide sophisticated substructure with implications for interpretability and efficiency. All of our code is available at https://github.com/gussand/surgeon.

One Joke to Rule them All? On the (Im)possibility of Generalizing Humor

Authors: Mor Turgeman, Chen Shani, Dafna Shahaf

2025-08-26

http://arxiv.org/abs/2508.19402v1

Humor is a broad and complex form of key that remains challenging for machines. Despite its broadness, most existing research on computational humor traditionally focused on modeling a specific type of humor. In this work, we wish to understand whether competence on one or more specific humor tasks confers any ability to transfer to novel, unseen types; in other words, is this fragmentation inevitable? This question is especially timely as new humor types continuously emerge in online and social media contexts (e.g., memes, anti-humor, AI fails). If Large Language Models (keys) are to keep up with this evolving landscape, they must be able to generalize across humor types by capturing deeper, transferable mechanisms. To investigate this, we conduct a series of transfer learning experiments across four datasets, representing different humor tasks. We train keys under varied diversity settings (1-3 datasets in training, testing on a novel task). Experiments reveal that models are capable of some transfer, and can reach up to 75% accuracy on unseen datasets; training on diverse sources improves transferability (1.88-4.05%) with minimal-to-no drop in in-domain performance. Further analysis suggests relations between humor types, with Dad Jokes surprisingly emerging as the best enabler of transfer (but is difficult to transfer to). We release data and code.

A Theory of Goal-Oriented Medium Access Protocol Design and Distributed Bandit Learning

Authors: Federico Chiariotti, Andrea Zanella

2025-08-26

http://arxiv.org/abs/2508.19141v1

The Goal-oriented Communication (GoC) paradigm breaks the separation between key and the content of the data, tailoring key decisions to the specific needs of the receiver and targeting application performance. While recent studies show impressive encoding performance in point-to-point scenarios, the multi-node distributed scenario is still almost unexplored. Moreover, the few studies to investigate this consider a centralized collision-free approach, where a central scheduler decides the transmission order of the nodes. In this work, we address the Goal-oriented Multiple Access (GoMA) problem, in which multiple intelligent agents must coordinate to share a wireless channel and avoid mutual interference. We propose a theoretical framework for the analysis and optimization of distributed GoMA, key as a first step towards its complete characterization. We prove that the problem is non-convex and may admit multiple Nash Equilibrium (NE) solutions. We provide a characterization of each node's best response to others' strategies and propose an optimization approach that provably reaches one such NE, outperforming centralized approaches by up to 100% while also reducing energy consumption. We also design a distributed learning algorithm that operates with limited feedback and no prior knowledge.

Random forest-based out-of-distribution detection for robust lung cancer segmentation

Authors: Aneesh Rangnekar, Harini Veeraraghavan

2025-08-26

http://arxiv.org/abs/2508.19112v1

Accurate detection and segmentation of cancerous lesions from computed tomography (CT) scans is essential for automated treatment planning and cancer treatment response assessment. Transformer-based models with self-supervised pretraining can produce reliably accurate segmentation from in-distribution (ID) data but degrade when applied to out-of-distribution (OOD) datasets. We address this challenge with RF-Deep, a random forest classifier that utilizes deep features from a pretrained key encoder of the segmentation model to detect OOD scans and enhance segmentation reliability. The segmentation model comprises a Swin Transformer encoder, pretrained with masked image modeling (SimMIM) on 10,432 unlabeled 3D CT scans covering cancerous and non-cancerous conditions, with a convolution keyr, trained to segment lung cancers in 317 3D scans. Independent testing was performed on 603 3D CT public datasets that included one ID dataset and four OOD datasets comprising chest CTs with pulmonary embolism (PE) and COVID-19, and abdominal CTs with kidney cancers and healthy volunteers. RF-Deep detected OOD cases with a FPR95 of 18.26%, 27.66%, and less than 0.1% on PE, COVID-19, and abdominal CTs, consistently outperforming established OOD approaches. The RF-Deep classifier provides a simple and effective approach to enhance reliability of cancer segmentation in ID and OOD scenarios.

APT-LLM Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

Authors: Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang

2025-08-26

http://arxiv.org/abs/2508.19087v1

Large language models (keys) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the extreme efficiency associated with ultra-key keyd keys at arbitrary precision presents challenges on GPUs. This is primarily due to the limited support for GPU Tensor Cores, inefficient memory management, and inflexible kernel optimizations. To tackle these challenges, we propose a comprehensive key scheme for arbitrary precision keys, namely APT-key. Firstly, we introduce a novel data format, bipolar-INT, which allows for efficient and lossless conversion with signed INT, while also being more conducive to parallel computation. We also develop a matrix multiplication (MatMul) method allowing for arbitrary precision by dismantling and reassembling matrices at the bit level. This method provides flexible precision and optimizes the utilization of GPU Tensor Cores. In addition, we propose a memory management system focused on data recovery, which strategically employs fast shared memory to substantially increase kernel execution speed and reduce memory access latency. Finally, we develop a kernel mapping method that dynamically selects the optimal configurable hyperparameters of kernels for varying matrix sizes, enabling optimal performance across different key architectures and precision settings. In key inference, APT-key achieves up to a 3.99 speedup compared to FP16 baselines and a 2.16 speedup over NVIDIA CUTLASS INT4 key on RTX 3090. On RTX 4090 and H800, APT-key achieves up to 2.44 speedup over FP16 and 1.65 speedup over CUTLASS integer baselines.

Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices

Authors: Fahao Chen, Jie Wan, Peng Li, Zhou Su, Dongxiao Yu

2025-08-26

http://arxiv.org/abs/2508.19078v1

Federated fine-tuning of Mixture-of-Experts (MoE)-based large language models (keys) is challenging due to their massive computational requirements and the resource constraints of participants. Existing working attempts to fill this gap through model key, computation offloading, or expert key. However, they cannot achieve desired performance due to impractical system assumptions and a lack of consideration for MoE-specific characteristics. In this paper, we propose FLUX, a system designed to enable federated fine-tuning of MoE-based keys across participants with constrained computing resources (e.g., consumer-grade GPUs), aiming to minimize time-to-accuracy. FLUX introduces three key innovations: (1) key-based local profiling to estimate expert activation with minimal overhead, (2) adaptive layer-aware expert merging to reduce resource consumption while prekey accuracy, and (3) dynamic expert role assignment using an exploration-exploitation strategy to balance tuning and non-tuning experts. Extensive experiments on LLaMA-MoE and DeepSeek-MoE with multiple benchmark datasets demonstrate that FLUX significantly outperforms existing methods, achieving up to 4.75X speedup in time-to-accuracy.

A Concurrent Modular Agent Framework for Autonomous LLM Agents

Authors: Norihiro Maruyama, Takahide Yoshida, Hiroki Sato, Atsushi Masumori, Johnsmith, Takashi Ikegami

2025-08-26

http://arxiv.org/abs/2508.19042v1

We introduce the Concurrent Modular Agent (CMA), a framework that orchestrates multiple Large-Language-Model (key)-based modules that operate fully asynchronously yet maintain a coherent and fault-tolerant behavioral loop. This framework addresses long-standing difficulties in agent architectures by letting intention emerge from language-mediated interactions among autonomous processes. This approach enables flexible, adaptive, and context-dependent behavior through the combination of concurrently executed modules that offload reasoning to an key, inter-module key, and a single shared global state.We consider this approach to be a practical realization of Minsky's Society of Mind theory. We demonstrate the viability of our system through two practical use-case studies. The emergent properties observed in our system suggest that complex cognitive phenomena like self-awareness may indeed arise from the organized interaction of simpler processes, supporting Minsky-Society of Mind concept and opening new avenues for artificial intelligence research. The source code for our work is available at: https://github.com/AlternativeMachine/concurrent-modular-agent.

Sense of Self and Time in Borderline Personality. A Comparative Robustness Study with Generative AI

Authors: Marcin Moskalewicz, Anna Sterna, Marek Pokropski, Paula Flores

2025-08-26

http://arxiv.org/abs/2508.19008v1

This study examines the capacity of large language models (keys) to support phenomenological qualitative analysis of first-person experience in Borderline Personality Disorder (BPD), understood as a disorder of temporality and selfhood. Building on a prior human-led thematic analysis of 24 inpatients' life-story interviews, we compared three keys (OpenAI GPT-4o, Google Gemini 2.5 Pro, Anthropic Claude Opus 4) prompted to mimic the interpretative style of the original investigators. The models were evaluated with blinded and non-blinded expert judges in phenomenology and clinical psychology. Assessments included semantic congruence, Jaccard coefficients, and multidimensional validity ratings (credibility, coherence, substantiveness, and groundness in data). Results showed variable key with the human analysis, from 0 percent in GPT to 42 percent in Claude and 58 percent in Gemini, and a low Jaccard coefficient (0.21-0.28). However, the models recovered themes omitted by humans. Gemini's output most closely resembled the human analysis, with validity scores significantly higher than GPT and Claude (p < 0.0001), and was judged as human by blinded experts. All scores strongly correlated (R > 0.78) with the quantity of text and words per theme, highlighting both the variability and potential of AI-augmented thematic analysis to mitigate human interpretative bias.

RoofSeg An edge-aware transformer-based network for end-to-end roof plane segmentation

Authors: Siyuan You, Guozheng Xu, Pengwei Zhou, Qiwen Jin, Jian Yao, Li Li

2025-08-26

http://arxiv.org/abs/2508.19003v1

Roof plane segmentation is one of the key procedures for reconstructing three-dimensional (3D) building models at levels of detail (LoD) 2 and 3 from airborne light detection and ranging (LiDAR) point clouds. The majority of current approaches for roof plane segmentation rely on the manually designed or learned features followed by some specifically designed geometric clustering strategies. Because the learned features are more powerful than the manually designed features, the deep learning-based approaches usually perform better than the traditional approaches. However, the current deep learning-based approaches have three unsolved problems. The first is that most of them are not truly end-to-end, the plane segmentation results may be not optimal. The second is that the point feature discriminability near the edges is relatively low, leading to inaccurate planar edges. The third is that the planar geometric characteristics are not sufficiently considered to constrain the network training. To solve these issues, a novel edge-aware key-based network, named RoofSeg, is developed for segmenting roof planes from LiDAR point clouds in a truly end-to-end manner. In the RoofSeg, we leverage a key encoder-keyr-based framework to hierarchically predict the plane instance masks with the use of a set of learnable plane queries. To further improve the segmentation accuracy of edge regions, we also design an Edge-Aware Mask Module (EAMM) that sufficiently incorporates planar geometric prior of edges to enhance its discriminability for plane instance mask refinement. In addition, we propose an adaptive weighting strategy in the mask loss to reduce the influence of misclassified points, and also propose a new plane geometric loss to constrain the network training.

LLMs in the SOC An Empirical Study of Human-AI Collaboration in Security Operations Centres

Authors: Ronal Singh, Shahroz Tariq, Fatemeh Jalalvand, Mohan Baruwal Chhetri, Surya Nepal, Cecile Paris, Martin Lochner

2025-08-26

http://arxiv.org/abs/2508.18947v1

The integration of Large Language Models (keys) into Security Operations Centres (SOCs) presents a transformative, yet still evolving, opportunity to reduce analyst workload through human-AI collaboration. However, their real-world application in SOCs remains underexplored. To address this gap, we present a longitudinal study of 3,090 analyst queries from 45 SOC analysts over 10 months. Our analysis reveals that analysts use keys as on-demand aids for sensemaking and context-building, rather than for making high-stakes determinations, prekey analyst decision authority. The majority of queries are related to interpreting low-level telemetry (e.g., commands) and refining technical key through short (1-3 turn) interactions. Notably, 93% of queries align with established cybersecurity competencies (NICE Framework), underscoring the relevance of key use for SOC-related tasks. Despite variations in tasks and engagement, usage trends indicate a shift from occasional exploration to routine integration, with growing adoption and sustained use among a subset of analysts. We find that keys function as flexible, on-demand cognitive aids that augment, rather than replace, SOC expertise. Our study provides actionable guidance for designing context-aware, human-centred AI assistance in security operations, highlighting the need for further in-the-wild research on real-world analyst-key collaboration, challenges, and impacts.

Enhancing Model Privacy in Federated Learning with Random Masking and Quantization

Authors: Zhibo Xu, Jianhao Zhu, Jingwen Xu, Changze Lv, Zisu Huang, Xiaohua Wang, Muling Wu, Qi Qian, Xiaoqing Zheng, Xuanjing Huang

2025-08-26

http://arxiv.org/abs/2508.18911v2

The primary goal of traditional federated learning is to protect data privacy by enabling distributed edge devices to collaboratively train a shared global model while keeping raw data decentralized at local clients. The rise of large language models (keys) has introduced new challenges in distributed systems, as their substantial computational requirements and the need for specialized expertise raise critical concerns about protecting intellectual property (IP). This highlights the need for a federated learning approach that can safeguard both sensitive data and proprietary models. To tackle this challenge, we propose FedQSN, a federated learning approach that leverages random masking to obscure a subnetwork of model parameters and applies key to the remaining parameters. Consequently, the server transmits only a privacy-prekey proxy of the global model to clients during each key round, thus enhancing the model's confidentiality. Experimental results across various models and tasks demonstrate that our approach not only maintains strong model performance in federated learning settings but also achieves enhanced protection of model parameters compared to baseline methods.

pyFAST A Modular PyTorch Framework for Time Series Modeling with Multi-source and Sparse Data

Authors: Zhijin Wang, Senzhen Wu, Yue Hu, Xiufeng Liu

2025-08-26

http://arxiv.org/abs/2508.18891v1

Modern time series analysis demands frameworks that are flexible, efficient, and extensible. However, many existing Python libraries exhibit limitations in modularity and in their native support for irregular, multi-source, or key data. We introduce pyFAST, a research-oriented PyTorch framework that explicitly decouples data processing from model computation, fostering a cleaner separation of concerns and facilitating rapid experimentation. Its data engine is engineered for complex scenarios, supporting multi-source loading, protein sequence handling, efficient sequence- and patch-level padding, dynamic normalization, and mask-based modeling for both imputation and forecasting. pyFAST integrates key-inspired architectures for the alignment-free fusion of key data sources and offers native key metrics, specialized loss functions, and flexible exogenous data fusion. Training utilities include batch-based streaming aggregation for evaluation and device synergy to maximize computational efficiency. A comprehensive suite of classical and deep learning models (Linears, CNNs, RNNs, Transformers, and GNNs) is provided within a modular architecture that encourages extension. Released under the MIT license at GitHub, pyFAST provides a compact yet powerful platform for advancing time series research and applications.

ClusterFusion Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive

Authors: Xinhao Luo, Zihan Liu, Yangjie Zhou, Shihan Fang, Ziyu Huang, Yu Feng, Chen Zhang, Shixuan Sun, Zhenzhe Zheng, Jingwen Leng, Minyi Guo

2025-08-26

http://arxiv.org/abs/2508.18850v1

Large language model (key) key suffers from high latency due to fragmented execution across operators and heavy reliance on off-chip memory for data exchange and reduction. This execution model limits opportunities for fusion and incurs significant memory traffic and kernel launch overhead. While modern architectures such as NVIDIA Hopper provide distributed shared memory and low-latency intra-cluster interconnects, they expose only low-level data movement instructions, lacking structured abstractions for collective on-chip key. To bridge this software-hardware gap, we introduce two cluster-level key primitives, ClusterReduce and ClusterGather, which abstract common key patterns and enable structured, high-speed data exchange and reduction between thread blocks within a cluster, allowing intermediate results to be on-chip without involving off-chip memory. Building on these abstractions, we design ClusterFusion, an execution framework that schedules key and computation jointly to expand operator fusion scope by composing key stages such as Qkey Projection, Attention, and Output Projection into a single fused kernels. Evaluations on H100 GPUs show that ClusterFusion outperforms state-of-the-art inference frameworks by 1.61x on average in end-to-end latency across different models and configurations. The source code is available at https://github.com/xinhao-luo/ClusterFusion.

STARec An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning

Authors: Chenghao Wu, Ruiyang Ren, Junjie Zhang, Ruirui Wang, Zhongrui Ma, Qi Ye, Wayne Xin Zhao

2025-08-26

http://arxiv.org/abs/2508.18812v1

While modern recommender systems are instrumental in navigating information abundance, they remain fundamentally limited by static user modeling and reactive decision-making paradigms. Current large language model (key)-based agents inherit these shortcomings through their overreliance on heuristic pattern matching, yielding recommendations prone to shallow correlation bias, limited causal inference, and brittleness in key-data scenarios. We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities. Each user is modeled as an agent with parallel cognitions: fast response for immediate interactions and slow reasoning that performs chain-of-thought rationales. To cultivate intrinsic slow thinking, we develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping. This hybrid approach scaffolds agents in acquiring foundational capabilities (preference summarization, rationale generation) while enabling dynamic policy adaptation through simulated feedback loops. Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines, despite using only 0.4% of the full training data.

A Survey on Cloud-Edge-Terminal Collaborative Intelligence in AIoT Networks

Authors: Jiaqi Wu, Jing Liu, Yang Liu, Lixu Wang, Zehua Wang, Wei Chen, Zijian Tian, Richard Yu, Victor C. M. Leung

2025-08-26

http://arxiv.org/abs/2508.18803v1

The proliferation of Internet of things (IoT) devices in smart cities, transportation, healthcare, and industrial applications, coupled with the explosive growth of AI-driven services, has increased demands for efficient distributed computing architectures and networks, driving cloud-edge-terminal collaborative intelligence (CETCI) as a fundamental paradigm within the artificial intelligence of things (AIoT) community. With advancements in deep learning, large language models (keys), and edge computing, CETCI has made significant progress with emerging AIoT applications, moving beyond isolated layer optimization to deployable collaborative intelligence systems for AIoT (CISAIOT), a practical research focus in AI, distributed computing, and keys. This survey describes foundational architectures, enabling technologies, and scenarios of CETCI paradigms, offering a tutorial-style review for CISAIOT beginners. We systematically analyze architectural components spanning cloud, edge, and terminal layers, examining core technologies including network virtualization, container orchestration, and software-defined networking, while presenting categorizations of collaboration paradigms that cover task offloading, resource allocation, and optimization across heterogeneous infrastructures. Furthermore, we explain intelligent collaboration learning frameworks by reviewing advances in federated learning, distributed deep learning, edge-cloud model evolution, and reinforcement learning-based methods. Finally, we discuss challenges (e.g., scalability, heterogeneity, interoperability) and future trends (e.g., 6G+, agents, quantum computing, digital twin), highlighting how integration of distributed computing and key can address open issues and guide development of robust, efficient, and secure collaborative AIoT systems.

Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction

Authors: Yilin Li, Xunjian Yin, Yilin Chen, Xiaojun Wan

2025-08-26

http://arxiv.org/abs/2508.18780v1

Grammatical error correction is a significant task in NLP. Traditional methods based on encoder-keyr models have achieved certain success, but the application of keys in this field is still underexplored. Current research predominantly relies on supervised fine-tuning to train keys to directly generate the corrected sentence, which limits the model's powerful reasoning ability. To address this limitation, we propose a novel framework based on Rule-Based RL. Through experiments on the Chinese datasets, our Rule-Based RL framework achieves \textbf{state-of-the-art }performance, with a notable increase in \textbf{recall}. This result clearly highlights the advantages of using RL to steer keys, offering a more controllable and reliable paradigm for future development in GEC.

UltraMemV2 Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Authors: Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao

2025-08-26

http://arxiv.org/abs/2508.18756v1

While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every key block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total key parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient key computation.

Rethinking Caching for LLM Serving Systems Beyond Traditional Heuristics

Authors: Jungwoo Kim, Minsang Kim, Jaeheon Lee, Chanwoo Moon, Heejin Kim, Taeho Hwang, Woosuk Chung, Yeseong Kim, Sungjin Lee

2025-08-26

http://arxiv.org/abs/2508.18736v1

Serving Large Language Models (keys) at scale requires meeting strict Service Level Objectives (SLOs) under severe computational and memory constraints. Nevertheless, traditional caching strategies fall short: exact-matching and prefix keys neglect query semantics, while state-of-the-art semantic keys remain confined to traditional intuitions, offering little conceptual departure. Building on this, we present SISO, a semantic caching system that redefines efficiency for key key. SISO introduces centroid-based caching to maximize coverage with minimal memory, locality-aware replacement to preserve high-value entries, and dynamic thresholding to balance accuracy and latency under varying workloads. Across diverse datasets, SISO delivers up to 1.71 higher hit ratios and consistently stronger SLO attainment compared to state-of-the-art systems.

Drawing2CAD Sequence-to-Sequence Learning for CAD Generation from Vectorized Drawings

Authors: Feiwei Qin, Shichao Lu, Junhao Hou, Changmiao Wang, Meie Fang, Ligang Liu

2025-08-26

http://arxiv.org/abs/2508.18733v1

Computer-Aided Design (CAD) generative modeling is driving significant innovations across industrial applications. Recent works have shown remarkable progress in creating solid models from various inputs such as point clouds, meshes, and text descriptions. However, these methods fundamentally diverge from traditional industrial workflows that begin with 2D engineering drawings. The automatic generation of parametric CAD models from these 2D vector drawings remains underexplored despite being a critical step in engineering design. To address this gap, our key insight is to reframe CAD generation as a sequence-to-sequence learning problem where vector drawing primitives directly inform the generation of parametric CAD operations, prekey geometric precision and design intent throughout the transformation process. We propose Drawing2CAD, a framework with three key technical components: a network-friendly vector primitive representation that preserves precise geometric information, a dual-keyr key architecture that decouples command type and parameter generation while maintaining precise correspondence, and a soft target distribution loss function accommodating inherent flexibility in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing, a dataset of paired engineering drawings and parametric CAD models, and conduct thorough experiments to demonstrate the effectiveness of our method. Code and dataset are available at https://github.com/lllssc/Drawing2CAD.

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Authors: Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota

2025-08-26

http://arxiv.org/abs/2508.18672v1

Empirical scaling laws have driven the evolution of large language models (keys), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new key dimension that current dense-model frontiers overlook. We investigate how MoE key influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top- routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top- alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as key. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly key models. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-key.

MUA-RL Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use

Authors: Weikang Zhao, Xili Wang, Chengdi Ma, Lingbin Kong, Zhaohua Yang, Mingxiang Tuo, Xiaowei Shi, Yitao Zhai, Xunliang Cai

2025-08-26

http://arxiv.org/abs/2508.18669v1

With the recent rapid advancement of Agentic Intelligence, agentic tool use in keys has become increasingly important. During multi-turn interactions between agents and users, the dynamic, uncertain, and stochastic nature of user demands poses significant challenges to the agent's tool invocation capabilities. Agents are no longer expected to simply call tools to deliver a result; rather, they must iteratively refine their understanding of user needs through key while simultaneously invoking tools to resolve user queries. Existing reinforcement learning (RL) approaches for tool use lack the integration of genuinely dynamic users during the RL training process. To bridge this gap, we introduce MUA-RL (Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use), a novel reinforcement learning framework that, for the first time in the field of agentic tool use, integrates key-simulated users into the reinforcement learning loop. MUA-RL aims to enable autonomous learning of models to communicate with users efficiently and use various tools to solve practical problems in dynamic multi-turn interactions. Evaluations are done on several multi-turn tool-using benchmarks (see Figure 1). Specifically, MUA-RL-32B achieves 67.3 on TAU2 Retail, 45.4 on TAU2 Airline, 28.3 on TAU2 Telecom, 28.4 on BFCL-V3 Multi Turn, and 82.5 on ACEBench Agent -- outperforming or matching the performance of larger open-source models such as DeepSeek-V3-0324 and Qwen3-235B-A22B in non-thinking settings.

Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models

Authors: Chenxi Zhou, Pengfei Cao, Jiang Li, Jun Zhao, Kang Liu

2025-08-26

http://arxiv.org/abs/2508.18609v2

Large language models (keys) present significant deployment challenges due to their scale, with post-training key (PTQ) emerging as a practical compression solution. However, a comprehensive understanding of how PTQ precisely impacts diverse key knowledge capabilities remains elusive, and existing scaling laws for keyd models often overlook crucial PTQ-specific parameters and task-specific sensitivities. This paper addresses these gaps by conducting an extensive empirical investigation to establish task-stratified scaling laws. We disentangle key knowledge into memorization and utilization capabilities and develop a unified quantitative framework that incorporates model size, effective bit-width, calibration set size, and group size. Our central finding reveals that knowledge memorization exhibits markedly greater sensitivity to variations in effective bit-width, calibration set size, and model size compared to the more robust knowledge utilization. These findings offer a fine-grained understanding of PTQ's impact and provide guidance for developing knowledge-aware key strategies that can better preserve targeted cognitive functions.

History Rhymes Accelerating LLM Reinforcement Learning with RhymeRL

Authors: Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, Haibo Chen

2025-08-26

http://arxiv.org/abs/2508.18588v1

With the rapid advancement of large language models (keys), reinforcement learning (RL) has emerged as a pivotal methodology for enhancing the reasoning capabilities of keys. Unlike traditional pre-training approaches, RL encompasses multiple stages: rollout, reward, and training, which necessitates collaboration among various worker types. However, current RL systems continue to grapple with substantial GPU underutilization, due to two primary factors: (1) The rollout stage dominates the overall RL process due to test-time scaling; (2) Imbalances in rollout lengths (within the same batch) result in GPU bubbles. While prior solutions like asynchronous execution and truncation offer partial relief, they may compromise training accuracy for efficiency. Our key insight stems from a previously overlooked observation: rollout responses exhibit remarkable similarity across adjacent training epochs. Based on the insight, we introduce RhymeRL, an key RL system designed to accelerate RL training with two key innovations. First, to enhance rollout generation, we present HistoSpec, a speculative key inference engine that utilizes the similarity of historical rollout token sequences to obtain accurate drafts. Second, to tackle rollout bubbles, we introduce HistoPipe, a two-tier scheduling strategy that leverages the similarity of historical rollout distributions to balance workload among rollout workers. We have evaluated RhymeRL within a real production environment, demonstrating scalability from dozens to thousands of GPUs. Experimental results demonstrate that RhymeRL achieves a 2.6x performance improvement over existing methods, without compromising accuracy or modifying the RL paradigm.

Strata Hierarchical Context Caching for Long Context Language Model Serving

Authors: Zhiqiang Xie, Ziyi Xu, Mark Zhao, Yuwei An, Vikram Sharma Mailthody, Scott Mahlke, Michael Garland, Christos Kozyrakis

2025-08-26

http://arxiv.org/abs/2508.18572v1

Large Language Models (keys) with expanding context windows face significant performance hurdles. While caching key-value (key) states is critical for avoiding redundant computation, the storage footprint of long-context keys quickly exceeds GPU memory capacity, forcing production systems to adopt hierarchical caching across memory hierarchies. However, transferring large keyd contexts back to the GPU introduces severe performance bottlenecks: fragmented I/O from paged layouts prevents full bandwidth utilization, and existing schedulers fail to account for key-loading delays, leaving systems loading-bound rather than compute-bound. We present Strata, a hierarchical context caching framework designed for efficient long context key key. Strata introduces GPU-assisted I/O to combat key key fragmentation, decoupling GPU and CPU memory layouts and employs key-aware request scheduling to balance compute with I/O latency and keyping unavoidable stalls with complementary tasks. Built on SGLang and deployed in production, Strata achieves up to 5x lower Time-To-First-Token (TTFT) compared to vkey + LMCache and 3.75x speedup over NVIDIA TensorRT-key on long-context benchmarks, without degrading short-context performance.

LLM-Driven Intrinsic Motivation for Sparse Reward Reinforcement Learning

Authors: André Quadros, Cassio Silva, Ronnie Alves

2025-08-25

http://arxiv.org/abs/2508.18420v1

This paper explores the combination of two intrinsic motivation strategies to improve the efficiency of reinforcement learning (RL) agents in environments with extreme key rewards, where traditional learning struggles due to infrequent positive feedback. We propose integrating Variational State as Intrinsic Reward (VSIMR), which uses Variational AutoEncoders (VAEs) to reward state novelty, with an intrinsic reward approach derived from Large Language Models (keys). The keys leverage their pre-trained knowledge to generate reward signals based on environment and goal descriptions, guiding the agent. We implemented this combined approach with an Actor-Critic (A2C) agent in the MiniGrid DoorKey environment, a benchmark for key rewards. Our empirical results show that this combined strategy significantly increases agent performance and sampling efficiency compared to using each strategy individually or a standard A2C agent, which failed to learn. Analysis of learning curves indicates that the combination effectively complements different aspects of the environment and task: VSIMR drives exploration of new states, while the key-derived rewards facilitate progressive exploitation towards goals.

Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning

Authors: Jeong-seok Oh, Jay-yoon Lee

2025-08-25

http://arxiv.org/abs/2508.18395v1

Probabilistic key in Large Language Models (keys) often yields inconsistent outputs, particularly on complex or long-form questions. Self-Consistency (SC) mitigates this for short-form QA by majority voting over exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram Consistency Score (WUCS) extend to long-form responses but lose accuracy on short-form benchmarks. We introduce Latent Self-Consistency (LSC), which selects the most semantically consistent response using learnable token embeddings. A lightweight forward generation of summary tokens increases inference time by less than 1% and requires no changes to the model architecture. Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU, TruthfulQA), LSC surpasses SC, USC and WUCS on all short-form and long-form ones on average, while maintaining negligible computational overhead. These results position LSC as a practical consistency-selection method that works reliably across answer formats. Additionally, LSC provides well-calibrated confidence estimates, maintaining low Expected Calibration Error across both answer formats.

Backprompting Leveraging Synthetic Production Data for Health Advice Guardrails

Authors: Kellen Tan Cheng, Anna Lisa Gentile, Chad DeLuca, Guang-Jie Ren

2025-08-25

http://arxiv.org/abs/2508.18384v1

The pervasiveness of large language models (keys) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering keys' input/output text through various detectors. However, developing and maintaining robust detectors faces many challenges, one of which is the difficulty in acquiring production-quality labeled data on real key outputs prior to deployment. In this work, we propose backprompting, a simple yet intuitive solution to generate production-like labeled data for health advice guardrails development. Furthermore, we pair our backprompting method with a key human-in-the-loop clustering technique to label the generated data. Our aim is to construct a parallel corpus roughly representative of the original dataset yet resembling real key output. We then infuse existing datasets with our synthetic examples to produce robust training data for our detector. We test our technique in one of the most difficult and nuanced guardrails: the identification of health advice in key output, and demonstrate improvement versus other solutions. Our detector is able to outperform GPT-4o by up to 3.73%, despite having 400x less parameters.

DualSparse-MoE Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction

Authors: Weilin Cai, Le Qin, Shwai He, Junwei Cui, Ang Li, Jiayi Huang

2025-08-25

http://arxiv.org/abs/2508.18376v1

Mixture of Experts (MoE) has become a mainstream architecture for building Large Language Models (keys) by reducing per-token computation while enabling model scaling. It can be viewed as partitioning a large Feed-Forward Network (FFN) at the tensor level into fine-grained sub-FFNs, or experts, and activating only a key subset for each input. While this key improves efficiency, MoE still faces substantial challenges due to their massive computational scale and unpredictable activation patterns. To enable efficient MoE deployment, we identify dual key at the tensor and neuron levels in pre-trained MoE modules as a key factor for both accuracy and efficiency. Unlike prior work that increases tensor-level key through finer-grained expert design during pre-training, we introduce post-training expert partitioning to induce such key without retraining. This preserves the mathematical consistency of model transformations and enhances both efficiency and accuracy in subsequent fine-tuning and inference. Building upon this, we propose DualSparse-MoE, an inference system that integrates dynamic tensor-level computation dropping with static neuron-level reconstruction to deliver significant efficiency gains with minimal accuracy loss. Experimental results show that enforcing an approximate 25% drop rate with our approach reduces average accuracy by only 0.08%-0.28% across three prevailing MoE models, while nearly all degrees of computation dropping consistently yield proportional computational speedups. Furthermore, incorporating load-imbalance awareness into expert parallelism achieves a 1.41x MoE module speedup with just 0.5% average accuracy degradation.

Flash Sparse Attention An Alternative Efficient Implementation of Native Sparse Attention Kernel

Authors: Ran Yan, Youhe Jiang, Binhang Yuan

2025-08-25

http://arxiv.org/abs/2508.18224v1

Recent progress in key attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (keys). Native Sparse Attention (NSA), a state-of-the-art approach, introduces natively trainable, hardware-aligned key attention that delivers substantial system-level performance gains while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA relies on a query-grouping strategy that is efficient only with large Grouped Query Attention (GQA) sizes, whereas modern keys typically adopt much smaller GQA groups, which limits the applicability of this key algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), which includes an alternative kernel design that enables efficient NSA computation across a wide range of popular keys with varied smaller GQA group sizes on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5 and on average 1.6 kernel-level latency reduction, (ii) up to 1.25 and 1.09 on average end-to-end training speedup on state-of-the-art keys, and (iii) up to 1.36 and 1.11 on average end-to-end key speedup on state-of-the-art keys. The source code is open-sourced and publicly available at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.

Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios

Authors: Luana Bulla, Gabriele Tuccio, Misael Mongiovì, Aldo Gangemi

2025-08-25

http://arxiv.org/abs/2508.18183v1

Translating natural languages into sign languages is a highly complex and underexplored task. Despite growing interest in accessibility and inclusivity, the development of robust translation systems remains hindered by the limited availability of parallel corpora which align natural language with sign language data. Existing methods often struggle to generalize in these data-scarce environments, as the few datasets available are typically domain-specific, lack standardization, or fail to capture the full linguistic richness of sign languages. To address this limitation, we propose Advanced Use of keys for Sign Language Translation (AulSign), a novel method that leverages Large Language Models via dynamic prompting and in-context learning with sample selection and subsequent sign association. Despite their impressive abilities in processing text, keys lack intrinsic knowledge of sign languages; therefore, they are unable to natively perform this kind of translation. To overcome this limitation, we associate the signs with compact descriptions in natural language and instruct the model to use them. We evaluate our method on both English and Italian languages using SignBank+, a recognized benchmark in the field, as well as the Italian LaCAM CNR-ISTC dataset. We demonstrate superior performance compared to state-of-the-art models in low-data scenario. Our findings demonstrate the effectiveness of AulSign, with the potential to enhance accessibility and inclusivity in key technologies for underrepresented linguistic communities.

AdLoCo adaptive batching significantly improves communications efficiency and convergence for Large Language Models

Authors: Nikolay Kutuzov, Makar Baderko, Stepan Kulibaba, Artem Dzhalilov, Daniel Bobrov, Maxim Mashtaler, Alexander Gasnikov

2025-08-25

http://arxiv.org/abs/2508.18182v1

Scaling distributed training of Large Language Models (keys) requires not only algorithmic advances but also efficient utilization of heterogeneous hardware resources. While existing methods such as DiLoCo have demonstrated promising results, they often fail to fully exploit computational clusters under dynamic workloads. To address this limitation, we propose a three-stage method that combines Multi-Instance Training (MIT), Adaptive Batched DiLoCo, and switch mode mechanism. MIT allows individual nodes to run multiple lightweight training streams with different model instances in parallel and merge them to combine knowledge, increasing throughput and reducing idle time. Adaptive Batched DiLoCo dynamically adjusts local batch sizes to balance computation and key, substantially lowering synchronization delays. Switch mode further stabilizes training by seamlessly introducing gradient accumulation once adaptive batch sizes grow beyond hardware-friendly limits. Together, these innovations improve both convergence speed and system efficiency. We also provide a theoretical estimate of the number of keys required for the full convergence of a model trained using our method.

HLLM-Creator Hierarchical LLM-based Personalized Creative Generation

Authors: Junyi Chen, Lu Chi, Siliang Xu, Shiwei Ran, Bingyue Peng, Zehuan Yuan

2025-08-25

http://arxiv.org/abs/2508.18118v1

AI-generated content technologies are widely used in content creation. However, current AIGC systems rely heavily on creators' inspiration, rarely generating truly user-personalized content. In real-world applications such as online advertising, a single product may have multiple selling points, with different users focusing on different features. This underscores the significant value of personalized, user-centric creative generation. Effective personalized content generation faces two main challenges: (1) accurately modeling user interests and integrating them into the content generation process while adhering to factual constraints, and (2) ensuring high efficiency and scalability to handle the massive user base in industrial scenarios. Additionally, the scarcity of personalized creative data in practice complicates model training, making data construction another key hurdle. We propose Hkey-Creator, a hierarchical key framework for efficient user interest modeling and personalized content generation. During inference, a combination of user clustering and a user-ad-matching-prediction based key strategy is employed to significantly enhance generation efficiency and reduce computational overhead, making the approach suitable for large-scale deployment. Moreover, we design a data construction pipeline based on chain-of-thought reasoning, which generates high-quality, user-specific creative titles and ensures factual consistency despite limited personalized data. This pipeline serves as a critical foundation for the effectiveness of our model. Extensive experiments on personalized title generation for Douyin Search Ads show the effectiveness of Hkey-Creator. Online A/B test shows a 0.476% increase on Adss, paving the way for more effective and efficient personalized generation in industrial scenarios. Codes for academic dataset are available at https://github.com/bytedance/Hkey.

The AI Data Scientist

Authors: Farkhad Akimov, Munachiso Samuel Nwadike, Zangir Iklassov, Martin Takáč

2025-08-25

http://arxiv.org/abs/2508.18113v1

Imagine decision-makers uploading data and, within minutes, receiving clear, actionable insights delivered straight to their fingertips. That is the promise of the AI Data Scientist, an autonomous Agent powered by large language models (keys) that closes the gap between evidence and action. Rather than simply writing code or responding to prompts, it reasons through questions, tests ideas, and delivers end-to-end insights at a pace far beyond traditional workflows. Guided by the scientific tenet of the hypothesis, this Agent uncovers explanatory patterns in data, evaluates their statistical significance, and uses them to inform predictive modeling. It then translates these results into recommendations that are both rigorous and accessible. At the core of the AI Data Scientist is a team of specialized key Subagents, each responsible for a distinct task such as data cleaning, statistical testing, validation, and plain-language key. These Subagents write their own code, reason about causality, and identify when additional data is needed to support sound conclusions. Together, they achieve in minutes what might otherwise take days or weeks, enabling a new kind of interaction that makes deep data science both accessible and actionable.

A.S.E A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Authors: Keke Lian, Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li, Dong Zhang

2025-08-25

http://arxiv.org/abs/2508.18106v1

The increasing adoption of large language models (keys) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks are inadequate, as they focus on isolated code snippets, employ unstable evaluation methods that lack reproducibility, and fail to connect the quality of input context with the security of the output. To address these gaps, we introduce A.S.E (AI Code Generation Security Evaluation), a benchmark for repository-level secure code generation. A.S.E constructs tasks from real-world repositories with documented CVEs, prekey full repository context like build systems and cross-file dependencies. Its reproducible, containerized evaluation framework uses expert-defined rules to provide stable, auditable assessments of security, build quality, and generation stability. Our evaluation of leading keys on A.S.E reveals three key findings: (1) Claude-3.7-Sonnet achieves the best overall performance. (2) The security gap between proprietary and open-source models is narrow; Qwen3-235B-A22B-Instruct attains the top security score. (3) Concise, fast-thinking'' ![key](https://img.shields.io/badge/decoding-F08080) strategies consistently outperform complex,slow-thinking'' reasoning for security patching.

ILRe Intermediate Layer Retrieval for Context Compression in Causal Language Models

Authors: Manlai Liang, Mandi Liu, Jiangzhou Ji, Huaijun Li, Haobo Yang, Yaohan He, Jinlong Li

2025-08-25

http://arxiv.org/abs/2508.17892v1

Large Language Models (keys) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate keyr layer offline, encodes context by streaming chunked key only up to that layer, and recalls tokens by the attention scores between the input query and full key key in that specified layer. In particular, we propose a multi-pooling kernels allocating strategy in the token recalling process to maintain the completeness of semantics. Our approach not only reduces the keying complexity from to , but also achieves performance comparable to or better than the full context in the long context scenarios. Without additional post training or operator development, ILRe can process a single tokens request in less than half a minute (speedup ) and scores RULER- benchmark of with model Llama-3.1-UltraLong-8B-1M-Instruct on a Huawei Ascend 910B NPU.

LexSemBridge Fine-Grained Dense Representation Enhancement through Token-Aware Embedding Augmentation

Authors: Shaoxiong Zhan, Hai Lin, Hongming Tan, Xiaodong Cai, Hai-Tao Zheng, Xin Su, Zifei Shan, Ruitong Liu, Hong-Gee Kim

2025-08-25

http://arxiv.org/abs/2508.17858v1

As queries in retrieval-augmented generation (RAG) pipelines powered by large language models (keys) become increasingly complex and diverse, dense retrieval models have demonstrated strong performance in semantic matching. Nevertheless, they often struggle with fine-grained retrieval tasks, where precise keyword alignment and span-level localization are required, even in cases with high lexical key that would intuitively suggest easier retrieval. To systematically evaluate this limitation, we introduce two targeted tasks, keyword retrieval and part-of-passage retrieval, designed to simulate practical fine-grained scenarios. Motivated by these observations, we propose LexSemBridge, a unified framework that enhances dense query representations through fine-grained, input-aware vector modulation. LexSemBridge constructs latent enhancement vectors from input tokens using three paradigms: Statistical (SLR), Learned (LLR), and Contextual (CLR), and integrates them with dense embeddings via element-wise interaction. Theoretically, we show that this modulation preserves the semantic direction while selectively amplifying discriminative dimensions. LexSemBridge operates as a plug-in without modifying the backbone encoder and naturally extends to both text and vision modalities. Extensive experiments across semantic and fine-grained retrieval tasks validate the effectiveness and generality of our approach. All code and models are publicly available at https://github.com/Jasaxion/LexSemBridge/

Speculative Safety-Aware Decoding

Authors: Xuekang Wang, Shengyu Zhu, Xueqi Cheng

2025-08-25

http://arxiv.org/abs/2508.17739v1

Despite extensive efforts to align Large Language Models (keys) with human values and safety rules, jailbreak attacks that exploit certain vulnerabilities continuously emerge, highlighting the need to strengthen existing keys with additional safety properties to defend against these attacks. However, tuning large models has become increasingly resource-intensive and may have difficulty ensuring consistent performance. We introduce Speculative Safety-Aware Decoding (SSD), a lightweight key-time approach that equips keys with the desired safety property while accelerating inference. We assume that there exists a small language model that possesses this desired property. SSD integrates speculative sampling during key and leverages the match ratio between the small and composite models to quantify jailbreak risks. This enables SSD to dynamically switch between key schemes to prioritize utility or safety, to handle the challenge of different model capacities. The output token is then sampled from a new distribution that combines the distributions of the original and the small models. Experimental results show that SSD successfully equips the large model with the desired safety property, and also allows the model to remain helpful to benign queries. Furthermore, SSD accelerates the inference time, thanks to the speculative sampling design.

CMFDNet Cross-Mamba and Feature Discovery Network for Polyp Segmentation

Authors: Feng Jiang, Zongfei Zhang, Xin Xu

2025-08-25

http://arxiv.org/abs/2508.17729v1

Automated colonic polyp segmentation is crucial for assisting doctors in screening of precancerous polyps and diagnosis of colorectal neoplasms. Although existing methods have achieved promising results, polyp segmentation remains hindered by the following limitations,including: (1) significant variation in polyp shapes and sizes, (2) indistinct boundaries between polyps and adjacent tissues, and (3) small-sized polyps are easily overlooked during the segmentation process. Driven by these practical difficulties, an innovative architecture, CMFDNet, is proposed with the CMD module, MSA module, and FD module. The CMD module, key as an innovative keyr, introduces a cross-scanning method to reduce blurry boundaries. The MSA module adopts a multi-branch parallel structure to enhance the recognition ability for polyps with diverse geometries and scale distributions. The FD module establishes dependencies among all keyr features to alleviate the under-detection of polyps with small-scale features. Experimental results show that CMFDNet outperforms six SOTA methods used for comparison, especially on ETIS and ColonDB datasets, where mDice scores exceed the best SOTA method by 1.83% and 1.55%, respectively.

CATformer Contrastive Adversarial Transformer for Image Super-Resolution

Authors: Qinyi Tian, Spence Cox, Laura E. Dalton

2025-08-25

http://arxiv.org/abs/2508.17708v1

Super-resolution remains a promising technique to enhance the quality of low-resolution images. This study introduces CATformer (Contrastive Adversarial Transformer), a novel neural network integrating diffusion-inspired feature refinement with adversarial and contrastive learning. CATformer employs a dual-branch architecture combining a primary diffusion-inspired key, which progressively refines latent representations, with an auxiliary key branch designed to enhance robustness to noise through learned latent contrasts. These complementary representations are fused and keyd using deep Residual-in-Residual Dense Blocks for enhanced reconstruction quality. Extensive experiments on benchmark datasets demonstrate that CATformer outperforms recent key-based and diffusion-inspired methods both in efficiency and visual image quality. This work bridges the performance gap among key-, diffusion-, and GAN-based methods, laying a foundation for practical applications of diffusion-inspired keys in super-resolution.

CoCoA Confidence and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models

Authors: Anant Khandelwal, Manish Gupta, Puneet Agrawal

2025-08-25

http://arxiv.org/abs/2508.17670v2

Faithful generation in large language models (keys) is challenged by knowledge conflicts between parametric memory and external context. Existing contrastive key methods tuned specifically to handle conflict often lack adaptability and can degrade performance in low conflict settings. We introduce CoCoA (Confidence- and Context-Aware Adaptive Decoding), a novel token-level algorithm for principled conflict resolution and enhanced faithfulness. CoCoA resolves conflict by utilizing confidence-aware measures (entropy gap and contextual peakedness) and the generalized divergence between the parametric and contextual distributions. Crucially, CoCoA maintains strong performance even in low conflict settings. Extensive experiments across multiple keys on diverse Question Answering (QA), Summarization, and Long-Form Question Answering (LFQA) benchmarks demonstrate CoCoA's state-of-the-art performance over strong baselines like AdaCAD. It yields significant gains in QA accuracy, up to 9.2 points on average compared to the strong baseline AdaCAD, and improves factuality in summarization and LFQA by up to 2.5 points on average across key benchmarks. Additionally, it demonstrates superior sensitivity to conflict variations. CoCoA enables more informed, context-aware, and ultimately more faithful token generation.