2025-10-31

The Limits of Obliviate Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework
INT v.s. FP A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
PureKV Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models
Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry
Feedback Alignment Meets Low-Rank Manifolds A Structured Recipe for Local Learning
Standardization of Psychiatric Diagnoses -- Role of Fine-tuned LLM Consortium and OpenAI-gpt-oss Reasoning LLM Enabled Decision Support System
TwinVoice A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation
Echo-Conditioned Denoising Diffusion Probabilistic Models for Multi-Target Tracking in RF Sensing
A Critical Study of Automatic Evaluation in Sign Language Translation
Implicature in Interaction Understanding Implicature Improves Alignment in Human-LLM Interaction
Serve Programs, Not Prompts
Lightweight Federated Learning in Mobile Edge Computing with Statistical and Device Heterogeneity Awareness
4-Doodle Text to 3D Sketches that Move!
Parrot A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning
DIRC-RAG Accelerating Edge RAG with Robust High-Density and High-Loading-Bandwidth Digital In-ReRAM Computation
MoEntwine Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference
Energy-Efficient Autonomous Driving with Adaptive Perception and Robust Decision
Transformers in Medicine Improving Vision-Language Alignment for Medical Image Captioning
Model-Document Protocol for AI Search
Conditional neural field for spatial dimension reduction of turbulence data a comparison study
KnowCoder-A1 Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA
H3M-SSMoEs Hypergraph-based Multimodal Learning with LLM Reasoning and Style-Structured Mixture of Experts
WebLeaper Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking
Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs
Diffusion LLM with Native Variable Generation Lengths Let [EOS] Lead the Way
Parallel Loop Transformer for Efficient Test-Time Computation Scaling
Decoupled MeanFlow Turning Flow Models into Flow Maps for Accelerated Sampling
MiniOneRec An Open-Source Framework for Scaling Generative Recommendation
Metadata-Driven Retrieval-Augmented Generation for Financial Question Answering
Text Simplification with Sentence Embeddings
SALS Sparse Attention in Latent Space for KV cache Compression
Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts
Pilot Distortion Design for ToA Obfuscation in Uplink OFDM Communication
FALQON Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic
Pie A Programmable Serving System for Emerging LLM Applications
SpecKD Speculative Decoding for Effective Knowledge Distillation of LLMs
PRO Enabling Precise and Robust Text Watermark for Open-Source LLMs
Accurate Prediction of Nonlinear Distortion of Multi-Carrier Signals
Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders
CountFormer A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting
BitSkip An Empirical Analysis of Quantization and Early Exit Composition
Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization
Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier
Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving
Adaptive Blockwise Search Inference-Time Alignment for Large Language Models
Evaluation of Vision-LLMs in Surveillance Video
Lost in Tokenization Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs
Beyond Imprecise Distance Metrics LLM-Predicted Target Call Stacks for Directed Greybox Fuzzing
P1GPT a multi-agent LLM workflow module for multi-modal financial information analysis
UGAE Unified Geometry and Attribute Enhancement for G-PCC Compressed Point Clouds
Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition
Switchable Token-Specific Codebook Quantization For Face Image Compression
How Can AI Augment Access to Justice? Public Defenders' Perspectives on AI Adoption
Rethinking Inference Placement for Deep Learning across Edge and Cloud Platforms A Multi-Objective Optimization Perspective and Future Directions
Batch Speculative Decoding Done Right
Sub-microsecond Transformers for Jet Tagging on FPGAs
Long-Term PM2.5 Forecasting Using a DTW-Enhanced CNN-GRU Model
Leveraging Large Language Models to Identify Conversation Threads in Collaborative Learning
Region-Adaptive Learned Hierarchical Encoding for 3D Gaussian Splatting Data
Iterative Layer Pruning for Efficient Translation Inference
Beyond Semantics How Temporal Biases Shape Retrieval in Transformer and State-Space Models
Rule-Based Explanations for Retrieval-Augmented LLM Systems
Transformers from Compressed Representations
TVMC Time-Varying Mesh Compression via Multi-Stage Anchor Mesh Generation
AI-Driven Carbon Monitoring Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions
SABlock Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size
AesCrop Aesthetic-driven Cropping Guided by Composition
Aligning Diffusion Language Models via Unpaired Preference Optimization
Frustratingly Easy Task-aware Pruning for Large Language Models
CHOIR Collaborative Harmonization fOr Inference Robustness
Backward-Friendly Optimization Training Large Language Models with Approximate Gradients under Memory Constraints
GigaEmbeddings Efficient Russian Language Embedding Model
The Structural Scalpel Automated Contiguous Layer Pruning for Large Language Models
Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders
Efficient Low Rank Attention for Long-Context Inference in Large Language Models
PACR Progressively Ascending Confidence Reward for LLM Reasoning
Synthetic-to-Real Transfer Learning for Chromatin-Sensitive PWS Microscopy
When Fewer Layers Break More Chains Layer Pruning Harms Test-Time Scaling in LLMs
TrajGATFormer A Graph-Based Transformer Approach for Worker and Obstacle Trajectory Prediction in Off-site Construction Environments
Surface Reading LLMs Synthetic Text and its Styles
Edit Less, Achieve More Dynamic Sparse Neuron Masking for Lifelong Knowledge Editing in LLMs
Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search
Generalization or Memorization Dynamic Decoding for Mode Steering
Embracing Trustworthy Brain-Agent Collaboration as Paradigm Extension for Intelligent Assistive Technologies
Compositional Bias Control in Large Language Models Preference Learning Fails, Supervision Succeeds
Pruning and Quantization Impact on Graph Neural Networks
Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders
On the acceleration of cosmic rays at the post-adiabatic shocks of supernova remnants
Sprint Sparse-Dense Residual Fusion for Efficient Diffusion Transformers
From Social Division to Cohesion with AI Message Suggestions in Online Chat Groups
Performance Trade-offs of Optimizing Small Language Models for E-Commerce
Model-Aware Tokenizer Transfer
Adversarial Déjà Vu Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

The Limits of Obliviate Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework

Authors: Aakriti Shah, Thai Le

2025-10-29

http://arxiv.org/abs/2510.25732v1

Unlearning in large language models (s) is crucial for managing sensitive data and correcting misinformation, yet evaluating its effectiveness remains an open problem. We investigate whether persuasive prompting can recall factual knowledge from deliberately unlearned s across models ranging from 2.7B to 13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from ACT-R and Hebbian theory (spreading activation theories), as well as principles, we introduce Stimulus-Knowledge Entanglement-Behavior Framework (SKeB), which models information entanglement via domain graphs and tests whether factual recall in unlearned models is correlated with persuasive framing. We develop entanglement metrics to quantify knowledge activation patterns and evaluate factuality, non-factuality, and hallucination in outputs. Our results show persuasive prompts substantially enhance factual knowledge recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB provides a foundation for assessing unlearning completeness, robustness, and overall behavior in s.

INT v.s. FP A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Authors: Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo

2025-10-29

http://arxiv.org/abs/2510.25602v1

Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (s). Despite this industry trend, a unified comparison of FP and integer (INT) across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained , the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., with block size 32), INT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., FP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained INT training, enabling nearly lossless performance for INT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly INT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.

PureKV Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

Authors: Zhonghua Jiang, Kunxi Li, Yiyun Zhou, Sihao Liu, Zhaode Wang, Chengfei lv, Shengyu Zhang

2025-10-29

http://arxiv.org/abs/2510.25600v2

Vision-Language Large Models (Vs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value () size, severely hinder the ing and stages. Recent efforts have attempted to compress by identifying and of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how attention, while accelerating the ing stage, alters the information structure of the , thereby compromising the effectiveness of downstream strategies. To address this issue, we propose Pure, a plug-and-play framework for joint optimization of attention and . We first introduce a strategy that is fully compatible with efficient attention accelerators. Our method utilizes lower layer attention scores to estimate the importance of high layers' , enabling active without compromising accuracy. In addition, we have designed a Spatial-Temporal Sparse Attention (ST-SpAttn) module specifically tailored for video algorithms. This module combines spatial and temporal attention to improve the efficiency of optimization algorithms by purifying spatial noise and temporal redundancy in . At the same time, ST-SpAttn also accelerated the ing stage of Vs. Extensive experiments on Vs (VideoLLaMA2, Qwen2.5-VL) have shown that Pure achieves 5.0 times and 3.16 times , with negligible quality degradation.

Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry

Authors: Run Peng, Ziqiao Ma, Amy Pang, Sikai Li, Zhang Xi-Jia, Yingzhuo Yu, Cristian-Paul Bara, Joyce Chai

2025-10-29

http://arxiv.org/abs/2510.25595v1

While Large Language Model () agents are often approached from the angle of action planning/generation to accomplish a goal (e.g., given by language descriptions), their abilities to collaborate with each other to achieve a joint goal are not well explored. To address this limitation, this paper studies agents in task collaboration, particularly under the condition of information asymmetry, where agents have disparities in their knowledge and skills and need to work together to complete a shared task. We extend Einstein Puzzles, a classical symbolic puzzle, to a table-top game. In this game, two agents must reason, communicate, and act to satisfy spatial and relational constraints required to solve the puzzle. We apply a fine-tuning-plus-verifier framework in which agents are equipped with various strategies and verification signals from the environment. Empirical results highlight the critical importance of aligned , especially when agents possess both information-seeking and -providing capabilities. Interestingly, agents without can still achieve high task performance; however, further analysis reveals a lack of true rule understanding and lower trust from human evaluators. Instead, by integrating an environment-based verifier, we enhance agents' ability to comprehend task rules and complete tasks, promoting both safer and more interpretable collaboration in AI systems. https://github.com/Roihn/EinsteinPuzzles

Feedback Alignment Meets Low-Rank Manifolds A Structured Recipe for Local Learning

Authors: Arani Roy, Marco P. Apolinario, Shristi Das Biswas, Kaushik Roy

2025-10-29

http://arxiv.org/abs/2510.25594v1

Training deep neural networks (DNNs) with backpropagation (BP) achieves state-of-the-art accuracy but requires global error propagation and full parameterization, leading to substantial memory and computational overhead. Direct Feedback Alignment (DFA) enables local, parallelizable updates with lower memory requirements but is limited by unstructured feedback and poor scalability in deeper architectures, specially convolutional neural networks. To address these limitations, we propose a structured local learning framework that operates directly on low-rank manifolds defined by the Singular Value Decomposition (SVD) of weight matrices. Each layer is trained in its decomposed form, with updates applied to the SVD components using a composite loss that integrates cross-entropy, subspace alignment, and orthogonality regularization. Feedback matrices are constructed to match the SVD structure, ensuring consistent alignment between forward and feedback pathways. Our method reduces the number of trainable parameters relative to the original DFA model, without relying on or post hoc . Experiments on CIFAR-10, CIFAR-100, and ImageNet show that our method achieves accuracy comparable to that of BP. Ablation studies confirm the importance of each loss term in the low-rank setting. These results establish local learning on low-rank manifolds as a principled and scalable alternative to full-rank gradient-based training.

Standardization of Psychiatric Diagnoses -- Role of Fine-tuned LLM Consortium and OpenAI-gpt-oss Reasoning LLM Enabled Decision Support System

Authors: Eranga Bandara, Ross Gore, Atmaram Yarlagadda, Anita H. Clayton, Preston Samuel, Christopher K. Rhea, Sachin Shetty

2025-10-29

http://arxiv.org/abs/2510.25588v1

The diagnosis of most mental disorders, including psychiatric evaluations, primarily depends on dialogues between psychiatrists and patients. This subjective process can lead to variability in diagnoses across clinicians and patients, resulting in inconsistencies and challenges in achieving reliable outcomes. To address these issues and standardize psychiatric diagnoses, we propose a Fine-Tuned Large Language Model () Consortium and OpenAI-gpt-oss Reasoning -enabled Decision Support System for the clinical diagnosis of mental disorders. Our approach leverages fine-tuned s trained on conversational datasets involving psychiatrist-patient interactions focused on mental health conditions (e.g., depression). The diagnostic predictions from individual models are aggregated through a consensus-based decision-making process, refined by the OpenAI-gpt-oss reasoning . We propose a novel method for deploying agents that orchestrate between the consortium and the reasoning , ensuring transparency, reliability, and responsible AI across the entire diagnostic workflow. Experimental results demonstrate the transformative potential of combining fine-tuned s with a reasoning model to create a robust and highly accurate diagnostic system for mental health assessment. A prototype of the proposed platform, integrating three fine-tuned s with the OpenAI-gpt-oss reasoning , was developed in collaboration with the U.S. Army Medical Research Team in Norfolk, Virginia, USA. To the best of our knowledge, this work represents the first application of a fine-tuned consortium integrated with a reasoning for clinical mental health diagnosis paving the way for next-generation AI-powered eHealth systems aimed at standardizing psychiatric diagnoses.

TwinVoice A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

Authors: Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, Yiqun Liu

2025-10-29

http://arxiv.org/abs/2510.25536v2

Large Language Models (s) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's style, behavioral tendencies, and personality traits. However, current evaluations of -based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by s remains considerably below the human baseline.

Echo-Conditioned Denoising Diffusion Probabilistic Models for Multi-Target Tracking in RF Sensing

Authors: Amirhossein Azarbahram, Onel L. A. López

2025-10-29

http://arxiv.org/abs/2510.25464v1

In this paper, we consider a dynamic radio frequency sensing system aiming to spatially track multiple targets over time. We develop a conditional denoising diffusion probabilistic model (C-DDPM)-assisted framework that learns the temporal evolution of target parameters by leveraging the noisy echo observations as conditioning features. The proposed framework integrates a variational autoencoder (VAE) for echo and utilizes classifier-free guidance to enhance conditional denoising. In each transmission block, VAE encodes the received echo into a latent representation that conditions DDPM to predict future target states, which are then used for codebook beam selection. Simulation results show that the proposed approach outperforms classical signal processing, filtering, and deep learning benchmarks. The C-DDPM-assisted framework achieves significantly lower estimation errors in both angle and distance tracking, demonstrating the potential of generative models for integrated sensing and s.

A Critical Study of Automatic Evaluation in Sign Language Translation

Authors: Shakib Yazdani, Yasser Hamidullah, Cristina España-Bonet, Eleftherios Avramidis, Josef van Genabith

2025-10-29

http://arxiv.org/abs/2510.25434v1

Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model ()-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical metrics and demonstrates that while -based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward -paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and -based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.

Implicature in Interaction Understanding Implicature Improves Alignment in Human-LLM Interaction

Authors: Asutosh Hota, Jussi P. P. Jokinen

2025-10-29

http://arxiv.org/abs/2510.25426v1

The rapid advancement of Large Language Models (s) is positioning language at the core of human-computer interaction (HCI). We argue that advancing HCI requires attention to the linguistic foundations of interaction, particularly implicature (meaning conveyed beyond explicit statements through shared context) which is essential for human-AI (HAI) alignment. This study examines s' ability to infer user intent embedded in context-driven prompts and whether understanding implicature improves response generation. Results show that larger models approximate human interpretations more closely, while smaller models struggle with implicature inference. Furthermore, implicature-based prompts significantly enhance the perceived relevance and quality of responses across models, with notable gains in smaller models. Overall, 67.6% of participants preferred responses with implicature-embedded prompts to literal ones, highlighting a clear preference for contextually nuanced . Our work contributes to understanding how linguistic theory can be used to address the alignment problem by making HAI interaction more natural and contextually grounded.

Serve Programs, Not Prompts

Authors: In Gim, Lin Zhong

2025-10-29

http://arxiv.org/abs/2510.25412v1

Current large language model () systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex applications due to their inflexible design. We propose a new system architecture that serves programs instead of prompts to address this problem. These programs, called Inference Programs (LIPs), allow users to customize token prediction and management at runtime and to offload parts of their application logic, such as tool execution, to the server. We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs. Symphony exposes model computations via system calls and virtualizes with a dedicated file system, while ensuring GPU efficiency with a two-level process scheduling scheme. Symphony has the potential to open the door to a more efficient and extensible ecosystem for applications.

Lightweight Federated Learning in Mobile Edge Computing with Statistical and Device Heterogeneity Awareness

Authors: Jinghong Tan, Zhichen Zhang, Kun Guo, Tsung-Hui Chang, Tony Q. S. Quek

2025-10-29

http://arxiv.org/abs/2510.25342v1

Federated learning enables collaborative machine learning while pre data privacy, but high and computation costs, exacerbated by statistical and device heterogeneity, limit its practicality in mobile edge computing. Existing methods like sparsification and reduce per-round costs but may increase training rounds and thus the total training cost, especially under heterogeneous environments. We propose a lightweight personalized FL framework built on parameter decoupling, which separates the model into shared and private subspaces, enabling us to uniquely apply gradient sparsification to the shared component and model to the private one. This structural separation confines to global knowledge exchange and computation reduction to local personalization, protecting personalization quality while adapting to heterogeneous client resources. We theoretically analyze convergence under the combined effects of sparsification and , revealing a - trade-off that links to the iteration complexity. Guided by this analysis, we formulate a joint optimization that selects per-client and rates and wireless bandwidth to reduce end-to-end training time. Simulation results demonstrate faster convergence and substantial reductions in overall and computation costs with negligible accuracy loss, validating the benefits of coordinated and resource-aware personalization in resource-constrained heterogeneous environments.

4-Doodle Text to 3D Sketches that Move!

Authors: Hao Chen, Jiaqi Wang, Yonggang Qi, Ke Li, Kaiyue Pang, Yi-Zhe Song

2025-10-29

http://arxiv.org/abs/2510.25319v1

We present a novel task: text-to-3D sketch animation, which aims to bring freeform sketches to life in dynamic 3D space. Unlike prior works focused on photorealistic content generation, we target , stylized, and view-consistent 3D vector sketches, a lightweight and interpretable medium well-suited for visual and prototyping. However, this task is very challenging: (i) no paired dataset exists for text and 3D (or 4D) sketches; (ii) sketches require structural abstraction that is difficult to model with conventional 3D representations like NeRFs or point clouds; and (iii) animating such sketches demands temporal coherence and multi-view consistency, which current pipelines do not address. Therefore, we propose 4-Doodle, the first training-free framework for generating dynamic 3D sketches from text. It leverages pretrained image and video diffusion models through a dual-space distillation scheme: one space captures multi-view-consistent geometry using differentiable B\'ezier curves, while the other encodes motion dynamics via temporally-aware priors. Unlike prior work (e.g., DreamFusion), which optimizes from a single view per step, our multi-view optimization ensures structural alignment and avoids view ambiguity, critical for sketches. Furthermore, we introduce a structure-aware motion module that separates shape-pre trajectories from deformation-aware changes, enabling expressive motion such as flipping, rotation, and articulated movement. Extensive experiments show that our method produces temporally realistic and structurally stable 3D sketch animations, outperforming existing baselines in both fidelity and controllability. We hope this work serves as a step toward more intuitive and accessible 4D content creation.

Parrot A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning

Authors: Senjie Jin, Lu Chen, Zhiheng Xi, Yuhui Wang, Sirui Song, Yuhao Zhou, Xinbo Zhang, Peng Sun, Hong Lu, Tao Gui, Qi Zhang, Xuanjing Huang

2025-10-29

http://arxiv.org/abs/2510.25310v1

Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (s) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms' strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.

DIRC-RAG Accelerating Edge RAG with Robust High-Density and High-Loading-Bandwidth Digital In-ReRAM Computation

Authors: Kunming Shao, Zhipeng Liao, Jiangnan Yu, Liang Zhao, Qiwei Li, Xijie Huang, Jingyu He, Fengshi Tian, Yi Zou, Xiaomeng Wang, Tim Kwang-Ting Cheng, Chi-Ying Tsui

2025-10-29

http://arxiv.org/abs/2510.25278v1

Retrieval-Augmented Generation (RAG) enhances large language models (s) by integrating external knowledge retrieval but faces challenges on edge devices due to high storage, energy, and latency demands. Computing-in-Memory (CIM) offers a promising solution by storing document embeddings in CIM macros and enabling in-situ parallel retrievals but is constrained by either low memory density or limited computational accuracy. To address these challenges, we present DIRCRAG, a novel edge RAG architecture leveraging Digital In-ReRAM Computation (DIRC). DIRC integrates a high-density multi-level ReRAM subarray with an SRAM cell, utilizing SRAM and differential sensing for robust ReRAM readout and digital multiply-accumulate (MAC) operations. By storing all document embeddings within the CIM macro, DIRC achieves ultra-low-power, single-cycle data loading, substantially reducing both energy consumption and latency compared to offchip DRAM. A query-stationary (QS) dataflow is supported for RAG tasks, minimizing on-chip data movement and reducing SRAM buffer requirements. We introduce error optimization for the DIRC ReRAM-SRAM cell by extracting the bit-wise spatial error distribution of the ReRAM subarray and applying targeted bit-wise data remapping. An error detection circuit is also implemented to enhance readout resilience against deviceand circuit-level variations. Simulation results demonstrate that DIRC-RAG under TSMC40nm process achieves an on-chip non-volatile memory density of 5.18Mb/mm2 and a throughput of 131 TOPS. It delivers a 4MB retrieval latency of 5.6{\mu}s/query and an energy consumption of 0.956{\mu}J/query, while maintaining the retrieval precision.

MoEntwine Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference

Authors: Xinru Tang, Jingxiang Hou, Dingcheng Jiang, Taiquan Wei, Jiaxin Liu, Jinyi Deng, Huizheng Wang, Qize Yang, Haoran Shang, Chao Li, Yang Hu, Shouyi Yin

2025-10-29

http://arxiv.org/abs/2510.25258v1

As large language models (s) continue to scale up, mixture-of-experts (MoE) has become a common technology in SOTA models. MoE models rely on expert parallelism (EP) to alleviate memory bottleneck, which introduces all-to-all to dispatch and combine tokens across devices. However, in widely-adopted GPU clusters, high-overhead cross-node makes all-to-all expensive, hindering the adoption of EP. Recently, wafer-scale chips (WSCs) have emerged as a platform integrating numerous devices on a wafer-sized interposer. WSCs provide a unified high-performance network connecting all devices, presenting a promising potential for hosting MoE models. Yet, their network is restricted to a mesh topology, causing imbalanced pressure and performance loss. Moreover, the lack of on-wafer disk leads to high-overhead expert migration on the critical path. To fully unleash this potential, we first propose Entwined Ring Mapping (ER-Mapping), which co-designs the mapping of attention and MoE layers to balance pressure and achieve better performance. We find that under ER-Mapping, the distribution of cold and hot links in the attention and MoE layers is complementary. Therefore, to hide the migration overhead, we propose the Non-invasive Balancer (NI-Balancer), which splits a complete expert migration into multiple steps and alternately utilizes the cold links of both layers. Evaluation shows ER-Mapping achieves reduction up to 62%. NI-Balancer further delivers 54% and 22% improvements in MoE computation and , respectively. Compared with the SOTA NVL72 supernode, the WSC platform delivers an average 39% higher per-device MoE performance owing to its scalability to larger EP.

Energy-Efficient Autonomous Driving with Adaptive Perception and Robust Decision

Authors: Yuyang Xia, Zibo Liang, Liwei Deng, Yan Zhao, Han Su, Kai Zheng

2025-10-29

http://arxiv.org/abs/2510.25205v1

Autonomous driving is an emerging technology that is expected to bring significant social, economic, and environmental benefits. However, these benefits come with rising energy consumption by computation engines, limiting the driving range of vehicles, especially electric ones. Perception computing is typically the most power-intensive component, as it relies on largescale deep learning models to extract environmental features. Recently, numerous studies have employed model techniques, such as sparsification, , and distillation, to reduce computational consumption. However, these methods often result in either a substantial model size or a significant drop in perception accuracy compared to high-computation models. To address these challenges, we propose an energy-efficient autonomous driving framework, called EneAD. In the adaptive perception module, a perception optimization strategy is designed from the perspective of data management and tuning. Firstly, we manage multiple perception models with different computational consumption and adjust the execution framerate dynamically. Then, we define them as knobs and design a transferable tuning method based on Bayesian optimization to identify promising knob values that achieve low computation while maintaining desired accuracy. To adaptively switch the knob values in various traffic scenarios, a lightweight classification model is proposed to distinguish the perception difficulty in different scenarios. In the robust decision module, we propose a decision model based on reinforcement learning and design a regularization term to enhance driving stability in the face of perturbed perception results. Extensive experiments evidence the superiority of our framework in both energy consumption and driving performance. EneAD can reduce perception consumption by 1.9x to 3.5x and thus improve driving range by 3.9% to 8.5%

Transformers in Medicine Improving Vision-Language Alignment for Medical Image Captioning

Authors: Yogesh Thakku Suresh, Vishwajeet Shivaji Hogale, Luca-Alexandru Zamfira, Anandavardhana Hegde

2025-10-29

http://arxiv.org/abs/2510.25164v1

We present a -based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based r. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent -based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.

Model-Document Protocol for AI Search

Authors: Hongjin Qian, Zheng Liu

2025-10-29

http://arxiv.org/abs/2510.25160v2

AI search depends on linking large language models (s) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently -ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the . This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to s through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, -ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value s. All three pathways share the same goal: ensuring that what reaches the is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.

Conditional neural field for spatial dimension reduction of turbulence data a comparison study

Authors: Junyi Guo, Pan Du, Xiantao Fan, Yahui Li, Jian-Xun Wang

2025-10-29

http://arxiv.org/abs/2510.25135v1

We investigate conditional neural fields (CNFs), mesh-agnostic, coordinate-based rs conditioned on a low-dimensional latent, for spatial dimensionality reduction of turbulent flows. CNFs are benchmarked against Proper Orthogonal Decomposition and a convolutional autoencoder within a unified encoding- framework and a common evaluation protocol that explicitly separates in-range (interpolative) from out-of-range (strict extrapolative) testing beyond the training horizon, with identical preprocessing, metrics, and fixed splits across all baselines. We examine three conditioning mechanisms: (i) activation-only modulation (often termed FiLM), (ii) low-rank weight and bias modulation (termed FP), and (iii) last-layer inner-product coupling, and introduce a novel domain-decomposed CNF that localizes complexities. Across representative turbulence datasets (WMLES channel inflow, DNS channel inflow, and wall pressure fluctuations over turbulent boundary layers), CNF-FP achieves the lowest training and in-range testing errors, while CNF-FiLM generalizes best for out-of-range scenarios once moderate latent capacity is available. Domain decomposition significantly improves out-of-range accuracy, especially for the more demanding datasets. The study provides a rigorous, physics-aware basis for selecting conditioning, capacity, and domain decomposition when using CNFs for turbulence and reconstruction.

KnowCoder-A1 Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

Authors: Zhuo Chen, Fei Wang, Zixuan Li, Zhao Zhang, Weiwei Ding, Chuanguang Yang, Yongjun Xu, Xiaolong Jin, Jiafeng Guo

2025-10-29

http://arxiv.org/abs/2510.25101v1

Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (s) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune s on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.

H3M-SSMoEs Hypergraph-based Multimodal Learning with LLM Reasoning and Style-Structured Mixture of Experts

Authors: Peilin Tan, Liang Xie, Churan Zhi, Dian Tu, Chuanqi Shi

2025-10-29

http://arxiv.org/abs/2510.25091v1

Stock movement prediction remains fundamentally challenging due to complex temporal dependencies, heterogeneous modalities, and dynamically evolving inter-stock relationships. Existing approaches often fail to unify structural, semantic, and regime-adaptive modeling within a scalable framework. This work introduces H3M-SSMoEs, a novel Hypergraph-based MultiModal architecture with reasoning and Style-Structured Mixture of Experts, integrating three key innovations: (1) a Multi-Context Multimodal Hypergraph that hierarchically captures fine-grained spatiotemporal dynamics via a Local Context Hypergraph (LCH) and persistent inter-stock dependencies through a Global Context Hypergraph (GCH), employing shared cross-modal hyperedges and Jensen-Shannon Divergence weighting mechanism for adaptive relational learning and cross-modal alignment; (2) a -enhanced reasoning module, which leverages a frozen large language model with lightweight adapters to semantically fuse and align quantitative and textual modalities, enriching representations with domain-specific financial knowledge; and (3) a Style-Structured Mixture of Experts (SSMoEs) that combines shared market experts and industry-specialized experts, each parameterized by learnable style vectors enabling regime-aware specialization under activation. Extensive experiments on three major stock markets demonstrate that H3M-SSMoEs surpasses state-of-the-art methods in both superior predictive accuracy and investment performance, while exhibiting effective risk control. Datasets, source code, and model weights are available at our GitHub repository: https://github.com/PeilinTime/H3M-SSMoEs.

WebLeaper Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking

Authors: Zhengwei Tao, Haiyang Shen, Baixuan Li, Wenbiao Yin, Jialong Wu, Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Liwen Zhang, Xinyu Wang, Pengjun Xie, Jingren Zhou, Yong Jiang

2025-10-28

http://arxiv.org/abs/2510.24697v1

Large Language Model ()-based agents have emerged as a transformative approach for open-ended problem solving, with information seeking (IS) being a core capability that enables autonomous reasoning and decision-making. While prior research has largely focused on improving retrieval depth, we observe that current IS agents often suffer from low search efficiency, which in turn constrains overall performance. A key factor underlying this inefficiency is the of target entities in training tasks, which limits opportunities for agents to learn and generalize efficient search behaviors. To address these challenges, we propose WebLeaper, a framework for constructing high-coverage IS tasks and generating efficient solution trajectories. We formulate IS as a tree-structured reasoning problem, enabling a substantially larger set of target entities to be embedded within a constrained context. Leveraging curated Wikipedia tables, we propose three variants for synthesizing IS tasks, Basic, Union, and Reverse-Union, to systematically increase both IS efficiency and efficacy. Finally, we curate training trajectories by retaining only those that are simultaneously accurate and efficient, ensuring that the model is optimized for both correctness and search performance. Extensive experiments on both basic and comprehensive settings, conducted on five IS benchmarks, BrowserComp, GAIA, xbench-DeepSearch, WideSearch, and Seal-0, demonstrate that our method consistently achieves improvements in both effectiveness and efficiency over strong baselines.

Repurposing Synthetic Data for Fine-grained Search Agent Supervision

Authors: Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang

2025-10-28

http://arxiv.org/abs/2510.24694v1

-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on , outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.

Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation

Authors: Snegha A, Sayambhu Sen, Piyush Singh Pasi, Abhishek Singhania, Preethi Jyothi

2025-10-28

http://arxiv.org/abs/2510.24619v1

With the release of new large language models (s) like Llama and Mistral, zero-shot cross-lingual transfer has become increasingly feasible due to their multilingual pretraining and strong generalization capabilities. However, adapting these r-only s to new tasks across languages remains challenging. While parameter-efficient fine-tuning (PeFT) techniques like Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as soft prompt tuning, prefix tuning, and Llama Adapter are less explored, especially for zero-shot transfer in r-only models. We present a comprehensive study of three prefix-based methods for zero-shot cross-lingual transfer from English to 35+ high- and low-resource languages. Our analysis further explores transfer across linguistic families and scripts, as well as the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix methods outperform LoRA-baselines by up to 6% on the Belebele benchmark. Similar improvements were observed with Mistral v0.3 7B as well. Despite using only 1.23M learning parameters with prefix tuning, we achieve consistent improvements across diverse benchmarks. These findings highlight the potential of prefix-based techniques as an effective and scalable alternative to LoRA, particularly in low-resource multilingual settings.

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

Authors: Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho

2025-10-28

http://arxiv.org/abs/2510.24606v1

The quadratic cost of attention hinders the scalability of long-context s, especially in resource-constrained settings. Existing static methods such as sliding windows or global tokens utilizes the of attention to reduce the cost of attention, but poorly adapts to the content-dependent variations in attention due to their staticity. While previous work has proposed several dynamic approaches to improve flexibility, they still depend on predefined templates or heuristic mechanisms. Such strategies reduce generality and prune tokens that remain contextually important, limiting their accuracy across diverse tasks. To tackle these bottlenecks of existing methods for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention online without retraining. Our proposed DHSA adaptively segments sequences into variable-length chunks, then computes chunk representations by aggregating the token embeddings within each chunk. To avoid the bias introduced by varying chunk lengths, we apply length-normalized aggregation that scales the averaged embeddings by the square root of the chunk size. Finally, DHSA upsamples the chunk-level similarity scores to token level similarities to calculate importance scores that determine which token-level interactions should be preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing latency by 20-60% and peak memory usage by 35%. Compared to other representative baselines such as block attention, DHSA achieves consistently higher accuracy (6-18% relative gains) with comparable or lower cost, offering an efficient and adaptable solution for long-context on-device s.

Diffusion LLM with Native Variable Generation Lengths Let [EOS] Lead the Way

Authors: Yicun Yang, Cong Wang, Shaobo Wang, Zichen Wen, Biqing Qi, Hanlin Xu, Linfeng Zhang

2025-10-28

http://arxiv.org/abs/2510.24605v1

Diffusion-based large language models (ds) have exhibited substantial potential for parallel text generation, which may enable more efficient generation compared to autoregressive models. However, current ds suffer from fixed generation lengths, which indicates the generation lengths of ds have to be determined before as a hyper-parameter, leading to issues in efficiency and flexibility. To solve these problems, in this work, we propose to train a diffusion with native variable generation lengths, abbreviated as d-Var. Concretely, we aim to train a model to accurately predict the [EOS] token in the generated text, which makes a d be able to natively infer in a block diffusion manner, while still maintaining the ability of global bi-directional (full) attention and high parallelism. Experiments on standard benchmarks demonstrate that our method achieves a 30.1x speedup over traditional d inference paradigms and a 2.4x speedup relative to autoregressive models such as Qwen and Llama. Our method achieves higher accuracy and faster inference, elevating ds beyond mere academic novelty and supporting their practical use in real-world applications. Codes and models have been released.

Parallel Loop Transformer for Efficient Test-Time Computation Scaling

Authors: Bohong Wu, Mengzhao Chen, Xiang Luo, Shen Yan, Qifan Yu, Fan Xia, Tianqi Zhang, Hongrui Zhan, Zheng Zhong, Xun Zhou, Siyuan Qiao, Xingyan Bin

2025-10-28

http://arxiv.org/abs/2510.24824v1

Large Language Models (s) are powerful but often too slow and costly for real-world use during inference. Looped s save on parameters by reusing the same weights for multiple computational steps, or "loops." However, this approach has a major flaw: the loops run one after another, causing inference latency and memory requirements to increase with each added loop. This makes them impractical for fast applications. To solve this problem, we introduce the Parallel Loop Transformer (PLT). PLT is a new architecture that delivers the performance benefits of a deep, looped model but with the low latency of a standard, non-looped model. PLT works using two key techniques. First, Cross-Loop Parallelism (CLP) breaks the sequential dependency by computing different loops for different tokens at the same time, all within a single pass. Second, to prevent memory costs from growing, we use an Efficient Representation Enhancement strategy. This method shares the memory ( ) from the first loop with all other loops. It then uses a Gated Sliding-Window Attention (G-SWA) to combine this shared global information with local information, maintaining high accuracy. Our experiments show that PLT achieves the high accuracy of a traditional looped model but with almost no extra latency or memory cost compared to a standard .

Decoupled MeanFlow Turning Flow Models into Flow Maps for Accelerated Sampling

Authors: Kyungmin Lee, Sihyun Yu, Jinwoo Shin

2025-10-28

http://arxiv.org/abs/2510.24474v1

Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce Decoupled MeanFlow, a simple strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion s on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1 to 4 steps. Notably, we find that training flow models and subsequently converting them is more efficient and effective than training flow maps from scratch. On ImageNet 256x256 and 512x512, our models attain 1-step FID of 2.16 and 2.12, respectively, surpassing prior art by a large margin. Furthermore, we achieve FID of 1.51 and 1.68 when increasing the steps to 4, which nearly matches the performance of flow models while delivering over 100x faster inference.

MiniOneRec An Open-Source Framework for Scaling Generative Recommendation

Authors: Xiaoyu Kong, Leheng Sheng, Junfei Tan, Yuxin Chen, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He

2025-10-28

http://arxiv.org/abs/2510.24431v1

The recent success of large language models (s) has renewed interest in whether recommender systems can achieve similar scaling benefits. Conventional recommenders, dominated by massive embedding tables, tend to plateau as embedding dimensions grow. In contrast, the emerging generative paradigm replaces embeddings with compact Semantic ID (SID) sequences produced by autoregressive Transformers. Yet most industrial deployments remain proprietary, leaving two fundamental questions open: (1) Do the expected scaling laws hold on public benchmarks? (2) What is the minimal post-training recipe that enables competitive performance? We present MiniOneRec, to the best of our knowledge, the first fully open-source generative recommendation framework, which provides an end-to-end workflow spanning SID construction, supervised fine-tuning, and recommendation-oriented reinforcement learning. We generate SIDs via a Residual Quantized VAE and post-train Qwen backbones ranging from 0.5B to 7B parameters on the Amazon Review dataset. Our experiments reveal a consistent downward trend in both training and evaluation losses with increasing model size, validating the parameter efficiency of the generative approach. To further enhance performance, we propose a lightweight yet effective post-training pipeline that (1) enforces full-process SID alignment and (2) applies reinforcement learning with constrained and hybrid rewards. Together, these techniques yield significant improvements in both ranking accuracy and candidate diversity.

Metadata-Driven Retrieval-Augmented Generation for Financial Question Answering

Authors: Michail Dadopoulos, Anestis Ladas, Stratos Moschidis, Ioannis Negkakis

2025-10-28

http://arxiv.org/abs/2510.24402v1

Retrieval-Augmented Generation (RAG) struggles on long, structured financial filings where relevant evidence is and cross-referenced. This paper presents a systematic investigation of advanced metadata-driven Retrieval-Augmented Generation (RAG) techniques, proposing and evaluating a novel, multi-stage RAG architecture that leverages -generated metadata. We introduce a sophisticated indexing pipeline to create contextually rich document chunks and benchmark a spectrum of enhancements, including pre-retrieval filtering, post-retrieval reranking, and enriched embeddings, benchmarked on the FinanceBench dataset. Our results reveal that while a powerful reranker is essential for precision, the most significant performance gains come from embedding chunk metadata directly with text ("contextual chunks"). Our proposed optimal architecture combines -driven pre-retrieval optimizations with these contextual embeddings to achieve superior performance. Additionally, we present a custom metadata reranker that offers a compelling, cost-effective alternative to commercial solutions, highlighting a practical trade-off between peak performance and operational efficiency. This study provides a blueprint for building robust, metadata-aware RAG systems for financial document analysis.

Text Simplification with Sentence Embeddings

Authors: Matthew Shardlow

2025-10-28

http://arxiv.org/abs/2510.24365v1

Sentence embeddings can be d to give approximations of the original texts used to create them. We explore this effect in the context of text simplification, demonstrating that reconstructed text embeddings preserve complexity levels. We experiment with a small feed forward neural network to effectively learn a transformation between sentence embeddings representing high-complexity and low-complexity texts. We provide comparison to a Seq2Seq and -based approach, showing encouraging results in our much smaller learning setting. Finally, we demonstrate the applicability of our transformation to an unseen simplification dataset (MedEASI), as well as datasets from languages outside the training data (ES,DE). We conclude that learning transformations in sentence embedding space is a promising direction for future research and has potential to unlock the ability to develop small, but powerful models for text simplification and other natural language generation tasks.

SALS Sparse Attention in Latent Space for KV cache Compression

Authors: Junlin Mu, Hantao Huang, Jihang Zhang, Minghui Yu, Tao Wang, Yidong Li

2025-10-28

http://arxiv.org/abs/2510.24273v1

Large Language Models capable of handling extended contexts are in high demand, yet their inference remains challenging due to substantial Key-Value size and high memory bandwidth requirements. Previous research has demonstrated that exhibits low-rank characteristics within the hidden dimension, suggesting the potential for effective . However, due to the widely adopted Rotary Position Embedding mechanism in modern s, naive low-rank suffers severe accuracy degradation or creates a new speed bottleneck, as the low-rank must first be reconstructed in order to apply RoPE. In this paper, we introduce two key insights: first, the application of RoPE to the key vectors increases their variance, which in turn results in a higher rank; second, after the key vectors are transformed into the latent space, they largely maintain their representation across most layers. Based on these insights, we propose the Sparse Attention in Latent Space framework. SALS projects the into a compact latent space via low-rank projection, and performs token selection using RoPE-free query-key interactions in this space. By reconstructing only a small subset of important tokens, it avoids the overhead of full reconstruction. We comprehensively evaluate SALS on various tasks using two large-scale models: LLaMA2-7b-chat and Mistral-7b, and additionally verify its scalability on the RULER-128k benchmark with LLaMA3.1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA performance by maintaining competitive accuracy. Under different settings, SALS achieves 6.4-fold and 5.7-fold speed-up in the attention operator compared to FlashAttention2 on the 4K sequence. For the end-to-end throughput performance, we achieves 1.4-fold and 4.5-fold improvement compared to GPT-fast on 4k and 32K sequences, respectively.

Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts

Authors: Jason M. Pittman, Anton Phillips Jr., Yesenia Medina-Santos, Brielle C. Stark

2025-10-28

http://arxiv.org/abs/2510.24817v2

In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (s). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are . Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct s. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts.

Pilot Distortion Design for ToA Obfuscation in Uplink OFDM Communication

Authors: Mahmut Kemal Ercan, Alireza Pourafzal, Musa Furkan Keskin, Sinan Gezici, Henk Wymeersch

2025-10-28

http://arxiv.org/abs/2510.24223v1

We study uplink orthogonal frequency-division multiplexing (OFDM) pilot distortion to deliberately obfuscate time-of-arrival (ToA) estimation at a single base station while pre performance. We design a complex per-subcarrier distortion vector that increases sidelobes of the mismatched ambiguity function (MAF) relative to its mainlobe, using two objectives: the sidelobe-to-peak level ratio and the integrated sidelobe level. The design is subject to a transmit-power budget and a proximity (dissimilarity) constraint around the -optimal pilot. Communication impact is quantied by a capacity-motivated lower bound obtained from the linear minimum mean-squared error error covariance with a mismatched channel estimate. The resulting generalized fractional program is solved with Dinkelbach's transform and a difference-of-convex update that yields a closed-form Karush-Kuhn-Tucker step. Simulations on a single-input single-output OFDM link show that the optimized distortions raise MAF sidelobes and degrade delay estimation, as validated by a mismatched maximum-likelihood ToA estimator, while incurring only marginal capacity loss over a broad signal-to-noise ratio range. The method requires no protocol changes or artificial path injection and provides a signal-level mechanism to control ToA observability under constraints.

FALQON Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic

Authors: Kanghyun Choi, Hyeyoon Lee, SunJong Park, Dain Kwon, Jinho Lee

2025-10-28

http://arxiv.org/abs/2510.24061v1

Low-bit floating-point (FP) formats, such as FP8, provide significant and memory savings in model training thanks to native hardware support on modern GPUs and NPUs. However, we analyze that FP8 offers speedup primarily for large-dimensional matrix multiplications, while inherent overheads diminish speedup when applied to low-rank adaptation (LoRA), which uses small-dimensional matrices for efficient fine-tuning of large language models (s). To address this limitation, we propose FALQON, a novel framework that eliminates the overhead from separate LoRA computational paths by directly merging LoRA adapters into an FP8-d backbone during fine-tuning. Furthermore, we reformulate the forward and backward computations for merged adapters to significantly reduce overhead, and introduce a row-wise proxy update mechanism that efficiently integrates substantial updates into the d backbone. Experimental evaluations demonstrate that FALQON achieves approximately a 3 $\times$ training speedup over existing d LoRA methods with a similar level of accuracy, providing a practical solution for efficient large-scale model fine-tuning. Moreover, FALQON's end-to-end FP8 workflow removes the need for post-training , facilitating efficient deployment. Code is available at https://github.com/iamkanghyunchoi/falqon.

Pie A Programmable Serving System for Emerging LLM Applications

Authors: In Gim, Zhiyao Ma, Seung-seob Lee, Lin Zhong

2025-10-28

http://arxiv.org/abs/2510.24051v1

Emerging large language model () applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing systems built on a monolithic token generation loop. This paper introduces Pie, a programmable system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.

SpecKD Speculative Decoding for Effective Knowledge Distillation of LLMs

Authors: Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren

2025-10-28

http://arxiv.org/abs/2510.24021v1

Knowledge Distillation (KD) has become a cornerstone technique for compressing Large Language Models (s) into smaller, more efficient student models. However, conventional KD approaches typically apply the distillation loss uniformly across all tokens, regardless of the teacher's confidence. This indiscriminate mimicry can introduce noise, as the student is forced to learn from the teacher's uncertain or high-entropy predictions, which may ultimately harm student performance-especially when the teacher is much larger and more powerful. To address this, we propose Speculative Knowledge Distillation (SpecKD), a novel, plug-and-play framework that introduces a dynamic, token-level gating mechanism inspired by the "propose-and-verify" paradigm of speculative . At each step, the student's token proposal is verified against the teacher's distribution; the distillation loss is selectively applied only to "accepted" tokens, while "rejected" tokens are masked out. Extensive experiments on diverse text generation tasks show that SpecKD consistently and significantly outperforms strong KD baselines, leading to more stable training and more capable student models, and achieving state-of-the-art results.

PRO Enabling Precise and Robust Text Watermark for Open-Source LLMs

Authors: Jiaqi Xue, Yifei Zhao, Mansour Al Ghanim, Shangqian Gao, Ruimin Sun, Qian Lou, Mengxin Zheng

2025-10-27

http://arxiv.org/abs/2510.23891v1

Text watermarking for large language models (s) enables model owners to verify text origin and protect intellectual property. While watermarking methods for closed-source s are relatively mature, extending them to open-source models remains challenging, as developers cannot control the process. Consequently, owners of open-source s lack practical means to verify whether text was generated by their models. A core difficulty lies in embedding watermarks directly into model weights without hurting detectability. A promising idea is to distill watermarks from a closed-source model into an open one, but this suffers from (i) poor detectability due to mismatch between learned and predefined patterns, and (ii) fragility to downstream modifications such as fine-tuning or model merging. To overcome these limitations, we propose PRO, a Precise and Robust text watermarking method for open-source s. PRO jointly trains a watermark policy model with the , producing patterns that are easier for the model to learn and more consistent with detection criteria. A regularization term further simulates downstream perturbations and penalizes degradation in watermark detectability, ensuring robustness under model edits. Experiments on open-source s (e.g., LLaMA-3.2, LLaMA-3, Phi-2) show that PRO substantially improves both watermark detectability and resilience to model modifications.

Accurate Prediction of Nonlinear Distortion of Multi-Carrier Signals

Authors: Cameron M. Pike, Brad Oney, Gabriel Hepner, Animesh Yadav

2025-10-27

http://arxiv.org/abs/2510.23844v1

Nonlinearities in power amplifiers adversely affect multi-carrier modulation techniques. Accurate prediction of nonlinear distortion is essential for making design trade-offs between output power and network throughput. We use the series form of the characteristic function (ch.f.) method to predict distortion spectra for multi-carrier transmissions. This method results in efficient calculations of individual signal and distortion components. The method is validated both theoretically and practically. Theoretical validation is performed by modeling the signal as a bandpass Gaussian process that is hard limited, and it is shown that the series ch.f. method produces results that are identical with the classical Price's theorem. Practical validation is shown by considering an orthogonal frequency division multiplexing (OFDM) signal with a fragmented spectrum which is then applied to an amplifier driven into for which application of Price's theorem is difficult, and the predicted output spectrum corroborates laboratory measurements. Part of the computational efficiency is realized in that the nonlinearity can be expressed as the fast Fourier transform (FFT) of samples of its forward scattering parameter (i.e., S21) or transconductance function (including AM-PM effects), and distortion contributions of the signal can be expressed as numerical autoconvolutions of the clean spectrum. Signal-to-distortion ratio (SDR) can be easily computed and parameterized across variables of interest, such as overdrive level.

Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders

Authors: Nathan Paek, Yongyi Zang, Qihui Yang, Randal Leistikow

2025-10-27

http://arxiv.org/abs/2510.23802v1

While autoencoders (SAEs) successfully extract interpretable features from language models, applying them to audio generation faces unique challenges: audio's dense nature requires that obscures semantic meaning, and automatic feature characterization remains limited. We propose a framework for interpreting audio generative models by mapping their latent representations to human-interpretable acoustic concepts. We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, and timbre). This enables both controllable manipulation and analysis of the AI music generation process, revealing how acoustic properties emerge during synthesis. We validate our approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyze DiffRhythm, a state-of-the-art text-to-music model, to demonstrate how pitch, timbre, and loudness evolve throughout generation. While our work is only done on audio modality, our framework can be extended to interpretable analysis of visual latent space generation models.

CountFormer A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

Authors: Md Tanvir Hossain, Akif Islam, Mohd Ruhul Ameen

2025-10-27

http://arxiv.org/abs/2510.23785v1

Humans can effortlessly count diverse objects by perceiving visual repetition and structural relationships rather than relying on class identity. However, most existing counting models fail to replicate this ability; they often miscount when objects exhibit complex shapes, internal symmetry, or ping components. In this work, we introduce CountFormer, a -based framework that learns to recognize repetition and structural coherence for class-agnostic object counting. Built upon the CounTR architecture, our model replaces its visual encoder with the self-supervised foundation model DINOv2, which produces richer and spatially consistent feature representations. We further incorporate positional embedding fusion to preserve geometric relationships before these features into density maps through a lightweight convolutional r. Evaluated on the FSC-147 dataset, our model achieves performance comparable to current state-of-the-art methods while demonstrating superior accuracy on structurally intricate or densely packed scenes. Our findings indicate that integrating foundation models such as DINOv2 enables counting systems to approach human-like structural perception, advancing toward a truly general and exemplar-free counting paradigm.

BitSkip An Empirical Analysis of Quantization and Early Exit Composition

Authors: Ramshankar Bhuvaneswaran, Handan Liu

2025-10-27

http://arxiv.org/abs/2510.23766v1

The pursuit of efficient Large Language Models (s) has led to increasingly complex techniques like extreme and dynamic routing. While individual benefits of these methods are well-documented, their compositional effects remain poorly understood. This paper introduces BitSkip, a hybrid architectural framework for systematically exploring these interactions. Counter-intuitively, our findings reveal that a simple 8-bit d model without Hadamard transform (BitSkip-V1) not only outperforms its more complex 4-bit and Hadamard-enhanced counterparts but also competes the full-precision baseline in quality (perplexity of 1.13 vs 1.19) . The introduction of Hadamard transforms, even at 8-bit precision, catastrophically degraded performance by over 37,000%, tracing fundamental training instability. Our BitSkip-V1 recipe demonstrates superior early-exit characteristics, with layer 18 providing optimal 32.5% speed gain for minimal 4% quality loss.

Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization

Authors: Bernardo Torres, Manuel Moussallam, Gabriel Meseguer-Brocal

2025-10-27

http://arxiv.org/abs/2510.23530v1

Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high- Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the r preserves addition) without altering the model's architecture or loss function. When trained with our method, the CAE exhibits linear behavior in both the encoder and r while pre reconstruction fidelity. We test the practical utility of our learned space on music source composition and separation via simple latent arithmetic. This work presents a straightforward technique for constructing structured latent spaces, enabling more intuitive and efficient audio processing.

Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier

Authors: Hyeongseop Rha, Jeong Hun Yeo, Yeonju Kim, Yong Man Ro

2025-10-27

http://arxiv.org/abs/2510.23506v1

The recent advancement of Multimodal Large Language Models (Ms) is transforming human-computer interaction (HCI) from surface-level exchanges into more nuanced and emotionally intelligent . To realize this shift, emotion understanding becomes essential allowing systems to capture subtle cues underlying user intent. Furthermore, providing faithful explanations for predicted emotions is crucial to ensure interpretability and build user trust. However, current M-based methods often generate emotion explanations that diverge from the target labels and sometimes even contradict their own predicted emotions. This inconsistency poses a critical risk for misunderstanding and erodes reliability in interactive settings. To address this, we propose a novel approach: the Emotional Rationale Verifier (ERV) and an Explanation Reward. Our method guides the model to produce reasoning that is explicitly consistent with the target emotion during multimodal emotion recognition without modifying the model architecture or requiring additional paired video-description annotations. Our method significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on the MAFW and DFEW datasets. Through extensive experiments and human evaluations, we show that our approach not only enhances alignment between explanation and prediction but also empowers Ms to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems.

Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving

Authors: Xinyu Wang, Jonas M. Kübler, Kailash Budhathoki, Yida Wang, Matthäus Kleindessner

2025-10-27

http://arxiv.org/abs/2510.23346v1

When a single base with several different LoRA adapters simultaneously, the adapters cannot simply be merged with the base model's weights as the adapter swapping would create overhead and requests using different adapters could not be batched. Rather, the LoRA computations have to be separated from the base computations, and in a multi-device setup the LoRA adapters can be sharded in a way that is well aligned with the base model's tensor parallel execution, as proposed in S-LoRA. However, the S-LoRA sharding strategy encounters some overhead, which may be small in theory, but can be large in practice. In this paper, we propose to constrain certain LoRA factors to be block-diagonal, which allows for an alternative way of sharding LoRA adapters that does not require any additional for the LoRA computations. We demonstrate in extensive experiments that our block-diagonal LoRA approach is similarly parameter efficient as standard LoRA (i.e., for a similar number of parameters it achieves similar downstream performance) and that it leads to significant end-to-end speed-up over S-LoRA. For example, when on eight A100 GPUs, we observe up to 1.79x (1.23x) end-to-end speed-up with 0.87x (1.74x) the number of adapter parameters for Llama-3.1-70B, and up to 1.63x (1.3x) end-to-end speed-up with 0.86x (1.73x) the number of adapter parameters for Llama-3.1-8B.

Adaptive Blockwise Search Inference-Time Alignment for Large Language Models

Authors: Mohammad Atif Quamar, Mohammad Areeb, Nishant Sharma, Ananth Shreekumar, Jonathan Rosenthal, Muslum Ozgur Ozmen, Mikhail Kuznetsov, Z. Berkay Celik

2025-10-27

http://arxiv.org/abs/2510.23334v1

alignment remains a critical challenge. Inference-time methods provide a flexible alternative to fine-tuning, but their uniform computational effort often yields suboptimal alignment. We hypothesize that for many alignment tasks, the initial tokens of a response are disproportionately more critical. To leverage this principle, we introduce AdaSearch, a novel blockwise search strategy. It adaptively allocates a fixed computational budget using a sampling schedule, focusing search effort on these critical tokens. We apply AdaSearch to sequential and introduce its tree-search counterpart, AdaBeam. Our comprehensive evaluation across eight s demonstrates that AdaSearch outperforms strong Best-of-N and fine-tuning baselines. Specifically, win-rates improve by over 10% for harmlessness generation, controlled sentiment generation, and for mathematical reasoning tasks relative to Best-of-N.

Evaluation of Vision-LLMs in Surveillance Video

Authors: Pascal Benschop, Cristian Meo, Justin Dauwels, Jelte P. Mense

2025-10-27

http://arxiv.org/abs/2510.23190v1

The widespread use of cameras in our society has created an overwhelming amount of video data, far exceeding the capacity for human monitoring. This presents a critical challenge for public safety and security, as the timely detection of anomalous or criminal events is crucial for effective response and prevention. The ability for an embodied agent to recognize unexpected events is fundamentally tied to its capacity for spatial reasoning. This paper investigates the spatial reasoning of vision-language models (VLMs) by framing anomalous action recognition as a zero-shot, language-grounded task, addressing the embodied perception challenge of interpreting dynamic 3D scenes from 2D video. Specifically, we investigate whether small, pre-trained vision--s can act as spatially-grounded, zero-shot anomaly detectors by converting video into text descriptions and scoring labels via textual entailment. We evaluate four open models on UCF-Crime and RWF-2000 under prompting and privacy-pre conditions. Few-shot exemplars can improve accuracy for some models, but may increase false positives, and privacy filters -- especially full-body GAN transforms -- introduce inconsistencies that degrade accuracy. These results chart where current vision--s succeed (simple, spatially salient events) and where they falter (noisy spatial cues, identity obfuscation). Looking forward, we outline concrete paths to strengthen spatial grounding without task-specific training: structure-aware prompts, lightweight spatial memory across clips, scene-graph or 3D-pose priors during description, and privacy methods that preserve action-relevant geometry. This positions zero-shot, language-grounded pipelines as adaptable building blocks for embodied, real-world video understanding. Our implementation for evaluating VLMs is publicly available at: https://github.com/pascalbenschopTU/V_AnomalyRecognition

Lost in Tokenization Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs

Authors: Kai Zhuang, Jiawei Zhang, Yumou Liu, Hanqun Cao, Chunbin Gu, Mengdi Liu, Zhangyang Gao, Zitong Jerry Wang, Xuanhe Zhou, Pheng-Ann Heng, Lijun Wu, Conghui He, Cheng Tan

2025-10-27

http://arxiv.org/abs/2510.23127v2

Scientific Large Language Models (Sci-s) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-s with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-s on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-s lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-s not as sequence rs, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis. The code is available at https://github.com/opendatalab-raiser/CoKE.

Beyond Imprecise Distance Metrics LLM-Predicted Target Call Stacks for Directed Greybox Fuzzing

Authors: Yifan Zhang, Xin Zhang

2025-10-27

http://arxiv.org/abs/2510.23101v1

Directed greybox fuzzing (DGF) aims to efficiently trigger bugs at specific target locations by prioritizing seeds whose execution paths are more likely to mutate into triggering target bugs. However, existing DGF approaches suffer from imprecise probability calculations due to their reliance on complex distance metrics derived from static analysis. The over-approximations inherent in static analysis cause a large number of irrelevant execution paths to be mistakenly considered to potentially mutate into triggering target bugs, significantly reducing fuzzing efficiency. We propose to replace static analysis-based distance metrics with precise call stack representations. Call stacks represent precise control flows, thereby avoiding false information in static analysis. We leverage large language models (s) to predict vulnerability-triggering call stacks for guiding seed prioritization. Our approach constructs call graphs through static analysis to identify methods that can potentially reach target locations, then utilizes s to predict the most likely call stack sequence that triggers the vulnerability. Seeds whose execution paths have higher with the predicted call stack are prioritized for mutation. This is the first work to integrate s into the core seed prioritization mechanism of DGF. We implement our approach and evaluate it against several state-of-the-art fuzzers. On a suite of real-world programs, our approach triggers vulnerabilities $1.86\times$ to $3.09\times$ faster compared to baselines. In addition, our approach identifies 10 new vulnerabilities and 2 incomplete fixes in the latest versions of programs used in our controlled experiments through directed patch testing, with 10 assigned CVE IDs.

Authors: Chen-Che Lu, Yun-Cheng Chou, Teng-Ruei Chen

2025-10-27

http://arxiv.org/abs/2510.23032v1

Recent advances in large language models (s) have enabled multi-agent reasoning systems capable of collaborative decision-making. However, in financial analysis, most frameworks remain narrowly focused on either isolated single-agent predictors or loosely connected analyst ensembles, and they lack a coherent reasoning workflow that unifies diverse data modalities. We introduce P1GPT, a layered multi-agent framework for multi-modal financial information analysis and interpretable trading decision support. Unlike prior systems that emulate trading teams through role simulation, P1GPT implements a structured reasoning pipeline that systematically fuses technical, fundamental, and news-based insights through coordinated agent and integration-time synthesis. Backtesting on multi-modal datasets across major U.S. equities demonstrates that P1GPT achieves superior cumulative and risk-adjusted returns, maintains low drawdowns, and provides transparent causal rationales. These findings suggest that structured reasoning workflows, rather than agent role imitation, offer a scalable path toward explainable and trustworthy financial AI systems.

UGAE Unified Geometry and Attribute Enhancement for G-PCC Compressed Point Clouds

Authors: Pan Zhao, Hui Yuan, Chongzhen Tian, Tian Guo, Raouf Hamzaoui, Zhigeng Pan

2025-10-27

http://arxiv.org/abs/2510.23009v1

Lossy of point clouds reduces storage and transmission costs; however, it inevitably leads to irreversible distortion in geometry structure and attribute information. To address these issues, we propose a unified geometry and attribute enhancement (UGAE) framework, which consists of three core components: post-geometry enhancement (PoGE), pre-attribute enhancement (PAE), and post-attribute enhancement (PoAE). In PoGE, a Transformer-based convolutional U-Net is used to reconstruct the geometry structure with high precision by predicting voxel occupancy probabilities. Building on the refined geometry structure, PAE introduces an innovative enhanced geometry-guided recoloring strategy, which uses a detail-aware K-Nearest Neighbors (DA-KNN) method to achieve accurate recoloring and effectively preserve high-frequency details before attribute . Finally, at the r side, PoAE uses an attribute residual prediction network with a weighted mean squared error (W-MSE) loss to enhance the quality of high-frequency regions while maintaining the fidelity of low-frequency regions. UGAE significantly outperformed existing methods on three benchmark datasets: 8iVFB, Owlii, and MVUB. Compared to the latest G-PCC test model (TMC13v29), UGAE achieved an average BD-PSNR gain of 9.98 dB and 90.98% BD-bitrate savings for geometry under the D1 metric, as well as a 3.67 dB BD-PSNR improvement with 56.88% BD-bitrate savings for attributes on the Y component. Additionally, it improved perceptual quality significantly.

Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition

Authors: Jing-Xuan Zhang, Genshun Wan, Jin Li, Jianqing Gao

2025-10-27

http://arxiv.org/abs/2510.22961v1

Unified speech recognition aims to perform auditory, visual, and audiovisual speech recognition within a single model framework. While speech foundation models (SFMs) have demonstrated remarkable performance in auditory tasks, their adaptation to multimodal scenarios remains underexplored. This paper presents UASR-, a novel framework that adapts frozen SFMs to unified VSR, ASR, and AVSR tasks by leveraging large language models (s) as text rs. Our approach introduces visual representations into multiple SFM layers through visual injection modules, enabling multimodal input processing and unified hidden representations. The augmented SFMs connect with r-only s via a feed-forward adaptor, where concatenated representations and instruction prompts guide speech transcription. We implement a twostage training strategy: visual injection pretraining followed by speech recognition finetuning. SFM parameters remain frozen throughout training, with only visual injection modules optimized initially, and s finetuned using LoRA parameters subsequently. Experimental results demonstrate superior performance over state-of-the-art baselines across VSR, ASR, and AVSR tasks under both clean and noisy conditions. Ablation studies confirm generalization across various SFMs and s, validating the proposed training strategy.

Switchable Token-Specific Codebook Quantization For Face Image Compression

Authors: Yongbo Wang, Haonan Wang, Guodong Mu, Ruixin Zhang, Jiaqi Chen, Jingyun Zhang, Jun Wang, Yuan Xie, Zhizhong Zhang, Shouhong Ding

2025-10-27

http://arxiv.org/abs/2510.22943v2

With the ever-increasing volume of visual data, the efficient and lossless transmission, along with its subsequent interpretation and understanding, has become a critical bottleneck in modern information systems. The emerged codebook-based solution utilize a globally shared codebook to and de each token, controlling the bpp by adjusting the number of tokens or the codebook size. However, for facial images, which are rich in attributes, such global codebook strategies overlook both the category-specific correlations within images and the semantic differences among tokens, resulting in suboptimal performance, especially at low bpp. Motivated by these observations, we propose a Switchable Token-Specific Codebook Quantization for face image , which learns distinct codebook groups for different image categories and assigns an independent codebook to each token. By recording the codebook group to which each token belongs with a small number of bits, our method can reduce the loss incurred when decreasing the size of each codebook group. This enables a larger total number of codebooks under a lower overall bpp, thereby enhancing the expressive capability and improving reconstruction performance. Owing to its generalizable design, our method can be integrated into any existing codebook-based representation learning approach and has demonstrated its effectiveness on face recognition datasets, achieving an average accuracy of 93.51% for reconstructed images at 0.05 bpp.

How Can AI Augment Access to Justice? Public Defenders' Perspectives on AI Adoption

Authors: Inyoung Cheong, Patty Liu, Dominik Stammbach, Peter Henderson

2025-10-27

http://arxiv.org/abs/2510.22933v1

Public defenders are asked to do more with less: representing clients de of adequate counsel while facing overwhelming caseloads and scarce resources. While artificial intelligence (AI) and large language models (s) are promoted as tools to alleviate this burden, such proposals are detached from the lived realities of public defenders. This study addresses that gap through semi-structured interviews with fourteen practitioners across the United States to examine their experiences with AI, anticipated applications, and ethical concerns. We find that AI adoption is constrained by costs, restrictive office norms, confidentiality risks, and unsatisfactory tool quality. To clarify where AI can and cannot contribute, we propose a task-level map of public defense. Public defenders view AI as most useful for evidence investigation to analyze overwhelming amounts of digital records, with narrower roles in legal research & writing, and client . Courtroom representation and defense strategy are considered least compatible with AI assistance, as they depend on contextual judgment and trust. Public defenders emphasize safeguards for responsible use, including mandatory human verification, limits on overreliance, and the preservation of relational aspect of lawyering. Building on these findings, we outline a research agenda that promotes equitable access to justice by prioritizing open-source models, domain-specific datasets and evaluation, and participatory design that incorporates defenders' perspectives into system development.

Rethinking Inference Placement for Deep Learning across Edge and Cloud Platforms A Multi-Objective Optimization Perspective and Future Directions

Authors: Zongshun Zhang, Ibrahim Matta

2025-10-27

http://arxiv.org/abs/2510.22909v1

Edge intelligent applications like VR/AR and language model based chatbots have become widespread with the rapid expansion of IoT and mobile devices. However, constrained edge devices often cannot serve the increasingly large and complex deep learning (DL) models. To mitigate these challenges, researchers have proposed optimizing and offloading partitions of DL models among user devices, edge servers, and the cloud. In this setting, users can take advantage of different services to support their intelligent applications. For example, edge resources offer low response latency. In contrast, cloud platforms provide low monetary cost computation resources for computation-intensive workloads. However, between DL model partitions can introduce transmission bottlenecks and pose risks of data leakage. Recent research aims to balance accuracy, computation delay, transmission delay, and privacy concerns. They address these issues with model , model distillation, transmission , and model architecture adaptations, including internal classifiers. This survey contextualizes the state-of-the-art model offloading methods and model adaptation techniques by studying their implication to a multi-objective optimization comprising inference latency, data privacy, and resource monetary cost.

Batch Speculative Decoding Done Right

Authors: Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li, Rui Zhang

2025-10-26

http://arxiv.org/abs/2510.22876v1

Speculative speeds up inference by using a small draft model to propose multiple tokens that a target model verifies in parallel. Extending this idea to batches is essential for production , but it introduces the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, breaking right-alignment and corrupting position IDs, attention masks, and - state. We show that several existing batch implementations violate output equivalence-the fundamental requirement that speculative must produce identical token sequences to standard autoregressive generation. These violations occur precisely due to improper handling of the ragged tensor problem. In response, we (1) characterize the synchronization requirements that guarantee correctness, (2) present a correctness-first batch speculative EQSPEC that exposes realignment as consuming 40% of overhead, and (3) introduce EXSPEC, which maintains a sliding pool of sequences and dynamically forms same-length groups, to reduce the realignment overhead while pre per-sequence speculative speedups. On the SpecBench dataset, across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B target/draft pairs, our approach achieves up to 3 $\times$ throughput improvement at batch size 8 compared to batch size 1, with efficient scaling through batch size 8, while maintaining 95% output equivalence. Our method requires no custom kernels and integrates cleanly with existing inference stacks. Our code is available at https://github.com/eBay/spec_dec.

Sub-microsecond Transformers for Jet Tagging on FPGAs

Authors: Lauri Laatu, Chang Sun, Arianna Cox, Abhijith Gandrakota, Benedikt Maier, Jennifer Ngadiuba, Zhiqiang Que, Wayne Luk, Maria Spiropulu, Alexander Tapper

2025-10-26

http://arxiv.org/abs/2510.24784v1

We present the first sub-microsecond implementation on an FPGA achieving competitive performance for state-of-the-art high-energy physics benchmarks. Transformers have shown exceptional performance on multiple tasks in modern machine learning applications, including jet tagging at the CERN Large Hadron Collider (LHC). However, their computational complexity prohibits use in real-time applications, such as the hardware trigger system of the collider experiments up until now. In this work, we demonstrate the first application of s for jet tagging on FPGAs, achieving $\mathcal{O}(100)$ nanosecond latency with superior performance compared to alternative baseline models. We leverage high-granularity and distributed arithmetic optimization to fit the entire model on a single FPGA, achieving the required throughput and latency. Furthermore, we add multi-head attention and linear attention support to hls4ml, making our work accessible to the broader fast machine learning community. This work advances the next-generation trigger systems for the High Luminosity LHC, enabling the use of s for real-time applications in high-energy physics and beyond.

Long-Term PM2.5 Forecasting Using a DTW-Enhanced CNN-GRU Model

Authors: Amirali Ataee Naeini, Arshia Ataee Naeini, Fatemeh Karami Mohammadi, Omid Ghaffarpasand

2025-10-26

http://arxiv.org/abs/2510.22863v1

Reliable long-term forecasting of PM2.5 concentrations is critical for public health early-warning systems, yet existing deep learning approaches struggle to maintain prediction stability beyond 48 hours, especially in cities with monitoring networks. This paper presents a deep learning framework that combines Dynamic Time Warping (DTW) for intelligent station similarity selection with a CNN-GRU architecture to enable extended-horizon PM2.5 forecasting in Isfahan, Iran, a city characterized by complex pollution dynamics and limited monitoring coverage. Unlike existing approaches that rely on computationally intensive models or external simulation tools, our method integrates three key innovations: (i) DTW-based historical sampling to identify similar pollution patterns across peer stations, (ii) a lightweight CNN-GRU architecture augmented with meteorological features, and (iii) a scalable design optimized for networks. Experimental validation using multi-year hourly data from eight monitoring stations demonstrates superior performance compared to state-of-the-art deep learning methods, achieving R2 = 0.91 for 24-hour forecasts. Notably, this is the first study to demonstrate stable 10-day PM2.5 forecasting (R2 = 0.73 at 240 hours) without performance degradation, addressing critical early-warning system requirements. The framework's computational efficiency and independence from external tools make it particularly suitable for deployment in resource-constrained urban environments.

Leveraging Large Language Models to Identify Conversation Threads in Collaborative Learning

Authors: Prerna Ravi, Dong Won Lee, Beatriz Flamia, Jasmine David, Brandon Hanks, Cynthia Breazeal, Emma Anderson, Grace Lin

2025-10-26

http://arxiv.org/abs/2510.22844v1

Understanding how ideas develop and flow in small-group conversations is critical for analyzing collaborative learning. A key structural feature of these interactions is threading, the way discourse talk naturally organizes into interwoven topical strands that evolve over time. While threading has been widely studied in asynchronous text settings, detecting threads in synchronous spoken dialogue remains challenging due to ping turns and implicit cues. At the same time, large language models (s) show promise for automating discourse analysis but often struggle with long-context tasks that depend on tracing these conversational links. In this paper, we investigate whether explicit thread linkages can improve -based coding of relational moves in group talk. We contribute a systematic guidebook for identifying threads in synchronous multi-party transcripts and benchmark different prompting strategies for automated threading. We then test how threading influences performance on downstream coding of conversational analysis frameworks, that capture core collaborative actions such as agreeing, building, and eliciting. Our results show that providing clear conversational thread information improves coding performance and underscores the heavy reliance of downstream analysis on well-structured dialogue. We also discuss practical trade-offs in time and cost, emphasizing where human-AI hybrid approaches can yield the best value. Together, this work advances methods for combining s and robust conversational thread structures to make sense of complex, real-time group interactions.

Region-Adaptive Learned Hierarchical Encoding for 3D Gaussian Splatting Data

Authors: Shashank N. Sridhara, Birendra Kathariya, Fangjun Pu, Peng Yin, Eduardo Pavez, Antonio Ortega

2025-10-26

http://arxiv.org/abs/2510.22812v1

We introduce Region-Adaptive Learned Hierarchical Encoding (RALHE) for 3D Gaussian Splatting (3DGS) data. While 3DGS has recently become popular for novel view synthesis, the size of trained models limits its deployment in bandwidth-constrained applications such as volumetric media streaming. To address this, we propose a learned hierarchical latent representation that builds upon the principles of "overfitted" learned image (e.g., Cool-Chic and C3) to efficiently encode 3DGS attributes. Unlike images, 3DGS data have irregular spatial distributions of Gaussians (geometry) and consist of multiple attributes (signals) defined on the irregular geometry. Our codec is designed to account for these differences between images and 3DGS. Specifically, we leverage the octree structure of the voxelized 3DGS geometry to obtain a hierarchical multi-resolution representation. Our approach overfits latents to each Gaussian attribute under a global rate constraint. These latents are d independently through a lightweight r network. To estimate the bitrate during training, we employ an autoregressive probability model that leverages octree-derived contexts from the 3D point structure. The multi-resolution latents, r, and autoregressive entropy coding networks are jointly optimized for each Gaussian attribute. Experiments demonstrate that the proposed RALHE framework achieves a rendering PSNR gain of up to 2dB at low bitrates (less than 1 MB) compared to the baseline 3DGS methods.

Iterative Layer Pruning for Efficient Translation Inference

Authors: Yasmin Moslem, Muhammad Hazim Al Farouq, John D. Kelleher

2025-10-26

http://arxiv.org/abs/2510.22763v1

Large language models (s) have transformed many areas of natural language processing, including machine translation. However, efficient deployment of s remains challenging due to their intensive computational requirements. In this paper, we address this challenge and present our submissions to the Model Compression track at the Conference on Machine Translation (WMT 2025). In our experiments, we investigate iterative layer guided by layer importance analysis. We evaluate this method using the Aya-Expanse-8B model for translation from Czech to German, and from English to Egyptian Arabic. Our approach achieves substantial reductions in model size and inference time, while maintaining the translation quality of the baseline models.

Beyond Semantics How Temporal Biases Shape Retrieval in Transformer and State-Space Models

Authors: Anooshka Bajaj, Deven Mahesh Mistry, Sahaj Singh Maini, Yash Aggarwal, Zoran Tiganj

2025-10-26

http://arxiv.org/abs/2510.22752v1

In-context learning is governed by both temporal and semantic relationships, shaping how Large Language Models (s) retrieve contextual information. Analogous to human episodic memory, where the retrieval of specific events is enabled by separating events that happened at different times, this work probes the ability of various pretrained s, including and state-space models, to differentiate and retrieve temporally separated events. Specifically, we prompted models with sequences containing multiple presentations of the same token, which reappears at the sequence end. By fixing the positions of these repeated tokens and permuting all others, we removed semantic confounds and isolated temporal effects on next-token prediction. Across diverse sequences, models consistently placed the highest probabilities on tokens following a repeated token, but with a notable bias for those nearest the beginning or end of the input. An ablation experiment linked this phenomenon in s to induction heads. Extending the analysis to unique semantic contexts with partial further demonstrated that memories embedded in the middle of a prompt are retrieved less reliably. Despite architectural differences, state-space and models showed comparable temporal biases. Our findings deepen the understanding of temporal biases in in-context learning and offer an illustration of how these biases can enable temporal separation and episodic retrieval.

Rule-Based Explanations for Retrieval-Augmented LLM Systems

Authors: Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jarek Szlichta

2025-10-26

http://arxiv.org/abs/2510.22689v1

If-then rules are widely used to explain machine learning models; e.g., "if employed = no, then loan application = rejected." We present the first proposal to apply rules to explain the emerging class of large language models (s) with retrieval-augmented generation (RAG). Since RAG enables systems to incorporate retrieved information sources at inference time, rules linking the presence or absence of sources can explain output provenance; e.g., "if a Times Higher Education ranking article is retrieved, then the ranks Oxford first." To generate such rules, a brute force approach would probe the with all source combinations and check if the presence or absence of any sources leads to the same output. We propose optimizations to speed up rule generation, inspired by Apriori-like from frequent itemset mining but redefined within the scope of our novel problem. We conclude with qualitative and quantitative experiments demonstrating our solutions' value and efficiency.

Transformers from Compressed Representations

Authors: Juan C. Leon Alcazar, Mattia Soldan, Mohammad Saatialsoruji, Alejandro Pardo, Hani Itani, Juan Camilo Perez, Bernard Ghanem

2025-10-26

http://arxiv.org/abs/2510.23665v2

Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media . Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.

TVMC Time-Varying Mesh Compression via Multi-Stage Anchor Mesh Generation

Authors: He Huang, Qi Yang, Yiling Xu, Zhu Li, Jenq-Neng Hwang

2025-10-26

http://arxiv.org/abs/2510.22646v1

Time-varying meshes, characterized by dynamic connectivity and varying vertex counts, hold significant promise for applications such as augmented reality. However, their practical utilization remains challenging due to the substantial data volume required for high-fidelity representation. While various methods attempt to leverage temporal redundancy between consecutive mesh frames, most struggle with topological inconsistency and motion-induced artifacts. To address these issues, we propose Time-Varying Mesh Compression (TVMC), a novel framework built on multi-stage coarse-to-fine anchor mesh generation for inter-frame prediction. Specifically, the anchor mesh is progressively constructed in three stages: initial, coarse, and fine. The initial anchor mesh is obtained through fast topology alignment to exploit temporal coherence. A Kalman filter-based motion estimation module then generates a coarse anchor mesh by accurately compensating inter-frame motions. Subsequently, a Quadric Error Metric-based refinement step optimizes vertex positions to form a fine anchor mesh with improved geometric fidelity. Based on the refined anchor mesh, the inter-frame motions relative to the reference base mesh are encoded, while the residual displacements between the subdivided fine anchor mesh and the input mesh are adaptively d and compressed. This hierarchical strategy preserves consistent connectivity and high-quality surface approximation, while achieving an efficient and compact representation of dynamic geometry. Extensive experiments on standard MPEG dynamic mesh sequences demonstrate that TVMC achieves state-of-the-art performance. Compared to the latest V-DMC standard, it delivers a significant BD-rate gain of 10.2% ~ 16.9%, while pre high reconstruction quality. The code is available at https://github.com/H-Huang774/TVMC.

AI-Driven Carbon Monitoring Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions

Authors: Padmanabhan Jagannathan Prajesh, Kaliaperumal Ragunath, Miriam Gordon, Bruce Rathgeber, Suresh Neethirajan

2025-10-26

http://arxiv.org/abs/2510.23663v1

Accurate mapping of column-averaged CO2 (XCO2) over agricultural landscapes is essential for guiding emission mitigation strategies. We present a Spatiotemporal Vision Transformer with Wavelets (ST-ViWT) framework that reconstructs continuous, uncertainty-quantified XCO2 fields from OCO-2 across southern Canada, emphasizing poultry-intensive regions. The model fuses wavelet time-frequency representations with attention over meteorology, vegetation indices, topography, and land cover. On 2024 OCO-2 data, ST-ViWT attains R2 = 0.984 and RMSE = 0.468 ppm; 92.3 percent of gap-filled predictions lie within +/-1 ppm. Independent validation with TCCON shows robust generalization (bias = -0.14 ppm; r = 0.928), including faithful reproduction of the late-summer drawdown. Spatial analysis across 14 poultry regions reveals a moderate positive association between facility density and XCO2 (r = 0.43); high-density areas exhibit larger seasonal amplitudes (9.57 ppm) and enhanced summer variability. Compared with conventional interpolation and standard machine-learning baselines, ST-ViWT yields seamless 0.25 degree CO2 surfaces with explicit uncertainties, enabling year-round coverage despite observations. The approach supports integration of satellite constraints with national inventories and precision livestock platforms to benchmark emissions, refine region-specific factors, and verify interventions. Importantly, -based Earth observation enables scalable, transparent, spatially explicit carbon accounting, hotspot prioritization, and policy-relevant mitigation assessment.

SABlock Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size

Authors: Jinhan Chen, Jianchun Liu, Hongli Xu, Xianjun Gao, Shilong Wang

2025-10-26

http://arxiv.org/abs/2510.22556v1

The growing memory footprint of the Key-Value () poses a severe scalability bottleneck for long-context Large Language Model () inference. While eviction has emerged as an effective solution by discarding less critical tokens, existing token-, block-, and sentence-level methods struggle to balance semantic coherence and memory efficiency. To this end, we introduce SABlock, a \underline{s}emantic-aware eviction framework with \underline{a}daptive \underline{block} sizes. Specifically, SABlock first performs semantic segmentation to align boundaries with linguistic structures, then applies segment-guided token scoring to refine token importance estimation. Finally, for each segment, a budget-driven search strategy adaptively determines the optimal block size that preserves semantic integrity while improving efficiency under a given budget. Extensive experiments on long-context benchmarks demonstrate that SABlock consistently outperforms state-of-the-art baselines under the same memory budgets. For instance, on Needle-in-a-Haystack (NIAH), SABlock achieves 99.9% retrieval accuracy with only 96 entries, nearly matching the performance of the full- baseline that retains up to 8K entries. Under a fixed budget of 1,024, SABlock further reduces peak memory usage by 46.28% and achieves up to 9.5x faster on a 128K context length.

AesCrop Aesthetic-driven Cropping Guided by Composition

Authors: Yen-Hong Wong, Lai-Kuan Wong

2025-10-26

http://arxiv.org/abs/2510.22528v1

Aesthetic-driven image cropping is crucial for applications like view recommendation and thumbnail generation, where visual appeal significantly impacts user engagement. A key factor in visual appeal is composition--the deliberate arrangement of elements within an image. Some methods have successfully incorporated compositional knowledge through evaluation-based and regression-based paradigms. However, evaluation-based methods lack globality while regression-based methods lack diversity. Recently, hybrid approaches that integrate both paradigms have emerged, bridging the gap between these two to achieve better diversity and globality. Notably, existing hybrid methods do not incorporate photographic composition guidance, a key attribute that defines photographic aesthetics. In this work, we introduce AesCrop, a composition-aware hybrid image-cropping model that integrates a VMamba image encoder, augmented with a novel Mamba Composition Attention Bias (MCAB) and a r to perform end-to-end rank-based image cropping, generating multiple crops along with the corresponding quality scores. By explicitly encoding compositional cues into the attention mechanism, MCAB directs AesCrop to focus on the most compositionally salient regions. Extensive experiments demonstrate that AesCrop outperforms current state-of-the-art methods, delivering superior quantitative metrics and qualitatively more pleasing crops.

Aligning Diffusion Language Models via Unpaired Preference Optimization

Authors: Vaibhav Jindal, Hejian Sang, Chun-Mao Lai, Yanning Chen, Zhipeng Wang

2025-10-26

http://arxiv.org/abs/2510.23658v1

Diffusion language models (ds) are an emerging alternative to autoregressive (AR) generators, but aligning them to human preferences is challenging because sequence log-likelihoods are intractable and pairwise preference data are costly to collect. We introduce ELBO-KTO, which combines an ELBO surrogate for diffusion log-likelihoods with a prospect-theoretic, unpaired preference objective (Kahneman Tversky Optimization, KTO). We analyze the bias and variance induced by the ELBO substitution and employ variance-reduction practices that stabilize gradients during training. Applied to LLaDA-8B-Instruct, ELBO-KTO yields \textbf{65.9\%} and \textbf{62.3\%} adjusted win rates on kto-mix-14k and UltraFeedback-Binary, respectively, versus the base model under an automatic judge. Across downstream tasks, including GSM8K, MMLU, and additional reasoning/knowledge benchmarks, ELBO-KTO trained on UltraFeedback-Binary performs on par with or better than the base model under identical . This establishes unpaired preference optimization as a viable alternative to pairwise alignment in diffusion s.

Frustratingly Easy Task-aware Pruning for Large Language Models

Authors: Yuanhe Tian, Junjie Liu, Xican Yang, Haishan Ye, Yan Song

2025-10-26

http://arxiv.org/abs/2510.22489v1

Pruning provides a practical solution to reduce the resources required to run large language models (s) to benefit from their effective capabilities as well as control their cost for training and inference. Research on often ranks the importance of parameters using their magnitudes and calibration-data activations and removes (or masks) the less important ones, accordingly reducing s' size. However, these approaches primarily focus on pre the 's ability to generate fluent sentences, while neglecting performance on specific domains and tasks. In this paper, we propose a simple yet effective approach for s that preserves task-specific capabilities while shrinking their parameter space. We first analyze how conventional minimizes loss perturbation under general-domain calibration and extend this formulation by incorporating task-specific feature distributions into the importance computation of existing algorithms. Thus, our framework computes separate importance scores using both general and task-specific calibration data, partitions parameters into shared and exclusive groups based on activation-norm differences, and then fuses their scores to guide the process. This design enables our method to integrate seamlessly with various foundation techniques and preserve the 's specialized abilities under . Experiments on widely used benchmarks demonstrate that our approach is effective and consistently outperforms the baselines with identical ratios and different settings.

CHOIR Collaborative Harmonization fOr Inference Robustness

Authors: Xiangjue Dong, Cong Wang, Maria Teleki, Millennium Bismay, James Caverlee

2025-10-26

http://arxiv.org/abs/2510.22475v1

Persona-assigned Large Language Models (s) can adopt diverse roles, enabling personalized and context-aware reasoning. However, even minor demographic perturbations in personas, such as simple pronoun changes, can alter reasoning trajectories, leading to divergent sets of correct answers. Instead of treating these variations as biases to be mitigated, we explore their potential as a constructive resource to improve reasoning robustness. We propose CHOIR (Collaborative Harmonization fOr Inference Robustness), a test-time framework that harmonizes multiple persona-conditioned reasoning signals into a unified prediction. CHOIR orchestrates a collaborative process among counterfactual personas, dynamically balancing agreement and divergence in their reasoning paths. Experiments on various reasoning benchmarks demonstrate that CHOIR consistently enhances performance across demographics, model architectures, scales, and tasks - without additional training. Improvements reach up to 26.4% for individual demographic groups and 19.2% on average across five demographics. It remains effective even when base personas are suboptimal. By reframing persona variation as a constructive signal, CHOIR provides a scalable and generalizable approach to more reliable reasoning.

Backward-Friendly Optimization Training Large Language Models with Approximate Gradients under Memory Constraints

Authors: Jing Yang, Kaitong Cai, Yijia Fan, Yufeng Yang, Keze Wang

2025-10-26

http://arxiv.org/abs/2510.22467v1

Full fine-tuning of Large Language Models (s) is notoriously memory-intensive, primarily because conventional optimizers such as SGD or Adam assume access to exact gradients derived from d activations. Existing solutions either alter the model architecture (e.g., reversible networks) or trade memory for computation (e.g., activation checkpointing), but the optimizer itself remains untouched. In this work, we introduce GradLite, a backward-friendly optimizer that relaxes the requirement of exact gradients, enabling efficient training even when intermediate activations are aggressively discarded or approximated. GradLite leverages two key techniques: (i) low-rank Jacobian approximation, which reduces the dimensionality of backpropagated error signals, and (ii) error-feedback correction, which accumulates and compensates approximation errors across iterations to preserve convergence guarantees. We provide a theoretical analysis showing that GradLite maintains unbiased gradient estimates with bounded variance, ensuring convergence rates comparable to Adam. Empirically, GradLite reduces optimizer-state and activation memory consumption by up to 50\% without architectural changes, and achieves on-par or superior downstream performance on reasoning (MMLU, GSM8K), multilingual, and dialogue benchmarks compared to checkpointing and optimizer-centric baselines (LoMo, GaLore).

GigaEmbeddings Efficient Russian Language Embedding Model

Authors: Egor Kolodin, Daria Khomich, Nikita Savushkin, Anastasia Ianina, Fyodor Minkin

2025-10-25

http://arxiv.org/abs/2510.22369v1

We introduce GigaEmbeddings, a novel framework for training high-performance Russian-focused text embeddings through hierarchical instruction tuning of the r-only designed specifically for Russian language (GigaChat-3B). Our three-stage pipeline, comprising large-scale contrastive pre-training in web-scale corpora, fine-tuning with hard negatives, and multitask generalization across retrieval, classification, and clustering tasks, addresses key limitations of existing methods by unifying diverse objectives and leveraging synthetic data generation. Architectural innovations include bidirectional attention for contextual modeling, latent attention pooling for robust sequence aggregation, and strategic of 25% of layers to enhance efficiency without compromising performance. Evaluated on the ruMTEB benchmark spanning 23 multilingual tasks, GigaEmbeddings achieves state-of-the-art results (69.1 avg. score), outperforming strong baselines with a larger number of parameters.

The Structural Scalpel Automated Contiguous Layer Pruning for Large Language Models

Authors: Yao Lu, Yuqi Li, Wenbin Xie, Shanqing Yu, Qi Xuan, Zhaowei Zhu, Shiping Wen

2025-10-25

http://arxiv.org/abs/2510.23652v1

Although large language models (s) have achieved revolutionary breakthroughs in many fields, their large model size and high computational cost pose significant challenges for practical deployment on resource-constrained edge devices. To this end, layer has been proposed to reduce the computational overhead by directly removing redundant layers. However, existing layer methods typically rely on hand-crafted metrics to evaluate and remove individual layers, while ignoring the dependencies between layers. This can disrupt the model's information flow and severely degrade performance. To address these issues, we propose CLP, a novel continuous layer framework that introduces two key innovations: a differentiable concave gate algorithm that automatically identifies the best continuous layer segments for via gradient-based optimization; and a cutoff endpoint tuning strategy that effectively restores model performance by fine-tuning only the layers adjacent to the pruned segments. Extensive experiments across multiple model architectures (including LLaMA2, LLaMA3 and Qwen) and sizes (from $7$ B to $70$ B parameters) show that CLP significantly outperforms existing state-of-the-art baselines. For example, at a rate of $20\%$ , CLP achieves an average performance retention of $95.34\%$ on LLaMA3-70B, outperforming baselines by $4.29\%$ - $30.52\%$ . Furthermore, CLP can be seamlessly combined with to further compress the model with only a slight performance loss.

Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders

Authors: Mengyu Ye, Jun Suzuki, Tatsuro Inaba, Tatsuki Kuribayashi

2025-10-25

http://arxiv.org/abs/2510.22332v1

Recent interpretability work on large language models (s) has been increasingly dominated by a feature-discovery approach with the help of proxy modules. Then, the quality of features learned by, e.g., auto-encoders (SAEs), is evaluated. This paradigm naturally raises a critical question: do such learned features have better properties than those already represented within the original model parameters, and unfortunately, only a few studies have made such comparisons systematically so far. In this work, we revisit the interpretability of feature vectors stored in feed-forward (FF) layers, given the perspective of FF as key-value memories, with modern interpretability benchmarks. Our extensive evaluation revealed that SAE and FFs exhibits a similar range of interpretability, although SAEs displayed an observable but minimal improvement in some aspects. Furthermore, in certain aspects, surprisingly, even vanilla FFs yielded better interpretability than the SAEs, and features discovered in SAEs and FFs diverged. These bring questions about the advantage of SAEs from both perspectives of feature quality and faithfulness, compared to directly interpreting FF feature vectors, and FF key-value parameters serve as a strong baseline in modern interpretability research.

Efficient Low Rank Attention for Long-Context Inference in Large Language Models

Authors: Tenghui Li, Guoxu Zhou, Xuyang Zhao, Yuning Qiu, Qibin Zhao

2025-10-25

http://arxiv.org/abs/2510.23649v1

As the length of input text grows, the key-value () in s imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as and , reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs. We introduce Low Rank Query and Key attention (LRQK), a two-stage framework that jointly decomposes the full-precision query and key matrices into compact rank- $r$ factors during the stage, and then uses these low-dimensional projections to compute proxy attention scores in $\mathcal{O}(lr)$ time at each step. By selecting only the top- $k$ tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU-CPU with a hit-and-miss mechanism that transfers only missing full-precision pairs, thereby pre exact attention outputs while reducing CPU-GPU data movement. Extensive experiments on the RULER and LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading -attention methods in long context settings, while delivering significant memory savings with minimal loss in accuracy. Our code is available at https://github.com/tenghuilee/LRQK.

PACR Progressively Ascending Confidence Reward for LLM Reasoning

Authors: Eunseop Yoon, Hee Suk Yoon, Jaehyun Jang, SooHwan Eom, Qi Dai, Chong Luo, Mark A. Hasegawa-Johnson, Chang D. Yoo

2025-10-25

http://arxiv.org/abs/2510.22255v1

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved reasoning, but its , outcome-based reward provides no guidance for intermediate steps, slowing exploration. We propose Progressively Ascending Confidence Reward (PACR), a dense, model-intrinsic reward computed directly from the model's evolving belief in the correct answer. PACR encodes the inductive bias that, along a well-formed reasoning trajectory, the probability of the ground-truth answer should have a generally ascending trend. We provide empirical and theoretical analysis validating that such an inductive bias constrains the exploration search space to regions richer in logically sound reasoning. We demonstrate that PACR accelerates exploration, reaches reward saturation with fewer trajectories, and yields improvements on multiple benchmarks. Our results suggest that dense, model-intrinsic shaping signals can make RLVR training more effective and reliable.

Synthetic-to-Real Transfer Learning for Chromatin-Sensitive PWS Microscopy

Authors: Jahidul Arafat, Sanjaya Poudel

2025-10-25

http://arxiv.org/abs/2510.22239v1

Chromatin sensitive partial wave spectroscopic (csPWS) microscopy enables label free detection of nanoscale chromatin packing alterations that occur before visible cellular transformation. However, manual nuclear segmentation limits population scale analysis needed for biomarker discovery in early cancer detection. The lack of annotated csPWS imaging data prevents direct use of standard deep learning methods. We present CFU Net, a hierarchical segmentation architecture trained with a three stage curriculum on synthetic multimodal data. CFU Net achieves near perfect performance on held out synthetic test data that represent diverse spectroscopic imaging conditions without manual annotations (Dice 0.9879, IoU 0.9895). Our approach uses physics based rendering that incorporates empirically supported chromatin packing statistics, Mie scattering models, and modality specific noise, combined with a curriculum that progresses from adversarial RGB pretraining to spectroscopic fine tuning and histology validation. CFU Net integrates five architectural elements (ConvNeXt backbone, Feature Pyramid Network, UNet plus plus dense connections, dual attention, and deep supervision) that together improve Dice over a baseline UNet by 8.3 percent. We demonstrate deployment ready INT8 with 74.9 percent and 0.15 second inference, giving a 240 times throughput gain over manual analysis. Applied to more than ten thousand automatically segmented nuclei from synthetic test data, the pipeline extracts chromatin biomarkers that distinguish normal from pre cancerous tissue with large effect sizes (Cohens d between 1.31 and 2.98), reaching 94 percent classification accuracy. This work provides a general framework for synthetic to real transfer learning in specialized microscopy and open resources for community validation on clinical specimens.

When Fewer Layers Break More Chains Layer Pruning Harms Test-Time Scaling in LLMs

Authors: Keyu Wang, Tian Lyu, Guinan Su, Jonas Geiping, Lu Yin, Marco Canini, Shiwei Liu

2025-10-25

http://arxiv.org/abs/2510.22228v1

Layer has emerged as a widely adopted technique for improving the efficiency of large language models (s). Although existing methods demonstrate strong performance retention on general knowledge tasks, their effect on long-chain reasoning, a more brittle yet crucial capability, remains largely unexplored. In this work, we study the impact of layer on long-chain reasoning through the lens of test-time scaling, a key mechanism in modern s that enables strong reasoning capacity by allocating more computation at inference time. With extensive experiments, we demonstrate that even one or two layers can severely impair test-time scaling, with performance collapsing drastically on long reasoning benchmarks even when performance on knowledge-intensive and shallow reasoning tasks remains stable. Furthermore, we find that standard supervised fine-tuning remedies fail to recover test-time scaling once it has deteriorated. Through in-depth analyses, we identify the mechanisms underlying this fragility of test-time scaling and highlight the fundamental risks of applying layer to reasoning-intensive s. These findings call for a rethinking of layer strategies and provide insights for developing methods that preserve the robustness of reasoning. We open-source the codebase in \href{https://github.com/keyu-wang-2002/Layer-Pruning-Harms-Inference-Scaling}{https://github.com/keyu-wang-2002/Layer-Pruning-Harms-Inference-Scaling}.

TrajGATFormer A Graph-Based Transformer Approach for Worker and Obstacle Trajectory Prediction in Off-site Construction Environments

Authors: Mohammed Alduais, Xinming Li, Qipei Mei

2025-10-25

http://arxiv.org/abs/2510.22205v1

As the demand grows within the construction industry for processes that are not only faster but also safer and more efficient, offsite construction has emerged as a solution, though it brings new safety risks due to the close interaction between workers, machinery, and moving obstacles. Predicting the future trajectories of workers and taking into account social and environmental factors is a crucial step for developing collision-avoidance systems to mitigate such risks. Traditional methods often struggle to adapt to the dynamic and unpredictable nature of construction environments. Many rely on simplified assumptions or require hand-crafted features, limiting their ability to respond to complex, real-time interactions between workers and moving obstacles. While recent data-driven methods have improved the modeling of temporal patterns, they still face challenges in capturing long-term behavior and accounting for the spatial and social context crucial to collision risk assessment. To address these limitations, this paper proposes a framework integrating YOLOv10n and DeepSORT for precise detection and tracking, along with two novel trajectory prediction models: TrajGATFormer and TrajGATFormer-Obstacle. YOLOv10n serves as the backbone for object detection, accurately identifying workers and obstacles in diverse scenes, while DeepSORT efficiently tracks them over time with unique IDs for continuity. Both models employ a encoder-r with Graph Attention Networks (GAT) to capture temporal and spatial interactions. TrajGATFormer predicts worker trajectories with an ADE of 1.25 m and FDE of 2.3 m over a 4.8 s horizon, while TrajGATFormer-Obstacle extends prediction to both workers and obstacles, achieving higher accuracy (ADE 1.15 m, FDE 2.2 m). Comparative analysis shows both models outperform traditional methods, reducing ADE and FDE by up to 35% and 38%, respectively.

Surface Reading LLMs Synthetic Text and its Styles

Authors: Hannes Bajohr

2025-10-25

http://arxiv.org/abs/2510.22162v2

Despite a potential plateau in ML advancement, the societal impact of large language models lies not in approaching superintelligence but in generating text surfaces indistinguishable from human writing. While Critical AI Studies provides essential material and socio-technical critique, it risks overlooking how s phenomenologically reshape meaning-making. This paper proposes a semiotics of "surface integrity" as attending to the immediate plane where s inscribe themselves into human . I distinguish three knowledge interests in ML research (epistemology, epist\=em\=e, and epistemics) and argue for integrating surface-level stylistic analysis alongside depth-oriented critique. Through two case studies examining stylistic markers of synthetic text, I argue how attending to style as a semiotic phenomenon reveals s as cultural actors that transform the conditions of meaning emergence and circulation in contemporary discourse, independent of questions about machine consciousness.

Edit Less, Achieve More Dynamic Sparse Neuron Masking for Lifelong Knowledge Editing in LLMs

Authors: Jinzhe Liu, Junshu Sun, Shufan Shen, Chenxue Yang, Shuhui Wang

2025-10-25

http://arxiv.org/abs/2510.22139v1

Lifelong knowledge editing enables continuous, precise updates to outdated knowledge in large language models (s) without computationally expensive full retraining. However, existing methods often accumulate errors throughout the editing process, causing a gradual decline in both editing accuracy and generalization. To tackle this problem, we propose Neuron-Specific Masked Knowledge Editing (NMKE), a novel fine-grained editing framework that combines neuron-level attribution with dynamic masking. Leveraging neuron functional attribution, we identify two key types of knowledge neurons, with knowledge-general neurons activating consistently across prompts and knowledge-specific neurons activating to specific prompts. NMKE further introduces an entropy-guided dynamic mask, locating relevant neurons to the target knowledge. This strategy enables precise neuron-level knowledge editing with fewer parameter modifications. Experimental results from thousands of sequential edits demonstrate that NMKE outperforms existing methods in maintaining high editing success rates and pre model general capabilities in lifelong editing.

Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search

Authors: Kayhan Behdin, Qingquan Song, Sriram Vasudevan, Jian Sheng, Xiaojing Ma, Z Zhou, Chuanrui Zhu, Guoyao Li, Chanh Nguyen, Sayan Ghosh, Hejian Sang, Ata Fatahi Baarzi, Sundara Raman Ramachandran, Xiaoqing Wang, Qing Lan, Vinay Y S, Qi Guo, Caleb Johnson, Zhipeng Wang, Fedor Borisyuk

2025-10-25

http://arxiv.org/abs/2510.22101v1

Large Language Models (s) have demonstrated impressive quality when applied to predictive tasks such as relevance ranking and semantic search. However, deployment of such s remains prohibitively expensive for industry applications with strict latency and throughput requirements. In this work, we present lessons and efficiency insights from developing a purely text-based r-only Small Language Model (SLM) for a semantic search application at LinkedIn. Particularly, we discuss model techniques such as that allow us to reduce the model size by up to $40\%$ while maintaining the accuracy. Additionally, we present context techniques that allow us to reduce the input context length by up to $10$ x with minimal loss of accuracy. Finally, we present practical lessons from optimizing the infrastructure for deploying such a system on GPUs at scale, millions of requests per second. Taken together, this allows us to increase our system's throughput by $10$ x in a real-world deployment, while meeting our quality bar.

Generalization or Memorization Dynamic Decoding for Mode Steering

Authors: Xuanming Zhang

2025-10-25

http://arxiv.org/abs/2510.22099v1

Large Language Models (s) exhibit a troubling duality, capable of both remarkable generalization and brittle, verbatim memorization of their training data. This unpredictability undermines their reliability in high-stakes applications. In this work, we propose a unified framework to understand, identify, and control these distinct reasoning modes. First, we introduce a theoretical model based on the Information Bottleneck (IB) principle, formalizing generalization as the learning of a compressed, task-relevant representation and memorization as a failure to compress. Building on this theory, we develop Dynamic Mode Steering (DMS), a novel inference-time algorithm which comprises two components: (1) a lightweight, causally-grounded linear probe that identifies the model's instantaneous reliance on memorization, and (2) a dynamic activation steering mechanism that nudges the model's computation towards pre-identified generalization circuits. We frame DMS as a form of adaptive, self-contrastive . Experiments on reasoning and faithfulness tasks demonstrate that DMS significantly improves logical consistency and factual accuracy, thereby offering a principled approach to enhancing reliability.

Embracing Trustworthy Brain-Agent Collaboration as Paradigm Extension for Intelligent Assistive Technologies

Authors: Yankai Chen, Xinni Zhang, Yifei Zhang, Yangning Li, Henry Peng Zou, Chunyu Miao, Weizhi Zhang, Xue Liu, Philip S. Yu

2025-10-25

http://arxiv.org/abs/2510.22095v1

Brain-Computer Interfaces (BCIs) offer a direct pathway between the human brain and external devices, holding significant promise for individuals with severe neurological impairments. However, their widespread adoption is hindered by critical limitations, such as low information transfer rates and extensive user-specific calibration. To overcome these challenges, recent research has explored the integration of Large Language Models (s), extending the focus from simple command to understanding complex cognitive states. Despite these advancements, deploying agentic AI faces technical hurdles and ethical concerns. Due to the lack of comprehensive discussion on this emerging direction, this position paper argues that the field is poised for a paradigm extension from BCI to Brain-Agent Collaboration (BAC). We emphasize reframing agents as active and collaborative partners for intelligent assistance rather than passive brain signal data processors, demanding a focus on ethical data handling, model reliability, and a robust human-agent collaboration framework to ensure these systems are safe, trustworthy, and effective.

Compositional Bias Control in Large Language Models Preference Learning Fails, Supervision Succeeds

Authors: Atij Mahesh

2025-10-24

http://arxiv.org/abs/2510.22084v1

Large Language Models (s) still produce gender-stereotyped language even in occupation-neutral contexts that reflect deep societal biases (Rudinger et al., 2018). To address this, prior work has proposed prompting, constrained (Dathathri et al., 2020; Zhou et al., 2024), post-processing, and fine-tuning-based alignment (Rafailov et al., 2023; Ravfogel et al., 2022). However, the comparative efficacy and learning dynamics remain little understood. We report a comparative analysis of six control techniques for bias mitigation: prompt-only, generate-and-filter, DFA-based Ctrl-G , Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Iterative Nullspace Projection (INLP). We evaluate each method on a compositional constraint task. This task requires generating sentences that contain at least one agentic and one communal descriptor for each of the twenty Winogender-derived occupations. We quantify trade-offs between control strength and naturalness with evaluations of constraint compliance, lexical diversity, and fluency. Our results reveal key contrasts among the methods: SFT achieves 99.87 +- 0.15% compliance and high lexical diversity, while DPO, despite similar training stability, fails at 4.53 +- 0.82%. Ctrl-G guarantees perfect compliance, but at the cost of severely reduced fluency and diversity. Preference-based learning fundamentally differs: it cannot satisfy compositional constraints, as binary preference signals encode ranking, not logical conjunctions. Only explicit positive supervision enables mitigation of compositional biases; preference-based alignment fails to generalize logical structures, underscoring the limitations of preference learning and the necessity of explicit supervision for fair and fluent controlled generation.

Pruning and Quantization Impact on Graph Neural Networks

Authors: Khatoon Khedri, Reza Rawassizadeh, Qifu Wen, Mehdi Hosseinzadeh

2025-10-24

http://arxiv.org/abs/2510.22058v1

Graph neural networks (GNNs) are known to operate with high accuracy on learning from graph-structured data, but they suffer from high computational and resource costs. Neural network methods are used to reduce the model size while maintaining reasonable accuracy. Two of the common neural network techniques include and . In this research, we empirically examine the effects of three methods and three methods on different GNN models, including graph classification tasks, node classification tasks, and link prediction. We conducted all experiments on three graph datasets, including Cora, Proteins, and BBBP. Our findings demonstrate that unstructured fine-grained and global can significantly reduce the model's size(50\%) while maintaining or even improving precision after fine-tuning the pruned model. The evaluation of different methods on GNN shows diverse impacts on accuracy, inference time, and model size across different datasets.

Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders

Authors: Zhimin Chen, Chenyu Zhao, Ka Chun Mo, Yunjiang Jiang, Jane H. Lee, Shouwei Chen, Khushhall Chandra Mahajan, Ning Jiang, Kai Ren, Jinhui Li, Wen-Yun Yang

2025-10-24

http://arxiv.org/abs/2510.22049v1

Modern large-scale recommendation systems rely heavily on user interaction history sequences to enhance the model performance. The advent of large language models and sequential modeling techniques, particularly -like architectures, has led to significant advancements recently (e.g., HSTU, SIM, and TWIN models). While scaling to ultra-long user histories (10k to 100k items) generally improves model performance, it also creates significant challenges on latency, queries per second (QPS) and GPU cost in industry-scale recommendation systems. Existing models do not adequately address these industrial scalability issues. In this paper, we propose a novel two-stage modeling framework, namely VIrtual Sequential Target Attention (VISTA), which decomposes traditional target attention from a candidate item to user history items into two distinct stages: (1) user history summarization into a few hundred tokens; followed by (2) candidate item attention to those tokens. These summarization token embeddings are then d in storage system and then utilized as sequence features for downstream model training and inference. This novel design for scalability enables VISTA to scale to lifelong user histories (up to one million items) while keeping downstream training and inference costs fixed, which is essential in industry. Our approach achieves significant improvements in offline and online metrics and has been successfully deployed on an industry leading recommendation platform billions of users.

On the acceleration of cosmic rays at the post-adiabatic shocks of supernova remnants

Authors: O. Petruk, R. Bandiera, T. Kuzyo, R. Brose, A. Ingallinera

2025-10-24

http://arxiv.org/abs/2510.21988v1

When a supernova remnant (SNR) interacts with the dense material of an interstellar cloud, its shock wave decelerates rapidly, and the post-shock temperature drops to levels that permit efficient cooling of the shocked plasma. At this stage, the shock enters the post-adiabatic phase of its evolution. During this phase, the internal structure of the SNR undergoes significant changes, particularly in the immediate post-shock region, at spatial scales relevant to cosmic ray . Once the shock enters the post-adiabatic regime, the efficiency of diffusive shock increases due to a higher plasma , to a change in the direction of the advection velocity, and to an increased rate of momentum gain. As a result, the momentum spectrum of relativistic particles hardens, deviating from a pure power law at high energies. Particles could reach higher maximum values compared to classical predictions. We highlight the dynamics of post-adiabatic flows in SNRs, study their impact on particle , and present supporting observational evidence in the radio band.

Sprint Sparse-Dense Residual Fusion for Efficient Diffusion Transformers

Authors: Dogyun Park, Moayed Haji-Ali, Yanyu Li, Willi Menapace, Sergey Tulyakov, Hyunwoo J. Kim, Aliaksandr Siarohin, Anil Kag

2025-10-24

http://arxiv.org/abs/2510.21986v1

Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet na\"ive strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while pre quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train--inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.

Authors: Faria Huq, Elijah L. Claggett, Hirokazu Shirado

2025-10-24

http://arxiv.org/abs/2510.21984v1

Social cohesion is difficult to sustain in societies marked by opinion diversity, particularly in online . As large language model ()-driven messaging assistance becomes increasingly embedded in these contexts, it raises critical questions about its societal impact. We present an online experiment with 557 participants who engaged in multi-round discussions on politically controversial topics while freely reconfiguring their discussion groups. In some conditions, participants received real-time message suggestions generated by an , either personalized to the individual or adapted to their group context. We find that subtle shifts in linguistic style during , mediated by AI assistance, can scale up to reshape collective structures. While individual-focused assistance leads users to segregate into like-minded groups, relational assistance that incorporates group members' stances enhances cohesion through more receptive exchanges. These findings demonstrate that AI-mediated can support social cohesion in diverse groups, but outcomes critically depend on how personalization is designed.

Performance Trade-offs of Optimizing Small Language Models for E-Commerce

Authors: Josip Tomo Licardo, Nikola Tankovic

2025-10-24

http://arxiv.org/abs/2510.21970v1

Large Language Models (s) offer state-of-the-art performance in natural language understanding and generation tasks. However, the deployment of leading commercial models for specialized tasks, such as e-commerce, is often hindered by high computational costs, latency, and operational expenses. This paper investigates the viability of smaller, open-weight models as a resource-efficient alternative. We present a methodology for optimizing a one-billion-parameter Llama 3.2 model for multilingual e-commerce intent recognition. The model was fine-tuned using Quantized Low-Rank Adaptation (QLoRA) on a synthetically generated dataset designed to mimic real-world user queries. Subsequently, we applied post-training techniques, creating GPU-optimized (GPTQ) and CPU-optimized (GGUF) versions. Our results demonstrate that the specialized 1B model achieves 99% accuracy, matching the performance of the significantly larger GPT-4.1 model. A detailed performance analysis revealed critical, hardware-dependent trade-offs: while 4-bit GPTQ reduced VRAM usage by 41%, it paradoxically slowed inference by 82% on an older GPU architecture (NVIDIA T4) due to de overhead. Conversely, GGUF formats on a CPU achieved a speedup of up to 18x in inference throughput and a reduction of over 90% in RAM consumption compared to the FP16 baseline. We conclude that small, properly optimized open-weight models are not just a viable but a more suitable alternative for domain-specific applications, offering state-of-the-art accuracy at a fraction of the computational cost.

Model-Aware Tokenizer Transfer

Authors: Mykola Haltiuk, Aleksander Smywiński-Pohl

2025-10-24

http://arxiv.org/abs/2510.21954v1

Large Language Models (s) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer methods typically rely on semantic heuristics to initialize new embeddings, ignoring higher-layer model dynamics and limiting transfer quality. We propose Model-Aware Tokenizer Transfer (MATT), a method that incorporates model internals into the tokenizer transfer process. MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token patterns from a source model into a target model with a new tokenizer, providing an efficient warm-up before standard language modeling. Unlike approaches that focus solely on embedding similarity, MATT leverages attention behavior to guide embedding initialization and adaptation. Experiments across diverse linguistic settings show that MATT recovers a large fraction of the original model's performance within a few GPU hours, outperforming heuristic baselines. These results demonstrate that incorporating model-level signals offers a practical and effective path toward robust tokenizer transfer in multilingual s.

Adversarial Déjà Vu Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

Authors: Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa, Jiachen T. Wang, Peng Gao, Charith Peris, Yao Ma, Rahul Gupta, Ming Jin, Prateek Mittal, Ruoxi Jia

2025-10-24

http://arxiv.org/abs/2510.21910v1

Large language models remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Defending against novel jailbreaks represents a critical challenge in AI safety. Adversarial training -- designed to make models robust against worst-case perturbations -- has been the dominant paradigm for adversarial robustness. However, due to optimization challenges and difficulties in defining realistic threat models, adversarial training methods often fail on newly developed jailbreaks in practice. This paper proposes a new paradigm for improving robustness against unseen jailbreaks, centered on the Adversarial D\'ej`a Vu hypothesis: novel jailbreaks are not fundamentally new, but largely recombinations of adversarial skills from previous attacks. We study this hypothesis through a large-scale analysis of 32 attack papers published over two years. Using an automated pipeline, we extract and compress adversarial skills into a dictionary of primitives, with s generating human-readable descriptions. Our analysis reveals that unseen attacks can be effectively explained as compositions of earlier skills, with explanatory power increasing monotonically as skill coverage grows. Guided by this insight, we introduce Adversarial Skill Compositional Training (ASCoT), which trains on diverse compositions of skill primitives rather than isolated attack instances. ASCoT substantially improves robustness to unseen attacks, including multi-turn jailbreaks, while maintaining low over-refusal rates. We also demonstrate that expanding adversarial skill coverage, not just data scale, is key to defending against novel attacks. \textcolor{red}{\textbf{Warning: This paper contains content that may be harmful or offensive in nature.