2025-10-31
Table of Contents
- The Limits of Obliviate Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework
- INT v.s. FP A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
- PureKV Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models
- Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry
- Feedback Alignment Meets Low-Rank Manifolds A Structured Recipe for Local Learning
- Standardization of Psychiatric Diagnoses -- Role of Fine-tuned LLM Consortium and OpenAI-gpt-oss Reasoning LLM Enabled Decision Support System
- TwinVoice A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation
- Echo-Conditioned Denoising Diffusion Probabilistic Models for Multi-Target Tracking in RF Sensing
- A Critical Study of Automatic Evaluation in Sign Language Translation
- Implicature in Interaction Understanding Implicature Improves Alignment in Human-LLM Interaction
- Serve Programs, Not Prompts
- Lightweight Federated Learning in Mobile Edge Computing with Statistical and Device Heterogeneity Awareness
- 4-Doodle Text to 3D Sketches that Move!
- Parrot A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning
- DIRC-RAG Accelerating Edge RAG with Robust High-Density and High-Loading-Bandwidth Digital In-ReRAM Computation
- MoEntwine Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference
- Energy-Efficient Autonomous Driving with Adaptive Perception and Robust Decision
- Transformers in Medicine Improving Vision-Language Alignment for Medical Image Captioning
- Model-Document Protocol for AI Search
- Conditional neural field for spatial dimension reduction of turbulence data a comparison study
- KnowCoder-A1 Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA
- H3M-SSMoEs Hypergraph-based Multimodal Learning with LLM Reasoning and Style-Structured Mixture of Experts
- WebLeaper Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking
- Repurposing Synthetic Data for Fine-grained Search Agent Supervision
- Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation
- Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs
- Diffusion LLM with Native Variable Generation Lengths Let [EOS] Lead the Way
- Parallel Loop Transformer for Efficient Test-Time Computation Scaling
- Decoupled MeanFlow Turning Flow Models into Flow Maps for Accelerated Sampling
- MiniOneRec An Open-Source Framework for Scaling Generative Recommendation
- Metadata-Driven Retrieval-Augmented Generation for Financial Question Answering
- Text Simplification with Sentence Embeddings
- SALS Sparse Attention in Latent Space for KV cache Compression
- Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts
- Pilot Distortion Design for ToA Obfuscation in Uplink OFDM Communication
- FALQON Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic
- Pie A Programmable Serving System for Emerging LLM Applications
- SpecKD Speculative Decoding for Effective Knowledge Distillation of LLMs
- PRO Enabling Precise and Robust Text Watermark for Open-Source LLMs
- Accurate Prediction of Nonlinear Distortion of Multi-Carrier Signals
- Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders
- CountFormer A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting
- BitSkip An Empirical Analysis of Quantization and Early Exit Composition
- Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization
- Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier
- Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving
- Adaptive Blockwise Search Inference-Time Alignment for Large Language Models
- Evaluation of Vision-LLMs in Surveillance Video
- Lost in Tokenization Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs
- Beyond Imprecise Distance Metrics LLM-Predicted Target Call Stacks for Directed Greybox Fuzzing
- P1GPT a multi-agent LLM workflow module for multi-modal financial information analysis
- UGAE Unified Geometry and Attribute Enhancement for G-PCC Compressed Point Clouds
- Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition
- Switchable Token-Specific Codebook Quantization For Face Image Compression
- How Can AI Augment Access to Justice? Public Defenders' Perspectives on AI Adoption
- Rethinking Inference Placement for Deep Learning across Edge and Cloud Platforms A Multi-Objective Optimization Perspective and Future Directions
- Batch Speculative Decoding Done Right
- Sub-microsecond Transformers for Jet Tagging on FPGAs
- Long-Term PM2.5 Forecasting Using a DTW-Enhanced CNN-GRU Model
- Leveraging Large Language Models to Identify Conversation Threads in Collaborative Learning
- Region-Adaptive Learned Hierarchical Encoding for 3D Gaussian Splatting Data
- Iterative Layer Pruning for Efficient Translation Inference
- Beyond Semantics How Temporal Biases Shape Retrieval in Transformer and State-Space Models
- Rule-Based Explanations for Retrieval-Augmented LLM Systems
- Transformers from Compressed Representations
- TVMC Time-Varying Mesh Compression via Multi-Stage Anchor Mesh Generation
- AI-Driven Carbon Monitoring Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions
- SABlock Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size
- AesCrop Aesthetic-driven Cropping Guided by Composition
- Aligning Diffusion Language Models via Unpaired Preference Optimization
- Frustratingly Easy Task-aware Pruning for Large Language Models
- CHOIR Collaborative Harmonization fOr Inference Robustness
- Backward-Friendly Optimization Training Large Language Models with Approximate Gradients under Memory Constraints
- GigaEmbeddings Efficient Russian Language Embedding Model
- The Structural Scalpel Automated Contiguous Layer Pruning for Large Language Models
- Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders
- Efficient Low Rank Attention for Long-Context Inference in Large Language Models
- PACR Progressively Ascending Confidence Reward for LLM Reasoning
- Synthetic-to-Real Transfer Learning for Chromatin-Sensitive PWS Microscopy
- When Fewer Layers Break More Chains Layer Pruning Harms Test-Time Scaling in LLMs
- TrajGATFormer A Graph-Based Transformer Approach for Worker and Obstacle Trajectory Prediction in Off-site Construction Environments
- Surface Reading LLMs Synthetic Text and its Styles
- Edit Less, Achieve More Dynamic Sparse Neuron Masking for Lifelong Knowledge Editing in LLMs
- Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search
- Generalization or Memorization Dynamic Decoding for Mode Steering
- Embracing Trustworthy Brain-Agent Collaboration as Paradigm Extension for Intelligent Assistive Technologies
- Compositional Bias Control in Large Language Models Preference Learning Fails, Supervision Succeeds
- Pruning and Quantization Impact on Graph Neural Networks
- Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders
- On the acceleration of cosmic rays at the post-adiabatic shocks of supernova remnants
- Sprint Sparse-Dense Residual Fusion for Efficient Diffusion Transformers
- From Social Division to Cohesion with AI Message Suggestions in Online Chat Groups
- Performance Trade-offs of Optimizing Small Language Models for E-Commerce
- Model-Aware Tokenizer Transfer
- Adversarial Déjà Vu Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
The Limits of Obliviate Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework
Authors: Aakriti Shah, Thai Le
2025-10-29
Unlearning in large language models (s) is crucial for managing sensitive
data and correcting misinformation, yet evaluating its effectiveness remains an
open problem. We investigate whether persuasive prompting can recall factual
knowledge from deliberately unlearned
s across models ranging from 2.7B to
13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from
ACT-R and Hebbian theory (spreading activation theories), as well as
principles, we introduce Stimulus-Knowledge Entanglement-Behavior
Framework (SKeB), which models information entanglement via domain graphs and
tests whether factual recall in unlearned models is correlated with persuasive
framing. We develop entanglement metrics to quantify knowledge activation
patterns and evaluate factuality, non-factuality, and hallucination in outputs.
Our results show persuasive prompts substantially enhance factual knowledge
recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness
inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB
provides a foundation for assessing unlearning completeness, robustness, and
overall behavior in
s.
INT v.s. FP A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
Authors: Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo
2025-10-29
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly
embracing low-precision floating-point (FP) formats to handle the pervasive
activation outliers in Large Language Models (s). Despite this industry
trend, a unified comparison of FP and integer (INT)
across varying
granularities has been missing, leaving algorithm and hardware co-design
without clear guidance. This paper fills that gap by systematically
investigating the trade-offs between FP and INT formats. We reveal a critical
performance crossover: while FP excels in coarse-grained
, the
comparison at fine-grained (block-wise) levels is more nuanced. Our
comprehensive comparison demonstrates that for popular 8-bit fine-grained
formats (e.g.,
with block size 32),
INT8 is superior to its FP counterpart
in both algorithmic accuracy and hardware efficiency. However, for 4-bit
formats, FP (e.g.,
FP4, NVFP4) often holds an accuracy advantage , though we
show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like
Hadamard rotation are applied. We also introduce a symmetric clipping method
that resolves gradient bias in fine-grained
INT training, enabling
nearly lossless performance for
INT8 training. These findings challenge the
current hardware trajectory, demonstrating that a one-size-fits-all FP approach
is suboptimal and advocating that fine-grained INT formats, particularly
INT8, offer a better balance of accuracy, power, and efficiency for future AI
accelerators.
PureKV Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models
Authors: Zhonghua Jiang, Kunxi Li, Yiyun Zhou, Sihao Liu, Zhaode Wang, Chengfei lv, Shengyu Zhang
2025-10-29
Vision-Language Large Models (Vs) face significant efficiency challenges
when processing high-resolution inputs. The quadratic complexity in attention
and autoregressive generation, as well as the constantly growing key value (
)
size, severely hinder the
ing and
stages. Recent efforts
have attempted to compress
by identifying and
of less
important tokens, but these methods typically rely on attention scores to
estimate token importance, making them incompatible with efficient attention
mechanisms such as FlashAttention and Sparse Attention, which do not explicitly
compute attention matrices. Moreover, existing methods overlook how
attention, while accelerating the
ing stage, alters the information
structure of the
, thereby compromising the effectiveness of downstream
strategies. To address this issue, we propose Pure
, a
plug-and-play framework for joint optimization of
attention and
. We first introduce a
strategy that is fully
compatible with efficient attention accelerators. Our method utilizes lower
layer attention scores to estimate the importance of high layers'
,
enabling active
without compromising accuracy. In addition, we have
designed a Spatial-Temporal Sparse Attention (ST-SpAttn) module specifically
tailored for video
algorithms. This module combines
spatial and temporal attention
to improve the
efficiency
of
optimization algorithms by purifying spatial noise and temporal
redundancy in
. At the same time, ST-SpAttn also accelerated the
ing stage of V
s. Extensive experiments on V
s (VideoLLaMA2,
Qwen2.5-VL) have shown that Pure
achieves 5.0 times
and
3.16 times
, with negligible quality degradation.
Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry
Authors: Run Peng, Ziqiao Ma, Amy Pang, Sikai Li, Zhang Xi-Jia, Yingzhuo Yu, Cristian-Paul Bara, Joyce Chai
2025-10-29
While Large Language Model () agents are often approached from the angle
of action planning/generation to accomplish a goal (e.g., given by language
descriptions), their abilities to collaborate with each other to achieve a
joint goal are not well explored. To address this limitation, this paper
studies
agents in task collaboration, particularly under the condition of
information asymmetry, where agents have disparities in their knowledge and
skills and need to work together to complete a shared task. We extend Einstein
Puzzles, a classical symbolic puzzle, to a table-top game. In this game, two
agents must reason, communicate, and act to satisfy spatial and relational
constraints required to solve the puzzle. We apply a fine-tuning-plus-verifier
framework in which
agents are equipped with various
strategies and verification signals from the environment. Empirical results
highlight the critical importance of aligned
, especially when
agents possess both information-seeking and -providing capabilities.
Interestingly, agents without
can still achieve high task
performance; however, further analysis reveals a lack of true rule
understanding and lower trust from human evaluators. Instead, by integrating an
environment-based verifier, we enhance agents' ability to comprehend task rules
and complete tasks, promoting both safer and more interpretable collaboration
in AI systems. https://github.com/Roihn/EinsteinPuzzles
Feedback Alignment Meets Low-Rank Manifolds A Structured Recipe for Local Learning
Authors: Arani Roy, Marco P. Apolinario, Shristi Das Biswas, Kaushik Roy
2025-10-29
Training deep neural networks (DNNs) with backpropagation (BP) achieves
state-of-the-art accuracy but requires global error propagation and full
parameterization, leading to substantial memory and computational overhead.
Direct Feedback Alignment (DFA) enables local, parallelizable updates with
lower memory requirements but is limited by unstructured feedback and poor
scalability in deeper architectures, specially convolutional neural networks.
To address these limitations, we propose a structured local learning framework
that operates directly on low-rank manifolds defined by the Singular Value
Decomposition (SVD) of weight matrices. Each layer is trained in its decomposed
form, with updates applied to the SVD components using a composite loss that
integrates cross-entropy, subspace alignment, and orthogonality regularization.
Feedback matrices are constructed to match the SVD structure, ensuring
consistent alignment between forward and feedback pathways. Our method reduces
the number of trainable parameters relative to the original DFA model, without
relying on or post hoc
. Experiments on CIFAR-10, CIFAR-100,
and ImageNet show that our method achieves accuracy comparable to that of BP.
Ablation studies confirm the importance of each loss term in the low-rank
setting. These results establish local learning on low-rank manifolds as a
principled and scalable alternative to full-rank gradient-based training.
Standardization of Psychiatric Diagnoses -- Role of Fine-tuned LLM Consortium and OpenAI-gpt-oss Reasoning LLM Enabled Decision Support System
Authors: Eranga Bandara, Ross Gore, Atmaram Yarlagadda, Anita H. Clayton, Preston Samuel, Christopher K. Rhea, Sachin Shetty
2025-10-29
The diagnosis of most mental disorders, including psychiatric evaluations,
primarily depends on dialogues between psychiatrists and patients. This
subjective process can lead to variability in diagnoses across clinicians and
patients, resulting in inconsistencies and challenges in achieving reliable
outcomes. To address these issues and standardize psychiatric diagnoses, we
propose a Fine-Tuned Large Language Model () Consortium and OpenAI-gpt-oss
Reasoning
-enabled Decision Support System for the clinical diagnosis of
mental disorders. Our approach leverages fine-tuned
s trained on
conversational datasets involving psychiatrist-patient interactions focused on
mental health conditions (e.g., depression). The diagnostic predictions from
individual models are aggregated through a consensus-based decision-making
process, refined by the OpenAI-gpt-oss reasoning
. We propose a novel method
for deploying
agents that orchestrate
between the
consortium and the reasoning
, ensuring transparency, reliability, and
responsible AI across the entire diagnostic workflow. Experimental results
demonstrate the transformative potential of combining fine-tuned
s with a
reasoning model to create a robust and highly accurate diagnostic system for
mental health assessment. A prototype of the proposed platform, integrating
three fine-tuned
s with the OpenAI-gpt-oss reasoning
, was developed in
collaboration with the U.S. Army Medical Research Team in Norfolk, Virginia,
USA. To the best of our knowledge, this work represents the first application
of a fine-tuned
consortium integrated with a reasoning
for clinical
mental health diagnosis paving the way for next-generation AI-powered eHealth
systems aimed at standardizing psychiatric diagnoses.
TwinVoice A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation
Authors: Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, Yiqun Liu
2025-10-29
Large Language Models (s) are exhibiting emergent human-like abilities and
are increasingly envisioned as the foundation for simulating an individual's
style, behavioral tendencies, and personality traits. However,
current evaluations of
-based persona simulation remain limited: most rely
on synthetic dialogues, lack systematic frameworks, and lack analysis of the
capability requirement. To address these limitations, we introduce TwinVoice, a
comprehensive benchmark for assessing persona simulation across diverse
real-world contexts. TwinVoice encompasses three dimensions: Social Persona
(public social interactions), Interpersonal Persona (private dialogues), and
Narrative Persona (role-based expression). It further decomposes the evaluation
of
performance into six fundamental capabilities, including opinion
consistency, memory recall, logical reasoning, lexical fidelity, persona tone,
and syntactic style. Experimental results reveal that while advanced models
achieve moderate accuracy in persona simulation, they still fall short of
capabilities such as syntactic style and memory recall. Consequently, the
average performance achieved by
s remains considerably below the human
baseline.
Echo-Conditioned Denoising Diffusion Probabilistic Models for Multi-Target Tracking in RF Sensing
Authors: Amirhossein Azarbahram, Onel L. A. López
2025-10-29
In this paper, we consider a dynamic radio frequency sensing system aiming to
spatially track multiple targets over time. We develop a conditional denoising
diffusion probabilistic model (C-DDPM)-assisted framework that learns the
temporal evolution of target parameters by leveraging the noisy echo
observations as conditioning features. The proposed framework integrates a
variational autoencoder (VAE) for echo and utilizes classifier-free
guidance to enhance conditional denoising. In each transmission block, VAE
encodes the received echo into a latent representation that conditions DDPM to
predict future target states, which are then used for codebook beam selection.
Simulation results show that the proposed approach outperforms classical signal
processing, filtering, and deep learning benchmarks. The C-DDPM-assisted
framework achieves significantly lower estimation errors in both angle and
distance tracking, demonstrating the potential of generative models for
integrated sensing and
s.
A Critical Study of Automatic Evaluation in Sign Language Translation
Authors: Shakib Yazdani, Yasser Hamidullah, Cristina España-Bonet, Eleftherios Avramidis, Josef van Genabith
2025-10-29
Automatic evaluation metrics are crucial for advancing sign language
translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are
only text-based, and it remains unclear to what extent text-based metrics can
reliably capture the quality of SLT outputs. To address this gap, we
investigate the limitations of text-based SLT evaluation metrics by analyzing
six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one
hand, and large language model ()-based evaluators such as G-Eval and GEMBA
zero-shot direct assessment on the other hand. Specifically, we assess the
consistency and robustness of these metrics under three controlled conditions:
paraphrasing, hallucinations in model outputs, and variations in sentence
length. Our analysis highlights the limitations of lexical
metrics and
demonstrates that while
-based evaluators better capture semantic
equivalence often missed by conventional metrics, they can also exhibit bias
toward
-paraphrased translations. Moreover, although all metrics are able to
detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and
-based evaluators are comparatively lenient toward subtle cases. This
motivates the need for multimodal evaluation frameworks that extend beyond
text-based metrics to enable a more holistic assessment of SLT outputs.
Implicature in Interaction Understanding Implicature Improves Alignment in Human-LLM Interaction
Authors: Asutosh Hota, Jussi P. P. Jokinen
2025-10-29
The rapid advancement of Large Language Models (s) is positioning language
at the core of human-computer interaction (HCI). We argue that advancing HCI
requires attention to the linguistic foundations of interaction, particularly
implicature (meaning conveyed beyond explicit statements through shared
context) which is essential for human-AI (HAI) alignment. This study examines
s' ability to infer user intent embedded in context-driven prompts and
whether understanding implicature improves response generation. Results show
that larger models approximate human interpretations more closely, while
smaller models struggle with implicature inference. Furthermore,
implicature-based prompts significantly enhance the perceived relevance and
quality of responses across models, with notable gains in smaller models.
Overall, 67.6% of participants preferred responses with implicature-embedded
prompts to literal ones, highlighting a clear preference for contextually
nuanced
. Our work contributes to understanding how linguistic
theory can be used to address the alignment problem by making HAI interaction
more natural and contextually grounded.
Serve Programs, Not Prompts
Authors: In Gim, Lin Zhong
2025-10-29
Current large language model ()
systems, primarily designed for
text completion, are neither efficient nor adaptable for increasingly complex
applications due to their inflexible design. We propose a new
system architecture that serves programs instead of prompts to address this
problem. These programs, called
Inference Programs (LIPs), allow users to
customize token prediction and
management at runtime and to offload
parts of their application logic, such as tool execution, to the server. We
describe an example of this architecture through a system named Symphony, which
functions as an operating system for LIPs. Symphony exposes
model
computations via system calls and virtualizes
with a dedicated file
system, while ensuring GPU efficiency with a two-level process scheduling
scheme. Symphony has the potential to open the door to a more efficient and
extensible ecosystem for
applications.
Lightweight Federated Learning in Mobile Edge Computing with Statistical and Device Heterogeneity Awareness
Authors: Jinghong Tan, Zhichen Zhang, Kun Guo, Tsung-Hui Chang, Tony Q. S. Quek
2025-10-29
Federated learning enables collaborative machine learning while pre
data privacy, but high
and computation costs, exacerbated by
statistical and device heterogeneity, limit its practicality in mobile edge
computing. Existing
methods like sparsification and
reduce
per-round costs but may increase training rounds and thus the total training
cost, especially under heterogeneous environments. We propose a lightweight
personalized FL framework built on parameter decoupling, which separates the
model into shared and private subspaces, enabling us to uniquely apply gradient
sparsification to the shared component and model
to the private one.
This structural separation confines
to global
knowledge exchange and computation reduction to local personalization,
protecting personalization quality while adapting to heterogeneous client
resources. We theoretically analyze convergence under the combined effects of
sparsification and
, revealing a
-
trade-off that links
to the iteration complexity. Guided by this analysis, we formulate a joint
optimization that selects per-client
and
rates and wireless
bandwidth to reduce end-to-end training time. Simulation results demonstrate
faster convergence and substantial reductions in overall
and
computation costs with negligible accuracy loss, validating the benefits of
coordinated and resource-aware personalization in resource-constrained
heterogeneous environments.
4-Doodle Text to 3D Sketches that Move!
Authors: Hao Chen, Jiaqi Wang, Yonggang Qi, Ke Li, Kaiyue Pang, Yi-Zhe Song
2025-10-29
We present a novel task: text-to-3D sketch animation, which aims to bring
freeform sketches to life in dynamic 3D space. Unlike prior works focused on
photorealistic content generation, we target , stylized, and
view-consistent 3D vector sketches, a lightweight and interpretable medium
well-suited for visual
and prototyping. However, this task is
very challenging: (i) no paired dataset exists for text and 3D (or 4D)
sketches; (ii) sketches require structural abstraction that is difficult to
model with conventional 3D representations like NeRFs or point clouds; and
(iii) animating such sketches demands temporal coherence and multi-view
consistency, which current pipelines do not address. Therefore, we propose
4-Doodle, the first training-free framework for generating dynamic 3D sketches
from text. It leverages pretrained image and video diffusion models through a
dual-space distillation scheme: one space captures multi-view-consistent
geometry using differentiable B\'ezier curves, while the other encodes motion
dynamics via temporally-aware priors. Unlike prior work (e.g., DreamFusion),
which optimizes from a single view per step, our multi-view optimization
ensures structural alignment and avoids view ambiguity, critical for
sketches. Furthermore, we introduce a structure-aware motion module that
separates shape-pre
trajectories from deformation-aware changes,
enabling expressive motion such as flipping, rotation, and articulated
movement. Extensive experiments show that our method produces temporally
realistic and structurally stable 3D sketch animations, outperforming existing
baselines in both fidelity and controllability. We hope this work serves as a
step toward more intuitive and accessible 4D content creation.
Parrot A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning
Authors: Senjie Jin, Lu Chen, Zhiheng Xi, Yuhui Wang, Sirui Song, Yuhao Zhou, Xinbo Zhang, Peng Sun, Hong Lu, Tao Gui, Qi Zhang, Xuanjing Huang
2025-10-29
Natural language chain-of-thought (N-CoT) and Program chain-of-thought
(P-CoT) have emerged as two primary paradigms for large language models (s)
to solve mathematical reasoning problems. Current research typically endeavors
to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced
P-CoT. In this paper, we seek to fully unleash the two paradigms' strengths for
mutual enhancement and ultimately achieve simultaneous improvements. We conduct
a detailed analysis of the error types across two paradigms, based on which we
propose Parrot, a novel training pipeline for mathematical problems: 1) Three
target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A
subtask hybrid training strategy to facilitate natural language semantic
transferability. 3) The converted N-CoT auxiliary reward is designed to
alleviate the
rewards in P-CoT optimization. Extensive experiments
demonstrate that Parrot significantly enhances both the performance of N-CoT
and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of
LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL
baseline, which is resource-intensive.
DIRC-RAG Accelerating Edge RAG with Robust High-Density and High-Loading-Bandwidth Digital In-ReRAM Computation
Authors: Kunming Shao, Zhipeng Liao, Jiangnan Yu, Liang Zhao, Qiwei Li, Xijie Huang, Jingyu He, Fengshi Tian, Yi Zou, Xiaomeng Wang, Tim Kwang-Ting Cheng, Chi-Ying Tsui
2025-10-29
Retrieval-Augmented Generation (RAG) enhances large language models (s) by
integrating external knowledge retrieval but faces challenges on edge devices
due to high storage, energy, and latency demands. Computing-in-Memory (CIM)
offers a promising solution by storing document embeddings in CIM macros and
enabling in-situ parallel retrievals but is constrained by either low memory
density or limited computational accuracy. To address these challenges, we
present DIRCRAG, a novel edge RAG
architecture leveraging Digital
In-ReRAM Computation (DIRC). DIRC integrates a high-density multi-level ReRAM
subarray with an SRAM cell, utilizing SRAM and differential sensing for robust
ReRAM readout and digital multiply-accumulate (MAC) operations. By storing all
document embeddings within the CIM macro, DIRC achieves ultra-low-power,
single-cycle data loading, substantially reducing both energy consumption and
latency compared to offchip DRAM. A query-stationary (QS) dataflow is supported
for RAG tasks, minimizing on-chip data movement and reducing SRAM buffer
requirements. We introduce error optimization for the DIRC ReRAM-SRAM cell by
extracting the bit-wise spatial error distribution of the ReRAM subarray and
applying targeted bit-wise data remapping. An error detection circuit is also
implemented to enhance readout resilience against deviceand circuit-level
variations. Simulation results demonstrate that DIRC-RAG under TSMC40nm process
achieves an on-chip non-volatile memory density of 5.18Mb/mm2 and a throughput
of 131 TOPS. It delivers a 4MB retrieval latency of 5.6{\mu}s/query and an
energy consumption of 0.956{\mu}J/query, while maintaining the retrieval
precision.
MoEntwine Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference
Authors: Xinru Tang, Jingxiang Hou, Dingcheng Jiang, Taiquan Wei, Jiaxin Liu, Jinyi Deng, Huizheng Wang, Qize Yang, Haoran Shang, Chao Li, Yang Hu, Shouyi Yin
2025-10-29
As large language models (s) continue to scale up, mixture-of-experts
(MoE) has become a common technology in SOTA models. MoE models rely on expert
parallelism (EP) to alleviate memory bottleneck, which introduces all-to-all
to dispatch and combine tokens across devices. However, in
widely-adopted GPU clusters, high-overhead cross-node
makes
all-to-all expensive, hindering the adoption of EP. Recently, wafer-scale chips
(WSCs) have emerged as a platform integrating numerous devices on a wafer-sized
interposer. WSCs provide a unified high-performance network connecting all
devices, presenting a promising potential for hosting MoE models. Yet, their
network is restricted to a mesh topology, causing imbalanced
pressure and performance loss. Moreover, the lack of on-wafer disk leads to
high-overhead expert migration on the critical path.
To fully unleash this potential, we first propose Entwined Ring Mapping
(ER-Mapping), which co-designs the mapping of attention and MoE layers to
balance
pressure and achieve better performance. We find that
under ER-Mapping, the distribution of cold and hot links in the attention and
MoE layers is complementary. Therefore, to hide the migration overhead, we
propose the Non-invasive Balancer (NI-Balancer), which splits a complete expert
migration into multiple steps and alternately utilizes the cold links of both
layers. Evaluation shows ER-Mapping achieves
reduction up to 62%.
NI-Balancer further delivers 54% and 22% improvements in MoE computation and
, respectively. Compared with the SOTA NVL72 supernode, the WSC
platform delivers an average 39% higher per-device MoE performance owing to its
scalability to larger EP.
Energy-Efficient Autonomous Driving with Adaptive Perception and Robust Decision
Authors: Yuyang Xia, Zibo Liang, Liwei Deng, Yan Zhao, Han Su, Kai Zheng
2025-10-29
Autonomous driving is an emerging technology that is expected to bring
significant social, economic, and environmental benefits. However, these
benefits come with rising energy consumption by computation engines, limiting
the driving range of vehicles, especially electric ones. Perception computing
is typically the most power-intensive component, as it relies on largescale
deep learning models to extract environmental features. Recently, numerous
studies have employed model techniques, such as sparsification,
, and distillation, to reduce computational consumption. However,
these methods often result in either a substantial model size or a significant
drop in perception accuracy compared to high-computation models. To address
these challenges, we propose an energy-efficient autonomous driving framework,
called EneAD. In the adaptive perception module, a perception optimization
strategy is designed from the perspective of data management and tuning.
Firstly, we manage multiple perception models with different computational
consumption and adjust the execution framerate dynamically. Then, we define
them as knobs and design a transferable tuning method based on Bayesian
optimization to identify promising knob values that achieve low computation
while maintaining desired accuracy. To adaptively switch the knob values in
various traffic scenarios, a lightweight classification model is proposed to
distinguish the perception difficulty in different scenarios. In the robust
decision module, we propose a decision model based on reinforcement learning
and design a regularization term to enhance driving stability in the face of
perturbed perception results. Extensive experiments evidence the superiority of
our framework in both energy consumption and driving performance. EneAD can
reduce perception consumption by 1.9x to 3.5x and thus improve driving range by
3.9% to 8.5%
Transformers in Medicine Improving Vision-Language Alignment for Medical Image Captioning
Authors: Yogesh Thakku Suresh, Vishwajeet Shivaji Hogale, Luca-Alexandru Zamfira, Anandavardhana Hegde
2025-10-29
We present a -based multimodal framework for generating clinically
relevant captions for MRI scans. Our system combines a DEiT-Small vision
as an image encoder, MediCareBERT for caption embedding, and a
custom LSTM-based
r. The architecture is designed to semantically align
image and textual embeddings, using hybrid cosine-MSE loss and contrastive
inference via vector similarity. We benchmark our method on the MultiCaRe
dataset, comparing performance on filtered brain-only MRIs versus general MRI
images against state-of-the-art medical image captioning methods including
BLIP, R2GenGPT, and recent
-based approaches. Results show that
focusing on domain-specific data improves caption accuracy and semantic
alignment. Our work proposes a scalable, interpretable solution for automated
medical image reporting.
Model-Document Protocol for AI Search
Authors: Hongjin Qian, Zheng Liu
2025-10-29
AI search depends on linking large language models (s) with vast external
knowledge sources. Yet web pages, PDF files, and other raw documents are not
inherently
-ready: they are long, noisy, and unstructured. Conventional
retrieval methods treat these documents as verbatim text and return raw
passages, leaving the burden of fragment assembly and contextual reasoning to
the
. This gap underscores the need for a new retrieval paradigm that
redefines how models interact with documents.
We introduce the Model-Document Protocol (MDP), a general framework that
formalizes how raw text is bridged to
s through consumable knowledge
representations. Rather than treating retrieval as passage fetching, MDP
defines multiple pathways that transform unstructured documents into
task-specific,
-ready inputs. These include agentic reasoning, which curates
raw evidence into coherent context; memory grounding, which accumulates
reusable notes to enrich reasoning; and structured leveraging, which encodes
documents into formal representations such as graphs or key-value
s. All
three pathways share the same goal: ensuring that what reaches the
is not
raw fragments but compact, structured knowledge directly consumable for
reasoning.
As an instantiation, we present MDP-Agent, which realizes the protocol
through an agentic process: constructing document-level gist memories for
global coverage, performing diffusion-based exploration with vertical
exploitation to uncover layered dependencies, and applying map-reduce style
synthesis to integrate large-scale evidence into compact yet sufficient
context. Experiments on information-seeking benchmarks demonstrate that
MDP-Agent outperforms baselines, validating both the soundness of the MDP
framework and the effectiveness of its agentic instantiation.
Conditional neural field for spatial dimension reduction of turbulence data a comparison study
Authors: Junyi Guo, Pan Du, Xiantao Fan, Yahui Li, Jian-Xun Wang
2025-10-29
We investigate conditional neural fields (CNFs), mesh-agnostic,
coordinate-based rs conditioned on a low-dimensional latent, for spatial
dimensionality reduction of turbulent flows. CNFs are benchmarked against
Proper Orthogonal Decomposition and a convolutional autoencoder within a
unified encoding-
framework and a common evaluation protocol that
explicitly separates in-range (interpolative) from out-of-range (strict
extrapolative) testing beyond the training horizon, with identical
preprocessing, metrics, and fixed splits across all baselines. We examine three
conditioning mechanisms: (i) activation-only modulation (often termed FiLM),
(ii) low-rank weight and bias modulation (termed FP), and (iii) last-layer
inner-product coupling, and introduce a novel domain-decomposed CNF that
localizes complexities. Across representative turbulence datasets (WMLES
channel inflow, DNS channel inflow, and wall pressure fluctuations over
turbulent boundary layers), CNF-FP achieves the lowest training and in-range
testing errors, while CNF-FiLM generalizes best for out-of-range scenarios once
moderate latent capacity is available. Domain decomposition significantly
improves out-of-range accuracy, especially for the more demanding datasets. The
study provides a rigorous, physics-aware basis for selecting conditioning,
capacity, and domain decomposition when using CNFs for turbulence
and reconstruction.
KnowCoder-A1 Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA
Authors: Zhuo Chen, Fei Wang, Zixuan Li, Zhao Zhang, Weiwei Ding, Chuanguang Yang, Yongjun Xu, Xiaolong Jin, Jiafeng Guo
2025-10-29
Knowledge Base Question Answering (KBQA) aims to answer natural-language
questions over a structured Knowledge Base (KB). Recent work improves KBQA by
adopting an agentic reasoning paradigm, in which Large Language Models (s)
iteratively decompose a question, generate its corresponding logical queries,
and interact with the KB to derive the answer. However, these methods typically
fine-tune
s on reasoning trajectories synthesized via process supervision,
which offers weak incentives for exploration and thus fails to strengthen the
agentic reasoning ability. In this paper, we propose KnowCoder-A1, an
that
can autonomously perform agentic reasoning on KBs to obtain answers. To
incentivize autonomous exploration, KnowCoder-A1 trains the
under
outcome-only supervision via a multi-stage curriculum reinforcement learning
with an easy-to-hard curriculum. To establish foundational agentic
capabilities, KnowCoder-A1 first fine-tunes the
on a small set of
high-quality trajectories obtained through outcome-based rejection sampling.
Then, to alleviate the reward
inherent in outcome-only supervision, it
applies multi-stage curriculum RL with reward schedules that progress from easy
to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful
reasoning behaviors and consistently outperforms prior approaches across three
mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1
achieves up to an 11.1% relative improvement while using only one-twelfth of
the training data, demonstrating strong agentic reasoning capabilities.
H3M-SSMoEs Hypergraph-based Multimodal Learning with LLM Reasoning and Style-Structured Mixture of Experts
Authors: Peilin Tan, Liang Xie, Churan Zhi, Dian Tu, Chuanqi Shi
2025-10-29
Stock movement prediction remains fundamentally challenging due to complex
temporal dependencies, heterogeneous modalities, and dynamically evolving
inter-stock relationships. Existing approaches often fail to unify structural,
semantic, and regime-adaptive modeling within a scalable framework. This work
introduces H3M-SSMoEs, a novel Hypergraph-based MultiModal architecture with
reasoning and Style-Structured Mixture of Experts, integrating three key
innovations: (1) a Multi-Context Multimodal Hypergraph that hierarchically
captures fine-grained spatiotemporal dynamics via a Local Context Hypergraph
(LCH) and persistent inter-stock dependencies through a Global Context
Hypergraph (GCH), employing shared cross-modal hyperedges and Jensen-Shannon
Divergence weighting mechanism for adaptive relational learning and cross-modal
alignment; (2) a
-enhanced reasoning module, which leverages a frozen large
language model with lightweight adapters to semantically fuse and align
quantitative and textual modalities, enriching representations with
domain-specific financial knowledge; and (3) a Style-Structured Mixture of
Experts (SSMoEs) that combines shared market experts and industry-specialized
experts, each parameterized by learnable style vectors enabling regime-aware
specialization under
activation. Extensive experiments on three major
stock markets demonstrate that H3M-SSMoEs surpasses state-of-the-art methods in
both superior predictive accuracy and investment performance, while exhibiting
effective risk control. Datasets, source code, and model weights are available
at our GitHub repository: https://github.com/PeilinTime/H3M-SSMoEs.
WebLeaper Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking
Authors: Zhengwei Tao, Haiyang Shen, Baixuan Li, Wenbiao Yin, Jialong Wu, Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Liwen Zhang, Xinyu Wang, Pengjun Xie, Jingren Zhou, Yong Jiang
2025-10-28
Large Language Model ()-based agents have emerged as a transformative
approach for open-ended problem solving, with information seeking (IS) being a
core capability that enables autonomous reasoning and decision-making. While
prior research has largely focused on improving retrieval depth, we observe
that current IS agents often suffer from low search efficiency, which in turn
constrains overall performance. A key factor underlying this inefficiency is
the
of target entities in training tasks, which limits opportunities
for agents to learn and generalize efficient search behaviors. To address these
challenges, we propose WebLeaper, a framework for constructing high-coverage IS
tasks and generating efficient solution trajectories. We formulate IS as a
tree-structured reasoning problem, enabling a substantially larger set of
target entities to be embedded within a constrained context. Leveraging curated
Wikipedia tables, we propose three variants for synthesizing IS tasks, Basic,
Union, and Reverse-Union, to systematically increase both IS efficiency and
efficacy. Finally, we curate training trajectories by retaining only those that
are simultaneously accurate and efficient, ensuring that the model is optimized
for both correctness and search performance. Extensive experiments on both
basic and comprehensive settings, conducted on five IS benchmarks, BrowserComp,
GAIA, xbench-DeepSearch, WideSearch, and Seal-0, demonstrate that our method
consistently achieves improvements in both effectiveness and efficiency over
strong baselines.
Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Authors: Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang
2025-10-28
-based search agents are increasingly trained on entity-centric synthetic
data to solve complex, knowledge-intensive tasks. However, prevailing training
methods like Group Relative Policy Optimization (GRPO) discard this rich entity
information, relying instead on
, outcome-based rewards. This critical
limitation renders them unable to distinguish informative "near-miss"
samples-those with substantially correct reasoning but a flawed final
answer-from complete failures, thus discarding valuable learning signals. We
address this by leveraging the very entities discarded during training. Our
empirical analysis reveals a strong positive correlation between the number of
ground-truth entities identified during an agent's reasoning process and final
answer accuracy. Building on this insight, we introduce Entity-aware Group
Relative Policy Optimization (E-GRPO), a novel framework that formulates a
dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect
samples proportional to their entity match rate, enabling the model to
effectively learn from these "near-misses". Experiments on diverse
question-answering (QA) and deep research benchmarks show that E-GRPO
consistently and significantly outperforms the GRPO baseline. Furthermore, our
analysis reveals that E-GRPO not only achieves superior accuracy but also
induces more efficient reasoning policies that require fewer tool calls,
demonstrating a more effective and sample-efficient approach to aligning search
agents.
Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation
Authors: Snegha A, Sayambhu Sen, Piyush Singh Pasi, Abhishek Singhania, Preethi Jyothi
2025-10-28
With the release of new large language models (s) like Llama and Mistral,
zero-shot cross-lingual transfer has become increasingly feasible due to their
multilingual pretraining and strong generalization capabilities. However,
adapting these
r-only
s to new tasks across languages remains
challenging. While parameter-efficient fine-tuning (PeFT) techniques like
Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as
soft prompt tuning, prefix tuning, and Llama Adapter are less explored,
especially for zero-shot transfer in
r-only models. We present a
comprehensive study of three prefix-based methods for zero-shot cross-lingual
transfer from English to 35+ high- and low-resource languages. Our analysis
further explores transfer across linguistic families and scripts, as well as
the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix
methods outperform LoRA-baselines by up to 6% on the Belebele benchmark.
Similar improvements were observed with Mistral v0.3 7B as well. Despite using
only 1.23M learning parameters with prefix tuning, we achieve consistent
improvements across diverse benchmarks. These findings highlight the potential
of prefix-based techniques as an effective and scalable alternative to LoRA,
particularly in low-resource multilingual settings.
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs
Authors: Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho
2025-10-28
The quadratic cost of attention hinders the scalability of long-context s,
especially in resource-constrained settings. Existing static
methods
such as sliding windows or global tokens utilizes the
of attention to
reduce the cost of attention, but poorly adapts to the content-dependent
variations in attention due to their staticity. While previous work has
proposed several dynamic approaches to improve flexibility, they still depend
on predefined templates or heuristic mechanisms. Such strategies reduce
generality and prune tokens that remain contextually important, limiting their
accuracy across diverse tasks. To tackle these bottlenecks of existing methods
for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention
(DHSA), a data-driven framework that dynamically predicts attention
online without retraining. Our proposed DHSA adaptively segments sequences into
variable-length chunks, then computes chunk representations by aggregating the
token embeddings within each chunk. To avoid the bias introduced by varying
chunk lengths, we apply length-normalized aggregation that scales the averaged
embeddings by the square root of the chunk size. Finally, DHSA upsamples the
chunk-level similarity scores to token level similarities to calculate
importance scores that determine which token-level interactions should be
preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and
LongBench show that DHSA matches dense attention in accuracy, while reducing
latency by 20-60% and peak memory usage by 35%. Compared to other
representative baselines such as block
attention, DHSA achieves
consistently higher accuracy (6-18% relative gains) with comparable or lower
cost, offering an efficient and adaptable solution for long-context on-device
s.
Diffusion LLM with Native Variable Generation Lengths Let [EOS] Lead the Way
Authors: Yicun Yang, Cong Wang, Shaobo Wang, Zichen Wen, Biqing Qi, Hanlin Xu, Linfeng Zhang
2025-10-28
Diffusion-based large language models (ds) have exhibited substantial
potential for parallel text generation, which may enable more efficient
generation compared to autoregressive models. However, current d
s suffer
from fixed generation lengths, which indicates the generation lengths of d
s
have to be determined before
as a hyper-parameter, leading to issues
in efficiency and flexibility. To solve these problems, in this work, we
propose to train a diffusion
with native variable generation lengths,
abbreviated as d
-Var. Concretely, we aim to train a model to accurately
predict the [EOS] token in the generated text, which makes a d
be able to
natively infer in a block diffusion manner, while still maintaining the ability
of global bi-directional (full) attention and high parallelism. Experiments on
standard benchmarks demonstrate that our method achieves a 30.1x speedup over
traditional d
inference paradigms and a 2.4x speedup relative to
autoregressive models such as Qwen and Llama. Our method achieves higher
accuracy and faster inference, elevating d
s beyond mere academic novelty and
supporting their practical use in real-world applications. Codes and models
have been released.
Parallel Loop Transformer for Efficient Test-Time Computation Scaling
Authors: Bohong Wu, Mengzhao Chen, Xiang Luo, Shen Yan, Qifan Yu, Fan Xia, Tianqi Zhang, Hongrui Zhan, Zheng Zhong, Xun Zhou, Siyuan Qiao, Xingyan Bin
2025-10-28
Large Language Models (s) are powerful but often too slow and costly for
real-world use during inference. Looped
s save on parameters by
reusing the same weights for multiple computational steps, or "loops." However,
this approach has a major flaw: the loops run one after another, causing
inference latency and memory requirements to increase with each added loop.
This makes them impractical for fast applications. To solve this problem, we
introduce the Parallel Loop Transformer (PLT). PLT is a new architecture that
delivers the performance benefits of a deep, looped model but with the low
latency of a standard, non-looped model. PLT works using two key techniques.
First, Cross-Loop Parallelism (CLP) breaks the sequential dependency by
computing different loops for different tokens at the same time, all within a
single pass. Second, to prevent memory costs from growing, we use an Efficient
Representation Enhancement strategy. This method shares the memory (
)
from the first loop with all other loops. It then uses a Gated Sliding-Window
Attention (G-SWA) to combine this shared global information with local
information, maintaining high accuracy. Our experiments show that PLT achieves
the high accuracy of a traditional looped model but with almost no extra
latency or memory cost compared to a standard
.
Decoupled MeanFlow Turning Flow Models into Flow Maps for Accelerated Sampling
Authors: Kyungmin Lee, Sihyun Yu, Jinwoo Shin
2025-10-28
Denoising generative models, such as diffusion and flow-based models, produce
high-quality samples but require many denoising steps due to discretization
error. Flow maps, which estimate the average velocity between timesteps,
mitigate this error and enable faster sampling. However, their training
typically demands architectural changes that limit compatibility with
pretrained flow models. We introduce Decoupled MeanFlow, a simple
strategy that converts flow models into flow map models without architectural
modifications. Our method conditions the final blocks of diffusion
s
on the subsequent timestep, allowing pretrained flow models to be directly
repurposed as flow maps. Combined with enhanced training techniques, this
design enables high-quality generation in as few as 1 to 4 steps. Notably, we
find that training flow models and subsequently converting them is more
efficient and effective than training flow maps from scratch. On ImageNet
256x256 and 512x512, our models attain 1-step FID of 2.16 and 2.12,
respectively, surpassing prior art by a large margin. Furthermore, we achieve
FID of 1.51 and 1.68 when increasing the steps to 4, which nearly matches the
performance of flow models while delivering over 100x faster inference.
MiniOneRec An Open-Source Framework for Scaling Generative Recommendation
Authors: Xiaoyu Kong, Leheng Sheng, Junfei Tan, Yuxin Chen, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He
2025-10-28
The recent success of large language models (s) has renewed interest in
whether recommender systems can achieve similar scaling benefits. Conventional
recommenders, dominated by massive embedding tables, tend to plateau as
embedding dimensions grow. In contrast, the emerging generative paradigm
replaces embeddings with compact Semantic ID (SID) sequences produced by
autoregressive Transformers. Yet most industrial deployments remain
proprietary, leaving two fundamental questions open: (1) Do the expected
scaling laws hold on public benchmarks? (2) What is the minimal post-training
recipe that enables competitive performance?
We present MiniOneRec, to the best of our knowledge, the first fully
open-source generative recommendation framework, which provides an end-to-end
workflow spanning SID construction, supervised fine-tuning, and
recommendation-oriented reinforcement learning. We generate SIDs via a Residual
Quantized VAE and post-train Qwen backbones ranging from 0.5B to 7B parameters
on the Amazon Review dataset. Our experiments reveal a consistent downward
trend in both training and evaluation losses with increasing model size,
validating the parameter efficiency of the generative approach. To further
enhance performance, we propose a lightweight yet effective post-training
pipeline that (1) enforces full-process SID alignment and (2) applies
reinforcement learning with constrained
and hybrid rewards. Together,
these techniques yield significant improvements in both ranking accuracy and
candidate diversity.
Metadata-Driven Retrieval-Augmented Generation for Financial Question Answering
Authors: Michail Dadopoulos, Anestis Ladas, Stratos Moschidis, Ioannis Negkakis
2025-10-28
Retrieval-Augmented Generation (RAG) struggles on long, structured financial
filings where relevant evidence is and cross-referenced. This paper
presents a systematic investigation of advanced metadata-driven
Retrieval-Augmented Generation (RAG) techniques, proposing and evaluating a
novel, multi-stage RAG architecture that leverages
-generated metadata. We
introduce a sophisticated indexing pipeline to create contextually rich
document chunks and benchmark a spectrum of enhancements, including
pre-retrieval filtering, post-retrieval reranking, and enriched embeddings,
benchmarked on the FinanceBench dataset. Our results reveal that while a
powerful reranker is essential for precision, the most significant performance
gains come from embedding chunk metadata directly with text ("contextual
chunks"). Our proposed optimal architecture combines
-driven pre-retrieval
optimizations with these contextual embeddings to achieve superior performance.
Additionally, we present a custom metadata reranker that offers a compelling,
cost-effective alternative to commercial solutions, highlighting a practical
trade-off between peak performance and operational efficiency. This study
provides a blueprint for building robust, metadata-aware RAG systems for
financial document analysis.
Text Simplification with Sentence Embeddings
Authors: Matthew Shardlow
2025-10-28
Sentence embeddings can be d to give approximations of the original
texts used to create them. We explore this effect in the context of text
simplification, demonstrating that reconstructed text embeddings preserve
complexity levels. We experiment with a small feed forward neural network to
effectively learn a transformation between sentence embeddings representing
high-complexity and low-complexity texts. We provide comparison to a Seq2Seq
and
-based approach, showing encouraging results in our much smaller
learning setting. Finally, we demonstrate the applicability of our
transformation to an unseen simplification dataset (MedEASI), as well as
datasets from languages outside the training data (ES,DE). We conclude that
learning transformations in sentence embedding space is a promising direction
for future research and has potential to unlock the ability to develop small,
but powerful models for text simplification and other natural language
generation tasks.
SALS Sparse Attention in Latent Space for KV cache Compression
Authors: Junlin Mu, Hantao Huang, Jihang Zhang, Minghui Yu, Tao Wang, Yidong Li
2025-10-28
Large Language Models capable of handling extended contexts are in high
demand, yet their inference remains challenging due to substantial Key-Value
size and high memory bandwidth requirements. Previous research has
demonstrated that
exhibits low-rank characteristics within the hidden
dimension, suggesting the potential for effective
. However, due to
the widely adopted Rotary Position Embedding mechanism in modern
s, naive
low-rank
suffers severe accuracy degradation or creates a new speed
bottleneck, as the low-rank
must first be reconstructed in order to apply
RoPE. In this paper, we introduce two key insights: first, the application of
RoPE to the key vectors increases their variance, which in turn results in a
higher rank; second, after the key vectors are transformed into the latent
space, they largely maintain their representation across most layers. Based on
these insights, we propose the Sparse Attention in Latent Space framework. SALS
projects the
into a compact latent space via low-rank projection, and
performs
token selection using RoPE-free query-key interactions in this
space. By reconstructing only a small subset of important tokens, it avoids the
overhead of full
reconstruction. We comprehensively evaluate SALS on
various tasks using two large-scale models: LLaMA2-7b-chat and Mistral-7b, and
additionally verify its scalability on the RULER-128k benchmark with
LLaMA3.1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA
performance by maintaining competitive accuracy. Under different settings, SALS
achieves 6.4-fold
and 5.7-fold speed-up in the attention
operator compared to FlashAttention2 on the 4K sequence. For the end-to-end
throughput performance, we achieves 1.4-fold and 4.5-fold improvement compared
to GPT-fast on 4k and 32K sequences, respectively.
Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts
Authors: Jason M. Pittman, Anton Phillips Jr., Yesenia Medina-Santos, Brielle C. Stark
2025-10-28
In aphasia research, Speech-Language Pathologists (SLPs) devote extensive
time to manually coding speech samples using Correct Information Units (CIUs),
a measure of how informative an individual sample of speech is. Developing
automated systems to recognize aphasic language is limited by data scarcity.
For example, only about 600 transcripts are available in AphasiaBank yet
billions of tokens are used to train large language models (s). In the
broader field of machine learning (ML), researchers increasingly turn to
synthetic data when such are
. Therefore, this study constructs and
validates two methods to generate synthetic transcripts of the AphasiaBank Cat
Rescue picture description task. One method leverages a procedural programming
approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct
s. The methods generate transcripts across four severity levels (Mild,
Moderate, Severe, Very Severe) through word dropping, filler insertion, and
paraphasia substitution. Overall, we found, compared to human-elicited
transcripts, Mistral 7b Instruct best captures key aspects of linguistic
degradation observed in aphasia, showing realistic directional changes in NDW,
word count, and word length amongst the synthetic generation methods. Based on
the results, future work should plan to create a larger dataset, fine-tune
models for better aphasic representation, and have SLPs assess the realism and
usefulness of the synthetic transcripts.
Pilot Distortion Design for ToA Obfuscation in Uplink OFDM Communication
Authors: Mahmut Kemal Ercan, Alireza Pourafzal, Musa Furkan Keskin, Sinan Gezici, Henk Wymeersch
2025-10-28
We study uplink orthogonal frequency-division multiplexing (OFDM) pilot
distortion to deliberately obfuscate time-of-arrival (ToA) estimation at a
single base station while pre
performance. We design a
complex per-subcarrier distortion vector that increases sidelobes of the
mismatched ambiguity function (MAF) relative to its mainlobe, using two
objectives: the sidelobe-to-peak level ratio and the integrated sidelobe level.
The design is subject to a transmit-power budget and a proximity
(dissimilarity) constraint around the
-optimal pilot.
Communication impact is quantied by a capacity-motivated lower bound obtained
from the linear minimum mean-squared error error covariance with a mismatched
channel estimate. The resulting generalized fractional program is solved with
Dinkelbach's transform and a difference-of-convex update that yields a
closed-form Karush-Kuhn-Tucker step. Simulations on a single-input
single-output OFDM link show that the optimized distortions raise MAF sidelobes
and degrade delay estimation, as validated by a mismatched maximum-likelihood
ToA estimator, while incurring only marginal capacity loss over a broad
signal-to-noise ratio range. The method requires no protocol changes or
artificial path injection and provides a signal-level mechanism to control ToA
observability under
constraints.
FALQON Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic
Authors: Kanghyun Choi, Hyeyoon Lee, SunJong Park, Dain Kwon, Jinho Lee
2025-10-28
Low-bit floating-point (FP) formats, such as FP8, provide significant
and memory savings in model training thanks to native hardware
support on modern GPUs and NPUs. However, we analyze that FP8
offers speedup primarily for large-dimensional matrix multiplications, while
inherent
overheads diminish speedup when applied to low-rank
adaptation (LoRA), which uses small-dimensional matrices for efficient
fine-tuning of large language models (
s). To address this limitation, we
propose FALQON, a novel framework that eliminates the
overhead
from separate LoRA computational paths by directly merging LoRA adapters into
an FP8-
d backbone during fine-tuning. Furthermore, we reformulate the
forward and backward computations for merged adapters to significantly reduce
overhead, and introduce a row-wise proxy update mechanism that
efficiently integrates substantial updates into the
d backbone.
Experimental evaluations demonstrate that FALQON achieves approximately a
3 training speedup over existing
d LoRA methods with a similar
level of accuracy, providing a practical solution for efficient large-scale
model fine-tuning. Moreover, FALQON's end-to-end FP8 workflow removes the need
for post-training
, facilitating efficient deployment. Code is
available at https://github.com/iamkanghyunchoi/falqon.
Pie A Programmable Serving System for Emerging LLM Applications
Authors: In Gim, Zhiyao Ma, Seung-seob Lee, Lin Zhong
2025-10-28
Emerging large language model () applications involve diverse reasoning
strategies and agentic workflows, straining the capabilities of existing
systems built on a monolithic token generation loop. This paper
introduces Pie, a programmable
system designed for flexibility and
efficiency. Pie decomposes the traditional generation loop into fine-grained
service handlers exposed via an API and delegates control of the generation
process to user-provided programs, called inferlets. This enables applications
to implement new
strategies, bespoke generation logic, and seamlessly
integrate computation and I/O-entirely within the application, without
requiring modifications to the
system. Pie executes inferlets using
WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows
Pie matches state-of-the-art performance on standard tasks (3-12% latency
overhead) while significantly improving latency and throughput (1.3x-3.4x
higher) on agentic workflows by enabling application-specific optimizations.
SpecKD Speculative Decoding for Effective Knowledge Distillation of LLMs
Authors: Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren
2025-10-28
Knowledge Distillation (KD) has become a cornerstone technique for
compressing Large Language Models (s) into smaller, more efficient student
models. However, conventional KD approaches typically apply the distillation
loss uniformly across all tokens, regardless of the teacher's confidence. This
indiscriminate mimicry can introduce noise, as the student is forced to learn
from the teacher's uncertain or high-entropy predictions, which may ultimately
harm student performance-especially when the teacher is much larger and more
powerful. To address this, we propose Speculative Knowledge Distillation
(SpecKD), a novel, plug-and-play framework that introduces a dynamic,
token-level gating mechanism inspired by the "propose-and-verify" paradigm of
speculative
. At each step, the student's token proposal is verified
against the teacher's distribution; the distillation loss is selectively
applied only to "accepted" tokens, while "rejected" tokens are masked out.
Extensive experiments on diverse text generation tasks show that SpecKD
consistently and significantly outperforms strong KD baselines, leading to more
stable training and more capable student models, and achieving state-of-the-art
results.
PRO Enabling Precise and Robust Text Watermark for Open-Source LLMs
Authors: Jiaqi Xue, Yifei Zhao, Mansour Al Ghanim, Shangqian Gao, Ruimin Sun, Qian Lou, Mengxin Zheng
2025-10-27
Text watermarking for large language models (s) enables model owners to
verify text origin and protect intellectual property. While watermarking
methods for closed-source
s are relatively mature, extending them to
open-source models remains challenging, as developers cannot control the
process. Consequently, owners of open-source
s lack practical means
to verify whether text was generated by their models. A core difficulty lies in
embedding watermarks directly into model weights without hurting detectability.
A promising idea is to distill watermarks from a closed-source model into an
open one, but this suffers from (i) poor detectability due to mismatch between
learned and predefined patterns, and (ii) fragility to downstream modifications
such as fine-tuning or model merging. To overcome these limitations, we propose
PRO, a Precise and Robust text watermarking method for open-source
s. PRO
jointly trains a watermark policy model with the
, producing patterns that
are easier for the model to learn and more consistent with detection criteria.
A regularization term further simulates downstream perturbations and penalizes
degradation in watermark detectability, ensuring robustness under model edits.
Experiments on open-source
s (e.g., LLaMA-3.2, LLaMA-3, Phi-2) show that PRO
substantially improves both watermark detectability and resilience to model
modifications.
Accurate Prediction of Nonlinear Distortion of Multi-Carrier Signals
Authors: Cameron M. Pike, Brad Oney, Gabriel Hepner, Animesh Yadav
2025-10-27
Nonlinearities in power amplifiers adversely affect multi-carrier modulation
techniques. Accurate prediction of nonlinear distortion is essential for making
design trade-offs between output power and network throughput. We use the
series form of the characteristic function (ch.f.) method to predict distortion
spectra for multi-carrier transmissions. This method results in
efficient calculations of individual signal and distortion components. The
method is validated both theoretically and practically. Theoretical validation
is performed by modeling the signal as a bandpass Gaussian process that is hard
limited, and it is shown that the series ch.f. method produces results that are
identical with the classical Price's theorem. Practical validation is shown by
considering an orthogonal frequency division multiplexing (OFDM) signal with a
fragmented spectrum which is then applied to an amplifier driven into
for which application of Price's theorem is difficult, and the
predicted output spectrum corroborates laboratory measurements. Part of the
computational efficiency is realized in that the nonlinearity can be expressed
as the fast Fourier transform (FFT) of samples of its forward scattering
parameter (i.e., S21) or transconductance function (including AM-PM effects),
and distortion contributions of the signal can be expressed as numerical
autoconvolutions of the clean spectrum. Signal-to-distortion ratio (SDR) can be
easily computed and parameterized across variables of interest, such as
overdrive level.
Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders
Authors: Nathan Paek, Yongyi Zang, Qihui Yang, Randal Leistikow
2025-10-27
While autoencoders (SAEs) successfully extract interpretable features
from language models, applying them to audio generation faces unique
challenges: audio's dense nature requires
that obscures semantic
meaning, and automatic feature characterization remains limited. We propose a
framework for interpreting audio generative models by mapping their latent
representations to human-interpretable acoustic concepts. We train SAEs on
audio autoencoder latents, then learn linear mappings from SAE features to
discretized acoustic properties (pitch, amplitude, and timbre). This enables
both controllable manipulation and analysis of the AI music generation process,
revealing how acoustic properties emerge during synthesis. We validate our
approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer)
audio latent spaces, and analyze DiffRhythm, a state-of-the-art text-to-music
model, to demonstrate how pitch, timbre, and loudness evolve throughout
generation. While our work is only done on audio modality, our framework can be
extended to interpretable analysis of visual latent space generation models.
CountFormer A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting
Authors: Md Tanvir Hossain, Akif Islam, Mohd Ruhul Ameen
2025-10-27
Humans can effortlessly count diverse objects by perceiving visual repetition
and structural relationships rather than relying on class identity. However,
most existing counting models fail to replicate this ability; they often
miscount when objects exhibit complex shapes, internal symmetry, or ping
components. In this work, we introduce CountFormer, a
-based
framework that learns to recognize repetition and structural coherence for
class-agnostic object counting. Built upon the CounTR architecture, our model
replaces its visual encoder with the self-supervised foundation model DINOv2,
which produces richer and spatially consistent feature representations. We
further incorporate positional embedding fusion to preserve geometric
relationships before
these features into density maps through a
lightweight convolutional
r. Evaluated on the FSC-147 dataset, our model
achieves performance comparable to current state-of-the-art methods while
demonstrating superior accuracy on structurally intricate or densely packed
scenes. Our findings indicate that integrating foundation models such as DINOv2
enables counting systems to approach human-like structural perception,
advancing toward a truly general and exemplar-free counting paradigm.
BitSkip An Empirical Analysis of Quantization and Early Exit Composition
Authors: Ramshankar Bhuvaneswaran, Handan Liu
2025-10-27
The pursuit of efficient Large Language Models (s) has led to increasingly
complex techniques like extreme
and dynamic routing. While
individual benefits of these methods are well-documented, their compositional
effects remain poorly understood. This paper introduces BitSkip, a hybrid
architectural framework for systematically exploring these interactions.
Counter-intuitively, our findings reveal that a simple 8-bit
d model
without Hadamard transform (BitSkip-V1) not only outperforms its more complex
4-bit and Hadamard-enhanced counterparts but also competes the full-precision
baseline in quality (perplexity of 1.13 vs 1.19) . The introduction of Hadamard
transforms, even at 8-bit precision, catastrophically degraded performance by
over 37,000%, tracing fundamental training instability. Our BitSkip-V1 recipe
demonstrates superior early-exit characteristics, with layer 18 providing
optimal 32.5% speed gain for minimal 4% quality loss.
Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization
Authors: Bernardo Torres, Manuel Moussallam, Gabriel Meseguer-Brocal
2025-10-27
Audio autoencoders learn useful, compressed audio representations, but their
non-linear latent spaces prevent intuitive algebraic manipulation such as
mixing or scaling. We introduce a simple training methodology to induce
linearity in a high- Consistency Autoencoder (CAE) by using data
augmentation, thereby inducing homogeneity (equivariance to scalar gain) and
additivity (the
r preserves addition) without altering the model's
architecture or loss function. When trained with our method, the CAE exhibits
linear behavior in both the encoder and
r while pre
reconstruction
fidelity. We test the practical utility of our learned space on music source
composition and separation via simple latent arithmetic. This work presents a
straightforward technique for constructing structured latent spaces, enabling
more intuitive and efficient audio processing.
Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier
Authors: Hyeongseop Rha, Jeong Hun Yeo, Yeonju Kim, Yong Man Ro
2025-10-27
The recent advancement of Multimodal Large Language Models (Ms) is
transforming human-computer interaction (HCI) from surface-level exchanges into
more nuanced and emotionally intelligent
. To realize this shift,
emotion understanding becomes essential allowing systems to capture subtle cues
underlying user intent. Furthermore, providing faithful explanations for
predicted emotions is crucial to ensure interpretability and build user trust.
However, current M
-based methods often generate emotion explanations that
diverge from the target labels and sometimes even contradict their own
predicted emotions. This inconsistency poses a critical risk for
misunderstanding and erodes reliability in interactive settings. To address
this, we propose a novel approach: the Emotional Rationale Verifier (ERV) and
an Explanation Reward. Our method guides the model to produce reasoning that is
explicitly consistent with the target emotion during multimodal emotion
recognition without modifying the model architecture or requiring additional
paired video-description annotations. Our method significantly improves
faithful explanation-prediction consistency and explanation emotion accuracy on
the MAFW and DFEW datasets. Through extensive experiments and human
evaluations, we show that our approach not only enhances alignment between
explanation and prediction but also empowers M
s to deliver emotionally
coherent, trustworthy interactions, marking a key step toward truly human-like
HCI systems.
Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving
Authors: Xinyu Wang, Jonas M. Kübler, Kailash Budhathoki, Yida Wang, Matthäus Kleindessner
2025-10-27
When a single base
with several different LoRA adapters
simultaneously, the adapters cannot simply be merged with the base model's
weights as the adapter swapping would create overhead and requests using
different adapters could not be batched. Rather, the LoRA computations have to
be separated from the base
computations, and in a multi-device setup the
LoRA adapters can be sharded in a way that is well aligned with the base
model's tensor parallel execution, as proposed in S-LoRA. However, the S-LoRA
sharding strategy encounters some
overhead, which may be small in
theory, but can be large in practice. In this paper, we propose to constrain
certain LoRA factors to be block-diagonal, which allows for an alternative way
of sharding LoRA adapters that does not require any additional
for the LoRA computations. We demonstrate in extensive experiments that our
block-diagonal LoRA approach is similarly parameter efficient as standard LoRA
(i.e., for a similar number of parameters it achieves similar downstream
performance) and that it leads to significant end-to-end speed-up over S-LoRA.
For example, when
on eight A100 GPUs, we observe up to 1.79x (1.23x)
end-to-end speed-up with 0.87x (1.74x) the number of adapter parameters for
Llama-3.1-70B, and up to 1.63x (1.3x) end-to-end speed-up with 0.86x (1.73x)
the number of adapter parameters for Llama-3.1-8B.
Adaptive Blockwise Search Inference-Time Alignment for Large Language Models
Authors: Mohammad Atif Quamar, Mohammad Areeb, Nishant Sharma, Ananth Shreekumar, Jonathan Rosenthal, Muslum Ozgur Ozmen, Mikhail Kuznetsov, Z. Berkay Celik
2025-10-27
alignment remains a critical challenge. Inference-time methods provide a
flexible alternative to fine-tuning, but their uniform computational effort
often yields suboptimal alignment. We hypothesize that for many alignment
tasks, the initial tokens of a response are disproportionately more critical.
To leverage this principle, we introduce AdaSearch, a novel blockwise search
strategy. It adaptively allocates a fixed computational budget using a sampling
schedule, focusing search effort on these critical tokens. We apply AdaSearch
to sequential
and introduce its tree-search counterpart, AdaBeam. Our
comprehensive evaluation across eight
s demonstrates that AdaSearch
outperforms strong Best-of-N and fine-tuning baselines. Specifically, win-rates
improve by over 10% for harmlessness generation, controlled sentiment
generation, and for mathematical reasoning tasks relative to Best-of-N.
Evaluation of Vision-LLMs in Surveillance Video
Authors: Pascal Benschop, Cristian Meo, Justin Dauwels, Jelte P. Mense
2025-10-27
The widespread use of cameras in our society has created an overwhelming
amount of video data, far exceeding the capacity for human monitoring. This
presents a critical challenge for public safety and security, as the timely
detection of anomalous or criminal events is crucial for effective response and
prevention. The ability for an embodied agent to recognize unexpected events is
fundamentally tied to its capacity for spatial reasoning. This paper
investigates the spatial reasoning of vision-language models (VLMs) by framing
anomalous action recognition as a zero-shot, language-grounded task, addressing
the embodied perception challenge of interpreting dynamic 3D scenes from
2D video. Specifically, we investigate whether small, pre-trained vision--
s
can act as spatially-grounded, zero-shot anomaly detectors by converting video
into text descriptions and scoring labels via textual entailment. We evaluate
four open models on UCF-Crime and RWF-2000 under prompting and
privacy-pre
conditions. Few-shot exemplars can improve accuracy for some
models, but may increase false positives, and privacy filters -- especially
full-body GAN transforms -- introduce inconsistencies that degrade accuracy.
These results chart where current vision--
s succeed (simple, spatially
salient events) and where they falter (noisy spatial cues, identity
obfuscation). Looking forward, we outline concrete paths to strengthen spatial
grounding without task-specific training: structure-aware prompts, lightweight
spatial memory across clips, scene-graph or 3D-pose priors during description,
and privacy methods that preserve action-relevant geometry. This positions
zero-shot, language-grounded pipelines as adaptable building blocks for
embodied, real-world video understanding. Our implementation for evaluating
VLMs is publicly available at:
https://github.com/pascalbenschopTU/V
_AnomalyRecognition
Lost in Tokenization Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs
Authors: Kai Zhuang, Jiawei Zhang, Yumou Liu, Hanqun Cao, Chunbin Gu, Mengdi Liu, Zhangyang Gao, Zitong Jerry Wang, Xuanhe Zhou, Pheng-Ann Heng, Lijun Wu, Conghui He, Cheng Tan
2025-10-27
Scientific Large Language Models (Sci-s) have emerged as a promising
frontier for accelerating biological discovery. However, these models face a
fundamental challenge when processing raw biomolecular sequences: the
tokenization dilemma. Whether treating sequences as a specialized language,
risking the loss of functional motif information, or as a separate modality,
introducing formidable alignment challenges, current strategies fundamentally
limit their reasoning capacity. We challenge this sequence-centric paradigm by
positing that a more effective strategy is to provide Sci-
s with high-level
structured context derived from established bioinformatics tools, thereby
bypassing the need to interpret low-level noisy sequence data directly. Through
a systematic comparison of leading Sci-
s on biological reasoning tasks, we
tested three input modes: sequence-only, context-only, and a combination of
both. Our findings are striking: the context-only approach consistently and
substantially outperforms all other modes. Even more revealing, the inclusion
of the raw sequence alongside its high-level context consistently degrades
performance, indicating that raw sequences act as informational noise, even for
models with specialized tokenization schemes. These results suggest that the
primary strength of existing Sci-
s lies not in their nascent ability to
interpret biomolecular syntax from scratch, but in their profound capacity for
reasoning over structured, human-readable knowledge. Therefore, we argue for
reframing Sci-
s not as sequence
rs, but as powerful reasoning engines
over expert knowledge. This work lays the foundation for a new class of hybrid
scientific AI agents, repositioning the developmental focus from direct
sequence interpretation towards high-level knowledge synthesis. The code is
available at https://github.com/opendatalab-raiser/CoKE.
Beyond Imprecise Distance Metrics LLM-Predicted Target Call Stacks for Directed Greybox Fuzzing
Authors: Yifan Zhang, Xin Zhang
2025-10-27
Directed greybox fuzzing (DGF) aims to efficiently trigger bugs at specific
target locations by prioritizing seeds whose execution paths are more likely to
mutate into triggering target bugs. However, existing DGF approaches suffer
from imprecise probability calculations due to their reliance on complex
distance metrics derived from static analysis. The over-approximations inherent
in static analysis cause a large number of irrelevant execution paths to be
mistakenly considered to potentially mutate into triggering target bugs,
significantly reducing fuzzing efficiency. We propose to replace static
analysis-based distance metrics with precise call stack representations. Call
stacks represent precise control flows, thereby avoiding false information in
static analysis. We leverage large language models (s) to predict
vulnerability-triggering call stacks for guiding seed prioritization. Our
approach constructs call graphs through static analysis to identify methods
that can potentially reach target locations, then utilizes
s to predict the
most likely call stack sequence that triggers the vulnerability. Seeds whose
execution paths have higher
with the predicted call stack are
prioritized for mutation. This is the first work to integrate
s into the
core seed prioritization mechanism of DGF. We implement our approach and
evaluate it against several state-of-the-art fuzzers. On a suite of real-world
programs, our approach triggers vulnerabilities to
faster compared to baselines. In addition, our approach identifies 10 new
vulnerabilities and 2 incomplete fixes in the latest versions of programs used
in our controlled experiments through directed patch testing, with 10 assigned
CVE IDs.
P1GPT a multi-agent LLM workflow module for multi-modal financial information analysis
Authors: Chen-Che Lu, Yun-Cheng Chou, Teng-Ruei Chen
2025-10-27
Recent advances in large language models (s) have enabled multi-agent
reasoning systems capable of collaborative decision-making. However, in
financial analysis, most frameworks remain narrowly focused on either isolated
single-agent predictors or loosely connected analyst ensembles, and they lack a
coherent reasoning workflow that unifies diverse data modalities. We introduce
P1GPT, a layered multi-agent
framework for multi-modal financial
information analysis and interpretable trading decision support. Unlike prior
systems that emulate trading teams through role simulation, P1GPT implements a
structured reasoning pipeline that systematically fuses technical, fundamental,
and news-based insights through coordinated agent
and
integration-time synthesis. Backtesting on multi-modal datasets across major
U.S. equities demonstrates that P1GPT achieves superior cumulative and
risk-adjusted returns, maintains low drawdowns, and provides transparent causal
rationales. These findings suggest that structured reasoning workflows, rather
than agent role imitation, offer a scalable path toward explainable and
trustworthy financial AI systems.
UGAE Unified Geometry and Attribute Enhancement for G-PCC Compressed Point Clouds
Authors: Pan Zhao, Hui Yuan, Chongzhen Tian, Tian Guo, Raouf Hamzaoui, Zhigeng Pan
2025-10-27
Lossy of point clouds reduces storage and transmission costs;
however, it inevitably leads to irreversible distortion in geometry structure
and attribute information. To address these issues, we propose a unified
geometry and attribute enhancement (UGAE) framework, which consists of three
core components: post-geometry enhancement (PoGE), pre-attribute enhancement
(PAE), and post-attribute enhancement (PoAE). In PoGE, a Transformer-based
convolutional U-Net is used to reconstruct the geometry structure with
high precision by predicting voxel occupancy probabilities. Building on the
refined geometry structure, PAE introduces an innovative enhanced
geometry-guided recoloring strategy, which uses a detail-aware K-Nearest
Neighbors (DA-KNN) method to achieve accurate recoloring and effectively
preserve high-frequency details before attribute
. Finally, at the
r side, PoAE uses an attribute residual prediction network with a
weighted mean squared error (W-MSE) loss to enhance the quality of
high-frequency regions while maintaining the fidelity of low-frequency regions.
UGAE significantly outperformed existing methods on three benchmark datasets:
8iVFB, Owlii, and MVUB. Compared to the latest G-PCC test model (TMC13v29),
UGAE achieved an average BD-PSNR gain of 9.98 dB and 90.98% BD-bitrate savings
for geometry under the D1 metric, as well as a 3.67 dB BD-PSNR improvement with
56.88% BD-bitrate savings for attributes on the Y component. Additionally, it
improved perceptual quality significantly.
Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition
Authors: Jing-Xuan Zhang, Genshun Wan, Jin Li, Jianqing Gao
2025-10-27
Unified speech recognition aims to perform auditory, visual, and audiovisual
speech recognition within a single model framework. While speech foundation
models (SFMs) have demonstrated remarkable performance in auditory tasks, their
adaptation to multimodal scenarios remains underexplored. This paper presents
UASR-, a novel framework that adapts frozen SFMs to unified VSR, ASR, and
AVSR tasks by leveraging large language models (
s) as text
rs. Our
approach introduces visual representations into multiple SFM layers through
visual injection modules, enabling multimodal input processing and unified
hidden representations. The augmented SFMs connect with
r-only
s via a
feed-forward adaptor, where concatenated representations and instruction
prompts guide speech transcription. We implement a twostage training strategy:
visual injection pretraining followed by speech recognition finetuning. SFM
parameters remain frozen throughout training, with only visual injection
modules optimized initially, and
s finetuned using LoRA parameters
subsequently. Experimental results demonstrate superior performance over
state-of-the-art baselines across VSR, ASR, and AVSR tasks under both clean and
noisy conditions. Ablation studies confirm generalization across various SFMs
and
s, validating the proposed training strategy.
Switchable Token-Specific Codebook Quantization For Face Image Compression
Authors: Yongbo Wang, Haonan Wang, Guodong Mu, Ruixin Zhang, Jiaqi Chen, Jingyun Zhang, Jun Wang, Yuan Xie, Zhizhong Zhang, Shouhong Ding
2025-10-27
With the ever-increasing volume of visual data, the efficient and lossless
transmission, along with its subsequent interpretation and understanding, has
become a critical bottleneck in modern information systems. The emerged
codebook-based solution utilize a globally shared codebook to and
de
each token, controlling the bpp by adjusting the number of tokens or
the codebook size. However, for facial images, which are rich in attributes,
such global codebook strategies overlook both the category-specific
correlations within images and the semantic differences among tokens, resulting
in suboptimal performance, especially at low bpp. Motivated by these
observations, we propose a Switchable Token-Specific Codebook Quantization for
face image
, which learns distinct codebook groups for different
image categories and assigns an independent codebook to each token. By
recording the codebook group to which each token belongs with a small number of
bits, our method can reduce the loss incurred when decreasing the size of each
codebook group. This enables a larger total number of codebooks under a lower
overall bpp, thereby enhancing the expressive capability and improving
reconstruction performance. Owing to its generalizable design, our method can
be integrated into any existing codebook-based representation learning approach
and has demonstrated its effectiveness on face recognition datasets, achieving
an average accuracy of 93.51% for reconstructed images at 0.05 bpp.
How Can AI Augment Access to Justice? Public Defenders' Perspectives on AI Adoption
Authors: Inyoung Cheong, Patty Liu, Dominik Stammbach, Peter Henderson
2025-10-27
Public defenders are asked to do more with less: representing clients
de of adequate counsel while facing overwhelming caseloads and scarce
resources. While artificial intelligence (AI) and large language models (
s)
are promoted as tools to alleviate this burden, such proposals are detached
from the lived realities of public defenders. This study addresses that gap
through semi-structured interviews with fourteen practitioners across the
United States to examine their experiences with AI, anticipated applications,
and ethical concerns. We find that AI adoption is constrained by costs,
restrictive office norms, confidentiality risks, and unsatisfactory tool
quality. To clarify where AI can and cannot contribute, we propose a task-level
map of public defense. Public defenders view AI as most useful for evidence
investigation to analyze overwhelming amounts of digital records, with narrower
roles in legal research & writing, and client
. Courtroom
representation and defense strategy are considered least compatible with AI
assistance, as they depend on contextual judgment and trust. Public defenders
emphasize safeguards for responsible use, including mandatory human
verification, limits on overreliance, and the preservation of relational aspect
of lawyering. Building on these findings, we outline a research agenda that
promotes equitable access to justice by prioritizing open-source models,
domain-specific datasets and evaluation, and participatory design that
incorporates defenders' perspectives into system development.
Rethinking Inference Placement for Deep Learning across Edge and Cloud Platforms A Multi-Objective Optimization Perspective and Future Directions
Authors: Zongshun Zhang, Ibrahim Matta
2025-10-27
Edge intelligent applications like VR/AR and language model based chatbots
have become widespread with the rapid expansion of IoT and mobile devices.
However, constrained edge devices often cannot serve the increasingly large and
complex deep learning (DL) models. To mitigate these challenges, researchers
have proposed optimizing and offloading partitions of DL models among user
devices, edge servers, and the cloud. In this setting, users can take advantage
of different services to support their intelligent applications. For example,
edge resources offer low response latency. In contrast, cloud platforms provide
low monetary cost computation resources for computation-intensive workloads.
However, between DL model partitions can introduce transmission
bottlenecks and pose risks of data leakage. Recent research aims to balance
accuracy, computation delay, transmission delay, and privacy concerns. They
address these issues with model
, model distillation, transmission
, and model architecture adaptations, including internal
classifiers. This survey contextualizes the state-of-the-art model offloading
methods and model adaptation techniques by studying their implication to a
multi-objective optimization comprising inference latency, data privacy, and
resource monetary cost.
Batch Speculative Decoding Done Right
Authors: Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li, Rui Zhang
2025-10-26
Speculative speeds up
inference by using a small draft model to
propose multiple tokens that a target model verifies in parallel. Extending
this idea to batches is essential for production
, but it introduces the
ragged tensor problem: sequences in the same batch accept different numbers of
draft tokens, breaking right-alignment and corrupting position IDs, attention
masks, and
-
state. We show that several existing batch implementations
violate output equivalence-the fundamental requirement that speculative
must produce identical token sequences to standard autoregressive
generation. These violations occur precisely due to improper handling of the
ragged tensor problem. In response, we (1) characterize the synchronization
requirements that guarantee correctness, (2) present a correctness-first batch
speculative
EQSPEC that exposes realignment as consuming 40% of
overhead, and (3) introduce EXSPEC, which maintains a sliding pool of sequences
and dynamically forms same-length groups, to reduce the realignment overhead
while pre
per-sequence speculative speedups. On the SpecBench dataset,
across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B target/draft pairs, our
approach achieves up to 3 throughput improvement at batch size 8
compared to batch size 1, with efficient scaling through batch size 8, while
maintaining 95% output equivalence. Our method requires no custom kernels and
integrates cleanly with existing inference stacks. Our code is available at
https://github.com/eBay/spec_dec.
Sub-microsecond Transformers for Jet Tagging on FPGAs
Authors: Lauri Laatu, Chang Sun, Arianna Cox, Abhijith Gandrakota, Benedikt Maier, Jennifer Ngadiuba, Zhiqiang Que, Wayne Luk, Maria Spiropulu, Alexander Tapper
2025-10-26
We present the first sub-microsecond implementation on an FPGA
achieving competitive performance for state-of-the-art high-energy physics
benchmarks. Transformers have shown exceptional performance on multiple tasks
in modern machine learning applications, including jet tagging at the CERN
Large Hadron Collider (LHC). However, their computational complexity prohibits
use in real-time applications, such as the hardware trigger system of the
collider experiments up until now. In this work, we demonstrate the first
application of
s for jet tagging on FPGAs, achieving
nanosecond latency with superior performance compared to
alternative baseline models. We leverage high-granularity
and
distributed arithmetic optimization to fit the entire
model on a
single FPGA, achieving the required throughput and latency. Furthermore, we add
multi-head attention and linear attention support to hls4ml, making our work
accessible to the broader fast machine learning community. This work advances
the next-generation trigger systems for the High Luminosity LHC, enabling the
use of
s for real-time applications in high-energy physics and
beyond.
Long-Term PM2.5 Forecasting Using a DTW-Enhanced CNN-GRU Model
Authors: Amirali Ataee Naeini, Arshia Ataee Naeini, Fatemeh Karami Mohammadi, Omid Ghaffarpasand
2025-10-26
Reliable long-term forecasting of PM2.5 concentrations is critical for public
health early-warning systems, yet existing deep learning approaches struggle to
maintain prediction stability beyond 48 hours, especially in cities with
monitoring networks. This paper presents a deep learning framework that
combines Dynamic Time Warping (DTW) for intelligent station similarity
selection with a CNN-GRU architecture to enable extended-horizon PM2.5
forecasting in Isfahan, Iran, a city characterized by complex pollution
dynamics and limited monitoring coverage. Unlike existing approaches that rely
on computationally intensive
models or external simulation tools,
our method integrates three key innovations: (i) DTW-based historical sampling
to identify similar pollution patterns across peer stations, (ii) a lightweight
CNN-GRU architecture augmented with meteorological features, and (iii) a
scalable design optimized for
networks. Experimental validation using
multi-year hourly data from eight monitoring stations demonstrates superior
performance compared to state-of-the-art deep learning methods, achieving R2 =
0.91 for 24-hour forecasts. Notably, this is the first study to demonstrate
stable 10-day PM2.5 forecasting (R2 = 0.73 at 240 hours) without performance
degradation, addressing critical early-warning system requirements. The
framework's computational efficiency and independence from external tools make
it particularly suitable for deployment in resource-constrained urban
environments.
Leveraging Large Language Models to Identify Conversation Threads in Collaborative Learning
Authors: Prerna Ravi, Dong Won Lee, Beatriz Flamia, Jasmine David, Brandon Hanks, Cynthia Breazeal, Emma Anderson, Grace Lin
2025-10-26
Understanding how ideas develop and flow in small-group conversations is
critical for analyzing collaborative learning. A key structural feature of
these interactions is threading, the way discourse talk naturally organizes
into interwoven topical strands that evolve over time. While threading has been
widely studied in asynchronous text settings, detecting threads in synchronous
spoken dialogue remains challenging due to ping turns and implicit cues.
At the same time, large language models (
s) show promise for automating
discourse analysis but often struggle with long-context tasks that depend on
tracing these conversational links. In this paper, we investigate whether
explicit thread linkages can improve
-based coding of relational moves in
group talk. We contribute a systematic guidebook for identifying threads in
synchronous multi-party transcripts and benchmark different
prompting
strategies for automated threading. We then test how threading influences
performance on downstream coding of conversational analysis frameworks, that
capture core collaborative actions such as agreeing, building, and eliciting.
Our results show that providing clear conversational thread information
improves
coding performance and underscores the heavy reliance of
downstream analysis on well-structured dialogue. We also discuss practical
trade-offs in time and cost, emphasizing where human-AI hybrid approaches can
yield the best value. Together, this work advances methods for combining
s
and robust conversational thread structures to make sense of complex, real-time
group interactions.
Region-Adaptive Learned Hierarchical Encoding for 3D Gaussian Splatting Data
Authors: Shashank N. Sridhara, Birendra Kathariya, Fangjun Pu, Peng Yin, Eduardo Pavez, Antonio Ortega
2025-10-26
We introduce Region-Adaptive Learned Hierarchical Encoding (RALHE) for 3D
Gaussian Splatting (3DGS) data. While 3DGS has recently become popular for
novel view synthesis, the size of trained models limits its deployment in
bandwidth-constrained applications such as volumetric media streaming. To
address this, we propose a learned hierarchical latent representation that
builds upon the principles of "overfitted" learned image (e.g.,
Cool-Chic and C3) to efficiently encode 3DGS attributes. Unlike images, 3DGS
data have irregular spatial distributions of Gaussians (geometry) and consist
of multiple attributes (signals) defined on the irregular geometry. Our codec
is designed to account for these differences between images and 3DGS.
Specifically, we leverage the octree structure of the voxelized 3DGS geometry
to obtain a hierarchical multi-resolution representation. Our approach overfits
latents to each Gaussian attribute under a global rate constraint. These
latents are
d independently through a lightweight
r network. To
estimate the bitrate during training, we employ an autoregressive probability
model that leverages octree-derived contexts from the 3D point structure. The
multi-resolution latents,
r, and autoregressive entropy coding networks
are jointly optimized for each Gaussian attribute. Experiments demonstrate that
the proposed RALHE
framework achieves a rendering PSNR gain of up
to 2dB at low bitrates (less than 1 MB) compared to the baseline 3DGS
methods.
Iterative Layer Pruning for Efficient Translation Inference
Authors: Yasmin Moslem, Muhammad Hazim Al Farouq, John D. Kelleher
2025-10-26
Large language models (s) have transformed many areas of natural language
processing, including machine translation. However, efficient deployment of
s remains challenging due to their intensive computational requirements. In
this paper, we address this challenge and present our submissions to the Model
Compression track at the Conference on Machine Translation (WMT 2025). In our
experiments, we investigate iterative layer
guided by layer importance
analysis. We evaluate this method using the Aya-Expanse-8B model for
translation from Czech to German, and from English to Egyptian Arabic. Our
approach achieves substantial reductions in model size and inference time,
while maintaining the translation quality of the baseline models.
Beyond Semantics How Temporal Biases Shape Retrieval in Transformer and State-Space Models
Authors: Anooshka Bajaj, Deven Mahesh Mistry, Sahaj Singh Maini, Yash Aggarwal, Zoran Tiganj
2025-10-26
In-context learning is governed by both temporal and semantic relationships,
shaping how Large Language Models (s) retrieve contextual information.
Analogous to human episodic memory, where the retrieval of specific events is
enabled by separating events that happened at different times, this work probes
the ability of various pretrained
s, including
and state-space
models, to differentiate and retrieve temporally separated events.
Specifically, we prompted models with sequences containing multiple
presentations of the same token, which reappears at the sequence end. By fixing
the positions of these repeated tokens and permuting all others, we removed
semantic confounds and isolated temporal effects on next-token prediction.
Across diverse sequences, models consistently placed the highest probabilities
on tokens following a repeated token, but with a notable bias for those nearest
the beginning or end of the input. An ablation experiment linked this
phenomenon in
s to induction heads. Extending the analysis to unique
semantic contexts with partial
further demonstrated that memories
embedded in the middle of a prompt are retrieved less reliably. Despite
architectural differences, state-space and
models showed comparable
temporal biases. Our findings deepen the understanding of temporal biases in
in-context learning and offer an illustration of how these biases can enable
temporal separation and episodic retrieval.
Rule-Based Explanations for Retrieval-Augmented LLM Systems
Authors: Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jarek Szlichta
2025-10-26
If-then rules are widely used to explain machine learning models; e.g., "if
employed = no, then loan application = rejected." We present the first proposal
to apply rules to explain the emerging class of large language models (s)
with retrieval-augmented generation (RAG). Since RAG enables
systems to
incorporate retrieved information sources at inference time, rules linking the
presence or absence of sources can explain output provenance; e.g., "if a Times
Higher Education ranking article is retrieved, then the
ranks Oxford
first." To generate such rules, a brute force approach would probe the
with
all source combinations and check if the presence or absence of any sources
leads to the same output. We propose optimizations to speed up rule generation,
inspired by Apriori-like
from frequent itemset mining but redefined
within the scope of our novel problem. We conclude with qualitative and
quantitative experiments demonstrating our solutions' value and efficiency.
Transformers from Compressed Representations
Authors: Juan C. Leon Alcazar, Mattia Soldan, Mohammad Saatialsoruji, Alejandro Pardo, Hani Itani, Juan Camilo Perez, Bernard Ghanem
2025-10-26
Compressed file formats are the corner stone of efficient data storage and
transmission, yet their potential for representation learning remains largely
underexplored. We introduce TEMPEST (TransformErs froM comPressed
rEpreSenTations), a method that exploits the inherent byte-stream structure of
compressed files to design an effective tokenization and encoding strategy. By
leveraging this compact encoding, a standard can directly learn
semantic representations from compressed data streams, bypassing the need for
raw byte-level processing or full media
. Our proposal substantially
reduces the number of tokens required for semantic classification, thereby
lowering both computational complexity and memory usage. Through extensive
experiments across diverse datasets, coding schemes, and modalities, we show
that TEMPEST achieves accuracy competitive wit the state-of-the-art while
delivering efficiency gains in memory and compute.
TVMC Time-Varying Mesh Compression via Multi-Stage Anchor Mesh Generation
Authors: He Huang, Qi Yang, Yiling Xu, Zhu Li, Jenq-Neng Hwang
2025-10-26
Time-varying meshes, characterized by dynamic connectivity and varying vertex
counts, hold significant promise for applications such as augmented reality.
However, their practical utilization remains challenging due to the substantial
data volume required for high-fidelity representation. While various
methods attempt to leverage temporal redundancy between consecutive
mesh frames, most struggle with topological inconsistency and motion-induced
artifacts. To address these issues, we propose Time-Varying Mesh Compression
(TVMC), a novel framework built on multi-stage coarse-to-fine anchor mesh
generation for inter-frame prediction. Specifically, the anchor mesh is
progressively constructed in three stages: initial, coarse, and fine. The
initial anchor mesh is obtained through fast topology alignment to exploit
temporal coherence. A Kalman filter-based motion estimation module then
generates a coarse anchor mesh by accurately compensating inter-frame motions.
Subsequently, a Quadric Error Metric-based refinement step optimizes vertex
positions to form a fine anchor mesh with improved geometric fidelity. Based on
the refined anchor mesh, the inter-frame motions relative to the reference base
mesh are encoded, while the residual displacements between the subdivided fine
anchor mesh and the input mesh are adaptively
d and compressed. This
hierarchical strategy preserves consistent connectivity and high-quality
surface approximation, while achieving an efficient and compact representation
of dynamic geometry. Extensive experiments on standard MPEG dynamic mesh
sequences demonstrate that TVMC achieves state-of-the-art
performance. Compared to the latest V-DMC standard, it delivers a significant
BD-rate gain of 10.2% ~ 16.9%, while pre
high reconstruction quality.
The code is available at https://github.com/H-Huang774/TVMC.
AI-Driven Carbon Monitoring Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions
Authors: Padmanabhan Jagannathan Prajesh, Kaliaperumal Ragunath, Miriam Gordon, Bruce Rathgeber, Suresh Neethirajan
2025-10-26
Accurate mapping of column-averaged CO2 (XCO2) over agricultural landscapes
is essential for guiding emission mitigation strategies. We present a
Spatiotemporal Vision Transformer with Wavelets (ST-ViWT) framework that
reconstructs continuous, uncertainty-quantified XCO2 fields from OCO-2 across
southern Canada, emphasizing poultry-intensive regions. The model fuses wavelet
time-frequency representations with attention over meteorology,
vegetation indices, topography, and land cover. On 2024 OCO-2 data, ST-ViWT
attains R2 = 0.984 and RMSE = 0.468 ppm; 92.3 percent of gap-filled predictions
lie within +/-1 ppm. Independent validation with TCCON shows robust
generalization (bias = -0.14 ppm; r = 0.928), including faithful reproduction
of the late-summer drawdown. Spatial analysis across 14 poultry regions reveals
a moderate positive association between facility density and XCO2 (r = 0.43);
high-density areas exhibit larger seasonal amplitudes (9.57 ppm) and enhanced
summer variability. Compared with conventional interpolation and standard
machine-learning baselines, ST-ViWT yields seamless 0.25 degree CO2 surfaces
with explicit uncertainties, enabling year-round coverage despite
observations. The approach supports integration of satellite constraints with
national inventories and precision livestock platforms to benchmark emissions,
refine region-specific factors, and verify interventions. Importantly,
-based Earth observation enables scalable, transparent, spatially
explicit carbon accounting, hotspot prioritization, and policy-relevant
mitigation assessment.
SABlock Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size
Authors: Jinhan Chen, Jianchun Liu, Hongli Xu, Xianjun Gao, Shilong Wang
2025-10-26
The growing memory footprint of the Key-Value ()
poses a severe
scalability bottleneck for long-context Large Language Model (
) inference.
While
eviction has emerged as an effective solution by discarding less
critical tokens, existing token-, block-, and sentence-level
methods struggle to balance semantic coherence and memory efficiency. To this
end, we introduce SABlock, a \underline{s}emantic-aware
eviction
framework with \underline{a}daptive \underline{block} sizes. Specifically,
SABlock first performs semantic segmentation to align
boundaries
with linguistic structures, then applies segment-guided token scoring to refine
token importance estimation. Finally, for each segment, a budget-driven search
strategy adaptively determines the optimal block size that preserves semantic
integrity while improving
efficiency under a given
budget.
Extensive experiments on long-context benchmarks demonstrate that SABlock
consistently outperforms state-of-the-art baselines under the same memory
budgets. For instance, on Needle-in-a-Haystack (NIAH), SABlock achieves 99.9%
retrieval accuracy with only 96
entries, nearly matching the performance of
the full-
baseline that retains up to 8K entries. Under a fixed
budget of 1,024, SABlock further reduces peak memory usage by 46.28% and
achieves up to 9.5x faster
on a 128K context length.
AesCrop Aesthetic-driven Cropping Guided by Composition
Authors: Yen-Hong Wong, Lai-Kuan Wong
2025-10-26
Aesthetic-driven image cropping is crucial for applications like view
recommendation and thumbnail generation, where visual appeal significantly
impacts user engagement. A key factor in visual appeal is composition--the
deliberate arrangement of elements within an image. Some methods have
successfully incorporated compositional knowledge through evaluation-based and
regression-based paradigms. However, evaluation-based methods lack globality
while regression-based methods lack diversity. Recently, hybrid approaches that
integrate both paradigms have emerged, bridging the gap between these two to
achieve better diversity and globality. Notably, existing hybrid methods do not
incorporate photographic composition guidance, a key attribute that defines
photographic aesthetics. In this work, we introduce AesCrop, a
composition-aware hybrid image-cropping model that integrates a VMamba image
encoder, augmented with a novel Mamba Composition Attention Bias (MCAB) and a
r to perform end-to-end rank-based image cropping, generating
multiple crops along with the corresponding quality scores. By explicitly
encoding compositional cues into the attention mechanism, MCAB directs AesCrop
to focus on the most compositionally salient regions. Extensive experiments
demonstrate that AesCrop outperforms current state-of-the-art methods,
delivering superior quantitative metrics and qualitatively more pleasing crops.
Aligning Diffusion Language Models via Unpaired Preference Optimization
Authors: Vaibhav Jindal, Hejian Sang, Chun-Mao Lai, Yanning Chen, Zhipeng Wang
2025-10-26
Diffusion language models (ds) are an emerging alternative to
autoregressive (AR) generators, but aligning them to human preferences is
challenging because sequence log-likelihoods are intractable and pairwise
preference data are costly to collect. We introduce ELBO-KTO, which combines an
ELBO surrogate for diffusion log-likelihoods with a prospect-theoretic,
unpaired preference objective (Kahneman Tversky Optimization, KTO). We analyze
the bias and variance induced by the ELBO substitution and employ
variance-reduction practices that stabilize gradients during training. Applied
to LLaDA-8B-Instruct, ELBO-KTO yields \textbf{65.9\%} and \textbf{62.3\%}
adjusted win rates on kto-mix-14k and UltraFeedback-Binary, respectively,
versus the base model under an automatic
judge. Across downstream tasks,
including GSM8K, MMLU, and additional reasoning/knowledge benchmarks, ELBO-KTO
trained on UltraFeedback-Binary performs on par with or better than the base
model under identical
. This establishes unpaired preference
optimization as a viable alternative to pairwise alignment in diffusion
s.
Frustratingly Easy Task-aware Pruning for Large Language Models
Authors: Yuanhe Tian, Junjie Liu, Xican Yang, Haishan Ye, Yan Song
2025-10-26
Pruning provides a practical solution to reduce the resources required to run
large language models (s) to benefit from their effective capabilities as
well as control their cost for training and inference. Research on
often ranks the importance of
parameters using their magnitudes and
calibration-data activations and removes (or masks) the less important ones,
accordingly reducing
s' size. However, these approaches primarily focus on
pre
the
's ability to generate fluent sentences, while neglecting
performance on specific domains and tasks. In this paper, we propose a simple
yet effective
approach for
s that preserves task-specific
capabilities while shrinking their parameter space. We first analyze how
conventional
minimizes loss perturbation under general-domain
calibration and extend this formulation by incorporating task-specific feature
distributions into the importance computation of existing
algorithms.
Thus, our framework computes separate importance scores using both general and
task-specific calibration data, partitions parameters into shared and exclusive
groups based on activation-norm differences, and then fuses their scores to
guide the
process. This design enables our method to integrate
seamlessly with various foundation
techniques and preserve the
's
specialized abilities under
. Experiments on widely used benchmarks
demonstrate that our approach is effective and consistently outperforms the
baselines with identical
ratios and different settings.
CHOIR Collaborative Harmonization fOr Inference Robustness
Authors: Xiangjue Dong, Cong Wang, Maria Teleki, Millennium Bismay, James Caverlee
2025-10-26
Persona-assigned Large Language Models (s) can adopt diverse roles,
enabling personalized and context-aware reasoning. However, even minor
demographic perturbations in personas, such as simple pronoun changes, can
alter reasoning trajectories, leading to divergent sets of correct answers.
Instead of treating these variations as biases to be mitigated, we explore
their potential as a constructive resource to improve reasoning robustness. We
propose CHOIR (Collaborative Harmonization fOr Inference Robustness), a
test-time framework that harmonizes multiple persona-conditioned reasoning
signals into a unified prediction. CHOIR orchestrates a collaborative
process among counterfactual personas, dynamically balancing agreement and
divergence in their reasoning paths. Experiments on various reasoning
benchmarks demonstrate that CHOIR consistently enhances performance across
demographics, model architectures, scales, and tasks - without additional
training. Improvements reach up to 26.4% for individual demographic groups and
19.2% on average across five demographics. It remains effective even when base
personas are suboptimal. By reframing persona variation as a constructive
signal, CHOIR provides a scalable and generalizable approach to more reliable
reasoning.
Backward-Friendly Optimization Training Large Language Models with Approximate Gradients under Memory Constraints
Authors: Jing Yang, Kaitong Cai, Yijia Fan, Yufeng Yang, Keze Wang
2025-10-26
Full fine-tuning of Large Language Models (s) is notoriously
memory-intensive, primarily because conventional optimizers such as SGD or Adam
assume access to exact gradients derived from
d activations. Existing
solutions either alter the model architecture (e.g., reversible networks) or
trade memory for computation (e.g., activation checkpointing), but the
optimizer itself remains untouched. In this work, we introduce GradLite, a
backward-friendly optimizer that relaxes the requirement of exact gradients,
enabling efficient training even when intermediate activations are aggressively
discarded or approximated. GradLite leverages two key techniques: (i) low-rank
Jacobian approximation, which reduces the dimensionality of backpropagated
error signals, and (ii) error-feedback correction, which accumulates and
compensates approximation errors across iterations to preserve convergence
guarantees. We provide a theoretical analysis showing that GradLite maintains
unbiased gradient estimates with bounded variance, ensuring convergence rates
comparable to Adam. Empirically, GradLite reduces optimizer-state and
activation memory consumption by up to 50\% without architectural changes, and
achieves on-par or superior downstream performance on reasoning (MMLU, GSM8K),
multilingual, and dialogue benchmarks compared to checkpointing and
optimizer-centric baselines (LoMo, GaLore).
GigaEmbeddings Efficient Russian Language Embedding Model
Authors: Egor Kolodin, Daria Khomich, Nikita Savushkin, Anastasia Ianina, Fyodor Minkin
2025-10-25
We introduce GigaEmbeddings, a novel framework for training high-performance
Russian-focused text embeddings through hierarchical instruction tuning of the
r-only
designed specifically for Russian language (GigaChat-3B). Our
three-stage pipeline, comprising large-scale contrastive pre-training in
web-scale corpora, fine-tuning with hard negatives, and multitask
generalization across retrieval, classification, and clustering tasks,
addresses key limitations of existing methods by unifying diverse objectives
and leveraging synthetic data generation. Architectural innovations include
bidirectional attention for contextual modeling, latent attention pooling for
robust sequence aggregation, and strategic
of 25% of
layers
to enhance efficiency without compromising performance. Evaluated on the ruMTEB
benchmark spanning 23 multilingual tasks, GigaEmbeddings achieves
state-of-the-art results (69.1 avg. score), outperforming strong baselines with
a larger number of parameters.
The Structural Scalpel Automated Contiguous Layer Pruning for Large Language Models
Authors: Yao Lu, Yuqi Li, Wenbin Xie, Shanqing Yu, Qi Xuan, Zhaowei Zhu, Shiping Wen
2025-10-25
Although large language models (s) have achieved revolutionary
breakthroughs in many fields, their large model size and high computational
cost pose significant challenges for practical deployment on
resource-constrained edge devices. To this end, layer
has been proposed
to reduce the computational overhead by directly removing redundant layers.
However, existing layer
methods typically rely on hand-crafted metrics
to evaluate and remove individual layers, while ignoring the dependencies
between layers. This can disrupt the model's information flow and severely
degrade performance. To address these issues, we propose CLP, a novel
continuous layer
framework that introduces two key innovations: a
differentiable concave gate algorithm that automatically identifies the best
continuous layer segments for
via gradient-based optimization; and a
cutoff endpoint tuning strategy that effectively restores model performance by
fine-tuning only the layers adjacent to the pruned segments. Extensive
experiments across multiple model architectures (including LLaMA2, LLaMA3 and
Qwen) and sizes (from B to B parameters) show that CLP significantly
outperforms existing state-of-the-art baselines. For example, at a
rate
of , CLP achieves an average performance retention of on
LLaMA3-70B, outperforming baselines by -. Furthermore, CLP can
be seamlessly combined with
to further compress the model with
only a slight performance loss.
Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders
Authors: Mengyu Ye, Jun Suzuki, Tatsuro Inaba, Tatsuki Kuribayashi
2025-10-25
Recent interpretability work on large language models (s) has been
increasingly dominated by a feature-discovery approach with the help of proxy
modules. Then, the quality of features learned by, e.g.,
auto-encoders
(SAEs), is evaluated. This paradigm naturally raises a critical question: do
such learned features have better properties than those already represented
within the original model parameters, and unfortunately, only a few studies
have made such comparisons systematically so far. In this work, we revisit the
interpretability of feature vectors stored in feed-forward (FF) layers, given
the perspective of FF as key-value memories, with modern interpretability
benchmarks. Our extensive evaluation revealed that SAE and FFs exhibits a
similar range of interpretability, although SAEs displayed an observable but
minimal improvement in some aspects. Furthermore, in certain aspects,
surprisingly, even vanilla FFs yielded better interpretability than the SAEs,
and features discovered in SAEs and FFs diverged. These bring questions about
the advantage of SAEs from both perspectives of feature quality and
faithfulness, compared to directly interpreting FF feature vectors, and FF
key-value parameters serve as a strong baseline in modern interpretability
research.
Efficient Low Rank Attention for Long-Context Inference in Large Language Models
Authors: Tenghui Li, Guoxu Zhou, Xuyang Zhao, Yuning Qiu, Qibin Zhao
2025-10-25
As the length of input text grows, the key-value ()
in
s imposes
prohibitive GPU memory costs and limits long-context inference on resource
constrained devices. Existing approaches, such as
and
,
reduce memory usage but suffer from numerical precision loss or suboptimal
retention of key-value pairs. We introduce Low Rank Query and Key attention
(LRQK), a two-stage framework that jointly decomposes the full-precision query
and key matrices into compact rank- factors during the
stage, and
then uses these low-dimensional projections to compute proxy attention scores
in time at each
step. By selecting only the
top- tokens and a small fixed set of recent tokens, LRQK employs a mixed
GPU-CPU
with a hit-and-miss mechanism that transfers only missing
full-precision
pairs, thereby pre
exact attention outputs while
reducing CPU-GPU data movement. Extensive experiments on the RULER and
LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B demonstrate that LRQK
matches or surpasses leading
-attention methods in long context settings,
while delivering significant memory savings with minimal loss in accuracy. Our
code is available at https://github.com/tenghuilee/LRQK.
PACR Progressively Ascending Confidence Reward for LLM Reasoning
Authors: Eunseop Yoon, Hee Suk Yoon, Jaehyun Jang, SooHwan Eom, Qi Dai, Chong Luo, Mark A. Hasegawa-Johnson, Chang D. Yoo
2025-10-25
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly
improved reasoning, but its
, outcome-based reward provides no
guidance for intermediate steps, slowing exploration. We propose Progressively
Ascending Confidence Reward (PACR), a dense, model-intrinsic reward computed
directly from the model's evolving belief in the correct answer. PACR encodes
the inductive bias that, along a well-formed reasoning trajectory, the
probability of the ground-truth answer should have a generally ascending trend.
We provide empirical and theoretical analysis validating that such an inductive
bias constrains the exploration search space to regions richer in logically
sound reasoning. We demonstrate that PACR accelerates exploration, reaches
reward saturation with fewer trajectories, and yields improvements on multiple
benchmarks. Our results suggest that dense, model-intrinsic shaping signals can
make RLVR training more effective and reliable.
Synthetic-to-Real Transfer Learning for Chromatin-Sensitive PWS Microscopy
Authors: Jahidul Arafat, Sanjaya Poudel
2025-10-25
Chromatin sensitive partial wave spectroscopic (csPWS) microscopy enables
label free detection of nanoscale chromatin packing alterations that occur
before visible cellular transformation. However, manual nuclear segmentation
limits population scale analysis needed for biomarker discovery in early cancer
detection. The lack of annotated csPWS imaging data prevents direct use of
standard deep learning methods. We present CFU Net, a hierarchical segmentation
architecture trained with a three stage curriculum on synthetic multimodal
data. CFU Net achieves near perfect performance on held out synthetic test data
that represent diverse spectroscopic imaging conditions without manual
annotations (Dice 0.9879, IoU 0.9895). Our approach uses physics based
rendering that incorporates empirically supported chromatin packing statistics,
Mie scattering models, and modality specific noise, combined with a curriculum
that progresses from adversarial RGB pretraining to spectroscopic fine tuning
and histology validation. CFU Net integrates five architectural elements
(ConvNeXt backbone, Feature Pyramid Network, UNet plus plus dense connections,
dual attention, and deep supervision) that together improve Dice over a
baseline UNet by 8.3 percent. We demonstrate deployment ready INT8
with 74.9 percent
and 0.15 second inference, giving a 240 times
throughput gain over manual analysis. Applied to more than ten thousand
automatically segmented nuclei from synthetic test data, the pipeline extracts
chromatin biomarkers that distinguish normal from pre cancerous tissue with
large effect sizes (Cohens d between 1.31 and 2.98), reaching 94 percent
classification accuracy. This work provides a general framework for synthetic
to real transfer learning in specialized microscopy and open resources for
community validation on clinical specimens.
When Fewer Layers Break More Chains Layer Pruning Harms Test-Time Scaling in LLMs
Authors: Keyu Wang, Tian Lyu, Guinan Su, Jonas Geiping, Lu Yin, Marco Canini, Shiwei Liu
2025-10-25
Layer has emerged as a widely adopted technique for improving the
efficiency of large language models (
s). Although existing methods
demonstrate strong performance retention on general knowledge tasks, their
effect on long-chain reasoning, a more brittle yet crucial capability, remains
largely unexplored. In this work, we study the impact of layer
on
long-chain reasoning through the lens of test-time scaling, a key mechanism in
modern
s that enables strong reasoning capacity by allocating more
computation at inference time. With extensive experiments, we demonstrate that
even one or two layers can severely impair test-time scaling, with
performance collapsing drastically on long reasoning benchmarks even when
performance on knowledge-intensive and shallow reasoning tasks remains stable.
Furthermore, we find that standard supervised fine-tuning remedies fail to
recover test-time scaling once it has deteriorated. Through in-depth analyses,
we identify the mechanisms underlying this fragility of test-time scaling and
highlight the fundamental risks of applying layer
to
reasoning-intensive
s. These findings call for a rethinking of layer
strategies and provide insights for developing methods that preserve the
robustness of reasoning. We open-source the codebase in
\href{https://github.com/keyu-wang-2002/Layer-Pruning-Harms-Inference-Scaling}{https://github.com/keyu-wang-2002/Layer-Pruning-Harms-Inference-Scaling}.
TrajGATFormer A Graph-Based Transformer Approach for Worker and Obstacle Trajectory Prediction in Off-site Construction Environments
Authors: Mohammed Alduais, Xinming Li, Qipei Mei
2025-10-25
As the demand grows within the construction industry for processes that are
not only faster but also safer and more efficient, offsite construction has
emerged as a solution, though it brings new safety risks due to the close
interaction between workers, machinery, and moving obstacles. Predicting the
future trajectories of workers and taking into account social and environmental
factors is a crucial step for developing collision-avoidance systems to
mitigate such risks. Traditional methods often struggle to adapt to the dynamic
and unpredictable nature of construction environments. Many rely on simplified
assumptions or require hand-crafted features, limiting their ability to respond
to complex, real-time interactions between workers and moving obstacles. While
recent data-driven methods have improved the modeling of temporal patterns,
they still face challenges in capturing long-term behavior and accounting for
the spatial and social context crucial to collision risk assessment. To address
these limitations, this paper proposes a framework integrating YOLOv10n and
DeepSORT for precise detection and tracking, along with two novel trajectory
prediction models: TrajGATFormer and TrajGATFormer-Obstacle. YOLOv10n serves as
the backbone for object detection, accurately identifying workers and obstacles
in diverse scenes, while DeepSORT efficiently tracks them over time with unique
IDs for continuity. Both models employ a encoder-
r with Graph
Attention Networks (GAT) to capture temporal and spatial interactions.
TrajGATFormer predicts worker trajectories with an ADE of 1.25 m and FDE of 2.3
m over a 4.8 s horizon, while TrajGATFormer-Obstacle extends prediction to both
workers and obstacles, achieving higher accuracy (ADE 1.15 m, FDE 2.2 m).
Comparative analysis shows both models outperform traditional methods, reducing
ADE and FDE by up to 35% and 38%, respectively.
Surface Reading LLMs Synthetic Text and its Styles
Authors: Hannes Bajohr
2025-10-25
Despite a potential plateau in ML advancement, the societal impact of large
language models lies not in approaching superintelligence but in generating
text surfaces indistinguishable from human writing. While Critical AI Studies
provides essential material and socio-technical critique, it risks overlooking
how s phenomenologically reshape meaning-making. This paper proposes a
semiotics of "surface integrity" as attending to the immediate plane where
s
inscribe themselves into human
. I distinguish three knowledge
interests in ML research (epistemology, epist\=em\=e, and epistemics) and argue
for integrating surface-level stylistic analysis alongside depth-oriented
critique. Through two case studies examining stylistic markers of synthetic
text, I argue how attending to style as a semiotic phenomenon reveals
s as
cultural actors that transform the conditions of meaning emergence and
circulation in contemporary discourse, independent of questions about machine
consciousness.
Edit Less, Achieve More Dynamic Sparse Neuron Masking for Lifelong Knowledge Editing in LLMs
Authors: Jinzhe Liu, Junshu Sun, Shufan Shen, Chenxue Yang, Shuhui Wang
2025-10-25
Lifelong knowledge editing enables continuous, precise updates to outdated
knowledge in large language models (s) without computationally expensive
full retraining. However, existing methods often accumulate errors throughout
the editing process, causing a gradual decline in both editing accuracy and
generalization. To tackle this problem, we propose Neuron-Specific Masked
Knowledge Editing (NMKE), a novel fine-grained editing framework that combines
neuron-level attribution with dynamic
masking. Leveraging neuron
functional attribution, we identify two key types of knowledge neurons, with
knowledge-general neurons activating consistently across prompts and
knowledge-specific neurons activating to specific prompts. NMKE further
introduces an entropy-guided dynamic
mask, locating relevant neurons to
the target knowledge. This strategy enables precise neuron-level knowledge
editing with fewer parameter modifications. Experimental results from thousands
of sequential edits demonstrate that NMKE outperforms existing methods in
maintaining high editing success rates and pre
model general
capabilities in lifelong editing.
Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search
Authors: Kayhan Behdin, Qingquan Song, Sriram Vasudevan, Jian Sheng, Xiaojing Ma, Z Zhou, Chuanrui Zhu, Guoyao Li, Chanh Nguyen, Sayan Ghosh, Hejian Sang, Ata Fatahi Baarzi, Sundara Raman Ramachandran, Xiaoqing Wang, Qing Lan, Vinay Y S, Qi Guo, Caleb Johnson, Zhipeng Wang, Fedor Borisyuk
2025-10-25
Large Language Models (s) have demonstrated impressive quality when
applied to predictive tasks such as relevance ranking and semantic search.
However, deployment of such
s remains prohibitively expensive for industry
applications with strict latency and throughput requirements. In this work, we
present lessons and efficiency insights from developing a purely text-based
r-only Small Language Model (SLM) for a semantic search application at
LinkedIn. Particularly, we discuss model
techniques such as
that allow us to reduce the model size by up to while maintaining the
accuracy. Additionally, we present context
techniques that allow us
to reduce the input context length by up to x with minimal loss of
accuracy. Finally, we present practical lessons from optimizing the
infrastructure for deploying such a system on GPUs at scale,
millions
of requests per second. Taken together, this allows us to increase our system's
throughput by x in a real-world deployment, while meeting our quality bar.
Generalization or Memorization Dynamic Decoding for Mode Steering
Authors: Xuanming Zhang
2025-10-25
Large Language Models (s) exhibit a troubling duality, capable of both
remarkable generalization and brittle, verbatim memorization of their training
data. This unpredictability undermines their reliability in high-stakes
applications. In this work, we propose a unified framework to understand,
identify, and control these distinct reasoning modes. First, we introduce a
theoretical model based on the Information Bottleneck (IB) principle,
formalizing generalization as the learning of a compressed, task-relevant
representation and memorization as a failure to compress. Building on this
theory, we develop Dynamic Mode Steering (DMS), a novel inference-time
algorithm which comprises two components: (1) a lightweight, causally-grounded
linear probe that identifies the model's instantaneous reliance on
memorization, and (2) a dynamic activation steering mechanism that nudges the
model's computation towards pre-identified generalization circuits. We frame
DMS as a form of adaptive, self-contrastive
. Experiments on reasoning
and faithfulness tasks demonstrate that DMS significantly improves logical
consistency and factual accuracy, thereby offering a principled approach to
enhancing
reliability.
Embracing Trustworthy Brain-Agent Collaboration as Paradigm Extension for Intelligent Assistive Technologies
Authors: Yankai Chen, Xinni Zhang, Yifei Zhang, Yangning Li, Henry Peng Zou, Chunyu Miao, Weizhi Zhang, Xue Liu, Philip S. Yu
2025-10-25
Brain-Computer Interfaces (BCIs) offer a direct pathway between
the human brain and external devices, holding significant promise for
individuals with severe neurological impairments. However, their widespread
adoption is hindered by critical limitations, such as low information transfer
rates and extensive user-specific calibration. To overcome these challenges,
recent research has explored the integration of Large Language Models (
s),
extending the focus from simple command
to understanding complex
cognitive states. Despite these advancements, deploying agentic AI faces
technical hurdles and ethical concerns. Due to the lack of comprehensive
discussion on this emerging direction, this position paper argues that the
field is poised for a paradigm extension from BCI to Brain-Agent Collaboration
(BAC). We emphasize reframing agents as active and collaborative partners for
intelligent assistance rather than passive brain signal data processors,
demanding a focus on ethical data handling, model reliability, and a robust
human-agent collaboration framework to ensure these systems are safe,
trustworthy, and effective.
Compositional Bias Control in Large Language Models Preference Learning Fails, Supervision Succeeds
Authors: Atij Mahesh
2025-10-24
Large Language Models (s) still produce gender-stereotyped language even
in occupation-neutral contexts that reflect deep societal biases (Rudinger et
al., 2018). To address this, prior work has proposed prompting, constrained
(Dathathri et al., 2020; Zhou et al., 2024), post-processing, and
fine-tuning-based alignment (Rafailov et al., 2023; Ravfogel et al., 2022).
However, the comparative efficacy and learning dynamics remain little
understood. We report a comparative analysis of six control techniques for bias
mitigation: prompt-only, generate-and-filter, DFA-based Ctrl-G
,
Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and
Iterative Nullspace Projection (INLP). We evaluate each method on a
compositional constraint task. This task requires generating sentences that
contain at least one agentic and one communal descriptor for each of the twenty
Winogender-derived occupations. We quantify trade-offs between control strength
and naturalness with evaluations of constraint compliance, lexical diversity,
and fluency. Our results reveal key contrasts among the methods: SFT achieves
99.87 +- 0.15% compliance and high lexical diversity, while DPO, despite
similar training stability, fails at 4.53 +- 0.82%. Ctrl-G guarantees perfect
compliance, but at the cost of severely reduced fluency and diversity.
Preference-based learning fundamentally differs: it cannot satisfy
compositional constraints, as binary preference signals encode ranking, not
logical conjunctions. Only explicit positive supervision enables mitigation of
compositional biases; preference-based alignment fails to generalize logical
structures, underscoring the limitations of preference learning and the
necessity of explicit supervision for fair and fluent controlled generation.
Pruning and Quantization Impact on Graph Neural Networks
Authors: Khatoon Khedri, Reza Rawassizadeh, Qifu Wen, Mehdi Hosseinzadeh
2025-10-24
Graph neural networks (GNNs) are known to operate with high accuracy on
learning from graph-structured data, but they suffer from high computational
and resource costs. Neural network methods are used to reduce the
model size while maintaining reasonable accuracy. Two of the common neural
network
techniques include
and
. In this
research, we empirically examine the effects of three
methods and three
methods on different GNN models, including graph classification
tasks, node classification tasks, and link prediction. We conducted all
experiments on three graph datasets, including Cora, Proteins, and BBBP. Our
findings demonstrate that unstructured fine-grained and global
can
significantly reduce the model's size(50\%) while maintaining or even improving
precision after fine-tuning the pruned model. The evaluation of different
methods on GNN shows diverse impacts on accuracy, inference time,
and model size across different datasets.
Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders
Authors: Zhimin Chen, Chenyu Zhao, Ka Chun Mo, Yunjiang Jiang, Jane H. Lee, Shouwei Chen, Khushhall Chandra Mahajan, Ning Jiang, Kai Ren, Jinhui Li, Wen-Yun Yang
2025-10-24
Modern large-scale recommendation systems rely heavily on user interaction
history sequences to enhance the model performance. The advent of large
language models and sequential modeling techniques, particularly
-like architectures, has led to significant advancements recently
(e.g., HSTU, SIM, and TWIN models). While scaling to ultra-long user histories
(10k to 100k items) generally improves model performance, it also creates
significant challenges on latency, queries per second (QPS) and GPU cost in
industry-scale recommendation systems. Existing models do not adequately
address these industrial scalability issues. In this paper, we propose a novel
two-stage modeling framework, namely VIrtual Sequential Target Attention
(VISTA), which decomposes traditional target attention from a candidate item to
user history items into two distinct stages: (1) user history summarization
into a few hundred tokens; followed by (2) candidate item attention to those
tokens. These summarization token embeddings are then
d in storage system
and then utilized as sequence features for downstream model training and
inference. This novel design for scalability enables VISTA to scale to lifelong
user histories (up to one million items) while keeping downstream training and
inference costs fixed, which is essential in industry. Our approach achieves
significant improvements in offline and online metrics and has been
successfully deployed on an industry leading recommendation platform
billions of users.
On the acceleration of cosmic rays at the post-adiabatic shocks of supernova remnants
Authors: O. Petruk, R. Bandiera, T. Kuzyo, R. Brose, A. Ingallinera
2025-10-24
When a supernova remnant (SNR) interacts with the dense material of an
interstellar cloud, its shock wave decelerates rapidly, and the post-shock
temperature drops to levels that permit efficient cooling of the shocked
plasma. At this stage, the shock enters the post-adiabatic phase of its
evolution. During this phase, the internal structure of the SNR undergoes
significant changes, particularly in the immediate post-shock region, at
spatial scales relevant to cosmic ray . Once the shock enters the
post-adiabatic regime, the efficiency of diffusive shock
increases
due to a higher plasma
, to a change in the direction of the
advection velocity, and to an increased rate of momentum gain. As a result, the
momentum spectrum of relativistic particles hardens, deviating from a pure
power law at high energies. Particles could reach higher maximum values
compared to classical predictions. We highlight the dynamics of post-adiabatic
flows in SNRs, study their impact on particle
, and present
supporting observational evidence in the radio band.
Sprint Sparse-Dense Residual Fusion for Efficient Diffusion Transformers
Authors: Dogyun Park, Moayed Haji-Ali, Yanyu Li, Willi Menapace, Sergey Tulyakov, Hyunwoo J. Kim, Aliaksandr Siarohin, Anil Kag
2025-10-24
Diffusion Transformers (DiTs) deliver state-of-the-art generative performance
but their quadratic training cost with sequence length makes large-scale
pretraining prohibitively expensive. Token dropping can reduce training cost,
yet na\"ive strategies degrade representations, and existing methods are either
parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense
Residual Fusion for Efficient Diffusion Transformers, a simple method that
enables aggressive token dropping (up to 75%) while pre quality. SPRINT
leverages the complementary roles of shallow and deep layers: early layers
process all tokens to capture local detail, deeper layers operate on a
subset to cut computation, and their outputs are fused through residual
connections. Training follows a two-stage schedule: long masked pre-training
for efficiency followed by short full-token fine-tuning to close the
train--inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training
savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG)
nearly halves FLOPs while improving quality. These results establish SPRINT as
a simple, effective, and general solution for efficient DiT training.
From Social Division to Cohesion with AI Message Suggestions in Online Chat Groups
Authors: Faria Huq, Elijah L. Claggett, Hirokazu Shirado
2025-10-24
Social cohesion is difficult to sustain in societies marked by opinion
diversity, particularly in online . As large language model
(
)-driven messaging assistance becomes increasingly embedded in these
contexts, it raises critical questions about its societal impact. We present an
online experiment with 557 participants who engaged in multi-round discussions
on politically controversial topics while freely reconfiguring their discussion
groups. In some conditions, participants received real-time message suggestions
generated by an
, either personalized to the individual or adapted to their
group context. We find that subtle shifts in linguistic style during
, mediated by AI assistance, can scale up to reshape collective
structures. While individual-focused assistance leads users to segregate into
like-minded groups, relational assistance that incorporates group members'
stances enhances cohesion through more receptive exchanges. These findings
demonstrate that AI-mediated
can support social cohesion in
diverse groups, but outcomes critically depend on how personalization is
designed.
Performance Trade-offs of Optimizing Small Language Models for E-Commerce
Authors: Josip Tomo Licardo, Nikola Tankovic
2025-10-24
Large Language Models (s) offer state-of-the-art performance in natural
language understanding and generation tasks. However, the deployment of leading
commercial models for specialized tasks, such as e-commerce, is often hindered
by high computational costs, latency, and operational expenses. This paper
investigates the viability of smaller, open-weight models as a
resource-efficient alternative. We present a methodology for optimizing a
one-billion-parameter Llama 3.2 model for multilingual e-commerce intent
recognition. The model was fine-tuned using Quantized Low-Rank Adaptation
(QLoRA) on a synthetically generated dataset designed to mimic real-world user
queries. Subsequently, we applied post-training
techniques,
creating GPU-optimized (GPTQ) and CPU-optimized (GGUF) versions. Our results
demonstrate that the specialized 1B model achieves 99% accuracy, matching the
performance of the significantly larger GPT-4.1 model. A detailed performance
analysis revealed critical, hardware-dependent trade-offs: while 4-bit GPTQ
reduced VRAM usage by 41%, it paradoxically slowed inference by 82% on an older
GPU architecture (NVIDIA T4) due to de
overhead. Conversely, GGUF
formats on a CPU achieved a speedup of up to 18x in inference throughput and a
reduction of over 90% in RAM consumption compared to the FP16 baseline. We
conclude that small, properly optimized open-weight models are not just a
viable but a more suitable alternative for domain-specific applications,
offering state-of-the-art accuracy at a fraction of the computational cost.
Model-Aware Tokenizer Transfer
Authors: Mykola Haltiuk, Aleksander Smywiński-Pohl
2025-10-24
Large Language Models (s) are trained to support an increasing number of
languages, yet their predefined tokenizers remain a bottleneck for adapting
models to lower-resource or distinct-script languages. Existing tokenizer
transfer methods typically rely on semantic heuristics to initialize new
embeddings, ignoring higher-layer model dynamics and limiting transfer quality.
We propose Model-Aware Tokenizer Transfer (MATT), a method that incorporates
model internals into the tokenizer transfer process. MATT introduces an
Attention Influence Modeling (AIM) objective that distills inter-token
patterns from a source model into a target model with a new
tokenizer, providing an efficient warm-up before standard language modeling.
Unlike approaches that focus solely on embedding similarity, MATT leverages
attention behavior to guide embedding initialization and adaptation.
Experiments across diverse linguistic settings show that MATT recovers a large
fraction of the original model's performance within a few GPU hours,
outperforming heuristic baselines. These results demonstrate that incorporating
model-level signals offers a practical and effective path toward robust
tokenizer transfer in multilingual
s.
Adversarial Déjà Vu Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
Authors: Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa, Jiachen T. Wang, Peng Gao, Charith Peris, Yao Ma, Rahul Gupta, Ming Jin, Prateek Mittal, Ruoxi Jia
2025-10-24
Large language models remain vulnerable to jailbreak attacks that bypass
safety guardrails to elicit harmful outputs. Defending against novel jailbreaks
represents a critical challenge in AI safety. Adversarial training -- designed
to make models robust against worst-case perturbations -- has been the dominant
paradigm for adversarial robustness. However, due to optimization challenges
and difficulties in defining realistic threat models, adversarial training
methods often fail on newly developed jailbreaks in practice. This paper
proposes a new paradigm for improving robustness against unseen jailbreaks,
centered on the Adversarial D\'ej`a Vu hypothesis: novel jailbreaks are not
fundamentally new, but largely recombinations of adversarial skills from
previous attacks. We study this hypothesis through a large-scale analysis of 32
attack papers published over two years. Using an automated pipeline, we extract
and compress adversarial skills into a dictionary of primitives, with
s generating human-readable descriptions. Our analysis reveals that unseen
attacks can be effectively explained as
compositions of earlier skills,
with explanatory power increasing monotonically as skill coverage grows. Guided
by this insight, we introduce Adversarial Skill Compositional Training (ASCoT),
which trains on diverse compositions of skill primitives rather than isolated
attack instances. ASCoT substantially improves robustness to unseen attacks,
including multi-turn jailbreaks, while maintaining low over-refusal rates. We
also demonstrate that expanding adversarial skill coverage, not just data
scale, is key to defending against novel attacks.
\textcolor{red}{\textbf{Warning: This paper contains content that may be
harmful or offensive in nature.