2025-08-15
Table of Contents
- STream3R Scalable Sequential 3D Reconstruction with Causal Transformer
- Generalizable Federated Learning using Client Adaptive Focal Modulation
- Video-BLADE Block-Sparse Attention Meets Step Distillation for Efficient Video Generation
- Thinking Inside the Mask In-Place Prompting in Diffusion LLMs
- Continuous Bangla Sign Language Translation Mitigating the Expense of Gloss Annotation with the Assistance of Graph
- SemPT Semantic Prompt Tuning for Vision-Language Models
- DAS Dual-Aligned Semantic IDs Empowered Industrial Recommender System
- GCRPNet Graph-Enhanced Contextual and Regional Perception Network For Salient Object Detection in Optical Remote Sensing Images
- X-Node Self-Explanation is All We Need
- Efficient Methods for Accurate Sparse Trajectory Recovery and Map Matching
- Computational Economics in Large Language Models Exploring Model Behavior and Incentive Design under Resource Constraints
- Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
- XQuant Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
- eMamba Efficient Acceleration Framework for Mamba Models in Edge Computing
- Improving Generative Cross-lingual Aspect-Based Sentiment Analysis with Constrained Decoding
- Advancing Cross-lingual Aspect-Based Sentiment Analysis with LLMs and Constrained Decoding for Sequence-to-Sequence Models
- What to Ask Next? Probing the Imaginative Reasoning of LLMs with TurtleSoup Puzzles
- DiffAxE Diffusion-driven Hardware Accelerator Generation and Design Space Exploration
- Pruning and Malicious Injection A Retraining-Free Backdoor Attack on Transformer Models
- Personalized Real-time Jargon Support for Online Meetings
- Can Transformers Break Encryption Schemes via In-Context Learning?
- Agentic AI Frameworks Architectures, Protocols, and Design Challenges
- Nested-ReFT Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts
- From Intent to Execution Multimodal Chain-of-Thought Reinforcement Learning for Precise CAD Code Generation
- Constrained Decoding of Diffusion LLMs with Context-Free Grammars
- Language of Persuasion and Misrepresentation in Business Communication A Textual Detection Approach
- Memory Decoder A Pretrained, Plug-and-Play Memory for Large Language Models
- OneVAE Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better
- Speed Always Wins A Survey on Efficient Architectures for Large Language Models
- MoIIE Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models
- MEML-GRPO Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement
- HierMoE Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap
- NeuronTune Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs
- EGGS-PTP An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models
- Gen-AFFECT Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy
- Shadow in the Cache Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference
- Synaptic Pruning A Biological Inspiration for Deep Learning Regularization
- SinLlama -- A Large Language Model for Sinhala
- READER Retrieval-Assisted Drafter for Efficient LLM Inference
- FetFIDS A Feature Embedding Attention based Federated Network Intrusion Detection Algorithm
- A Survey on Training-free Alignment of Large Language Models
- Retrospective Sparse Attention for Efficient Long-Context Generation
- NEFMind Parameter-Efficient Fine-Tuning of Open-Source LLMs for Telecom APIs Automation
- ColorGPT Leveraging Large Language Models for Multimodal Color Recommendation
- ASPD Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs
- Steering Towards Fairness Mitigating Political Bias in LLMs
- DiffPose-Animal A Language-Conditioned Diffusion Framework for Animal Pose Estimation
- Interpretable Reward Model via Sparse Autoencoder
- A Survey on Parallel Text Generation From Parallel Decoding to Diffusion Language Models
- Prompt-and-Check Using Large Language Models to Evaluate Communication Protocol Compliance in Simulation-Based Training
- Classifier Language Models Unifying Sparse Finetuning and Adaptive Tokenization for Specialized Classification Tasks
- AgriGPT a Large Language Model Ecosystem for Agriculture
- QoE-Aware Service Provision for Mobile AR Rendering An Agent-Driven Approach
- Agentic Graph Neural Networks for Wireless Communications and Networking Towards Edge General Intelligence A Survey
- Joint decoding method for controllable contextual speech recognition based on Speech LLM
- Securing Agentic AI Threat Modeling and Risk Analysis for Network Monitoring Agentic AI System
- Profiling Large Language Model Inference on Apple Silicon A Quantization Perspective
- Using LLMs to Capture Users' Temporal Context for Recommendation
- When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise Real-World Lessons from LLM Co-Design in a Safety-Net Hospital
- Vector-Centric Machine Learning Systems A Cross-Stack Approach
- Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories
- OverFill Two-Stage Models for Efficient Language Model Decoding
- Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference
- Follow-Your-Shape Shape-Aware Image Editing via Trajectory-Guided Region Control
- BlindGuard Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks
- TeamMedAgents Enhancing Medical Decision-Making of LLMs Through Structured Teamwork
- ChatGPT on the Road Leveraging Large Language Model-Powered In-vehicle Conversational Agents for Safer and More Enjoyable Driving Experience
- Bridging ASR and LLMs for Dysarthric Speech Recognition Benchmarking Self-Supervised and Generative Approaches
- Interpreting Fedspeak with Confidence A LLM-Based Uncertainty-Aware Framework Guided by Monetary Policy Transmission Paths
- DiTVR Zero-Shot Diffusion Transformer for Video Restoration
- EvoCoT Overcoming the Exploration Bottleneck in Reinforcement Learning
- Grove MoE Towards Efficient and Superior MoE LLMs with Adjugate Experts
- SASST Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation
- Symmetry-Aware Transformer Training for Automated Planning
- Semantic Caching for Low-Cost LLM Serving From Offline Learning to Online Adaptation
- GLiClass Generalist Lightweight Model for Sequence Classification Tasks
- LaVieID Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation
- HGMF A Hierarchical Gaussian Mixture Framework for Scalable Tool Invocation within the Model Context Protocol
- Towards Theoretical Understanding of Transformer Test-Time Computing Investigation on In-Context Linear Regression
- Grounding Natural Language for Multi-agent Decision-Making with Multi-agentic LLMs
- Investigating 1-Bit Quantization in Transformer-Based Top Tagging
- LET-US Long Event-Text Understanding of Scenes
- Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative
- BEVANet Bilateral Efficient Visual Attention Network for Real-Time Semantic Segmentation
- Tasa Thermal-aware 3D-Stacked Architecture Design with Bandwidth Sharing for LLM Inference
- LP-Spec Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture-Dataflow Co-Optimization
- Bridging Semantic Logic Gaps A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization
- DySK-Attn A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention
- How Effectively Can Large Language Models Connect SNP Variants and ECG Phenotypes for Cardiovascular Risk Prediction?
- From Nodes to Narratives Explaining Graph Neural Networks with LLMs and Graph Context
- Large Language Model Evaluated Stand-alone Attention-Assisted Graph Neural Network with Spatial and Structural Information Interaction for Precise Endoscopic Image Segmentation
- Vec2Summ Text Summarization via Probabilistic Sentence Embeddings
- Narrative Memory in Machines Multi-Agent Arc Extraction in Serialized TV
- SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency
- Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models
- Fed MobiLLM Efficient Federated LLM Fine-Tuning over Heterogeneous Mobile Devices via Server Assisted Side-Tuning
- Pushing the Envelope of LLM Inference on AI-PC
- CISO Species Distribution Modeling Conditioned on Incomplete Species Observations
STream3R Scalable Sequential 3D Reconstruction with Causal Transformer
Authors: Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, Xingang Pan
2025-08-14
We present STream3R, a novel approach to 3D reconstruction that reformulates
pointmap prediction as a r-only Transformer problem. Existing
state-of-the-art methods for multi-view reconstruction either depend on
expensive global optimization or rely on simplistic memory mechanisms that
scale poorly with sequence length. In contrast, STream3R introduces an
streaming framework that processes image sequences efficiently using causal
attention, inspired by advances in modern language modeling. By learning
geometric priors from large-scale 3D datasets, STream3R generalizes well to
diverse and challenging scenarios, including dynamic scenes where traditional
methods often fail. Extensive experiments show that our method consistently
outperforms prior work across both static and dynamic scene benchmarks.
Moreover, STream3R is inherently compatible with
-style training
infrastructure, enabling efficient large-scale pretraining and fine-tuning for
various downstream 3D tasks. Our results underscore the potential of causal
Transformer models for online 3D perception, paving the way for real-time 3D
understanding in streaming environments. More details can be found in our
project page: https://nirvanalan.github.io/projects/stream3r.
Generalizable Federated Learning using Client Adaptive Focal Modulation
Authors: Tajamul Ashraf, Iqra Altaf Gillani
2025-08-14
Federated learning (FL) has proven essential for privacy-pre,
collaborative training across distributed clients. Our prior work, TransFed,
introduced a robust
-based FL framework that leverages a
learn-to-adapt hypernetwork to generate personalized focal modulation layers
per client, outperforming traditional methods in non-IID and cross-domain
settings. In this extended version, we propose AdaptFED, where we deepen the
investigation of focal modulation in generalizable FL by incorporating: (1) a
refined adaptation strategy that integrates task-aware client embeddings to
personalize modulation dynamics further, (2) enhanced theoretical bounds on
adaptation performance, and (3) broader empirical validation across additional
modalities, including time-series and multilingual data. We also introduce an
efficient variant of TransFed that reduces server-client
overhead
via low-rank hypernetwork conditioning, enabling scalable deployment in
resource-constrained environments. Extensive experiments on eight diverse
datasets reaffirm the superiority of our method over state-of-the-art
baselines, particularly in source-free and cross-task federated setups. Our
findings not only extend the capabilities of focal modulation in FL but also
pave the way for more adaptive, scalable, and generalizable
-based
federated systems. The code is available at
http://github.com/Tajamul21/TransFed
Video-BLADE Block-Sparse Attention Meets Step Distillation for Efficient Video Generation
Authors: Youping Gu, Xiaolong Li, Yuhao Hu, Bohan Zhuang
2025-08-14
Diffusion s currently lead the field in high-quality video
generation, but their slow iterative denoising process and prohibitive
quadratic attention costs for long sequences create significant inference
bottlenecks. While both step distillation and
attention mechanisms have
shown promise as independent
strategies, effectively combining
these approaches presents critical challenges -- training-free integration
yields suboptimal results, while separately training
attention after
step distillation requires prohibitively expensive high-quality video data. To
overcome these limitations, we propose BLADE, an innovative data-free joint
training framework that introduces: (1) an Adaptive Block-Sparse Attention
(ASA) mechanism for dynamically generating content-aware
masks to
focus computation on salient spatiotemporal features, and (2) a
-aware
step distillation paradigm built upon Trajectory Distribution Matching (TDM)
that directly incorporates
into the distillation process rather than
treating it as a separate compression step, with fast convergence. We validate
BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Our framework
demonstrates remarkable efficiency gains across different scales. On
Wan2.1-1.3B, BLADE achieves a 14.10x end-to-end inference
over a
50-step baseline. Moreover, on models such as CogVideoX-5B with short video
sequence lengths, our framework delivers a robust 8.89x speedup. Crucially, the
is accompanied by a consistent quality improvement. On the
VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from
0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further
corroborated by superior ratings in human evaluations. Our code and model
weights are publicly available at: http://ziplab.co/BLADE-Homepage/.
Thinking Inside the Mask In-Place Prompting in Diffusion LLMs
Authors: Xiangqi Jin, Yuxuan Wang, Yifeng Gao, Zichen Wen, Biqing Qi, Dongrui Liu, Linfeng Zhang
2025-08-14
Despite large language models (s) have achieved remarkable success, their
prefix-only prompting paradigm and sequential generation process offer limited
flexibility for bidirectional information. Diffusion large language models
(d
s) present new opportunities through their bidirectional attention
mechanisms and iterative refinement processes, enabling more flexible in-place
prompting strategies. We introduce ICE (In-Place Chain-of-Thought Prompting
with Early Exit), a novel framework that transforms prefix-only prompting into
in-place prompting specifically designed for d
s. ICE integrates in-place
prompts directly within masked token positions during iterative refinement and
employs a confidence-aware early exit mechanism to significantly reduce
computational overhead. Extensive experiments demonstrate ICE's effectiveness,
achieving up to 17.29% accuracy improvement with 4.12 speedup on GSM8K,
and up to 276.67
on MMLU while maintaining competitive
performance.
Continuous Bangla Sign Language Translation Mitigating the Expense of Gloss Annotation with the Assistance of Graph
Authors: Safaeid Hossain Arib, Rabeya Akter, Sejuti Rahman
2025-08-14
Millions of individuals worldwide are affected by deafness and hearing
impairment. Sign language serves as a sophisticated means of for
the deaf and hard of hearing. However, in societies that prioritize spoken
languages, sign language often faces underestimation, leading to
barriers and social exclusion. The Continuous Bangla Sign Language Translation
project aims to address this gap by enhancing translation methods. While recent
approaches leverage
architecture for state-of-the-art results, our
method integrates graph-based methods with the
architecture. This
fusion, combining
and STGCN-LSTM architectures, proves more
effective in gloss-free translation. Our contributions include architectural
fusion, exploring various fusion strategies, and achieving a new
state-of-the-art performance on diverse sign language datasets, namely
RWTH-PHOENIX-2014T, CSL-Daily, How2Sign, and BornilDB v1.0. Our approach
demonstrates superior performance compared to current translation outcomes
across all datasets, showcasing notable improvements of BLEU-4 scores of 4.01,
2.07, and 0.5, surpassing those of GASLT, GASLT and slt_how2sign in
RWTH-PHOENIX-2014T, CSL-Daily, and How2Sign, respectively. Also, we introduce
benchmarking on the BornilDB v1.0 dataset for the first time. Our method sets a
benchmark for future research, emphasizing the importance of gloss-free
translation to improve
accessibility for the deaf and hard of
hearing.
SemPT Semantic Prompt Tuning for Vision-Language Models
Authors: Xiao Shi, Yangjun Ou, Zhenzhong Chen
2025-08-14
Visual transfer learning for unseen categories presents an active research
topic yet a challenging task, due to the inherent conflict between pre
category-specific representations and acquiring transferable knowledge.
Vision-Language Models (VLMs) pre-trained on large amounts of image-text pairs
offer a promising solution. However, existing prompt tuning methods rely on
category labels or disparate
-generated descriptions, which fragment
knowledge representation and hinder transferability. To address this
limitation, we introduce Semantic Prompt Tuning (SemPT), a novel framework that
tackles the generalization challenge by leveraging shared attribute-level
knowledge across categories. Specifically, SemPT adopts a two-step prompting
strategy to guide
in extracting shared visual attributes and generating
attribute-level descriptions, capturing transferable semantic cues beyond
labels while ensuring coherent structure. Then, visually guided weighting is
applied to the embeddings of attribute-level descriptions to reduce noise from
irrelevant attributes and enhance the text embeddings. Additionally, image
embeddings are jointly aligned with both label and attribute-enhanced text
embeddings, balancing discrimination for seen categories and transferability to
unseen ones. Considering the availability of category exposure, our inference
dynamically selects between standard label embeddings for seen categories and
attribute-enhanced embeddings for unseen ones to ensure effective adaptation.
Extensive experiments on 15 benchmark datasets demonstrate that SemPT achieves
state-of-the-art performance across various settings, including base-to-novel
generalization, cross-dataset transfer, cross-domain transfer, and few-shot
learning.
DAS Dual-Aligned Semantic IDs Empowered Industrial Recommender System
Authors: Wencai Ye, Mingjie Sun, Shaoyun Shi, Peng Wang, Wenjin Wu, Peng Jiang
2025-08-14
Semantic IDs are discrete identifiers generated by quantizing the Multi-modal
Large Language Models (Ms) embeddings, enabling efficient multi-modal
content integration in recommendation systems. However, their lack of
collaborative signals results in a misalignment with downstream discriminative
and generative recommendation objectives. Recent studies have introduced
various alignment mechanisms to address this problem, but their two-stage
framework design still leads to two main limitations: (1) inevitable
information loss during alignment, and (2) inflexibility in applying adaptive
alignment strategies, consequently constraining the mutual information
maximization during the alignment process. To address these limitations, we
propose a novel and flexible one-stage Dual-Aligned Semantic IDs (DAS) method
that simultaneously optimizes
and alignment, pre
semantic
integrity and alignment quality while avoiding the information loss typically
associated with two-stage methods. Meanwhile, DAS achieves more efficient
alignment between the semantic IDs and collaborative signals, with the
following two innovative and effective approaches: (1) Multi-view Constrative
Alignment: To maximize mutual information between semantic IDs and
collaborative signals, we first incorporate an ID-based CF debias module, and
then design three effective contrastive alignment methods: dual user-to-item
(u2i), dual item-to-item/user-to-user (i2i/u2u), and dual co-occurrence
item-to-item/user-to-user (i2i/u2u). (2) Dual Learning: By aligning the dual
s of users and ads, the constructed semantic IDs for users and ads
achieve stronger alignment. Finally, we conduct extensive offline experiments
and online A/B tests to evaluate DAS's effectiveness, which is now successfully
deployed across various advertising scenarios at Kuaishou App,
over 400
million users daily.
GCRPNet Graph-Enhanced Contextual and Regional Perception Network For Salient Object Detection in Optical Remote Sensing Images
Authors: Mengyu Ren, Yutong Li, Hua Li, Runmin Cong, Sam Kwong
2025-08-14
Salient object detection (SOD) in optical remote sensing images (ORSIs) faces
numerous challenges, including significant variations in target scales and low
contrast between targets and the background. Existing methods based on vision
s (ViTs) and convolutional neural networks (CNNs) architectures aim
to leverage both global and local features, but the difficulty in effectively
integrating these heterogeneous features limits their overall performance. To
overcome these limitations, we propose a graph-enhanced contextual and regional
perception network (GCRPNet), which builds upon the Mamba architecture to
simultaneously capture long-range dependencies and enhance regional feature
representation. Specifically, we employ the visual state space (VSS) encoder to
extract multi-scale features. To further achieve deep guidance and enhancement
of these features, we first design a difference-similarity guided hierarchical
graph attention module (DS-HGAM). This module strengthens cross-layer
interaction capabilities between features of different scales while enhancing
the model's structural perception,allowing it to distinguish between foreground
and background more effectively. Then, we design the LEVSS block as the
r
of GCRPNet. This module integrates our proposed adaptive scanning strategy and
multi-granularity collaborative attention enhancement module (MCAEM). It
performs adaptive patch scanning on feature maps processed via multi-scale
convolutions, thereby capturing rich local region information and enhancing
Mamba's local modeling capability. Extensive experimental results demonstrate
that the proposed model achieves state-of-the-art performance, validating its
effectiveness and superiority.
X-Node Self-Explanation is All We Need
Authors: Prajit Sengupta, Islem Rekik
2025-08-14
Graph neural networks (GNNs) have achieved state-of-the-art results in
computer vision and medical image classification tasks by capturing structural
dependencies across data instances. However, their decision-making remains
largely opaque, limiting their trustworthiness in high-stakes clinical
applications where interpretability is essential. Existing explainability
techniques for GNNs are typically post-hoc and global, offering limited insight
into individual node decisions or local reasoning. We introduce X-Node, a
self-explaining GNN framework in which each node generates its own explanation
as part of the prediction process. For every node, we construct a structured
context vector encoding interpretable cues such as degree, centrality,
clustering, feature saliency, and label agreement within its local topology. A
lightweight Reasoner module maps this context into a compact explanation
vector, which serves three purposes: (1) reconstructing the node's latent
embedding via a r to enforce faithfulness, (2) generating a natural
language explanation using a pre-trained
(e.g., Grok or Gemini), and (3)
guiding the GNN itself via a "text-injection" mechanism that feeds explanations
back into the message-passing pipeline. We evaluate X-Node on two graph
datasets derived from MedMNIST and MorphoMNIST, integrating it with GCN, GAT,
and GIN backbones. Our results show that X-Node maintains competitive
classification accuracy while producing faithful, per-node explanations.
Repository: https://github.com/basiralab/X-Node.
Efficient Methods for Accurate Sparse Trajectory Recovery and Map Matching
Authors: Wei Tian, Jieming Shi, Man Lung Yiu
2025-08-14
Real-world trajectories are often with low-sampling rates (i.e., long
intervals between consecutive GPS points) and misaligned with road networks,
yet many applications demand high-quality data for optimal performance. To
improve data quality with
trajectories as input, we systematically study
two related research problems: trajectory recovery on road network, which aims
to infer missing points to recover high-sampling trajectories, and map
matching, which aims to map GPS points to road segments to determine underlying
routes. In this paper, we present efficient methods TRMMA and MMA for accurate
trajectory recovery and map matching, respectively, where MMA serves as the
first step of TRMMA. In MMA, we carefully formulate a classification task to
map a GPS point from
trajectories to a road segment over a small
candidate segment set, rather than the entire road network. We develop
techniques in MMA to generate effective embeddings that capture the patterns of
GPS data, directional information, and road segments, to accurately align
trajectories to routes. For trajectory recovery, TRMMA focuses on the
segments in the route returned by MMA to infer missing points with position
ratios on road segments, producing high-sampling trajectories efficiently by
avoiding evaluation of all road segments. Specifically, in TRMMA, we design a
dual-
encoding process to cohesively capture latent patterns in
trajectories and routes, and an effective
technique to sequentially
predict the position ratios and road segments of missing points. We conduct
extensive experiments to compare TRMMA and MMA with numerous existing methods
for trajectory recovery and map matching, respectively, on 4 large real-world
datasets. TRMMA and MMA consistently achieve the best result quality, often by
a significant margin.
Computational Economics in Large Language Models Exploring Model Behavior and Incentive Design under Resource Constraints
Authors: Sandeep Reddy, Kabir Khan, Rohit Patil, Ananya Chakraborty, Faizan A. Khan, Swati Kulkarni, Arjun Verma, Neha Singh
2025-08-14
Large language models (s) are limited by substantial computational cost.
We introduce a "computational economics" framework that treats an
as an
internal economy of resource-constrained agents (attention heads and neuron
blocks) that must allocate scarce computation to maximize task utility. First,
we show empirically that when computation is scarce, standard
s reallocate
attention toward high-value tokens while pre
accuracy. Building on this
observation, we propose an incentive-driven training paradigm that augments the
task loss with a differentiable computation cost term, encouraging
and
efficient activations. On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method
yields a family of models that trace a Pareto frontier and consistently
dominate post-hoc
; for a similar accuracy we obtain roughly a forty
percent reduction in FLOPS and lower latency, together with more interpretable
attention patterns. These results indicate that economic principles offer a
principled route to designing efficient, adaptive, and more transparent
s
under strict resource constraints.
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Authors: Huizhen Shu, Xuying Li, Qirui Wang, Yuji Kosuga, Mengqiu Tian, Zhuo Li
2025-08-14
With the rapid proliferation of Natural Language Processing (NLP), especially
Large Language Models (s), generating adversarial examples to jailbreak
s
remains a key challenge for understanding model vulnerabilities and improving
robustness. In this context, we propose a new black-box attack method that
leverages the interpretability of large models. We introduce the Sparse Feature
Perturbation Framework (SFPF), a novel approach for adversarial text generation
that utilizes
autoencoders to identify and manipulate critical features
in text. After using the SAE model to reconstruct hidden layer representations,
we perform feature clustering on the successfully attacked texts to identify
features with higher activations. These highly activated features are then
perturbed to generate new adversarial texts. This selective perturbation
preserves the malicious intent while amplifying safety signals, thereby
increasing their potential to evade existing defenses. Our method enables a new
red-teaming strategy that balances adversarial effectiveness with safety
alignment. Experimental results demonstrate that adversarial texts generated by
SFPF can bypass state-of-the-art defense mechanisms, revealing persistent
vulnerabilities in current NLP systems.However, the method's effectiveness
varies across prompts and layers, and its generalizability to other
architectures and larger models remains to be validated.
XQuant Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
Authors: Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
2025-08-14
Although inference has emerged as a critical workload for many downstream
applications, efficiently inferring
s is challenging due to the substantial
memory footprint and bandwidth requirements. In parallel, compute capabilities
have steadily outpaced both memory capacity and bandwidth over the last few
decades, a trend that remains evident in modern GPU hardware and exacerbates
the challenge of
inference. As such, new algorithms are emerging that trade
increased computation for reduced memory operations. To that end, we present
XQuant, which takes advantage of this trend, enabling an order-of-magnitude
reduction in memory consumption through
with substantial
accuracy benefits relative to state-of-the-art
methods.
We accomplish this by quantizing and caching the layer input activations X,
instead of using standard
caching, and then rematerializing the Keys and
Values on-the-fly during inference. This results in an immediate 2
memory savings compared to
caching. By applying XQuant, we achieve up to
memory savings with perplexity degradation compared to
the FP16 baseline. Furthermore, our approach leverages the fact that X values
are similar across layers. Building on this observation, we introduce
XQuant-CL, which exploits the cross-layer similarity in the X embeddings for
extreme compression. Across different models, XQuant-CL attains up to
10 memory savings relative to the FP16 baseline with only 0.01
perplexity degradation, and 12.5 memory savings with only
perplexity degradation. XQuant exploits the rapidly increasing compute
capabilities of hardware platforms to eliminate the memory bottleneck, while
surpassing state-of-the-art
methods and achieving
near-FP16 accuracy across a wide range of models.
eMamba Efficient Acceleration Framework for Mamba Models in Edge Computing
Authors: Jiyong Kim, Jaeho Lee, Jiahao Lin, Alish Kanani, Miao Sun, Umit Y. Ogras, Jaehyun Park
2025-08-14
State Space Model (SSM)-based machine learning architectures have recently
gained significant attention for processing sequential data. Mamba, a recent
sequence-to-sequence SSM, offers competitive accuracy with superior
computational efficiency compared to state-of-the-art models. While
this advantage makes Mamba particularly promising for resource-constrained edge
devices, no hardware
frameworks are currently optimized for
deploying it in such environments. This paper presents eMamba, a comprehensive
end-to-end hardware
framework explicitly designed for deploying
Mamba models on edge platforms. eMamba maximizes computational efficiency by
replacing complex normalization layers with lightweight hardware-aware
alternatives and approximating expensive operations, such as SiLU activation
and exponentiation, considering the target applications. Then, it performs an
approximation-aware neural architecture search (NAS) to tune the learnable
parameters used during approximation. Evaluations with Fashion-MNIST, CIFAR-10,
and MARS, an open-source human pose estimation dataset, show eMamba achieves
comparable accuracy to state-of-the-art techniques using 1.63-19.9
fewer parameters. In addition, it generalizes well to large-scale natural
language tasks, demonstrating stable perplexity across varying sequence lengths
on the WikiText2 dataset. We also
and implement the entire eMamba
pipeline on an AMD ZCU102 FPGA and ASIC using GlobalFoundries (GF) 22 nm
technology. Experimental results show 4.95-5.62 lower latency and
2.22-9.95 higher throughput, with 4.77 smaller area,
9.84 lower power, and 48.6 lower energy consumption than
baseline solutions while maintaining competitive accuracy.
Improving Generative Cross-lingual Aspect-Based Sentiment Analysis with Constrained Decoding
Authors: Jakub Å mÃd, Pavel PÅ™ibáň, Pavel Král
2025-08-14
While aspect-based sentiment analysis (ABSA) has made substantial progress,
challenges remain for low-resource languages, which are often overlooked in
favour of English. Current cross-lingual ABSA approaches focus on limited, less
complex tasks and often rely on external translation tools. This paper
introduces a novel approach using constrained with
sequence-to-sequence models, eliminating the need for unreliable translation
tools and improving cross-lingual performance by 5\% on average for the most
complex task. The proposed method also supports multi-tasking, which enables
solving multiple ABSA tasks with a single model, with constrained
boosting results by more than 10\%.
We evaluate our approach across seven languages and six ABSA tasks,
surpassing state-of-the-art methods and setting new benchmarks for previously
unexplored tasks. Additionally, we assess large language models (
s) in
zero-shot, few-shot, and fine-tuning scenarios. While
s perform poorly in
zero-shot and few-shot settings, fine-tuning achieves competitive results
compared to smaller multilingual models, albeit at the cost of longer training
and inference times.
We provide practical recommendations for real-world applications, enhancing
the understanding of cross-lingual ABSA methodologies. This study offers
valuable insights into the strengths and limitations of cross-lingual ABSA
approaches, advancing the state-of-the-art in this challenging research domain.
Advancing Cross-lingual Aspect-Based Sentiment Analysis with LLMs and Constrained Decoding for Sequence-to-Sequence Models
Authors: Jakub Å mÃd, Pavel PÅ™ibáň, Pavel Král
2025-08-14
Aspect-based sentiment analysis (ABSA) has made significant strides, yet
challenges remain for low-resource languages due to the predominant focus on
English. Current cross-lingual ABSA studies often centre on simpler tasks and
rely heavily on external translation tools. In this paper, we present a novel
sequence-to-sequence method for compound ABSA tasks that eliminates the need
for such tools. Our approach, which uses constrained , improves
cross-lingual ABSA performance by up to 10\%. This method broadens the scope of
cross-lingual ABSA, enabling it to handle more complex tasks and providing a
practical, efficient alternative to translation-dependent techniques.
Furthermore, we compare our approach with large language models (
s) and show
that while fine-tuned multilingual
s can achieve comparable results,
English-centric
s struggle with these tasks.
What to Ask Next? Probing the Imaginative Reasoning of LLMs with TurtleSoup Puzzles
Authors: Mengtao Zhou, Sifan Wu, Huan Zhang, Qi Sima, Bang Liu
2025-08-14
We investigate the capacity of Large Language Models (s) for imaginative
reasoning--the proactive construction, testing, and revision of hypotheses in
information-
environments. Existing benchmarks, often static or focused
on social deduction, fail to capture the dynamic, exploratory nature of this
reasoning process. To address this gap, we introduce a comprehensive research
framework based on the classic "Turtle Soup" game, integrating a benchmark, an
agent, and an evaluation protocol. We present TurtleSoup-Bench, the first
large-scale, bilingual, interactive benchmark for imaginative reasoning,
comprising 800 turtle soup puzzles sourced from both the Internet and expert
authors. We also propose Mosaic-Agent, a novel agent designed to assess
s'
performance in this setting. To evaluate reasoning quality, we develop a
multi-dimensional protocol measuring logical consistency, detail completion,
and conclusion alignment. Experiments with leading
s reveal clear capability
limits, common failure patterns, and a significant performance gap compared to
humans. Our work offers new insights into
s' imaginative reasoning and
establishes a foundation for future research on exploratory agent behavior.
DiffAxE Diffusion-driven Hardware Accelerator Generation and Design Space Exploration
Authors: Arkapravo Ghosh, Abhishek Moitra, Abhiroop Bhattacharjee, Ruokai Yin, Priyadarshini Panda
2025-08-14
Design space exploration (DSE) is critical for developing optimized hardware
architectures, especially for AI workloads such as deep neural networks (DNNs)
and large language models (s), which require specialized
. As
model complexity grows, accelerator design spaces have expanded to O(10^17),
becoming highly irregular, non-convex, and exhibiting many-to-one mappings from
design configurations to performance metrics. This complexity renders direct
inverse derivation infeasible and necessitates heuristic or sampling-based
optimization. Conventional methods - including Bayesian optimization, gradient
descent, reinforcement learning, and genetic algorithms - depend on iterative
sampling, resulting in long runtimes and sensitivity to initialization. Deep
learning-based approaches have reframed DSE as classification using
recommendation models, but remain limited to small-scale (O(10^3)), less
complex design spaces. To overcome these constraints, we propose a generative
approach that models hardware design as 1-D image synthesis conditioned on
target performance, enabling efficient learning of non-differentiable,
non-bijective hardware-performance mappings. Our framework achieves 0.86% lower
generation error than Bayesian optimization with a 17000x speedup, and
outperforms GANDSE with 30% lower error at only 1.83x slower search. We further
extend the method to a structured DSE setting, attaining 9.8% lower
energy-delay product (EDP) and 6% higher performance, with up to 145.6x and
1312x faster search compared to existing optimization methods on O(10^17)
design spaces. For
inference, our method achieves 3.37x and 7.75x lower EDP
on a 32nm ASIC and Xilinx Ultrascale+ VPU13 FPGA, respectively, compared to the
state-of-the-art DOSA framework.
Pruning and Malicious Injection A Retraining-Free Backdoor Attack on Transformer Models
Authors: Taibiao Zhao, Mingxuan Sun, Hao Wang, Xiaobing Chen, Xiangwei Zhou
2025-08-14
Transformer models have demonstrated exceptional performance and have become
indispensable in computer vision (CV) and natural language processing (NLP)
tasks. However, recent studies reveal that s are susceptible to
backdoor attacks. Prior backdoor attack methods typically rely on retraining
with clean data or altering the model architecture, both of which can be
resource-intensive and intrusive. In this paper, we propose Head-wise Pruning
and Malicious Injection (HPMI), a novel retraining-free backdoor attack on
s that does not alter the model's architecture. Our approach
requires only a small subset of the original data and basic knowledge of the
model architecture, eliminating the need for retraining the target
.
Technically, HPMI works by
the least important head and injecting a
pre-trained malicious head to establish the backdoor. We provide a rigorous
theoretical justification demonstrating that the implanted backdoor resists
detection and removal by state-of-the-art defense techniques, under reasonable
assumptions. Experimental evaluations across multiple datasets further validate
the effectiveness of HPMI, showing that it 1) incurs negligible clean accuracy
loss, 2) achieves at least 99.55% attack success rate, and 3) bypasses four
advanced defense mechanisms. Additionally, relative to state-of-the-art
retraining-dependent attacks, HPMI achieves greater concealment and robustness
against diverse defense strategies, while maintaining minimal impact on clean
accuracy.
Personalized Real-time Jargon Support for Online Meetings
Authors: Yifan Song, Wing Yee Au, Hon Yung Wong, Brian P. Bailey, Tal August
2025-08-13
Effective interdisciplinary is frequently hindered by
domain-specific jargon. To explore the jargon barriers in-depth, we conducted a
formative diary study with 16 professionals, revealing critical limitations in
current jargon-management strategies during workplace meetings. Based on these
insights, we designed ParseJargon, an interactive
-powered system providing
real-time personalized jargon identification and explanations tailored to
users' individual backgrounds. A controlled experiment comparing ParseJargon
against baseline (no support) and general-purpose (non-personalized) conditions
demonstrated that personalized jargon support significantly enhanced
participants' comprehension, engagement, and appreciation of colleagues' work,
whereas general-purpose support negatively affected engagement. A follow-up
field study validated ParseJargon's usability and practical value in real-time
meetings, highlighting both opportunities and limitations for real-world
deployment. Our findings contribute insights into designing personalized jargon
support tools, with implications for broader interdisciplinary and educational
applications.
Can Transformers Break Encryption Schemes via In-Context Learning?
Authors: Jathin Korrapati, Patrick Mendoza, Aditya Tomar, Abein Abraham
2025-08-13
In-context learning (ICL) has emerged as a powerful capability of
-based language models, enabling them to perform tasks by
conditioning on a small number of examples presented at inference time, without
any parameter updates. Prior work has shown that
s can generalize
over simple function classes like linear functions, decision trees, even neural
networks, purely from context, focusing on numerical or symbolic reasoning over
underlying well-structured functions. Instead, we propose a novel application
of ICL into the domain of cryptographic function learning, specifically
focusing on ciphers such as mono-alphabetic substitution and Vigen`ere
ciphers, two classes of private-key encryption schemes. These ciphers involve a
fixed but hidden bijective mapping between plain text and cipher text
characters. Given a small set of (cipher text, plain text) pairs, the goal is
for the model to infer the underlying substitution and
a new cipher text
word. This setting poses a structured inference challenge, which is well-suited
for evaluating the inductive biases and generalization capabilities of
s under the ICL paradigm. Code is available at
https://github.com/adistomar/CS182-project.
Agentic AI Frameworks Architectures, Protocols, and Design Challenges
Authors: Hana Derouiche, Zaki Brahmi, Haithem Mazeni
2025-08-13
The emergence of Large Language Models (s) has ushered in a transformative
paradigm in artificial intelligence, Agentic AI, where intelligent agents
exhibit goal-directed autonomy, contextual reasoning, and dynamic multi-agent
coordination. This paper provides a systematic review and comparative analysis
of leading Agentic AI frameworks, including CrewAI, LangGraph, AutoGen,
Semantic Kernel, Agno, Google ADK, and MetaGPT, evaluating their architectural
principles,
mechanisms, memory management, safety guardrails, and
alignment with service-oriented computing paradigms. Furthermore, we identify
key limitations, emerging trends, and open challenges in the field. To address
the issue of agent
, we conduct an in-depth analysis of protocols
such as the Contract Net Protocol (CNP), Agent-to-Agent (A2A), Agent Network
Protocol (ANP), and Agora. Our findings not only establish a foundational
taxonomy for Agentic AI systems but also propose future research directions to
enhance scalability, robustness, and interoperability. This work serves as a
comprehensive reference for researchers and practitioners working to advance
the next generation of autonomous AI systems.
Nested-ReFT Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts
Authors: Maxime Heuillet, Yufei Cui, Boxing Chen, Audrey Durand, Prasanna Parthasarathi
2025-08-13
Advanced reasoning in s on challenging domains like mathematical reasoning
can be tackled using verifiable rewards based reinforced fine-tuning (ReFT). In
standard ReFT frameworks, a behavior model generates multiple completions with
answers per problem, for the answer to be then scored by a reward function.
While such RL post-training methods demonstrate significant performance
improvements across challenging reasoning domains, the computational cost of
generating completions during training with multiple inference steps makes the
training cost non-trivial. To address this, we draw inspiration from off-policy
RL, and speculative
to introduce a novel ReFT framework, dubbed
Nested-ReFT, where a subset of layers of the target model acts as the behavior
model to generate off-policy completions during training. The behavior model
configured with dynamic layer skipping per batch during training decreases the
inference cost compared to the standard ReFT frameworks. Our theoretical
analysis shows that Nested-ReFT yields unbiased gradient estimates with
controlled variance. Our empirical analysis demonstrates improved computational
efficiency measured as tokens/sec across multiple math reasoning benchmarks and
model sizes. Additionally, we explore three variants of bias mitigation to
minimize the off-policyness in the gradient updates that allows for maintaining
performance that matches the baseline ReFT performance.
From Intent to Execution Multimodal Chain-of-Thought Reinforcement Learning for Precise CAD Code Generation
Authors: Ke Niu, Haiyang Yu, Zhuofan Chen, Mengyang Zhao, Teng Fu, Bin Li, Xiangyang Xue
2025-08-13
Computer-Aided Design (CAD) plays a vital role in engineering and
manufacturing, yet current CAD workflows require extensive domain expertise and
manual modeling effort. Recent advances in large language models (s) have
made it possible to generate code from natural language, opening new
opportunities for automating parametric 3D modeling. However, directly
translating human design intent into executable CAD code remains highly
challenging, due to the need for logical reasoning, syntactic correctness, and
numerical precision. In this work, we propose CAD-RL, a multimodal
Chain-of-Thought (CoT) guided reinforcement learning post training framework
for CAD modeling code generation. Our method combines CoT-based Cold Start with
goal-driven reinforcement learning post training using three task-specific
rewards: executability reward, geometric accuracy reward, and external
evaluation reward. To ensure stable policy learning under
and
high-variance reward conditions, we introduce three targeted optimization
strategies: Trust Region Stretch for improved exploration, Precision Token Loss
for enhanced dimensions parameter accuracy, and Overlong Filtering to reduce
noisy supervision. To support training and benchmarking, we release ExeCAD, a
noval dataset comprising 16,540 real-world CAD examples with paired natural
language and structured design language descriptions, executable CADQuery
scripts, and rendered 3D models. Experiments demonstrate that CAD-RL achieves
significant improvements in reasoning quality, output precision, and code
executability over existing VLMs.
Constrained Decoding of Diffusion LLMs with Context-Free Grammars
Authors: Niels Mündler, Jasper Dekoninck, Martin Vechev
2025-08-13
Large language models (s) have shown promising performance across diverse
domains. Many practical applications of
s, such as code completion and
structured data extraction, require adherence to syntactic constraints
specified by a formal language. Yet, due to their probabilistic nature,
output is not guaranteed to adhere to such formal languages. Prior work has
proposed constrained
as a means to restrict
generation to
particular formal languages. However, existing works are not applicable to the
emerging paradigm of diffusion
s, when used in practical scenarios such as
the generation of formally correct C++ or JSON output. In this paper we address
this challenge and present the first constrained
method for diffusion
models, one that can handle formal languages captured by context-free grammars.
We begin by reducing constrained
to the more general additive
infilling problem, which asks whether a partial output can be completed to a
valid word in the target language. This problem also naturally subsumes the
previously unaddressed multi-region infilling constrained
. We then
reduce this problem to the task of deciding whether the intersection of the
target language and a regular language is empty and present an efficient
algorithm to solve it for context-free languages. Empirical results on various
applications, such as C++ code infilling and structured data extraction in
JSON, demonstrate that our method achieves near-perfect syntactic correctness
while consistently pre
or improving functional correctness. Importantly,
our efficiency optimizations ensure that the computational overhead remains
practical.
Language of Persuasion and Misrepresentation in Business Communication A Textual Detection Approach
Authors: Sayem Hossen, Monalisa Moon Joti, Md. Golam Rashed
2025-08-13
Business digitisation has reorganised the process of persuasive
discourse, which
allows not only greater transparency but also advanced deception. This
inquiry synthesises classical
rhetoric and
psychology with linguistic theory and empirical
studies in the financial
reporting, sustainability discourse, and digital marketing to explain how
deceptive language can be
systematically detected using persuasive lexicon. In controlled settings,
detection accuracies of greater
than 99% were achieved by using computational textual analysis as well as
personalised
models. However, reproducing this performance in multilingual settings is
also problematic and,
to a large extent, this is because it is not easy to find sufficient data,
and because few multilingual
text-processing infrastructures are in place. This evidence shows that there
has been an increasing
gap between the theoretical representations of
and those
empirically approximated,
and therefore, there is a need to have strong automatic text-identification
systems where AI-based
discourse is becoming more realistic in communicating with humans.
Memory Decoder A Pretrained, Plug-and-Play Memory for Large Language Models
Authors: Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin
2025-08-13
Large Language Models (s) have shown strong abilities in general language
tasks, yet adapting them to specific domains remains a challenge. Current
method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter
training and suffers from catastrophic forgetting. Meanwhile,
Retrieval-Augmented Generation (RAG) introduces substantial inference latency
due to expensive nearest-neighbor searches and longer context. This paper
introduces Memory Decoder, a plug-and-play pretrained memory that enables
efficient domain adaptation without changing the original model's parameters.
Memory Decoder employs a small
r that learns to imitate the
behavior of an external non-parametric retriever. Once trained, Memory Decoder
can be seamlessly integrated with any pretrained language model that shares the
same tokenizer, requiring no model-specific modifications. Experimental results
demonstrate that Memory Decoder enables effective adaptation of various Qwen
and Llama models to three distinct specialized domains: biomedicine, finance,
and law, reducing perplexity by an average of 6.17 points. Overall, Memory
Decoder introduces a novel paradigm centered on a specially pretrained memory
component designed for domain-specific adaptation. This memory architecture can
be integrated in a plug-and-play manner, consistently enhancing performance
across multiple models within the target domain.
OneVAE Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better
Authors: Yupeng Zhou, Zhen Li, Ziheng Ouyang, Yuming Chen, Ruoyi Du, Daquan Zhou, Bin Fu, Yihao Liu, Peng Gao, Ming-Ming Cheng, Qibin Hou
2025-08-13
Encoding videos into discrete tokens could align with text tokens to
facilitate concise and unified multi-modal s, yet introducing significant
spatiotemporal compression compared to continuous video representation.
Previous discrete video VAEs experienced unstable training, long training time,
and degraded reconstruction quality. Given the easier training and superior
performance of continuous VAEs, an intuitive idea is to enhance discrete video
VAEs by leveraging continuous VAEs. After rethinking the intrinsic link between
discrete and continuous representations, we found that FSQ could effectively
preserve pre-trained continuous VAE priors compared to other
methods. By leveraging continuous VAE priors, it converges several times faster
than training from scratch and achieves superior performance at convergence.
Meanwhile, two structural improvements are proposed. First, inspired by how
continuous VAEs enhance reconstruction via enlarged latent dimensions, we
introduce a multi-token
mechanism, which achieves nearly a 1 dB
improvement in PSNR without compromising the token compression ratio. Second,
to tackle reconstruction challenges in high-compression video VAEs, we
strengthen first-frame reconstruction, enabling the causal VAE to leverage this
information in subsequent frames and markedly improving the performance of 4 x
16 x 16 discrete VAEs. Furthermore, we propose a joint discrete-continuous
optimization scheme that unifies the two paradigms and, for the first time,
achieves competitive performance on both continuous and discrete
representations within a single network. We name our method OneVAE to reflect
this connection.
Speed Always Wins A Survey on Efficient Architectures for Large Language Models
Authors: Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng
2025-08-13
Large Language Models (s) have delivered impressive results in language
understanding, generation, reasoning, and pushes the ability boundary of
multimodal models. Transformer models, as the foundation of modern
s, offer
a strong baseline with excellent scaling properties. However, the traditional
architecture requires substantial computations and poses
significant obstacles for large-scale training and practical deployment. In
this survey, we offer a systematic examination of innovative
architectures
that address the inherent limitations of
s and boost the efficiency.
Starting from language modeling, this survey covers the background and
technical details of linear and
sequence modeling methods, efficient
full attention variants,
mixture-of-experts, hybrid model architectures
incorporating the above techniques, and emerging diffusion
s. Additionally,
we discuss applications of these techniques to other modalities and consider
their wider implications for developing scalable, resource-aware foundation
models. By grouping recent studies into the above category, this survey
presents a blueprint of modern efficient
architectures, and we hope this
could help motivate future research toward more efficient, versatile AI
systems.
MoIIE Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models
Authors: Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yitong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, Zhongyu Wei
2025-08-13
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance
across multi-modal tasks by scaling model size and training data. However,
these dense LVLMs incur significant computational costs and motivate the
exploration of Mixture of Experts (MoE) architectures. While MoE improve
parameter efficiency, effectively applying MoE to simultaneously model
modality-specific features and cross-modal associations in LVLMs remains
challenging. In this work, we propose to incorporate Mixture of Intra- and
Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is
guided by its modality, directing tokens to their respective intra-modality
experts as well as a shared pool of inter-modality experts, enabling the model
to jointly learn rich intra-modal features and cross-modal interactions. We
further introduce an effective and straightforward two-stage training strategy,
which facilitates the direct activation of both MoE and multi-modal
capabilities. Extensive experiments across different data scales and
backbone demonstrate the effectiveness, efficiency and generality of our
approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters
match or even surpass the performance of existing advanced open-source MoE-
s
based multi-modal models that involve more activated parameters. The code is
available at https://github.com/AlenjandroWang/MoIIE.
MEML-GRPO Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement
Authors: Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, Hao Feng, Irene Li, Kun Yang, Han Wang, Jingqun Tang, Teng Fu, Changhong Jin, Chao Feng, Xiaohui Lv, Can Huang
2025-08-13
Recent advances demonstrate that reinforcement learning with verifiable
rewards (RLVR) significantly enhances the reasoning capabilities of large
language models (s). However, standard RLVR faces challenges with reward
, where zero rewards from consistently incorrect candidate answers
provide no learning signal, particularly in challenging tasks. To address this,
we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative
framework that utilizes diverse expert prompts as system prompts to generate a
broader range of responses, substantially increasing the likelihood of
identifying correct solutions. Additionally, we introduce an inter-expert
mutual learning mechanism that facilitates knowledge sharing and transfer among
experts, further boosting the model's performance through RLVR. Extensive
experiments across multiple reasoning benchmarks show that MEML-GRPO delivers
significant improvements, achieving an average performance gain of 4.89% with
Qwen and 11.33% with Llama, effectively overcoming the core limitations of
traditional RLVR methods.
HierMoE Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap
Authors: Wenxiang Lin, Xinglin Pan, Lin Zhang, Shaohuai Shi, Xuan Wang, Xiaowen Chu
2025-08-13
The ly activated mixture-of-experts (MoE)
has become a
common architecture for large language models (
s) due to its
, which
requires fewer computational demands while easily scaling the model size. In
MoE models, each MoE layer requires to dynamically choose tokens to activate
particular experts for computation while the activated experts may not be
located in the same device or GPU as the token. However, this leads to
substantial
and load imbalances across all GPUs, which obstructs
the scalability of distributed systems within a GPU cluster. To this end, we
introduce HierMoE to accelerate the training of MoE models by two
topology-aware techniques: 1) token deduplication to reduce the
traffic, and 2) expert swap to balance the workloads among all GPUs. To enable
the above two proposed approaches to be more general, we build theoretical
models aimed at achieving the best token duplication and expert swap strategy
under different model configurations and hardware environments. We implement
our prototype HierMoE system atop Megatron-LM and conduct experiments on a
32-GPU cluster with DeepSeek-V3 and Qwen3-30B-A3B models. Experimental results
show that our HierMoE achieves to faster
and delivers to faster end-to-end
training compared to state-of-the-art MoE training systems, Tutel-2DH,
SmartMoE, and Megatron-LM.
NeuronTune Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs
Authors: Birong Pan, Mayi Xu, Qiankun Pi, Jianhao Chen, Yuanyuan Zhu, Ming Zhong, Tieyun Qian
2025-08-13
Ensuring robust safety alignment while pre utility is critical for the
reliable deployment of Large Language Models (
s). However, current
techniques fundamentally suffer from intertwined deficiencies: insufficient
robustness against malicious attacks, frequent refusal of benign queries,
degradation in generated text quality and general task performance--the former
two reflecting deficits in robust safety and the latter constituting utility
impairment. We trace these limitations to the coarse-grained layer-wise
interventions in existing methods. To resolve this, we propose NeuronTune, a
fine-grained framework that dynamically modulates
neurons to achieve
simultaneous safety-utility optimization. Our approach first identifies
safety-critical and utility-pre
neurons across all layers via
attribution, then employs meta-learning to adaptively amplify safety-neuron
activations and suppress utility-neuron activations. Crucially, NeuronTune
enables tunable adjustment of intervention scope via neuron-count thresholds,
supporting flexible adaptation to security-critical or utility-priority
scenarios. Extensive experimental results demonstrate that our method
significantly outperforms existing state-of-the-art technologies, achieving
superior model safety while maintaining excellent utility.
EGGS-PTP An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models
Authors: Omar Bazarbachi, Zijun Sun, Yanning Shen
2025-08-13
As Large Language Models (s) become more widely adopted and scale up in
size, the computational and memory challenges involved in deploying these
massive foundation models have grown increasingly severe. This underscores the
urgent need to develop more efficient model variants. Faced with this
challenge, the present work introduces EGGS-PTP: an Expander-Graph Guided
Structured Post-training Pruning method. The proposed approach leverages graph
theory to guide the design of N:M structured
, effectively reducing
model size and computational demands. By incorporating concepts from expander
graphs, EGGS-PTP ensures information flow within the pruned network, pre
essential model functionality. Extensive numerical experiments demonstrate that
EGGS-PTP not only achieves significant
and memory savings due to
structured
but also outperforms existing structured
techniques
in terms of accuracy across various
s.
Gen-AFFECT Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy
Authors: Hao Yu, Rupayan Mallick, Margrit Betke, Sarah Adel Bargal
2025-08-13
Different forms of customized 2D avatars are widely used in gaming
applications, virtual , education, and content creation. However,
existing approaches often fail to capture fine-grained facial expressions and
struggle to preserve identity across different expressions. We propose
GEN-AFFECT, a novel framework for personalized avatar generation that generates
expressive and identity-consistent avatars with a diverse set of facial
expressions. Our framework proposes conditioning a multimodal diffusion
on an extracted identity-expression representation. This enables
identity preservation and representation of a wide range of facial expressions.
GEN-AFFECT additionally employs consistent attention at inference for
information sharing across the set of generated expressions, enabling the
generation process to maintain identity consistency over the array of generated
fine-grained expressions. GEN-AFFECT demonstrates superior performance compared
to previous state-of-the-art methods on the basis of the accuracy of the
generated expressions, the preservation of the identity and the consistency of
the target identity across an array of fine-grained facial expressions.
Shadow in the Cache Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference
Authors: Zhifan Luo, Shuo Shao, Su Zhang, Lijing Zhou, Yuke Hu, Chenxu Zhao, Zhihao Liu, Zhan Qin
2025-08-13
The Key-Value ()
, which stores intermediate attention computations
(Key and Value pairs) to avoid redundant calculations, is a fundamental
mechanism for accelerating Large Language Model (
) inference. However, this
efficiency optimization introduces significant yet underexplored privacy risks.
This paper provides the first comprehensive analysis of these vulnerabilities,
demonstrating that an attacker can reconstruct sensitive user inputs directly
from the
-
. We design and implement three distinct attack vectors: a
direct Inversion Attack, a more broadly applicable and potent Collision Attack,
and a semantic-based Injection Attack. These methods demonstrate the
practicality and severity of
-
privacy leakage issues. To mitigate this,
we propose
-Cloak, a novel, lightweight, and efficient defense mechanism.
-Cloak uses a reversible matrix-based obfuscation scheme, combined with
operator fusion, to secure the
-
. Our extensive experiments show that
-Cloak effectively thwarts all proposed attacks, reducing reconstruction
quality to random noise. Crucially, it achieves this robust security with
virtually no degradation in model accuracy and minimal performance overhead,
offering a practical solution for trustworthy
deployment.
Synaptic Pruning A Biological Inspiration for Deep Learning Regularization
Authors: Gideon Vos, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi
2025-08-12
Synaptic in biological brains removes weak connections to improve
efficiency. In contrast, dropout regularization in artificial neural networks
randomly deactivates neurons without considering activity-dependent
. We
propose a magnitude-based synaptic
method that better reflects biology
by progressively removing low-importance connections during training.
Integrated directly into the training loop as a dropout replacement, our
approach computes weight importance from absolute magnitudes across layers and
applies a cubic schedule to gradually increase global
. At fixed
intervals,
masks permanently remove low-importance weights while
maintaining gradient flow for active ones, eliminating the need for separate
and fine-tuning phases. Experiments on multiple time series forecasting
models including RNN, LSTM, and Patch Time Series Transformer across four
datasets show consistent gains. Our method ranked best overall, with
statistically significant improvements confirmed by Friedman tests (p < 0.01).
In financial forecasting, it reduced Mean Absolute Error by up to 20% over
models with no or standard dropout, and up to 52% in select
models.
This dynamic
mechanism advances regularization by coupling weight
elimination with progressive sparsification, offering easy integration into
diverse architectures. Its strong performance, especially in financial time
series forecasting, highlights its potential as a practical alternative to
conventional dropout techniques.
SinLlama -- A Large Language Model for Sinhala
Authors: H. W. K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Surangika Ranathunga, Rishemjit Kaur
2025-08-12
Low-resource languages such as Sinhala are often overlooked by open-source
Large Language Models (s). In this research, we extend an existing
multilingual
(Llama-3-8B) to better serve Sinhala. We enhance the
tokenizer with Sinhala specific vocabulary and perform continual pre-training
on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This
is the very first
r-based open-source
with explicit Sinhala support.
When SinLlama was instruction fine-tuned for three text classification tasks,
it outperformed base and instruct variants of Llama-3-8B by a significant
margin.
READER Retrieval-Assisted Drafter for Efficient LLM Inference
Authors: Maxim Divilkovskiy, Vitaly Malygin, Sergey Zlobin, Sultan Isali, Vasily Kalugin, Stanislav Ilyushin, Nuriza Aitassova, Yi Fei, Zeng Weidi
2025-08-12
Large Language Models (s) generate tokens autoregressively, with each
token depending on the preceding context. This sequential nature makes the
inference process inherently difficult to accelerate, posing a significant
challenge for efficient deployment. In recent years, various methods have been
proposed to address this issue, with the most effective approaches often
involving the training of additional draft models. In this paper, we introduce
READER (Retrieval-Assisted Drafter for Efficient
Inference), a novel
lossless speculative
method that enhances model-based approaches by
leveraging self-repetitions in the text. Our algorithm expands the speculative
tree using tokens obtained through statistical search. This work
focuses on large batch sizes (>= 8), an underexplored yet important area for
industrial applications. We also analyze the key-value (
)
size during
speculative
and propose an optimization to improve performance for
large batches. As a result, READER outperforms existing speculative
methods. Notably, READER requires no additional training and can reuse
pre-trained speculator models, increasing the speedup by over 40\%. Our method
demonstrates particularly strong performance on search-based tasks, such as
retrieval-augmented generation, where we achieve more than 10x speedup.
FetFIDS A Feature Embedding Attention based Federated Network Intrusion Detection Algorithm
Authors: Shreya Ghosh, Abu Shafin Mohammad Mahdee Jameel, Aly El Gamal
2025-08-12
Intrusion Detection Systems (IDS) have an increasingly important role in
preventing exploitation of network vulnerabilities by malicious actors. Recent
deep learning based developments have resulted in significant improvements in
the performance of IDS systems. In this paper, we present FetFIDS, where we
explore the employment of feature embedding instead of positional embedding to
improve intrusion detection performance of a based deep learning
system. Our model is developed with the aim of deployments in edge learning
scenarios, where federated learning over multiple
rounds can
ensure both privacy and localized performance improvements. FetFIDS outperforms
multiple state-of-the-art intrusion detection systems in a federated
environment and demonstrates a high degree of suitability to federated
learning. The code for this work can be found at
https://github.com/ghosh64/fetfids.
A Survey on Training-free Alignment of Large Language Models
Authors: Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian
2025-08-12
The alignment of large language models (s) aims to ensure their outputs
adhere to human values, ethical standards, and legal norms. Traditional
alignment methods often rely on resource-intensive fine-tuning (FT), which may
suffer from knowledge degradation and face challenges in scenarios where the
model accessibility or computational resources are constrained. In contrast,
training-free (TF) alignment techniques--leveraging in-context learning,
-time adjustments, and post-generation corrections--offer a promising
alternative by enabling alignment without heavily retraining
s, making them
adaptable to both open-source and closed-source environments. This paper
presents the first systematic review of TF alignment methods, categorizing them
by stages of pre-
, in-
, and post-
. For each stage, we
provide a detailed examination from the viewpoint of
s and multimodal
s
(M
s), highlighting their mechanisms and limitations. Furthermore, we
identify key challenges and future directions, paving the way for more
inclusive and effective TF alignment techniques. By synthesizing and organizing
the rapidly growing body of research, this survey offers a guidance for
practitioners and advances the development of safer and more reliable
s.
Retrospective Sparse Attention for Efficient Long-Context Generation
Authors: Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim
2025-08-12
Large Language Models (s) are increasingly deployed in long-context tasks
such as reasoning, code generation, and multi-turn dialogue. However, inference
over extended contexts is bottlenecked by the Key-Value (
)
, whose
memory footprint grows linearly with sequence length and dominates latency at
each
step. While recent
compression methods identify and load
important tokens, they focus predominantly on input contexts and fail to
address the cumulative attention errors that arise during long
. In
this paper, we introduce RetroAttention, a novel
update technique that
retrospectively revises past attention outputs using newly arrived
entries
from subsequent
steps. By maintaining a lightweight output
,
RetroAttention enables past queries to efficiently access more relevant
context, while incurring minimal latency overhead. This breaks the
fixed-attention-output paradigm and allows continual correction of prior
approximations. Extensive experiments on long-generation benchmarks show that
RetroAttention consistently outperforms state-of-the-art (SOTA)
compression
methods, increasing effective
exposure by up to 1.6 and accuracy by
up to 21.9\%.
NEFMind Parameter-Efficient Fine-Tuning of Open-Source LLMs for Telecom APIs Automation
Authors: Zainab Khan, Ahmed Hussain, Mukesh Thakur, Arto Hellas, Panos Papadimitratos
2025-08-12
The use of Service-Based Architecture in modern teles has
exponentially increased Network Functions (NFs) and Application Programming
Interfaces (APIs), creating substantial operational complexities in service
discovery and management. We introduce \textit{NEFMind}, a framework leveraging
parameter-efficient fine-tuning of open-source Large Language Models (
s) to
address these challenges. It integrates three core components: synthetic
dataset generation from Network Exposure Function (NEF) API specifications,
model optimization through Quantized-Low-Rank Adaptation, and performance
evaluation via GPT-4 Ref Score and BertScore metrics. Targeting 5G
Service-Based Architecture APIs, our approach achieves 85% reduction in
overhead compared to manual discovery methods. Experimental
validation using the open-source Phi-2 model demonstrates exceptional API call
identification performance at 98-100% accuracy. The fine-tuned Phi-2 model
delivers performance comparable to significantly larger models like GPT-4 while
maintaining computational efficiency for tele
s infrastructure
deployment. These findings validate domain-specific, parameter-efficient
strategies for managing complex API ecosystems in next-generation
tele
s networks.
ColorGPT Leveraging Large Language Models for Multimodal Color Recommendation
Authors: Ding Xia, Naoto Inoue, Qianru Qiu, Kotaro Kikuchi
2025-08-12
Colors play a crucial role in the design of vector graphic documents by
enhancing visual appeal, facilitating , improving usability, and
ensuring accessibility. In this context, color recommendation involves
suggesting appropriate colors to complete or refine a design when one or more
colors are missing or require alteration. Traditional methods often struggled
with these challenges due to the complex nature of color design and the limited
data availability. In this study, we explored the use of pretrained Large
Language Models (
s) and their commonsense reasoning capabilities for color
recommendation, raising the question: Can pretrained
s serve as superior
designers for color recommendation tasks? To investigate this, we developed a
robust, rigorously validated pipeline, ColorGPT, that was built by
systematically testing multiple color representations and applying effective
prompt engineering techniques. Our approach primarily targeted color palette
completion by recommending colors based on a set of given colors and
accompanying context. Moreover, our method can be extended to full palette
generation, producing an entire color palette corresponding to a provided
textual description. Experimental results demonstrated that our
-based
pipeline outperformed existing methods in terms of color suggestion accuracy
and the distribution of colors in the color palette completion task. For the
full palette generation task, our approach also yielded improvements in color
diversity and similarity compared to current techniques.
ASPD Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs
Authors: Keyu Chen, Zhifeng Shen, Daohai Yu, Haoqian Wu, Wei Wen, Jianfeng He, Ruizhi Qiao, Xing Sun
2025-08-12
The increasing scale and complexity of large language models (s) pose
significant inference latency challenges, primarily due to their autoregressive
paradigm characterized by the sequential nature of next-token
prediction. By re-examining the outputs of autoregressive models, we observed
that some segments exhibit parallelizable structures, which we term intrinsic
parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel
) can significantly improve the overall inference speed of
s. In
this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which
addresses two core challenges: automated construction of parallelizable data
and efficient parallel
mechanism. More specifically, we introduce a
non-invasive pipeline that automatically extracts and validates parallelizable
structures from the responses of autoregressive models. To empower efficient
adaptive serial-parallel
, we implement a Hybrid Decoding Engine which
enables seamless transitions between serial and parallel
modes while
maintaining a reusable
, maximizing computational efficiency. Extensive
evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical
Reasoning, demonstrate that ASPD achieves unprecedented performance in both
effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up
to 3.19x speedup (1.85x on average) while maintaining response quality within
1% difference compared to autoregressive models, realizing significant
without compromising generation quality. Our framework sets a
groundbreaking benchmark for efficient
parallel inference, paving the way
for its deployment in latency-sensitive applications such as AI-powered
customer service bots and answer retrieval engines.
Steering Towards Fairness Mitigating Political Bias in LLMs
Authors: Afrozah Nadeem, Mark Dras, Usman Naseem
2025-08-12
Recent advancements in large language models (s) have enabled their
widespread use across diverse real-world applications. However, concerns remain
about their tendency to encode and reproduce ideological biases, particularly
along political and economic dimensions. In this paper, we propose a framework
for probing and mitigating such biases in
r-based
s through analysis
of internal model representations. Grounded in the Political Compass Test
(PCT), our method uses contrastive pairs to extract and compare hidden layer
activations from models like Mistral and DeepSeek. We introduce a comprehensive
activation extraction pipeline capable of layer-wise analysis across multiple
ideological axes, revealing meaningful disparities linked to political framing.
Our results show that
r
s systematically encode representational bias
across layers, which can be leveraged for effective steering vector-based
mitigation. This work provides new insights into how political bias is encoded
in
s and offers a principled approach to debiasing beyond surface-level
output interventions.
DiffPose-Animal A Language-Conditioned Diffusion Framework for Animal Pose Estimation
Authors: Tianyu Xiong, Dayi Tan, Wei Tian
2025-08-12
Animal pose estimation is a fundamental task in computer vision, with growing
importance in ecological monitoring, behavioral analysis, and intelligent
livestock management. Compared to human pose estimation, animal pose estimation
is more challenging due to high interspecies morphological diversity, complex
body structures, and limited annotated data. In this work, we introduce
DiffPose-Animal, a novel diffusion-based framework for top-down animal pose
estimation. Unlike traditional heatmap regression methods, DiffPose-Animal
reformulates pose estimation as a denoising process under the generative
framework of diffusion models. To enhance semantic guidance during keypoint
generation, we leverage large language models (s) to extract both global
anatomical priors and local keypoint-wise semantics based on species-specific
prompts. These textual priors are encoded and fused with image features via
cross-attention modules to provide biologically meaningful constraints
throughout the denoising process. Additionally, a diffusion-based keypoint
r is designed to progressively refine pose predictions, improving
robustness to occlusion and annotation
. Extensive experiments on
public animal pose datasets demonstrate the effectiveness and generalization
capability of our method, especially under challenging scenarios with diverse
species, cluttered backgrounds, and incomplete keypoints.
Interpretable Reward Model via Sparse Autoencoder
Authors: Shuyi Zhang, Wei Shi, Sihang Li, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang
2025-08-12
Large language models (s) have been widely deployed across numerous
fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward
models (RMs) as proxies for human preferences to align
behaviors with human
values, making the accuracy, reliability, and interpretability of RMs critical
for effective alignment. However, traditional RMs lack interpretability, offer
limited insight into the reasoning behind reward assignments, and are
inflexible toward user preference shifts. While recent multidimensional RMs aim
for improved interpretability, they often fail to provide feature-level
attribution and require costly annotations. To overcome these limitations, we
introduce the Sparse Autoencoder-enhanced Reward Model (SARM), a novel
architecture that integrates a pretrained Sparse Autoencoder (SAE) into a
reward model. SARM maps the hidden activations of
-based RM into an
interpretable,
, and monosemantic feature space, from which a scalar head
aggregates feature activations to produce transparent and conceptually
meaningful reward scores. Empirical evaluations demonstrate that SARM
facilitates direct feature-level attribution of reward assignments, allows
dynamic adjustment to preference shifts, and achieves superior alignment
performance compared to conventional reward models. Our code is available at
https://github.com/schrieffer-z/sarm.
A Survey on Parallel Text Generation From Parallel Decoding to Diffusion Language Models
Authors: Lingzhe Zhang, Liancheng Fang, Chiming Duan, Minghua He, Leyi Pan, Pei Xiao, Shiyu Huang, Yunpeng Zhai, Xuming Hu, Philip S. Yu, Aiwei Liu
2025-08-12
As text generation has become a core capability of modern Large Language
Models (s), it underpins a wide range of downstream applications. However,
most existing
s rely on autoregressive (AR) generation, producing one token
at a time based on previously generated context-resulting in limited generation
speed due to the inherently sequential nature of the process. To address this
challenge, an increasing number of researchers have begun exploring parallel
text generation-a broad class of techniques aimed at breaking the
token-by-token generation bottleneck and improving inference efficiency.
Despite growing interest, there remains a lack of comprehensive analysis on
what specific techniques constitute parallel text generation and how they
improve inference performance. To bridge this gap, we present a systematic
survey of parallel text generation methods. We categorize existing approaches
into AR-based and Non-AR-based paradigms, and provide a detailed examination of
the core techniques within each category. Following this taxonomy, we assess
their theoretical trade-offs in terms of speed, quality, and efficiency, and
examine their potential for combination and comparison with alternative
strategies. Finally, based on our findings, we highlight recent
advancements, identify open challenges, and outline promising directions for
future research in parallel text generation. We have also created a GitHub
repository for indexing relevant papers and open resources available at
https://github.com/zhanglingzhe0820/Awesome-Parallel-Text-Generation.
Prompt-and-Check Using Large Language Models to Evaluate Communication Protocol Compliance in Simulation-Based Training
Authors: Vishakha Lall, Yisi Liu
2025-08-12
Accurate evaluation of procedural compliance is essential in
simulation-based training, particularly in safety-critical domains where
adherence to compliance checklists reflects operational competence. This paper
explores a lightweight, deployable approach using prompt-based inference with
open-source large language models (
s) that can run efficiently on
consumer-grade GPUs. We present Prompt-and-Check, a method that uses
context-rich prompts to evaluate whether each checklist item in a protocol has
been fulfilled, solely based on transcribed verbal exchanges. We perform a case
study in the maritime domain with participants performing an identical
simulation task, and experiment with models such as LLama 2 7B, LLaMA 3 8B and
Mistral 7B, running locally on an RTX 4070 GPU. For each checklist item, a
prompt incorporating relevant transcript excerpts is fed into the model, which
outputs a compliance judgment. We assess model outputs against expert-annotated
ground truth using classification accuracy and agreement scores. Our findings
demonstrate that prompting enables effective context-aware reasoning without
task-specific training. This study highlights the practical utility of
s in
augmenting debriefing, performance feedback, and automated assessment in
training environments.
Classifier Language Models Unifying Sparse Finetuning and Adaptive Tokenization for Specialized Classification Tasks
Authors: Adit Krishnan, Chu Wang, Chris Kong
2025-08-12
Semantic text classification requires the understanding of the contextual
significance of specific tokens rather than surface-level patterns or keywords
(as in rule-based or statistical text classification), making large language
models (s) well-suited for this task. However, semantic classification
applications in industry, like customer intent detection or semantic role
labeling, tend to be highly specialized. They require annotation by domain
experts in contrast to general-purpose corpora for pretraining. Further, they
typically require high inference throughputs which limits the model size from
latency and cost perspectives. Thus, for a range of specialized classification
tasks, the preferred solution is to develop customized classifiers by
finetuning smaller language models (e.g., mini-encoders, small language
models).
In this work, we develop a token-driven
finetuning strategy to adapt
small language models to specialized classification tasks. We identify and
finetune a small sensitive subset of model parameters by leveraging
task-specific token constructs in the finetuning dataset, while leaving most of
the pretrained weights unchanged. Unlike adapter approaches such as low rank
adaptation (LoRA), we do not introduce additional parameters to the model. Our
approach identifies highly relevant semantic tokens (case study in the
Appendix) and outperforms end-to-end finetuning, LoRA, layer selection, and
prefix tuning on five diverse semantic classification tasks. We achieve greater
stability and half the training costs vs. end-to-end finetuning.
AgriGPT a Large Language Model Ecosystem for Agriculture
Authors: Bo Yang, Yu Zhang, Lanfei Feng, Yunkui Chen, Jianyu Zhang, Xiao Xu, Nueraili Aierken, Yurui Li, Yuxuan Chen, Guijun Yang, Yong He, Runhe Huang, Shijian Li
2025-08-12
Despite the rapid progress of Large Language Models (s), their application
in agriculture remains limited due to the lack of domain-specific models,
curated datasets, and robust evaluation frameworks. To address these
challenges, we propose AgriGPT, a domain-specialized
ecosystem for
agricultural usage. At its core, we design a multi-agent scalable data engine
that systematically compiles credible data sources into Agri-342K, a
high-quality, standardized question-answer (QA) dataset. Trained on this
dataset, AgriGPT supports a broad range of agricultural stakeholders, from
practitioners to policy-makers. To enhance factual grounding, we employ
Tri-RAG, a three-channel Retrieval-Augmented Generation framework combining
dense retrieval,
retrieval, and multi-hop knowledge graph reasoning,
thereby improving the
's reasoning reliability. For comprehensive
evaluation, we introduce AgriBench-13K, a benchmark suite comprising 13 tasks
with varying types and complexities. Experiments demonstrate that AgriGPT
significantly outperforms general-purpose
s on both domain adaptation and
reasoning. Beyond the model itself, AgriGPT represents a modular and extensible
ecosystem for agriculture, comprising structured data construction,
retrieval-enhanced generation, and domain-specific evaluation. This work
provides a generalizable framework for developing scientific and
industry-specialized
s. All models, datasets, and code will be released to
empower agricultural communities, especially in underserved regions, and to
promote open, impactful research.
QoE-Aware Service Provision for Mobile AR Rendering An Agent-Driven Approach
Authors: Conghao Zhou, Lulu Sun, Xiucheng Wang, Peng Yang, Feng Lyu, Sihan Lu, Xuemin Shen
2025-08-12
Mobile augmented reality (MAR) is envisioned as a key immersive application
in 6G, enabling virtual content rendering aligned with the physical environment
through device pose estimation. In this paper, we propose a novel agent-driven
service provisioning approach for edge-assisted MAR, aiming to
reduce
overhead between MAR devices and the edge server while
ensuring the quality of experience (QoE). First, to address the inaccessibility
of MAR application-specific information to the network controller, we establish
a digital agent powered by large language models (
s) on behalf of the MAR
service provider, bridging the data and function gap between the MAR service
and network domains. Second, to cope with the user-dependent and dynamic nature
of data traffic patterns for individual devices, we develop a user-level QoE
modeling method that captures the relationship between
resource
demands and perceived user QoE, enabling personalized, agent-driven
resource management. Trace-driven simulation results demonstrate
that the proposed approach outperforms conventional
-based QoE-aware service
provisioning methods in both user-level QoE modeling accuracy and
resource efficiency.
Agentic Graph Neural Networks for Wireless Communications and Networking Towards Edge General Intelligence A Survey
Authors: Yang Lu, Shengli Zhang, Chang Liu, Ruichen Zhang, Bo Ai, Dusit Niyato, Wei Ni, Xianbin Wang, Abbas Jamalipour
2025-08-12
The rapid advancement of technologies has driven the evolution
of
networks towards both high-dimensional resource utilization
and multifunctional integration. This evolving complexity poses significant
challenges in designing
networks to satisfy the growing
quality-of-service and time sensitivity of mobile applications in dynamic
environments. Graph neural networks (GNNs) have emerged as fundamental deep
learning (DL) models for complex
networks. GNNs not only augment
the extraction of features over network topologies but also enhance scalability
and facilitate distributed computation. However, most existing GNNs follow a
traditional passive learning framework, which may fail to meet the needs of
increasingly diverse wireless systems. This survey proposes the employment of
agentic artificial intelligence (AI) to organize and integrate GNNs, enabling
scenario- and task-aware implementation towards edge general intelligence. To
comprehend the full capability of GNNs, we holistically review recent
applications of GNNs in wireless
s and networking. Specifically,
we focus on the alignment between graph representations and network topologies,
and between neural architectures and wireless tasks. We first provide an
overview of GNNs based on prominent neural architectures, followed by the
concept of agentic GNNs. Then, we summarize and compare GNN applications for
conventional systems and emerging technologies, including physical, MAC, and
network layer designs, integrated sensing and
(ISAC),
reconfigurable intelligent surface (RIS) and cell-free network architecture. We
further propose a large language model (
) framework as an intelligent
question-answering agent, leveraging this survey as a local knowledge base to
enable GNN-related responses tailored to wireless
research.
Joint decoding method for controllable contextual speech recognition based on Speech LLM
Authors: Yangui Fang, Jing Peng, Yu Xi, Xu Li, Haoyu Li, Chengwei Zhang, Guohui Zhong, Kai Yu
2025-08-12
Contextual speech recognition refers to the ability to identify preferences
for specific content based on contextual information. Recently, leveraging the
contextual understanding capabilities of Speech to achieve contextual
biasing by injecting contextual information through prompts have emerged as a
research hotspot.However, the direct information injection method via prompts
relies on the internal attention mechanism of the model, making it impossible
to explicitly control the extent of information injection. To address this
limitation, we propose a joint
method to control the contextual
information. This approach enables explicit control over the injected
contextual information and achieving superior recognition performance.
Additionally, Our method can also be used for sensitive word suppression
recognition.Furthermore, experimental results show that even Speech
not
pre-trained on long contextual data can acquire long contextual capabilities
through our method.
Securing Agentic AI Threat Modeling and Risk Analysis for Network Monitoring Agentic AI System
Authors: Pallavi Zambare, Venkata Nikhil Thanikella, Ying Liu
2025-08-12
When combining Large Language Models (s) with autonomous agents, used in
network monitoring and decision-making systems, this will create serious
security issues. In this research, the MAESTRO framework consisting of the
seven layers threat modeling architecture in the system was used to expose,
evaluate, and eliminate vulnerabilities of agentic AI. The prototype agent
system was constructed and implemented, using Python, LangChain, and telemetry
in WebSockets, and deployed with inference, memory, parameter tuning, and
anomaly detection modules. Two practical threat cases were confirmed as
follows: (i) resource denial of service by traffic replay denial-of-service,
and (ii) memory poisoning by tampering with the historical log file maintained
by the agent. These situations resulted in measurable levels of performance
degradation, i.e. telemetry updates were delayed, and computational loads were
increased, as a result of poor system adaptations. It was suggested to use a
multilayered defense-in-depth approach with memory isolation, validation of
planners and anomaly response systems in real-time. These findings verify that
MAESTRO is viable in operational threat mapping, prospective risk scoring, and
the basis of the resilient system design. The authors bring attention to the
importance of the enforcement of memory integrity, paying attention to the
adaptation logic monitoring, and cross-layer
protection that
guarantee the agentic AI reliability in adversarial settings.
Profiling Large Language Model Inference on Apple Silicon A Quantization Perspective
Authors: Afsara Benazir, Felix Xiaozhu Lin
2025-08-12
A systematic understanding of Apple Silicon is lacking in the current
landscape of hardware efficiency; research focus is largely centered on
accelerating GPUs for large-scale training or inference on CUDA devices. This
paper investigates Apple Silicon's unique memory architecture that offers a
unified memory integrating CPU and GPU memory and its implications for
on-device inference.
We decipher myths about whether Apple Silicon is efficient for on-device
inference compared to competitors such as NVIDIA GPUs by directly conducting
latency and throughput comparison benchmarks. We explain the performance gap
between them through profiling low level hardware metrics - ALU utilization,
memory bandwidth, buffer usage,
residency etc. at runtime. We draw
several insights regarding performance bottlenecks such as de
overhead, compute throughput and memory bandwidth. We debunk existing false
claims regarding large language model inference such as compressing models to
lower bit precision is a defacto promise for faster inference across all
hardware platforms. We find that the large unified memory enables Apple Silicon
to be both cost effective and efficient against NVIDIA GPUs for ultra large
language models.
Our large scale evaluation on 5 hardware testbeds incorporating three Apple
M-series devices: M2 Ultra, M2 Max and M4 Pro and two NVIDIA GPUs: NVIDIA RTX
A6000, a multi GPU setup with 2xNVIDIA RTX A6000, 5 model scales ranging from
8B to 405B parameters and 14
schemes gives an understanding of how
Apple Silicon fits within the paradigm of on-device
inference. Our analysis
reveals multiple resource interdependencies and unexpected findings, while also
quantifying established insights. To the best of our knowledge, this study
makes the first attempt to present a thorough characterization and analysis of
Apple Silicon for on-device inference.
Using LLMs to Capture Users' Temporal Context for Recommendation
Authors: Milad Sabouri, Masoud Mansoury, Kun Lin, Bamshad Mobasher
2025-08-11
Effective recommender systems demand dynamic user understanding, especially
in complex, evolving environments. Traditional user profiling often fails to
capture the nuanced, temporal contextual factors of user preferences, such as
transient short-term interests and enduring long-term tastes. This paper
presents an assessment of Large Language Models (s) for generating
semantically rich, time-aware user profiles. We do not propose a novel
end-to-end recommendation architecture; instead, the core contribution is a
systematic investigation into the degree of
effectiveness in capturing the
dynamics of user context by disentangling short-term and long-term preferences.
This approach, framing temporal preferences as dynamic user contexts for
recommendations, adaptively fuses these distinct contextual components into
comprehensive user embeddings. The evaluation across Movies&TV and Video Games
domains suggests that while
-generated profiles offer semantic depth and
temporal structure, their effectiveness for context-aware recommendations is
notably contingent on the richness of user interaction histories. Significant
gains are observed in dense domains (e.g., Movies&TV), whereas improvements are
less pronounced in
environments (e.g., Video Games). This work
highlights
s' nuanced potential in enhancing user profiling for adaptive,
context-aware recommendations, emphasizing the critical role of dataset
characteristics for practical applicability.
When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise Real-World Lessons from LLM Co-Design in a Safety-Net Hospital
Authors: Avni Kothari, Patrick Vossler, Jean Digitale, Mohammad Forouzannia, Elise Rosenberg, Michele Lee, Jennee Bryant, Melanie Molina, James Marks, Lucas Zier, Jean Feng
2025-08-11
Large language models (s) have the potential to address social and
behavioral determinants of health by transforming labor intensive workflows in
resource-constrained settings. Creating
-based applications that serve the
needs of underserved communities requires a deep understanding of their local
context, but it is often the case that neither
s nor their developers
possess this local expertise, and the experts in these communities often face
severe time/resource constraints. This creates a disconnect: how can one engage
in meaningful co-design of an
-based application for an under-resourced
community when the
channel between the
developer and domain
expert is constrained? We explored this question through a real-world case
study, in which our data science team sought to partner with social workers at
a safety net hospital to build an
application that summarizes patients'
social needs. Whereas prior works focus on the challenge of prompt tuning, we
found that the most critical challenge in this setting is the careful and
precise specification of \what information to surface to providers so that the
application is accurate, comprehensive, and verifiable. Here we present a
novel co-design framework for settings with limited access to domain experts,
in which the summary generation task is first decomposed into
individually-optimizable attributes and then each attribute is efficiently
refined and validated through a multi-tier cascading approach.
Vector-Centric Machine Learning Systems A Cross-Stack Approach
Authors: Wenqi Jiang
2025-08-11
Today, two major trends are shaping the evolution of ML systems. First,
modern AI systems are becoming increasingly complex, often integrating
components beyond the model itself. A notable example is Retrieval-Augmented
Generation (RAG), which incorporates not only multiple models but also vector
databases, leading to heterogeneity in both system components and underlying
hardware. Second, with the end of Moore's Law, achieving high system efficiency
is no longer feasible without accounting for the rapid evolution of the
hardware landscape.
Building on the observations above, this thesis adopts a cross-stack approach
to improving ML system efficiency, presenting solutions that span algorithms,
systems, and hardware. First, it introduces several pioneering works about RAG
efficiency across the computing stack. PipeRAG focuses on
algorithm-level improvements, RAGO introduces system-level optimizations, and
Chameleon explores heterogeneous accelerator systems for RAG. Second, this
thesis investigates algorithm-hardware co-design for vector search.
Specifically, FANNS and Falcon optimize
-based and graph-based
vector search, the two most popular paradigms of retrieval algorithms. Third,
this thesis addresses the
efficiency of recommender systems, another
example of vector-centric ML systems, where the memory-intensive lookup
operations on embedding vector tables often represent a major performance
bottleneck. MicroRec and FleetRec propose solutions at the hardware and system
levels, respectively, optimizing both data movement and computation to enhance
the efficiency of large-scale recommender models.
Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories
Authors: Ming-Yen Lee, Faaiq Waqar, Hanchen Yang, Muhammed Ahosan Ul Karim, Harsono Simka, Shimeng Yu
2025-08-11
Long-context Large Language Model () inference faces increasing compute
bottlenecks as attention calculations scale with context length, primarily due
to the growing
-
transfer overhead that saturates High Bandwidth Memory
(HBM). While prefetching techniques mitigate
misses by fetching
data
in advance, their spatial and temporal benefits present new opportunities to
exploit. This work proposes a packing-prefetch scheduling architecture with
monolithic 3D (M3D) back-end-of-line (BEOL) compatible embedded memories with
ultra-large on-chip capacity to accelerate long-context
inference. Our
optimizations demonstrate 8.06x
speedup and 1.83x overall latency
reduction on Llama3.1-8B using TPUv6e-like hardware with additional 512MB BEOL
memories over the serial execution. Evaluations of multi-request workloads on
TPU-like architectures show 1.7x-2.4x throughput improvement and 1.5x-2.4x HBM
bandwidth reduction compared to packing-only methods on Llama3.1-8B and
Llama3.1-70B models. With the co-design of packing, prefetching, and BEOL
memories, our approach alleviates HBM constraints and enables efficient
long-context
inference.
OverFill Two-Stage Models for Efficient Language Model Decoding
Authors: Woojeong Kim, Junxiong Wang, Jing Nathan Yan, Mohamed Abdelfattah, Alexander M. Rush
2025-08-11
Large language models (s) excel across diverse tasks but face significant
deployment challenges due to high inference costs.
inference comprises
(compute-bound) and
(memory-bound) stages, with
dominating latency particularly for long sequences. Current
r-only models
handle both stages uniformly, despite their distinct computational profiles. We
propose OverFill, which decouples these stages to optimize accuracy-efficiency
tradeoffs. OverFill begins with a full model for
, processing system and
user inputs in parallel. It then switches to a dense pruned model, while
generating tokens sequentially. Leveraging more compute during
,
OverFill improves generation quality with minimal latency overhead. Our
3B-to-1B OverFill configuration outperforms 1B pruned models by 83.2%, while
the 8B-to-3B configuration improves over 3B pruned models by 79.2% on average
across standard benchmarks. OverFill matches the performance of same-sized
models trained from scratch, while using significantly less training data. Our
code is available at https://github.com/friendshipkim/overfill.
Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference
Authors: Kexin Chu, Zecheng Lin, Dawei Xiang, Zixu Shen, Jianchang Su, Cheng Chu, Yiwei Yang, Wenhui Zhang, Wenfei Wu, Wei Zhang
2025-08-11
Global -
sharing has emerged as a key optimization for accelerating
large language model (
) inference. However, it exposes a new class of timing
side-channel attacks, enabling adversaries to infer sensitive user inputs via
shared
entries. Existing defenses, such as per-user isolation, eliminate
leakage but degrade performance by up to 38.9% in time-to-first-token (TTFT),
making them impractical for high-throughput deployment. To address this gap, we
introduce Safe
(Secure and Flexible
Cache Sharing), a privacy-aware
-
management framework that selectively shares non-sensitive entries
while confining sensitive content to private
s. Safe
comprises three
components: (i) a hybrid, multi-tier detection pipeline that integrates
rule-based pattern matching, a general-purpose privacy detector, and
context-aware validation; (ii) a unified radix-tree index that manages public
and private entries across heterogeneous memory tiers (HBM, DRAM, SSD); and
(iii) entropy-based access monitoring to detect and mitigate residual
information leakage. Our evaluation shows that Safe
mitigates 94% - 97% of
timing-based side-channel attacks. Compared to per-user isolation method,
Safe
improves TTFT by up to 40.58% and throughput by up to 2.66X across
diverse
s and workloads. Safe
reduces
-induced TTFT overhead from
50.41% to 11.74% on Qwen3-235B. By combining fine-grained privacy control with
high
reuse efficiency, Safe
reclaims the performance advantages of
global sharing while providing robust runtime privacy guarantees for
inference.
Follow-Your-Shape Shape-Aware Image Editing via Trajectory-Guided Region Control
Authors: Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Linfeng Zhang, Qifeng Chen, Yue Ma
2025-08-11
While recent flow-based image editing models demonstrate general-purpose
capabilities across diverse tasks, they often struggle to specialize in
challenging scenarios -- particularly those involving large-scale shape
transformations. When performing such structural edits, these methods either
fail to achieve the intended shape change or inadvertently alter non-target
regions, resulting in degraded background quality. We propose
Follow-Your-Shape, a training-free and mask-free framework that supports
precise and controllable editing of object shapes while strictly pre
non-target content. Motivated by the divergence between inversion and editing
trajectories, we compute a Trajectory Divergence Map (TDM) by comparing
token-wise velocity differences between the inversion and denoising paths. The
TDM enables precise localization of editable regions and guides a Scheduled
Injection mechanism that ensures stable and faithful editing. To facilitate a
rigorous evaluation, we introduce ReShapeBench, a new benchmark comprising 120
new images and enriched prompt pairs specifically curated for shape-aware
editing. Experiments demonstrate that our method achieves superior editability
and visual fidelity, particularly in tasks requiring large-scale shape
replacement.
BlindGuard Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks
Authors: Rui Miao, Yixin Liu, Yili Wang, Xu Shen, Yue Tan, Yiwei Dai, Shirui Pan, Xin Wang
2025-08-11
The security of -based multi-agent systems (MAS) is critically threatened
by propagation vulnerability, where malicious agents can distort collective
decision-making through inter-agent message interactions. While existing
supervised defense methods demonstrate promising performance, they may be
impractical in real-world scenarios due to their heavy reliance on labeled
malicious agents to train a supervised malicious detection model. To enable
practical and generalizable MAS defenses, in this paper, we propose BlindGuard,
an unsupervised defense method that learns without requiring any
attack-specific labels or prior knowledge of malicious behaviors. To this end,
we establish a hierarchical agent encoder to capture individual, neighborhood,
and global interaction patterns of each agent, providing a comprehensive
understanding for malicious agent detection. Meanwhile, we design a
corruption-guided detector that consists of directional noise injection and
contrastive learning, allowing effective detection model training solely on
normal agent behaviors. Extensive experiments show that BlindGuard effectively
detects diverse attack types (i.e., prompt injection, memory poisoning, and
tool attack) across MAS with various
patterns while maintaining
superior generalizability compared to supervised baselines. The code is
available at: https://github.com/MR9812/BlindGuard.
TeamMedAgents Enhancing Medical Decision-Making of LLMs Through Structured Teamwork
Authors: Pranav Pushkar Mishra, Mohammad Arvan, Mohan Zalake
2025-08-11
We present TeamMedAgents, a novel multi-agent approach that systematically
integrates evidence-based teamwork components from human-human collaboration
into medical decision-making with large language models (s). Our approach
validates an organizational psychology teamwork model from human collaboration
to computational multi-agent medical systems by operationalizing six core
teamwork components derived from Salas et al.'s "Big Five" model: team
leadership, mutual performance monitoring, team orientation, shared mental
models, closed-loop
, and mutual trust. We implement and evaluate
these components as modular, configurable mechanisms within an adaptive
collaboration architecture while assessing the effect of the number of agents
involved based on the task's requirements and domain. Systematic evaluation of
computational implementations of teamwork behaviors across eight medical
benchmarks (MedQA, MedMCQA, MMLU-Pro Medical, PubMedQA, DDXPlus, MedBullets,
Path-VQA, and PMC-VQA) demonstrates consistent improvements across 7 out of 8
evaluated datasets. Controlled ablation studies conducted on 50 questions per
configuration across 3 independent runs provide mechanistic insights into
individual component contributions, revealing optimal teamwork configurations
that vary by reasoning task complexity and domain-specific requirements. Our
ablation analyses reveal dataset-specific optimal teamwork configurations,
indicating that different medical reasoning modalities benefit from distinct
collaborative patterns. TeamMedAgents represents an advancement in
collaborative AI by providing a systematic translation of established teamwork
theories from human collaboration into agentic collaboration, establishing a
foundation for evidence-based multi-agent system design in critical
decision-making domains.
ChatGPT on the Road Leveraging Large Language Model-Powered In-vehicle Conversational Agents for Safer and More Enjoyable Driving Experience
Authors: Yeana Lee Bond, Mungyeong Choe, Baker Kasim Hasan, Arsh Siddiqui, Myounghoon Jeon
2025-08-11
Studies on in-vehicle conversational agents have traditionally relied on
pre-scripted prompts or limited voice commands, constraining natural
driver-agent interaction. To resolve this issue, the present study explored the
potential of a ChatGPT-based in-vehicle agent capable of carrying continuous,
multi-turn dialogues. Forty drivers participated in our experiment using a
motion-based driving simulator, comparing three conditions (No agent,
Pre-scripted agent, and ChatGPT-based agent) as a within-subjects variable.
Results showed that the ChatGPT-based agent condition led to more stable
driving performance across multiple metrics. Participants demonstrated lower
variability in longitudinal , lateral
, and lane
deviation compared to the other two conditions. In subjective evaluations, the
ChatGPT-based agent also received significantly higher ratings in competence,
animacy, affective trust, and preference compared to the Pre-scripted agent.
Our thematic analysis of driver-agent conversations revealed diverse
interaction patterns in topics, including driving assistance/questions,
entertainment requests, and anthropomorphic interactions. Our results highlight
the potential of
-powered in-vehicle conversational agents to enhance
driving safety and user experience through natural, context-rich interactions.
Bridging ASR and LLMs for Dysarthric Speech Recognition Benchmarking Self-Supervised and Generative Approaches
Authors: Ahmed Aboeitta, Ahmed Sharshar, Youssef Nafea, Shady Shehata
2025-08-11
Speech Recognition (ASR) due to phoneme distortions and high variability.
While self-supervised ASR models like Wav2Vec, HuBERT, and Whisper have shown
promise, their effectiveness in dysarthric speech remains unclear. This study
systematically benchmarks these models with different strategies,
including CTC, seq2seq, and
-enhanced
(BART,GPT-2, Vicuna). Our
contributions include (1) benchmarking ASR architectures for dysarthric speech,
(2) introducing
-based
to improve intelligibility, (3) analyzing
generalization across datasets, and (4) providing insights into recognition
errors across severity levels. Findings highlight that
-enhanced
improves dysarthric ASR by leveraging linguistic constraints for phoneme
restoration and grammatical correction.
Interpreting Fedspeak with Confidence A LLM-Based Uncertainty-Aware Framework Guided by Monetary Policy Transmission Paths
Authors: Rui Yao, Qi Chai, Jinhai Yao, Siyuan Li, Junhao Chen, Qi Zhang, Hao Wang
2025-08-11
"Fedspeak", the stylized and often nuanced language used by the U.S. Federal
Reserve, encodes implicit policy signals and strategic stances. The Federal
Open Market Committee strategically employs Fedspeak as a tool to
shape market expectations and influence both domestic and global economic
conditions. As such, automatically parsing and interpreting Fedspeak presents a
high-impact challenge, with significant implications for financial forecasting,
algorithmic trading, and data-driven policy analysis. In this paper, we propose
an
-based, uncertainty-aware framework for deciphering Fedspeak and
classifying its underlying monetary policy stance. Technically, to enrich the
semantic and contextual representation of Fedspeak texts, we incorporate
domain-specific reasoning grounded in the monetary policy transmission
mechanism. We further introduce a dynamic uncertainty
module to assess
the confidence of model predictions, thereby enhancing both classification
accuracy and model reliability. Experimental results demonstrate that our
framework achieves state-of-the-art performance on the policy stance analysis
task. Moreover, statistical analysis reveals a significant positive correlation
between perceptual uncertainty and model error rates, validating the
effectiveness of perceptual uncertainty as a diagnostic signal.
DiTVR Zero-Shot Diffusion Transformer for Video Restoration
Authors: Sicheng Gao, Nancy Mehta, Zongwei Wu, Radu Timofte
2025-08-11
Video restoration aims to reconstruct high quality video sequences from low
quality inputs, addressing tasks such as super resolution, denoising, and
deblurring. Traditional regression based methods often produce unrealistic
details and require extensive paired datasets, while recent generative
diffusion models face challenges in ensuring temporal consistency. We introduce
DiTVR, a zero shot video restoration framework that couples a diffusion
with trajectory aware attention and a wavelet guided, flow
consistent sampler. Unlike prior 3D convolutional or frame wise diffusion
approaches, our attention mechanism aligns tokens along optical flow
trajectories, with particular emphasis on vital layers that exhibit the highest
sensitivity to temporal dynamics. A spatiotemporal neighbour
dynamically
selects relevant tokens based on motion correspondences across frames. The flow
guided sampler injects data consistency only into low-frequency bands,
pre
high frequency priors while accelerating convergence. DiTVR
establishes a new zero shot state of the art on video restoration benchmarks,
demonstrating superior temporal consistency and detail preservation while
remaining robust to flow noise and occlusions.
EvoCoT Overcoming the Exploration Bottleneck in Reinforcement Learning
Authors: Huanyu Liu, Jia Li, Chang Yu, Taozhi Chen, Yihong Dong, Lecheng Wang, Hu XiaoLong, Ge Li
2025-08-11
Reinforcement learning with verifiable reward (RLVR) has become a promising
paradigm for post-training large language models (s) to improve their
reasoning capability. However, when the rollout accuracy is low on hard
problems, the reward becomes
, limiting learning efficiency and causing
exploration bottlenecks. Existing approaches either rely on stronger
s for
distillation or filter out difficult problems, which limits scalability or
restricts reasoning improvement through exploration.
We propose EvoCoT, a self-evolving curriculum learning framework based on
two-stage chain-of-thought (CoT) reasoning optimization. EvoCoT constrains the
exploration space by self-generating and verifying CoT trajectories, then
gradually shortens them to expand the space in a controlled way. This enables
s to stably learn from initially unsolved hard problems under
rewards. We apply EvoCoT to multiple
families, including Qwen, DeepSeek,
and Llama. Experiments show that EvoCoT enables
s to solve previously
unsolved problems, improves reasoning capability without external CoT
supervision, and is compatible with various RL fine-tuning methods. We release
the source code to support future research.
Grove MoE Towards Efficient and Superior MoE LLMs with Adjugate Experts
Authors: Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, Zhenzhong Lan, Bei Yu, Jianguo Li
2025-08-11
The Mixture of Experts (MoE) architecture is a cornerstone of modern
state-of-the-art (SOTA) large language models (s). MoE models facilitate
scalability by enabling
parameter activation. However, traditional MoE
architecture uses homogeneous experts of a uniform size, activating a fixed
number of parameters irrespective of input complexity and thus limiting
computational efficiency. To overcome this limitation, we introduce Grove MoE,
a novel architecture incorporating experts of varying sizes, inspired by the
heterogeneous big.LITTLE CPU architecture. This architecture features novel
adjugate experts with a dynamic activation mechanism, enabling model capacity
expansion while maintaining manageable computational overhead. Building on this
architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter
s
developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model
during mid-training and post-training. GroveMoE models dynamically activate
3.14-3.28B parameters based on token complexity and achieve performance
comparable to SOTA open-source models of similar or even larger size.
SASST Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation
Authors: Zeyu Yang, Lai Wei, Roman Koshkin, Xi Chen, Satoshi Nakamura
2025-08-11
This work proposes a grammar-based chunking strategy that segments input
streams into semantically complete units by parsing dependency relations (e.g.,
noun phrase boundaries, verb-object structures) and punctuation features. The
method ensures chunk coherence and minimizes semantic fragmentation. Building
on this mechanism, we present SASST (Syntax-Aware Simultaneous Speech
Translation), an end-to-end framework integrating frozen Whisper encoder and
r-only
. The unified architecture dynamically outputs translation
tokens or
-driven SimulST systems.
Symmetry-Aware Transformer Training for Automated Planning
Authors: Markus Fritzsche, Elliot Gestrin, Jendrik Seipp
2025-08-11
While s excel in many settings, their application in the field of
automated planning is limited. Prior work like PlanGPT, a state-of-the-art
r-only
, struggles with extrapolation from easy to hard
planning problems. This in turn stems from problem symmetries: planning tasks
can be represented with arbitrary variable names that carry no meaning beyond
being identifiers. This causes a combinatorial explosion of equivalent
representations that pure
s cannot efficiently learn from. We
propose a novel contrastive learning objective to make
s
symmetry-aware and thereby compensate for their lack of inductive bias.
Combining this with architectural improvements, we show that
s can
be efficiently trained for either plan-generation or heuristic-prediction. Our
results across multiple planning domains demonstrate that our symmetry-aware
training effectively and efficiently addresses the limitations of PlanGPT.
Semantic Caching for Low-Cost LLM Serving From Offline Learning to Online Adaptation
Authors: Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C. S. Lui, Wei Chen, Carlee Joe-Wong
2025-08-11
Large Language Models (s) are revolutionizing how users interact with
information systems, yet their high inference cost poses serious scalability
and sustainability challenges. Caching inference responses, allowing them to be
retrieved without another forward pass through the
, has emerged as one
possible solution. Traditional exact-match caching, however, overlooks the
semantic similarity between queries, leading to unnecessary recomputation.
Semantic caching addresses this by retrieving responses based on semantic
similarity, but introduces a fundamentally different
eviction problem:
one must account for mismatch costs between incoming queries and
d
responses. Moreover, key system parameters, such as query arrival probabilities
and
costs, are often unknown and must be learned over time. Existing
semantic caching methods are largely ad-hoc, lacking theoretical foundations
and unable to adapt to real-world uncertainty. In this paper, we present a
principled, learning-based framework for semantic
eviction under unknown
query and cost distributions. We formulate both offline optimization and online
learning variants of the problem, and develop provably efficient algorithms
with state-of-the-art guarantees. We also evaluate our framework on a synthetic
dataset, showing that our proposed algorithms perform matching or superior
performance compared with baselines.
GLiClass Generalist Lightweight Model for Sequence Classification Tasks
Authors: Ihor Stepanov, Mykhailo Shtopko, Dmytro Vodianytskyi, Oleksandr Lukashov, Alexander Yavorskyi, Mykyta Yaroshenko
2025-08-11
Classification is one of the most widespread tasks in AI applications,
often as the first step in filtering, sorting, and categorizing data.
Since modern AI systems must handle large volumes of input data and early
pipeline stages can propagate errors downstream, achieving high efficiency and
accuracy is critical. Moreover, classification requirements can change
dynamically based on user needs, necessitating models with strong zero-shot
capabilities. While generative
s have become mainstream for zero-shot
classification due to their versatility, they suffer from inconsistent
instruction following and computational inefficiency. Cross-encoders, commonly
used as rerankers in RAG pipelines, face a different bottleneck: they must
process text-label pairs sequentially, significantly reducing efficiency with
large label sets. Embedding-based approaches offer good efficiency but struggle
with complex scenarios involving logical and semantic constraints. We propose
GLiClass, a novel method that adapts the GLiNER architecture for sequence
classification tasks. Our approach achieves strong accuracy and efficiency
comparable to embedding-based methods, while maintaining the flexibility needed
for zero-shot and few-shot learning scenarios. Additionally, we adapted
proximal policy optimization (PPO) for multi-label text classification,
enabling training classifiers in data-
conditions or from human feedback.
LaVieID Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation
Authors: Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang
2025-08-11
In this paper, we present LaVieID, a novel \underline{l}ocal
\underline{a}utoregressive \underline{vi}d\underline{e}o diffusion framework
designed to tackle the challenging \underline{id}entity-pre
text-to-video task. The key idea of LaVieID is to mitigate the loss of identity
information inherent in the stochastic global generation process of diffusion
s (DiTs) from both spatial and temporal perspectives. Specifically,
unlike the global and unstructured modeling of facial latent states in existing
DiTs, LaVieID introduces a local router to explicitly represent latent states
by weighted combinations of fine-grained local facial structures. This
alleviates undesirable feature interference and encourages DiTs to capture
distinctive facial characteristics. Furthermore, a temporal autoregressive
module is integrated into LaVieID to refine denoised latent tokens before video
. This module divides latent tokens temporally into chunks, exploiting
their long-range temporal dependencies to predict biases for rectifying tokens,
thereby significantly enhancing inter-frame identity consistency. Consequently,
LaVieID can generate high-fidelity personalized videos and achieve
state-of-the-art performance. Our code and models are available at
https://github.com/ssugarwh/LaVieID.
HGMF A Hierarchical Gaussian Mixture Framework for Scalable Tool Invocation within the Model Context Protocol
Authors: Wenpeng Xing, Zhipeng Chen, Changting Lin, Meng Han
2025-08-11
Invoking external tools enables Large Language Models (s) to perform
complex, real-world tasks, yet selecting the correct tool from large,
hierarchically-structured libraries remains a significant challenge. The
limited context windows of
s and noise from irrelevant options often lead to
low selection accuracy and high computational costs. To address this, we
propose the Hierarchical Gaussian Mixture Framework (HGMF), a probabilistic
method for scalable tool invocation. HGMF first maps the user query and
all tool descriptions into a unified semantic space. The framework then
operates in two stages: it clusters servers using a Gaussian Mixture Model
(GMM) and filters them based on the query's likelihood. Subsequently, it
applies the same GMM-based clustering and filtering to the tools associated
with the selected servers. This hierarchical process produces a compact,
high-relevance candidate set, simplifying the final selection task for the
.
Experiments on a public dataset show that HGMF significantly improves tool
selection accuracy while reducing inference latency, confirming the framework's
scalability and effectiveness for large-scale tool libraries.
Towards Theoretical Understanding of Transformer Test-Time Computing Investigation on In-Context Linear Regression
Authors: Xingwu Chen, Miao Lu, Beining Wu, Difan Zou
2025-08-11
Using more test-time computation during language model inference, such as
generating more intermediate thoughts or sampling multiple candidate answers,
has proven effective in significantly improving model performance. This paper
takes an initial step toward bridging the gap between practical language model
inference and theoretical analysis by incorporating randomness and
sampling. We focus on in-context linear regression with continuous/binary
coefficients, where our framework simulates language model
through
noise injection and binary coefficient sampling. Through this framework, we
provide detailed analyses of widely adopted inference techniques. Supported by
empirical results, our theoretical framework and analysis demonstrate the
potential for offering new insights into understanding inference behaviors in
real-world language models.
Grounding Natural Language for Multi-agent Decision-Making with Multi-agentic LLMs
Authors: Dom Huh, Prasant Mohapatra
2025-08-10
Language is a ubiquitous tool that is foundational to reasoning and
collaboration, ranging from everyday interactions to sophisticated
problem-solving tasks. The establishment of a common language can serve as a
powerful asset in ensuring clear and understanding amongst
agents, facilitating desired coordination and strategies. In this work, we
extend the capabilities of large language models (
s) by integrating them
with advancements in multi-agent decision-making algorithms. We propose a
systematic framework for the design of multi-agentic large language models
(
s), focusing on key integration practices. These include advanced prompt
engineering techniques, the development of effective memory architectures,
multi-modal information processing, and alignment strategies through
fine-tuning algorithms. We evaluate these design choices through extensive
ablation studies on classic game settings with significant underlying social
dilemmas and game-theoretic considerations.
Investigating 1-Bit Quantization in Transformer-Based Top Tagging
Authors: Saurabh Rai, Prisha, Jitendra Kumar
2025-08-10
The increasing scale of deep learning models in high-energy physics (HEP) has
posed challenges to their deployment on low-power, latency-sensitive platforms,
such as FPGAs and ASICs used in trigger systems, as well as in offline data
reconstruction and processing pipelines. In this work, we introduce BitParT, a
1-bit Transformer-based architecture designed specifically for the top-quark
tagging method. Building upon recent advances in ultra- large language
models (
s), we extended these ideas to the HEP domain by developing a
binary-weight variant (BitParT) of the Particle Transformer (ParT) model. Our
findings indicate a potential for substantial reduction in model size and
computational complexity, while maintaining high tagging performance. We
benchmark BitParT on the public Top Quark Tagging Reference Dataset and show
that it achieves competitive performance relative to its full-precision
counterpart. This work demonstrates the design of extreme
d models for
physics applications, paving the way for real-time inference in collider
experiments with minimal and optimized resource usage.
LET-US Long Event-Text Understanding of Scenes
Authors: Rui Chen, Xingyu Chen, Shaoan Wang, Shihan Kong, Junzhi Yu
2025-08-10
Event cameras output event streams as , asynchronous data with
microsecond-level temporal resolution, enabling visual perception with low
latency and a high dynamic range. While existing Multimodal Large Language
Models (M
s) have achieved significant success in understanding and analyzing
RGB video content, they either fail to interpret event streams effectively or
remain constrained to very short sequences. In this paper, we introduce LET-US,
a framework for long event-stream--text comprehension that employs an adaptive
compression mechanism to reduce the volume of input events while pre
critical visual details. LET-US thus establishes a new frontier in cross-modal
inferential understanding over extended event sequences. To bridge the
substantial modality gap between event streams and textual representations, we
adopt a two-stage optimization paradigm that progressively equips our model
with the capacity to interpret event-based scenes. To handle the voluminous
temporal information inherent in long event streams, we leverage text-guided
cross-modal queries for feature reduction, augmented by hierarchical clustering
and similarity computation to distill the most representative event features.
Moreover, we curate and construct a large-scale event-text aligned dataset to
train our model, achieving tighter alignment of event features within the
embedding space. We also develop a comprehensive benchmark covering a diverse
set of tasks -- reasoning, captioning, classification, temporal localization
and moment retrieval. Experimental results demonstrate that LET-US outperforms
prior state-of-the-art M
s in both descriptive accuracy and semantic
comprehension on long-duration event streams. All datasets, codes, and models
will be publicly available.
Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative
Authors: Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang
2025-08-10
With the breakthrough progress of large language models (s) in natural
language processing and multimodal tasks, efficiently deploying them on
resource-constrained edge devices has become a critical challenge. The Mixture
of Experts (MoE) architecture enhances model capacity through
activation, but faces two major difficulties in practical deployment: (1) The
presence of numerous outliers in activation distributions leads to severe
degradation in
accuracy for both activations and weights,
significantly impairing inference performance; (2) Under limited memory,
efficient offloading and collaborative inference of expert modules struggle to
balance latency and throughput. To address these issues, this paper proposes an
efficient MoE edge deployment scheme based on Hessian-Aware Quantization (HAQ)
and CPU-GPU collaborative inference. First, by introducing smoothed Hessian
matrix
, we achieve joint 8-bit
of activations and
weights, which significantly alleviates the accuracy loss caused by outliers
while ensuring efficient implementation on mainstream hardware. Second, we
design an expert-level collaborative offloading and inference mechanism, which,
combined with expert activation path statistics, enables efficient deployment
and scheduling of expert modules between CPU and GPU, greatly reducing memory
footprint and inference latency. Extensive experiments validate the
effectiveness of our method on mainstream large models such as the OPT series
and Mixtral 8*7B: on datasets like Wikitext2 and C4, the inference accuracy of
the
d model approaches that of the full-precision model, while
GPU memory usage is reduced by about 60%, and inference latency is
significantly improved.
BEVANet Bilateral Efficient Visual Attention Network for Real-Time Semantic Segmentation
Authors: Ping-Mao Huang, I-Tien Chao, Ping-Chia Huang, Jia-Wei Liao, Yung-Yu Chuang
2025-08-10
Real-time semantic segmentation presents the dual challenge of designing
efficient architectures that capture large receptive fields for semantic
understanding while also refining detailed contours. Vision s model
long-range dependencies effectively but incur high computational cost. To
address these challenges, we introduce the Large Kernel Attention (LKA)
mechanism. Our proposed Bilateral Efficient Visual Attention Network (BEVANet)
expands the receptive field to capture contextual information and extracts
visual and structural features using Sparse Decomposed Large Separable Kernel
Attentions (SDLSKA). The Comprehensive Kernel Selection (CKS) mechanism
dynamically adapts the receptive field to further enhance performance.
Furthermore, the Deep Large Kernel Pyramid Pooling Module (DLKPPM) enriches
contextual features by synergistically combining dilated convolutions and large
kernel attention. The bilateral architecture facilitates frequent branch
, and the Boundary Guided Adaptive Fusion (BGAF) module enhances
boundary delineation by integrating spatial and semantic features under
boundary guidance. BEVANet achieves real-time segmentation at 33 FPS, yielding
79.3% mIoU without pretraining and 81.0% mIoU on Cityscapes after ImageNet
pretraining, demonstrating state-of-the-art performance. The code and model is
available at https://github.com/maomao0819/BEVANet.
Tasa Thermal-aware 3D-Stacked Architecture Design with Bandwidth Sharing for LLM Inference
Authors: Siyuan He, Peiran Yan, Yandong He, Youwei Zhuo, Tianyu Jia
2025-08-10
The autoregressive in
s is the major inference bottleneck due to
the memory-intensive operations and limited hardware bandwidth. 3D-stacked
architecture is a promising solution with significantly improved memory
bandwidth, which vertically stacked multi DRAM dies on top of logic die.
However, our experiments also show the 3D-stacked architecture faces severer
thermal issues compared to 2D architecture, in terms of thermal temperature,
gradient and scalability. To better exploit the potential of 3D-stacked
architecture, we present Tasa, a heterogeneous architecture with cross-stack
thermal optimizations to balance the temperature distribution and maximize the
performance under the thermal constraints. High-performance core is designed
for compute-intensive operations, while high-efficiency core is used for
memory-intensive operators, e.g. attention layers. Furthermore, we propose a
bandwidth sharing scheduling to improve the bandwidth utilization in such
heterogeneous architecture. Extensive thermal experiments show that our Tasa
architecture demonstrates greater scalability compared with the homogeneous
3D-stacked architecture, i.e. up to 5.55 , 9.37 ,
and 7.91 peak temperature reduction for 48, 60, and 72 core
configurations. Our experimental for Llama-65B and GPT-3 66B inferences also
demonstrate 2.85x and 2.21x speedup are obtained over the GPU baselines and
state-of-the-art heterogeneous PIM-based
accelerator
LP-Spec Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture-Dataflow Co-Optimization
Authors: Siyuan He, Zhantong Zhu, Yandong He, Tianyu Jia
2025-08-10
inference on mobile devices faces extraneous challenges due to limited
memory bandwidth and computational resources. To address these issues,
speculative inference and processing-in-memory (PIM) techniques have been
explored at the algorithmic and hardware levels. However, speculative inference
results in more compute-intensive GEMM operations, creating new design
trade-offs for existing GEMV-accelerated PIM architectures. Furthermore, there
exists a significant amount of redundant draft tokens in tree-based speculative
inference, necessitating efficient token management schemes to minimize energy
consumption. In this work, we present LP-Spec, an architecture-dataflow
co-design leveraging hybrid LPDDR5 performance-enhanced PIM architecture with
draft token
and dynamic workload scheduling to accelerate
speculative inference. A near-data memory controller is proposed to enable data
reallocation between DRAM and PIM banks. Furthermore, a data allocation unit
based on the hardware-aware draft token pruner is developed to minimize energy
consumption and fully exploit parallel execution opportunities. Compared to
end-to-end
inference on other mobile solutions such as mobile NPUs or
GEMV-accelerated PIMs, our LP-Spec achieves 13.21x, 7.56x, and 99.87x
improvements in performance, energy efficiency, and energy-delay-product (EDP).
Compared with prior AttAcc PIM and RTX 3090 GPU, LP-Spec can obtain 12.83x and
415.31x EDP reduction benefits.
Bridging Semantic Logic Gaps A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization
Authors: Songlin Li, Zhiqing Guo, Yuanman Li, Zeyu Li, Yunfeng Diao, Gaobo Yang, Liejun Wang
2025-08-10
The existing image manipulation localization (IML) models mainly relies on
visual cues, but ignores the semantic logical relationships between content
features. In fact, the content semantics conveyed by real images often conform
to human cognitive laws. However, image manipulation technology usually
destroys the internal relationship between content features, thus leaving
semantic clues for IML. In this paper, we propose a cognition-inspired
multimodal boundary-pre network (CMB-Net). Specifically, CMB-Net
utilizes large language models (
s) to analyze manipulated regions within
images and generate prompt-based textual information to compensate for the lack
of semantic relationships in the visual information. Considering that the
erroneous texts induced by hallucination from
s will damage the accuracy of
IML, we propose an image-text central ambiguity module (ITCAM). It assigns
weights to the text features by quantifying the ambiguity between text and
image features, thereby ensuring the beneficial impact of textual information.
We also propose an image-text interaction module (ITIM) that aligns visual and
text features using a correlation matrix for fine-grained interaction. Finally,
inspired by invertible neural networks, we propose a restoration edge
r
(RED) that mutually generates input and output features to preserve boundary
information in manipulated regions without loss. Extensive experiments show
that CMB-Net outperforms most existing IML models.
DySK-Attn A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention
Authors: Kabir Khan, Priya Sharma, Arjun Mehta, Neha Gupta, Ravi Narayanan
2025-08-10
Large Language Models (s) suffer from a critical limitation: their
knowledge is static and quickly becomes outdated. Retraining these massive
models is computationally prohibitive, while existing knowledge editing
techniques can be slow and may introduce unforeseen side effects. To address
this, we propose DySK-Attn, a novel framework that enables
s to efficiently
integrate real-time knowledge from a dynamic external source. Our approach
synergizes an
with a dynamic Knowledge Graph (KG) that can be updated
instantaneously. The core of our framework is a
knowledge attention
mechanism, which allows the
to perform a coarse-to-fine grained search,
efficiently identifying and focusing on a small, highly relevant subset of
facts from the vast KG. This mechanism avoids the high computational cost of
dense attention over the entire knowledge base and mitigates noise from
irrelevant information. We demonstrate through extensive experiments on
time-sensitive question-answering tasks that DySK-Attn significantly
outperforms strong baselines, including standard Retrieval-Augmented Generation
(RAG) and model editing techniques, in both factual accuracy for updated
knowledge and computational efficiency. Our framework offers a scalable and
effective solution for building
s that can stay current with the
ever-changing world.
How Effectively Can Large Language Models Connect SNP Variants and ECG Phenotypes for Cardiovascular Risk Prediction?
Authors: Niranjana Arun Menon, Iqra Farooq, Yulong Li, Sara Ahmed, Yutong Xie, Muhammad Awais, Imran Razzak
2025-08-10
Cardiovascular disease (CVD) prediction remains a tremendous challenge due to
its multifactorial etiology and global burden of morbidity and mortality.
Despite the growing availability of genomic and electrophysiological data,
extracting biologically meaningful insights from such high-dimensional, noisy,
and ly annotated datasets remains a non-trivial task. Recently,
s has
been applied effectively to predict structural variations in biological
sequences. In this work, we explore the potential of fine-tuned
s to predict
cardiac diseases and SNPs potentially leading to CVD risk using genetic markers
derived from high-throughput genomic profiling. We investigate the effect of
genetic patterns associated with cardiac conditions and evaluate how
s can
learn latent biological relationships from structured and semi-structured
genomic data obtained by mapping genetic aspects that are inherited from the
family tree. By framing the problem as a Chain of Thought (CoT) reasoning task,
the models are prompted to generate disease labels and articulate informed
clinical deductions across diverse patient profiles and phenotypes. The
findings highlight the promise of
s in contributing to early detection, risk
assessment, and ultimately, the advancement of personalized medicine in cardiac
care.
From Nodes to Narratives Explaining Graph Neural Networks with LLMs and Graph Context
Authors: Peyman Baghershahi, Gregoire Fournier, Pranav Nyati, Sourav Medya
2025-08-09
Graph Neural Networks (GNNs) have emerged as powerful tools for learning over
structured data, including text-attributed graphs, which are common in domains
such as citation networks, social platforms, and knowledge graphs. GNNs are not
inherently interpretable and thus, many explanation methods have been proposed.
However, existing explanation methods often struggle to generate interpretable,
fine-grained rationales, especially when node attributes include rich natural
language. In this work, we introduce LOGIC, a lightweight, post-hoc framework
that uses large language models (s) to generate faithful and interpretable
explanations for GNN predictions. LOGIC projects GNN node embeddings into the
embedding space and constructs hybrid prompts that interleave soft prompts
with textual inputs from the graph structure. This enables the
to reason
about GNN internal representations and produce natural language explanations
along with concise explanation subgraphs. Our experiments across four
real-world TAG datasets demonstrate that LOGIC achieves a favorable trade-off
between fidelity and
, while significantly improving human-centric
metrics such as insightfulness. LOGIC sets a new direction for
-based
explainability in graph learning by aligning GNN internals with human
reasoning.
Large Language Model Evaluated Stand-alone Attention-Assisted Graph Neural Network with Spatial and Structural Information Interaction for Precise Endoscopic Image Segmentation
Authors: Juntong Fan, Shuyi Fan, Debesh Jha, Changsheng Fang, Tieyong Zeng, Hengyong Yu, Dayang Wang
2025-08-09
Accurate endoscopic image segmentation on the polyps is critical for early
colorectal cancer detection. However, this task remains challenging due to low
contrast with surrounding mucosa, specular highlights, and indistinct
boundaries. To address these challenges, we propose FOCUS-Med, which stands for
Fusion of spatial and structural graph with attentional context-aware polyp
segmentation in endoscopic medical imaging. FOCUS-Med integrates a Dual Graph
Convolutional Network (Dual-GCN) module to capture contextual spatial and
topological structural dependencies. This graph-based representation enables
the model to better distinguish polyps from background tissues by leveraging
topological cues and spatial connectivity, which are often obscured in raw
image intensities. It enhances the model's ability to preserve boundaries and
delineate complex shapes typical of polyps. In addition, a location-fused
stand-alone self-attention is employed to strengthen global context
integration. To bridge the semantic gap between encoder-r layers, we
incorporate a trainable weighted fast normalized fusion strategy for efficient
multi-scale aggregation. Notably, we are the first to introduce the use of a
Large Language Model (
) to provide detailed qualitative evaluations of
segmentation quality. Extensive experiments on public benchmarks demonstrate
that FOCUS-Med achieves state-of-the-art performance across five key metrics,
underscoring its effectiveness and clinical potential for AI-assisted
colonoscopy.
Vec2Summ Text Summarization via Probabilistic Sentence Embeddings
Authors: Mao Li, Fred Conrad, Johann Gagnon-Bartsch
2025-08-09
We propose Vec2Summ, a novel method for abstractive summarization that frames
the task as semantic compression. Vec2Summ represents a document collection
using a single mean vector in the semantic embedding space, capturing the
central meaning of the corpus. To reconstruct fluent summaries, we perform
embedding inversion -- this mean vector into natural language using a
generative language model. To improve reconstruction quality and capture some
degree of topical variability, we introduce stochasticity by sampling from a
Gaussian distribution centered on the mean. This approach is loosely analogous
to bagging in ensemble learning, where controlled randomness encourages more
robust and varied outputs. Vec2Summ addresses key limitations of
-based
summarization methods. It avoids context-length constraints, enables
interpretable and controllable generation via semantic parameters, and scales
efficiently with corpus size -- requiring only parameters.
Empirical results show that Vec2Summ produces coherent summaries for topically
focused, order-invariant corpora, with performance comparable to direct
summarization in terms of thematic coverage and efficiency, albeit with less
fine-grained detail. These results underscore Vec2Summ's potential in settings
where scalability, semantic control, and corpus-level abstraction are
prioritized.
Narrative Memory in Machines Multi-Agent Arc Extraction in Serialized TV
Authors: Roberto Balestri, Guglielmo Pescatore
2025-08-09
Serialized television narratives present significant analytical challenges
due to their complex, temporally distributed storylines that necessitate
sophisticated information management. This paper introduces a multi-agent
system (MAS) designed to extract and analyze narrative arcs by implementing
principles of computational memory architectures. The system conceptualizes
narrative understanding through analogues of human memory: Large Language
Models (s) provide a form of semantic memory for general narrative patterns,
while a vector database stores specific arc progressions as episodic memories.
A multi-agent workflow simulates working memory processes to integrate these
information types. Tested on the first season of Grey's Anatomy (ABC 2005-),
the MAS identifies three arc types: Anthology (self-contained), Soap
(relationship-focused), and Genre-Specific. These arcs and their episodic
developments are stored in a vector database, facilitating structured analysis
and semantic comparison. To bridge automation with critical interpretation, a
graphical interface enables human oversight and refinement of the system's
narrative memory. While demonstrating strong performance in identifying
Anthology Arcs and character entities, the system's reliance on textual
paratexts (episode summaries) revealed limitations in discerning
ping
arcs and opaque dynamics, underscoring the challenges in computational memory
consolidation versus human holistic understanding. This memory-centric approach
highlights the potential of combining AI-driven memory processing with human
expertise. Beyond television, it offers promise for serialized written formats
where narrative is entirely text-based. Future work will focus on integrating
multimodal inputs to enrich episodic memory, refining memory integration
mechanisms within the MAS, and expanding testing across diverse genres.
SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency
Authors: Kwanhee Kyung, Sungmin Yun, Jung Ho Ahn
2025-08-09
Large Language Models (s) applying Mixture-of-Experts (MoE) scale to
trillions of parameters but require vast memory, motivating a line of research
to offload expert weights from fast-but-small DRAM (HBM) to denser Flash SSDs.
While SSDs provide cost-effective capacity, their read energy per bit is
substantially higher than that of DRAM. This paper quantitatively analyzes the
energy implications of offloading MoE expert weights to SSDs during the
critical
stage of
inference. Our analysis, comparing SSD, CPU memory
(DDR), and HBM storage scenarios for models like DeepSeek-R1, reveals that
offloading MoE weights to current SSDs drastically increases
per-token-generation energy consumption (e.g., by up to ~12x compared to the
HBM baseline), dominating the total inference energy budget. Although
techniques like prefetching effectively hide access latency, they cannot
mitigate this fundamental energy penalty. We further explore future
technological scaling, finding that the inherent
of MoE models could
potentially make SSDs energy-viable if Flash read energy improves
significantly, roughly by an order of magnitude.
Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models
Authors: Zhijun Tu, Hanting Chen, Siqi Liu, Chuanjian Liu, Jian Li, Jie Hu, Yunhe Wang
2025-08-09
1-bit
offers significant advantages in reducing storage and
computational costs. However, existing methods typically train 1-bit
s from
scratch, failing to fully leverage pre-trained models. This results in high
training costs and notable accuracy degradation. We identify that the large gap
between full precision and 1-bit representations makes direct adaptation
difficult. In this paper, we introduce a consistent progressive training for
both forward and backward, smoothly converting the floating-point weights into
the binarized ones. Additionally, we incorporate binary-aware initialization
and dual-scaling compensation to reduce the difficulty of progressive training
and improve the performance. Experimental results on
s of various sizes
demonstrate that our method outperforms existing approaches. Our results show
that high-performance 1-bit
s can be achieved using pre-trained models,
eliminating the need for expensive training from scratch.
Fed MobiLLM Efficient Federated LLM Fine-Tuning over Heterogeneous Mobile Devices via Server Assisted Side-Tuning
Authors: Xingke Yang, Liang Li, Sicong Li, Liwei Guan, Hao Wang, Xiaoqi Qi, Jiang Liu, Xin Fu, Miao Pan
2025-08-09
Collaboratively fine-tuning (FT) large language models (s) over
heterogeneous mobile devices fosters immense potential applications of
personalized intelligence. However, such a vision faces critical system
challenges. Conventional federated
FT approaches place prohibitive
computational and memory burdens on mobile hardware, and their synchronous
model aggregation protocols stall for slower devices. In this paper, we propose
Fed Mobi
, a novel design to facilitate efficient federated
FT across
mobile devices with diverse computing/
speeds and local model
architectures. In particular, Fed Mobi
implements a pioneering
server-assisted federated side-tuning paradigm. Briefly, mobile devices perform
lightweight forward propagation computations on local data using their frozen
pre-scaled backbone
s, and then upload selected intermediate activations.
The server trains a shared side-network independently, eliminating client-side
backpropagation and enabling asynchronous updates. To bridge model
heterogeneity across different devices, we introduce an adaptive layer-wise
feature alignment method, which ensures consistent representations for
collaboratively tuning a shared side network. Extensive experimental results
demonstrate that Fed Mobi
can maintain robust fine-tuning performance while
achieving extremely low on-device memory, with at least 95.2% reduction in
computation overhead, 93.2% reduction in
costs and 5.1x faster
convergence compared to existing methods, validating its efficacy for practical
adaptation over heterogeneous mobile devices.
Pushing the Envelope of LLM Inference on AI-PC
Authors: Evangelos Georganas, Dhiraj Kalamkar, Alexander Heinecke
2025-08-08
The advent of ultra-
models (1/1.58/2-bit), which match the
perplexity and end-task performance of their full-precision counterparts using
the same model size, is ushering in a new era of
inference for
resource-constrained environments such as edge devices and AI PCs. While these
advances promise models that are more cost-effective in terms of
latency, memory, throughput, and energy consumption, the computational
efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp)
used to deploy them remains underexplored. In this work, we take a bottom-up
approach: we first design and implement 1-bit and 2-bit microkernels optimized
for modern CPUs, achieving peak computational efficiency across a variety of
CPU platforms. We integrate these microkernels into a state-of-the-art
inference framework, namely PyTorch-TPP, and present end-to-end inference
results with 2-bit models that outperform the current SOTA runtime bitnet.cpp
by up to 2.2x, and deliver up to 7x speedup compared to the 16-bit model
inference. Our optimized runtime advances the state of
inference on AI PCs
and edge devices, paving the way for efficient deployment of ultra-
models.
CISO Species Distribution Modeling Conditioned on Incomplete Species Observations
Authors: Hager Radi Abdelwahed, Mélisande Teng, Robin Zbinden, Laura Pollock, Hugo Larochelle, Devis Tuia, David Rolnick
2025-08-08
Species distribution models (SDMs) are widely used to predict species'
geographic distributions, as critical tools for ecological research and
conservation planning. Typically, SDMs relate species occurrences to
environmental variables representing abiotic factors, such as temperature,
precipitation, and soil properties. However, species distributions are also
strongly influenced by biotic interactions with other species, which are often
overlooked. While some methods partially address this limitation by
incorporating biotic interactions, they often assume symmetrical pairwise
relationships between species and require consistent co-occurrence data. In
practice, species observations are
, and the availability of information
about the presence or absence of other species varies significantly across
locations. To address these challenges, we propose CISO, a deep learning-based
method for species distribution modeling Conditioned on Incomplete Species
Observations. CISO enables predictions to be conditioned on a flexible number
of species observations alongside environmental variables, accommodating the
variability and incompleteness of available biotic data. We demonstrate our
approach using three datasets representing different species groups: sPlotOpen
for plants, SatBird for birds, and a new dataset, SatButterfly, for
butterflies. Our results show that including partial biotic information
improves predictive performance on spatially separate test sets. When
conditioned on a subset of species within the same dataset, CISO outperforms
alternative methods in predicting the distribution of the remaining species.
Furthermore, we show that combining observations from multiple datasets can
improve performance. CISO is a promising ecological tool, capable of
incorporating incomplete biotic information and identifying potential
interactions between species from disparate taxa.