2025-08-29
Table of Contents
- Discrete Diffusion VLA Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
- Symphony A Decentralized Multi-Agent Framework for Scalable Collective Intelligence
- Optimal Remainder Estimates in the Quantization of Complex Projective Spaces
- Secure Multi-LLM Agentic AI and Agentification for Edge General Intelligence by Zero-Trust A Survey
- SoK Large Language Model Copyright Auditing via Fingerprinting
- The Return of Structural Handwritten Mathematical Expression Recognition
- Spotlight Attention Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval
- Continuously Steering LLMs Sensitivity to Contextual Knowledge with Proxy Models
- Hybrid Decoding Rapid Pass and Selective Detailed Correction for Sequence Models
- Survey of Specialized Large Language Model
- LFD Layer Fused Decoding to Exploit External Knowledge in Retrieval-Augmented Generation
- ReST-RL Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
- Taming the Chaos Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
- Towards 6G Intelligence The Role of Generative AI in Future Wireless Networks
- Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLMs
- Even Heads Fix Odd Errors Mechanistic Discovery and Surgical Repair in Transformer Attention
- One Joke to Rule them All? On the (Im)possibility of Generalizing Humor
- A Theory of Goal-Oriented Medium Access Protocol Design and Distributed Bandit Learning
- Random forest-based out-of-distribution detection for robust lung cancer segmentation
- APT-LLM Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration
- Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices
- A Concurrent Modular Agent Framework for Autonomous LLM Agents
- Sense of Self and Time in Borderline Personality. A Comparative Robustness Study with Generative AI
- RoofSeg An edge-aware transformer-based network for end-to-end roof plane segmentation
- LLMs in the SOC An Empirical Study of Human-AI Collaboration in Security Operations Centres
- Enhancing Model Privacy in Federated Learning with Random Masking and Quantization
- pyFAST A Modular PyTorch Framework for Time Series Modeling with Multi-source and Sparse Data
- ClusterFusion Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
- STARec An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning
- A Survey on Cloud-Edge-Terminal Collaborative Intelligence in AIoT Networks
- Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction
- UltraMemV2 Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning
- Rethinking Caching for LLM Serving Systems Beyond Traditional Heuristics
- Drawing2CAD Sequence-to-Sequence Learning for CAD Generation from Vectorized Drawings
- Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
- MUA-RL Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use
- Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models
- History Rhymes Accelerating LLM Reinforcement Learning with RhymeRL
- Strata Hierarchical Context Caching for Long Context Language Model Serving
- LLM-Driven Intrinsic Motivation for Sparse Reward Reinforcement Learning
- Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning
- Backprompting Leveraging Synthetic Production Data for Health Advice Guardrails
- DualSparse-MoE Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction
- Flash Sparse Attention An Alternative Efficient Implementation of Native Sparse Attention Kernel
- Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios
- AdLoCo adaptive batching significantly improves communications efficiency and convergence for Large Language Models
- HLLM-Creator Hierarchical LLM-based Personalized Creative Generation
- The AI Data Scientist
- A.S.E A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
- ILRe Intermediate Layer Retrieval for Context Compression in Causal Language Models
- LexSemBridge Fine-Grained Dense Representation Enhancement through Token-Aware Embedding Augmentation
- Speculative Safety-Aware Decoding
- CMFDNet Cross-Mamba and Feature Discovery Network for Polyp Segmentation
- CATformer Contrastive Adversarial Transformer for Image Super-Resolution
- CoCoA Confidence and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models
Discrete Diffusion VLA Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Authors: Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo
2025-08-27
Vision-Language-Action (VLA) models adapt large vision-language backbones to
map images and instructions to robot actions. However, prevailing VLA rs
either generate actions autoregressively in a fixed left-to-right order or
attach continuous diffusion or flow matching heads outside the backbone,
demanding specialized training and iterative sampling that hinder a unified,
scalable architecture. We present Discrete Diffusion VLA, a single-
policy that models discretized action chunks with discrete diffusion and is
trained with the same cross-entropy objective as the VLM backbone. The design
retains diffusion's progressive refinement paradigm while remaining natively
compatible with the discrete token interface of VLMs. Our method achieves an
adaptive
order that resolves easy action elements before harder ones
and uses secondary remasking to revisit uncertain predictions across refinement
rounds, which improves consistency and enables robust error correction. This
unified
r preserves pretrained vision language priors, supports parallel
, breaks the autoregressive bottleneck, and reduces the number of
function evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO,
71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnv
Bridge, improving over both autoregressive and continuous diffusion baselines.
These findings indicate that discrete-diffusion action
r supports precise
action modeling and consistent training, laying groundwork for scaling VLA to
larger models and datasets.
Symphony A Decentralized Multi-Agent Framework for Scalable Collective Intelligence
Authors: Ji Wang, Kashing Chen, Xinyuan Song, Ke Zhang, Lynn Ai, Eric Yang, Bill Shi
2025-08-27
Most existing Large Language Model ()-based agent frameworks rely on
centralized orchestration, incurring high deployment costs, rigid
topologies, and limited adaptability. To address these challenges, we introduce
Symphony, a decentralized multi-agent system which enables lightweight
s on
consumer-grade GPUs to coordinate. Symphony introduces three key mechanisms:
(1) a decentralized ledger that records capabilities, (2) a Beacon-selection
protocol for dynamic task allocation, and (3) weighted result voting based on
CoTs. This design forms a privacy-saving, scalable, and fault-tolerant
orchestration with low overhead. Empirically, Symphony outperforms existing
baselines on reasoning benchmarks, achieving substantial accuracy gains and
demonstrating robustness across models of varying capacities.
Optimal Remainder Estimates in the Quantization of Complex Projective Spaces
Authors: Tommaso Aschieri, Błażej Ruba, Jan Philip Solovej
2025-08-27
We study Berezin-Toeplitz of complex projective spaces
and obtain full asymptotic expansions of the Berezin
transformation and of products of Toeplitz operators. In each case, the
remainder is controlled by the next term of the expansion, either through a
positivity-pre
transformation or via an operator inequality. This leads
to bounds which are optimal in terms of the required regularity and feature
sharp or asymptotically sharp constants.
Secure Multi-LLM Agentic AI and Agentification for Edge General Intelligence by Zero-Trust A Survey
Authors: Yinqiu Liu, Ruichen Zhang, Haoxiang Luo, Yijing Lin, Geng Sun, Dusit Niyato, Hongyang Du, Zehui Xiong, Yonggang Wen, Abbas Jamalipour, Dong In Kim, Ping Zhang
2025-08-27
Agentification serves as a critical enabler of Edge General Intelligence
(EGI), transforming massive edge devices into cognitive agents through
integrating Large Language Models (s) and perception, reasoning, and acting
modules. These agents collaborate across heterogeneous edge infrastructures,
forming multi-
agentic AI systems that leverage collective intelligence and
specialized capabilities to tackle complex, multi-step tasks. However, the
collaborative nature of multi-
systems introduces critical security
vulnerabilities, including insecure inter-
s, expanded attack
surfaces, and cross-domain data leakage that traditional perimeter-based
security cannot adequately address. To this end, this survey introduces
zero-trust security of multi-
in EGI, a paradigmatic shift following the
``never trust, always verify'' principle. We begin by systematically analyzing
the security risks in multi-
systems within EGI contexts. Subsequently, we
present the vision of a zero-trust multi-
framework in EGI. We then survey
key technical progress to facilitate zero-trust multi-
systems in EGI.
Particularly, we categorize zero-trust security mechanisms into model- and
system-level approaches. The former and latter include strong identification,
context-aware access control, etc., and proactive maintenance, blockchain-based
management, etc., respectively. Finally, we identify critical research
directions. This survey serves as the first systematic treatment of zero-trust
applied to multi-
systems, providing both theoretical foundations and
practical strategies.
SoK Large Language Model Copyright Auditing via Fingerprinting
Authors: Shuo Shao, Yiming Li, Yu He, Hongwei Yao, Wenyuan Yang, Dacheng Tao, Zhan Qin
2025-08-27
The broad capabilities and substantial resources required to train Large
Language Models (s) make them valuable intellectual property, yet they
remain vulnerable to copyright infringement, such as unauthorized use and model
theft.
fingerprinting, a non-intrusive technique that extracts and compares
the distinctive features from
s to identify infringements, offers a
promising solution to copyright auditing. However, its reliability remains
uncertain due to the prevalence of diverse model modifications and the lack of
standardized evaluation. In this SoK, we present the first comprehensive study
of
fingerprinting. We introduce a unified framework and formal taxonomy
that categorizes existing methods into white-box and black-box approaches,
providing a structured overview of the state of the art. We further propose
LeaFBench, the first systematic benchmark for evaluating
fingerprinting
under realistic deployment scenarios. Built upon mainstream foundation models
and comprising 149 distinct model instances, LeaFBench integrates 13
representative post-development techniques, spanning both parameter-altering
methods (e.g., fine-tuning,
) and parameter-independent mechanisms
(e.g., system prompts, RAG). Extensive experiments on LeaFBench reveal the
strengths and weaknesses of existing methods, thereby outlining future research
directions and critical open problems in this emerging field. The code is
available at https://github.com/shaoshuo-ss/LeaFBench.
The Return of Structural Handwritten Mathematical Expression Recognition
Authors: Jakob Seitz, Tobias Lengfeld, Radu Timofte
2025-08-27
Handwritten Mathematical Expression Recognition is foundational for
educational technologies, enabling applications like digital note-taking and
automated grading. While modern encoder-r architectures with large
language models excel at LaTeX generation, they lack explicit symbol-to-trace
alignment, a critical limitation for error analysis, interpretability, and
spatially aware interactive applications requiring selective content updates.
This paper introduces a structural recognition approach with two innovations: 1
an automatic annotation system that uses a neural network to map LaTeX
equations to raw traces, automatically generating annotations for symbol
segmentation, classification, and spatial relations, and 2 a modular structural
recognition system that independently optimizes segmentation, classification,
and relation prediction. By leveraging a dataset enriched with structural
annotations from our auto-labeling system, the proposed recognition system
combines graph-based trace sorting, a hybrid convolutional-recurrent network,
and
-based correction to achieve competitive performance on the
CROHME-2023 benchmark. Crucially, our structural recognition system generates a
complete graph structure that directly links handwritten traces to predicted
symbols, enabling transparent error analysis and interpretable outputs.
Spotlight Attention Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval
Authors: Wenhao Li, Yuxin Zhang, Gen Luo, Haiyuan Wan, Ziyang Gong, Fei Chao, Rongrong Ji
2025-08-27
Reducing the key-value ()
burden in Large Language Models (
s)
significantly accelerates inference. Dynamically selecting critical
s
during
helps maintain performance. Existing methods use random linear
hashing to identify important tokens, but this approach is inefficient due to
the orthogonal distribution of queries and keys within two narrow cones in
s. We introduce Spotlight Attention, a novel method that employs non-linear
hashing functions to optimize the embedding distribution of queries and keys,
enhancing coding efficiency and robustness. We also developed a lightweight,
stable training framework using a Bradley-Terry ranking-based loss, enabling
optimization of the non-linear hashing module on GPUs with 16GB memory in 8
hours. Experimental results show that Spotlight Attention drastically improves
retrieval precision while shortening the length of the hash code at least
5 compared to traditional linear hashing. Finally, we exploit the
computational advantages of bitwise operations by implementing specialized CUDA
kernels, achieving hashing retrieval for 512K tokens in under 100s on a
single A100 GPU, with end-to-end throughput up to 3 higher than vanilla
.
Continuously Steering LLMs Sensitivity to Contextual Knowledge with Proxy Models
Authors: Yilin Wang, Heng Wang, Yuyang Bai, Minnan Luo
2025-08-27
In Large Language Models (s) generation, there exist knowledge conflicts
and scenarios where parametric knowledge contradicts knowledge provided in the
context. Previous works studied tuning,
algorithms, or locating and
editing context-aware neurons to adapt
s to be faithful to new contextual
knowledge. However, they are usually inefficient or ineffective for large
models, not workable for black-box models, or unable to continuously adjust
s' sensitivity to the knowledge provided in the context. To mitigate these
problems, we propose CSKS (Continuously Steering Knowledge Sensitivity), a
simple framework that can steer
s' sensitivity to contextual knowledge
continuously at a lightweight cost. Specifically, we tune two small LMs (i.e.
proxy models) and use the difference in their output distributions to shift the
original distribution of an
without modifying the
weights. In the
evaluation process, we not only design synthetic data and fine-grained metrics
to measure models' sensitivity to contextual knowledge but also use a real
conflict dataset to validate CSKS's practical efficacy. Extensive experiments
demonstrate that our framework achieves continuous and precise control over
s' sensitivity to contextual knowledge, enabling both increased sensitivity
and reduced sensitivity, thereby allowing
s to prioritize either contextual
or parametric knowledge as needed flexibly. Our data and code are available at
https://github.com/OliveJuiceLin/CSKS.
Hybrid Decoding Rapid Pass and Selective Detailed Correction for Sequence Models
Authors: Yunkyu Lim, Jihwan Park, Hyung Yong Kim, Hanbin Lee, Byeong-Yeol Kim
2025-08-27
Recently, Transformer-based encoder-r models have demonstrated strong
performance in multilingual speech recognition. However, the
r's
autoregressive nature and large size introduce significant bottlenecks during
inference. Additionally, although rare, repetition can occur and negatively
affect recognition accuracy. To tackle these challenges, we propose a novel
Hybrid Decoding approach that both accelerates inference and alleviates the
issue of repetition. Our method extends the
encoder-
r
architecture by attaching a lightweight, fast
r to the pretrained
encoder. During inference, the fast
r rapidly generates an output, which
is then verified and, if necessary, selectively corrected by the Transformer
r. This results in faster
and improved robustness against
repetitive errors. Experiments on the LibriSpeech and GigaSpeech test sets
indicate that, with fine-tuning limited to the added
r, our method
achieves word error rates comparable to or better than the baseline, while more
than doubling the inference speed.
Survey of Specialized Large Language Model
Authors: Chenghan Yang, Ruiyu Zhao, Yang Liu, Ling Jiang
2025-08-27
The rapid evolution of specialized large language models (s) has
transitioned from simple domain adaptation to sophisticated native
architectures, marking a paradigm shift in AI development. This survey
systematically examines this progression across healthcare, finance, legal, and
technical domains. Besides the wide use of specialized
s, technical
breakthrough such as the emergence of domain-native designs beyond fine-tuning,
growing emphasis on parameter efficiency through
computation and
, increasing integration of multimodal capabilities and so on are
applied to recent
agent. Our analysis reveals how these innovations address
fundamental limitations of general-purpose
s in professional applications,
with specialized models consistently performance gains on domain-specific
benchmarks. The survey further highlights the implications for E-Commerce field
to fill gaps in the field.
LFD Layer Fused Decoding to Exploit External Knowledge in Retrieval-Augmented Generation
Authors: Yang Sun, Lixin Zou, Dan Luo, Zhiyong Xie, Long Zhang, Liming Dong, Yunwei Zhao, Xixun Lin, Yanxiong Lu, Chenliang Li
2025-08-27
Retrieval-augmented generation (RAG) incorporates external knowledge into
large language models (s), improving their adaptability to downstream tasks
and enabling information updates. Surprisingly, recent empirical evidence
demonstrates that injecting noise into retrieved relevant documents
paradoxically facilitates exploitation of external knowledge and improves
generation quality. Although counterintuitive and challenging to apply in
practice, this phenomenon enables granular control and rigorous analysis of how
s integrate external knowledge. Therefore, in this paper, we intervene on
noise injection and establish a layer-specific functional demarcation within
the
: shallow layers specialize in local context modeling, intermediate
layers focus on integrating long-range external factual knowledge, and deeper
layers primarily rely on parametric internal knowledge. Building on this
insight, we propose Layer Fused Decoding (LFD), a simple
strategy that
directly combines representations from an intermediate layer with final-layer
outputs to fully exploit the external factual knowledge. To identify
the optimal intermediate layer, we introduce an internal knowledge score (IKS)
criterion that selects the layer with the lowest IKS value in the latter half
of layers. Experimental results across multiple benchmarks demonstrate that LFD
helps RAG systems more effectively surface retrieved context knowledge with
minimal cost.
ReST-RL Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
Authors: Sining Zhoubian, Dan Zhang, Yuxiao Dong, Jie Tang
2025-08-27
With respect to improving the reasoning accuracy of s, the representative
reinforcement learning (RL) method GRPO faces failure due to insignificant
reward variance, while verification methods based on process reward models
(PRMs) suffer from difficulties with training data acquisition and verification
effectiveness. To tackle these problems, this paper introduces ReST-RL, a
unified
RL paradigm that significantly improves
's code reasoning
ability by combining an improved GRPO algorithm with a meticulously designed
test time
method assisted by a value model (VM). As the first stage of
policy reinforcement, ReST-GRPO adopts an optimized ReST algorithm to filter
and assemble high-value training data, increasing the reward variance of GRPO
sampling, thus improving the effectiveness and efficiency of training. After
the basic reasoning ability of
policy has been improved, we further propose
a test time
optimization method called VM-MCTS. Through Monte-Carlo
Tree Search (MCTS), we collect accurate value targets with no annotation
required, on which VM training is based. When
, the VM is deployed by
an adapted MCTS algorithm to provide precise process signals as well as
verification scores, assisting the
policy to achieve high reasoning
accuracy. We validate the effectiveness of the proposed RL paradigm through
extensive experiments on coding problems. Upon comparison, our approach
significantly outperforms other reinforcement training baselines (e.g., naive
GRPO and ReST-DPO), as well as
and verification baselines (e.g.,
PRM-BoN and ORM-MCTS) on well-known coding benchmarks of various levels (e.g.,
APPS, BigCodeBench, and HumanEval), indicating its power to strengthen the
reasoning ability of
policies. Codes for our project can be found at
https://github.com/THUDM/ReST-RL.
Taming the Chaos Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
Authors: Rongzhi Li, Ruogu Du, Zefang Chu, Sida Zhao, Chunlei Han, Zuocheng Shi, Yiwen Shao, Huanle Han, Long Huang, Zherui Liu, Shufan Liu
2025-08-27
Serving Large Language Models (s) is a GPU-intensive task where
traditional autoscalers fall short, particularly for modern Prefill-Decode
(P/D)
d architectures. This architectural shift, while powerful,
introduces significant operational challenges, including inefficient use of
heterogeneous hardware, network bottlenecks, and critical imbalances between
and
stages. We introduce HeteroScale, a coordinated autoscaling
framework that addresses the core challenges of P/D
d
.
HeteroScale combines a topology-aware scheduler that adapts to heterogeneous
hardware and network constraints with a novel metric-driven policy derived from
the first large-scale empirical study of autoscaling signals in production. By
leveraging a single, robust metric to jointly scale
and
pools,
HeteroScale maintains architectural balance while ensuring efficient, adaptive
resource management. Deployed in a massive production environment on tens of
thousands of GPUs, HeteroScale has proven its effectiveness, increasing average
GPU utilization by a significant 26.6 percentage points and saving hundreds of
thousands of GPU-hours daily, all while upholding stringent service level
objectives.
Towards 6G Intelligence The Role of Generative AI in Future Wireless Networks
Authors: Muhammad Ahmed Mohsin, Junaid Ahmad, Muhammad Hamza Nawaz, Muhammad Ali Jamshed
2025-08-27
Ambient intelligence (AmI) is a computing paradigm in which physical
environments are embedded with sensing, computation, and so they
can perceive people and context, decide appropriate actions, and respond
autonomously. Realizing AmI at global scale requires sixth generation (6G)
wireless networks with capabilities for real time perception, reasoning, and
action aligned with human behavior and mobility patterns. We argue that
Generative Artificial Intelligence (GenAI) is the creative core of such
environments. Unlike traditional AI, GenAI learns data distributions and can
generate realistic samples, making it well suited to close key AmI gaps,
including generating synthetic sensor and channel data in under observed areas,
translating user intent into compact, semantic messages, predicting future
network conditions for proactive control, and updating digital twins without
compromising privacy.
This chapter reviews foundational GenAI models, GANs, VAEs, diffusion models,
and generative
s, and connects them to practical AmI use cases,
including spectrum sharing, ultra reliable low latency
,
intelligent security, and context aware digital twins. We also examine how 6G
enablers, such as edge and fog computing, IoT device swarms, intelligent
reflecting surfaces (IRS), and non terrestrial networks, can host or accelerate
distributed GenAI. Finally, we outline open challenges in energy efficient on
device training, trustworthy synthetic data, federated generative learning, and
AmI specific standardization. We show that GenAI is not a peripheral addition,
but a foundational element for transforming 6G from a faster network into an
ambient intelligent ecosystem.
Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLMs
Authors: Yao Fu, Xianxuan Long, Runchao Li, Haotian Yu, Mu Sheng, Xiaotian Han, Yu Yin, Pan Li
2025-08-26
Quantization enables efficient deployment of large language models (s) in
resource-constrained environments by significantly reducing memory and
computation costs. While
d
s often maintain performance on
perplexity and zero-shot tasks, their impact on truthfulness-whether generating
truthful or deceptive responses-remains largely unexplored. In this work, we
introduce TruthfulnessEval, a comprehensive evaluation framework for assessing
the truthfulness of
d
s across three dimensions: (1) Truthfulness on
Logical Reasoning; (2) Truthfulness on Common Sense; and (3) Truthfulness on
Imitative Falsehoods. Using this framework, we examine mainstream
techniques (ranging from 4-bit to extreme 2-bit) across several open-source
s. Surprisingly, we find that while
d models retain internally
truthful representations, they are more susceptible to producing false outputs
under misleading prompts. To probe this vulnerability, we test 15 rephrased
variants of "honest", "neutral" and "deceptive" prompts and observe that
"deceptive" prompts can override truth-consistent behavior, whereas "honest"
and "neutral" prompts maintain stable outputs. Further, we reveal that
d models "know" the truth internally yet still produce false outputs
when guided by "deceptive" prompts via layer-wise probing and PCA
visualizations. Our findings provide insights into future designs of
-aware alignment and truthfulness interventions.
Even Heads Fix Odd Errors Mechanistic Discovery and Surgical Repair in Transformer Attention
Authors: Gustavo Sandoval
2025-08-26
We present a mechanistic case study of a format-dependent reasoning failure
in Llama-3.1-8B-Instruct, where the model incorrectly judges "9.11" as larger
than "9.8" in chat or Q&A formats, but answers correctly in simple format.
Through systematic intervention, we discover s implement even/odd
attention head specialization: even indexed heads handle numerical comparison,
while odd heads serve incompatible functions. The bug requires exactly 8 even
heads at Layer 10 for perfect repair. Any combination of 8+ even heads
succeeds, while 7 or fewer completely fails, revealing sharp computational
thresholds with perfect redundancy among the 16 even heads. SAE analysis
reveals the mechanism: format representations separate (10% feature
at
Layer 7), then re-entangle with different weightings (80% feature
at
Layer 10), with specific features showing 1.5x amplification in failing
formats. We achieve perfect repair using only 25% of attention heads and
identify a 60% pattern replacement threshold, demonstrating that apparent
full-module requirements hide sophisticated substructure with implications for
interpretability and efficiency. All of our code is available at
https://github.com/gussand/surgeon.
One Joke to Rule them All? On the (Im)possibility of Generalizing Humor
Authors: Mor Turgeman, Chen Shani, Dafna Shahaf
2025-08-26
Humor is a broad and complex form of that remains challenging
for machines. Despite its broadness, most existing research on computational
humor traditionally focused on modeling a specific type of humor. In this work,
we wish to understand whether competence on one or more specific humor tasks
confers any ability to transfer to novel, unseen types; in other words, is this
fragmentation inevitable? This question is especially timely as new humor types
continuously emerge in online and social media contexts (e.g., memes,
anti-humor, AI fails). If Large Language Models (
s) are to keep up with this
evolving landscape, they must be able to generalize across humor types by
capturing deeper, transferable mechanisms. To investigate this, we conduct a
series of transfer learning experiments across four datasets, representing
different humor tasks. We train
s under varied diversity settings (1-3
datasets in training, testing on a novel task). Experiments reveal that models
are capable of some transfer, and can reach up to 75% accuracy on unseen
datasets; training on diverse sources improves transferability (1.88-4.05%)
with minimal-to-no drop in in-domain performance. Further analysis suggests
relations between humor types, with Dad Jokes surprisingly emerging as the best
enabler of transfer (but is difficult to transfer to). We release data and
code.
A Theory of Goal-Oriented Medium Access Protocol Design and Distributed Bandit Learning
Authors: Federico Chiariotti, Andrea Zanella
2025-08-26
The Goal-oriented Communication (GoC) paradigm breaks the separation between
and the content of the data, tailoring
decisions to
the specific needs of the receiver and targeting application performance. While
recent studies show impressive encoding performance in point-to-point
scenarios, the multi-node distributed scenario is still almost unexplored.
Moreover, the few studies to investigate this consider a centralized
collision-free approach, where a central scheduler decides the transmission
order of the nodes. In this work, we address the Goal-oriented Multiple Access
(GoMA) problem, in which multiple intelligent agents must coordinate to share a
wireless channel and avoid mutual interference. We propose a theoretical
framework for the analysis and optimization of distributed GoMA,
as a
first step towards its complete characterization. We prove that the problem is
non-convex and may admit multiple Nash Equilibrium (NE) solutions. We provide a
characterization of each node's best response to others' strategies and propose
an optimization approach that provably reaches one such NE, outperforming
centralized approaches by up to 100% while also reducing energy consumption. We
also design a distributed learning algorithm that operates with limited
feedback and no prior knowledge.
Random forest-based out-of-distribution detection for robust lung cancer segmentation
Authors: Aneesh Rangnekar, Harini Veeraraghavan
2025-08-26
Accurate detection and segmentation of cancerous lesions from computed
tomography (CT) scans is essential for automated treatment planning and cancer
treatment response assessment. Transformer-based models with self-supervised
pretraining can produce reliably accurate segmentation from in-distribution
(ID) data but degrade when applied to out-of-distribution (OOD) datasets. We
address this challenge with RF-Deep, a random forest classifier that utilizes
deep features from a pretrained encoder of the segmentation model
to detect OOD scans and enhance segmentation reliability. The segmentation
model comprises a Swin Transformer encoder, pretrained with masked image
modeling (SimMIM) on 10,432 unlabeled 3D CT scans covering cancerous and
non-cancerous conditions, with a convolution
r, trained to segment lung
cancers in 317 3D scans. Independent testing was performed on 603 3D CT public
datasets that included one ID dataset and four OOD datasets comprising chest
CTs with pulmonary embolism (PE) and COVID-19, and abdominal CTs with kidney
cancers and healthy volunteers. RF-Deep detected OOD cases with a FPR95 of
18.26%, 27.66%, and less than 0.1% on PE, COVID-19, and abdominal CTs,
consistently outperforming established OOD approaches. The RF-Deep classifier
provides a simple and effective approach to enhance reliability of cancer
segmentation in ID and OOD scenarios.
APT-LLM Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration
Authors: Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang
2025-08-26
Large language models (s) have revolutionized AI applications, yet their
enormous computational demands severely limit deployment and real-time
performance. Quantization methods can help reduce computational costs, however,
attaining the extreme efficiency associated with ultra-
d
s
at arbitrary precision presents challenges on GPUs. This is primarily due to
the limited support for GPU Tensor Cores, inefficient memory management, and
inflexible kernel optimizations. To tackle these challenges, we propose a
comprehensive
scheme for arbitrary precision
s, namely APT-
.
Firstly, we introduce a novel data format, bipolar-INT, which allows for
efficient and lossless conversion with signed INT, while also being more
conducive to parallel computation. We also develop a matrix multiplication
(MatMul) method allowing for arbitrary precision by dismantling and
reassembling matrices at the bit level. This method provides flexible precision
and optimizes the utilization of GPU Tensor Cores. In addition, we propose a
memory management system focused on data recovery, which strategically employs
fast shared memory to substantially increase kernel execution speed and reduce
memory access latency. Finally, we develop a kernel mapping method that
dynamically selects the optimal configurable hyperparameters of kernels for
varying matrix sizes, enabling optimal performance across different
architectures and precision settings. In
inference, APT-
achieves up to
a 3.99 speedup compared to FP16 baselines and a 2.16 speedup
over NVIDIA CUTLASS INT4
on RTX 3090. On RTX 4090 and H800,
APT-
achieves up to 2.44 speedup over FP16 and 1.65 speedup
over CUTLASS integer baselines.
Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices
Authors: Fahao Chen, Jie Wan, Peng Li, Zhou Su, Dongxiao Yu
2025-08-26
Federated fine-tuning of Mixture-of-Experts (MoE)-based large language models
(s) is challenging due to their massive computational requirements and the
resource constraints of participants. Existing working attempts to fill this
gap through model
, computation offloading, or expert
.
However, they cannot achieve desired performance due to impractical system
assumptions and a lack of consideration for MoE-specific characteristics. In
this paper, we propose FLUX, a system designed to enable federated fine-tuning
of MoE-based
s across participants with constrained computing resources
(e.g., consumer-grade GPUs), aiming to minimize time-to-accuracy. FLUX
introduces three key innovations: (1)
-based local profiling to
estimate expert activation with minimal overhead, (2) adaptive layer-aware
expert merging to reduce resource consumption while pre
accuracy, and
(3) dynamic expert role assignment using an exploration-exploitation strategy
to balance tuning and non-tuning experts. Extensive experiments on LLaMA-MoE
and DeepSeek-MoE with multiple benchmark datasets demonstrate that FLUX
significantly outperforms existing methods, achieving up to 4.75X speedup in
time-to-accuracy.
A Concurrent Modular Agent Framework for Autonomous LLM Agents
Authors: Norihiro Maruyama, Takahide Yoshida, Hiroki Sato, Atsushi Masumori, Johnsmith, Takashi Ikegami
2025-08-26
We introduce the Concurrent Modular Agent (CMA), a framework that
orchestrates multiple Large-Language-Model ()-based modules that operate
fully asynchronously yet maintain a coherent and fault-tolerant behavioral
loop. This framework addresses long-standing difficulties in agent
architectures by letting intention emerge from language-mediated interactions
among autonomous processes. This approach enables flexible, adaptive, and
context-dependent behavior through the combination of concurrently executed
modules that offload reasoning to an
, inter-module
, and a
single shared global state.We consider this approach to be a practical
realization of Minsky's Society of Mind theory. We demonstrate the viability of
our system through two practical use-case studies. The emergent properties
observed in our system suggest that complex cognitive phenomena like
self-awareness may indeed arise from the organized interaction of simpler
processes, supporting Minsky-Society of Mind concept and opening new avenues
for artificial intelligence research. The source code for our work is available
at: https://github.com/AlternativeMachine/concurrent-modular-agent.
Sense of Self and Time in Borderline Personality. A Comparative Robustness Study with Generative AI
Authors: Marcin Moskalewicz, Anna Sterna, Marek Pokropski, Paula Flores
2025-08-26
This study examines the capacity of large language models (s) to support
phenomenological qualitative analysis of first-person experience in Borderline
Personality Disorder (BPD), understood as a disorder of temporality and
selfhood. Building on a prior human-led thematic analysis of 24 inpatients'
life-story interviews, we compared three
s (OpenAI GPT-4o, Google Gemini 2.5
Pro, Anthropic Claude Opus 4) prompted to mimic the interpretative style of the
original investigators. The models were evaluated with blinded and non-blinded
expert judges in phenomenology and clinical psychology. Assessments included
semantic congruence, Jaccard coefficients, and multidimensional validity
ratings (credibility, coherence, substantiveness, and groundness in data).
Results showed variable
with the human analysis, from 0 percent in GPT
to 42 percent in Claude and 58 percent in Gemini, and a low Jaccard coefficient
(0.21-0.28). However, the models recovered themes omitted by humans. Gemini's
output most closely resembled the human analysis, with validity scores
significantly higher than GPT and Claude (p < 0.0001), and was judged as human
by blinded experts. All scores strongly correlated (R > 0.78) with the quantity
of text and words per theme, highlighting both the variability and potential of
AI-augmented thematic analysis to mitigate human interpretative bias.
RoofSeg An edge-aware transformer-based network for end-to-end roof plane segmentation
Authors: Siyuan You, Guozheng Xu, Pengwei Zhou, Qiwen Jin, Jian Yao, Li Li
2025-08-26
Roof plane segmentation is one of the key procedures for reconstructing
three-dimensional (3D) building models at levels of detail (LoD) 2 and 3 from
airborne light detection and ranging (LiDAR) point clouds. The majority of
current approaches for roof plane segmentation rely on the manually designed or
learned features followed by some specifically designed geometric clustering
strategies. Because the learned features are more powerful than the manually
designed features, the deep learning-based approaches usually perform better
than the traditional approaches. However, the current deep learning-based
approaches have three unsolved problems. The first is that most of them are not
truly end-to-end, the plane segmentation results may be not optimal. The second
is that the point feature discriminability near the edges is relatively low,
leading to inaccurate planar edges. The third is that the planar geometric
characteristics are not sufficiently considered to constrain the network
training. To solve these issues, a novel edge-aware -based network,
named RoofSeg, is developed for segmenting roof planes from LiDAR point clouds
in a truly end-to-end manner. In the RoofSeg, we leverage a
encoder-
r-based framework to hierarchically predict the plane instance
masks with the use of a set of learnable plane queries. To further improve the
segmentation accuracy of edge regions, we also design an Edge-Aware Mask Module
(EAMM) that sufficiently incorporates planar geometric prior of edges to
enhance its discriminability for plane instance mask refinement. In addition,
we propose an adaptive weighting strategy in the mask loss to reduce the
influence of misclassified points, and also propose a new plane geometric loss
to constrain the network training.
LLMs in the SOC An Empirical Study of Human-AI Collaboration in Security Operations Centres
Authors: Ronal Singh, Shahroz Tariq, Fatemeh Jalalvand, Mohan Baruwal Chhetri, Surya Nepal, Cecile Paris, Martin Lochner
2025-08-26
The integration of Large Language Models (s) into Security Operations
Centres (SOCs) presents a transformative, yet still evolving, opportunity to
reduce analyst workload through human-AI collaboration. However, their
real-world application in SOCs remains underexplored. To address this gap, we
present a longitudinal study of 3,090 analyst queries from 45 SOC analysts over
10 months. Our analysis reveals that analysts use
s as on-demand aids for
sensemaking and context-building, rather than for making high-stakes
determinations, pre
analyst decision authority. The majority of queries
are related to interpreting low-level telemetry (e.g., commands) and refining
technical
through short (1-3 turn) interactions. Notably, 93% of
queries align with established cybersecurity competencies (NICE Framework),
underscoring the relevance of
use for SOC-related tasks. Despite variations
in tasks and engagement, usage trends indicate a shift from occasional
exploration to routine integration, with growing adoption and sustained use
among a subset of analysts. We find that
s function as flexible, on-demand
cognitive aids that augment, rather than replace, SOC expertise. Our study
provides actionable guidance for designing context-aware, human-centred AI
assistance in security operations, highlighting the need for further
in-the-wild research on real-world analyst-
collaboration, challenges, and
impacts.
Enhancing Model Privacy in Federated Learning with Random Masking and Quantization
Authors: Zhibo Xu, Jianhao Zhu, Jingwen Xu, Changze Lv, Zisu Huang, Xiaohua Wang, Muling Wu, Qi Qian, Xiaoqing Zheng, Xuanjing Huang
2025-08-26
The primary goal of traditional federated learning is to protect data privacy
by enabling distributed edge devices to collaboratively train a shared global
model while keeping raw data decentralized at local clients. The rise of large
language models (s) has introduced new challenges in distributed systems, as
their substantial computational requirements and the need for specialized
expertise raise critical concerns about protecting intellectual property (IP).
This highlights the need for a federated learning approach that can safeguard
both sensitive data and proprietary models. To tackle this challenge, we
propose FedQSN, a federated learning approach that leverages random masking to
obscure a subnetwork of model parameters and applies
to the
remaining parameters. Consequently, the server transmits only a
privacy-pre
proxy of the global model to clients during each
round, thus enhancing the model's confidentiality. Experimental
results across various models and tasks demonstrate that our approach not only
maintains strong model performance in federated learning settings but also
achieves enhanced protection of model parameters compared to baseline methods.
pyFAST A Modular PyTorch Framework for Time Series Modeling with Multi-source and Sparse Data
Authors: Zhijin Wang, Senzhen Wu, Yue Hu, Xiufeng Liu
2025-08-26
Modern time series analysis demands frameworks that are flexible, efficient,
and extensible. However, many existing Python libraries exhibit limitations in
modularity and in their native support for irregular, multi-source, or
data. We introduce pyFAST, a research-oriented PyTorch framework that
explicitly decouples data processing from model computation, fostering a
cleaner separation of concerns and facilitating rapid experimentation. Its data
engine is engineered for complex scenarios, supporting multi-source loading,
protein sequence handling, efficient sequence- and patch-level padding, dynamic
normalization, and mask-based modeling for both imputation and forecasting.
pyFAST integrates
-inspired architectures for the alignment-free fusion of
data sources and offers native
metrics, specialized loss
functions, and flexible exogenous data fusion. Training utilities include
batch-based streaming aggregation for evaluation and device synergy to maximize
computational efficiency. A comprehensive suite of classical and deep learning
models (Linears, CNNs, RNNs, Transformers, and GNNs) is provided within a
modular architecture that encourages extension. Released under the MIT license
at GitHub, pyFAST provides a compact yet powerful platform for advancing time
series research and applications.
ClusterFusion Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
Authors: Xinhao Luo, Zihan Liu, Yangjie Zhou, Shihan Fang, Ziyu Huang, Yu Feng, Chen Zhang, Shixuan Sun, Zhenzhe Zheng, Jingwen Leng, Minyi Guo
2025-08-26
Large language model ()
suffers from high latency due to
fragmented execution across operators and heavy reliance on off-chip memory for
data exchange and reduction. This execution model limits opportunities for
fusion and incurs significant memory traffic and kernel launch overhead. While
modern architectures such as NVIDIA Hopper provide distributed shared memory
and low-latency intra-cluster interconnects, they expose only low-level data
movement instructions, lacking structured abstractions for collective on-chip
. To bridge this software-hardware gap, we introduce two
cluster-level
primitives, ClusterReduce and ClusterGather, which
abstract common
patterns and enable structured, high-speed data
exchange and reduction between thread blocks within a cluster, allowing
intermediate results to be on-chip without involving off-chip memory. Building
on these abstractions, we design ClusterFusion, an execution framework that
schedules
and computation jointly to expand operator fusion scope
by composing
stages such as Q
Projection, Attention, and Output
Projection into a single fused kernels. Evaluations on H100 GPUs show that
ClusterFusion outperforms state-of-the-art inference frameworks by 1.61x on
average in end-to-end latency across different models and configurations. The
source code is available at https://github.com/xinhao-luo/ClusterFusion.
STARec An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning
Authors: Chenghao Wu, Ruiyang Ren, Junjie Zhang, Ruirui Wang, Zhongrui Ma, Qi Ye, Wayne Xin Zhao
2025-08-26
While modern recommender systems are instrumental in navigating information
abundance, they remain fundamentally limited by static user modeling and
reactive decision-making paradigms. Current large language model ()-based
agents inherit these shortcomings through their overreliance on heuristic
pattern matching, yielding recommendations prone to shallow correlation bias,
limited causal inference, and brittleness in
-data scenarios. We
introduce STARec, a slow-thinking augmented agent framework that endows
recommender systems with autonomous deliberative reasoning capabilities. Each
user is modeled as an agent with parallel cognitions: fast response for
immediate interactions and slow reasoning that performs chain-of-thought
rationales. To cultivate intrinsic slow thinking, we develop anchored
reinforcement training - a two-stage paradigm combining structured knowledge
distillation from advanced reasoning models with preference-aligned reward
shaping. This hybrid approach scaffolds agents in acquiring foundational
capabilities (preference summarization, rationale generation) while enabling
dynamic policy adaptation through simulated feedback loops. Experiments on
MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves
substantial performance gains compared with state-of-the-art baselines, despite
using only 0.4% of the full training data.
A Survey on Cloud-Edge-Terminal Collaborative Intelligence in AIoT Networks
Authors: Jiaqi Wu, Jing Liu, Yang Liu, Lixu Wang, Zehua Wang, Wei Chen, Zijian Tian, Richard Yu, Victor C. M. Leung
2025-08-26
The proliferation of Internet of things (IoT) devices in smart cities,
transportation, healthcare, and industrial applications, coupled with the
explosive growth of AI-driven services, has increased demands for efficient
distributed computing architectures and networks, driving cloud-edge-terminal
collaborative intelligence (CETCI) as a fundamental paradigm within the
artificial intelligence of things (AIoT) community. With advancements in deep
learning, large language models (s), and edge computing, CETCI has made
significant progress with emerging AIoT applications, moving beyond isolated
layer optimization to deployable collaborative intelligence systems for AIoT
(CISAIOT), a practical research focus in AI, distributed computing, and
s. This survey describes foundational architectures, enabling
technologies, and scenarios of CETCI paradigms, offering a tutorial-style
review for CISAIOT beginners. We systematically analyze architectural
components spanning cloud, edge, and terminal layers, examining core
technologies including network virtualization, container orchestration, and
software-defined networking, while presenting categorizations of collaboration
paradigms that cover task offloading, resource allocation, and optimization
across heterogeneous infrastructures. Furthermore, we explain intelligent
collaboration learning frameworks by reviewing advances in federated learning,
distributed deep learning, edge-cloud model evolution, and reinforcement
learning-based methods. Finally, we discuss challenges (e.g., scalability,
heterogeneity, interoperability) and future trends (e.g., 6G+, agents, quantum
computing, digital twin), highlighting how integration of distributed computing
and
can address open issues and guide development of robust,
efficient, and secure collaborative AIoT systems.
Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction
Authors: Yilin Li, Xunjian Yin, Yilin Chen, Xiaojun Wan
2025-08-26
Grammatical error correction is a significant task in NLP. Traditional
methods based on encoder-r models have achieved certain success, but the
application of
s in this field is still underexplored. Current research
predominantly relies on supervised fine-tuning to train
s to directly
generate the corrected sentence, which limits the model's powerful reasoning
ability. To address this limitation, we propose a novel framework based on
Rule-Based RL. Through experiments on the Chinese datasets, our Rule-Based RL
framework achieves \textbf{state-of-the-art }performance, with a notable
increase in \textbf{recall}. This result clearly highlights the advantages of
using RL to steer
s, offering a more controllable and reliable paradigm for
future development in GEC.
UltraMemV2 Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning
Authors: Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao
2025-08-26
While Mixture of Experts (MoE) models achieve remarkable efficiency by
activating only subsets of parameters, they suffer from high memory access
costs during inference. Memory-layer architectures offer an appealing
alternative with very few memory access, but previous attempts like UltraMem
have only matched the performance of 2-expert MoE models, falling significantly
short of state-of-the-art 8-expert configurations. We present UltraMemV2, a
redesigned memory-layer architecture that closes this performance gap. Our
approach introduces five key improvements: integrating memory layers into every
block, simplifying value expansion with single linear projections,
adopting FFN-based value processing from PEER, implementing principled
parameter initialization, and rebalancing memory-to-FFN computation ratios.
Through extensive evaluation, we demonstrate that UltraMemV2 achieves
performance parity with 8-expert MoE models under same computation and
parameters but significantly low memory access. Notably, UltraMemV2 shows
superior performance on memory-intensive tasks, with improvements of +1.6
points on long-context memorization, +6.2 points on multi-round memorization,
and +7.9 points on in-context learning. We validate our approach at scale with
models up to 2.5B activated parameters from 120B total parameters, and
establish that activation density has greater impact on performance than total
parameter count. Our work brings memory-layer architectures to
performance parity with state-of-the-art MoE models, presenting a compelling
alternative for efficient
computation.
Rethinking Caching for LLM Serving Systems Beyond Traditional Heuristics
Authors: Jungwoo Kim, Minsang Kim, Jaeheon Lee, Chanwoo Moon, Heejin Kim, Taeho Hwang, Woosuk Chung, Yeseong Kim, Sungjin Lee
2025-08-26
Serving Large Language Models (s) at scale requires meeting strict Service
Level Objectives (SLOs) under severe computational and memory constraints.
Nevertheless, traditional caching strategies fall short: exact-matching and
prefix
s neglect query semantics, while state-of-the-art semantic
s
remain confined to traditional intuitions, offering little conceptual
departure. Building on this, we present SISO, a semantic caching system that
redefines efficiency for
. SISO introduces centroid-based caching to
maximize coverage with minimal memory, locality-aware replacement to preserve
high-value entries, and dynamic thresholding to balance accuracy and latency
under varying workloads. Across diverse datasets, SISO delivers up to
1.71 higher hit ratios and consistently stronger SLO attainment
compared to state-of-the-art systems.
Drawing2CAD Sequence-to-Sequence Learning for CAD Generation from Vectorized Drawings
Authors: Feiwei Qin, Shichao Lu, Junhao Hou, Changmiao Wang, Meie Fang, Ligang Liu
2025-08-26
Computer-Aided Design (CAD) generative modeling is driving significant
innovations across industrial applications. Recent works have shown remarkable
progress in creating solid models from various inputs such as point clouds,
meshes, and text descriptions. However, these methods fundamentally diverge
from traditional industrial workflows that begin with 2D engineering drawings.
The automatic generation of parametric CAD models from these 2D vector drawings
remains underexplored despite being a critical step in engineering design. To
address this gap, our key insight is to reframe CAD generation as a
sequence-to-sequence learning problem where vector drawing primitives directly
inform the generation of parametric CAD operations, pre geometric
precision and design intent throughout the transformation process. We propose
Drawing2CAD, a framework with three key technical components: a
network-friendly vector primitive representation that preserves precise
geometric information, a dual-
r
architecture that decouples
command type and parameter generation while maintaining precise correspondence,
and a soft target distribution loss function accommodating inherent flexibility
in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing,
a dataset of paired engineering drawings and parametric CAD models, and conduct
thorough experiments to demonstrate the effectiveness of our method. Code and
dataset are available at https://github.com/lllssc/Drawing2CAD.
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
Authors: Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota
2025-08-26
Empirical scaling laws have driven the evolution of large language models
(s), yet their coefficients shift whenever the model architecture or data
pipeline changes. Mixture-of-Experts (MoE) models, now standard in
state-of-the-art systems, introduce a new
dimension that current
dense-model frontiers overlook. We investigate how MoE
influences two
distinct capability regimes: memorization and reasoning. We train families of
MoE Transformers that systematically vary total parameters, active parameters,
and top- routing while holding the compute budget fixed. For every model we
record pre-training loss, downstream task loss, and task accuracy, allowing us
to separate the train-test generalization gap from the loss-accuracy gap.
Memorization benchmarks improve monotonically with total parameters, mirroring
training loss. By contrast, reasoning performance saturates and can even
regress despite continued gains in both total parameters and training loss.
Altering top- alone has little effect when active parameters are constant,
and classic hyperparameters such as learning rate and initialization modulate
the generalization gap in the same direction as
. Neither post-training
reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning
deficit of overly
models. Our model checkpoints, code and logs are
open-source at https://github.com/rioyokotalab/optimal-
.
MUA-RL Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use
Authors: Weikang Zhao, Xili Wang, Chengdi Ma, Lingbin Kong, Zhaohua Yang, Mingxiang Tuo, Xiaowei Shi, Yitao Zhai, Xunliang Cai
2025-08-26
With the recent rapid advancement of Agentic Intelligence, agentic tool use
in s has become increasingly important. During multi-turn interactions
between agents and users, the dynamic, uncertain, and stochastic nature of user
demands poses significant challenges to the agent's tool invocation
capabilities. Agents are no longer expected to simply call tools to deliver a
result; rather, they must iteratively refine their understanding of user needs
through
while simultaneously invoking tools to resolve user
queries. Existing reinforcement learning (RL) approaches for tool use lack the
integration of genuinely dynamic users during the RL training process. To
bridge this gap, we introduce MUA-RL (Multi-turn User-interacting Agent
Reinforcement Learning for agentic tool use), a novel reinforcement learning
framework that, for the first time in the field of agentic tool use, integrates
-simulated users into the reinforcement learning loop. MUA-RL aims to enable
autonomous learning of models to communicate with users efficiently and use
various tools to solve practical problems in dynamic multi-turn interactions.
Evaluations are done on several multi-turn tool-using benchmarks (see Figure
1). Specifically, MUA-RL-32B achieves 67.3 on TAU2 Retail, 45.4 on TAU2
Airline, 28.3 on TAU2 Telecom, 28.4 on BFCL-V3 Multi Turn, and 82.5 on ACEBench
Agent -- outperforming or matching the performance of larger open-source models
such as DeepSeek-V3-0324 and Qwen3-235B-A22B in non-thinking settings.
Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models
Authors: Chenxi Zhou, Pengfei Cao, Jiang Li, Jun Zhao, Kang Liu
2025-08-26
Large language models (s) present significant deployment challenges due to
their scale, with post-training
(PTQ) emerging as a practical
compression solution. However, a comprehensive understanding of how PTQ
precisely impacts diverse
knowledge capabilities remains elusive, and
existing scaling laws for
d models often overlook crucial PTQ-specific
parameters and task-specific sensitivities. This paper addresses these gaps by
conducting an extensive empirical investigation to establish task-stratified
scaling laws. We disentangle
knowledge into memorization and utilization
capabilities and develop a unified quantitative framework that incorporates
model size, effective bit-width, calibration set size, and group size. Our
central finding reveals that knowledge memorization exhibits markedly greater
sensitivity to variations in effective bit-width, calibration set size, and
model size compared to the more robust knowledge utilization. These findings
offer a fine-grained understanding of PTQ's impact and provide guidance for
developing knowledge-aware
strategies that can better preserve
targeted cognitive functions.
History Rhymes Accelerating LLM Reinforcement Learning with RhymeRL
Authors: Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, Haibo Chen
2025-08-26
With the rapid advancement of large language models (s), reinforcement
learning (RL) has emerged as a pivotal methodology for enhancing the reasoning
capabilities of
s. Unlike traditional pre-training approaches, RL
encompasses multiple stages: rollout, reward, and training, which necessitates
collaboration among various worker types. However, current RL systems continue
to grapple with substantial GPU underutilization, due to two primary factors:
(1) The rollout stage dominates the overall RL process due to test-time
scaling; (2) Imbalances in rollout lengths (within the same batch) result in
GPU bubbles. While prior solutions like asynchronous execution and truncation
offer partial relief, they may compromise training accuracy for efficiency.
Our key insight stems from a previously overlooked observation: rollout
responses exhibit remarkable similarity across adjacent training epochs. Based
on the insight, we introduce RhymeRL, an
RL system designed to accelerate
RL training with two key innovations. First, to enhance rollout generation, we
present HistoSpec, a speculative
inference engine that utilizes the
similarity of historical rollout token sequences to obtain accurate drafts.
Second, to tackle rollout bubbles, we introduce HistoPipe, a two-tier
scheduling strategy that leverages the similarity of historical rollout
distributions to balance workload among rollout workers. We have evaluated
RhymeRL within a real production environment, demonstrating scalability from
dozens to thousands of GPUs. Experimental results demonstrate that RhymeRL
achieves a 2.6x performance improvement over existing methods, without
compromising accuracy or modifying the RL paradigm.
Strata Hierarchical Context Caching for Long Context Language Model Serving
Authors: Zhiqiang Xie, Ziyi Xu, Mark Zhao, Yuwei An, Vikram Sharma Mailthody, Scott Mahlke, Michael Garland, Christos Kozyrakis
2025-08-26
Large Language Models (s) with expanding context windows face significant
performance hurdles. While caching key-value (
) states is critical for
avoiding redundant computation, the storage footprint of long-context
s
quickly exceeds GPU memory capacity, forcing production systems to adopt
hierarchical caching across memory hierarchies. However, transferring large
d contexts back to the GPU introduces severe performance bottlenecks:
fragmented I/O from paged layouts prevents full bandwidth utilization, and
existing schedulers fail to account for
-loading delays, leaving systems
loading-bound rather than compute-bound. We present Strata, a hierarchical
context caching framework designed for efficient long context
.
Strata introduces GPU-assisted I/O to combat
fragmentation, decoupling
GPU and CPU memory layouts and employs
-aware request scheduling to
balance compute with I/O latency and
ping unavoidable stalls with
complementary tasks. Built on SGLang and deployed in production, Strata
achieves up to 5x lower Time-To-First-Token (TTFT) compared to v
+ LMCache
and 3.75x speedup over NVIDIA TensorRT-
on long-context benchmarks, without
degrading short-context performance.
LLM-Driven Intrinsic Motivation for Sparse Reward Reinforcement Learning
Authors: André Quadros, Cassio Silva, Ronnie Alves
2025-08-25
This paper explores the combination of two intrinsic motivation strategies to
improve the efficiency of reinforcement learning (RL) agents in environments
with extreme rewards, where traditional learning struggles due to
infrequent positive feedback. We propose integrating Variational State as
Intrinsic Reward (VSIMR), which uses Variational AutoEncoders (VAEs) to reward
state novelty, with an intrinsic reward approach derived from Large Language
Models (
s). The
s leverage their pre-trained knowledge to generate reward
signals based on environment and goal descriptions, guiding the agent. We
implemented this combined approach with an Actor-Critic (A2C) agent in the
MiniGrid DoorKey environment, a benchmark for
rewards. Our empirical
results show that this combined strategy significantly increases agent
performance and sampling efficiency compared to using each strategy
individually or a standard A2C agent, which failed to learn. Analysis of
learning curves indicates that the combination effectively complements
different aspects of the environment and task: VSIMR drives exploration of new
states, while the
-derived rewards facilitate progressive exploitation
towards goals.
Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning
Authors: Jeong-seok Oh, Jay-yoon Lee
2025-08-25
Probabilistic in Large Language Models (
s) often yields
inconsistent outputs, particularly on complex or long-form questions.
Self-Consistency (SC) mitigates this for short-form QA by majority voting over
exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram
Consistency Score (WUCS) extend to long-form responses but lose accuracy on
short-form benchmarks.
We introduce Latent Self-Consistency (LSC), which selects the most
semantically consistent response using learnable token embeddings. A
lightweight forward generation of summary tokens increases inference time by
less than 1% and requires no changes to the model architecture.
Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU,
TruthfulQA), LSC surpasses SC, USC and WUCS on all short-form and long-form
ones on average, while maintaining negligible computational overhead. These
results position LSC as a practical consistency-selection method that works
reliably across answer formats. Additionally, LSC provides well-calibrated
confidence estimates, maintaining low Expected Calibration Error across both
answer formats.
Backprompting Leveraging Synthetic Production Data for Health Advice Guardrails
Authors: Kellen Tan Cheng, Anna Lisa Gentile, Chad DeLuca, Guang-Jie Ren
2025-08-25
The pervasiveness of large language models (s) in enterprise settings has
also brought forth a significant amount of risks associated with their usage.
Guardrails technologies aim to mitigate this risk by filtering
s'
input/output text through various detectors. However, developing and
maintaining robust detectors faces many challenges, one of which is the
difficulty in acquiring production-quality labeled data on real
outputs
prior to deployment. In this work, we propose backprompting, a simple yet
intuitive solution to generate production-like labeled data for health advice
guardrails development. Furthermore, we pair our backprompting method with a
human-in-the-loop clustering technique to label the generated data. Our
aim is to construct a parallel corpus roughly representative of the original
dataset yet resembling real
output. We then infuse existing datasets with
our synthetic examples to produce robust training data for our detector. We
test our technique in one of the most difficult and nuanced guardrails: the
identification of health advice in
output, and demonstrate improvement
versus other solutions. Our detector is able to outperform GPT-4o by up to
3.73%, despite having 400x less parameters.
DualSparse-MoE Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction
Authors: Weilin Cai, Le Qin, Shwai He, Junwei Cui, Ang Li, Jiayi Huang
2025-08-25
Mixture of Experts (MoE) has become a mainstream architecture for building
Large Language Models (s) by reducing per-token computation while enabling
model scaling. It can be viewed as partitioning a large Feed-Forward Network
(FFN) at the tensor level into fine-grained sub-FFNs, or experts, and
activating only a
subset for each input. While this
improves
efficiency, MoE still faces substantial challenges due to their massive
computational scale and unpredictable activation patterns.
To enable efficient MoE deployment, we identify dual
at the tensor
and neuron levels in pre-trained MoE modules as a key factor for both accuracy
and efficiency. Unlike prior work that increases tensor-level
through
finer-grained expert design during pre-training, we introduce post-training
expert partitioning to induce such
without retraining. This preserves
the mathematical consistency of model transformations and enhances both
efficiency and accuracy in subsequent fine-tuning and inference. Building upon
this, we propose DualSparse-MoE, an inference system that integrates dynamic
tensor-level computation dropping with static neuron-level reconstruction to
deliver significant efficiency gains with minimal accuracy loss.
Experimental results show that enforcing an approximate 25% drop rate with
our approach reduces average accuracy by only 0.08%-0.28% across three
prevailing MoE models, while nearly all degrees of computation dropping
consistently yield proportional computational speedups. Furthermore,
incorporating load-imbalance awareness into expert parallelism achieves a 1.41x
MoE module speedup with just 0.5% average accuracy degradation.
Flash Sparse Attention An Alternative Efficient Implementation of Native Sparse Attention Kernel
Authors: Ran Yan, Youhe Jiang, Binhang Yuan
2025-08-25
Recent progress in attention mechanisms has demonstrated strong
potential for reducing the computational cost of long-context training and
inference in large language models (
s). Native Sparse Attention (NSA), a
state-of-the-art approach, introduces natively trainable, hardware-aligned
attention that delivers substantial system-level performance gains while
maintaining accuracy comparable to full attention. However, the kernel
implementation of NSA relies on a query-grouping strategy that is efficient
only with large Grouped Query Attention (GQA) sizes, whereas modern
s
typically adopt much smaller GQA groups, which limits the applicability of this
algorithmic advance. In this work, we propose Flash Sparse Attention
(FSA), which includes an alternative kernel design that enables efficient NSA
computation across a wide range of popular
s with varied smaller GQA group
sizes on modern GPUs. Compared to vanilla NSA kernel implementation, our
empirical evaluation demonstrates that FSA achieves (i) up to 3.5 and
on average 1.6 kernel-level latency reduction, (ii) up to 1.25
and 1.09 on average end-to-end training speedup on state-of-the-art
s, and (iii) up to 1.36 and 1.11 on average end-to-end
speedup on state-of-the-art
s. The source code is open-sourced and
publicly available at
https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.
Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios
Authors: Luana Bulla, Gabriele Tuccio, Misael Mongiovì, Aldo Gangemi
2025-08-25
Translating natural languages into sign languages is a highly complex and
underexplored task. Despite growing interest in accessibility and inclusivity,
the development of robust translation systems remains hindered by the limited
availability of parallel corpora which align natural language with sign
language data. Existing methods often struggle to generalize in these
data-scarce environments, as the few datasets available are typically
domain-specific, lack standardization, or fail to capture the full linguistic
richness of sign languages. To address this limitation, we propose Advanced Use
of s for Sign Language Translation (AulSign), a novel method that leverages
Large Language Models via dynamic prompting and in-context learning with sample
selection and subsequent sign association. Despite their impressive abilities
in processing text,
s lack intrinsic knowledge of sign languages; therefore,
they are unable to natively perform this kind of translation. To overcome this
limitation, we associate the signs with compact descriptions in natural
language and instruct the model to use them. We evaluate our method on both
English and Italian languages using SignBank+, a recognized benchmark in the
field, as well as the Italian LaCAM CNR-ISTC dataset. We demonstrate superior
performance compared to state-of-the-art models in low-data scenario. Our
findings demonstrate the effectiveness of AulSign, with the potential to
enhance accessibility and inclusivity in
technologies for
underrepresented linguistic communities.
AdLoCo adaptive batching significantly improves communications efficiency and convergence for Large Language Models
Authors: Nikolay Kutuzov, Makar Baderko, Stepan Kulibaba, Artem Dzhalilov, Daniel Bobrov, Maxim Mashtaler, Alexander Gasnikov
2025-08-25
Scaling distributed training of Large Language Models (s) requires not
only algorithmic advances but also efficient utilization of heterogeneous
hardware resources. While existing methods such as DiLoCo have demonstrated
promising results, they often fail to fully exploit computational clusters
under dynamic workloads. To address this limitation, we propose a three-stage
method that combines Multi-Instance Training (MIT), Adaptive Batched DiLoCo,
and switch mode mechanism. MIT allows individual nodes to run multiple
lightweight training streams with different model instances in parallel and
merge them to combine knowledge, increasing throughput and reducing idle time.
Adaptive Batched DiLoCo dynamically adjusts local batch sizes to balance
computation and
, substantially lowering synchronization delays.
Switch mode further stabilizes training by seamlessly introducing gradient
accumulation once adaptive batch sizes grow beyond hardware-friendly limits.
Together, these innovations improve both convergence speed and system
efficiency. We also provide a theoretical estimate of the number of
s required for the full convergence of a model trained using our
method.
HLLM-Creator Hierarchical LLM-based Personalized Creative Generation
Authors: Junyi Chen, Lu Chi, Siliang Xu, Shiwei Ran, Bingyue Peng, Zehuan Yuan
2025-08-25
AI-generated content technologies are widely used in content creation.
However, current AIGC systems rely heavily on creators' inspiration, rarely
generating truly user-personalized content. In real-world applications such as
online advertising, a single product may have multiple selling points, with
different users focusing on different features. This underscores the
significant value of personalized, user-centric creative generation. Effective
personalized content generation faces two main challenges: (1) accurately
modeling user interests and integrating them into the content generation
process while adhering to factual constraints, and (2) ensuring high efficiency
and scalability to handle the massive user base in industrial scenarios.
Additionally, the scarcity of personalized creative data in practice
complicates model training, making data construction another key hurdle. We
propose H-Creator, a hierarchical
framework for efficient user interest
modeling and personalized content generation. During inference, a combination
of user clustering and a user-ad-matching-prediction based
strategy is
employed to significantly enhance generation efficiency and reduce
computational overhead, making the approach suitable for large-scale
deployment. Moreover, we design a data construction pipeline based on
chain-of-thought reasoning, which generates high-quality, user-specific
creative titles and ensures factual consistency despite limited personalized
data. This pipeline serves as a critical foundation for the effectiveness of
our model. Extensive experiments on personalized title generation for Douyin
Search Ads show the effectiveness of H
-Creator. Online A/B test shows a
0.476% increase on Adss, paving the way for more effective and efficient
personalized generation in industrial scenarios. Codes for academic dataset are
available at https://github.com/bytedance/H
.
The AI Data Scientist
Authors: Farkhad Akimov, Munachiso Samuel Nwadike, Zangir Iklassov, Martin Takáč
2025-08-25
Imagine decision-makers uploading data and, within minutes, receiving clear,
actionable insights delivered straight to their fingertips. That is the promise
of the AI Data Scientist, an autonomous Agent powered by large language models
(s) that closes the gap between evidence and action. Rather than simply
writing code or responding to prompts, it reasons through questions, tests
ideas, and delivers end-to-end insights at a pace far beyond traditional
workflows. Guided by the scientific tenet of the hypothesis, this Agent
uncovers explanatory patterns in data, evaluates their statistical
significance, and uses them to inform predictive modeling. It then translates
these results into recommendations that are both rigorous and accessible. At
the core of the AI Data Scientist is a team of specialized
Subagents, each
responsible for a distinct task such as data cleaning, statistical testing,
validation, and plain-language
. These Subagents write their own
code, reason about causality, and identify when additional data is needed to
support sound conclusions. Together, they achieve in minutes what might
otherwise take days or weeks, enabling a new kind of interaction that makes
deep data science both accessible and actionable.
A.S.E A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
Authors: Keke Lian, Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li, Dong Zhang
2025-08-25
The increasing adoption of large language models (s) in software
engineering necessitates rigorous security evaluation of their generated code.
However, existing benchmarks are inadequate, as they focus on isolated code
snippets, employ unstable evaluation methods that lack reproducibility, and
fail to connect the quality of input context with the security of the output.
To address these gaps, we introduce A.S.E (AI Code Generation Security
Evaluation), a benchmark for repository-level secure code generation. A.S.E
constructs tasks from real-world repositories with documented CVEs, pre
full repository context like build systems and cross-file dependencies. Its
reproducible, containerized evaluation framework uses expert-defined rules to
provide stable, auditable assessments of security, build quality, and
generation stability. Our evaluation of leading
s on A.S.E reveals three key
findings: (1) Claude-3.7-Sonnet achieves the best overall performance. (2) The
security gap between proprietary and open-source models is narrow;
Qwen3-235B-A22B-Instruct attains the top security score. (3) Concise,
fast-thinking''  strategies consistently outperform complex,
slow-thinking'' reasoning for security patching.
ILRe Intermediate Layer Retrieval for Context Compression in Causal Language Models
Authors: Manlai Liang, Mandi Liu, Jiangzhou Ji, Huaijun Li, Haobo Yang, Yaohan He, Jinlong Li
2025-08-25
Large Language Models (s) have demonstrated success across many
benchmarks. However, they still exhibit limitations in long-context scenarios,
primarily due to their short effective context length, quadratic computational
complexity, and high memory overhead when processing lengthy inputs. To
mitigate these issues, we introduce a novel context compression pipeline,
called Intermediate Layer Retrieval (ILRe), which determines one intermediate
r layer offline, encodes context by streaming chunked
only up to
that layer, and recalls tokens by the attention scores between the input query
and full key
in that specified layer. In particular, we propose a
multi-pooling kernels allocating strategy in the token recalling process to
maintain the completeness of semantics. Our approach not only reduces the
ing complexity from to , but also achieves performance
comparable to or better than the full context in the long context scenarios.
Without additional post training or operator development, ILRe can process a
single tokens request in less than half a minute (speedup ) and scores RULER- benchmark of with model
Llama-3.1-UltraLong-8B-1M-Instruct on a Huawei Ascend 910B NPU.
LexSemBridge Fine-Grained Dense Representation Enhancement through Token-Aware Embedding Augmentation
Authors: Shaoxiong Zhan, Hai Lin, Hongming Tan, Xiaodong Cai, Hai-Tao Zheng, Xin Su, Zifei Shan, Ruitong Liu, Hong-Gee Kim
2025-08-25
As queries in retrieval-augmented generation (RAG) pipelines powered by large
language models (s) become increasingly complex and diverse, dense retrieval
models have demonstrated strong performance in semantic matching. Nevertheless,
they often struggle with fine-grained retrieval tasks, where precise keyword
alignment and span-level localization are required, even in cases with high
lexical
that would intuitively suggest easier retrieval. To
systematically evaluate this limitation, we introduce two targeted tasks,
keyword retrieval and part-of-passage retrieval, designed to simulate practical
fine-grained scenarios. Motivated by these observations, we propose
LexSemBridge, a unified framework that enhances dense query representations
through fine-grained, input-aware vector modulation. LexSemBridge constructs
latent enhancement vectors from input tokens using three paradigms: Statistical
(SLR), Learned (LLR), and Contextual (CLR), and integrates them with dense
embeddings via element-wise interaction. Theoretically, we show that this
modulation preserves the semantic direction while selectively amplifying
discriminative dimensions. LexSemBridge operates as a plug-in without modifying
the backbone encoder and naturally extends to both text and vision modalities.
Extensive experiments across semantic and fine-grained retrieval tasks validate
the effectiveness and generality of our approach. All code and models are
publicly available at https://github.com/Jasaxion/LexSemBridge/
Speculative Safety-Aware Decoding
Authors: Xuekang Wang, Shengyu Zhu, Xueqi Cheng
2025-08-25
Despite extensive efforts to align Large Language Models (s) with human
values and safety rules, jailbreak attacks that exploit certain vulnerabilities
continuously emerge, highlighting the need to strengthen existing
s with
additional safety properties to defend against these attacks. However, tuning
large models has become increasingly resource-intensive and may have difficulty
ensuring consistent performance. We introduce Speculative Safety-Aware Decoding
(SSD), a lightweight
-time approach that equips
s with the desired
safety property while accelerating inference. We assume that there exists a
small language model that possesses this desired property. SSD integrates
speculative sampling during
and leverages the match ratio between the
small and composite models to quantify jailbreak risks. This enables SSD to
dynamically switch between
schemes to prioritize utility or safety, to
handle the challenge of different model capacities. The output token is then
sampled from a new distribution that combines the distributions of the original
and the small models. Experimental results show that SSD successfully equips
the large model with the desired safety property, and also allows the model to
remain helpful to benign queries. Furthermore, SSD accelerates the inference
time, thanks to the speculative sampling design.
CMFDNet Cross-Mamba and Feature Discovery Network for Polyp Segmentation
Authors: Feng Jiang, Zongfei Zhang, Xin Xu
2025-08-25
Automated colonic polyp segmentation is crucial for assisting doctors in
screening of precancerous polyps and diagnosis of colorectal neoplasms.
Although existing methods have achieved promising results, polyp segmentation
remains hindered by the following limitations,including: (1) significant
variation in polyp shapes and sizes, (2) indistinct boundaries between polyps
and adjacent tissues, and (3) small-sized polyps are easily overlooked during
the segmentation process. Driven by these practical difficulties, an innovative
architecture, CMFDNet, is proposed with the CMD module, MSA module, and FD
module. The CMD module, as an innovative
r, introduces a
cross-scanning method to reduce blurry boundaries. The MSA module adopts a
multi-branch parallel structure to enhance the recognition ability for polyps
with diverse geometries and scale distributions. The FD module establishes
dependencies among all
r features to alleviate the under-detection of
polyps with small-scale features. Experimental results show that CMFDNet
outperforms six SOTA methods used for comparison, especially on ETIS and
ColonDB datasets, where mDice scores exceed the best SOTA method by 1.83% and
1.55%, respectively.
CATformer Contrastive Adversarial Transformer for Image Super-Resolution
Authors: Qinyi Tian, Spence Cox, Laura E. Dalton
2025-08-25
Super-resolution remains a promising technique to enhance the quality of
low-resolution images. This study introduces CATformer (Contrastive Adversarial
Transformer), a novel neural network integrating diffusion-inspired feature
refinement with adversarial and contrastive learning. CATformer employs a
dual-branch architecture combining a primary diffusion-inspired ,
which progressively refines latent representations, with an auxiliary
branch designed to enhance robustness to noise through learned
latent contrasts. These complementary representations are fused and
d
using deep Residual-in-Residual Dense Blocks for enhanced reconstruction
quality. Extensive experiments on benchmark datasets demonstrate that CATformer
outperforms recent
-based and diffusion-inspired methods both in
efficiency and visual image quality. This work bridges the performance gap
among
-, diffusion-, and GAN-based methods, laying a foundation for
practical applications of diffusion-inspired
s in super-resolution.
CoCoA Confidence and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models
Authors: Anant Khandelwal, Manish Gupta, Puneet Agrawal
2025-08-25
Faithful generation in large language models (s) is challenged by
knowledge conflicts between parametric memory and external context. Existing
contrastive
methods tuned specifically to handle conflict often lack
adaptability and can degrade performance in low conflict settings. We introduce
CoCoA (Confidence- and Context-Aware Adaptive Decoding), a novel token-level
algorithm for principled conflict resolution and enhanced faithfulness. CoCoA
resolves conflict by utilizing confidence-aware measures (entropy gap and
contextual peakedness) and the generalized divergence between the parametric
and contextual distributions. Crucially, CoCoA maintains strong performance
even in low conflict settings. Extensive experiments across multiple
s on
diverse Question Answering (QA), Summarization, and Long-Form Question
Answering (LFQA) benchmarks demonstrate CoCoA's state-of-the-art performance
over strong baselines like AdaCAD. It yields significant gains in QA accuracy,
up to 9.2 points on average compared to the strong baseline AdaCAD, and
improves factuality in summarization and LFQA by up to 2.5 points on average
across key benchmarks. Additionally, it demonstrates superior sensitivity to
conflict variations. CoCoA enables more informed, context-aware, and ultimately
more faithful token generation.