2025-11-07
Table of Contents
- Whisper Leak a side-channel attack on Large Language Models
- Towards Transparent Stance Detection A Zero-Shot Approach Using Implicit and Explicit Interpretability
- PerfDojo Automated ML Library Generation for Heterogeneous Architectures
- RAGBoost Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse
- SurgViVQA Temporally-Grounded Video Question Answering for Surgical Scene Understanding
- DRL-Based Robust Multi-Timescale Anti-Jamming Approaches under State Uncertainty
- UMDAM A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM
- Characterising Global Platforms Centralised, Decentralised, Federated, and Grassroots
- Provable Separations between Memorization and Generalization in Diffusion Models
- A Quantized VAE-MLP Botnet Detection Model A Systematic Evaluation of Quantization-Aware Training and Post-Training Quantization Strategies
- AI as We Describe It How Large Language Models and Their Applications in Health are Represented Across Channels of Public Discourse
- Large language models require a new form of oversight capability-based monitoring
- SnapStream Efficient Long Sequence Decoding on Dataflow Accelerators
- LogicSparse Enabling Engine-Free Unstructured Sparsity for Quantised Deep-learning Accelerators
Whisper Leak a side-channel attack on Large Language Models
Authors: Geoff McDonald, Jonathan Bar Or
2025-11-05
Large Language Models (s) are increasingly deployed in sensitive domains
including healthcare, legal services, and confidential
s, where
privacy is paramount. This paper introduces Whisper Leak, a side-channel attack
that infers user prompt topics from encrypted
traffic by analyzing packet
size and timing patterns in streaming responses. Despite TLS encryption
protecting content, these metadata patterns leak sufficient information to
enable topic classification. We demonstrate the attack across 28 popular
s
from major providers, achieving near-perfect classification (often >98% AUPRC)
and high precision even at extreme class imbalance (10,000:1 noise-to-target
ratio). For many models, we achieve 100% precision in identifying sensitive
topics like "money laundering" while recovering 5-20% of target conversations.
This industry-wide vulnerability poses significant risks for users under
network surveillance by ISPs, governments, or local adversaries. We evaluate
three mitigation strategies - random padding, token batching, and packet
injection - finding that while each reduces attack effectiveness, none provides
complete protection. Through responsible disclosure, we have collaborated with
providers to implement initial countermeasures. Our findings underscore the
need for
providers to address metadata leakage as AI systems handle
increasingly sensitive information.
Towards Transparent Stance Detection A Zero-Shot Approach Using Implicit and Explicit Interpretability
Authors: Apoorva Upadhyaya, Wolfgang Nejdl, Marco Fisichella
2025-11-05
Zero-Shot Stance Detection (ZSSD) identifies the attitude of the post toward
unseen targets. Existing research using contrastive, meta-learning, or data
augmentation suffers from generalizability issues or lack of coherence between
text and target. Recent works leveraging large language models (s) for ZSSD
focus either on improving unseen target-specific knowledge or generating
explanations for stance analysis. However, most of these works are limited by
their over-reliance on explicit reasoning, provide coarse explanations that
lack nuance, and do not explicitly model the reasoning process, making it
difficult to interpret the model's predictions. To address these issues, in our
study, we develop a novel interpretable ZSSD framework, IRIS. We provide an
interpretable understanding of the attitude of the input towards the target
implicitly based on sequences within the text (implicit rationales) and
explicitly based on linguistic measures (explicit rationales). IRIS considers
stance detection as an information retrieval ranking task, understanding the
relevance of implicit rationales for different stances to guide the model
towards correct predictions without requiring the ground-truth of rationales,
thus providing inherent interpretability. In addition, explicit rationales
based on communicative features help
the emotional and cognitive
dimensions of stance, offering an interpretable understanding of the author's
attitude towards the given target. Extensive experiments on the benchmark
datasets of VAST, EZ-STANCE, P-Stance, and RFD using 50%, 30%, and even 10%
training data prove the generalizability of our model, benefiting from the
proposed architecture and interpretable design.
PerfDojo Automated ML Library Generation for Heterogeneous Architectures
Authors: Andrei Ivanov, Siyuan Shen, Gioele Gottardo, Marcin Chrapek, Afif Boudaoud, Timo Schneider, Luca Benini, Torsten Hoefler
2025-11-05
The increasing complexity of machine learning models and the proliferation of
diverse hardware architectures (CPUs, GPUs, accelerators) make achieving
optimal performance a significant challenge. Heterogeneity in instruction sets,
specialized kernel requirements for different data types and model features
(e.g., ,
), and architecture-specific optimizations
complicate performance tuning. Manual optimization is resource-intensive, while
existing automatic approaches often rely on complex hardware-specific
heuristics and uninterpretable intermediate representations, hindering
performance portability. We introduce Perf
, a novel automatic optimization
methodology leveraging Large Language Models (
s) and Reinforcement Learning
(RL). Central to this is PerfDojo, an environment framing optimization as an RL
game using a human-readable, mathematically-inspired code representation that
guarantees semantic validity through transformations. This allows effective
optimization without prior hardware knowledge, facilitating both human analysis
and RL agent training. We demonstrate Perf
's ability to achieve significant
performance gains across diverse CPU (x86, Arm, RISC-V) and GPU architectures.
RAGBoost Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse
Authors: Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, Luo Mai
2025-11-05
Retrieval-augmented generation (RAG) enhances large language models (s)
with retrieved context but often suffers from downgraded
performance as
modern applications demand longer and more complex inputs. Existing caching
techniques either preserve accuracy with low
reuse or improve reuse at
the cost of degraded reasoning quality. We present RAGBoost, an efficient RAG
system that achieves high
reuse without sacrificing accuracy through
accuracy-pre
context reuse. RAGBoost detects
ping retrieved items
across concurrent sessions and multi-turn interactions, using efficient context
indexing, ordering, and de-duplication to maximize reuse, while lightweight
contextual hints maintain reasoning fidelity. It integrates seamlessly with
existing
inference engines and improves their
performance by 1.5-3X
over state-of-the-art methods, while pre
or even enhancing reasoning
accuracy across diverse RAG and agentic AI workloads. Our code is released at:
https://github.com/Edinburgh-AgenticAI/RAGBoost.
SurgViVQA Temporally-Grounded Video Question Answering for Surgical Scene Understanding
Authors: Mauro Orazio Drago, Luca Carlini, Pelinsu Celebi Balyemez, Dennis Pierantozzi, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque
2025-11-05
Video Question Answering (VideoQA) in the surgical domain aims to enhance
intraoperative understanding by enabling AI models to reason over temporally
coherent events rather than isolated frames. Current approaches are limited to
static image features, and available datasets often lack temporal annotations,
ignoring the dynamics critical for accurate procedural interpretation. We
propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from
static images to dynamic surgical scenes. It uses a Masked Video--Text Encoder
to fuse video and question features, capturing temporal cues such as motion and
tool--tissue interactions, which a fine-tuned large language model () then
s into coherent answers. To evaluate its performance, we curated
REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related
questions and diagnostic attributes, as well as out-of-template questions with
rephrased or semantically altered formulations to assess model robustness.
Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset
shows that SurgViVQA outperforms existing image-based VQA benchmark models,
particularly in keyword accuracy, improving over PitVQA by +11\% on
REAL-Colon-VQA and +9\% on EndoVis18-VQA. A perturbation study on the questions
further confirms improved generalizability and robustness to variations in
question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework
for temporally-aware understanding in surgical VideoQA, enabling AI models to
interpret dynamic procedural contexts more effectively. Code and dataset
available at https://github.com/madratak/SurgViVQA.
DRL-Based Robust Multi-Timescale Anti-Jamming Approaches under State Uncertainty
Authors: Haoqin Zhao, Zan Li, Jiangbo Si, Rui Huang, Hang Hu, Tony Q. S. Quek, Naofal Al-Dhahir
2025-11-05
Owing to the openness of wireless channels, wireless systems
are highly susceptible to malicious jamming. Most existing anti-jamming methods
rely on the assumption of accurate sensing and optimize parameters on a single
timescale. However, such methods overlook two practical issues: mismatched
execution latencies across heterogeneous actions and measurement errors caused
by sensor imperfections. Especially for deep reinforcement learning (DRL)-based
methods, the inherent sensitivity of neural networks implies that even minor
perturbations in the input can mislead the agent into choosing suboptimal
actions, with potentially severe consequences. To ensure reliable wireless
transmission, we establish a multi-timescale decision model that incorporates
state uncertainty. Subsequently, we propose two robust schemes that sustain
performance under bounded sensing errors. First, a Projected Gradient
Descent-assisted Double Deep Q-Network (PGD-DDQN) algorithm is designed, which
derives worst-case perturbations under a norm-bounded error model and applies
PGD during training for robust optimization. Second, a Nonlinear Q-Compression
DDQN (NQC-DDQN) algorithm introduces a nonlinear
mechanism that
adaptively contracts Q-value ranges to eliminate action aliasing. Simulation
results indicate that, compared with the perfect-sensing baseline, the proposed
algorithms show only minor degradation in anti-jamming performance while
maintaining robustness under various perturbations, thereby validating their
practicality in imperfect sensing conditions.
UMDAM A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM
Authors: Hai Huang, Xuhong Qiang, Weisheng Zhao, Chenchen Liu
2025-11-05
Large Language Models (s) are increasingly deployed on edge devices with
Neural Processing Units (NPUs), yet the
phase remains memory-intensive,
limiting performance. Processing-in-Memory (PIM) offers a promising solution,
but co-executing NPU-PIM systems face challenges such as data layout
mismatches, bandwidth loss, and redundant storage. To address these issues, we
propose UMDAM, a unified memory-affinity data layout and DRAM address mapping
scheme tailored for NPU-PIM co-execution. UMDAM employs a column-major,
tile-based layout and a configurable DRAM mapping strategy to ensure
compatibility with NPU computation while maximizing PIM efficiency -- without
introducing extra memory overhead or bandwidth loss. Comprehensive evaluations
on OPT models demonstrate that UMDAM reduces time-to-first-token (TTFT) by up
to 3.0x and time-to-last-token (TTLT) by 2.18x, significantly improving
end-to-end
inference efficiency on edge devices.
Characterising Global Platforms Centralised, Decentralised, Federated, and Grassroots
Authors: Ehud Shapiro
2025-11-05
Global digital platforms are software systems designed to serve entire
populations, with some already billions of people. We propose atomic
transactions-based multiagent transition systems and protocols as a formal
framework to study them; introduce essential agents -- minimal sets of agents
the removal of which makes
impossible; and show that the
cardinality of essential agents partitions all global platforms into four
classes:
1. Centralised -- one (the server)
2. Decentralised -- finite (bootstrap nodes)
3. Federated -- infinite but not universal (all servers)
4. Grassroots -- universal (all agents)
Our illustrative formal example is a global social network, for which we
provide centralised, decentralised, federated, and grassroots specifications
via multiagent atomic transactions, and prove they all satisfy the same basic
correctness properties. We discuss informally additional global platforms --
currencies, ``sharing economy'' apps, AI, and more. While this may be the first
characterisation of centralised, decentralised, and federated global platforms,
grassroots platforms have been formally defined previously, but using different
notions. Here, we prove that their original definition implies that all agents
are essential, placing grassroots platforms in a distinct class within the
broader formal context that includes all global platforms. This work provides
the first mathematical framework for classifying any global platform --
existing or imagined -- by providing a multiagent atomic-transactions
specification of it and determining the cardinality of the minimal set of
essential agents in the ensuing multiagent protocol. It thus provides a
unifying mathematical approach for the study of global digital platforms,
perhaps the most important class of computer systems today.
Provable Separations between Memorization and Generalization in Diffusion Models
Authors: Zeqi Ye, Qijie Zhu, Molei Tao, Minshuo Chen
2025-11-05
Diffusion models have achieved remarkable success across diverse domains, but
they remain vulnerable to memorization -- reproducing training data rather than
generating novel outputs. This not only limits their creative potential but
also raises concerns about privacy and safety. While empirical studies have
explored mitigation strategies, theoretical understanding of memorization
remains limited. We address this gap through developing a dual-separation
result via two complementary perspectives: statistical estimation and network
approximation. From the estimation side, we show that the ground-truth score
function does not minimize the empirical denoising loss, creating a separation
that drives memorization. From the approximation side, we prove that
implementing the empirical score function requires network size to scale with
sample size, spelling a separation compared to the more compact network
representation of the ground-truth score function. Guided by these insights, we
develop a -based method that reduces memorization while maintaining
generation quality in diffusion
s.
A Quantized VAE-MLP Botnet Detection Model A Systematic Evaluation of Quantization-Aware Training and Post-Training Quantization Strategies
Authors: Hassan Wasswa, Hussein Abbass, Timothy Lynar
2025-11-05
In an effort to counter the increasing IoT botnet-based attacks,
state-of-the-art deep learning methods have been proposed and have achieved
impressive detection accuracy. However, their computational intensity restricts
deployment on resource-constrained IoT devices, creating a critical need for
lightweight detection models. A common solution to this challenge is model
via
. This study proposes a VAE-MLP model framework
where an MLP-based classifier is trained on 8-dimensional latent vectors
derived from the high-dimensional train data using the encoder component of a
pretrained variational autoencoder (VAE). Two widely used
strategies--Quantization-Aware Training (QAT) and Post-Training Quantization
(PTQ)--are then systematically evaluated in terms of their impact on detection
performance, storage efficiency, and inference latency using two benchmark IoT
botnet datasets--N-BaIoT and CICIoT2022. The results revealed that, with
respect to detection accuracy, the QAT strategy experienced a more noticeable
decline,whereas PTQ incurred only a marginal reduction compared to the original
un
d model. Furthermore, PTQ yielded a 6x speedup and 21x reduction in
size, while QAT achieved a 3x speedup and 24x
, demonstrating the
practicality of
for device-level IoT botnet detection.
AI as We Describe It How Large Language Models and Their Applications in Health are Represented Across Channels of Public Discourse
Authors: Jiawei Zhou, Lei Zhang, Mei Li, Benjamin D Horne, Munmun De Choudhury
2025-11-05
Representation shapes public attitudes and behaviors. With the arrival and
rapid adoption of s, the way these systems are introduced will negotiate
societal expectations for their role in high-stakes domains like health. Yet it
remains unclear whether current narratives present a balanced view. We analyzed
five prominent discourse channels (news, research press, YouTube, TikTok, and
Reddit) over a two-year period on lexical style, informational content, and
symbolic representation. Discussions were generally positive and episodic, with
positivity increasing over time. Risk
was unthorough and often
reduced to information quality incidents, while explanations of
s'
generative nature were rare. Compared with professional outlets, TikTok and
Reddit highlighted wellbeing applications and showed greater variations in tone
and anthropomorphism but little attention to risks. We discuss implications for
public discourse as a diagnostic tool in identifying literacy and governance
gaps, and for
and design strategies to support more informed
engagement.
Large language models require a new form of oversight capability-based monitoring
Authors: Katherine C. Kellogg, Bingyang Ye, Yifan Hu, Guergana K. Savova, Byron Wallace, Danielle S. Bitterman
2025-11-05
The rapid adoption of large language models (s) in healthcare has been
accompanied by scrutiny of their oversight. Existing monitoring approaches,
inherited from traditional machine learning (ML), are task-based and founded on
assumed performance degradation arising from dataset drift. In contrast, with
s, inevitable model degradation due to changes in populations compared to
the training dataset cannot be assumed, because
s were not trained for any
specific task in any given population. We therefore propose a new organizing
principle guiding generalist
monitoring that is scalable and grounded in
how these models are developed and used in practice: capability-based
monitoring. Capability-based monitoring is motivated by the fact that
s are
generalist systems whose
ping internal capabilities are reused across
numerous downstream tasks. Instead of evaluating each downstream task
independently, this approach organizes monitoring around shared model
capabilities, such as summarization, reasoning, translation, or safety
guardrails, in order to enable cross-task detection of systemic weaknesses,
long-tail errors, and emergent behaviors that task-based monitoring may miss.
We describe considerations for developers, organizational leaders, and
professional societies for implementing a capability-based monitoring approach.
Ultimately, capability-based monitoring will provide a scalable foundation for
safe, adaptive, and collaborative monitoring of
s and future generalist
artificial intelligence models in healthcare.
SnapStream Efficient Long Sequence Decoding on Dataflow Accelerators
Authors: Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian Haggstrom, Guangtao Wang, Shubhangi Upasani, Ayush Sachdeva, Rui Li, Faline Fu, Chen Wu, Ayesha Siddiqua, John Long, Tuowen Zhao, Matheen Musaddiq, Hakan Zeffer, Yun Du, Mingran Wang, Qinghua Li, Bo Li, Urmish Thakker, Raghu Prabhakar
2025-11-05
The proliferation of 100B+ parameter Large Language Models (s) with 100k+
context length support have resulted in increasing demands for on-chip memory
to support large
s. Techniques such as Streaming
and Snap
demonstrate how to control
size while maintaining model accuracy. Yet,
these techniques are not commonly used within industrial deployments using
frameworks like v
or SGLang. The reason is twofold: on one hand, the static
graphs and continuous batching methodology employed by these frameworks make it
difficult to admit modifications to the standard multi-head attention
algorithm, while on the other hand, the accuracy implications of such
techniques on modern instruction-following and reasoning models are not well
understood, obfuscating the need for implementing these techniques. In this
paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and
DeepSeek-R1, and develop SnapStream, a
method that can be
deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way
tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators
running at 128k context length and up to 1832 tokens per second in a real
production setting. SnapStream enables improved on-chip memory usage
and introduces minimal accuracy degradation on LongBench-v2, AIME24 and
LiveCodeBench. To the best of our knowledge, this is the first implementation
of
attention techniques deployed in a production inference system
with static graphs and continuous batching.
LogicSparse Enabling Engine-Free Unstructured Sparsity for Quantised Deep-learning Accelerators
Authors: Changhong Li, Biswajit Basu, Shreejith Shanker
2025-11-05
FPGAs have been shown to be a promising platform for deploying Quantised
Neural Networks (QNNs) with high-speed, low-latency, and energy-efficient
inference. However, the complexity of modern deep-learning models limits the
performance on resource-constrained edge devices. While quantisation and
alleviate these challenges, unstructured
remains
underexploited due to irregular memory access. This work introduces a framework
that embeds unstructured
into dataflow accelerators, eliminating the
need for dedicated
engines and pre
parallelism. A hardware-aware
strategy is introduced to improve efficiency and design flow further.
On LeNet-5, the framework attains 51.6 x
and 1.23 x throughput
improvement using only 5.12% of LUTs, effectively exploiting unstructured
for QNN
.