2025-11-07

Whisper Leak a side-channel attack on Large Language Models
Towards Transparent Stance Detection A Zero-Shot Approach Using Implicit and Explicit Interpretability
PerfDojo Automated ML Library Generation for Heterogeneous Architectures
RAGBoost Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse
SurgViVQA Temporally-Grounded Video Question Answering for Surgical Scene Understanding
DRL-Based Robust Multi-Timescale Anti-Jamming Approaches under State Uncertainty
UMDAM A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM
Characterising Global Platforms Centralised, Decentralised, Federated, and Grassroots
Provable Separations between Memorization and Generalization in Diffusion Models
A Quantized VAE-MLP Botnet Detection Model A Systematic Evaluation of Quantization-Aware Training and Post-Training Quantization Strategies
AI as We Describe It How Large Language Models and Their Applications in Health are Represented Across Channels of Public Discourse
Large language models require a new form of oversight capability-based monitoring
SnapStream Efficient Long Sequence Decoding on Dataflow Accelerators
LogicSparse Enabling Engine-Free Unstructured Sparsity for Quantised Deep-learning Accelerators

Whisper Leak a side-channel attack on Large Language Models

Authors: Geoff McDonald, Jonathan Bar Or

2025-11-05

http://arxiv.org/abs/2511.03675v1

Large Language Models (s) are increasingly deployed in sensitive domains including healthcare, legal services, and confidential s, where privacy is paramount. This paper introduces Whisper Leak, a side-channel attack that infers user prompt topics from encrypted traffic by analyzing packet size and timing patterns in streaming responses. Despite TLS encryption protecting content, these metadata patterns leak sufficient information to enable topic classification. We demonstrate the attack across 28 popular s from major providers, achieving near-perfect classification (often >98% AUPRC) and high precision even at extreme class imbalance (10,000:1 noise-to-target ratio). For many models, we achieve 100% precision in identifying sensitive topics like "money laundering" while recovering 5-20% of target conversations. This industry-wide vulnerability poses significant risks for users under network surveillance by ISPs, governments, or local adversaries. We evaluate three mitigation strategies - random padding, token batching, and packet injection - finding that while each reduces attack effectiveness, none provides complete protection. Through responsible disclosure, we have collaborated with providers to implement initial countermeasures. Our findings underscore the need for providers to address metadata leakage as AI systems handle increasingly sensitive information.

Towards Transparent Stance Detection A Zero-Shot Approach Using Implicit and Explicit Interpretability

Authors: Apoorva Upadhyaya, Wolfgang Nejdl, Marco Fisichella

2025-11-05

http://arxiv.org/abs/2511.03635v1

Zero-Shot Stance Detection (ZSSD) identifies the attitude of the post toward unseen targets. Existing research using contrastive, meta-learning, or data augmentation suffers from generalizability issues or lack of coherence between text and target. Recent works leveraging large language models (s) for ZSSD focus either on improving unseen target-specific knowledge or generating explanations for stance analysis. However, most of these works are limited by their over-reliance on explicit reasoning, provide coarse explanations that lack nuance, and do not explicitly model the reasoning process, making it difficult to interpret the model's predictions. To address these issues, in our study, we develop a novel interpretable ZSSD framework, IRIS. We provide an interpretable understanding of the attitude of the input towards the target implicitly based on sequences within the text (implicit rationales) and explicitly based on linguistic measures (explicit rationales). IRIS considers stance detection as an information retrieval ranking task, understanding the relevance of implicit rationales for different stances to guide the model towards correct predictions without requiring the ground-truth of rationales, thus providing inherent interpretability. In addition, explicit rationales based on communicative features help the emotional and cognitive dimensions of stance, offering an interpretable understanding of the author's attitude towards the given target. Extensive experiments on the benchmark datasets of VAST, EZ-STANCE, P-Stance, and RFD using 50%, 30%, and even 10% training data prove the generalizability of our model, benefiting from the proposed architecture and interpretable design.

PerfDojo Automated ML Library Generation for Heterogeneous Architectures

Authors: Andrei Ivanov, Siyuan Shen, Gioele Gottardo, Marcin Chrapek, Afif Boudaoud, Timo Schneider, Luca Benini, Torsten Hoefler

2025-11-05

http://arxiv.org/abs/2511.03586v1

The increasing complexity of machine learning models and the proliferation of diverse hardware architectures (CPUs, GPUs, accelerators) make achieving optimal performance a significant challenge. Heterogeneity in instruction sets, specialized kernel requirements for different data types and model features (e.g., , ), and architecture-specific optimizations complicate performance tuning. Manual optimization is resource-intensive, while existing automatic approaches often rely on complex hardware-specific heuristics and uninterpretable intermediate representations, hindering performance portability. We introduce Perf, a novel automatic optimization methodology leveraging Large Language Models (s) and Reinforcement Learning (RL). Central to this is PerfDojo, an environment framing optimization as an RL game using a human-readable, mathematically-inspired code representation that guarantees semantic validity through transformations. This allows effective optimization without prior hardware knowledge, facilitating both human analysis and RL agent training. We demonstrate Perf's ability to achieve significant performance gains across diverse CPU (x86, Arm, RISC-V) and GPU architectures.

RAGBoost Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse

Authors: Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, Luo Mai

2025-11-05

http://arxiv.org/abs/2511.03475v1

Retrieval-augmented generation (RAG) enhances large language models (s) with retrieved context but often suffers from downgraded performance as modern applications demand longer and more complex inputs. Existing caching techniques either preserve accuracy with low reuse or improve reuse at the cost of degraded reasoning quality. We present RAGBoost, an efficient RAG system that achieves high reuse without sacrificing accuracy through accuracy-pre context reuse. RAGBoost detects ping retrieved items across concurrent sessions and multi-turn interactions, using efficient context indexing, ordering, and de-duplication to maximize reuse, while lightweight contextual hints maintain reasoning fidelity. It integrates seamlessly with existing inference engines and improves their performance by 1.5-3X over state-of-the-art methods, while pre or even enhancing reasoning accuracy across diverse RAG and agentic AI workloads. Our code is released at: https://github.com/Edinburgh-AgenticAI/RAGBoost.

SurgViVQA Temporally-Grounded Video Question Answering for Surgical Scene Understanding

Authors: Mauro Orazio Drago, Luca Carlini, Pelinsu Celebi Balyemez, Dennis Pierantozzi, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque

2025-11-05

http://arxiv.org/abs/2511.03325v2

Video Question Answering (VideoQA) in the surgical domain aims to enhance intraoperative understanding by enabling AI models to reason over temporally coherent events rather than isolated frames. Current approaches are limited to static image features, and available datasets often lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation. We propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes. It uses a Masked Video--Text Encoder to fuse video and question features, capturing temporal cues such as motion and tool--tissue interactions, which a fine-tuned large language model () then s into coherent answers. To evaluate its performance, we curated REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related questions and diagnostic attributes, as well as out-of-template questions with rephrased or semantically altered formulations to assess model robustness. Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset shows that SurgViVQA outperforms existing image-based VQA benchmark models, particularly in keyword accuracy, improving over PitVQA by +11\% on REAL-Colon-VQA and +9\% on EndoVis18-VQA. A perturbation study on the questions further confirms improved generalizability and robustness to variations in question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework for temporally-aware understanding in surgical VideoQA, enabling AI models to interpret dynamic procedural contexts more effectively. Code and dataset available at https://github.com/madratak/SurgViVQA.

DRL-Based Robust Multi-Timescale Anti-Jamming Approaches under State Uncertainty

Authors: Haoqin Zhao, Zan Li, Jiangbo Si, Rui Huang, Hang Hu, Tony Q. S. Quek, Naofal Al-Dhahir

2025-11-05

http://arxiv.org/abs/2511.03305v1

Owing to the openness of wireless channels, wireless systems are highly susceptible to malicious jamming. Most existing anti-jamming methods rely on the assumption of accurate sensing and optimize parameters on a single timescale. However, such methods overlook two practical issues: mismatched execution latencies across heterogeneous actions and measurement errors caused by sensor imperfections. Especially for deep reinforcement learning (DRL)-based methods, the inherent sensitivity of neural networks implies that even minor perturbations in the input can mislead the agent into choosing suboptimal actions, with potentially severe consequences. To ensure reliable wireless transmission, we establish a multi-timescale decision model that incorporates state uncertainty. Subsequently, we propose two robust schemes that sustain performance under bounded sensing errors. First, a Projected Gradient Descent-assisted Double Deep Q-Network (PGD-DDQN) algorithm is designed, which derives worst-case perturbations under a norm-bounded error model and applies PGD during training for robust optimization. Second, a Nonlinear Q-Compression DDQN (NQC-DDQN) algorithm introduces a nonlinear mechanism that adaptively contracts Q-value ranges to eliminate action aliasing. Simulation results indicate that, compared with the perfect-sensing baseline, the proposed algorithms show only minor degradation in anti-jamming performance while maintaining robustness under various perturbations, thereby validating their practicality in imperfect sensing conditions.

UMDAM A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM

Authors: Hai Huang, Xuhong Qiang, Weisheng Zhao, Chenchen Liu

2025-11-05

http://arxiv.org/abs/2511.03293v1

Large Language Models (s) are increasingly deployed on edge devices with Neural Processing Units (NPUs), yet the phase remains memory-intensive, limiting performance. Processing-in-Memory (PIM) offers a promising solution, but co-executing NPU-PIM systems face challenges such as data layout mismatches, bandwidth loss, and redundant storage. To address these issues, we propose UMDAM, a unified memory-affinity data layout and DRAM address mapping scheme tailored for NPU-PIM co-execution. UMDAM employs a column-major, tile-based layout and a configurable DRAM mapping strategy to ensure compatibility with NPU computation while maximizing PIM efficiency -- without introducing extra memory overhead or bandwidth loss. Comprehensive evaluations on OPT models demonstrate that UMDAM reduces time-to-first-token (TTFT) by up to 3.0x and time-to-last-token (TTLT) by 2.18x, significantly improving end-to-end inference efficiency on edge devices.

Characterising Global Platforms Centralised, Decentralised, Federated, and Grassroots

Authors: Ehud Shapiro

2025-11-05

http://arxiv.org/abs/2511.03286v2

Global digital platforms are software systems designed to serve entire populations, with some already billions of people. We propose atomic transactions-based multiagent transition systems and protocols as a formal framework to study them; introduce essential agents -- minimal sets of agents the removal of which makes impossible; and show that the cardinality of essential agents partitions all global platforms into four classes: 1. Centralised -- one (the server) 2. Decentralised -- finite $>1$ (bootstrap nodes) 3. Federated -- infinite but not universal (all servers) 4. Grassroots -- universal (all agents) Our illustrative formal example is a global social network, for which we provide centralised, decentralised, federated, and grassroots specifications via multiagent atomic transactions, and prove they all satisfy the same basic correctness properties. We discuss informally additional global platforms -- currencies, ``sharing economy'' apps, AI, and more. While this may be the first characterisation of centralised, decentralised, and federated global platforms, grassroots platforms have been formally defined previously, but using different notions. Here, we prove that their original definition implies that all agents are essential, placing grassroots platforms in a distinct class within the broader formal context that includes all global platforms. This work provides the first mathematical framework for classifying any global platform -- existing or imagined -- by providing a multiagent atomic-transactions specification of it and determining the cardinality of the minimal set of essential agents in the ensuing multiagent protocol. It thus provides a unifying mathematical approach for the study of global digital platforms, perhaps the most important class of computer systems today.

Provable Separations between Memorization and Generalization in Diffusion Models

Authors: Zeqi Ye, Qijie Zhu, Molei Tao, Minshuo Chen

2025-11-05

http://arxiv.org/abs/2511.03202v1

Diffusion models have achieved remarkable success across diverse domains, but they remain vulnerable to memorization -- reproducing training data rather than generating novel outputs. This not only limits their creative potential but also raises concerns about privacy and safety. While empirical studies have explored mitigation strategies, theoretical understanding of memorization remains limited. We address this gap through developing a dual-separation result via two complementary perspectives: statistical estimation and network approximation. From the estimation side, we show that the ground-truth score function does not minimize the empirical denoising loss, creating a separation that drives memorization. From the approximation side, we prove that implementing the empirical score function requires network size to scale with sample size, spelling a separation compared to the more compact network representation of the ground-truth score function. Guided by these insights, we develop a -based method that reduces memorization while maintaining generation quality in diffusion s.

A Quantized VAE-MLP Botnet Detection Model A Systematic Evaluation of Quantization-Aware Training and Post-Training Quantization Strategies

Authors: Hassan Wasswa, Hussein Abbass, Timothy Lynar

2025-11-05

http://arxiv.org/abs/2511.03201v1

In an effort to counter the increasing IoT botnet-based attacks, state-of-the-art deep learning methods have been proposed and have achieved impressive detection accuracy. However, their computational intensity restricts deployment on resource-constrained IoT devices, creating a critical need for lightweight detection models. A common solution to this challenge is model via . This study proposes a VAE-MLP model framework where an MLP-based classifier is trained on 8-dimensional latent vectors derived from the high-dimensional train data using the encoder component of a pretrained variational autoencoder (VAE). Two widely used strategies--Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ)--are then systematically evaluated in terms of their impact on detection performance, storage efficiency, and inference latency using two benchmark IoT botnet datasets--N-BaIoT and CICIoT2022. The results revealed that, with respect to detection accuracy, the QAT strategy experienced a more noticeable decline,whereas PTQ incurred only a marginal reduction compared to the original und model. Furthermore, PTQ yielded a 6x speedup and 21x reduction in size, while QAT achieved a 3x speedup and 24x , demonstrating the practicality of for device-level IoT botnet detection.

AI as We Describe It How Large Language Models and Their Applications in Health are Represented Across Channels of Public Discourse

Authors: Jiawei Zhou, Lei Zhang, Mei Li, Benjamin D Horne, Munmun De Choudhury

2025-11-05

http://arxiv.org/abs/2511.03174v1

Representation shapes public attitudes and behaviors. With the arrival and rapid adoption of s, the way these systems are introduced will negotiate societal expectations for their role in high-stakes domains like health. Yet it remains unclear whether current narratives present a balanced view. We analyzed five prominent discourse channels (news, research press, YouTube, TikTok, and Reddit) over a two-year period on lexical style, informational content, and symbolic representation. Discussions were generally positive and episodic, with positivity increasing over time. Risk was unthorough and often reduced to information quality incidents, while explanations of s' generative nature were rare. Compared with professional outlets, TikTok and Reddit highlighted wellbeing applications and showed greater variations in tone and anthropomorphism but little attention to risks. We discuss implications for public discourse as a diagnostic tool in identifying literacy and governance gaps, and for and design strategies to support more informed engagement.

Large language models require a new form of oversight capability-based monitoring

Authors: Katherine C. Kellogg, Bingyang Ye, Yifan Hu, Guergana K. Savova, Byron Wallace, Danielle S. Bitterman

2025-11-05

http://arxiv.org/abs/2511.03106v1

The rapid adoption of large language models (s) in healthcare has been accompanied by scrutiny of their oversight. Existing monitoring approaches, inherited from traditional machine learning (ML), are task-based and founded on assumed performance degradation arising from dataset drift. In contrast, with s, inevitable model degradation due to changes in populations compared to the training dataset cannot be assumed, because s were not trained for any specific task in any given population. We therefore propose a new organizing principle guiding generalist monitoring that is scalable and grounded in how these models are developed and used in practice: capability-based monitoring. Capability-based monitoring is motivated by the fact that s are generalist systems whose ping internal capabilities are reused across numerous downstream tasks. Instead of evaluating each downstream task independently, this approach organizes monitoring around shared model capabilities, such as summarization, reasoning, translation, or safety guardrails, in order to enable cross-task detection of systemic weaknesses, long-tail errors, and emergent behaviors that task-based monitoring may miss. We describe considerations for developers, organizational leaders, and professional societies for implementing a capability-based monitoring approach. Ultimately, capability-based monitoring will provide a scalable foundation for safe, adaptive, and collaborative monitoring of s and future generalist artificial intelligence models in healthcare.

SnapStream Efficient Long Sequence Decoding on Dataflow Accelerators

Authors: Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian Haggstrom, Guangtao Wang, Shubhangi Upasani, Ayush Sachdeva, Rui Li, Faline Fu, Chen Wu, Ayesha Siddiqua, John Long, Tuowen Zhao, Matheen Musaddiq, Hakan Zeffer, Yun Du, Mingran Wang, Qinghua Li, Bo Li, Urmish Thakker, Raghu Prabhakar

2025-11-05

http://arxiv.org/abs/2511.03092v2

The proliferation of 100B+ parameter Large Language Models (s) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large s. Techniques such as Streaming and Snap demonstrate how to control size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like v or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of attention techniques deployed in a production inference system with static graphs and continuous batching.

LogicSparse Enabling Engine-Free Unstructured Sparsity for Quantised Deep-learning Accelerators

Authors: Changhong Li, Biswajit Basu, Shreejith Shanker

2025-11-05

http://arxiv.org/abs/2511.03079v1

FPGAs have been shown to be a promising platform for deploying Quantised Neural Networks (QNNs) with high-speed, low-latency, and energy-efficient inference. However, the complexity of modern deep-learning models limits the performance on resource-constrained edge devices. While quantisation and alleviate these challenges, unstructured remains underexploited due to irregular memory access. This work introduces a framework that embeds unstructured into dataflow accelerators, eliminating the need for dedicated engines and pre parallelism. A hardware-aware strategy is introduced to improve efficiency and design flow further. On LeNet-5, the framework attains 51.6 x and 1.23 x throughput improvement using only 5.12% of LUTs, effectively exploiting unstructured for QNN .