2026-03-17
Effective Distillation to Hybrid xLSTM Architectures
针对蒸馏模型性能不匹配问题,提出有效蒸馏方法,实现无损蒸馏提升下游任务表现
Authors: Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Pieter-Jan Hoedt, Maximilian Beck, Sebastian Böck, Günter Klambauer, Sepp Hochreiter | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.15590v1
Abstract
There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.LEXI Lossless Exponent Coding for Efficient Inter-Chiplet Communication in Hybrid LLMs
针对芯片间通信开销问题,提出LEXI指数压缩方案,降低推理延迟提升通信效率
Authors: Miao Sun, Alish Kanani, Kaushik Shroff, Umit Ogras | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.15589v1
Abstract
Data movement overheads increase the inference latency of state-of-the-art large language models (LLMs). These models commonly use the bfloat16 (BF16) format for stable training. Floating-point standards allocate eight bits to the exponent, but our profiling reveals that exponent streams exhibit fewer than 3 bits Shannon entropy, indicating high inherent compressibility. To exploit this potential, we propose LEXI, a novel lossless exponent compression scheme based on Huffman coding. LEXI compresses activations and caches on the fly while storing compressed weights for just-in-time decompression near compute, without sacrificing system throughput and model accuracy. The codecs at the ingress and egress ports of network-on-chip routers sustain the maximum link bandwidth via multi-lane LUT decoders, incurring only 0.09 percent area and energy overheads with GF 22 nm technology. LEXI reduces inter-chiplet communication and end-to-end inference latencies by 33-45 percent and 30-35 percent on modern Jamba, Zamba, and Qwen LLMs implemented on a homogeneous chiplet architecture.DUET Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages
针对混合模型计算瓶颈问题,提出DUET异构加速器,优化预填充与解码阶段性能
Authors: Alish Kanani, Sangwan Lee, Han Lyu, Jiahao Lin, Jaehyun Park, Umit Y. Ogras | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.15530v1
Abstract
Large language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba-Transformer models inherit this asymmetry while adding state space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector-matrix multiplications. Both architectures are runtime-configurable to support hybrid models with mixed Mamba and attention layers. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B across four workloads show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens over the B200 GPU.Federated Learning of Binary Neural Networks Enabling Low-Cost Inference
针对边缘推理资源限制问题,提出联邦二元神经网络训练,实现低开销高效推理
Authors: Nitin Priyadarshini Shankar, Soham Lahiri, Sheetal Kalyani, Saurav Prakash | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.15507v1
Abstract
Federated Learning (FL) preserves privacy by distributing training across devices. However, using DNNs is computationally intensive at the low-powered edge during inference. Edge deployment demands models that simultaneously optimize memory footprint and computational efficiency, a dilemma where conventional DNNs fail by exceeding resource limits. Traditional post-training binarization reduces model size but suffers from severe accuracy loss due to quantization errors. To address these challenges, we propose FedBNN, a rotation-aware binary neural network framework that learns binary representations directly during local training. By encoding each weight as a single bit $\{+1, -1\}$ instead of a $32$-bit float, FedBNN shrinks the model footprint, significantly reducing runtime (during inference) FLOPs and memory requirements in comparison to federated methods using real models. Evaluations across multiple benchmark datasets demonstrate that FedBNN significantly reduces resource consumption while performing similarly to existing federated methods using real-valued models.When Does Sparsity Mitigate the Curse of Depth in LLMs
针对LLM深度诅咒问题,证明稀疏性可调节方差传播,提升深度层利用率
Authors: Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann, Shiwei Liu | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.15389v1
Abstract
Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at https://github.com/pUmpKin-Co/SparsityAndCoD.Flash-Unified A Training-Free and Task-Aware Acceleration Framework for Native Unified Models
针对统一模型计算开销问题,提出任务感知加速框架,提升生成与理解任务效率
Authors: Junlong Ke, Zichen Wen, Boxue Yang, Yantai Yang, Xuyang Liu, Chenfei Liao, Zhaorun Chen, Shaobo Wang, Linfeng Zhang | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.15271v1
Abstract
Native unified multimodal models, which integrate both generative and understanding capabilities, face substantial computational overhead that hinders their real-world deployment. Existing acceleration techniques typically employ a static, monolithic strategy, ignoring the fundamental divergence in computational profiles between iterative generation tasks (e.g., image generation) and single-pass understanding tasks (e.g., VQA). In this work, we present the first systematic analysis of unified models, revealing pronounced parameter specialization, where distinct neuron sets are critical for each task. This implies that, at the parameter level, unified models have implicitly internalized separate inference pathways for generation and understanding within a single architecture. Based on these insights, we introduce a training-free and task-aware acceleration framework, FlashU, that tailors optimization to each task's demands. Across both tasks, we introduce Task-Specific Network Pruning and Dynamic Layer Skipping, aiming to eliminate inter-layer and task-specific redundancy. For visual generation, we implement a time-varying control signal for the guidance scale and a temporal approximation for the diffusion head via Diffusion Head Cache. For multimodal understanding, building upon the pruned model, we introduce Dynamic Token Pruning via a V-Norm Proxy to exploit the spatial redundancy of visual inputs. Extensive experiments on Show-o2 demonstrate that FlashU achieves 1.78$\times$ to 2.01$\times$ inference acceleration across both understanding and generation tasks while maintaining SOTA performance, outperforming competing unified models and validating our task-aware acceleration paradigm. Our code is publicly available at https://github.com/Rirayh/FlashU.Efficient Document Parsing via Parallel Token Prediction
针对VLM解析速度瓶颈问题,提出并行令牌预测方法,提升多令牌生成效率
Authors: Lei Li, Ze Zhao, Meng Li, Zhongwang Lun, Yi Yuan, Xingjing Lu, Zheng Wei, Jiang Bian, Zang Li | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.15206v1
Abstract
Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.LMetric Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling
针对LLM请求调度复杂性问题,提出简单乘法调度方法,平衡加速与负载均衡
Authors: Dingyan Zhang, Jinbo Han, Kaixi Zhang, Xingda Wei, Sijie Shen, Chenguang Fang, Wenyuan Yu, Jingren Zhou, Rong Chen | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.15202v1
Abstract
High-quality LLM request scheduling requires achieving two key objectives: whether the routed instance has KV$ to accelerate the request execution and whether the workload is balanced across instances. Achieving both objectives is challenging because pursuing one objective may compromise the other. Current approaches adopt various combinators (e.g., linear combinations) to compute a scheduling score combining indicators for the two objectives, which are complex in that they either require significant workload-specific hyperparameter tuning or model-hardware-aware simulator development, and could still lead to suboptimal performance. In this paper, we show that using a simple multiplication of two carefully chosen indicators-one for KV$-aware (new prefill tokens if routed to an instance) and one for load balancing-aware (current batch size of the instance)-as the scheduling score can simultaneously achieve both objectives well without any hyperparameter tuning. The key idea is that the multiplied score considers both objectives in a manner similar to a linear combination, with a nice property that the original hyperparameters are canceled out during comparison so we don't need tuning to find the best parameters. The two indicators are chosen based on our analysis of LLM characteristics, and our extensive experiments show that this simple approach can reduce TTFT by 92% and 52%, and TPOT by 21% and 20%, compared to vLLM-v1 and a production scheduler on real-world workloads covering chatbots, API calls, and coding agents. We also mathematically derive the conditions under which multiplication may fail, and find that such conditions are extremely rare in practice and can be detected (and mitigated) beforehand.Joint Routing and Model Pruning for Decentralized Federated Learning in Bandwidth-Constrained Multi-Hop Wireless Networks
针对联邦学习通信瓶颈问题,提出联合路由与剪枝框架,优化多跳网络延迟
Authors: Xiaoyu He, Weicai Li, Tiejun Lv, Xi Yu | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.15188v1
Abstract
Decentralized federated learning (D-FL) enables privacy-preserving training without a central server, but multi-hop model exchanges and aggregation are often bottlenecked by communication resource constraints. To address this issue, we propose a joint routing-and-pruning framework that optimizes routing paths and pruning rates to maintain communication latency within prescribed limits. We analyze how the sum of model biases across all clients affects the convergence bound of D-FL and formulate an optimization problem that maximizes the model retention rate to minimize these biases under communication constraints. Further analysis reveals that each client's model retention rate is path-dependent, which reduces the original problem to a routing optimization. Leveraging this insight, we develop a routing algorithm that selects latency-efficient transmission paths, allowing more parameters to be delivered within the time budget and thereby improving D-FL convergence. Simulations demonstrate that, compared with unpruned systems, the proposed framework reduces average transmission latency by 27.8% and improves testing accuracy by approximately 12%. Furthermore, relative to standard benchmark routing algorithms, the proposed routing method improves accuracy by roughly 8%.Guaranteeing Semantic and Performance Determinism in Flexible GPU Sharing
针对GPU共享权衡问题,提出DetShare系统,确保语义确定性与透明性
Authors: Zhenyuan Yang, Wenxin Zheng, Mingyu Li | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.15042v1
Abstract
GPU sharing is critical for maximizing hardware utilization in modern data centers. However, existing approaches present a stark trade-off: coarse-grained temporal multiplexing incurs severe tail-latency spikes for interactive services, while fine-grained spatial partitioning often necessitates invasive kernel modifications that compromise behavioral equivalence. We present DetShare, a novel GPU sharing system that prioritizes determinism and transparency. DetShare ensures semantic determinism (unmodified kernels yield identical results) and performance determinism (predictable tail latency), all while maintaining complete transparency (zero code modification). DetShare introduces GPU coroutines, a new abstraction that decouples logical execution contexts from physical GPU resources. This decoupling enables flexible, fine-grained resource allocation via lightweight context migration. Our evaluation demonstrates that DetShare improves training throughput by up to 79.2% compared to temporal sharing. In co-location scenarios, it outperforms state-of-the-art baselines, reducing P99 tail latency by 15.1% without compromising throughput. Furthermore, through workload-aware placement and our TPOT-First scheduling policy, DetShare decreases average inference latency by 69.1% and reduces Time-Per-Output-Token (TPOT) SLO violations by 21.2% relative to default policies.MMSpec Benchmarking Speculative Decoding for Vision-Language Models
针对VLM推理延迟问题,提出MMSpec基准,评估推测解码在视觉语言模型中的效果
Authors: Hui Shen, Xin Wang, Ping Zhang, Yunta Hsieh, Qi Han, Zhongwei Wan, Ziheng Zhang, Jingxuan Zhang, Jing Xiong, Ziyuan Liu, Yifan Zhang, Hangrui Cao, Chenyang Zhao, Mi Zhang | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.14989v1
Abstract
Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose ViSkip, a plug-and-play speculative decoding method that dynamically adapts speculation to vision tokens and achieves state-of-the-art performance.Spiking Layer-Adaptive Magnitude-based Pruning
针对SNN剪枝性能下降问题,提出层自适应幅度剪枝框架,提升能量效率与准确性
Authors: Junqiao Wang, Zhehang Ye, Yuqi Ouyang | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.14946v1
Abstract
Spiking Neural Networks (SNNs) provide energy-efficient computation but their deployment is constrained by dense connectivity and high spiking operation costs. Existing magnitude-based pruning strategies, when naively applied to SNNs, fail to account for temporal accumulation, non-uniform timestep contributions, and membrane stability, often leading to severe performance degradation. This paper proposes Spiking Layer-Adaptive Magnitude-based Pruning (SLAMP), a theory-guided pruning framework that generalizes layer-adaptive magnitude pruning to temporal SNNs by explicitly controlling worst-case output distortion across layers and timesteps. SLAMP formulates sparsity allocation as a temporal distortion-constrained optimization problem, yielding time-aware layer importance scores that reduce to conventional layer-adaptive pruning in single-timestep limit. An efficient two-stage procedure is derived, combining temporal score estimation, global sparsity allocation, and magnitude pruning with retraining for stability recovery. Experiments on CIFAR10, CIFAR100, and the event-based CIFAR10-DVS datasets demonstrate that SLAMP achieves substantial connectivity and spiking operation reductions while preserving accuracy, enabling efficient and deployable SNN inference.Balancing Saliency and Coverage Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs
针对VLM视觉令牌压缩问题,提出语义显著性感知预算方法,平衡显著性与覆盖率
Authors: Jaehoon Lee, Mingi Jung, Soohyuk Jang, Seungryong Yoo, Dahuin Jung, Sungroh Yoon | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.14892v1
Abstract
Large Vision-Language Models (VLMs) achieve strong multimodal understanding capabilities by leveraging high-resolution visual inputs, but the resulting large number of visual tokens creates a major computational bottleneck. Recent work mitigates this issue through visual token compression, typically compressing tokens based on saliency, diversity, or a fixed combination of both. We observe that the distribution of semantic prominence varies substantially across samples, leading to different optimal trade-offs between local saliency preservation and global coverage. This observation suggests that applying a static compression strategy across all samples can be suboptimal. Motivated by this insight, we propose PromPrune, a sample-adaptive visual token selection framework composed of semantic prominence-aware budget allocation and a two-stage selection pipeline. Our method adaptively balances local saliency preservation and global coverage according to the semantic prominence distribution of each sample. By allocating token budgets between locally salient regions and globally diverse regions, our method maintains strong performance even under high compression ratios. On LLaVA-NeXT-7B, our approach reduces FLOPs by 88% and prefill latency by 22% while preserving 97.5% of the original accuracy.SkipOPU An FPGA-based Overlay Processor for Large Language Models with Dynamically Allocated Computation
针对LLM推理效率瓶颈问题,提出FPGA覆盖处理器,动态分配计算减少冗余
Authors: Zicheng He, Anhao Zhao, Xiaoyu Shen, Chen Wu, Lei He | Date: 2026-03-16
Link: http://arxiv.org/abs/2603.14785v1