最近阅读了论文《MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads》,总结了内容和关键信息,分享到这里。 鉴于笔者科研经验尚浅,文中观点仅为个人解读和解读,若有偏颇或错误,敬请学界同仁批评指教。 基础信息: 撰写时间:2025年3月9日 本文目的:分享论文阅读心
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Headsarxiv.org/abs/2401.10774 作者:Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao Affiliation: Princeton University; Together AI; University of Illinois Urbana-Champaign; Carnegie ...
speculative decoding的问题在于draft model一般需要预训练,预训练耗时长而且预训练数据集和original model的有差异会影响acceptance。 Medusa提出了一个新颖的优化做法,使用最后一层Transfomrer输出生成next k token的预测(Medusa Heads),利用Tree Attention处理这些next k token的预测,最后确定是否接受。通过这种方式可以达到...
InferLLM is a lightweight LLM model inference framework that mainly references and borrows from the llama.cpp project. llama.cpp puts almost all core code and kernels in a single file and use a large number of macros, making it difficult for developers to read and modify. InferLLM has the...
To address this challenge, we present Vidur – a large-scale, high-fidelity, easily-extensible simulation framework for LLM inference performance. Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end...
microsoft/vidurPublic NotificationsYou must be signed in to change notification settings Fork65 Star376 main BranchesTags Code Vidur: LLM Inference System Simulator Vidur is a high-fidelity and extensible LLM inference system simulator. It can help you with: ...
GitHub地址:GitHub - kvcache-ai/ktransformers: A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations 官方文档:Introduction - Ktransformers 0、更新 2025 年 2 月 15 日:KTransformers V0.2.1:上下文更长(从 4K 增加到 8K,适用于 24GB 显存)且速度稍快(提升 15%)(最高可达 16...
TensorRT-LLM also powersNVIDIA NeMo, which provides an end-to-end cloud-native enterprise framework for developers to build, customize, and deploy generative AI models with billions of parameters.Get started with NeMo. Related resources Deploying a Model for Inference at Production Scale ...
TheIPEX-LLMlibrary (previously known as BigDL-LLM) is a PyTorch* library for running LLMs on Intel CPUs and GPUs with low latency. The library contains state-of-art optimizations for LLM inference and fine-tuning, low-bit (int4, FP4, int8, and FP8) LLM accelerations, and seamless integr...
概括来说,DeepSpeed Inference 的优化点主要有以下几点: 多GPU的并行优化 小batch的算子融合 INT8 模型量化 推理的pipeline 方案 1.1 DeepSpeed 的算子融合 对于Transformer layer,可分为以下4个主要部分: Input Layer-Norm plus Query, Key, and Value GeMMs and their biasadds. ...