LLM decoding phase阶段的cache行为以及tokens时序性是LLM推理计算系统设计的关键,CPU offload的cache的内容和对应的参数都一一对应,但是面临一个是cache内存量大,一般都大于参数量,复杂度是4*blh1(s+n),相比模型参数多了b(batchsize)、s+n(序列长度)两个复杂度系统成线性关系(而且,LLM都逐渐将seq长度扩大,越大...
因为flexgen应用到LLM模型的推理,LLM模型是全dense模型,所以pipeline的多GPU组织方式可以不仅可以提高多GPU并行度,更重要的是提高了GPU计算占CPU offload计算的计算占比,从而pipeline并行对LLM可以获取超线性的加速比(间接减低了CPU offload这里瓶颈计算的计算占比),但是推荐模型dense部分一般纬度较低,计算代价相对较小,de...
In recent years, large language models (LLMs) have shown great performance across a wide range of tasks. Increasingly, LLMs have been applied not only to interactive applications (such as chat), but also to many "back-of-house" tasks. These tasks include benchmarking, information extraction,...
In recent years, large language models (LLMs) have shown great performance across a wide range of tasks. Increasingly, LLMs have been applied not only to interactive applications (such as chat), but also to many "back-of-house" tasks. These tasks include benchmarking, information extraction,...
The high computational and memory requirements of large language model (LLM) inference traditionally make it feasible only with multiple high-end accelerators. FlexGen aims to lower the resource requirements of LLM inference down to a single commodity GPU (e.g., T4, 3090) and allow flexible depl...
策略和压缩方案的engine,即FlexGen。FlexGen通过管理多个CUDA streams和CPU threads实现了计算和IO的overlap,并综合利用了GPU内存、CPU内存和硬盘这三种存储介质,在单个消费级的GPU上对大模型(LLM)实现了高吞吐的推理,即实现了较大batch size的支持(large-batch high-throughput)。
因为LLM应用效果显著(OpenAI chatGPT/GPT-4、meta OPT/LLaMA),所以LLM的计算系统研究工作从专注训练系统设计到推理系统方向拓展,目标降低LLM推理成本、推理门槛,Flexgen提到了几个重要工作,谷歌代表性工作 PaLM inference 、微软代表性工作 Deepspeed-Inference,另外OSDI22年 Orca工作。PaLM inference和Deepspeed-Inference是...
One key characteristic of these applications is that they are throughput-oriented: they require running LLM inferences over millions of tokens in batches, e.g., all the private documents in a company's corpus, or all the tasks in the HELM benchmark. These workloads are less sensitive to late...
forked from FMInference/FlexLLMGen Notifications Fork 0 Star 0 Code Pull requests Actions Projects Security Insights Files main benchmark docs experimental flexgen apps data_wrangle README.md __init__.py completion.py helm_fast_test.py helm_passed_30b.sh helm_run.py __init__.py...
同样地,FT这类non-offloading的LLM inference方案,也可能吸取FlexGen里的思想,加入在廉价硬件上推理服务的能力支持,实际上用于non-offload的硬件配置往往带有了更大的硬盘和内存以及CPU资源,所以对于实现offloading也提供了更佳的硬件支持。