By the way, does vllm-flash-attn support Turing architecture GPUs like the 2080ti? simonwei97 commented May 24, 2024 • edited I have same problelm on Linux (CentOS 7). My Env torch 2.3.0 xformers 0.0.26.post1 vllm 0.4.2 vllm-flash-attn 2.5.8.post2 vllm_nccl_cu12 2.18....
[Bug]: Cannot use FlashAttention-2 backend because the flash_attn package is not found #4906 Closed maxin9966 commented May 22, 2024 Thank you very much. By the way, does vllm-flash-attn support Turing architecture GPUs like the 2080ti? I recall that the Turing GPU supports flash-a...
3. 注意README已经告诉你了,需要提前安装ninja,否则编译过程会持续很长时间,如果你的ninja已经安装完毕,可以直接执行pip install flash-attn --no-build-isolation 但实测直接pip的话编译过程会超级慢,强烈建议从源码直接进行编译(需提前安装好ninja): git clonehttps://github.com/Dao-AILab/flash-attention.git c...
attn_output = torch.nn.functional.scaled_dot_product_attention( Is there an existing issue for this? I have searched the existing issues Reproduction Load model,https://huggingface.co/WhiteRabbitNeo/WhiteRabbitNeo-33B-v1.5 Try to use it. ...
@logicwong你好,我试了use_flash_attn设成False不行。我感觉可能需要把pytorch_model.bin切分多个分开加载,请问您那边有切分的001.bin、002.bin这种文件吗 好像是内存不够用,大概需要 20G 的内存,你的显存应该是够用的。我的配置是 2080ti(11G),给 WSL 分配 20G 内存就能跑起来文档中的例子。