By the way, does vllm-flash-attn support Turing architecture GPUs like the 2080ti? simonwei97 commented May 24, 2024 • edited I have same problelm on Linux (CentOS 7). My Env torch 2.3.0 xformers 0.0.26.post1 vllm 0.4.2 vllm-flash-attn 2.5.8.post2 vllm_nccl_cu12 2.18....
Thank you very much. By the way, does vllm-flash-attn support Turing architecture GPUs like the 2080ti? I recall that the Turing GPU supports flash-attn1. sivanantha321 mentioned this pull request Jun 4, 2024 Add nccl package and Bump vLLM to 0.4.3 for huggingface runtime kserve/kse...
然而,目前原生的flash attention仅支持Ampere、Hopper等架构的GPU,例如:A100、H100等,很遗憾,V100属于Volta架构并不支持,所以需要先看下自己的显卡是否支持再进行上述操作。如果不支持,建议使用xformers或者torch.nn.functional.scaled_dot_product_attention,前者需要PyTorch 2.1.2版本,后者需要PyTorch 2.0及以上版本,但如...