PACKAGE_NAME = "flash_attn" PACKAGE_NAME = "vllm_flash_attn" BASE_WHEEL_URL = ( "https://github.com/Dao-AILab/flash-attention/releases/download/{tag_name}/{wheel_name}" @@ -106,7 +106,7 @@ def append_nvcc_threads(nvcc_extra_args): if os.path.exists(os.path.join(torch_dir,...
RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \ --mount=type=cache,target=/root/.cache/pip \ pip install dist/*.whl --verbose RUN --mount=type=bind,from=flash-attn-builder,src=/usr/src/flash-attention-v2,target=/usr/src/flash-attention-v2 \ --mo...
同样的错误也发生在我身上。这个bug还在持续吗?
Enable when vllm_flash_attn da50678 Merge branch 'main' into flash-attention-decode 6d5b4ec Add vllm-flash-attn as dependency 37cb5a9 WoosukKwonadded2commitsMay 13, 2024 17:54 yapf 1be2eb3 Use fp32 in ref attn softmax d544611 ...
>>> flash_attn is not found. Using xformers backend. but flash_attn has been added into the vllm wheel adding'vllm/thirdparty_files/flash_attn/ops/triton/rotary.py'adding'vllm/thirdparty_files/flash_attn/ops/triton/__pycache__/__init__.cpython-310.pyc'adding'vllm/thirdparty_files/...
vllm/vllm_flash_attn/ # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] @@ -12,6 +15,8 @@ __pycache__/ # Distribution / packaging .Python build/ cmake-build-*/ CMakeUserPresets.json develop-eggs/ dist/ downloads/ 98 changes: 73 additions & 25 deletions 98 CMa...
f"flash-attn=={flash_attn_version}", "--no-dependencies", # Required to avoid re-installing torch. ], env=dict(os.environ, CC="gcc"), )# Copy the FlashAttention package into the vLLM package after build. class build_ext(BuildExtension):def...
Revert "[Kernel] Use flash-attn for decoding (vllm-project#3648)" (vl… … bd73ad3 WoosukKwon mentioned this pull request May 19, 2024 [Kernel] Add flash-attn back #4907 Merged dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024 Revert "[...
My guess is the way we encode multiple query tokens per sequence in an attention kernel invocation breaks the flash_attn contract somehow. cadedaniel added the bug label Jun 5, 2024 Collaborator Author cadedaniel commented Jun 5, 2024 actually I will close this in favor of #5152 Sorry,...
This PR reverts #4820 by adding back flash-attn. Previously, using flash-attn for decoding caused errors when using small models (like the Llama 68M model in lora/test_layer_variation.py). This was...