Use prefix-enabled attention 8a209ff Disable flash-attn backend 31f741d Copy link Collaborator WoosukKwoncommentedMar 28, 2024• edited @skriderI just edited this PR: 1) I removed dependency on your FlashAttention repo (Let's add it in the next PR), 2) I enabled the prefix-attention, ...
Revert "[Kernel] Use flash-attn for decoding (vllm-project#3648)" (vl… … bd73ad3 WoosukKwon mentioned this pull request May 19, 2024 [Kernel] Add flash-attn back #4907 Merged dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024 Revert "[...
use_flash_attn=False, ): if device is None: device = select_device() @@ -292,7 +295,7 @@ def _load( if gpt_config_path: cfg = OmegaConf.load(gpt_config_path) gpt = GPT(**cfg, device=device, logger=self.logger).eval() gpt = GPT(**cfg, use_flash_attn=use_flash_attn, de...
import flash_attn # noqa: F401 import vllm_flash_attn # noqa: F401 except ImportError: logger.info( "Cannot use FlashAttention-2 backend because the flash_attn " "package is not found. Please install it for better performance.") "Cannot use FlashAttention-2 backend because the vllm_flash...
result->op = GGML_OP_FLASH_ATTN_BACK; result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL; result->src[0] = q; result->src[1] = k; result->src[2] = v; result->src[3] = d; result->src[4] = ggml_new_i32(ctx, masked ? 1 : 0);...