parser.add_argument('--use_fast',action='store_true',help='Set use_fast=True while loading the tokenizer.') parser.add_argument('--use_flash_attention_2',action='store_true',help='Set use_flash_attention_2=True while loading the model.') ...
My attention_mask is a dynamic mask matrix for the prefix decoder, similar to UniLM and GLM. How should this type of attention_mask be applied to Flash Attention? 👀 2 Contributor tridao commented Apr 18, 2024 That kind of mask is not currently supported....