cache_seqlens=torch.zeros(prompt_batch_size,device='cuda',dtype=torch.int32)cache_batch_idx=torch.Tensor(batch_idx).to(device='cuda',dtype=torch.int32)flash_attn_with_kvcache(q=q,k_cache=k_cache,v_cache=v_cache,k=k,v=v,rotary_cos=None,rotary_sin=None,cache_seqlens=cache_seqlens...
This is our MHA implementation with kv cache, which matches your recommended one: qkv = self.qkv_proj(x).view(B, S, 3, self.n_local_heads, self.head_dim) query = qkv[:, :, 0, :, :] kv_proj = qkv[:, :, 1:, :, :] attn_output = flash_attn_w...