add_zero_attn in MultiheadAttention breaks causality · Issue...
🐛 Bug add_zero_attn=True in MultiheadAttention is ignoring the mask during the backward() To Reproduce Steps to reproduce the behavior: import torch import numpy as np embedding_dim = 8 batch_size = 1 num_heads = 2 seq_len = 4 net = torc...