[BUG] Given boolean tgt_mask, TransformerDecoder produces...
🐛 Describe the bug Two tokens are decoded in this example. Ideally, the output feature on the first token should be the same regardless of the sequence length as a square subsequent mask is applied. Here are two ways to generate the tgt ...