Figure 2. The architecture of transformer encoder layer. To introduce variations in the attention scores, the mechanism of self-attention can be extended to the MHA version. In the MHA module, multiple sets of queries, keys, and values are generated through linear projections of the input. Th...