After calculating attention for everyhead, we concatenate all heads together and pass it through a linear layer (W_O matrix). In turn, each head isscaled dot-product attentionwith three separate matrix multiplications for the query, key, and value (W_Q,W_K,and W_V matrices respectively)...