Expanding the concept of Scaled Dot-Product Attention, Vaswani et al. [1] proposed the multi-head attention mechanism.
This method implements self-attention across multiple heads (h), where each one has its own set of Q, K, V matrices, capturing different aspects and relationships within the sequence.
First step of the multi-head attention mechanism is defining the hyperparameter h and subsequently extract the Q, K, V matrices h times, in the same fashion as scaled dot-product attention and in parallel.
Here head_i represents a single attention output with i = 1,…,h.
Finally, the outputs of the heads are concatenated and then multiplied with a learnable weight matrix WO to generate the final result.
Figure 1. Multi-head attention.
[1] Ashish Vaswani et al. «Attention is all you need». In: Advances in neural information processing systems. 2017, pp. 5998–6008.