Multi-Head Attention

Expanding the concept of Scaled Dot-Product Attention, Vaswani et al. [1] proposed the multi-head attention mechanism.

This method implements self-attention across multiple heads (h), where each one has its own set of Q, K, V matrices, capturing different aspects and relationships within the sequence.

First step of the multi-head attention mechanism is defining the hyperparameter h and subsequently extract the Q, K, V matrices h times, in the same fashion as scaled dot-product attention and in parallel. 

Here head_i represents a single attention output with i = 1,…,h.

Finally, the outputs of the heads are concatenated and then multiplied with a learnable weight matrix WO to generate the final result.

Figure 1. Multi-head attention.

[1] Ashish Vaswani et al. «Attention is all you need». In: Advances in neural information processing systems. 2017, pp. 5998–6008.