Scaled Dot-Product Attention

Self-attention is the core mechanism behind Transformer models, which have provided state-of-the-art results in various scientific fields (i.e. Natural Language Processing).

Self-attention enables models to weigh the significance of different elements (tokens) within a sequence, concerning each other and capturing their dependencies. Unlike recurrent neural networks (RNNs) or even convolutional neural networks (CNNs) the attention mechanism allows the model to process every element in a sequence simultaneously.

This article contains mathematical formulas. Continue reading here.