Self-attention is the core mechanism behind Transformer models, which have provided state-of-the-art results in various scientific fields (i.e. Natural Language Processing).
Self-attention enables models to weigh the significance of different elements (tokens) within a sequence, concerning each other and capturing their dependencies. Unlike recurrent neural networks (RNNs) or even convolutional neural networks (CNNs) the attention mechanism allows the model to process every element in a sequence simultaneously.
This article contains mathematical formulas. Continue reading here.