Attention Mechanism

<aside> 📘 Series:

</aside>

Can I have your Attention please! The introduction of the Attention Mechanism in deep learning has improved the success of various models in recent years, and continues to be an omnipresent component in state-of-the-art models. Therefore, it is vital that we pay Attention to Attention and how it goes about achieving its effectiveness.

In this article, I will be covering the main concepts behind Attention, including an implementation of a sequence-to-sequence Attention model, followed by the application of Attention in Transformers and how they can be used for state-of-the-art results. It is advised that you have some knowledge of Recurrent Neural Networks (RNNs) and their variants, or an understanding of how sequence-to-sequence models work.

What is Attention?

When we think about the English word “Attention”, we know that it means directing your focus at something and taking greater notice. The Attention mechanism in Deep Learning is based off this concept of directing your focus, and it pays greater attention to certain factors when processing the data.

In broad terms, Attention is one component of a network’s architecture, and is in charge of managing and quantifying the interdependence:

Between the input and output elements (General Attention)
Within the input elements (Self-Attention)

Let me give you an example of how Attention works in a translation task. Say we have the sentence “How was your day”, which we would like to translate to the French version - “Comment se passe ta journée”. What the Attention component of the network will do for each word in the output sentence is map the important and relevant words from the input sentence and assign higher weights to these words, enhancing the accuracy of the output prediction.

Weights are assigned to input words at each step of the translation

The above explanation of Attention is very broad and vague due to the various types of Attention mechanisms available. But fret not, you’ll gain a clearer picture of how Attention works and achieves its objectives further in the article. As the Attention mechanism has undergone multiple adaptations over the years to suit various tasks, there are many different versions of Attention that are applied. We will only cover the more popular adaptations here, which are its usage in sequence-to-sequence models and the more recent Self-Attention.

While Attention does have its application in other fields of deep learning such as Computer Vision, its main breakthrough and success comes from its application in Natural Language Processing (NLP) tasks. This is due to the fact that Attention was introduced to address the problem of long sequences in Machine Translation, which is also a problem for most other NLP tasks as well.

Attention in Sequence-to-Sequence Models

Most articles on the Attention Mechanism will use the example of sequence-to-sequence (seq2seq) models to explain how it works. This is because Attention was originally introduced as a solution to address the main issue surrounding seq2seq models, and to great success. If you are unfamiliar with seq2seq models, also known as the Encoder-Decoder model, I recommend having a read through this article to get you up to speed.

<aside> 📘 RNN 概念简介

RNN seq2seq 的特点就是可以接收任意多的 input，通过迭代 encoder cell 来接收输入。每一个 encoder cell 都接收两个参数：input + encoder hidden state。这个 hidden state 就是实际上在被训练的神经网络。

输入完毕后转入输出模式，最终的 encoder hidden state（也称为 context vector）转化（transition）为 decoder hidden state。

然后开始滚动迭代 decoder cell。每一个 decoder cell 也接收两个参数：上一的 decoder 的 output 和 decoder hidden state。需要输出多少词，decoder cell 就会被迭代多少次。

弱点

此前的 seq2seq 不能处理太长的序列，decoder 会更容易关联更近的输出，而丧失对全文输入的关联。

为了给 RNN 提供长期记忆，

</aside>

Overall process of a Sequence-to-sequence model

The standard seq2seq model is generally unable to accurately process long input sequences, since only the last hidden state of the encoder RNN is used as the context vector for the decoder. On the other hand, the Attention Mechanism directly addresses this issue as it retains and utilises all the hidden states of the input sequence during the decoding process. It does this by creating a unique mapping between each time step of the decoder output to all the encoder hidden states. This means that for each output that the decoder makes, it has access to the entire input sequence and can selectively pick out specific elements from that sequence to produce the output.