<aside> đ Series:
Beginnerâs Guide on Recurrent Neural Networks with PyTorch
A Brief Introduction to Recurrent Neural Networks
Illustrated Guide to Transformers- Step by Step Explanation
How to code The Transformer in PyTorch
</aside>
Attention is, to some extent, motivated by how we pay visual attention to different regions of an image or correlate words in one sentence. Take the picture of a Shiba Inu in Fig. 1 as an example.
Fig. 1. A Shiba Inu in a menâs outfit. The credit of the original photo goes to Instagram @mensweardog.
Human visual attention allows us to focus on a certain region with âhigh resolutionâ (i.e. look at the pointy ear in the yellow box) while perceiving the surrounding image in âlow resolutionâ (i.e. now how about the snowy background and the outfit?), and then adjust the focal point or do the inference accordingly. Given a small patch of an image, pixels in the rest provide clues what should be displayed there. We expect to see a pointy ear in the yellow box because we have seen a dogâs nose, another pointy ear on the right, and Shibaâs mystery eyes (stuff in the red boxes). However, the sweater and blanket at the bottom would not be as helpful as those doggy features.
Similarly, we can explain the relationship between words in one sentence or close context. When we see âeatingâ, we expect to encounter a food word very soon. The color term describes the food, but probably not so much with âeatingâ directly.
Fig. 2. One word "attends" to other words in the same sentence differently.
In a nutshell, attention in deep learning can be broadly interpreted as a vector of importance weights: in order to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with (or âattends toâ as you may have read in many papers) other elements and take the sum of their values weighted by the attention vector as the approximation of the target.
The seq2seq model was born in the field of language modeling (Sutskever, et al. 2014). Broadly speaking, it aims to transform an input sequence (source) to a new one (target) and both sequences can be of arbitrary lengths. Examples of transformation tasks include machine translation between multiple languages in either text or audio, question-answer dialog generation, or even parsing sentences into grammar trees.
The seq2seq model normally has an encoder-decoder architecture, composed of:
Both the encoder and decoder are recurrent neural networks, i.e. using LSTM or GRU units.
Fig. 3. The encoder-decoder model, translating the sentence "she is eating a green apple" to Chinese. The visualization of both encoder and decoder is unrolled in time.
A critical and apparent disadvantage of this fixed-length context vector design is incapability of remembering long sentences. Often it has forgotten the first part once it completes processing the whole input. The attention mechanism was born (Bahdanau et al., 2015) to resolve this problem.
The attention mechanism was born to help memorize long source sentences in neural machine translation (NMT). Rather than building a single context vector out of the encoderâs last hidden state, the secret sauce invented by attention is to create shortcuts between the context vector and the entire source input. The weights of these shortcut connections are customizable for each output element.
While the context vector has access to the entire input sequence, we donât need to worry about forgetting. The alignment between the source and target is learned and controlled by the context vector. Essentially the context vector consumes three pieces of information: