A text to understand the Attention mechanism

Attention is being used more and more widely. especially BERT After the fire.

What is special about Attention? What is his principle and essence? What types of Attention are there? This article will explain all aspects of Attention in detail.

Want to learn more NLP For related content, please visit the NLP topic, and a 59-page NLP document download is available for free.

Visit the NLP topic and download a 59-page free PDF


What is the nature of Attention?

The Attention mechanism, if understood shallowly, matches his name very well. His core logic is "From attention to focus. "

The essence of attention

The Attention mechanism is very similar to the logic of human viewing pictures. When we look at a picture, we don't see the whole picture, but focus on the focus of the picture. Let's take a look at the picture below:

Human eyes

We will definitely see 4 words of "Jinjiang Hotel", as shown below:

Visual focus at Jin Jiang Hotel

However, I believe no one will realize that there is a string of "phone numbers" on the "Jinjiang Hotel" and they will not realize that "Xiyunlai Da Restaurant" is as follows:

Non-visual focus is easily ignored

So when we look at an image, it actually looks like this:

Human effect when looking at pictures

As mentioned above, our vision system is an Attention mechanism.Focus your attention on key information to save resources and get the most effective information quickly.

Attention mechanism in the AI ​​domain

The Attention mechanism was first applied in computer vision, and then began to be applied in the NLP field. The real promotion is in the NLP field, because 2018 BERT And the effect of GPT is surprisingly good, and then popular. and Transform And Attention these cores have begun to be focused on.

If you use a graph to express the position of Attention, it looks like this:

Attention location

Let's first let everyone have a macro concept of Attention. The Attention mechanism will be explained in more detail below. Before that, let's talk about why you should use Attention.


Attention 3 big advantage

The reason for introducing the Attention mechanism is mainly 3:

  1. Less parameters
  2. high speed
  3. Good effect

Attention 3 big advantage

Less parameters

Model complexity CNN,RNN In comparison, the complexity is smaller and the parameters are less. Therefore, the requirements for computing power are even smaller.

high speed

Attention solves the problem that RNN cannot be computed in parallel. Each step of the Attention mechanism calculation does not depend on the calculation results of the previous step, so it can be processed in parallel as CNN.

Good effect

Before the introduction of the Attention mechanism, there was a problem that everyone had been annoyed: long-distance information would be weakened, just like people with weak memory, and the same thing could not be remembered in the past.

Attention is the focus, even if the text is long, you can focus on the middle, without losing important information. The red expectations in the picture below are the key points that have been picked.

Attention can capture key information in long text


Principle of Attention

Attention often and Encoder-Decoder Together, the previous articleRead the model framework Encoder-Decoder and Seq2Seq in NLPAlso mentioned is Attention.

The following animation demonstrates the general flow of the completion of the machine translation task at the introduction of the Encoder-Decoder framework.

Attention use under the Encoder-Decoder framework

However, Attention does not have to be used under the Encoder-Decoder framework, it can be separated from the Encoder-Decoder framework.

The image below is a schematic diagram of the principle behind the Encoder-Decoder framework.

Attention schematic

Small story explanation

The above diagram looks more abstract. Here is an example to explain the principle of attention:

Small story explaining attention

There are a lot of books in the source. In order to facilitate the search, we have given the book a key. When we want to learn about the query, we can look at books related to animation, movies, and even World War II.

In order to improve efficiency, not all books will be carefully looked at. For Marvel, anime, film-related will be more careful (high weight), but World War II only need to simply scan it (low weight).

When we all read it, we had a comprehensive understanding of Marvel.


The 3 step decomposition of the Attention principle:

Attention principle 3 step decomposition

The first step: query and key for similarity calculation, get the weight

Step 2: Normalize the weights to get the weights that are directly available

Step 3: Weight and sum the weights and values


From the above modeling, we can roughly feel the idea of ​​Attention is simple,The four words "with power and sum" can be highly summarized, Avenue to Jane.To make an inappropriate analogy, humans learn a new language basically through four stages: rote memorization (learning grammar through reading and recitation to practice the sense of language) -> outline and guide (simple dialogue depends on understanding the key words in the sentence to accurately understand the core meaning ) -> Integrate (complex dialogue understands the context and the connection behind the language, and has the ability to learn by analogy) -> achieve the best (a lot of immersive practice).

This is also like the development of attention. The RNN era is a period of rote memorization. The model of attention learns the outline and evolves to transformIt has a good integration and learning ability, and then goes to GPT, BERT, and accumulates practical experience through multi-task large-scale learning.

To answer why attention is so good? It is because it makes the model open, understands the outline, and learns to integrate.

- Ali technology


For more technical details, check out the article or video below:

"article"Attention mechanism in deep learning

"article"Attention, everywhere, do you really understand?

"article"Explore the Attention Attention Mechanism and Transformer in NLP

"video"Li Hongyi – transformer

"video"Li Hongyi-ELMO, BERT, GPT explanation


N types of attention

There are many different types of Attention: Soft Attention, Hard Attention, Static Attention, Dynamic Attention, Self Attention, and more. Let me explain to you what are the differences between these different Attentions.

Type of attention

Thanks to this articleAttention is used for some summary of NLP" has been summarized very well, the following is directly quoted:

This section classifies the form of Attention from the areas of computation, information used, structure levels, and models.

1. Calculation area

According to the calculation area of ​​Attention, it can be divided into the following types:

1)Soft Drinks Attention, this is the more common Attention method. For all keys, the weight probability is given. Each key has a corresponding weight. It is a global calculation method (also called Global Attention). This method is more rational, refer to the contents of all keys, and then weighted. But the amount of calculations may be larger.

2)Hard Attention, this method is to directly and accurately locate a key, and the rest of the keys are ignored. The probability of the key is 1, and the probability of the remaining keys is all 0. Therefore, this alignment is very demanding and requires one step in place. If it is not properly aligned, it will have a big impact. On the other hand, because it is not steerable, it is generally necessary to train with reinforcement learning. (or use gumbel softmax or the like)

3)Location Attention, this method is actually a compromise between the above two methods, calculating a window area. First use the Hard method to locate a place. You can get a window area centered on this point. In this small area, you can use Soft to calculate Attention.


Information used by 2.

Suppose we want to calculate Attention for a piece of text, where the original text refers to the text we want to focus on, then the information used includes internal information and external information, the internal information refers to the information of the original text, and the external information refers to the text except the original text. Additional information.

1)General Attention, this method uses external information, often used for tasks that need to construct two text relationships. The query generally contains additional information to align the original text according to the external query.

For example, in the reading comprehension task, you need to construct the association between the problem and the article. Suppose now that the baseline is, calculate a problem for the problem.vectorq, put this q and all the article word vectors together, input toLSTMModeling in . So in this model, all the word vectors of the article share the same problem vector. Now we want the word vector of each step of the article to have a different problem vector, that is, use the word vector pair of the article under this step at each step. The problem is to calculate the attention. The problem here belongs to the original text, and the article word vector belongs to the external information.

2)Location Attention, this method only uses internal information, key and value and query are only related to the input text. In self attention, key=value=query. Since there is no external information, each word in the original text can be calculated with Attention for all the words in the sentence, which is equivalent to finding the relationship inside the original text.

Still give an example of reading and understanding tasks. As mentioned in the above baseline, if you calculate a vector q for the problem, you can also use the attention here, and only use the information of the problem to do the attention without introducing the article information.


3. Structure level

The structure aspect is divided into single-layer attention, multi-layer attention and multi-head attention according to whether or not the hierarchical relationship is divided:

1) Single layer Attention, this is a more common practice, using a query to make an attention to a piece of original text.

2) Multi-layer Attention, generally used for models with hierarchical relations of texts. Suppose we divide a document into multiple sentences. In the first layer, we use a attention to calculate a sentence vector for each sentence (that is, a single layer). At the second level, we calculate the document vector (also a single layer of attention) for all the sentence vectors, and finally use the document vector to do the task.

3) Multi-head attention, which is the multi-head attention mentioned in Attention is All You Need. It uses multiple queries to make multiple attentions to a piece of original text. Each query pays attention to different parts of the original text, which is equivalent to repeating Multiple single layer attention:

Finally, these results are stitched together:


4. Model aspect

From the model point of view, Attention is generally used on CNN and LSTM, and can also directly perform pure Attention calculation.

1) CNN+Attention

CNN's convolution operation can extract important features. I think this is also the idea of ​​Attention, but CNN's convolution perception field is local, and it is necessary to enlarge the field of view by superimposing multiple layers of convolution. In addition, Max Pooling directly extracts the feature with the largest value, and also like the idea of ​​hard attention, directly select a feature.

CNN plus Attention can be added to these aspects:

a. Attention before the convolution operation, such as Attention-Based BCNN-1, the task is that the text implies that the task needs to process two pieces of text, and at the same time, the sequence vectors of the two inputs are attention, the feature vector is calculated, and then spliced ​​to the original In the vector, as the input to the convolutional layer.

b. Attention after the convolution operation, such as Attention-Based BCNN-2, the attention of the output of the convolutional layer of the two pieces of text as the input of the pooling layer.

c. Attention in the pooling layer instead of max pooling. For example, Attention pooling, first we use LSTM to learn a better sentence vector, as a query, then use CNN to learn a feature matrix as the key, then use the query to generate weights on the key, and attention to get the final sentence vector.


2) LSTM+Attention

LSTM internally has a Gate mechanism, in which the input gate selects which current information to input, and the forget gate chooses which past information to forget. I think this is a certain degree of Attention, and it claims to solve the long-term dependency problem. In fact, LSTM needs to capture step by step. Sequence information, the performance on long text will slowly decay as the step increases, it is difficult to retain all useful information.

LSTM usually needs to get a vector and then do the task. Common ways are:

a. Direct use of the last hidden state (may lose some of the previous information, it is difficult to express the full text)

b. Equal weighted averaging of the hidden state for all steps (different for all steps).

c. Attention mechanism, weights the hidden state of all steps, and focuses on the important hidden state information in the entire text. Performance is better than the previous two, and it is convenient to visually observe which steps are important, but be careful about the fit and increase the amount of calculation.


3) pure Attention

Attention is all you need. If you don't use CNN/RNN, it's a clean stream, but take a closer look, essentially a bunch of vectors to calculate the attention.


5. Similarity calculation method

When doing the attention, we need to calculate the score (similarity) of the query and a key. Common methods are:

1) point multiplication: the easiest way, 

2) matrix multiplication: 

3) cos similarity: 

4) tandem mode: splicing q and k, 

5) can also be used with multi-layer perceptrons: