Want to learn more NLP For related content, please visit the NLP topic, and a 59-page NLP document download is available for free.

Visit the NLP topic and download a 59-page free PDF

What is Transformer (know version)?

transformer structure diagram

As you can see above, at first glance, the Transformer architecture is a bit complicated. . . Nothing, let's talk slowly. . .

And classic seq2seq Like the model, the encoer- is also used in the Transformer model.decoder Architecture. The left half of the above picture NX When the box comes out, it represents a layer. encoderThe encoder in the paper has a structure like the 6 layer. Use the right half of the above image NX The box comes out, it represents a layer of decoder, there are also 6 layer.

Define the input sequence first Word embedding, and then add to the positional encoding, input to the encoder. The output sequence is processed the same as the input sequence and then entered into the decoder.

Finally, the output of the decoder goes through a linear layer and then to Softmax.

On top is the overall framework of the Transformer, the following introduces the encoder and decoder.

Original address


What is Transformer (Microsoft Research Institute stupid)?

Transformer was named by Google in 2017 YearYou have been warned! Presented in the paper by Is All You Need. Transformer is a codec model based entirely on the attention mechanism. It abandons the loop and convolution structure that was retained by other models before the introduction of the attention mechanism, and adopted the self-attention mechanism in the task performance. There has been a significant increase in parallelism and ease of training.

Before the emergence of Transformer, most neural network-based machine translation models were adopted. RNNThe model architecture, which relies on looping functions for ordered sequence operations. Although the RNN architecture has strong sequence modeling capabilities, it has problems such as slow training and low training quality.

Unlike the RNN-based approach, there is no loop structure in the Transformer model, but all words or symbols in the sequence are processed in parallel, and the relationship between all the words in the sentence is directly modeled by the self-attention mechanism, without regard to consideration. Their respective locations. Specifically, if you want to calculate the next characterization of a given word, Transformer will compare the word one by one with the other words in the sentence and derive the attention score for those words. Attention scores determine the semantic impact of other words on a given vocabulary. The attention score is then used as the average weight for all word representations, which are entered into the fully connected network to generate a new representation.

Since Transformer processes all words in parallel, and each word can be linked to other words in multiple processing steps, it trains faster than RNN models and performs better in translation tasks than RNN models. . In addition to computational performance and greater accuracy, another highlight of Transformer is the ability to visualize the parts of the network's attention, especially when dealing with or translating a given word, so you can gain insight into how information is transmitted over the network.

Later, Google researchers expanded the standard Transformer model to adopt a new, efficiency-oriented time-parallel loop structure that allows it to have general-purpose computing power and achieve better results in more tasks. result.

The improved model (Universal Transformer) replaces the Transformer set of several fixed transform functions with a set of structures consisting of a single, time-parallel loop transform function, while preserving the original parallel structure of the Transformer model. . Compared to the RNN one symbol followed by one symbol to process the sequence from left to right, Universal Transformer and Transformer can process all the symbols at the same time, but Universal Transformer will then make several parallel interpretations of each symbol according to the self-attention mechanism. Loop processing modification. The time parallel loop mechanism in Universal Transformer is not only faster than the serial loop used in RNN, but also makes the Universal Transformer more powerful than the standard feedforward Transformer.

The above content is reproduced in the public head of the Microsoft Research Institute AI headline,Original address