[Reading]BERTThe proposed year is alsoNLPA year of rapid development in the field. The academic community constantly proposes new pre-training models to refresh various task indicators. The industry is also constantly trying to quote pre-trained models such as BERT and XLNet in engineering problems. Why can BERT have such good effects and deepen its principle itself? where is it? At the AI ​​ProCon 2019 conference, Zhang Junlin, head of the Sina Weibo machine learning team AI Lab, shared the BERT andTransformWhat did you learn? 》.

Zhang Junlin:What is the relationship between BERT and Transformer? After the BERT was proposed, I was thinking about a problem: the BERT effect is so good, why?

Transformer is a feature extractor, andCNN,RNNParallel to a deep hierarchical network structure for feature extraction, and BERT can be regarded as a two-stage processing flow. The framework used in this process is Transformer. After simple explanation, you can understand that BERT uses Transformer to learn how to encode, Store information knowledge. This is the relationship between the two.

Before Transformer and BERT, the most common ones were CNN, RNN,Encoder-DecoderThe three technologies cover the technology and application of 80% in the NLP field. Where are Transformer and BERT better than them? What did each layer of the network learn? What kind of knowledge have you learned more? These questions are all I have been thinking about and I would like to share with you some of the current research conclusions.

The first part of today's sharing is an introduction to BERT and Transformer.From the basic principles, processes, advantages and disadvantages, and improvements, we will lead you to have a more intuitive understanding of them.

The second part introduces the search method for opening the black box of Transformer and Bert model structure.The multi-layer Transformer learns what kind of knowledge, what form to encode, which type of feature to encode, and what kind of problem each layer is good at. To understand this, you need some convenient means to understand the black box. This section describes the current What are the common technical means to explore these problems and introduce the mainstream search methods.

The third part is also the content that everyone cares most about.Through the above-mentioned exploration method to study the parameters of BERT and see the mystery it contains, we can get some conclusions:What did BERT learn?What do you have learned from a pre-trained model than a pre-trained model?This section introduces some of the current conclusions.

Finally, share some existing conclusions and experiences with you.Although there are still some problems in Bert's engineering application, such as the high online inference delay caused by the model too large, I think that as long as the algorithm works well, engineering application is not an obstacle, there is always a way to overcome it.

first part:BERT and Transformer

Everyone knows that BERT was proposed by 2018 in 10. After the proposal, it has had a great impact in both industry and academic circles. The root cause is that the model is so good that people have to take it, and then various applications. Some breakthrough results have also been achieved. I wrote an article before.Innovation in the Bert era:The application progress of Bert in various fields of NLP》Introduced some application cases. In general, BERT has achieved good results in various application directions, but there are also some different situations in different fields. If BERT is regarded as a milestone in the field of NLP, I believe everyone will not question this.

BERT has achieved such a good effect, so that everyone can't help but think: Will BERT try to apply to its own business, can it bring good business results? What problems are there in the BERT model itself worthy of further exploration and reflection?

I have compiled and summarized the literature data that I have read. The application of BERT in various fields of NLP has basically improved, but the improvement effect in different fields is different. The tasks in different fields and data sets in the same field make the improvement effect different. The following is a brief overview of the application of some areas (2019 months of 5): Most of the QA domain has tried to use the BERT model, and the performance has improved from 30% to 70%. The reading comprehension field has 30% performance after applying BERT. 50% improvement; in the field of information retrieval, short document retrieval performance improvement is more obvious than long document performance improvement, short document improves 25% to 106%, long document improves 20% to 30%; in the field of dialogue robot, current BERT Can improve performance 5% to 40%; (this is related to specific applications, subtasks, some sub-directions are more suitable for BERT, some may not be suitable.) The text summary field is not obvious, about 10%, It seems that the potential of BERT has not yet been realized; other applications such as Chinese word segmentation, text categorization, and text generation try to use BERT in the work, but the improvement effect is not obvious.

Here, I think there is a problem everyone needs to think about: see these results and conclusions, ask yourself why this is happening? Why is BERT so different for different NLP applications? What is the reason behind this? This is a good question. What I just talked about is the improvement effect of BERT in application. From the emergence of BERT to today, Transformer before BERT has been more than two years now, and whether it is BERT or Transformer, everyone is complicated inside. There is not much understanding of the mechanism, but it is worth exploring and will help you deepen your understanding of BERT and Transformer.

Although BERT is relatively new and effective, it must have some shortcomings. Since it has shortcomings, we can find its shortcomings, transform the shortcomings, make the model stronger and stronger, and the effect is getting better and better.Next, I will list some possible improvements to BERT:

First, the text generation model.What is text generation? In machine translation, you input your English sentences into the model and translate it into Chinese. This is the task of generating classes; the text summary is also a typical generation class task, and the model extracts three sentences from the article as a summary of the topic content. Although the use of the BERT model improves the effect in the generated class task, but the improvement is not much, the role of the BERT in the generation class task has not yet played out, there should be a better transformation method, which is a very important research direction, if this The aspect can be done well, and many achievements in the generation of tasks such as machine translation and text summarization will achieve great results.

Second, the introduction of structured knowledge.Introducing well-structured knowledge into the model, how to add structured knowledge to the BERT, this is also a valuable improvement direction, which can be directly used to solve the NLP tasks related to our knowledge at hand.

Third, multimodal fusion.We now use BERT for text most of the time. In fact, there are many multi-modal scene applications, such as sending a microblog, which contains a lot of information, text content, pictures, videos, and social relationships. To fully understand a Weibo, not only should you understand its text well, but also understand what the picture says and what the video says. This is a different modality. How to achieve better integration of different modes? Incorporating BERT into different modal systems is certainly a very promising direction for BERT improvements.

Fourth, larger, higher quality training data.How to further optimize the data volume and training methods is a simple and direct optimization direction. At present, there is a lot of evidence to prove that directly increasing the scale and quality of training data will directly improve the Bert effect, which means that we have not yet reached the ceiling of the pre-training model. When BERT first started pre-training, the amount of data was about a dozen G. Suppose there is a company with a lot of money, money is not a problem, and you can use pre-training with unlimited data. There is no doubt that Bert's effect will be greatly improved. But is anyone doing this now? No, because this is too much money. Judging from the existing results, suppose a classmate is very rich and said that I want to modify the BERT. It is very simple, increase the data size, and then the data is more abundant, the data quality is higher, you can do this. By dumping the data, it is possible to make a better indicator than the BERT currently visible. By increasing the data to continue to improve the BERT effect, although this has no technical content, it is actually a simple and easy solution.

Fifth, more suitable training objectives and training methods.This is also a relatively simple direction of improvement, but it is particularly easy to be effective, and there are some work at the moment.

Sixth, multi-language integration.The BERT that is currently being done is a single language, and how the different languages ​​are integrated in the Bert system is one of the good directions that BERT is worth improving.

Of course, there are other optimization directions, because it is not today's theme, so I will not elaborate on it. Next, let's analyze the hierarchical structure of BERT and BERT.

As shown in the figure, this is the typical hierarchical structure of Transformer. The Transformer is composed of several blocks. As a basic component, each block is a small ecosystem, which involves many technologies, among which the four most critical sub-parts. :Layer Norm, Skip Connection, Self-Attention and Feedforward Neural Networks.

The BERT consists of two phases, each with its own characteristics and goals. The first phase is the pre-training phase and the second phase is the Fine-Tuning phase.The pre-training stage uses a large number of unsupervised texts to train through self-supervised methods, and encodes the linguistic knowledge contained in the text into the Transformer in the form of parameters. Fine-Tuning is generally supervised, the data volume is relatively small, and the classification is performed on the model structure. Task to solve the current task. How is the first phase connected to the second phase? In the pre-training phase, Transformer learned a lot of initialization knowledge. In the second stage, the language knowledge learned by the initialization network was used. Fine-Tuning introduced new features to solve your problem.

So why is the BERT effect so good? Why is the previous model effect not as good as BERT? Because the first stage encodes a large amount of linguistic knowledge in the text, before Bert, there was not much textual data used, and it was unsupervised. Then we care about:What did the Transformer in BERT learn?What kind of knowledge is learned more than the traditional model?This is the key.

It should be said that Transformer and BERT are not very mature, the structure is complex, and the practical application is very complicated. If we don't have a deep understanding of them, we don't know their structure and advantages and disadvantages, we can hardly improve them. For better BERT and Transformer.How can we deepen our understanding of them?This is what we will discuss in depth with you.

the second part:Search method

We said that Bert learned the language knowledge through pre-training. So where is this knowledge? Just in the Transformer parameters. However, what we see is a bunch of parameters, that is, a large number of values, can not see the meaning inside, so the question is transformed into: How do we know what each layer of the multilayer Transformer has learned, what can be seen What did it learn from it? These techniques are generally called exploration methods, so what are the commonly used methods of exploration?

Before starting to talk about BERT's search methods, start with DNN, the famous black box system. Everyone knows that DNN works well, but what each neuron learns, I don't know; I can't see it, I don't understand it, I can see that the response value of a neuron is bigger or smaller; the relationship between neurons is also I don't know, everyone doesn't understand how DNN works. The academic community has long recognized this problem. Since DNN came out, many people have tried to find ways to understand how DNN works and to explore what each neuron learns. Feature visualization is a typical method of cracking black boxes. This method is very common in the image field, but it is not universal. Today I am going to talk about the BERT and Transformer search methods.

There are currently several typical methods. The first is visualization (2D t-SEN), which is shown in the form of 2D.As shown in the following figure, using the characteristics of each layer of Transformer, each noun and phrase are clustered, and the same color represents the same type of phrase. If the clustering effect is good, this layer encodes such knowledge. In this way, it is a typical visualization method to know which layer is suitable for solving the problem and what kind of knowledge is encoded.

The second method isYou have been warned!Illustration.The Attention diagram is a very critical method for exploring the knowledge learned by Transformer. It can visually observe the relationship between a word and other words, and the closeness of the connection. As shown in the figure below, take a look at whose relationship the preposition'at' is more closely related? The thicker the connecting line, the closer the connection. The larger the value, the thicker the side is drawn. It is found to be thicker with'Auction', which proves that BERT has learned the relationship between the preposition and the main noun, and more importantly, it is through the Attention graph. Ways to know what knowledge has been learned.

The third method is the Probing Classifier.For an Embedding node of a word on a layer of Transformer, what do you do if you want to know what it has learned? We fixed the Transformer structure parameters and kept them unchanged. Knowledge has been encoded in the parameters. We need to find a way to find out what we learned at each level. The example shown in the following figure is very straightforward. The Transformer parameter is fixed. The word corresponding to the highest layer Transformer has an Embedding, which means that the knowledge learned through each layer, how to know what this Embedding learned? Adding a small classification network above, this network structure is very simple. We don't want it to learn too much knowledge. We only hope that it can use the knowledge that Transformer has already coded to perform part-of-speech tagging. If it can be labeled correctly, it indicates that Transformer has already been layered. Coding learned the knowledge of part-of-speech tagging, if the tagging error indicates that there is no coding knowledge. Use such a simple classifier to accomplish a specific task, and the parameters in the classifier are few, basically no parameters, all decision information comes from the knowledge learned by Transformer itself, if the task can be solved very well, It shows that there is more knowledge related to such tasks stored in Transformer. This explores what kind of knowledge each layer of Transformer has learned.

There is also an improved method called Edge Probing Classifier.What is the difference between it and Probing Classifier? The Probing Classifier can only determine what the Embedding node for a word has learned, but there are other requirements for many tasks. For example, if we need to know what a phrase, two words and three words have learned, or what is the relationship between the A word and the B word in the sentence, how can we learn what knowledge it has learned through the Edge Probing Classifier method? As shown in the figure below, Transformer still has fixed parameters. The input of simple classifier becomes multi-node input. The above Span may cover a segment, such as a word, two words, and then construct a simple classifier to solve the classification task, and then observe The accuracy of the prediction, based on the accuracy of the prediction, to know what knowledge has been learned. The main difference between it and the Probing Classifier is the ability to simultaneously detect multi-node encoding.

The above are some common detection methods. With these detection methods, you can see what knowledge Bert or Transformer has learned. If you summarize the current research conclusions, give a rough overview:After the BERT is trained, the low-level Transformer mainly learns the characteristics of the natural language surface layer, the middle layer learns the coding syntax information, and the upper layer encodes the semantic features of the NLP.Many experiments have proved this conclusion.

How did you come to this conclusion? The series of tasks in the above picture explains why this conclusion is reached.POS, component analysis, DEP, Entities, SRL, COREF, relationship classification, from top to bottom, the more the task needs to be biased towards high-level semantic knowledge to solve.POS part-of-speech tagging is a simple task, biased towards surface features, and relational classification is a purely semantic task. Without understanding the semantics, the task cannot be solved well, and from top to bottom, it gradually moves toward semantic tasks. The histogram shows which layers of the Transformer to work with to solve this task. The higher the score, the higher the depth of the layer. 9.40 of the relationship classification task and 3.39 of the POS task mean that the relationship classification task is more dependent on the contribution of the Transformer high-level, and the information of the POS task is mainly obtained from the lower layer of the Transformer. As can be seen from the figure,As the depth of the layer gradually increases, the task gradually moves toward the semantic task.

If we only divide Transformer into low, medium and high levels, this is still a bit rough. We hope to analyze the effect of each layer in depth and in more detail.As shown in the above figure, the horizontal coordinate indicates that the Transformer has an 24 layer, the ordinate represents the size of the function, and the high index represents the greater effect of this layer. It can be seen from the figure that the first layer, the second layer, the third layer and the fourth layer contribute the most to solve the part-of-speech tagging task, and the other layers do not contribute much. What is the method used? It is the Probing Classifier method just mentioned above, which shows that the lower layer of Transformer is more suitable for solving the task of surface features. The part of speech information is encoded here, and the lower layer encodes the surface layer and syntactic knowledge.The specific conclusions drawn from the subdivisions can be summarized:Syntactic knowledge has Layer locality, which has great dependence on some layers, while semantic knowledge does not have Layer locality, and knowledge is encoded in each layer.

The conclusions obtained above are in line with expectations, and the next conclusion is more interesting. As shown in the red figure below,Transformer executives tend to encode semantic knowledge, low-level coding syntactic knowledge, and high-level semantic knowledge can feedback the low-level syntactic knowledge, and modify the underlying syntactic features through high-level semantic guidance.'He took six hits in Toronto in the playoffs'. Toronto is a polysemous word that can represent a place name or a sports team. If we enter this sentence into the Transformer or BERT structure, it is coded at a certain level. Is it 'Toronto' or 'Toronto'? This is what we want to know.

You can judge the coding knowledge of each layer by observing the ratio of yellow and blue of 0 to 12 layer. Yellow indicates that 'Toronto' is (place name) and blue is considered to be one (team name). As we can see from the figure, 0 The layer, the 1 layer, and the 2 layer are basically not considered to be 'Toronto' (team name), and the 'Toronto' is (place name); the high-level 'Toronto' is more prominent. Why? Because Bert found the word Smoked in the sentence and found that it had a kinetic-actuated relationship with the word "Toronto", it was more inclined to judge that Toronto is a team name. This semantic knowledge is coded at a high level. It in turn affects the judgment of the middle and lower levels, thus indicating that high-level semantic knowledge can in turn correct low-level syntactic knowledge.

Below we explain what linguistic knowledge is specifically encoded by the Transformer Layer 3 (lower, middle, and high level).The lower layer encodes the word position information more fully.The abscissa indicates the depth of the layer. It can be seen that the result of the 2 layer is already very good, and the prediction result of the fourth layer coding is very serious, indicating that the lower layer encodes the word position, and the upper layer has basically lost the position information. The location information task cannot be solved; the location information is mainly coded and learned at the lower level, and the high layer encodes the structural information formed between the lower layer words, and the position information is only used by the lower layer to construct the structural relationship between the words.

In addition, the lower layer encodes the phrase information and encodes the special symbols.

The middle layer is the encoding of syntactic information.In the syntactic prediction task, the abscissa is each layer of the 24 layer, which is an effect indicator. Which layer has a better prediction effect on syntax, and a better effect means more coding information. It can be seen that the 3 to 8 layer is better for syntactic prediction.

The upper level encodes the semantic information.As shown in the figure below, the finger dispelling task, the pronouns 'He' and 'She' can refer to what? Did BERT learn? It can be seen from the figure that the referential relationship has been encoded into the BERT feature, so the solution is better.

the third part:What does BERT's pre-training learn more than without pre-training?

The pre-training model uses the Probing Classifier method mentioned above to complete the detection, and then uses the model without direct training and no initialization to directly learn the model.We will find:The pre-training model performs better in the sentence length prediction task than the pre-training model BERT, indicating that the pre-training model obtains more and richer complex feature expression ability by sacrificing part of the surface feature expression ability.

We know that there are many different pre-training models at the moment. What are the similarities and differences between them? What has the Bert pre-training model learned more than other models?Compared with other models such as Cove and Elmo, the Bert model encodes more syntactic information, and the semantic information is roughly equivalent.In addition, what does BERT learn more than GPT? Because the BERT layer is deeper, it is more conducive to coding semantic features. Finally, what does ELMO learn more than traditional models such as RNN and CNN? Through pre-training, ELMO learned more and longer contextual features than CNN.

Finally, we summarize that Bert's Transformer learns the surface features at the lower level, the middle layer learns the syntactic features, and the high-level learns the semantic features. Although there are related work in progress, it is not detailed enough, and it needs to be explored in depth and believe in the future. There will be more and better research. My sharing today is here, thank you all!

Expand reading

Earlier, we have shared the technical articles of Teacher Zhang Junlin on many occasions. You can read and read:

This article was transferred from the public number AI Technology Base Camp.Original address