Microsoft Research Edition

At present, the neural network is basically based on the Back Propagation (BP) algorithm, and the network model parameters are randomly initialized, and then the optimization algorithm is used to optimize the model parameters. However, in the case of few label data, the models trained by neural networks tend to have limited precision. “Pre-training” can solve this problem well and model the word polysemy.

The pre-training is to train the language model through a large number of unlabeled language texts, obtain a set of model parameters, use this set of parameters to initialize the model, and then fine-tune the existing language model according to the specific tasks. Pre-training methods have proven to have better results in the classification and marking tasks of natural language processing. Currently, there are three popular pre-training methods: ELMo, OpenAI GPT andBERT.

At the beginning of 2018, researchers at the Allen Institute for Artificial Intelligence and the University of Washington presented ELMo in a paper titled "Deep contextualized word representations." Compared to traditional use word embedding (Word embeddingRepresenting words, obtaining a unique fixed word vector for each word, ELMo uses a pre-trained two-way language model to derive a representation of the word in the text from the language model based on the specific input. Supervised NLP In the task, ELMo can be directly spliced ​​into the word vector input of the specific task model or the highest level representation of the model.

Based on ELMo, OpenAI researchers presented OpenAI GPT in "Improving Language Understanding by Generative Pre-Training." Unlike ELMo, which provides an explicit word vector for each word, OpenAI GPT can learn a generic representation that enables it to be applied on a large number of tasks. When dealing with specific tasks, OpenAI GPT does not need to re-build a new model structure for the task, but directly Transformer The last layer on the language model is connected to softmax as the task output layer, and the entire model is fine-tuned.

Both the ELMo and OpenAI GPT pre-training language representation methods use a one-way language model to learn the language representation, while Google's proposed BERT implements two-way learning and gets better training results. Specifically, BERT uses Transformer's encoder as a language model and proposes two new goals in language model training: MLM (Masked Language Model) and sentence prediction. MLM refers to a random block of 15% in the input word sequence, and occludes part of the words for bidirectional prediction. In order to enable the model to learn the relationship between sentences, the researchers proposed to let the model predict the upcoming sentences: binary classification of the correctness of consecutive sentences, and then take the sum and likelihood.

Pre-training
Pre-training

The above content is reproduced from the public magazine Microsoft Research Institute AI headline,Original address

 

Baidu Encyclopedia version

Unsupervised pre-training is used to train data that does not contain output targets. Learning algorithms are needed to automatically learn some valuable information.

see details