Want to learn more NLP For related content, please visit the NLP topic, and a 59-page NLP document download is available for free.

Visit the NLP topic and download a 59-page free PDF

What is a BERT?

The full name of BERT is Bidirectional Encoder Representation from Transformers, ie two-wayTransformEncoder becausedecoderIt is impossible to obtain information to be predicted. The main innovations of the model arePre-trainIn the method, Masked LM and Next Sentence Prediction are used to capture the word and sentence level representation respectively.

From the current megatrend, using a model to pre-train a language model seems to be a more reliable approach. From the previous ELX of AI2, to the fine-tune transformer of OpenAI, to the BERT of Google, all of them are applications of pre-trained language models. The BERT model is different from the other two.

  1. It replaces a small number of words with Mask or another random word with a reduced probability while training the two-way language model. I personally feel that this goal is to force the model to increase the memory of the context. As for this probability, I guess Jacob is just setting his head.
  2. Added a loss to predict the next sentence. This looks quite novel.

The BERT model has the following two characteristics:

  1. This model is very deep, the 12 layer is not wide, the middle layer is only 1024, and the previous Transformer model has 2048 in the middle layer. This seems to confirm the idea of ​​computer image processing - deep and narrower than shallow and wide models.
  2. MLM (Masked Language Model), using the words on the left and right at the same time, this has appeared on ELMo, definitely not original. Secondly, for the application of Mask (occlusion) on language models, it has been proposed by Xiang Xie (I am also fortunate to have participated in this paper): [1703.02573] Data Noising as Smoothing in Neural Network Language Models. This is also a collection of superstar papers: Sida Wang, Jiwei Li (the founder and CEO of Shannon Technology and the most published NLP scholar in history), Andrew Ng, Dan Jurafsky are Coauthor. Unfortunately, they did not pay attention to this paper. Using the method of this paper to do Masking, I believe that BRET's ability may be improved.

Content from:[NLP] Google BERT | [NLP natural language processing] Google BERT model depth analysis