It is no exaggeration to say that the BERT of Google AI Lab has profoundly affectedNLPThe pattern.

Imagine a model that is trained in a large number of unlabeled datasets. You can get SOTA results on 11 different NLP tasks with a little bit of fine-tuning. That's right, BERT is like this, it has completely changed the way we design NLP models.

After the BERT, many NLP architectures, training methods and language models have sprung up, such as Google's TransformerXL, OpenAI's GPT-2, XLNet, ERNIE2.0, RoBERTa and so on.

Note: I will mention a lot in this article.TransformIf you don't know about Transformer, you can check out this article - How do Transformers Work in NLP? A Guide to the Latest State-of-the-Art Models.

What is a BERT?

You must have heard of BERT, and you know how incredible it is to change the field of NLP, but what is BERT?

The following is a description of the framework by the BERT team: BERT full name Bidirectional Encoder Representations from Transformers (transformers' two-way coded representation), for unlabeled text, pre-trained deep two-way representations through context constraints. After the training is completed, only the BERT pre-training model needs to be fine-tune, and the output layer of the specific task can be used to obtain the SOTA result.

This explanation is not clear enough for newcomers, but it does a good summary of the BERT mechanism. Next, we will analyze it a little bit.

The first thing that can be made clear is that BERT stands for Bidirectional Encoder Representations from Transformers. Every word in the name has its meaning. We will introduce it in the following paragraphs. From the BERT's name, the most important information we can get is that the BERT is based on the Transformer architecture.

Second, BERT is pre-trained on a large number of unlabeled text, including the entire Wikipedia (with 25 billion words) and the library corpus (8 billion words).

This step of pre-training is crucial for BERT. It is because of the support of such a large corpus that the model can extract the working principle of the language more deeply and accurately during the training process. The knowledge extracted through this process is 'Wanyou Oil' for all NLP tasks. .

Then, the BERT is a "deep two-way" model, and two-way means that the BERT pays attention to the context information of the current location during the training.

Contextual information is important for an accurate understanding of semantics. Looking at the example below, the same word "bank" is included in both sentences:

BETR capture context information
BETR capture context information

If we want to rely on the information above or below to understand the meaning of "bank", then we can't distinguish the different meanings of "bank" in these two sentences.

The solution is to consider the context information before the prediction, BERT is doing this.

Finally, the most attractive thing about BERT is that we can get SOTA results in various NLP tasks simply by adding the output layer part according to our needs after the model.

Word2VecTo BERT:NLP's exploration of language representation

"One of the biggest challenges in the field of natural language processing is the shortage of training data. NLP is a multi-disciplinary field with many tasks. Most data sets in specific areas contain only thousands or hundreds of thousands of manually labeled data." - Google AI

Word2Vec and GloVe

The pre-training model learns the language representation from a large amount of unlabeled text data from word embedding, such as Word2Vec and GloVe.

Word embedding changes the way NLP tasks are performed. By embedding, we are able to capture the context of a word.

The embedded methods shown in the figure are widely used in training models for downstream NLP tasks in order to obtain better prediction results.

A major drawback of previous embedding methods is that only very shallow language models are used, which means that the information they capture is limited.

Another drawback is that these embedded models do not take into account the context of the word. Like the "bank" example mentioned earlier, the same word may have different meanings in different contexts.

However, models such as WordVec represent "banks" in different contexts in the same vector.

As a result, some important information was omitted.


ELMo is a solution to the problem of linguistic ambiguity - for words that have different meanings in different contexts.

Starting with the training of shallow feedforward networks (Word2vec), we gradually transition to using complex two-wayLSTMStructure to train word embedding.

This means that the same word can have multiple ELMO embedding depending on the context in which it is located.

Since then, we have begun to notice that the advantages of pre-training will make it an important part of the NLP mission.

Further, ULMFiT, in document classification tasks, even with very little data (less than 100), fine-tuning the language model trained by the framework can provide excellent results. This means that ULMFiT solves the migration learning problem in NLP tasks.

This is our proposed NLP migration learning gold formula:

NLP migration learning = pre-training + fine tuning

 After ULMFIT, many NLP missions were trained according to the above formula and a new benchmark was obtained.


OpenAI's GPT further extends the pre-training and fine-tuning methods introduced in ULMFiT and ELMo.

The key to GPT is to replace the LSTM-based language modeling structure with a Transformer-based structure.

Not only document classification tasks, but also GPT models can be used for other NLP tasks.

Fine-tuned, such as common sense reasoning, semantic similarity, and reading comprehension.

OpenAI's GPT obtains SOTA results in multiple tasks, verifying the robustness and effectiveness of the Transformer architecture.  

In this way, BERT was born on the basis of Transformer, and brought great changes to the NLP field.

BERT was born

At this point, solving NLP tasks is inseparable from these two steps:

1. Training language models on unmarked large text corpora (unsupervised or semi-supervised)

2. Fine-tuning large language models for specific NLP tasks to take advantage of the extensive knowledge of pre-trained models (supervised)

Next, we will learn more about how BERT trains models and become an industry benchmark in the NLP field for some time to come.

How does BERT work?Dry goods explanation

Go deep into the BERT and understand why the language model built by BERT is so effective.

1. BERT structure

The BERT architecture is built on Transformer.We currently have two variants available:

  • BERT Base: 12 layer (transformer module), 12 layer attention, 1.1 billion parameters
  • BERT Large: 24 layer (transformer module), 16 layer attention, 3.4 billion parameters

Compared to OpenAI's GPT model, the BERT Base model is similar in size, and all of the transformer layers of the BERT Base include only the encoding portion.

If you are not very clear about the structure of the transformer, I suggest you read this article first.

Now we have seen the overall architecture of the BERT. Before the model is formally built, some text processing needs to be done first.

2. Text preprocessing

The developers behind the BERT added a specific set of rules to represent the input text of the model. Many of them are creative design choices that make the model perform better.

First, each input embedding is a combination of three embeddings:

1. Position embedding: BERT learns and uses position embedding to express the position of a word in a sentence. Add this embedding to overcome the limitations of Transformer, andRNNUnlike, Transformer cannot capture "sequence" or "sequence" information

2. Segment embedding: BERT can also use sentences as input to a task (question-answer). Therefore, it learns unique embedding for the first sentence and the second sentence to help the model distinguish them. In the example above, all tags that are EA belong to sentence A (for EB)

3. Token Embedding: These are embedded from the WordPiece Token Glossary for specific token learning

For a given token, its input representation is constructed by embedding the corresponding token, segment, and position.

This comprehensive embedding solution contains a lot of useful model information.

The combination of these pre-processing steps makes the BERT so versatile.

3. Pre-training tasks

BERT pre-trains two NLP tasks:

  • Masking language model
  • Next sentence prediction

Let us understand these two tasks in more detail!

a. masking language model (two-way)

BERT is a deep two-way model. The network always focuses on the context of the current word from the first layer to the last layer for information capture.

Word sequence prediction
Word sequence prediction

The traditional language model either uses right-to-left text information to train to predict the next word (such as GPT), or uses left-to-right text information to train, which makes the model inevitably lose some information, resulting in error.

ELMo attempts to solve this problem by training two LSTM language models (one using left-to-right textual information, one using right-to-left textual information) and connecting them. Although this has made progress to a certain extent, it is still far from enough.

Compared to GPT and ELMo, BERT has made important breakthroughs in the use of contextual information, as shown in the above figure.

The arrows in the figure represent the flow of information from one layer to the next, and the green box at the top represents the final representation of each input word.

It can be clearly seen from the above picture that BERT is bidirectional, GPT is unidirectional (from left to right), and ELMo is shallow bidirectional.

About the masking language model - this is the mystery of BERT bidirectional encoding.

For such a sentence - "I love to read data science blogs on Analytics Vidhya", how can we use it to train a two-way language model?

We first replace "Analytics" with "[MASK]", "[MASK]" means to mask the word at that position.

Then we need to train a model to predict the words that are masked out: "I love to read data science blogs on [MASK] Vidhya."

This is the key to masking the language model. The author of BERT also introduced some considerations for masking language models:

  • In order to prevent the model from paying too much attention to specific locations or masked markers, the researchers randomly masked 15% words.
  • Masked words are not always replaced by [MASK], which does not require the [MASK] tag in the fine-tuning phase for specific tasks.
  • To this end, the general approach of researchers is: (The word [MASK] is required for 15%)
  • (15%) 80% words are masked by [MASK]
  • The remaining 10% words are replaced by other random words
  • The remaining 10% words remain unchanged

In a previous article, I detailed how to implement a masking language model in Python: Introduction to PyTorch-Transformers: An Incredible Library for State-of-the-Art NLP (with Python code)

b. Next sentence prediction

Masking language models (MLMs) learn the relationship between words.

In addition, BERT also trains the next sentence prediction task to learn the relationship between sentences.

A typical example of such a task is the question and answer system.

The task is very simple, give A and B two sentences, judge B is the next sentence after A, or just a random sentence?

Since this is a two-category problem, a large amount of training data can be obtained by decomposing sentences in the corpus into sentence pairs. Similar to MLMs, the author also gives considerations when making the next sentence prediction task. Specifically by this example:

For a data set containing 10 million sentences, we can get 5 million sentence pairs for training data.

  • 50% in the training data, the second sentence is the real sentence
  • The other 50%, the second sentence is a random sentence in the corpus
  • The label of the first 50% is 'IsNext', and the label of the last 50% is 'NotNext'

Combining the masking language model (MLMs) and the lower sentence prediction (NSP) two pre-training tasks in the modeling process, this makes the BERT a task-independent model that can be applied to other downstream tasks through simple fine-tuning.

Beyond BERT:The latest technology of NLP

BERT has aroused great interest in the field of NLP, especially the wide application of Transformer.This has also led to more and more laboratories and organizations starting to study tasks such as pre-training, transformers and fine-tuning.

After BERT, some new projects have achieved better results in various NLP tasks.For example, RoBERTa, this is Facebook AI's improvement of BERT and DistilBERT, and the latter is actually a lighter and more convenient version of BERT.

You can learn more about the improved model after BERT in the respecting State-of-the-Art NLP in this article.

This article is transferred from the public AI developer,Original address