Beginning with 2018, pre-training (Pre-trainNo doubt becomesNLPThe hottest research direction in the field. With the help ofBERTWith pre-training models such as GPT, humans have made major breakthroughs in multiple natural language understanding tasks. However, in the sequence-to-sequence natural language generation task, the current mainstream pre-training model has not achieved significant results. To this end, researchers at Microsoft Research Asia have proposed a new universal pre-training method, MASS, on ICML 2019, which goes beyond BERT and GPT in sequence-to-sequence natural language generation tasks. In the WMT19 machine translation competition that Microsoft participated in,MASS helped the Chinese-English, English-Lithuanian language scores in the first place..

BERT has achieved good results in natural language understanding (such as sentiment classification, natural language reasoning, named entity recognition, SQuAD reading comprehension, etc.) and has received more and more attention. However, in the field of natural language processing, in addition to natural language understanding tasks, there are many sequence-to-sequence natural language generation tasks, such as machine translation, text summary generation, dialog generation, question and answer, and text style conversion. In this type of task, the current mainstream method is the encoder-attention-decoder framework, as shown in the following figure.

Encoder-attention-decoder frame

Encoder(Encoder) encode the source sequence text X into a hidden vector sequence, then the decoder (DecoderThrough the attention mechanism (You have been warned!Extracting the encoded hidden vector sequence information and generating the target sequence text Y autoregressively.

BERT usually trains only one encoder for natural language understanding, while GPT's language model usually trains a decoder.If BERT or GPT is to be used for sequence-to-sequence natural language generation tasks, usually only the pre-training encoder and decoder are separated, so the encoder-attention-decoder structure is not jointly trained, and the memory mechanism is not pre-fetched. Training, and the decoder's attention mechanism to the encoder is very important in such tasks, so BERT and GPT can only achieve sub-optimal effects in such tasks.

New pre-training method - MASS

Specialized for the sequence-to-sequence natural language generation task, Microsoft Research Asia has proposed a new pre-training method: Masked Sequence to Sequence Pre-training (MASS).The MASS randomly masks a sentence into a continuous segment of length k and then predicts the fragment by the encoder-attention-decoder model.

Shielded sequence to sequence pre-trained MASS model framework

As shown in the figure above, the 3-6 words on the encoder side are masked, and then the decoder only predicts these consecutive words, and blocks other words. The "_" in the figure represents the blocked words.

MASS pre-training has the following advantages:

(1) other words on the decoder side (words that are not masked at the encoder side) are masked to encourage the decoder to extract information from the encoder side to help predict the continuous segment.Promote joint training of encoder-attention-decoder structures;

(2) In order to provide more useful information to the decoder, the encoder is forced to extract the semantics of unmasked words.Improve the ability of the encoder to understand the source sequence text;

(3) allows the decoder to predict consecutive sequence segments toImprove the language modeling capabilities of the decoder.

Unified pre-training framework

MASS has an important hyperparameter k (masked continuous segment length). By adjusting the size of k, MASS can include the masked language model training method in BERT and the standard language model pre-training method in GPT.Make MASS a universal pre-training framework.

When k=1, according to the setting of MASS, the encoder side masks a word, and the decoder side predicts a word, as shown in the following figure. There is no input information at the decoder end, and the pre-training method of the masked language model in MASS and BERT is equivalent.

When k=m (m is the length of the sequence), according to the setting of MASS, the encoder masks all words, and the decoder predicts all words, as shown in the following figure, since all words on the encoder side are masked, the decoder The attention mechanism is equivalent to not obtaining information, in which case MASS is equivalent to the standard language model in GPT.

The probability form of MASS under different K is shown in the following table, where m is the sequence length, u and v are the start and end positions of the mask sequence, and x^u:v represents the sequence fragment from position u to v, x^\u :v indicates that the sequence is masked from position u to v. Can be seen whenWhen K=1 or m, the probabilistic form of MASS is consistent with the masked language model in BERT and the standard language model in GPT.

We experimentally analyzed the effect of pre-training the different segment lengths (k) in the masked MASS model, as shown in the following figure.

When k takes about half the length of the sentence (50% m), the downstream task can achieve optimal performance. Blocking half of the words in the sentence can well balance the pre-training of the encoder and decoder. Excessive biasing of the encoder (k=1, ie BERT) or over-biasing of the decoder (k=m, ie LM/GPT) cannot be The best results are achieved in this task, from which the advantages of MASS in sequence-to-sequence natural language generation tasks can be seen.

Sequence-to-sequence natural language generation task experiment

Pre-training process

MASS only requires unsupervised monolingual data (such as WMT News Crawl Data, Wikipedia Data, etc.) for pre-training.MASS supports cross-language sequence-to-sequence generation (such as machine translation), as well as single-language sequence-to-sequence generation (such as text summary generation, dialog generation). whenPre-trainingWhen MASS supports cross-language tasks (such as English-French machine translation), we simultaneously pre-train English to English and French to French in one model. It is necessary to add a corresponding language embedding vector to each language separately to distinguish different languages. We selected four tasks: unsupervised machine translation, low-resource machine translation, text summary generation, and dialog generation. The MASS pre-training model was fine-tuned for each task to verify the effect of MASS.

Unsupervised machine translation

In the unsupervised translation task, we compare it with the current strongest Facebook XLM (XLM uses the masked pre-training model in BERT and the standard language model to pre-train the encoder and decoder separately). The comparison results are shown in the following table.

It can be seen that the pre-training method of MASS outperforms XLM in the 14 translation direction of WMT16 English-French, WMT4 English-German. The effect of MASS on English-French unsupervised translation has far surpassed that of the earlier supervised encoder-attention-decoder model, while greatly narrowing the gap between the current and supervised models.

Low resource machine translation


Under different data scales, the performance of our pre-training methods is different than that of the baseline model without pre-training. The less the monitoring data, the more significant the lifting effect.

Text summary generation

On the Gigaword Corpus task, we compared MASS with BERT+LM (encoder pre-trained with BERT, decoder pre-trained with standard language model LM) and DAE (denoising self-encoder). As can be seen from the table below, the effect of MASS is significantly better than BERT+LM and DAE.

Conversation generation

On the Cornell Movie Dialog Corpus task, we compared MASS with BERT+LM and the results are shown in the following table. The PPL of MASS is lower than BERT+LM.

MASS has achieved very good results in different sequence-to-sequence natural language generation tasks. Next, we will also test the performance of MASS on natural language understanding tasks and add support for supervised data pre-training to the model in hopes of improving in more natural language tasks. In the future, we also hope to extend the application field of MASS to include other sequence-to-sequence generation tasks such as voice and video.

Paper address:

This article is transferred from the public magazine Microsoft Research Institute AI headlines,Original address