Last year in 11, the Google research team posted a lot of expectations on GitHub.BERT, it's not only in the 11 itemNLPThe test has set the highest score, and even shows the amazing results that transcend humanity. But the shock of BERT has not subsided, and another news that has excited many NLPers is released today: CMU and Google brain's new XLNet surpassed BERT's performance on 20 tasks, and achieved the most current 18 tasks. Good effect! What's even more exciting is that XLNet has now opened training code and large pre-training models. The AI ​​Technology Review sorts out the details below.

Relationship between BERT and XLNet

Compared with pre-training processing methods based on autoregressive language modeling, self-encoding pre-training processing methods (such as BERT) have good two-way context modeling capabilities. However, since the input is used to destroy the input, the BERT ignores the dependency between the mask locations and exhibits a pretrain-finetune difference.

XLNet is a generalized autoregressive pre-training method based on the advantages and disadvantages of BERT. It achieves bidirectional context learning by maximizing the expected likelihood of all permutations in the order of decomposition; overcoming the limitations of BERT through autoregressive formulas, and will come fromTransformerThe idea of ​​-XL (the most advanced autoregressive model) is integrated into pre-training and exhibits excellent performance in language tasks represented by long text.

XLNet birth background

First, we need to understand two concepts: autoregressive (AR) language modeling and self-encoding (AE).

Unsupervised indicates that learning is very successful in the field of natural language processing. Typically, these methods first pre-train the neural network on a large scale unmarked text corpus and then fine-tune the model or representation of the downstream task. Under this common high-level thinking, different unsupervised pre-training objectives have been explored in the relevant literature. Among them, autoregressive language modeling and self-encoding are the two most successful pre-training goals.

AR language modeling uses an autoregressive model to estimate the probability distribution of a text corpus. Specifically, given a text sequence x = (x1, ..., xT), the AR language model decomposes this possibility into a forward product or a backward product. A parametric model (such as a neural network) is trained to model each conditional distribution. Since the AR language model is only trained to encode one-way contexts (forward or backward), it does not have an effect when modeling deep bidirectional contexts. The opposite is the downstream language understanding task, which usually requires bidirectional context information. This has led to a gap between AR language modeling and effective pre-training.

The description of the replacement language modeling target of x3 is predicted given the same input sequence x but different factorization order.

In contrast, AE-based pre-training does not perform explicit density estimation, but rather aims to reconstruct raw data from the input. A well-known example is BERT, which uses the most advanced pre-training methods. Given the input token sequence, a particular part of it is replaced with a special symbol [MASK] and the training model restores the original token from the corrupted version.

Since the density estimate is not part of the target, the BERT can be reconstructed using a bidirectional context. The immediate benefit is that this eliminates the two-way information gap in AR language modeling, which improves performance. However, the BERT does not exist in the actual data of artificial symbols such as [MASK] used for pre-training, resulting in a pre-trained network gap. In addition, since the predicted token is re-masked in the input, the BERT cannot model the joint probability using the product rule as in AR language modeling.

Therefore, in view of the advantages and disadvantages of existing language pre-training goals, CMU and Google Brain proposed a generalized autoregressive pre-training model XLNet that fully combines the advantages of AR and AE.

Detailed XLNet

First, instead of using the fixed forward or backward factorization order in the traditional AR model, XLNet maximizes the expected log likelihood of all possible factorization sequences. Because it is an arrangement operation of the factorization order, the context of each position can be composed of tokens from the left and the right. In anticipation, each location learns to utilize contextual information from all locations, ie, to capture bidirectional contexts.

Second, as a generalized AR language model, XLNet does not rely on residual data. Therefore, XLNet is not affected by BERT's pre-training - fine-tuning differences. At the same time, the autoregressive goal also provides a natural way to perform factorization on the joint probability of the predicted token using the product rule, eliminating the independence assumption made in the BERT.

In addition to a new pre-training goal, XLNet has improved the pre-trained architecture design.

Inspired by the latest advances in AR language modeling, XLNet integrates Transformer-XL's segmentation repetition mechanism and relative coding scheme into pre-training to improve performance in longer text sequence tasks.

It should be noted that simply applying the Transformer (-XL) architecture to permutation-based language modeling does not work because the decomposition order is arbitrary and the goal is ambiguous. As a solution, the researchers proposed re-parameterizing the Transformer (-XL) network to eliminate ambiguity.

Experimental result

As of 2019 6 on 19, XLNet surpassed BERT's performance on 20 tasks and achieved state-of-the-art on 18 tasks, including machine Q&A, natural language inference, Sentiment analysis and document sorting.

Here are some comparisons between XLNet-Large and Bert-Large:

XLNet vs. BERT: Reading Comprehension Tasks
XLNet vs. BERT: Reading Comprehension Tasks

XLNet vs. BERT: Text Classification Task
XLNet vs. BERT: Text Classification Task

XLNet vs. BERT: ClueWeb09-B Document Ranking Task
XLNet vs. BERT: ClueWeb09-B Document Ranking Task

In the final 20 task, XLNet performed better than BERT and achieved the most advanced results in the 18 task.

Release model

As of now, the following modes have been provided:

XLNet-Large, Cased: 24-layer, 1024-hidden, 16-heads,

Each .zip file contains three items:

TensorFlow checkpoint (xlnet_model.ckpt), which contains pre-trained weights.

SentencePiece model (spiece.model) for (de) tokenization.

A configuration file (xlnet_config.json) that specifies the hyperparameters of the model.

Future release plan

Subsequent developers also plan to continue to release more pre-training models in different environments, including:

Basic model - an XLNet-Base will be released at the end of 2019 6.

Uncased model-Currently, Cased XLNet-Large performs better than Uncased XLNet-Large.The developers are still observing and researching, and when a conclusion is reached, they will immediately release the Uncased model. (It is expected that it will not be too long)

A pre-trained model that is fine-tuned on Wikipedia, which can be used for Wikipedia text tasks such as SQuAD and HotpotQA.

Pre-trained models of other hyperparameter configurations can be used for specific downstream tasks.

A pre-training model associated with new technologies.

Related Links

Paper address:

https://arxiv.org/pdf/1906.08237.pdf

Pre-training model and code address:

https://github.com/zihangdai/xlnet

This article is transferred from the public number AI technology review,Original address