This article is transferred from the public number,Original address

In the past year, deep neural networks have created an exciting era of natural language processing. Research in the field of using pre-trained models has led to manyNLPA huge leap in the latest results of the mission, such as text categorization, natural language reasoning and question and answer. Some key milestones are ELMo, ULMFiT and OpenAI Transform. These methods allow us to pre-train unsupervised language models on large databases (such as all Wikipedia articles) and then fine-tune these pre-trained models on downstream tasks. Perhaps the most exciting event in this field this year isBERTThe release, which is a multi-language Transformer-based model, has achieved the most advanced results in various NLP tasks. BERT is a two-way model based on the Transformer architecture that replaces the faster attention-based approach.RNN(LSTMAnd GRU) order characteristics. The model also pre-trained two unsupervised tasks, mask language modeling and next sentence prediction. This allows us to fine-tune downstream specific tasks (such as sentiment classification, intent detection, Q&A, etc.) using a pre-trained BERT model.

We will use Kaggle's spam classification challenge to measure the performance of BERT in multi-label text classification.

Where do we start?

Google Research recently unveiled the tensor stream implementation of BERT and released the following pre-trained models:

  1. BERT-Base, Uncased: 12 layer, 768 hidden layer, 12-heads, 110M parameters
  2. BERT-Large, Uncased: 24 layer, 1024 hidden layer, 16-heads, 340M parameters
  3. BERT-Base, Cased: 12 layer, 768 hidden layer, 12-heads, 110M parameters
  4. BERT-Large, Cased: 24 layer, 1024 hidden layer, 16-heads, 340M parameters
  5. BERT-Base, Multilingual Cased (New, recommended): 104 language, 12 layer, 768 hidden layer, 12-heads, 110M parameters
  6. BERT-Base, Chinese: Chinese Simplified and Traditional, 12 layer, 768 hidden layer, 12-heads, 110M parameters

We will use a smaller Bert-Base, no-frame model to accomplish this task. The Bert-Base model has 12 layers, and all text will be converted to lowercase by the tokenizer.

We will use HuggingFace's excellent PyTorch BERT port, available at We have converted the pre-trained TensorFlow checkpoints to PyTorch weights using the script provided in HuggingFace's repo.

Our implementation is largely inspired by the run_classifier example provided in the original implementation of BERT.

data preparation

We are in classInputExample Prepare the data:

  • Text_a: comment content
  • Text_b: not used
  • Labels: The training data corresponds to the label, and the test data is empty.


BERT-Base, an unsupervised model that uses the vocabulary of 30,522 words. The word segmentation process involves splitting the input text into a list of tags available in the vocabulary. To handle words that are not available in the vocabulary, BERT uses a WordPiece tokenization technique called BPE-based. In this method, words outside the vocabulary are gradually divided into sub-words, which are then represented by a set of sub-words. Since the subwords are part of the vocabulary, we have learned the context in which these subwords are represented, and the context of the word is simply a combination of contexts of the subwords. For more details on this method, see Neural Machine Translation Models for Rare Words Using Subword Units.

Model structure


The training loop is the same as the loop provided in the original BERT implementation in We trained the 4 period models with a batch size of 32 and a sequence length of 512, the maximum likelihood of a pre-trained model. According to the original paper's suggestion, the learning rate remains at 3e-5.

We have the opportunity to use multipleGPU. So we wrap the Pytorch model in the DataParallel module. This allows us to spread our training efforts across all available GPUs.

For some reason we did not use the semi-precision FP16 technique, and the binary crosss entropy with the logits loss function does not support FP16 processing. This does not affect the final result, it only takes longer to train.

Evaluation index

We adjusted the precision metric function to include the threshold, which is set to 0.5 by default.

For multi-label classification, the more important indicator isROCK, THE CAT-AUCcurve. This is also the evaluation indicator for Kaggle competitions. We calculate the ROC-AUC of each tag separately. We also use micro averages on the roc-auc scores for individual labels.

We did some experiments with only a few changes, but more experiments gave similar results.

The results are as follows:

Training loss: 0.022, verification loss: 0.018, verification accuracy: 99.31%

ROC-AUC score for each label:

Toxic: 0.9988

Severe-toxic: 0.9935

Obscene: 0.9988

Threat: 0.9989

Insult: 0.9975

Identity_hate: 0.9988

Micro ROC-AUC: 0.9987