From BOW to BERT
Since 2013, Mikolov et al. Word2Vec Since then, we have made a lot of developments in word embedding. Today, almost everyone in the machine learning industry can familiarize with the rumors of "king minus men and women equals the queen." At present, these interpretable word embedding has become a lot of deep learning based. NLP An important part of the system.
At the beginning of 10 last year, Google AI proposed BERT Characterization -Transformer Two-way coding characterization (paper link: https://arxiv.org/abs/1810.04805, project code: https://github.com/google-research/bert). It seems that Google has done a staggering move: they have proposed a new model for learning contextual characterization that optimizes the best current results on 11 NLP tasks, “even in the most challenging questions and answers. The task exceeded human performance." However, there is still a problem in the middle: What do these context words mean? Can these features be as interpretable as the word embedding generated by Word2Vec?
This article focuses on the above issues: the interpretability of fixed word representations generated by the BERT model. We found that we can observe some interesting phenomena without analyzing them too deeply.
Analysis of BERT characterization
Let's look at a simple example - no matter what the context. Here, we first ignore the fact that BERT is actually trained on a series of continuous representations. In all the experiments covered in this article, we will do the following two things:
- Extracting the characterization of the target word
- Calculating the cosine distance between words
Extracting the vector features of the words "man", "woman", "king" and "Queen", we found that the classic reconstruction operation was carried out (That is, the king minus the man plus the womanAfter that, the reconstructed word vector is actually farther away from the Queen.
But in fact, maybe we are not reasonable to test BERT like this. BERT was originally trained for sequence prediction tasks such as Masked-LM and Next-Sequence-Prediction. In other words, the weights in the BERT are obtained under the condition that the context information is used to establish the word representation. It is more than just a loss function that learns context-independent representations.
In order to eliminate the irrationality in the previous section, we can use our vocabulary to construct sentences in the correct context, such as "The King passes the law", "The Queen passes the law", "The refrigerator is very cool" and so on. Under these new conditions, we began to study:
- How a representation of a particular word is used in a different context (eg, as a subject or an object, subject to different descriptive adjectives, and corresponding to the contextless word itself).
- When we extract the representation from the correct context, will the semantic vector space hypothesis still hold.
Let's start with a simple experiment. Using the term "refrigerator", we created the following 5 sentences:
- refrigerator(Use without any context refrigerator Word)
- Refrigerator in the kitchen(will refrigerator As the subject of the sentence)
- The refrigerator is very cold(still will refrigerator As the subject of the sentence)
- He put the food in the fridge(will refrigerator As the preposition "in" the object)
- The refrigerator passed the lawPut refrigerator Used in an unreasonable context)
Here, we confirmed our previous assumptions and found that using a refrigerator without any context would return a very different representation than using the refrigerator in the proper environment. In addition, will refrigerator The sentence returned as the subject (sentence 2,3) is more similar to the expression returned by the sentence (sentence 4) and the unreasonable context (sentence 5) of the refrigerator as the object.
Let's look at another example, this time using send this phrase. We still make 5 sentences:
- send(Use without any context send Word)
- That person ate a party(use send As an object)
- The man threw away a faction(use send As an object)
- That pie is delicious(use send As subject)
- That pie ate a personPut send Used in an unreasonable context)
The trend we observed was very similar to the previous refrigerator experiment.
Next, let's take a look at the earliest King, queen, man 和 woman example of. Let's make 4 almost identical sentences and only change their subject.
- The king passed the law
- The queen passed the law
- Man passed the law
- Woman passed the law
From the above sentence, we extract the BERT representation of the subject. In this example, we got better results: the cosine distance between the king minus the man plus the woman and the queen was reduced a little.
Finally, let's take a look at what happens to the word representation when the sentence structure is constant but the sentence emotions are different. Here, we make three sentences.
- Mathematics is a difficult subject
- Mathematics is a difficult subject
- Mathematics is a simple subject
Using these sentences, we can explore what happens to the subject and adjectives when we change our emotions. Interestingly, we find that adjectives that are synonymous (ie difficult and difficult to learn) have similar representations, but adjectives of antonyms (ie, difficult and simple) have very different representations.
In addition, when we change our emotions, we find that subject mathematics is more similar when sentences are the same (egdifficult和Difficult to learn), and the similarity is low when the emotions are different (such asdifficult和simple).
In summary, our experimental results seem to indicate that, like Word2Vec, BERT can also learn semantic vector representations (though less obvious). BERT seems to be very dependent on contextual information: words without any context are very different from the same words with certain contexts, and different contexts (such as changing sentence emotions) also change the representation of the subject.
Keep in mind, however, that there is always a risk of over-generalization in cases where evidence is limited. The experiments in this article are not complete and can only be considered a start. The sample size we use is very small (relative to the massive dictionary of English words), and we evaluate very specific distance metrics (cosine distances) on very specific experimental data sets. Analysis of the future work represented by BERT should be extended in all of these areas. Finally, thanks to John Hewitt and Zack Lipton for providing this very interesting discussion on the subject.
This article is transferred from the front line of the public number AI.Original address