1 overview
Text representation Text representation)YesNLPThe mission is very basic and at the same time a very important part. Currently commonly used text representations are divided into:
- Discrete representationDiscrete Representation);
- Distributed representationDistributed Representation);
This article aims to introduce these two types of commonly used text representations.
2 discrete representation (Discrete Representation)
2.1 One-Hot
One-Hot The code, also known as "independent coding" or "dumb coding", is the most traditional and basic method of character (or word) representation. This encoding represents a word (or word) as a vector whose dimensions are the length of the dictionary (or dictionary) (the dictionary is generated by the corpus), in which the value of the position of the current word is 1, and the rest The location is 0.
Text useOne-hot Coding steps:
- Create a dictionary based on the corpus (vocabulary) and create a mapping of words and indexes (Stoi,itos);
- Convert the sentence to be represented by an index;
- createOneHot Encoder;
- 使用OneHot The encoder encodes the sentence;
ne-Hot The characteristics of the code are as follows:
- The length of the word vector is the length of the dictionary;
- In the vector, the value of the index position of the word is 1, and the rest of the values are 0;
- 使用One-Hot To encode the text, the resulting matrix issparse matrix(Sparse matrix);
Things to note:
- The vector representations of different words are orthogonal to each other and cannot measure the relationship between different words;
- The code can only reflect whether a word appears in a sentence, and cannot measure the importance of different words;
- 使用One-Hot Encoding the text yields a high-dimensional sparse matrix that wastes computational and storage resources;
2.2 word bag model (Bag Of Word, BOW)
Example:
- Jane wants to go to Shenzhen.
- Bob wants to go to Shanghai.
The word order and lexical information are not considered in the word bag model. Each word is independent of each other. The words are put into a "bag" and the frequency of occurrence of each word is counted.
Word bag model coding features:
- The word bag model encodes text (rather than words or words);
- The length of the encoded vector is the length of the dictionary;
- The encoding ignores the order in which the words appear;
- In the vector, the value of the index position of the word is the number of times the word appears in the text; if the word at the index position does not appear in the text, the value is 0;
Disadvantage
- The code ignores the location information of the word. The location information is a very important information in the text. The location of the word is not the same as the semantics (such as "cat loves to eat mouse" and "mouse loves cat" code) ;
- Although the coding method counts the number of times a word appears in the text, it cannot distinguish common words (such as "I", "Yes", "", etc.) and keywords (such as: " Natural language processing", "NLP The importance of "etc." in the text;
2.3 TF-IDF(word frequency - inverse document frequency)
In order to solve the word bag model, it is impossible to distinguish common words (such as "Yes", "", etc.) and proper nouns (such as: "natural language processing", "NLP "etc." the question of the importance of the text,TF-IDF The algorithm came into being.
TF-IDF The full name is:Term frequency–inverse document frequency Also known as word frequency-inverse text frequency.in:
- TF (Term Frequency ): The frequency with which a word appears in the current text, a high-frequency word or an important word (such as "natural language processing") or a common word (such as "I", "Yes", "", etc.) ;
- IDF (Inverse Document frequency ): Reverse text frequency. Text frequency refers to the proportion of text containing a word in the entire corpus. The inverse text frequency is the reciprocal of the text frequency;
formula
advantage
- Simple implementation, easy to understand algorithm and strong explanatory;
- 从IDF The calculation method can be seen that many common words (such as "I", "Yes", "", etc.) in the corpus will appear, so the value of IDF will be small; and keywords (such as: "natural language deal with","NLP "etc." will only appear in articles in a certain field.IDF The value will be larger; therefore:TF-IDF You can filter out some common, irrelevant words while retaining the important words of the article;
Disadvantage
- The position information of the word cannot be reflected. When the keyword is extracted, the position information of the word (such as the title, the beginning of the sentence, and the sentence at the end of the sentence should be given a higher weight);
- IDF Is a kind of weighting that tries to suppress noise, and it tends to be a word with a relatively small frequency in the text, which makesIDF The accuracy is not high;
- TF-IDF Severely dependent on the corpus (especially when training similar corpora, it often conceals some keywords of the same type; eg: in progressTF-IDF During training, if there are more entertainment news in the corpus, the weight of keywords related to entertainment will be lower), so it is necessary to select a high-quality corpus for training;
3 distributed representation (Distributed Representation
Theoretical basis:
- 1954 years,HarrisPropose a distributed hypothesis (Distributional hypothesis) laid the theoretical foundation of this approach:A word's meaning is given by the words that frequently appear close-by(Speech-like words with similar semantics);
- 1957 years,FirthFurther elaboration and clarification of the distributed hypothesis:A word is characterized by the company it keeps(The semantics of a word is determined by its context);
3.1 n-gram
N-gram is a language model (Language Model, LM).The language model is a discriminant model based on probability. The input of the model is a sentence (sequence of words), and the output is the probability of this sentence, which is the joint probability of these words (Joint probability). (Note: The language model is to judge whether a sentence is normal or not.)
3.2 co-occurrence matrix(Co-Occurrence Matrix)
First specify the window size, then count the number of times the words appear together in the window (and the symmetric window) as the vector of words (vector).
Corpus:
- I like deep learning.
- I like NLP.
- I enjoy flying.
Remarks: The statistics for specifying the window size to 1 (ie: left and right window_length=1, equivalent to bi-gram) are as follows:(I, like), (Iike, deep), (deep, learning), (learning, .), (I, like), (like, NLP), (NLP, .), (I, enjoy), (enjoy , flying), (flying, .). The co-occurrence matrix of the corpus is shown in the following table:
It can be seen from the above co-occurrence matrix that the words like and enjoy both appear in the attachment of word I and their statistical numbers are approximately equal, so their semantic and grammatical meanings are approximately the same.
advantage
- Consider the order of the words in the sentence;
Disadvantage
- The length of the vocabulary is large, resulting in a large vector length of the word;
- The co-occurrence matrix is also a sparse matrix (can be used SVD,PCA Such algorithms perform dimensionality reduction, but the amount of calculation is large);
3.3 Word2Vec
The word2vec model isGoogleThe word representation method released by the team in 2013.As soon as this method transfers the use of pre-training word vectors,NLP The field is blooming everywhere.
model
word2vec has two models:CBOW 和 SKIP-GRAM;
- CBOW: use the words of the context to predict the central word;
- SKIP-GRAM: a word that uses the central word to predict context;
advantage
- Learning semantics and grammar information, taking into account the context of the words;
- The resulting word vector dimension is small, saving storage and computing resources;
- Versatile, can be applied to variousNLP In the mission;
Disadvantage
- Words and vectors are one-to-one relationships that do not solve the problem of polysemous words;
- word2vec is a static model. Although it is versatile, it cannot be dynamically optimized for specific tasks.
3.4 GloVe
GloVe Stanford UniversityJeffrey, Richard a word vector representation algorithm provided byGloVe Full name isGlobal Vectors for Word RepresentationIs a global word frequency statistic (count-based & overall staticsticsWord representationWord representation)algorithm.The algorithm combines the advantages of the global matrix factorization (global matrix factorization) and local context window (local context window) two methods.
Remarks:GloveThe derivation formula of the model is more complicated and will not be deduced in detail here. For details, please visit the official website (https://nlp.stanford.edu/projects/glove/).
effect
advantage
- Learning semantics and grammar information, taking into account the context of the words, and the information of the global corpus;
- The resulting word vector dimension is small, saving storage and computing resources;
- Versatile, can be applied to variousNLP In the mission;
Disadvantage
- Words and vectors are one-to-one relationships that do not solve the problem of polysemous words;
- Glove is also a static model. Although it is versatile, it cannot be dynamically optimized for specific tasks.
3.5 ELMO
The word vectors obtained by the word2vec and glove algorithms are static word vectors (static word vectors will fuse the semantics of polysemes, and will not change according to the context after training). Static word vectors cannot solve polysemy problems (eg: "I am today Bought 7appleAnd "I bought it today."appleIn 7" apple Is a polysemous word). The word vector trained by the ELMO model can solve the problem of polysemous words.
ELMO The full name is " Embedding from Language Models "This name does not reflect the characteristics of the model very well.ELMO The title of the paper can more accurately express the characteristics of the algorithm. Deep contextualized word representation . "
The essence of the algorithm is: training the neural network with the language model, in useWord embedding When the word already has context information, the neural network can be based on the context information.Word embedding Make adjustments so that after adjustmentWord embedding It is more able to express the specific meaning in this context, which solves the problem that the static word vector cannot represent polysemy.
Network model
process
- The structure in the above figure uses a character-level convolutional neural network (Convolutional neural network, CNN) to convert words in the text into original word vectors (Raw word vector) ;
- Enter the original word vector into the first layer of the bidirectional language model;
- The forward iteration contains the word and some vocabulary or context information before the word (ie above);
- The backward iteration contains the word and some vocabulary or context information after the word (ie below);
- The information of these two iterations constitutes an intermediate word vector (Intermediate word vector);
- The intermediate word vector is entered into the next layer of the model;
- The final vector is the weighted sum of the original word vector and the two intermediate word vectors;
effect
As shown in FIG:
- In the word vector trained by glow, most of the words similar to play are related to sports, because the corpus related to play in the corpus is related to the sports field;
- In the word vector trained with elmo, when play takes the meaning of performance, the sentences close to it are also similar to performance;
4 Conclusion
- Deep learning nowNLP Most of the fields used are distributed word vectors;
- The theoretical basis of distributed word vectors is the language model;
- In the choice of word vector, we must consider the characteristics of specific tasks. Word2vec, groom, and elmo training word vectors have their own advantages and disadvantages, and no one is much better than the other two.
This article was transferred from the public number AI Institute.Original address
Comment
The dummy variable is not One-Hot. . . .