Want to learn more NLP For related content, please visit the NLP topic, and a 59-page NLP document download is available for free.
Visit the NLP topic and download a 59-page free PDF
Text representation (Representation)
Text is an unstructured data message that cannot be directly calculated.
The role of text representation is to transform this unstructured information into structured informationIn this way, we can calculate the text information to complete the tasks of text classification and sentiment judgment that we can see everyday.
There are many ways to represent text, only three types of methods are introduced below:
- One-hot representation | one-hot representation
- Integer encoding
- Word embedding
One-hot representation | one-hot representation
Suppose there are 4 words in the text we want to calculate: cat, dog, cow, sheep. Each position in the vector represents a word. So using one-hot to express it is:
Cat: [1, 0, 0, 0]
Dog: [0]
Cattle: [0]
Sheep: [0]
But in the actual situation, there are likely to be thousands of different words in the text, and the vector will be very long. More than 99% of them are 0.
The disadvantages of one-hot are as follows:
- Unable to express the relationship between words
- This too sparse vector results in inefficient calculation and storage
Integer encoding
This method is also very easy to understand. Use a number to represent a word. The example above is:
Cat: 1
Dog: 2
Cow: 3
Sheep: 4
Putting together each word in a sentence is a vector that can represent a sentence.
The disadvantages of integer encoding are as follows:
- Unable to express the relationship between words
- For model interpretation, integer encoding can be challenging.
What is word embedding?
Word embedding is a type of method for text representation. Same purpose as one-hot encoding and integer encoding, but it has more advantages.
Word embedding does not specifically refer to a specific algorithm. Compared with the above two methods, this method has several obvious advantages:
- He can express text through a low-dimensional vector, not as long as one-hot.
- Words with similar semantics are also similar in vector space.
- It is very versatile and can be used in different tasks.
Recall the example above:
2 mainstream word embedding algorithms
Word2vec
This is a statistical method to obtain word vectors. He proposed a new set of word embedding methods by Google's Mikolov in 2013.
This algorithm has 2 training modes:
- Predicting the current word by context
- Predicting context from the current word
Want to learn more Word2vec, You can look at this article: "I understand Word2vec in one article (basic concept + 2 training model + 5 advantages and disadvantages)"
GloVe
GloVe is an extension of the Word2vec method, which combines global statistics with Word2vec's context-based learning.
I want to know GloVe's three-step implementation, training method, and w2c comparison. Check out this article: "GloVe in detail"
Baidu Encyclopedia and Wikipedia
Word embedding, also known as a group of language modeling and feature learning techniques in Word Embedded Natural Language Processing (NLP), where words or phrases from a vocabulary are mapped to vectors of real numbers. Conceptually, it involves mathematical embedding from one-dimensional space for each word to a continuous vector space with lower dimensions.
Methods for generating such mappings include neural networks, dimensionality reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the explicit representation of the context in which words appear.
When used as an underlying input representation, word and phrase embedding has been shown to improve the performance of NLP tasks, such as grammar analysis and sentiment analysis.
Word embedding is an important part of natural language processing. It is a general term for some language processing models and does not specifically refer to an algorithm or model. The task of Word embedding is to convert words into vectors that can be computed. Conceptually, it involves mathematical embedding from a one-dimensional space of each word to a continuous vector space with a lower dimension.
Methods for generating such mappings include neural networks, dimensionality reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the explicit representation of the context in which words appear.
When used as an underlying input representation, word and phrase embedding has been shown to improve the performance of NLP tasks, such as parsing and sentiment analysis.
Comments