Understand word embedding in one article

Want to learn more NLP For related content, please visit the NLP topic, and a 59-page NLP document download is available for free.

Visit the NLP topic and download a 59-page free PDF

Text representation (Representation)

Text is an unstructured data message that cannot be directly calculated.

The role of text representation is to transform this unstructured information into structured informationIn this way, we can calculate the text information to complete the tasks of text classification and sentiment judgment that we can see everyday.

Textual representation to transform unstructured data into structured data

There are many ways to represent text, only three types of methods are introduced below:

  1. One-hot representation | one-hot representation
  2. Integer encoding
  3. Word embedding

word embedding's relationship

 

One-hot representation | one-hot representation

Suppose there are 4 words in the text we want to calculate: cat, dog, cow, sheep. Each position in the vector represents a word. So using one-hot to express it is:

Cat: [1, 0, 0, 0]

Dog: [0]

Cattle: [0]

Sheep: [0]

one-hot encoding

But in the actual situation, there are likely to be thousands of different words in the text, and the vector will be very long. More than 99% of them are 0.

The disadvantages of one-hot are as follows:

  1. Unable to express the relationship between words
  2. This too sparse vector results in inefficient calculation and storage

 

Integer encoding

This method is also very easy to understand. Use a number to represent a word. The example above is:

Cat: 1

Dog: 2

Cow: 3

Sheep: 4

Integer encoding

Putting together each word in a sentence is a vector that can represent a sentence.

The disadvantages of integer encoding are as follows:

  1. Unable to express the relationship between words
  2. For model interpretation, integer encoding can be challenging.

 

What is word embedding?

Word embedding is a type of method for text representation. Same purpose as one-hot encoding and integer encoding, but it has more advantages.

Word embedding does not specifically refer to a specific algorithm. Compared with the above two methods, this method has several obvious advantages:

  1. He can express text through a low-dimensional vector, not as long as one-hot.
  2. Words with similar semantics are also similar in vector space.
  3. It is very versatile and can be used in different tasks.

Recall the example above:

word embedding: semantically similar words will be similar in vector space

 

2 mainstream word embedding algorithms

2 mainstream word embedding algorithms

Word2vec

This is a statistical method to obtain word vectors. He proposed a new set of word embedding methods by Google's Mikolov in 2013.

This algorithm has 2 training modes:

  1. Predicting the current word by context
  2. Predicting context from the current word

Want to learn more Word2vec, You can look at this article: "I understand Word2vec in one article (basic concept + 2 training model + 5 advantages and disadvantages)"

 

GloVe

GloVe is an extension of the Word2vec method, which combines global statistics with Word2vec's context-based learning.

I want to know GloVe's three-step implementation, training method, and w2c comparison. Check out this article: "GloVe in detail"

 

Baidu Encyclopedia and Wikipedia

Baidu Encyclopedia version

Word embedding, also known as a group of language modeling and feature learning techniques in Word Embedded Natural Language Processing (NLP), where words or phrases from a vocabulary are mapped to vectors of real numbers. Conceptually, it involves mathematical embedding from one-dimensional space for each word to a continuous vector space with lower dimensions.

Methods for generating such mappings include neural networks, dimensionality reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the explicit representation of the context in which words appear.

When used as an underlying input representation, word and phrase embedding has been shown to improve the performance of NLP tasks, such as grammar analysis and sentiment analysis.

Read More

Wikipedia version

Word embedding is an important part of natural language processing. It is a general term for some language processing models and does not specifically refer to an algorithm or model. The task of Word embedding is to convert words into vectors that can be computed. Conceptually, it involves mathematical embedding from a one-dimensional space of each word to a continuous vector space with a lower dimension.

Methods for generating such mappings include neural networks, dimensionality reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the explicit representation of the context in which words appear.

When used as an underlying input representation, word and phrase embedding has been shown to improve the performance of NLP tasks, such as parsing and sentiment analysis.

Read More