## Text representation (Representation)

Text is an unstructured data message that cannot be directly calculated.

The role of text representation is to transform this unstructured information into structured informationIn this way, we can calculate the text information to complete the tasks of text classification and sentiment judgment that we can see everyday.

There are many ways to represent text, only three types of methods are introduced below:

1. One-hot representation | one-hot representation
2. Integer encoding
3. Word embedding

## One-hot representation | one-hot representation

Suppose there are 4 words in the text we want to calculate: cat, dog, cow, sheep. Each position in the vector represents a word. So using one-hot to express it is:

Cat: [1, 0, 0, 0]

Dog: [0]

Cattle: [0]

Sheep: [0]

But in the actual situation, there are likely to be thousands of different words in the text, and the vector will be very long. More than 99% of them are 0.

The disadvantages of one-hot are as follows:

1. Unable to express the relationship between words
2. This too sparse vector results in inefficient calculation and storage

## Integer encoding

This method is also very easy to understand. Use a number to represent a word. The example above is:

Cat: 1

Dog: 2

Cow: 3

Sheep: 4

Putting together each word in a sentence is a vector that can represent a sentence.

The disadvantages of integer encoding are as follows:

1. Unable to express the relationship between words
2. For model interpretation, integer encoding can be challenging.

## What is word embedding?

Word embedding is a type of method for text representation. Same purpose as one-hot encoding and integer encoding, but it has more advantages.

Word embedding does not specifically refer to a specific algorithm. Compared with the above two methods, this method has several obvious advantages:

1. He can express text through a low-dimensional vector, not as long as one-hot.
2. Words with similar semantics are also similar in vector space.
3. It is very versatile and can be used in different tasks.

Recall the example above:

## 2 mainstream word embedding algorithms

### Word2vec

This is a statistical method to obtain word vectors. He proposed a new set of word embedding methods by Google's Mikolov in 2013.

This algorithm has 2 training modes:

1. Predicting the current word by context
2. Predicting context from the current word

### GloVe

GloVe is an extension of the Word2vec method, which combines global statistics with Word2vec's context-based learning.

I want to know GloVe's three-step implementation, training method, and w2c comparison. Check out this article: "GloVe in detail"

## Baidu Encyclopedia and Wikipedia

Baidu Encyclopedia version

Word embedding, also known as a group of language modeling and feature learning techniques in Word Embedded Natural Language Processing (NLP), where words or phrases from a vocabulary are mapped to vectors of real numbers. Conceptually, it involves mathematical embedding from one-dimensional space for each word to a continuous vector space with lower dimensions.

Methods for generating such mappings include neural networks, dimensionality reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the explicit representation of the context in which words appear.

When used as an underlying input representation, word and phrase embedding has been shown to improve the performance of NLP tasks, such as grammar analysis and sentiment analysis.