https://easyai.tech/wp-content/uploads/2022/08/39995-2021-03-30-typefeature.png

Classification features are an important class of features.Classification features are discrete and non-continuous.

This article will introduce 5 mainstream coding methods for small and large classifications.And their respective advantages and disadvantages.

 

What are classification (category) characteristics?

Categorical features are used to represent classification. Unlike numerical features, which are continuous, categorical features are discrete.

such as:

  • 性别
  • city
  • Colour
  • IP address
  • User's account ID

https://easyai.tech/wp-content/uploads/2022/08/d2797-2021-03-30-lisan.png

Some classification features are also numerical values, such as account ID and IP address.But these values ​​are not continuous.

Continuous numbers are numerical features, and discrete numbers are categorical features.

For continuous and discrete explanations, you can read this article: "Understanding of continuous and discrete"

Encoding of small classification features

https://easyai.tech/wp-content/uploads/2022/08/5345c-2021-03-30-small-data.png

Natural Number Encoding/Sequence Encoding-Ordinal Encoding

Certain classifications have a certain order, in this case, simple natural number coding can be used.

For example degree:

Bachelor-0

Master-1

Ph.D-2

One-Hot Encoding-One-Hot Encoding

For city, color, brand, material... these features are not suitable for coding with natural numbers, because these features have no ordering relationship.

The use of one-hot encoding can make different categories in an "equal position", and will not affect the classification because of the magnitude of the value.

For example, color classification (assuming there are only 3 colors):

Red-100

Yellow -010

Blue-001

Similar to one-hot encoding, there are "Dummy Encoding" and "Effect Encoding".

The implementation is similar, but there are some slight differences, and it is applicable to different scenarios.

Those who are interested can read this article:

'The difference between dummy variables and one-hot encoding"

'Assignment method: effect coding"

Encoding of large-scale classification features

https://easyai.tech/wp-content/uploads/2022/08/f340a-2021-03-30-big-data.png

Target Encoding

Target encoding is a very effective method to represent classification columns, and it only occupies a feature space, also known as mean encoding.Each value in this column is replaced by the average target value for that category.This can more directly express the relationship between categorical variables and target variables.

Extended reading of the target code: "Introduction to Target Encoding"

Hash encoding

The hash function is also a hash function that everyone often hears.The hash function is a deterministic function that maps a potentially unbounded integer to a finite integer range [1, m].

If there is a category with 1 values, if one-hot encoding is used, the encoding will be very long.With the use of hash encoding, no matter how many different values ​​there are in the classification, it will be converted into a fixed-length encoding.

Bin-Counting

The thinking of bin counting is a bit complicated: instead of using the value of a categorical variable as a feature, he uses the conditional probability of the target variable taking this value.

In other words, we do not encode the value of the categorical variable, but calculate the correlation statistics between the value of the categorical variable and the target variable to be predicted.

Summary of the advantages and disadvantages of different encodings

One-Hot Encoding-One-Hot Encoding

advantage:

  1. easy to accomplish
  2. The classification is very precise
  3. Can be used for online learning

Things to note:

  1. Inefficient calculation
  2. Unable to adapt to growthable categories
  3. Only applicable to linear models
  4. For large data sets, large-scale distributed optimization is required

Hash encoding

advantage:

  1. easy to accomplish
  2. Model training costs are lower
  3. Easy to adapt to new categories
  4. Easy to handle rare types
  5. Can be used for online learning

Things to note:

  1. Only suitable for linear models or kernel methods
  2. Unexplainable features after hashing
  3. Accuracy is difficult to guarantee

Bin-Counting

advantage:

  1. Minimal computational burden during training
  2. Can be used for tree-based models
  3. Easy to adapt to new categories
  4. Use back-off method or minimum count graph to handle rare classes
  5. Explainable

Things to note:

  1. Need historical data
  2. Need to delay update, not completely suitable for online learning
  3. Very likely to cause a data breach

The above content is taken from: "Proficient in feature engineering"

Final Thoughts

Categorical features are discrete features, and numerical features are continuous.

For small classifications, commonly used encoding methods are:

  1. Natural Number Encoding/Sequence Encoding-Ordinal Encoding
  2. One-Hot Encoding-One-Hot Encoding
  3. Dummy Encoding-Dummy Encoding
  4. Effect Encoding-Effect Encoding

For large classifications, commonly used coding methods are:

  1. Target Encoding
  2. Hash encoding
  3. Bin-Counting

Recommended articles:

'Machine learning category feature processing"

'Feature Engineering (XNUMX): Category Features"