Classification features are an important class of features.Classification features are discrete and non-continuous.
This article will introduce 5 mainstream coding methods for small and large classifications.And their respective advantages and disadvantages.
What are classification (category) characteristics?
Categorical features are used to represent classification. Unlike numerical features, which are continuous, categorical features are discrete.
such as:
- 性别
- city
- Colour
- IP address
- User's account ID
Some classification features are also numerical values, such as account ID and IP address.But these values are not continuous.
Continuous numbers are numerical features, and discrete numbers are categorical features.
For continuous and discrete explanations, you can read this article: "Understanding of continuous and discrete"
Encoding of small classification features
Natural Number Encoding/Sequence Encoding-Ordinal Encoding
Certain classifications have a certain order, in this case, simple natural number coding can be used.
For example degree:
Bachelor-0
Master-1
Ph.D-2
One-Hot Encoding-One-Hot Encoding
For city, color, brand, material... these features are not suitable for coding with natural numbers, because these features have no ordering relationship.
The use of one-hot encoding can make different categories in an "equal position", and will not affect the classification because of the magnitude of the value.
For example, color classification (assuming there are only 3 colors):
Red-100
Yellow -010
Blue-001
Similar to one-hot encoding, there are "Dummy Encoding" and "Effect Encoding".
The implementation is similar, but there are some slight differences, and it is applicable to different scenarios.
Those who are interested can read this article:
'The difference between dummy variables and one-hot encoding"
'Assignment method: effect coding"
Encoding of large-scale classification features
Target Encoding
Target encoding is a very effective method to represent classification columns, and it only occupies a feature space, also known as mean encoding.Each value in this column is replaced by the average target value for that category.This can more directly express the relationship between categorical variables and target variables.
Extended reading of the target code: "Introduction to Target Encoding"
Hash encoding
The hash function is also a hash function that everyone often hears.The hash function is a deterministic function that maps a potentially unbounded integer to a finite integer range [1, m].
If there is a category with 1 values, if one-hot encoding is used, the encoding will be very long.With the use of hash encoding, no matter how many different values there are in the classification, it will be converted into a fixed-length encoding.
Bin-Counting
The thinking of bin counting is a bit complicated: instead of using the value of a categorical variable as a feature, he uses the conditional probability of the target variable taking this value.
In other words, we do not encode the value of the categorical variable, but calculate the correlation statistics between the value of the categorical variable and the target variable to be predicted.
Summary of the advantages and disadvantages of different encodings
One-Hot Encoding-One-Hot Encoding
advantage:
- easy to accomplish
- The classification is very precise
- Can be used for online learning
Things to note:
- Inefficient calculation
- Unable to adapt to growthable categories
- Only applicable to linear models
- For large data sets, large-scale distributed optimization is required
Hash encoding
advantage:
- easy to accomplish
- Model training costs are lower
- Easy to adapt to new categories
- Easy to handle rare types
- Can be used for online learning
Things to note:
- Only suitable for linear models or kernel methods
- Unexplainable features after hashing
- Accuracy is difficult to guarantee
Bin-Counting
advantage:
- Minimal computational burden during training
- Can be used for tree-based models
- Easy to adapt to new categories
- Use back-off method or minimum count graph to handle rare classes
- Explainable
Things to note:
- Need historical data
- Need to delay update, not completely suitable for online learning
- Very likely to cause a data breach
The above content is taken from: "Proficient in feature engineering"
Final Thoughts
Categorical features are discrete features, and numerical features are continuous.
For small classifications, commonly used encoding methods are:
- Natural Number Encoding/Sequence Encoding-Ordinal Encoding
- One-Hot Encoding-One-Hot Encoding
- Dummy Encoding-Dummy Encoding
- Effect Encoding-Effect Encoding
For large classifications, commonly used coding methods are:
- Target Encoding
- Hash encoding
- Bin-Counting
Recommended articles:
Comments