Numerical features are the most common feature type, and numerical values can be directly fed to the algorithm.
In order to improve the effect, we need to do some processing on numerical features. This article introduces 4 common processing methods: missing value processing, binarization, bucketing, and scaling.
What is a numerical feature?
Numerical features are features that can be actually measured.E.g:
- Human height, weight, three-dimensional
- The number of visits to the product, the number of times it was added to the shopping cart, and the final sales volume
- How many new users and returning users among the logged-in users
The features of the numerical class can be directly fed to the algorithm, why do we need to deal with it?
Because good numerical features can not only show the information hidden in the data, but also consistent with the model's assumptions.A good effect can be improved through proper numerical transformation.
For example, linear regression and logistic regression are very sensitive to the size of the value, so it needs to be scaled.
For numerical features, we mainly focus on 2 points:
The four processing methods mentioned below are optimized around size and distribution.
4 common processing methods for numerical features
- Missing value processing
- Divide buckets/bins
Missing value processing
In actual problems, we often encounter data missing.Missing values will have a greater impact on performance.So it needs to be dealt with according to the actual situation.
There are three commonly used processing methods for missing values:
- Fill in missing values (mean, median, model prediction...)
- Delete rows with missing values
- Ignore it directly, and feed the missing value as part of the feature to the model for learning
This processing method is usually used in counting scenarios, such as: the number of visits, the number of times a song has been listened to...
Predict which songs are more popular based on the user’s listening music data.
Assuming that most people listen to songs very averagely and will listen to new songs continuously, but there is a user who plays the same song 24 hours a day, and this song is very partial, resulting in a particularly high total number of listening to this song .If the total number of listening times is used to feed the model, it will mislead the model.At this time, you need to use "binarization".
The same user has listened to the same song N times, and only counts 1, so that everyone can find songs that everyone likes to recommend.
Take the income of each person as an example. The income of most people is not high, and the income of a very small number of people is extremely high and the distribution is very uneven.Some have a monthly income of 3000, and some have a monthly income of 30, which is several orders of magnitude.
This feature is very unfriendly to the model.This situation can be handled by bucketing.Bucketing is to divide numerical features into different intervals, and treat each interval as a whole.
- age distribution
- Commodity price distribution
- Income distribution
Commonly used bucketing methods:
- Quantiles and buckets (for example, the price range recommended by Taobao: 30% of users choose the cheapest price range, 60% of users choose the medium price range, and 9% of users choose the most expensive price range)
- Use the model to find the best bucket
Linear regression and logistic regression are very sensitive to the magnitude of the value, and the large difference between different feature scales will seriously affect the effect.Therefore, the values of different magnitudes need to be normalized.Scale different orders of magnitude into the same static range (for example: 0~1, -1~1).
Commonly used normalization methods:
- z-score normalization
- min-max standardization
- Row normalization
- Variance scaling