Feature engineering is an important part of the machine learning workflow. It "translates" the original data into a form that the model can understand.
This article will introduce the basic concepts, importance and performance evaluation of feature engineering in 4 steps.
The importance of feature engineering
Everyone has heard two classic quotes from American computer scientist Peter Norvig:
A simple model based on a large amount of data is better than a complex model based on a small amount of data.
This sentence illustrates the importance of the amount of data.
More data is better than smart algorithms, and good data is better than more data.
This sentence is about the importance of feature engineering.
Therefore, how to use the given data to exert greater data value is what feature engineering needs to do.
In a 16-year survey, it was found that 80% of the work of a data scientist is spent on acquiring, cleaning, and organizing data.The time to construct the machine learning pipeline is less than 20%.Details are as follows:
- Set the training set: 3%
- Cleaning and organizing data: 60%
- Collect data set: 19%
- Mining data patterns: 9%
- Adjustment algorithm: 5%
- Other: 4%
PS: Data cleaning and organizing data are also the "most annoying" tasks for data scientists.Those who are interested can read this original article:
Data Sources:"Data Scientists Spend Most of Their Time Cleaning Data"
What is feature engineering
Let's first take a look at the position of feature engineering in the machine learning process:
As can be seen from the above figure, the feature engineering is between the original data and the feature.His task is to "translate" the original data into features.
Features: It is the numerical expression of the original data, and the expression that the machine learning algorithm model can directly use.
Feature engineering is a process that transforms data into features that can better represent business logic, thereby improving the performance of machine learning.
This may not be easy to understand.In fact, feature engineering is very similar to cooking:
We bought the ingredients, washed and cut vegetables, and then started cooking according to our own preferences to make delicious meals.
In the above example:
Ingredients are like raw data
The processes of cleaning, cutting vegetables, and cooking are like feature engineering
The delicious meal made at the end is the characteristic
Humans need to eat processed food, which is safer and more delicious.The machine algorithm model is similar. The original data cannot be directly fed to the model, and the data needs to be cleaned, organized, and converted.Finally, the features that can be digested by the model can be obtained.
In addition to converting raw data into features, there are two important points that are easily overlooked:
Key 1: Better representation of business logic
Feature engineering can be said to be a mathematical expression of business logic.
The purpose of our use of machine learning is to solve specific problems in the business.There are many ways to convert the same raw data into features. We need to choose those that can "better represent business logic" to better solve the problem.Rather than those simpler methods.
Focus 2: Improve machine learning performance
Performance means shorter time and lower cost. Even the same model will have different performance due to different feature engineering.So we need to choose those feature projects that can exert better performance.
4 steps to evaluate feature engineering performance
The business evaluation of feature engineering is very important, but there are various methods, and different businesses have different evaluation methods.
Only the performance evaluation method is introduced here, which is relatively general.
- Before applying any feature engineering, get the benchmark performance of the machine learning model
- Apply one or more feature engineering
- For each feature project, obtain a performance index and compare it with the benchmark performance
- If the performance increase is greater than a certain threshold, then feature engineering is considered to be beneficial and applied on the machine learning pipeline
For example: the accuracy of the benchmark performance is 40%, after applying a certain feature engineering, the accuracy is increased to 76%, then the change is 90%.
(76%-40%) / 40%=90%
Final Thoughts
Feature engineering is the most time-consuming work in the machine learning process, and it is also one of the most important tasks.
Feature engineering definition: It is a process that transforms data into features that can better represent business logic, thereby improving the performance of machine learning.
Two key points that feature engineering is easily overlooked:
- Better representation of business logic
- Improve machine learning performance
The 4 steps of feature engineering performance evaluation:
- Before applying any feature engineering, get the benchmark performance of the machine learning model
- Apply one or more feature engineering
- For each feature project, obtain a performance index and compare it with the benchmark performance
- If the performance increase is greater than a certain threshold, then feature engineering is considered to be beneficial and applied on the machine learning pipeline
Comments