Author Archive

Classification feature

https://easyai.tech/wp-content/uploads/2022/08/39995-2021-03-30-typefeature.png

Classification features are an important class of features.Classification features are discrete and non-continuous.

This article will introduce 5 mainstream coding methods for small and large classifications.And their respective advantages and disadvantages.

 

What are classification (category) characteristics?

Categorical features are used to represent classification. Unlike numerical features, which are continuous, categorical features are discrete.

such as:

  • 性别
  • city
  • Colour
  • IP address
  • User's account ID

https://easyai.tech/wp-content/uploads/2022/08/d2797-2021-03-30-lisan.png

Some classification features are also numerical values, such as account ID and IP address.But these values ​​are not continuous.

Continuous numbers are numerical features, and discrete numbers are categorical features.

For continuous and discrete explanations, you can read this article: "Understanding of continuous and discrete"

Encoding of small classification features

https://easyai.tech/wp-content/uploads/2022/08/5345c-2021-03-30-small-data.png

Natural Number Encoding/Sequence Encoding-Ordinal Encoding

Certain classifications have a certain order, in this case, simple natural number coding can be used.

For example degree:

Bachelor-0

Master-1

Ph.D-2

One-Hot Encoding-One-Hot Encoding

For city, color, brand, material... these features are not suitable for coding with natural numbers, because these features have no ordering relationship.

The use of one-hot encoding can make different categories in an "equal position", and will not affect the classification because of the magnitude of the value.

For example, color classification (assuming there are only 3 colors):

Red-100

Yellow -010

Blue-001

Similar to one-hot encoding, there are "Dummy Encoding" and "Effect Encoding".

The implementation is similar, but there are some slight differences, and it is applicable to different scenarios.

Those who are interested can read this article:

'The difference between dummy variables and one-hot encoding"

'Assignment method: effect coding"

Encoding of large-scale classification features

https://easyai.tech/wp-content/uploads/2022/08/f340a-2021-03-30-big-data.png

Target Encoding

Target encoding is a very effective method to represent classification columns, and it only occupies a feature space, also known as mean encoding.Each value in this column is replaced by the average target value for that category.This can more directly express the relationship between categorical variables and target variables.

Extended reading of the target code: "Introduction to Target Encoding"

Hash encoding

The hash function is also a hash function that everyone often hears.The hash function is a deterministic function that maps a potentially unbounded integer to a finite integer range [1, m].

If there is a category with 1 values, if one-hot encoding is used, the encoding will be very long.With the use of hash encoding, no matter how many different values ​​there are in the classification, it will be converted into a fixed-length encoding.

Bin-Counting

The thinking of bin counting is a bit complicated: instead of using the value of a categorical variable as a feature, he uses the conditional probability of the target variable taking this value.

In other words, we do not encode the value of the categorical variable, but calculate the correlation statistics between the value of the categorical variable and the target variable to be predicted.

Summary of the advantages and disadvantages of different encodings

One-Hot Encoding-One-Hot Encoding

advantage:

  1. easy to accomplish
  2. The classification is very precise
  3. Can be used for online learning

Things to note:

  1. Inefficient calculation
  2. Unable to adapt to growthable categories
  3. Only applicable to linear models
  4. For large data sets, large-scale distributed optimization is required

Hash encoding

advantage:

  1. easy to accomplish
  2. Model training costs are lower
  3. Easy to adapt to new categories
  4. Easy to handle rare types
  5. Can be used for online learning

Things to note:

  1. Only suitable for linear models or kernel methods
  2. Unexplainable features after hashing
  3. Accuracy is difficult to guarantee

Bin-Counting

advantage:

  1. Minimal computational burden during training
  2. Can be used for tree-based models
  3. Easy to adapt to new categories
  4. Use back-off method or minimum count graph to handle rare classes
  5. Explainable

Things to note:

  1. Need historical data
  2. Need to delay update, not completely suitable for online learning
  3. Very likely to cause a data breach

The above content is taken from: "Proficient in feature engineering"

Final Thoughts

Categorical features are discrete features, and numerical features are continuous.

For small classifications, commonly used encoding methods are:

  1. Natural Number Encoding/Sequence Encoding-Ordinal Encoding
  2. One-Hot Encoding-One-Hot Encoding
  3. Dummy Encoding-Dummy Encoding
  4. Effect Encoding-Effect Encoding

For large classifications, commonly used coding methods are:

  1. Target Encoding
  2. Hash encoding
  3. Bin-Counting

Recommended articles:

'Machine learning category feature processing"

'Feature Engineering (XNUMX): Category Features"

Numerical features

https://easyai.tech/wp-content/uploads/2022/08/c3a87-2021-03-21-datafeature.png

Numerical features are the most common feature type, and numerical values ​​can be directly fed to the algorithm.
In order to improve the effect, we need to do some processing on numerical features. This article introduces 4 common processing methods: missing value processing, binarization, bucketing, and scaling.

What is a numerical feature?

https://easyai.tech/wp-content/uploads/2022/08/5f1f1-2021-03-21-keceliang.png

Numerical features are features that can be actually measured.E.g:

  • Human height, weight, three-dimensional
  • The number of visits to the product, the number of times it was added to the shopping cart, and the final sales volume
  • How many new users and returning users among the logged-in users

 

The features of the numerical class can be directly fed to the algorithm, why do we need to deal with it?

Because good numerical features can not only show the information hidden in the data, but also consistent with the model's assumptions.A good effect can be improved through proper numerical transformation.

For example, linear regression and logistic regression are very sensitive to the size of the value, so it needs to be scaled.

https://easyai.tech/wp-content/uploads/2022/08/8a714-2021-03-21-2points.png

For numerical features, we mainly focus on 2 points:

  1. 大小
  2. distributed

The four processing methods mentioned below are optimized around size and distribution.

 

4 common processing methods for numerical features

https://easyai.tech/wp-content/uploads/2022/08/e1ef8-2021-03-21-4method.png

  1. Missing value processing
  2. Binarization
  3. Divide buckets/bins
  4. Zoom

 

Missing value processing

In actual problems, we often encounter data missing.Missing values ​​will have a greater impact on performance.So it needs to be dealt with according to the actual situation.

There are three commonly used processing methods for missing values:

  1. Fill in missing values ​​(mean, median, model prediction...)
  2. Delete rows with missing values
  3. Ignore it directly, and feed the missing value as part of the feature to the model for learning

 

Binarization

This processing method is usually used in counting scenarios, such as: the number of visits, the number of times a song has been listened to...

Example:

Predict which songs are more popular based on the user’s listening music data.

Assuming that most people listen to songs very averagely and will listen to new songs continuously, but there is a user who plays the same song 24 hours a day, and this song is very partial, resulting in a particularly high total number of listening to this song .If the total number of listening times is used to feed the model, it will mislead the model.At this time, you need to use "binarization".

The same user has listened to the same song N times, and only counts 1, so that everyone can find songs that everyone likes to recommend.

 

Divide buckets/bins

Take the income of each person as an example. The income of most people is not high, and the income of a very small number of people is extremely high and the distribution is very uneven.Some have a monthly income of 3000, and some have a monthly income of 30, which is several orders of magnitude.

This feature is very unfriendly to the model.This situation can be handled by bucketing.Bucketing is to divide numerical features into different intervals, and treat each interval as a whole.

Common bucketing:

  1. age distribution
  2. Commodity price distribution
  3. Income distribution

Commonly used bucketing methods:

  1. 固定数值的分桶(例如年龄分布:0-12岁、13-17岁、18-24岁…)、
  2. Quantiles and buckets (for example, the price range recommended by Taobao: 30% of users choose the cheapest price range, 60% of users choose the medium price range, and 9% of users choose the most expensive price range)
  3. Use the model to find the best bucket

https://easyai.tech/wp-content/uploads/2022/08/c2ba0-2021-03-21-taobao-fenweishu.png

 

Zoom

Linear regression and logistic regression are very sensitive to the magnitude of the value, and the large difference between different feature scales will seriously affect the effect.Therefore, the values ​​of different magnitudes need to be normalized.Scale different orders of magnitude into the same static range (for example: 0~1, -1~1).

Commonly used normalization methods:

  1. z-score normalization
  2. min-max standardization
  3. Row normalization
  4. Variance scaling

Extended reading:

'Data scaling: standardization and normalization"

'106-Data scaling (standardization, normalization) those things"

Exploratory Data Analysis | EDA

https://easyai.tech/wp-content/uploads/2022/08/d7e01-2021-03-07-edabanner.png

Exploratory data analysis is the process of obtaining the original data and using technical means to help oneself better understand the data, extract "good features", and establish a preliminary model.

This article will introduce how to classify data and how to visualize different types of data.

What is exploratory data analysis?

When it comes to basketball, everyone knows that height and wingspan are the key characteristics of athletes.

What about handball?I believe most people can't tell.

When you encounter a field you are not familiar with, you need to quickly have a certain understanding of the unfamiliar field.

There are 2 ways to help us understand unfamiliar areas:

  1. Consult industry insiders.Senior industry insiders will pass on some of their experience.
  2. Go and study data in unfamiliar areas.We can take the physical data and performance data of handball players for analysis to see what are the characteristics of the best handball players.Without any industry experience, some discoveries can be made through data insights.

https://easyai.tech/wp-content/uploads/2022/08/73047-2021-03-07-ask-eda.png

The second way above is:Exploratory Data Analysis | Exploratory Data Analysis | EDA

Exploratory data analysis is a data analysis method and concept that uses various technical means (most of which are data visualization) to explore the internal structure and laws of data.

The purpose of exploratory data analysis is to gain as much insight as possible into the data set, discover the internal structure of the data, extract important features, detect outliers, test basic hypotheses, and establish preliminary models.

The 3-step approach to exploratory data analysis

https://easyai.tech/wp-content/uploads/2022/08/ef6ca-2021-03-08-3steps.png

The process of exploratory data analysis is roughly divided into 3 steps:

  1. Data Classification
  2. data visualization
  3. Insight data

The first step: data classification

When we get the data, the first step is to classify the data, and then use different methods to process different types of data.

The data can be classified in the following ways from coarse to fine:

https://easyai.tech/wp-content/uploads/2022/08/24d96-2021-03-07-xifen.png

Structured data vs unstructured data

Structured data: Data that can be organized in tables is considered structured data.

For example: data in Excel, data in MySQL...

Unstructured data: All are organized in non-tabular format.

For example: text, picture, video...

 

Quantitative data vs qualitative data

Quantitative data: Numerical type, which measures the quantity of something.

For example: 1985

Qualitative data: category, describing the nature of something.

For example: post-80s

 

4 levels of data

Norminal level: It is the first level of data, and its structure is the weakest.Just need to sort by name.

For example: blood type (A, B, AB, O), name, color

Ordinal level: Sequencing level adds natural sorting on the basis of categorization level, so that we can compare different data.

For example: the star rating of the restaurant, the evaluation level of the company

Interval level: The fixed distance level must be of numeric type, and these values ​​can be used not only for sorting, but also for addition and subtraction.

For example: Fahrenheit, Celsius (the temperature has a negative number, multiplication and division are not allowed)

Ratio level (ratio level): On the basis of the fixed distance level, the absolute zero point is added, which can not only perform addition and subtraction operations, but also multiplication and division operations.

For example: money, weight

 

Step XNUMX: Data visualization

In order to have a better insight into the data, we can visualize the data to better observe the characteristics of the data.

There are several commonly used data visualizations:

https://easyai.tech/wp-content/uploads/2022/08/78ba6-2021-03-07-keshihua.png

The four data levels above need to correspond to different visualization methods. Below is a table that can help you choose a better visualization solution.

The following are some basic visualization schemes. In practical applications, there will be more complex, combination charts can be used.

Data level attribute Descriptive statistics chart
Classify Discrete, disordered Frequency ratio, mode Bar chart, pie chart
Sequencing Ordered categories, comparison Frequency, mode, median, percentile Bar chart, pie chart
Fixed distance Number difference is meaningful Frequency, mode, median, mean, standard deviation Bar chart, pie chart, box plot
Fixed ratio continuous Mean, standard deviation Bar chart, curve chart, pie chart, box plot

Step XNUMX: Insight into the data

Data visualization can help us gain better insights into the data. We can more efficiently discover which data is more important, the possible relationships between different data, and which data will affect each other...

The reason why it is called exploratory data analysis is that there are no fixed routines, so there is nothing to talk about in this step.

Final Thoughts

Exploratory data analysis is a data analysis method and concept that uses various technical means (most of which are data visualization) to explore the internal structure and laws of data.

The process of exploratory data analysis is roughly divided into 3 steps:

  1. Data Classification
  2. data visualization
  3. Insight data

Feature Engineering – Feature Engineering

Understanding feature engineering in one article

Feature engineering is an important part of the machine learning workflow. It "translates" the original data into a form that the model can understand.

This article will introduce the basic concepts, importance and performance evaluation of feature engineering in 4 steps.

The importance of feature engineering

Everyone has heard two classic quotes from American computer scientist Peter Norvig:

A simple model based on a large amount of data is better than a complex model based on a small amount of data.

This sentence illustrates the importance of the amount of data.

More data is better than smart algorithms, and good data is better than more data.

This sentence is about the importance of feature engineering.

Therefore, how to use the given data to exert greater data value is what feature engineering needs to do.

In a 16-year survey, it was found that 80% of the work of a data scientist is spent on acquiring, cleaning, and organizing data.The time to construct the machine learning pipeline is less than 20%.Details are as follows:

80% of the work of a data scientist is spent on acquiring, cleaning and organizing data

  • Set the training set: 3%
  • Cleaning and organizing data: 60%
  • Collect data set: 19%
  • Mining data patterns: 9%
  • Adjustment algorithm: 5%
  • Other: 4%

PS: Data cleaning and organizing data are also the "most annoying" tasks for data scientists.Those who are interested can read this original article:

Data Sources:"Data Scientists Spend Most of Their Time Cleaning Data"

What is feature engineering

Let's first take a look at the position of feature engineering in the machine learning process:

The place of feature engineering in the machine learning process

As can be seen from the above figure, the feature engineering is between the original data and the feature.His task is to "translate" the original data into features.

Features: It is the numerical expression of the original data, and the expression that the machine learning algorithm model can directly use.

Feature engineering is a process that transforms data into features that can better represent business logic, thereby improving the performance of machine learning.

This may not be easy to understand.In fact, feature engineering is very similar to cooking:

We bought the ingredients, washed and cut vegetables, and then started cooking according to our own preferences to make delicious meals.

Feature engineering is very similar to cooking

In the above example:

Ingredients are like raw data

The processes of cleaning, cutting vegetables, and cooking are like feature engineering

The delicious meal made at the end is the characteristic

Humans need to eat processed food, which is safer and more delicious.The machine algorithm model is similar. The original data cannot be directly fed to the model, and the data needs to be cleaned, organized, and converted.Finally, the features that can be digested by the model can be obtained.

In addition to converting raw data into features, there are two important points that are easily overlooked:

Key 1: Better representation of business logic

Feature engineering can be said to be a mathematical expression of business logic.

The purpose of our use of machine learning is to solve specific problems in the business.There are many ways to convert the same raw data into features. We need to choose those that can "better represent business logic" to better solve the problem.Rather than those simpler methods.

Focus 2: Improve machine learning performance

Performance means shorter time and lower cost. Even the same model will have different performance due to different feature engineering.So we need to choose those feature projects that can exert better performance.

4 steps to evaluate feature engineering performance

The business evaluation of feature engineering is very important, but there are various methods, and different businesses have different evaluation methods.

Only the performance evaluation method is introduced here, which is relatively general.

4 steps to evaluate feature engineering performance

  1. Before applying any feature engineering, get the benchmark performance of the machine learning model
  2. Apply one or more feature engineering
  3. For each feature project, obtain a performance index and compare it with the benchmark performance
  4. If the performance increase is greater than a certain threshold, then feature engineering is considered to be beneficial and applied on the machine learning pipeline

For example: the accuracy of the benchmark performance is 40%, after applying a certain feature engineering, the accuracy is increased to 76%, then the change is 90%.

(76%-40%) / 40%=90%

Final Thoughts

Feature engineering is the most time-consuming work in the machine learning process, and it is also one of the most important tasks.

Feature engineering definition: It is a process that transforms data into features that can better represent business logic, thereby improving the performance of machine learning.

Two key points that feature engineering is easily overlooked:

  1. Better representation of business logic
  2. Improve machine learning performance

The 4 steps of feature engineering performance evaluation:

  1. Before applying any feature engineering, get the benchmark performance of the machine learning model
  2. Apply one or more feature engineering
  3. For each feature project, obtain a performance index and compare it with the benchmark performance
  4. If the performance increase is greater than a certain threshold, then feature engineering is considered to be beneficial and applied on the machine learning pipeline