Everything you want to know about marketing AI content generation apps
Author Archive
To use machine learning, you only need 3 tools
To create an effective machine learning toolbox, you actually only need the following three basic tools: Feature Store, Model Store, Evaluation Store
AI on terminal devices-what I know so far
By 2022, 80% of smartphones shipped will have AI functions on terminal devices, up from 2017% in 10
Classification feature
Classification features are an important class of features.Classification features are discrete and non-continuous.
This article will introduce 5 mainstream coding methods for small and large classifications.And their respective advantages and disadvantages.
What are classification (category) characteristics?
Categorical features are used to represent classification. Unlike numerical features, which are continuous, categorical features are discrete.
such as:
- 性别
- city
- Colour
- IP address
- User's account ID
Some classification features are also numerical values, such as account ID and IP address.But these values are not continuous.
Continuous numbers are numerical features, and discrete numbers are categorical features.
For continuous and discrete explanations, you can read this article: "Understanding of continuous and discrete"
Encoding of small classification features
Natural Number Encoding/Sequence Encoding-Ordinal Encoding
Certain classifications have a certain order, in this case, simple natural number coding can be used.
For example degree:
Bachelor-0
Master-1
Ph.D-2
One-Hot Encoding-One-Hot Encoding
For city, color, brand, material... these features are not suitable for coding with natural numbers, because these features have no ordering relationship.
The use of one-hot encoding can make different categories in an "equal position", and will not affect the classification because of the magnitude of the value.
For example, color classification (assuming there are only 3 colors):
Red-100
Yellow -010
Blue-001
Similar to one-hot encoding, there are "Dummy Encoding" and "Effect Encoding".
The implementation is similar, but there are some slight differences, and it is applicable to different scenarios.
Those who are interested can read this article:
'The difference between dummy variables and one-hot encoding"
'Assignment method: effect coding"
Encoding of large-scale classification features
Target Encoding
Target encoding is a very effective method to represent classification columns, and it only occupies a feature space, also known as mean encoding.Each value in this column is replaced by the average target value for that category.This can more directly express the relationship between categorical variables and target variables.
Extended reading of the target code: "Introduction to Target Encoding"
Hash encoding
The hash function is also a hash function that everyone often hears.The hash function is a deterministic function that maps a potentially unbounded integer to a finite integer range [1, m].
If there is a category with 1 values, if one-hot encoding is used, the encoding will be very long.With the use of hash encoding, no matter how many different values there are in the classification, it will be converted into a fixed-length encoding.
Bin-Counting
The thinking of bin counting is a bit complicated: instead of using the value of a categorical variable as a feature, he uses the conditional probability of the target variable taking this value.
In other words, we do not encode the value of the categorical variable, but calculate the correlation statistics between the value of the categorical variable and the target variable to be predicted.
Summary of the advantages and disadvantages of different encodings
One-Hot Encoding-One-Hot Encoding
advantage:
- easy to accomplish
- The classification is very precise
- Can be used for online learning
Things to note:
- Inefficient calculation
- Unable to adapt to growthable categories
- Only applicable to linear models
- For large data sets, large-scale distributed optimization is required
Hash encoding
advantage:
- easy to accomplish
- Model training costs are lower
- Easy to adapt to new categories
- Easy to handle rare types
- Can be used for online learning
Things to note:
- Only suitable for linear models or kernel methods
- Unexplainable features after hashing
- Accuracy is difficult to guarantee
Bin-Counting
advantage:
- Minimal computational burden during training
- Can be used for tree-based models
- Easy to adapt to new categories
- Use back-off method or minimum count graph to handle rare classes
- Explainable
Things to note:
- Need historical data
- Need to delay update, not completely suitable for online learning
- Very likely to cause a data breach
The above content is taken from: "Proficient in feature engineering"
Final Thoughts
Categorical features are discrete features, and numerical features are continuous.
For small classifications, commonly used encoding methods are:
- Natural Number Encoding/Sequence Encoding-Ordinal Encoding
- One-Hot Encoding-One-Hot Encoding
- Dummy Encoding-Dummy Encoding
- Effect Encoding-Effect Encoding
For large classifications, commonly used coding methods are:
- Target Encoding
- Hash encoding
- Bin-Counting
Recommended articles:
Numerical features
Numerical features are the most common feature type, and numerical values can be directly fed to the algorithm.
In order to improve the effect, we need to do some processing on numerical features. This article introduces 4 common processing methods: missing value processing, binarization, bucketing, and scaling.
What is a numerical feature?
Numerical features are features that can be actually measured.E.g:
- Human height, weight, three-dimensional
- The number of visits to the product, the number of times it was added to the shopping cart, and the final sales volume
- How many new users and returning users among the logged-in users
The features of the numerical class can be directly fed to the algorithm, why do we need to deal with it?
Because good numerical features can not only show the information hidden in the data, but also consistent with the model's assumptions.A good effect can be improved through proper numerical transformation.
For example, linear regression and logistic regression are very sensitive to the size of the value, so it needs to be scaled.
For numerical features, we mainly focus on 2 points:
- 大小
- distributed
The four processing methods mentioned below are optimized around size and distribution.
4 common processing methods for numerical features
- Missing value processing
- Binarization
- Divide buckets/bins
- Zoom
Missing value processing
In actual problems, we often encounter data missing.Missing values will have a greater impact on performance.So it needs to be dealt with according to the actual situation.
There are three commonly used processing methods for missing values:
- Fill in missing values (mean, median, model prediction...)
- Delete rows with missing values
- Ignore it directly, and feed the missing value as part of the feature to the model for learning
Binarization
This processing method is usually used in counting scenarios, such as: the number of visits, the number of times a song has been listened to...
Example:
Predict which songs are more popular based on the user’s listening music data.
Assuming that most people listen to songs very averagely and will listen to new songs continuously, but there is a user who plays the same song 24 hours a day, and this song is very partial, resulting in a particularly high total number of listening to this song .If the total number of listening times is used to feed the model, it will mislead the model.At this time, you need to use "binarization".
The same user has listened to the same song N times, and only counts 1, so that everyone can find songs that everyone likes to recommend.
Divide buckets/bins
Take the income of each person as an example. The income of most people is not high, and the income of a very small number of people is extremely high and the distribution is very uneven.Some have a monthly income of 3000, and some have a monthly income of 30, which is several orders of magnitude.
This feature is very unfriendly to the model.This situation can be handled by bucketing.Bucketing is to divide numerical features into different intervals, and treat each interval as a whole.
Common bucketing:
- age distribution
- Commodity price distribution
- Income distribution
Commonly used bucketing methods:
- 固定数值的分桶(例如年龄分布:0-12岁、13-17岁、18-24岁…)、
- Quantiles and buckets (for example, the price range recommended by Taobao: 30% of users choose the cheapest price range, 60% of users choose the medium price range, and 9% of users choose the most expensive price range)
- Use the model to find the best bucket
Zoom
Linear regression and logistic regression are very sensitive to the magnitude of the value, and the large difference between different feature scales will seriously affect the effect.Therefore, the values of different magnitudes need to be normalized.Scale different orders of magnitude into the same static range (for example: 0~1, -1~1).
Commonly used normalization methods:
- z-score normalization
- min-max standardization
- Row normalization
- Variance scaling
Extended reading:
'Data scaling: standardization and normalization"
'106-Data scaling (standardization, normalization) those things"
The 7 steps of the data science life cycle-applying AI in business
We will delve deeper into the seven steps of the data science life cycle itself, as well as the process aspects that non-technical project leaders should understand
Lyft's Craig Martell Interview: Less Algorithms, More Applications
Is the law becoming less and less important?As algorithms become more commoditized, there may be fewer and fewer algorithms and more and more applications.
In addition to Kaggle, what other data science platforms are there?
This article is divided into two parts: Competitive and collaborative platforms to hone your skills, new resources to enhance specific skills
Exploratory Data Analysis | EDA
Exploratory data analysis is the process of obtaining the original data and using technical means to help oneself better understand the data, extract "good features", and establish a preliminary model.
This article will introduce how to classify data and how to visualize different types of data.
What is exploratory data analysis?
When it comes to basketball, everyone knows that height and wingspan are the key characteristics of athletes.
What about handball?I believe most people can't tell.
When you encounter a field you are not familiar with, you need to quickly have a certain understanding of the unfamiliar field.
There are 2 ways to help us understand unfamiliar areas:
- Consult industry insiders.Senior industry insiders will pass on some of their experience.
- Go and study data in unfamiliar areas.We can take the physical data and performance data of handball players for analysis to see what are the characteristics of the best handball players.Without any industry experience, some discoveries can be made through data insights.
The second way above is:Exploratory Data Analysis | Exploratory Data Analysis | EDA
Exploratory data analysis is a data analysis method and concept that uses various technical means (most of which are data visualization) to explore the internal structure and laws of data.
The purpose of exploratory data analysis is to gain as much insight as possible into the data set, discover the internal structure of the data, extract important features, detect outliers, test basic hypotheses, and establish preliminary models.
The 3-step approach to exploratory data analysis
The process of exploratory data analysis is roughly divided into 3 steps:
- Data Classification
- data visualization
- Insight data
The first step: data classification
When we get the data, the first step is to classify the data, and then use different methods to process different types of data.
The data can be classified in the following ways from coarse to fine:
Structured data vs unstructured data
Structured data: Data that can be organized in tables is considered structured data.
For example: data in Excel, data in MySQL...
Unstructured data: All are organized in non-tabular format.
For example: text, picture, video...
Quantitative data vs qualitative data
Quantitative data: Numerical type, which measures the quantity of something.
For example: 1985
Qualitative data: category, describing the nature of something.
For example: post-80s
4 levels of data
Norminal level: It is the first level of data, and its structure is the weakest.Just need to sort by name.
For example: blood type (A, B, AB, O), name, color
Ordinal level: Sequencing level adds natural sorting on the basis of categorization level, so that we can compare different data.
For example: the star rating of the restaurant, the evaluation level of the company
Interval level: The fixed distance level must be of numeric type, and these values can be used not only for sorting, but also for addition and subtraction.
For example: Fahrenheit, Celsius (the temperature has a negative number, multiplication and division are not allowed)
Ratio level (ratio level): On the basis of the fixed distance level, the absolute zero point is added, which can not only perform addition and subtraction operations, but also multiplication and division operations.
For example: money, weight
Step XNUMX: Data visualization
In order to have a better insight into the data, we can visualize the data to better observe the characteristics of the data.
There are several commonly used data visualizations:
The four data levels above need to correspond to different visualization methods. Below is a table that can help you choose a better visualization solution.
The following are some basic visualization schemes. In practical applications, there will be more complex, combination charts can be used.
Data level | attribute | Descriptive statistics | chart |
---|---|---|---|
Classify | Discrete, disordered | Frequency ratio, mode | Bar chart, pie chart |
Sequencing | Ordered categories, comparison | Frequency, mode, median, percentile | Bar chart, pie chart |
Fixed distance | Number difference is meaningful | Frequency, mode, median, mean, standard deviation | Bar chart, pie chart, box plot |
Fixed ratio | continuous | Mean, standard deviation | Bar chart, curve chart, pie chart, box plot |
Step XNUMX: Insight into the data
Data visualization can help us gain better insights into the data. We can more efficiently discover which data is more important, the possible relationships between different data, and which data will affect each other...
The reason why it is called exploratory data analysis is that there are no fixed routines, so there is nothing to talk about in this step.
Final Thoughts
Exploratory data analysis is a data analysis method and concept that uses various technical means (most of which are data visualization) to explore the internal structure and laws of data.
The process of exploratory data analysis is roughly divided into 3 steps:
- Data Classification
- data visualization
- Insight data
Feature Engineering – Feature Engineering
Feature engineering is an important part of the machine learning workflow. It "translates" the original data into a form that the model can understand.
This article will introduce the basic concepts, importance and performance evaluation of feature engineering in 4 steps.
The importance of feature engineering
Everyone has heard two classic quotes from American computer scientist Peter Norvig:
A simple model based on a large amount of data is better than a complex model based on a small amount of data.
This sentence illustrates the importance of the amount of data.
More data is better than smart algorithms, and good data is better than more data.
This sentence is about the importance of feature engineering.
Therefore, how to use the given data to exert greater data value is what feature engineering needs to do.
In a 16-year survey, it was found that 80% of the work of a data scientist is spent on acquiring, cleaning, and organizing data.The time to construct the machine learning pipeline is less than 20%.Details are as follows:
- Set the training set: 3%
- Cleaning and organizing data: 60%
- Collect data set: 19%
- Mining data patterns: 9%
- Adjustment algorithm: 5%
- Other: 4%
PS: Data cleaning and organizing data are also the "most annoying" tasks for data scientists.Those who are interested can read this original article:
Data Sources:"Data Scientists Spend Most of Their Time Cleaning Data"
What is feature engineering
Let's first take a look at the position of feature engineering in the machine learning process:
As can be seen from the above figure, the feature engineering is between the original data and the feature.His task is to "translate" the original data into features.
Features: It is the numerical expression of the original data, and the expression that the machine learning algorithm model can directly use.
Feature engineering is a process that transforms data into features that can better represent business logic, thereby improving the performance of machine learning.
This may not be easy to understand.In fact, feature engineering is very similar to cooking:
We bought the ingredients, washed and cut vegetables, and then started cooking according to our own preferences to make delicious meals.
In the above example:
Ingredients are like raw data
The processes of cleaning, cutting vegetables, and cooking are like feature engineering
The delicious meal made at the end is the characteristic
Humans need to eat processed food, which is safer and more delicious.The machine algorithm model is similar. The original data cannot be directly fed to the model, and the data needs to be cleaned, organized, and converted.Finally, the features that can be digested by the model can be obtained.
In addition to converting raw data into features, there are two important points that are easily overlooked:
Key 1: Better representation of business logic
Feature engineering can be said to be a mathematical expression of business logic.
The purpose of our use of machine learning is to solve specific problems in the business.There are many ways to convert the same raw data into features. We need to choose those that can "better represent business logic" to better solve the problem.Rather than those simpler methods.
Focus 2: Improve machine learning performance
Performance means shorter time and lower cost. Even the same model will have different performance due to different feature engineering.So we need to choose those feature projects that can exert better performance.
4 steps to evaluate feature engineering performance
The business evaluation of feature engineering is very important, but there are various methods, and different businesses have different evaluation methods.
Only the performance evaluation method is introduced here, which is relatively general.
- Before applying any feature engineering, get the benchmark performance of the machine learning model
- Apply one or more feature engineering
- For each feature project, obtain a performance index and compare it with the benchmark performance
- If the performance increase is greater than a certain threshold, then feature engineering is considered to be beneficial and applied on the machine learning pipeline
For example: the accuracy of the benchmark performance is 40%, after applying a certain feature engineering, the accuracy is increased to 76%, then the change is 90%.
(76%-40%) / 40%=90%
Final Thoughts
Feature engineering is the most time-consuming work in the machine learning process, and it is also one of the most important tasks.
Feature engineering definition: It is a process that transforms data into features that can better represent business logic, thereby improving the performance of machine learning.
Two key points that feature engineering is easily overlooked:
- Better representation of business logic
- Improve machine learning performance
The 4 steps of feature engineering performance evaluation:
- Before applying any feature engineering, get the benchmark performance of the machine learning model
- Apply one or more feature engineering
- For each feature project, obtain a performance index and compare it with the benchmark performance
- If the performance increase is greater than a certain threshold, then feature engineering is considered to be beneficial and applied on the machine learning pipeline