Even in this fancy world of machine learning, just as humans can't drive cars on uneven roads in azaki vehicles, machine learning algorithms can't produce the expected results on large amounts of unwanted miscellaneous data. So let's dive into all the options for optimizing your data.

note:This can be a bit verbose. Therefore, I divided it into three parts. So please explore gradually.


Part 1: Feature Engineering

What is feature engineering and why bother?

Feature engineering isUse domain knowledge of data to createThe process of making features of machine learning algorithms work. Feature engineering isThe foundation of machine learning applications, both difficult and expensive. However, if done correctly, it may work wonders. This is the difference between poor performance and excellent performance of the same algorithm.

This is important and cannot be ignored. Let's take a look at its general idea:

  • Data cleaning and preprocessing

→ Handling outliers

→ Handling missing values

→ Handle skew

  • scaling ratio
  • Encoding
  1. Data cleaning and preprocessing: In the real world, we never get tailored data that fits perfectly into the algorithm. We need to do this in the following way.
Feature Engineering Image Results

i) Handling outliers:Outliers are data points that do not follow the overall trend of the data. Many algorithms are sensitive to outliers. So what is the problem?

→ If the quantity is small, remove it completely. You can set thresholds to identify them and then delete them. If there are many outliers in a column, it is best to delete the column completely, as well as the rows.

→ You can convert everything to "log form" because the log keeps everything the same distance. (Although only for digital data)

→ Use scatter plots, histograms, box plots and whisker plotsVisualize the dataAnd look for extreme values. There are many other techniques you can use to deal with outliers, and I suggest you read them carefully.

ii) Handling missing values: Why lack value processing?

Loss of data in the training data set reduces the power / fitness of the model. Missing values ​​may cause the model to be biased because we have not properly analyzed its behavior and its relationship to other variables. This is useful becauseSome algorithms cannot use or take advantage of lost data.Therefore, it is important to identify and mark the missing data. Once marked, you are ready to replace the value.

→ Replace missing values ​​with the mean, median, and mode (it all depends on the judgment). You cansklearn.preprocessing.imputeFor the same purpose.

→ If necessary, you can replace with brand new data (call judgment again).

→ Or, if too many values ​​are missing, you can delete the entire column. So this mainly requires judgment calls, again! !! !!

iii) Skewness:Skewness is a measure of asymmetric distribution. Skewness is a measure of symmetry, or more precisely, a measure of lack of symmetry.

Three basic types of skewness
Three basic types of skewness

Why deal with skewness?

→ Many model building techniques assume that the predictor values ​​are normally distributed and have a symmetrical shape. Therefore, sometimes it is important to deal with skewness.

→ Symmetric distribution is better than oblique distribution because it is easier to explain and generate inferences.

→ Use logarithmic transformation, square root transformation, etc.

2. Zoom: Just as dealing with missing values ​​is mandatory, scaling is not required. But that doesn't mean it's less important. Consider a situation,One of your columns (for example, A) has a value in the range of 10k to 100k, and the value of one column is in the range of 0 to 1 (for example, B), then A will have an inappropriate advantage over B because it will carry more the weight of.

→ Scaling modifies the elements to be between the given minimum and maximum values, usually between zero and one, or scales the maximum absolute value of each element to a unit size to improve the numerical stability of some models sex.

Zoom effect

→ But normalization / scaling cannot be applied to categorical data, so we separate categorical data from numerical data to standardize numerical data.

→ MinMax Scaler, Standard Scler, NormalizerEtc. are some techniques. can use"Sklearn.preprocessing.scaler"Do everything

(I suggest you visitThis blogFor a deeper look at Scaling)

3. Encoding:So, and why?

Most of the algorithms we use use numeric values, and categorical data is usually in the form of text / string (male, female) or bin (0–4, 4–8, etc.)

One option is to exclude these variables from the algorithm and use only numeric data. But in doing so, we may lose some key information. Therefore, it is usually best to include categorical variables in your algorithm by encoding them to convert to numeric values, but first let us know one or two things about categorical variables.

Variable type

Generally, two types of encoding are performed on the data, namely tag encoding and a hot encoding (or pandas.get_dummies).

i) Tag encoding:Give each category a label (e.g. 0, 1, 2, etc.). Tag encoding is a convenient technique for encoding categorical variables. However, nominal variables so encoded may end up being misunderstood as ordinal numbers.Therefore, tag encoding is performed only on ordinal data (with some sense of order).

→ Therefore, even after tag encoding, all data will not lose its ranking or importance level.

Tag encoding

can use"Sklearn.preprocessing.LabelEncoder"(I.e.

ii) a thermal coding:Tag encoding cannot be performed on nominal or binary because we cannot rank them based on their attributes. Every data is treated equally. Consider the following two categorical variables and their values, such as

→ color: Blue, green, red, yellow

→ Education: Elementary, middle, graduate, postgraduate, doctorate.

For example, a hot code

can use"Pd.get_dummies"Or"Sklearn.preprocessing.OneHotEncoder" executes

Data sets with larger dimensions require more parameters for the model to understand, which means that more rows are needed to learn these parameters reliably. The effect of using a thermal encoder is to add many columns (sizes).

If the number of rows in the data set is fixed, adding additional dimensions without adding more information for model learning may adversely affect the accuracy of the final model.

Image results for label encoder example
Thermal coding and label coding

So far, the 1 part is over. Read Part 2, which will discuss feature extraction and super-important dimensionality reduction.

Next:"Prepare your data for modeling: feature engineering, feature selection, dimensionality reduction (Part 2)"