This article is reproduced from the public reading core,Original address

Data in the real world can be confusing, whether it's a related SQL database, an Excel file, or any other data source. Although these data are usually the structure of a table, that is, each row (sample) has its corresponding value relative to each column (feature), but the data may be difficult to understand and process. To make machine learning models easier to read data, we can use feature engineering to improve the performance of the model.

What is feature engineering?

Feature engineering is the process of transforming given data into a more analyzable form. In this article, we hope to make the machine learning model more transparent, and also to generate features that enable people without relevant background knowledge to better understand the visual data content provided to them. However, the concept of transparency in machine learning models is complex because different models require different methods for different kinds of data.

Example: Coordinates

To understand the concept of feature engineering, we can give a simple example. In the image below, we can see two points. Imagine that there is now a warehouse close to these points, which can only serve some customers with limited distance. From a human point of view, it is easy to understand that we need to consider points within the finite radius of the warehouse. This requires a combination of two known features.

But this is not obvious to the algorithm. For example, a decision tree-based algorithm considers only one feature at a time and divides the data set into two parts, one of which has an eigenvalue above an arbitrary threshold and the other below a threshold. Dividing the space as described above requires a large amount of such splitting.

Coordinate transformation

However, we can carry out a simple coordinate transformation of high school. That is, it is converted from a so-called Cartesian coordinate system (x, y) to a polar coordinate system (r, 0). Here we use the following conversion:

Now, any algorithm is easier to analyze the data. The data set is split according to the r-axis and the threshold is set to r split=2. Obviously this example is of little value, and real-world data is rarely so simple, but it demonstrates the potential of proper feature engineering.

Continuous data

Continuous data is the most common type of data. It is possible to include any value in a given range. For example, it can be the price of the product, the temperature in the industrial production process, or the coordinates of the object in the map. Here, features are mainly generated by domain data. For example, you can use the price minus the purchase price to make a profit, or you can calculate the distance between the two places on the map. New generateable features are only limited by the available features and known mathematical operations.

Classification feature

The second most common type of data is the classification feature, which is a feature that can take values ​​from a limited set of values. Usually this feature can only have a single value. There is another case, but in this case, this feature is often split into a set of features. For example, according to the ISO/IEC 5218 standard, gender can be divided into one of four values: unknown, male, female, and not applicable.

Coding and one-hots

The problem with this type of data is that the algorithm is not designed to process text data. The standard workaround for dealing with this problem is classification coding. Introduce an integer to represent each category. For example, the standard classification codes for the previously mentioned genders are 0, 1, 2, and 9, respectively. But sometimes for different visualizations or model efficiencies, different encodings can be used. We can replace a single feature with multiple levels with several Boolean features. Only one of these Boolean features can take a True value. This is called one-hot encoding, especially popular in neural networks.

Missing value

In the real world, sometimes some data cannot be obtained, or the data is lost during processing. Therefore, data often contains missing values.

Dealing with these values ​​is a separate art. This part of the data processing is called data cleaning and is usually considered a separate step. However, when creating some new features, keep in mind that missing values ​​may be hidden behind different names and values.

Some programming languages ​​and libraries contain special objects that correspond to missing values. Usually it is represented by "NaN" - not a number, but instead of any available value. For example, in a column of positive integers, missing values ​​can be encoded as "-1". However, if the value of the feature is not analyzed in advance, it is inconvenient to calculate the average value of the feature. At other times, missing values ​​can be replaced with "0", which makes it easy to sum, but does not generate a new feature that requires a divide operation.

Another more common option is to fill in missing values ​​with the average or median of the current values. But again, different results are obtained when the average is calculated again, so there is a significant difference between the new features generated from the true mean and the wrong mean. These examples all show a constant fact, understand your data! This is also important when performing feature engineering.

There is a missing here!

A common approach is to introduce a Brownian feature that indicates whether a given feature in a given sample contains missing values. If there is a missing, the Brown feature will show True, and if everything is normal, it will show False. It allows the machine learning model to determine whether the given value should be considered trusted or whether it should be handled separately.

Normalized

Another common feature engineering approach is to put data into a given interval. Why do you do that? The first reason is simple: computing a number in a limited range avoids some numerical errors while limiting the computational performance required. The second reason is that some machine learning algorithms can better handle normalized data. There are several ways to normalize data.

Standard normalization

In nature and human society, many things are subject to the distribution of Zhengtai (Gaussian). This is why a normalized feature is introduced into the distribution. It is represented by the following equation:

Here X represents a new feature that is equal to each sample of the old feature minus the average of the old feature and then divided by the standard deviation. The standard deviation indicates the degree of dispersion of the feature values. Thus, the value range of X will be in the range of [-1,1].

Feature scaling

Another normalization is to subtract the minimum value Xmin from the eigenvalue and divide it by its value range Xmax – Xmin to get the following expression:

This normalization will classify the given feature into the interval of [0,1].

Correct model normalization

As mentioned earlier, different models require different normalizations to achieve efficient operation. For example, in the case of k-nearest neighbors, specific feature ranges represent weights. The greater the value, the more important the feature is. In the case of neural networks, normalization is not important to the final result itself, but it speeds up training. On the other hand, decision tree-based algorithms do not benefit from normalization and are not adversely affected by normalization.

Correct normalization of the problem

Sometimes correct normalization does not stem from general data or computational considerations, but from domain knowledge. For example, when modeling some physical systems based on temperature, it is helpful to introduce the Kelvin temperature scale, which can simplify the relationship between data. Domain knowledge is always very useful in data science.

Date and time

The next common set of data types is the date and time in all different formats.

The problem here is that the format of the date and time is varied. For example, the data may be a formatted string or a standardized date category that exists in a given language or library. There may be differences in standards and formats between different organizations and regions of the world.

For example, every European will fry when dealing with dates in the US format, 10.27.2018. If dates in DD/MM/YYYY and MM/DD/YYYY format are imported into the same dataset as simple strings, it can easily lead to some misunderstandings or poor model operation. The problem is that the data is not simple numerical data. It does not directly import machine learning models. The easiest way is to split the data into 3 integer features, representing the day, month, and year. But this is not all. We can also build some culturally relevant features. For example, is this day a weekend or a holiday? Other options are the time or date at which a particular large event begins, or the interval between consecutive activities. In addition, the time is the same. It can be expressed in hours, minutes, and seconds. But it can also be counted in seconds only, or from a specific big event. For example, in fact most software uses the standard time 1970 1 month 1 day 00: 00: 00 as the beginning of time, which can also be well applied to feature engineering.

Example: Patching time

We can take an example of a data set whose date plays an important role. This is a dataset from the Blue Book for Bulldozers competition.

Here, we load this dataset with the Pandas library in Python. For the sake of illustration, we only take three features. SalesID represents the transaction number and needs to be predicted by SalePrice. In addition, more information on the machine being sold and the date of sale can be found in this data set. We can use the date a little.

With a few lines of simple code, the date column is converted into 6 features that can be read by the model, which can be used to extract more sales information.

text

In computers, text is encoded in ASCII code represented by numbers. This may sound like a good deal, but it's a big mistake! Extracting information from text requires language structure, which is the relationship between the letters in the word and the word itself in the sentence. This spans a branch of the entire interdisciplinary field called natural language processing (NLP). Many developments are made to extract this information more easily. Because this requires at least another article or an entire book to explain, so I won't go into details here.

Sort from text

In addition to processing the entire text, you can split it into a single word and try to find the one with the highest rate of occurrence. For example, we may have access to a database of some human resources departments. One of the fields may be an academic title. Among them, many fields similar to the Bachelor of Science, the Master of Science, and the Doctor of Philosophy may be found. But the number of fields can be huge. Words such as bachelor's, master's, and doctor's can be extracted from it, and specific fields are omitted. This includes a classification feature that contains 4-level (including title-free) education levels. A similar example is the full name with the title. Phrases like Mr. Alan Turing, Mrs. Ada Lovelace, Miss Skłodowska appear in this field. We can extract the titles of gender and marital status, such as Mr., Mrs., and Miss. As you can see, there are many ways to utilize text data, and you don't need to use the full computational power of NLP.

chart

If you don't rewrite an entire issue, visual data is the second type of data that requires at least a separate article to discuss. The analysis of this data has plagued scientists for decades. An entire field of computer vision came into being. But it is worth mentioning that due to the deep learning revolution a few years ago, a simple image analysis method emerged. Convolutional Neural Networks (CNNs) can achieve powerful computing performance by using one of the common frameworks and graphics cards, without much computer vision (CV) A user with a specific subject area knowledge provides a reasonable solution.

As you can see, there are many possibilities in creating new features. Its real purpose is to design features that promote the data science process. But there are more than the methods mentioned before. For example, new features are generated by mixing continuous features and classification features. NLP and CV give us more features, but that's not all. The only way to master them all is to practice and experiment. Because of its enormous diversity, feature engineering is often referred to as an art.