we Learn three points at once .

  1. Feature engineering, from raw data to features
  2. Characteristics of good features
  3. How to understand the data

Feature engineering

Machine learning models cannot directly see, hear, or perceive input samples. You must create a data representation that provides a useful signal for the model to understand the key characteristics of the data.

Interestingly, when we process data in the real world, the data is not presented to us in the form of a formatted feature vector. Instead, the data presented to us is a database record, a protocol buffer, or any other form.

We must extract data from a variety of data sources and then create feature vectors from those data. just now,Extract features from raw dataProcessFeature engineering.

As shown, each portion of the left vector is mapped to one or more fields of the right feature vector.

Feature engineering

In practice, machine learning practitioners spend about 75% of their time on feature engineering, and the features are what we want.

Next, let's take a look at how feature engineering happens.

From raw data to features

Map value

If a record is a number, such as the number of rooms in a house, we can copy this number directly into a feature vector. This process is very natural.

Map value

Mapping categorical values ​​by unique heat encoding

However, if we are dealing with a string, such as "Shorebird Way," what should we do?

If you see a string value, we can usuallyConvert it to a feature vector using a one-hot encoding.

Now, the one-hot encoding provides a unique element for every string we might see.

Mapping categorical values ​​by unique heat encoding

For example, if there is a unique heat code for the street name, thenEvery possible street has a unique element. When we see the streets we own, such as "Shorebird Way," we add an 1 to the elements of the "Shorebird Way" and add an 0 to each of the other locations. Don't worry too much about 0, I will useSparse representation, where only non-zero values ​​are stored.

We can put thisThe unique heat encoding is used as a feature vector representing a string. It is convenient to use unique heat encoding to process such unusual classification data.

So, what kind of features are good features?

Characteristics of good features

The eigenvalues ​​should appear multiple times in the data set as non-zero values.

First, a feature should haveNon-zero valueAnd in ourAt least a few times or more in the data set.

If a feature has a very small number of occurrences with non-zero values, or only once, it is probably not a good feature to use, it should be filtered out in the preprocessing step.

For example, the phone should use the model (device_model:galaxy_s6) instead of the ID ID (my_device_id:8SK982ZZ1242)

Features should have clear and unambiguous meaning

Features should have clear and clear meaningIn this way, we can conduct effective integrity checks, eliminate errors, and ensure that features are properly handled.

For example, calculating the age of a house (user_age: 23) in years is much harder than calculating in seconds (user_age: 123456789). It is much less difficult to troubleshoot and reason.

Features should not use "magic" values

Within the actual dataDo not incorporate special values, features should not use "magic" values, for example, if we define a feature, its role is to tell us how many days a house is listed for sale,

We can use the special value negative 1 to indicate that the house has never been sold. Instead, a better idea is to define an indicator feature that uses a Boolean value to indicate whether the "listed days to sell" feature has been defined.

Therefore, we can use Boolean, True/False, 1-0 to indicate whether this feature is defined. In this way, the original feature "listed days of sale" can maintain the natural unit of 0 to n.

Eigenvalues ​​should not change over time

Considering upstream instability, the eigenvalues ​​should not change over time, which is back to the concept of data stationarity.

So why is the eigenvalue likely to change over time? This is because our features may have been passed from an upstream model to us, and this upstream model may change. We would like to be able to get the version information of this model and have a very specific understanding of the model, or we can get the value of the constant semantics with the defined specification from this model.

Distribution should not contain outlier outliers

Features should not use irrational outliers. For example, in California housing data, if we create a synthetic feature,

For example, the number of rooms per person, that is, the total number of rooms used divided by the total population, then for most city blocks, we will get a relatively reasonable value between 0 to 3 or 4 rooms per person. But in some city blocks, we get values ​​up to 50. This value is too abnormal, I mean, where do these people live? Is it like a "hotel block"? At this point, we may be able to set an upper bound or transition feature for the feature to remove these irrational outliers.

Ideally, all features are converted to similar ranges, such as (-1, 1) or (0, 5).

Distribution should not contain outlier outliers

Boxing technique

We can also consider another technology, that is, binning technology. If we want to explore the impact of latitude on housing prices in California, we find that there is no linear relationship between North and South that can be directly mapped to housing prices.

But in any particular latitude range, there is often a strong correlation. So what we can do is divide the latitude from north to south into a number of small bins.

Boxing technique

Each of these sub-boxes becomes a Boolean feature, and we can use the unique heat encoding. Now, if you're in a specific bin that maps to San Diego, basically we'll get an 1. Or, if you are in a specific bin that maps to the South Bay area, you get an 1; in any other area, you get an 0. In this way, our model can map part of the nonlinear relationship to the model in a cost-effective way without the need for any special techniques.

Understand the data

It's important to understand the data. It's not appropriate to think of machine learning as a black box, throwing data in without checking the data, and looking forward to getting good results.

Because weird content appears in the data, all you can do is: Use a histogram or scatter plot and various ranking metrics to visually display the data.

You can do itVarious data error elimination checkFor this reason, in various development activities, you need to spend a lot of time looking for duplicate values, missing values, and anything that looks like an outlier. Anything like the information center that lets you interact with your data can be very useful.

In real life, in data setsMany samples are unreliableThere are one or more of the following reasons:

  • Missing value. For example, someone forgot to enter a value for the age of a house.
  • Repeat the sample. For example, the server incorrectly uploaded the same record twice.
  • Bad label. For example, someone mistakenly marks a picture of an oak tree as a maple.
  • Bad feature value. For example, someone entered an extra digit, or the thermometer was left behind in the sun.

Finally, we wantMonitor data over time.

The data source was in good shape yesterday, and it does not mean that the situation will be good tomorrow. Therefore, any measures we can take to enhance the stability of the monitoring features over time will enhance the robustness of the system.

Follow the rules below when understanding the data :

  • Remember the data status you expect.
  • Confirm that the data meets these expectations (or you can explain why the data does not meet expectations).
  • Double check that the training data is consistent with data from other sources.

Handle your data as carefully as any mission-critical code. Good machine learning relies on good data.

Final Thoughts

Extract features from raw dataProcessFeature engineering

Characteristics of good features: Avoid discrete eigenvalues ​​that are rarely used, preferably with clear and unambiguous meanings. Do not incorporate special values ​​into the actual data. Consider upstream instability, and the distribution should not contain outliers.

Understand the data,Handle your data as carefully as any mission-critical code

This article is transferred from the general rudder of the public number AI product manager.Original address