If you've been researching some math, statistics, or computational models lately, you must have heard that machine learning is a type of "GIGO". This means that if you send the garbage to the dump, you should also expect it to have a high probability of garbage. If you really want to avoid the embarrassing situation of your machine learning model producing garbage results, you need to understand the importance of "effective data preprocessing and feature engineering" as the title implies. For the sake of simplicity, we divide the title into separate modules.
Data preprocessing
Data preprocessing is a huge topic because preprocessing techniques vary from data to data. Different types of data (images, text, sounds, videos, csv files, etc.) have different preprocessing methods, but there are some methods that work for almost any type of data. The most important of these methods happens to be:
- Converted to carrier
- standardization
- Handling missing values
Converted to carrier
All ML models require input data in the form of a vector. If you have raw text data, you need some mechanism to convert these strings into meaningful digital representations, such asTf-idf.word2vecWait. If you have images, they are treated as pixel matrices. If you have sound, they need to convert from analog to digital. If you get the categorical data in the csv file, you may need to apply label encoding or single thermal encoding. You basically convert all the data into floats (or in some cases, integers) so that your ML model can handle all of them with ease. Each record (each sentence, if it is text; sound waves, if it is sound) represents a single line of your input, and for multiple records, you get the input matrix (usually represented by X).
Trend AI article:
standardization
It is highly recommended that your data be scaled correctly, which means that your data should not be significantly skewed for each column (function). If the value of your column is between 0-1 and you have another function with a value between 100-1000, the difference in the range of values may cause the optimizer to make large gradient updates and your The network/model may not converge. Therefore, a good way to standardize your values is between 100-1000 and scale them between 0-1. For the decomposition step, the following steps should be applied to get the maximum benefit of standardization:
- Smaller values: Try setting all values to 0 to 1 or -1 to 1.
- Homogeneity: The values of all columns should be approximately the same.
- Mean: Normalized so that the average of each column is 0.
- Standard Deviation: Normalized so that the standard deviation of each column is 1.
Handling missing values
Your data may not always be the ideal data. The lack of values in the dataset is very common, and efficient methods of dealing with missing values lead to better model training. One way is to replace all missing values with 0, provided that 0 does not yet represent meaningful information in the data. If there are many missing values in the data and they are replaced with 0, the model will eventually know that all 0s do not play any constructive role in the decision making process of the model, and almost ignore them by giving them lower weight. If your data is fairly consistent, especially in the case of time series or sequence-based data sets, the interpolation of the data can also be a meaningful option. Otherwise, most of the missing values are replaced with averages.
Featured project
If you think that this data is the crude oil of 21 century, then this step is the place of refinement and will enhance its value. Feature engineering basically means that you derive some hidden insights from the raw data and extract some meaningful features from it. If you want to do it effectively, this step requires a lot of domain knowledge. For example, if you want to use Forex data to design new features, you need a good understanding of how foreign exchange, the global economy and the currency actually work.
Consider a very simple example where you want to develop a system that requires some input and output a day. There are a number of ways you can do this now. A rough way to do this is to pass several images of the model to the model and mark the time of day. Your model can view thousands of images, learn the visual relationship between the clock hands and a given time, and eventually learn to tell you the time of day estimate. Sounds great enough? Correct?
Another way is that you understand how time actually works and convert the image data to numbers, only storing the x and y coordinates of the clock's'hand'. Now compare it with the original image method before. Suppose you pass 256×256 clock images, which means 65,536 values need to be processed, but when you design the coordinate function, you only need to process 4 values! Not only will this feature engineering make the model more accurate, but it will also be thousands of times faster.
Now let's take a step forward and say that you have gained very detailed insight into how the clock actually works, and you can measure the angle of the two clock hands. In this case, you only need two angles (one for each hand), that's it! When you get the angle and have an in-depth understanding of all the math involved in measuring time, you might just write some code for the equation, so you don't need a machine learning model (but this is just a very simple example; for more complex scenarios) , ML will be a better approach than coding all possible rules).
Finally, sum up everything,
- Good features allow you to solve problems more elegantly with fewer computing resources.
- Good functionality will allow you to train a model with few valid data.
This article is transferred from medium,Original address
Comments