Introduction

If you haven't heard it yet, please tell us the fact that as a data scientist, you should always stand in a corner and tell you:"Your results are as good as your data."

Trying to make up for bad data by improving model capabilities is a mistake many people make. This is equivalent to replacing a super sports car because the original car used poor quality gasoline and the car did not perform well. What should be done in this case is to refine the gasoline, not the upgraded car. In this article. I will explain to you how to easily get better results by improving the quality of your dataset.

Notes : I will take the task of image classification as an example, but these techniques can be applied to various data sets.

Problem 1: The amount of data is not enough.

If your data set is too small, your model will not have enough samples to summarize the features, and the data fitted on it will result in high test results even though the training results are not too bad.

Solution 1: Gather more data.

You can try to find more of the same source as your original data set, or from another source with a high degree of similarity, or if you absolutely want to generalize.

Things to noteThis is usually not an easy task and requires time and money. In addition, you may want to do an analysis to determine how much extra data you need. Compare the results to different dataset sizes and try to infer.

In this case, it seems that we need500k sampleCan reachaims error. This means that the amount of data we are collecting now is current.x50. Other data processingaspectOr  modelMay be more effective.

Solution 2: Enhance data by creating multiple copies of the same image with slight variations.

This technique can work wonders and generate a large number of additional images at a very low cost.You can try to crop, rotate, pan or zoom the image.You can add noise, blur, change color or block some noise.In all cases, you need to ensure that the data still represents the same class.

All of these images still represent the "cat" category

This can be very powerful because stacking these effects will provide an exponential sample of your data set. Please note that this is usuallyNot as goodcollectMORE original data.

Combined data enhancement technology. The class is still a "cat" and should be recognized.

Things to noteAll enhancements may not be available for your issue. For example, if you want to classify lemons and limes, don't play with the color, because this willMeaningfulColor is important for classification.

This type of data increase will make it more difficult for models to find distinguishing features.

Problem 2: Low quality classification

It's easy, but if possible, take some time to go through your data set and verify the label of each sample.This may take a while, but the use of counterexamples in the data set can adversely affect the learning process.

Also, choose the correct level of granularity for your class. Depending on the problem, you may need more or fewer classes. For example, you can use the global classifier to categorize the kitten's image to make sure it is an animal, then run it through the animal classifier to make sure it is a kitten. A huge model can do both, but it will be harder.

Two-stage prediction with a specialized classifier.

Problem 3: low quality data

As stated in the introduction, low quality data will only result in low quality results.

The samples in the dataset in the dataset may be too far from the dataset you are using. These patterns that may be more confusing are not very helpful.

Solution: Delete the worst image.

This is a long process, but it will improve your results.

Of course, these three images represent cats, but the model may not be able to use it.

Another common problem is when your data set consists of data that does not match the real-world application.For example, if the image comes from a completely different source. 

Solution: Consider the long-term application of the technology and the method that will be used to capture production data.

If possible, try to find/build a dataset using the same tool.

Using data that doesn't represent your real-world application is often a bad idea. Your model may extract features that are not available in the real world.

Problem 4: Unbalanced classification

If the samples of each category are not roughly the same for all categories, the model may favor the tendency of the ruling class because it leads to a lower error.We say that the model is biased because the class distribution is skewed.This is a serious problem and you need to checkAccuracy, recallOrThe reason for confusing the matrix.

Solution 1: Collect more samples of underrepresented categories.

However, this is时间MoneyUsually expensive, or at allNot feasible.

Solution 2: Over/under sampling of data.

This means that you remove some samples from over-represented classes or copy samples from under-represented classes. ratiorepeatBetter, use data to increase, as mentioned earlier.

Adding cat pictures, reducing the picture of lime can make the data set differently balanced

Problem 5: Data imbalance

If your data does not have a specific format, or the value is not in a specific range, your model may not be able to handle it.You will have the image, with better results in aspect ratio and pixel value.

Solution 1: Crop or stretch the data to have the same aspect or format as the other samples.

There are two possibilities to improve the two possibilities of improving the image error of the malformed image.

Solution 2: Normalize the data so that the data for each sample is in the same range of values.

The range of values ​​is normalized to be consistent across the entire data set.

Problem 6: No validation set and test set

After the data set has been cleaned, expanded, and correctly labeled, it needs to be split.Many people split it up in the following ways: 80% for training and 20% for testing, which makes it easy to spot overfitting.However, if you try multiple models on the same test set, other things will happen.By choosing the model with the best test accuracy, you are actually overfitting the test set.This happens because the model you manually select is not its intrinsic model value, but a specific data set on its performance.

Solution: Split the data set into three: training set, validation set, test set.

The mask your test is set to overfit the selection by the model. The selection process becomes:

  1. Train your model on the training set.
  2. Test them on the verification set to make sure there is noOverfitting.
  3. Choose the most promising model. Test it on the test set, which will give you the true accuracy of the model.
note:Once you have selected the production model, don't forget to train on the entire data set!The more data, the better!

in conclusion

I hope that by now you are sure that you must pay attention to your data set before considering your model. You now know the biggest mistakes in handling data, how to avoid pitfalls, and tips and tricks on how to build killer datasets! If in doubt, please remember:"The winner is not the best model, this is the best data."

original:Stop Feeding Garbage To Your Model! — The 6 biggest mistakes with datasets and how to avoid them.

Translation: Google Translate

Proofreading: Xiaoqiang, Dukang, who can't die