Data is very important in artificial intelligence technology! This article will introduce to you three kinds of data sets in detail: training set, validation set, and test set.

It will also introduce how to reasonably divide the data into 3 types of data sets. Finally, I will introduce a way to make full use of limited data: cross-validation.

First use an inappropriate metaphor to illustrate the relationship between the three data sets:

• Training set is equivalent to learning knowledge in class
• The validation set is equivalent to the exercises after class to correct and strengthen the knowledge
• The test set is equivalent to the final exam, which is used to finally evaluate the learning effect

What is a training set?

The Training Dataset is used to train the model.

in"Understanding machine learning in one articleIn》 we introduced the 7 steps of machine learning. The Training Dataset is mainly used during the training phase.

What is a validation set?

When our model is trained, we don't know how well it performs. At this time, you can use the Validation Dataset to see how the model performs on the new data (the validation set and the test set are different data).At the same time, the model is in the best state by adjusting the hyperparameters.

The validation set has two main roles:

1. Evaluate model performance and serve to adjust hyperparameters
2. Adjust the hyperparameters to make the model perform best on the validation set

Description:

1. Unlike the training and test sets, the validation set is not required. If you do not need to adjust the hyperparameters, you can use the test set to evaluate the effect without using the validation set.
2. The effect evaluated by the validation set is not the final effect of the model, it is mainly used to adjust the hyperparameters. The final effect of the model is based on the evaluation result of the test set.

What is a test set?

After we have adjusted the hyperparameters, we will start the "final exam". We use Test Dataset for final evaluation.

Through the evaluation of the test set, we will get some final evaluation indicators, such as: accuracy, precision, recall,F1and so on.

How to divide the data set reasonably?

The following data set partitioning methods are mainly for the "reserve method" verification method. In addition, there are other cross-validation methods. For details, see below-cross-validation method.

The method of data division is not clearly defined, but you can refer to 3 principles:

1. For small-scale sample sets (on the order of tens of thousands), commonly used allocation ratios are 60% training set, 20% validation set, and 20% test set.
2. For large-scale sample sets (more than one million levels), as long as the number of validation and test sets is sufficient, for example, there are 100w pieces of data, then leave the 1w verification set and the 1w test set. For 1000w data, 1w validation set and 1w test set are also retained.
3. The fewer hyperparameters, or the hyperparameters are easy to adjust, then the proportion of the validation set can be reduced and the more allocated to the training set.

Cross-validation

Why use cross-validation?

If we teach children to learn addition: 1 apple + 1 apple = 2 apples

When we test again, we will ask: 1 banana + 1 banana = how many bananas?

If the child knows "2 bananas" and it is no problem to change to something else, then we think that the child has learned the knowledge point of "1 + 1 = 2".

When evaluating whether a model has learned "a certain skill", it also needs to be evaluated with new data instead of using data from the training set. This completely different verification method for the "training set" and "test set" is the cross-validation method.

3 mainstream cross-validation methods

Holdout cross validation

As mentioned above, the data set is fixed at a fixed scalestillDivided into training set, validation set, test set. The way is to set aside the law.

Leave one out cross validation

Each test set has only one sample, and m trainings and predictions are performed. This method uses only one sample less for training than the overall data set, so it is closest to the distribution of the original sample. But the training complexity increases because the number of models is the same as the number of original data samples. Generally used when data is scarce.

k-fold cross validation

The static "save-out method" is more sensitive to the division of data, and it is possible that different models have been obtained for different divisions. K-fold cross-validation is a dynamic verification method that can reduce the impact of data partitioning. Specific steps are as follows:

1. Divide the data set into a training set and a test set, and set the test set aside
2. Divide the training set into k shares
3. Use 1 of the k copies each time as the validation set, and the rest as the training set.
4. After k trainings, we got k different models.
5. Evaluate the effects of k models and select the best hyperparameters
6. Use the optimal hyperparameters, and then retrain the model using all k data as the training set to get the final model.

k is usually 10 When the amount of data is small, k can be set larger, so that the training set accounts for a larger proportion, but the number of models trained at the same time also increases. When the amount of data is large, k can be set smaller.