There is a lot of discussion about whether data is a new model driver [1] [2], no matter what the conclusion, it can't change the fact that we get high data in actual work (labor costs, license fees, equipment running time) etc).

Therefore, a key question in machine learning projects is how much training data we need to achieve specific performance metrics such as classifier accuracy. The problem of how much training data is also referred to as sample complexity in the relevant literature.

In this article, we will quickly and extensively review the current experience of training data and related research results, starting with regression analysis and deep learning. Specifically, we will:

- Explain the range of experience of regression task and computer vision task training data;
- Given the test performance of statistical tests, discuss how to determine the number of samples. This is a statistical topic, however, as it is closely related to determining the amount of machine learning training data, it will also be included in this discussion;
- Demonstrate the results of statistical theory learning, indicating what determines the amount of training data;
- Give the answer to the following question: Will the model performance continue to improve as the training data increases? What happens in the case of deep learning?
- A method for determining the amount of training data in a classification task is proposed;
- Finally, we will answer this question: Is increasing training data the best way to handle data imbalance?

**Empirical range of training data volume**

Let us first look at some of the widely used empirical methods used to determine the amount of training data, depending on the type of model we use:

**regression analysis:**According to the empirical rules of 1/10, each predictor [3] requires 10 samples. Other versions of this method are discussed in [4], such as 1/20 to deal with the problem of reduced regression coefficients, and an exciting binary logistic regression variable is proposed in [5].

Specifically, the author estimates the amount of training data by considering the number of predictors, the total sample size, and the ratio of the positive sample size to the total sample size.

**Computer vision:**For image classification using deep learning, the rule of thumb is that each classification requires 1000 images, and if a pre-trained model [6] is used, this requirement can be significantly reduced.

**Determination of sample size in hypothesis testing**

Hypothesis testing is one of the tools used by data scientists to test population differences and determine the efficacy of new drugs. Given the ability to perform tests, it is often necessary to determine the sample size here.

Let's take a look at this example: A technology giant moved to City A, where house prices rose sharply. A reporter wants to know what the average price of an apartment is now.

If the standard deviation of the apartment price is 60K, the acceptable error range is 10K. Should he count the price of the apartment and then average it to give the result a confidence of 95%?

The formula is calculated as follows: N is the sample size he needs, and 1.96 is the number of standard normal distributions corresponding to the 95% confidence:

According to the above equation, the reporter needs to consider the price of the 138 apartment.

The above formula will vary depending on the specific test task, but it always includes a confidence interval, an acceptable error range, and a standard deviation metric. A better discussion of this topic can be found in [7].

**Statistical learning theory of training data scale**

Let us first introduce the famous Vapnik-Chevronenkis dimension (VC dimension) [8]. The VC dimension is a measure of model complexity. The more complex the model, the larger the VC dimension.In the next paragraph, we will introduce a formula that uses VC to represent the size of the training data.

First, let's look at an example that is often used to show how the VC dimension is calculated: Suppose our classifier is a straight line on a two-dimensional plane, and there are 3 points to classify.

Regardless of the positive/negative combination of these 3 points (both positive, 2 positive, 1 positive, etc.), a straight line can correctly classify/distort these positive and negative samples.

We say that the linear classifier can distinguish all points, so its VC dimension is at least 3, and because we can find 4 points that cannot be accurately distinguished by lines, we say that the VC dimension of the linear classifier is exactly 3. . The results show that the training data size N is a function of VC [8]:

Where d is the probability of failure and epsilon is the learning error. Therefore, as [9] points out, the amount of data required for learning depends on the complexity of the model. An obvious example is the well-known neural network's greed for training data because they are very complex.

**As the training data increases, will the performance of the model continue to improve? What about deep learning?**

The figure above shows how the performance of machine learning algorithms varies with the amount of data in the case of traditional machine learning [10] algorithms (regression, etc.) and deep learning [11].

Specifically, for traditional machine learning algorithms, performance is increased in accordance with the power law and tends to be stable after a period of time. The literature [12]-[16], [18] shows how performance changes with deeper learning as the amount of data increases.

Figure 1 shows the consensus of most current studies: For deep learning, according to the power law, performance increases as the amount of data increases.

For example, in the literature [13], the author used deep learning techniques to classify 3 billion images, and they found that the model performance increased logarithmically as the training data increased.

Let's take a look at some of the results that are worthy of the contradiction in the field of deep learning. Specifically, in the literature [15], the author uses a convolutional network to process 1 billion Flickr images and title datasets.

For the amount of data in the training set, they report that the performance of the model increases as the amount of data increases. However, after 5000 10,000 images, it stagnates.

In the literature [16], the authors found that the image classification accuracy increases with the increase of the training set. However, the robustness of the model begins to decrease after exceeding a certain point that is specific to the model.

**Method for determining the amount of training data in a classification task**

The well-known learning curve is usually a graph of the error and the amount of training data. [17] and [18] are references to learning about learning curves in machine learning and how they change with deviation or variance. Python also provides a function to learn curves in scikit-learn [17].

In the classification task, we usually use a slightly different form of learning curve: the relationship between classification accuracy and the amount of training data.

The method of determining the amount of training data is simple: first determine a learning curve form according to the task, and then simply find the point corresponding to the required classification accuracy on the graph. For example, in the literature [19], [20], the author uses the learning curve method in the medical field and uses a power law function to represent:

In the above formula, y is the classification accuracy, x is the training data, and b1 and b2 are respectively corresponding to the learning rate and the attenuation rate. The settings of the parameters vary from problem to problem and can be estimated using nonlinear regression or weighted nonlinear regression.

**Is increasing training data the best way to handle data imbalance?**

This problem was solved in the literature [9]. The authors put forward an interesting point: accuracy is not the best measure of classifier performance in the case of data imbalance.

**The reason is very straightforward:**Let us assume that negative samples are the overwhelming majority, and then if we predict negative samples most of the time, we can achieve high accuracy.

Instead, they suggest that accuracy and recall (also known as sensitivity) are the most appropriate indicators for measuring data imbalance performance. In addition to the above obvious accuracy issues, the authors also believe that the measurement accuracy has a greater intrinsic impact on the unbalanced region.

For example, in the hospital's alarm system [9], high accuracy means that when the alarm sounds, the patient is likely to have a problem.

Choosing an appropriate performance measurement method, the author compared the imbalanced correction method in imbalanced-learn [21] (Python scikit-learn library) and simply using a larger training data set.

Specifically, they use the K-nearest neighbor method in the imbalance-correction for data imbalance correction on a drug-related dataset of 50,000 samples. These imbalance correction techniques include undersampling, oversampling, and integration learning. A neural network was trained on the 100 dataset similar to the original dataset.

**The author repeated 200 times and the final conclusion was simple and profound: no imbalance correction technique is comparable to adding more training data in terms of measurement accuracy and recall rate.**

At this point, we have reached the end of this trip. The following resources can help you learn more about this topic. Thank you for reading!

**References**

This article was transferred from awardsdatascience,Original address

## Comments