This article is reproduced from the public reading core,Original address

In order to build a well-performing machine learning (ML) model, the model must be trained and tested on data from the same target distribution. However, sometimes we can only collect a limited amount of data from the target distribution. This data may not be sufficient to build the required training/development/test set.

At the same time, similar data from other data distributions may be readily available. What should I do in this situation? Let's discuss some ideas!

Some background knowledge

If you are not yet familiar with basic machine learning concepts, you can read this section carefully to better understand the content of this article:

· Training sets, development sets, and test sets: Note that the dev set is also known as the validation or hold-on set.

· Deviation (under-fitting) and variance (over-fitting) errors: This is a very simple explanation for these errors.

· How to split the training/development/test set correctly.

Scenes

Suppose you are building a dog image classifier application that can be used to determine if an image is a dog.

The app is for users in rural areas who can take pictures of animals on mobile devices so that the app can classify animals for them.

Through the study of the distribution of target data, we found that the images are mostly fuzzy and low resolution, as shown in the following figure:

Left: Dog (Italian Volpino variety); Right: Arctic fox

You can only collect images like 8000, which is not enough to build a training/development/test set. Suppose you have determined that you need at least 100000 images.

You want to know if you can use the images in another dataset - and the 8000 images you collected - to build a training/development/test set.

You realize that it's easy to crawl through the web to create a dataset containing 100000 sheets or more, similar in frequency to the dog and non-dog images we need.

However, it is clear that this web dataset comes from a different distribution with clear images and high resolution, such as:

Dog image (left and right) and fox image (middle)

 How to build a training/development/test set?

You can't just build a training/development/test set using only the collected 8000 original images, because they are not enough to form a good classifier. In general, computer vision, like other natural perception problems (speech recognition or natural language processing), requires a large amount of data.

Also, you can't just use web datasets. The classifier does not handle the user's blurred image very well, unlike the high definition web image used to train the model. So what do you need to do? Let us consider some possibilities.

One possible option - data shuffling

What you can do is combine the two data sets and randomly shuffle them. The resulting data set is then split into training/development/test sets.

Suppose you decide to split into training/development/test sets by 96:2:2%. This process will look like this:

Once the split is complete, the training/development/test set will come from the same distribution as required, as shown above.

However, there is a big drawback here!

If you look at the development set, in the 2000 image, only 148 images are averaged from the target distribution.

This means that in most cases, you are optimizing the classifier for web image distribution (2000 in 1852 images) - this is not what you want!

The situation is similar when evaluating the performance of a classifier based on a test set. Therefore, this method is not suitable for segmenting training/development/test sets.

a better choice

Another option is to have the development/test set come from the target distribution dataset, which is from the web dataset.

Suppose you are still split into training/development/test sets according to 96:2:2% as before. Each development/test set contains 2000 images - from the target dataset - the rest of the images will be assigned to the training set, as shown in the following image:

With this split, you will optimize the classifier to perform well on the target distribution, which is exactly what you care about. This is because the images of the development set are only from the target distribution.

However, the training distribution is now different from the development/test distribution. This means that in most cases, you are training the classifier on the web image. Therefore, optimizing the model will take more and more time.

More importantly, you will not be able to easily determine whether the classifier error on the development set is a variance error, a data mismatch error, or both, relative to the error on the training set.

Let's consider this question in more detail and see what we can do.

Variance does not match the data

Consider the split of the training/development/test set in the second option above. For the sake of simplicity, assume that the human error is zero.

Also, suppose you find that the training error is 2% and the development error is 10%. How much of the 8% error between these two errors is due to data mismatch between the two data sets (assuming they come from different distributions)? What is the variance (overfitting) of the model? we do not know.

Let's modify the split for training/development/testing. Take a small portion of the training set and call it the "bridge" set. Bridge sets are not used to train classifiers, but rather as a separate set. This split produces 4 collections, which belong to two data distributions, as follows:

variance

With this split, we assume that you find that the training and development errors are 2% and 10%, respectively, and that the bridge set error is 9%, as shown below:

Now, what is the error of 8% between the training set error and the development set error? How many are data mismatch errors?

Very simple! The answer is 7% variance error and 1% data mismatch error. But why?

This is because the bridge set and the training set are from the same distribution, and the error difference between them is 7%. This means that the classifier is over-fitting on the training set. This shows that we now have a high variance problem.

Data mismatch error

Now, let's assume that you find that the error of the bridge set reaches 3%, and the rest is the same as before, as shown below:

How much of the 8% error between the training set and the development set is the variance error? How many data does not match the error?

The answer is the variance error of 1%, and the data of 7% does not match the error. why?

This time, because if the classifier comes from the same distribution (such as a bridge set), it performs well on data sets that have never been seen before. If it comes from a different distribution, such as a development set, then its performance is very poor. Therefore, we have a problem with data mismatch.

Reducing variance is a common task in machine learning. For example, you can use regularization methods or assign a larger training set.

Reducing data mismatch errors is a more interesting issue, so let's discuss it below.

Mitigation data mismatch

To reduce data mismatch errors, you need to incorporate the development/test data set (target distribution) into the training set in some way.

Collecting more data from the target distribution Adding to the training set is often the best option. However, if this is not possible (as we assumed at the beginning of the discussion), you can try the following.

Error Analysis

Analysis of the error on the development set and the difference between these errors and the error on the training set can provide ideas for solving the data mismatch problem.

For example, if many errors on the development set are found to occur in the case where the background of the animal image is rock, these errors can be mitigated by adding an animal image with a rock background to the training set and.

Artificial data synthesis

Combine the characteristics of the development/test set into another method of the training set to synthesize data with similar characteristics.

For example, as we mentioned before, the images in our development/training set are mostly blurred, and our training sets are mostly composed of clear images on the network. You can artificially blur the image in the training set to make it more similar to the development/test set, as shown below:

Before and after the image blur in the training set


However, there is an important point to note here!

In the end, the classifier may be over-fitting because of the artificial features you have made.

In our example, the artifacts artificially generated by some mathematical functions may be just a subset of the blurs present in the target distribution image.

In other words, the blur in the target distribution can be caused by a variety of reasons. For example, fog, low-resolution cameras, and movement of objects can all be the cause. But the ambiguity of synthesis may not represent all of these reasons.

In general, when synthesizing data for a training set that can be used to solve any type of problem, such as computer vision or speech recognition, over-fitting may occur when building a model of the synthetic data set.

From the human eye, this data set seems to be enough to represent the target distribution. But in reality, it is only a small part of the target distribution. So keep this in mind when using this powerful tool - data compositing.

Final Thoughts

When developing a machine learning model, ideally, the training/development/test data set should come from the same data distribution, which is the distribution of data that users will encounter when using the model.

However, sometimes it is not possible to collect enough data from the target distribution to build a training/development/test set, while similar data in other distributions is readily available.

In this case, the development/test set should come from the target distribution, while data from other distributions can be used to build the (most) training set. Data matching techniques can be used to mitigate data distribution differences between training sets and development/test sets.