We often hearMassive DataIt is the key to building a successful machine learning project.

He is a major issue:Many organizations won't have the data you need.

How do we make and validate machine learning concepts without the most basic raw materials? How can we effectively capture and create data values ​​when resources are scarce?

At my workplace, we produce many functional prototypes for our customers. Therefore, I often need to letSmall DataGo a long way. In this article, I will share 7 hints to improve results when prototyping with small data sets.

1: Realize that your model won't be well summarized.

This should be the first business. You are building a model whose knowledge is based onCosmicSmall partAnd should be the only place or situation that can be expected to work well.

If you are building a computer vision prototype based on an indoor photo selection, don't expect it to work well outdoors. If you have a language model based on chat room teasing, don't expect it to work for fantasy novels.

Make sure your manager or customer understands this.In this way, everyone can be consistent with the actual expectations of the results the model should provide. It also creates an opportunity to present useful new KPIs to quantify model performance both within and outside the prototype.

2: Building a good data infrastructure.

In many cases, the client will not have the data you need, and public data will not be an option. If part of your prototype needs to collect and tag new data, make sure your infrastructure creates as little friction as possible.

do you needMake sure the data labels are very simple,Therefore, it is also approachable for non-technical personnel. We have started usingProdigyI think it's a great tool: accessible and extensible. Depending on the size of the project, you may also needSet up automatic data ingestionThis can receive new data and automatically provide it to the tag system.

If it's quick and easywillNew data import system,You will get more data.

3: Do some data expansion.

You can usually extend your data set by augmenting the data you have. It is about making minor changes to data that should not significantly change the model output. For example, the cat's image is still an image of the cat if it rotates 40 degrees.

In most instances,Enhancement technology allows you to generate more "semi-unique" data points to train your model. As a starting point, you can try adding a small amount of Gaussian noise to your data.

For computer vision, there are many simple ways to enhance your image. I'm rightAlbumentationsThe library has a positive experience, it can do a lot of useful image conversion while keeping the label unharmed.

Photo source:Supplement on Github

Another enhancement that many people find useful isMixup.This technique actually requires two input images, mixing them together and combining their labels.

Photo source:Cecilia Summers and Michael J. Dinneen

When augmenting other input data types, you need to consider which conversions will change the label and which will not.

4: Generate some synthetic data.

If you have exhausted the option to increase the actual data, you can start thinking about creating some fake data. Generating synthetic data can also be a good way to cover some of the edge conditions that are not available in real data sets.

For example, many reinforcement learning systems for robots (such as OpenAIDactyl) Train in an analog 3D environment before deploying to a real robot. For image recognition systems, you can similarly build 3D scenes that can provide you with thousands of new data points.

15 simulatedDactylTraining instances are parallel.

There are many ways to create synthetic data. inKandaWe are developing a turntable-based solution to create data for object detection. If your data requirements are very high, consider使用Generative Adverserial NetworksCreate synthetic data. Please note,GANIt's notorious for being difficult to train, so make sure it's worth it first.

NVIDIAGauGANIn action!

Sometimes you can combine it: Apple has a very clever way toUse GAN to process 3D to model the image of the face to make it look more realistic. If you have time, awesome techniques for extending data sets.

5. Be careful with the lucky split.

When training a machine learning model, according to a certain ratioData setrandomDivided intotrainingsetAnd test setIt is very common. Usually, this is good. However, when dealing with small data sets, the noise risk is high due to the small number of training examples.

under these circumstances,You may accidentally get a lucky split: A specific data set splits where your model will execute and is well suited for test sets. But in fact, this may be just because the test set (coincidence) does not contain any difficult examples.

under these circumstances,K-fold cross validationIs a better choice. Basically, you split the data set intok "fold" and for eachkTrain a new model, one for the test set and the rest for the training. This can control the test performance you see not just because of a lucky (or unfortunate) split.

6. Use transfer learning.

If you are using some sort of standardized data format, such as text, images, video or sound, you can take advantage of other people's previous work in these areas by transferring learning. It's like standing on the shoulders of giants.

When you transfer to study, you will choose a model that someone else creates (usually,"other people"Is Google, Facebook or major university) andFine-tune it to your specific needs.

Transfer learning is effective because most tasks related to language, image or sound have many features in common. For computer vision, it can be to detect certain types of shapes, colors or patterns.

Recently, I developed an object inspection prototype for customers with high precision requirements. By fine-tuning already at GoogleOpen Images v4Trained on a dataset (approximately 900 million markup images!)MobileNetSingle detectorI am able to greatly accelerate development. After a day of training, I canImage marked with ~1500Generating a fairly robust object detection model,testmAPFor 0.85.

7. Try a group of "weak learners."

Sometimes you just have to face the fact that you don't have enough data to do anything fancy. Fortunately, there are many traditional machine learning algorithms that boil down to being less sensitive to data set size.

Similar algorithm Support Vector Machines This is a good choice when the data set is small and the dimensions of the data points are high.

Unfortunately, these algorithms are not always as accurate as the most advanced methods. This is why they can be called "weak learners", at least compared to highly parametric neural networks.

One way to improve performance is to put these "weak learners" (this could be a groupSupport Vector MachinesOrDecision tree) combined so that they "work together" to produce predictions.This is what Ensemble Learning means.

This article was transferred from awardsdatascience,Original address