A while ago, the style of NvidiaGANIt can be said that it is a fire, and recently made a big move! In the past, image-to-image conversion required a large number of images for training samples, but in this work of Nvidia, image-to-image conversion can be done with only a small sample (the code is open source)!

Small sample, great achievement!

When we saw a standing tiger, it was easy to imagine how it lay down.

This is because we can make associations based on the posture of other animals lying down.

However, it is not as simple as a machine.A large number of training images are required in existing unsupervised image-to-image conversion models.

Not only that,Another premise that a model can convert an image is that the objects in the image must exist in the training set..

Recently, NVIDIA, Conor University and Aalto University jointly published an article -Small sample (few-shot) unsupervised image to image conversion.

Paper address:


To put it simply, you can enter a golden hair. During the training process, even if you see a new animal for the first time, you can make it stick your tongue like a golden hair, shut your mouth, and lick your head.

If you enter a picture of a fried noodles, the model can also make other foods become fried noodles.

The work also provided online testing, and the new Zhiyuan Xiaobian made their own tests with their own cat owners, "Watermelon" and "Dobby":

Enter the result of "watermelon"

Enter the result of "Dobby"

The online test connection is as follows, the readers can quickly play it:


The code for the project is also open source, the address is as follows:


2 stage image conversion, very interesting!

Our proposed FUNIT framework aims to map an image of a source class to a similar image of a target class by taking advantage of several target class images available at the time of testing.

To train FUNIT, we use images from a set of object classes (such as images of various animal species) calledSource classes. We do not assume that there is a paired image between any two classes (ie, any two animals of different species will not be in the same pose).

We use images from the source class to train a multi-class unsupervised image to an image transformation model.

During the test, we are called from oneTarget classProvided in the new object classa few images. The model must use a small number of target images to convert any image in the source class into a similar image in the target class.

training.The training set consists of images of various object classes (source classes). We trained a model to convert images between these source object classes.

deploy.We show to the training modelVery small amountThe image in the target class, which is enough to convert the image of the source class into a similar image of the target class, even if the model has never seen any image of the target class during training.

It should be noted that the FUNIT generator hasTwo inputs:1) oneContent image;2)Target class image. Its purpose is to generate a conversion of an input image similar to the target class image.

Our framework consists of a conditional image generator G and a multitasking confrontation discriminator D.

Unlike the conditional image generators in the existing unsupervised image-to-image translation framework, they take an image as input, and our generator G needs to combine a content image x and a set of K class images at the same time {y1, …, yK} is used as input to generate the output image x¯, the formula is as follows:

Experimental results: attitude and type are converted together, beyond the benchmark model

Main result

As shown in the table 1, the FUNIT framework has better performance metrics for both the Animal Faces and North American Birds data sets than the baseline model for small sample unsupervised image-to-image conversion tasks.

FUNIT reaches the 1-shot and 5-shot settings of the Animal Faces dataset respectivelyTop-82.36 test accuracy for 96.05 and 5And reached on the North American Birds datasetTop-60.19 test accuracy for 75.75 and 5.

These indicators are significantly better than the corresponding benchmark model.

Table 1: Performance comparison between FUNIT and baseline models. ↑ indicates that the larger the value, the better. ↓ indicates that the smaller the better.

In the figure 2, we visualize the results of the few-shot translation of the FUNIT-5 calculation.

Figure 2: Visualization of unsupervised image-image conversion results. The calculation results are based on the FUNIT-5 model.

From top to bottom are the results from animal face, bird, flower and food data sets. Each example randomly shows the image in the 2 target class, the input content image x, and the converted output image x ̄.

The results show that the model can successfully convert the image of the source class into a similar image in the new class. The object is in the input content image x and the corresponding output image x ̄The posture remains basically unchanged. Output image alsoVery realistic, similar to the image in the target class.

Figure 3 provides a comparison of the results of FUNIT with the baseline model. As you can see, FUNIT generates high quality image conversion output.

Figure 3: Comparison of small sample image to image conversion effects.

The left to right columns are the input content image x, the two input target class images y1, y2, the conversion results from the unfair StarGAN baseline, the conversion results from the fair StarGAN baseline, and the results from the FUNIT framework.

Reference link:




This article is transferred from the public number Xinzhiyuan,Original address