This article is reproduced from the public number PaperWeekly,Original address

  • Author Xu Zhiqin, Zhang Yaoyu
  • Postdoctoral fellow at the University of New York at Abu Dhabi, visiting scholar at the Coulometric Institute of New York University
  • Research direction 丨Computational neuroscience, deep learning theory

GAN Great success has been achieved in image generation, which undoubtedly depends on GAN's continuous improvement in modeling capabilities under the game, and ultimately the realization of false image generation.

GAN has been 2014 for many years since the birth of 4. A large number of articles about GAN have been published in major journals and conferences to improve and analyze GAN's mathematical research, improve GAN's generation quality research, and GAN's image generation. Applications (specify image composition, text to image, image to image, video) and GAN NLP And other applications. Image generation is the most studied, and research in this field has demonstrated the enormous potential of using GAN in image synthesis.

This article surrounds An Introduction to Image Synthesis with Generative Adversarial Nets This article gives an overview of GAN in image generation applications.

Introduction of the paper

The famous physicist Richard Feynman said:"What I can create, I do not understand" (I can't understand it for things I can't create). The AI ​​products that we are exposed to at this stage are trying to understand what humans can understand, such as ImageNet image classification, AlphaGo, intelligent dialogue robots, etc.

However, we still can't conclude that these algorithms are truly "smart" because knowing how to do something doesn't necessarily mean understanding something, and it's critical that a truly intelligent robot understand its mission. 

If the machine can go to create, this also means that the machine's input data can be modeled autonomously. Does this mean that the machine is taking a step toward more "wisdom"? This create is the most feasible method in the field of machine learning, which is to generate models. By learning the generated model, the machine can even draw samples that are not in the training set but follow the same distribution. 

More influential in the generated model VAE [1],pixelCNN [2],Glow [3],GAN [4]. Among them, the GAN proposed in 2014 is the most popular in the generation model. Even if it can't be said that GAN is a ride, it can be said to stand out.

The GAN consists of two neural networks, a generator and a discriminator, where the generator attempts to generate a real sample of the spoof discriminator, and the discriminator attempts to distinguish between the real sample and the generated sample. This kind of confrontation game makes the generator and the discriminator continuously improve the performance. After reaching the Nash balance, the generator can realize the false output.

But this Nash balance exists only in theory, and the actual GAN ​​training is accompanied by some problem limitations. One is GAN training instability and the other is mode collapse, leading to the theoretical derivation of the problem in the previousarticle [41] Some evidence. 

The problems of GAN do not limit the development of GAN. The articles that continuously improve GAN are emerging one after another. In the past few years, GAN has developed quite mature. From the high-quality articles on GAN in recent years, it can be seen that the articles after 18 are more concerned with the application of GAN in various fields, while the previous articles focus on the improvement of GAN problems.

GAN is most prominent in image generation applications, and of course there are many other applications in computer vision, such as image painting, image annotation, object detection and semantic segmentation. The application of GAN in natural language processing is also a growing trend, such as text modeling, dialog generation, question and answer, and machine translation. However, training GAN in NLP tasks is more difficult and requires more technology, making it a challenging but interesting area of ​​research. 

An Introduction to Image Synthesis with Generative Adversarial Nets One article isOutline the methods used in GAN image generation and indicate the pros and cons of existing methods. This article is a personal understanding and translation of this paper, and some of these methods combined with personal practical application experience.

The basis of GAN

Scholars who have been exposed to GAN can already skip this part if they are already familiar with the structure of GAN. Let's take a look The infrastructure of GAN:

GAN can take any distribution as input, where Z is the input. In the experiment, we take Z∼N(0,1) and take the uniform distribution of [−1,1] as input. The parameter of generator G is θ, the input Z is G(z; θ) under the generator, and the output can be regarded as the sample G(z; θ) ∼ Pg extracted from the distribution.

The data distribution of the training sample x is Pdata, and the training goal of the generative model G is to make Pg approximate to Pdata. The discriminator D is to distinguish between the true and false of the generated sample and the real sample. The training generator and the discriminator pass the min-max game, where the generator G tries to generate realistic data to deceive the discriminator, and the discriminator D tries to distinguish the real data And synthetic data. This game can be formulated as:

The original GAN ​​used a fully connected layer as its building block. later,DCGAN [5] proposed the use of convolutional neural networks to achieve better performance, and since then the convolutional layer has become the core component of many GAN models.

However, when the discriminator is trained much better than the generator, D can confidently reject samples from G from G, so the loss term log(1−D(G(z))) is saturated and G cannot learn from it. anything.

To prevent this, you can train G to maximize logD(G(z)) instead of training G to minimize log(1−D(G(z))). Although the changed loss function of G gives a different gradient from the original gradient, it still provides the same gradient direction and does not saturate.

Condition GAN

In the original GAN, there is no way to control what is to be generated because the output only depends on random noise. We can add the conditional input c to the random noise Z so that the resulting image is defined by G(c,z). This is CGAN [6], usually the conditional input vector c is directly connected to the noise vector z, and the resulting vector is used as the input of the generator as it is in the original GAN. Condition c can be the class of the image, the properties of the object, or a textual description of the image that you want to generate, or even a picture.

Auxiliary classifier GAN (ACGAN)

To provide more auxiliary information and allow for semi-supervised learning, additional auxiliary classifiers can be added to the discriminator to optimize the model on both the original task and the additional tasks. The architecture of this approach is shown below, where C is the auxiliary classifier. 

Adding an auxiliary classifier allows us to use pre-trained models (for example, image classifiers trained on ImageNet) and ACGAN The experiments in [7] proved that this method can help generate clearer images and reduce the mode collapse problem.The use of auxiliary classifiers can also be applied in text-to-image synthesis and image-to-image conversion.

GAN andEncoderCombination

Although GAN can convert the noise vector z into a composite data sample G(z), it does not allow an inverse transform. If the noise distribution is considered as a potential feature space for the data samples, GAN lacks the ability to map data samples x to potential features z.

In order to allow such mapping, two concurrent jobs BiGAN [8] and ALI [9] Add encoder E to the original GAN, as shown in the figure below.

Let Ωx be the data space, Ωz be the potential feature space, and encoder E take x∈Ωx as the input and generate the eigenvector E(x)∈Ωz as the output. The discriminator D is modified to calculate P(Y|x,z) by taking both the data sample and the feature vector as inputs, where Y=1 indicates that the sample is real and Y=0 indicates that the data is generated by G. Expressed as a mathematical formula as:

GAN and VAE combination

The image generated by VAE is ambiguous, but VAE generation does not have a problem like GAN's mode crash.VAE-GAN [10] The original intention was to combine the advantages of the two to form a more robust generative model.The model structure is as follows:

However, in the actual training process, the combined training process of VAE and GAN is also difficult to grasp.

Processing mode crash

Although GAN is very effective in image generation, its training process is very unstable and requires a lot of skill to get good results.GAN is not only unstable in training, but also has a pattern collapse problem.. The discriminator does not need to consider the type of sample generated, but only to determine whether each sample is true, which makes it necessary for the generator to generate only a few high-quality images to fool the discriminator.

For example, the MNIST dataset contains digital images from 0 to 9, but in extreme cases, the generator only needs to learn to perfectly generate one of the ten numbers to completely spoof the discriminator, and then the generator stops trying to generate the other nine digits. The lack of other nine digits is an example of a crash between classes. An example of an intra-class mode crash is that each number has a lot of writing style, but the generator only learns to generate a perfect sample for each number to successfully spoof the discriminator. 

Many methods have been proposed to solve the model crash problem. a technique calledSmall batch (miniBatch) featureThe idea is to make the discriminator compare the small batch of real samples with the samples generated in small batches. In this way, the discriminator can learn to determine if the generated sample is too similar to other generated samples by measuring the distance of the sample in the potential space. Although this method works well, performance is highly dependent on the features used in the distance calculation. 

MRGAN [11] It is recommended to add an encoder to convert the samples in the data space back to the latent space, such as BiGAN. Its combination of encoder and generator acts as an auto-encoder, and the reconstruction loss is added to the adversarial loss to act as a pattern regularization. Device.At the same time, the discriminator is also trained to distinguish reconstructed samples, which serves as another pattern regularizer. 

WGAN [12] Use Wasserstein distance to measure the similarity between the real data distribution and the learning distribution, instead of using Jensen-Shannon divergence like the original GAN.Although it avoids model collapse in theory, it takes longer for the model to converge than previous GANs.

In order to alleviate this problem,WGAN-GP [13] It is recommended to use gradient penalty instead of weight reduction in WGAN. WGAN-GP can usually produce good images and greatly avoid mode collapse, and it is easy to apply this training framework to other GAN models.

SAGAN [14] The idea of ​​spectral normalization is used in the discriminator to limit the ability of the discriminator.

GAN in image generation method

The main method of GAN in image generation isDirect method.Iterative methodHierarchical approachThese three methods can be shown in the following figure:

The method of distinguishing image generation is to see that it has several generators and discriminators.

The Direct Method

All methods in this category follow the principle of using a generator and a discriminator in their model, and the structure of the generator and discriminator is straightforward, with no branches. Many of the earliest GAN models fall into this category, such as GAN [4],DCGAN [5],ImprovedGAN [15],InfoGAN [16],f-GAN [17] and GANINT-CLS [18].

Among them, DCGAN is one of the most classic, its structure is used by many later models. The general building block used in DCGAN is shown below, where the generator uses deconvolution, batch normalization and ReLU activation, and the discriminator Using convolution, batchnormalization and LeakyReLU activation, this is also the reference for many GAN model network designs.

This approach is relatively straightforward to design and implement compared to layered and iterative methods, and generally yields good results.

Stratification 

In contrast to the direct method, the algorithm under the hierarchical approach uses two generators and two discriminators in its model, with different generators having different purposes. The idea behind these methods is to divide the image into two parts, such as "style and structure" and "foreground and background." The relationship between the two generators can be parallel or in series. 

SS-GAN [19] Two GANs are used, and one Structure-GAN is used to input and output images from random noise ẑ. The overall structure can be shown in the following figure:

Iterative method 

The iterative method is different from the layering method. First, instead of using two different generators that perform different roles, the models in this category use multiple generators with similar or even identical structures, and they generate images from coarse to fine. Each generator regenerates the details of the results. When using the same structure in the generator, iterative methods can use weight sharing between generators, while layered methods typically do not. 

LAPGAN [20] is the first GAN that uses the Laplacian pyramid to generate images from coarse to fine using an iterative method. Multiple generators in LAPGAN perform the same task: take an image from the previous generator and take the noise vector as input, and then output the details that make the image clearer when added to the input image (residual image).

The only difference between these generator structures is the size of the input/output size, with one exception being that the lowest level generator only takes the noise vector as input and outputs the image. LAPGAN is superior to the original GAN ​​and shows that the iterative method can produce images that are sharper than the direct method.

StackGAN [21] As an iterative method, there are only two generators.The first generator receives input (z,c), and then outputs a blurred image, which can display rough shapes and blurred details of the object, while the second generator uses (z,c) and the image generated by the previous generator, Then output a larger image, you can get more realistic photo details. 

Another example of an iterative method is SGAN [22], its stack generator that takes lower level features as input and outputs higher level features, while the bottom generator takes the noise vector as input and the top generator outputs the image.

The need to use separate generators for different levels of features is the SGAN correlation coder, discriminator and Q network (for predicting the posterior probability of P(zi|hi) for entropy maximization, where hi each generator The output characteristics of the i-th layer) to constrain and improve the quality of these features.

Other methods 

Different from the other methods mentioned above,PPGN [23] uses activation maximization to generate images, which are based on samples previously learned using a denoising autoencoder (DAE).

In order to generate an image that is conditional on a particular category label y, instead of using a feedforward approach (for example, if the loop method is considered feedfor by time expansion), the PPGN optimization process finds the input z for the generator, which makes the output image height Activates a neuron in another pre-training classifier (the neuron in the output layer that corresponds to its class label y). 

In order to generate better, higher resolution images,ProgressiveGAN [24] It is recommended to train a 4×4 pixel generator and discriminator first, and then gradually add additional layers to double the output resolution to 1024×1024.This approach allows the model to learn the rough structure first, and then focus on redefining the details later, instead of having to deal with all the details of different scales at the same time.

GAN in text to image application

GAN is applied to image generation, although CGAN [6] Such a label condition GAN model can generate images belonging to a specific class, but generating images based on text descriptions is still a huge challenge.Text-to-image synthesis is a milestone in computer vision, because if the algorithm can generate realistic images from pure text descriptions, we can be highly confident that the algorithm actually understands the content in the image. 

GAN-INT-CLS [18] is the first attempt to use GAN to generate images from text descriptions. The idea is similar to conditional GAN ​​that connects conditional vectors with noise vectors, but uses the embedding of text sentences instead of class labels or attribute distinctions.

The GAN-INT-CLS groundbreaking region is divided into two sources of error:Unreal images with any text, as well as real images of unmatched text.

In order to train the discriminator to distinguish between these two errors, three types of inputs are fed to the discriminator in each training step: {real image, matching text}, {real image, mismatched text} and {pseudo image, true text}. This training technique is very important for generating high quality images because it not only tells the model how to generate realistic images, but also tells the correspondence between text and images. 

TAC-GAN [25] Yes GAN-INT-CLS [18] and ACGAN [7] The combination.

Position constrained text to image 

in spite of GAN-INT-CLS [18] and StackGAN [21] Images can be generated based on text descriptions, but they cannot capture the positioning constraints of objects in the image.In order to allow coding space constraints,GAWWN[26] Two possible solutions were proposed. The first method proposed by GAWWN isLearning the spatially embedded text tensor by spatial transformation networkTo learn the bounding box of the object.

The output of the space transformer network is a tensor with the same dimensions as the input, but the values ​​outside the boundary are all zero. The output tensor of the space transformer passes through several convolutional layers to reduce its size back to a one-dimensional vector, which not only preserves the textual information, but also provides constraints on the position of the object through the bounding box. One benefit of this approach is that it is end-to-end and does not require additional input.

The second method proposed by GAWWN isUse user-specified key points to constrain different parts of an object in an image(eg head, legs, arms, tails, etc.). For each key point, a mask matrix is ​​generated, where the key point position is 1 and the others are 0. All matrices are combined by depth cascading to form a mask tensor of shape [M×M×K], where M is masked. The size of the code, K is the key point of the number.

The tensor is then placed in a binary matrix, where 1 indicates the presence of a key, otherwise 0, and then copied in the depth direction to become the tensor to be fed into the remaining layers. Although this method allows for more detailed constraints on the object, it requires additional user input to specify the key points.

Although GAWWN provides two methods that can enforce positional constraints on the resulting image, it only works with images that have a single object, because none of the proposed methods can handle multiple different objects in the image.

Stack GAN text to image 

StackGAN [21] It is recommended to use two different generators for text-to-image synthesis instead of just using one generator.The first generator is responsible for generating low-resolution images containing objects with rough shapes and colors, while the second generator takes the output of the first generator and generates images with higher resolution and sharper details, each The generators are all associated with their own discriminators.

StackGAN ++ [27] It is recommended to use more pairs of generators and discriminators instead of just two, add unconditional image synthesis loss to the discriminator, and use the color consistency regularization term calculated by the mean average loss and the difference between the real and false images. difference. 

AttnGAN [28] Expanded further by using attention mechanisms on image and text features StackGAN ++ [27] The architecture.In AttnGAN, each sentence is embedded in the global sentence vector, and each word of the sentence is also embedded in the word vector.

The global sentence vector is used to generate a low resolution image in the first stage, and then the following stage uses the input image feature and word vector of the previous stage as input to the layer of interest and calculates the word context vector to be used. Combining with image features and forming the input of the generator will generate new image features.

Limitations of text to image models 

The current text-to-image model performs well on data sets with a single object per image, such as faces in CelebA, birds in CUB, and some objects in ImageNet. In addition, they can synthesize reasonable images in LSUN for scenes such as bedrooms and living rooms, even if the objects in the scene lack clear details. However, in the case where multiple complex objects are involved in one image, all existing models work badly. 

A reasonable reason why current models do not work well on complex images is that the model only learns the overall characteristics of the image, not the concept of each of these objects. This explains why the synthetic scenes in the bedroom and living room lack clear details because the model does not distinguish between beds and tables, and all it sees is that some shapes and colors should be placed somewhere in the composite image. In other words, the model doesn't really understand the image, just remember where to place some shapes and colors. 

The generative confrontation network undoubtedly provides a promising text-to-image synthesis method because it produces images that are clearer than any other generation method to date. In order to take further steps in text-to-image synthesis, it is necessary to find new ways to implement the concept of things in algorithms. One possible approach is to train a separate model that can generate different kinds of objects, and then train another model that learns how to combine different objects (reasonable relationships between objects) into one image based on textual descriptions.

However, this approach requires a large training set for different objects, as well as another large data set containing images of those different objects that are difficult to obtain. Another possible direction may be to use the capsule concept proposed by Hinton et al., because the capsule is designed to capture the concept of an object, but how to effectively train this capsule-based network is still a problem to be solved.

GAN in image to image application

Image-to-image conversion is defined as the problem of converting a possible representation of one scene into another, such as mapping an image structure map to an RGB image, or vice versa. This problem is related to style migration, which takes a content image and a style image and outputs an image having the content of the content image and the style of the style image.

Image-to-image transformation can be seen as a generalization of style migration, as it is not limited to the style of the transferred image, but can also manipulate the properties of the object (as in facial editing applications). 

Supervised image to image conversion 

Pix2Pix [29] proposed to combine the loss of CGAN with the L1 regularization loss, so that the generator is not only trained to deceive the discriminator, but also generates images that are as close to the real label as possible. The reason for using L1 instead of L2 is that L1 produces less Blurred image.

Conditional GAN ​​loss is defined as:

The L1 loss for constrained self-similarity is defined as:

The total loss is:

Among them, λ is a hyperparameter to balance the two loss items. The generator structure of Pix2Pix is ​​based on UNet, which belongs to the encoder-decoder framework, but it adds a skip connection from the encoder to the decoder in order to bypass sharing objects such as The bottleneck of low-level information such as the edge.

Paired supervision of image to image conversion 

PLDT [30] proposed another method of supervised image-to-image conversion, by adding another discriminator Dpair to learn to determine whether a pair of images from different domains are related to each other.

The architecture of the PLDT is shown in the figure below, given the input image Xs from the source domain, the real image Xt in the target domain, the unrelated image Xt̃ in the target domain, and the generator G transmitting the Xs to the image to the image Xt. . The loss of Dpair can be expressed as:

Unsupervised image to image conversion 

Two concurrent jobs CycleGAN [31] and DualGAN [32] uses reconstruction loss to try to preserve the input image after the conversion period. CycleGAN and DualGAN share the same framework, as shown in the figure below.

It can be seen that the two generators G(ab) and G(ba) are undergoing the opposite conversion, which can be seen as a kind of double learning. In addition,DiscoGAN [33] is another model that uses the same cyclical framework as shown in the figure below.

Taking CycleGAN as an example, in CycleGAN, there are two generators, G(ab) is used to transfer images from domain A to B, and G(ba) is used to perform the opposite transformation. In addition, there are two discriminators DA and DB that predict whether the image belongs to the domain. 

Although CycleGAN and DualGAN have the same model structure, they use different implementations for the generator. CycleGAN uses the generator structure of the convolutional architecture, while DualGAN follows the U-Net architecture.

Unsupervised image-to-image conversion under distance constraints

Unsupervised image-to-image conversion with stable features 

In addition to minimizing the original pixel-level reconstruction error, this can be done at a higher feature level, which is DTNDiscussed in [35]. The architecture of DTN is shown in the figure below, where the generator G consists of two neural networks, a convolutional network f and a deconvolutional network g, such that G=f∘g.

Here f acts as a feature extractor, and the DTN attempts to preserve the advanced features of the input image after transmitting the input image to the target domain. Given that the output of the input image x∈xs generator is G(x)=g(f(x)), the feature reconstruction error can then be defined using the distance metric d (DTN uses mean square error (MSE)). We have conducted a detailed interpretation of this paper before, seeThis article[42].

Unsupervised image-to-image conversion with VAE and weight sharing 

UNIT [36] It is recommended to add VAE to CoGAN [37] Used for unsupervised image-to-image conversion, as shown in the figure below.

Furthermore, UNIT assumes that two encoders share the same potential space, which means that xA, xB are the same image in different domains, and then sharing the potential space means. Based on the shared potential space assumption, UNIT forces weight sharing between the last layers of the encoder and the first few layers of the generator. 

The objective function of UNIT is a combination of GAN and VAE targets, except that two sets of GAN / VAE are added and the hyperparameter λs is added to balance the different loss terms.

Unsupervised multi-domain image to image conversion 

Previous models only converted images between two domains, but if you want to convert images between several domains, you need to train a separate generator for each pair of domains, which is expensive.

To solve this problem,StarGAN [38] It is recommended to use a generator that can generate images of all domains. StarGAN does not only take the image as a conditional input, but takes the label of the target domain as the input, and the generator is used to convert the input image into the target domain indicated by the input label.

Similar to ACGAN, StarGAN uses a secondary domain classifier to classify images into the domain they belong to. In addition, loop consistency loss is used to maintain content similarity between input and output images.

To allow StarGAN to train on multiple data sets that may have different sets of labels, StarGAN uses an additional single vector to indicate the data set and join all the label vectors into a vector, setting the unspecified label of each data set to zero. .

Image to image conversion summary

The image-to-image conversion methods discussed previously, the different losses they use are summarized in the following table:

The simplest loss is the pixel-wise L1 reconstruction loss, which requires paired training samples. Both one-sided and two-way reconstruction losses can be considered as unsupervised versions of pixel-wise L1 reconstruction losses because they enforce loop consistency and do not require paired training samples.

The extra VAE loss is based on the assumption of shared potential space between the source and target domains, and also implies a loss of bidirectional loop consistency. However, the equivalent loss does not attempt to reconstruct the image, but rather preserves the difference between the image between the source and target domains. 

In all the mentioned models,Pix2Pix [29] Produces the sharpest image, even if the L1 loss is just a simple add-on to the original GAN ​​model.Combining the L1 loss with the paired discriminator in PLDT can improve the performance of the model in image-to-image conversion involving image geometric changes.

In addition, Pix2Pix may be useful for preserving similarity information between images in the source and target domains, as in some unsupervised methods such as CycleGAN [31] and DistanceGAN As done in [34].

As for unsupervised methods, although their results are not as good as those of Pix2Pix and other supervised methods, they are a promising research direction because they do not require paired data, and collecting tagged data in the real world is very expensive.

Image to image conversion application

Image-to-image conversion has been applied in many fields, such as face editing, image super-resolution, video prediction, and medical image conversion. This part is not specifically developed because the work in this area is too large.

GAN generated image evaluation index

The quality of the generated image is difficult to quantify, and metrics like RMSE are not appropriate because there is no absolute one-to-one correspondence between the composite image and the real image. A common subjective indicator is the use of Amazon Mechanical Turk (AMT), which employs humans to score synthetic and real images based on how realistic they think the image is. However, people usually have different opinions about good or bad, so we also need objective indicators to evaluate the quality of the image. 

Inception score (IS) [15] When putting categories into a pre-trained image classifier, the images are evaluated based on the entropy in the class probability distribution.The intuition behind the initial score is that the better the image x, the lower the entropy of the conditional distribution p(y|x), which means that the classifier has a high degree of trust in the image.In addition, in order to encourage the model to generate various types of images, the marginal distribution p(y)=∫p(y|x=G(z))dz should have high entropy.

Combined with these two discussions, the initial score is calculated. The Inception score is neither sensitive to the previous distribution of the label nor sensitive to proper distance measurements. In addition, the Inception score is affected by the in-class mode crash, because the model only needs to generate a perfect sample for each class to get the perfect Inception score, so the Inception score can't reflect whether the generated model has a pattern crash. 

Similar to the initial score,FCN-score [29] The idea is that if the generated image is real, then the classifier trained on the real image will be able to correctly classify the synthesized image.However, the image classifier does not need the input image to be very clear to give the correct classification, which means that the metric based on image classification may not be able to distinguish small differences in details between the two images.To make matters worse, the decision of the classifier does not necessarily depend on the visual content of the image, but may be highly affected by noise invisible to humans, and the measurement of FCN-score is also problematic. 

Fréchet Inception Distance (FID) [39] Provides a different approach.First, the generated image is embedded in the latent feature space of the selected layer of the initial network.Second, consider the embedding of the generated and real images as samples from two consecutive multivariate Gaussians so that their mean and covariance can be calculated.Then, the quality of the generated image can be determined by the Fréchet distance between two Gaussians:

The above equations (μx, μg) and (∑x, ∑g) are the mean and covariance from the real data distribution and the generated samples, respectively. The FID is consistent with human judgment and there is a strong negative correlation between the FID and the quality of the generated image. In addition, the FID is less sensitive to noise than IS and can detect intra-class mode crashes.

Final Thoughts

This article is in the paper An Introduction to Image Synthesis with Generative Adversarial Nets On the basis of reviewing the basic methods of GAN, the three main methods of image generation methods, namely direct method, layered method and iterative method and other generation methods, such as iterative sampling. Two main forms of image synthesis are also discussed, namely text-to-image synthesis and image-to-image conversion.

I hope this article can help the reader to clarify the guidance of GAN in the direction of image generation. Of course, it is limited to the original paper (the majority of this article is the original translation), and there are many wonderful GAN ​​papers in the direction of image generation. The reader can read it by himself.

References

[1] Kingma DP, Welling M. Auto-encoding variational bayes[J]. arXiv preprint arXiv: 1312.6114, 2013.

[2] van den Oord, Aaron, et al. “Conditional image generation with pixelcnn decoders.” Advances in Neural Information Processing Systems. 2016.

[3] Kingma, Durk P., and Prafulla Dhariwal. “Glow: Generative flow with invertible 1×1 convolutions.” Advances in Neural Information Processing Systems. 2018.

[4] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.

[5] A. Radford, L. Metz, and S. Chintala, “Unsupervised represetation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv: 1511.06434, 2015.

[6] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv: 1411.1784, 2014.

[7] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier gans,” arXiv preprint arXiv: 1610.09585,2016.

[8] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” arXiv preprint arXiv: 1605.09782, 2016.

[9] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville, “Adversarially learned inference,” arXiv preprint arXiv: 1606.00704, 2016.

[10] ABL Larsen, SK Sønderby, H. Larochelle, and O. Winther, "Autoencoding beyond pixels using a learned similarity metric," arXiv preprint arXiv: 1512.09300, 2015.

[11] T. Che, Y. Li, AP Jacob, Y. Bengio, and W. Li, “Mode regularized generative adversarial networks,” arXiv preprint arXiv: 1612.02136, 2016.

[12] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv: 1701.07875, 2017.

[13] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of wasserstein gan,” arXiv preprint arXiv: 1704.00028, 2017.

[14] Miyato, Takeru, et al. “Spectral normalization for generative adversarial networks.” arXiv preprint arXiv: 1802.05957 (2018). 

[15] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, 2016, pp. 2226– 2234.

[16] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances In Neural Information Processing Systems, 2016, pp. 2172–2180.

[17] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training generative neural samplers using variational divergence minimization,” arXiv preprint arXiv: 1606.00709, 2016.

[18] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, "Generative adversarial text to image synthesis," arXiv preprint arXiv: 1605.05396, 2016.

[19] X. Wang and A. Gupta, “Generative image modeling using style and structure adversarial networks,” arXiv preprint arXiv: 1603.05631, 2016.

[20] EL Denton, S. Chintala, a. szlam, and R. Fergus, “Deep generative image models using a laplacian pyramid of adversarial networks,” in Advances in Neural Information Processing Systems Curran Associates, Inc., 2015, pp. 1486–1494.

[21] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas, “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks,” arXiv Preprint arXiv: 1612.03242, 2016.

[22] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie, “Stacked generative adversarial networks,” arXiv preprint arXiv: 1612.04357, 2016.

[23] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune, "Plug & play generative networks: Conditional iterative generation of images in latent space," arXiv preprint arXiv:1612.00005,2016.

[24] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing gans for improved quality, stability, and variation,” arXiv preprint arXiv: 1710.10196, 2017. 

[25] A. Dash, JCB Gamboa, S. Ahmed, MZ Afzal, and M. Liwicki, “Tac-gan-text conditioned auxiliary classifier generative adversarial network,” arXiv preprint arXiv: 1703.06412, 2017.

[26] SE Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, "Learning what and where to draw," in Advances in Neural Information Processing Systems, 2016, pp. 217– 225.

[27] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and DN Metaxas, “Stackgan++: Realistic image synthesis with stacked generative adversarial networks,” CoRR, vol. abs/ 1710.10916,2017. 

[28] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” arXiv preprint arXiv:1711.10485, 2017.

[29] P. Isola, J.-Y. Zhu, T. Zhou, and AA Efros, “Image-to-image translation with conditional adversarial networks,” arXiv preprint arXiv: 1611.07004, 2016.

[30] D. Yoo, N. Kim, S. Park, AS Paek, and IS Kweon, “Pixel-level domain transfer,” in European Conference on Computer VisionSpringer, 2016, pp. 517–532. 

[31] J.-Y. Zhu, T. Park, P. Isola, and AA Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” arXiv preprint arXiv: 1703.10593, 2017. 

[32] Z. Yi, H. Zhang, PT Gong et al., “Dualgan: Unsupervised dual learning for image-to-image translation,” arXiv preprint arXiv: 1704.02510, 2017. 

[33] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks,” arXiv preprint arXiv: 1703.05192, 2017. 

[34] S. Benaim and L. Wolf, “One-sided unsupervised domain mapping,” arXiv preprint arXiv: 1706.00826, 2017.

[35] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain image generation,” arXiv preprint arXiv: 1611.02200, 2016. 

[36] neural information processing systems, 2014, pp. 2366–2374.M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation networks,” in Advances in Neural Information Processing Systems , 2017, pp. 700–708.

[37] M.-Y. Liu and O. Tuzel, "Coupled generative adversarial networks," in Advances in neural information processing systems, 2016, pp. 469–477.

[38] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,” arXiv preprint arXiv: 1711.09020,2017.

[39] M. Heusel, H. Ramsauer, T. UnterthiNer, B. Nessler, G. Klambauer, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a nash equilibrium,” CoRR, vol. abs/1706.08500, 2017.

[40] Huang, He, Phillip S. Yu, and Changhu Wang. “An Introduction to Image Synthesis with Generative Adversarial Nets.” arXiv preprint arXiv: 1803.04469 (2018).

[41] http://www.twistedwg.com/2018/01/30/GAN-problem.html

[42] https://www.paperweekly.site/papers/notes/503