It is easier to identify Monet's paintings than to paint Monet's paintings. Generating models (creating data) is considered to be more difficult than discriminating models (processing data). trainingGANIt is also difficult. This article isGAN seriesIn part, we will study why training is so elusive. Through this research, we learned some basic questions that drive the direction of many researchers. We will study some differences so that we know where the research might go. Before studying these issues, let's quickly review the GAN equation.

GAN

GAN uses normal or uniform distribution for noisezSampling and using the deep network generatorGTo create an imagex(x = G(z)).

GAN generation water cup

In GAN, we add a discriminator to distinguish whether the discriminator input is real or generated. It outputs a valueD(x)To estimate the input is a real opportunity.

GAN logic

Objective function and gradient

GAN is defined as a minimal maximal game with the following objective function.

GAN is defined as a minimal max game with the following objective function

The figure below summarizes how we use the corresponding gradient training discriminator and generator.

How to use the corresponding gradient training discriminator and generator

GAN problem

Many GAN models have the following major problems:

  • Does not converge: Model parameters oscillate, unstable, never converge,
  • Mode crash: The generator collapses, producing a limited sample type,
  • Attenuation gradient: The discriminator is too successful, the generator gradient disappears, and nothing is learned.
  • The imbalance between the generator and the discriminator leads to overfitting, and
  • Very sensitive to hyperparameter selection.

模式

The actual data distribution is multimodal. For example, in MNIST, there are 10 mainmode,From the number "0" to the number "9". The following samples are generated by two different GANs. The top row produces all 10 patterns, while the second row produces only a single pattern (the number "6"). When only a few data is generated模式When this problem is calledPattern folding.

When only a few data patterns are generated, this problem is called pattern folding.
Resource

Nash Equilibrium

GAN is based on zero-sum non-cooperative games. In short, if one wins another. The zero-sum game is also known as minimax. Your opponent wants to maximize their actions, and your actions are to minimize them. In game theory, when the discriminator and generator reach the Nash equilibrium, the GAN model converges. This is the best point for the following minimax equation.

Since both parties want to destroy other teams, a Nash equilibrium occurs when a player does not change his or her actions, no matter what the opponent may do. Consider two playsABControl valueXÿrespectively. PlayerAHope to maximizexyValue, andBI want to minimize it.

Nash equilibrium isx = y = 0. This is the only state in which the opponent's actions are irrelevant. This is the only state in which any opponent's actions will not change the outcome of the game.

Let's see if we can easily find a Nash equilibrium using gradient descent. We are based on the value functionVGradient update parameterxy.

among themαIs the learning rate. When we draw for training iterationsx.yXyWe realized that our solution did not converge.

If we improve the learning rate or train the model for a long time, we can see the parameters.x,yUnstable during large swings.

Our example is a good demonstration that some cost functions do not converge with gradients, especially for non-convex games. We can also look at this problem intuitively: your opponent always takes countermeasures against your behavior, which makes the model more difficult to converge.

In a very small game, using a gradient descent may not converge the cost function.

KL-DivergenceGenerating model

In order to understand the convergence problem in GAN, we will first study KL-divergence and JS-divergence. Prior to GAN, many generation models created maximum maximum likelihood estimates.MLEModelθ . This is to find the best model parameters that are best for the training data.

This and minimizing KL-divergenceKL(p,q)(proveThe same, its measurement probability distributionqHow (estimated distribution) deviates from the expected probability distributionp(actual distribution).

KL-divergence is not symmetrical.

for p(x)→0region, KL(x)Fall to 0. For example, in the lower right image, the red curve corresponds to D(p,q).Down to zero X> 2,among them pClose to 0.

Note: KL(p,q) is the integral of the red curve on the right.
Note: KL(p,q) is the integral of the red curve on the right.

What does it mean? KL- divergence if some image patterns are missedDL(p,q)Penalty generator: high penalty, wherep(x)> 0butq(x)→0. However, some images that look unreal are acceptable. whenp(x)→0butq(x)> 0When the punishment is very low.(poor quality but more samples)

On the other hand, if the image does not look real, then reverse KL-divergenceDL(q,p)Punish the generator:如果p(x)→0butq(x)> 0Then high punishment. But it explores fewer changes: ifq(x)→0butp(x)> 0Then low punishment.(better quality but less sample)

Some build models (except GAN) use MLE (aka KL-divergence) to create models. KL-divergence was originally thought to result in poor image quality (blurred images). However, it should be noted that some empirical experiments may object to this statement.

JS-divergence

JS-divergence is defined as:

The JS-difference is symmetrical. Unlike KL-divergence, it severely punishes bad images. (whenp(x)→0q(x)> 0In GAN, if the discriminator is optimal (behaves well in distinguishing images), then the generator's objective function becomes (prove):

Therefore, optimizing the generator model is considered to optimize JS divergence. In the experiment, GAN produced a better image than other generated models using KL-divergence. Following the logic in the previous section, early research speculated that optimizing JS-divergence rather than KL-divergence can create better but less diverse images. However, some researchers have retracted these claims because the image quality produced by the GAN experiment using MLE is similar but there are still image diversity issues. However, many efforts have been made to study the weaknesses of JS-Divergence in GAN training. Regardless of the debate, these works are very important. Therefore, next we will delve into the issue of JS differences.

Gradient disappeared in JS-Divergence

Recall that when the discriminator is optimal, the generator's objective function is:

What happened to the JS-divergent gradient when the data was distributed?qThe image of the generator does not match the ground truthpforReal image. Let us consider an example wherepqIs Gaussian,pThe average value is zero. Let us consider it in different waysqTo studyJS(p,q)Gradient.

Here we drawpqJS-divergence betweenJS(p,q),among themqThe range is from 0 to 30. As shown below, the JS-divergent gradient is fromq1Disappearedq3. When the cost of these areas is saturated, the GAN generator will learn to be extremely slow. Especially in early training,pqIt is very different and the generator learning is very slow.

Unstable gradient

As the gradient disappears, the original GAN ​​paper presentsAnother cost functionTo solve the gradient disappearance problem.

According to another research paper by Arjovsky, the corresponding gradient is:

It includes aReverse The KL-divergent term, which Arjovsky uses to explain why GAN has higher quality but fewer images than the KL-divergent based generation model. But the same analysis claims that the gradient fluctuates and causes the model to be unstable. To illustrate this, Arjovsky freezes the generator and continues to train the discriminator. With larger variants, the gradient of the generator begins to increase.

Resource

The above experiment is not the way we train GAN. However, in mathematics, Arjovsky showed that the objective function of the first GAN generator has a vanishing gradient, while the replacement cost function has a fluctuating gradient, resulting in model instability. Since the original GAN ​​paper, looking for new cost functions, such as LSGAN, WGAN, WGAN-GP, BEGAN, etc. have all had a gold rush... Some methods are based on new mathematical models, and other methods are based on intuition and backed up through experiments. The goal is to find a cost function with a smoother and non-vanishing gradient.

However, 2017's Google brain paper "GAN creates equality?" claims that in the end, we did not find evidence that any tested algorithm is always superior to the original algorithm.

If any new proposed cost function is a great success in improving image quality, we won't have this argument. The apocalyptic picture of the original cost function in the Arjovsky mathematical model is also not fully realized. But I will be cautious to remind readers that it is not important to declare the cost function too early. I can be atHereFind my opinion on the Google Brain paper.What is my opinion?Training GAN is easy to fail. Instead of trying many cost functions at the beginning, first debug your design and code. Next try to adjust the hyperparameters because the GAN model is very sensitive to them. Do this before randomly trying the cost function.

Why is the pattern crashing in GAN?

Pattern crash is one of the most difficult problems in GAN. A complete crash is not common, but a partial crash often occurs. The images with the same underline color look similar, and the pattern begins to fold.

Source codemodify 

Let's see how it happens. The goal of the GAN generator is to create a discriminator that can maximize the spoofingDImage.

But let us consider an extreme situation, namelyGNo updatesDExtensive training in case. The resulting image will converge to find the best imagex *The image is the most foolishD.The most realistic image from a discriminator perspective. In this extreme case,x *Will be independent ofz.

This is bad news. The mode is folded toSingle Point. versuszThe associated gradient is close to zero.

When we resume training in the discriminator, the most efficient way to detect the resulting image is to detect this single pattern. Since the generator is alreadyzThe impact is not sensitive,So the gradient from the discriminator may push the single point to the next most vulnerable mode. This is not difficult to find. The generator produces this unbalanced pattern during training, which reduces its ability to detect other modes. Now, both networks are over-assembled to take advantage of short-term opponent weaknesses. This became a game of cat and mouse, and the model did not converge.

In the image below, Unroll GAN ​​manages to generate all 8 expected data patterns. The second line shows another GAN, and when the discriminator catches up, the pattern is folded and rotated to another mode.

Resource

During the training, the discriminator is constantly updated to detect the opponent. Therefore, the generator is unlikely to over-assemble. In practice, our understanding of model collapse is still limited. Our intuitive explanation above may be too simple. The mitigation method was developed and validated through empirical experiments. However, GAN training is still a heuristic process. Partial crashes are still very common.

But mode crashes are not all bad news. In style transfer using GAN, we are happy to convert an image into a good image instead of finding all variants. In fact, the specialization of partial mode crashes sometimes produces higher quality images. But the pattern crash is still one of the most important issues that GAN has to solve.

Superparameters and training

Without a good hyperparameter, no cost function can work, and adjusting them takes time and patience. The new cost function may introduce hyperparameters with sensitive performance. Hyperparameter adjustment requires patience. If you don't spend time on hyperparameters, any cost function won't work.

Balance between discriminator and generator

Non-convergence and mode collapse are often interpreted as an imbalance between the discriminator and the generator. The obvious solution is to balance their training to avoid overfitting. However, little progress has been made, but not because of a lack of experimentation. Some researchers believe that this is not a viable or ideal goal because a good discriminator can provide good feedback. Therefore, some attention shifts to a cost function with a non-disappearing gradient.

Cost and image quality

In the discriminant model, the accuracy of the loss measurement prediction is used and used to monitor the progress of the training. However, compared to the opponent, the GAN š loss measures our performance. In general, generator costs increase but image quality is actually increasing. We went back and manually checked the generated image to verify the progress. This makes the model comparison more difficult to lead to the difficulty of picking the best model in a single run. It also complicates the adjustment process.

Further reading

Now that you have heard the problem, you may want to hear the solution. We offer two different articles. The first provides a key summary of the solution.GAN-Comprehensive review of GAN gangsters (Part XNUMX)
This paper studies the motivation and direction of GAN research in improving GAN. Pass inMedia.comView them

If you want to go deeper, the second one will have a more in-depth discussion:GAN-Methods to improve GAN performance and
Compared to other deep networks, the GAN model may be severely affected in the following ways.medium.com

If you want to further study the mathematical model of gradient and stability problems, the following article will elaborate on it. But be aware that the equation may seem overwhelming. However, if you are not afraid of equations, you can provide good reasoning for some of their claims.GAN-What's wrong with the GAN cost function?
We strive to provide mathematical models for deep learning. But usually, we did not succeed, but returned to...medium.com

reference

A principled approach to fostering a generative confrontation network

Improved techniques for training GAN

NIPS 2016 Tutorial: Generating Confrontation Network

Generating confrontation network

Is GAN equal? a large-scale study