It is easier to identify Monet's paintings than to paint Monet's paintings. Generating models (creating data) is considered to be more difficult than discriminating models (processing data). trainingGANIt is also difficult. This article isGAN seriesIn part, we will study why training is so elusive. Through this research, we learned some basic questions that drive the direction of many researchers. We will study some differences so that we know where the research might go. Before studying these issues, let's quickly review the GAN equation.

**GAN**

GAN uses normal or uniform distribution for noise** z**Sampling and using the deep network generator

**To create an image**

*G***.**

*x(x = G(z))*In GAN, we add a discriminator to distinguish whether the discriminator input is real or generated. It outputs a value** D(x)**To estimate the input is a real opportunity.

**Objective function and gradient**

GAN is defined as a minimal maximal game with the following objective function.

The figure below summarizes how we use the corresponding gradient training discriminator and generator.

**GAN problem**

Many GAN models have the following major problems:

**Does not converge**: Model parameters oscillate, unstable, never converge,**Mode crash**: The generator collapses, producing a limited sample type,**Attenuation gradient**: The discriminator is too successful, the generator gradient disappears, and nothing is learned.- The imbalance between the generator and the discriminator leads to overfitting, and
- Very sensitive to hyperparameter selection.

**模式**

The actual data distribution is multimodal. For example, in MNIST, there are 10 main**mode,**From the number "0" to the number "9". The following samples are generated by two different GANs. The top row produces all 10 patterns, while the second row produces only a single pattern (the number "6"). When only a few data is generated**模式**When this problem is called**Pattern folding**.

## Nash Equilibrium

GAN is based on zero-sum non-cooperative games. In short, if one wins another. The zero-sum game is also known as minimax. Your opponent wants to maximize their actions, and your actions are to minimize them. In game theory, when the discriminator and generator reach the Nash equilibrium, the GAN model converges. This is the best point for the following minimax equation.

Since both parties want to destroy other teams, a Nash equilibrium occurs when a player does not change his or her actions, no matter what the opponent may do. Consider two plays*A*和*B*Control value** X**和

**respectively. Player**

*ÿ**A*Hope to maximize

**Value, and**

*xy***I want to minimize it.**

*B*Nash equilibrium is*x = y = 0*. This is the only state in which the opponent's actions are irrelevant. This is the only state in which any opponent's actions will not change the outcome of the game.

Let's see if we can easily find a Nash equilibrium using gradient descent. We are based on the value function*V*Gradient update parameter** x**和

**.**

*y*among them** α**Is the learning rate. When we draw for training iterations

**.**

*x***和**

*y***We realized that our solution did not converge.**

*Xy*If we improve the learning rate or train the model for a long time, we can see the parameters.*x,y*Unstable during large swings.

Our example is a good demonstration that some cost functions do not converge with gradients, especially for non-convex games. We can also look at this problem intuitively: your opponent always takes countermeasures against your behavior, which makes the model more difficult to converge.

In a very small game, using a gradient descent may not converge the cost function.

## KL-Divergence**Generating model**

In order to understand the convergence problem in GAN, we will first study KL-divergence and JS-divergence. Prior to GAN, many generation models created maximum maximum likelihood estimates.**MLE**Model** θ **. This is to find the best model parameters that are best for the training data.

This and minimizing KL-divergence** KL(p,q)**(proveThe same, its measurement probability distribution

**How (estimated distribution) deviates from the expected probability distribution**

*q***(actual distribution).**

*p*KL-divergence is not symmetrical.

**for**** p(x)→0**region,

**Fall to**

*KL(x)***. For example, in the lower right image, the red curve corresponds to**

*0**D(p,q).*Down to zero

**,among them**

*X> 2***Close to 0.**

*p*What does it mean? KL- divergence if some image patterns are missed** DL(p,q)**Penalty generator: high penalty, where

*p(x)> 0*but

*q(x)→0*. However, some images that look unreal are acceptable. when

*p(x)→0*but

*q(x)> 0*When the punishment is very low.

**(poor quality but more samples)**

On the other hand, if the image does not look real, then reverse KL-divergence** DL(q,p)**Punish the generator:如果

*p(x)→0*but

*q(x)> 0*Then high punishment. But it explores fewer changes: if

*q(x)→0*but

*p(x)> 0*Then low punishment.

**(better quality but less sample)**

Some build models (except GAN) use MLE (aka KL-divergence) to create models. KL-divergence was originally thought to result in poor image quality (blurred images). However, it should be noted that some empirical experiments may object to this statement.

## JS-divergence

JS-divergence is defined as:

The JS-difference is symmetrical. Unlike KL-divergence, it severely punishes bad images. (when*p(x)→0*且*q(x)> 0*In GAN, if the discriminator is optimal (behaves well in distinguishing images), then the generator's objective function becomes (prove):

Therefore, optimizing the generator model is considered to optimize JS divergence. In the experiment, GAN produced a better image than other generated models using KL-divergence. Following the logic in the previous section, early research speculated that optimizing JS-divergence rather than KL-divergence can create better but less diverse images. However, some researchers have retracted these claims because the image quality produced by the GAN experiment using MLE is similar but there are still image diversity issues. However, many efforts have been made to study the weaknesses of JS-Divergence in GAN training. Regardless of the debate, these works are very important. Therefore, next we will delve into the issue of JS differences.

**Gradient disappeared in JS-Divergence**

Recall that when the discriminator is optimal, the generator's objective function is:

What happened to the JS-divergent gradient when the data was distributed?** q**The image of the generator does not match the ground truth

**forReal image. Let us consider an example where**

*p***和**

*p***Is Gaussian,**

*q***The average value is zero. Let us consider it in different ways**

*p***To study**

*q**JS(p,q)*Gradient.

Here we draw*p*和*q*JS-divergence between*JS(p,q)*,among them*q*The range is from 0 to 30. As shown below, the JS-divergent gradient is from*q1*Disappeared*q3*. When the cost of these areas is saturated, the GAN generator will learn to be extremely slow. Especially in early training,*p*和*q*It is very different and the generator learning is very slow.

## Unstable gradient

As the gradient disappears, the original GAN paper presents**Another cost function**To solve the gradient disappearance problem.

According to another research paper by Arjovsky, the corresponding gradient is:

It includes a**Reverse** The KL-divergent term, which Arjovsky uses to explain why GAN has higher quality but fewer images than the KL-divergent based generation model. But the same analysis claims that the gradient fluctuates and causes the model to be unstable. To illustrate this, Arjovsky freezes the generator and continues to train the discriminator. With larger variants, the gradient of the generator begins to increase.

The above experiment is not the way we train GAN. However, in mathematics, Arjovsky showed that the objective function of the first GAN generator has a vanishing gradient, while the replacement cost function has a fluctuating gradient, resulting in model instability. Since the original GAN paper, looking for new cost functions, such as LSGAN, WGAN, WGAN-GP, BEGAN, etc. have all had a gold rush... Some methods are based on new mathematical models, and other methods are based on intuition and backed up through experiments. The goal is to find a cost function with a smoother and non-vanishing gradient.

However, 2017's Google brain paper "GAN creates equality?" claims that in the end, we did not find evidence that any tested algorithm is always superior to the original algorithm.

If any new proposed cost function is a great success in improving image quality, we won't have this argument. The apocalyptic picture of the original cost function in the Arjovsky mathematical model is also not fully realized. But I will be cautious to remind readers that it is not important to declare the cost function too early. I can be atHereFind my opinion on the Google Brain paper.**What is my opinion?**Training GAN is easy to fail. Instead of trying many cost functions at the beginning, first debug your design and code. Next try to adjust the hyperparameters because the GAN model is very sensitive to them. Do this before randomly trying the cost function.

## Why is the pattern crashing in GAN?

Pattern crash is one of the most difficult problems in GAN. A complete crash is not common, but a partial crash often occurs. The images with the same underline color look similar, and the pattern begins to fold.

Let's see how it happens. The goal of the GAN generator is to create a discriminator that can maximize the spoofing** D**Image.

But let us consider an extreme situation, namely** G**No updates

**Extensive training in case. The resulting image will converge to find the best image**

*D***The image is the most foolish**

*x ****.The most realistic image from a discriminator perspective. In this extreme case,**

*D***Will be independent of**

*x ****.**

*z*This is bad news. The mode is folded to**Single Point**. versus** z**The associated gradient is close to zero.

When we resume training in the discriminator, the most efficient way to detect the resulting image is to detect this single pattern. Since the generator is already** z**The impact is not sensitive,So the gradient from the discriminator may push the single point to the next most vulnerable mode. This is not difficult to find. The generator produces this unbalanced pattern during training, which reduces its ability to detect other modes. Now, both networks are over-assembled to take advantage of short-term opponent weaknesses. This became a game of cat and mouse, and the model did not converge.

In the image below, Unroll GAN manages to generate all 8 expected data patterns. The second line shows another GAN, and when the discriminator catches up, the pattern is folded and rotated to another mode.

During the training, the discriminator is constantly updated to detect the opponent. Therefore, the generator is unlikely to over-assemble. In practice, our understanding of model collapse is still limited. Our intuitive explanation above may be too simple. The mitigation method was developed and validated through empirical experiments. However, GAN training is still a heuristic process. Partial crashes are still very common.

But mode crashes are not all bad news. In style transfer using GAN, we are happy to convert an image into a good image instead of finding all variants. In fact, the specialization of partial mode crashes sometimes produces higher quality images. But the pattern crash is still one of the most important issues that GAN has to solve.

**Superparameters and training**

Without a good hyperparameter, no cost function can work, and adjusting them takes time and patience. The new cost function may introduce hyperparameters with sensitive performance. Hyperparameter adjustment requires patience. If you don't spend time on hyperparameters, any cost function won't work.

## Balance between discriminator and generator

Non-convergence and mode collapse are often interpreted as an imbalance between the discriminator and the generator. The obvious solution is to balance their training to avoid overfitting. However, little progress has been made, but not because of a lack of experimentation. Some researchers believe that this is not a viable or ideal goal because a good discriminator can provide good feedback. Therefore, some attention shifts to a cost function with a non-disappearing gradient.

**Cost and image quality**

In the discriminant model, the accuracy of the loss measurement prediction is used and used to monitor the progress of the training. However, compared to the opponent, the GAN š loss measures our performance. In general, generator costs increase but image quality is actually increasing. We went back and manually checked the generated image to verify the progress. This makes the model comparison more difficult to lead to the difficulty of picking the best model in a single run. It also complicates the adjustment process.

## Further reading

Now that you have heard the problem, you may want to hear the solution. We offer two different articles. The first provides a key summary of the solution.**GAN-Comprehensive review of GAN gangsters (Part XNUMX)***This paper studies the motivation and direction of GAN research in improving GAN. Pass in*Media.com*View them*

If you want to go deeper, the second one will have a more in-depth discussion:**GAN-Methods to improve GAN performance and***Compared to other deep networks, the GAN model may be severely affected in the following ways.*medium.com

If you want to further study the mathematical model of gradient and stability problems, the following article will elaborate on it. But be aware that the equation may seem overwhelming. However, if you are not afraid of equations, you can provide good reasoning for some of their claims.**GAN-What's wrong with the GAN cost function?***We strive to provide mathematical models for deep learning. But usually, we did not succeed, but returned to...*medium.com

## reference

A principled approach to fostering a generative confrontation network

Improved techniques for training GAN

NIPS 2016 Tutorial: Generating Confrontation Network

## Comments