Intro

Yann LeCun described it as "the most interesting idea of ​​machine learning in the past 10 years." Of course, this compliment from a distinguished researcher in the field of deep learning is always a good advertisement for the topic we are talking about! Moreover, in fact, the formation of a confrontational network (short for Gans) has produced great success because they were introduced in an article in 2014 by Ian J. Goodfellow and co-authors.Deformed to the Nets.

So what is Generative Adversarial Networks? What makes them so "fun"? In this article, we will see that confrontational training is an inspiring idea that is simple and beautiful, representing the true concept of machine learning, especially the generation of models (the same way as backpropagation is a simple but very Clever tricks make the basic idea of ​​neural networks so popular and efficient).

Before we get into the details, let's take a quick look at it.GANthe use of. The generative confrontation network belongs to a set of generation models. This means they are able to generate/generate (we will see how) new content. To illustrate the concept of this "generating model", we can look at some well-known examples of the results obtained with GAN.

An illustration of the GAN capabilities of Ian Goodfellow and the co-author. These are samples generated by Generative Adversarial Networks after training two data sets: MNIST and TFD. For both, the rightmost column contains the real data closest to the sample generated directly adjacent. This shows us that the generated data is actually generated, not just memorized by the network. (Source: "Generative Adversarial Nets" paper)

Of course, this ability to generate new content makes GAN look a bit "magic", at least at first glance. In the following sections, we will overcome the obvious magic of GAN to gain insight into the ideas behind these models, mathematics and modeling. Not only will we discuss the basic concepts that Generative Adversarial Networks relies on, but we will gradually build and derive the reasoning that led to these ideas from the start.

Needless to say, let us rediscover GAN!

Notes: Although we try to make this article as independent as possible, we still need basic prior knowledge of machine learning. However, most concepts will be retained as needed, otherwise some references will be provided. We really tried to make this article as easy as possible to read. Don't hesitate to mention more in the comments section that you would like to read (other possible articles on this topic).

Outline

In the first section below, we will discuss the process of generating random variables from a given distribution. Then, in the 2 section, we will show by an example that the problem that GAN is trying to solve can be expressed as a random variable generation problem. In the 3 section, we will discuss matching-based generation networks and show how they answer the questions described in the 2 section. Finally, in the 4 section, we will introduce GAN. More specifically, we will show the general architecture with its loss function, and we will link to all previous sections.


Generating random variables

In this section, we will discuss the process of generating random variables: we remind some existing methods, especially the inverse transform method, which allows complex random variables to be generated from simple uniform random variables. Although all of this seems to be a far cry from our material subject GAN, we will see deeper connections to the generated model in the next section.

Pseudo-randomly generated uniform random variables

The computer is basically deterministic. Therefore, in theory, it is impossible to generate truly random numbers (even if we can say "What is true randomness?" This problem is very difficult). However, an algorithm for generating a sequence of numbers can be defined that is very similar in nature to the properties of a sequence of theoretical random numbers. In particular, the computer can generate a series of numbers using a pseudo-random number generator that closely follows a uniform random distribution between 0 and 1. The unified situation is a very simple case where different methods of building more complex random variables can be built.

Random variables are expressed as the result of an operation or process

There are different techniques designed to produce more complex random variables. Among them we can find, for example, inverse transformation method, rejection sampling, Metropolis-Hasting algorithm and so on. All of these methods rely on different mathematical techniques, which are mainly about expressing the random variables we want to generate as operations (through simpler random variables) or the results of the process.

Refusal samplingRepresenting a random variable is the result of a process that does not sample from a complex distribution, but samples from a well-known simple distribution and accepts or rejects the sampled value based on certain conditions. This process is repeated until the sampled values ​​are accepted, and we can demonstrate that under the correct acceptance conditions, the values ​​effectively sampled will follow the correct distribution.

OnMetropolis-Hasting algorithmThe idea is to find the Markov chain (MC) so that the static distribution of the MC corresponds to the distribution we want to sample the random variables. Once this MC is discovered, we can simulate a sufficiently long trajectory on this MC to consider that we have reached a steady state, and then the last value we get in this way can be thought of as derived from the distribution of interest.

We won't go any further into the details of refusal sampling and Metropolis-Hasting, as these methods will not lead us to follow the concepts behind GAN (although interested readers can refer to the Wikipedia article and links in it). But let's focus more on the inverse transform method.

Inverse transform method

The idea of ​​the inverse transformation method is just to express our complexity-in this article, "complex" should always be understood as "not simple" rather than mathematical meaning-random variables as the result of the function applied to the function uniform random variables we know how generate.

We consider this in the following example. Let X be the complex random variable we want to sample, U is a uniform random variable on [0,1], and we know how to sample from it. We remind random variables byCumulative distribution function(CDF) is fully defined. The CDF of a random variable is a function from the domain of the random variable to the interval [0,1] and is defined in one dimension, so that

In the specific case of our uniform random variable U, we have

For the sake of simplicity, we assume here that the function CDF_X is reversible and represents its inverse function.

(By using the generalized inverse of a function, it is easy to extend the method to an irreversible case, but it is actually not the main point we want to focus on). Then if we define

We have

We can see that Y and X have the same CDF and then define the same random variable. Therefore, by defining Y (as a function of a uniform random variable) as above, we try to define a random variable with a target distribution.

In summary, the inverse transform method is a way to generate random variables that follow a given distribution by passing a uniform random variable through a well-designed "transform function" (inverse CDF). In fact, the concept of this "inverse transformation method" can be extended to the concept of "transformation method", which, more broadly, generates random variables from some simpler random variables (not necessarily uniform, Then the transformation function is no longer an inverse CDF). Conceptually, the purpose of the "transformation function" is to deform/reshape the initial probability distribution: the transformation function is too high compared to the target distribution from the initial distribution and places it too low.

An illustration of the inverse transform method. Blue: Evenly distributed on [0,1]. Orange: Standard Gaussian distribution. Gray: A mapping from uniform to Gaussian distribution (inverse CDF).

Generating model

We are trying to generate very complex random variables...

Suppose we are interested in generating a black and white square image of a dog with a size of n by n pixels. We can reshape each data into N = n × n dimensional vectors (by stacking the columns on top of each other) so that the dog's image can be represented by a vector. However, this does not mean that all vectors represent a dog shape back to the square! Therefore, we can say that an N-dimensional vector that effectively gives something that looks like a dog is distributed according to a very specific probability distribution over the entire N-dimensional vector space (some points of the space are likely to represent dogs, and it is Very unlikely for others). In the same spirit, there is a probability distribution of images of cats, birds, etc. in this N-dimensional vector space.

Then, the problem of generating a new image of the dog is equivalent to the problem of generating a new vector following the "dog probability distribution" on the N-dimensional vector space. In fact, we are faced with the problem of generating random variables for a specific probability distribution.

At this point, we can mention two important things. First of all, the "dog probability distribution" we mentioned is a very complex distribution in a very large space. Second, even though we can assume that there is such a base distribution (actually there are images that look like dogs and others do not look like images), we obviously don't know how to express this distribution explicitly. The previous two points make the process of generating random variables from this distribution very difficult. Then let's try to solve the following two problems.

…So let’s use the neural network transformation method as a function!

When we try to generate a new image of a dog, our first problem is that the "dog probability distribution" in the N-dimensional vector space is a very complicated problem, and we don't know how to directly generate complex random variables. However, as we are very clear about how to generate N uncorrelated uniform random variables, we can use the transform method. To do this, we need to represent the N-dimensional random variables as the result of very complex functions applied to simple N-dimensional random variables!

Here, we can emphasize the fact that finding a transformation function is not the closed inverse of a cumulative distribution function (which we obviously don't know) as we did when describing the inverse transformation method. The conversion function cannot be articulated, so we have to learn it from the data.

In these cases, in most cases, very complex functions naturally mean neural network modeling. Then, our idea is to model the transformation function through a neural network that takes a simple N-dimensional uniform random variable as input and returns another N-dimensional random variable as output. After training, the random variable The correct "dog probability distribution" should be followed. Once the network architecture is designed, we still need to train it. In the next two sections, we will discuss two ways to train these generation networks, including the idea of ​​confrontation training behind GAN!

Illustration of a model concept using a neural network generation. Obviously, the dimensions we really talk about are much higher than the dimensions represented here.

Generate matching network

Disclaimer: The name of "Generate matching network" is not standard. However, we can find it in the literature, for example, "Generative Moments Matching Networks" or "Generative Features Matching Networks." We just want to use a slightly more denomination here to describe what we are describing.

Cultivate generated models

So far, we have proved that the problem of generating a new image of a dog can be re-described as the problem of generating a random vector following the "dog probability distribution" in the N-dimensional vector space, and we suggest using a transformation method using a neural network. To simulate the transformation function.

Now, we still need to train (optimize) the network to express the correct transformation function. To this end, we can suggest two different training methods: direct training methods and indirect training methods. The direct training method involves comparing the real and generated probability distributions and propagating the differences (errors) back through the network. This is the idea of ​​rule generation matching networks (GMNs). For indirect training methods, we do not directly compare the real and generated distributions. Instead, we train the generation network by passing these two distributions through selected downstream tasks, so that the optimization process of the generation network relative to the downstream tasks will force the generated distribution to be close to the true distribution. The last idea is to generate one of the behind the network (GAN), which we will cover in the next section. But for now, let's start with the direct approach and GMN.

Compare two probability distributions based on samples

As mentioned above, the idea of ​​GMN is to train the generation network by directly comparing the generated distribution with the real distribution. However, we don't know how to express the true "dog probability distribution" clearly. We can also say that the generated distribution is too complicated to be clearly expressed. Therefore, comparison based on explicit expressions is impossible. However, if we have a way to compare sample-based probability distributions, we can use it to train the network. In fact, we have a sample of real data, and we can generate a sample of the generated data in each iteration of the training process.

Although in theory any distance (or similarity measure) that can effectively compare two distributions based on the sample can be used, we can specifically mention the Maximum Mean Difference (MMD) method. MMD defines the distance between two probability distributions that can be calculated (estimated) based on the samples of these distributions. Although it is not completely beyond the scope of this article, we decided not to spend more time describing the MDD. However, our project will soon publish an article that will contain more details about it. Readers who want to know more about MMD can refer toThese slides.TextOrText.

Backpropagation distribution matching error

Therefore, once we have defined a method based on sample comparison of two distributions, we can define the training process for generating networks in the GMN. Given a random variable with a uniform probability distribution as input, we hope that the probability distribution of the generated output is the "dog probability distribution". Then, GMN's idea is to optimize the network by repeating the following steps:

  • Produce some uniform input
  • Pass these inputs through the network and collect the generated output
  • Compare the real "dog probability distribution" with one generated based on available samples (eg, calculate the MMD distance between the real dog image sample and the sample of the generated sample)
  • Use backpropagation to perform a step of gradient descent to reduce the distance between the real distribution and the generated distribution (eg MMD)

As mentioned above, when following these steps, we apply a gradient descent on the network with a loss function, which is the distance between the real distribution and the generated distribution in the current iteration.

The generated matching network uses simple random input to generate new data, directly compares the distribution of the generated data with the distribution of real data, and propagates matching errors back to train the network.

Generating confrontation network

"indirect" training method

The "direct" method proposed above directly compares the generated distribution with the real distribution when training the generated network. The good idea of ​​the rule GAN is to replace this direct comparison with an indirect alternative that takes the form of downstream tasks of both distributions. The task is then trained to generate a network such that it forces the generated distribution to be closer to the true distribution.

The downstream task of GAN is the discrimination task between the real sample and the generated sample. Or we can say "non-discrimination" tasks because we want discrimination to fail as much as possible. Therefore, in the GAN architecture, we have a discriminator that takes samples of real data and generated data and tries to classify them as much as possible, as well as a trained generator to trick the discriminator as much as possible. Let's look at a simple example of why the direct and indirect methods we mentioned should theoretically lead to the same optimal generator.

Ideal situation: perfect generator and discriminator

To better understand why the training generator to spoof the discriminator would result in the same result as the direct training generator matching the target distribution, let us adopt a simple one-dimensional example. We forgot for a moment how to represent generators and discriminators and treat them as abstractions (as specified in the next section). Moreover, both are considered "perfect" (with infinite capacity) because they are not constrained by any type (parameterized) model.

Suppose we have a real distribution, such as a one-dimensional Gaussian distribution, and we want a generator that samples from this probability distribution. Our so-called "direct" training method will include iteratively adjusting the generator (gradient descent iteration) to correct the measurement difference/error between the real distribution and the generated distribution. Finally, assuming that the optimization process is perfect, we should end up with a generator distribution that exactly matches the true distribution.

Illustration of the concept of direct matching method. The distribution of blue is real and the resulting distribution is shown in orange. Through iterative iteration, we compare the two distributions and adjust the network weights through the gradient descent steps. The comparison here is done on the mean and variance (similar to the truncated moment matching method). Note that (obviously) this example is very simple and does not require an iterative method: the purpose is simply to illustrate the intuition given above.

For the "indirect" method, we must also consider a discriminator. We now assume that this discriminator is an oracle that knows exactly what is real and generated, and can predict the class ("true" or "generate") of any given point based on this information. If these two distributions are obvious, then the discriminator will be able to easily classify and can confidently present most of the points we present to it. If we want to trick the discriminator, we have to make the resulting distribution close to the real distribution. When the two distributions are equal at all points, the discriminator will be the most difficult to predict class: in this case,

Intuition against the method of confrontation. The blue distribution is real and the orange is generated. In gray, there is a corresponding y-axis on the right, and if it selects a higher-density class at each point (assuming the ratios of "true" and "generated" data are equal), we show the true probability of the discriminator. The closer the two distributions are, the more error the discriminator is. During training, the goal is to move the “green area” (the resulting distribution is too high) to the red area (the resulting distribution is too low).

At this point, there seems to be reason to doubt whether this indirect method is really a good idea. In fact, it seems to be more complicated (we have to optimize the generator based on downstream tasks rather than directly on the distribution) and it needs to be considered here to be the oracle's discriminator, but in fact, it is neither known nor perfect. For the first point, the difficulty of directly comparing the two probability distributions based on the sample offsets the significantly higher complexity of the indirect method. For the second point, it is obvious that the discriminator is unknown. However, it can be learned!

Approximation: Antagonistic Neural Network

Let us now describe the specific form of the generator and discriminator in the GANs architecture. The generator is a neural network that simulates a conversion function. It takes a simple random variable as input and must return a random variable following the target distribution after training. Because it is very complex and unknown, we decided to model the discriminator with another neural network. The neural network simulates a discriminant function. It takes a point (in our dog example an N-dimensional vector) as input and returns the probability of that point as "true" as output.

Note that the fact that we now impose a parametric model to express the generator and discriminator (rather than the idealized version in the previous section) does not actually have a huge impact on the theoretical argument/intuition given above: It is then only working in some parametric spaces rather than the ideal full space, so the ideal point we should achieve in an ideal situation can be seen as "rounding" by the precise capacity of the parametric model.

Once defined, the two networks can jointly (simultaneously) perform the opposite target training:

  • The goal of the generator is to trick the discriminator, so training generates a neural network to maximize the final classification error (between real data and generated data)
  • The goal of the discriminator is to detect forged data, so training the discriminant neural network to minimize the final classification error

Therefore, in each iteration of the training process, the weight of the generated network is updated to increase the classification error (the error gradient rises to the parameters of the generator), and the weight of the discriminant network is updated to reduce this error (the error gradient drops beyond the parameters of the discriminator) ).

Generating against network representation. The generator takes a simple random variable as input and generates new data. The discriminator uses "real" and "generated" data and tries to distinguish them to build a classifier. The goal of the generator is to trick the discriminator (by adding as much of the generated data as possible to the real data to increase the classification error), and the goal of the discriminator is to distinguish between real data and generated data.

These opposite goals and the implicit concept of confrontation training in both networks explain the name of the "antagonistic network": both networks try to beat each other, and by doing so, they are getting better and better. The competition between them makes the two networks "progress" in their respective goals. From a game theory perspective, we can think of this setting as a minimally large two-player game where the equilibrium state corresponds to the situation where the generator generates data from the exact target distribution and the discriminator predicts "real" or "generated". The probability of any point it receives is 1/2.


Math details about GAN

Notes: This section is more technical and is not absolutely necessary for a comprehensive understanding of GAN. So, readers who don't want to read some math now can skip this part for the time being. For others, let's see how the intuition given above is mathematically formalized.

give up: The following equation is not an article by Ian Goodfellow. We propose another mathematical formalization here for two reasons: first, to keep closer to the intuition given above, and second, because the equations of the original paper are already very clear, it is useless to rewrite them. Also note that we will never participate in actual considerations (disappearance gradients or other) related to different possible loss functions. We strongly recommend that readers also look at the equations of the original paper: the main difference is that Ian Goodfellow and the co-authors use cross-entropy errors instead of absolute errors (as we did). Furthermore, in the following we assume generators and discriminators with unlimited capacity.

Neural network modeling essentially requires two things to be defined: the architecture and the loss function. We have described the architecture of Generative Adversarial Networks. It contains two networks:

  • Generate a network G(.) that takes a random input z of density p_z and returns the output x_g = G(z), which should follow the (post-training) target probability distribution
  • A discriminant network D(.), which takes an input x (x_t whose density is represented by p_t) or "generated" (x_g whose density p_g is the density caused by the density p_z through G) which can be "true" And return the probability D(x) of x to "real" data.

Now let's take a closer look at GAN's "theoretical" loss function. If we send the "real" and "generated" data to the discriminator in the same proportion, the expected absolute error of the discriminator can be expressed as

The goal of the generator is to trick the discriminator with the goal of being able to distinguish between real data and generated data. Therefore, when training the generator, we want to maximize this error, and we try to minimize it for the discriminator. It gave us

For any given generator G (and induced probability density p_g), the best possible discriminator is the minimized discriminator

In order to minimize (relative to D) this integral, we can minimize the function within the integral of each value of x. Then it defines the best possible discriminator for a given generator

(In fact, the best because of the value of x, such that p_t(x) = p_g(x) can be handled in another way, but not important for the latter). Then we search for G maximization

Again, to maximize (relative to G) this integral, we can maximize the function within the integral of each value of x. Since the density p_t is independent of the generator G, we can't be better than setting G.

Of course, since p_g is the probability density that should be integrated with 1, we must have the best G

Therefore, we have shown that in the ideal case with an infinite capacity generator and discriminator, the optimal point of the antagonistic setting causes the generator to produce the same density as the true density, and the discriminator cannot be better than the real one. There are two in one case, just as the intuition tells us. Finally, also pay attention to G maximization

In this form, we'd better see G wanting to maximize the expected probability of a discriminator error.


Tips

The main content of this article is:

  • Computers can basically generate simple pseudo-random variables (for example, they can generate variables that are very close to evenly distributed)
  • There are different ways to generate more complex random variables, including the concept of "transformation methods", which include functions that represent random variables as some simpler random variables.
  • In machine learning, the generation model attempts to generate data from a given (complex) probability distribution.
  • The deep learning generation model is modeled as a neural network (a very complex function) that takes a simple random variable as input and returns a random variable that follows the target distribution ("transform method").
  • These generation networks can be “directly” trained (by comparing the distribution of generated data with the real distribution): this is the idea of ​​generating a matching network.
  • These generation networks can also be "indirect" trained (by trying to trick another network that is training at the same time to distinguish between "generated" data and "real" data): this is the idea of ​​generating a confrontational network.

Even though the "hype" surrounding GAN may be a bit exaggerated, we can say that the idea of ​​adversarial training proposed by Ian Goodfellow and his co-authors is really great. This way of distorting the loss function from direct comparison to indirect comparison can actually stimulate further work in the field of deep learning. All in all, let's say that we don't know whether the concept of GAN is really "the most interesting idea in machine learning in the past 10 years"... but it is obviously at least the most interesting one!

This article was transferred from awardsdatascience,Original address