GANIt has also opened up new frontiers.

Last year, NVIDIA's StyleGAN generated high-quality and visually realistic images, and deceived countless pairs of eyes. Then a large number of fake faces, fake cats and fake houses rose up, showing the power of GAN.

StyleGAN generates a fake face
StyleGAN generates a fake face

Although GAN has made significant advances in imagery, it is still very challenging to ensure semantic consistency between textual descriptions and visual content.

Recently, researchers from universities such as Zhejiang University and the University of Sydney have proposed a novel global-local attention and semantic preservation.Text-image-text(text-to-image-to-text) framework to solve this problem, this framework is calledMirrorGAN.

Researchers from universities such as Zhejiang University and the University of Sydney have proposed a novel global-local attention and semantically maintained text-to-image-to-text framework to solve this problem. Called MirrorGAN.

How strong is MirrorGAN?

Currently more mainstream data setsCOCO data setCUB bird data setOn, MirrorGAN has achieved the best results.

Currently, the paper has been accepted by CVPR2019.

MirrorGAN: Resolving semantic consistency between text and vision

Text-generated images (T2I) have great potential in many applications and have become an active research area in the field of natural language processing and computer vision.

Contrary to basic image generation problems, T2I generation is conditional on textual descriptions, not just noise. Utilizing the power of GAN, the industry has proposed different T2I methods to generate visually realistic and text-related images. These methods all use a discriminator to distinguish between the generated image and the corresponding text pair as well as the ground-truth image and the corresponding text pair.

However, modeling the underlying semantic consistency within each pair is difficult and inefficient when relying solely on such discriminators due to regional differences between text and images.

In recent years, in response to this problem, people have used attention mechanisms to guide the generator to focus on different words when generating different image regions. However, due to the variety of text and image patterns, the use of only word-level attention does not ensure consistency of global semantics. As shown in Figure 1(b):

Figure 1 (a) The description of the mirror structure embodies the idea of ​​re-described the learning text to image generation; (b)-(c) the inconsistent and consistent image of the predecessors' research results and the MirrorGAN generated separately. Restate.
Figure 1 (a) The description of the mirror structure reflects the idea of ​​learning text to image generation by re-description; (b)-(c) The previous research results are inconsistent and consistent with the semantically inconsistent and consistent images generated by the MirrorGAN proposed in this paper. Restate.

T2I generation can be thought of as an inverse problem of image headers (or image-to-text generation, I2T), which produces a textual description of a given image. Considering that each task needs to model and align the underlying semantics of the two domains, it is natural and reasonable to model the two tasks in a unified framework to take advantage of the underlying dual rules.

As shown in 1 (a) and (c), if the image generated by T2I is semantically consistent with a given text description, I2T re-described it should have exactly the same semantics as the given text description. In other words, the resulting image should look like a mirror that accurately reflects the underlying text semantics.

Based on this observation, the paper proposes a new text-image-text framework.MirrorGAN to improve T2I generationIt takes advantage of the idea of ​​learning T2I by re-description.

Anatomy of MirrorGAN's three core modules

For the T2I task, there are two main goals:

  • Visual authenticity;
  • Semantics

And the two need to be consistent.

MirrorGAN takes advantage of the idea of ​​"text-to-image re-description learning generation", which consists of three modules:

  • Semantic text embedding moduleSTEM);
  • Cascading image generation global-local collaborative attention module (GLAM);
  • Semantic text regeneration and alignment module (STREAM).

STEM generates word-level and sentence-level embedding; GLAM has a cascaded structure for generating target images from coarse scale to fine scale, using local word attention and global sentence attention, and gradually enhancing the diversity and semantic consistency of generated images. STREAM attempts to regenerate a textual description from the generated image that is semantically consistent with the given textual description.

Figure 2 MirrorGAN schematic
Figure 2 MirrorGAN principle diagram

As shown in 2, MirrorGAN embodyes the mirror structure by integrating T2I and I2T.

It takes advantage of the idea of ​​learning T2I generation by re-description. After the image is generated, MirrorGAN regenerates its description, which aligns its underlying semantics with the given textual description.

The following are three modules of MirrorGAN: STEM, GLAM and STREAM.

STEM: Semantic Text Embedding Module

First, a semantic text embedding module is introduced to embed a given text description into local word level features and global sentence level features.

Introduce a semantic text embedding module that embeds a given text description into local word level features and global sentence level features

As shown on the far left of 2 (ie above), use a recurrent neural network (RNNExtract the semantic embedding T from the given text description, including a word embedded in w and a sentence embedded in s.

Extract a semantic embedded T from a given text description using a recurrent neural network (RNN)

GLAM: Global-Local Collaborative Module for Cascading Image Generation

Next, a multi-stage cascade generator is constructed by continuously superimposing three image generation networks.

This article uses the basic structure described in "Attngan: Fine-grained text to image generation with attentional generative adversarial networks" because it has good performance in generating realistic images.

This article uses the basic structure described in "Attngan: Fine-grained text to image generation with attentional generative adversarial networks" because it has good performance in generating realistic images.

Use {F0, F1,..., Fm-1} To represent m visual feature converters, and use {G0, G1,..., Gm-1} To represent m image generators.Vision in each stage特征FiAnd generated image IiIt can be expressed as:

The visual feature Fi and the generated image Ii in each stage can be expressed as

STREAM: Semantic text regeneration and alignment module

As noted above, MirrorGAN includes a Semantic Text Reproduction and Alignment Module (STREAM) to regenerate a textual description from the generated image that is semantically aligned with a given textual description.

Specifically, a widely used encoder decoder-based image header framework is employed as the basic STREAM architecture.

Image Encoder is a pre-trained convolutional neural network on ImageNet (CNN), the decoder is RNN. Image I generated by the final generatorm-1Enter the CNN encoder and RNN decoder as follows:

The image Im-1 generated by the final stage generator is input to the CNN encoder and the RNN decoder as follows

Experimental results: Best results on COCO data sets

So, how strong is MirrorGAN's performance?

First look at the comparison of MirrorGAN with other state-of-the-art T2I methods, including GAN-INT-CLS, GAWWN, StackGAN, StackGAN ++, PPGN, and AttnGAN.

The data set used is currently the more mainstream data set, which is the COCO data set and the CUB bird data set:

  • The CUB bird data set contains 8,855 training images and 2,933 test images belonging to 200 categories, each bird image has 10 text descriptions;
  • The OCO data set contains 82,783 training images and 40,504 verification images, each image having 5 text descriptions. 

The result is shown in the table 1:

Table 1 Comparison of MirrorGAN and other advanced methods on CUB and COCO data sets
Table 1 Comparison of the results of MirrorGAN and other advanced methods on the CUB and COCO data sets

Table 2 shows the R precision scores of AttnGAN and MirrorGAN on the CUB and COCO data sets.

Table 2 R precision scores for MirrorGAN and AttnGAN on the CUB and COCO data sets.
Table 2 The R accuracy scores of MirrorGAN and AttnGAN on the CUB and COCO data sets.

In all experimental comparisons, MirrorGAN showed a greater advantage, indicating the superiority of the text-to-image-to-text framework and global-to-local collaborative attention module proposed in this paper, because MirrorGAN generates high-quality images with Enter text to describe consistent semantics.

This article is reproduced from the public number Xinzhiyuan,Original address