GANIt has also opened up new frontiers.
Last year, NVIDIA's StyleGAN generated high-quality and visually realistic images, and deceived countless pairs of eyes. Then a large number of fake faces, fake cats and fake houses rose up, showing the power of GAN.
Although GAN has made significant advances in imagery, it is still very challenging to ensure semantic consistency between textual descriptions and visual content.
Recently, researchers from universities such as Zhejiang University and the University of Sydney have proposed a novel global-local attention and semantic preservation.Text-image-text(text-to-image-to-text) framework to solve this problem, this framework is calledMirrorGAN.
How strong is MirrorGAN?
Currently more mainstream data setsCOCO data set和CUB bird data setOn, MirrorGAN has achieved the best results.
Currently, the paper has been accepted by CVPR2019.
MirrorGAN: Resolving semantic consistency between text and vision
Text-generated images (T2I) have great potential in many applications and have become an active research area in the field of natural language processing and computer vision.
Contrary to basic image generation problems, T2I generation is conditional on textual descriptions, not just noise. Utilizing the power of GAN, the industry has proposed different T2I methods to generate visually realistic and text-related images. These methods all use a discriminator to distinguish between the generated image and the corresponding text pair as well as the ground-truth image and the corresponding text pair.
However, modeling the underlying semantic consistency within each pair is difficult and inefficient when relying solely on such discriminators due to regional differences between text and images.
In recent years, in response to this problem, people have used attention mechanisms to guide the generator to focus on different words when generating different image regions. However, due to the variety of text and image patterns, the use of only word-level attention does not ensure consistency of global semantics. As shown in Figure 1(b):
T2I generation can be thought of as an inverse problem of image headers (or image-to-text generation, I2T), which produces a textual description of a given image. Considering that each task needs to model and align the underlying semantics of the two domains, it is natural and reasonable to model the two tasks in a unified framework to take advantage of the underlying dual rules.
As shown in 1 (a) and (c), if the image generated by T2I is semantically consistent with a given text description, I2T re-described it should have exactly the same semantics as the given text description. In other words, the resulting image should look like a mirror that accurately reflects the underlying text semantics.
Based on this observation, the paper proposes a new text-image-text framework.MirrorGAN to improve T2I generationIt takes advantage of the idea of learning T2I by re-description.
Anatomy of MirrorGAN's three core modules
For the T2I task, there are two main goals:
- Visual authenticity;
- Semantics
And the two need to be consistent.
MirrorGAN takes advantage of the idea of "text-to-image re-description learning generation", which consists of three modules:
- Semantic text embedding moduleSTEM);
- Cascading image generation global-local collaborative attention module (GLAM);
- Semantic text regeneration and alignment module (STREAM).
STEM generates word-level and sentence-level embedding; GLAM has a cascaded structure for generating target images from coarse scale to fine scale, using local word attention and global sentence attention, and gradually enhancing the diversity and semantic consistency of generated images. STREAM attempts to regenerate a textual description from the generated image that is semantically consistent with the given textual description.
As shown in 2, MirrorGAN embodyes the mirror structure by integrating T2I and I2T.
It takes advantage of the idea of learning T2I generation by re-description. After the image is generated, MirrorGAN regenerates its description, which aligns its underlying semantics with the given textual description.
The following are three modules of MirrorGAN: STEM, GLAM and STREAM.
STEM: Semantic Text Embedding Module
First, a semantic text embedding module is introduced to embed a given text description into local word level features and global sentence level features.
As shown on the far left of 2 (ie above), use a recurrent neural network (RNNExtract the semantic embedding T from the given text description, including a word embedded in w and a sentence embedded in s.
GLAM: Global-Local Collaborative Module for Cascading Image Generation
Next, a multi-stage cascade generator is constructed by continuously superimposing three image generation networks.
This article uses the basic structure described in "Attngan: Fine-grained text to image generation with attentional generative adversarial networks" because it has good performance in generating realistic images.
Use {F0, F1,..., Fm-1} To represent m visual feature converters, and use {G0, G1,..., Gm-1} To represent m image generators.Vision in each stage特征FiAnd generated image IiIt can be expressed as:
STREAM: Semantic text regeneration and alignment module
As noted above, MirrorGAN includes a Semantic Text Reproduction and Alignment Module (STREAM) to regenerate a textual description from the generated image that is semantically aligned with a given textual description.
Specifically, a widely used encoder decoder-based image header framework is employed as the basic STREAM architecture.
Image Encoder is a pre-trained convolutional neural network on ImageNet (CNN), the decoder is RNN. Image I generated by the final generatorm-1Enter the CNN encoder and RNN decoder as follows:
Experimental results: Best results on COCO data sets
So, how strong is MirrorGAN's performance?
First look at the comparison of MirrorGAN with other state-of-the-art T2I methods, including GAN-INT-CLS, GAWWN, StackGAN, StackGAN ++, PPGN, and AttnGAN.
The data set used is currently the more mainstream data set, which is the COCO data set and the CUB bird data set:
- The CUB bird data set contains 8,855 training images and 2,933 test images belonging to 200 categories, each bird image has 10 text descriptions;
- The OCO data set contains 82,783 training images and 40,504 verification images, each image having 5 text descriptions.
The result is shown in the table 1:
Table 2 shows the R precision scores of AttnGAN and MirrorGAN on the CUB and COCO data sets.
In all experimental comparisons, MirrorGAN showed a greater advantage, indicating the superiority of the text-to-image-to-text framework and global-to-local collaborative attention module proposed in this paper, because MirrorGAN generates high-quality images with Enter text to describe consistent semantics.
This article is reproduced from the public number Xinzhiyuan,Original address
Comments