This article is reproduced from the heart of the machine,Original address

The latest research in Google brain proposes to find better through neural architecture search TransformerTo achieve better performance. The search resulted in a new architecture called Evolved Transformer, which performed on four mature language tasks (WMT 2014, WMT 2014, WMT 2014, and LM1B). Better than the original Transformer.

In the past few years, great progress has been made in the field of neural architecture search.Models obtained through reinforcement learning and evolution have been proven to surpass human-designed models (Real et al., 2019; Zoph et al., 2018).Most of these advances focus on improving image models, but there are also some studies dedicated to improving sequence models (Zoph & Le, 2017; Pham et al., 2018).But in these studies, researchers have been working on improving recurrent neural networks (RNNThe network has long been used to solve sequence problems (Sutskever et al., 2014; Bahdanau et al., 2015).

However, recent research has shown that RNN is not the best way to solve sequence problems. Due to convolutional networks (such as convolution Seq2Seq(Gehring et al., 2017) and the complete attention network (such as Transformer) (Vaswani et al., 2017), the feedforward network can be used to solve the seq2seq task. Its main advantage is that the training speed is faster than RNN. It is also easier to train.

This paper aims to test the use of neural architecture search methods and to design a better feedforward architecture for the seq2seq task. Specifically, Google brain researchers used a tournament selection architecture search to evolve a better, more efficient architecture from Transformer, which is considered to be the best and most widely used architecture. To achieve this, the researchers constructed a search space that reflects the latest advances in the feedforward seq2seq model, and developed a method called progressive dynamic hurdle (PDH), which allows direct comparisons in computational requirements. Perform a search on the high WMT 2014 Yingde translation task. The search resulted in a new architecture called Evolved Transformer, which performed on four mature language tasks (WMT 2014, WMT 2014, WMT 2014, and LM1B). Better than the original Transformer. In experiments with large models, Evolved Transformer's efficiency (FLOPS) was twice that of Transformer, and there was no loss in quality. In a small model (7M with a parameter size) that is more suitable for mobile devices, the EVOved Transformer's BLEU value is higher than the Transformer 0.7.

Paper: The Evolved Transformer

Paper link:

Abstract: Recent research has emphasized the advantages of Transformer in solving sequence tasks. At the same time, neural architecture search has evolved to a model that can surpass human design. The purpose of this article is to use the architecture search to find a better Transformer architecture. We first constructed a large search space based on the latest developments in the feedforward sequence model, and then ran the evolutionary architecture search, using Transformer to rank our initial population. In order to efficiently run this search on the computationally expensive WMT 2014 English-German translation task, we developed a progressive dynamic obstacle method that allows us to dynamically allocate more resources to more potential candidate models. The architecture we found in the experiment-Evolved Transformer-performed well on four recognized language tasks (WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English and One Billion Word Language Model Benchmark (LM1B)) In Transformer. In experiments with large models, the efficiency of the Evolved Transformer (FLOPS) is twice that of the Transformer, and there is no loss in quality. In a small model (with 7M parameters) that is more suitable for mobile devices, the BLEU value of Evolved Transformer in the WMT'14 Anglo-German mission is higher than Transformer 0.7.


Researchers have used evolution-based architecture search because it is simple and has proven to be more efficient than reinforcement learning in the case of limited resources (Real et al., 2019). They used the same tournament selection algorithm algorithm as that used by Real et al. (2019), but omitted the old-fashioned regularization. The algorithm is roughly described as follows.

Tournament Selection Evolutionary Architecture Search first defines the genetic coding that describes the neural network architecture; then, an initial population is created from the genetic coding space to create an initial population. Based on the training of the neural networks described by these individuals on the target task, they are assigned fitness and their performance is evaluated on the task's validation set. The investigator then resamples the population to produce a subpopulation, from which the most adaptive individual is selected as the parent. The selected parent mutates the self-gene coding (the coding field is randomly changed to a different value) to generate a sub-model. Then, by training and evaluating on the target task, the fitness is assigned to these sub-models as if they were the initial population. When the fitness assessment ends, the population is sampled again, and the individuals with the lowest fitness in the subpopulation are removed, ie removed from the population. The newly evaluated submodel is then added to the population, replacing the removed individual. This process is repeated until there is a highly adaptive individual in the population, which in this paper represents a well-performing architecture.


In this chapter, we first benchmark our search methods, dynamic evolution barriers, and other evolutionary search methods. We then set up the Evolved Transformer and benchmark against Transormer.

Table 1: Top-level model validation confusion for various search settings. The number of models selected balances resource consumption.

Figure 3: The architectural unit of Transformer and Evolved Transformer. The four most noteworthy aspects of the architecture are: 1. Broad Detachable Convolution; 2. Gated Linear Unit; 3. Branch Structure; 4.swish Activation Function. The ET encoder and decoder independently develop the lower section of the wide convolution branch. In both architectures, the latter paragraph is the same as the Transformer.

Figure 4: Performance comparison of Evolved Transformer and Transformer across various FLOPS sizes.

Table 2: NVIDIA P8 in 100 blockGPU Encoder-decoder WMT'14 comparison on the above. Based on available resources, each model is trained 10-15 times. The perplexity is calculated on the validation set, and BLEU is calculated on the test set.
Easyai public number

Table 3: in 16 block TPU Comparison of trained Transformer and ET on v.2. On the Translation task, the confusion is calculated on the validation set, and the BLEU is computed on the test set. For the LM1B task, the confusion is calculated on the test set. ET exhibits a consistency improvement of at least one standard deviation across all tasks. In terms of base size, it surpasses all searches, and the BLEU value on the English and French and Ingex tasks has increased 0.6.

Table 4: Mutation elimination. The pre-5 column describes each mutation. Transormer and ET Confusions Enhanced on the WMT 14En-De Validation Set In the 6 and 7 columns, the 7 and 8 columns show the difference between the mean of the unenhanced base model confusion mean and the enhanced model confusion mean. The red unit indicates evidence that the corresponding mutation impairs the overall performance. The green unit indicates the corresponding evidence that the mutation is beneficial to the overall performance.

Easyai public number