This article is reproduced from the public head of the Microsoft Research Institute AI headline,Original address
Existing machine translations require a large amount of translated text for training samples, which makes machine translation perform well only in a small number of languages with sufficient sample size, but how to train machine translation models without source translation, ie unsupervised training Has become a hot topic of research. Facebook at EMNLP The 2018 paper "Phrase-Based & Neural Unsupervised Machine Translation" uses cross-word embedding (Cross Word Embedding), up to 11 BLEU, how is Facebook implemented?
The first step is to let the system learn a bilingual dictionary. The system first trains word embedding for each word in each language, and the training word embeds the context to predict words around a given word. Word embedding in different languages has a similar neighborhood structure, so the system can learn to rotate and transform the embedding of a language into a language to match the embedding of another language. Based on this information, a relatively accurate bilingual dictionary can be obtained, and word-by-word translation can be basically realized. After obtaining the language model and the initial verbatim translation model, an early version of the translation system can be constructed.
Then the system translated statement is processed as the marked real data, the reverse machine translation system is trained, a more fluent and grammatically correct language model is obtained, and the artificially generated parallel sentences in the reverse translation are provided with the language model. The correction is combined to train the translation system.
Through the training of the system, a reverse-translated data set is formed, thereby improving the original machine translation system. As a system is improved, it can be used to iteratively generate training data for the system in the opposite direction and perform multiple iterations as needed.
Verbatim embedding initialization, language modeling, and reverse translation are three important principles of unsupervised machine translation. Applying the translation system based on these principles to unsupervised neural models and counting-based statistical models, starting from the trained neural model, using other reverse-translated sentences based on the phrase model to train them, and finally get a Smooth, high-accuracy model.
For unsupervised machine translation, the Microsoft Asian Research Institute Natural Language Computing Group also explored. The researchers used the Posterior Regularization method to introduce SMT (Statistical Machine Translation) into the unsupervised NMT training process, and alternately optimize the SMT and NMT models through the EM process, so that the noise during the unsupervised NMT iteration can be It is effectively removed, and the NMT model also makes up for the shortcomings of the SMT model in terms of sentence fluency. The related paper "Unsupervised Neural Machine Translation with SMT as Posterior Regularization" has been accepted by AAAI 2019.