Unsupervised learning

Good text sharing

Facebook uses cross-word embedding to achieve unsupervised machine translation

This article is reproduced from the public head of the Microsoft Research Institute AI headline,Original address

Existing machine translations require a large amount of translated text for training samples, which makes machine translation perform well only in a small number of languages ​​with sufficient sample size, but how to train machine translation models without source translation, ie unsupervised training Has become a hot topic of research. Facebook at EMNLP The 2018 paper "Phrase-Based & Neural Unsupervised Machine Translation" uses cross-word embedding (Cross Word Embedding), up to 11 BLEU, how is Facebook implemented?

The first step is to let the system learn a bilingual dictionary. The system first trains word embedding for each word in each language, and the training word embeds the context to predict words around a given word. Word embedding in different languages ​​has a similar neighborhood structure, so the system can learn to rotate and transform the embedding of a language into a language to match the embedding of another language. Based on this information, a relatively accurate bilingual dictionary can be obtained, and word-by-word translation can be basically realized. After obtaining the language model and the initial verbatim translation model, an early version of the translation system can be constructed.

Then the system translated statement is processed as the marked real data, the reverse machine translation system is trained, a more fluent and grammatically correct language model is obtained, and the artificially generated parallel sentences in the reverse translation are provided with the language model. The correction is combined to train the translation system.

Through the training of the system, a reverse-translated data set is formed, thereby improving the original machine translation system. As a system is improved, it can be used to iteratively generate training data for the system in the opposite direction and perform multiple iterations as needed.

Verbatim embedding initialization, language modeling, and reverse translation are three important principles of unsupervised machine translation. Applying the translation system based on these principles to unsupervised neural models and counting-based statistical models, starting from the trained neural model, using other reverse-translated sentences based on the phrase model to train them, and finally get a Smooth, high-accuracy model.

For unsupervised machine translation, the Microsoft Asian Research Institute Natural Language Computing Group also explored. The researchers used the Posterior Regularization method to introduce SMT (Statistical Machine Translation) into the unsupervised NMT training process, and alternately optimize the SMT and NMT models through the EM process, so that the noise during the unsupervised NMT iteration can be It is effectively removed, and the NMT model also makes up for the shortcomings of the SMT model in terms of sentence fluency. The related paper "Unsupervised Neural Machine Translation with SMT as Posterior Regularization" has been accepted by AAAI 2019.

Good text sharing

The concept of the three brothers of machine learning: "supervised learning", "unsupervised learning" and "enhanced learning"

In this article, we will help you better understand the implications of the definitions of supervised learning, unsupervised learning, and reinforcement learning, and explain their connection to machine learning from a broader perspective. A deep understanding of their connotations will not only help you to be embarrassed in the literature in this field, but also guide you to sharply capture the development of AI and the advancement of technological progress.