We'll start with the simplest method and move on to more subtle solutions such as feature engineering, word vectors and deep learning.
Natural language processing
Facebook uses cross-word embedding to achieve unsupervised machine translation
This article is reproduced from the public head of the Microsoft Research Institute AI headline,Original address
Existing machine translations require a large amount of translated text for training samples, which makes machine translation perform well only in a small number of languages with sufficient sample size, but how to train machine translation models without source translation, ie unsupervised training Has become a hot topic of research. Facebook at EMNLP The 2018 paper "Phrase-Based & Neural Unsupervised Machine Translation" uses cross-word embedding (Cross Word Embedding), up to 11 BLEU, how is Facebook implemented?
The first step is to let the system learn a bilingual dictionary. The system first trains word embedding for each word in each language, and the training word embeds the context to predict words around a given word. Word embedding in different languages has a similar neighborhood structure, so the system can learn to rotate and transform the embedding of a language into a language to match the embedding of another language. Based on this information, a relatively accurate bilingual dictionary can be obtained, and word-by-word translation can be basically realized. After obtaining the language model and the initial verbatim translation model, an early version of the translation system can be constructed.
Then the system translated statement is processed as the marked real data, the reverse machine translation system is trained, a more fluent and grammatically correct language model is obtained, and the artificially generated parallel sentences in the reverse translation are provided with the language model. The correction is combined to train the translation system.
Through the training of the system, a reverse-translated data set is formed, thereby improving the original machine translation system. As a system is improved, it can be used to iteratively generate training data for the system in the opposite direction and perform multiple iterations as needed.
Verbatim embedding initialization, language modeling, and reverse translation are three important principles of unsupervised machine translation. Applying the translation system based on these principles to unsupervised neural models and counting-based statistical models, starting from the trained neural model, using other reverse-translated sentences based on the phrase model to train them, and finally get a Smooth, high-accuracy model.
For unsupervised machine translation, the Microsoft Asian Research Institute Natural Language Computing Group also explored. The researchers used the Posterior Regularization method to introduce SMT (Statistical Machine Translation) into the unsupervised NMT training process, and alternately optimize the SMT and NMT models through the EM process, so that the noise during the unsupervised NMT iteration can be It is effectively removed, and the NMT model also makes up for the shortcomings of the SMT model in terms of sentence fluency. The related paper "Unsupervised Neural Machine Translation with SMT as Posterior Regularization" has been accepted by AAAI 2019.
[Switch] Why is NLP technology landing so difficult? What pits are there?
CSDN editor-in-chief afternoon tea invited Li Bo, the rotating chairman and chief architect of the Small i Robotics Technical Committee, to discuss with us.NLPThe difficulty of technology landing and how to reduce the threshold of developers.
[转] Depth good text: 2018 NLP application and commercialization report
What is the problem with natural language processing technology in commercial applications? Why haven’t there been any major progress? Where is the key to solving the problem?