Association is the 57 Annual Conference on Computational Linguistics (ACL 2019)It will start in Florence this week in Italy. We took this opportunity to review the animation.NLPThe main research trends of space, and some meanings from a business perspective. This article is supported by statistical data-guessing-an analysis of ACL papers based on NLP over the past 20 years.

1. Motivation

Compared with other species, natural language is one of the main USPs of human thinking. NLP is a major buzzword in today's technical discussions, involving how computers understand and generate language. The rise of NLP over the past few decades has been supported by several global developments-general hype around AI, exponential progress in the field of deep learning, and the ever-increasing amount of available text data. But what is the essence behind the buzz? In fact, NLP is a highly complex interdisciplinary field, continuously provided by high-quality basic research in linguistics, mathematics and computer science. The ACL meeting brings these different perspectives together. As shown in the figure below, research activities in the past few years have flourished:

Figure 1: Number of sheets released at the ACL meeting

In the following, we summarize some of the core trends in data strategies, algorithms, tasks, and multilingual NLP. The analysis is based on an ACL paper published since 1998, which uses domain-specific ontology for NLP and machine learning.

2. Data: Resolving bottlenecks

The amount of freely available text data has grown exponentially, mainly due to the massive generation of web content. However, this large amount of data poses some key challenges. First, big data is inherently noisy. Think of natural resources such as oil and metals – they need to be refined and purified before they can be used in the final product. The same is true for data. Generally speaking, the more “democratic” the production channel, the dirtier the data – which means more effort must be spent to clean it up. For example, data from social media will require longer clean pipelines. In addition, you also need to deal with self-expression romances like emojis and irregular punctuation, which usually don't exist in more formal environments such as scientific papers or legal contracts.

Another major challenge is labeling data bottlenecks: strictly speaking, most of the most advanced algorithms are supervised. They not only need labelled data-they need big labelled data. This is particularly relevant to the advanced and complex algorithms of the deep learning series. Just as a child's brain first needs the largest input to learn its native language, to be "deep", the algorithm first needs a lot of data to include language in the overall complexity.

Traditionally, small-scale training data has been manually annotated. However, dedicated manual annotation of large data sets introduces trade-offs in efficiency that are rarely accepted, especially in business environments.

What are the possible solutions? On the one hand, there are some improvements in management, including. Crowdsourcing and Training Data as a Service (TDaaS). On the other hand, the machine learning community has also proposed a series of automated workarounds for creating annotated data sets. The figure below shows some trends:

Figure 2: Discussing ways to create and reuse training data (referred to by the number of sheets in the corresponding year)

Obviously,Pre-trainingThe biggest increase in the past five years. In pre-training, the model is first trained on a large general data set and then adjusted based on task-specific data and goals. Its popularity is largely due to companies such as Google and Facebook that are providing large-scale out-of-the-box models for the open source community. Especially pre-trained word embedding, such asWord2Vec, FastText andBERTTo enable NLP developers to jump to a higher level.Transfer learningIt is another way to reuse models in different tasks. If the reuse of an existing model is not an option, then a small amount of tag data can be used to automatically tag a larger amount of data, such asRemotelylack of regulation – But please note that these methods usually result in a decrease in label accuracy.

3. Algorithm: A series of interrupts in deep learning

In terms of algorithms, recent research has focused on deep learning series:

Figure 3: Discussion of the deep learning algorithm (referred to by the number of sheets in the corresponding year)

Word embeddingObviously it is gaining the upper hand. In its basic form, Mikolov et al. introduced embedded words. (2013 years). The general language principle behind word embedding is distribution similarity: a word can be characterized by the context in which it appears. Therefore, as human beings, we can usually complete the sentence "Customer signs ___ today" with appropriate words such as "transaction" or "contract". Word embedding allows this to be done automatically, so it is very powerful and can solve the core problem of context-awareness.

althoughword2vec(Original embedding algorithm) is statistical, does not consider the complexity of life, such as ambiguity, context sensitivity and language structure, the follow-up method enriches the word embedding containing various language information. And, by the way, you can not only embed words, but also embed other things like senses, sentences, and entire documents.

Neural NetworksIt is the main force of deep learning (see Goldberg and Hirst (2017), which introduces the basic architecture in the NLP environment).Convolutional neural networkHas increased in the past few years, while traditionalRecurrent neural networkRNNThe popularity is declining. On the one hand, this is due to a more efficient RNN-based architecture, such asLSTMCRANE. on the other hand,Sequence into sequenceIntroduced a new and quite destructive sequential processing mechanism-attentionSutskever et al.seq2seq)model. (2014). If you use Google Translate, you may have noticed the transcendence of translation quality a few years ago-seq2seq is the culprit. Although seq2seq still depends on the RNN in the pipeline,transformerThe architecture is another major advancement in 2017 years, eventually getting rid of recurrence and relying entirely on attention mechanisms (Vaswani et al. 2017).

Deep learning is a dynamic and fascinating area, but from an application perspective, it can be very daunting. If this is the case, keep in mind that most developments are achieved by increasing the efficiency of big data, context awareness, and scalability for different tasks and languages. For the introduction of mathematics, Young et al. (2018) provides an excellent overview of the most advanced algorithms.

4. Integrates various NLP tasks

When we look at specific NLP tasks (such as sentiment analysis and named entity recognition), inventory is more stable than basic algorithms. Over the years, there has been a gradual evolution from preprocessing tasks (such as syntax analysis and information extraction) to semantic-oriented tasks (such as sentiment/sentiment analysis and semantic analysis). This corresponds to the three "global" NLP development curves described by Cambria et al.-grammar, semantics and context awareness. (2014). As we saw in the previous section, the third curve-awareness of the larger context-has become one of the main drivers behind new deep learning algorithms.

From a more general perspective, there is an interesting trend in task-agnostic research. In Section 2, we saw how the generalization ability of modern mathematical methods can be fully utilized in scenarios such as transfer learning and pre-training. In fact, modern algorithms are developing amazing multitasking capabilities-therefore, the relevance of the specific task at hand will decrease. The chart below shows the overall decline in discussions on specific NLP tasks since 2006:

Figure 4: Discussion volume for a specific NLP task

5. Description of multilingual research

With globalization, going international has become a necessary condition for business growth. Traditionally, English has been the starting point for most NLP research, but the demand for scalable multilingual NLP systems has also increased in recent years. How is this demand reflected in the research community? Treating different languages ​​as different lenses, through which we can see the same world – they share many attributes, this fact is fully supported by modern learning algorithms, and they have increasingly powerful abstraction and generalization capabilities. Nevertheless, especially in the preprocessing stage, language-specific functions must be completely resolved. As shown in the figure below, the language diversity involved in ACL research is increasing:

Figure 5: Frequent languages ​​per year (>10 times per language)
Easyai public number

However, as seen in the previous section for NLP tasks, once language-specific differences are neutralized into the next wave of algorithms, we can expect consolidation. Figure 6 summarizes the most popular languages.

Figure 6: Language solved by ACL research

For some of these languages, research interests are in line with commercial appeal: languages ​​such as English, Chinese, and Spanish bring together a large amount of available data, huge native speakers, and huge economic potential in the corresponding geographic regions. However, the rich "smaller" language also suggests that the NLP field is generally evolving in a theoretically sensible way of multilingual and cross-linguistic generalization.

Final Thoughts

Stimulated by global artificial intelligence hype, new methods and disruptive improvements are emerging in the NLP field. Model meaning and context dependence are shifting and may be the most common and challenging facts in human language. The generalization capabilities of modern algorithms allow for efficient scaling across different tasks, languages, and data sets, dramatically accelerating the ROI cycle for NLP development and allowing NLP to be flexibly and efficiently integrated into business scenarios.

This article was transferred from awardsdatascience,Original address

Easyai public number