Text data is everywhere

Whether you are a mature company or a new service, you can always use text data to validate, improve and extend the functionality of your product. The science of extracting meaning and learning from textual data is an active research topic called natural language processing (NLP).

NLP productionnewExciting 结果On a daily basis, it is a very large field. However, the Insight team has worked with hundreds of companies who see some key real-world applications appear more frequently than any other application:

  • Identify different users/customer groups (eg predict churn, lifetime value, product preferences)
  • Accurately detect and extract different types of feedback (positive and negative reviews/opinions, mentioning specific attributes, such as clothing size/fitting...)
  • Classify text according to intent (eg request basic help, urgent question)

Although many NLP papers and tutorials exist online, we find it hard to find out how to start from scratcheffectiveGuidelines and tips for solving these problems.

How can this article help?

After leading hundreds of projects each year and getting advice from top teams across the United States, we wrote this article to explain how to build machine learning solutions to solve the problems mentioned above. We willThe easiest wayStart and then move to more subtle solutions such as feature engineering, word vectors and deep learning.

After reading this article, you will learn how to:

  • Collect, prepare and check data
  • Build a simple model and transition to deep learning when necessary
  • Explain and understand your model to make sure you actually capture information instead of noise

We wrote this article as a step-by-step guide; it can also serve as a high-level overview of efficient standard methods.

This article is attachedAn interactive notebook,Demonstrate and apply all of these technologies. Feel free to run the code and continue!

Step 1: Collect your data

Sample data source

Every machine learning problem starts with data, such as an email, post, or tweet list. Common sources of text information include:

  • Product Reviews (on Amazon, Yelp and various app stores)
  • User generated content (tweet, Facebook post, StackOverflow question)
  • Troubleshooting (customer request, support ticket, chat history)

Social Media Disaster dataset

For this article, we will useFigure 8A generously provided data set called "Disasters on Social Media" where:

Contributors viewed more than 10,000 tweets, including various searches such as "ignition", "quarantine" and "chaos", and then noticed whether the tweet involved a disaster event (rather than a joke or some related to a word or movie comment) Non-catastrophic things).

Our task is to detect which tweets are aboutCatastrophic eventInstead of being like a movieIrrelevant topic. why? One potential application is to fully inform law enforcement officials of an emergency and ignore comments on the recent Adam Sandler movie. A particular challenge with this task is that both classes contain the same search terms used to find tweets, so we must use subtle differences to distinguish them.

In the rest of this article, we will refer to the tweet about disaster as " disaster ", and called tweets about anything else" Irrelevant . "


We have already tagged the data, so we know which tweets belong to which categories. As Richard Socher outlined below, it's usually faster, simpler, and cheaper.Find and mark enough dataInstead of trying to optimize complex unsupervised methods, train the model.

Richard Socher's professional tips

Step 2: Clean up the data

The first rule we follow is: "Your model will always be as good as your data."

One of the key skills of data scientists is to understand whether the next step is whether to process the model or the data. A good rule of thumb is to first look at the data and then clean it up.A clean data set will allow the model to learn meaningful features rather than over-matching unrelated noise.

The following is a checklist used to clean up the data: (For details, see代码):

  1. Delete all unrelated characters, such as any non-alphanumeric characters
  2. TokenizationBy splitting it into individual word text
  3. Delete irrelevant words such as "@"twitter mention or URL
  4. Convert all characters to lowercase to treat words such as "hello", "Hello", and "HELLO" as identical
  5. Consider combining misspelled or alternately spelled words into a single representation (eg "cool" / "kewl" / "cooool")
  6. considerLexical reduction(Simplify words such as "am", "are" and "is" into common forms such as "be")

After following these steps and checking for other errors, we can start training the model with clean tag data!

Step 3: Find a good data representation

The machine learning model takes values ​​as input. For example, the model that processes the image takes a matrix that represents the intensity of each pixel in each color channel.

A smiling face is represented as a digital matrix.

Our dataset is a list of sentences, so in order for our algorithm to extract patterns from the data, we first need to find a way to represent the way our algorithm can understand, ie as a list of numbers.

Bag of Words

A natural way to represent computer text is to encode each character individually as a number (for exampleASCII). If we provide this simple representation to the classifier, we must learn the structure of the word from scratch based only on our data, which is not possible for most data sets. We need to use a higher level approach.

For example, we can build all the unique words in the dataset.GlossaryAnd associate a unique index with each word in the vocabulary. Each sentence is then represented as a list that is as long as the number of different words in our vocabulary. At each index in this list, we mark the number of times a given word appears in a sentence. This is calledBag of WordsmodelBecause it is a representation that completely ignores the order of words in a sentence. This is shown below.

Express the sentence as a bag of words. The sentence on the left and the expression on the right. Each index in the vector represents a specific word.

Visual embedding

Our vocabulary in the "Social Media Disaster" example has approximately 20,000 words, which means that each sentence will be represented as a vector of length 20,000. Vector willmainPrice includes0,Because each sentence contains only a small part of our vocabulary.

In order to see if our embedding is capturingRelated to our problemsInformation (ie whether tweets are related to disasters), it's best to visualize them and see if the classes look good. Since the vocabulary is usually very large and it is not possible to visualize the data in the 20,000 dimension, it is likePCASuch techniques will help to project data into two dimensions. This is drawn as follows.

Visualized word bag embedding.

These two classes seem to be not very well separated, which may be a feature we embed, or just a feature of our dimensionality reduction. To see if the Bag of Words feature is useful, we can train the classifier based on them.

Step 4: Classification

When the first is close to the problem, the general best practice begins with the simplest tool that can solve the job. Whenever data is classified, its versatility and interpretability have in commonLogistic regression. The training is very simple and the results can be explained because you can easily extract the most important coefficients from the model.

We divide the data into a training set that fits our model and a test set to see its generalization of the unseen data. After training, we get75.4%OfAccuracy.Not too shabby! Guess the most frequent course ("unrelated") will only give us 57%. However, even if the accuracy of 75% is sufficient to meet our needs,We should never ship a model without trying to understand it.

Step 5: Check

Chaotic matrix

The first step is to understand the types of errors our model makes and which ones are the least desirable. In our example,False positiveClassify irrelevant tweets as disasters, andFalse positiveThe disaster is classified as an irrelevant tweet. If we prioritize responding to each potential event, we would like to reduce our underreporting. However, if we are limited in terms of resources, we may prioritize lower false positives to reduce false positives. A good way to visualize this information is to useConfusion matrix, thematrixCompare our model predictions to real labels. Ideally, the matrix will be the diagonal from the upper left to the lower right (our predictions exactly match the truth).

Confusion matrix (high green ratio, low blue)

Our classifiers produce more false negatives than false positives (proportional). In other words, the most common mistake in our model is to classify the disaster as irrelevant. If false positives represent a high cost of law enforcement, then this may be a good bias for our classifiers.

Explain and explain our model

In order to validate our model and explain its predictions, it is important to look at the words it uses to make decisions. If our data is biased, our classifier will make accurate predictions of the sample data, but the model will not be well generalized in the real world. Here we draw for disaster and unrelated coursesMost important words. The importance of drawing words using Bag of Words and Logistic regression is simple because we can extract and arrange the coefficients that the model uses for prediction.

Bag word: word importance

Our classifier correctly adopted some patterns (Hiroshima, Massacre), but apparently over-fitting (heyoo, x1392) in some meaningless terms. Now, our Bag of Words model is dealing with huge words of different words andTreat all words equally. However, some of these words are very frequent and only produce noise for our predictions. Next, we'll try a way to represent sentences that explain the frequency of words and see if we can get more signals from our data.

Step 6: Consider the lexical structure


To help our model focus more on meaningful words, we can use it on top of our Bag of Words model.TF-IDF score(Term Frequency, Inverse Document Frequency).TF-IDFWords are weighted according to their rarity in our dataset, discounting words that are too frequent and adding only noise. This is our newly embedded PCA projection.

Visualize TF-IDF embedding.

We can see a clearer distinction between the two colors above. This should make it easier for our classifier to separate the two groups. Let's see if this will lead to better performance. Train another logistic regression on our new embedding, we getThe accuracy of 76.2%.

A little improvement. Does our model start accepting more important words? If we get better results while preventing our model from "cheating", then we can really think of this model as an upgrade.

TF-IDF: word importance

The words it picks look more relevant! While the metrics on our test set have only increased slightly, we are more confident in the terminology used by the model, so it feels more comfortable when deploying it in a system that interacts with the customer.

Step 7: Using Semantics


Our latest models try to get high signal words. However, if we deploy this model, we are likely to encounter words that we have not seen before in the training set.Even when you see very similar words during training,Previous models also failed to accurately classify these tweets.

In order to solve this problem, we need to captureWordsOfSemanticsThis means that we need to understand that words like "good" and "positive" are closer than "apricot" and "continental". The tools we will use to help us capture meaning are calledWord2Vec.

Use pre-trained words

Word2VecIt is a technique for finding words that are continuously embedded. It learns by reading a large amount of text and remembering which words tend to appear in similar contexts. After training enough data, it generates an 300 dimension vector for each word in the vocabulary, where words with similar meanings are closer to each other.

The author of the reportpaperOpen source has a very large corpus, we can use some knowledge including semantics into our model which is a model of pre-training. Can be associated with this postRepositoryFind the pre-trained vector.

Sentence level representation

The quick way to set the sentence embedding for our classifier is to average the Word2Vec score for all words in the sentence. This is a Bag of Words method like before, but this timeWe only lost the syntax of the sentence while retaining some semantic information.

Word2Vec sentence embedding

Here's a new embedded visualization using the previous technology:

Visualize Word2Vec embedding.

These two sets of colors look more separate, and our new embedding should help our classifier find a separation between the two classes. After training the same model for the third time (Logistic regression), we get77.7% accuracy score, our best results! It's time to check our model.

Complexity/interpretability trade-off

Since our embedding does not represent a one-dimensional vector for each word like our previous model, it is difficult to see which words are most relevant to our classification. Although we can still access the coefficients of the logistic regression, they are related to the embedded 300 dimension, not the index of the word.

For such low accuracy gains, losing all interpretability seems to be a severe trade-off. However, for more complex models, we can take advantage ofLIMEWaitBlack box interpreterTo gain insight into how the classifier works.


LIME PossibleOpen source packageOn Githubobtain. The black box interpreter allows the user to interpret any classifier by scrambling the input (in our case deleting the word from the sentence) and seeing how the prediction changesOn a specific exampledecision.

Let's take a look at a few explanations of the sentences in our data set.

Pick the right disaster words to categorize them as "related."
Here, the contribution of words to classification seems to be less obvious.

However, we don't have time to explore thousands of examples in the dataset. What we need to do is run LIME on a representative sample of test cases to see which words continue to be powerful contributors. Using this approach, we can get the word importance scores like the previous model and validate the model's predictions.

Word2Vec: word importance

It seems that the model has chosen highly relevant words, suggesting that it seems to make an understandable decision. These seem to be the most relevant words of all previous models, so we prefer to deploy to production.

Step 8: Using the end-to-end method to take advantage of the syntax

We have introduced a quick and efficient way to generate compact sentence embedding. However, by omitting the order of the words, we will discard all the syntax information of the sentence. If these methods don't provide enough results, you can use a more complex model, taking the entire sentence as input and predicting the label without having to build an intermediate representation. The common way to do this is to put a sentence in it.Word vector sequenceUse either Word2Vec or a more recent method, such asglovesOrMountain concave. This is what we will do below.

Efficient end-to-end architecture (source code)

Convolutional neural network for sentence classificationThe training is very fast and works well as an entry-level deep learning architecture. Although convolutional neural networks (CNN) primarily known for their performance on image data, but they provide excellent results on text-related tasks and are typically better than most complex NLP methods (egLSTMEncoder/decoderArchitecture) Train faster. The model preserves the order of the words and learns valuable information about which word sequences can predict the target class. Contrary to previous models, it distinguishes between "Alex eating plants" and "plant eating Alex."

Training this model does not require more work than the previous method (see代码) and provide us with a better model than before,Accuracy reaches 79.5%! As with the above model, the next step should be to explore and interpret the predictions using the methods we describe to verify that it is indeed the best model for deployment to the user. Until now, you should solve this problem yourself.

Final summary

The following is a brief review of the methods we have successfully used:

  • Start with a quick and easy model
  • Explain its predictions
  • Understand the mistakes it is making
  • Use this knowledge to provide information for the next step, whether it's for your data or a more complex model.

These methods apply to specific case cases, using models tailored to understand and utilize short text (such as tweets), but these ideasWidely applicable to all kinds of problems. I hope this will help you, we are happy to hear your opinions and questions!welcomePost a comment below, or here orContact on Twitter@EmmanuelAmeisen.

This article is from Insight,Original address