Based on some recent conversations, I realized that text preprocessing is a topic that has been seriously ignored. Some people I have contacted mentioned theirNLPThe results of the application are inconsistent just to realize that they have not preprocessed their text or that their project has used the wrong text preprocessing.

With this in mind, I want to understand the true meaning of text preprocessing, the different methods of text preprocessing, and the methods of estimating how much preprocessing might be needed. For those who are interested, I have also done some for you.Text preprocessing code snippet. Now let's get started!

What is text preprocessing?

Preprocessing text simply means converting the text topredictable AnalyzableThe form of the task. The task here is a combination of methods and fields. For example, using the tfidf (method) from tweet (domain) to extract the top-level keyword istaskExample.

Task = method + domain

The ideal pre-processing of a task can be the worst nightmare of another task. Please note: Text preprocessing cannot be transferred directly from a task to a task.

Let's take a very simple example, assuming you are trying to find words that are commonly used in news data sets. If your preprocessing step involves deletingStop words,Because other tasks use it, you may miss some common words because you have already deleted them. In fact, this is not a universal approach.

Type of text preprocessing technique

There are different ways to preprocess your text. Here are some ways you should know, and I will try to emphasize the importance of each method.

Unified to lowercase letters

Reducing all text data, although usually ignored, is one of the simplest and most effective forms of text preprocessing. It works for most text mining and NLP problems and can help if the data set is not large and contributes significantly to the consistency of the expected output.

Recently, one of my blog readersFind for similarityTrained oneEmbedded modelOfword. He found that different changes in input capitalization (such as "Canada" and "Canada") gave him different types of output or no output at all. This may be due to the mixture of the word “Canada” in the data set, and there is not enough evidence to prove that the neural network can effectively learn the weight of the uncommon version. This type of problem will definitely happen when your data set is quite small, and lowercase is a good way to deal with sparsity issues.

Here's an example of how the lowercase solves the sparsity problem, where the same words with different situations map to the same lowercase form:

Words with different cases are mapped to the same lowercase form
Words with different cases are mapped to the same小写形式

Another example of very useful lowercase is search. Imagine that you are looking for a document that contains "United States." However, no results are shown because "United States" is indexed as"United States".Now, who should we blame? Set the interface UI designer or engineer who sets the search index?

Although lowercase should be the standard practice, I also have the importance of keeping capitalization. For example, when predicting the programming language of a source code file.SystemWords in JavasystemThe words in python are completely different. Lowering the two makes them the same, causing the classifier to lose important predictive features. Although lowercaseUsually veryUse, but it may not work for all tasks.

Stem extraction

Lexification is the process of reducing distortions (eg, troubles, annoyances) in words to their root form (eg, trouble). In this case, the "root" may not be the real root word, but the normative form of the original word.

The stem uses a rough heuristic to cut off the end of the word in order to correctly convert the word to its root form. Therefore, the words "trouble", "trouble" and "trouble" may actually be converted to,troublInstead oftroubleBecause both ends are cut (hey, how rough!).

The stem has different algorithms. The most common algorithm, also known as English, has an empirical effect.Porters algorithm. The following isPorter An example of Stemmer taking action:

Influence of stem
Influence of stem

Stems are very useful for dealing with sparsity issues and standardizing vocabulary. I have been successful in the search application. The idea is that if you say that you are looking for a “deep learning course”, you also want to mention “deep learning”.Course And depthLearnThe course's documentation, although the latter sounds wrong. But you got our goal. You want to match all the variations of the word to show the most relevant documents.

However, in most of my previous text categorization work, the classification accuracy was only slightly improved, rather than using better engineering features and text rich methods, such as using word embedding.

Lexical reduction

The on-the-spot reduction is very similar to the stem, with the goal of removing the distortion and mapping the word to its root form. The only difference is that the morphological restoration attempts to do it in the right way. It doesn't just cut it off, it actually converts words into actual roots. For example, the word "better" would map to "good." It can use such asWordNetDictionaryTo mapOr someRule-basedspecialmethod. The following is an example of a morphological restore using a WordNet-based approach:

WordNet's morphological reduction effect
WordNet's morphological reduction effect

In my experience, the simplification of the form has no obvious advantage over the purpose of search and text categorization. In fact, depending on the algorithm you choose, it may be much slower than using a very basic stemmer, and you may have to know the part of the word to get the correct lemma.TextIt was found that the morphological reduction had no significant effect on the accuracy of the classification of neural structure texts.

I will personally use the morphological restoration. Additional overhead may or may not be worth it. But you can always try to see how it affects your performance metrics.

Delete stop words

A stop word is a term commonly used in a language. Examples of stop words in English are "a", "the", "is", "are", and the like. The intuition behind the use of stop words is that by removing low information words from the text, we can focus on important words instead.

For example, in the context of a search system, if your search query is"What is text preprocessing?"You want the search system to focus on surface documents, thesetext preprocessingDocuments can discuss documents that are discussedwhat is. This can be done by blocking all words in the stop word list for analysis. Stop words are often applied to search systems, text categorization applications, topic modeling, topic extraction, and more.

In my experience, stopping deleting words is effective in search and topic extraction systems, but is not critical in the classification system. However, it does help to reduce the number of features considered, which helps to keep the size of the model.

The following is an example of stopping deleting words. All stop words are replaced with virtual characters.W :

Delete sentences before and after stop words
Delete sentences before and after stop words

Stop word listCan come from pre-established collections, orFor your domaincreatecustomizeWord list. Some libraries (such as sklearn) allow you to delete words that appear in X% documents, which can also provide you with word deletion effects.


A highly neglected preprocessing step is text normalization. Text normalization is the process of converting text into a canonical (standard) form. For example, the words "gooood" and "gud" can be converted to "good", the normative form. Another example is to map approximately the same words (such as "stop words", "stop words" and "stop words") to "stop words."

Text normalization is important for noisy text, such as social media reviews, text messages, and comments on blog posts, where abbreviations, spelling mistakes, and the use of extra-words (oov) are common.TextBy using the text normalization strategy of tweets, they are able to increase the emotional classification accuracy by about 4%.

This is an example of a word before and after standardization:

Text normalization effect
Text normalization effect

Notice how the variant maps to the same canonical form.

According to my experience, text normalization can even be effectively analyzedHighly unstructured clinical text, where doctors take notes in a non-standard way. I also found it forTheme extractionUseful, where approximate synonym and spelling differences are common (such as topic modeling, topic modeling, topic modeling, topic modeling).

Unfortunately, unlike stemming and morphological restoration, there is no standardized textual approach. It usually depends on the task. For example, the way you standardize clinical text may differ from the way you standardize SMS text messages.

Some common methods of text normalization include dictionary mapping (the simplest), statistical machine translation (SMT), and spell-correction-based methods.This interesting articleCompare the use of dictionary-based methods and SMT methods to normalize text messages.

Noise cancellation

Noise removal related to deletioncharacters digits, andpieces of textCan interfere with your text analysis. Noise cancellation is one of the most basic text preprocessing steps. It is also highly dependent on the domain.

For example, in a tweet, the noise may be all special characters except the subject tag because it represents the concept that can characterize the tweet. The problem with noise is that it produces inconsistent results in downstream tasks. Let's look at the following example:

Noise cancellation
Noise cancellation

Please note that all the original words above have some noise around them. If you kill these words, you will find that the stemming results don't look too beautiful. They don't have the right stem. But throughIn this notebookSome cleaning, the results now look much better:

Stem and noise elimination
stemversusNoise cancellation

In Text Mining and NLP, noise cancellation is one of the first things you should consider. There are various ways to eliminate noise. This includesPunctuation delete.Special character deletion.Digital deletion, html format deletion, domain-specific keyword deletion (For example, forwarded'RT'), source code deleted, title deletedWait. It all depends on which area you are working in and the noise that is causing your task.My notebookmiddlecode segmentShows how to do some basic noise cancellation.

Rich/enhanced text

Text richness involves augmenting raw text data with information that you didn't have before. Text richness provides more semantics for the original text, which improves predictive power and in-depth analysis of the data execution.

In the information retrieval example, extending the user's query to improve keyword matching is an enhanced form. Like a querytext miningCan becometext document mining analysis. Although this does not make sense to humans, it can help to obtain more relevant documentation.

You can get real ideas by enriching the text. you can use itPart of speech taggingTo get more detailed information about the words in the text.

For example, in the document classification question, the wordbookAsThe appearance of nounsMay lead toVerbbookDifferent classifications, because one is used in the context of reading, and the other is used in the context of retaining some content.TextIt discusses how to use nouns and verbs as input features to improve Chinese text categorization.

However, with the availability of a large amount of text, people are starting to useEmbedTo enrich the meaning of words, phrases and sentences for classification, search, summarization and text generation. This is especially true in deep learning based NLP methods, whereWord level embedding layerVery common. You canPre-established embeddingStart, you can also create your ownEmbedAnd use it in downstream tasks.

Other methods of enriching text data includePhrase extraction, you can recognize compound words as one (also known as chunking).Use synonyms to expandDependency resolution.

Do you need it all?

Not really, but if you want to get good, consistent results, you have to do it. In order to let you know what the minimum should be, I break it down intoMust Do.Should DoTask Dependent. All tasks that depend on the task can be tested quantitatively or qualitatively before deciding what you really need.

Keep in mind that less is more and you want to be as elegant as possible. The more overhead you add, the more layers you will need to strip when you encounter a problem.

Must do:

  • Noise cancellation
  • Lowercase (in some cases, depending on the task)

What should be done:

  • Simple normalization – (for example, normalize almost identical words)

Determine whether to do according to the task:

  1. Advanced standardization (for example, solving vocabulary words)
  2. Stop deleting words
  3. Stem/morphological restoration
  4. Rich/enhanced text

So for any task, the minimum you should do is try lowercase text and eliminate noise. The noise depends on your domain name (see the Noise Cancellation section). You can also perform some basic normalization steps to achieve greater consistency, then systematically add additional layers as needed.

General rule of thumb

Not all tasks require the same level of preprocessing. For some tasks, you can minimize it. However, for others, the data set is so noisy, if you don't do enough pre-processing, it will be junk.

This is a general rule of thumb.This is not always true, but it applies to most situations.If you have a good written text that can be used in a fairly common field, then preprocessing is not very critical; you can barely get away (e.g. use all Wikipedia text or Reuters news articles to train a word embedded in the model).

However, if you work in a very narrow field (such as tweets about healthy foods) and the data is sparse and noisy, you can benefit from more pre-processing layers, even though you add each layer (for example, stop deleting) Words, stemming, and normalization all need to be quantitatively or qualitatively verified as meaningful layers. This is a table summarizing how much preprocessing you should perform on text data:

Text preprocessing judgment
Text preprocessing judgment

I hope that the ideas here will lead you to choose the right pre-processing steps for your project. remember,less is more. A friend of mine once told me how he passed abandonmentunnecessaryThe pre-processing layer makes the large e-commerce search system more efficient and less error-prone.


This article is reproduced from the medium,Original address(Requires Internet Science)