A text to understand stem extraction, word reduction

Stem extraction and morphological restoration are important links in English corpus preprocessing. Although their purpose is the same, there are still some differences between the two.

This article will introduce their concepts, similarities and differences, implementation algorithms and so on.

Want to learn more NLP For related content, please visit the NLP topic, and a 59-page NLP document download is available for free.

Visit the NLP topic and download a 59-page free PDF

 

Where is stemming and morphological restoration in NLP?

Stem extraction is a step of English corpus preprocessing (Chinese does not need), and corpus preprocessing is the first step of NLP, the following picture will let everyone know the position of stem extraction in this knowledge structure.

Where is stemming and morphological restoration in NLP?

 

What is stemming and morphological restoration?

Stemming – Stemming

Stem extraction is the process of removing the suffix of a word to get the root.

The common affixes before and after are "plural of nouns", "progressive", "past participle"...

Stem extraction

Lemmatisation – Lemmatisation

The morphological restoration is based on a dictionary that transforms the complex form of a word into the most basic form.

The morphological restoration does not simply remove the suffixes, but instead converts the words according to the dictionary. For example, "drove" will be converted to "drive".

Lexical reduction

Why do stem extraction and morphological restoration?

For example, when I search for "play basketball", Bob is playing basketball is also in line with my requirements, but play and playback are completely different things for 2 for the computer, so we need to convert playback into play.

The purpose of stemming and morphological restoration is to unify words with different looks but the same meaning, which facilitates subsequent processing and analysis.

 

4 similarities for stem extraction and morphological restoration

4 similarities for stem extraction and morphological restoration

  1. The goal is the same. The goal of stemming and morphological restoration is to simplify or merge the inflection or derivative form of the word into the basic form of stem or prototype, which is a process of unified merger of different forms of words.
  2. The results are partially crossed. Stem extraction and morphological restoration are not mutually exclusive relationships, and the result is partially intersected. Some words can achieve the same form transformation effect by using these two methods. For example, the word "dogs" is "dog" and its original shape is also "dog".
  3. The mainstream implementation method is similar. At present, the mainstream implementation methods of stemming and morphological restoration are to use the rules existing in the language or to use the dictionary mapping to extract the stems or obtain the prototypes of the words.
  4. The application areas are similar. Mainly used in information retrieval and text, natural language processing, etc., both of which are the basic steps of these applications.

 

5 differences between stem extraction and morphological restoration

5 differences between stem extraction and morphological restoration

  1. In principle, stemming is mainly based on the method of “reduction”, which converts words into stems, such as “cats” as “cat” and “effective” as “effect”. The morphological reduction mainly adopts the "transition" method, which transforms the word into its original shape, such as treating "drove" as "drive" and "driving" as "drive".
  2. In terms of complexity, the stemming method is relatively simple. The principle of the form of the word needs to return the original shape of the word. It is necessary to analyze the form of the word, not only to convert the affix, but also to identify the part of the word, to distinguish the same form but different from the original form. The difference in words. The accuracy of part-of-speech tagging also directly affects the accuracy of morphological restoration. Therefore, morphological restoration is more complicated.
  3. In terms of implementation methods, although the mainstream methods of stemming and wordform restoration are similar, the two have different emphasis on specific implementation. The implementation of stemming is mainly to use the rule changes to remove and reduce the affixes, so as to achieve the simplified effect of the words. The principle of morphological form is relatively complex, with complex morphological changes, which cannot be done well by rules alone. It relies more on the dictionary, performs mapping of inflections and prototypes, and generates valid words in the dictionary.
  4. In terms of results, stemming and morphological restoration are also partially different. The result of stemming may not be a complete, meaningful word, but only a part of the word, such as the result of "revival" stem extraction is "reviv", "ailiNerThe result of the stem extraction is "airlin". The result obtained by the morphological reduction process is a complete and complete word, which is generally a valid word in the dictionary.
  5. In the field of application, there are also different focuses. Although both are used in information retrieval and text processing, the focus is different. Stem extraction is more used in the field of information retrieval, such as Solr, Lucene, etc., for extended retrieval, and the granularity is coarse. The morphological restoration is mainly applied to text mining and natural language processing for more fine-grained and more accurate text analysis and expression.

 

3 mainstream stemming algorithm

3 mainstream stemming algorithm

Porter

This stemming algorithm is older. It started in the 20 century 80 era, and its main focus is to remove the common endings of words in order to parse them into a common form. It is not too complicated and its development has stopped.

Normally, it's a good starting basic stemmer, but it's not recommended for complex applications. Instead, it is used as a good basic stemming algorithm in research to ensure repeatability. It is also a very gentle stemming algorithm compared to other algorithms.

"Recommended" Snowball

The algorithm is also known as the Porter2 stem algorithm. It is almost universally considered to be better than Porter, and even developers who invented Porter think so. Snowball adds a lot of optimization to Porter. The difference between Snowball and Porter is about 5%.

Lancaster

Lancaster's algorithm is more radical and sometimes deals with some strange words. If at NLTK Using a stemmer, it's very easy to add your own custom rules to this algorithm.

Practice method of morphological restoration

The morphological restoration is based on a dictionary. Each language needs to be semantically analyzed and part of speech to establish a complete lexicon. At present, the English vocabulary is perfect.

The NLTK library in Python contains a vocabulary database of English words. These words are linked together based on their semantic relationship. The link depends on the meaning of the word. In particular, we can take advantage of WordNet.

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("blogs"))
#Returns blogimport nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("blogs"))
#Returns blog

 

Final Thoughts

Both stemming and word-reduction are unified words that have different looks but the same meaning, which facilitates subsequent processing and analysis.

They are a part of the English corpus preprocessing.

4 similarities for stem extraction and morphological restoration:

  1. Consistent goal
  2. Some results are consistent
  3. Mainstream implementation is similar
  4. Similar application fields

5 differences between stem extraction and morphological restoration:

  1. Different in principle
  2. Lexical restoration is more complicated
  3. Different implementations have different focus
  4. There is a difference in the results
  5. In the field of application, the focus is not exactly the same

The mainstream algorithm for 3 stemming:

  1. Porter
  2. Snowball
  3. Lancaster

The English word-restore can be used directly in the NLTK library in Python, which contains a vocabulary database of English words.

 

Baidu Encyclopedia + Wikipedia

Stem extraction

Baidu Encyclopedia version

In lexical and information retrieval, stemming is the process of removing the affixes to get the roots (the most common way to get the words). For the morphological root of a word, the stem does not need to be exactly the same; the related word mapping to the same stem can generally yield satisfactory results, even if the stem is not the valid root of the word. Since 1968, there have been corresponding algorithms for stem extraction in the field of computer science. Many search engines use the same stem for synonym as a query when dealing with vocabulary. This process is called merging. The stemming project generally involves a stemming algorithm or a stemmer.

see details

Wikipedia version

In language morphology and information retrieval, stemming is the process of reducing deformed (or sometimes derived) words to stems, roots or word forms-usually in written form. The stem does not have to be the same as the morphological root of the word; usually, it is sufficient for related words to map to the same stem, even if the stem itself is not a valid stem. Since the 20s, stemming algorithms have been studied in computer science. Many search engines treat words with the same stem as synonyms as a query expansion, a process called obfuscation.

see details

 

Lexical reduction

Wikipedia version

Lemmatisation (or morphological restoration) in linguistics is the process of combining distorted forms of words, so they can be analyzed as a single item, identified by the lemma or dictionary form of the word.

In computational linguistics, lemmatisation is an algorithmic process that determines the lemma of a word based on its intended meaning. Unlike stemming, lexicalization depends on correctly recognizing the expected part of speech and the meaning of the words in the sentence, as well as the larger context surrounding the sentence, such as adjacent sentences or even entire documents. Therefore, developing an effective lemmatisation algorithm is an open field of research.

see details