Stem extraction and morphological restoration are important links in English corpus preprocessing. Although their purpose is the same, there are still some differences between the two.
This article will introduce their concepts, similarities and differences, implementation algorithms and so on.
Want to learn more NLP For related content, please visit the NLP topic, and a 59-page NLP document download is available for free.
Where is stemming and morphological restoration in NLP?
Stem extraction is a step of English corpus preprocessing (Chinese does not need), and corpus preprocessing is the first step of NLP, the following picture will let everyone know the position of stem extraction in this knowledge structure.
What is stemming and morphological restoration?
Stemming – Stemming
Stem extraction is the process of removing the suffix of a word to get the root.
The common affixes before and after are "plural of nouns", "progressive", "past participle"...
Lemmatisation – Lemmatisation
The morphological restoration is based on a dictionary that transforms the complex form of a word into the most basic form.
The morphological restoration does not simply remove the suffixes, but instead converts the words according to the dictionary. For example, "drove" will be converted to "drive".
Why do stem extraction and morphological restoration?
For example, when I search for "play basketball", Bob is playing basketball is also in line with my requirements, but play and playback are completely different things for 2 for the computer, so we need to convert playback into play.
The purpose of stemming and morphological restoration is to unify words with different looks but the same meaning, which facilitates subsequent processing and analysis.
4 similarities for stem extraction and morphological restoration
- The goal is the same. The goal of stemming and morphological restoration is to simplify or merge the inflection or derivative form of the word into the basic form of stem or prototype, which is a process of unified merger of different forms of words.
- The results are partially crossed. Stem extraction and morphological restoration are not mutually exclusive relationships, and the result is partially intersected. Some words can achieve the same form transformation effect by using these two methods. For example, the word "dogs" is "dog" and its original shape is also "dog".
- The mainstream implementation method is similar. At present, the mainstream implementation methods of stemming and morphological restoration are to use the rules existing in the language or to use the dictionary mapping to extract the stems or obtain the prototypes of the words.
- The application areas are similar. Mainly used in information retrieval and text, natural language processing, etc., both of which are the basic steps of these applications.
5 differences between stem extraction and morphological restoration
- In principle, stemming is mainly based on the method of “reduction”, which converts words into stems, such as “cats” as “cat” and “effective” as “effect”. The morphological reduction mainly adopts the "transition" method, which transforms the word into its original shape, such as treating "drove" as "drive" and "driving" as "drive".
- In terms of complexity, the stemming method is relatively simple. The principle of the form of the word needs to return the original shape of the word. It is necessary to analyze the form of the word, not only to convert the affix, but also to identify the part of the word, to distinguish the same form but different from the original form. The difference in words. The accuracy of part-of-speech tagging also directly affects the accuracy of morphological restoration. Therefore, morphological restoration is more complicated.
- In terms of implementation methods, although the mainstream methods of stemming and wordform restoration are similar, the two have different emphasis on specific implementation. The implementation of stemming is mainly to use the rule changes to remove and reduce the affixes, so as to achieve the simplified effect of the words. The principle of morphological form is relatively complex, with complex morphological changes, which cannot be done well by rules alone. It relies more on the dictionary, performs mapping of inflections and prototypes, and generates valid words in the dictionary.
- In terms of results, stemming and morphological restoration are also partially different. The result of stemming may not be a complete, meaningful word, but only a part of the word, such as the result of "revival" stem extraction is "reviv", "ailiNerThe result of the stem extraction is "airlin". The result obtained by the morphological reduction process is a complete and complete word, which is generally a valid word in the dictionary.
- In the field of application, there are also different focuses. Although both are used in information retrieval and text processing, the focus is different. Stem extraction is more used in the field of information retrieval, such as Solr, Lucene, etc., for extended retrieval, and the granularity is coarse. The morphological restoration is mainly applied to text mining and natural language processing for more fine-grained and more accurate text analysis and expression.
3 mainstream stemming algorithm
This stemming algorithm is older. It started in the 20 century 80 era, and its main focus is to remove the common endings of words in order to parse them into a common form. It is not too complicated and its development has stopped.
Normally, it's a good starting basic stemmer, but it's not recommended for complex applications. Instead, it is used as a good basic stemming algorithm in research to ensure repeatability. It is also a very gentle stemming algorithm compared to other algorithms.
The algorithm is also known as the Porter2 stem algorithm. It is almost universally considered to be better than Porter, and even developers who invented Porter think so. Snowball adds a lot of optimization to Porter. The difference between Snowball and Porter is about 5%.
Lancaster's algorithm is more radical and sometimes deals with some strange words. If at NLTK Using a stemmer, it's very easy to add your own custom rules to this algorithm.
Practice method of morphological restoration
The morphological restoration is based on a dictionary. Each language needs to be semantically analyzed and part of speech to establish a complete lexicon. At present, the English vocabulary is perfect.
The NLTK library in Python contains a vocabulary database of English words. These words are linked together based on their semantic relationship. The link depends on the meaning of the word. In particular, we can take advantage of WordNet.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
#Returns blogimport nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
Both stemming and word-reduction are unified words that have different looks but the same meaning, which facilitates subsequent processing and analysis.
They are a part of the English corpus preprocessing.
4 similarities for stem extraction and morphological restoration:
- Consistent goal
- Some results are consistent
- Mainstream implementation is similar
- Similar application fields
5 differences between stem extraction and morphological restoration:
- Different in principle
- Lexical restoration is more complicated
- Different implementations have different focus
- There is a difference in the results
- In the field of application, the focus is not exactly the same
The mainstream algorithm for 3 stemming:
The English word-restore can be used directly in the NLTK library in Python, which contains a vocabulary database of English words.
Baidu Encyclopedia + Wikipedia