I understand the part of speech tagging

Want to learn more NLP For related content, please visit the NLP topic, and a 59-page NLP document download is available for free.

Visit the NLP topic and download a 59-page free PDF

What is a part of speech tagging?

Part of speech tagging

The definition of part of speech in Wikipedia is: In traditional grammar, a part of speech (abbreviated form: PoS or POS) is a category of words (or, more generally, of lexical items) which have similar grammatical properties.

Part of speech refers to the characteristics of words as the basis for classifying words. The word class is a linguistic term. It is a grammatical classification of words in a language. It is the result of grammatical features (including syntactic functions and morphological changes) and the division of words with lexical meaning.

In terms of combination and aggregation, a word class refers to a category in which a plurality of words with the same syntactic function and appearing in the same combination position are aggregated together in one language. Word classes are the most common grammar aggregation. The word class division is hierarchical. For example, in Chinese, words can be divided into real words and function words, and real words include body words, predicates, etc., and nouns and pronouns can be separated in body words.

Part-of-speech tagging is the process of determining the grammatical category of each word in a given sentence, determining its part of speech and labeling it. This is also a very important basic work in natural language processing. All research on part-of-speech tagging has been long. In the long-term research summary of the researchers, I found that many problems are encountered in Chinese part-of-speech tagging.

 

Difficulties in Chinese part-of-speech

Chinese is a language lacking the change of word form. The category of words cannot be judged directly from the morphological changes of words like Indo-European.

Common words and classes are serious. Among the commonly used words collected in "Modern Chinese Eight Hundred Words", the proportion of categorical words is as high as 22.5%, and the more common words are found, the more different usages. Due to the high degree of use of the class, the class phenomenon involves most of the word classes in Chinese, which results in a huge amount of task in the Chinese text.

The difficulty caused by the subjective reasons of the researcher. There are still differences in the linguistics community on the purpose and standard of the division of parts of speech. At present, there is no unified standard for the classification of Chinese word classes, and the granularity and markup of word class division are not uniform. The differences in word class division criteria and mark symbol sets, as well as the ambiguity of word segmentation specifications, have brought great difficulties to Chinese information processing.

 

Part of the common method of part-of-speech tagging 4

Part of the common method of part-of-speech tagging 4

There are many researches on part-of-speech tagging. Here are a few common methods, including rule-based part-of-speech tagging methods, part-of-speech tagging methods based on statistical models, part-of-speech tagging methods based on statistical methods and rule methods, and deep learning based on deep learning. The part of the word tagging method.

 

Rule-based part-of-speech tagging method

The rule-based part-of-speech tagging method is an early method of part-of-speech tagging. The basic idea is to construct the word class disambiguation rules according to the collocation relationship and context. Early word class labeling rules were generally constructed manually.

As the size of the annotated corpus increases, the available resources become more and more. At this time, the method of manually extracting rules obviously becomes unrealistic. Therefore, people propose a method based on machine learning rules.

 

Part of speech tagging method based on statistical model

Statistical methods treat part-of-speech tagging as a sequence tagging problem. The basic idea is that given the sequence of words with their respective annotations, we can determine the most likely part of speech for the next word.

There are now statistical models such as Hidden Markov Model (HMM) and Conditional Random Domain (CRF), which can be trained using large corpora with tagged data, while tagged data means that each word is assigned. The text of the correct part of speech tagging.

 

Part of speech tagging method based on combination of statistical method and rule method

The combination of rationalism and empiricism has always been a problem that experts in the field of natural language processing are constantly researching and exploring. Of course, the problem of part-of-speech tagging is no exception.

The main feature of this type of method is the screening of statistical labeling results. Only those rules that are considered suspicious are ruled for ambiguous resolution, rather than using statistical methods and rule methods in all cases.

 

Part of speech tagging method based on deep learning

Can be used as a task of sequence labeling. Currently, the most common methods for deep learning to solve sequence labeling tasks includeLSTM+CRF, BiLSTM+CRF, etc.

It is worth mentioning that this type of method has a lot of articles in recent years, and friends who want to know more about this piece can look at it here:NLP-progress – GitHub

Finally, put another part of speech tagging task data set – People's Daily 1998 part-of-speech tagging data set

 

 

Word tagging tool recommendation

Jieba

"Stuttering" Chinese participle: Do the best Python Chinese word segmentation component, can be used for part-of-speech tagging.

Github address

 

SnowNLP

SnowNLP is a python-written class library that can handle Chinese text content easily.

Github address

 

THULAC

THULAC (THU Lexical Analyzer for Chinese) is a set of Chinese lexical analysis toolkit developed by Tsinghua University's Natural Language Processing and Social Humanities Computing Laboratory. It has Chinese word segmentation and part-of-speech tagging.

Github address

 

StanfordCoreNLP

The open source of the Stanford NLP group supports the python interface.

Github address

 

HanLP

HanLP is a series of NLP toolkits of models and algorithms. It is dominated by large search and is completely open source. The goal is to popularize the application of natural language processing in production environments.

Github address

 

NLTK

NLTK is an efficient Python build platform for processing human natural language data.

Github address

 

SpaCy

Industrial-grade natural language processing tools, unfortunately do not support Chinese.

Gihub address | Official website

The code has been uploaded to

 

This article is transferred from the public number AI Xiaobai,Original address