Participle is NLP The basic task is to decompose sentences and paragraphs into word units to facilitate the analysis of subsequent processing.
This article will introduce the reasons for word segmentation, the 3 difference between Chinese and English word segmentation, the 3 difficulty of Chinese word segmentation, and the typical 3 method of word segmentation. Finally, the tools commonly used for Chinese word segmentation and English word segmentation will be introduced.
To learn more about NLP-related content, please visit the NLP topic, and a 59-page NLP document download is available for free.
Visit the NLP topic and download a 59-page free PDF
What is a participle?
Participle is Natural Language Understanding – NLP Important steps.
Word segmentation is the decomposition of long texts such as sentences, paragraphs, and articles into data structures in units of words, which facilitates subsequent processing and analysis.
Why are you dividing words?
1. Turning complex problems into mathematical problems
On Machine learning article As mentioned in it, machine learning seems to solve many complicated problems because it translates these problems into mathematical problems.
NLP is the same idea. Text is some "unstructured data". We need to convert this data into "structured data" first. Structured data can be transformed into mathematics, and word segmentation is the first transformation. step.
2. word is a more suitable granularity
A word is the smallest unit that expresses the full meaning.
The granularity of the word is too small to express the full meaning. For example, "rat" can be "mouse" or "mouse".
The granularity of sentences is too large, and the amount of information carried is large, which is difficult to reuse. For example, "the traditional method of word segmentation, an important reason is that the traditional method has a weak ability to model long-distance dependence."
3. In the era of deep learning, some tasks can also be "segmented"
In the era of deep learning, with the explosive growth of data volume and computing power, many traditional methods have been subverted.
The word segmentation has always been the basis of NLP, but it is not necessarily the case now. If you are interested, you can check out this paper:Is Word Segmentation Necessary for Deep Learning of Chinese Representations?》.
However, in some specific tasks, participles are still necessary. Such as: keyword extraction, named entity recognition, etc.
3 typical differences between Chinese and English participles
Differentiating 1: Different ways of word segmentation, Chinese is more difficult
English has a natural space as a separator, but Chinese does not. So how to divide is a difficult point, plus the fact that there is a lot of meaning in Chinese, which makes it easy to be ambiguous. The difficulties in the following sections will be explained in detail.
Differentiating 2: English words have multiple forms
There are rich transformations in English words. In order to deal with these complex transformations, English NLP has some unique processing steps compared to Chinese, which we call Lemmatization and stem extraction (Stemming). Chinese does not need
Part of speech restoration: does, done, doing, did need to be restored to do by part of speech restoration.
Stem extraction: cities, children, teeth These words need to be converted to city, child, tooth"
Differentiating 3: Chinese word segmentation needs to consider the granularity problem
For example, the "University of Science and Technology of China" has a variety of divisions:
- University of Science and Technology of China
- China\Science and Technology\University
- China\Science\Technology\University
The larger the granularity, the more accurate the meaning of the expression, but it will also result in fewer recalls. So Chinese needs different scenarios and requires different granularities. This is not in English.
3 big difficulty of Chinese word segmentation
Difficult 1: There is no uniform standard
There is no uniform standard for Chinese word segmentation at present, and there is no recognized norm. Different companies and organizations have their own methods and rules.
Difficult 2: How to distinguish ambiguous words
For example, "the auction of the table tennis ball" has 2 word segmentation to express the different meanings of 2:
- Table tennis \ auction \ finished
- Ping Pong \ Racket \ Selling \ Finished
Difficult 3: Identification of new words
In the era of information explosion, a bunch of new words will emerge at the end of the three days. How to quickly identify these new words is a major difficulty. For example, when the "blue thin mushroom" fire was in the past, it needed to be quickly identified.
3 typical word segmentation method
The method of word segmentation is roughly divided into 3 classes:
- Dictionary based matching
- Based on statistics
- Deep learning
Give the dictionary a matching word segmentation
Advantages: fast speed and low cost
Disadvantages: not adaptable, large difference in effect in different fields
The basic idea is based on dictionary matching. The Chinese text of the word to be segmented is divided and adjusted according to certain rules, and then matched with the words in the dictionary. If the matching is successful, the word segmentation according to the dictionary is used. If the matching fails, the adjustment is repeated or re-selected. Just fine. Representative methods are based on forward maximum matching and based on inverse maximum matching and two-way matching.
Statistical segmentation method
Advantages: strong adaptability
Disadvantages: higher cost and slower speed
The current commonly used algorithm isHMM, CRF,SVMDeep learningAlgorithms such as stanford and Hanlp word segmentation tools are based on the CRF algorithm. Taking CRF as an example, the basic idea is to mark the Chinese characters. It not only considers the frequency of occurrence of words, but also considers the context and has good learning ability. Therefore, it has good effects on the identification of ambiguous words and unregistered words.
Deep learning
Advantages: high accuracy and adaptability
Disadvantages: high cost and slow speed
For example, someone tries to use two-wayLSTM+CRF implements a tokenizer, which is essentially a sequence label, so it can be used for versatility, named entity recognition, etc. It is reported that its word breaker character accuracy can be as high as 97.5%.
Common word breakers use a combination of machine learning algorithms and dictionaries to improve segmentation accuracy on the one hand and domain adaptability on the other.
Chinese word segmentation tool
The ranking below is ranked according to the number of stars on GitHub:
- Hanlp
- Stanford participle
- Ansj word breaker
- Harbin Institute of Technology LTP
- KCWS word breaker
- Jieba
- IK
- Tsinghua University THULAC
- ICTCLAS
English word segmentation tool
Final Thoughts
Word segmentation is the decomposition of long texts such as sentences, paragraphs, and articles into data structures in units of words, which facilitates subsequent processing and analysis.
Reasons for word segmentation:
- Turn complex problems into mathematical problems
- Word is a more appropriate granularity
- In the era of deep learning, some tasks can also be "divided"
Typical differences between 3 in Chinese and English participles:
- Different ways of word segmentation, Chinese is more difficult
- English words have multiple forms, requiring part of speech restoration and stemming
- Chinese word segmentation needs to consider the granularity problem
3 big difficulty of Chinese word segmentation
- No uniform standard
- How to distinguish ambiguous words
- New word recognition
3 typical word segmentation:
- Dictionary based matching
- Based on statistics
- Deep learning
Baidu Encyclopedia + Wikipedia
Chinese word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. We know that in English, words are spaces with natural delimiters, while Chinese is just words, sentences and paragraphs that can be delimited by explicit delimiters. Only words have no formal delimiter. Although English also has the problem of the division of phrases, at the level of words, Chinese is much more complicated and much more difficult than English.
ParticipleIt is the process of dividing and possibly classifying a series of input characters. The resulting mark is then passed to some other form of processing. This process can be thought of as a subtask that parses the input.
Comments