This article is reproduced from the public number of the door venture,Original address

This article reviewsChinese word segmentationTechnological advances in 2007-2017 for ten years, especially sinceDeep learning penetrates into natural language processingThe main work since. Our basic conclusion is that the supervised machine learning method of Chinese word segmentation has not shown obvious technical advantages in the migration from non-neural network method to neural network method. The construction of the machine learning model of Chinese word segmentation still needs to balance the recognition problem of known words and unregistered words.

Although the application of deep learning to Chinese word segmentation has not yet fully surpassed the traditional machine learning method, we have prudently speculated that the neural network model based on artificial intelligence connectionism has the potential to fit the natural language decomposition method of natural language, thus effectively modeling , or can show new technological advancements in the near future.

1back ground

Chinese word segmentation is a basic task and research direction of Chinese information processing. Ten years ago, Huang and Zhao (2007) accepted the commission of the Chinese Journal of Information, and made a 10-year review of the machine learning method for Chinese word segmentation since the end of 20 century, and published the article "Review of Chinese Word Segmentation for Ten Years". The basic conclusion of this paper is that the statistical machine learning method of Chinese word segmentation is superior to the traditional rule method, especially in the recognition of words that are not present in the out-of-vocabulary words (OOV), ie, the training set. Comparable advantages. This basic conclusion was subsequently fully confirmed. Now that 10 has passed, as of 2017 in 3, Google Scholar has shown that the article has been cited 166 times, while China Knowledge Network has recorded its reference as 483 times.

Today, it seems natural to use machine learning to learn or train high-performance Chinese word breakers on segmented material with segmentation marks, but the situation was very different ten years ago. First, it was not until the last decade of the 20 century that the Chinese information processing community realized that participles could operate as machine learning tasks on real-labeled corpus. Second, adequate corpus preparation is not a one-off. The two early corpora were from the Penn Treebank (CTB) of the University of Pennsylvania and the People's Daily corpus marked by the Computational Language of Peking University. On the basis of all kinds of categorized corpus, SIGHAN was able to organize the first international Chinese word segmentation evaluation SIGHANBakeoff-2003.

In addition to the corpus preparation, there are two historical factors, delaying the Chinese word segmentation, the Chinese information processing basic sub-task to a thorough machine learning. First, for a long time, the classic method of Chinese word segmentation, that is, the maximum matching algorithm, can usually achieve a certain degree of acceptable performance under the appropriate dictionary. Measured by the F value, the maximum matching participle generally gets about 80% or higher. The existence of such simple and effective rule methods has greatly reduced the urgency of developing advanced machine learning techniques. Second, the computational cost of machine learning methods is enormous. The advantages of machine learning methods are not reflected when the necessary hardware conditions are not yet widespread or the cost is still too high. In 2005, the typical participle learning tool conditional random field (CRF) training on the million word corpus requires 12-18 hours of single-threaded CPU time, taking up memory 2-3G, far beyond the personal computer at the time. General hardware configuration level.

Therefore, in today's 2017 year, we reviewed the academic status of 10 years ago. We must historically consider the specific situation at that time in order to understand the inherent and special rationality and inevitability of technological advances at that time and later. The most significant technical blowout in 10's machine learning field recently is clearly the rise and comprehensive coverage of deep learning methods. Therefore, we will divide the technical summary into two parts, namely the traditional machine learning model of Chinese word segmentation and the recent deep learning (neural network) model. The latest references to technical papers in this paper have been closed to some of the accepted papers in the ACL-2017 conference. However, due to space limitations, we only pay attention to the related work under the supervision of machine learning mode in a strict sense, but for unsupervised learning, semi-supervised learning, domain transfer learning, and other word segmentation methods and applications, it is still pending. Or the efforts of Gao Xianxian. We take the liberty to look back at some of our current and current relatively cutting-edge research efforts within our narrower perspectives, aiming to draw on the jade for reference.

2Traditional machine learning model

As a segmentation process on a string, word segmentation is a relatively simple structured machine learning task. According to the processed structure decomposition unit, the traditional machine learning model for word segmentation can be roughly divided into two categories, namely word-based annotation and word-based (related feature)-based learning.

The method based on word tagging learning started from Xue (2003).This work uses four relative position tags (tags) of a character in a word, namely B, M, E, and S (as shown in Table 1), to express the segmentation and labeling information carried by the character, thus For the first time, the task of word segmentation is formalized as a learning task of character position string labeling.String labeling learning is the most basic structured learning task in natural language processing. In the probability graph model of string labeling, each node unit of the two strings needs to have a strict one-to-one correspondence, which is very convenient to use various mature machine learning tools. Modeling and realization. The first realization of Xue (2003) has not yet fully used string labeling structure learning, but directly applied the word position classification model. Ng & Low (2004) and Low et al. (2005) are the first to apply strict string labeling learning to word segmentation, using the Maximum Entropy (ME) Markov model.Peng et al. (2004) and Tseng et al. (2005) naturally introduce the standard string labeling learning tool conditional random field into word segmentation learning.Subsequently, multiple variants of CRF constituted the standard word segmentation model before the deep learning era.

The Chinese word segmentation task is to distinguish the correct words in a specific context. Therefore, the so-called word-based segmentation learning modeling needs to solve the problem of “first chicken or egg first”. Word-based stochastic process modeling leads to the direct application of a CRF variant, the semi-CRF (semi-conditional random field) model. Word segmentation-based word segmentation usually uses a linear chain conditional random carrier (CRF), which is based on Markov process modeling. Each step in the process only labels one text unit of the input sequence. The semi-CRF is modeled on the semi-Markov process, which marks successive elements in the sequence with the same label at each step. This feature is highly compatible with the word segmentation process, allowing it to be used directly for word segmentation.

Andrew (2006) published the first participle implementation of semi-CRF. However, even with the standards of the time, the segmentation performance of the semi-CRF model, which is called direct modeling, is not ideal. In general, direct modeling results in better machine learning, but when semi-CRF is directly applied to word segmentation, it has been difficult to cash. Later, Sun et al. (2009) and Sun et al. (2012) used the semi-CRF learning model with hidden variables for word segmentation to advance their word segmentation performance to the frontier level: the former is the first hidden variable semi-CRF The work of the model claims to be able to utilize both character-based and word-based feature information, and empirically proves that the introduction of implicit variables can improve the recall of long words by effectively capturing long-distance information; the latter additionally references new high-dimensional The tag transfer Markov feature, and the adaptive online gradient descent algorithm based on feature frequency is proposed in order to improve the training efficiency. It is worth noting that the training time of the linear chain CRF model is several times slower than the corresponding maximum entropy Markov model, because the maximum entropy model training time is proportional to the number of labels to be learned, and the CRF training time is proportional to the square of the number of labels. However, semi-CRF training is slower than standard CRF, thus greatly limiting the practical application of this type of model.

After the traditional word tagging model method is further developed, it also introduces some iconic known word information (that is, in-vocabularywords, IV). Zhang et al. (2006) proposed a subword-based labeling learning. The basic idea is to extract high-frequency known words from the training set to construct a subword dictionary.However, this method alone does not work well and needs to be combined with other models to make meaningful comparisons with existing methods in its performance. Zhao&Kit (2007a) greatly improved this strategy. By iterating the maximum matching word segmentation method on the training set, the optimal sub-word (sub-string) dictionary is found, and the best performance can be obtained by using a single sub-string labeling learning.

The direct labeling model based on substrings actually applies the known word information too strongly, because all the substrings belong to the known words and cannot be segmented at the beginning of the model.This defect was later corrected, and the main work included Zhao&Kit (2007b;2008b;2008a;2011).In these works, there are two main improvements to the existing work: First, all possible substrings are scored according to the n-gram count on the training set according to a specific statistical measurement method; second, the basic model is still Word position labeling learning, the previously obtained substring information appears in the form of additional features.This work obtained the best performance under the traditional labeling model, including the first place in the closed test of all five word segmentation of SIGHAN Bakeoff-2008 in 4 (Zhao&Kit, 2008b).When the extraction of substrings and the calculation of statistical measurement scores are extended beyond the training set, Zhao&Kit (2011) actually proposed a highly scalable semi-supervised word segmentation method, and experiments have also verified its effectiveness.

Unlike all the above methods based on string labeling, whether it is linear chain CRF labeling or semi-CRF labeling, Zhang & Clark (2007) introduced a word segmentation method based on whole sentence segmentation structure learning.Although they claim that this is a word-based method, their method is different from the previous one. The most significant point is that the n-gram features of characters and words are extracted in the same position in the segmentation structure decomposition of the whole sentence. .In detail, they used an extended perceptron algorithm for training, and used an approximate beam search in the decoding stage.Although its model has a theoretically broader ability to express features, in fact this work fails to give better segmentation performance (see Table 6 for comparison of results).

Since word segmentation is a starting task for natural language processing, the types of optional features under string annotation learning are quite limited. In fact, all that can be chosen is the n-gram feature under the sliding window, where the n-gram unit is a word or word. In theory, arbitrary feature template selection in units of a single n-gram feature is feasible in engineering calculations. In the actual system, 5 word sliding window is adopted for word features, while Zhao et al. (2006a) and its subsequent work only use 3 word window; for words, XVIIX word sliding window is adopted. However, word position labeling is not a direct cut point learning. There are many schemes from the latter (segment point) to the former (word position labeling system), and once the word position labeling changes, the corresponding optimized n-gram feature set Obviously there will be changes. The discovery of this phenomenon and its complete empirical research are published in Zhao et al. (3b) and Zhao et al. (2006a). Table 2010 and Table 2 list the previous annotation set and the complete annotation set sequence examined by Zhao et al. (3a), which proves that 2010 n-gram features are used in 6-tag annotation set with 3 word window. (C-6, C1, C0, C-1C1, C0C0, C-1C1, where C1 represents the current word), you can get the best performance of the word mark learning (the default use CRF model).

Table 3: All possible word tag sets and examples below 6-tag

3Deep learning: neural network segmentation model

Since word embeddingWord embeddingAfter the practical stage of numerical calculation is reached, deep learning begins to sweep into the field of natural language processing. In principle, the embedded vector carries the syntactic and semantic information of a part of the word or word and should lead to further performance improvements. As mentioned earlier, the features available in the Chinese word segmentation task are limited to the n-gram features within the sliding window. Thus, although typical deep learning models are known for their advantages in reducing the cost of feature engineering, the mitigation of feature engineering stress for word segmentation tasks is rather limited. Therefore, it is expected that the direction of further improvement of the neuro-word segmentation model is: one, effectively integrating the embedded representation of words or words, making full use of the effective syntax and semantic information contained therein; second, the learning ability of neural networks is effective and already Some traditional structured modeling methods are combined, such as in the classical word-pointing model, with equivalent corresponding network structures.

Neckbert Et al. (2011) proposes a general framework for solving natural language processing problems using neural networks, especially for sequence labeling problems. This framework extracts features within the sliding window and solves the label classification problem in each window. On this basis, Zheng et al. (2013) proposed a neural network Chinese word segmentation method, which verified the feasibility of applying the deep learning method to the Chinese word segmentation task for the first time. Their work directly borrows the structure of the Collobert model and uses the word vector as the system input. The technical contributions include: 1. Using pre-trained word vector representation on large-scale text to improve supervised learning (open test meaning); A perceptron-like training approach replaces the traditional maximum likelihood method to accelerate neural network training. For structured modeling, this work is equivalent to the string learning model of the word mark of Low et al. (2005). The only difference is that the simple entropy model of the latter is replaced by a simple neural network model. The block diagram is shown on the left in 1. Due to the shortcomings of structured modeling, the accuracy of this model is only comparable to that of the earlier Xue (2003), and far less than the leader of the traditional word annotation learning model.

Figure 1: Model framework for Zheng et al. (2013) (left) and Pei et al. (2014) (right).

In 2014, Pei et al. (2014) made important improvements to the Zheng et al. (2013) model, introducing label vectors to more finely describe the transfer relationship between labels, the degree of improvement is similar to Low et al. (2005) first introduced the Markov feature to the maximum entropy model of Ng & Low (2004). Pei et al. proposed a new type of neural network called Max-Margin Tensor Neural Network (MMTNN) and used it for word segmentation tasks (see Figure 1 on the right), using label vectors and tensor changes to capture The relationship between tags and tags, and between tags and context.In addition, in order to reduce computational complexity and prevent overfitting (a common problem of all neural network models), this article also specifically proposes a new tensor decomposition method.

Subsequently, in order to more completely and accurately model the word segmentation context, Chen et al. (2015a) proposed a recurrent neural network (Gated recursive neural network, G with adaptive gate structure).RNNExtract n-gram features, of which two custom gate structures (reset gates, update gates) are used to control the fusion and extraction of n-gram information. Unlike the simple splicing of word-level information in the previous two studies, the model uses a deeper network structure to avoid the gradient diffusion of traditional optimization methods, which uses a supervised layer-by-layer training approach.

In the same year, Chen et al. (2015b) proposed a long-short-term memory neural network (Long Short-TermMemory Neural Networks) for the locality of sliding windows. LSTM) To capture long-distance dependence, which partially overcomes the shortcomings of previous sequence labeling methods that can only extract features from a fixed-size sliding window. Xu & Sun (2016) combined GRNN and LSTM.This work can be seen as a model combining both Chen et al. (2015a) and Chen et al. (2015b).In this model, the two-way LSTM is first used to extract context-sensitive local information, and then the local information is fused with a recurrent neural network with a gate structure in a sliding window, and finally used as a basis for label classification. LSTM is a structural modeling tool that has the same role as linear chain CRF in the neural network model family. As it is introduced into word segmentation learning, the neural network model can begin to compete with traditional machine learning models in terms of word segmentation performance.We list the comparison of the traditional-neural model of structured modeling in Table 4.

Unlike the traditional word-based sequence labeling scheme that almost dominates the situation, neural networks have relatively more flexible structured modeling capabilities, so other methods different from sequence labeling have emerged one after another. Ma & Hinrichs (2015) proposed a word-based segmentation action matching algorithm. This algorithm maintains a considerable degree of word segmentation performance and has a speed advantage that is no less than that of traditional methods.Specifically, this paper proposes a new type of vector matching algorithm, which can be regarded as an extension of the traditional sequence labeling method, with only linear time complexity in the training and testing phases (see Figure 2 on the left).Two highlights of this work are worth noting: First, it seriously considered the computational efficiency of neural model segmentation for the first time; second, it followed the strict SIGHAN Bakeoff closed test requirements, and only used a simple feature set without relying on the training set at all. Language resources outside.

Figure 2: Model block diagrams of Ma & Hinrichs (2015) (left) and Liu et al. (2016) (right)

Zhang et al. (2016) proposed a transfer-based model for word segmentation, combining traditional feature templates with features automatically extracted by neural networks, and methods for automatically extracting features from neural networks and traditional discrete features. Did an attempt. The results show that by combining these two features, the word segmentation precision can be further improved.

Liu et al. (2016) applied the zero-order semi-Markov random field to the neuro participle model for the first time, and analyzed the influence of different word vectors and word vectors on the word segmentation effect. This paper is based on semi-CRF modeling participle learning (see 2 right). It uses direct segmentation embedded representation and indirect input unit fusion representation to characterize the segmentation block. It also examines multiple fusion methods and multiple cuts. Blocked embedding representation. Unfortunately, the system relies heavily on the output of traditional methods to improve performance. Their specific practice is to use the traditional word segmentation results (in external corpus) as a word vector training corpus. Therefore, the final result reported in this paper should be in the open test category. As a semi-CRF model under the pure neural model version, the effect of the system is not as good as the traditional semi-CRF (such as Andrew (2006)) in the sense of closed test (see the comparison of the results of Table 6).

Cai & Zhao (2016) completely abandoned the sliding window and proposed a method of directly modeling the word segmentation sentence to capture all the historical information of the word segmentation. They proposed a neural word segmentation model similar to Zhang & Clark (2007), and fully absorbed the previous Some useful experience in work, such as gate network structure, etc. (Figure 3).As it covers an unprecedented range of features, the model has achieved a word segmentation performance close to that of the traditional model in the sense of closed testing.In summary, this method uses a combined neural network with an adaptive gate structure, the word vector representation is generated by its word vector, and the scoring model of the LSTM network is used to score the word vector sequence.This method directly models the word segmentation structure, and can use the three levels of information of characters, words, and sentences. It is the first method that can completely capture segmentation and input history.Compared with the previous methods, whether traditional or deep learning, this model expands the feature window on which word segmentation actions depend to the greatest extent (see Table 5).The word segmentation system framework proposed in this article can be divided into three components: a gated combination neural network (Gated combination neural network, GCNNSee 4 left); an evaluation network that scores different final scores (ie, word sequences); and a search algorithm that finds the score with the largest score. The first module is similar to the simulation of Chinese word-making process, which is of great significance for the recognition of unregistered words; the second module scores the fluency and rationality of the word segmentation from the perspective of the whole sentence, which can make the most use of it. The word segmentation context; the third module finds the most likely segmentation optimal solution in the exponential segmentation space.

Figure 3: Model block diagram of Cai & Zhao (2016)

Table 6 lists the performance of the word segmentation of the main word segmentation system in SIGHAN Bakeoff-10 corpus in recent 2005 years. The neurological word system has made great strides in just a few years, but overall it still does not match the traditional model. In addition, although the neural network method has great advantages in knowledge dependence and feature engineering, and has made some progress, the computational complexity of the model is greatly improved, because successful neurological word breakers are often built on more sophisticated and complex networks. Above the structure. In fact, after five years, the deep learning method has no significant advantage over the traditional methods in terms of the performance of the final model, whether it is word segmentation accuracy or computational efficiency.

Based on Cai & Zhao (2017), Cai et al. (2016) designed a training strategy based on greedy search by simplifying the network structure, mixing word input, and using early update and other training strategies with better convergence. (greedy search) fast word segmentation system (see Figure 4 right).Compared with the previous deep learning algorithm, this algorithm not only has a huge improvement in speed, but also has a certain improvement in word segmentation accuracy.The experimental results also show that word-level information is more effective for machine learning than word-level information, but relying only on word-level information will inevitably weaken the generalization ability of deep learning models in unfamiliar environments.Table 7 lists the results of the neural word segmentation system related to speed in the last 3 years.It can be seen from this that Cai et al. (2017) made the neural model method achieve comparable performance and efficiency to the traditional method for the first time.

Figure 4: The combined gate network module of Cai & Zhao (2016) (left) and Cai et al. (2017) (right)

4Closed and open test

SIGHAN Bakeoff's word segmentation defines strict closed test conditions, requiring that language resources other than the training set be used, otherwise the corresponding results are considered open test categories. One of the main purposes of distinguishing between closed and open tests is to distinguish the performance improvement of machine learning from the improvement of the model itself, not the others.

Regardless of the traditional model or the deep learning model, the optional word segmentation external resources can include various types of dictionaries and segmentation corpora (not necessarily the same segmentation specification as the existing segmentation corpus). The use of external resources can be introduced in the form of additional markup features, including the early open test system including Low et al. (2005). Zhao et al. (2010a) systematically examined a variety of external resources, including dictionaries, named entity recognizers, and other corpus-trained word breakers, which were used in conjunction with additional markup features under the word markup model. : Add the auxiliary mark feature given by other tokenizers (or named entity recognizers) on the main slicer. The results show that the strategy can significantly improve performance in all word segmentation corpora, especially in the two simplified corpus of SIGHAN Bakeoff-2006 can bring additional performance gain of 2 percentage points. The results of the table 6 show that the open test results reported by Zhao et al. (2010a) are still the highest segmentation performance in the industry so far. This set of results is actually given on the Bakeoff-2006 corpus, thus lacking the results on the PKU, and the additional resources used are from other public Bakeoff corpora. Finally, the work also empirically suggests that if the additional segmentation corpus available can be expanded without limit, the word segmentation accuracy can be increased without limit, although the cost is that the segmentation speed will drop dramatically.

The deep learning model based on embedded representation brings new challenges to the segmentation and open test differentiation of word segmentation. Obviously, external pre-trained words or word embedding vectors are obvious external resource utilization, because word vector pre-training can borrow external unmarked corpus directly, typically such as Wikipedia data, while word vector pre-training requires the use of a traditional word segmentation. The model is pre-segmented on the external corpus, which synchronously introduces external resource knowledge and implicitly integrates the output of the traditional tokenizer. However, the work of a considerable part of the neurological participle intentionally or unintentionally ignores the role distinction of the above practices, which in fact confuses the open and closed tests, not to mention that many neural model systems even use additional dictionary annotations to enhance their performance. These practices severely interfere with the analysis and evaluation of the current neurological segmentation model: In the end, do the performance enhancements claimed by these models come from the newly introduced deep learning model or the contribution of external resources that are quietly introduced? It can be seen from the open and closed test results of the neurological participle matched in the table 6 that most of the neurological word system introduces external auxiliary information to obtain 1-2 percentage points performance improvement (already in the open test category). Can compete with traditional models in the sense of strict closed testing. If all the extra pre-trained word or word embedding, the extra lexicon annotation features, and the performance contribution of the implicitly integrated traditional tokenizer are strictly stripped off, it is fair to see that until the end of 2016, when all the neurological word breakers are run separately, Performance (not to mention efficiency) is no match for traditional systems.

Note: The upper part of the table shows the traditional method, and the lower part is the deep learning method.The double asterisk (**) is the rerun result from Pei et al. (2014); the asterisk (*) is the rerun result from Cai & Zhao (2016); the † means the use Or may use pre-trained word vectors; with ‡ side is to rely on traditional models (pre-training using traditional segmentation results on large-scale unlabeled corpus); and the results in brackets (...) use idiom tables .

Note: Data marked with an asterisk (*) comes from the results of Cai and Zhao (2016) rerun. This table lists the results of the separate work of the neural network model in Zhang et al. (2016) and Liu et al. (2016). Note that the word vectors used by most deep learning methods can be pre-trained on large scale unmarked corpora in advance. Strictly speaking, such results need to be attributed to the SIGHAN Bakeoff open test category.

Yang et al. (2017) specifically investigated the impact of external resources on Chinese word segmentation, including pre-trained word/word vectors, punctuation, automatic word segmentation, part-of-speech tagging, etc. They treat each external resource as a Auxiliary classification tasks, using a multi-task neural learning method, pre-trained a set of shared parameters for modeling Chinese context. A large number of experiments have shown that external resources are also important for the improvement of the performance of neural models.

If we quantify the contribution of external resources, or simplify it, can we give a link between the corpus size of machine learning and the growth of learning performance? In fact, the empirical work in this area has been completed in Zhao et al. (2010b). The basic conclusion is that the accuracy of the word segmentation and the size of the training corpus given by the statistical machine learning system generally conform to the Zipf law, that is, the corpus size index increases, performance Linear growth. Unlike statistical participles, more traditional rule parts, such as the maximum matching method, have a linear relationship with the scale of the dictionary used (ie, the vocabulary included), because the word segmentation error is mainly caused by unregistered words. This conclusion implies that statistical methods, whether traditional word annotation or modern neural models, still have huge room for growth.

5in conclusion

Regarding the machine learning method of Chinese word segmentation, there has been a long-standing dispute over the superiority of the feature of "character or word", which coincides with the dispute over the "character-based" or "word-based" of Chinese structural analysis in linguistics.As early as Huang and Zhao (2006) gave an empirical observation result: the feature learning of characters and words needs to be expressed in a balanced expression in the word segmentation system in order to obtain the best performance.In fact, the core of the so-called word and word disputes corresponds to the two indicators of word segmentation, the recognition accuracy of known words (or vocabulary words, that is, words that appear in the segmented training corpus) and the recognition accuracy of unregistered words, the former The recognition accuracy is very high and relatively easy, but the percentage is high. The latter recognition accuracy is very low and difficult but the percentage is low.The empirical results show that emphasizing character-based features and their representations will lead to better recognition performance of unregistered words.There is no other reason. Unregistered words never appear in the training set, and can only be recognized by the model through creative combinations of words.Conversely, systems that emphasize word features, including word-based segmentation systems, are usually slightly inferior to unregistered words.The best word segmentation system always needs to reasonably consider the balance between word representation and word representation.The improvement points of two recent work can support this conclusion: Cai et al. (2017) A key improvement to Cai & Zhao (2016) is that the word vector is no longer always calculated by the word vector through the neural network. Instead, two strategies are adopted, that is, low-frequency or unknown words continue to be calculated by word vectors, while high-frequency words in the training set (which can be considered more stable known words) are directly calculated.When the system shifts from the latter to the word vector representation mode to the word-word balanced representation mode, it does bring additional performance improvements.

In the recent 5 year, word segmentation based on neural network model has achieved a series of results. As far as the current results are concerned, we can draw two basic conclusions: First, the performance effect of the neurological participle is only roughly equivalent to the traditional word segmentation system, if not a bit worse; second, a considerable part of the neurological word system reports The performance improvement (we are cautiously speculative) comes from the additional introduction of external language resource information via word or word embedding, rather than the performance improvement caused by the model itself or the word embedding representation. If the word embedding expression implies deep syntactic and semantic information, then this conclusion seems to imply an inference that participle learning is a task that can be done well without too much syntactic and semantic information.

The neural network in the sense of modern deep learning is classified into the connectionist trend of artificial intelligence. Because it has a congenital internal topology, if it can overcome the shortcomings of its training calculation inefficiency, it should be structured learning itself. The ideal way to model natural language processing tasks. This is the main reason why we have seen more diverse structural modeling methods for the Chinese word segmentation task in the era of deep learning. If we can effectively balance the balance of word-word representation, we cannot rule out that the word segmentation system based on deep learning in the future can have further room for growth.