Want to learn more NLP For related content, please visit the NLP topic, and a 59-page NLP document download is available for free.
Visit the NLP topic and download a 59-page free PDF
What is syntactic analysis?
Syntactic parsing is one of the key techniques in natural language processing. It is the process of analyzing the input text sentences to get the syntactic structure of the sentences. The analysis of syntactic structure is on the one hand the need of language understanding, syntactic analysis is an important part of language understanding, and on the other hand it provides support for other natural language processing tasks. For example, syntactically driven statistical machine translation requires parsing of the source or target language (or both languages).
Semantic analysis usually takes the output of the parsing as input to get more indications. According to the different representations of the syntactic structure, the most common syntactic analysis tasks can be divided into the following three types:
Syntactic structure parsing, also known as phrase structure parsing, is also called constituent syntactic parsing. The role is to identify the structure of the phrase in the sentence and the hierarchical syntactic relationship between the phrases.
Dependency analysis, also known as dependency syntactic parsing, referred to as dependency analysis, is to identify the interdependence between vocabulary and vocabulary in a sentence.
Deep grammar syntax analysis, that is, using deep grammar, such as Lexicalized Tree Adjoining Grammar (LTAG), Lexical Functional Grammar (LFG), Combinatory Categorial Grammar (CCG), etc. Perform deep syntactic and semantic analysis.
What is dependency syntax analysis?
The Wikipedia is described as follows: The dependency-based parse trees of depending grammars see all nodes as terminal, which means they do not acknowledge the distinction between terminal and non-terminal categories. They are simpler on average than constituency-based parse trees They contain fewer nodes.
The dependency syntax was first proposed by the French linguist L. Tesniere. It analyzes the sentence into a dependent syntax tree and describes the dependencies between the words. It also points out the syntactic collocation between words, which is related to semantics.
In natural language processing, the framework for describing language structure by the dependence between words and words is called dependency grammar, also known as affiliation grammar. Syntactic analysis using dependency syntax is one of the important techniques for natural language understanding.
Related important concepts
The dependency syntax considers the verb in the "predicate" to be the center of a sentence, and other components are directly or indirectly related to the verb.
In the dependency syntax theory, “dependency” refers to the relationship between the dominance and the dominance between words and words. This relationship is not equal. This relationship has direction. To be precise, the dominant component is called the governor (regent, head), and the dominant component is called the modifier (subordinate, dependency).
The dependency grammar itself does not stipulate the classification of dependencies, but in order to enrich the syntax information conveyed by the dependency structure, in the actual application, the edges of the dependency tree are generally marked differently.
There is a common basic assumption in dependency grammar: the syntactic structure essentially contains the dependency (modification) relationship between words and words. A dependency connects two words, the core word and the dependent word. Dependencies can be broken down into different types, representing the specific syntactic relationship between the two words.
Common method
Rule-based approach: Early syntactic grammar-based syntactic analysis methods mainly include dynamic programming algorithms like CYK, methods based on constraint satisfaction, and deterministic analysis strategies.
Statistical-based methods: A large number of excellent research work has emerged in the field of statistical natural language processing, including generation-dependent analysis methods, discriminant-dependent analysis methods, and deterministic dependency analysis methods. These types of methods are data-driven statistical dependence. The most representative method of analysis.
Deep learning based methods: In recent years, deep learning has gradually become a research hotspot in the topic of syntactic analysis. The main research work focuses on feature representation. The characteristic representation of the traditional method mainly uses artificially defined atomic features and feature combinations, while deep learning encodes atomic features (words, part of speech, category tags) and extracts features using a multi-layer neural network.
Performance Evaluation of Dependent Analyzer
Commonly used indicators include: unlabeled attachment score (UAS), labeled attachment score (LAS), dependency accuracy (DA), root accuracy (root accuracy, RA) ), complete match (CM), etc. The specific meaning of these indicators is as follows:
- Unmarked Dependency Correct Rate (UAS): The percentage of total words in the test set that finds the correct dosing word (including the root node without the dosing).
- Marked Dependency Correction Rate (LAS): The word in the test set finds the correct dosing word, and the dependency type also labels the correct word (including the root node without the dosing term) as a percentage of the total number of words.
- Dependency Correction Rate (DA): The percentage of non-root node words in the test set that find the correct dominance for all non-root nodes.
- Root Correction Rate (RA): There are two definitions, one is the percentage of the correct root node in the test set and the number of sentences. The other is the percentage of the total number of sentences in the test set to find the correct root node.
- Complete Match Rate (CM): The percentage of sentences in the test set that are completely correct without a tag-dependent structure.
Related data set
Penn Treebank: Penn Treebank is the name of a project. The purpose of the project is to mark the corpus, including part-of-speech tagging and syntactic analysis.
SemEval-2016 Task 9 Chinese semantic dependency graph data: http://ir.hit.edu.cn/2461.html
CoNLL often opens academic reviews of syntactic analysis, such as:
- 2018 year general syntax analysis evaluation task
- 2009 multi-language multi-language syntactic dependency and semantic role joint evaluation task :
- 2008 Year English Dependency Syntax-Semantic Role Joint Evaluation Task :
- 2007 multi-language dependency analysis :
Related tools recommended
StanfordCoreNLP
Developed by Stanford University, it provides dependency parsing capabilities.
Github address | Official website
HanLP
HanLP is an NLP toolkit of models and algorithms. Chinese dependency syntax analysis is provided.
Github address | Official website
SpaCy
Industrial-grade natural language processing tools, unfortunately does not currently support Chinese.
Gihub address | Official website
FudanNLP
Chinese natural language processing toolkit developed by Fudan University Natural Language Processing Laboratory, including information retrieval: text classification, news clustering; Chinese processing: Chinese word segmentation, part-of-speech tagging, entity name recognition, keyword extraction, dependency syntax analysis, time phrase Identification; structured learning: online learning, hierarchical classification, clustering.
Github address | Code upload address
This article is transferred from the public number AI Xiaobai,Original address
Comments