Want to learn more NLP For related content, please visit the NLP topic, and a 59-page NLP document download is available for free.
What is named entity recognition?
Named Entity Recognition (NER), also known as "name identification", refers to the identification of entities with specific meaning in the text, including person names, place names, institution names, proper nouns, and so on. To put it simply, it is to identify the boundaries and categories of the entities in the natural text.
Development history of named entity recognition
NER has always been a research hotspot in the field of NLP. From early dictionary-based and rule-based methods, to traditional machine learning methods, to deep learning-based methods in recent years, the general trend of NER research progress is roughly shown in the following figure.
Stage 1: Early methods such as rule-based methods, dictionary-based methods
Stage 2: Traditional machine learning, such as: HMM, MEMM, CRF
Stage 4: Some recent emerging methods, such as: attention model, migration learning, semi-supervised learning
Common implementations of the 4 class
Early named entity recognition methods were basically rule-based. Later, after the statistical methods based on large-scale corpora achieved good results in all aspects of natural language processing, a large number of machine learning methods also appeared in the named entity class recognition task. In the book Statistical Natural Language Processing, Zong Chengqing roughly divided these machine learning-based named entity recognition methods into the following categories:
Supervised learning method : This type of method requires parameter training of the model using large-scale annotated corpora. Currently commonly used models or methods include hidden Markov models, language models, maximum entropy models, support vector machines, decision trees, and conditional random fields. It is worth mentioning that the conditional random field approach is the most successful method of naming entity recognition.
Semi-supervised learning method : This type of method uses bootstrapped small data sets (seed data) to learn bootstrap.
Unsupervised learning method : This type of method uses contextual clustering using lexical resources such as WordNet.
Hybrid method: Combine several models or use statistical methods and manual summaries of knowledge bases.
It is worth mentioning that due to the extensive application of deep learning in natural language, the method of named entity recognition based on deep learning also shows good results. This kind of method basically uses named entity recognition as the sequence labeling task, which is more classic. the way isLSTM+CRF, BiLSTM+CRF.
NER related data set
|data set||A brief description||address|
|Electronic case assessment||CCKS2017 open Chinese electronic case assessment related data||Evaluation of 1 | Evaluation of 2|
|Music field||CCKS2018 open entity recognition task in the music field||CCKS|
|Location, organization, people...||This is an excerpt from the GMB corpus used to train the classifier to predict named entities such as name, location, etc.||Kaggle|
|Speaking||NLPCC2018's open task-based dialogue system for oral comprehension evaluation||NLPCC|
|Name, place name, institution, proper noun||A data set provided by a company, including names of people, places, institutions, and proper nouns||Boson|
Related tools recommended
|Stanford NER||A conditional random field-based named entity recognition system developed by Stanford University. The system parameters are based on CoNLL, MUC-6, MUC-7, and ACE named entity corpora.||Official website | GitHub address|
|MALLET||An open source package for statistical natural language processing developed by the University of Massachusetts, which enables the identification of named entities in the application of sequence annotation tools.||Official website|
|Hanlp||HanLP is a series of NLP toolkits of models and algorithms. It is dominated by large search and is completely open source. The goal is to popularize the application of natural language processing in production environments. Support for named entity recognition.||Official website | GitHub address|
|NLTK||NLTK is an efficient Python build platform for processing human natural language data.||Official website | GitHub address|
|SpaCy||Industrial-grade natural language processing tools, unfortunately do not support Chinese.||Official website | GitHub address|
|Crfsuite||You can load your own dataset to train the CRF entity recognition model.||File | GitHub address|
This article is reproduced from the public number AI Xiaobai,Original address