Want to learn more NLP For related content, please visit the NLP topic, and a 59-page NLP document download is available for free.

Visit the NLP topic and download a 59-page free PDF

What is named entity recognition?

Named Entity Recognition (NER), also known as "name identification", refers to the identification of entities with specific meaning in the text, including person names, place names, institution names, proper nouns, and so on. To put it simply, it is to identify the boundaries and categories of the entities in the natural text.

Baidu Encyclopedia details | Wikipedia details

 

Development history of named entity recognition

NER has always been a research hotspot in the field of NLP. From early dictionary-based and rule-based methods, to traditional machine learning methods, to deep learning-based methods in recent years, the general trend of NER research progress is roughly shown in the following figure.

Named entity recognizes the history of NER

Stage 1: Early methods such as rule-based methods, dictionary-based methods

Stage 2: Traditional machine learning, such as: HMM, MEMM, CRF

Stage 3: A method of deep learning, such as:RNN – CRF,CNN –CRF

Stage 4: Some recent emerging methods, such as: attention model, migration learning, semi-supervised learning

 

Common implementations of the 4 class

Early named entity recognition methods were basically rule-based. Later, after the statistical methods based on large-scale corpora achieved good results in all aspects of natural language processing, a large number of machine learning methods also appeared in the named entity class recognition task. In the book Statistical Natural Language Processing, Zong Chengqing roughly divided these machine learning-based named entity recognition methods into the following categories:

Supervised learning method : This type of method requires parameter training of the model using large-scale annotated corpora. Currently commonly used models or methods include hidden Markov models, language models, maximum entropy models, support vector machines, decision trees, and conditional random fields. It is worth mentioning that the conditional random field approach is the most successful method of naming entity recognition.

Semi-supervised learning method : This type of method uses bootstrapped small data sets (seed data) to learn bootstrap.

Unsupervised learning method : This type of method uses contextual clustering using lexical resources such as WordNet.

Hybrid method: Combine several models or use statistical methods and manual summaries of knowledge bases.

It is worth mentioning that due to the extensive application of deep learning in natural language, the method of named entity recognition based on deep learning also shows good results. This kind of method basically uses named entity recognition as the sequence labeling task, which is more classic. the way isLSTM+CRF, BiLSTM+CRF.

 

NER related data set

data set A brief description address
Electronic case assessment CCKS2017 open Chinese electronic case assessment related data Evaluation of 1 | Evaluation of 2
Music field CCKS2018 open entity recognition task in the music field CCKS
Location, organization, people... This is an excerpt from the GMB corpus used to train the classifier to predict named entities such as name, location, etc. Kaggle
Speaking NLPCC2018's open task-based dialogue system for oral comprehension evaluation NLPCC
Name, place name, institution, proper noun A data set provided by a company, including names of people, places, institutions, and proper nouns Boson

 

Related tools recommended

tool Introduction address
Stanford NER A conditional random field-based named entity recognition system developed by Stanford University. The system parameters are based on CoNLL, MUC-6, MUC-7, and ACE named entity corpora. Official website | GitHub address
MALLET An open source package for statistical natural language processing developed by the University of Massachusetts, which enables the identification of named entities in the application of sequence annotation tools. Official website
Hanlp HanLP is a series of NLP toolkits of models and algorithms. It is dominated by large search and is completely open source. The goal is to popularize the application of natural language processing in production environments. Support for named entity recognition. Official website | GitHub address
NLTK NLTK is an efficient Python build platform for processing human natural language data. Official website | GitHub address
SpaCy Industrial-grade natural language processing tools, unfortunately do not support Chinese. Official website | GitHub address
Crfsuite You can load your own dataset to train the CRF entity recognition model. File | GitHub address

This article is reproduced from the public number AI Xiaobai,Original address