There is a large amount of digitized text on the web, and we can get a lot of valuable information through text mining.
This article will tell you what text mining is, as well as his processing steps and common processing methods.
Want to learn more NLP For related content, please visit the NLP topic, and a 59-page NLP document download is available for free.
What is text mining?
During the Spring Festival, the number of people who bought train tickets and air tickets leaving the first-tier cities increased sharply - this is the data.
Then match the ID card information of these people and find that these people are returning to their hometown from the first-tier cities - this is information
Going back to my hometown to reunite with my family, it’s a Chinese custom to spend the Spring Festival together - this is knowledge
The above example is obvious, but in the real business, there are a lot of information that is not so obvious, such as:
- Traffic will rise or fall regularly every weekend. Why?
- During the National Day holiday, the proportion of shopping using iPad is higher than usual. Why?
The meaning of text mining is to find valuable information from the data to find or solve some practical problems.
5 steps for text mining
Text mining is roughly divided into the following 5 important steps.
5 steps for text mining:
- data collection
- Text preprocessing
- Data mining and visualization
- Building a model
- Model evaluation
7 text mining method
Keyword extraction: Analyze the content of long text and output keywords that reflect the key information of the text.
Text summary: Many text mining applications need to summarize text documents to give a brief overview of large documents or collections of documents for a topic.
Clustering: Clustering is a technique for obtaining hidden data structures in unlabeled text. Commonly, there are K-means clustering and hierarchical clustering. See more Unsupervised learning
Text Categorization: Text classification uses a supervised learning approach to machine learning methods that predict the classification of unknown data.
Text theme model LDA:LDA(Latent Dirichlet AllocationIs a document theme generation model, also known as a three-layer Bayesian probability model, containing three-layer structure of words, topics and documents.
Perspective extraction: Analyze the text (mainly for comments), extract the core ideas, and judge the polarity (positive and negative), mainly used for analysis of e-commerce, food, hotels, cars and other comments.
emotion analysis: Emotional tendency judgment on the text, and the text emotion is divided into positive, negative, and neutral. Used for word of mouth analysis, topic monitoring, and public opinion analysis.
Text mining, also known as text data mining, is roughly equivalent to text analysis and is the process of obtaining high quality information from text. High-quality information is usually derived from design patterns and trends through statistical model learning. Text mining usually involves the process of constructing input text (usually parsing, adding some derived language features and deleting other features, then inserting them into the database), exporting the schema in structured data, and finally evaluating and interpreting the output. "High quality" in text mining usually refers to some combination of relevance, novelty and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, granularity classification generation, sentiment analysis, document summaries, and entity relationship modeling (ie, learning relationships between named entities).
Text analysis involves information retrieval, vocabulary analysis to study word frequency distribution, pattern recognition, marking / annotation, information extraction, data mining techniques, including link and association analysis, visualization, and predictive analysis. The most important goal is to convert text into data for analysis by applying natural language processing (NLP) and analysis methods. Typical applications are to scan a set of documents written in natural language and model the document set for predictive classification purposes, or to populate a database or search index with the extracted information.