A text to understand text mining

There are a large number of digital texts on the Internet, and we can obtain a lot of valuable information through text mining.

This article will tell you what text mining is, as well as its processing steps and common processing methods.

Want to learn more NLP For related content, please visit the NLP topic, and a 59-page NLP document download is available for free.

Visit the NLP topic and download a 59-page free PDF


What is text mining?

During the Spring Festival, the number of people buying train tickets and air tickets leaving first-tier cities increases sharply-this is the data

After matching the ID information of these people, it is found that these people are all returning to their hometowns from first-tier cities-this is information

It’s a Chinese custom to go back to my hometown to reunite with my family and spend the Spring Festival together-this is knowledge

The above example is obvious, but in actual business, there are a lot of information that is not so obvious, such as:

  • The traffic will increase or decrease regularly every weekend. Why?
  • During the National Day holiday, the proportion of shopping with iPad is higher than usual. Why at this time?
  • ……

The meaning of text mining is to find valuable information from data to discover or solve some practical problems.

The meaning of text mining is to find valuable information from data


5 steps of text mining

Text mining is roughly divided into the following five important steps.

5 steps of text mining

The 5 steps of text mining:

  1. data collection
  2. Text preprocessing
  3. Data mining and visualization
  4. Building a model
  5. Model evaluation


7 methods of text mining

7 methods of text mining

Keyword extraction: Analyze the content of the long text and output keywords that can reflect the key information of the text.

Text summary: Many text mining applications need to summarize text documents in order to make a brief overview of a large document or a collection of documents on a subject.

Clustering: Clustering is a technique for obtaining hidden data structures from unlabeled text. Common ones include K-means clustering and hierarchical clustering.See more Unsupervised learning

Text Categorization: Text classification uses a supervised learning method to predict the classification of unknown data and a machine learning method.

Text topic model LDA:LDA(Latent Dirichlet Allocation) Is a document topic generation model, also known as a three-layer Bayesian probability model, which contains a three-layer structure of words, topics, and documents.

Opinion extraction: Analyze the text (mainly for reviews), extract the core views, and determine the polarity (positive and negative), mainly for the analysis of reviews of e-commerce, food, hotels, cars, etc.

emotion analysis: Judging the sentiment orientation of the text, dividing the text sentiment into positive, negative, and neutral.Used for word-of-mouth analysis, topic monitoring, and public opinion analysis.


Wikipedia version

Text mining, also known as text data mining, is roughly equivalent to text analysis, which is the process of obtaining high-quality information from text.High-quality information is usually obtained by designing patterns and trends by means such as statistical model learning.Text mining usually involves the process of constructing input text (usually parsing, adding some derived language features and deleting other features, and then inserting them into the database), deriving patterns in structured data, and finally evaluating and interpreting the output."High quality" in text mining usually refers to a certain combination of relevance, novelty, and interest.Typical text mining tasks include text classification, text clustering, concept/entity extraction, granular taxonomy generation, sentiment analysis, document summarization, and entity relationship modeling (ie, learning relationships between named entities).

Text analysis involves information retrieval, vocabulary analysis to study word frequency distribution, pattern recognition, tagging/annotation, information extraction, data mining techniques, including link and association analysis, visualization and predictive analysis.The most important goal is to convert text into data for analysis by applying natural language processing (NLP) and analysis methods.A typical application is to scan a set of documents written in natural language and model the document set for predictive classification purposes, or to populate a database or search index with the extracted information.

Read more