introduction

Understanding "concept" is an important cornerstone for human beings to understand the world. For natural language understanding, extracting concepts and conceptualizing texts is a critical research issue. For example, when you see Honda Civic or Hyundai Elantra, people can think of the concept of “low-fuel car” or “economy car” and can think of Ford Focus (Ford). Focus) or Nissan Versa (Nissan Versa) and other models.

Figure 1. Humans can conceptualize things and produce associations

Past research work, including DBPedia, YAGO, Probase, and other knowledge maps or concept libraries, extracts different concepts from Wikipedia or web articles. However, the concept extracted in this way is not consistent with the user's cognitive perspective. For example, instead of recognizing Toyota 4RunNerIt's a Toyota SUV or a car, and we're more interested in whether it can be conceptualized as a "chassis high car" or "off-road car." Similarly, if an article is discussing "Jane Eyre", "Wuthering Heights", "Awesome Gaisby" and other movies, if we can recognize that it is discussing the concept of "fiction adapted film", then it will help Big. However, the current work of knowledge maps and so on is aimed at creating a structured knowledge representation of the world, extracted from grammatically rigorous articles. Therefore, they cannot conceptualize text (such as query and document) from the user's perspective to understand the user's intent. On the other hand, the current work is mainly to extract the concept of long-term stability, it is difficult to extract popular concepts that appear in a short period of time (such as "New Year's Blockbusters", "2019 July New Fans") and the links between them.

We propose the ConcepT concept mining system to extract concepts that match user interest and cognitive granularity. Unlike previous work, the ConcepT system extracts concepts from a large number of user query search click logs and further links topics, concepts, and entities to form a hierarchical cognitive system. Currently, ConcepT is deployed in the Tencent QQ browser to explore different concepts, enhance the understanding of the user's query intentions and the theme of long articles, and support search and other services. It has now extracted more than 20's high-quality user-view-based concepts and is growing at a rate that excavates more than 11000 new concepts every day. The core algorithm architecture of the ConcepT system is also applicable to other languages ​​such as English.

Our main contributions include:

  1. Based on two unsupervised models, bootstrapping and query-title alignment, we extract the user-centered concept from a large number of search logs;
  2. Based on the seed data extracted by the above strategy, we further train the supervised model (conditional random field CRF + classifier) ​​to further extract the concept phrase from the input query and the article title;
  3. We have proposed two strategies to label long articles with concepts and enrich the characterization of the articles;
  4. By extracting topics, concepts, and isA relationships between entities, we built a three-tier grading system to preserve the connections between them.

Experiments have shown that the ConcepT system accurately extracts high-quality conceptual phrases from the query and puts long articles on relevant conceptual tags. The online A/B test proves that the ConcepT system can increase the efficiency of 6.01% information flow exposure.

method

The ConcepT system proposes different solutions for three problems: Concept Mining, Concept Tagging, and Taxonomy Construction. Figure 2 shows the principles of ConcepT system concept mining. It mainly includes three strategies:

  1. Pattern-Concept Bootstrapping. This method starts with a batch of seed templates and matches the query to get some candidate concepts. Based on the candidate concepts obtained, a new template is generated. The new template should match a certain amount of existing concepts, be extensible, and match a certain amount of new concepts. Templates that satisfy this condition will be retained. This cycle, so as to continue to get more candidate concepts and matching templates.
  2. Query-Title Alignment. This method extracts candidate concepts using the query and the title of the article that it enters. It extracts the N-gram shared in the query and the article title as a candidate concept, and assumes that the N-gram in the query can have a richer description in the article title.
  3. Sequence Labeling. This method models concept extraction as a sequence labeling problem, and training conditions are extracted from the airport sequence labeling model to extract candidate concepts.

The candidate concepts obtained through the above three strategies will be uniformly filtered by a classifier to control the quality of the concept, and the final set of concepts will be obtained.

Figure 2. ConceptT concept mining process: mining concepts from user search click logs

Figure 3 shows how the ConcepT system puts a concept tag on an article. There are two main strategies:

  1. Match-based tagging algorithm. The algorithm extracts key entities from the article, and then uses the upper and lower relations between entities and concepts to get related concepts. Because each concept text is short, the concept of a high-click article title is used to augment the concept. Finally, compare the similarity between the expanded concept representation and the article, and decide whether to mark the article with the given concept tag.
  2. A signature algorithm based on probability inference. The algorithm extracts key entities from the article. When the upper and lower positions between the entity and the concept have not been established, the words appearing in the context of the entity word are used to find the candidate concept. Finally, for each candidate concept, the algorithm will combine the probability of the article to each entity word and the conditional probability of the entity word to the concept to measure the relevance of the article to the candidate concept. Based on the scores obtained, decide which concept tags the article can be tagged.
Figure 3. ConcepT article tagging process: tag the article with the associated concept tag

Figure 4 shows the "topic-concept-entity" three-level hierarchical relationship library built by ConcepT. With this hierarchical relationship, you can have a multi-level rich theme characterization of long and short text. Among them, the relationship between the construction entity word and the concept is consistent with the algorithm of the degree of association between the three computing entities and concepts, that is, the probability inference is based on the words in the entity context. The association between concept and topic is determined based on how much of the article containing the concept belongs to the topic. The topics here are predefined 31 class themes, including "Technology", "Entertainment", "Current Events" and more. The topic classification of the article is completed by a text classification model based on word vector.

Figure 4. Topic-Concept-Entity Level 3 Hierarchy

Experimental result

Our ability to extract concepts, article markups, and hierarchical construction of the ConcepT system has been extensively evaluated using offline evaluation and online A/B test.

Table 1. Concept display of ConcepT system extracted from user search query
Table 2. ConcepT concept mining accuracy evaluation.

We compared different algorithms and procedures for existing keyword or quality phrase extraction, as well as variants of our methods. Experiments have shown that our comprehensive strategy achieves the highest accuracy and has a significant improvement over other algorithms. The concept extraction dataset used in the experiment was extracted from the real search log and manually tagged. The dataset is open source.

Table 3. Hierarchical structure display of ConcepT build.

Currently, the average concept contains 3.44 subordinate entities, and the largest one contains 59 entities. The accuracy of manual evaluation of hierarchical relationships is 96.59%.

Table 4. Online A/B test results.

The ConcepT system has significantly improved the various indicators of the QQ browser information flow service. The most important indicator of exposure efficiency (IE) is a relative increase of 6.01%.

Figure 5. The ConcepT system puts a conceptual label on the article.

Currently, 96700 articles can be processed every day, and about 35% can be labeled with a concept. We created a concept tag data containing 11547 articles to measure the accuracy of the tags. Manual evaluation found that the current system's mark accuracy is 96%.

Paper: A User-Centered Concept Mining System for Query and Document Understanding at Tencent

Paper address:

https://arxiv.org/abs/1905.08487

Related data resources:

https://github.com/BangLiu/ConcepT

This article is transferred from the public Tencent technical project,Original address