This article is reproduced from the public head of the Microsoft Research Institute AI headline,Original address
Editor's Note: As a special graph data, knowledge maps are not only human-readable but also machine-friendly. Information retrieval, question and answer system, recommendation system, e-commerce, financial risk control, these common application scenarios in life are inseparable from the support of knowledge map. How to build a knowledge service system that is “how fast and better”? Researchers at the Microsoft Research Institute's Machine Learning Group gave their recommendations based on their own experience.
In recent years, with the extensive development and application of artificial intelligence technology in scientific research and practice, knowledge map has also developed rapidly as an important topic of artificial intelligence. A public knowledge map of hundreds of millions of facts is already very common, and different data sources are connected to each other, forming tens of billions of super-large-scale knowledge maps. At the same time, with the development of natural language processing, deep learning and other technologies, the extraction technology of knowledge maps is also constantly improving, and the incremental knowledge map data is continuously imported.
On the one hand, the universality and relevance of knowledge determine that the knowledge map can only truly exert its energy when it reaches a certain order of magnitude and coverage; on the other hand, the flexibility of common knowledge representation methods (such as triples) and The characteristics of fragmentation also make it difficult to manage knowledge map data. In the context of knowledge map data has become a rapidly growing behemoth,How to make massive knowledge available and easy to use has become an urgent need for current knowledge-based applications.
The same set of knowledge map data can be used in different applications, and for users, building a new data system is a redundant and complicated task. Therefore, providing data services that are available online or downloadable for knowledge-based applications becomes an efficient way to apply knowledge maps. In this article, we refer to this way of providing knowledge map data services online or offline."Knowledge Map as a Service".
This article will elaborate on how to efficiently manage and serve hyperscale knowledge map data from the perspectives of data, applications and challenges, and share some of the case experiences of the author's team in designing and implementing real-time services for tens of billions of knowledge maps.
Knowledge Atlas: Knowledge from the Perspective of Graphs
The British philosopher Francis Bacon has a famous saying: "Knowledge is power." Stuart J. Russell and Peter Norvig pointed out in "Artificial Intelligence: A Modern Approach" that artificial intelligence includes natural language processing, knowledge representation, automatic reasoning, machine learning, computer vision, and robotics.The importance of knowledge representation for artificial intelligence is self-evident.In fact, the model obtained by machine learning is also a kind of knowledge expressed by computational structure and numerical values.
Among the many ways of knowledge representation,As a semantic network, knowledge map has strong expressive power and modeling flexibility:First, the knowledge map is a semantic representation that can model the entities, concepts, attributes and relationships between them in the real world. Second, the knowledge map is the data exchange standard of its derivative technology, which is itself a kind of data. The “protocol” of modeling, related technologies cover knowledge extraction, knowledge integration, knowledge management and knowledge application.
Knowledge map is a kind of special graph data, which is semantic and reusable: once the knowledge map data is acquired, it can be reused by multi-domain applications, which is also the motivation for the construction of knowledge map services. So, what is the knowledge map specifically?
First, the knowledge map is a special graph data.Specifically, the knowledge map is a marked directional attribute map. Each node in the knowledge map has several attributes and attribute values. The edge between the entity and the entity represents the relationship between the nodes. The direction of the edge indicates the direction of the relationship, while the mark on the side indicates The type of relationship. For example, "Tom Cruise" and "Mission Impossible" are two entities, and "starring" is the relationship between the two. The two entities correspond to people and movies in the real world, while the sides correspond to the actual connections between the people they represent and the movies. The former is the starting node of the relationship, and the latter is the target node of the relationship. The entity "Tom Cruise" has attributes such as "date of birth" and its attribute value is "1962 year 7 month 3 day" and the like.
Second, the knowledge map is a human-recognizable and machine-friendly knowledge representation.The knowledge map uses a human-readable string to identify each element; at the same time, the graph data representation is a general-purpose data structure that can be easily identified and processed by a computer.
Again, the knowledge map comes with semantics, implying logical meaning and rules.The nodes in the knowledge map correspond to entities or concepts in the real world, and each edge or attribute also corresponds to a piece of knowledge in reality. On top of this, we can derive knowledge that is not explicitly given in the knowledge map data according to the rules defined by humans. For example, it is known that "Zhang San is an individual", we can get a lot of new knowledge according to the rules of "people have parents, have brains, need to breathe", and do not need to be given one by one in the knowledge map. For another example, the “son” of “grandfather” is “舅舅”, which can be used to derive relationships between relatives of many entities.
The following is a description of the widely used knowledge map representation framework and query language: resource description framework (ResourceDEscription FRamework, RDF) and SPARQL query language (SPARQL Protocol and QUeryLAnguage, SPARQL).
The basic element format in the resource description framework is: <main, predicate, object> or .It uses subject, predicate, and object to express and state a piece of knowledge, which is simple and flexible.Because the English abbreviation of the resource description framework is RDF, the knowledge graph data set is often called RDF data set, and its management tool is usually called RDF Store or Triple Store.In RDF, we can usewithAnd so on to represent the relationship and attributes of the entity "Tom Cruise".An RDF data can be divided into two parts:
One: explicit triples. This part is given directly by the data set. E.g:
Two: Implied triples. This part is implicitly represented by the inference rules. Rules can be applied to explicit triples, and the number of triples derived can be huge, even more than a few times the original data. E.g:
- If ?x is a type of ?y, and ?y is a subtype of ?z, then ?x is also a type of ?z: (?x, IsA, ?y) & (?y, SubclassOf, ?z) => ( ?x, IsA, ?z)
- If ?y is a friend of ?x, then ?x is a friend of ?y: (?x, Friend, ?y) => (?y, Friend, ?x)
Representative RDF data sets include Freebase, DBpedia, WikiData, Yago2, Cyc, PubChemRDF, and UniProt. The amount of data in existing knowledge map data has reached an unprecedented scale. For example, Freebase contains about 19 billion triples, DBpedia contains about 30 billion triples, and Bio2RDF has about 100 billion triples. These data sets are connected to each other by the Linked Open Data initiative (LOD), forming a super-knowledge map that exceeds the 1490 billion triples. If you count the implicit triples that are available to the standard rule set (such as ρdf, RDFSPlus, RDFSFull, OWL DL, etc.), the actual amount of data will be even larger.
SPARQL is a widely used knowledge map query language. Its main function is to make declarative subgraph matching, that is, to describe only the final result pattern, and the specific execution process is determined by the underlying data management system. The main component of a SPARQL query is the base graph mode, such as (?s, p, ?o). The three elements correspond to the main, the predicate, and the guest. The elements that need to be matched are represented by variables with question marks. Combinatorial logic is implemented by combining operational symbols such as intersections, unions, and differences. In addition, other graph query languages such as GraphQL and Cypher can also be used to query knowledge maps.
Knowledge map application
Knowledge maps have been explored and constructed by major companies, universities and research institutions at home and abroad, such as Google's Google Knowledge Graph, Microsoft's Bing Knowledge Graph, Sogou's Knowledge Cube, Baidu's Intimacy, Alibaba's Product Knowledge Atlas, Fudan University. CN-DBpedia, Zhishi.me of Southeast University, Xlore of Tsinghua University, and cnSchema.org of OpenKG, an open Chinese knowledge mapping community. The development of knowledge maps has also received the attention and support of the government. For example, the State Council of China has clearly proposed the development plan related to knowledge maps in the article “New Generation Artificial Intelligence Development Plan” (XOFX 2017).
As an application technology, knowledge maps support specific applications in many industries. E.g:
- Information retrieval: Accurate aggregation and matching of entity information in search engines, understanding of keywords, and semantic analysis of search intent;
- Natural language understanding: knowledge in the knowledge map as background information for understanding entities and relationships in natural language;
- Question and answer system: matching the mapping between the question and answer mode and the knowledge subgraph in the knowledge map;
- Recommended system: Integrate the knowledge map as an auxiliary information into the recommendation system to provide more accurate recommendation options;
- E-commerce: Build a product knowledge map to accurately match the user's purchase intention and product candidate set;
- Financial risk control: Use the relationship between entities to analyze the risk of financial activities to provide remedial measures (such as contacts) after risk triggering;
- Public security criminal investigation: Analyze the relationship between entities and entities to obtain clues, etc.;
- Judicial assistance: structured representation and inquiry of legal provisions to assist in the judgment of the case;
- Education and medical care: Provide visual knowledge representation for drug analysis, disease diagnosis, etc.
So how is the knowledge map accessed in these applications?The service scene of the knowledge map can be roughly divided into online query and offline analysis according to its response time requirements.
Online query tasks include: 1) Entries for entities such as "The author is Su Shi's poem"; 2) Query for attributes, such as "Xin Qiji's Birth and Death Year"; 3) For relational queries, through multi-hop relationship search Relationships in the knowledge map, such as "Tom Cruise and Brad Pitt's film awards"; 4) queries for subgraph structures, used to query a series of entities with a specific relationship, Such as "actors who have participated in the same film and have a husband and wife relationship"; 5) for the aggregation of queries, such as "list of countries where the world's top ten peaks are located." The characteristics of this type of query are: response delay sensitivity, but low data throughput.
Offline analysis tasks include: 1) Graph structure-based calculations such as: knowledge map vectorization representation, knowledge map data completion and entity ordering; 2) rule-based reasoning; 3) statistical analysis based on overall information. Such tasks require low response times but require high data throughput.
Knowledge Mapping Service Challenge
So, what are the difficulties in building a knowledge map service? It mainly includes the following three points:
One: parallel graph processing. As a kind of special graph data, the storage, query and calculation of the knowledge map need to deal with the following three aspects related to the graph structure:
- The complexity of the graph data. From the perspective of data access, regardless of how the graph data is represented, access to a neighbor node of a graph node involves a "jump" in the contiguous storage space, that is, a large amount of random data access. A large number of program optimization techniques rely on the locality and reuse of data. A large amount of random data access can cause the central processor's cache to fail most of the time, resulting in a sharp drop in system performance. From the perspective of program structure, the unstructured nature of the graph makes parallel processing very difficult. It is worth mentioning that the segmentation data itself is an NP-hard problem, which makes the system design of the divide-and-conquer solution difficult.
- Figure structure and the diversity of calculations. There are many types of graph data, and the performance of graph algorithms can vary greatly depending on the graph features. There are also many types of graph calculations, and targeted system optimization is limited.
- The size of the graph data is getting bigger and bigger. Large-scale graph data makes many classic graph algorithms unavailable because of inefficiency.
Two: Implicit semantic information processing.For example, processing (?x, Parent, ?y) => (?x, Father, ?y) UNION (?x, Mother, ?y) and (?x, Niece, ?y) => (?x, Niece , ?y) UNION (?x, Spouse, ?z) AND (?z, Niece, ?y) Such rule-based reasoning will amplify storage and query costs.Processing methods are generally divided into two categories: one is to derive all implicit triples before querying and store them as normal triples. This type of method will greatly increase the storage cost; the other is when querying, Rewriting a query into multiple queries according to the inference rules, and finally summarizing the query results, this kind of method will greatly increase the query cost.The compromise method that combines the two methods will amplify the storage and query costs at the same time.
Three: complex data patterns. The schema of the knowledge map is complex, such as 'A dog can be also an actor' (DEFINE 'Actor SubClass Person'?); entities can have multiple roles, such as 'A director who is also an actor, a singer, and a writer'; attributes can have multiple values or defaults, such as 'An actor with many awards' and 'A person entity without birthday data'. The complexity of the data model makes physical storage and optimization of knowledge very difficult.
Combining the above analysis of data characteristics, query characteristics and data patterns, we summarize the challenges faced by a knowledge service system into the following four points:
- System architecture: achieve data scalability;
- Storage design: handling random data access;
- Data access: a friendly access interface;
- Efficient queries: low latency and/or high throughput data queries.
These four points are brought together:many,fast,Good,province.
Implementation of knowledge map service
We first analyze the underlying data storage media. Common memory (RAM) and hard disk (HDD) response speed levels are 10 negative 7 power seconds and 10 negative 2 power seconds, and data throughput is about 3GB / s and 100MB / s. Due to the limitations of its mechanical seek mechanism, traditional hard disks have poor data randomization and parallel read and write performance. In order to meet the requirements of high parallel and randomized data access, delay sensitive knowledge map applications generally choose memory as the main storage medium. On the other hand, due to the limitation of stand-alone memory capacity, the upward expansion is expensive, and we prefer the scalable memory cloud architecture.
From the perspective of storage architecture, stand-alone storage systems are mainly of two types: one is a system based on traditional relational database, such as SW-Store, Triple Tables, etc. The other is a dedicated storage query system customized for knowledge map data, such as RDF-3X, TripleBit, etc. Distributed systems can be roughly divided into distributed file systems, key-value storage systems, centralized storage, and hybrid storage.
For upper-layer applications, the system needs to provide a fine-grained access interface. The reasons are as follows: First, these interfaces can be used as a lightweight tool for online query, and can be used as a parallelization component for off-line analysis. Really convert N to 1 + 1 + ... + 1, where N is a batch operation and 1 is represented. Fine-grained operation. In this way, we can execute these 1 in parallel, and we can get high throughput effects comparable to batch processing. Conversely, if we only support batch processing, when the system responds to multiple fine-grained operations, it will become N + N + ... + N, resulting in wasted resources and slower response.
The data storage and processing module acts as a middle tier, and it is based on the underlying storage architecture (such as the memory cloud). It needs to consider storage cost sensitivity and random access friendly. At the same time, it needs to provide a fine-grained and efficient access interface.
The core part of the data storage and processing module is the physical representation of the knowledge map data. It is responsible for data management such as storage, query and update. The core problem to be solved is to manage an updateable hyperscale sparse third-order tensor: if the subject, predicate and object of the knowledge map data are respectively considered as three parts In the three dimensions, the range of values in each dimension is all the values appearing in the corresponding positions in the knowledge map, then the entire data set can be represented by a sparse third-order tensor, and the storage problem of the knowledge map can also be converted into pairs. This third-order tensor is compressed and indexed. This third-order tensor is very sparse and unevenly distributed. The two dimensions of the subject and the object are large, and can reach billions of orders, while the predicates are usually on the order of thousands to tens of thousands. We need to be efficient both in terms of storage and query, and support free updates to each dimension.
Specific optimization goals include: first, the size of the data storage file can be minimized; secondly, various types of data access requests can be processed quickly; and finally, incremental updates are supported. In view of the complexity and flexibility of the physical storage model, a common method is to directly compress and index the elements (ie, triples) in this third-order tensor as the smallest storage unit. This method stores and stores the three elements in multiple positions according to the position, and has achieved the purpose of space change time. This method may be called zero-order flattening. Another type of method flattens the third-order tensor by a certain order, such as subject (graph data structure) or predicate (vertical partition, bit matrix, etc.). The last category is flattened in two orders at the same time, such as subject and predicate, which we call a strongly typed storage scheme. Most of the common data storage solutions mentioned above belong to weakly typed systems, such as triple table (simple direct) or "predicate-object" key list. Strongly typed systems have less data storage and access cost and less JOIN cost than weakly typed systems.