Graph neural network (Graph NN) is a recent research hotspot, especially the "Graph Networks" proposed by DeepMind, which is expected to enable deep learning to achieve causal reasoning. However, this paper is difficult to understand. Dr. Deng Wei, the chief AI scientist of Fosun Group and the founder of DaDian Medical, analyzed the significance of DeepMind “Figure Network” based on the clear classification of GNN review by Professor Yu Shilun of Tsinghua University.

**The emergence of graph neural network (GNN)**

Looking back at the progress of machine learning in 2018, the paper published by the DeepMind team in June 2018 "Relational inductive biases, deep learning, and graph networks", is an important paper, which has caused heated discussion in the industry.

Subsequently, many scholars continued their research along their lines of thought, including the Sun Maosong team at Tsinghua University. They published a review on 2018 12, the title is"Graph neural networks: A review of methods and applications".

2019 1 month, Professor Yu Shilun's team also wrote a review, the coverage of this review is more comprehensive, the topic is "A Comprehensive Survey on Graph Neural Networks".

This paper by the DeepMind team has attracted such enthusiastic attention from the industry for three reasons:

- Reputation: Since AlphaGo defeated Li Shiduo, DeepMind has become well-known in the industry and has become the leading team in the machine learning industry. The papers published by the DeepMind team have received widespread attention from peers;
- Open source: The DeepMind team published a paper [1] Soon after, the software system they developed was open sourced on Github, the project name is Graph Nets [4];
- Theme: Prestige and open source are both important, but they are not the main reason for being hotly discussed in the industry.The main reason is the topic. The topic of the DeepMind team's research is how to use deep learning methods to process the graph.

## What is a graph neural network (GNN)

Graph consists of Node and Edge.

A map is an important mathematical model that can be used to solve many problems.

For example, if we take the urban subway route map as a map, each subway station is a point. The connection between adjacent subway stations is the edge. Enter the starting point to the end point. We can calculate the starting point to the ending point through the calculation of the map. The route with the shortest time and the least number of transfers.

Another example is the search engines of Google and Baidu. The search engines treat every page of every website in the world as a point in the map.In each webpage, there are often links, referencing webpages of other websites, and each link is an edge in the graph.The more cited a webpage, the more reliable this webpage is, and therefore, the higher the ranking in search results.

There are still many problems to be solved in the operation of the map.

For example, enter hundreds of millions of routes that the driver travels. Each route is a series of time (time, GPS latitude and longitude) arrays. How to stack hundreds of millions of routes and build a city map?

Think of the map as a map. Each intersection is a point that connects two adjacent intersections and is an edge.

It looks very simple, but the details are very troublesome.

For example, there are many forms of intersections, not only crossroads, but also five-corner fields, six-way crossings, and circular road overpasses – how to determine the center of an intersection from multiple paths?

## The value of GNN

Using deep learning to process maps can expand our ability to process maps.

Deep learning has achieved great success in the processing of images and text. How to expand the results of deep learning and apply it to map processing?

The image consists of a horizontal and vertical pixel matrix.If you change the angle, consider each pixel as a point in the map. Each pixel has an edge with the 8 neighboring pixels around it, and each edge is the same length.Through this perspective, re-examining the image, the image is a special case of the broad spectrum.

Many deep learning methods for processing images can be changed and applied to generalized maps, such as convolution, residual, dropout, pooling,attention,encoder-decoder and many more. This is the original idea of deep learning map processing, very simple and simple.

Although the original idea was simple, but the details were deepened and various challenges emerged. Each challenge means more powerful technical capabilities, and it has a more promising application scenario.

Deep learning map processing this research direction, the industry has no uniform title.

The team that emphasized the mathematical properties of the map named this research direction Geometric Deep Learning.The team of Maosong Sun and Shilun Yu emphasized the importance of neural networks in graph processing and the source of ideas. They named this direction Graph Neural Networks. The DeepMind team is opposed to binding specific technical means. They use a more abstract name, Graph Networks.

Naming is not that important, but it's important to use a method to sort through the many advances in this area. It is conducive to strengthening mutual understanding between peers and facilitating future cooperation between peers by clarifying the target orientation and technical methods of each school.

## 5 directions for deep learning maps

The Yu Shilun team sorted out the many developments in deep learning map processing into 5 sub-directions, which are very clear and easy to understand.

- Graph Convolution Networks
- Graph Attention Networks
- Graph Embedding
- Graph Generative Networks
- Graph Spatial-temporal Networks

Let me talk about Graph Convolution Networks (GCNs) first.

GCN CNN All kinds of weapons, applied to the broad spectrum. CNN is mainly divided into four tasks,

- The fusion between points.In the image field, the fusion between points is mainly achieved through convolution technology.In the generalized map, the relationship between points is expressed by edges.Therefore, in the generalized atlas, point-by-point fusion is a more powerful method than convolution. Messsage passing [5] is a more powerful method.
- Layered abstraction. CNN uses convolution to extract more refined and abstract features layer by layer from the original pixel matrix.The higher-level points are no longer isolated points, but merge the attributes of other points in the adjacent area.The method of fusing neighbors can also be applied to the generalized map.
- Feature refinement. CNN uses methods such as pooling to extract edges from adjacent original pixels.From the adjacent edges, refine the solid outline.From adjacent entities, refine higher-level and more abstract entities. CNN usually uses convolution and pooling interchangeably to build a neural network with a more complex structure and more powerful functions.For generalized maps, Messsage passing and Pooling can also be combined to construct a multi-layer map.
- Output layer. CNN usually uses softmax and other methods to classify the entire image and identify the semantic connotation of the map.For generalized maps, the output results are more diverse, not only for the entire map, but also for output classification and other results.And it can also predict the value of a specific point in the graph, or the value of a certain edge.

The problem to be solved by Graph Attention Networks is similar to that of GCN, but the difference lies in the method of integration and multi-layer abstraction.

Graph Convolution Networks uses convolution to achieve point fusion and layered abstraction. Convolution The convolution method is only suitable for fusing adjacent points, while the attention focusing method is not limited to adjacent points. Each point can fuse all other points in the entire map. Whether it is adjacent or not, how to fuse depends on the point. The strength of the association with the point.

Attention is more powerful, but it has higher requirements for computing power, because it is necessary to calculate the strength of the correlation between any two points in the entire graph.Therefore, the focus of Graph Attention Networks' research is how to reduce computing costs, or to improve computing efficiency through parallel computing.

## 4 Spectral Neural Networks Other Than GCN

The problem to be solved by Graph Embedding is to assign a numerical tensor to each point and each edge in the graph.This problem does not exist with images, because pixels are inherently numerical tensors.However, the text is composed of text vocabulary sentence paragraphs, and the text vocabulary needs to be converted into a numerical tensor in order to use many deep learning algorithms.

If each word or vocabulary in the text is regarded as a point in the map, and the grammatical semantic relationship between the word and the word is regarded as an edge in the map, then the statement and the paragraph are equivalent to walking in the text map. A path of travel.

If you can give each text and vocabulary an appropriate value tensor, then the path of travel corresponding to the statement and paragraph is mostly the shortest path.

There are many ways to implement Graph Embedding, and one of the better ones is Autoencoder.Using the GCN method, the points and edges of the graph are converted into numerical tensors. This process is called encoding. Then, by calculating the distance between the points, the numerical tensors are set and inverted into the graph. This process It is called decoding.Through constant adjustment of parameters, the decoded map becomes closer and closer to the original map. This process is called training.

Graph Embedding assigns an appropriate numerical tensor to each point and edge in the graph, but it does not solve the problem of the structure of the graph.

If you input a large number of map travel paths, how to identify which points and which points have connections from these travel paths?The more difficult question is, if there is no path of travel, the input training data is a part of the map, and the characteristics of the corresponding map, how to splice the part into the full picture of the map?These problems are the problems that Graph Generative Networks wants to solve.

Graph Generative Networks has a more potential implementation method, which is to use Generative Adversarial Networks (GAN).

GAN is composed of two parts: generator and discriminator: 1. From the training data, such as massive travel paths, the generator guesses what the map behind the data should look like; 2. Use the generated map to forge A batch of travel paths; 3. From a large number of forged paths and real paths, select several paths, and let the discriminator identify which paths are forged.

If the discriminator is stupid and can't tell who is the forged path, who is the real path, the map generated by the generator is very close to the real map.

## Dynamic map

Above we discussed several issues for static maps, but the maps are sometimes dynamic, such as the roads represented in the map are static, but the road conditions are dynamic.

How to predict the traffic congestion near Tiananmen Square in Beijing during the Spring Festival?To solve this problem, we must consider not only spatial factors, such as the road structure around Tiananmen Square, but also temporal factors, such as traffic congestion in the area during the Spring Festival in previous years.This is one of the problems that Graph Spatial-temporal Networks wants to solve.

Graph Spatial-temporal Networks can also solve other problems, such as inputting a video of a football, how to identify the position of the football in each frame of the image?The difficulty of this problem is that in some frames of the video, the football may be invisible, such as being blocked by the player's legs.

The general idea of solving a time series problem is RNN,include LSTM And GRU and so on.

The DeepMind team added an encoder-decoder mechanism on the basis of RNN.

## What does DeepMind want to say?

In this paper [1] of the DeepMind team, they claim that their work, "part position paper, part review, and part unification", is both a proposal, a review, and a fusion.How do you understand this?

DeepMind, in conjunction with Google Brain, MIT and other organizations, published a large paper on 27 authors, proposing a "Graph network" that combines end-to-end learning with inductive reasoning, and is expected to solve the problem that deep learning cannot be used for relational reasoning.

As mentioned earlier, Yu Shilun’s team has sorted out the many progress in deep learning graph processing into 5 sub-directions: 1) Graph Convolution Networks, 2) Graph Attention Networks, 3) Graph Embedding, 4) Graph Generative Networks, 5) Graph Spatial-temporal Networks.

The DeepMind team focused on solving the last four of the five sub-directions, namely Graph Attention Networks, Graph Embedding, Graph Generative Networks, and Graph Spatial-temporal Networks.They "fused" the results of these four directions into a unified framework and named it Graph Networks.

In their paper, a "review" of many achievements along the four sub-directions was made, but they did not summarize the achievements in the direction of Graph Convolution Networks.Then they selected the most promising method from the results of these four sub-directions and formed their own "proposal", which is their open source code [4].

Although in the paper, they claim that their proposal solves the problems of the four sub-directions, but looking at their open source code, they found that they are actually focusing on solving the latter two sub-directions, Graph Attention Networks and Graph Spatial-temporal Networks.

DeepMind’s idea is this: First, combine the message passing mechanism of [5] with the global focus mechanism of [6] to build a general graph block module; secondly, integrate the LSTM elements into the encoder-decoder Framework to build a time series mechanism; finally, the graph block module is integrated into the encoder-decoder framework to form a general system of Graph Spatial-temporal Networks.

## Why are the results of DeepMind important?

**First, the interpretation of the deep learning process**

In principle, the results of deep learning, such as CNN, come from the continuous abstraction of images.That is, abstract line segments from the original pixel matrix.From the adjacent line segments connected end to end, the outline of the entity is abstracted.The entity is abstracted from the outline, and the semantics is abstracted from the entity.

However, if you snoop on the intermediate results of each layer of CNN, it is actually difficult to know which nodes on which layer have abstracted the outline, and I don’t know which nodes on which layer have abstracted the entity.All in all, the network structure of CNN is a mystery, and it is impossible to clearly explain the details of the working process hidden by the network structure.

Unable to explain the details of the work process, there is no human intervention.If something goes wrong with the CNN, it has to be retrained.However, whether the results after retraining can achieve the desired effect is not available in advance of the corpus.Often press the gourd to float the scoop, which solves this defect, but causes other defects.

Conversely, if you can clearly figure out the details of the CNN work process, you can adjust the parameters of individual nodes at individual levels in a targeted manner, and intervene accurately in advance.

**Second, small sample learning**

Deep learning relies on training data, which is usually very large, ranging from tens of thousands to millions. From where to collect so much training data, how much manpower to organize to label training data is a huge challenge.

If we have a clearer understanding of the details of the deep learning process, we can improve the brute force of convolution and train a lighter deep learning model with less training data.

The process of convolution is a process of brute force, which convolves indiscriminately on adjacent points.

If we have a clearer understanding of the relationship between points and points, we do not need to convolve indiscriminately for adjacent points. It is only necessary to convolve or otherwise process the associated points.

According to the relationship between points, the network constructed is the generalized map.The structure of the generalized map is usually simpler than that of the CNN network, so it requires less training data.

**Third, migration learning and reasoning**

With today's CNN, a certain entity, such as a cat, can be identified from a large number of pictures.

However, if you want to expand the ability of CNN to recognize cats so that it can recognize not only cats but also dogs, additional training data for dog recognition is required.This is the process of transfer learning.

Can you not provide additional training data for identifying dogs, but just tell the difference between a computer cat and a dog in a regular way, and then let the computer recognize the dog? This is the goal of reasoning.

If you have a more accurate understanding of the deep learning process, you can integrate knowledge and rules into deep learning.

In a broad sense, deep learning and knowledge mapping are the two mainstream schools of many schools in the machine learning camp. So far, the two universities have been squadrons, and each has its own victory and defeat. How to integrate the two university schools and learn from each other's strengths is a problem that has long plagued the academic world. Extending deep learning to the processing of the map brings hope to the integration of the two universities.

**Fourth, the fusion of space and time, the fusion of pixels and semantics**

Video processing can be said to be the highest level of deep learning.

- Video processing combines the spatial segmentation of images, the recognition of entities in images, and the semantic understanding of entities.
- Multi-frame still images are concatenated together to form a video, which is actually a time series. The same entity, in the position of different frames, contains the movement of the entity. Behind the movement are the laws of physics and semantics.
- How to summarize the text title from a video. Or, conversely, how to find the most appropriate video based on a textual title. This is a classic task of video processing, and it is also a difficult task.

References

- Relational inductive biases, deep learning, and graph networks
- Graph neural networks: A review ofmethods and applications
- A Comprehensive Survey on Graph Neural Networks
- Graph nets
- Neural message passing for quantum chemistry
- Non-local neural networks

This article is transferred from the public number Xinzhiyuan,Original address

## Comments