Editor's Note: In the last two years, models such as self-focused mechanisms, graphs, and relational networks have beenNLPThere was a whirlwind in the field, based on these modelsTransformer,BERTFrameworks such as MASS have gradually become the mainstream method of NLP. Can these models be equally useful in the field of computer vision? Recently, Hu Wei, a researcher at the Visual Computing Group of Microsoft Research Asia, was invited to participate in VALSE Webinar to share some of their recent work. Their research and some other work during the same period have shown that these models can also be widely used to model the relationship between visual basic elements, including objects and objects, between objects and pixels, and between pixels and pixels, especially in Modeling the relationship between pixels and pixels can complement the convolution operation, and even hope to replace the convolution operation to achieve the most basic image feature extraction.
Should brain and machine intelligence be universal learning machines?
First, we start with a very interesting experiment. This experiment connects the auditory cortex in the mouse brain to the visual input. After a period of training, it is found that the auditory cortex can also realize the visual perception task. This experiment caused us to think about a question. Can machine intelligence also achieve the versatility of structure and learning?
The current machine learning paradigm is basically uniform, generally following the process of collecting data, labeling, defining network structure, and using the direction propagation algorithm to train network weights, but the basic models used in different tasks are diverse. Current computer vision is dominated by convolutional networks, while natural language processing has gone throughLSTM, GRU, convolution, and self-attention, and other model phases. Does it have a basic model that can solve different intelligent tasks such as vision, NLP, graph structure data perception, and even reasoning?
The most common model at the moment: relationship and graph networks
For now, relational networks are a model that is closest to this goal. Before explaining this model, we first make some clarifications on some nouns, including graph neural networks and self-attention mechanisms.
The graph neural network is more general in concept, including the feature representation of the node, the opposite edge, and the global attribute, while the self-attention model is a special implementation of the graph neural network, in which only the nodes are characterized and the edges are (that is, the relationship) is calculated by the inner product of key embedding and query embedding. It is a very economical model when the graph is fully connected (all nodes are connected), but the expression ability is strong enough because any Things and concepts can be made comparable by different projection (key and query) characteristics.
The collection of key and query in the attention mechanism is often inconsistent, such as word collection and image block collection, or sentences in different languages, respectively, and the self-attention mechanism is the case where the keys and the query objects are the same set. The recent revolution in the NLP field has focused on discovering the value of the "self" attention mechanism in coding the relationship between words and words in the same sentence. The relational network and the graph neural network also have the same self-attention mechanism, which is literally more concerned with the modeling of the relationship between nodes and nodes.
Applying relational networks to basic visual modeling
Considering that the relational network has achieved great success in graph structure data and NLP sequence data modeling, it is a natural question whether this modeling method is also suitable for modeling in the visual. Computer vision mainly involves two levels of basic elements: one is an object; the other is a pixel. So we studied the relationship between objects and objects, objects and pixels, and pixel and pixel modeling.
Modeling the relationship between objects and objects, the first fully end-to-end object detector
Objects are the core of many visual perception tasks. In the era of deep learning, the perception of individual objects has made good progress, but there is no good tool for how to model the relationship between objects and objects. We proposed a plug-and-play Object Relation Module (ORM) on CVPR last year. The modeling of the object relationship module is basically a kind of self-attention mechanism application. The main difference from the basic self-attention mechanism is that relative geometry items are added. We find that this item is very important for visual problems. The relative positional relationship between the two can help the perception of the object itself. This module can be easily embedded into existing object detection frameworks (shown in 3 is the most widely used Faster R-CNNAlgorithm), to improve the head network, and replace the manual de-duplication module, which is currently used in the non-maximization suppression method (NMS). The replacement of the former makes the objects not individually identified, but is identified together, and replacing the latter helps achieve the first complete end-to-end object detection system. We also extend the object relationship module to the space-time dimension to solve the multi-target tracking problem.
Object-to-pixel relationship modeling
One of the most direct applications of object-to-pixel relationship modeling is to extract object region features from image features. The most commonly used algorithm was RoIPooling or RoIAlign. We used a relational network to implement a method for adaptively extracting region features from image features. And prove that this method is better than RoIAlign in the standard dataset COCO for object detection. 1 mAPabout.
Pixel-to-pixel relationship modeling, replacing convolutional local relational networks and global context networks
The modeling of pixel-pixel relationship can be used to achieve the most basic image local feature extraction, and can also be used to extract global information of the image, which complements the basic image feature extraction network (such as convolutional neural network).
1) Alternative Local Network of Convolutional Neural Networks
The basic image feature extraction methods nowadays almost all use convolution operators, but the convolution is essentially a template matching operator, and the efficiency is low. For example, the three bird heads in the 4 are very simple. Change, but need three channels to model it. We propose a local relation layer to achieve more efficient image feature extraction, which is essentially based on a relational network. When applied to the basic pixel-pixel relationship modeling problem, we found the following details to be important: First, the modeling of the relationship should be limited to the local, and only the local constraints can be used to construct the information bottleneck. Learned; the second is the need to introduce a learning geometric prior term. The introduction of this item also notices that the template matching process used by the most popular convolution operators is a modeling method that relies heavily on relative positional relationships. Using scalar keys and queries, in standard relational networks, keys and queries are usually represented by vectors. Using scalar keys and queries can save a lot of parameters and calculations, and therefore can be modeled in a limited amount of computation. relationship.
Compared with convolution, the biggest difference in the concept of the local relational layer is that it is based on the two pixels' own characteristics to calculate the inter-pixel composability, rather than using a global template to match. The upper part of the figure 4 also shows some of the key and query diagrams (scalars) that have been learned. From left to right, the layers are shallow to deep. It is found that the shallow layer has learned the concepts of the edges and the interior, and deeply learned the concept of different objects. The lower right of the figure 4 shows the learned geometric priors. From top to bottom, the layers are shallow to deep. It is found that the geometric priors are concentrated and sparse in the shallow layer, suggesting that the geometric a priori plays a big role, while in the deep The geometric prior is more vague, suggesting that the key and the query play a more important role.
The local relationship layer can be used to replace all the spatial convolutional layers in the convolutional network, including all 3×3 convolutions, and the initial 7×7 convolutions, so a network without spatial convolutional layers at all is obtained , We call it a local relational network (LR-Net). The left side of Figure 5 is an example of replacing all convolutional layers in the ResNet-50 network with a local relational layer. In the case of the same amount of calculation, LR-Net has compared with ResNet Fewer parameters. The right side of Figure 5 is a comparison of the top-26 accuracy of the 26-layer LR-Net and the 1-layer ResNet with standard convolution or depthwise convolution in ImageNet classification. It can be seen that without any geometric priors, LR-Net has been matched with ResNet, and after adding geometric priors, it can achieve 50% higher performance than the standard convolutional ResNet-2.7 . In addition, the local relational network performs best when the neighborhood is 7×7, while the corresponding standard ResNet network performs better at 3×3 and 5×5, which shows that the local relational network is better than the common convolution operator. The ResNet network can model a wider range of pixel relationships.
2) Non-local network encounters SE-Net, a more efficient global context network
Non-local relational networks have achieved very good results on multiple visual perception tasks. The academic community generally believes that this benefits from the modeling of long-distance pixel-to-pixel relationships by non-local networks. But we find a very different phenomenon when visually comparing the pixel-to-pixel similarity. For different query pixels (red dots in the picture), regardless of the query pixel in the foreground, or in the grass, or in the sky. , they are formed by the similarity of key pixelsattention The map is almost exactly the same.
Naturally, if we show that all the query pixels share the same attention map, will it reduce the performance? We have found that this answer is negative in some important perceptual tasks, such as image classification, object detection, and motion recognition. In other words, even if all the query pixels share the same attention map, the accuracy of recognition will not be reduced, and the corresponding calculation will be greatly reduced. Even if all the residual blocks in the ResNet network are added, the overall network will not be increased. Calculated amount.
Further discover the champion of such a simplified non-local network (SNL) and 2017 ImageNet competitionalgorithmThe SE-Net structure is very similar. It is to first model the global context information, assemble the HxW image features to generate a global vector, and the second step is to transform the global vector. Finally, the transformed The global features and the original features of each position of the image are merged, so that a generic framework for modeling global context information can be abstracted. Further, the best implementation is selected in each step, so that the Global Context Block can be obtained. This network can achieve better than non-local network in COCO object detection, ImageNet image classification, and motion recognition tasks. SE-Net has better accuracy, while the amount of calculation remains basically the same or lower than that of non-local networks and SE-Net.
Reference paper: Han Hu*, Jiayuan Gu*, Zheng Zhang*, Jifeng Dai and Yichen Wei. Relation Networks for Object Detection. In CVPR 2018.  Jiarui Xu, Yue Cao, Zheng Zhang and Han Hu. Spatial-Temporal Relation Networks for Multi-Object Tracking. Tech Report.  Jiayuan Gu, Han Hu, Liwei Wang, Yichen Wei and Jifeng Dai. Learning Region Features for Object Detection. In ECCV 2018.  Han Hu, Zheng Zhang, Zhenda Xie and Stephen Lin. Local Relation Networks for Image Recognition. Tech Report.  Yue Cao*, Jiarui Xu*, Stephen Lin, Fangyun Wei and Han Hu. GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond. Tech Report.
This article is reproduced from the public head of the Microsoft Research Institute AI headline,Original address