CNNVery good at classifying out-of-order images, but humans are not.

In this article, the authors show why the most advanced deep neural networks still recognize garbled images well, and the reasons for this help reveal that DNN uses an unexpectedly simple strategy to classify natural images. 

A paper in ICLR 2019 pointed out that the above findings can:

  1. Solving ImageNet is much simpler than many people think
  2. Enables us to build more illuminating and transparent image classification pipelines
  3. Explain some of the phenomena observed in modern CNN, such as bias on textures and ignoring spatial ordering of object parts.

Retro bag-of-features model

Before the advent of deep learning, the object recognition process in natural images is rather crude and simple: define a set of key visual features ("words"), identify the frequency of occurrence of each visual feature in the image ("package"), and then based on these numbers Classify images. These models are called "feature package" models (BoF models).

For example, given a human eye and a feather, we want to classify the image as "human" and "bird". The simplest BoF model workflow is this: it adds +1 to the "human" evidence for each eye in the image. And vice versa; for each feather in the image, it adds "bird" evidence + 1; no matter what kind of accumulation, most of the evidence in the image is predicted.

One of the simplest features of this simplest BoF model is its interpretable and transparent decision making. We can accurately examine which image features carry evidence for a given class, and the spatial integration of evidence is very simple (compared to deep nonlinear feature integration in deep neural networks), and it is easy to understand how the model makes decisions.

The traditional BoF model has been very advanced and very popular before the beginning of deep learning. However, due to its low classification performance, it quickly fell out of favor. However, how do we determine whether deep neural networks use decision strategies that are distinct from the BoF model?

A deep but interpretable BoF network (BagNet)

To test this, the researchers combined the interpretability and transparency of the BoF model with the performance of the DNN.

  • Split the image into small qxq image patches
  • Pass the patch through DNN to get the class evidence for each patch (logits)
  • Summarize the evidence for all patches to achieve image-level decision making
BagNets classification strategy: For each patch, we use DNN to extract class evidence and summarize the general class evidence for all patches.


为了以最简单和最有效的方式实现这一策略,我们采用标准的ResNet-50架构,用1×1卷积替换大多数(但不是全部)3×3卷积。 

In this case, the hidden cells in the last convolution layer each "see" only a small portion of the image (ie, their receptive fields are much smaller than the size of the image). 

This avoids explicit partitioning of the image and is as close as possible to the standard CNN, while still implementing the outlined strategy, which we call the model structure BagNet-q: where q represents the topmost receptive field size (we test q=9,17 And 33). The running time of BagNet-q is approximately 50 times the running time of ResNet-2.5.

The performance of BagNets with different tile sizes on ImageNet.


Even for very small patch sizes, BagNets performance on BagNet is impressive: image features of size 17 x 17 pixels are sufficient for AlexNet-level performance, while 33 x 33 pixels are characterized by a size of approximately 87% Pre-5 precision. Higher performance values ​​can be achieved by placing 3 x 3 convolutions and additional hyperparameter adjustments more carefully.

This is the first important result we get: you can solve the ImageNet problem with just a set of small graph features. Remote spatial relationships such as object shapes or relationships between object parts can be completely ignored and there is no need to solve the task.

A major feature of BagNets is their transparent decision making. For example, we can now see which image feature is most predictive for a given class.

The image function has the most class of evidence. We show the function of correctly predicting the class (top row) and the ability to predict the distraction of the error class (bottom row).

In the above picture, the top finger image is identified as a tent (Ding guì, the main species of freshwater fishing, and a feed for predatory fish such as squid), because most of the images in this category have a fisherman. Lift Ding like a trophy.

Similarly, we also get a precisely defined heat map that shows which parts of the image cause the neural network to make a decision.

The heat map from BagNets shows the contribution of the exact image portion to the decision. The heat map is not approximate, but shows the true contribution of each image portion.

ResNet-50 is amazingly similar to BagNets

BagNets shows that high precision can be achieved on ImageNet based on weak statistical correlation between local image features and object categories. 

If this is enough, why does the standard deep web like ResNet-50 learn anything fundamentally different? If rich local image features are sufficient to solve the task, why does ResNet-50 need to understand complex large-scale relationships (such as the shape of objects)?

To verify that modern DNNs follow the assumptions of a strategy similar to a simple signature packet network, we tested different ResNets, DenseNets and VGG on the following "signatures" of BagNets:

  • Decision making is not constant for spatial reorganization of image features (can only be tested on VGG models)
  • Modifications to different image parts should be independent (in terms of their impact on the general class of evidence)
  • Errors generated by standard CNN and BagNets should be similar
  • Standard CNN and BagNets should be sensitive to similar functions

In all four experiments, we found that the behavior between CNN and BagNets is very similar. For example, in the previous experiment, we showed that the most sensitive parts of the image of BagNets (for example, if you occlude those parts) are basically the same as those most sensitive to CNN. 

In fact, BagNets' heat map (a spatial map of sensitivity) better predicts the sensitivity of DenseNet-169 than a heat map generated by an attribution method such as DeepLift (which directly calculates heat maps for DenseNet-169). 

Of course, the DNN is not exactly like the signature package model, but does show some bias. In particular, we found that the deeper the network, the larger the functionality, and the greater the remote dependencies. 

Therefore, the deeper neural network does improve the simpler feature package model, but I don't think the core classification strategy has really changed.

Explain a few strange phenomena of CNN

Think of CNN's decision as a BoF strategy that can explain a few strange observations about CNN. First, it will explain why CNN has such a strong texture bias; second, it can explain why CNN is so insensitive to the confusion of image parts; it can even explain the existence of general antagonistic stickers and antagonistic disturbances, such as people in images. The misleading signals are placed anywhere, and the CNN can still reliably receive signals regardless of whether they are suitable for the rest of the image.

Our results show that CNN uses many weak statistical laws that exist in natural images to classify and does not jump to object-level integration of image parts like humans. The same is true of other tasks and senses.

We must think carefully about how to build architectures, tasks, and learning methods to counteract this trend of weak statistical correlation. One way is to improve the inductive bias of CNN from a small local feature to a more global feature; the other way is to delete, or replace, those features that the network should not rely on.

However, one of the biggest problems is of course the task of image classification itself: if the local image features are sufficient to solve the task, there is no need to learn the real "physics" of nature, so we must reconstruct the task and push the model to learn. The physical nature of the object.

This is likely to require a way to obtain correlation between input and output features purely by observation only, in order to allow the model to extract causal dependencies.

Final Thoughts

In summary, our results indicate that CNN may follow an extremely simple classification strategy. Scientists believe that this discovery may continue to be the focus of attention in 2019, highlighting our little understanding of the internal operations of deep neural networks. 

Lack of understanding prevents us from fundamentally developing better models and architectures to bridge the gap between people and machines. Deepening our understanding will enable us to find ways to bridge this gap. 

This will bring an unusually rich return: when we try to bias CNN towards more physical properties of the object, we suddenly reach noise robustness close to humans. 

We continue to look forward to more exciting results in this field in 2019, with a convolutional neural network that truly understands the physical, causal nature of the real world.

This article is transferred from the public number Xinzhiyuan,Original address