back ground

Geoffrey Hinton is one of the pioneers of deep learning and the inventor of classic algorithms for neural networks such as back propagation. He and his team have proposed a new neural network based on a structure called capsules. , And also published an inter-capsule dynamic routing algorithm for training capsule networks.

research problem

TraditionalCNNThere are deficiencies (explained in detail below). How to solve the deficiency of CNN, Hinton proposed a network that is more effective for image processing-capsule network, which combines the advantages of CNN while taking into account the relative position of CNN loss, Angle and other information, which improves the recognition effect.

Research motivation

CNN flaws

CNN focuses on detecting important features in image pixels. Consider a simple face detection task. A face consists of an oval representing the face shape, two eyes, a nose, and a mouth. Based on the principle of CNN, as long as these objects exist, there is a strong stimulus, so the spatial relationship of these objects is not so important.
As shown in the figure below, the right picture is not a human face but it has all the objects needed by the human face. Therefore, it is very likely that CNN has activated the judgment of a human face through the objects it has, thereby making the result judgment wrong.
Re-examine the working method of CNN. The high-level features are the weighted sum of low-level feature combinations. The activation of the previous layer is multiplied and added with the neuron weights of the next layer, and then activated by a non-linear activation function. In such an architecture, the positional relationship between high-level features and low-level features becomes blurred (I think there are still some that are not well used). The way CNN solves this problem is to expand the horizon of the next convolution kernel through the largest pooling layer or perhaps the convolution layer (I think the largest pooling layer will lose information or even important information anyway) .

Inverse graphics

Computer graphics is based on the hierarchical representation of geometric data to construct a visual image. Its structure takes into account the relative positions of objects. The relative position and orientation of geometric objects are represented by matrices. Specific software accepts these representations as input. And turn them into images on the screen (rendering).
Hinton was inspired by this, thinking that the brain does exactly the opposite of rendering, called inverse graphics. From the visual information received by the eye, the brain parses the hierarchical representation of its world and tries to match the learned pattern and store it in the brain. Recognition of the relationship between them, noticed that the representation of objects in the brain does not depend on perspective.
Therefore, what is now to be considered is how to model these hierarchical relationships in neural networks. In computer graphics, the relationship between three-dimensional objects in three-dimensional graphics can be represented by poses. The essence of poses is translation and rotation. Hinton proposed that preserving the hierarchical pose relationship between object parts is important to correctly classify and identify objects. The capsule network combines the relative relationships between objects and is numerically represented as a 4-dimensional pose matrix. When the model has pose information, it can be easily understood that what it sees is what it has seen before, but only changes its perspective.As shown in the figure below, the human eye can easily distinguish the Statue of Liberty, but the angle is different, but CNN is difficult to do it. The capsule network that gathers pose information can also determine the different angle of the Statue of Liberty .

Capsule network advantages

  • Because the capsule network collects pose information, it can learn a good representation effect through a small amount of data, so this is also a big improvement over CNN. For example, in order to recognize handwritten numbers, the human brain needs dozens of up to hundreds of examples, but CNN needs tens of thousands of data sets to train good results, which is obviously too violent!
  • The way of thinking is closer to the human brain, and the hierarchical relationship of internal knowledge representation in neural networks is better modeled. The intuition behind the capsule is very simple and elegant.

Cons of Capsule Network

  • The current implementation of the capsule network is much slower than other modern deep learning models (I think it is the effect of updating the coupling coefficient and the convolution layer overlay), and improving the training efficiency is a big challenge.

Research Projects

What is a capsule

An excerpt from "Transforming Autoencoders" by Hinton et al. On the concept of capsules is as follows.

Artificial neural networks should not pursue perspective invariance in "neuronal" activities (using a single scalar output to summarize the activity of repeating feature detectors in a local pool), but should use local "capsules" The input performs some fairly complex internal calculations, and then encapsulates the results of these calculations into a small vector containing informative output.Each capsule learns to identify a visual entity that is implicitly defined within a limited range of observation conditions and deformations, and outputs the probability that the entity exists within a limited range and a set of "instance parameters"The instance parameters may include precise pose, lighting conditions, and deformation information relative to the implicitly defined typical version of this visual entity. When the capsule works normally, the probability of the existence of the visual entity is locally invariant-when the entity moves on the appearance manifold within a limited range covered by the capsule, the probability does not change. Instance parameters are "isovariant"-as the observation conditions change, when the entity moves on the appearance manifold, the instance parameters will change accordingly, because the instance parameters represent the intrinsic coordinates of the entity on the appearance manifold.

In simple terms, it can be understood as:
  • Artificial neurons output a single scalar. The convolutional network uses a convolution kernel so that the results calculated by the same convolution kernel for each region of a two-dimensional matrix are stacked together to form the output of the convolution layer.
  • The maximum pooling method is used to achieve the perspective invariance. Because the maximum pool continuously searches the area of ​​the two-dimensional matrix and selects the largest number in the area, it meets the activity invariance we want (that is, we slightly adjust the input and the output is still the same) In other words, on the input image, we slightly transform the object we want to detect, and the model can still detect the object.
  • The pooling layer loses valuable information and does not take into account the relative spatial relationship between the encoded features. Therefore, we should use capsules. All important information about the state of the features in the capsule detection will be encapsulated by the capsule in a vector form ( Neurons are scalar)
The comparison of capsules and artificial neurons is as follows:

Inter-capsule dynamic routing algorithm

Low-level capsuleNeed to decide how to send its output vector to the high-level capsule. Low-level capsule changes scalar weightcij The output vector is multiplied by the weight and sent to the high-level capsule as the input of the high-level capsule. About weightscij Need to know there are:
  • Weights are non-negative scalars
  • For each lower capsuleIn terms of ownershipcij The sum is equal to 1
  • For each lower capsuleIn terms of the number of weights equal to the number of high-level capsules
  • These weights are determined by an iterative dynamic routing algorithm
The low-level capsule sends its output to the high-level capsule that expresses "agree", the algorithm pseudo-code is as follows:
The weight update can be intuitively understood with the following figure.
The output of two of the high-level capsules is in purple vector v1 和 v2 Indicates that the orange vector represents input received from a low-level capsule, and the other black vectors represent input received from other low-level capsules. Purple output on the left v1 And orange input u1|1 They point in opposite directions, so they are not similar, which means that their dot product is negative and the routing coefficient will be reduced when updating c11 . Purple output on the right v2 And orange input u2|1 Pointing in the same direction, they are similar, so the routing coefficient when updating parameters c12 Will increase. Repeat this process on all high-level capsules and all their inputs to get a set of routing parameters to achieve the best match between the output from the low-level capsule and the high-level capsule output.
How many routing iterations are taken?The paper tested the values ​​in a certain range on the MNIST and CIFAR datasets, and obtained the following conclusions:
  • More iterations often lead to overfitting
  • 3 iterations are recommended in practice

Overall framework

CapsNet consists of two parts: an encoder and a decoder. The first 3 layers are the encoder and the last 3 layers are the decoder:
  • First layer: convolution layer
  • Second layer: PrimaryCaps layer
  • Third layer: DigitCaps layer
  • The fourth layer: the first fully connected layer
  • The fifth layer: the second fully connected layer
  • The sixth layer: the third fully connected layer


The encoder accepts one28 × 28MNIST digital image as input, encode it as an instance parameter16Dimensional vector.

Convolution layer

  • Input: 28 × 28 images (monochrome)
  • Output: 20 × 20 × 256 tensor
  • 卷积核:256个步长为1的9×9×1的核
  • Activation function: ReLU
PrimaryCaps layer (32 capsules)
  • Input: 20 × 20 × 256 tensor
  • 输出:6×6×8×32张量(共有32个胶囊)
  • 卷积核:8个步长为1的9×9×256的核/胶囊
DigitCaps layer (10 capsules)
  • Enter:
    6 × 6 × 8 × 32 tensor
  • Output:
    16 × 10 matrix

Loss function


The decoder accepts a 16-dimensional vector from the correct DigitCap and learns to encode it into a digital image (note that only the correct DigitCap vector is used during training, and the incorrect DigitCap is ignored). The decoder is used as a regularizer. It accepts the output of the correct DigitCap as input, reconstructs a 28 × 28 pixel image, and the loss function is the Euclidean distance between the reconstructed image and the input image. The decoder forces the capsule to learn features useful for reconstructing the original image. The closer the reconstructed image is to the input image, the better. An example of the reconstructed image is shown below.
The first fully connected layer
  • Input: 16 × 10 matrix
  • Output: 512 vector
Second fully connected layer
  • Input: 512 vector
  • Output: 1024 vector
Third fully connected layer
  • Input: 1024 vector
  • Output: 784 vector

Original link: