back ground
research problem
Research motivation
CNN flaws
Inverse graphics
Capsule network advantages

Because the capsule network collects pose information, it can learn a good representation effect through a small amount of data, so this is also a big improvement over CNN. For example, in order to recognize handwritten numbers, the human brain needs dozens of up to hundreds of examples, but CNN needs tens of thousands of data sets to train good results, which is obviously too violent! 
The way of thinking is closer to the human brain, and the hierarchical relationship of internal knowledge representation in neural networks is better modeled. The intuition behind the capsule is very simple and elegant.
Cons of Capsule Network

The current implementation of the capsule network is much slower than other modern deep learning models (I think it is the effect of updating the coupling coefficient and the convolution layer overlay), and improving the training efficiency is a big challenge.
Research Projects
What is a capsule
Artificial neural networks should not pursue perspective invariance in "neuronal" activities (using a single scalar output to summarize the activity of repeating feature detectors in a local pool), but should use local "capsules" The input performs some fairly complex internal calculations, and then encapsulates the results of these calculations into a small vector containing informative output.Each capsule learns to identify a visual entity that is implicitly defined within a limited range of observation conditions and deformations, and outputs the probability that the entity exists within a limited range and a set of "instance parameters"The instance parameters may include precise pose, lighting conditions, and deformation information relative to the implicitly defined typical version of this visual entity. When the capsule works normally, the probability of the existence of the visual entity is locally invariantwhen the entity moves on the appearance manifold within a limited range covered by the capsule, the probability does not change. Instance parameters are "isovariant"as the observation conditions change, when the entity moves on the appearance manifold, the instance parameters will change accordingly, because the instance parameters represent the intrinsic coordinates of the entity on the appearance manifold.

Artificial neurons output a single scalar. The convolutional network uses a convolution kernel so that the results calculated by the same convolution kernel for each region of a twodimensional matrix are stacked together to form the output of the convolution layer. 
The maximum pooling method is used to achieve the perspective invariance. Because the maximum pool continuously searches the area of the twodimensional matrix and selects the largest number in the area, it meets the activity invariance we want (that is, we slightly adjust the input and the output is still the same) In other words, on the input image, we slightly transform the object we want to detect, and the model can still detect the object. 
The pooling layer loses valuable information and does not take into account the relative spatial relationship between the encoded features. Therefore, we should use capsules. All important information about the state of the features in the capsule detection will be encapsulated by the capsule in a vector form ( Neurons are scalar)
Intercapsule dynamic routing algorithm

Weights are nonnegative scalars 
For each lower capsulei In terms of ownershipc_{ij }The sum is equal to 1 
For each lower capsulei In terms of the number of weights equal to the number of highlevel capsules 
These weights are determined by an iterative dynamic routing algorithm

More iterations often lead to overfitting 
3 iterations are recommended in practice
Overall framework

First layer: convolution layer 
Second layer: PrimaryCaps layer 
Third layer: DigitCaps layer 
The fourth layer: the first fully connected layer 
The fifth layer: the second fully connected layer 
The sixth layer: the third fully connected layer
Encoder
Convolution layer

Input: 28 × 28 images (monochrome) 
Output: 20 × 20 × 256 tensor 
卷积核：256个步长为1的9×9×1的核 
Activation function: ReLU

Input: 20 × 20 × 256 tensor 
输出：6×6×8×32张量（共有32个胶囊） 
卷积核：8个步长为1的9×9×256的核/胶囊

Enter: 6 × 6 × 8 × 32 tensor 
Output: 16 × 10 matrix
Loss function
decoder

Input: 16 × 10 matrix 
Output: 512 vector

Input: 512 vector 
Output: 1024 vector

Input: 1024 vector 
Output: 784 vector
Comment
mm