back ground
research problem
Research motivation
CNN flaws
Inverse graphics
Capsule network advantages
-
Because the capsule network collects pose information, it can learn a good representation effect through a small amount of data, so this is also a big improvement over CNN. For example, in order to recognize handwritten numbers, the human brain needs dozens of up to hundreds of examples, but CNN needs tens of thousands of data sets to train good results, which is obviously too violent! -
The way of thinking is closer to the human brain, and the hierarchical relationship of internal knowledge representation in neural networks is better modeled. The intuition behind the capsule is very simple and elegant.
Cons of Capsule Network
-
The current implementation of the capsule network is much slower than other modern deep learning models (I think it is the effect of updating the coupling coefficient and the convolution layer overlay), and improving the training efficiency is a big challenge.
Research Projects
What is a capsule
Artificial neural networks should not pursue perspective invariance in "neuronal" activities (using a single scalar output to summarize the activity of repeating feature detectors in a local pool), but should use local "capsules" The input performs some fairly complex internal calculations, and then encapsulates the results of these calculations into a small vector containing informative output.Each capsule learns to identify a visual entity that is implicitly defined within a limited range of observation conditions and deformations, and outputs the probability that the entity exists within a limited range and a set of "instance parameters"The instance parameters may include precise pose, lighting conditions, and deformation information relative to the implicitly defined typical version of this visual entity. When the capsule works normally, the probability of the existence of the visual entity is locally invariant-when the entity moves on the appearance manifold within a limited range covered by the capsule, the probability does not change. Instance parameters are "isovariant"-as the observation conditions change, when the entity moves on the appearance manifold, the instance parameters will change accordingly, because the instance parameters represent the intrinsic coordinates of the entity on the appearance manifold.
-
Artificial neurons output a single scalar. The convolutional network uses a convolution kernel so that the results calculated by the same convolution kernel for each region of a two-dimensional matrix are stacked together to form the output of the convolution layer. -
The maximum pooling method is used to achieve the perspective invariance. Because the maximum pool continuously searches the area of the two-dimensional matrix and selects the largest number in the area, it meets the activity invariance we want (that is, we slightly adjust the input and the output is still the same) In other words, on the input image, we slightly transform the object we want to detect, and the model can still detect the object. -
The pooling layer loses valuable information and does not take into account the relative spatial relationship between the encoded features. Therefore, we should use capsules. All important information about the state of the features in the capsule detection will be encapsulated by the capsule in a vector form ( Neurons are scalar)
Inter-capsule dynamic routing algorithm
-
Weights are non-negative scalars -
For each lower capsulei In terms of ownershipcij The sum is equal to 1 -
For each lower capsulei In terms of the number of weights equal to the number of high-level capsules -
These weights are determined by an iterative dynamic routing algorithm
-
More iterations often lead to overfitting -
3 iterations are recommended in practice
Overall framework
-
First layer: convolution layer -
Second layer: PrimaryCaps layer -
Third layer: DigitCaps layer -
The fourth layer: the first fully connected layer -
The fifth layer: the second fully connected layer -
The sixth layer: the third fully connected layer
Encoder
Convolution layer
-
Input: 28 × 28 images (monochrome) -
Output: 20 × 20 × 256 tensor -
卷积核:256个步长为1的9×9×1的核 -
Activation function: ReLU
-
Input: 20 × 20 × 256 tensor -
输出:6×6×8×32张量(共有32个胶囊) -
卷积核:8个步长为1的9×9×256的核/胶囊
-
Enter: 6 × 6 × 8 × 32 tensor -
Output: 16 × 10 matrix
Loss function
decoder
-
Input: 16 × 10 matrix -
Output: 512 vector
-
Input: 512 vector -
Output: 1024 vector
-
Input: 1024 vector -
Output: 784 vector
Comment
mm