Understanding the Convolutional Neural Network-CNN

Convolutional Neural Network-CNN is best at image processing. It is inspired by the human visual nervous system.

CNN has the big features of 2:

  1. Can effectively reduce the size of large data into small data
  2. Can effectively retain image features, in line with the principle of image processing

At present, CNN has been widely used, such as face recognition, autopilot, Mito show, security and many other fields.


What problem has CNN solved?

Before the advent of CNN, images were a problem for artificial intelligence, with 2 reasons:

  1. The amount of data that the image needs to process is too large, resulting in high cost and low efficiency.
  2. It is difficult to retain the original features in the process of digitization, resulting in low accuracy of image processing.

Here's a detailed description of the 2 questions:


The amount of data that needs to be processed is too large

The image is made up of pixels, each of which is made up of colors.

The image is made up of pixels, each of which is made up of colors.

Now a picture is 1000×1000 pixels or more, and each pixel has RGB 3 parameters to represent the color information.

If we process an image of 1000×1000 pixels, we need to process 3 million parameters!


Such a large amount of data processing is very resource intensive, and this is just a picture that is not too big!

Convolutional Neural Networks-The first problem solved by CNN is to "simplify complex problems", reducing the dimensions of a large number of parameters to a small number of parameters, and then processing.

More importantly: in most scenarios, we will not affect the results. For example, the 1000 pixel image is reduced to 200 pixels, which does not affect the naked eye to recognize whether the picture is a cat or a dog. The same is true for the machine.


Retain image features

The traditional way of digitizing pictures is to simplify it, just like the process in the following figure:

Simple digitalization of images does not preserve image features

If a circle is 1 and no circle is 0, then the different positions of the circle will produce completely different data representations. But from a visual point of view,The content (essence) of the image has not changed, only the position has changed..

So when we move the objects in the image, the parameters derived in the traditional way will vary greatly! This is not in line with the requirements of image processing.

CNN solves this problem. It preserves the characteristics of the image in a visually similar way. When the image is flipped, rotated or transformed, it can effectively recognize similar images.

So how is the convolutional neural network implemented? Before we understand the principles of CNN, let's take a look at the human visual principle.


Human visual principle

Many research results of deep learning are inseparable from the study of the cognitive principles of the brain, especially the study of visual principles.

The Nobel Prize in Medicine for 1981 was presented to David Hubel (American neurobiologist born in Canada) and TorstenWiesel, and Roger Sperry. The main contribution of the first two is "Discovered the information processing of the visual system"The visual cortex is hierarchical.

The human visual principle is as follows: starting from the initial signal intake (pupils in the pupils), followed by a preliminary treatment (some cells in the cerebral cortex find the edges and directions), and then abstracted (the brain determines that the shape of the object in front of the eye is a circle Shaped) and then further abstracted (the brain further determines that the object is a balloon). The following is an example of human face recognition for face recognition:

Human Vision Principle 1


For different objects, human vision is also cognized by hierarchical grading in this way:

Human Vision Principle 2

We can see that the features at the bottom layer are basically similar, that is, the various edges, the more up, the more features (wheels, eyes, torso, etc.) of such objects can be extracted, to the top, different advanced The features are ultimately combined into corresponding images, enabling humans to accurately distinguish between different objects.

Then we can naturally think of it: Can not imitate this feature of the human brain, construct a multi-layered neural network, the lower layer recognizes the primary image features, and some of the underlying features form a higher level feature, and finally through multiple levels The combination, and finally the classification at the top level?

The answer is yes, and this is the source of inspiration for many deep learning algorithms, including CNN.


Convolutional Neural Network - The Basic Principles of CNN

A typical CNN consists of 3 parts:

  1. Convolution layer
  2. Pooling layer
  3. Fully connected layer

If it is simple to describe:

The convolutional layer is responsible for extracting local features in the image; the pooling layer is used to significantly reduce the parameter magnitude (dimension reduction); the fully connected layer is similar to the traditional neural network portion and is used to output the desired result.

A typical CNN consists of 3 parts

The following principles are explained in order to be easy to understand and ignore a lot of technical details. If you are interested in the detailed principles, you can watch this video.Convolutional neural network》.


Convolution - extraction features

The operation of the convolutional layer is as shown in the figure below, using a convolution kernel to scan the entire picture:

Convolutional layer operation

This process we can understand is that we use a filter (convolution kernel) to filter the small areas of the image to get the eigenvalues ​​of these small areas.

In a specific application, there are often multiple convolution kernels. It can be considered that each convolution kernel represents an image mode. If an image block is convolved with the convolution kernel, the image block is considered to be Very close to this convolution kernel. If we design 6 convolution kernels, we can understand that we think there are 6 underlying texture patterns on this image, that is, we can draw an image using the basic mode in 6. The following are examples of 25 different convolution kernels:

25 different convolution kernels

Summary: The convolutional layer extracts the local features in the image by filtering the convolution kernel, similar to the feature extraction of human vision mentioned above.


Pooling layer (downsampling) - data dimensionality reduction to avoid overfitting

The pooling layer is simply a downsampling, which can greatly reduce the dimensions of the data. The process is as follows:

Pooling process

In the above figure, we can see that the original picture is 20×20, we downsample it, the sampling window is 10×10, and finally it is downsampled into a 2×2 feature map.

The reason for this is that even after the convolution is done, the image is still large (because the convolution kernel is small), so in order to reduce the data dimension, the downsampling is performed.

Summary: The pooling layer can reduce the data dimension more effectively than the convolutional layer. This can not only greatly reduce the amount of computation, but also effectively avoid overfitting.


Fully connected layer - output

This part is the last step. The data processed by the convolutional layer and the pooled layer is input to the fully connected layer to get the final desired result.

After the dimensionality reduction of the convolutional layer and the pooling layer, the fully connected layer can "run", otherwise the amount of data is too large, the calculation cost is high, and the efficiency is low.

Fully connected layer

A typical CNN is not just the 3 layer structure mentioned above, but a multi-layer structure, such as the structure of LeNet-5 as shown below:

Convolutional layer-Pooling layer-Convolutional layer-Pooling layer-Convolutional layer-Fully connected layer

LeNet-5 network structure

After understanding the basic principles of CNN, let's focus on what the actual application of CNN is.


What are the practical applications of CNN?

Convolutional Neural Network-CNN is very good at processing images.The video is a superposition of images, so it is also good at processing video content.Here are some more mature applications?:


Image classification, retrieval

Image classification is a relatively basic application, which can save a lot of labor costs and effectively classify images. For some specific areas of the image, the accuracy of the classification can reach 95%+, which is already a highly usable application.

Typical scene: image search...

CNN application - image classification, retrieval


Target location detection

You can position the target in the image and determine the location and size of the target.

Typical scenarios: Autonomous driving, security, medical...

CNN application - target


Target segmentation

Simple understanding is a pixel-level classification.

He can distinguish between foreground and background pixel-level, and then advanced to identify targets and classify them.

Typical scene: Meitu Xiuxiu, video post-processing, image generation...

CNN application - target segmentation


Face recognition

Face recognition has become a very popular application and has been widely used in many fields.

Typical scene: security, finance, life...

CNN application - face recognition


Skeletal recognition

Bone recognition is a key bone that recognizes the body and the action of tracking bones.

Typical scenarios: security, movies, image and video generation, games...

CNN application - bone recognition


Final Thoughts

Today we introduced the value, basic principles and application scenarios of CNN. The summary is as follows:

The value of CNN:

  1. Ability to effectively reduce the amount of large data to a small amount of data (without affecting the results)
  2. Ability to preserve the characteristics of the image, similar to the human visual principle

The basic principle of CNN:

  1. Convolution layer – the main role is to preserve the characteristics of the picture
  2. Pooling layer – the main function is to reduce the dimensionality of the data, which can effectively avoid over-fitting
  3. Fully connected layer – output the results we want according to different tasks

Practical application of CNN:

  1. Image classification, retrieval
  2. Target location detection
  3. Target segmentation
  4. Face recognition
  5. Skeletal recognition


Baidu Encyclopedia + Wikipedia

Baidu Encyclopedia version

Convolutional Neural Networks (CNN) is a kind of feedforward neural network with convolutional computation and deep structure. It is one of the representative algorithms of deep learning. Since the convolutional neural network is capable of shift-invariant classification, it is also called "Shift-Invariant Artificial Neural Networks (SIANN)".

The study of convolutional neural networks began in the 80 to 90 era in the twentieth century. Time delay networks and LeNet-5 were the earliest convolutional neural networks; after the twenty-first century, with the introduction of deep learning theory and numerical calculations With the improvement of equipment, the convolutional neural network has been rapidly developed and is widely used in the fields of computer vision and natural language processing.

Read More

Wikipedia version

In deep learning, convolutional neural networks (CNN or ConvNet) are a class of deep neural networks that are most commonly used to analyze visual images.

CNN uses a variant design of a multilayer perceptron that requires minimal pre-processing. They are also known as shift-invariant or spatially invariant artificial neural networks (SIANN) based on their shared weighting architecture and translation invariant features. The convolutional network is inspired by the bioprocess in which the connected pattern of neurons resembles the animal's tissue visual cortex. Individual cortical neurons respond to stimuli only in restricted areas known as the field of receptive fields. The receptive fields of different neurons partially overlap, making them cover the entire field of view.

Compared to other image classification algorithms, CNN uses relatively few pre-processing. This means that the network learns manually designed filters in traditional algorithms. This independence from prior knowledge and manpower in feature design is a major advantage.

They can be used for image and video recognition, recommendation systems, image classification, medical image analysis and natural language processing.

Read More