Computer Vision-Computer Vision

Computer Vision is an important branch of artificial intelligence. Its purpose: to understand the content in the picture.

This article will introduce the basic concepts of computer vision, implementation principles, 8 tasks and 4 common application scenarios in life.

Why is computer vision important?

About 70% of the human cerebral cortex processes visual information. It is one of the most important channels for humans to obtain information.

In the online world, photos and videos (collections of images) are also exploding!

The following figure is the trend of the proportion of new data on the network. Gray is structured data and blue is unstructured data (mostly images and videos). It is clear that pictures and videos are growing at an exponential rate.

Picture and video data are growing rapidly

Before the advent of computer vision, images were black boxes for computers.

An image is just a file to the machine. The machine does not know what the content of the picture is, it only knows what size the picture is, how many MB, and what format.

Before CV, the machine intelligently saw the file attributes and could not understand the picture content

If computers and artificial intelligence want to play an important role in the real world, they must understand pictures! This is the problem that computer vision has to solve.


What is computer vision-CV?

Computer vision is an important branch of artificial intelligence. The problems it needs to solve are:Understand what's in the image.

such as:

  • Is the pet in the picture a cat or a dog?
  • Is the person in the picture Lao Zhang or Pharaoh?
  • What items are on the table in this photo?

CV allows the machine to understand the content in the picture


What is the principle of computer vision?

The current mainstream machine learning methods based on deep learning are similar in principle to how the human brain works.

The principle of human vision is as follows: start with the original signal intake (the pupils take in pixels), then do preliminary processing (cerebral cortex finds edges and directions), and then abstract (the brain determines that the shape of the object in front of the eye is a circle Shape), and then abstract (the brain further determines that the object is a balloon).

How the human brain looks

The machine's method is also similar: constructing a multilayer neural network, identifying lower-level image features at the lower level, and composing higher-level features at a number of lower-level features. Finally, through the combination of multiple levels, classification is finally made at the top level.

The principle of CV is similar to the principle of human vision


2 challenges of computer vision

It is very simple for humans to understand pictures, but it is a very difficult thing for machines. There are two typical difficulties:

Difficult to extract features

The same cat is under different angles, different lights, and different actions. The pixel difference is very large. Even if it is the same photo, the pixel difference is very large after rotating 90 degrees!

So the content in the pictures is similar or even the same, but at the pixel level, the changes will be very large. This is a big challenge for feature extraction.

Huge amount of data to calculate

Taking a photo on your phone is 1000 * 2000 pixels. There are 3 parameters of RGB for each pixel, and there are 1000 X 2000 X 3 = 6,000,000 in total. Any one photo has to process 600 million parameters, and then calculate the 4K video that is becoming more and more popular now. You know how scary this magnitude of calculation is.

2 challenges of computer vision

CNN Solved the two major problems above

CNN belongs to the category of deep learning, which solves the two major difficulties mentioned above:

  1. CNN can effectively extract features in images
  2. CNN can effectively reduce the dimensionality of massive data (without affecting the feature extraction), greatly reducing the requirement for computing power

The specific principle of CNN is not described here. If you are interested, you can check out "A paper to understand the convolutional neural network - CNN (basic principle + unique value + practical application)"


8 tasks for computer vision

8 tasks of CV

Image classification

Image classification is an important basic problem in computer vision. The other tasks mentioned later are also based on it.

Take a few typical examples: face recognition, picture recognition, photo album automatic classification based on people, and so on.

Image classification

Target Detection

The goal of the object detection task is to give an image or a video frame, let the computer find the position of all the objects in it, and give the specific category of each object.

Target Detection

Semantic segmentation

It divides the entire image into pixel groups, and then labels and classifies the pixel groups. Semantic segmentation attempts to semantically understand what each pixel in the image is (person, car, dog, tree...).

As shown below, in addition to identifying people, roads, cars, trees, etc., we must also determine the boundaries of each object.

Semantic segmentation

Instance segmentation

In addition to semantic segmentation, instance segmentation classifies different types of instances, such as marking 5 cars with 5 different colors. We will see multiple overlapping objects and complex scenes with different backgrounds. Not only do we need to classify these different objects, but also determine the boundaries, differences, and relationships between the objects!

Instance segmentation

Video classification

Different from image classification, the object of classification is no longer a still image, but a video object composed of multi-frame images that contains voice data and motion information, so understanding the video requires more contextual information. Not only must we understand what each frame of image is and what it contains, but also we need to combine different frames to know the context information.

Video classification

Human keypoint detection

Body keypoint detection, through the combination and tracking of key nodes of the human body to identify human movement and behavior, is very important for describing human posture and predicting human behavior.

This technology is used in the Xbox.

Human keypoint detection

Scene text recognition

Many photos contain textual information, which is important for understanding the image.

Scene text recognition is the process of converting image information into a text sequence under the conditions of complex image background, low resolution, diverse fonts, and random distribution.

The license plate recognition of parking lots and toll stations is a typical application scenario.

Scene text recognition

Target Tracking

Object tracking refers to the process of tracking one or more specific objects of interest in a specific scene. The traditional application is the interaction between video and the real world, which is observed after the initial object is detected.

This technology is used in unmanned driving.

Target Tracking


Application scenarios of CV in daily life

The application scenarios of computer vision are very wide. Here are a few common application scenarios in life.

  1. Face recognition on access control and Alipay
  2. License plate recognition in parking lots and toll stations
  3. Risk identification when uploading pictures or videos to websites
  4. Various props on the vibrato (requires the position of the face first)

Application scenarios of computer vision in daily life

It needs to be explained here that scanning of barcodes and QR codes is not considered computer vision.

This kind of image recognition is based on fixed rules. It does not need to process complex images and does not use AI technology at all.


Baidu Encyclopedia + Wikipedia

Baidu Encyclopedia version

Computer vision is a science that studies how to make a machine "look". Further, it refers to the use of cameras and computers instead of human eyes to identify, track and measure machine vision, and further graphic processing to make computers Processing becomes an image that is more suitable for human eye observation or transmission to the instrument for detection. As a scientific discipline, theories and techniques related to computer vision research attempt to establish artificial intelligence systems that can extract 'information' from images or multidimensional data. The information referred to here refers to the information defined by Shannon and can be used to help make a "decision". Because perception can be thought of as extracting information from sensory signals, computer vision can also be seen as the science of how to make artificial systems "perceive" from images or multidimensional data.

Read More

Wikipedia version

Computer vision is an interdisciplinary field of science that involves creating computers to gain a high level of understanding from digital images or video. From an engineering perspective, it seeks to automate the tasks that the human visual system can accomplish.

Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, as well as extracting high dimensional data from the real world to produce digital or symbolic information, for example, in the form of decisions.

Understanding in this context means transforming a visual image (input of the retina) into a world description that can interact with other thought processes and lead to appropriate actions. This image understanding can be seen as the use of geometry, physics, statistics and learning theory to model the symbol information from the image data.

As a science discipline, computer vision focuses on the theory behind artificial systems that extract information from images. Image data can take many forms, such as a video sequence, a view from multiple cameras, or multi-dimensional data from a medical scanner. As a technical discipline, computer vision attempts to apply its theory and models to the construction of computer vision systems. Sub-domains of computer vision include scene reconstruction, event detection, video tracking, object recognition, 3D pose estimation, learning, indexing, motion estimation, and image restoration.

Read More


Application article (1)