An article to understand unsupervised learning

Unsupervised learning is a learning style in the field of machine learning. This article will explain to you the basic concepts and tell you which specific scenarios can be used for unsupervised learning.

Finally, give you an example of the 2 class unsupervised learning thinking: clustering, dimensionality reduction. And specific 4 algorithms.

 

What is unsupervised learning?

Unsupervised learning is a kind of machine learningTraining method / learning method :

Unsupervised learning is a branch of machine learning

Unsupervised learning is understood below by comparing it with supervised learning:

  1. Supervised learning is a purpose-based training method, and you know what you get;Unsupervised learning is a training method with no clear purpose. You can't know in advance what the result is..
  2. Supervised learning needs to label data;Unsupervised learning does not require labeling data.
  3. Supervised learning can measure results because the goals are clear;Unsupervised learning can hardly quantify the effect.

Supervised learning versus unsupervised learning

Simply summarize:

Unsupervised learning is a form of training for machine learning. It is essentially a statistical tool, and a training method for potential structures can be found in unlabeled data.

It mainly has 3 features:

  1. Unsupervised learning has no clear purpose
  2. Unsupervised learning does not require labeling data
  3. Unsupervised learning cannot quantify results

This explanation is difficult to understand. Let's use some specific cases to tell you some practical application scenarios of unsupervised learning. Through these practical scenarios, you can understand the value of unsupervised learning.

 

Unsupervised learning usage scenarios

Discover anomalous data with unsupervised learning

Case 1: Found an exception

There are many illegal activities that require "money laundering". These money laundering activities are different from the behavior of ordinary users. What is the difference?

If it is a very expensive and complicated thing to analyze by humans, we can classify the users by the characteristics of these behaviors, it is easier to find those users with abnormal behaviors, and then analyze in depth how their behaviors are different. Whether it is in the scope of illegal money laundering.

Through unsupervised learning, we can quickly classify behaviors. Although we don't know what these classifications mean, through this classification, we can quickly expel normal users and conduct more in-depth analysis of abnormal behaviors.

 

Segment users with unsupervised learning

Case 2: User Segmentation

This makes sense for the advertising platform. We not only classify users by gender, age, geographic location, etc., but also classify users by user behavior.

With user segmentation in many dimensions, ad serving can be more targeted and better.

 

Recommend to users with unsupervised learning

Case 3: Recommendation System

Everyone has heard the story of "beer + diapers". This story is an example of recommending related products based on users' buying behavior.

For example, when you shop on Taobao, Tmall, and JD.com, you will always recommend some related products based on your browsing behavior. Some products are recommended through clustering through unsupervised learning. The system will find some users with similar purchase behaviors and recommend the most "favorite" products of such users.

 

Common 2 class unsupervised learning algorithm

Common 2 class algorithms are: clustering, dimensionality reduction

2 mainstream unsupervised learning method: clustering, dimensionality reduction

Clustering: Simply put, it is an automatic classification method. In supervised learning, you know exactly what each category is, but clustering is not. You don't know what each of the several categories after clustering mean.

Dimensionality reduction: Dimensionality looks a lot like compression. This is to reduce the complexity of the data while keeping the relevant structure as much as possible.

 

"clustering algorithm" K-means clustering

The K-means clustering is to automatically group the number of packets to be K.

The steps for K-means clustering are as follows:

  1. Define K centers of gravity. At the beginning these centers of gravity are random (there are some more efficient algorithms for initializing the center of gravity)
  2. Find the nearest center of gravity and update the cluster assignment. Each data point is assigned to one of the K clusters. Each data point is assigned a cluster of centers of gravity closest to them. The measure of "proximity" here is a hyperparameter - usually the Euclidean distance.
  3. Move the center of gravity to the center of their cluster. The new position of the center of gravity of each cluster is obtained by calculating the average position of all data points in the cluster.

Repeat the 2 and 3 steps until the position of the center of gravity no longer changes significantly (ie, until the algorithm converges).

The process is as follows:

K-means clustering process

 

"clustering algorithm" hierarchical clustering

If you don't know that it should be divided into several categories, then hierarchical clustering is more suitable. Hierarchical clustering builds a multi-level nested classification, similar to a tree structure.

Hierarchical clustering

The steps of hierarchical clustering are as follows:

  1. Start with N clusters, one cluster per data point.
  2. Combine the two clusters that are closest to each other into one. Now you have N-1 clusters.
  3. Recalculate the distance between these clusters.
  4. Repeat the 2 and 3 steps until you get a cluster of N data points.
  5. Select a number of clusters and then draw a horizontal line in the tree.

 

Principal component analysis of "dimensionality reduction algorithm"-PCA

Principal component analysis is the conversion of multiple indicators into a few comprehensive indicators.

Principal component analysis often uses features that reduce the dimensionality of the data set while maintaining the largest contribution of the data set. This is done by preserving the low-order principal components and ignoring the higher-order principal components. Such low-order components tend to retain the most important aspects of the data.

Steps to transform:

  1. The first step is to calculate the sample of matrix X.Covariance matrix S (this is a non-standard PCA, standard PCA calculationCorrelation coefficientmatrixC)
  2. The second step is to calculate the covariance matrix S (or C)Feature vector e1,e2,...,eN and eigenvalues, t = 1,2,...,N
  3. The third step projects the data into the space formed by the feature vector. Use the formula below, where the BV value is the value of the corresponding dimension in the original sample. Principal component analysis formula

 

"Dimensionality Reduction Algorithm" Singular Value Decomposition – SVD

Singular Value Decomposition is an important matrix decomposition in linear algebra, and singular value decomposition is the generalization of feature decomposition on arbitrary matrices. It has important applications in the fields of signal processing and statistics.

Learn more about singular value decomposition, you can viewWikipedia

 

Generating models andGAN

The simplest goal of unsupervised learning is to train the algorithm to generate its own data instances, but the model should not simply reproduce the previously trained data, otherwise it is a simple memory behavior.

It must be to build a base class model from the data. Instead of generating a specific horse or rainbow photo, it generates a collection of horses and rainbows; it is not a specific utterance from a particular speaker, but a general distribution of utterances.

The guiding principle for generating models is that being able to construct a compelling data example is the most powerful evidence to understand it. As physicist Richard Feynman said: "I can't understand what I can't create" (What I cannot create, I do not understand.).

For images, the most successful generation model to date is the Generate Confrontation Network (GAN). It consists of two networks: a generator and a discriminator, which are responsible for forging pictures and identifying true and false.

GAN generated image

The purpose of the generator to produce images is to induce the discriminator to believe that they are authentic, and at the same time, the discriminator will be rewarded for finding fake pictures.

The images that GAN began to generate were cluttered and random, refining in many iterations, resulting in more realistic images that could not even be distinguished from real photos. Recently, NVIDIA's GauGAN can also generate images based on user sketches.

 

Baidu Encyclopedia and Wikipedia

Baidu Encyclopedia version

There is often a problem in real life: the lack of sufficient prior knowledge makes it difficult to manually label categories or manually label them. Naturally, we want computers to do the work for us, or at least to help. The various problems in pattern recognition are solved according to the training samples whose categories are unknown (not marked), which is called unsupervised learning.

Read More

 

Wikipedia version

Unsupervised learning is a branch of machine learning that learns from unmarked, classified or classified test data. Unsupervised learning is not a response to feedback, but rather identifies and reacts to commonalities in the data based on whether or not such commonality exists in each new data. Alternatives include supervised learning and reinforcement learning. The application of unsupervised learning centers is in field density estimation in statistics, [1] although unsupervised learning includes many other domains that involve summarizing and interpreting the characteristics of the data.

Read More

 

Extended reading