What is supervised learning?

Supervised learning is a kind of machine learningTraining method / learning method :

Supervised learning requires a clear goal and it is clear what results you want.. For example: according to "established rules" to classify, predict a specific value...

Supervision does not mean that people are standing next to the machine to see if the machine is doing the right thing, but the following process:

  1. Choose a mathematical model that fits the target task
  2. First give some known "questions and answers" (training sets) to the machine to learn
  3. The machine summed up its own "methodology"
  4. Humans give "new questions" (test sets) to the machine for him to answer

The questions and answers mentioned above are just a metaphor. If we want to complete the task of categorizing articles, the following is the way:

  1. Choose a suitable mathematical model
  2. Put a bunch of articles that have been classified and their classifications to the machine
  3. The machine learned the "methodology" of classification
  4. After the machine learns, throw him some new articles (without classification) and let the machine predict the classification of these articles.


Supervised learning 2 tasks: regression, classification

Supervised learning has 2 main tasks:

  1. return
  2. Category

Regression: predicts continuous, specific values.For example: Alipay credit score in Alipay (more on this below)

Classification: Divide various things for discrete types (What is discrete?)prediction.such as:


"Return" case: How did the sesame credit score come from?

The following is the personal credit assessment method -FICO.

He is similar to Sesame Credit and is used to assess an individual's credit status. The credit scores from the FICO scoring system range from 300 to 850. The higher the score, the smaller the credit risk.

Let us simulate the invention process of FICO, which is the return of supervised learning.


Step 1: Build the problem, select the model

We first find out the influencing factors of personal credit. Logically speaking, a person's weight should have nothing to do with his credit. For example, people who are very credible around us have fat people and thin people.

The total amount of wealth seems to be related to credit, because Ma Yun does not say that the loss of credit is very huge, so everyone has never heard that Ma Yun will not pay credit cards! And the loss of a creditlessness is very small, this street can't be mixed and continue to change streets.

So according to the judgment, the following 5 influencing factors were found:

  • payment record
  • Total account amount
  • Credit history span (credit history since the opening of the account, credit history since the opening of a specific type of account...)
  • New accounts (number of accounts opened recently, percentage of accounts opened for specific types of accounts...)
  • Credit category (number of accounts)

At this time, we built a simple model:

f can be simply understood as a specific formula that associates 5 factors with personal credit scores.

Our goal is to get the concrete formula of f, so that we can get a person's credit score as long as we have one 5 data of one person.


Step 2: Collecting known data

In order to find this formula f, we need to collect a large amount of known data, which must contain a person's 5 data and his/her credit status (convert the credit status to a score).

We divide the data into several parts, one for training and one for testing and verification.


Step 3: Train the ideal model

With these data, we can "guess" the relationship between these five types of data and credit scores through machine learning. This relationship is the formula f.

Then we use the verification data and test data to verify that the formula is OK.

The specific method of test verification is:

  1. Put 5 data into the formula and calculate the credit score
  2. Compare the calculated credit score with the actual credit score of the person (pre-prepared)
  3. Evaluate the accuracy of the formula and adjust it if the problem is large


Step 4: Forecasting new users

When we want to know the credit status of a new user, we only need to collect his 5 data, and put it into the formula f to calculate the result once again!

Well, the above is a regression model that is closely related to everyone. The general idea is the idea mentioned above. The whole process has been simplified. If you want to view the complete process, you can checkMachine learning - 7 steps for machine learning"


"Classification" case: How to predict divorce

Dr. Gottman, an American psychologist, used big data to restore the truth about marriage. His method is the idea of ​​classification.

After observing and listening to a couple's 5 minute conversation, Dr. Gottman can predict whether they will divorce and predict an accuracy of 94%! His research also published a book.happy marriage》 (Douban 8.4 points).


Step 1: Build the problem, select the model

Gottman suggested that the dialogue can reflect the potential problems between husband and wife, and their quarrels, laughter, teasing and emotional expression in the dialogue create some kind of emotional connection. Through the emotional associations in these conversations, couples can be divided into different types, representing different divorce probabilities.


Step 2: Collecting known data

The researchers invited 700 to participate in the experiment. They sit alone in a room and talk about a controversial topic, such as money and sex, or relationships with in-laws. Murray and Gottman let each couple continue to talk about this topic 15 minutes and shoot the process. After watching the videos, the observers rate them according to the conversation between the husband and the wife.


Step 3: Train the ideal model

Gottman's method is not machine learning to get results, but the principles are similar. He got the following conclusions:

First, they plot the scores of both husband and wife on a chart, and the intersection of the two lines can indicate whether the marriage will last for a long time. If the husband or wife continues to score negative points, the two are likely to go to divorce. The focus is on quantifying the ratio of positive and negative effects in conversation. The ideal ratio is 5:1, and if it is lower than this ratio, the marriage will have problems. Finally, the results are placed on a mathematical model that uses the difference equation to highlight the potential characteristics of a successful marriage.

Gottman divided the couple into 5 groups based on the score:

  1. Happy couple: Calm, intimate, mutual support, and friendly. They prefer to share experiences.
  2. Invalid couple: They do their best to avoid conflicts, just by responding positively to each other.
  3. Changeable couple: They are romantic and enthusiastic, and they can be very heated. They are sometimes unstable and sometimes unstable, but in general they are not very happy.
  4. Hostile coupleOne party does not want to talk about something, the other party agrees, so there is no communication between the two.
  5. Husband and wifeOne party is eager to argue, but the other party is not interested in the topic of discussion.

The mathematical model shows the difference between two stable couples (a harmonious couple and a couple who are not harmonious) and two unstable couples (hostile couples and uninvolved couples). It is predicted that unstable couples may remain in a marriage relationship, even though their marriage is unstable.


Step 4: Forecasting new users

Every two or two years since 12, Murray and Gottman will interact with the 700 couple who participated in the study. The two-person formula predicts the divorce rate to an accuracy of 94%.


Mainstream supervised learning algorithm

algorithm Types of Introduction
Naive Bayes Category Bayesian classification is a statistical classification method based on Bayesian's theorem. It classifies by predicting the probability that a given tuple belongs to a particular class. The naive Bayesian classification assumes that the effect of an attribute value on a given class is independent of other attributes - class conditional independence.
Decision tree Category A decision tree is a simple but widely used classifier that builds a decision tree by training data to classify unknown data.
SVM Category The support vector machine transforms the classification problem into the problem of finding the classification plane, and realizes the classification by maximizing the distance between the classification boundary points and the classification plane.
Logistic regression Category Logistic regression is a regression problem for dealing with dependent variables as categorical variables. Commonly, it is a binary or binomial distribution problem. It can also deal with multi-classification problems. It actually belongs to a classification method.
Linear regression return Linear regression is one of the most commonly used algorithms for dealing with regression tasks. The form of the algorithm is very simple, it is expected to use a hyperplane to fit the data set (only two lines are a straight line).
Return tree return The regression tree (a type of decision tree) implements hierarchical learning by repeatedly dividing the data set into different branches. The criterion for segmentation is to maximize the information gain for each separation. This branching structure allows the regression tree to naturally learn nonlinear relationships.
K proximity Classification + regression New data points are predicted by searching the entire training set of the K most similar instances (neighbors) and summarizing the output variables of those K instances.
Adaboosting Classification + regression AdaboostThe goal is to learn a series of weak classifiers or basic classifiers from the training data, and then combine these weak classifiers into one strong classifier.
Neural Networks Classification + regression It abstracts the human brain neuron network from the perspective of information processing, establishes a simple model, and forms different networks according to different connection methods.



Baidu Encyclopedia and Wikipedia

Baidu Encyclopedia version

Supervised learning refers to the process of using a set of known categories of samples to adjust the parameters of the classifier to achieve the required performance, also known as supervised training or teacher learning.

Supervised learning is a machine learning task that infers a function from the labeled training data. The training data includes a set of training examples. In supervised learning, each instance consists of an input object (usually a vector) and a desired output value (also known as a supervisory signal). The supervised learning algorithm analyzes the training data and produces an inferred function that can be used to map out new instances. An optimal solution would allow the algorithm to correctly determine the class labels of those invisible instances. This requires learning algorithms to be formed in a "reasonable" way from a training data to an invisible situation.

Read More

Wikipedia version

Supervised learning is a machine learning task that learns a function that maps input to output based on example input-output pairs. It infers that the training data labeled by a function consists of a set of training examples. In supervised learning, each embodiment is a pair of input objects (usually vectors) and expected output values ​​(also called monitoring signals). The supervised learning algorithm analyzes the training data and generates an inference function that can be used to map new examples. The best solution will allow the algorithm to correctly determine the class label of the invisible instance. This requires learning algorithms to generalize from training data to unseen situations in a "reasonable" way.

Read More