Machine learning models need to have quantitative assessment indicators to evaluate which models work better.
This article will explain the confusion matrix of classification problems and the calculation formulas of various evaluation indicators in an easy-to-understand way. The evaluation indicators that will be introduced to you are: accuracy, accuracy, recall, F1, ROC curve, AUC curve.
Machine learning evaluation indicators
Everything needs to be evaluated, especially quantitative indicators.
- The college entrance examination results are used to assess students' learning ability.
- The weight of the barbell is used to assess the strength of the muscles.
- Running points are used to evaluate the overall performance of the phone
Machine learning has many indicators of evaluation. With these indicators we compare horizontally which models perform better. Let's take a look at the overall assessment indicators as a whole:
Classification problem assessment indicators:
- Precision (Difference) - Precision
- Recall rate (recovery rate) - Recall
- F1 score
- ROC curve
- AUC curve
Regression problem assessment indicators:
Classification problem diagram
In order to facilitate the understanding of the calculation of each indicator, we use specific examples to illustrate the classification problem to help you quickly understand the various situations that appear in the classification.
We have 10 photos, 5 males, and 5 females. As shown below:
There is a machine learning model that judges gender. When we use it to judge whether it is "male", 4 will appear. As shown below:
- Actually male and judged to be male (correct)
- Actually male, but judged as female (wrong)
- Actually female, and judged to be female (correct)
- Actually female, but judged as male (wrong)
This 4 case constitutes a classic confusion matrix, as shown below:
TP-True Positive: Actually male, and judged to be male (correct)
FN-False Negative: Actually male, but judged to be female (error)
TN-True Negative: Actually female, and judged to be female (correct)
FP-False Positive: Actually female, but judged to be male (error)
This 4 noun seems to be dizzy at first (especially the abbreviation), but it is easy to understand when we split the English, as shown below:
All evaluation metrics are calculated around the above 4 cases, so understanding the above 4 case is the basis!
Detailed evaluation indicators
The following is a detailed description of the various evaluation indicators and calculation formulas for classification into categories:
Predict the correct result as a percentage of the total sample, as follows:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Although the accuracy rate can judge the total correct rate, but inSample imbalance In the case, it cannot be used as a good indicator to measure the results. For a simple example, for example, in a total sample, positive samples account for 90%, negative samples account for 10%, and samples are heavily unbalanced. In this case, we only need to predict all samples as positive samples to get high accuracy of 90%, but in fact we do not have a very careful classification, just no brains. This explains:Due to the problem of sample imbalance, the resulting high accuracy results contain large amounts of moisture. That is, if the sample is not balanced, the accuracy will be invalid.
Precision (Difference) - Precision
The probability of a sample that is positive for all samples that are predicted to be positive, as follows:
Precision rate = TP / (TP + FP)
The accuracy and accuracy look a bit similar, but two completely different concepts. The accuracy rate represents the accuracy of the prediction in the positive sample results, while the accuracy rate represents the overall accuracy of the prediction, including both positive and negative samples.
Recall rate (recovery rate) - Recall
The probability that a positive sample is predicted to be a positive sample is as follows:
Recall rate = TP / (TP + FN)
Application scenario of recall rate: For example, taking the default rate of online loans, relatively good users, we are more concerned about bad users, can not misplace any bad users. Because if we excessively treat bad users as good users, the amount of default that may occur in the future will far exceed the amount of interest paid by the good users, resulting in serious compensation.The higher the recall rate, the higher the probability that the actual bad user is predicted. The meaning is similar: it is better to kill a thousand and never let go.
If we use the graph to express the relationship between precision and recall (Recall), it is the following PR curve:
It can be found that the relationship between the two is a "difficult" relationship. In order to combine the performance of the two, to find a balance between the two, there is a F1 score.
ROC curve, AUC curve
ROC and AUC are 2 more complex evaluation indicators. The following article has been explained in great detail. I will quote some of this article directly.
The above indicator description is also from this article:'A book allows you to thoroughly understand the accuracy, accuracy, recall, real rate, false positive rate, ROC / AUC"
1. Sensitivity, specificity, true rate, false positive rate
Before we officially introduce ROC/AUC, we will introduce two more indicators.The choice of these two indicators is also the reason why ROC and AUC can ignore the sample imbalance. These two indicators are:Sensitivity and (1-specificity), also known as true rate (TPR) and false positive rate (FPR).
Sensitivity = TP/(TP+FN)
Specificity = TN/(FP+TN)
- In fact, we can find that the sensitivity and recall rate are exactly the same, but the name has changed.
- Since we are more concerned with positive samples, we need to see how many negative samples are incorrectly predicted as positive samples, so use (1-specificity) instead of specificity.
True rate (TPR) = sensitivity = TP/(TP+FN)
False positive rate (FPR) = 1- specificity = FP/(FP+TN)
The following is a schematic of the true rate and the false positive rate, we foundTPR and FPR are based on the actual representations of 1 and 0, respectively, that is, they observe the relevant probability problems in the actual positive and negative samples, respectively. Because of this, no matter whether the sample is balanced or not, it will not be affected. Still taking the previous example, in the total sample, 90% is a positive sample and 10% is a negative sample. We know that the accuracy is hydrated, but not the same as TPR and FPR. Here, TPR only focuses on how many 90% positive samples are actually covered, and has nothing to do with 10%. Similarly, FPR only focuses on 10%. How many negative samples are covered by errors, and also with 90 % has nothing to do with, so you can see:If we look at the results of the actual performance, we can avoid the problem of sample imbalance, which is why TPR and FPR are used as indicators of ROC/AUC.
Or we can think about it from another angle:Conditional Probability. let's assumeXFor the predicted value,YIs the true value. Then you can express these indicators by conditional probability:
Precision rate = P(Y=1 | X=1)
Recall rate = sensitivity = P (X = 1 | Y = 1)
Specificity = P(X=0 | Y=0)
See from the above three formulas:If we first conditional on the actual results (recall rate, specificity), then we only need to consider one sample, and first conditionally (precision rate), then we need to consider both positive and negative samples. Therefore, the indicators that are based on the actual results are not affected by the sample imbalance. On the contrary, the conditions based on the prediction results will be affected.
2. ROC (receiver operating characteristic curve)
The ROC (Receiver Operating Characteristic) curve, also known as the receiver operating characteristic curve. This curve was first applied to the field of radar signal detection to distinguish between signal and noise. Later it was used to evaluate the predictive power of the model, and the ROC curve was derived based on the confusion matrix.
The two main indicators in the ROC curve areReal rate和False positive rate, The benefits of this choice are also explained above. The abscissa is the false positive rate (FPR) and the ordinate is the true rate (TPR). Below is a standard ROC graph.
Threshold problem of ROC curve
Similar to the previous PR curve, the ROC curve is also passedTraverse all thresholds To draw the entire curve. If we continue to traverse all the thresholds, the predicted positive and negative samples are constantly changing, and the corresponding sliding in the ROC curve along the curve.
How to judge the quality of the ROC curve?
Changing the threshold simply changes the number of positive and negative samples predicted, TPR and FPR, but the curve itself does not change. So how do you judge the ROC curve of a model is good? This is still going to return to our goal: FPR represents the degree of response of the model's false report, and TPR represents the extent to which the model predicts the response. What we hope for is of course: the less the false report, the better, the more coverage, the better. So sum up it isThe higher the TPR and the lower the FPR (ie, the steeper the ROC curve), the better the performance of the model. Refer to the following:
ROC curve ignores sample imbalance
I have already explained why the ROC curve can ignore the sample imbalance. Let's show it again in the form of a dynamic graph. we discover:Regardless of how the red-blue sample ratio changes, the ROC curve has no effect.
3. AUC (area under the curve)
To calculate the points on the ROC curve, we can evaluate the logistic regression model multiple times using different classification thresholds, but this is very inefficient. Fortunately, there is an efficient sorting-based algorithm that can provide us with such information. This algorithm is calledArea Under Curve.
More interestingly, if we connect the diagonal, its area is exactly 0.5. The actual meaning of the diagonal is:Randomly judge response and non-response, positive and negative sample coverage should be 50%, indicating random effects. The steeper the ROC curve, the better, so the ideal value is 1, a square, and the worst random judgment has 0.5, so the value of the general AUC is between 0.5 and 1.
AUC's general judgment criteria
0.5 – 0.7: The effect is low, but it is very good for predicting stocks.
0.7 – 0.85: General effect
0.85 – 0.95: good results
0.95 – 1: The effect is very good, but it is generally not possible
The physical meaning of AUC
The area under the curve measures the effects of all possible classification thresholds. One way to interpret the area under the curve is to consider the probability that the model will arrange a random positive category sample above a random negative category sample. Take the following sample as an example. Logistic regression predictions are arranged in ascending order from left to right: