At present, we have done a lot of work in building and optimizing machine learning models, but after all these hard work, we can't help but ask the question: How to compare the models we have built? To model A and model B Compare, which is the winner, why? Or, can you combine the two models to optimize performance?

A very superficial approach is to compare the total accuracy of the test set. For example, the accuracy of model A is 94%, and the accuracy of model B is 95%, and then it is rashly concluded that model B is better. In fact, if you compare the two models, there are many aspects to consider, not just total accuracy.

This article will explain statistics in plain language, so this article is a good reading for those who are not very good at statistics, but want to learn more.

**1. "Understand" the data**

If possible, it is a good idea to come up with some pictures that reflect the actual situation. It is strange to draw a picture of this aspect, but it can provide us with some insights that numbers cannot provide.

In one project, based on the same test set, two machine learning models were compared in predicting the accuracy of the user's tax on their documents. It is generally considered that it is a good idea to integrate data by user id and calculate the ratio of each model to accurately predict the tax amount.

Assuming the data set is large, the data is parsed into different regions and the focus is on a smaller subset of data, and the accuracy of each subset may vary. This is usually the case when dealing with unusually large data sets, because it is unrealistic to process large amounts of data at once, not to mention reliable conclusions (the problem with sample size will be discussed later). One of the great advantages of big data sets is that not only is there a large amount of information available, but it can also amplify the data and investigate the situation on a subset of pixels.

Based on this, we have reason to suspect that one of the models performs well on some subsets, but the performance on other subsets is quite consistent. This is a big step forward from our comparison of the total accuracy of the model. But this suspicion can be further investigated through hypothesis testing. Hypothesis testing is better at finding differences than the human eye—we have limited data in the test set. If we compare models on different test sets, we might wonder how their accuracy will change. Unfortunately, we don't always think of a different test set, so understanding some of the statistics we have today may help in studying the accuracy of the model.

**2. Hypothesis testing: start now!**

This may seem trivial at first glance, you may have seen it before:

1. Establish H0 and H1

2. Give a test statistic, assuming it is normally distributed

3. Calculate the p-value

4. If p <= 0.05 then exclude H0, then you are done!

In practice, the hypothesis test is more complicated and tricky. However, people are always less cautious in hypothesis testing, so as to misinterpret the results. Let us step by step:

**Step 1: **Establish H0: Null hypothesis/null hypothesis, that is, there is no statistically significant difference between the two models; H1: Alternative hypothesis/opposite hypothesis, that is, there is a statistically significant difference between the two models in their accuracy.It is up to you to determine model A! = B (two-sided test) or model A <model B or model A> model B (one-sided test)

**step 2:**A test statistic is proposed to quantify the behavior of distinguishing null hypotheses from alternative hypotheses in the observed data. There are many options, and even the best statisticians may have no clue about the number of unknown statistical tests. Don't worry! Because there are many assumptions and facts to consider, once the data is known, you can choose the appropriate method. . The key is to understand how hypothesis testing works, and the actual test statistic is just a tool to simplify calculations with software.

Keep in mind that there are a number of assumptions that need to be met before any statistical tests are performed. You can look up the assumptions required for each test; however, most of the real-life data does not fully satisfy all the conditions, so you can relax the conditions appropriately! But what if the data deviates significantly from the normal distribution?

There are two main categories of statistical tests: parametric and nonparametric. In short, the main difference between the two types of statistical tests is that the parameter test requires some assumptions about the overall distribution, while the non-parametric test is more robust (please do not use parameters).

In the analysis of the above project, if you want to use the paired sample t test (https://www.statisticssolutions.com/manova-analysis-paired-sample-t-test/), but because the data is not normally distributed, So you can choose the Vickers symbol rank test (https://www.statisticssolutions.com/how-to-conduct-the-wilcox-sign-test/) (nonparametric test of paired samples). You can decide which test statistic to use in your analysis, but make sure you meet the assumptions.

**Step 3: **Determine the p value. The concept of p-value is somewhat abstract: the p-value is just a number used to measure the reason for negating the null hypothesis. The more the reason for negating the null hypothesis, the smaller the p-value. If the p value is small enough, we have a good reason to deny the null hypothesis.

Fortunately, p-values are easy to find in Python's R, so you don't have to do it yourself. You can choose to make a hypothesis test in R because it has more options available. The following is a piece of code. It can be seen that on the subset 2, we get a small p value, but the confidence interval is useless.

> wilcox.test(data1, data2, conf.int = TRUE, alternative=”greater”, paired=TRUE, conf.level = .95, exact = FALSE)

V = 1061.5, p-value = 0.008576

Alternative hypothesis: true location shift is less than 0

95 percent confidence interval:

-Inf -0.008297017

Sample estimates:

(pseudo)median

-0.02717335

**step 4:**This step is simple. If the p-value is less than the given alpha (usually 0.05), then there is reason to deny the null hypothesis and accept the alternative hypothesis. Otherwise, there is no good reason to deny the null hypothesis, but this does not mean that the original hypothesis is correct. In fact, the null hypothesis may still be wrong, but there is not enough data as evidence to reject the hypothesis. If the value of alpha is 0.05=5%, this means that the risk of a erroneous conclusion that there is a difference is only 5% (ie the first type of error).

You might ask yourself: Why can't we take the value of alapha as 1% but 5%? Because that would make the analysis more conservative, it would increase the difficulty of negating the null hypothesis (and our goal is to set the null hypothesis).

The most commonly used alpha values are 5%, 10%, and 1%, but you can choose any alpha value you want. It depends on how much risk you are willing to take.

Can the alpha value be 0%? That is, there is no possibility of making the first type of error. This is impossible. In fact, you always make mistakes, so choosing 0% is meaningless. We need to leave some room for our own small mistakes.

If you want to avoid "p-hack", you can increase the alpha value and negate the null hypothesis, but you need to reduce the confidence (as the alpha value increases, the confidence decreases, and the two can only take one).

**3. Causal analysis: statistical significance vs. practical significance**

If the resulting p-value is very small, it certainly means that the accuracy of the two models is statistically significantly different. In the previous example, we did get a small p-value, so mathematically, the model is of course different, but "meaningful" does not mean "important." Does this difference really make sense? Is this small difference related to business problems?

Statistical significance means that the mean difference observed in the sample cannot be due to sampling error. Given a large enough sample, although the overall difference does not seem significant, we can still find its statistical significance. On the other hand, the practical significance is to see if the difference is large enough to have realistic value. Statistical significance is strictly defined, while practical meaning is more intuitive and subjective.

At this point, you may have realized that the p-value is not as powerful as you think. We also need to do more research, but also consider the effect size. The size of the effect measures the size of the difference, and if there is a statistically significant difference, we may be interested in its size. The size of the effect emphasizes the size of the difference, not the size of the sample. Remember not to confuse the two.

> abs(qnorm(p-value))/sqrt(n)

0.14

# the effect size is small

What are low, medium, and high effects? The traditional thresholds are 0.1, 0.3, and 0.5, but it really depends on your business problem.

What is the sample size? If the sample size is too small, the result is not reliable, but it does not matter. So what if the sample size is too large? This seems very good - but in this case, even very small differences can be detected by hypothesis testing. In the case of so much data, even small deviations can be considered significant. This is where the amount of effect is useful.

There is more to do, we can also try to determine the test and the optimal sample size. But I don't need it now.

If the hypothesis test is successful, it can be very useful in model comparisons. The general steps include establishing the null hypothesis (H0) and the alternative hypothesis (H1), calculating the statistics and finding the p-value, but interpreting the results requires intuition, creativity, and a deeper understanding of the business problem.

Keep in mind that if the test is based on a very large test set, then the statistically significant relationship found may not have much practical significance. Don't blindly believe in the magical p-values: It's a good idea to zoom in and perform causal analysis.

## Comments