In this article, I want to explain one of the most important concepts of machine learning and data science that we encountered after training machine learning models. This is a topic that must be known.

This article is intended to explain the following topics:

  1. What is overfitting in a machine learning project?
  2. How do we detect overfitting?
  3. How do we solve the overfitting problem?
Photo: Isaac Smith on Unsplash
Photography: Isaac smithOn  Unsplash

Introduction-What is overfitting?

Let us first determine the basis of the concept.

Suppose you want to predict future price movements of stocks.

Then, you decide to collect the historical daily price of the stock for the past 10 days and plot the stock price on the scatter plot as follows:

The figure above shows that the actual stock price is random.

To capture stock price movements, you need to evaluate and collect data on the following 16 features, you know that the stock price depends on:

  1. Industry performance
  2. Company news release
  3. Company income
  4. Company profit
  5. Company's future announcement
  6. Company dividend
  7. Current and future contract size of the company
  8. Company's M&A status
  9. Company management information
  10. Current contract of the company
  11. Future contract of the company
  12. inflation
  13. interest rate
  14. Foreign Exchange Rate
  15. Investor sentiment
  16. Company's competitor

After collecting, cleaning, scaling, and transforming data, you can split the data into training and test data sets. In addition, you provide training data to your machine learning model for training.

After training the model, you decide to test the accuracy of the model by passing in the test data set.

What are we looking forward to?

Expected result

The figure above shows that the actual stock price is random. However, the predicted stock price is a smooth curve. It doesn't fit itself too close to the training set, so it can better promote invisible data.

But let's assume that your plot is actually related to the predicted stock price and you will encounter the following chart:

1. line shows predicted price

Straight line shows predicted price

What does it show?

This means that the algorithm has a very strong data pre-concept. This means it is highly biased.This is called under-coordination.These models are not suitable for predicting new data.

2. Very powerful tight fit line

Over-fitting result

What does it show?

This is another extreme. It may seem like it's doing a good job of predicting stock prices.However, this is called overfitting.It is also known as high variance because it has been well trained in training data and therefore does not promote new and invisible data well. These models are not suitable for predicting new data. If we provide new data to the model, its accuracy will eventually become extremely poor.

This also shows that we did not train our model with enough data. Overfitting means that your model overtrains your training data. It may be because there are too many features in the data, or because we didn't provide enough data. It occurs when the difference between the actual value and the predicted value is close to 0.

How to detect overfitting?

Models that over-adapt to training data are not well summarized as new examples. They are not good at predicting invisible data.

Photograph by Stephen Dawson, from Unsplash
Photography: Stephen Dawson.  From Unsplash

This means that they are very accurate during training and produce very poor results during the prediction of invisible data. If the accuracy measurements (such as the mean squared error) are significantly reduced during model training and the accuracy of the test data set is degraded, this means that your model over-fitting the data.

If you want to learn about algorithms that can be used to measure the accuracy of machine learning models, check out this article:Must know the mathematical measurement scientist for each data
Key mathematical formulas are introduced in easy-to-follow bullet pointsmedium.com

How do we solve the overfitting problem?

We can randomly remove these features and iteratively evaluate the accuracy of the algorithm, but this is a very tedious and slow process.

There are basically four common ways to reduce overfitting.

1. Reduced functionality:

The most obvious option is to reduce functionality. You can calculate the correlation matrix of features and reduce the features that are highly correlated with each other:

import matplotlib.pyplot as pltplt.matshow(dataframe.corr())plt.show()

2. Model Selection Algorithm:

You can choose a model selection algorithm. These algorithms can choose more important functions.

The problem with these technologies is that we may end up losing valuable information.

3. Provide more data

Your goal should be to provide enough data for the model to fully train, test and validate the model. Designed to provide 60% data to train the model, 20% data for testing, and 20% data for validation models.

4. Normalization:

The purpose of regularization is to maintain all features, but impose constraints on the size of the coefficients.

It is preferred because you don't have to use the penalty feature to lose functionality. When a constraint is applied to a parameter, the model is less prone to overfitting because it produces a smooth function.

A regularization parameter called a penalty factor is introduced, which controls the parameters and ensures that the model itself does not overtrain the training data.

These parameters are set to smaller values ​​to eliminate overfitting. When the coefficient takes a large value, the regularization parameter punishes the optimization function.

There are two common regularization techniques:

  1. lasso

Add a penalty value, which is the absolute value of the coefficient magnitude. This ensures that the feature does not ultimately impose a high weight on the prediction of the algorithm.

From sklearn import linear_model model = linear_model.Lasso(alpha=0.1) model.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])

2. RIDGE

Add a penalty, which is the square of the coefficient size. As a result, some of the weights will eventually be zero. This means that data for certain features will not be used in the algorithm.

from Sklearn.linear_model import Ridge
Model = Ridge(alpha=1.0)
Model.fit(X, y)
Photograph: Sergey Pesterev, About Unsplash
Photography: Sergey Pesterev,About  Unsplash

Final Thoughts

This article highlights a key topic we encountered after testing the machine learning model. It outlines the following key parts:

  1. What is overfitting in a machine learning project?
  2. How do we detect overfitting?
  3. How do we solve the overfitting problem?

This article is transferred from medium,Original address