This article is reproduced from the public reading core,Original address

Have you heard of "teamwork"? Not a TV series, but a real term. No? Ok, imagine you have asked a complicated question in the crowd. Now, let's summarize the answers of people.

You may find that most people give answers better than experts.

“Grouping together” refers to the collective opinions of a group of people, not the opinions of an expert.

--Wikipedia

Back in the field of machine learning, we can apply the same ideas. For example, if we aggregate the predictions of a set of predictors (such as classifiers and regressions), the results may be more accurate than using a single best predictor.

A set of predictors is called a collection. Therefore, this machine learning technique is called integrated learning.

In nature, a group of trees form a forest. It is assumed that a set of tree-shaped decision classifiers are trained to perform individual predictions based on different subsets of the training data set. Later, considering most people's predictions of the same situation, you predicted the expected observation class. This means that you are basically using a knowledge set of tree-determining taxonomies, which is often referred to as a random forest.

This article will cover the most popular integration methods, including bagging, boosting, stacking, and more. Before diving into the discussion, please keep in mind:

The integration method works best when the predictors are as independent as possible. One way to get different classifiers is to train them using completely different algorithms. This increases the chances of different types of errors, which increases the accuracy of the integration.

——Excerpt from Chapter 7 of "Machine Learning Using Scikit-Learn&TensorFlow"

Simple integration technology 

Hard Voting Classifier

This is the simplest example of this technology, and you may already be proficient. Voting classifiers are often used to classify problems. Suppose you have trained and will have some classifiers (logical regression classifiers,SVMThe classifier, random forest classifier, etc.) are matched to the training data set.

An easy way to create a better classifier is to combine the predictions made by each classifier and use the majority of the selection as the final prediction. Basically, we can think of it as a pattern for retrieving all predictors.

average

The first example above is mainly used for classification problems. Now let's look at a technique for regression problems—average. Similar to Hard Voting, we use different algorithms to make multiple predictions and take the average to make the final prediction.

Weighted average

This method refers to assigning different weights to the model according to the importance of the model to the final prediction when averaging.

Advanced integration technology

Stacking

This technique, also known as stack generalization, is based on the idea of ​​a training model that will perform the regular collection we saw earlier.

We have N predictors, each of which makes a prediction and returns a final value.After that, the meta-learner or Blender takes these predictions as input and makes the final prediction.

Let's see how it works.

A common method of training meta learners is to set aside the set. First, divide the training set into two subsets. The first subset of data is used to train the predictor.

The second subset (ie, the set of leaves) is then predicted by training the first subset of predictors. This ensures predictive cleanliness because these algorithms have never been processed.

With these new predictions, we can generate a new training set as an input feature (three-dimensionalizing the new training set) and maintain the target value.

Therefore, the meta learner is finally trained on this new data set and the target value is predicted taking into account the value given by the first prediction.

In addition, you can train several meta learners in this way (for example, one using linear regression, the other using random forest regression, etc.). To use multiple meta learners, the training set must be divided into three or more subsets: the first subset is used to train the first layer predictor, and the second is used to predict unknown data and create new ones. The data set, the third is used to train the meta learner and predict the target value.

Bagging and Pasting

Another approach is to use the same algorithm for each predictor (such as the tree decision classification), however, different random subsets of the training set can yield more comprehensive results.

Excerpt from the article "Using adaptive random forests to reduce the spatial scale of precipitation"

As for the creation of a subset, it can be replaced or not replaced.If it is assumed that there are substitutions, then some observations may appear in multiple subsets, which is why we call this method bagging (short for bootstrap aggregating).When sampling without replacement, we ensure that all observations in each subset are unique, so as to ensure that there will be no duplicate observations in each subset.

Once all the predictors have been trained, the set can predict a new instance by aggregating the predicted values ​​of all trained predictors, as we saw in the hard voting classifier. Although the bias of each individual predictor is higher than the predictor trained on the original data set, the aggregation can reduce the bias and variance.

Using pasting may make these models get the same results because their inputs are the same.Compared with pasting, each subset of bootstraping is more diverse, so the deviation obtained by bagging is larger, which means that the correlation between predictors is smaller, thereby reducing the variance of the integration.In short, bagging models usually provide better results, which explains why bagging is more common than pasting in general.

Boosting

If the first model and the next model (perhaps all models) are not correctly predicted for one data point, will the combination of results provide better predictions? This is where Boosting works.

Boosting, also known as Hypothesis Boosting, refers to any integrated approach that can combine weak learners into learners. This is a continuous process, and each model tries to correct the error of the previous model.

Each model does not perform well on the entire data set. However, they do perform well in some parts of the data set. Therefore, with Boosting, we can expect each model to help actually improve the performance of the entire integration.

Algorithm based on Bagging and Boosting

The most common integrated learning techniques are Bagging and Boosting. Below are some of the most common algorithms in these technologies.

Bagging algorithm:

· Random Forest (https://medium.com/diogo-menezes-borges/random-forests-8ae226855565)

Boosting algorithm:

· AdaBoost(https://medium.com/diogo-menezes-borges/boosting-with-adaboost-and-gradient-boosting-9cbab2a1af81)

· Gradient Boosting Machine (GBM) (https://medium.com/diogo-menezes-borges/boosting-with-adaboost-and-gradient-boosting-9cbab2a1af81)

· XGBoost (https://medium.com/diogo-menezes-borges/boosting-with-adaboost-and-gradient-boosting-9cbab2a1af81)

· Light GBM (https://medium.com/diogo-menezes-borges/boosting-with-adaboost-and-gradient-boosting-9cbab2a1af81)