Read a random forest in a text

Random forest is an integrated algorithm composed of decision trees, and he can perform well in many cases.

This article will introduce the basic concepts of random forests, 4 construction steps, comparative evaluation of 4 methods, 10 advantages and disadvantages, and 4 application directions.

 

What is a random forest?

Random forest is a Bagging (short for Bootstrap AGgregation) method in integrated learning. If you use a graph to show the relationship between them is as follows:

Random forest belongs to the Bagging method in integrated learning

 

Decision Tree – Decision Tree

Graphical decision tree

Before explaining the random forest, you need to mention the decision tree first. Decision tree is a very simple algorithm. It is highly explanatory and conforms to human intuitive thinking. This is a supervised learning algorithm based on the if-then-else rule. The above picture can intuitively express the logic of the decision tree.

Learn more:"One article to understand the decision tree-Decision tree (3 steps + 3 typical algorithms + 10 advantages and disadvantages)"

 

Random Forest – Random Forest | RF

Graphical random forest

Random forests are made up of many decision trees, and there is no correlation between different decision trees.

When we perform the classification task, the new input sample enters, and each decision tree in the forest is judged and classified separately. Each decision tree will get its own classification result, and which classification result of the decision tree Most, then random forest will use this result as the final result.

 

4 steps to construct a random forest

4 steps to construct a random forest

  1. A sample with a sample size of N is drawn N times with replacement, and one sample is drawn each time, and finally N samples are formed.The selected N samples are used to train a decision tree as the samples at the root node of the decision tree.
  2. When each sample has M attributes, when each node of the decision tree needs to be split, randomly select m attributes from these M attributes, and satisfy the condition m << M.Then use a certain strategy (such as information gain) from these m attributes to select 1 attribute as the split attribute of the node.
  3. In the decision tree formation process, each node must be split according to the step 2 (it is easy to understand that if the next attribute selected by the node is the attribute that was used just when its parent node was split, the node has reached the leaf. Node, no need to continue to split). Until it can't be split again. Note that no pruning is done during the formation of the entire decision tree.
  4. According to the steps 1~3, a large number of decision trees are created, which constitutes a random forest.

 

Advantages and disadvantages of random forests

advantage

  1. It can come out with very high dimensional (features) data, and no need to reduce dimension, no need to make feature selection
  2. It can judge the importance of the feature
  3. Can judge the interaction between different features
  4. Not easy to overfit
  5. Training speed is faster, easy to make parallel method
  6. It is relatively simple to implement
  7. For unbalanced data sets, it balances the error.
  8. If a large part of the features are lost, accuracy can still be maintained.

Disadvantage

  1. Random forests have been shown to fit over certain noisy classification or regression problems.
  2. For data with different values, attributes with more values ​​will have a greater impact on random forests, so the attribute weights generated by random forests on such data are not credible.

 

Random forest 4 implementation method comparison test

Random forests are commonly used machine learning algorithms that can be used for both classification and regression problems. This paper compares and tests the random forest algorithm implementations of scikit-learn, Spark MLlib, DolphinDB, and XGBoost. Evaluation indicators include memory usage, speed of operation, and classification accuracy.

The test results are as follows:

Random forest 4 implementation method comparison test

The test process and instructions are ignored. If you are interested, you can view the original text.Random forest algorithm 4 implementation comparison test: DolphinDB is the fastest, XGBoost is the worst performer"

 

4 application directions for random forests

4 application directions for random forests

Random forests can be used in many places:

  1. Classification of discrete values
  2. Regression of continuous values
  3. Unsupervised learning clustering
  4. Abnormal point detection

 

Baidu Encyclopedia + Wikipedia

Baidu Encyclopedia version

In machine learning, a random forest is a classifier that contains multiple decision trees, and the category of its output is determined by the mode of the category of the individual tree output. Leo Breiman and Adele Cutler developed an algorithm for deducing random forests.

Read More

Wikipedia version

A random forest or stochastic decision forest is an integrated learning method for classification, regression, and other tasks that operates by building multiple decision trees at training time and outputting a class (classification) or average prediction (regression) as a class. Individual trees. Random decision-making forests correct the habit of decision trees over-fitting their training sets.

Read More

 

Machine learning tutorial 2 that can be understood by liberal arts students: decision trees and random forests