Random forest is an integrated algorithm composed of decision trees, and he can perform well in many cases.
This article will introduce the basic concepts of random forests, 4 construction steps, comparative evaluation of 4 methods, 10 advantages and disadvantages, and 4 application directions.
What is a random forest?
Random forest is a Bagging (short for Bootstrap AGgregation) method in integrated learning. If you use a graph to show the relationship between them is as follows:
Decision Tree – Decision Tree
Before explaining the random forest, you need to mention the decision tree first. Decision tree is a very simple algorithm. It is highly explanatory and conforms to human intuitive thinking. This is a supervised learning algorithm based on the if-then-else rule. The above picture can intuitively express the logic of the decision tree.
Random Forest – Random Forest | RF
Random forests are made up of many decision trees, and there is no correlation between different decision trees.
When we perform the classification task, the new input sample enters, and each decision tree in the forest is judged and classified separately. Each decision tree will get its own classification result, and which classification result of the decision tree Most, then random forest will use this result as the final result.
4 steps to construct a random forest
- A sample with a sample size of N is drawn N times with replacement, and one sample is drawn each time, and finally N samples are formed.The selected N samples are used to train a decision tree as the samples at the root node of the decision tree.
- When each sample has M attributes, when each node of the decision tree needs to be split, randomly select m attributes from these M attributes, and satisfy the condition m << M.Then use a certain strategy (such as information gain) from these m attributes to select 1 attribute as the split attribute of the node.
- In the decision tree formation process, each node must be split according to the step 2 (it is easy to understand that if the next attribute selected by the node is the attribute that was used just when its parent node was split, the node has reached the leaf. Node, no need to continue to split). Until it can't be split again. Note that no pruning is done during the formation of the entire decision tree.
- According to the steps 1~3, a large number of decision trees are created, which constitutes a random forest.
Advantages and disadvantages of random forests
- It can come out with very high dimensional (features) data, and no need to reduce dimension, no need to make feature selection
- It can judge the importance of the feature
- Can judge the interaction between different features
- Not easy to overfit
- Training speed is faster, easy to make parallel method
- It is relatively simple to implement
- For unbalanced data sets, it balances the error.
- If a large part of the features are lost, accuracy can still be maintained.
- Random forests have been shown to fit over certain noisy classification or regression problems.
- For data with different values, attributes with more values will have a greater impact on random forests, so the attribute weights generated by random forests on such data are not credible.
Random forest 4 implementation method comparison test
Random forests are commonly used machine learning algorithms that can be used for both classification and regression problems. This paper compares and tests the random forest algorithm implementations of scikit-learn, Spark MLlib, DolphinDB, and XGBoost. Evaluation indicators include memory usage, speed of operation, and classification accuracy.
The test results are as follows:
The test process and instructions are ignored. If you are interested, you can view the original text.Random forest algorithm 4 implementation comparison test: DolphinDB is the fastest, XGBoost is the worst performer"
4 application directions for random forests
Random forests can be used in many places:
- Classification of discrete values
- Regression of continuous values
- Unsupervised learning clustering
- Abnormal point detection
Baidu Encyclopedia + Wikipedia