This article is reproduced from the public number SIGAI,Original address

Nearly 40 has produced tens of thousands of papers in the field of machine learning and has grown at a rate of tens of thousands per year. But there are not many that can really be called classics, withstood historical tests, and can be put into practical use. This paper sorts out the classic papers that appear in the history of machine learning. They are sorted according to the number of times they are cited, divided into top10, the number of citations exceeds 2 million, the number of citations exceeds 1 million, and the 4 part of the future has potential. They have been or will be eligible to be written into machine learning, deep learning, artificial intelligence textbooks in the future, and are valuable treasures that have been left to us by generations of researchers. It should be noted that the number of citations is unfair to the newly appeared articles in recent years. They are still in a period of rapid growth, but good wine is good wine, and it will become more and more fragrant over time.

## The most cited 10 articles

#### 1 name-EM algorithm

Arthur P Dempster, Nan M Laird, Donald B Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the royal statistical society series b-methodological, 1976.

**Quoted times: 55989**

What surprised the author is that the number one is not the supportive vector machine, integrated learning, deep learning, decision tree and other historically famous algorithms, but EM. This is the original text of the EM algorithm, and the number of citations is as high as 5! The EM algorithm is called one of the 10 big algorithms for machine learning in many versions of the ranking. It is mathematically beautiful and simple to implement. It is a powerful tool for solving maximum likelihood estimation and maximum posterior probability estimation with hidden variables. It has been successfully applied in Gaussian mixture models and hidden Markov models. . The principle is described in detail in the public article "Understanding the EM Algorithm" before SIGAI.

#### 2 name-logistic regression

David W Hosmer, Stanley Lemeshow. Applied logistic regression. Technometrics. 2000.

**Quoted times: 55234**

It represents the hill of linear model. This is not the original text of logistic regression. Logistic regression has been proposed for decades before, but the number of citations in this document has been reached. Although it is not a paper but a book, its citation is more than the famous PRML. Be high. This is also in line with our intuitive understanding. Logistic regression is simple, but practical. In engineering, the simpler things are more useful.

#### 3 name - random forest

Breiman, Leo. Random Forests. Machine Learning 45 (1), 5-32, 2001.

**Quoted times: 42608**

It represents the big mountain of integrated learning. Breiman's random forest, classification and regression trees are ranked 3/4. And the random forest ranking ratioAdaBoostThe algorithm is high. Similarly, random forests are simple but easy to use. The integrated learning, bagging, and random forests are described in detail in the public article "Random Forest Overview" before SIGAI.

#### 4 name - classification and regression tree

Breiman, L., Friedman, J. Olshen, R. and Stone C. Classification and Regression Trees, Wadsworth, 1984.

**Quoted times: 39580**

This is the original text of the classification and regression tree, which represents the hill of decision trees. Among the various decision trees, the Classification and Regression Tree (CART) should be the most widely used, and is now used as a weak forest for random forests, AdaBoost, and gradient lifting algorithms. Father Breiman has passed away in 2005, but he left us with large trees and forests. This algorithm is described in detail in the public article "Understanding Decision Trees" before SIGAI.

#### 5 name - support vector machine open source library libsvm

C.-C. Chang and C.-J. Lin. LIBSVM: a Library for Support Vector Machines. ACM TIST, 2: 27: 1-27: 27, 2011.

**Quoted times: 38386**

This article introduces the libsvm open source library. The number of citations exceeds the original text of the support vector machine, which should be regarded as the most popular support vector machine implementation. The author is Professor Lin Zhiren from Taiwan University and his students. I believe that many students who do machine learning research and products have used it. The SVM was introduced in detail in the SIGAI public article "Using an Understanding SVM" and "Understanding the Role of SVM Kernel Functions and Parameters".

#### 6 name - statistical learning theory

An overview of statistical learning theory. VN Vapnik – IEEE transactions on neural networks

**Quoted times: 36117**

The only theoretical level article in Top10 comes from Vapnik. His most influential results are support vector machines, VC dimensions. However, the machine learning theory article, on the whole, has relatively few citations, and should be less researcher with these directions. The article is less relevant, and most people are still doing some specific algorithms.

#### 7 name - principal component analysis

Ian T. Jolliffe. Principal Component Analysis. Springer Verlag, New York, 1986.

**Quoted times: 35849**

It represents the hill of the dimension reduction algorithm. This document is not the original text of principal component analysis. The original text was published in 1 more than a century ago. This ranking is worthy of the status of principal component analysis. PCA is widely used in various scientific and engineering data analysis. The PCA was introduced in the public article "Understanding Principal Component Analysis (PCA)" before SIGAI.

#### 8 name - decision tree - C4.5

J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA, 1993.

**Quoted times: 34703**

It is also a decision tree. The decision tree is simple and practical, and has strong interpretability. It is an important achievement in the early stage of machine learning.

#### 9 name - deep convolutional neural network

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. 2012.

**Quoted times: 34574**

It represents the hill of deep learning. The pioneering work of deep convolutional neural networks will carry forward the convolutional neural network of Yanle Village. It is not easy to have such a citation number in an article published in 2012 year. The wine that was just brewed is a famous wine. It is not surprising that it is from Hinton. Such citations are created by the current hot and deep learning. Similarly, there are no complicated formulas and theories, but they are surprisingly easy to use.

#### 10 name - support vector machine

Cortes, C. and Vapnik, V. Support vector networks. Machine Learning, 20, 273-297, 1995.

**Quoted times: 33540**

It represents the linear model, the hill of nuclear skills, this is the official text of the SVM. It is a bit strange that the support vector machine is ranked to the 10 position. It is a near-20 year algorithm in the machine learning lakes. It must be called SVM.

To sum up the literature of top10, it can be seen that simplicity is beautiful. The algorithms proposed in these documents have no complicated mathematical formulas and incomprehensible theories, but they are the most classic, because they are useful! They embody a deeper philosophical thinking. In fact, it is also true in other scientific fields. Some of the most classic theorems and formulas in mathematics are also very beautiful and concise, similar to physics. In top10, Breiman and Vapnik are on the list twice.

## Literature with more than 2 million citations

In addition to top10, there are some articles that have been cited more than 2 million, which is also a classic.

Lawrence R. RabiNerA tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE. 77 (2): 257–286. 1989.

**Quoted times: 26466**

It represents the hill of the probability map model. Finally, I saw the probability map model. In the past few decades, the most widely cited probability graph model is the Hidden Markov Model (HMM). This article is not the original text of HMM, but it has been written into a classic, the principle of HMM, the modeling method in speech recognition is clear and thorough.

MacQueen, JB Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. 1. University of California Press. pp. 281–297, 1967

**Quoted times: 24035**

Represents the hill of the clustering algorithm. The pioneering work of the k-means algorithm, which is also known in various rankings.机器 学习The 10 big classic algorithm is also simple and easy to understand. I believe that middle school students can understand it!

J. Ross Quinlan. Induction of decision trees. Machine Learnin, 1 (1): 81-106, 1986.

**Quoted times: 20359**

Introduce the literature of the decision tree, but explain more, the status is here. Quinlan is also a big part of the decision tree.

## Literature with more than 1 million citations

Roweis, Sam T and Saul, Lawrence K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500). 2000: 2323-2326.

**Quoted times: 12941**

Tenenbaum, Joshua B and De Silva, Vin and Langford, John C. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500). 2000: 2319-2323.

**Quoted times: 11927**

The two males of the manifold learning, two masterpieces, opened the first in this field. Manifold learning is a very popular direction of the year. Both articles are published in Science, and it is very difficult to know that computer science papers are sent to Science and Nature. This type of algorithm is described in the public article "Important Learning Overview" before SIGAI.

Ronald A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7 Part 2: 179-188, 1936.

**Quoted times: 15379**

The original text of linear discriminant analysis was published in 1936 years, when World War II did not break out.

Burges JC. A tutorial on support vector machines for pattern recognition. Bell Laboratories, Lucent Technologies, 1997.

**Quoted times: 19885**

Introduce the article that the support vector machine applies in the mode. SVM is really a good direction for irrigation!

Yoav Freund, Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. computational learning theory. 1995.

**Quoted times: 16431**

The classic of the AdaBoost algorithm, along with the SVM, is the machine learning duo of the year. This is the first algorithm with extensive influence in integrated learning, which has been successful in issues such as face detection. It is described in detail in the public article "SIGA AdaBoost Algorithm" and "Understanding the AdaBoost Algorithm" before SIGAI.

Lafferty, J., McCallum, A., Pereira, F. Conditional random fields: Probabilistic models for segmentation and labeling sequence data. Proc18th International Conf. On Machine Learning. Morgan Kaufmann. Pp. 282–289. 2001.

**Quoted times: 11978**

Conditional random field classics, this method has been successfully applied in natural language processing, image segmentation and other issues, and is now integrated with the circulating neural network to solve some key problems in the field of natural language processing.

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by back-propagating errors. Nature, 323(99): 533-536, 1986.

**Quoted times: 16610**

The original text of the backpropagation algorithm in the strict sense is sent to Nature, and the importance is not explained. I still use it for deep learning now. Hinton's name appears again. The public article before SIGAI "Back Propagation Algorithm Derivation - Fully Connected Neural Network", "Back Propagation Algorithm Derivation - Convolutional Neural Network" is explained in detail.

Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359-366, 1989.

**Quoted times: 16394**

The theoretical article of neural network, the famous universal approximation theorem, theoretically proves that at least 1 hidden layers of neural networks can force arbitrary continuous functions on near-closed intervals to arbitrary specified precision, which provides powerful neural network and deep learning. Theoretical guarantee.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, november 1998.

**Quoted times: 16339**

The original text of the LeNet network has been cited more times than the comrade Yan Lecun in 1989, and 1990 proposed convolutional neural networks. Also let Yan Lecun get the title of the father of the convolutional neural network.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, Going Deeper with Convolutions, Arxiv Link: http://arxiv.org/abs/1409.4842.

**Quoted times: 11268**

The original text of the GoogLeNet network, the students who do deep learning know. Articles published in 2015 years can have such citations, of course, benefiting from the popularity of deep learning.

K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. international conference on learning representations. 2015.

**Quoted times: 18980**

The original text of the VGG network, the classic convolutional network structure, is used in various places, and the number of citations is much more than GoogLeNet.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. computer vision And pattern recognition, 2015.

**Quoted times: 17285**

The original text of the residual network, the students who do deep learning know that they finally use the Chinese name to list, come on!

S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735-1780, 1997.

**Quoted times: 15448**

LSTMThe original text makes the circular neural network really practical. The author has made important contributions in the field of deep learning, but it is so low-key that many people don't know.

Martin Ester, Hanspeter Kriegel, Jorg Sander, Xu Xiaowei. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231, 1996.

**Quoted times: 13817**

It is also a typical representative of the clustering algorithm, the famous DBSCAN-based density-based clustering algorithm. The algorithm is also very simple, but it is also very powerful, and there is no formula that goes beyond the mathematical range of middle school. It is described in the public article article "Overview of Clustering Algorithms" before SIGAI.

Dorin Comaniciu, Peter Meer. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002.

**Quoted times: 12146**

The famous mean shift algorithm is also simple and elegant, but very easy to use. For machine learning, machine vision students must know, especially the target tracking algorithm in the visual field.

## Potential literature in the future

The number of citations of the following articles has not yet exceeded 1 million, but they are still very young, and the future is promising, so they are listed separately. It should be emphasized that although there are several articles on intensive learning published in the 1990 era, we have also listed them, and they will gradually show more important value as the deep learning research progresses.

Goodfellow Ian, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. Advances in Neural Information Processing Systems, 2672-2680, 2014.

**Quoted times: 6902**

The creation of a pioneering network against the network represents the hill of the deep generation model. The idea of generating a confrontational network is simple, elegant, and effective, and there are a large number of improved algorithms and applications. The Variable Automatic Encoder (VAE) is a depth generation model second only to GAN, but the number of citations of the original text is far less than GAN.

Richard Sutton. Learning to predict by the methods of temporal differences. Machine Learning. 3 (1): 9-44.1988.

**Quoted times: 5108**

The beginning of the timing difference algorithm, the status is not much explained.

Mnih, Volodymyr, et al. Human-level control through deep reinforcement learning. Nature. 518 (7540): 529-533, 2015.

**Quoted times: 4570**

The heavyweight work of deep reinforcement learning comes from the hands of DeepMind. The first article was published in 2013 years, and the number of citations is far less than this one. This was published on Nature and created the DQN algorithm.

David Silver, et al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 2016.

**Quoted times: 4123**

The original text of AlphaGo is not explained, and everyone on earth knows it.

Christopher JCH Watkins and Peter Dayan. Q-learningMachine learning, 8 (3-4): 279–292, 1992.

**Quoted times: 8308**

The original text of Q learning lays the foundation for this algorithm and is the basis of DQN.

The algorithms listed in this article are explained in detail in the book "Machine Learning and Applications" (by Lei Ming, published by Tsinghua University Press).

## Comments