## What is the K proximity algorithm?

The KNN algorithm is very simple and very efficient. The model representation of KNN is the entire training data set. Simple?

New data points are predicted by searching the entire training set of the K most similar instances (neighbors) and summarizing the output variables of those K instances. For regression problems, this may be the average output variable, which may be the mode (or most common) class value for classification problems.

The trick is how to determine the similarity between data instances. If your attributes have the same scale (for example, in inches), the easiest technique is to use Euclidean distance, which you can calculate directly based on the difference between each input variable.

KNN may require a lot of memory or space to store all the data, but only when it needs to be predicted (or learn), in time. You can also update and plan your training examples over time to keep your forecasts accurate.

The concept of distance or proximity can be decomposed in very high dimensions (many input variables), which can have a negative impact on the performance of the algorithm on your problem. This is called the curse of dimensions. It is recommended that you only use input variables that are most relevant to the predicted output variable.

• The theory is mature and the thought is simple. It can be used for classification as well as for regression.
• Can be used for nonlinear classification;
• The training time complexity is O(n);
• No assumptions about the data, high accuracy, not sensitive to outlier;
• KNN is an online technology where new data can be added directly to the data set without having to be retrained;
• KNN theory is simple and easy to implement;

• The sample imbalance problem (that is, the number of samples in some categories is large, while the number of other samples is small) is not effective;
• Need a lot of memory;
• For the data set with large sample size, the calculation amount is relatively large (in the distance calculation);
• When the sample is unbalanced, the prediction bias is relatively large.Such as:There are fewer samples in one category, and more samples in other categories;
• KNN will perform a global operation again for each classification;
• The choice of k value size is not optimal in theory selection, and it is often combined with K-fold cross-validation to obtain the optimal k value selection;

## Baidu Encyclopedia version

The proximity algorithm, or K-Nearest Neighbor classification algorithm, is one of the simplest methods in data mining classification technology. The so-called K nearest neighbor is the meaning of k nearest neighbors, saying that each sample can be represented by its nearest k neighbors.

The core idea of ​​the kNN algorithm is that if the majority of the k most neighboring samples in a feature space belong to a certain category, the sample also belongs to this category and has the characteristics of the samples on this category. The method determines the category to which the sample to be classified belongs based on only the category of the nearest one or several samples in determining the classification decision. The kNN method is only relevant to a very small number of adjacent samples in class decision making. Since the kNN method mainly relies on the surrounding limited samples, rather than relying on the discriminant domain method to determine the category, the kNN method is more than the other methods for the cross-over or overlapping sample sets of the class domain. To be suitable.