What is the K proximity algorithm?

The KNN algorithm is very simple and very efficient. The model representation of KNN is the entire training data set. Simple?

New data points are predicted by searching the entire training set of the K most similar instances (neighbors) and summarizing the output variables of those K instances. For regression problems, this may be the average output variable, which may be the mode (or most common) class value for classification problems.

The trick is how to determine the similarity between data instances. If your attributes have the same scale (for example, in inches), the easiest technique is to use Euclidean distance, which you can calculate directly based on the difference between each input variable.

K proximity algorithm

KNN may require a lot of memory or space to store all the data, but only when it needs to be predicted (or learn), in time. You can also update and plan your training examples over time to keep your forecasts accurate.

The concept of distance or proximity can be decomposed in very high dimensions (many input variables), which can have a negative impact on the performance of the algorithm on your problem. This is called the curse of dimensions. It is recommended that you only use input variables that are most relevant to the predicted output variable.

 

Advantages and disadvantages of K proximity algorithm

advantage

  • The theory is mature and the thought is simple. It can be used for classification as well as for regression.
  • Can be used for nonlinear classification;
  • The training time complexity is O(n);
  • No assumptions about the data, high accuracy, not sensitive to outlier;
  • KNN is an online technology where new data can be added directly to the data set without having to be retrained;
  • KNN theory is simple and easy to implement;

Disadvantage

  • The sample imbalance problem (that is, the number of samples in some categories is large, while the number of other samples is small) is not effective;
  • Need a lot of memory;
  • For the data set with large sample size, the calculation amount is relatively large (in the distance calculation);
  • When the sample is unbalanced, the prediction bias is relatively large.Such as:There are fewer samples in one category, and more samples in other categories;
  • KNN will perform a global operation again for each classification;
  • The choice of k value size is not optimal in theory selection, and it is often combined with K-fold cross-validation to obtain the optimal k value selection;

 

Baidu Encyclopedia version

The proximity algorithm, or K-Nearest Neighbor classification algorithm, is one of the simplest methods in data mining classification technology. The so-called K nearest neighbor is the meaning of k nearest neighbors, saying that each sample can be represented by its nearest k neighbors.

The core idea of ​​the kNN algorithm is that if the majority of the k most neighboring samples in a feature space belong to a certain category, the sample also belongs to this category and has the characteristics of the samples on this category. The method determines the category to which the sample to be classified belongs based on only the category of the nearest one or several samples in determining the classification decision. The kNN method is only relevant to a very small number of adjacent samples in class decision making. Since the kNN method mainly relies on the surrounding limited samples, rather than relying on the discriminant domain method to determine the category, the kNN method is more than the other methods for the cross-over or overlapping sample sets of the class domain. To be suitable.

Read More

 

Wikipedia version

In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a nonparametric method for classification and regression. In both cases, the input contains the most recent k training examples in the feature space. The output depends on whether k -NN is used for classification or regression:

  • In the k-NN classification, the output is a class membership. Objects are categorized by multiple votes of their neighbors, where objects are assigned to the most common class of their k nearest neighbors (k is a positive integer, usually a small integer). If k = 1, simply assign the object to the class of the single nearest neighbor.
  • In k-NN regression, the output is the attribute value of the object. This value is the average of the values ​​of its k nearest neighbors.

k-NN is an instance-based learning or lazy learning where the function is only approximated locally and all calculations are deferred to the classification. The ķ-NN algorithm is the simplest of all machine learning algorithms.

Read More