Advantages and disadvantages of K-means clustering
- The algorithm is simple and easy to implement;
- The algorithm is fast;
- For processing large data sets, the algorithm is relatively scalable and efficient because its complexity is approximately O(nkt), where n is the number of all objects, k is the number of clusters, and t is the number of iterations.Usually k<This algorithmUsually local convergence.
- The algorithm attempts to find the k partitions that minimize the value of the squared error function.When the clusters are dense, spherical or lumpy, and the difference between clusters and clusters is obvious, the clustering effect is better.
- High requirements on data types, suitable for numerical data;
- May converge to a local minimum and converge slowly on large-scale data
- The number k of packets is an input parameter, and an inappropriate k may return a poor result.
- Sensitive to the cluster value of the initial value, which may result in different clustering results for different initial values;
- Not suitable for finding clusters with non-convex shapes, or clusters with large differences in size.
- Sensitive to "noise" and outlier data, a small amount of this type of data can have a significant impact on the average.
Baidu Encyclopedia version
The K-means clustering algorithm first randomly selects K objects as the initial clustering center. The distance between each object and each seed cluster center is then calculated, and each object is assigned to the cluster center closest to it. The cluster centers and the objects assigned to them represent a cluster. Once all objects have been assigned, the cluster center of each cluster is recalculated based on the existing objects in the cluster. This process will continue to repeat until a termination condition is met. The termination condition may be that no (or a minimum number of) objects are reassigned to different clusters, no (or a minimum number of) cluster centers are re-changed, the square of error and the local minimum.
Ķ-means clustering is a method of vector quantization, initially from signal processing, ie popular cluster analysis in data mining. The purpose of the ķ-means clustering is to divide the Ñ observations into which each observation belongs to the cluster cluster with the nearest average as a prototype cluster. This causes the data space to be divided into Voronoi units.
The problem is computationally difficult (NP is difficult); however, efficient heuristics converge quickly to local optimum. These are usually similar to the maximum expected algorithm as a mixture of Gaussian distributions via two employed iterative refinement methods k-means and Gaussian mixture models. They all use cluster centers to model data; however, k-means clusters tend to find clusters with comparable spatial extents, while expectation maximization mechanisms allow clusters to have different shapes.
The algorithm has a loose relationship with the k nearest neighbor classifier, which is a popular classification machine learning technique that is often confused with k-means for name reasons. Using the 1 nearest neighbor classifier, the clustering center obtained by k-means classifies the new data into existing clusters. This is known as the nearest centroid classifier or Rocchio algorithm.