How to use K-means

Cluster analysis or data clustering can be hierarchical or non-hierarchical, and K-means is non-hierarchical clustering. The K-means procedure is very straightforward:

Arrange K reference vectors with random numbers. The K reference vectors have exactly the same number of elements as the number of dimensions of the data to be trained.
Determine the reference vector with the shortest distance for each data point. The result is a grouping of data points.
Find the centroid of the data points within each group and move the reference vector there. Dissolve the groups.

After repeating steps 2 to 3 a few times, the algorithm converges.

The number of K can be freely set depending on the problem. However, many recent data science courses explain how to find the optimal number of clusters using the elbow method. This is not wrong, but it creates a slightly awkward situation. This is because they overlook the question of what is the optimal number of clusters. <How to use Clustering Quality Measures>

There may be clustering that looks more natural, but as soon as you think of it as an objective or absolute clustering, there is a danger that it will become unscientific. In most cases, the number of clusters tends to be around 3 to 10 at most. This relates to human cognitive abilities and is not reality. The primary use of the K-means algorithm is information compression. Also Clustering, which is commonly thought of, can actually be viewed as information compression tailored to human cognitive abilities.

An example of information compression is a palette of images. PCs from several decades ago were unable to display full color due to performance limitations, so images were displayed in 16 or 256 colors. However, since it is not possible to display images beautifully with the default color set, the way to improve the display quality of images by creating a “palette” from colors that are often used in that ceartain image were deviced. This method is isomorphic to K-means.

Hierarchical agglomerative clustering is very computationally time-consuming to apply to big data, as it requires distance calculations between all data points. In this case, we can apply K-means to big data. For example, you can compress 100,000 records into 1000 micro-clusters. Furthermore, hierarchical clustering such as Ward’s method can be applied to these 1000 micro clusters. The advantage of hierarchical clustering is that it allows you to explore the “structure” inherent in your data.

Of course, there may be cases where you don’t need to know the cluster hierarchy. If you want to create the number of features from data that are easy to handle with human cognitive ability, K-means will be useful. You can of course use hierarchical agglomerative clustering, but if you don’t need a hierarchy, you can use K-means.

Tree models such as decision trees and random forests partition the data space with a large number of conditional expressions. In other words, it means dividing the space with rectangles. This does not fit well with the nature of the actual data distribution. Creating an accurate model requires many conditional expressions, making the model complex. This is exactly why AI in the 1980s was not successful.

By applying features created from K-means to the tree model, the shortcomings of the tree model may be overcome. This is essentially equivalent to applying K-means to class classification. Tree models can be interpreted as being used to match data points with K vectors.

Actually, SOM is a more advanced version of K-means. In K-means, K vectors can move freely without interfering with each other, but in SOM, by setting a topological order, K vectors have tied each other, and their positional relationship is smoothed.

Written by:

Kunihiro TADA