Skip to content
Mindware Research Institute

Mindware Research Institute

Concept Research – AI powered Creative Information Analysis

  • Home
  • Concept Research
  • Contact
  • 日本語

How to use K-means

2023年10月24日
By Kunihiro TADA In Data Science

How to use K-means

Cluster analysis or data clustering can be hierarchical or non-hierarchical, and K-means is non-hierarchical clustering. The K-means procedure is very straightforward:

  1. Arrange K reference vectors with random numbers. The K reference vectors have exactly the same number of elements as the number of dimensions of the data to be trained.
  2. Determine the reference vector with the shortest distance for each data point. The result is a grouping of data points.
  3. Find the centroid of the data points within each group and move the reference vector there. Dissolve the groups.

After repeating steps 2 to 3 a few times, the algorithm converges.

The number of K can be freely set depending on the problem. However, many recent data science courses explain how to find the optimal number of clusters using the elbow method. This is not wrong, but it creates a slightly awkward situation. This is because they overlook the question of what is the optimal number of clusters. <How to use Clustering Quality Measures>

There may be clustering that looks more natural, but as soon as you think of it as an objective or absolute clustering, there is a danger that it will become unscientific. In most cases, the number of clusters tends to be around 3 to 10 at most. This relates to human cognitive abilities and is not reality. The primary use of the K-means algorithm is information compression. Also Clustering, which is commonly thought of, can actually be viewed as information compression tailored to human cognitive abilities.

An example of information compression is a palette of images. PCs from several decades ago were unable to display full color due to performance limitations, so images were displayed in 16 or 256 colors. However, since it is not possible to display images beautifully with the default color set, the way to improve the display quality of images by creating a “palette” from colors that are often used in that ceartain image were deviced. This method is isomorphic to K-means.

Hierarchical agglomerative clustering is very computationally time-consuming to apply to big data, as it requires distance calculations between all data points. In this case, we can apply K-means to big data. For example, you can compress 100,000 records into 1000 micro-clusters. Furthermore, hierarchical clustering such as Ward’s method can be applied to these 1000 micro clusters. The advantage of hierarchical clustering is that it allows you to explore the “structure” inherent in your data.

Of course, there may be cases where you don’t need to know the cluster hierarchy. If you want to create the number of features from data that are easy to handle with human cognitive ability, K-means will be useful. You can of course use hierarchical agglomerative clustering, but if you don’t need a hierarchy, you can use K-means.

Tree models such as decision trees and random forests partition the data space with a large number of conditional expressions. In other words, it means dividing the space with rectangles. This does not fit well with the nature of the actual data distribution. Creating an accurate model requires many conditional expressions, making the model complex. This is exactly why AI in the 1980s was not successful.

By applying features created from K-means to the tree model, the shortcomings of the tree model may be overcome. This is essentially equivalent to applying K-means to class classification. Tree models can be interpreted as being used to match data points with K vectors.

Actually, SOM is a more advanced version of K-means. In K-means, K vectors can move freely without interfering with each other, but in SOM, by setting a topological order, K vectors have tied each other, and their positional relationship is smoothed.

Written by:

Kunihiro TADA

He has been a watcher of the industrial boom from the early 1980s to the present day. 1982, planner of high-tech seminars at the Japan Technology and Economy Centre, and of seminars and research projects at JMA Consulting; in 1986 he organised AI chip seminars on fuzzy inference and other topics, triggering the fuzzy boom; after freelance writing on CG and multimedia, he founded the Mindware Research Institute, selling the Japanese version of Viscovery SOMine since 2000, and Hugin and XLSTAT since 2003 in Japan.

View All Posts

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Recent Posts

  • Epistemology vs Ontology: Why This Distinction Matters More Than Ever
  • Entered into AI governance-related business
  • A Unified Perspective on Cosmology, Causal Structure, Many-Worlds Interpretation, and Bayesian Networks
  • Data Science and Buddhism: From the “Ugly Duckling Theorem” to Emptiness, Provisionality, and the Middle Way
  • The Value of Human–AI Interfaces in the Age of AGI
  • Viscovery SOMine 8.1 Release
  • Semantic data mining that fundamentally changes information analysis 2
  • Semantic data mining that fundamentally changes information analysis 1
  • SOM as a platform for ensembles of multi-machine learning models
  • Innovation Maps: IT Industry top 1000 Services and Products Competing Map

Archives

  • April 2026
  • December 2025
  • November 2025
  • October 2025
  • January 2025
  • December 2024
  • July 2024
  • June 2024
  • April 2024
  • March 2024
  • December 2023
  • October 2023
  • September 2023
  • August 2023
RSS Error: Retrieved unsupported status code "404"
Logo  
Daiichi Central Bldg. 6-36, Honmachi, Okayama Kita-ku, 700-0901, Japan
info@mindware-jp.com
+81-86-226-0028

Recent Posts

  • Epistemology vs Ontology: Why This Distinction Matters More Than Ever
  • Entered into AI governance-related business
  • A Unified Perspective on Cosmology, Causal Structure, Many-Worlds Interpretation, and Bayesian Networks
  • Data Science and Buddhism: From the “Ugly Duckling Theorem” to Emptiness, Provisionality, and the Middle Way
  • The Value of Human–AI Interfaces in the Age of AGI

Categories

  • Data Science
  • Innovation Maps
  • Quantitative business strategy management
  • ThinkNavi
  • 未分類

Proudly powered by WordPress | Theme: BusiCare by SpiceThemes