Classification and Clustering

Classification and clustering are often confused because they are similar. In machine learning, it is generally explained that classification is supervised learning, and clustering is unsupervised learning. In statistical terms, supervised is when there is an objective variable, and unsupervised is when there is no objective variable.

The purpose of a classification model is relatively easy to understand. Classification models use other qualitative and quantitative variables to explain and reproduce the value of a particular qualitative variable. Note that multivariate data can include multiple quantitative and qualitative variables. Formally, any qualitative variable can be chosen as the objective variable for classification. There are as many ways to classify as there are qualitative variables.

On the other hand, the purpose of clustering is a bit lofty, so misunderstandings sometimes occur. It is possible to treat the results of clustering as a new classification. Each cluster generated becomes a class in the new classification. In other words, clustering can be used to give new labels to data that do not have classification labels. But trying to do the opposite, to expect the clustering results to match a particular classification, doesn’t seem very productive.

Of course, if you cluster a dataset that has been prepared for a specific classification and has already been validated, it is quite possible that the clustering results will match the classification. For example, if you cluster the explanatory variables of Fisher’s iris data, the results will closely match the variables of interest. some cases that do not match are due to inconsistencies in the original data. However, this doesn’t seem very practical.

Cases like this are often introduced as examples of clustering, and the purpose of clustering may be becoming increasingly difficult to understand. An easy-to-understand example where the results of classification and clustering do not match is the classification of good and defective products in quality control. In this case, the clustering results are not limited to two classes, and defective products of multiple classes may be found. In other words, we can see that there are multiple patterns of defective products. The purpose of clustering is to discover deeper new knowledge in this way.

Clustering can be interpreted as a method of creating new classifications synthesized from multiple variables. It is generally said that clustering is used when there is no prior knowledge about the object, but personally I only half agree with this idea. This is because some insight must be at work in deciding which variables to use for clustering. Even in clustering, it is always clustering from some perspective, and absolute clustering cannot exist. In practice, the clustering process should be exploratory, adjusting variable selection.

Written by:

Kunihiro TADA