Clustering Boosting

3 February 2017

Clustering is one of the most fundamental tasks in machine learning. This task consists in grouping similar objects together without any supervision, meaning there is no labels to guide this grouping. The goal is to find structure in data to facilitate further analysis. It is a desirable goal in data analysis, visualization and is often a preliminary step in many algorithms.

This post is a kind of comment of this article An Ensemble Method for Clustering.

In this post, I am agnostic to the favorite clustering algorithm you have chosen. Here I briefly enumerate some famous algorithms and recent associated articles I like very much:

The common point of all clustering algorithms is the output that could be:

Unlike in Supervised Classification, Unsupervised Classification (a.k.a Clustering), the classes are anonymous. It means that if we permute the columns of a soft clustering, the clusterings (the original one and the permuted one) are still the same -- they represent the same partition of the points but with different indices.

One can remark that hard clustering is a special case of soft clustering. Indeed, a soft clustering version of a hard one can be computed through a one-hot-vector representation. So, without loss of generality, I can assume I have soft clusterings.

Suppose you have a dataset of $N$ points to cluster into $K$ groups (if $K$ varies between clusterings, one can choose the maximum and do zero-padding) and you are given $M$ soft clustering (from $M$ random initializations for example), the aim of this post is to find a way to aggregate these clusterings to produce one final soft clustering. In terms of optimization, on can propose to cast this problem as a variant of K-Means: $$\min_{p,\sigma} \frac{1}{NM} \sum_{m=1}^M \sum_{i=1}^N \Vert p(i) - C_m(\sigma_m(i)) \Vert$$

where:

This is very similar to the K-Means objective function: (i) if $\sigma$ is given, then the optimal $p^*$ is found by a mean (for the squared euclidean distance) or a median (for the Manhattan distance) and (ii) if the $p$ is given, then the optimal $\sigma^*$ is given by a minimum. Thus, a greedy optimization should be a nice heuristic to solve it which means that there is no way to guarantee the optimal solution but the solution quality cannot worsen w.r.t that objective until convergence. The so-called switching problem is here handled by the $\sigma$ functions. Indeed, if we did not have that switching problem, the solution would simply consist in one step (i). Initialized by a $\sigma^0$ with least entropy, a loop with steps (i) and (ii) works well in practice.