## Clustering Boosting

*7 February 2020*

Clustering is one of the most fundamental tasks in machine learning. This task consists in grouping similar objects together without any supervision, meaning there is no labels to guide this grouping. The goal is to find structure in data to facilitate further analysis. It is a desirable goal in data analysis, visualization and is often a preliminary step in many algorithms.

The subject of this blog post is to propose an answer to: *How can we combine several clustering algorithms outputs into one?*

Indeed, as suggested by Andreas Weingessel *et al.* in An Ensemble Method for Clustering work, combining supervised classification outputs is well-studied and called *boosting* popularized and acclaimed in face detection with AdaBoost by Paul Viola and Michael Jones in practice and there is a thorough analysis in the theoretical book The Elements of Statistical Learning by Trevor Hastie *et al*.

Back to unsupervised classification (*a.k.a.* clustering), there is a considerable wealth of research and we would not dare to hope being unbiased nor exhaustive:

While being agnostic to the clustering algorithms, if we want to combine their output, we must choose a common format to represent them between:

- hard clustering: represented by a sequence of integer corresponding to the identifier (id) of each cluster.
- soft clustering: represented by a probability table where each line corresponds to each element and each column corresponds to a probability of belonging to each clustering class (each line sums to one).

Suppose you have a dataset of $N$ points to group into $K$ clusters and you are given $M$ soft clustering outputs, we want to find a way to eventually produce only one clustering aggregating all the other $M$ clustering outputs. To that end, we choose to re-format all clustering outputs into soft clustering thanks to one-hot-encoding (i.e. degenerated *soft* histograms for hard clustering) and also thanks to zero-padding in case of varying number of clusters. So we only manipulate soft clustering outputs without loss of generality. Moreover, unlike in supervised classification, for unsupervised classification (a.k.a clustering), the classes are *anonymous* : for hard clustering, the 11122 sequence over data is exactly the same clustering as 22211. For soft clustering, this means that randomly permuting the columns of the table does only change the clustering representation but not the clustering itself and this column-permutation invariance is called the *switching label problem*. This is a problem because it gives several representations for the same machine learning clustering object.

In terms of optimization, one can propose to cast this problem into: $$\min_{p,\sigma} \frac{1}{NM} \sum_{m=1}^M \sum_{i=1}^N \Vert p_i - \sigma_m(C_{m,i}) \Vert$$ where:

- $p \in \mathbb{R}^{N \times K}$ is the final soft clustering table. $p_i \in \mathbb{R}^K$ is the probability vector of the $i^\text{th}$ point of belonging to each cluster: $p_{i,k} = \mathbb{P}(k|i)$
- $C \in \mathbb{R}^{M \times N \times K}$ is the tensor containing all the $M$ soft clustering outputs. $C_m \in \mathbb{R}^{N \times K}$ is the soft representation of the $m^\text{th}$ clustering outputs. Thus, $C_{m,i}$ is the probability vector of the $i^\text{th}$ point over $K$ clusters according to the $m^\text{th}$ clustering algorithm.
- $\sigma_m$ is a permutation function over the clustering indices between $1$ and $K$. Hopefully at convergence, $\sigma_m(C_{m,i})$ is the probability vector of the $i^\text{th}$ point over $K$ clusters without the switching label problem which means that comparaisons accross the $M$ algorithms are now reasonable thanks to $\sigma$.

If we take a closer look to the proposed optimization problem, then we can see that the distance is not specified. Empirically, it turns out that the Hellinger distance which is specifically designed for histograms (positive $1$-sum vectors) is a good choice: $$ \forall a \in \mathbb{R}^K, b \in \mathbb{R}^K $$ $$\Vert a - b \Vert_\text{Hellinger} = \sum_{k=1}^K (\sqrt{a_k} - \sqrt{b_k})^2 = \Vert \sqrt{a} - \sqrt{b} \Vert_2^2 = 2 \times (1 - \sqrt{a}^\top \sqrt{b})$$ Mathematically, the Hellinger distance makes the histogram space (i.e. the positive $L_1$ sphere simplex) look spherical (in a the $L_2$ traditional euclidean sense) thanks to the square root which is cool for even rudimentary tool like a hyperplane that has more interpretable separation power cutting an $L_2$ sphere than cutting an $L_1$ sphere simplex as shown in this figure below.

Our optimization problem is very similar to the one of K-Means: *(i)* if $\sigma$ is given, then the optimal $p^*$ is found by a mean of square root probabilities and *(ii)* if the $p$ is given, then the optimal $\sigma^*$ is given by a cheap brute-force strategy to make the switching label problem disappear. Likewise, an alternated optimization scheme is efficient to (not optimally) solve our clustering aggregation (or boosting in reference to its supervised counterpart). Of course, there is no way to guarantee the optimal solution exactly like with K-Means but the solution quality cannot worsen w.r.t that objective until convergence in the same spirit as K-Means. Finally, we can also nicely adapt the K-Means++ strategy for careful initialization. Indeed, it seems we do K-Means over many simplistic or sophisticated clustering algorithms which is amusingly paying an ancient algorithm the respect that it deserves.