Scientific Communications


Who Am I?

I am Warith HARCHAOUI

CV/Resume


Warith HARCHAOUI

Academic Books for Machine Learning

12 April 2019

In random order:


Small Data

6 April 2019

In 2012, I was privileged enough to attend to Divide-and-Conquer and Statistical Inference in Big Data given in French by Michael I. Jordan who is a prominent figure in Machine Learning. That visionary talk begun by establishing what can still be considered at the crux of Data Science today, especially back in a time when neither Hadoop nor Spark were popularized:

Consider $N$ the number of elements of the database (cardinality) and $D$ the dimensionality of each element.

  • Big Data regime $\frac{N}{D} \gg 1$: Developper's nightmare especially because data may not fit in memory but if the computations are doable, then statistics background give reasons for the systems to work very well;

  • Small Data regime $\frac{N}{D} \ll 1$ or $\frac{N}{D} \simeq 1$: Statistician's nightmare especially because of the mathematical curse of dimensionality but the computations are easier than previously.

NB: This has not been said that way by Michael I. Jordan but only simplified by myself

Since then, data storage problems have been nicely solved at least thrice thanks to the decrease of hardware's costs, increase of data access speed and software solutions like HDFS and Spark (for both distributed file and in-memory systems) on the developper's side but the curse of dimensionality remains on the statistician's side. Computations have also been improved but not sufficiently energy-wise (that is beyond that current blog post scope).

Enforcing data structure seems to be an efficient way to cope with the curse of dimensionality issue. For example, if one considers the huge progress accomplished with images from the 1990s to the 2010s, then one sees that methods with hierarchical convolutions have been the key to unlock good classification results at the historical ImageNet competition ($\sim 10^6$ images database with $\sim 10^6$ pixels each thus $\frac{N}{D} \simeq 1$ which corresponds to a Small Data regime). Likewise, words ordering turns out to be the key to Natural Language Processing representations but despite worldwide Research attention in Genetics, Machine Learning tends to be uneffective yet: $1$ human being $D \simeq 10^{12}$ nucleotides and databases are limited to the worldwide population ($\sim 10^9$), thus $\frac{N}{D} \ll 1$ which corresponds to a Small Data regime once again. We did not yet leverage any desirable but unknown structure among those nucleotides to get an easier Big Data regime.

The take-home-message of this post is:

Indeed, structure translates data knowledge and human expertise into an understandable statistical language that is more suitable for a computer. By the way, if $N$ is not large ($\sim 10^1$ or $\sim 10^2$) then statistical estimators do not have much relevance although people like Sylvain ARLOT try to build a Mathematical theory and thus new algorithms in Small Data regime but it is hard.


What is a Data Scientist?

10 February 2017

A data scientist is someone who is better at Statistics than any software engineer and better at Software Engineering than any statistician.

People often ask me about my job: Data Scientist. I humbly believe I know what a Data Scientist should know although I do not know everything about it. ;)

This post is a compilation of my favorite academic books on that subject:

Here is a tutorials website which is fascinating: Tutorials by Andrew W. Moore

Finally, you should know how to code in Python including Scikit-learn and Pandas which are well described in Python for Data Analysis (this is a great primer if you are a beginner in the field) and perhaps also R and Java and for many companies C# in practice.


Clustering Boosting

3 February 2017

Clustering is one of the most fundamental tasks in machine learning. This task consists in grouping similar objects together without any supervision, meaning there is no labels to guide this grouping. The goal is to find structure in data to facilitate further analysis. It is a desirable goal in data analysis, visualization and is often a preliminary step in many algorithms.

This post is a kind of comment of this article An Ensemble Method for Clustering.

In this post, I am agnostic to the favorite clustering algorithm you have chosen. Here I briefly enumerate some famous algorithms and recent associated articles I like very much:

The common point of all clustering algorithms is the output that could be:

Unlike in Supervised Classification, Unsupervised Classification (a.k.a Clustering), the classes are anonymous. It means that if we permute the columns of a soft clustering, the clusterings (the original one and the permuted one) are still the same -- they represent the same partition of the points but with different indices.

One can remark that hard clustering is a special case of soft clustering. Indeed, a soft clustering version of a hard one can be computed through a one-hot-vector representation. So, without loss of generality, I can assume I have soft clusterings.

Suppose you have a dataset of $N$ points to cluster into $K$ groups (if $K$ varies between clusterings, one can choose the maximum and do zero-padding) and you are given $M$ soft clustering (from $M$ random initializations for example), the aim of this post is to find a way to aggregate these clusterings to produce one final soft clustering. In terms of optimization, on can propose to cast this problem as a variant of K-Means: $$\min_{p,\sigma} \frac{1}{NM} \sum_{m=1}^M \sum_{i=1}^N \Vert p(i) - C_m(\sigma_m(i)) \Vert$$

where:

This is very similar to the K-Means objective function: (i) if $\sigma$ is given, then the optimal $p^*$ is found by a mean (for the squared euclidean distance) or a median (for the Manhattan distance) and (ii) if the $p$ is given, then the optimal $\sigma^*$ is given by a minimum. Thus, a greedy optimization should be a nice heuristic to solve it which means that there is no way to guarantee the optimal solution but the solution quality cannot worsen w.r.t that objective until convergence. The so-called switching problem is here handled by the $\sigma$ functions. Indeed, if we did not have that switching problem, the solution would simply consist in one step (i). Initialized by a $\sigma^0$ with least entropy, a loop with steps (i) and (ii) works well in practice.


Non-negative Weights Linear Models for Classification

22 December 2016

We suppose that data are normalized (by mean and variance or between 0 and 1). For certain applications, such as diagonal Mahalanobis Metric Learning, it is useful to have a linear-model-implementation code for classification that has non-negative weights. Without sign constraints on the weights, there exists several nice out-of-the-shelves implementations freely accessible in the web, for example:

but none of them authorize a built-in mechanism to impose sign constraints on the weights.

One solution could be to modify these implementations and do a Projected Gradient Descent in the primal by zeroing the negative weights at each step. But this process is inconvenient because the nice convergence properties are then lost.

In this post, I would like to suggest another way to benefit from existing linear models implementations without modifying their codes. In a previous post (Regularized Bias in (Stochastic) Gradient Descent) I showed how to handle the regularized bias by adding a supplementary constant feature equal to $1000$. To impose non-negative weights, we gain augment the data set by all the $B$-hot vectors possible with $B \gg 1$ (e.g. $B=1000$) but without bias. A $B$-hot vector is a vector with zeros everywhere except for one bin which contains $B$. If we look at the SVM formulation, we have for weights $\mathbf{w}$, labels $y_i \in \{ -1,1 \}$ and examples $\mathbf{x}_i$

$$ \min_{\mathbf{w}} \frac{1}{2} \Vert \mathbf{w} \Vert^2_2$$ such that for each $i$ indexing the dataset $$y_i \mathbf{w}^\top \mathbf{x}_i \ge 1 - \xi_i $$

If $\mathbf{x}_i$ is a $B$-hot vector with a positive label ($y_i= 1$), then we get: $$\mathbf{w}_k \ge \frac{1- \xi_i}{B} \simeq 0$$ for each $k$ indexing the dimensions of $\mathbf{w}$. So, up to the slack variable $\xi_i$ which is most of the time zero or less than one, we have constrained the sign of $\mathbf{w}_k$. In practice, we can zero out the negative values of the libraries' output.

This slight modification of regular SVMs also works for logistic regression (because the hinge and logistic losses have almost the same behavior) and is easy for sparse representation as $B$-hot vectors are ridiculously sparse.


Detection is not Classification

22 December 2015

There is a fondamental difference between Classification and Detection, let's say for the binary case. On the one hand, Classification distinguishes two well-defined classes and everything is symmetric. On the other hand, Detection aims to detect one class versus the rest of the world, so the problem is not symmetric. For example, getting a nice dataset for Classification is fairly easy, it is enough to sample points from the positive and the negative classes. But in the Detection case, it is very difficult to sample "enough" points from the negative cases because representing the rest of the world is fairly difficult. For example, in the latent-SVM paper from Felzenszwalb et al for Object Detection there is a tedious Hard Negative Mining step to cope with this problem.

More recently, the Convolutional Neural Networks revolution started by the seminal work of Alex Krizhevsky's AlexNet, handles this problem by leveraging a huge number of classes (thousands) in a simulataneous One-vs-Rest fashion. As we speak, there is no explicit discriminative framework for Detection without Classification. Maybe the One-class SVM framework could do but I did not experiment with it.


Regularized Bias in (Stochastic) Gradient Descent

21 December 2015

It is useful to remind the reader that data should always be normalized, à la StandardScaler of scikit-learn (which substract the mean and divide by the standard deviation for each dimension of your data). In this context, the optimization papers almost never precise how to handle the Bias. In the excellent explanatory C++ code of Léon Bottou, we have several modes to regularize or not the Bias.

What does it mean exactly? Data Science practionners have often heard that the Bias is a detail and that it is enough to add a "1" dimension to the data. But in fact the Bias plays a crucial role during the optimization. For example, for Linear Support Vector Machines or Logistic Regression, the Bias sets the sign of the prediction driving the direction of the gradient step.

One way to handle the Bias is: add a "1000"-dimension instead of a "1"-dimension. This way, the true Bias found at the end of the optimization should be divided by 1000. In fact, the Bias will be regularized but just a little which is a trade-off between an ease of optimization (no special case for the Bias) and having no regularization for the Bias (which is instable).


Fun Stuff