Small Data

6 April 2019

In 2012, I was privileged enough to attend to Divide-and-Conquer and Statistical Inference in Big Data given in French by Michael I. Jordan who is a prominent figure in Machine Learning. That visionary talk begun by establishing what can still be considered at the crux of Data Science today, especially back in a time when neither Hadoop nor Spark were popularized:

Consider $N$ the number of elements of the database (cardinality) and $D$ the dimensionality of each element.

  • Big Data regime $\frac{N}{D} \gg 1$: Developper's nightmare especially because data may not fit in memory but if the computations are doable, then statistics background give reasons for the systems to work very well especially since the rise of stochastic optimization methods;

  • Small Data regime $\frac{N}{D} \ll 1$ or $\frac{N}{D} \simeq 1$: Statistician's nightmare especially because of the mathematical curse of dimensionality but the computations are easier than previously.

NB: This has not been said that way by Michael I. Jordan but only simplified by myself

Since then, data storage problems have been nicely solved at least thrice thanks to the decrease of hardware's costs (and thus the increase of available computational power especially with the rise of GPUs), increase of data access speed and software solutions like HDFS and Spark (for both distributed file and in-memory computer systems) on the developper's side but the curse of dimensionality remains on the statistician's side. Computations have also been improved but not sufficiently energy-wise (that is beyond that current blog post scope).

Enforcing data structure seems to be an efficient way to cope with the curse of dimensionality issue. For example, if one considers the huge progress accomplished with images from the 1990s to the 2010s, then one sees that methods with hierarchical convolutions have been the key to unlock good classification results at the historical ImageNet competition ($\sim 10^6$ images database with $\sim 10^6$ pixels each thus $\frac{N}{D} \simeq 1$ which corresponds to a Small Data regime). Likewise, words ordering turns out to be the key to Natural Language Processing representations but despite worldwide Research attention in Genetics, Machine Learning tends to be uneffective yet: $1$ human being $D \simeq 10^{12}$ nucleotides and databases are limited to the worldwide population ($\sim 10^9$), thus $\frac{N}{D} \ll 1$ which corresponds to a Small Data regime once again. We did not yet leverage any desirable but unknown structure among those nucleotides to get an easier Big Data regime.

The take-home-message of this post is:

Indeed, structure translates data knowledge and human expertise into an understandable statistical language that is more suitable for a computer. By the way, if $N$ is not large ($\sim 10^ 1$ or $\sim 10 ^ 2$) then statistical estimators do not have much relevance in such very Small Data regime ($N \leq 10 ^ 2$).