Regularized Bias in (Stochastic) Gradient Descent

21 December 2015

It is useful to remind the reader that data should always be normalized, à la StandardScaler of scikit-learn (which substract the mean and divide by the standard deviation for each dimension of your data). In this context, the optimization papers almost never precise how to handle the Bias. In the excellent explanatory C++ code of Léon Bottou, we have several modes to regularize or not the Bias.

What does it mean exactly? Data Science practionners have often heard that the Bias is a detail and that it is enough to add a "1" dimension to the data. But in fact the Bias plays a crucial role during the optimization. For example, for Linear Support Vector Machines or Logistic Regression, the Bias sets the sign of the prediction driving the direction of the gradient step.

One way to handle the Bias is: add a "1000"-dimension instead of a "1"-dimension. This way, the true Bias found at the end of the optimization should be divided by 1000. In fact, the Bias will be regularized but just a little which is a trade-off between an ease of optimization (no special case for the Bias) and having no regularization for the Bias (which is instable).