How to Evaluate Machine Learning Models: Hyperparameter Tuning

mbq · on May 30, 2015

It is sad that the post fails to acknowledge that hyperparameter tuning may also be a source of overfitting (and very often is) -- one should always treat is as a part of training and validate as such, for instance by using nested CV. And never ever report accuracy straight from tuning as a final result.

alicez · on May 30, 2015

Don't be sad. I'm happy to update the blog post as needed. By overfitting, do you mean over-optimizing the results on the validation set? Based on what I understand about nested CV, it is only necessary if 1. the hold-out validation set is way too small and not representative of the overall data distribution, or 2. if the model training procedure itself is unstable and produces models with wildly varying results on the same dataset.

To prevent overfitting to the training data, one performs hold-out validation or cv or early stopping in the training process.

To prevent overfitting of hyperparameters to a small validation dataset, or to mitigate the variance of the model training outcome, one can use nested cv.

Is that along the lines of you were looking for?

mbq · on May 31, 2015

It is a common misconception and a huge source of disappointment with ML -- without proper validation of the whole model building procedure (method selection + parameter tuning + feature selection + fitting) no amount of data and magic tricks will make you sure that there is no overfitting. Even a single hold-out test is risky because gives you no idea about the expected accuracy variance.

alicez · on June 1, 2015

Well, you can use the bootstrap to calculate the variance. It costs computation. But it works. Cosma Shalizi wrote a really nice introduction to it: http://www.americanscientist.org/issues/pub/2010/3/the-boots...

murbard2 · on May 30, 2015

A very easy simple improvement over random search is to use a low discrepancy sequence. They were designed almost for this purpose (avoiding the problems caused by irrelevant dimensions). I don't know why I never see it suggested... it's not as good as gaussian process modeling, but it's very easy to implement and it clearly dominates random search for this application.

alicez · on May 30, 2015

Great idea. Will run some experiments to see how it performs. It sounds analogous to kmeans++ initialization. Sobol sequences ring a bell. Some of the Bayesian optimization software libraries may in fact use a Sobol sequence of initial evaluations. But it may not be well documented.

doobwa · on June 1, 2015

Spearmint, for example: https://github.com/JasperSnoek/spearmint/blob/master/spearmi...

vii · on May 30, 2015

In a happy situation the sensitivity of the optimzation to hyperparameter changes is low. That's why the 'random' approach provides reasonable results. If the optimization quality were heavily dependent on the hyperparameter, for an exaggerated example, only providing good results for exactly one value of the hyperparameter, then guessing 60 times and getting within 5% of the best value of the hyperparameter would not guarantee a good model optimization.

The main difficulty with hyperparameters is that one often does not actually know a priori a reasonable range to search in. Suppose you have a regularisation constant C - without some calculation based on your data how can you pick that constant? By picking the range of the hyperparameter, the problem is just punted to a hyperhyperparameter.

More interesting than blindly guessing values, is measuring the sensitivity of recall, precision and cross validation performance to changes in the hyperparameters. Make sure that the sensitivity is low!

skadamat · on May 30, 2015

There's research that shows that in practice, grid search / random search beat most of the alternatives. It's also easier to parallelize thankfully!

elliott34 · on May 31, 2015

Check out whetstone labs. Their Bayesian grid search tech is awesome

doobwa · on June 1, 2015

I think you mean https://www.whetlab.com/