Bayesian optimization in cloud machine learning engine
Last updated
Last updated
At Google, in order to implement hyperparameter tuning we use an algorithm called Gaussian process bandits, which is a form of Bayesian optimization
Goal: select hyperparameters that give the best performance
Optimization problem: optimizing a function f(x) (i.e. performance as a function of hyperparameter values) over a compact set A, which we can write as
Problem:
Optimization settings assume that the objective function f(x) has a known mathematical form, is convex and is cheap to evaluate
But the above characteristics do not apply to the problem of finding hyperparameters where the function is unknown and expensive to evaluate
Bayesian optimization: compute a posterior distribution over the objective function based on the data (using the famous Bayes theorem), and select good points to try with respect to this distribution
Gaussian processes
To use Bayesian optimization, we need a way to flexibly model distributions over objective functions
Need one such distribution to represent our belief about f(x) for each x
If x contains continuous hyperparameters, there will be infinitely many x for which we must model f(x) i.e. construct a distribution for it
Generalize multi-dimensional Gaussian distributions, and versions exist that are flexible enough to model any objective function
We have an unknown objective function denoted by the dotted line, which is the actual reward the agent will receive. Our goal objective is to find that maximum reward.
How to understand this?
Once we have a model for the objective function, we need to select a good point to try. This creates an exploration-exploitation tradeoff.
To encode the tradeoff between exploration and exploitation, we can define an acquisition function that provides a single measure of how useful it would be to try any given point. We can then repeatedly find the point that maximizes this acquisition function and try it next.
The main idea behind probability of improvement acquisition function is that we pick the next point based on the maximum probability of improvement (MPI) with respect to the current maximum.
From the Gaussian process the predicted distribution for the observed value at x (t+1) give the first t observations, Dt is
The simplest acquisition function looks at an optimistic value for the point. Given a parameter beta, it assumes the value of the point will be beta standard deviations above the mean. Mathematically, it is . By varying beta, we can encourage the algorithm to explore or exploit more.
The algorithm which Cloud ML Engine uses for acquisition function is an implementation in Google Vizier called Expected Improvement. It measures the expected increase in the maximum objective value seen over all experiments, given the next point we pick. In contrast, MPI only takes into account the probability of improvement and not the magnitude of improvement for the next point. Mathematically, as proposed, it is