Bayesian optimization in cloud machine learning engine

At Google, in order to implement hyperparameter tuning we use an algorithm called Gaussian process bandits, which is a form of Bayesian optimization
Goal: select hyperparameters that give the best performance
- Optimization problem: optimizing a function f(x) (i.e. performance as a function of hyperparameter values) over a compact set A, which we can write as
- Problem:
  - Optimization settings assume that the objective function f(x) has a known mathematical form, is convex and is cheap to evaluate
  - But the above characteristics do not apply to the problem of finding hyperparameters where the function is unknown and expensive to evaluate
- Bayesian optimization: compute a posterior distribution over the objective function based on the data (using the famous Bayes theorem), and select good points to try with respect to this distribution
Gaussian processes
- To use Bayesian optimization, we need a way to flexibly model distributions over objective functions
  - Need one such distribution to represent our belief about f(x) for each x
  - If x contains continuous hyperparameters, there will be infinitely many x for which we must model f(x) i.e. construct a distribution for it
- Generalize multi-dimensional Gaussian distributions, and versions exist that are flexible enough to model any objective function

We have an unknown objective function denoted by the dotted line, which is the actual reward the agent will receive. Our goal objective is to find that maximum reward.

How to understand this?

Exploration-exploitation tradeoff

Once we have a model for the objective function, we need to select a good point to try. This creates an exploration-exploitation tradeoff.

From the Gaussian process the predicted distribution for the observed value at x (t+1) give the first t observations, Dt is

Acquisition function

To encode the tradeoff between exploration and exploitation, we can define an acquisition function that provides a single measure of how useful it would be to try any given point. We can then repeatedly find the point that maximizes this acquisition function and try it next.

Upper confidence bound

The simplest acquisition function looks at an optimistic value for the point. Given a parameter beta, it assumes the value of the point will be beta standard deviations above the mean. Mathematically, it is . By varying beta, we can encourage the algorithm to explore or exploit more.

Probability of improvement

The main idea behind probability of improvement acquisition function is that we pick the next point based on the maximum probability of improvement (MPI) with respect to the current maximum.

Expected Improvement

The algorithm which Cloud ML Engine uses for acquisition function is an implementation in Google Vizier called Expected Improvement. It measures the expected increase in the maximum objective value seen over all experiments, given the next point we pick. In contrast, MPI only takes into account the probability of improvement and not the magnitude of improvement for the next point. Mathematically, as proposed, it is

PreviousHyperparam Exploration NextShivaram's Group

Last updated 3 years ago

Was this helpful?