Scalable Bayesian optimization

Batch Bayesian optimization/parallel function evaluations

  • Note that in the basic algorithm for Bayesian optimization described in the previous notebook, new points \(x_{n+1}\) are evaluated sequentially.
  • Using multiple computing resources allows for obtaining multiple function evaluations at new points \({x_{n+1}^{(1)},\ldots,x_{n+1}^{(q)}}\) in parallel, at each iteration.
  • In such a case, the acquisition function needs to be modified to handle batch acquisition of new points \({x_{n+1}^{(1)},\ldots,x_{n+1}^{(q)}}\).
  • In our work, we utilize parallel function evaluations as a first way of scalability.

Distributed Gaussian processes via trust regions

GPs are a popular choice for the surrogate model but they have limitations: - They show nice properties and are often sample-efficient. - However, they do not scale well to higher dimensions (large \(D\)) and break down in more than 20-30 dimensions [Frazier, 2018]. - Reason: - They rely on the distance between points which is large in high dimensions. - In particular, large portions of the space receive high uncertainty. - As a result, the algorithm focuses on these regions of the space and makes no progress in a reasonable evaluation budget.

This motivates the design of a collection of local GP models restricted to small regions of the search space called trust regions (TRs). - This scheme was first presented in [Eriksson, et al., 2019] in an algorithm called TuRBO. - In our work, we utilize the same idea of multiple trust regions with its own independent GP surrogate model. - Note that this approach runs multiple trust regions \(\text{TR}_1,\ldots,\text{TR}_m\) in parallel, giving a second way of scalability. - Moreover, its empirically observed in [Eriksson, et al., 2019] that using multiple trust regions allows to optimize in higher dimensions \(D\) compared to nominal methods, giving a third way of scalability.

Importantly: - We change the implementation of TuRBO such that every trust region \(\text{TR}_l\) samples its own batch of new points at each interation. - Leads to more function evaluations, but avoids communication between nodes. This is desirable in a cluster setting.

Trust regions: - Adapte their size/volume to the progress of the local optimization. - Terminate once progress has plateaued.

Once all trust regions have terminated we use deep kernel learning to learn from all the collected data so far to improve the next round of trust regions, as discussed in the next subsection.

Effective representation via deep kernel learning

  • Deep kernel learning (DKL) was introduced in [Wilson, et al., 2015].
  • Learns a representation of the inputs \(x\) that are more meaningful for the GPs.
  • I.e., given a kernel function \(k(x,y)\), the DKL kernel is \(k(\phi(x|\theta),\phi(y|\theta))\) where \(\phi\) is a neural network parametrized by \(\theta\).
  • In the likelihood optimization of the GP, \(\theta\) is optimized alongside the other GP hyperparameters.

Final algorithm

  • While evaluation budget remains:
    • For each trust region \(\text{TR}_l\) in parallel, observe \(f\) at \(n_0\) initial points.
    • Parallel foreach trust region \(\text{TR}_l\):
      • Run TuRBO until termination
    • Aggregate all data so far and (re)train the DKL kernel and use it as the kernel for future local GP surrogate models.

Return a solution.