July 14, 2015

Notes on Surrogate-Based Optimization

Surrogate-based optimization (SBO) is an interesting and rich topic. In this post I'll discuss:

how you can view SBO from the perspective of stochastic processes;
how, theoretically, SBO can fail if not properly performed.

The post is formula-free; the reader is referred to [4] and references therein for technical details.

SBO from the perspective of stochastic processes

We can describe SBO in terms of stochastic processes (a.k.a. random fields, in a multi-dimensional setting), because they can express our prior expectations about possible forms of the objective function.

The stochastic processes actually used for SBO are normally Gaussian (or closely related to Gaussian processes, e.g. functions thereof). A Gaussian process consists of Gaussian random variables located at each point of the design space, and is described by their mean values and mutual covariances.

We can interprete the mean value of a random variable at a given point as the approximation to the objective function at this point, and the variance (covariance of the variable with itself) as a measure of uncertainty of our prediction.

During an SBO run, we perform a sequence of measurements of the objective function at some test points. The random process is updated by conditioning on the past observations, and our approximations along with the uncertainty estimates are updated accordingly.

Example

The GIF picture below is an example of 1D SBO based on the Ornstein-Uhlenbeck process:

The top subplot shows the true objective function to be minimized and a sequence of approximations resulting from a growing set of its measurements.

The subplot in the middle shows the variance of the process.

Note that the approximations go through the measurement points, and the variance vanishes at the measurement positions. This is so because in the updated (conditioned) process the random variables at the positions of measurements trivialize into constants thereby losing all their randomness.

We next discuss what is shown in the bottom subplot.

How to Update Using the Function of Expected Improvement

How do we choose the next point where to evaluate the objective function?

Our goal is to find the global minimum of the objective function or, more realistically, get as close to it as possible using a small number of observations.

To move towards this goal, we can try to simply minimize, at each step, our current approximation and perform the next measurement at the found point of expected minimum.

However, this strategy is faulty because it does not take into acount the different uncertainty of the approximation at different points. In particular, it may in principle happen that the minimum is attained at a point already containing a measurement. In still case the SBO will be stalled.

Therefore, we need to include in the update algorithm the information about the variance, so as to add some incentive for the algorithm to explore underexplored regions. Finding a proper balance here is sometimes referred to as the dilemma of exploitation vs. exploration.

One particular way to reach a balance is to define the new measurement point as the maximizer of the function of expected improvement shown in the bottom subplot of the above figure. This function is defined as the expected value of the improvement in the best available observation, with the expectation taken with respect to the current state of the process. The expectation vanishes at the points of past measurements and is strictly positive otherwise, so we are guaranteed to avoid stalling.

Can SBO Fail?

In general, we expect a reasonable algorithm of global optimization to eventually visit the whole design space while exploring more actively the more promising regions. This agrees well with what we see in the above picture. Theoretically, one can ask if the infinite sequence of tests points is dense in the design space — otherwise, we may miss the global optimum.

Can we guarantee this density for the expected improvement algorithm described above?

It turns out that the answer depends on the properties of the objective function and on some more subtle details of the algorithm, in particular the covariance function of the process.

For the Ornstein-Uhlenbeck process, the density is easy to prove for all continuous objective functions thanks to the Markov property of this process [2].

The density has also been proved, under reasonable assumptions, for more general processes and in higher dimensions [1,3].

On the other hand, the density can theoretically fail if we, for example, try to optimize a non-analytic objective function using an analytic (e.g., squared-exponential) covariance without regularization — in this case the test point sequence can even converge to the first test point, thus being nowhere dense [4]:

The reason is that an analytic function is globally determined by its values in a small neighborhood of any point, so an analytic process can "think" that it "knows" the objective function precisely in a large domain without ever visiting this domain.

We stress, however, that this example is "idealistic" in the sense that the approximation and the function of expected improvement are constructed here without any regularization, i.e., under a "zero noise" assumption. In contrast, real-world problems always contain numerical and measurement noise, and any reasonable practical implementation of SBO, in particular the one used in pSeven, includes regularization taking care of the noise.

The Markov Property

If you look closely, you can notice a peculiar property holding for our first SBO example, but not for the second: adding a new test point into an interval between two existing test points changes the variance and the function of expected improvement inside this interval but not outside it.

This is explained by the fact that the Ornstein-Uhlenbeck process is Markovian (in fact, this is the only stochastic process that is at the same time stationary, Gaussian, and Markovian). As a result, information obtained from the new measurement does not go past the old measurement points. This is, of course, a very special property of the Ornstein-Uhlenbeck process and a purely one-dimensional phenomenon.

References

A. D. Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12:2879-2904 (2011) (arxiv)
M. Locatelli. Bayesian algorithms for one-dimensional global optimization. Journal of Global Optimization, 10:57-76 (1997)
E. Vazquez and J. Bect. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and Inference, 140(11):3088-3095 (2010) (arxiv)
D. Yarotsky. Examples of inconsistency in optimization by expected improvement. Journal of Global Optimization. Volume 56, Issue 4, pp 1773-1790 (2013) (arxiv)