Nov 12, 2021. 12 min

Bayesian Optimization and Hyperparameter Tuning.

1. A Shot in the Dark

Many modern machine learning algorithms have a large number of hyperparameters. The standard way to tune the set of hyperparameters is to have a validation set, separate from the training data set, over which the performance of the model is evaluated and compared across different combinations of hyper-parameters. Instead of randomly trying out different combinations, a grid of hyper-parameters is considered. For each point on the grid, the model is tuned and trained over the training dataset, then the model is evaluated over a validation set by using a function that is a good proxy for the performance/fit of the model. The problem with such an approach is that, with the increasing complexity of the model, it may get computationally expensive to try out all the possible combinations, as the size of the grid increases by O(nk), where k is the number of different hyperparameters to tune. Simple grid search over all such combinations of parameters may turn out to be computationally impractical. Moreover some guesswork is necessary to specify the min and max values for each of the hyperparameter.

An alternative for going over all the points on the grid is to define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain. This is something that may be computationally feasible but still feels like taking a shot in the dark after each iteration. There is a desire for an algorithm that iteratively gets better at selecting the next hyperparameters to evaluate the model and Bayesian optimization offers exactly that. It offers an algorithm that use the results of experiments in the previous iterations to help improve the hyperparameter selection for the next iteration. It treats the performance of the model on the validation set as a complete black-box function. In this article, we would explore how the Gaussian process can help us get a good combination of hyperparameters after a few iterations.

2. Gaussian Process

Gaussian Process (GP) regression is a Bayesian statistical approach for modeling functions. The functions which are of interest have no mathematical formulation (black-box) and the closer two input points are the closer the values the function returns (a sense of continuity over input space).

We first collect f’s values at a finite collection of points x1, . . . , xk ∈ Rd. We get a vector of these values [f(x1), ..., f(xk)]⊤. we suppose that it was drawn at random by nature from some prior probability distribution. This prior distribution is taken as a multivariate normal, with a particular mean vector and covariance matrix. We initialize a mean vector µ0 = [µ0(x1), ..., µ0(xk)]⊤ and covariance matrix Σ0. Each element of the mean vector is calculated by a function µ0. The covariance matrix is initialized by using a kernel function which k. Each entry i,j is given by k(xi, xj ) Since we believe the function f is such that if x is close to y then f(x) is close to f(y). The kernel is chosen such that points xi, xj that are closer in the input space have a large positive correlation. One example of such a kernel is the RBF kernel.

K(x, x′) = e−α||x−x′||2

Say if we were interested in predicting housing prices. It makes sense if my neighbor who has an identical house as me will have the same price as mine. On the other hand, my houses price is practically independent of what price a mansion is being sold at. The vector of a house may include features such as no. of rooms, area, locality etc. The price is the function we want to predict. It can be modeled using a Gaussian process. For now, let's stick with the task at hand. How is setting such prior is going to help us model the function over entire range of values of the input vector?

The prior distribution of [f(x1), ..., f(xk)]⊤ is,

f(x1:k) ∼ Normal(µ0, Σ0)

Suppose we have n sample points and we observe a new data point x, the prior distribution for the collection of n+1 points i.e f(x1:n+1) is the same as mentioned above, just replace k with n+1 and xk is x. Now we get a posterior distribution for f(x) which is also a Gaussian distribution.

f(x)|f(x1:n) ∼ Normal(µn(x), σ2n(x))

Notation: if∈ Rn, xl:m represents the partition of x in Rm−l from lth to the mth element. xl:m = [xl, xl+1..., xm]⊤. f(xl:m) = [f(xl), f(xl+1)..., f(xm)]⊤

where,

µn(x) = Σ0(x, x1:n)Σ0(x1:n, x1:n)−1(f(x1:n) − µ0(x1:n)) + µ0(x)

Σ0(x, x1:n) is the vector of covariances of of all sample xi with x. It is [K(x1, x), ...k(xn, x)]. The matrix Σ0(x1:n, x1:n) is n × n sub matrix of the matrix Σ0 with the last row and column removed. Removing the rows and columns having co variances w.r.t x and the sample points.

The posterior mean µn(x) is a weighted average between the prior µ0(x) and an estimate based on the data f(x1:n), with a weight that depends on the kernel and

σ2n(x) = k(x, x) − Σ0(x, x1:n)Σ0(x1:n, x1:n)−1Σ0(x, x1:n)⊤

The proof of the conditional distribution being Gaussian, and also the derivation of posterior mean and co variance matrix can be found in any multivariate statistics book.

Figure 1: For x ∈ R, f(x) predicted by the Gaussian process

The performance of the model on a validation dataset is a function of all possible combinations of the hyperparameters. Our goal is to sample enough data points (till our budget runs out) to get a better estimate of the function and at the same time a good candidate that maximizes the performance function. For each x in the input space, we have a posterior distribution of f(x). The points which are not near the sampled point has a large uncertainty in the value (large variance). We can sample a point from that region to improve our predicted function (exploration). We can get greedy with the task at hand and sample the next point in the region of optimal predicted value (exploitation). Since the sampling isn’t cheap, we rely on an acquisition function which considers both the aspects to get the best candidate x for sampling.

3. Acquisition Function

Acquisition functions are mathematical functions that guide how the hyperparameter space should be explored during Bayesian optimization. There are a wide variety of options. In this section, we will explore a few of them. Formally, our goal is to find the location x ∈ Rd corresponding to the global maximum (or minimum) of a continuous function f : Rd7→ R.

Since sampling is expensive, we will use the following algorithm to iteratively reach our goal.

1. First model the function with a Gaussian process with an arbitrary mean and randomly sample a few points in the parameter space

2. Given the set of initial function evaluations, use the Bayes rule to obtain the posterior.

3. Use an acquisition function α(x), which depends on the posterior, to decide the next sample point xt = argmax α(x).

4. Add newly sampled data to the set of observations and repeat steps 2 - 3 till convergence or budget elapses.

3.1 Explore Uncertain Region (Exploration Strategy)

After sampling n points, the posterior distribution for x ∈ Rd is a multivariate Gaussian distribution. f(x)|f(x1:n) ~Normal(µn(x), σ2n(x))

The uncertainty (variance) at x is captured by σn(x).

In a pure exploration strategy, the acquisition function is the posterior standard deviation at the point x. α(x) = σn(x)

This strategy targets the points in the location where the posterior uncertainty (variance) is largest, which is equivalent to maximally reducing the overall predictive entropy of the modeled function. Consequently, it may eventually locate the global optima, but the rate of convergence may be very slow.

3.2 Maximize the Mean Function (Exploitation Strategy)

In a purely exploitation strategy, the acquisition function is the posterior mean at the point x. α(x) = µn(x)

This strategy targets the point which has the highest expected value of f at each iteration. Although we may converge to a good value, it can converge prematurely to a local optimum. It is a purely greedy strategy.

3.3 Upper Confidence Bound (UCB)

UCB is an optimistic policy, it is to overestimate the mean with added uncertainty. The acquisition function at the point x is a weighted sum of the mean and uncertainty (standard deviation).

α(x) = µn(x) + βtσn(x)

βt ≥ 0 is a positive weight. As βt is closer to zero, the strategy is closer to being an exploitative one. In contrast, if the βt becomes larger, the strategy is more exploratory.

3.4 Probability of Improvement (PI)

As the name suggests, the point which has a higher probability of improvement over the existing sampled optimum point is selected. Let D = {x1, x2, x3...xn} be the set of sampled points, x∗ = argmaxx∈Df(x). Then acquisition function, αP I is defined as the following.

αP I (x) = P(f(x) > f(x∗)|x, D)

where f(x)|x, D is the posterior Gaussian distribution.

3.5 Expected Improvement (EI)

Probability of improvement only looked at how likely it is that the next sample point is an improvement over the current maximum. It did not consider how much we can improve. Expected Improvement (EI) does exactly that. It chooses the next point as the one which has the highest expected improvement over the current max x∗. Let I(x) = max(f(x) − f(x∗), 0). Then acquisition function, αEI is defined as the following.

αEI = E(I(x)|x, D)

4. Summary

Bayesian Optimization is well suited when the function evaluations are expensive, making grid search impractical. For example, in the case of a large neural network that takes days to train. The hyperparameters may include the number of neurons per layer, the number of layers etc. Moreover, in the case of neural networks, the performance metric like accuracy don’t have any analytical cost function for the performance of the model over the hyperparameter space. It is a complete black-box function. In such a case, any gradient-based algorithm won’t be able to help us either. In such a scenario, Bayesian optimization is an ideal algorithm as it is also very cost-effective.

Hardik Prabhu

Hardik Prabhu is a member of the AI team of CloudAEye. He completed Masters in Data Science from Chennai Mathematical Institute in India.