Adaptive Bayesian contextual hyperband: A novel hyperparameter optimization approach

Hyperparameter tuning plays a significant role when building a machine learning or a deep learning model. The tuning process aims to find the optimal hyper-parameter setting for a model or algorithm from a pre-defined search space of the hyperparameters configurations. Several tuning algorithms have been proposed in recent years and there is scope for improvement in achieving a better exploration-exploitation tradeoff of the search space. In this paper, we present a novel hyperparameter tuning algorithm named adaptive Bayesian contextual hy-perband (Adaptive BCHB) that incorporates a new sampling approach to identify best regions of the search space and exploit those configurations that produce minimum validation loss by dynamically updating the threshold in every iteration. The proposed algorithm is assessed using benchmark models and datasets on traditional machine learning tasks. The proposed Adaptive BCHB algorithm shows a significant improvement in terms of accuracy and computational time for different types of hyperparameters when compared with state-of-the-art tuning algorithms. This is an open access article under the CC BY-SA license.


INTRODUCTION
Hyperparameters (HPs) allow for a great deal of configuration of machine learning (ML) algorithms.Because the hyperparameter values frequently have considerable effects on the model's architecture, complexity, behavior, speed, and other features, it is important to choose them carefully in order to achieve the optimum results.Choosing these values through human trial and error methods ends up being timeconsuming, frequently biased, error-prone, and computationally irreproducible.Yet, machine learning experts must adjust a number of hyperparameters to enhance task accuracy and minimize training resource consumption.Examples include the decision about the deep neural network (DNN) model architecture, the training dataset, the optimization technique, the optimization algorithm's hyperparameters (such as learning rate and momentum), and the training time budget.Optimization of hyperparameters started gaining momentum since the beginning of 1990s [1], and it has been known that specific hyperparameter settings tend to work better for various datasets.Contrarily, the ability of hyperparameter optimization (HPO to modify general-purpose pipelines for usage in particular application domains is a very recent realization.It is now generally acknowledged that optimized hyperparameters outperform the default settings provided by the majority of machine learning frameworks.It is preferable to outsource HPO to appropriate algorithms and machines to improve efficiency and ensure reproducibility.This is because the mathematical formali-❒ ISSN: 2252-8938 sation of HPO is essentially a black box optimization, which is frequently in a higher-dimensional environment.The primary objective of automated machine learning (AutoML) is therefore to automatically configure these hyperparameters to improve performance [2]- [5] thereby saving precious execution time.In particular, deep neural networks that depend on a variety of hyperparameter decisions about their architecture, optimization, and regularization benefit the most from automated hyperparameter search.Hyperparameter optimization using AutoML techniques have several important advantages [6], [7] which includes: i) Reducing the human efforts by making the entire hyperparameter optimization process automated; ii) Tailoring the HPO process to suit the needs of the problem at hand; iii) Improving the reproducibility of experiments as it removes the need for manual search.Several HPO algorithms have been proposed in recent years and the efficiency of the algorithms depends on several factors including how the initial hyperparameter search space is defined and the exploration of such space.This paper aims to i formulate a novel algorithm for hyperparameter tuning to achieve a good trade-off between exploration and exploitation of the hyperparameter search space.The algorithm does so by exploring new regions in the hyperparameter search space for better performing configurations by including the observations seen till time step t and exploiting the best configurations with more resources.ii present a new sampling technique that creates a sample space by including equal number of the historical observations and random hyperparameter configurations based on an adaptive threshold for loss.iii select only the best performing configurations in every iteration and diverting the resources only to these configurations thereby exploiting the hyperparameter search space.This research proposes a novel hyperparameter tuning procedure called adaptive Bayesian contextual hyperband (Adaptive BCHB) along with an innovative sampling strategy to accomplish the aforementioned objectives.The rest of the paper is organized as follows: section 2 focuses on the recent literature on hyperparameter tuning, their challenges and applications.Section 3 explains the hyperparameter tuning process and the proposed tuning algorithm and the sampling technique.The experimental setup and results are shown in section 4 with conclusion in section 5.

RELATED WORK
Hyperparameter tuning is an important step in building an ML model for any application.Choosing the right hyperparameters will help in reaching a good accuracy of the model in less training time.There are several tuning algorithms with different strategies that has evolved through the years [1], [8], [9].The hyperparameter search space is defined and the algorithms differ in how this space is navigated.Grid search (GS) [10] is a brute-force strategy which explores the search space in a grid and does an exhaustive search of all hyperparameters in it using a predefined step size.This method is popularly used when the number of hyperparameters is small.Alternatively, when the number of hyperparameters is more, grid search becomes computationally expensive as the number of evaluations grows exponentially and the result of the algorithm solely depends on the grid and step size.Random search (RS) [11] navigates the search space and selects configurations randomly between the specified lower and upper bounds for the hyperparameters.This ensures a fair exploration of the search space.As each hyperparameter configuration is selected independently of the others this algorithm can be parallelized [10].
To overcome the limitations of RS and GS, Bayesian concepts were included in the tuning algorithms to identify and concentrate on well-performing regions thereby reaching the optimum earlier [12].Bayesian strategies [13] uses a prior and posterior distribution representing the search space and for every iteration, the posterior is updated based on the prior and current evidence.The iterations are sequential by nature.Some works like [14], [15] have achieved parallelism by dividing the search space into subspaces and applying a simpler tuning algorithm to find the best performing regions and then applying the intended tuning algorithm in the subspace containing the best region.There are several variations of Bayesian strategies based on the surrogate models used.Some commonly used surrogate models are Gaussian processes (GP), Tree Parzen Estimators (TPE) and random forest (RF) making the Bayesian optimization (BO) based tuning algorithms as BO-GP, BO-TPE and sequential model based algorithm configuration (SMAC) respectively [16].Unlike BO-GP that is mainly used for hyperparameters that are continuous, SMAC supports discrete, continuous, categorical, and conditional hyperparameters.BO-TPE is tree based and naturally supports conditional hyperparameters.
To limit the use of resources, a well-defined budget (the algorithm's number of iterations or running time) may be set before the start of the tuning process.This budget can be utilized effectively for better perform-Int J Artif Intell, Vol. 13 [17] are based on this strategy and two commonly used algorithms are 1) Successive halving 2) Hyperband [18].Successive halving uniformly divides its budget among all the configurations selected.Hyperband uses successive halving as a subroutine but dynamically determines the number of configurations to assess and the budget for every iteration.It starts with a predefined number of configurations with a limited budget and finishes with a better performing configuration in the group which is run with the maximum budget.Bayesian optimization hyperband (BOHB) [19] is a technique that incorporates combines Bayesian techniques in the Hyperband algorithm.BOHB replaces the random search used in Hyperband to sample the configurations from the search space with BO which selects the configuration based on the surrogate model built.This results in higher accuracy in a less amount of time than successive halving and also Hyperband.Studies [20] have been conducted that proves BOHB surpasses in performance, many other optimization techniques when tuning traditional machine learning models like support vector machines (SVM) and also deep learning (DL) models.
Other than Bayesian algorithms, several different class of algorithms like metaheuristic algorithms like genetic algorithms (GA) [21], particle swarm optimization (PSO) are also used for hyperparameter tuning.There are several hybrid methodologies based on the above-mentioned commonly used algorithms that are used in many ML and DL models.These hybrid methods make use of the best of the techniques they use while avoiding their limitations.One such method discussed in [22] has been shown to work well for convolutional neural networks.A similar hybrid method based on PSO is used in autoencoders [23], [24].Differential evolution (DE) strategies are used in [25] but the results show that they perform better with smaller datasets.DE algorithms are computationally expensive [26] and can be afforded only with the availability of huge computational resources [27].

METHOD
The hyperparameter tuning process is iterative and accepts the complete space of hyperparameters as input.In our earlier work [28], we had formulated an algorithm named Bayesian contextual hyperband (BCHB) that incorporates a surrogate model and an acquisition function to select the next best configuration to run based on the observed loss values in the previous iterations.An initial surrogate model constructed using default hyperparameters forms the basis of the algorithm.The inner loop of the algorithm (see BCHB algorithm of [28]) executes r i times (the number of resources allocated for a configuration) for each iteration i to update the surrogate model with the next-best hyperparameter configuration and associated loss value.The acquisition function S is used to acquire the best hyperparameter which is denoted by h * .The history of the best configurations (HT ) seen thus far is kept as the pair (conf iguration, loss), and the surrogate model is revised with this past knowledge.This configuration h * is assessed with the original loss function.Like BOHB, the get hyperparameter conf iguration(n) function in this algorithm randomly selects n configurations from the search space at the beginning of each bracket.Some interesting points to note in BCHB that has scope for improvement are: i) For the start of each bracket of BCHB, random sampling is used to select n new configurations from surrogate model which involves no exploitation of the observed values ii) The number of configurations from second iteration of successive having in each bracket is fixed even if the configuration performs poorly iii) The number of resources allotted to a configuration in each iteration is also fixed leading to wastage of computational resources for poorly performing configurations.

Adaptive Bayesian contextual hyperband (Adaptive BCHB)
To address the above concerns and improve the performance of tuning, we propose an improved version of BCHB named adaptive Bayesian contextual hyperband (Adaptive BCHB) which is given in Algorithm 1.The algorithm starts with defining R, the maximum number of resources that can be allotted to a single configuration and a surrogate model M 0 built with default settings of the model hyperparameters.The hyperparameter search space H is formed and minimum threshold for the validation loss is initialized as T .The algorithm uses an acquisition function S that aids in selecting next best hyperparameters for the following iterations after eliminating the poor performers based on a control parameter η (set to default = 3) [19].The algorithm uses successive halving procedure of Hyperband and finally returns the optimal hyperparameter configuration.The key difference between BCHB and the proposed algorithm is that the latter keeps track of the history of observed configurations as HT .This history is used for sampling new configurations from the search space as in Algorithm 2 for the subsequent iterations.The contributions in the improved algorithm Adaptive BCHB are as follows: i The algorithm incorporates a novel sampling technique that explores the best regions of the original search space obtained from the history of best configurations observed.The optimal samples for the next iteration of the algorithm are identified using a reward function.ii A threshold for validation loss is used to select the number of configurations that furthers through the successive halving iterations.This threshold is updated dynamically to select more potential candidates from the search space rather than being stuck with a local optimum.iii There exists an upper bound for the number of times a hyperparameter configuration is selected for execution.This is to include some level of randomness in the selection process enabling new configurations to be pulled resulting in the balance between exploration and exploitation of the search space.Algorithm 2 shows the new sampling approach that uses a search space H new formed using sets of randomly sampled configurations from the original search space and the best configurations in history.A hyperparameter configuration h i can be selected for the first time or for the (k + 1) th time assuming it is already selected k times.The number of times the configuration is pulled is limited to N to allow other potential candidates a chance to run.When a configuration h i uses k resources it produces an output z i,k (model performance).The aim is to find the maximum z using a minimum number of resources.To achieve this, a reward rw i is attached to the configuration h i based on its performance [29].Now maximizing z i,k is nothing but maximizing the reward rw i .The reward function is updated every time based on the observed Int J Artif Intell, Vol.values.The probability density function of reward rw i,k is given in (1). where

Dynamic threshold for validation loss
The run then retun val loss (h, r i ) function of Algorithm 1 returns the loss value obtained by running the model with r resources configured for hyperparameters h in round i.Based on these losses, the top n() method identifies the top n configurations.Not all of the top-performing candidates reach the lowest cutoff T and advance to the following cycle.The algorithm's ability to locate the ideal hyperparameter in a shorter period of time is significantly improved by eliminating the underperforming candidates in each iteration and allocating resources effectively in the following iteration.The threshold for the current round is set using the set t hreshold(T ) function which returns a new threshold T i which is the average of the threshold at round i − 1 and the best value for loss (minimum) seen in the current round i as given in (2).
The algorithm guarantees that the loss value can be modified dynamically for each iteration, depending on the application, by averaging the threshold.L [h, r i ] denotes the loss for a configuration h in HP running on r i resources.The configurations that meet the requirement stated in (3) are the ones to go to the subsequent halving iteration i + 1.This allows the threshold to change every round and the algorithm moves towards the best performing region in the search space much faster.

Adaptive budget
In the BCHB algorithm, each hyperband bracket has a predetermined minimum budget r that is chosen at the beginning of the bracket.At least once in the bracket, all n configurations are run using r resources.The input η and the bracket number s govern both n and r.The most exploratory bracket, s = s max has the smallest budget because it focuses on examining the search space for n distinct configurations.The bracket with the greatest potential for exploitation, s = 0, runs configurations for a budget R, which corresponds to the maximum number of resources that can be allotted to a configuration.The number of configurations in each bracket s and the budget's subsequent halving for iteration i are determined by η and i.
In Adaptive BCHB, the number of configurations that advance to iteration i + 1 is dynamic and fewer than or equal to n (of top n()).This is achieved by setting the threshold and selecting the best configurations ❒ ISSN: 2252-8938 that meet the threshold.However, the number of configurations chosen from HP and the resources given to those configurations are determined using (4) for the first iteration of each bracket, where i = 0.
This is the same technique adopted in BCHB.It allows the initial iteration to explore the search space using the greatest possible number of combinations while spending the least number of resources possible.Additionally, for following iterations i + 1, the updated hyperparameter set HP defines the configurations that are chosen with the aid of the threshold T i .( 5) provides the number of hyperparameters n i in these iterations, which is just the set HP's count.
It is only sensible to modify the minimal resources for each selected configuration as n i is now less than or equal to n of top n().n i now governs the allocation of resources rather than η.In other words, the number of resources provided to each configuration is adapted based on the number of configurations that advance to the next round.As a result, ( 6) is now used to determine the quantity of resources r i in the iteration i + 1.
The Adaptive BCHB algorithm thus improves the exploitation of resources by allotting them to only the best performing candidates.This strategy eliminates the poor-performing configurations much early in the iterations and helps the algorithm to converge to the optimal configuration faster.

RESULTS AND DISCUSSION
The performance of the proposed algorithm is assessed for the machine learning tasks using the benchmark datasets as given in Table 1.The rationale for selecting these models is that they each feature a distinct type of hyperparameter and are most effective for both classification and regression tasks.SVM has conditional hyperparameters, RF and neural network (NN) each have multiple hyperparameters of various types, whereas K-nearest neighbor (KNN) has just one discrete parameter.Hyperband (HB) [18] Bayesian Optimization Hyperband (BOHB) [19] Bayesian Contextual Hyperband (BCHB) [28] Adaptive BCH Each experiment uses 5-fold cross-validation on the chosen HPO methodologies.While mean squared error (MSE) is employed for the regression job using the Boston Housing dataset, accuracy is the assessment metric for the classification task using the Modified National Institute of Standards and Technology database (MNIST) and Canadian Institute for Advanced Research, 10 classes (CIFAR-10) datasets.The entire execution time of each method with 5-fold cross validation is referred to as computational time, which is also employed as a statistic.Python 3.6 is used for all experiments on an Intel i7-8559U machine with 12 GB of RAM.

Performance of Adaptive BCHB with popular tuning algorithms
Tables 3-5 list the results of the tuning algorithms used on the models.The proposed algorithm is compared with benchmark algorithms in terms of accuracy and computational time.The models are also run with default hyperparameters [20] to show in comaprison the increase in performance after tuning.
From the results, it is evident that in terms of both accuracy and time, Adaptive BCHB outperforms all other tuning algorithms in all the cases.It can handle all types of hyperparameters (discrete, continuous, and categorical) well and produces the best results in comparison.The time taken to train the models with the default hyperparameters is less than the algorithms as there are no tuning done and these values are included in the tables to show how each model perform on its own without any hyperparameter tuning and to set the baseline.Classification tasks involving MNIST and CIFAR datasets have images as input and hence computational time is slightly more than the regression task.Grid search performs poorly as expected because of exploring the search space in a fixed grid and not exploiting the best regions in the space.Though the model performance of GS is closer to RS, it takes twice as much time as RS to complete execution.The Bayesian method BO-TPE performs better than grid and random search but fails to compete with bandit influenced Bayesian methods.Hyperband and BOHB compete closely in terms of model performance but there is a significant improvement in BOHB with respect to computational time.
The proposed algorithm, Adaptive BCHB does well with computational time (better than BCHB) since the poorly performing hyperparameter candidates in each round of the algorithm are discarded and their resources diverted to better performers.This in turn helps the algorithm to exploit only the best performing regions and limits exploration which saves time.The model performance of Adaptive BCHB is 7% more than the default hyperparameter settings for random forest as shown in Table 4.The interesting point to note here is that Adaptive BCHB performs well for all tasks using the neural network model in terms of both model performance and time.It is 1.5 times faster than BOHB when used with NN as shown in Table 5.Among all the Bayesian methods, Adaptive BCHB has performed well for all three tasks on three benchmark datasets with different types of hyperparameters.

Performance comparison of BCHB and Adaptive BCHB with varying budgets
The performance of the proposed algorithm Adaptive BCHB is compared with its predecessor BCHB using the NN model and its hyperparameters.The model is run with all the three benchmark datasets mentioned earlier and the results are plotted in terms of test error with varying budgets (the number of iterations).
Figure 1 illustrates the error for the MNIST dataset, where Adaptive BCHB initially produces errors that are more similar to BCHB but gradually improves as the number of iterations increases.Due to the adoption of an appropriate starting cutoff, the test error obtained for the CIFAR-10 dataset during the early iterations when compared to BCHB is significantly different, as shown in Figure 2. In the subsequent iterations, as the threshold for validation loss is updated based on the previous iterations, the proposed approach converges with a lesser difference in the test error.The technique starts with a 17% drop in error for the Boston housing dataset (see Figure 3) and converges faster than it does for the other two datasets.Additionally, there is a sizable error difference between BCHB and Adaptive BCHB throughout the iterations of the algorithm.When the budget is 200 iterations for the MNIST dataset and 150 iterations for the CIFAR-10 and Boston housing datasets, the Adaptive BCHB method begins to converge.CONCLUSION Data analysis has gained attention in recent years due to the availability of computational resources.ML and DL models applied to these applications need to perform better to satisfy growing trends.Any good ML or DL model needs its hyperparameters to be set and manually tuning these hyperparameters is not only computationally expensive but time-consuming.Trial and error methods often are error prone.Automating the process of hyperparameter tuning hence becomes a necessity.Several hyperparameter tuning algorithms

Figure 1 .Figure 2 .Figure 3 .
Figure 1.Varied budget analysis of Adaptive BCHB and BCHB on MINST dataset Input: R, M 0 , H, S, T, η Output: Optimal hyperparameter configuration Initialize HT ← ∅ and s max = log η (R) for s ∈ {s max , s max−1 , ..0} do if s = s max then HP ← Randomly sample n conf igurations f rom H else HP ← get hyperparameter conf iguration (n) dx is the cumulative distribution function of the normal distribution f (x|µ, σ), y * is the current best observation and [x] + is defined as max (o, x).

Table 1 .
Models, Datasets and Tuners used in the experiments

Table 2 .
Table 2 lists the machine learning tasks, models, selected hyperparameters, and their domain.All of the hyperparameters given in Table 2 for each of the experiments are included in the hyperparameter search space H. Model hyperparameters and their search space

Table 3 .
Accuracy and computational time for classification task using MNIST dataset

Table 4 .
Accuracy and computational time for classification task using CIFAR-10 dataset

Table 5 .
Error and computational time for regression task using Boston Housing dataset