In this post, which now has a sequel, we examine the performance of some popular open source optimization libraries that are intended for use in a derivative-free manner. We compare their ability to find minima of analytic functions with known properties. After forming some tentative views and shortlisting, we turn to their performance on log-likelihood functions arising in the estimation of parameters for a time series filter.
This exercise was intended to inform, in some small way, the choice of parameter optimization methods for time series algorithms. In order to address noisy, skewed, intermittent, lurching data that we see in real live instrumented processes, it is often necessary to use models whose parameters or hyper-parameters are analytically inconvenient (e.g. no derivatives). Sometimes automatic differentiation is possible, and that will be considered in a future post, but for now we assume that isn't the case.
Like any such analysis, this one is heavily influenced by the choice of problems tackled and we would encourage the reader to perform their own analysis in the domain of their choosing, before coming to any strong opinions. There are also many ways to modify behavior of any optimizer by choice of arguments. I have some back and forth with the authors of these packages and will seek to improve this study over time.
We used objective functions collated in the tuneup Python package. In turn, some of those objective functions are taken from the DEAP package, and comprise analytic functions suggested by various researchers (see the DEAP website for attributions). These functions have the advantage of fast computation. They are tabulated below.
These formulas have been specialized to three dimensions for clarity, though it should be clear from their definitions that they generalize.
A number of objective functions in the tuneup package are not, however, analytic functions of this sort. Instead they are negative log-likelihood functions arising in the estimation of model parameters for time series models. We will come to them in due course after shortlisting, because their computation time is significant.
We only included optimizers designed to work on non-differentiable functions. We also only included packages deployed on PyPI that installed, and whose "hello world" examples worked. Where that wasn't the case, we created an issue so as to increase the chances of including the package the next time around. We apologize for any omissions.
Optimizer | Specific description | General description |
Facebook Ax | Bayesian optimization | Ax is an accessible, general-purpose platform for understanding, managing, deploying, and automating adaptive experiments. |
Optuna | Tree-structured Parzen Estimator | Automated search for optimal hyperparameters using Python conditionals, loops, and syntax. |
Hyperopt | Tree-structured Parzen Estimator | Python library for serial and parallel optimization over awkward search spaces, which may include real-valued, discrete, and conditional dimensions. |
Scipy |
The simplicial homology global optimization technique. Powell's conjugate direction method. |
Provides several commonly used optimization algorithms. |
PySOT | Surrogate optimization | Asynchronous parallel optimization toolbox for computationally expensive global optimization problems. |
PyMoo | Differential evolution, brkga, nelder mead et al. | State of the art single and multi-objective optimization algorithms, and many more features related to multi-objective optimization, such as visualization and decision making. |
Platypus | Genetic, evolutionary | Framework for evolutionary computing in Python with a focus on multiobjective evolutionary algorithms. |
The first part of the analysis attempted to get a handle on whether one optimizer was likely to result in significant computation savings versus another (assuming that the objective function optimization is the only cost). To this end, we set about converting the best function value found (i.e. minimum) into a simple score intended to be indicative of logarithmic computation time.
In the first phase, we ran each optimizer until a pre-set number of function evaluations (n) was reached. We varied n and computed the mean minimum found across all "good" optimizers (more on that in a moment). An example is shown for the Schwefel function.
Maximum Evaluations | Mean Minimum Found | Maximum Evaluations | Mean Minimum Found |
n=1 | 990.91 | n=32 | 378.2 |
n=2 | 790.5 | n=64 | 298.2 |
n=4 | 593.0 | n=128 | 250.5 |
n=8 | 542.9 | n=256 | 136.4 |
n=16 | 408.9 | n=512 | 134.6 |
As an example of applying this scale, suppose that one optimization run for Scipy.Powell (say) results in an objective function of 203.1. We would assign a score of 8, since it is better than 250.5 recorded for n=128, and all n below as it happens, but not better than 136.4 which is the average minimum found across all optimizers given a limit of 256 evaluations. The first phase was intended to create a scale of performance that is easily related to time.
A difficulty arose because, for some packages, the argument provided to the optimizer intended to limit the number of function evaluations is treated only as a rough guide. Indeed, by auditing the number of actual function evaluations, we discovered that this limit was often exceeded - a fact that might easily lead one to believe those optimizers were superior. To partially get around this problem, we introduced a backoff in the parameter said to govern the number of function evaluations. If the audited count exceeded n by more than 10%, we reduced the parameter by 10% and repeated. In this manner, we attempted to treat all algorithms on a roughly equal footing. However this only addressed the issue if the number of evaluations was close to the user setting. This did not solve the problem in some important cases, and we will return to this issue shortly.
A second comment is that the scoring scale, and the analysis itself, is not easy to interpret if optimizers have vastly differing abilities. For that reason, some algorithms were eliminated, and we shall return to those decisions in due course also.
Having established a scale of difficulty for each problem, we then fixed the number of function evaluations (using the same backoff procedure as above, where necessary). We ran each optimizer many times and reported the mean and standard deviation of the scores.
We tested the optimizers on objective functions with N=3 variables, at first. We increased to N=6 and N=20. It probably goes without saying that some approaches to global optimization are better placed than others to tackle the increase in dimension, whereas others may start to grind and chew on a lot of memory.
As an example, we compared between optuna, hyperopt, pysot and pattern (from PyZoo) in six dimensions, where N=225 function evaluations were permitted. The performance scale used n=64, 90, 128, 181, 256, 362, 512, 724, 1024 and 1448 function evaluations. So a difference of 1 in the table below indicates that minimum was better, or worse, by an amount that might be commensurate with a 40% increase in computation time (with higher scores better). As you can see, the differences are indeed large from a practical perspective.
optuna | hyperopt | pysot | |
schaffer | 5.5+/-2.4 | 1.1+/-1.1 | 6.7+/-2.5 |
bohachevsky | 6.0+/-1.8 | 2.0+/-1.5 | 10.0+/-0.0 |
griewank | 5.8+/-1.9 | 1.7+/-1.2 | 9.9+/-0.2 |
rastrigin | 5.4+/-2.4 | 1.6+/-1.1 | 7.9+/-2.8 |
rastrigin (scaled) | 6.8+/-2.3 | 1.5+/-1.1 | 7.1+/-1.9 |
rastrigin (skew) | 3.6+/-2.1 | 5.2+/-0.7 | 7.0+/-1.4 |
schwefel | 7.4+/-2.4 | 0.6+/-1.3 | 5.0+/-2.7 |
These preliminary results suggested that PySOT, and perhaps surrogate optimization in general, was well suited to the analytic functions - not altogether surprising. It suggests that considerable computation time might be saved by using it over, say, hyperopt whose performance was concerning.
In this discussion, a naughty optimizer is one that does not limit function evaluations to precisely what the user might desire. As an aside, this isn't a critique. Often the optimizers cycle through parameters, or perform some other subroutine that involves multiple function evaluations - so they only stop periodically to see whether the user-supplied maximum has been exceeded. It may not be sensible to do otherwise, depending on the details of the algorithm.
We tested all optimizers to determine whether this kind of behavior was present, and whether it warranted special treatment or merely a small hack. While it would be tempting to disqualify them, we instead attempted what might be considered a slightly one-sided analysis. To evaluate these optimizers, we ran them first and recorded an audited number of function evaluations (that is to say the actual number of times an objective function was called, using our own monitoring). This was then used when running their better behaved brethren. This way, all optimizers in the comparisons to follow use the same total number of function evaluations.
For example, when Scipy's SHGO is told to use 100 function evaluations, it actually used an average of 315 evaluations (plus or minus a standard deviation of 200 across problems). Nonetheless, as noted, all solvers in the table were given the same number of function evaluations during each pass (however many SHGO choose to use). The table below should be treated with care as it isn't comparing the optimizers on a truly equal footing, but rather with a choice of the number of function evaluations that suits SHGO.
Lower is better ...
pysot | hyperopt | shgo | optuna | |
rastrigin (scaled) | 32.5+/-8.6 | 85.3+/-20.6 | 0.0+/-0.0 | 34.8+/-15.5 |
schaffer | 9.4+/-2.8 | 17.5+/-2.1 | 2.0+/-0.0 | 10.3+/-3.3 |
schwefel | 665.7+/-203.8 | 1020.1+/-209.5 | 1263.8+/-0.0 | 387.4+/-190.9 |
bohachevsky | 3.7+/-0.8 | 2340+/-1080 | 0.0+/-0.0 | 563.6+/-550.9 |
rastrigin | 11.1+/-6.5 | 35.3+/-6.9 | 0.0+/-0.0 | 20.9+/-7.2 |
griewank | 0.7+/-0.1 | 11.0+/-4.7 | 0.0+/-0.0 | 3.8+/-2.5 |
rastrigin (skew) | 29.2+/-9.8 | 52.2+/-8.6 | 0.0+/-0.0 | 108.3+/-90.9 |
Similarly, when Scipy's Powell is supplied maxfev=100, say, it will first perform individual line searches in every dimension (each one may require many function evaluations) and move through all dimensions before performing a check to determine whether the function evaluation limit is exceeded. This makes it difficult to control time or compute for high dimensional problems. For example, the earliest possible stopping occurs between 50 and 156 function evaluations in, when the problem is six dimensional (for 20 dimensional problems it may take close to 600). Despite this limitation, Powell's method can be very effective for some types of problems as suggested by the table below listing minimums found. Lower is better again.
hyperopt | pysot | optuna | powell | |
rastrigin (skew) | 66.4+/-15.0 | 47.3+/-14.8 | 397.4+/-250.9 | 5.9+/-0.0 |
bohachevsky | 8935+/-3477 | 106.7+/-99.7 | 4914+/-3002 | 0.0+/-0.0 |
schwefel | 1374.5+/-207.0 | 905.3+/-250.1 | 857.4+/-257.7 | 710.6+/-0.0 |
griewank | 36.5+/-14.7 | 1.2+/-0.3 | 21.2+/-9.8 | 0.0+/-0.0 |
schaffer | 31.4+/-3.9 | 18.1+/-7.1 | 23.2+/-5.6 | 0.0+/-0.0 |
rastrigin (scaled) | 171.7+/-53.4 | 59.4+/-16.4 | 108.5+/-44.1 | 0.0+/-0.0 |
rastrigin | 52.4+/-9.0 | 41.7+/-12.4 | 40.4+/-11.9 |
0.0+/-0.0 |
Given that Powell is being allowed to run until it is happy, we have also highlighted in red the poorest performing optimizer from the choice HyperOpt, PySOT and Optuna.
The analysis above did not include all algorithms available to us in the mentioned libraries. Here we present some additional comparisons which dictated these choices. In particular, we only selected the "pattern" algorithm from the PyMoo library, and some reasons are provided in the data to follow. We did not use any algorithms from the Platypus library, as initial results suggested they were uniformly worse than alternatives.
The evolutionary algorithm in the Platypus library sometimes runs as many function evaluations as one requests, but often runs a few more, or roughly 100 more. We did not go to great lengths to investigate this behavior but again, provide a somewhat fair comparison against the more predictable optimizers. Mean and standard deviation of minimums are shown here. Lower is better.
hyperopt | evolutionary | pysot | optuna | |
rastrigin (scaled) | 126.8+/-37.8 | 168.5+/-65.0 | 34.8+/-11.9 | 59.7+/-17.5 |
rastrigin | 40.0+/-6.1 | 49.6+/-11.5 | 31.5+/-18.9 | 36.5+/-8.3 |
schaffer | 32.8+/-2.8 | 40.5+/-1.9 | 22.3+/-11.9 | 26.7+/-7.5 |
schwefel | 1385.7+/-178.1 | 1217.6+/-210.6 | 984.1+/-277.7 | 689.3+/-241.6 |
griewank | 24.3+/-7.9 | 44.9+/-15.1 | 0.8+/-0.2 | 8.9+/-6.0 |
rastrigin (skew) | 59.9+/-20.5 | 111.7+/-31.1 | 39.2+/-15.0 | 190.9+/-164.6 |
bohachevsky | 4563.1+/-1784 | 9678+/-3224 | 4.4+/-0.8 | 2048.6+/-931.5 |
At least for these examples, there seems to be no compelling reason to use Platypus' evolutionary algorithm over, say, PySOT, if you think your problems' geometry might be in any way reflective of the (somewhat artificial) analytic functions we use.
Again, performance of Platypus' genetic algorithm is dominated by PySOT, and also optuna. Lower is better. This doesn't eliminate the possibility of a genetic search serving a purpose elsewhere.
genetic | pysot | hyperopt | optuna | |
rastrigin | 54.0+/-7.0 | 21.2+/-13.4 | 41.6+/-7.6 | 31.4+/-9.5 |
schaffer | 36.9+/-3.0 | 17.9+/-6.2 | 29.4+/-2.8 | 24.5+/-5.4 |
griewank | 46.5+/-17.7 | 0.8+/-0.1 | 16.7+/-6.9 | 9.7+/-9.0 |
rastrigin (skew) | 110.9+/-44.1 | 44.8+/-23.3 | 54.4+/-16.0 | 298.5+/-147.1 |
schwefel | 1317.8+/-200.1 | 780.4+/-255.7 | 1327.3+/-246.7 | 758.2+/-141.3 |
bohachevsky | 9083+/-5693 | 4.3+/-0.9 | 5793.0+/-2004 | 2052.8+/-1467.8 |
rastrigin (scaled) | 196.9+/-61.7 | 47.1+/-13.7 | 121.4+/-40.5 | 67.6+/-16.8 |
PyMoo provides numerous algorithms. The library intent is multi-objective optimization, so the single objective results here may not be too relevant. It isn't fair to judge a multi-objective algorithm by its single objective performance. However, we compared all the methods below. We included those that reliably stopped after roughly 100 function evaluations (as an example of an exception, brkga algorithm ran to 1000). We included PySOT as a reference given its success above.
nelder | nsga2 | nsga3 | pattern | ctaea | pysot | |
schaffer | 35.3+/-6 | 38.8+/-4 | 39.6+/-9 | 20.8+/-8 | 40.3+/-7 | 21.1+/-7.4 |
rastrigin (skew) | 50.8+/-18.9 | 87.3+/-21.0 | 108.9+/-209.7 |
30.4+/-11.4 |
115.1+/-114.0 | 42.3+/-19.1 |
bohachevsky | 402.8+/-551.4 | 11147.3+/-4408.5 | 1686.4+/-2601.0 | 66.1+/- 55.4 | 1887.2+/-2632.4 | 4.4+/-0.9 |
rastrigin (scaled) | 94.8+/- 38.5 | 184.3+/-62.8 | 59.4+/-65.8 | 38.9+/- 12.5 | 100.2+/-193.3 | 37.4+/-13.5 |
schwefel | 904.7+/-288.6 | 1393.4+/-141.2 | 1034.2+/-277.0 | 695.5+/-229.8 | 996.7+/-292.7 | 768.7+/- 262.1 |
rastrigin | 32.9+/- 15.8 | 51.1+/- 9.0 | 38.4+/- 13.4 | 24.1+/- 10.4 | 34.1+/- 15.5 | 25.0+/-16.5 |
griewank | 2.3+/-1.2 | 37.3+/- 12.2 | 7.9+/-8.1 | 1.2+/-0.2 | 8.7+/-9.1 | 0.8+/-0.1 |
Increasing the dimensionality didn't materially change this assessment. Pattern search was still the front-runner.
pattern | nelder | nsga2 | ctaea | pysot | nsga3 | |
schwefel |
743+/- 295 |
898+/- 352 | 1399+/- 187 | 1104+/- 295 | 1041+/- 182 | 853+/- 238 |
rastrigin | 24.1+/- 11.7 | 29.3+/- 14.0 | 51.8+/- 8.7 | 32.5+/- 15.2 | 16.8+/- 13.2 | 41.1+/- 10.6 |
rastrigin (skew) | 29.8+/- 6.9 | 46.4+/- 17.4 | 83.1+/- 25.4 | 126+/- 135 | 45.1+/- 18.3 | 120+/- 128 |
bohachevsky | 101+/- 109 | 177+/- 134 | 9954+/- 4299 | 3552+/- 4811 | 4.4+/-1.1 | 3220+/- 4435 |
griewank | 1.1+/-0.2 | 2.3+/-1.3 | 41.6+/- 11.3 | 9.8+/- 10.7 | 0.7+/-0.2 | 8.2+/-7.0 |
schaffer | 23.5+/- 8.4 | 33.9+/- 6.4 | 35.6+/- 3.0 | 45.5+/- 6.2 | 23.3+/- 7.2 | 34.8+/- 11.3 |
rastrigin (scaled) | 38.3+/-9 | 92.3+/- 35.6 | 215+/- 51.7 | 93.8+/- 83.4 | 40.0+/- 8.5 | 151+/- 340 |
Given a larger number of function evaluations, the picture is somewhat murkier and nsga3 and ctaea seemed to be more competitive.
nsga3 | ctaea | nsga2 | nelder | pysot | pattern | |
bohachevsky | 11.3+/- 6.3 | 7.2+/-2.5 | 4526+/- 1945 | 0.9+/- 0.3 | 3.7+/- 0.7 | 0.0+/-0.0 |
griewank | 0.7+/-0.2 | 0.6+/-0.2 | 17.5+/- 11.6 | 0.2+/-0.1 | 0.7+/-0.0 | 0.1+/-0.1 |
rastrigin (scaled) | 13.8+/- 7.3 | 8.9+/-4.2 | 121+/- 23.7 | 59.6+/- 34.8 | 25.7+/- 8.6 | 9.6+/-8.3 |
schwefel | 775+/- 224 | 906+/- 139 | 941+/- 167 | 858+/- 271 | 639+/- 182 | 673+/- 274 |
schaffer | 34.0+/- 9.9 | 33.5+/- 13.8 | 31.1+/- 2.2 | 33.4+/- 3.3 | 11.0+/- 4.7 | 18.1+/- 11.9 |
rastrigin (skew) | 12.7+/- 8.6 | 12.2+/- 8.5 | 72.5+/- 21.5 | 36.6+/-11.6 | 26.6+/- 9.3 | 21.8+/-9.2 |
rastrigin | 17.1+/- 12.4 | 13.9+/- 9.9 | 38.3+/- 6.0 | 29.1+/- 10.7 | 8.4+/-6.6 | 23.1+/-9.4 |
When both dimension and the number of evaluations are increased, the performance of pattern seemed once again to be clearly superior to its colleagues in the PyMoo package. It was not obviously better or worse than PySOT.
ctaea | pysot | nelder | pattern | nsga2 | nsga3 | |
schwefel | 3377+/- 496 | 3412+/- 703 | 3793+/- 230 | 2319+/- 467 | 4637+/- 100 | 2859+/- 350 |
bohachevsky | 750+/- 708 | 30.2+/- 5.5 | 3277+/-2514 | 28.2+/- 4.1 | 51454+/-8815 | 919+/-1092 |
griewank | 3.9+/-1.0 | 1.0+/-0.0 | 24.9+/- 17.0 | 1.0+/-0.0 | 159+/- 30.0 | 1.9+/-0.2 |
rastrigin | 112+/- 18.8 | 55.4+/- 24.8 | 115+/- 20.9 | 61.2+/- 9.7 | 187+/- 27.7 | 101+/-18.5 |
rastrigin (skew) | 73.9+/- 4.0 | 137+/- 12.0 | 166+/- 39.5 | 81.7+/- 12.0 | 549+/- 144 | 71.9+/-34.6 |
rastrigin (scaled) | 156+/- 142 | 163+/- 41.2 | 589+/- 86.1 | 65.1+/- 10.1 | 854+/- 100 | 143+/-55.0 |
schaffer | 141+/- 28.1 | 79.2+/- 32.3 | 151+/- 7.2 | 98.6+/- 12.7 | 164+/- 4.6 | 153+/-10.6 |
As a minor note, Facebook's ax-platform was a late scratching as the current build on PyPI was broken - but it will be included soon. It did subsequently build directly from github but not in time for these results. It will be included soon.
To provide a different kind of challenge to the optimizers, we supplied them with a negative log-likelihood function for a model with five parameters. The parameters govern the action of a Kalman-like filter. The calculation is performed over 1000 data points in a time series. In order to generate a variety of different objective functions, we considered different combinations of three parameters to vary, while fixing two. The intent was to move away from somewhat artificially constructed analytic functions in favor of something that might arise in the wild. It must be said, however, that performance on this particular problem may not be indicative of performance elsewhere.
We minimized negative log-likelihood for a time series model. As anticipated, SHGO typically did a very good job, although, when instructed to limit evaluation to 225, it performed an average of 860 evaluations, plus or minus 300 or so. This was the only bad egg included in this particular comparison, as hyperopt, pysot, pattern and optuna could be relied upon to spend exactly the same number of evaluations. Bearing in mind that lower is better here, SHGO and pattern seem initially to be very reliable.
pattern | hyperopt | pysot | optuna | |
expnorm (varying g1,logK,loc) | 1205+/-0.0 | 1208+/-2.2 | 1206+/-0.5 | 1476+/-0.6 |
expnorm (varying g2,logK,logScale) | 1154+/-0.8 | 1205+/-66.7 | 1195+/-101 | 1561+/-1.0 |
expnorm (varying logK,loc,logScale) | 1472+/-3.9 | 1474+/-5.0 | 1471+/-0.7 | 1477+/-2.0 |
expnorm (varying g1,g2,loc) | 1152+/-1.5 | 1158+/-13.4 | 1168+/-24.7 | 1723+/-1.5 |
expnorm (varying g1,loc,logScale) | 1205+/-0.0 | 1216+/-11.4 | 1205+/-0.3 | 1487+/-0.3 |
expnorm (varying g2,loc,logScale) | 1154+/-0.1 | 1193+/-63.6 | 1188+/-111 | 1594+/-1.1 |
expnorm (varying g1,logK,logScale) | 1205+/-0.1 | 1219+/-14.2 | 1209+/-5.0 | 1478+/-3.5 |
expnorm (varying g2,logK,loc) | 1154+/-0.3 | 1156+/-1.7 | 1155+/-2.0 | 1560+/-0.8 |
expnorm (varying g1,g2,logScale) | 1151+/-0.0 | 1203+/-22.1 | 1170+/-22.2 | 1596+/-1.4 |
expnorm (varying g1,g2,logK) | 1151+/-0.0 | 1168+/-20.4 | 1170+/-23.9 | 1561+/-1.0 |
Summary
For relatively low-dimensional problems, the scipy global optimization methods are certainly worthy of consideration. We are intrigued by the performance of scipy.shgo, for example, a relatively recent addition grounded in homology. However, the method appears to consume considerable resources in higher dimensional settings, and the inability to control the number of function evaluations could be awkward in some settings. Similarly, Scipy's implementation of Powell's method is not to be underestimated, though it presents a similar difficulty, and PyMoo's implementation of pattern is strong, too.
That said, we make no recommendation for one software suite over another, if for no other reason that some are intended to facilitate mixing and matching of algorithms. Performing a fastidious assessment is less well motivated in those cases, and invites a combinatorial explosion.
The reader will have their own needs. There will be cases where a search must be performed quickly, using 20 guesses say, and others where we can afford 1,000 evaluations or more. Our analysis was steered toward cases where function evaluation is particularly expensive.
The reader may prefer one package over another for functionality provided (such as multi-objective optimization, multi-fidelity search or moving peak search) or preference for one design style over another (simple function calls versus classes, method of providing constraints, and so forth). It seems unfair to judge some packages on out of the box single objective optimization when their intended contribution is broader. Analysis here can be found in the tuneup Python package. Once again I apologize for any omissions.
Update Feb 2021:
Sometime after writing this article, I finally followed up on my promise to clean up the code. You may be interested in the sequel to this article:
The HumpDay package will supposedly make choosing optimization strategies more convenient. Contributions are more than welcome.
To ensure articles like this are in your thread consider following microprediction on LinkedIn. We're trying to make powerful bespoke AI free to the world, and you are welcome to use it or contribute in large or small ways.
Comments