This post continues my quixotic quest, begun in Comparing Global Optimization Packages, to locate the best, practical derivative-free global optimizer packages that can be easily used from Python. I wrote a small package called HumpDay to make this less painful. It provides a simple, unified calling syntax into some - but by no means all - functionality from a dozen of the most popular Python global optimization packages:
best_val, best_x = strategy( objective, n_trials, n_dim )
The sister repository optimizer-elo-ratings uses this to assess performance. That isn't an easy task but HumpDay (camel case, of course) contains a growing list of challenges that I hope become more representative over time. For now, it uses the classic functions you are familiar with such as griewank, rastrigin, and their ilk. See objectives/classic.py for the full list.
In this post, I update recommendations and explain a new Elo rating methodology for optimizers. Perhaps more importantly, you can use the humpday/comparison utilities (or just open this notebook in colab) and the simplicity of the common syntax, to quickly do your own bespoke evaluations of optimizers from different packages and, thereby, arrive at one or more likely to serve your purposes, not mine.
There is an important limitation at present: these comparisons are probably only good for continuous parameter spaces. Of course you can map anything to the hypercube, but the optimizers won't know that and you'll likely waste function evaluations. A more subtle drawback is that if your domain is continuous but long and skinny, then there may be some distortion when you transform the domain to the cube.
One has quite a choice these days, as the table below illustrates. No doubt it is incomplete. And not all popular optimization packages are included in HumpDay at present, or have Elo ratings assigned.
Package name (PyPI) |
Downloads |
Elo rating? | Notes |
scipy | 20,516,412* | Elo | |
scikit-optimize | 562,604 | Elo | |
hyperopt | 525,297 | Elo reviewed | |
cmaes | 312,860 | Elo reviewed | under the optuna hood |
ray | 246,120* | ray.tune | |
gpyopt | 240,550 | ||
ortools | 173,678* | ||
optuna | 166,445 | Elo reviewed | |
bayesian-optimization | 78,512 | Elo notebook | |
dlib | 77,472* | Elo notebook | blog paper post |
cma | 52,000 | ||
ax-platform | 22,144* | Elo | |
nlopt | 20,359 | Elo notebook | docs (be wary of uncaught C++ exceptions :-) |
swarmpackagepy | 16,316 | ||
hpbandster | 13,174 | Implements bohb | |
rbfopt | 12,634 | ||
gekko | 11,680 | mixed int | |
zoopt | 11,433 | paper | |
rbfopt | 11,096 | IBM | |
simanneal | 10,549 | ||
optlang | 8,564 | ||
nevergrad | 8,486 | Elo | |
spotpy | 7,477 | ||
pyswarms | 7,440 | ||
py-bobyqa | 5,328 | paper NAG (also dfols for LS) | |
hebo | 5,300 | ||
shgo | n/a | Elo reviewed | Part of scipy now |
pymoo | 3,241 | Elo reviewed | multi |
pySOT | 3,094 | Elo reviewed | |
platypus-opt | 2823 | Elo reviewed | |
bolib | 2,666 | ||
swarmlib | 2,130 | Elo | toy |
auptimizer | 2,225 | orchestration | |
bayesianoptimization | 1,921 | ||
gaft | 1,725 | genetic | |
ipopt | 1,706 | COIN-OR | |
mealpy | 1,615 | meta | |
mystic | 1,533 | docs | |
niapy | 1,507 | ||
proxmin | 1,203 | paper coordinate-wise cvx | |
bayeso | 986 | ||
hebo | 819 | bbo 1st place. original in github | |
yabox | 623 | "Yet another..." | |
dfogn | 611 | NAG paper | |
pypesto | 523 | param est | |
mipego | 521 | mixed integer | |
opentuner | 482 | site | |
pyriad | 396 | pytorch | |
pygpgo | 390 | ||
oasis-optimization | 314 | docs | |
dfoalgos | 295 | rank-based | |
heuristic-optimization | 283 | swarm | |
pydogs | 244 | ||
ultraopt | 222 | Elo notebook | |
pysmac | 221 | ||
arm-mango | 195 | slides | |
pyopus | |||
ssb-optimize | 100 | ||
pdfo | 92 | Powell | |
tgo | 88 | Supplanted by shgo which is reviewed. | |
solidpy | 50 | ||
The following are not on PyPI | |||
bayesianevolution | |||
rapids-ai-BBO-2nd-place-solution | NVIDIA | ||
pagmo | algo list | ||
spearmint | Murky license. Only github | ||
simple | Simplex domain only | ||
turbo_bbo_neurips_2020 |
As with the other lists here, from which this was copied, a download number carries an asterisk if it is part of a larger general purpose package. Download stats are dubious and easily manipulated. Please file an issue if you would like more included.
That pain of choosing an optimizer centers on the fact that between scipy.optimize, dlib, ax-platform, hyperopt, nevergrad, optuna, bayesopt, platypus, pymoo, pySOT and skopt (and more by the time you read this) there is barely a convention in common. I was about to say that skopt is the exception because it adopts the scipy conventions - now there's a novel idea - but actually it doesn't quite. Aside from minor naming differences, it can be fooled into optimizing on the corners instead of a continuum, if you are not careful, when you send it precisely the same bounds that work perfectly in scipy.optimize.
That's a trite example but if you read into the HumpDay package, you'll be stunned at how many ways there are to set up an optimization problem in code. Compare optuna's style here to, say, pymoo and you'll see what I mean - they are both quite marked departures from the functional scipy style, and yet depart in quite different ways. That isn't a bad thing, as innovation in the design of optimizer interfaces is important - Optuna's is rather elegant I think - but it does make performance comparisons awkward. I strongly suspect that most people don't compare optimizers, but just pick a package based on hearsay and stick with it.
Unfortunately, if you read my previous post Comparing Python Global Optimizers, you'll know that a Google search can be a terrible guide. Hyperopt is downloaded more than twice as often as the nearest competitor, but I'm not sure it even makes my top ten. On the flip side, the pySOT surrogate optimization package is a quiet achiever, ranking a lowly fourteenth by download popularity (and too humble to assign itself a logo). I suspect that's because of the ceremony involved and run time, but if your objective is truly expensive, maybe try "dycors" or "srbf".
The HumpDay package also makes amends for some notable omissions in my initial look Comparing Python Global Optimzers (post). I somehow left out two optimizers provided by Facebook - ax-platform and nevergrad. Sorry folks! You'll also find consistent signatures for scikit-optimize (a.k.a. skopt), ultraopt, dlib and bayesian-optimization that didn't appear previously.
Obviously, I got fed up and decided to cut the Gordian knot, or at least one part of the knot: continuous domains. So in HumpDay you will find more than fifty different optimization strategies spanning the usual suspects inspired by nature, Bayesian surrogate methods, homology, Parzen and hybrid approaches (not to mention, as they say on the radio, all the hits from the sixties, seventies and eighties - pattern search ain't broken).
Implementations are drawn from reasonably popular Python optimization packages, and within each, I've tried to span a reasonable set of choices. Each is rendered as a simple function, even simpler than the scipy signature:
best_val, best_x = strategy( objective, n_trials, n_dim )
I was minimalist here, so objective is a function taking the hyper-cube as its domain. At least for me there isn't a huge loss of generality, since the domain can be mapped to a hypercube in some fashion usually. This exercise has only one real hazard: the potential to trigger long-buried memories of Complex Analysis exams - though obviously the mapping from one domain to another need not preserve angles! Indeed, the case has been made that for Lipshitz/Holder functions one doesn't even need to preserve dimension. See Lera and Sergeyev pdf for one example - though this is a whole sub-field.
I hope that as far as your views on optimizers go, the standardization provided by the HumpDay package makes it operationally trivial for you to form a weak prior at least, as to what might be useful. Suppose you have an objective function that captures the essence of many similar tasks you face, or a collection of the same. To run one optimizer:
from humpday.optimizers.nevergrad import nevergrad_ngopt_cube
from humpday.objectives.classic import salomon_on_cube
best_val, best_x, feval_count = nevergrad_ngopt_cube(salomon_on_cube, n_trials=100, n_dim=3, with_count=True)
You can then switch seamlessly from nevergrad to ax-platform to dlib or whatever you can find in humpday/optimizers. Or, to test all the strategies from one package against some pre-canned objective functions, you can write something like:
from humpday.optimizers.pysotcube import PYSOT_OPTIMIZERS
from humpday import CLASSIC_OBJECTIVES
for objective in CLASSIC_OBJECTIVES:
for optimizer in PYSOT_OPTIMIZERS:
print( (optimizer.__name__, optimizer(objective, n_trials=100, n_dim=3, with_count=True)))
Similarly one can, given enough time, test all strategies from all optimization packages:
from humpday import OPTIMIZERS
from humpday import CLASSIC_OBJECTIVES
for objective in CLASSIC_OBJECTIVES:
for optimizer in OPTIMIZERS:
print( (optimizer.__name__, optimizer(objective, n_trials=100, n_dim=3, with_count=True)))
Obviously, you could replace CLASSIC_OBJECTIVES with your own list of objective functions. To emphasize, an objective is a function taking just one argument - a single list or vector describing a point in the hypercube [0,1]^n. I dare say functools.partial may be your friend, and there are a few trivial transforms provided to squish domains as you need to.
Though your bespoke task is what matters, it is probably useful to glean some overall impressions about which optimizers are performing well on standardized tasks. The Elo ratings, presented with no expense spared at optimizer-elo-ratings, are an attempt to get around the assessment problem caused by "naughty" optimizers - those who insist on performing more objective function evaluations than they are instructed to use. There are lots of caveats and many Elo ratings are yet to settle as I write, but that said, the following packages and strategies seem to be performing well.
Package | Examples | Form Guide |
dlib | dlib | The algorithm find_min_global is fast, light and ingenious. It estimates the Lipschitz constant and constructs linear upper bounds, then samples the upper envelope function. If the degree to which your objective function varies across space isn't too great, and you are in dim<25, this is one way to get a free lunch. You will kick yourself for not inventing this. Strong contender. |
bayesian-optimization | poi | A nice easy to use library. Based on overall performance, maximizing the probability of improvement seems to work better than expected improvement or upper confidence bounds, for reasons I don't understand. Don't leave out. |
nlopt | directr | The directr label is my (possibly confusing) shorthand for nlopt.GN_DIRECT_L_RAND, a randomized version of the locally biased DIRECT method by Gablonsky and Kelley (pdf). Hard to beat! |
pySOT | dycors; srbf | Implements Regis and Shoemaker's method using radial basis surrogates (paper), and stochastic radial basis function (paper). Compared to dlib, it is very heavy on computation, but if you are drilling holes in the ground that isn't going to be an issue. Definitely include in your trifectas. |
skopt | default | skopt.gp_minimize. Like pySOT this uses surrogates, and is very good, but by necessity, computationally heavy. A very popular library and underrepresented in the current Elo ratings somewhat due to the paucity of my wrapper. Expect good things. |
nevergrad | ngpopt8 ; ngpopt4 | A product of Facebook research, ngopt8 selects between optimizers, guided by benchmarking that is probably better than mine. See Algorithm 1 page 3 of the paperBlack Box Optimization Revisited by Meunier et al. I'm reliably informed that a new meta-optimizer ngpopt12 is on the way. However as is, the latest model is topping some of the leaderboards already. Don't lay this horse. |
shgo | nelder | As discussed in Comparing Global Optimizers, SHGO is now one of the scipy global searches. It uses techniques from simplicial homology to avoid wasting samples, and seems to work very well in relatively low dimensional problems (say <=10) using either Nelder-Mead or Powell for local search. Each way chance. |
pymoo | pattern | Hooke and Jeeves pattern search (package notes, paper) from 1960 is still pretty good! Don't overlook this, or Powell for that matter, depending on your problem. |
ax-platform | default | Another effort at Facebook. The default global optimizer is part of a more extensive suite of tools and I'm reliably informed that some integration with nevergrad is likely (notwithstanding this issue). Expect big things. |
optuna / cmaes | cmaes | Optuna offers a flexible design and is well supported. It calls down to cmaes in this particular instance (stands for Covariance Matrix Adaptation Evolution Strategy). See that repository for explanations and a list of references. The underlying package is maintained by Mashahiro Nomura who has an interesting paper on warm starting CMA-ES (pdf). Likely to improve. |
For readers of my previous post, which tried to get at optimizer performance in a different way, there are some familiar characters here and no huge surprises - though I repeat the caveat that some techniques (e.g. surrogate) might be favored by the somewhat stylized objective functions I am using here. The flexibility of other techniques (e.g. parallelism, discrete optimization, nested domains, et cetera) aren't allowed to shine. My intent isn't to downplay the cleverness of design features of these packages.
As more real-world objective functions are introduced, it will be interesting to see how Elo's change. I refer you to the current leaderboard, as I won't be attempting to keep recommendations here up to date manually. The new Facebook inclusions are doing pretty well - which is a relief given that I've not been entirely helpful promoting their signature time series package (see prophet post - just trying to help here). Facebook organized an open competition last year to improve nevergrad and IOHprofiler (link) and I'm sure the optimizers will get even better over time. They also cater to a wide range of dimensions and domains.
However, at the risk of being booted from the nevergrad Facebook group my new go-to is dlib. It uses the approach developed by Malherbe and Vayatis (paper) as noted above, which is annoyingly elegant. There is a nice video on the blog and a very clear explanation. Not only is the performance extremely good, but the run time is likely to be negligible for any real world, expensive objective functions. The library won't let you run problems with more than 34 dimensions, but that seems to be the only limitation. The only drawback I can think of (for my particular domain) is constantly wishing I had come up with it myself.
Elo ratings are flawed in the same way that Value at Risk is flawed. I don't suggest that "most likely to get the lowest value for a fixed number of function evaluations" is everyone's cost function. There are some other things to be aware of too, such as the fact that some optimizers, including SHGO, really struggle to limit themselves to a fixed number of evaluations and that can hurt them in head-to-head matches when playing Black. Whether that penalty is considered fair depends on your needs. I'll go into a little bit more detail on the methodology in a moment, to illuminate these points.
First let me mention that you can poke around in humpday/comparison for utilities that will generate recommended optimizers. These are only as good as the category Elo ratings behind them, which need more time to equilibrate, but say you want some quick advice on which optimizers are the best performing at 8 dimensional problems when given 130 function evaluations to play with. Perhaps you want to limit total computation time of the optimizer (as distinct from the objective function) to 5 minutes. Use:
suggest(n_dim=8, n_trials=130, n_seconds=300)
The category Elo ratings are in the process of being computed by dimension and function evaluation limit. I've tried to warm them up a little, but they are really intended to be a free byproduct of testing. Examples:
Maximum of 130 evaluations | Maximum of 210 evaluations | |
Dimension 5 | leaderboards/classic_d05_n130 | leaderboards/classic_d05_n210 |
Dimension 8 | leaderboards/classic_d08_n130 | leaderboards/classic_d08_n210 |
Go up one directory to find more, here. For some other time saving utilities I'd refer you to the HumpDay README.md as that is likely to be better maintained than this blog article. For instance:
from humpday import recommend
def my_objective(u):
time.sleep(0.01)
return u[0]*math.sin(u[1])
recommendations = recommend(my_objective, n_dim=21, n_trials=130)
As this particular function is very fast, some optimizers will be culled from the list. There's not point spending a long time updating a surrogate function if you can sample the objective many times instead.
Hopefully this speeds your search or encourages you to try some packages you would otherwise overlook. I suppose, in addition, you could continue to use HumpDay after you find you favorite optimizer, though that comes with some pip dependency baggage and serious limitations so I don't really recommend it. The intent is modest here: to provide you with a quick shortlisting of Python optimization packages that might be suitable to your bespoke task, and some templates to get you over the hump (groan) of moving from one syntax to another. At minimum it might provide you with some assurance that in continuing with the one you know and like, you're not missing much.
If you are familiar with Elo ratings for chess, and I'm guessing that many of you are, then the only question you will ask is, "How are matches arranged between optimizers?". That is laid out in the code, and you can even watch games occurring (I'm kidding ... I think) or inspect the history. But here's the English version.
Sometimes a match will be declared void if step 4 or step 5 fails, and one optimizer fails to constrain itself appropriately. This occurs if the White player exceeds n_trials repeatedly or if the Black player exceeds the white_feval_count by more than 10%, despite strong encouragement. A match is also void if either algorithm crashes. There is no other penalty, at present.
The rest is just standard Elo updating, but I'll save you from re-watching The Social Network if you are rusty on that topic (is it mildly ironic that we are judging Facebook algorithms with the algorithm that catalyzed The Facebook?). The Elo system assumes that if Black's prior rating exceeds White's rating by a difference of \(d\) then White can expect to score \begin{equation} \bar{p} = \frac{1}{1 + 10^{d/F}} \end{equation} points, where a win counts as \(1.0\), a draw \(0.5\) and a loss \(0.0\) and \(F\) is a constant, called the F-factor. White actually scores \(p\) points say leading to a differential of \begin{equation} \delta = p - \bar{p} \end{equation} for White and \(-\delta\) for Black. Both player's ratings are updated by an amount equal to this performance differential multiplied by the so-called K-factor. At time of writing, the following F-factors and K-factors are used.
Optimizers | c.f. Chess | |
K-factor | 60 or 30 | 25, 16 or 10 |
F-factor | 1000 | 400 |
We take the liberty of using a larger F-factor than in chess. The rationale is that a single optimization is somewhere between an entire game of chess and a hand of poker.
Unlike chess Elo ratings, the optimizer Elo scale is anchored. The random optimizer's rating is reset to 1600 after every game. For the avoidance of doubt, the random optimizer is the optimizer that chooses uniformly on the hypercube n_trials times, returning the lowest value found.
This presumably limits ratings drift over time, and also makes it relatively easy to interpret an optimizer rating in an absolute sense. For example, using an F-factor of 1000 we can read off the expected points gained in a game when an optimizer is compared to random search. If draws are considered rare, this is essentially equivalent to the probability that an optimizer will help, rather than harm, when you employ it.
As also noted in the previous post, that particular measure may not be the best statistic. For example, you might be interested in the average time taken by an optimizer to perform reasonably well, as we were in that article. Or you might be interested in some tail weighted metric that penalizes cases where it never finds the minimum. Such metrics might lead you more towards optimizers that try to find all minima, whereas Elo ratings might not always correlate.
But with those caveats and more, here's an interpretation. The probability that an optimizer performs better than random search, assuming no draws, is as follows:
Elo | Beat Random | Elo | Beat Random | Elo | Beat Random |
1300 | 33 % | 1600 | 50 % | 1900 | 67 % |
1400 | 38 % | 1700 | 56 % | 2000 | 72 % |
1500 | 44 % | 1800 | 61 % | 2100 | 76 % |
Now as computed, the Elo rating is more a function of matches against other optimizers than it is against the random optimizer, so take with a grain of salt. But notice that the very best optimization strategies are better than random search three times more often than not. That might be significant, depending on the economic cost of your search.
Random search isn't dreadful, and it is well appreciated that randomly choosing points from the space to be evaluated is superior to grid search, something we see all too often (see the paper by Bergstra and Bengio for a theoretical treatment). On the other hand, random search is certainly easy to implement, so one might reasonably be concerned if an optimizer is assigned a rating below 1600 - as seems to be the case for the "portfolio" algorithm in the ax-platform at present, and for several marsupials in the platypus package.
But again, there is a lot of noise in the ratings (it is an exercise for the reader to estimate how much) and, as new objective functions are introduced that are better suited to evolutionary algorithms, relative assessments will probably change. There is another reason why some optimizers might lose out to the random search on the Elo scale that the reader may have noticed. If they are not good at controlling their function evaluation budget, then the head-to-head procedure, as outlined above, may result in them using fewer function evaluations than their opponent - at least when they play Black.
There are theoretical flaws with the Elo system and inaccuracies in the underpinning assumptions (logarithmic odds ratios). Yet the chess world carries on and so shall we. I hope that the relative differences in Elo ratings of optimizers can provide some rough guidance, allowing you to weigh possible performance differences against a myriad of other considerations, ranging from design, ease of use, responsiveness to Github issues, generality of the search domain, ease of modification and so forth.
Here's an oddly familiar table relating differences in optimizer Elo ratings to the likelihood that one optimizer will find a lower minimum than the other.
Elo Diff | Probability | Elo Diff | Probability | Elo Diff | Probability |
-300 | 33 % | 0 | 50 % | 300 | 67 % |
-200 | 38 % | 100 | 56 % | 400 | 72 % |
-100 | 44 % | 200 | 61 % | 500 | 76 % |
I would not dismiss relatively small differences as economically insignificant if each evaluation corresponds to a lengthy simulation (or worse, some physical operation). However, I think this might also grant some of you the liberty to choose your package based on non-performance criteria in many instances and sleep well - say if the rating differential amounts to what might be won or lost in a single game (+/- 30 points initially, then +/- 15 later on) or two.
This exercise has convinced me further that download statistics and other measures of popularity are dreadful regressors for performance. Hopefully, more optimizers will be assigned Elo ratings in the future and again, if you have a suggestion for these lists, please file an issue or, better yet, a pull request.
It should be easy to make helpful PRs:
Interesting objective functions on the hypercube are also in demand, and you can shove them here. Incidentally, there are a couple of related projects that might pique your interest too:
Anyway, I hope HumpDay saves you some time, simple as it is. To ensure articles like this are in your thread consider following microprediction on LinkedIn. We're trying to make powerful bespoke AI free to the world, and you are welcome to use it or contribute in large or small ways.
Comments