16 min

HumpDay: A Package to Help You Choose a Python Global Optimizer

Published on February 13, 2021

This post continues my quixotic quest, begun in Comparing Global Optimization Packages, to locate the best, practical derivative-free global optimizer packages that can be easily used from Python. I wrote a small package called HumpDay to make this less painful. It provides a simple, unified calling syntax into some - but by no means all - functionality from a dozen of the most popular Python global optimization packages:

                           best_val, best_x = strategy( objective, n_trials, n_dim )

The sister repository optimizer-elo-ratings uses this to assess performance. That isn't an easy task but HumpDay (camel case, of course) contains a growing list of challenges that I hope become more representative over time. For now, it uses the classic functions you are familiar with such as griewank, rastrigin, and their ilk. See objectives/classic.py for the full list.  

In this post, I update recommendations and explain a new Elo rating methodology for optimizers. Perhaps more importantly, you can use the humpday/comparison utilities (or just open this notebook in colab) and the simplicity of the common syntax, to quickly do your own bespoke evaluations of optimizers from different packages and, thereby, arrive at one or more likely to serve your purposes, not mine.   

There is an important limitation at present: these comparisons are probably only good for continuous parameter spaces. Of course you can map anything to the hypercube, but the optimizers won't know that and you'll likely waste function evaluations. A more subtle drawback is that if your domain is continuous but long and skinny, then there may be some distortion when you transform the domain to the cube. 

Popular Python Optimization Packages

One has quite a choice these days, as the table below illustrates. No doubt it is incomplete. And not all popular optimization packages are included in HumpDay at present, or have Elo ratings assigned. 

Package name (PyPI)

Downloads

Elo rating?  Notes
scipy 20,516,412* Elo  
scikit-optimize 562,604 Elo  
hyperopt 525,297 Elo reviewed  
cmaes 312,860 Elo  reviewed under the optuna hood
ray 246,120*   ray.tune
gpyopt 240,550    
ortools 173,678*    
optuna 166,445 Elo reviewed  
bayesian-optimization 78,512 Elo notebook  
dlib 77,472* Elo notebook blog paper   post 
cma 52,000    
ax-platform 22,144* Elo  
nlopt 20,359 Elo notebook docs (be wary of uncaught C++ exceptions :-)
swarmpackagepy 16,316    
hpbandster 13,174   Implements bohb
rbfopt 12,634    
gekko 11,680   mixed int
zoopt 11,433   paper
rbfopt 11,096   IBM
simanneal 10,549    
optlang 8,564    
nevergrad 8,486 Elo  
spotpy 7,477    
pyswarms 7,440    
py-bobyqa 5,328   paper NAG (also dfols for LS)
hebo 5,300    
shgo n/a Elo reviewed Part of scipy now 
pymoo 3,241 Elo reviewed multi
pySOT 3,094 Elo reviewed  
platypus-opt 2823 Elo reviewed  
bolib 2,666    
swarmlib 2,130 Elo toy
auptimizer 2,225   orchestration
bayesianoptimization 1,921    
gaft 1,725   genetic
ipopt 1,706   COIN-OR
mealpy 1,615   meta
mystic 1,533   docs
niapy 1,507    
proxmin 1,203   paper coordinate-wise cvx
bayeso 986    
hebo 819   bbo 1st place. original in github
yabox 623   "Yet another..."
dfogn 611   NAG paper
pypesto 523   param est
mipego 521   mixed integer
opentuner 482   site
pyriad 396   pytorch
pygpgo 390    
oasis-optimization 314   docs
dfoalgos 295   rank-based
heuristic-optimization 283   swarm
pydogs 244    
ultraopt 222 Elo notebook  
pysmac 221    
arm-mango 195    slides
pyopus      
ssb-optimize 100    
pdfo 92   Powell
tgo 88   Supplanted by shgo which is reviewed.
solidpy 50    
  The following are not on PyPI    
bayesianevolution      
rapids-ai-BBO-2nd-place-solution     NVIDIA
pagmo     algo list
spearmint     Murky license. Only github
simple     Simplex domain only
turbo_bbo_neurips_2020      

As with the other lists here, from which this was copied, a download number carries an asterisk if it is part of a larger general purpose package. Download stats are dubious and easily manipulated. Please file an issue if you would like more included. 

A Painful Choice

That pain of choosing an optimizer centers on the fact that between scipy.optimize, dlib, ax-platform, hyperopt, nevergrad, optuna, bayesopt, platypus, pymoo, pySOT and skopt (and more by the time you read this) there is barely a convention in common. I was about to say that skopt is the exception because it adopts the scipy conventions - now there's a novel idea - but actually it doesn't quite. Aside from minor naming differences, it can be fooled into optimizing on the corners instead of a continuum, if you are not careful, when you send it precisely the same bounds that work perfectly in scipy.optimize. 

That's a trite example but if you read into the HumpDay package, you'll be stunned at how many ways there are to set up an optimization problem in code. Compare optuna's style here to, say, pymoo and you'll see what I mean - they are both quite marked departures from the functional scipy style, and yet depart in quite different ways. That isn't a bad thing, as innovation in the design of optimizer interfaces is important - Optuna's is rather elegant I think - but it does make performance comparisons awkward. I strongly suspect that most people don't compare optimizers, but just pick a package based on hearsay and stick with it. 

Unfortunately, if you read my previous post Comparing Python Global Optimizers, you'll know that a Google search can be a terrible guide. Hyperopt is downloaded more than twice as often as the nearest competitor, but I'm not sure it even makes my top ten. On the flip side, the pySOT surrogate optimization package is a quiet achiever, ranking a lowly fourteenth by download popularity (and too humble to assign itself a logo). I suspect that's because of the ceremony involved and run time, but if your objective is truly expensive, maybe try "dycors" or "srbf". 

The HumpDay package also makes amends for some notable omissions in my initial look Comparing Python Global Optimzers (post). I somehow left out two optimizers provided by Facebook - ax-platform and nevergrad. Sorry folks! You'll also find consistent signatures for scikit-optimize (a.k.a. skopt), ultraopt, dlib and bayesian-optimization that didn't appear previously.  

Canonical Signatures

Obviously, I got fed up and decided to cut the Gordian knot, or at least one part of the knot: continuous domains. So in HumpDay you will find more than fifty different optimization strategies spanning the usual suspects inspired by nature, Bayesian surrogate methods, homology, Parzen and hybrid approaches (not to mention, as they say on the radio, all the hits from the sixties, seventies and eighties - pattern search ain't broken). 

Implementations are drawn from reasonably popular Python optimization packages, and within each, I've tried to span a reasonable set of choices. Each is rendered as a simple function, even simpler than the scipy signature:   

                          best_val, best_x = strategy( objective, n_trials, n_dim )  

I was minimalist here, so objective is a function taking the hyper-cube as its domain. At least for me there isn't a huge loss of generality, since the domain can be mapped to a hypercube in some fashion usually. This exercise has only one real hazard: the potential to trigger long-buried memories of Complex Analysis exams - though obviously the mapping from one domain to another need not preserve angles! Indeed, the case has been made that for Lipshitz/Holder functions one doesn't even need to preserve dimension. See Lera and Sergeyev pdf for one example - though this is a whole sub-field. 

I hope that as far as your views on optimizers go, the standardization provided by the HumpDay package makes it operationally trivial for you to form a weak prior at least, as to what might be useful. Suppose you have an objective function that captures the essence of many similar tasks you face, or a collection of the same. To run one optimizer:

    from humpday.optimizers.nevergrad import nevergrad_ngopt_cube
    from humpday.objectives.classic import salomon_on_cube
    best_val, best_x, feval_count = nevergrad_ngopt_cube(salomon_on_cube, n_trials=100, n_dim=3, with_count=True)
You can then switch seamlessly from nevergrad to ax-platform to dlib or whatever you can find in humpday/optimizers. Or, to test all the strategies from one package against some pre-canned objective functions, you can write something like:

    from humpday.optimizers.pysotcube import PYSOT_OPTIMIZERS
    from humpday import CLASSIC_OBJECTIVES
    for objective in CLASSIC_OBJECTIVES:
        for optimizer in PYSOT_OPTIMIZERS:
            print( (optimizer.__name__, optimizer(objective, n_trials=100, n_dim=3, with_count=True)))
Similarly one can, given enough time, test all strategies from all optimization packages:

    from humpday import OPTIMIZERS
    from humpday import CLASSIC_OBJECTIVES
    for objective in CLASSIC_OBJECTIVES:
        for optimizer in OPTIMIZERS:
            print( (optimizer.__name__, optimizer(objective, n_trials=100, n_dim=3, with_count=True)))

Obviously, you could replace CLASSIC_OBJECTIVES with your own list of objective functions. To emphasize, an objective is a function taking just one argument - a single list or vector describing a point in the hypercube [0,1]^n. I dare say functools.partial may be your friend, and there are a few trivial transforms provided to squish domains as you need to.  

Top Performers

Though your bespoke task is what matters, it is probably useful to glean some overall impressions about which optimizers are performing well on standardized tasks. The Elo ratings, presented with no expense spared at optimizer-elo-ratings, are an attempt to get around the assessment problem caused by "naughty" optimizers - those who insist on performing more objective function evaluations than they are instructed to use. There are lots of caveats and many Elo ratings are yet to settle as I write, but that said, the following packages and strategies seem to be performing well.  

Package Examples Form Guide
dlib dlib The algorithm find_min_global is fast, light and ingenious. It estimates the Lipschitz constant and constructs linear upper bounds, then samples the upper envelope function. If the degree to which your objective function varies across space isn't too great, and you are in dim<25, this is one way to get a free lunch. You will kick yourself for not inventing this. Strong contender.
bayesian-optimization poi A nice easy to use library. Based on overall performance, maximizing the probability of improvement seems to work better than expected improvement or upper confidence bounds, for reasons I don't understand. Don't leave out. 
nlopt directr The directr label is my (possibly confusing) shorthand for nlopt.GN_DIRECT_L_RAND, a randomized version of the locally biased DIRECT method by Gablonsky and Kelley (pdf). Hard to beat! 
pySOT  dycors;  srbf  Implements Regis and Shoemaker's method using radial basis surrogates (paper), and stochastic radial basis function         (paper). Compared to dlib, it is very heavy on computation, but if you are drilling holes in the ground that isn't going to be an issue. Definitely include in your trifectas.  
skopt default  skopt.gp_minimize. Like pySOT this uses surrogates, and is very good, but by necessity, computationally heavy. A very popular library and underrepresented in the current Elo ratings somewhat due to the paucity of my wrapper. Expect good things. 
nevergrad ngpopt8 ; ngpopt4 A product of Facebook research, ngopt8 selects between optimizers, guided by benchmarking that is probably better than mine. See Algorithm 1 page 3 of the paperBlack Box Optimization Revisited by Meunier et al. I'm reliably informed that a new meta-optimizer ngpopt12 is on the way. However as is, the latest model is topping some of the leaderboards already. Don't lay this horse.   
shgo nelder As discussed in Comparing Global Optimizers, SHGO is now one of the scipy global searches. It uses techniques from simplicial homology to avoid wasting samples, and seems to work very well in relatively low dimensional problems (say <=10) using either Nelder-Mead or Powell for local search. Each way chance. 
pymoo pattern Hooke and Jeeves pattern search (package notes, paper) from 1960 is still pretty good! Don't overlook this, or Powell for that matter, depending on your problem.  
ax-platform default  Another effort at Facebook. The default global optimizer is part of a more extensive suite of tools and I'm reliably informed that some integration with nevergrad is likely (notwithstanding this issue). Expect big things. 
optuna / cmaes cmaes  Optuna offers a flexible design and is well supported. It calls down to cmaes in this particular instance (stands for Covariance Matrix Adaptation Evolution Strategy). See that repository for explanations and a list of references. The underlying package is maintained by Mashahiro Nomura who has an interesting paper on warm starting CMA-ES (pdf). Likely to improve.

For readers of my previous post, which tried to get at optimizer performance in a different way, there are some familiar characters here and no huge surprises - though I repeat the caveat that some techniques (e.g. surrogate) might be favored by the somewhat stylized objective functions I am using here. The flexibility of other techniques (e.g. parallelism, discrete optimization, nested domains, et cetera) aren't allowed to shine. My intent isn't to downplay the cleverness of design features of these packages. 

As more real-world objective functions are introduced, it will be interesting to see how Elo's change. I refer you to the current leaderboard, as I won't be attempting to keep recommendations here up to date manually. The new Facebook inclusions are doing pretty well - which is a relief given that I've not been entirely helpful promoting their signature time series package (see prophet post - just trying to help here). Facebook organized an open competition last year to improve nevergrad and IOHprofiler (link) and I'm sure the optimizers will get even better over time. They also cater to a wide range of dimensions and domains.

However, at the risk of being booted from the nevergrad Facebook group my new go-to is dlib. It uses the approach developed by Malherbe and Vayatis (paper) as noted above, which is annoyingly elegant. There is a nice video on the blog and a very clear explanation. Not only is the performance extremely good, but the run time is likely to be negligible for any real world, expensive objective functions. The library won't let you run problems with more than 34 dimensions, but that seems to be the only limitation. The only drawback I can think of (for my particular domain) is constantly wishing I had come up with it myself. 

Elo ratings are flawed in the same way that Value at Risk is flawed. I don't suggest that "most likely to get the lowest value for a fixed number of function evaluations" is everyone's cost function. There are some other things to be aware of too, such as the fact that some optimizers, including SHGO, really struggle to limit themselves to a fixed number of evaluations and that can hurt them in head-to-head matches when playing Black. Whether that penalty is considered fair depends on your needs. I'll go into a little bit more detail on the methodology in a moment, to illuminate these points. 

Recommendations

First let me mention that you can poke around in humpday/comparison for utilities that will generate recommended optimizers. These are only as good as the category Elo ratings behind them, which need more time to equilibrate, but say you want some quick advice on which optimizers are the best performing at 8 dimensional problems when given 130 function evaluations to play with. Perhaps you want to limit total computation time of the optimizer (as distinct from the objective function) to 5 minutes. Use:

    suggest(n_dim=8, n_trials=130, n_seconds=300)

The category Elo ratings are in the process of being computed by dimension and function evaluation limit. I've tried to warm them up a little, but they are really intended to be a free byproduct of testing. Examples:

  Maximum of 130 evaluations Maximum of 210 evaluations
 Dimension 5 leaderboards/classic_d05_n130 leaderboards/classic_d05_n210
 Dimension 8 leaderboards/classic_d08_n130 leaderboards/classic_d08_n210

Go up one directory to find more, here. For some other time saving utilities I'd refer you to the HumpDay README.md as that is likely to be better maintained than this blog article. For instance:

    from humpday import recommend

def my_objective(u):
time.sleep(0.01)
return u[0]*math.sin(u[1])

recommendations = recommend(my_objective, n_dim=21, n_trials=130)

As this particular function is very fast, some optimizers will be culled from the list. There's not point spending a long time updating a surrogate function if you can sample the objective many times instead. 

Hopefully this speeds your search or encourages you to try some packages you would otherwise overlook. I suppose, in addition, you could continue to use HumpDay after you find you favorite optimizer, though that comes with some pip dependency baggage and serious limitations so I don't really recommend it. The intent is modest here: to provide you with a quick shortlisting of Python optimization packages that might be suitable to your bespoke task, and some templates to get you over the hump (groan) of moving from one syntax to another. At minimum it might provide you with some assurance that in continuing with the one you know and like, you're not missing much. 

How Matches are Officiated

If you are familiar with Elo ratings for chess, and I'm guessing that many of you are, then the only question you will ask is, "How are matches arranged between optimizers?". That is laid out in the code, and you can even watch games occurring (I'm kidding ... I think) or inspect the history. But here's the English version. 

  1. We choose an objective function at random from the list. See the previous article Comparing Python Global Optimizers for an explanation of some of these. 
  2. We select a random dimension n_dim, from the Fibonacci numbers (because ... no reason) and a maximum number of function evaluations, referred to n_trials in the code.
  3. We choose two optimization strategies at random. One to play White, the other Black. To be clear, an optimization strategy comprises a choice of optimization package, such as optuna, and also the fixing of any other choices or parameters that are required (in this case a choice of sampler - such as Tree-structured Parzen Estimator or TPE). 
  4. The White player is instructed to minimize the objective function using at most n_trials function evaluations. If they go over, we try to fool the optimizer by lowering the parameter that (supposedly) controls the number of function evaluations (but evidently, doesn't quite do so). We repeat until the actual number of function evaluations, call it white_feval_count, is not more than n_trials. Each time we fail, we reduce the parameter by 30%. 
  5. Assuming success, it is now Black's turn. We perform a similar procedure where n_trials is swapped out for white_feval_count, should they be different. And this time, we allow Black to exceed white_eval_count by ten percent. We also give Black finer increments, and more attempts. 
  6. Hopefully, both optimizers obey our command one way or another and in doing so, minimize a function using roughly the same number of function evaluations. A winner is determined according to the lowest function minimum that is found. However, if the proportional difference is very small, a draw is declared. 

Sometimes a match will be declared void if step 4 or step 5 fails, and one optimizer fails to constrain itself appropriately. This occurs if the White player exceeds n_trials repeatedly or if the Black player exceeds the white_feval_count by more than 10%, despite strong encouragement. A match is also void if either algorithm crashes. There is no other penalty, at present.  

The rest is just standard Elo updating, but I'll save you from re-watching The Social Network if you are rusty on that topic (is it mildly ironic that we are judging Facebook algorithms with the algorithm that catalyzed The Facebook?). The Elo system assumes that if Black's prior rating exceeds White's rating by a difference of \(d\) then White can expect to score \begin{equation} \bar{p} = \frac{1}{1 + 10^{d/F}} \end{equation} points, where a win counts as \(1.0\), a draw \(0.5\) and a loss \(0.0\) and \(F\) is a constant, called the F-factor. White actually scores \(p\) points say leading to a differential of \begin{equation} \delta = p - \bar{p} \end{equation} for White and \(-\delta\) for Black. Both player's ratings are updated by an amount equal to this performance differential multiplied by the so-called K-factor. At time of writing, the following F-factors and K-factors are used. 

  Optimizers c.f. Chess
K-factor 60 or 30 25, 16 or 10
F-factor 1000 400

We take the liberty of using a larger F-factor than in chess. The rationale is that a single optimization is somewhere between an entire game of chess and a hand of poker.  

Interpreting Absolute Optimizer Elo Ratings

Unlike chess Elo ratings, the optimizer Elo scale is anchored. The random optimizer's rating is reset to 1600 after every game. For the avoidance of doubt, the random optimizer is the optimizer that chooses uniformly on the hypercube n_trials times, returning the lowest value found. 

This presumably limits ratings drift over time, and also makes it relatively easy to interpret an optimizer rating in an absolute sense. For example, using an F-factor of 1000 we can read off the expected points gained in a game when an optimizer is compared to random search. If draws are considered rare, this is essentially equivalent to the probability that an optimizer will help, rather than harm, when you employ it. 

As also noted in the previous post, that particular measure may not be the best statistic. For example, you might be interested in the average time taken by an optimizer to perform reasonably well, as we were in that article. Or you might be interested in some tail weighted metric that penalizes cases where it never finds the minimum. Such metrics might lead you more towards optimizers that try to find all minima, whereas Elo ratings might not always correlate.

But with those caveats and more, here's an interpretation. The probability that an optimizer performs better than random search, assuming no draws, is as follows:  

Elo Beat Random Elo Beat Random Elo Beat Random
1300 33 %  1600 50 % 1900 67 %
1400 38 % 1700 56 % 2000 72 %
1500 44 % 1800 61 % 2100 76 %

Now as computed, the Elo rating is more a function of matches against other optimizers than it is against the random optimizer, so take with a grain of salt. But notice that the very best optimization strategies are better than random search three times more often than not. That might be significant, depending on the economic cost of your search. 

Random search isn't dreadful, and it is well appreciated that randomly choosing points from the space to be evaluated is superior to grid search, something we see all too often (see the paper by Bergstra and Bengio for a theoretical treatment). On the other hand, random search is certainly easy to implement, so one might reasonably be concerned if an optimizer is assigned a rating below 1600 - as seems to be the case for the "portfolio" algorithm in the ax-platform at present, and for several marsupials in the platypus package.

But again, there is a lot of noise in the ratings (it is an exercise for the reader to estimate how much) and, as new objective functions are introduced that are better suited to evolutionary algorithms, relative assessments will probably change. There is another reason why some optimizers might lose out to the random search on the Elo scale that the reader may have noticed. If they are not good at controlling their function evaluation budget, then the head-to-head procedure, as outlined above, may result in them using fewer function evaluations than their opponent - at least when they play Black. 

Interpreting Relative Optimizer Elo Ratings

There are theoretical flaws with the Elo system and inaccuracies in the underpinning assumptions (logarithmic odds ratios). Yet the chess world carries on and so shall we. I hope that the relative differences in Elo ratings of optimizers can provide some rough guidance, allowing you to weigh possible performance differences against a myriad of other considerations, ranging from design, ease of use, responsiveness to Github issues, generality of the search domain, ease of modification and so forth.

Here's an oddly familiar table relating differences in optimizer Elo ratings to the likelihood that one optimizer will find a lower minimum than the other.   

Elo Diff Probability Elo Diff Probability Elo Diff Probability
-300 33 %  0 50 % 300 67 %
-200 38 % 100 56 % 400 72 %
-100 44 % 200 61 % 500 76 %

I would not dismiss relatively small differences as economically insignificant if each evaluation corresponds to a lengthy simulation (or worse, some physical operation). However, I think this might also grant some of you the liberty to choose your package based on non-performance criteria in many instances and sleep well - say if the rating differential amounts to what might be won or lost in a single game (+/- 30 points initially, then +/- 15 later on) or two.  

This exercise has convinced me further that download statistics and other measures of popularity are dreadful regressors for performance. Hopefully, more optimizers will be assigned Elo ratings in the future and again, if you have a suggestion for these lists, please file an issue or, better yet, a pull request. 

Contributing

It should be easy to make helpful PRs: 

  1. Just grab your favorite optimization package from those that I'm missed thus far. 
  2. Shape it as I have the others in the optimizers directory, and make a pull request.
  3. Or ... find bugs. Special thanks to Robert Kern for finding some horrendous ones already.  
  4. Or ... suggest better choices of method or hyper-parameters within a package.

Interesting objective functions on the hypercube are also in demand, and you can shove them here.  Incidentally, there are a couple of related projects that might pique your interest too: 

  • The embarrassingly library discussed here that is my somewhat experimental hack for robustness. It modifies the objective function only, so is uncoupled in an engineering sense from the optimizer itself. 
  • An ongoing effort to boost elo ratings for time-series algorithms judged out of sample on live data was part of the motivation for HumpDay. HumpDay grew out of a sub-module there but I carved it out in the interest of maintainability.

Anyway, I hope HumpDay saves you some time, simple as it is. To ensure articles like this are in your thread consider following microprediction on LinkedIn. We're trying to make powerful bespoke AI free to the world, and you are welcome to use it or contribute in large or small ways. 

 

 

 

 

 

Comments