16 min

Fast, Robust Optimization Using Impure Objective Functions

Published on December 4, 2020

Quite a few open source optimization packages were reviewed in a previous blog article Comparing Python Global Optimizers. Here I introduce a small Python package I've been writing, called embarrassingly, that might work in tandem with any one of them.  The library helps you modify your objective function before you send it to an optimizer, and this article explains why you might want to do that sometimes.

Can You Make Your Optimizer Do This?  

Here's an example. I want to land my helicopter on the red plateau, even though there are higher points to choose further from the origin (we're minimizing, but the plot is upside down, btw, note the z-axis). 

landing_area

Of course this is a hokey example, but we might suppose that our objective function is a likelihood for a model, or some other fitness criterion where limited data and changing conditions make us loath to pick a sharp extremal point. I'll be wading a short distance into the motivations, though it could be an entire book. For now let's say it is plausible that for some applications, generalization ability might be related to the properties of the objective function in a neighborhood, and not just at a point. That's another way of saying that you want to define your objective function implicitly , for instance as a kernel smoothing of a function you can compute quickly. We'll assume that the function is way too expensive to consider actually doing that, so "fix the objective" isn't a pragmatic solution. 

So, can you make your optimizer land on the helipad even though it really doesn't want to?  

Modifying Optimizers is Hard

To give you a little time to think about this challenge and motivate my approach, I'm going to ramble on for just a bit. The other night I was reading through Stefan Endres' optimizer code while thinking about some modifications to it that might help my helicopter land on that red helipad. It was probably the third or fourth time I'd read Stefan's code, and I was starting to wonder if this task was beyond my capabilities. Probably it is, and even if I could debug my modifications, I had visions of merge hell in my future after Stefan made more improvements - improvements I'd want in my version too.

Worse yet, he might come up with something entirely different and fundamentally superior, at which point my branch is dead. Yes I know we don't have to use this and we could roll our own genetic, evolutionary, swarmy algorithm instead, but that didn't cheer me up because those guys are getting smashed in my comparisons (tell me what I'm missing but if nature had used homology instead of that painfully slow nonsense Darwin noticed, I guarantee you more than 50% of homo-sapiens would be wearing masks right now). 

Despondent and distracted, I found myself looking at a chess game. Here's Reshko to move with the white pieces against Kaminsky in the Leningrad Championship of 1972 (this story is going somewhere - sort of). Black must have been exceedingly pleased with himself as he sat at the board, having apparently extracted himself from a very tough endgame and, no doubt, seen the stalemate possibility (queen sacrifice on f7) many moves prior.   

reshko_kaminsky

But in fact Reshko can win this one, using a ploy that has, I understand, only been used twice with good reason in the entire history of tournament chess. I got a little fixated with Queen maneuvers, following on from the previous passage of play. I gave up and looked at the answer, then admonished myself. I claim the problem is easier as a snapshot, when you are conditioned to think it is contrived, not a real game. Or I need more sleep.

Possibly I strain the analogy between the game and our task, but with The Queen's Gambit exploding on Netflix, we're going to run with the chess theme today. Reshko's solution reminds us that if there isn't actually a rule against something, don't prune your mental search prematurely just because something seems to run against the custom. Our other theme will be golf, with Tiger our inspiration. He's one smart dude. Not every golfer uses Pythagorus Theorem on the course. Not every golfer realizes a boulder is a loose impediment. Let's break some norms! 

Rules: Tiger's Loose Impediment Ruling In 1999 | SwingU Clubhouse 

The line of attack I propose here requires us to depart from one norm: sending functions to optimizers. We shall wean ourselves off the identification between computer minimizers (like scipy.optimize.minimize) and the mathematical notion of minimizing a function. \begin{equation} \min_{x \in \Omega} f(x) \end{equation}

The Problem

I'm not talking about the obvious difference between software and theoretical minimization - namely that the software is not perfect. Let's assume it is perfect. We denote the best value found by the optimizer as \(x^*=M(f)\) and we could suppose, for this discussion, an optimizer where \(M(f) = \arg\min f(x)\) for any mathematical function \(f\) you care to dream up. That flawless software won't return the location of the helipad.

However, it is reasonably obvious that if you give it a different function, it will. Indeed we can define the location of helipads in greater generality as the minimal points of functions such as:

\begin{eqnarray*} \bar{g}(x) & = & E_{x'\in B(x,\epsilon)}[ f(x') ] \\ \underline{g}(x) & = & \max_{x'\in B(x,\epsilon)} f(x') \end{eqnarray*} where the average, maximum or some other functional is computed for a ball \(B(x,\epsilon)\) around \(x\). In some sense we are done. We can imagine a mathematical function \(H\) that is the helipad locator. It takes a function \(f\) and creates some smoothed version \(g\) which it then optimizes. If \(f * h\) represents the smoothing operation then \begin{equation} H(f) = \arg\min f * h \end{equation} is the functional we want to reduce to code somehow, subject to the constraint that \(f\) is expensive to compute, lacks known derivatives, and cannot actually be convolved with anything in any reasonable amount of time.

That's the rub. How do we do the equivalent of computing \(g=f*h\) and then minimizing using our perfect optimizer \(M\)? A natural line of attack asks whether there is some other function \(\tilde{g}\) that approximates \(g\) and, therefore, may position us at a nice flat point \(x^* = M(\tilde{g}) \simeq M(g) = \arg\min f*h\).

I racked my brains for quite a bit on this one and actually came very close to convincing myself that there could be no such function - because if there was and it had the same image under \(H\) as \(g\) for all test functions \(f\) then (insert slightly vague logic that prunes the mental search) yes, Mr. Kaminsky, I agree to a draw. It seems to me that this path is a difficult one. No mathematical function that I can think of can be supplied to trick our optimizer into serving our purposes. 

The Monkey's Bum 

So is this variation dead? Not quite. As my favorite Croatian YouTuber would demand, let's pause the tape for a moment so you can ponder which function to send to the optimizer - bearing in mind that it's a bit of a trick question.

The game, as called by Antonio Radic, may provide you some inspiration. As you watch, think of a mathematical function as a queen - for she is surely the second most valuable piece on the mathematical board, second only to the notion of a set. Is that a sufficient double-spoiler? Okay, here's another hint - lest the title of this post slipped by. What do the following Python functions have in common? 

  • zip
  • int
  • range
  • enumerate
  • reversed

You're out of time, almost. The answer is that they are not Python functions but you think of them that way. Flipping that around, Python functions aren't mathematical functions, typically. In Haskell, maybe. And that's important because the domain of the operation \(M\) can be much larger than the domain of the operator \(f \mapsto \arg\min f\) - if we take that mathematical notation to imply \(f\) is a function (I think it does).

Finer points of mathematical notation aside, I trace my blindness on this problem to the false equivalence suggested by \(M(f) = \arg\min f(x)\), and to the habit of passing pure functions to optimizers - as strong a habit as promoting a pawn to a queen. The left hand side is an operation on functions, of course, but it could be an operation on functions with side-effects, or an operation on state machines, or, in Python, anything callable that produces \(y\) given \(x\). There is no requirement that the result depend only on \(x\). 

If pure functions are the queens of mathematics, can we call the impure functions bishops? In this metaphor I'd like to promote a bishop as the solution to this little problem we have. A bishop is a Python callable with non-trivial state (examples we'll come to momentarily use two internal surrogate functions). Once we free ourselves of the preconception of an optimizer as something that finds the minimum of a pure function, and view it just as a piece of code, the possibilities are quite vast and we can look for an \(\tilde{f}\) such that \(M(\tilde{f}) = \arg\min g(x)\). 

And thus the variation springs to life. I think we do still have an excellent chance of using, or abusing, global optimizers as they are without changing a single line of code. And it comes with one huge advantage. We can choose whatever optimizer we like and also enumerate transformations of objective functions (i.e. \(f \mapsto \tilde{f}\)) at the same time. The two are decoupled as far as implementation is concerned. Something in that tensor product is going to get us on that helipad. 

This tinkering, engineering approach will not appeal to everyone. It is as crass as an attempt to pull off Scholar's mate against the modern defense, in keeping with our theme. That opening goes by the name Monkey's Bum attack because chess-player Ken Coates, when first shown it, thought it was so awful he declared "if that works I'm a monkey's bum!" But it can work. For those of you who binge watched The Queen's Gambit, here's the real Beth Harmon playing the Monkey's Bum, and blowing Alexei Shirov off the board (video).  

beth_harmon

So maybe we call the feeding of non-functions to function optimizers the Monkey's Bum attack on robust optimization. A state machine will treat the suggestion \(x\) from the optimizer \(M\) as stimulus and omit a response which the optimizer will interpret as the value of a function \(f(x)\). The deceitful dance between optimizer and objective non-function will result in a number. Maybe a useful one for your business. 

Aside on Python Callables

I think the cleanest way to do this in Python is supplying the optimizer with a callable class. So let's write some code and, just in case you are not familiar with Python callables, here is a quick warm-up. The intent of the embarrassingly.fastidious module is just logging. We create a callable Python class that walks and talks like a function as far as your optimizer is concerned - but also tracks quantities such as the time to convergence, number of function evaluations and values returned.

 
from embarrassingy.fastidious import Fastidious
import numpy as np 
def g(x):
   return 3.14*np.linalg.norm(x)
G = Fastidious(g)

We then pass \(G\) rather than \(g\) to your optimizer of choice. Nothing terribly great here (and if your optimizer multi-threads, this won't help) but it will save us some time. As another example, suppose your objective function takes a long time to compute, but you have some servers lying around and you wish to use them all. It may be possible to use some optimization libraries directly if they support cluster computation in some manner, for example ray in SHGO or sqlite-based persistence in Optuna. However, in theory this shouldn't be a burden for creators of optimization packages and in practice, your setup may be different.   

For instance, suppose for concreteness that our objective function shells out to a bash script that runs a process remotely that triggers a smart outlet that fires up a motor that tips over a glass of milk and startles a mouse that kicks a ball down a hill and you measure how far it goes ... you get the idea (let's just say, in a little more generality, that your engineering is not easily anticipated by the authors of the optimization library you happen to be using). 

The class embarrassingly.parallel.Parallel is intended to be your friend in this situation. As with the previous example Fastidious, Parallel( ) also acts morally like a decorator, changing your function to a "better" one. But here you actually pass it a "pre-objective" function instead, which is identical to the objective function, except that it takes one additional pre-pended argument: the number of the worker (e.g. server). You want that extra argument because your script can use it to decide where to do the calculation. Meanwhile, the Parallel version will manage the queue of workers so you don't have to. You can send the callable class directly to your favorite optimizer. 

embarrassingly_parallel

If that sounds complicated, it is anything but. Here's a complete usage example. There is exactly one more line of code that there would otherwise be.    

 
from embarrassingly.parallel import Parallel
import optuna
import time

def f_(worker, trial):
    print('Hi this is worker ' + str(worker)+' taking 5 seconds.')  
time.sleep(5) x = [trial.suggest_float('x' + str(i), 0, 1) for i in range(3)] return x[0] + x[1] * x[2] if __name__=='__main__': f = Parallel(f_, num_workers=7) study = optuna.create_study() study.optimize(f, n_trials=21, n_jobs=7)

When Computation Time Varies

One could keep going with the conveniences, but we want to get to the helipad. I give just one more example which will set us up nicely, and it has independent interest. The hack applies when computation time varies widely across the domain where we seek a minimum. This one is more of a gambit than the proceeding examples, because we are messing with the intent of the optimizer, and changing the outcome.    

To motivate, suppose evaluation of \(f\) at some value \(x\) in the parameter space involves one type of model and elsewhere, \(f(x')\) calls for an entirely different calculation. Or perhaps \(f\) is morally the same function everywhere but \(x\) represents parameters or hyperparameters that have material bearing on the calculation time. A chemical reaction simulation, or an epidemic agent model might terminate quickly for some values of \(x\), whereas for other "cusp" calculations, \(f(x)\) might take a very long time.  

This problem could potentially be addressed by using a different optimizer entirely - such as one inspired by multi-fidelity search. However, maybe you are generally quite happy with the optimizer you are using (or you are paying a lot) and don't want to make a radical shift just because some parts of the problem domain involve high computation cost. Perhaps you feel that groovy techniques found in some optimizers are not (yet) incorporated into ... whatever class of optimizer might explicitly tackle variation in computation time. You'd surely be right. I haven't seen a global optimizer that uses both multi-fidelity notions and homology. 

In keeping with our theme, let's not throw out our favorite optimizer but instead, try to hack the objective function so that it doesn't waste too much time. For this we use the embarrassingly.shy module. It will consume an objective function and provide something quite similar. However, a shy objective function (Python callable, to be pedantic) will maintain two auxiliary models under the hood. 

  • A surrogate model for your objective function
  • A model estimating the computation time

Both functions can be modified upon construction, so long as they derive from SurrogateModel, a class provided by the Surrogate Modeling Toolbox (authored by Mohamed Amine Bouhlel, John Hwang, Nathalie Bartoli, Remi Lafage, Joseph Morlier and Joaquim Martins). To illustrate, let's make a function on the square \( (-1/2,1/2)^2 \) whose computation time depends on how close we are to the origin.


from deap.benchmarks import schwefel
def slow_and_pointless(x):
    """ Example of a function with varying computation time """
    r = np.linalg.norm(x)
    quad = (0.5*0.5-r*r)/(0.5*0.5)
    compute_time = max(0,0.5*quad+x[0])
    time.sleep(compute_time)
    return schwefel([1000*x[0],980*x[1]])[0]
Now, instead of passing this directly to an optimizer, we'll first make the function shy:

bounds = [(-0.5, 0.5), (-0.5, 0.5)]
SAP = Shy(slow_and_pointless, bounds=bounds, t_unit=0.01, d_unit=0.3)

As the name suggests, a shy objective function is reluctant to evaluate itself. The blue dots represent cases where it did, but the green dots indicate points where it decided instead to use its internal surrogate model and reply to the optimizer with a (potentially misleading) approximation. This is an engineering hack. The decision is made in the accept() method and the reader may wish to completely override that as they see fit, perhaps with something more theoretically motivated.

As it stands, the arguments t_unit (\(\delta t\) in our formulas) and d_unit ( \(\delta x\)) guide the Shy class by indicating an appropriate unit of time (here 0.01 seconds) and distance, which guides the computation of an acceptance function. Under the hood, the new callable SAP will sometimes ignore the optimizer's request to evaluate a point. By default, it will evaluate only with probability \begin{equation} P(evaluate) = \min\{1,\frac{\eta_1+(e^{d'}-1)}{t'+\eta_2}\} \end{equation} where \(\eta_1\) and \(\eta_2\) are constants, and \(d'=\frac{d}{\delta x}\) is the distance to the nearest point that has already been computed, measured in units of d_unit. Similarly the quantity \(t'=\frac{t}{\delta t}\) is the predicted time of computation \(t\), measured in units of t_unit. This computation time can be estimated on the fly with no prior knowledge. Here is an example of learning the approximate computation time during an optimization.

cpu_time

Notice that the learned surface has moved above some of the green points, which were the estimated computation times at the moment when the algorithm needed them. Prior knowledge can be inserted by the user so that the algorithm doesn't have to learn the cpu expenditure on the fly.

As a side remark, it may be possible to create a small spin on the shyness theme if your optimizer is written in such a way that function evaluation may be declined (say by throwing an error) rather than an approximation returned. But let's proceed...


res = scipy.optimize.shgo(func=SAP, bounds=bounds, n=8, iters=4, options={'minimize_every_iter': True, 'ftol': 0.1})

Here SHGO, short for simplicial homology global optimization, is being called (and the other parameters have little to do with our objective function skullduggery). As the optimization proceeds, you can see that a largish expanse near the origin is skipped over, more or less entirely, where we see only green surrogate evaluations informing the SGHO optimizer.  

objective

Both internal models (surrogate and cpu approximation) learn from an entirely cold start here, though this doesn't prevent the optimization from working well. 

A warm start may be possible if there are very similar optimization runs to be performed - one could then prime the surrogate functions with prior evidence (or "hyper-optimize" selection of the same). In fairness the reader might wonder, at some point, if it would be easier to re-write the optimizer instead, rather than shoe-horn our problem into an optimization method that does not anticipate uneven calculation times. 

However, as noted above, some optimizers seem to be flat out better than others. And some of the best might not (yet) account for varying computation times (nor shy objective functions who might politely decline to evaluate a particular point - since we are listing their possible shortcomings). SHGO has some excellent results on my problems at least, and is strongly motivated theoretically. Using homology, it cleverly avoids creating too many candidate minima to search.

The use of shy objective functions with SHGO appealed to me because I had in mind applications where the objective varied considerably in character from one region to another. Admittedly, our lovely hack completely destroys the theoretical guarantee in SHGO (that we will find every minimum) and so it comes with a status warning: "interesting kludge in need of further study". As noted the surrogate function might cover up a good minimum, hiding it from the optimizer.

But one shouldn't let the perfect be the enemy of the good. After all, using Sobol sequences for sampling in SHGO also throws out the theoretical guarantees and yet, in an example of pragmatism triumphing over theoretical certainty, the SGHO library will do this by default. It is assumed that most users will not, as a practical matter, actually want to pay for this guarantee with computation time. 

Because it is only one or two extra lines of code, I would encourage you to try converting your objective function to a shy one and see whether it helps in your particular problem. By the way, a shy objective function is also embarrassingly.fastidious and thus will track its own progress towards finding minima as noted above. So as shown in examples/shy_shgo.py the following comparison comes for free. 

schwefel_function_shgo_shy

As you can see, the dastardly Schwefel function (pic) is conquered by SHGO more speedily than usual, at least for some choices of distance and computation time scale. In this example, the optimizer is performing a local search after every global update (during which regions for the function are selected that are hopefully approximately convex). What's really fascinating is that not only time but, seemingly, convergence is sometimes improved. Possibly this is an artifact of the surrogate function disagreeing with the real one, leading the optimizer to believe it has more work to do. 

If we turn off the local search until after the global iterations are done, this effect seems even more pronounced. However, you'll note there are also cases where convergence is not achieved - probably due to minima hiding. 

schwefel_function_shgo_no_local_search

I also tested with the Optuna library. You'll notice from the logarithmic scale that this struggles more than SHGO on this test, but here we are interested in the relative performance before and after we make our function shy. Again there are probably some quirks to be ironed out in the termination criteria here, as you notice the impressive progress of a number of paths below the blue baseline - they certainly get off to a fast start but perhaps fail to find the right minima (I didn't check). 

optuna-shy-annotated

Again, I would suggest the reader investigate whether the hack provided by embarrassingly.shy leads to good practical results on problems at hand, as it is surely premature to make general statements. 

Underpromotion and plateaus

Finally, as promised, we return to the challenge of robust optimization. Robustness is an old topic with related ideas in statistics (jackknife, bootstrap and so on) and machine learning (see this talk by Professor Stefanie Jegelka for a recent example). I present something of a caricature. One reader commented, I think quite reasonably, that if your model likelihood looks like this, then there is something very wrong with your model or your fitness criterion. 

landing_area-1

However, there are several counters to that argument. First, if we can learn how combinations of objective functions and optimizers perform on training exercises such as this, it seems likely they may help us on more reasonable ones. Second, our task might not be as reasonable as we imagine. The fitness function might be ten dimensional, and nobody ever really looks - do they?   

But most importantly, I think it is a little too idealistic to assume that we can always define and conveniently compute a "true" fitness function \(g\), and presumably build only "good models", with the same ease and speed as we can define "bad models" and "pre-fitness" functions \(f\) in our notation, defined broadly as those that would be more sensible or can be more useful after smoothing, or some other operation suggestive of robustness. I emphasize that I'm considering global derivative free optimization. For us \(g\) is ephemeral, and not close to \(f\) in any pragmatic sense.

Proceeding then, and unable to use \(g\), we continue our line of thinking begun above - seeking a proxy \(\tilde{f}\) to send to the optimizer that might not be a pure function, and hoping that the result of the optimization will nonetheless point us to a nice minimum point on a plateau.

In the definition \(\tilde{f}(x;\cdot)\) we should probably throw in a dot, at least, to indicate state, past search values or whatever else the callable chooses to use. And without suggesting there is one best way of approaching this dreadful bastardization of optimization, the module embarrasingly.underpromoted allows you to create an "under-promoted" objective callable that, by default, might be somewhat reasonable (and you can modify it of course). Since we are down to brass tacks, here's the "pre-objective" function \(f\) with a helipad.

 
def f(x):
    """ A helicopter landing pad when you turn it upside down """
    r = np.linalg.norm(x)
    x0 = np.array([0.25,0.25])
    amp = r*math.sin(16*r*r)
    return -1 if np.linalg.norm(x-x0) < 0.1 else 0.1*x[0] + amp

We create a deceitful, stateful objective non-function:

 
from embarrasingly.underpromoted import Underpromoted
bounds = [(-1,1),(-1,1)]
f_tilde = Underpromoted(f, bounds=bounds, radius=0.01)

and run an optimizer:

 
from scipy.optimize import shgo
res = shgo(func=f_tilde, bounds=bounds, n=8, iters=4, options={'minimize_every_iter': True, 'ftol': 0.1})

The optimizer finds a global minima for \(f\) at \((-0.734,-0.734)\) but when given the under-promoted version \(\tilde{f}\) it returns \((0.25,0.25)\) as we had hoped - a considerably less terrifying place to land. As you can see, the usage is very simple and as a minor point, one can alternatively use embarrassingly.shy directly, as this already contains the functionality (Underpromoted is mere syntactic sugar on top of the Shy class really).

Yes, SHGO is a pet favorite but I imagine it would be interesting to study the interaction between choice of optimizer and choice of objective modification. As an aside, it is well appreciated that objective search can itself have a regularizing effect, though this is usually studied on the model directly, not at the level of hyper-parameter search. Stochastic gradient descent implicitly performs some regularization (the ball is more likely to stop rolling on a wide plateau) and that seems to be true for complex language processing models - at least according to Lei, Sun, Xiao and Wang (paper). 

So, what is the default \(\tilde{f}\)? 

I now turn to some tentative, very provisional thoughts which are expressed in the library. The idea is that an evaluation of the function \(f(x)\), if a low number, is a suggestion that \(x\) should rise up the rankings, as it were, and inform us as to the increased likelihood of a minima. However we are trying to optimize \(g\), not \(f\), and so the good performance does not immediately incline us to report the "whole truth" to the optimizer. The point \(x\) will be under-promoted. Its rise (if one plots upside down) will be retarded. 

For example, we can provide a compromise which drags us down (or up) towards \(f_{surrogate}\). In the interest of simplicity we might choose:

\begin{equation} \tilde{f}(x) := f_{surrogate}(x)+\kappa \left( f(x)-f_{surrogate}(x)\right) \end{equation} where \(\kappa\) is a parameter that - given its analogous role in the Kalman filter, can only be called the gain.  Notice we maintain a surrogate function \(f_{surrogate}\) internally, as we did when making a shy function, and we also update this with the evidence \(f(x)\) - at least someone knows that \(x\) has done a good job. And the next time the optimizer picks a point nearby, it won't be retarded to the same extent. 

Depending on the settings, we will be reluctant to evaluate lots of points very close to each other. The surrogate function - and, thus also \(\tilde{f}(x)\), will catch up to the original objective function faster where \(f(x)\) is flatter, steering the optimizer towards the helipad. Notice that we use the real value of \(f(x)\) to update the internal surrogate model even though we don't tell the optimizer about it. Here's a demonstration. It isn't exactly SpaceX landing a rocket at sea, but it's kind of fun.  

underpromotion

The choice of distance scale can matter. If it is close to the same as the helipad itself, then the optimizer could get a tad unlucky and sample a point right at the middle of a tiny plateau "prematurely". It won't have a chance to correct that particular point, but it can settle elsewhere. Unless the choice of distance unit is overly precise, there will be neighbors who are "pretty good" places to land. As an aside, if our optimizer is written to allow updates (revisiting the same point, or noisy measurements) we'd have more options - but we assume that isn't the case.

I think the key observation here is that the callable \(\tilde{f}\) wants to mimic \(g\) only insofar as we get a good result. The usual metrics don't capture distance between \(\tilde{f}\) and \(g\) in a useful way as far as understanding the efficacy is concerned. Still, a formal analysis of the conspiracy between optimizer and objective callable is surely possible, as they attempt a vaguely Bayesian approach. 

The details of the search will matter also, obviously. A naive hill climbing search might fail to do anything terribly useful, period. The SHGO algorithm can be fooled too, but if the penalization due to under-promotion is roughly the same for the initial broad sweep and construction of simplexes, it seems plausible that this won't get terribly badly messed up. The provisional results seem to support that, but I suspect there will be more to this story. 

As far as choosing parameters goes, there is some code you are welcome to adapt in the examples folder that searches for reasonable \(\kappa\) and spatial parameters as it moves the helipad around and changes its size (code). This steers me towards the lower end of gain parameters, such as \(\kappa=1/8\), but that will depend on \(\delta x\) and with different problems, your mileage may vary. This is all quite speculative at this point, so if you have thoughts, I'm all ears.    

Application to Short Time Series

It seems plausible that time series models estimated on short time series might benefit from robust optimization, especially if you have a lot of parameters and insist on using models with non-linearities. Just for a laugh let's assume a time series model \begin{equation} \hat{y}_t = \alpha_0 y_{t-1} + \alpha_1 y_{t-2} + \alpha_2 y_{t-3} \frac{ y_{t-2}-y_{t-1} } { 0.1+ \left|y_{t-2}-y_{t-4}\right| } \end{equation} which I don't hold up as a paragon of sensible model design - more an allegory for models with non-linear response. Suppose further that for whatever perverse reason we fit this on just six data points - twice as many as the number of parameters. 

Sometimes when we fit this model (by least squares minimization of prediction error on the short training set) we'll pick up coefficients that aren't going to generalize well. The question is, can the preference for flat spots in the objective function help us back off somewhat? When tested on a variety of very different real-world time series, this does seem to help. Or to put it another way, it rarely harms a lot and sometimes saves us from some really silly parameter values. We also see that the training error is a slightly better representative of the out-of-sample error (though still terrible). In this particular run, we train near the start of a time series and then skip ahead \(500\) data points before testing the model out of sample. You can click through to the time series to see what they look like. The coin example is a bit of an odd one, though interesting, as it is a binary outcome.  


Time series

Training error

Prediction error Robust training error Robust prediction error Percentage change in prediction error
altitude 0.624 1.274 0.624 1.269 -0.4
emojitracker 1.147 0.951 1.166 0.886 -6.8
electricity 0.557 1.11 0.574 1.137 2.4
copula_x 0.683 0.951 0.692 0.966 1.5
die 0.511 1.826 0.515 1.705 -6.6
dvn 0.18 1.245 0.181 1.229 -1.3
hospital 0.367 1.332 0.424 1.179 -11.5
qual 0.323 1.583 0.323 1.594 0.7
mtum 0.303 0.742 0.305 0.755 1.7
fcx 0.39 0.854 0.39 0.859 0.5
coin_b 0.588 4.342 0.588 0.977 -77.5
airport 2.497 2.997 2.501 2.958 -1.3
pe 0.086 1.073 0.086 1.06 -1.2
c5_cardano 0.06 0.993 0.061 1.001 0.8

Forget for a moment that we have other ways to find \(\alpha\) than sending this to a derivative-free optimizer. A radius of \(0.1\) was used and \(\kappa=1/2\) here. One might wonder if benefit will persist as the training set is enlarged to \(50\) data points, and indeed the effect seems much smaller. Seemingly it is still there but I wouldn't want to defend it in court. 

A slightly different example adds another term to the model, increasing the set of number of parameters to four. \begin{equation} \hat{y}_t = \alpha_0 y_{t-1} + \alpha_1 y_{t-2} + \alpha_2 y_{t-3} \frac{ y_{t-2}-y_{t-1} } { 0.1+ \left|y_{t-2}-y_{t-4}\right| } +\alpha_3 y_{t-1}\left(y_{t-1}-y_{t-3}\right)^+\end{equation} where \( (\cdot)^+ = \max(0,\cdot)\) indicates the positive part. When the objective is under-promoted prior to being sent to the optimizer, we find a discernable effect even when the time series on which the model is trained contains fifty points.

Here I'm adding another term that seems like a quirky choice, to put it mildly, in order to test whether under-promotion helps steer us away from strange artifacts (that might also crop up in a more sensibly motivated model, say a neural network). Indeed we do see that the magnitude of the coefficient \(\alpha_3\) is smaller in every single instance (so far anyway, the code still runs as I write) indicating that the under-promotion has a regularizing effect not dissimilar to what we see with Lasso, compressed sensing or other shrinkage and selection techniques.

However this is quite a different setup to ridge regression, to pick one. There is no direct penalty on the coefficient included in the objective function - as noted we don't really have an objective function anymore, rather a deceptive impure function. There is, however, an indirect penalty arising because regions where the parameter is large tend to be sharper and more prone to variation. The optimizer is more likely to head to relatively flat parts of the space, and large values of \(\alpha_3\) will not help that.  

As well as selecting a more plausible set of parameters, this procedure does seem to help the out of sample error as well. The time series are quite noisy, as you can see, but the error is reduced by around five percent on average in these preliminary runs which is not to be sneezed at. Naturally, the details of the model, time series and choice of parameters for under-promotion will dictate results in any application. A lot more work would be required to better understand the impact, and I imagine the reader would want some more guidance as to choice of under-promotion parameters (which interact with optimization parameters) than I am positioned to provide at this point.  

Also worthy of note, and perhaps somewhat obvious, is that models which are reasonably stable by design, and parsimonious, are unlikely to benefit in a large way from under-promotion. That said, there may be some small benefits even for something as simple as the univariate Kalman filter - the most stable, parsimonious thing I could think of. To test, the script underpromoted kalman pulls electricity data and estimates two parameters (\(\sqrt{P}\) and \(\sqrt{Q}\) corresponding to the standard deviations of process and measurement error noise.

These time series aren't especially well suited to the Kalman filter so to ensure it is not completely misapplied, we add normally distributed noise to the data. Parameters were estimated with and without under-promotion using the first \(50\) data points, with under-promotion using a radius of \(1.0\) and \(\kappa=1/8\). Out of sample error was computed for the remaining \(950\) data points by running the filter forward. I think we pretty much get the null result we are looking for here, however for some types of time series there seems to be a small gain.


Time series

Training error

Prediction error Robust training error Robust prediction error Percentage change in prediction error
electricity 23.473 22.229 23.557 21.955 -1.2
electricity 11.461 19.735 11.489 19.63 -0.5
electricity 9.69 11.778 9.726 11.981 1.7
electricity 14.711 16.665 14.932 15.903 -4.6
electricity 11.558 11.501 11.753 11.9 3.5
electricity 17.097 19.566 17.098 19.683 0.6
electricity 12.487 11.38 12.492 11.499 1.0
electricity 16.474 13.578 16.474 13.577 -0.0
electricity 23.828 19.389 23.828 19.388 -0.0
electricity 11.826 17.43 12.107 13.858 -20.5
electricity 13.447 15.322 13.819 14.823 -3.3

Once again, and as is to be expected, the training error is higher when we use under-promotion. The out-of-sample error is (sometimes) lower. It seems marginal, and yet if we look at the objective function surface, you can see that under-promotion is likely doing something sensible - slightly backing away from the relatively sharp cliffs that occur for small values of \(P\) and \(Q\). This makes it less likely to chase extreme outliers, though I think the case for under-promotion of extremely benign models like this is tenuous.

In cases like these, the utility of under-promotion may be greater if there is some need to directly interpret or use the parameters - as compared with merely using the predictions directly and benefiting from reduced least square error (or not reduced, as the case may be). 

Going for the Pin 

Now let's have a bit of fun.  

Here's two views of objective function that many of you will be familiar with. It shows the average number of shots you might expect to take to finish a golf hole, as a function of where your ball lands - in the vicinity of the green or on it. The pin is at the origin, in the smaller section of the green and protected by two bunkers, one slightly larger than the other. It is a little hard to see but for simplicity I've assigned your probability of sinking a putt equal to \(e^{-d/12}\) where \(d\) is the distance in feet. 

golf-12

Aside: You should probably be flattered by that formula, unless you are in the rarified company of Denny McCarthy - an under-appreciated putting genius whose ability might forever have been obscured had Professor Mark Broadie, of Columbia Business School, not persuaded the PGA Tour to implement the use of value functions in the calculation of golf statistics (more on that in my lay introduction to the topic). Previously, putting statistics might as well have been called "who's best at narrowly missing greens".

denny_mccarthy

 

But enough of that. Let's aim your approach shot. 

In our minds we smooth the objective function \(f\) shown, smooshing it out depending on how errant your shot making is likely to be, and this gives us \(g\). Then we maximize \(g\). We expect the minimum of \(g\) to be away from the origin, where I placed the flag, so as to be further from the bunker. 

What I want to point out is that if the function relating position to expected score is very well behaved, you can't use under-promotion for this purpose - at least not the version I've given. Like the Kalman filter example, the green is very smooth and the function is well behaved near the flag - so it's too hard for me to figure out how to shield it from the optimizer. Perhaps you can. 

However, to make things more interesting, let's pretend a robot is playing golf or you are thinking too much about your swing. We can lift the spatial objection function (shown above) back through some physical simulation of ball flight, all the way back to a parameter space of a robot itself (shown below, sort of). Thus, we obtain an objective function where the variables are not x and y coordinates, but something about the swing mechanics. As a toy example, I created a little model where we choose how much arm and, separately, body rotation to apply. We aim at the pin, but we choose how aggressively to swing - in those two dimensions.   

In this model, the more rotation, the more power. The more power, the more lofted a club we can use and the faster we can stop the ball after we clear the front bunker. However, with power comes risk. The imbalance between rotation of arms and body will lead to a miss. Missing left isn't so bad, as there is plenty of green in that direction. Missing right could be bad news. 

robot-objective

By limiting the decision to two variables, I can present you a plot. In keeping with my convention, this is upside down so when we minimize, we want the highest point. That point, however, is perched perilously because a small loss of coordination may result in a slice into the bunker, whereupon our poor robot is short-sided. As you can see from the script underpromoted_golf I was able to fool scipy.optimize.shgo into finding me a more suitable, conservative choice where the swing speed was a little lower, and where the differential between arms and body favored the fat of the green somewhat more. 

Of course that could have been done by convolving the objective function \(f\) to get \(g\) instead. But that's the point of this article. At least for some examples, it seems that you can hack your way to more or less the same effect by using \(\tilde{f}\) instead - and that computation only involves a small number of evaluations of \(f\). I think I'm scratching the surface here, and there are many variations in the Monkey's Bum opening, as I have called this approach to robust optimization, that involve passing impure functions to optimizers, but that don't involve my particular version of under-promotion. So play with the embarrassingly code as you see fit and feel free to add more. 

 

 

 

   

 

Comments