What would happen if you took the best performing open-source univariate time-series prediction algorithms and weighted them by precision? What if that list was constructed at run time? Perhaps you'd have a "forever" model and of late, several of these ensembles have been charging up the leaderboards. This short note describes their construction and how you might improve on the same.
I think it is well appreciated that in the time-series literature, and the statistical literature more broadly, combining of models can be profitable. This is a large topic I won't attempt to survey, but we know that contest leaderboards are often flooded with bagging entries, where a training dataset is bootstrapped and an average of predictions employed. The jiggling of historical time-series, or various types of data augmentation and pre-processing helps too.
Just to pick on a few examples out there, the mixtures of experts' literature provides numerous suggestions for combining the output of models based on their efficacy. In machine learning, this is sometimes called gating and pooling. Various methods in statistical learning, such as regression and decision trees, also give rise to model combinations. Econometric models have always been mixed (e.g. gaussian mixture models have been recommended by Eirola and Lendasse, as have mixtures of GARCH models by Hass et al). Needless to say, boosting has also been used for time-series prediction with one spin on the theme provided by Karingula et al.
These approaches are interesting, but of sole relevance to us in this post are methods where the combining is performed in an efficient online fashion. I've written about online composition of time-series models previously - see the article Predicting, Fast and Slow. So here I'll stick to the use of ensembles (stacking) of models.
I'll make an engineering distinction between homogeneous ensembles and heterogenous. As an example of the former, let me describe a model I wrote last weekend. It has been sneaking up the rankings.
I consider a population of ARMA models subject to some evolutionary nudging. If you poke around in skaters.smdk.smdkarma you'll find various species of ARMA ensembles. To unwind the naming, the prediction function smdk_p5_d0_q3_n1000_aggressive maintains 1,000 distinct ARMA models with AR degree at most 5 and MA degree at most 3. As a data point arrives, all models in the collection are updated simultaneously.
Because this is Python, there isn't a huge hit for maintaining such a large collection. We benefit from numpy's efficient multi-dimensional array operations, and I benefited from the template provided by the simdkalman package provided by Otto Sieskari.
I chose a differential evolution step to prune the population of poor-performing models, and create new ones that are similar to those in the top-performing decile (don't quote me on that, read the code). At the time of writing, the crossover takes one of four flavors, depending on whether AR, MA, measurement error, or process variance are being modified. Once a new ARMA model is created, it is protected for a fixed period of time to allow it to warm up. After that, cue the David Attenborough voiceover as the vicious culling of the herd gets underway.
This is just one example of a model that is reasonably fast - compared to fitting an auto-arima at each time step, say. Indeed it is certainly fast enough that it can itself be used inside a bigger ensemble - and that's the point. It doesn't have to work all the time to add value. Perhaps it could be a complement to something you come up with.
With that in mind, I'll next point you to a couple of tools that should make the combining of very different models pretty straightforward. If there are no particular gains to be made from data-parallelism or what-have-you, then there really aren't too many pre-requisites for the ensembling of heterogenous models in real-time.
However, we need to know how accurate they have been for recent data points. I've provided a function called a parade that is rather handy in this respect, for it tracks k-step ahead predictions and their accuracy against incoming data. Also, we need a few minor conveniences to track running estimates of bias, squared error and if necessary, higher moments. For this, I wrote the momentum functions because I couldn't find a minimalist package implementing them. (Some readers may prefer a more object-oriented version of the same thing here). Building on this, some tools for combining skater functions are provided in the skatertools/ensembling modules that should make it easy to create ensembles of prediction functions.
As an aside, if you are not familiar with what I call "skater" functions, I'll refer you to the README.md or basic usage. Morally speaking, skaters are merely k-step ahead forecast functions that take in a current data point and state, and emit k-step ahead forecasts and posterior state. Unlike other time-series packages, there is no setup or ceremony. If you look at the way moving average ensembles are constructed (probably the simplest example) you'll see this in the code:
That's a one-line ensemble of the more basic moving average forecast functions. It uses the ensembling skater found here and you may do with that what you wish. (I note, however, that this only works for combining honest skaters that report earnest estimates of their own inaccuracy. Sheer laziness on my part.)
Feedback on the style is welcome. To be honest, I decided when I was creating this package to completely avoid the use of classes. It started as a perverse intellectual exercise but I think it has sort of worked out. I'll beg you to read the FAQ before complaining too bitterly about my overburdening of the humble Python function. And yes, I'm scratching the surface when it comes to ways of stacking models.
There's some inspiration in a recent paper by Yao, Pirs, Vehtari and Gelman (arXiv) on this topic - just to pick one example. It should be straightforward to modify the "complete pooling stacking" (see precision_weighted_skater) to perform partial-pooling stacking. As the authors point out, some models are somewhere useful.
But I know what you are thinking. If stacking is easy, why not just automate the selection of the best and fastest models that go into the ensembles? Why not have machine readable leaderboards to facilitate that? Then one could create some ensembles that get better over time as new methods are included in the mix by open source contributors.
Of course, that's already there in the code for so-called Elo ensemble skaters - time-series algorithms that are ensembles of those with high Elo ratings. Indeed as you can see from eloensemblefactory.py there are even some models that check the live Elo ratings before deciding which sub-models to use in their ensembles. You can do the same using the top_rated method in the recently created recommendations module of the timemachines package.
As you can also read in the code, the models prefixed by elo_ appearing on the leaderboards will typically use a discounted running empirical variance, and will form a combination of the k-step ahead forecasts weighted by some power of statistical precision. They usually only ensemble fast models that I wrote about in the post titled Fast Python Time-Series Forecasting but that doesn't preclude using libraries like statsmodels, pmdarima, pydlm or sktime. It probably does exclude neuralprophet or tbats.
Though I'm yet to try it, one could also treat these models as features for downstream combination that is more complex than a simple precision weighting. As a further comment, I note that should you wish to hard-wire the component models to avoid HTTP latency that's fine, and I'll remind you there is a colab notebook for the recommendations.
Here are results that are so hot off the press I may not have pushed them to the timeseries-elo-ratings repository yet. Below we have the leaderboard for 5-step ahead prediction. You'll notice that Elo ensembles occupy first, second, and fourth place amongst the super-speedy algorithms (taking less than 1 second to run fifty forecasts).
They outperform sk-time's implementation of the theta method, which is by no means a terrible benchmark. Similarly, if we consider predictions 21 steps in advance we again see the march of the Elo ensembles there too. And it doesn't seem to matter too much how these are combined. The only difference between "balanced" and "precision" ensembles in the table below is the exponent applied to inverse empirical variance.
Interestingly, although we are predicting "normal" time-series (as compared to model residuals), the Elo combination of the best performing model residual predictors (called elo_faster_residual_balanced_ensemble) is doing the best. I'm always quick to add that the Elo ratings are quite noisy, but this is nonetheless a promising initial finding and not altogether surprising.
I'm tempted to say that the Elo ensemble skaters are your forever prediction function. The Elo ensemble skaters are like any other skater in the timemachines package, so they are easy to use so long as you are connected to the internet. They are smart enough to know not to include methods that use libraries you haven't installed yet. As noted in the timemachines README.md, you might want to pip install a subset of the following before firing up an Elo ensemble skater.
There you have it. The only prediction function you'll ever need? Well, that's probably an exaggeration. For one thing, we're only discussing univariate prediction here. Nonetheless, you'd be nuts not to glue elo_faster_residual_balanced_ensemble or similar on the back of whatever predictions you are currently using. Because unlike those, elo_faster_residual_balanced_ensemble will get steadily better over time.
I'm sure I'll introduce a more terse and catchy name for the forever function, but it's a little too early to suggest which of the Elo ensembles is likely to provide the greatest benefit to the largest number of users. In fact, I'd prefer it if someone were to come up with a better one. Read CONTRIBUTING.md if you are interested in making clever combinations of the existing algorithms.
One aspect of this that I forgot to mention (until prompted by a comment by Kevin No - thanks) is that the data used to calculate the Elo ratings is live and open to improvement. That's also mentioned in the contributor guide and there are plenty of resources at microprediction.com including a video tutorial in the knowledge center on creating a data stream. As you can see there, this is pretty much a one-liner:
after you instantiate an instance of the MicroWriter class. Reach out to us for a key that is strong enough to create data streams, if you don't care to burn one yourself.