In this week's David versus Goliath battle, not one but two popular time series prediction packages: NeuralProphet and Prophet face off against a couple of home-spun methods in the little-known TimeMachines package. Everyone loves a horse race, but this post also provides an introduction to some Python one-liners you can use to potentially improve the accuracy of any forecasting model. In particular, I'll be explaining the function called thinking_fast_and_slow.
I'm grouping by package.
I've already gone to great lengths to describe Facebook's Prophet library, top weight for our race today. Recently it has been renamed prophet, from fbprophet, and reached version 1.0 after 15 million downloads or so. It is a huge success in terms of usage and as I previously noted, it tops the chart compiled here. If you haven't read my article about Prophet, suffice to say that the package employs an easily interpreted, three-component additive model whose Bayesian posterior is sampled using STAN.
Facebook Prophet is intended to be used out of the box but for this race I have also included some variants on Facebook prophet that were motivated in my glowing review. For example, I composed Prophet with a model or two that tried to predict Prophet's errors (see timemachines/prophskaterscomposed). I'll go into more detail below if any of them do well.
Inspired by the ease of use of Prophet, the NeuralProphet package aims to bring neural network forecasting to the masses. Its builds on a paper by Oskar Triebe, Nikolay Laptev and Ram Rajagopal that hopes to combine the best of traditional statistical models with neural networks. For a new library, it is popular with about 20,000 downloads per month - though that is at least an order of magnitude smaller than fbprophet. So perhaps the branding will propel it further, especially if it performs well.
While neural prophet takes care of training for you, it does ask for a specification of the number of lags to use (essentially the order of the AR model). I've therefore enumerated some choices using the first few Fibonacci numbers, as shown in the code. We have a neural prophet model with 1 lag, another with 2, and so forth up to 8 lags.
The timemachines library is my own baby. And it really is a baby. The package was conceived as a means of comparing time-series packages such as statsmodels.tsa, prophet, pmdarima, and so forth, and providing a simple functional interface to the same.
However, in the process of performing this comparison, I found it was quick work to create a few simple approaches that are stand-alone. In particular, I'll be using four similar models called thinking_fast_and_fast, thinking_fast_and_slow, thinking_slow_and_fast, and thinking_slow_and_slow. They are all found in timemachines/thinking. I'll explain them below, if they perform. I also included some simple ensembles of constant parameter ARIMA models found in tsa/tsaconstant.
This race never ends. It is a series of head-to-head matchups in which a segment of a time series is presented to a pair of algorithms and they both try to predict it. Whoever is the closest by RMSE wins. If it is really close a draw is declared. An update is made to an Elo rating accordingly.
In this particular exercise, we use only 450 historical data points at a time. The first 400 are considered a burn-in period, during which the model will train. Then, the next 50 data points will be used to assess out-of-sample performance.
The time series is chosen randomly from the collection of a thousand or so at Microprediction.Org. Here are some examples of categories and examples within to give you a sense.
Description | Example |
Radio activity | queens_110_and_115 |
Wind | noaa_wind_speed_46061 |
Traffic | triborough |
Emoji usage | medical masks |
Cryptocurrencies | ripple price changes |
Crypto+exchange rates | btc_eur |
Airport wait times | Newark Terminal A |
Atmospheric | ozone |
Futures | sugar price changes |
Chess ratings | Hikaru Nakamura |
Hackernews comments | comment counts |
Tides | water level |
Stock indexes | euro_stoxx |
Govt bonds | 30_year changes |
Hospital wait times | Piedmont |
Electricity (NYSIO) | overall |
The time series vary quite a lot in their characteristics. I'll be brief because I've detailed these elsewhere but to make the point, notice that this Electricity usage time-series exhibits some pretty serious momentum and time of day effects, as we'd expect (see also the electricity competition page btw).
On the other hand, there are also plenty of financial time series, and differenced time series, such as this stream.
Clearly, these pose quite different challenges. Sometimes the challenge is knowing when not to overfit.
A final remark. It isn't easy for a model to memorize structure because the data is constantly being refreshed. For instance, some data comes from weather stations like this or sites like EmojiTracker (epilepsy warning) giving rise to a stream like this one which shows the number of people tweeting "face with medical mask" in the last minute or so.
I think you get the idea. Real instrumented things. Real data. No tears.
If each download were a wager placed in a parimutuel, the TimeMachines package might be 1000/1. Given the overwhelming popularity of Facebook Prophet, the bookies would have to be extremely defensive.
I'll break it down by how far ahead we are forecasting.
Turning to the current leaderboard for 1-step ahead prediction, we see that amongst the contenders I've listed, the best performing models are the ensembles - though actually they get pipped by a statsmodels implementation with order (3,0,1). The NeuralProphet model that uses a single lag does okay too, though for any other choices it drops far down the leaderboard.
It's pretty clear that the options provided by TimeMachines are a good bet, especially when set against the popularity of Prophet and NeuralProphet. Unless you have a crystal ball for predicting which ARIMA model will be best, using some kind of ensemble (it need not be EMA) might be wise. You can make your own very easily using the ensemble tools.
I guess you are wondering what happened to Facebook Prophet? I'm afraid it is not in the frame. You can scroll down the leaderboard and find various flavors of this forecaster, but they all perform terribly.
To pick another example, let's consider prediction much further into the future of a univariate time-series. The leaderboard for models predicting 21-steps in advance is also topped by a specific choice of ARIMA model - in this case order (2,0,1). However, it might have been difficult to foresee that.
All other orders are beaten out (for now) by thinking_fast_and_slow, thinking_slow_and_fast, and thinking_slow_and_fast. The models in this leaderboard can be called using the TimeMachine's package syntax but I've highlighted these particular models because they are home-spun. I'll get to what these slow and fast models are in a moment, but first observe that NeuralProphet has fallen out of the running completely. This made me worry that I had some terrible bug, and maybe you can find it in my wrapping of NeuralProphet.
I suppose it is also possible that popularity has nothing to do with accuracy.
I dare say the ink should not dry on this one, nor on any empirical work if it can be continuously updated instead. Readers are also welcome to add to the list of time-series that get used. See get-predictions for instructions on creating a new stream of live data at microprediction.org.
However, I would say the results are "interesting". So much so that I added a roulette wheel to the listing of popular time-series packages because I've come to believe that randomly choosing a package is a better policy than allowing a predominance of articles (or Google search, for the same reason) to guide your choice. There might be a vast pool of largely unconsidering humans who want to pretend they are doing statistics. There might be wisdom of the minority model (see Callander and Horner pdf) that might even support this theory. I speculate wildly.
In my list of time-series models (go on, try out the new roulette wheel) the least used packages tend to be the cutting edge research whereas the most popular tend to be those most written about. That writing is mostly by data scientists finishing their first boot camp, it would seem, but don't worry I have a cunning plan to change all that.
Now the thinking_fast_and_slow model, and its ilk, would not be described as cutting edge but since they are apparently more reliable than popular alternatives, not to mention thousands of times more computationally efficient, I will describe their simple mechanism.
For the first step, I used my own implementation empirical_ema because that is self-aware in the sense that it tracks discounted moments of its own empirical errors. I only used two moments, but you can easily add skew and kurtosis if you wish.
For the second step, I used the hypocratic_residual_factory that presents you with a one-line method of stacking a careful residual prediction on the back of any model. In turn, that uses the residual chaser factory and tells it to use a quickly_hypocratic or slowly_hypocratic prediction.
Hypocratic predictions are my terminology for cautious predictors (see code examples) that assume zero is a pretty good prediction. They take a model prediction and empirical errors of the same, and they strongly shrink the predictions towards zero. There are plenty of ways to do that but I was content with the use of tanh as shown in simple/hypocratic.py.
Ensembling is even easier. The ensembling utilities are demonstrated by the mixtures of tsa models (scroll down here). One supplies a list of "skaters" to be ensembles to one of several ensembling mechanisms. Here is one possibility.
If you are interested in helping, I've made some effort to do a few things differently in the timemachines package and they are explained in the repo, but since I'm yet to introduce that package in this blog, here's a summary of the goals copied from the README. I wanted to use popular (and unpopular) forecasting packages with one line of code, and view their Elo ratings on an ongoing basis.
Simple canonical use of some functionality from packages like fbprophet, pmdarima, tsa and their ilk.
Simple, ongoing empirical evaluation. See the leaderboards in the accompanying repository timeseries-elo-ratings. Assessment is always out of sample and uses live, constantly updating real-world data from microprediction.org.
Simple k-step ahead forecasts in functional style involving one line of code.
Simple tuning with one line of code facilitated by HumpDay, which provides canonical functional use of scipy.optimize, ax-platform, hyperopt, optuna, platypus, pymoo, pySOT, skopt, dlib, nlopt, bayesian-optimization, nevergrad and more.
Simple evaluation with one line of code using metrics like RMSE or energy distances.
Simple stacking of models, as noted, with one line of code. The functional form makes other types of model combinations easy as well, such as ensembling.
Simpler deployment. There is no state, other than that explicitly returned to the caller. For many models, state is a pure Python dictionary and thus trivially converted to JSON and back.
End of advertisement. It perhaps goes without saying that any time-series package written in this format can be trivially introduced into the prediction network (using the StreamSkater class) so that anytime anyone publishes a data feed, it can be attacked. As usual, I'll close by asking you to consider following microprediction on LinkedIn to justify my existence.
Comments