4 min

The Future of Forecasting Competitions, According to the Experts

Published on September 14, 2021

Update: Oct 7, 2021

The details of the M6 are out. See my article for unsolicited advice on how to win. 


I draw the reader's attention to a recent paper titled The future of forecasting competitions: Design attributes and principles by Spyros Makridakis, Chris Fry, Fotios Petropoulos, and Evangelos Spiliotis. Those of you who have participated in the M5 and other competitions, or just like time-series, will recognize the authors and why they are uniquely placed to opine on this subject. The paper also draws on the views of other time-series researchers, such as Rob Hyndman who requires no introduction. 

Contests are not side shows. They are, according to David Donoho, the "secret sauce" of prediction culture and a major catalyst for the machine learning revolution. The paper describes design attributes in ten categories that forecasting competitions might aspire to. This post considers the extent to which the ongoing live "microprediction" forecasting competitions might inform that discussion, or even achieve some of those objectives. 

1. Scope

As regards the focus of the competition and the type of submissions, the main recommendation made is that submissions should include uncertainty - and not just comprise point estimates.

It is good to see contest experts saying this outright. Based on my experience running distributional forecasting competitions I agree ... although it does make it harder for both participants and the beleaguered platform developer.

There are other benefits to distributional contests that might not be immediately apparent - depending on how it is done - and that includes stateless combination and rewarding of contributions. There are many ways to achieve this, but one way that is implemented is described in detail here

The only caveat to this recommendation that I see is the possibility of smart-contests and specialization. Most contests are single-layered, like perceptrons, and don't permit a chain of transformations to evolve. But if they do, it can well be the case that a point estimate, or perhaps a point estimate combined with a scale number (informed by confidence intervals) can be interpreted as an affine transformation of incoming data that helps someone else predict well. 

One could similarly make the argument that a sufficiently clever contest-like mechanism should be able to glean information from submissions whether or not a proper scoring rule is applied (since it can account for the gaming). That might facilitate more entries from uncalibrated but nonetheless useful algorithms and people. 

2. Diversity and representativeness

The authors state that whether the focus of a competition is generic or not, it is important that the events considered have a reasonable degree of diversity that will allow for generalization of the findings and insights obtained.

I think this is a motherhood issue and agree it is especially important to create stern tests for allegedly autonomous methods. Claims have been made about software that is able to adapt to new data without human involvement, or "outperform human forecasters". 

My experience is that this doesn't stand up when the programs are challenged with a rich array of time-series - some with high signal-to-noise ratios and some low; some with intermittent predictability; some the residuals of other time-series; some mean-reverting and some not; some slow-moving and some fast-moving; some continuous and some discrete, and so forth. 

That objective is met to some extent given that streams for radio activity, wind, traffic, emoji use, cryptocurrencies, airport wait times, atmospheric measurements, chess ratings, tides, stock indexes, hospital wait times, government bonds, electricity use, and so on are already the subject of competitive prediction (see the stream listing). 

But I think it is just a start. I would encourage others to use the self-service contest creation capability (that also provides free prediction). I sometimes wonder if I'd have more success if I charged for that! The openness is what might eventually help us meet the diversity objective in a really convincing way.  

3. Data structure

The authors of The Future of Forecasting Competitions suggest that it is important that the explanatory/exogenous variables used for producing the forecasts will only refer to information that would have been available at the time the forecasts were produced. (The reader need only google "data leakage Kaggle" to appreciate the issue).

I hope it goes without saying that the use of a live format helps prevent data leakage. It is a strong motivation for live contests, despite the limitation of scope that involves. 

However, I would comment that nobody is prevented from publishing streams of delayed or historical data using the microprediction API or client. That's considered a feature, not a bug, because the microprediction site isn't intended to serve contests for contests' sake (as partly explained in Dorothy, You're Not In Kaggle Anymorebut also the business applications explained here). Indeed the API can be used to convert intermittent data into on-demand estimates, noisy data into clean and weak indications of truth into stronger ones. It just takes imagination. 

However, the contest goals are easily met too. My advice to researchers looking to use microprediction as a platform for contests (as opposed to sourcing obscure bespoke predictions because you really need them) is to use a dedicated write_key to create a range of challenges that are all truly live - thereby avoiding data leakage altogether. Results are aggregated across sponsor identities. 

4. Data granularity

The authors note that forecasting competitions has been increasingly focused on short-term and higher frequency prediction. I would agree that it is important to consider the domain and not over-sell. 

The way I come at it, a contest is a way of suggesting that it is possible to mechanically assess and differentiate (and hopefully combine) models. That simply isn't always true. If we are going to remove the human generalized intelligence from the model assessment then we need to be very careful about the domain of problems where this is likely to be effective. 

I am not suggesting any such arrangement for longer-term forecasting. I do think we have a chance of assessing in a meaningful way autonomous time-series algorithms in the fast domain. But the very name microprediction is an attempt to de-anchor people from singular, longer-term prediction. It implies many predictions. The authors suggest that it should be practically impossible for anyone to win by making random choices. Or as the Fanduel rent-a-protesters once chanted in the streets of lower Manhattan: "Game of Skill, Game of Skill".   

5. Data availability

This refers to the amount of information provided by the organizers. In regard to exogenous data, the authors note that in the era of big data and instant access to many publically available sources of information, participants are usually in a position to gather the required data by themselves, but also to complement their forecasts by using any other publicly available information

This is the most important statement in the paper, in my humble opinion. Obviously, I would agree and indeed one of the primary advantages of live forecasting contests (for the creator of the data stream) is the possibility of making exogenous data discovery part of the contest - as compared with punishing people for "cheating" when they find it. 

Now close your eyes and imagine that anyone and everyone does this, and that all the exogenous data feeds to all the other contests run by everyone else. Once you grok that real-time prediction is inherently collective, you might want to join our slack channel (invite here).   

6. Forecasting horizon

The authors make the point that averaging across forecasting horizons may not be the most informative way to aggregate. Again it is hard to argue with that. I would suggest that those establishing streams with the intent of running a forecasting contest use the leaderboard API or python client to gather the individual credits gained on each horizon, and aggregate as they see fit (see the get_leaderboard methods). As a practical matter, I'm not sure how many will do this, but the author's advice is certainly worth keeping in mind. 

7. Evaluation setup

The authors note that it is most common to use historical data and conceal part of it. They give the example of electricity load forecasting, however, where three strong seasonal patterns are typically observed across the year, and thus, evaluation by considering only one particular day or week is invalid. 

The authors suggest a rolling competition, where more data is revealed over time. However, the authors are also completely correct to point out that this requires more input and energy from the participants. Rolling origin competitions exhibit higher dropout rates. The authors suggest that participants might provide code instead, though this also requires energy on an ongoing basis since presumably the models must be updated epoch by epoch.  

Without providing examples (hello guys!) the authors go on to suggest that competition might take place on a live basis, with forecasts being evaluated against the actual data once they become available. A splendid idea indeed and you'll find open source code for achieving that in the various microprediction repositories

They again point out the major advantage: participants can incorporate current information to their forecasts in real time, meaning that the data and external variables could be fetched by the participants themselves based on their preferences and methods usedI hope this is one of those dangerous ideas that refuse to die. 

They also point out that information leakage about the actual future values becomes impossible and the competition represents reality perfectly. Indeed true, although as an erstwhile founder of a data company, I'd be happy to discuss minor caveats to "represents reality perfectly" over a cup of tea.  

8. Performance measurement

The authors recount the problems that arise when point estimates are contributed, ameliorated somewhat by the use of proper scoring rules. This is such a deep topic I won't comment on it here, but the challenge of using point estimates is expressed by the cover picture on this blog article.  

9. Benchmarks

The authors suggest that selecting benchmarks is very important.

I suppose that depends on what your purposes are, and actually, that might not matter to everyone. It seems to beg the real question, "why should the organizer need to think about benchmarks?," and a possible answer is, "because they aren't made convenient enough for a participant to use," because no two time-series packages use the same set of conventions. It's a thorny one. Nobody wants to impose a draconian best format for time-series libraries. 

I've adopted the approach of including plenty of open source contributions on the leaderboards - with code badges that people can click through to reveal the implementation (see here). I've also provided an open-source library (here) that exposes some but not all functionality from popular Python time-series packages in what I hope is a simple sequence-to-sequence style of functional convention. 

It's an open-ended challenge. If you view the listing of popular Python time-series packages I've drawn together here you'll see how far we have to go, especially if the intent is including all the recent developments coming out of conferences. Usually, the intent of the code base is the production of a paper, not ongoing robust use. 

Another idea is to provide a level of abstraction linking the production of point estimates to algorithms that operate in real-time. In my case, I've tried to accomplish that by providing a Python class that makes it easy to use any of those said methods in a "crawler" that enters live contests. Examples of crawlers are provided here. But there are many ways to do that and I don't suggest it is the right one.  

What would make all of this stronger is a greater number of good online distributional forecasting packages that emphasized incremental calculation. Unfortunately, that seems to be in the flying pony category of wishes. Most packages only provide point estimates. And many are premised on tabular offline use which isn't a deal-breaker but does feel very inefficient.

I've attempted a shout-out to some online packages here which are yet to benefit from the medium echo chamber. I'm afraid our future may lie in the hands of neophyte data scientists throwing darts as they decide which package to write a Towards Data Science article about - mostly by reading other articles written by others who did the same thing.   

10. Learning

From the authors' perspective, the objective should be not just the determination of the winners but the advance of the theory. The objectives of microprediction are broader because the intent is the provision of low-cost business optimization ... eventually. But we're certainly aligned on the research agenda too. 

One suggestion in the paper is that creators state some hypotheses before the contest. If that extends to contest platforms, then one hypothesis of microprediction is that specialization is possible (as with the prediction of z-streams, explained here, where distributional estimates are elicited for distributional transforms of top-level contests).  

Another idea in the paper is that future forecasting competitions challenge the findings of the previous ones. Call me quixotic but I would take that much, much further and suggest that real-time forecasting competitions will be won by other real-time forecasting competitions in real-time. And when real-time forecasting competitions are miniaturized, we give birth to the microprediction web.

There's some discussion here but reach out if you're truly interested in a longer-form version of that thesis.  

Is the right thing to do too hard?

I'd encourage everyone competing at microprediction or elsewhere, to read the paper. If nothing else, you might be interested in the table compiled by the authors with links to all the winning methods used in past time-series contests.  

Overall I'm in complete agreement with the authors. We know the right way to arrange forecasting competitions that reflect reality, avoid data leakage, facilitate data search, and lead to clearly interpretable outcomes. That answer is, to some approximation, streaming distributional contests using diverse live data

But the authors have also identified the key challenge. Progress isn't just a function of the contest satisfying the statisticians, economists, and game theorists. For better or worse it will depend on what participants are willing to do. Hopefully, a new wave of contests can surf the MLOps wave. Hopefully, there are other ways to make it increasingly easier for participants to contribute in real-time, and continue to do so.  

It isn't that hard, now. You can enter a real-time contest by cutting and pasting this bash command into a terminal. 

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/micropred(iction/microprediction/master/shell_examples/run_default_crawler_from_new_venv.sh)"

(Whether or not it's a good idea to run a script like this you see on the internet is another matter)

There are some tailwinds for real-time contests. As more of the world is instrumented, and more employers seek employees who are adept at deploying machine learning models, not just experimenting offline, the perceived payoffs of participation will change. (Speaking for one employer, I strongly encourage those seeking interviews to prove that they can succeed at hard tasks. Kaggle padding doesn't cut it.) 

Moreover, if streaming contests become the norm, I believe it might open up possibilities that go well beyond the wildest ambitions of forecasting contests as they stand today. That would have to be coincident with a move by companies to employ explicit rather than implicit repeated quantitative tasks in their processes and transform them into a format where they can be food for contests. 

That process will involve a lot of trial and error. In that vein, I hope more of you who wish to advance open democratic prediction decide to make predictions by running a Python script, or otherwise using the API. But keep in mind this is all open-source, including the back-end. I hope you become interested in the microprediction project because of the limitless possibilities to wield the "secret sauce" of machine learning.