Can you provide a distributional estimate of where a badminton player will be in a few moments' time? If you can, how should you communicate that information? When you do, how should your contribution be assessed?

Badminton was contributed to the list of live time series challenges at Microprediction.org just yesterday thanks to Haodong Qian. You may find **badminton_x** and **badminton_y** if you browse the growing stream listing. You may also find streams representing transport delays, bike sharing activity outside NYC hospitals, ozone levels, traffic speed in the Bronx, electricity prices in South Australia, wind speed and direction in Seattle, cryptocurrencies and stock prices - all subject to live out-of-sample competition between algorithms. (Before claiming you have a good all-purpose time series approach, see how it runs this gauntlet!)

The badminton_x stream represents one position measurement for a badminton player's neck, and the badminton_y the other coordinate. If you think you can get close to the true position given 225 guesses, it is easy to write a Python program that will put you on the leaderboard (quickstart). Oh, and maybe your algorithm is good at something else too that you didn't anticipate. See the crawling quickstart to find out how to set your algorithm loose to find its own destiny.

Badminton data leans on the CMU open pose video recognition project written by Gines Hidalgo, Hanbyul Joo, Zhe Cao, Tomas Simon, Shih-En Wei and Yaser Sheikh. You can see demonstrations of it working in office environments (and flashmobs) here. The badminton stream applies the software to a single badminton game. The video recognition of the position of a player is imperfect and will occasionally lead to erroneous data points - but overall it is pretty good. There are twenty-four different parts of the body of a badminton player that the software attempts to identify, and it does so quite successfully. We chose to use the position of the neck (position 1) in this example stream.

To encourage out-of-sample testing, the badminton data streams supply a new neck position every minute (i.e. a new x,y pair). These streams are slower than the frenetic pace of actual badminton, and are not truly live, but they provide good training for algorithms. I am fascinated by this example because the movement of the player bouncing around is likely to stump quite a few time series modeling approaches.

Though only just underway, the badminton time series already contains a few sudden lurches. Physics dictates short-term momentum but also, one would assume, long-term memory and perhaps even a signature - similar to the way we might recognize people based on gait. So combining short- and long-term thinking would seem to be important. I imagine there will be a few LSTM's thrown at this one.

As noted, the stream also contains some errors but we'd expect that in any real world application. One of the nice things about creating streams at Microprediction.org is that zscores for every data point are automatically generated. For example, here is the zstream for the x-coordinate of the badminton player's neck position, which reports how surprising each data point is relative to the predictions made by algorithms.

You can see an erroneous data point sticking out here. I leave it to an enterprising person to create a new, clean stream based on this polluted data - something that can probably be accomplished in a few lines of code. You can publish that, or anything important to your business that is public, at Python client instructions or if you prefer, the API directly.

**Microprediction.org** is very new and it will take a while before the best minds on the planet - both human and artificial - turn their attention to this particular stream. But in anticipation, it makes sense to think about how best they might be assessed.

The stream caught my attention because it brings into sharp relief the distinction between point estimates and distributional estimates. Suppose we were to ask algorithms to provide a single number estimate of the position of the badminton player 60 time steps forward in time. What score should we assign to their prediction once the truth is revealed? How should we interpret the forecast?

If a hyper-intelligent life form provided us with a single number indicative of the position of the badminton player *without *explaining how it was derived, then certainly we could compute an absolute or mean square error, or some other metric, to rank the aliens against our best efforts. But would that reveal anything meaningful? It might be very difficult to interpret what the aliens really meant to convey.

Presumably the aliens had some probabilistic view of the badminton player's position.

Point estimates are lousy, in this respect, which is why algorithms that submit predictions at Microprediction.org are required to provide not one but 225 guesses of where the badminton player will move. They can use this freedom to assign something looking vaguely gaussian should they wish to - but they can also supply a bimodal distribution, or a heavily skewed one, or whatever they think is appropriate.

I hasten to add that we should not leap to the conclusion that single number submissions are always impossible to interpret - even if that seems problematic for badminton.

One place to look is the *scoring rule* literature. A scoring rule is the academic term for something that intends to measure the accuracy of predictions. For example, Gnieting and Raftery consider scores assigned based on a probabilistic prediction and an outcome, and consider conditions under which maximizing a score sets up the right incentives for the person to provide an "honest" probabilistic forecast.

Suppose we were to ask prediction algorithms not about badminton player positions, but only the result of the point. We shall assume there are two possible outcomes, and it is cleaner to think of this as two probabilities that add up to one (actually it isn't necessary to insist on them adding to one, but that's a side point).

Put yourself in the position of someone asked to provide two numbers (p1, p2) knowing you will be judged by what is known as Brier score. Brier score is just least squares applied to the difference between probabilities and outcome. For example, if you assign p1=0.8 and p2=0.2 corresponding to an 80% chance of the first player winning the point and the player does not win, your error will be computed as follows:

One can show that under this scoring system you have incentive to provide your true unbiased estimate of probabilities and thus someone can easily interpret your contribution. On the other hand, if you knew that you would be scored differently:

then you will not provide your honest estimate of 0.8 and 0.2, but will instead shrink them both in the direction of 0.5 so as to minimize heavy penalty. In the Hookean universe (squared distance energy, which you are trying to minimize) we have a useful center of mass and things cancel nicely. With a fourth power penalty, they do not.

The same is true if we allow a third outcome where neither player wins the point. Treat your contributed probabilities (p1, p2, p3) as points on the simplex.

To orient ourselves in this picture, note that each corner of the triangle corresponds to a degenerate probabilistic view (you are certain that one outcome will occur). Now on the other hand if you truly believe the probabilities are (p1, p2, p3) and you will be judged by Brier score once again, then minimizing the expected Brier score (another exercise for the reader) suggests that you should submit "honest" estimates p1, p2 and p3 (hint: use the same center of mass shortcut you use in physics).

There is plenty more to the story of scoring rules. Just be aware that asking aliens to minimize a measure of least square error will lead them to do precisely that - and depending on how the rules are set this might obscure the intent carried in their communication. As the economists say, the problem isn't that incentives don't work, it's that they work too well.

Returning to the question of the position of the badminton player in space, which of course is a continuous variable, how might we choose to reward distributional forecasts embodied as a collection of 225 scenarios for the position? A simple approach is to assign a positive score to submissions that are close to the correct answer (but decrease it in proportion to how many other submissions are also close) and a negative score to others. One can translate rewards until it is a zero sum game.

For example, the algorithm *Flashy Coyote *recently received a tiny uptick in balance

because it produced more scenarios closer to the truth than the average bear, at least for this data point. This helped it climb just a little higher on the leaderboard. I notice that three algorithms have also found their way to this badminton stream already.

Because I wrote *Flashy*, I can tell you that it is doing a fairly honest job of supplying distributional predictions. One might ask if this is the best policy.

Given approximate knowledge of other participant's spray of guesses, something we might refer to as the market distribution, players will attempt to place scenarios in parts of the space which are underserved. This will drive the market distribution towards a reasonable estimate of objective probability (modulo the usual things).

As an aside, there is a closely related situation that can arise where the optimal submission is *independent of the other player's scenarios. *I leave this as an exercise for the reader but it only applies when a player imagines that they are investing all their wealth, and when they seek to maximize the logarithm of their posterior wealth. *Flashy* will probably do okay supplying honest submissions ignoring everyone else's.

However in general it is important to appreciate that the goal is *not* identification of a single submission that is the best. Rather, a plurality of algorithms conspired to produce a community forecast. Collective prediction is subtle.

**Algorithm symbiosis**. It is entirely possible that Algorithms A and B might survive and thrive when both participate, even though individually they would perish.**Benchmarking fallacies**. It does not follow logically that an algorithm that extracts the most rent from a prediction stream is inherently superior, or can be used in isolation.

Furthermore, once you appreciate that market-normalized predictions are also predicted (see the introduction to z-streams article) it becomes even more difficult to disentangle contributions.

I dare say this isn't the end of the game theory at Microprediction.org - not by a long shot - but I hope it gives you a sense. There are more articles on the topic here that come at collective prediction from different angles. The most important thing might not be the details of the mechanics but the ease with which you can tap into collective prediction.

See the quickstart for publishing data. As this badminton example shows, if you publish it they will come!

## Comments