5 min

# Collective Distributional Prediction

Published on July 1, 2020

This article explains the mechanics of live, ongoing short-term prediction challenges where you can participate by running a Python script continuously. It explains streams, quarantined distributional predictions, and near-the-pin rewards. It also discusses z-streams.

This post is not a tutorial - it covers the game itself. Detailed "Getting Started" instructions exist in the knowledge center. But briefly, your crawler (for which a minimalist example is provided) is usually a sub-class of MicroCrawler, unless you choose to create your own using the API or lower-level functionality like the MicroWriter.

The prediction challenge is not what you may be used to. If you are coming from Kaggle, or simply interested in the motivations for this open initiative, please read Dorothy, You're Not in Kaggle Anymore.

## Outline

• Streams
• Distributional predictions
• Quarantined distributional predictions
• Community implied z-scores
• Community implied z-curves using embeddings from [0,1]^2 -> R and [0,1]^3 -> R
• A remark on Sklar's theorem

## Streams

A stream is simply a time series of scalar (float) data created by someone who repeatedly publishes a single number. It is a public, live, moving target for community-contributed prediction algorithms. For example, here are links to three streams:

and for convenience, the third (at time of writing) looked like this: Note the leaderboard of prediction algorithms. They are supplying distributional predictions.

## Distributional Prediction

Algorithms living at Microprediction.org, or should I say interacting with it via API, don't supply single number predictions (point estimates). Why not? Well, here is a "proof without words" that point estimates are difficult to interpret. In the community garden called Microprediction.org (which you are asked to treat with loving respect), distributional forecasts comprise a vector of 225 carefully chosen floating-point numbers. An algorithm submitting supplies three things, when it chooses to submit them.

1. the name of a stream;
2. a delay/horizon parameter, chosen from 4 possibilities {55s, 310s, 910s, 3555s}; and
3. a collection of 225 numbers.

How should the 225 numbers be interpreted? Know that the system will add a small amount of gaussian noise to each number. Know that rewards for the algorithm will be based on how close the noisy numbers are to the truth. Then do a little game theory (maybe portfolio theory) and come to your own precise interpretation.

I offer the vague interpretation that your 225 points should represent a kernel estimate of a distribution. I won't provide an extended discussion of strategy here, but you should definitely acquaint yourself with the lottery paradox if nothing else.

Let's move on to the interpretation of the delay/horizon parameter.

## Quarantine

The algorithms are firing off distributional predictions of a stream. Let's be more precise.

• Morally a distributional prediction at Microprediction.org comprises a vector of 225 numbers suggestive of the value that will be taken by a data point at some time in the future...say 5 minutes from now or 1 hour from now.
• However, when making the distributional prediction, the exact time of arrival of future data points is not known by the algorithms, but must be estimated. Thus it would be more precise to say the distributional prediction applies not to a fixed time horizon but rather to the time of next arrival of a data point after some elapsed interval.

Let us pick a delay of 3555 seconds for illustration (45 seconds shy of one hour). If the data seems to be arriving once every 90 minutes, and arrived most recently at noon, it is fair to say that a set of scenarios submitted at 12:15 p.m. can be interpreted as a collection of equally-weighted scenarios for the value that will (probably) be revealed at 1:30 p.m. (and is thus a 75 minute ahead forecast, morally speaking).

The system doesn't care about the interpretation. When a new data point arrives at 1:34 p.m., it looks for all predictions that were submitted at least as far back as 12:33:15 p.m., a cutoff point chosen to be 3555 seconds prior. Those distributional predictions qualify to be included in a reward calculation.

## Reward

Each algorithm will be rewarded based on how many of its 225 submitted points (guesses) are close to the revealed truth. The reward depends on how other algorithms perform. The reward can be viewed as a sum of rewards assigned to each of the 225 submissions. This is what happens (for example).

2. We look for submitted forecasts not much more than x, but not too many.

3. We look for submitted points not much less than x, but again not too many.

4. Each guess that is close, in this sense, attracts a positive reward.

5. The positive rewards are normalized so they sum to the stream budget multiplied by the number of participating algorithms (the budget is typically 1.0 or 0.1 say).

6. Guesses not close attract a fixed negative reward: the stream budget divided by 225.

Again, I remark on a seemingly minor but possibly important detail: a very small amount of noise is added to submitted predictions prior to this calculation begin performed

## Community distributional predictions (cdfs)

You will see cumulative distributions on the site. Since predictions are made with a specific quarantine period in mind, there are four different CDFs - one for each of the four possible choices of delay (70 seconds and so on).

The CDF can be interrogated by supplying an x value, or a list of them, to the API or Python client (see the knowledge center for how to use Python, R, or the API directly). However, at the time of writing the interpretation of this CDF requires some care. Here is how it is computed.

1. Poorly performing algorithms are ignored.
2. The x value supplied implies a corresponding percentile (probability) p for each algorithm's latest submission lying in the open interval (1/450,1-1/450).
3. These p values are transformed via the inverse normal CDF.
4. Then they are averaged.
5. Then the average is transformed back again via the normal CDF.

There are, however, some further details to the calculation of the CDF driven by practical considerations. I omit them here as they are likely to change. The backend code is, however, open-source! In theory, the bad algorithms give up or get kicked out, and better ones arrive. The CDF gets more accurate over time as algorithms (and people) find relevant exogenous data.

## Implied percentiles

In a similar fashion, we can compute a percentile in the interval (1/450,1-1/450) for each arriving data point using predictions that have exited quarantine. Let's suppose it has surprised the algorithms on the high side and so the percentile is 0.72, say. We call 0.72 the community implied percentile.

It will be apparent to the reader that a community implied percentile will be quite different depending on the choice of the quarantine period. For example, the data point might be a big surprise relative to the one-hour-ahead prediction, but less so compared to forecasts that have not been quarantined as long (the reverse can also be true).

One could compute four different implied percentiles for each arriving data point, but we choose to use only the shortest and longest quarantine periods. There are two community percentiles computed: one computed using forecasts delayed more than a minute (actually 70 seconds) and one relative to those delayed more than one hour (actually 3555 seconds).

Look closely at the name of streams such as z1~three_body_z~3555 and you will notice that the quarantine used to compute the implied percentile appears last (3555 seconds in this case).

## Implied z-scores (z1's)

Next, we define a community z-score as the inverse normal cumulative distribution of the community implied percentile. This is an overloading of the term z-score! Yes, z-scores often refer to a different, rather crude standardization of data that assumes it is normally distributed (when it never is). Here, in contrast, we are using the community to define a distributional transform. If the community of human and artificial life is good at making distributional predictions, the z-scores will actually be normally distributed.

Or not. The system doesn't leave that to chance but rather, publishes the z-scores so that they too can be the subject of competitive prediction. This might even provide an opportunity for epistemologists who think they can reason a-priori as to why these will be wrong (e.g. "the tails that are too thin - always - buy my book"). It will take you about five minutes to write the algorithm that submits that particular theory. You need only make a one-time submission of a fat tailed distribution (represented by 225 numbers).

To be clear, each community z-score is treated as a live data point in its own right - a data point that appends to its own stream. That stream, called a z1-stream, is (like any other stream) the target of quarantined distributional predictions exactly as before. Think of it as a kind of community model review.

So, do you think you can spot deviation from the normal distribution in these community z-scores for South Australian electricity prices? I would tend to agree with you ... and you may be a few lines of Python away from a great statistical triumph. Godspeed. ## Implied z-curves (z2's and z3's)

Now things get more interesting. Let's suppose that soliciting turnkey predictions from a swarm of competing quasi-human life forms will soon be the norm, as will use of this second layer of analysis we have discussed (z1-streams). Let's face it, why would you do anything crazy like hire a data scientist - a proposition with a vastly greater cost and far less promising asymptotic properties. Why would anyone hire both a data scientist and other data scientists to review their work?

Of course, companies will just hit the API instead and be done. But after that, pretty soon people will want more. We've tried to anticipate as much, and so you'll find something a little more subtle than a z1-stream when you look at z2~c5_bitcoin~c5_ripple~3555. Notice there are two parent streams in the name (c5_bitcoin and c5_ripple). Here is what is going on.

1. The parent streams are updated simultaneously, with a single call to the /copula part of the API.
2. As above, two implied percentiles are computed for each incoming data point, for the longest and the shortest delay choices (i.e. roughly 1hr and 1min). Let's call them p1 and p2 referring to the implied percentile for bitcoin and ripple moves respectively.
3. The pair (p1,p2) is converted to a single probability between 0 and 1 by means of a space-filling curve.

What's a space-filling curve, you say? It is a map from (0,1) to the hypercube that visits every point and tries to do so in a mostly continuous fashion. Technically we are using the inverse of this map, which is actually easier to explain.

1. Rescale the community percentiles.
2. Convert to binary representation.
3. Interleave the digits in the binary representation.
4. Convert back, and scale back again.
But then we add the last step:
5. Apply the inverse normal distribution function.

So that the z2~ streams are approximately normally distributed.

As an aside, this squishing of two-dimensional data into one-dimensional might prompt some reasonable questions about the best way to predict. You can use a univariate algorithm to provide 225 guesses of what the next number might be in the sequence (i.e. 225 guesses of the fifth number in the sequence 0.17791, -1.9669, 0.48892, 0.1782, ?) But you can also unfold the sequence into pairs of numbers and then apply bivariate methods.

We don't need to talk about what makes the most sense. You can just beat me on the leaderboard.

Similarly, if you were to look at https://www.microprediction.org/strz3~c5_bitcoin~c5_cardano~c5_ethereum~70 you will now appreciate that it is really a trivariate time series masquerading as a univariate sequence.

This explanation would not be complete without some space-filling curve eye-candy. Here is what happens when you move from 0 to 1 and (after scaling) use half the digits in your binary representation to represent one coordinate and a half to represent the other.

## A Remark on Sklar's Theorem

We have established an algorithm smackdown on multiple levels:

1. At the level of the primary stream of data, at multiple horizons;
2. on implied z-scores individually; and
3. on joint behavior relative to community predictions.

Since we are nerding out on this, observe that the z-curve setup is somewhat reminiscent of Sklar's Theorem. Sklar's Theorem states (loosely) that the distribution of a multivariate random variable can be decomposed into:

• Univariate margins
• A Copula function

where for our purposes a copula is synonymous with a joint distribution on the square or the cube. As an aside, Sklar's Theorem is "obvious" modulo technicalities, in the sense that any variable can be converted to uniform by applying its own (cumulative) distribution function. Thus, generation of a multivariate random variable can be controlled by a throw of a continuous die that takes values in a cube (each coordinate can be transformed by the application of the inverse cumulative distribution of the margin).

But what about the use of space-filling curves to transform the description of a copula? I have not been able to dig this up, so maybe its a new idea. At some level this seems imperfect. But by folding bivariate and trivariate community prediction back into univariate, there are some compensating pragmatic gains to be had.

Whether you view the packing into one dimension as a technology convenience (my laziness?) or something more is up to you. There are also some interesting and, I think, understudied aspects to this. The reader might wish to contemplate the approximate analytical relationship between two correlated random variables (however that is parametrized) and the variance or volatility of their z-curve. For instance, bivariate normal with a correlation of 30% yields 15% excess standard deviation over standard normal for the squished z2-stream. It's roughly a rule of two.

What remains a matter of experiment is whether arbitraging algorithms can bring Sklar's Theorem to life in an effective and visceral manner. Hopefully, we'll also discover whether the separation of concerns suggested by Sklar's Theorem is useful, or not, when it comes to determining accurate higher-dimensional, probabilistic short-term forecasts.

## Why we care about z-curves

This question is particularly pertinent for quantities such as stocks where some moments (the stock margins) are traded explicitly but many are not (most volatility is not directly traded even). The intraday dependence structure between style investing factors (like size, value, momentum, and so forth) is a subtle but very important thing in fund management - so involving a diversity of algorithms and perspectives seems prudent, as does not expecting any one algorithm or person to solve the puzzle in its entirety.

You may not care about stocks and that's fine. There isn't a lot to prevent one algorithm finding its way from stocks to train delays to weather in Seattle. You can derive from MicroCrawler class to advance a new kind of algorithm reuse and cross-subsidy.

The best specification of the precise conventions for z-curves (and also naming conventions to help you navigate the hundreds of streams at Microprediction.org) is the microconventions package on GitHub or PyPI.

I hope this is fun. The intent here is that the growing collection of live data streams (which anyone can add to) represents a kind of "Tough-Mudder" for general-purpose, fully autonomous prediction algorithms. The marginal cost of applying these to a new business problem is, in principle, very close to zero.

And since you read this far then let me add that there are some cash incentives for participation to go with the bragging rights. See microprediction.com/competitions for details. Want to give it a go?  