18 min

An Introduction to Z-Streams and the Mechanics Of Collective Distributional Prediction

Published on July 1, 2020

As a note to the reader.  This article explains the mechanics of a new kind of real-time prediction contest. It pre-dates the creation of a knowledge center that provides some hands on tutorials for those looking to participate. The collection of data streams represent a kind of "Tough-Mudder" for tailored, and also general-purpose, prediction algorithms. If you choose to author and run an algorithm, you might be challenged in new ways - both statistical and pragmatic. But this experience will likely prepare you well for the "real world", if one can include front-office quant jobs, tech jobs, manufacturing, transport, industrial control and other arenas where live systems are maintained and optimized.

Python Module 1 R Module 1

Right away, however, you will encounter something you haven't seen: z-streams. This post provides an explanation of their role, and the more elementary prerequisite notions of streams and quarantined distributional predictions

Outline

  • Streams
  • Distributional predictions
  • Quarantined distributional predictions
  • Community implied z-scores
  • Community implied z-curves using embeddings from [0,1]^2 -> R and [0,1]^3 -> R
  • A remark on Sklar's theorem

Streams

A stream is simply a time series of scalar (float) data created by someone who repeatedly publishes a single number. It is a public, live moving target for community contributed prediction algorithms. For example, here are links to three streams:

and for convenience, the third at time of writing looked like this: 

three_body_z

Note the leaderboard of prediction algorithms. They are supplying distributional predictions.

Distributional Prediction

Algorithms living at Microprediction.org, or should I say interacting with it via API, don't supply single number predictions (point estimates). Here is a "proof without pictures" that point estimates are difficult to interpret.

Tennis_cropped

In the community garden called Microprediction.org (which you are asked to treat with loving respect) distributional forecasts comprise a vector of 225 carefully chosen floating point numbers. An algorithm submitting a forecast supplies three things:

  1. the name of a stream;
  2. a delay/horizon parameter, chosen from 4 possibilities {55s, 310s, 910s, 3555s}; and
  3. a collection of 225 numbers.

How should the 225 numbers be interpreted? Know that the system will add a small amount of gaussian noise to each number. Know that rewards for the algorithm will be based on how close the noisy numbers are to the truth. Then do a little game theory (maybe portfolio theory) and come to your own precise interpretation. I offer the vague interpretation that your 225 points represent a kernel estimate of a distribution.

In the future we may allow weighted scenarios. But let's move on to the interpretation of the delay parameter. 

Quarantine

The algorithms are firing off distributional predictions of a stream. Let's be more precise.

  • Morally a distributional prediction at Microprediction.org comprises a vector of 225 numbers suggestive of the value that will be taken by a data point at some time in the future...say 5 minutes from now or 1 hour from now.
  • However, when making the distributional prediction, the exact time of arrival of future data points is not known by the algorithms, but must be estimated. Thus it would be more precise to say the distributional prediction applies not to a fixed time horizon but rather to the time of next arrival of a data point after some elapsed interval.

Let us pick a delay of 3555 seconds for illustration (45 seconds shy of one hour). If the data seems to be arriving once every 90 minutes, and arrived most recently at noon, it is fair to say that a set of scenarios submitted at 12:15 p.m. can be interpreted as a collection of equally weighted scenarios for the value that will be (probably) be revealed at 1:30 p.m. (and is thus a 75 minute ahead forecast, morally speaking).

The system doesn't care about the interpretation. When a new data point arrives at 1:34 p.m., it looks for all predictions that were submitted at least as far back as 12:33:15 p.m., a cutoff point chosen to be 3555 seconds prior. Those distributional predictions qualify to be included in a reward calculation.

Reward

Each algorithm will be rewarded based on how many of its 225 submitted points (guesses) are close to the revealed truth. The reward depends on how other algorithms perform. The reward can be viewed as a sum of rewards assigned to each of the 225 submissions. This is what happens (for example).

  1. The true value is published by the stream creator

  2. We look for submitted forecasts not much more than x, but not too many

  3. We look for submitted points not much less than x, but again not too many

  4. Each guess that is close, in this sense, attracts a positive reward. 

  5. The positive rewards are normalized so they sum to the stream budget multiplied by the number of participating algorithms (the budget is typically 1.0 or 0.1 say)

  6. Guesses not close attract a fixed negative reward: the stream budget divided by 225  

A seemingly minor but important detail is that a small amount of noise is added to submitted predictions prior to this calculation begin performed. 

Community distributional predictions (cdfs)

You will see cumulative distributions on the site. Since predictions are made with a specific quarantine period in mind, there are four different CDFs - one for each of the four possible choices of delay (70 seconds and so on). 

The CDF can be interrogated by supplying an x value, or a list of them, to the API or Python client (see the knowledge center for how to use Python, R or the API directly). However at the time of writing the interpretation of this CDF requires some care. Here is how it is computed. 

  1. Poorly performing algorithms are ignored
  2. The x value supplied implies a corresponding percentile (probability) p for each algorithm's latest submission lying in the open interval (1/450,1-1/450)
  3. These p values are transformed via the inverse normal cdf
  4. Then they are averaged.
  5. Then the average is transformed back again via the normal cdf

There are, however, some further details to the calculation of the cdf driven by practical considerations. We omit as they are likely to change. 

Chart

The  theory, the bad algorithms give up or get kicked out, and better ones arrive. The CDF gets more accurate over time as algorithms (and people) find relevant exogenous data.

Implied percentiles

In a similar fashion we can compute a percentile in the interval (1/450,1-1/450) for each arriving data point using predictions that have exited quarantine.  Let's suppose it has surprised the algorithms on the high side and so the percentile is 0.72, say. We call 0.72 the community implied percentile.

It will be apparent to the reader that a community implied percentiles will be quite different depending on the choice of quarantine period. For example, the data point might be a big surprise relative to the one hour ahead prediction, but less so compared to forecasts that have not been quarantined as long (the reverse can also be true).

One could compute four different implied percentiles for each arriving data point, but we choose to use only the shortest and longest quarantine periods. There are two community percentiles computed: one computed using forecasts delayed more than a minute (actually 70 seconds) and one relative to those delayed more than one hour (actually 3555 seconds).

Look closely at the name of streams such as z1~three_body_z~3555 and you will notice that the quarantine used to compute the implied percentile appears last (3555 seconds in this case).  

Implied z-scores (z1's)

Next, we define a community z-score as the inverse normal cumulative distribution of the community implied percentile. This is a bit of a misnomer as z-scores often refer to a different rather crude standardization of data that assumes it is normally distributed. Here, in contrast, we are using the community to define a distributional transform. If the community of human and artificial life is good at making distributional predictions, the z-scores will actually be normally distributed.

Or not. There are lots of intelligent people and algorithms in this world who believe, to the contrary, that they are able to make distributional predictions about other people's distributional predictions. Some people even go so far as to suggest that they can make unconditional distributional predictions (tails that are too thin - always). Good for them. They now have a chance to prove this hypothesis or much more subtle ones.

That is because each community z-score is treated as a live data point in its own right - a data point that appends to its own stream. That stream, called a z1-stream, is (like any other stream) the target of quarantined distributional predictions. Think of it as a kind of community model review.

So, do you think you can spot deviation from the normal distribution in these community z-scores for South Australian electricity prices? I would tend to agree with you ... and you may be a few lines of Python away from a great statistical triumph. Godspeed.

Chart 2

 

Implied z-curves (z2's and z3's)

Now things get more interesting. Let's suppose that soliciting turnkey predictions from a swarm of competing quasi-human life forms will soon be the norm, as will use of this second layer of analysis we have discussed (z1-streams). Let's face it, why would you do anything crazy like hire a data scientist - a proposition with vastly greater cost and far less promising asymptotic properties. Why would anyone hire both a data scientist and other data scientists to review their work? 

Of course companies will just hit the API instead and be done. But after that, pretty soon people will want more. We've tried to anticipate as much, and so you'll find something a little more subtle than a z1-stream when you look at z2~c5_bitcoin~c5_ripple~3555. Notice there are two parent streams in the name (c5_bitcoin and c5_ripple). Here is what is going on. 

  1. The two parent streams are updated simultaneously, with a single call to the /copula part of the API. 
  2. As above, two implied percentiles are computed for each incoming data point, for the longest and the shortest delay choices (i.e. roughly 1hr and 1min). Let's call them p1 and p2 referring to the implied percentile for bitcoin and ripple moves respectively. 
  3. The pair (p1,p2) is converted to a single probability between 0 and 1 by means of a space filling curve. 

What's a space filling curve you say? It is a map from (0,1) to the hypercube that visits every point and tries to do so in a mostly continuous fashion. Technically we are using the inverse of this map, which is actually easier to explain. 

  1. Rescale the community percentiles
  2. Convert to binary representation
  3. Interleave the digits in the binary representation
  4. Convert back, and scale back again
    But then we add a last step:
  5. Apply the inverse normal distribution function

So that the z2~ streams are approximately normally distributed. 

As an aside, this squishing of two dimensional data into one dimensional might prompt some reasonable questions about the best way to predict. You can use a univariate algorithm to provide 225 guesses of what the next number might be in the sequence (i.e. 225 guesses of the fifth number in the sequence 0.17791, -1.9669, 0.48892, 0.1782, ?) But you can also unfolded the sequence into pairs of numbers and then apply bivariate methods. 

We don't need to talk about what makes the most sense. You can just beat me on the leaderboard. 

Similarly if you were to look at https://www.microprediction.org/strz3~c5_bitcoin~c5_cardano~c5_ethereum~70 you will now appreciate that is is really a trivariate time series masquerading as a univariate sequence.

This explanation would not be complete without some space filling curve eye-candy. Here is what happens when you move from 0 to 1 and (after scaling) use half the digits in your binary representation to represent one coordinate and half to represent the other. 

 

 

 

A Remark on Sklar's Theorem

We have established an algorithm smackdown on multiple levels:

  1. At the level of the primary stream of data, at multiple horizons;
  2. on implied z-scores individually; and
  3. on joint behavior relative to community predictions.

Since we are nerding out on this, observe that the z-curve setup is somewhat reminiscent of Sklar's Theorem. Sklar's Theorem states (loosely) that the distribution of a multivariate random variable can be decomposed into:

  • Univariate margins
  • A Copula function

where for our purposes a copula is synonymous with a joint distribution on the square or the cube. As an aside Sklar's Theorem is "obvious" modulo technicalities, in the sense that any variable can be converted to uniform by applying its own (cumulative) distribution function. Thus, generation of a multivariate random variable can be controlled by a throw of a continuous die taking values in a cube (each coordinate can be transformed by application of the inverse cumulative distribution of the margin).

But what about the use of space filling curves to transform the description of a copula? I have not been able to dig this up, and at some level this seems imperfect. But by folding bivariate and trivariate community prediction back into univariate, there are some compensating pragmatic gains to be had.

Whether you view the packing into one dimension as a technology convenience (and somewhat lossy) or something more is up to you. There are also some interesting and, I think, understudied aspects to this. The reader might wish to contemplate the approximate analytical relationship between two correlated random variables (however that is parametrized) and the variance or volatility of their z-curve. For instance, bivariate normal with correlation 30% yields 15% excess standard deviation over standard normal. It's roughly a rule of two.

What remains a matter of experiment is whether arbitraging algorithms can bring Sklar's Theorem to life in an effective and visceral manner, and whether the separation of concerns suggested by Sklar's Theorem is useful, or not, when it comes to determining accurate higher-dimensional, probabilistic short-term forecasts.

Why we care about z-curves

This question is particularly pertinent for quantities such as stocks where some moments (the stock margins) are traded explicitly but many are not (most volatility is not directly traded even). The intraday dependence structure between style investing factors (like size, value, momentum and so forth) is a subtle but very important thing in fund management - so involving a diversity of algorithms and perspectives seems prudent, as does not expecting any one algorithm or person to solve the puzzle in its entirety.

You may not care about stocks and that's fine. There isn't a lot to prevent one algorithm finding its way from stocks to train delays to weather in Seattle. You can derive from MicroCrawler class to advance a new kind of algorithm reuse and cross-subsidy.

The best specification of the precise conventions for z-curves (and also naming conventions to help you navigate the hundreds of streams at Microprediction.org) is the microconventions package on GitHub or PyPI. 

Oh and I didn't want to mention this to anyone who is not absolutely fascinated by multi-layered multivariate distributional prediction. But since you read this far then let me add that there are some cash incentives for participation to go with the bragging rights. See microprediction.com/competitions for details. 

Want to give it a go?

R Module 1  Python Module 1

Comments