29 min

A Call for Contributions to a Copula Contest

Published on July 11, 2020

In a live prediction challenge running at Microprediction.org, algorithms try to predict bivariate and trivariate relationships between five minutely returns of Bitcoin, Ethereum, Ripple, Cardano and Iota. Can you beat them?


It is hoped that out of a collection of interrelated statistical contests, a picture of the fine structure of two-way, three-way and five-way dependencies will emerge. This detailed understanding might surpass what one model or person could achieve.


Outline
1) Why model joint behavior of cryptocurrency returns?
2) Why trivariate margins might help reconstruct five-way relationships, and why correlation modeling isn't always enough.
3) A Python walkthrough.


Why model joint behaviour of cryptocurrencies?
Because it is good practice.

Cryptocurrency and stock price changes are examples of approximate martingales. According to the Efficient Markets Hypothesis (EMH) it should be very difficult to provide an estimate of the mean of the process five minutes forward in time that is substantially better than the current value. However, even if you believe the EMH, that leaves an awful lot of structure to determine.

Both stocks, ETFs and many other quantities are used at Microprediction.org to train algorithms which make distributional predictions. Including cryptocurrencies in the mix adds one more type of exercise for the algorithms. Over time, our understanding of which algorithms perform well across a range of different domains may help us make inferences about longer time scale behaviour, and this may be relevant for all sorts of applications including portfolio management.

Introducing a trivariate prediction contest
How can we understand and model joint behaviour of ... things? In my most recent article  I motivated a similar study of bivariate relationships in a physical system (pitch and yaw of a laboratory helicopter). Cryptocurrencies give me an excuse to convince you that trivariate relationships might be important. I draw your attention to the following trivariate stream  at Microprediction.org fed by live pricing data for Bitcoin, Ethereum and Ripple.

 

Figure 1

 

In a moment we will walk through how to submit your predictions of this z-stream. But one might reasonably wonder whether 3-way relationships are more trouble than they are worth. Is it not sufficient to source good predictions of pairwise relationships? After all, we frequently come across multivariate processes modeled using correlation or covariance matrices (or factor models amounting to the same thing). That is the done thing. Everywhere. Almost without exception. In every field.

 

However, we know mathematically this can't be the whole story since 2-margins (pairwise probabilities) do not determine a joint distribution of n-variables. The question is whether this matters in practice or whether it is a pedantic statistical quibble.

 

About ten years ago I came across a simple counterexample to the notion that pairwise relationships are sufficient to reconstruct joint distributions. Ten-pin bowling. The example occurred to me shortly before boarding a long haul flight from New York to Sydney. I had just enough time to find a bowling game on a computer, but not enough time to dig up a usable open source simulation. Annoyed but convinced of the thesis, I painstakingly recorded the result of the first bowl many, many times.

 

Tenpins_inverted_triangle

Given fatigue and mild oxygen deprivation at 40,000 feet, I can't guarantee that the priceless ten pin bowling data gathered on that trip is reproducible - and of course the way one plays the game dictates a lot. However, if you were to perform this exercise yourself - presumably in a less tedious fashion - I think you'll probably notice something rather interesting.

 

To convince you that sleepwalking into correlation modeling isn't always a good idea, let's proceed in a perfectly reasonable fashion to construct a model for all ten pins. To make the ten-pin bowling game feel a little bit more like (say) investment management, we assign to each pin a normally distributed random variable. We shall assume that if this normally distributed variable exceeds some threshold, the pin falls and otherwise, it survives. For instance, if X(7) represents the variable attached to pin number 7, we might suppose that pin 7 falls if

 

X(7) > -0.23

 

We have the data to back into the number -0.23 so that the probability of the 7th pin falling equals that of the data. Let's say we also have the bivariate data. Then we could also infer correlations between the variable X(7) representing pin number 7 and the variable X(5) representing pin number 5. Most pins will be positively correlated. But if you are a bad bowler like I am, the 7 and 10 pins will be negatively correlated. Perhaps the correlation matrix might look like this:

 

Tenpins

 

Now we have a model for all ten pins derived from properties of pairs of pins only. We roll the ten gaussian variables and check each against their corresponding thresholds, each of which are calibrated to the data. Sound reasonable?

 

Here I have picked a portfolio of pins (pins 1,4,6,7,9 and 10) in order to diagnose whether the model, thus created, could assign reasonable probabilities to the seven outcomes (counting the number of pins in this subset of six that would fall). The probabilities were quite far off as you can see - dramatically underestimating the probability of all six (which is not too surprising) but getting other things wrong too. Of course this is just one cross section of the model so no doubt many other things are wrong about the way it assigns probabilities to all 1024 possible outcomes.

 

It seems that as far as ten-pin bowling goes, this particular approach isn't cutting it. A few years ago Roberto Fontana and Patrizia Semeraro published this paper providing a beautiful characterization of multivariate Bernoulli distributions that sheds some light on this. I did not have the benefit of this at the time.

 

However, I suspected that the tiny amount of information added by some 3-margins would help a lot. Something about the geometry is suggestive. A 3-margin comprises probabilities for all eight outcomes of three pins. Eight numbers but there are seven equations you know already (three 2-margins, three 1-margins and you know that all eight numbers add to 1). So really just one more number. Adding 3-margins isn't taking you very fast towards the total number of degrees of freedom in the system (1023).

 

But it helped. Here is an example "portfolio" of bowling pins. We count the number that fall once again. The blue probabilities are the data. The green probabilities use a sprinkling of 3-margins. The red probabilities are from the correlation model.

 

it_helped_dramatically

 

If you look at the mean percentage error across all possible choices of six pin portfolios you get percentage errors as follows:

 

Averaged across all 210 possible choices of 6-pin portfolios

 

Percentage error versus data when predicting probabilities for pin counts. Top row is a correlation model. Bottom row a more complex model exploiting 3-margins.

 

where the first row uses 2-margins only and the second 3-margins. Neither model is perfect but using 3-margins to try to reconstruct the joint distribution clearly helps a lot.

 

This experiment left a lasting impression on me. I worried that correlation and covariance modeling might quite often be misleading (with factor models a special case). But on the other hand, not every problem I looked at led to similar findings. Later, I replaced ten pins with ten airports and the binary event of whether or not it rained on a given day. The implied gaussian correlations of a fitted model were as follows:

 

ten_east_coast

 

Yes there are days when it rains at LaGuardia and not at JFK.

 

In similar fashion to the bowling pins, one can look at subsets of airports and ask the question "at how many airports is it raining?". Now in contrast to bowling pins, here the 2-margin model isn't really all that bad. Indeed one can even try to get away with reducing the rank of the correlation or covariance matrix. This is one example, using a reduction to three factors driving the correlations between airports.

 

using_a_simple_model

 

Here "julius" refers to the use of 3-margins (and has nothing to do with orange juice - long story). Again the red normal model refers to the use of the normal copula and the blue is the data. Correlation modeling isn't so bad here.

 

Furthermore, I found examples where correlation modeling worked really well, even after simplification of the model. Looking for an example that I hoped would really trick up pairwise modeling, I decided to model the number of squares whose color remains unchanged after five random moves of a Rubik's Cube.

 

 

of_a_rubiks_cube

 

I was the one who was fooled. Not only did 2-margins do a great job of reconstructing this Rubik generated joint distribution, the same was true after I approximated the model with three factors, then two factors. Amazingly even a 1-factor approximation did a bang up job.

 

Don't believe me? Find a Rubik's Cube program and try it. Here are some examples of "portfolios" of squares on the Rubik's Cube (the numbering scheme is shown above). We show the probabilities of the number of squares whose color remains unchanged after five random moves. Below the "red" single factor model does almost as well as a fancy piecing together of 3-margins. Here are a few examples of "portfolios" of squares and the distributions of how many change color:

 

dont_believe_me

 

Once again the geometry, by which I mean the mechanics of the Rubik's Cube, sort of suggests ... maybe ... does it? To you? I'm not sure even in retrospect.

 

Moral of the story: dependence is tricky. Sometimes it isn't just about correlation or covariance matrices. Sometimes what is missing in correlation models matters more than other times. Cryptocurrencies could be a little bit like bowling pins or they might be more like Rubik's Cubes insofar as the importance of 2-margins and 3-margins is concerned (say if we want to be able to understand how all five currencies move together). That is for you to figure out and I look forward to coming back to these cryptocurrency streams at a later date to see what structure clever algorithms have found.

 

Python walkthrough

I hope that is sufficient motivation. If you would like to help solve the existential mystery of the joint distribution of cryptocurrencies by helping predict 1-margins, 2-margins and 3-margins, and in doing so get some practice for contests that attract cash awards at Microprediction.Org ($4,000 this month - see July incentives) read on...

 

If you have read the article on helicopter prediction, you may already be familiar with bivariate prediction, but I'm assuming you are coming in cold, and we will also extend to trivariate prediction. If this is all too new, maybe read the article An Introduction to Z-Streams first. You may prefer to read this notebook which is identical to what follows.

 

First, we pull a list of cryptocurrency streams. I wouldn't expect you to know in advance that the ones we are interested in here are prefixed by c5.

 

some_of_this

 

You'll see there are a lot of streams meeting this criteria. That's because five are so-called primary streams but many more are derived (the ones with tildes). Let's narrow down:

 

Primary C5

 

We see five, the first three being:

 

the_first_three_being

 

Next, let's take a look at them:

 

look_at_them

We see:

 

cryptocurrencies

 

There is a fair degree of comovement in the coin returns, which is hardly surprising. One way to look at this is via Sklar's Theorem. All those other streams (the z2~ and z3~ streams) are really implied Copula functions. Some code and a picture is worth more than my words so:

 

my_words_so

 

And lo, we see the Bitcoin Ethereum copula:

 

ethereum_copula

 

First thing to note about the code - the name of the bivariate stream:

 

Code-1

 

can be inferred from the names of the parent streams but also needs a parameter 70 which is the quarantine time of predictions. You can always just browse all the streams to see what is there: https://www.microprediction.org/browse_streams.html

 

Second thing to note is the unpacking from 1 to 2 dimensions. Notice that we used the from_zcurve method to convert univariate to bivariate data. This unpacking is via a space filling curve (also explained in the article An Introduction to Z-Streams noted above). And you may ask, percentiles compared to what? The answer is percentiles compared to a collective distributional prediction made by all the algorithms fighting to predict the primary streams (you can see the leaderboards at the cardano primary stream for example).

 

The scatter plot can be thought of as samples from a Copula function (see Wikipedia copula article). The question you might ask is, what bivariate random variable with uniform margins might this be? Some people like to apply monotonic transforms of this variable so that margins are more familiar. For example, let's create a new set of samples with normally distributed margins like so:

 

like_so

 

And now the (transformed) percentiles might be mistaken for bivariate normal:

 

bivariate_normal

 

What's the correlation?

 

the_correlation

 

Pretty high when I checked ... 80% ... but the stream had just begun life so we'll see how it goes when you run the notebook.

 

Submitting a distributional prediction of the 2-copula

Let's create a model for this data. I'm not going to work too hard here, but you can improve it. In order to be able to submit to www.Microprediction.org and appear on the bitcoin ethereum bivariate leaderboard, we need 225 samples.

 

we_need_225

 

Notice that nsamples are normally distributed whereas usamples are uniform. However, the contest requires univariate submission. So we pack our bivariate percentiles back into univariate via the space filling curve (you did read An Introduction to Z-Streams, right?)

 

right

 

And now we are ready to submit them to the contest. However, do you have a write key? You are going to need that. If you don't have one, the following code will create one.

 

will_create_one

 

This key allows you to create a MicroWriter, which you will need to submit predictions. It is also your identity so if you are planning on winning cash prizes don't lose your key. There isn't any way to recover it. Email it to yourself now!

 

to_yourself_now

 

And now ... drumroll ...

 

drumroll

 

You won't appear on the leaderboard immediately but by all means head over to the bitcoin ethereum bivariate leaderboard again. Come back tomorrow to see how you are doing.

 

Submitting a trivariate prediction

Finally, we are ready for trivariate prediction. But now it is easy ... essentially the same as before. Here are a few lines of code that create an admittedly poor submission and send it to the API at Microprediction.org.

 

to_improve_on

 

Your mission, should you choose to accept it, is to improve on this.

 

Ongoing predictions

If you are lucky you might be able to do okay by running a script to update your predictions every hour or every week.

 

However, if you want to run a program that continuously monitors and alters submissions, say in response to live data likely to impact the volatility of cryptocurrencies or their correlations, you may want to "crawl." A crawler can also wander to other streams too, like transport, COVID-19, financial and other time series data. You are welcome to create a crawling program from scratch but you can also derive from MicroCrawler available in the microprediction package. See crawling instructions on the site or jump straight to the crawler code on Github for inspiration.

 

Quickstart

Notebook on which this article is based.

Crawling instructions at Microprediction.org

Comments