9 min

How to Enter a Cryptocurrency Copula Contest

Published on July 11, 2020

In a live prediction challenge running at Microprediction.org, algorithms try to predict bivariate and trivariate relationships between five minutely returns of Bitcoin, Ethereum, Ripple, Cardano and Iota. Can you beat them?

It is hoped that out of a collection of interrelated statistical contests, a picture of the fine structure of two-way, three-way and five-way dependencies will emerge. This detailed understanding might surpass what one model or person could achieve. This post comprises two parts:


  1. A discussion of the study of joint behavior, and why trivariate margins might help reconstruct five-way relationships, and why correlation modeling isn't always enough.
  2. A Python walkthrough for those who want to try their hand. 

Rules are on the competition page

On Joint Distributions and 3-margins

Why model cryptocurrencies? Of course there is direct interest. It is also good practice.

Cryptocurrency and stock price changes are examples of approximate martingales. According to the Efficient Markets Hypothesis (EMH) it should be very difficult to provide an estimate of the mean of the process five minutes forward in time that is substantially better than the current value. However, even if you believe the EMH, that leaves an awful lot of structure to determine.

Both stocks, ETFs and many other quantities are used at Microprediction.org to train algorithms which make distributional predictions. Including cryptocurrencies in the mix adds one more type of exercise for the algorithms. Over time, our understanding of which algorithms perform well across a range of different domains may help us make inferences about longer time scale behavior, and this may be relevant for all sorts of applications including portfolio management.


How can we understand and model joint behavior of ... things? In my most recent article  I motivated a similar study of bivariate relationships in a physical system (pitch and yaw of a laboratory helicopter). Cryptocurrencies give me an excuse to convince you that trivariate relationships might be important. I draw your attention to the following trivariate stream  at Microprediction.org fed by live pricing data for Bitcoin, Ethereum and Ripple.


Figure 1

In a moment we will walk through how to submit your predictions of this z-stream. But one might reasonably wonder whether 3-way relationships are more trouble than they are worth. Is it not sufficient to source good predictions of pairwise relationships? After all, we frequently come across multivariate processes modeled using correlation or covariance matrices (or factor models amounting to the same thing). That is the done thing. Everywhere. Almost without exception. In every field.


However, we know mathematically this can't be the whole story since 2-margins (pairwise probabilities) do not determine a joint distribution of n-variables. The question is whether this matters in practice or whether it is a pedantic statistical quibble.


About ten years ago I came across a simple counterexample to the notion that pairwise relationships are sufficient to reconstruct joint distributions. Ten-pin bowling. The example occurred to me shortly before boarding a long haul flight from New York to Sydney. I had just enough time to find a bowling game on a computer, but not enough time to dig up a usable open source simulation. Annoyed but convinced of the thesis, I painstakingly recorded the result of the first bowl many, many times.



Given fatigue and mild oxygen deprivation at 40,000 feet, I can't guarantee that the priceless ten pin bowling data gathered on that trip is reproducible - and of course the way one plays the game dictates a lot. However, if you were to perform this exercise yourself - presumably in a less tedious fashion - I think you'll probably notice something rather interesting.


To convince you that sleepwalking into correlation modeling isn't always a good idea, let's proceed in a perfectly reasonable fashion to construct a model for all ten pins. To make the ten-pin bowling game feel a little bit more like (say) investment management, we assign to each pin a normally distributed random variable. We shall assume that if this normally distributed variable exceeds some threshold, the pin falls and otherwise, it survives. For instance, if X(7) represents the variable attached to pin number 7, we might suppose that pin 7 falls if


X(7) > -0.23


We have the data to back into the number -0.23 so that the probability of the 7th pin falling equals that of the data. Let's say we also have the bivariate data. Then we could also infer correlations between the variable X(7) representing pin number 7 and the variable X(5) representing pin number 5. Most pins will be positively correlated. But if you are a bad bowler like I am, the 7 and 10 pins will be negatively correlated. Perhaps the correlation matrix might look like this:




Now we have a model for all ten pins derived from properties of pairs of pins only. We roll the ten gaussian variables and check each against their corresponding thresholds, each of which are calibrated to the data. Sound reasonable?


Here I have picked a portfolio of pins (pins 1,4,6,7,9 and 10) in order to diagnose whether the model, thus created, could assign reasonable probabilities to the seven outcomes (counting the number of pins in this subset of six that would fall). The probabilities were quite far off as you can see - dramatically underestimating the probability of all six (which is not too surprising) but getting other things wrong too. Of course this is just one cross section of the model so no doubt many other things are wrong about the way it assigns probabilities to all 1024 possible outcomes.


It seems that as far as ten-pin bowling goes, this particular approach isn't cutting it. A few years ago Roberto Fontana and Patrizia Semeraro published this paper providing a beautiful characterization of multivariate Bernoulli distributions that sheds some light on this. I did not have the benefit of this at the time.


However, I suspected that the tiny amount of information added by some 3-margins would help a lot. Something about the geometry is suggestive. A 3-margin comprises probabilities for all eight outcomes of three pins. Eight numbers but there are seven equations you know already (three 2-margins, three 1-margins and you know that all eight numbers add to 1). So really just one more number. Adding 3-margins isn't taking you very fast towards the total number of degrees of freedom in the system (1023).


But it helped. Here is an example "portfolio" of bowling pins. We count the number that fall once again. The blue probabilities are the data. The green probabilities use a sprinkling of 3-margins. The red probabilities are from the correlation model.




If you look at the mean percentage error across all possible choices of six pin portfolios you get percentage errors as follows:

Averaged across all 210 possible choices of 6-pin portfolios


Percentage error versus data when predicting probabilities for pin counts. Top row is a correlation model. Bottom row a more complex model exploiting 3-margins.


where the first row uses 2-margins only and the second 3-margins. Neither model is perfect but using 3-margins to try to reconstruct the joint distribution clearly helps a lot.


This experiment left a lasting impression on me. I worried that correlation and covariance modeling might quite often be misleading (with factor models a special case). But on the other hand, not every problem I looked at led to similar findings. Later, I replaced ten pins with ten airports and the binary event of whether or not it rained on a given day. The implied gaussian correlations of a fitted model were as follows:




Yes there are days when it rains at LaGuardia and not at JFK.


In similar fashion to the bowling pins, one can look at subsets of airports and ask the question "at how many airports is it raining?". Now in contrast to bowling pins, here the 2-margin model isn't really all that bad. Indeed one can even try to get away with reducing the rank of the correlation or covariance matrix. This is one example, using a reduction to three factors driving the correlations between airports.




Here "julius" refers to the use of 3-margins (and has nothing to do with orange juice - long story). Again the red normal model refers to the use of the normal copula and the blue is the data. Correlation modeling isn't so bad here.


Furthermore, I found examples where correlation modeling worked really well, even after simplification of the model. Looking for an example that I hoped would really trick up pairwise modeling, I decided to model the number of squares whose color remains unchanged after five random moves of a Rubik's Cube.





I was the one who was fooled. Not only did 2-margins do a great job of reconstructing this Rubik generated joint distribution, the same was true after I approximated the model with three factors, then two factors. Amazingly even a 1-factor approximation did a bang up job.


Don't believe me? Find a Rubik's Cube program and try it. Here are some examples of "portfolios" of squares on the Rubik's Cube (the numbering scheme is shown above). We show the probabilities of the number of squares whose color remains unchanged after five random moves. Below the "red" single factor model does almost as well as a fancy piecing together of 3-margins. Here are a few examples of "portfolios" of squares and the distributions of how many change color:




Once again the geometry, by which I mean the mechanics of the Rubik's Cube, sort of suggests ... maybe ... does it? To you? I'm not sure even in retrospect.


Moral of the story: dependence is tricky. Sometimes it isn't just about correlation or covariance matrices. Sometimes what is missing in correlation models matters more than other times. Cryptocurrencies could be a little bit like bowling pins or they might be more like Rubik's Cubes insofar as the importance of 2-margins and 3-margins is concerned (say if we want to be able to understand how all five currencies move together). That is for you to figure out and I look forward to coming back to these cryptocurrency streams at a later date to see what structure clever algorithms have found.


Python walkthrough

I hope that is sufficient motivation. If you would like to help solve the existential mystery of the joint distribution of cryptocurrencies by helping predict 1-margins, 2-margins and 3-margins, and in doing so get some practice for contests that attract cash awards at Microprediction.Org read on...


This post takes you under the hood of the mechanics, so you understand the game theory involved - at least some of it. I strongly suggest you read The Lottery Paradox blog article as well. 


If you have read the article on helicopter prediction, you may already be familiar with bivariate prediction, but I'm assuming you are coming in cold, and we will also extend to trivariate prediction. If this is all too new, maybe read the article An Introduction to Z-Streams first. Most code in this post is contained in one of two places:

  • This notebook - which is intended to expose you to z-stream mechanics. 
  • The fit.py script - which presents you something of a shortcut.

I will first walk through the former. There, we pull a list of cryptocurrency streams. I wouldn't expect you to know in advance that the ones we are interested in here are prefixed by c5_, but that's the case




You'll see there are a lot of streams meeting this criteria. That's because five are so-called primary streams but many more are derived streams as well (the ones with tildes). Let's narrow down:


Primary C5


We see five, the first three being:




Next, let's take a look at them:



We see:




There is a fair degree of co-movement in the coin returns, which is hardly surprising. One way to look at this is via Sklar's Theorem. All those other streams (the z2~ and z3~ streams) are really implied Copula functions. Some code and a picture is worth more than my words so:




And lo, we see the Bitcoin Ethereum copula:




First thing to note about the code - the name of the bivariate stream:




can be inferred from the names of the parent streams but also needs a parameter 70 which is the quarantine time of predictions. You can always just browse all the streams to see what is there: https://www.microprediction.org/browse_streams.html


Second thing to note is the unpacking from 1 to 2 dimensions. Notice that we used the from_zcurve method to convert univariate to bivariate data. This unpacking is via a space filling curve (also explained in the article An Introduction to Z-Streams noted above). And you may ask, percentiles compared to what? The answer is percentiles compared to a collective distributional prediction made by all the algorithms fighting to predict the primary streams (you can see the leaderboards at the cardano primary stream for example).


As of a recent push, you don't need to do this unpacking manually. There are new methods in the microreader which allow you to get the lagged percentiles in two or three dimensions directly

lagged_percentiles = mr.get_lagged_copulas(name=name, count= 5000)
I refer you to the reader and there is an example of usage at microactors/fit which I will return to momentarily.


The scatter plot can be thought of as samples from a Copula function (see Wikipedia copula article). The question you might ask is, what bivariate random variable with uniform margins might this be? Some people like to apply monotonic transforms of this variable so that margins are more familiar. For example, let's create a new set of samples with normally distributed margins like so:




You can also do this directly from the stream:

lagged_zvalues = mr.get_lagged_zvalues(name=name, count= 5000)

or if you prefer percentilesAnd now the (transformed) percentiles might be mistaken for bivariate normal:




What's the correlation?




Pretty high when I checked ... 80% ... but the stream had just begun life so we'll see how it goes when you run the notebook.


Submitting a distributional prediction of the 2-copula

Let's create a model for this data. I'm not going to work too hard here, but you can improve it. In order to be able to submit to www.Microprediction.org and appear on the bitcoin ethereum bivariate leaderboard, we need 225 samples.




Notice that nsamples are normally distributed whereas usamples are uniform. However, the contest requires univariate submission. So we pack our bivariate percentiles back into univariate via the space filling curve (you did read An Introduction to Z-Streams, right?)




And now we are ready to submit them to the contest. However, do you have a write key? You are going to need that. If you don't have one, the following code will create one.




This key allows you to create a MicroWriter, which you will need to submit predictions. It is also your identity so if you are planning on winning cash prizes don't lose your key. There isn't any way to recover it. Email it to yourself now!




And now ... drumroll ...




Well, that's one way to do it anyway which illustrates exactly what's going on. However there are submission shortcuts if you wish to submit z-vectors or copulas (percentiles) directly, rather than the pre-image of the space filling curve.
res = mw.submit_zvalues(name=name, zvalues=zvalues, delay=delay )
or if you'd rather submit percentiles:
res = mw.submit_copula(name=name, prctls=prctls, delay=delay )

I refer you to the MicroWriter class for details. However you choose to submit, you won't appear on the leaderboard immediately but by all means head over to the bitcoin ethereum bivariate leaderboard again. Come back tomorrow to see how you are doing.

If you want to run a program that continuously monitors and alters submissions, say in response to live data likely to impact the volatility of cryptocurrencies or their correlations, you may want to "crawl." A crawler can also wander to other streams too, like transport, COVID-19, financial and other time series data. You are welcome to create a crawling program from scratch but you can also derive from MicroCrawler available in the microprediction package. See crawling instructions on the site or jump straight to the crawler code on Github for inspiration.

However, there is another way you might also like. It is "set and forget". 


Submitting a trivariate prediction using a Copula library and GitHub actions


For most z-streams, you can probably get away with submitting predictions less frequently, since the distribution of implied price changes (the copulas) might not change by the minute. So here I would refer you to a cute little GitHub repository that does all that for you. You need only fork it and modify as you see fit.


Here is the script fit.py in its entirety. The script requires that an environment variable called WRITE_KEY be set, and in my chosen setup this is accomplished by the GitHub action called daily.yml. With the write key we can, as you can see, very easily fit a copula and make a submission.

from microprediction import MicroWriter
import numpy as np
from pprint import pprint
import matplotlib.pyplot as plt
import random 
import time
import warnings
from copulas.multivariate import GaussianMultivariate
import pandas as pd

# Grab the Github secret 
import os 
WRITE_KEY = os.environ.get('WRITE_KEY')         
ANIMAL = MicroWriter.animal_from_key(WRITE_KEY)    
REPO = 'https://github.com/microprediction/microactors/blob/master/fit.py' 
print('This is '+ANIMAL+' firing up')

STOP_LOSS = 25 # 

# Get historical data, fit a copula, and submit 

def fit_and_sample(lagged_zvalues:[[float]],num:int, copula=None):
    """ Example of creating a "sample" of future values
           lagged_zvalues:     [ [z1,z2,z3] ]  distributed N(0,1) margins, roughly
           copula :            Something from https://pypi.org/project/copulas/
           returns:            [ [z1, z2, z3] ]  representative sample
        Swap out this function for whatever you like. 
    # Remark 1: It's lazy to just sample synthetic data
    # Remark 2: Any multivariate density estimation could go here. 
    # Remark 3: If you prefer uniform margin, use mw.get_lagged_copulas(name=name, count= 5000) 
    # See https://www.microprediction.com/blog/lottery for discussion of this "game" 
    df = pd.DataFrame(data=lagged_zvalues)
    if copula is None:
        copula = GaussianMultivariate() 
    synthetic = copula.sample(num)
    return synthetic.values.tolist()

if __name__ == "__main__":
    mw = MicroWriter(write_key=WRITE_KEY)
    mw.set_repository(REPO) # Just polite, creates a CODE badge on the leaderboard
    NAMES = [ n for n in mw.get_stream_names() if 'z2~' in n or 'z3~' in n ]
    for _ in range(1):       
        name = random.choice(NAMES)
        lagged_zvalues = mw.get_lagged_zvalues(name=name, count= 5000)
        if len(lagged_zvalues)>20:
            zvalues = fit_and_sample(lagged_zvalues=lagged_zvalues, num=mw.num_predictions)
            pprint( (name, len(lagged_zvalues), len(zvalues)))
                for delay in mw.DELAYS:
                    res = mw.submit_zvalues(name=name, zvalues=zvalues, delay=delay )
            except Exception as e:
    # Quit some stream/horizon combinations where we fare poorly
    mw.cancel_worst_active(stop_loss=STOP_LOSS, num=3)

If you prefer, there is a marginally more sophisticated project called microactors-plots which adds some bells and whistles, including Copula eye candy like this example intended to help you identify when the Copulas are not fitting well. Here's an example of a pretty bad fit to the trivariate exchange rate relationship achieved by a Vine copula (direct variety). 



On the other hand the Vine Copula (center variety) does a somewhat better job! 


You can browse them all in the copula gallery


More Resources and Reading

Hopefully this gives you an introduction to live implied Copula contests. Since I wrote the first version of this post, the resources available at www.microprediction.com knowledge center have come along, and you can find video tutorials. As noted, you can simply fork this repository and enable GitHub actions. There's a notebook in the repo you can use to generate yourself a write key. There are some limitations to this approach, but it will get you on the leaderboards very quickly. If you enjoy copulas of multivariate distributional estimation, this one's for you! See also our guide to GitHub actions 


On the theory side I won't try to survey but in the special case of binary random variables I found these to be more than interesting: 


  • Multivariate Bernoulli Distribution (pdf) Bin Dai, Shilin Ding and Grace Wahba
  • Characterization of Multivariate Bernoulli Distributions with Given Margins (pdf). Roberto Fontana and Patrizia Semeraro
  • On the Sufficiency of Pairwise Interactions in Maximum Entropy Models of Networks (pdfLina Merchan and ILya Nemenman

Please suggest other references, perhaps here