8 min

Live, Online Distribution Estimation Using t-Digests

Published on July 28, 2020

This post presents a Python walkthrough of Statesboy Cat, Decastyle Cat and Thallodal Cat, time series crawlers living at Microprediction.org that leverage online distributional estimation with t-digests. Descastyle Cat is doing a reasonable job of predicting a simulated epidemic. It also has a good performance predicting altitude of an unidentified low flying object, bike activity near New York City hospitals, the position of a badminton player's neck, and bitcoin movements over five minute intervals (and given a chance, will try any time series you publish).

An example of a time series predicted using TDigest: scaled five minutely log bitcoin price changes

This post may be of interest if you:

• Have not used the useful Python TDigest library, wherein k-means clustering is used to maintain a sketch of a cumulative distribution.
• Haven't been introduced to the nascent prediction web (where have you been?) that will eventually change all of commerce.

Or it may be of interest if you already crawl at Microprediction.org (for a variety of reasons such as Benchmarking AutoML Vendors, or gathering data for your next paper, or going after prizes) and...

• You need a baseline distributional estimate to anchor to your favorite point estimate
• You wish to understand, by example, the use of OnlineStreamCrawler

Wait, backup. What's a crawler you say? To crawl is to create a time series algorithm that wanders from live data stream to live data stream at Microprediction.org - prototypical host for the prediction network supporting a live lattice of nowcasts whose goals are described here. If you have a good time series algorithm, crawling is a great way to benchmark it. For more context and motivation, I suggest reading:
If You Love Your Algorithm Set it Free or Dorothy, You're Not in Kaggle Anymore or

T-digests

The t-digest is described in a paper by Ted Dunning and Otmar Ertl. The basic idea is to cluster samples using k-means.

 The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. This t-digest data structure can be used to estimate quantiles, compute other rank statistics or even to estimate related measures like trimmed means.

With a t-digest, we try to summarize a long collection of data points by clustering them. We carry around a mean and a number of samples of each centroid. In contrast to previous work on so-called q-digests, the t-digest limits the size of a cluster with a scale function that ensures close fidelity to the actual cumulative distribution function as we approach 0 or 1 (in between the cluster sizes are allowed to grow larger so the CDF might not be as good, but it is still very accurate).

The t-digest approximation of a univariate distribution can be updated sequentially very fast, making it ideal for streaming data and leakage-free time series analysis. Digests are relatively small (compared to storing all the lags) yet provide part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles. It takes only about 100ns to add a new data point. And while we won't be exploiting it, t-digests can also be added together (i.e. merged) which makes them ideal for federated learning or efficient parallel batch processing.

We will use an implementation of t-digests written by Cameron Davidson-Pilon (Github homepage). We first describe one of several examples of crawlers provided in the crawler_examples directory of the microprediction package github repo. This example, called Decastyle Cat, is not the shortest path to implementation but it will give you an introduction to the mechanics of OnlineStreamCrawler, a useful pattern for many time series models beyond the one we consider. We will also mention two shortcuts that take advantage of the speed of t-digest.

Option 1: OnlineStreamCrawler

The OnlineStreamCrawler class is useful for creating stateful crawlers from models that require an offline fitting process. All crawlers derived from the parent MicroCrawler, including this one, do the following:

• Maintain a schedule of when to predict live data for a collection of streams
• Occasionally find a new stream, or give up on a stream due to poor performance

The OnlineStreamCrawler does this and, in addition

• Cycles through streams, when it has a few spare seconds, and updates some kind of model (or "state") associated with that stream.

The OnlineStreamCrawler might be considered overkill for our t-digest example, but its implementation is so simple we can give its code in entirety:

(You may prefer to read the code in github).

We now derive from this OnlineStreamCrawler, starting with a tiny bit of navigation. We will create DigestCrawler (code) as follows:

Aside from the usual constructor boilerplate, we are telling our crawler not to look at bivariate or trivariate prediction streams (if you are interested in bivariate or trivariate prediction I refer you to the cryptocurrency contest article). There is more information about navigation in the article Economical Statistics.

Initial state for a stream of data

Next, we have to tell the crawler how to initialize "state" for a stream it encounters for the first time. State can be anything you want.

Here we are borrowing some convenience methods is_process, approx_dt from a utility module samplers.py in the microprediction package. They respectively try to decide if a time series should be differenced or not, and estimate the typical time between data points. Whether or not we difference, the state we create comprises the following:

1. A record (t) of the most recent timestamp in lagged data
2. A t-digest constructed using lagged values
3. A record of whether we decided it should be differenced or not (boolean as_process)
4. A record of the typical time between data point arrivals

Updating state

Next, we imagine that at some time in the future we once again visit the same stream and retrieve a buffer of lagged values and times (using lagged_values and lagged_times crawler methods). We need to update the t-digest as follows:

Notice that thanks to the excellent t-digest package, all we have to do is figure out which data points we haven't yet provided to the digest, and whether to difference or not.

Making predictions

Finally, we need to implement a sample method for our crawler. This is the all-important method that provides 225 guesses of a future value of the stream (read about why we don't collect point estimates in Where Will a Badminton Player Move to Next). To be precise, we are trying to predict the value taken by the next data point to arrive after a fixed delay has passed. The delay is passed into the sample method for our convenience.

(For readability one line is truncated here, so I suggest you read the source code directly if that is convenient). This sample function is very straightforward if the time series does not need differencing - for example, if it is a stock price change. You can see that morally speaking all we need to do is extract the inverse cumulative distribution of the t-digest.

Things are a tiny bit more complicated if it is a process. In that case we have been storing a distributional approximation of each step in an independent random walk. Out of sheer laziness I generate a distribution forward in time using Monte Carlo here, the last bastion of the scoundrel. The convenience function inv_cdf_walk is also provided in samplers.py and it does the following:

The reader may wish to improve on this sloppy approximation of a sum of random variables, each of which is described by a t-digest.

Running the crawler

Now for the fun part. I create a crawler with a write_key:

If you don't have a write_key, just leave out that argument and it will create one on the fly (takes a few minutes though, be patient). And now run it:

Here I have included the url of the repo where the code for the crawler resides. This is entirely optional. The min_lags property will ensure that we don't participate in a stream unless there are at least 500 lagged values available to our crawler.

Checking on how it does

Next I race over to the dashboard and punch in my write_key. Then we can see where this crawler is performing well:

And we can also see that there are a few streams that it is probably not well suited to, at least not without further modifications:

That's okay. The whole point is that your crawler can find problems that it is good at on its own. And aside from automatically pointing you towards your next paper, you might accidentally help someone with a real live prediction need.

Option 2: SteamCrawler

Now for the first shortcut.

The example crawler called Thallodal Cat is another good one to fork (see code). It is similar to Decastyle Cat but simpler, insofar as it derives from StreamCrawler (code) rather than OnlineStreamCrawler. In contrast to OnlineStreamCrawler, StreamCrawler does not, by default, do anything in its downtime. Instead it assumes that all processing can be done on the fly, which is certainly the case for the speedy t-digest.

As you can see from the code for Thallodal Cat, StreamCrawler offers the user the option of defining a sample_using_state method instead of overriding sample. The use is otherwise similar to the above:

Option 3: SequentialCrawler

The approach above still requires some book-keeping on your part. Let's be really lazy. Check out Statesboy Cat (code) which uses SequentialStreamCrawler (code) to avoid manual management of ancillary state (figuring out which lags are new) and sampling logic. This class is a little less flexible, but it is certainly very easy to use.

We introduce the notion of a DistributionMachine, which is a state machine that has an inverse cumulative distribution function. This is essentially synonymous with the crawler itself, that will merely apply it to every stream and then perform a Monte Carlo in similar fashion to Decastyle Cat.

If you choose to use SequentialStreamCrawler, you need only implement DistributionMachine and pass it in. If you are super lazy, you could even use the default instance that ignores all observations and behaves like a standard normal distribution.

That might have limited success. In our example, we want it to behave like t-digest instead.

That's all that is required. The sampling logic and state management can be inherited. We merely instantiate a SequentialStreamCrawler by passing it the type (not an instance) of DigestMachine:

A DistributionMachine could be many things. It might be a Kalman filter, or a running estimate of moments for a normal distribution. It could be a sequentially updated keras model. The DistributionMachine could combine two or more different types of estimates, tracking a point estimate forecast and also a distribution. Using this pattern you can easily create crawlers that do a good job of forecasting future values.

Crawler code examples

• Statesboy Cat using SequentialCrawler (code)
• Thallodal Cat using StreamCrawler (code)
• Decastyle Cat using OnlineStreamCrawler (code)