Want to quickly improve your position on the leaderboard?

This post presents a Python walkthrough of Statesboy Cat, Decastyle Cat and Thallodal Cat, time series crawlers living at Microprediction.org that leverage online distributional estimation with *t-digests. **Descastyle Cat* is doing a reasonable job of predicting a simulated epidemic. It also has a good performance predicting altitude of an unidentified low flying object, bike activity near New York City hospitals, the position of a badminton player's neck, and bitcoin movements over five minute intervals (and given a chance, will try any time series you publish).

**An example of a time series predicted using TDigest: scaled five minutely log bitcoin price **changes

(link).

This post may be of interest if you:

- Have not used the useful Python TDigest library, wherein k-means clustering is used to maintain a sketch of a cumulative distribution.
- Haven't been introduced to the nascent prediction web (where have you been?) that will eventually change all of commerce.

Or it may be of interest if you already *crawl *at* *Microprediction.org (for a variety of reasons such as Benchmarking AutoML Vendors, or gathering data for your next paper, or going after prizes) and...

- You need a baseline distributional estimate to anchor to your favorite point estimate
- You wish to understand, by example, the use of OnlineStreamCrawler

Wait, backup. What's a crawler you say? To crawl is to create a time series algorithm that wanders from live data stream to live data stream at Microprediction.org - prototypical host for the prediction network supporting a live lattice of nowcasts whose goals are described here. If you have a good time series algorithm, crawling is a great way to benchmark it. For more context and motivation, I suggest reading:

If You Love Your Algorithm Set it Free or Dorothy, You're Not in Kaggle Anymore or

T-digests

The t-digest is described in a paper by Ted Dunning and Otmar Ertl. The basic idea is to cluster samples using k-means.

The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. This t-digest data structure can be used to estimate quantiles, compute other rank statistics or even to estimate related measures like trimmed means. |

With a t-digest, we try to summarize a long collection of data points by clustering them. We carry around a mean and a number of samples of each centroid. In contrast to previous work on so-called q-digests, the t-digest limits the size of a cluster with a scale function that ensures close fidelity to the actual cumulative distribution function as we approach 0 or 1 (in between the cluster sizes are allowed to grow larger so the CDF might not be as good, but it is still very accurate).

The t-digest approximation of a univariate distribution can be updated sequentially very fast, making it ideal for streaming data and leakage-free time series analysis. Digests are relatively small (compared to storing all the lags) yet provide part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles. It takes only about 100ns to add a new data point. And while we won't be exploiting it, t-digests can also be added together (i.e. merged) which makes them ideal for federated learning or efficient parallel batch processing.

We will use an implementation of t-digests written by Cameron Davidson-Pilon (Github homepage). We first describe one of several examples of crawlers provided in the crawler_examples directory of the microprediction package github repo. This example, called *Decastyle Cat, *is not the shortest path to implementation but it will give you an introduction to the mechanics of **OnlineStreamCrawler**, a useful pattern for many time series models beyond the one we consider. We will also mention two shortcuts that take advantage of the speed of t-digest.

**Option 1: OnlineStreamCrawler**

The **OnlineStreamCrawler** class is useful for creating stateful crawlers from models that require an offline fitting process. All crawlers derived from the parent **MicroCrawler, **including this one,** **do the following:

- Maintain a schedule of when to predict live data for a collection of streams
- Occasionally find a new stream, or give up on a stream due to poor performance

The **OnlineStreamCrawler **does this and, in addition

- Cycles through streams, when it has a few spare seconds, and updates some kind of model (or "state") associated with that stream.

The OnlineStreamCrawler might be considered overkill for our t-digest example, but its implementation is so simple we can give its code in entirety:

(You may prefer to read the code in github).

We now derive from this OnlineStreamCrawler, starting with a tiny bit of navigation. We will create DigestCrawler (code) as follows:

Aside from the usual constructor boilerplate, we are telling our crawler not to look at bivariate or trivariate prediction streams (if you are interested in bivariate or trivariate prediction I refer you to the cryptocurrency contest article). There is more information about navigation in the article Economical Statistics.

**Initial state for a stream of data**

Next, we have to tell the crawler how to initialize "state" for a stream it encounters for the first time. State can be anything you want.

Here we are borrowing some convenience methods **is_process, approx_dt** from a utility module samplers.py in the microprediction package. They respectively try to decide if a time series should be differenced or not, and estimate the typical time between data points. Whether or not we difference, the state we create comprises the following:

- A record (
**t**) of the most recent timestamp in lagged data - A t-digest constructed using lagged values
- A record of whether we decided it should be differenced or not (boolean
**as_process**) - A record of the typical time between data point arrivals

**Updating state**

Next, we imagine that at some time in the future we once again visit the same stream and retrieve a buffer of lagged values and times (using **lagged_values** and **lagged_times **crawler methods). We need to update the t-digest as follows:

Notice that thanks to the excellent t-digest package, all we have to do is figure out which data points we haven't yet provided to the digest, and whether to difference or not.

**Making predictions**

Finally, we need to implement a **sample **method for our crawler. This is the all-important method that provides 225 guesses of a future value of the stream (read about why we don't collect point estimates in Where Will a Badminton Player Move to Next). To be precise, we are trying to predict the value taken by the next data point to arrive after a fixed delay has passed. The delay is passed into the **sample** method for our convenience.

(For readability one line is truncated here, so I suggest you read the source code directly if that is convenient). This sample function is very straightforward if the time series does not need differencing - for example, if it is a stock price change. You can see that morally speaking all we need to do is extract the inverse cumulative distribution of the t-digest.

Things are a tiny bit more complicated if it is a process. In that case we have been storing a distributional approximation of each step in an independent random walk. Out of sheer laziness I generate a distribution forward in time using Monte Carlo here, the last bastion of the scoundrel. The convenience function **inv_cdf_walk **is also provided in samplers.py and it does the following:

The reader may wish to improve on this sloppy approximation of a sum of random variables, each of which is described by a t-digest.

**Running the crawler**

Now for the fun part. I create a crawler with a write_key:

If you don't have a write_key, just leave out that argument and it will create one on the fly (takes a few minutes though, be patient). And now run it:

Here I have included the url of the repo where the code for the crawler resides. This is entirely optional. The **min_lags** property will ensure that we don't participate in a stream unless there are at least 500 lagged values available to our crawler.

**Checking on how it does**

Next I race over to the dashboard and punch in my write_key. Then we can see where this crawler is performing well:

And we can also see that there are a few streams that it is probably not well suited to, at least not without further modifications:

That's okay. The whole point is that your crawler can find problems that it is good at on its own. And aside from automatically pointing you towards your next paper, you might accidentally help someone with a real live prediction need.

**Option 2: SteamCrawler**

Now for the first shortcut.

The example crawler called *Thallodal Cat *is another good one to fork (see code). It is similar to Decastyle Cat but simpler, insofar as it derives from StreamCrawler (code) rather than OnlineStreamCrawler. In contrast to OnlineStreamCrawler, StreamCrawler does not, by default, do anything in its downtime. Instead it assumes that all processing can be done on the fly, which is certainly the case for the speedy t-digest.

As you can see from the code for *Thallodal Cat*, StreamCrawler offers the user the option of defining a **sample_using_state **method instead of overriding **sample**. The use is otherwise similar to the above:

**Option 3: SequentialCrawler**

The approach above still requires some book-keeping on your part. Let's be really lazy. Check out *Statesboy Cat (**code**) *which uses SequentialStreamCrawler (code) to avoid manual management of ancillary state (figuring out which lags are new) and sampling logic. This class is a little less flexible, but it is certainly very easy to use.

We introduce the notion of a **DistributionMachine**, which is a state machine that has an inverse cumulative distribution function. This is essentially synonymous with the crawler itself, that will merely apply it to every stream and then perform a Monte Carlo in similar fashion to *Decastyle Cat*.

If you choose to use **SequentialStreamCrawler**, you need only implement **DistributionMachine **and pass it in. If you are super lazy, you could even use the default instance that ignores all observations and behaves like a standard normal distribution.

That might have limited success. In our example, we want it to behave like t-digest instead.

That's all that is required. The sampling logic and state management can be inherited. We merely instantiate a SequentialStreamCrawler by passing it the *type *(not an instance) of DigestMachine:

A DistributionMachine could be many things. It might be a Kalman filter, or a running estimate of moments for a normal distribution. It could be a sequentially updated keras model. The DistributionMachine could combine two or more different types of estimates, tracking a point estimate forecast and also a distribution. Using this pattern you can easily create crawlers that do a good job of forecasting future values.

**Crawler code examples**

*Statesboy Cat*using SequentialCrawler (code)*Thallodal Ca*t using StreamCrawler (code)*Decastyle Cat*using OnlineStreamCrawler (code)

**Further reading**

- Crawling
**quickstart**at Microprediction.org - Computing Extremely Accurate Quantiles Using t-digests Ted Dunning (
**working pape**r) - Ted Dunning's implementation (
**github repo**) - The implementation we used (
**github repo**)

## Comments