6 min

Tears of Joy: A Primer on Standardizing Streaming Data, the Easy Way

Published on September 29, 2020

Life moves pretty fast, as Ferris Beuller once said, and if you don't stop to look around once in awhile you might miss it. He was on to something, no doubt, yet when it comes to streaming data, it seems that it is possible to miss plenty of things, even if you do, every once in awhile, stop to look around. 

An insight might appear for a minute or two then fade back into the noise. There may simply be too many things to observe. Or the observations may be subtle, requiring statistical methods to tease them out. These days we need computers to look around on our behalf. 

What sticks out? 

The principle challenge with writing programs to monitor data lies in establishing a baseline view - a probabilistic model, presumably, assigning likelihoods to all possible outcomes - so that deviations from the same can be quantitatively assessed. Data can be assigned percentiles between 0 and 1, for example.  

It's been the tradition to compute quantities like z-scores, from which a percentile might bravely be computed. However, these are only as good as the assumed underlying model's fit to the data. Even if your quantity is normally distributed (and that's unlikely), estimating forward looking standard deviation is not so easy.   

Thankfully, there is a new way to standardize data in a compelling statistical manner that is:

  1. accessible,
  2. requires no modeling work on your behalf yet,
  3. leverages state of the art algorithms, relevant data, and cognitive diversity, and
  4. is continuously improving over time.

The proposed solution is the microprediction API. This is the dead simple way to invite a world of algorithms, and people, to standardize your data in a forward-looking fashion. And it also serves to assess that very same transformation on an ongoing basis.  

One enterprising user of the microprediction API has already demonstrated how this might apply to Twitter data. By publishing a live count of Twitter emoji's, he has set in motion the following:  

  • A time series for "tears of joy" emoji counts (see stream) that is predicted by algorithms in real time. They are judged out of sample (no cheating). They provide distributional predictions. From this competition arises an increasingly accurate probabilistic prediction of how many Twitter users will use the tears of joy emoticon, one hour ahead of time.  
  • The algorithms' combined forecast automatically standardizes the data. So, for example, if we were to witness 2000 uses of this emoticon, we'd know that this was a 95% event (for example). This creates a secondary time series (see the corresponding z1 stream).
  • The quality of the standardized data is itself critiqued by a secondary contest between algorithms. 

All of this is completely automatic, and can be used by anyone with a fast moving business problem. You just write a program that sends a data point to the API every minute. You can follow the publishing instructions, and more on that in a moment. 


But first to the topic du jour. Bring on the debate! 

Whether you love or hate presidential debates, there are good reasons to want to instrument them. Perhaps you find them boring, but wish to be alerted if something dramatic occurs. Perhaps you find them riveting, and also like watching how the cut and thrust therein sends ripples through the data universe. 

In this U.S. election cycle, we've already seen one candidate's hopes destroyed in minutes - just under three minutes, in fact, in a now famous exchange between Elizabeth Warren and Mike Bloomberg. It must have felt like time was speeding up - in terms of volatility - while slowing down in an excruciating way for Bloomberg. 

At least one type of seismograph, shall we say, picked up shudders from the Warren/Bloomberg collision. That measurement took the form of changes in prices on betting exchanges, for Mike Bloomberg in particular. The candidate's estimated probability of nomination dropped precipitously from 1 in 3, to 1 in 5. 

Markets such as these provide a one-dimensional view. They are unlikely to miss a Nevada earthquake, such as this, but arguably there is more to a debate than winning and losing - and that tends to be the only kind of thing we find on betting exchanges. Understandably, they focus on human participation, for the most part, and humans can only cover so much ground. 

However, we can wield something as powerful as a market mechanism that can operate with a much finer granularity. Welcome to the prediction network. Inspired by emoji contribution to Microprediction.org, I've decided to use the standardized measurements of emotion at the site to test whether competition, similar to markets but of a different sort, can help discern meaningful measurement from background radiation.

Here are some examples of standardized time series of emoji counts.  

These normalizations are derived by comparing the live tweet data to one hour ahead distributional predictions of how many tweets will occur. The predictions are made by algorithms which observe data streams at Microprediction.org. These algorithms are said to be self-navigating. They don't need humans to hold their hand and take them to problems. 

The algorithms are authored by humans (though they may be authored by algorithms too, and are certainly fit by algorithms). However, unlike most arrangements for producing models, this one is wide open - like a market although typically, those aren't as open as they might be. At Microprediction.org, anyone in the world can contribute algorithms and, should you be interested in joining them, see the instructions towards the end of this article. 

The day before the first debate

I warn that there are some difficulties with interpretation of any data such as that which we use. Be aware (it was claimed) that Trump bots outnumbered Clinton bots by a huge 7:1 ratio last time around (source: Vox - 2016). This points to just one of many challenges.

But let's have some fun. 

We begin with the quiet before the storm. I write the day before the first debate, a Jewish holiday, but an otherwise unremarkable Monday. The New York Times article on Donald Trump's tax returns is in the news cycle. It is worth a glance at today's data, just as a reference point. The daily rise and fall in emoji use is rather obvious, peaking at around 1500 and approaching roughly half that number by 9 PM EST, exactly 24 hours prior to the debate. 


The morning of the first debate

Debates are all about preparation. My preparation for the debate was writing a little notebook (see code) to rank-sort the microprediction z-scores for all emoji streams and bubble the top five to the top.

Here is a video intended to show how easy it is to attack the data surveillance problem, by allowing self-navigating algorithms to find your data stream (and perform a collective standardization of it). As you can also see in this video, it is also easy to consume the results using the MicroReader class.

(Sorry audio could be crisper ... I've ordered a microphone for the next one). 


HubSpot Video


As you can see, at the time I wrote the script, the top five emotions were a good mix of the happy and sad.


To create this plot in real time while you are watching the debate:

  1. Visit the notebook  
  2. Click "open in Colab" in colab
  3. Run every cell sequentially (arrow buttons)

The debate (updated)

And then it began. I had planned to update this blog in realtime but have to admit I was completely overwhelmed by what unfolded. I don't feel so bad because arguably, moderator Chris Wallace was also caught off guard. I'm still absorbing. 

However it did rapidly became apparent, to all observers I would venture, that one debate participant was adopting a high risk strategy. The President attempted to dominate speaking time, and even Republican Chris Christie would observe that Trump came in "hot". This was, it has been universally agreed, fairly unusual for a presidential debate of which many have been acrimonious.

Trump's strategy complicated the analysis. I had been hoping that two minutes of speaking time, alternating between participants, would permit a reasonable guess as to which tweeting emotions were directed at which participant. No such luck. 


One plausible hypothesis, which you are welcome to disagree with, is that reactions to this debate might have therefore been closer to a referendum on the debate quality itself, and its departure from the historical norms, rather than more idiosyncratic signals on particular issues. Certainly, time compartmentalized performance assessments and reactions to specific talking points are harder to discern when one person is speaking more or less the entire time.   

We didn't need to wait for the pundits to declare this the Worst Debate in History (one headline). Almost immediately a flood of very negative emotions exceeded their expectations. There were very, very few moments which could in any way be expected to solicit happy responses. I snapped the closest thing to levity I could find in this debate. For a brief moment, winks and kisses raced up the charts.  

winking face

But as we know, this was mostly deep into negative territory. When race relations and protests were discussed, the responses that rose to the top took on much more sombre note. "Heavy Black Heart" and "White Smiling Face" took positions #2 and #3, a rather unfortunate juxtaposition and naming of emoji's - though they lagged behind the persistently performing "weary face". 


When COVID-19 was discussed, deathly negative tropes arrived: the black hearts and skulls. I don't know who uses skulls on Twitter. I think I would rather not know.  


We have undoubtedly stumbled onto something, but this is a first experiment. It is perhaps impossible to discuss the purely empirical aspects of this without seeming to take a political stance, or bringing one to the analysis. I realized this almost immediately and decided to record the entire thing. Reach out if you'd like a complete 20Gig video recording of the standardized rankings of the emojis - in real time - side by side with the captioned debate. 

You can make up your own mind about who won the debate, of course. However a somewhat brave interpretation of this data is:

  • People hated the debate
  • (more speculative) Some attributed the quality degradation to the President.  

This thesis is supported by the reaction in the betting exchanges. Trump came into the debate a slight underdog and began to slide almost immediately. We can't read a precise absolute probability from venues like Betfair due to time value of money and commission effects, but Trump is roughly 5/6ths s likely to win as he was before the debate - according to those markets. He has fallen from slightly less than 1 in 2 to a probability closer to 1 in 3. 

The micropredictions  suggesting this were calculated immediately, by definition. However the betting markets took twelve hours to absorb the debate - with only about half of the movement in implied probability occurring during or shortly after the debate. In fairness, time will tell if that is an under or over-reaction. 

Regardless of your political persuasion, I hope you find this interesting. I certainly did. I may put more effort into producing micropredictions of sentiment with a little more polish in the next debate ... assuming the concept of a debate survives this one. The algorithms will be better by then as well, and there will be more emojis, we promise. 

Microprediction works like a market, only more efficiently 

Now setting politics aside I do want to discuss why this attempt to glean realtime information, grounded in competition, might be even more powerful in the future. Let it be said that the microprediction API is a universal data standardizer, insofar as it can be tried for any source of live data. 

Some sources might be more competitively predicted than others, naturally, but over time both data and clever algorithms may arrive. Markets give us some indication of why it is likely to be increasingly powerful over time. We look at betting markets because they are, by design, excellent accumulators of information.

But why?

The topic is well surveyed but as a long-time observer and participant in model creation, I suggest that the more cynical reasons are as important as the theoretical ones. 

  • Unlike a model built by an in-house team (or even an open source project) it is possible for anyone with a new insight or better approach to improve the quality of prediction (and be rewarded) without asking permission
  • Conversely, a cost is imposed on anyone claiming to provide accurate analytics (but perhaps protective of their turf) to engage in activity which serves to block others from improving the outcome.

Whatever the reasons, if you want to know the probability that one fly crawling up a wall will beat another fly, then you can do worse than letting people wager on it. Thing is, you probably don't want people wagering on everything. There are societal costs, of course. Too many people lose proportionately more than Bloomberg did in Nevada.

But there is also a pragmatic issue. Markets are clunky. There's too much overhead. They might be brilliant aggregators of information when the economics overwhelm these fixed costs (and yes the mean price of Apple stock one hour hence is dreadfully close to its price now) but markets predict a minuscule proportion of all the quantities that might be of economic, civic or scientific interest to someone. However, markets (including prediction markets and betting exchanges) are just examples of competitive aggregation. They are far from the end of the story.  

Viewed purely as probabilistic machines, markets are generally trying to solve a different problem. They are predicated on what we might call de Finetti's exasperation. The Italian mathematician declared "Probability Does Not Exist" some sixty years ago, then went on to write a lengthy thesis on the topic. Bruno de Finetti's notion of probability is summed up in the phrase "put your money where your mouth is." There is no such thing as objective probability, only a pile of cash on one side of the table and a pile on the other. 

There's a counterpoint, however, to this strain of mildly nihilistic probabilism. Now that we drown in data, and Machine Learning and data hungry methods are thriving, the law of large (out of sample) numbers is an increasingly important caveat. There may well be sufficient data to distinguish good modeling from bad in a short period of time - without the need for staking, and without introducing substantial degrees of chance into the medium term evaluation. Indeed, slow moving pseudo-markets (in the form of academics publishing papers that use standardized data sets) have been important catalysts for Machine Learning advances. 

So placing bets isn't the only way. Staking and chance need not play a role (they happen to be the criteria used by some courts in delineating gambling, incidentally). It has been argued by hedge fund Numerai that staking is crucial to their success. That may be true, but they are trying to use a market-like mechanism to beat an existing market. In a large domain of practical problems, that isn't the case. There is room for non-staked competitive prediction that isn't close to gambling in any sense.  

What seems more relevant than staking is whether we prevent blocking, protective behavior. You want your data standardization, and with it your ability to discern deviations from the same, to be constantly improving over time. You don't want it to be limited by happenstance, or the ability (or lack thereof) of the person you happened to assign the task to. You don't want to be captive to their possibly less-than-completely-intellectually-honest approach to serving you the best predictions anyone can produce - whether or not it makes them look good.  

Enter microprediction. A contest between computer algorithms and their authors to predict real-time streaming data - but one in which few of the trappings of financial markets (or betting exchanges, or hokey-crypto-currency bound systems) are required. Microprediction exists to eliminate all forms of friction in the assimilation of information. Information is primary. Information is not an accidental byproduct of trade (or the desire by recreational gamblers to increase their variance). 

Microprediction means a ruthless contest over thousands or millions of data points. A golf tournament played over 4000 rounds, not four. It means algorithms come first, humans second. Markets have become more efficient over time through the introduction of algorithmic trading. However the setup costs are way too high for those looking to turn their talent into prediction. Microprediction is designed from the ground up so that launching an algorithm that competes (analogous to trading in many ways) requires only three lines of Python code. 

There are no barriers to entry for people or algorithms looking to improve predictions. There are no exchange fees. No licenses. No credit cards. No cozy arrangements designed to provide some participants with more data than others. It isn't even necessary for algorithm authors to supply an email, as they generate their own key's themselves.  

So, if you want a better chance of understanding the real magnitude of any data event, consider publishing it to the microprediction API and letting the self-navigating algorithms, and their authors, fight it out. 

A quick tutorial in creating your own standardized data using Python

To that end, here's a video explaining how you (or if need be your long suffering technologist) can create a new microprediction data stream in just ten minutes  - thereby also creating a standardized version of every live data point you publish. The sooner you start publishing, the sooner you develop a history of data points, and the sooner algorithms will stumble across your data stream and start predicting it. 


HubSpot Video

(The water example used here wasn't nearly as exciting as tweets, but possibly of more scientific merit.)

Creating emoji streams using TypeScript

Now I hasten to add that emoji streams were not created using the Python client I supplied. Nobody is forced to use the Python client and the TypeScript code is available at rustyconover/emojitracker 

The snippet of code that creates the prediction stream looks like this: 


As you can see, we merely instantiate a MicroWriter using Rusty's client and issue a set command. 

Submitting a prediction using Python 

Now, what about predictive power behind the API. That's created by you and me. 

There's a big picture here, asymptotically speaking, but no need to wait on that. If you'd like to let loose your latest algorithm (or someone else's) on this emoji data, then you'll find plenty of examples of self-navigating Python algorithms you can modify. 

You can also look at the leaderboards such as this one and you'll observe that many of the algorithms have code badges on them. Click through for examples of crawlers. 


Video tutorials are also coming along. The details will depend on what your preferred way to run your crawler is. I personally find PythonAnywhere to be extremely convenient. Here is a three minute demonstration of letting loose a crawler to run on the cloud. 

HubSpot Video

The documentation for the Python microprediction client  is improving, we like to think, and you can contact us with questions. 

Submitting using the Julia client

Microprediction isn't only for Python fans. Here is some Julia code, also contributed by Rusty Conover, and using the Julia Microprediction package that he also wrote.  


Yes you can do better than this (and so has Rusty since writing this). But this code illustrates how easy it is for anyone to add to the intelligence behind the API.

We hope you consider contributing to the prediction network in this or some other manner.


Gerd Altmann  for image.