Well, that got your attention didn't it? Yes, it is factually accurate that participation at Microprediction.org has preceded interviews at Citadel - but correlation is not causality. I don't work for Citadel HR and never will. But hello eager young minds of tomorrow who reached out to me on LinkedIn asking for career advice. I'm finally getting around to giving it, for better or worse, in a transparently self-serving way.
Some would scoff at the idea of me giving career advice, or any advice. However, sticking to the empirical data, it is true that according to LinkedIn analytics, microprediction articles are read disproportionately by top-tier quant-trading firms. For instance, I was shocked to learn that my blog had more readers from IMC Trading than any other firm, because I have absolutely no affiliation there past or present (other than being Australian, perhaps).
I'm not going to draw conclusions. You're the data scientist. You figure it out. But whatever industry you head for, and I'm not saying it should be finance, remember that the top quant funds are better at prediction than most of the rest of the business world - they have always been faster to steal great technical ideas from scientific fields; they opened their arms to an influx of extraordinary talent from the Soviet Union and elsewhere a long time ago; and they have lived with the brutal discipline of out-of-sample punishment for forty years while they try to predict the hardest thing to predict on the planet. If you can predict it there, as the song goes, you'll predict it anywhere.
It isn't easy though. Perhaps you are looking to get in the door at a top mathematical fund that's been around for many decades (like - oh, let me see, this one). That's not the only strategy, and you can try the opposite: looking for a place lacking that DNA, and claiming that you are good at communicating to non-technical folks who don't speak the language of nature. But if you do pursue it, then either polish off that IMO medal, think of something you can do to stand out, or be very, very good at saying fascinating things when presented with a conversation starter like this:
That's not an opinion on interviewing technique by the way, I just like looking at it.
Now for the self-serving part. Microprediction is a collection of open source software and a hosted nano-market for live prediction. It's for algorithms, not people, primarily, yet microprediction is where you can go if you are looking for a way to distinguish yourself from 1,000,000 plodders at Kaggle seemingly intent on turning themselves into slow, expensive wetware versions of auto-sklearn - by all means hook yourself into the Matrix and let the machines use your body heat while you are at it.
Oh look there's nothing wrong with Kaggle (he says, hoping that Google doesn't further downgrade the page rank while microprediction followers rose 10x - how did that happen?!). But in contrast to Kaggle, Microprediction is where you go to prove that you can deploy models, understand APIs, invest in learning online methods rather than batch, package your stuff nicely and otherwise prove that you might actually be of some use to someone's real-time operation, some day.
It ain't friendly. There are no shortcuts. There are hard frustrating lessons to learn. Things break. Feeds go down. But one day you'll have a job and you'll realize you aren't in Kaggle anymore. The sooner you figure that out, the better. I'm here to give you the tough love. I wrote and maintain Microprediction.org for several reasons, and getting you a job actually isn't top of the list, I'm sorry to say, though maybe it is an accidental benefit.
A stronger motivation: open source time series prediction and the public data available for researchers is, for the most part, in something of a shambles. Most public data repositories drop you into a giant pile of ad hoc data sets - typically starting in 1992 and ending in 2003. Yes, you might be lucky and stumble across an interesting source of live data, but they are surprisingly few and far between. And don't get me started on those wonderful public datasets with one data point per month. Please.
I think it goes without saying that none of this helps the reproducibility crisis. I attended graduate school with statistician Regina Nuzzo, who coined the wonderful term p-hacking to sum that up. Let's fix p-hacking and also the superfund site called live public data. Let's not waste data scientists time predicting so-called real-world data that turns out to be mostly synthetic.
There are over a thousand live data streams currently used to train time series algorithms at Microprediction.org, and you are welcome to add more. Here I demystify some to give you a taste of what you'll find in the stream listing. Again, remember that this site is really for self-navigating automated machine learning algorithms - the grey goo that will eat through all the soft nonsense we see in enterprise data science - but that doesn't mean you aren't welcome.
Live data is important because without it, one is kidding oneself. If you read my article Is Facebook Prophet the Time Series Messiah ... you'll appreciate that there are plenty of time series packages out there that are stupidly popular ... but don't seem to have really been tested on anything. Want to change that?
Consider contributing to the timemachines package, where all models are regularly tested against new, incoming data and compared to each other using data that can't be memorized (notice by the way that the world's most popular time series package is substantially worse than some really simple models that run in 1/100'th of the time, have simple serializable state, and are trivially deployed).
Or look around for good libraries to contribute to. The river package is a good example that tries to get away from the 1990s batch learning style. It could use your help. Pycaret is another tremendous initiative. Nevergrad is running a contest for the best pull requests (see, I'm not anti-Facebook). Show employers that you can work with people, produce clean code that works, and make a contribution.
But back to the self-serving stuff. The first phase of stream creation at Microprediction is coming to a close. My initial objectives were pretty limited - just checking a few boxes, such as ensuring we had a good mix of time series with diverse properties and all the usual suspects when it comes to time-series challenges. Those include quasi-periodicity, dramatic excursions, mean-reversion, hourly and daily effects, noisy measurements including serially correlated noise, complex dependencies between similar time series, market microstructure and so on.
In the next phase of its life, Microprediction is going to undergo a fairly radical transformation and you'll see the quantity of streams spike - more about that another day. But in advance of that, and also ahead of some other steps to make things a little more human-friendly, I jotted down some notes about some live data you can use for whatever purpose you see fit.
Streams can come and go, so apologies if a link is broken, depending on when you are reading this. It is all accessible via the microprediction client or the API which is all explained in the knowledge center. There you will find Python tutorials walking you through various aspects of the site. You'll also observe that the data is right there in the open. So for instance you can download in comma-separated format with a link like this if you must, though I tend to discourage it as I would prefer you write a live algorithm that runs (instructions).
|Description||Example||Pattern for similar streams|
|Emoji usage||medical masks||emojitracker-twitter*|
|Cryptocurrencies||ripple price changes||c5_*|
|Airport wait times||Newark Terminal A||airport_ewr|
|Futures||sugar price changes||finance-futures|
|Chess ratings||Hikaru Nakamura||chess_bullet|
|Hackernews comments||comment counts||www-hackernews|
|Govt bonds||30_year changes||finance-futures-*bond|
|Hospital wait times||Piedmont||hospital-er-wait-minutes*|
|GitHub repos popularity||stargazers tensorflow||github_*|
|Laboratory helicopter||helicopter pitch||helicopter_*|
|Die, coins etc||die||coin_*|
|Epidemic agent model||infected||epidemic*|
|Three body system||three_body_x||three_body*|
|Unidentified flying object||altitude changes||altitude|
If a category is marked with an asterisk it means that z2 and z3 streams exist, which is to say that predictions are used to transform simultaneous measurements to a vector with roughly normally distributed margins. A dollar sign implies cash prizes.
Did you know that wind powered electricity generation (stream) is correlated with humidity? Well you know now and it is obvious in retrospect, isn't it? The electricity streams are discussed a length in this presentation by Rusty Conover, and he has made his prediction code open source (for the Julia folks). The New York Independent System Operator (NYSIO) publishes all sorts of live public data you are encouraged to search.
See also the electricity competition page. The competition pays out regularly and runs all year long.
Cities and airports often stream their radio traffic to the internet, you can listen to police or firetrucks being dispatched or airplanes being given headings and altitudes. We listen to a selection of these streams and count the amount of time that the frequency is occupied with transmissions over a silence threshold.
Every five minutes, the percentage of time that the audio stream was not silent is returned as a number ranging from 0.0 to 1.0. By predicting the activity of these radio frequencies, the common pattern of activity may be learned and anomalies detected.
See this stream for one of many examples of financial data, where changes are logged to the system. The mean of these series will be very close to zero, but the challenge lies in estimating the distribution of the changes.
This is actually the altitude of the Earth itself, or if you like a very low flying UFO. We were just curious to see how time series algorithm cope with the geometry of the planet.
Love chess? Love mean-reverting time-series-inviting hidden Markov models? Some people think these work really well in finance markets - so consider this an easier example where there absolutely is a hidden state influencing the drift (plus mean reversion due to the Elo system). Bullet ratings for the world's top players are produced here. As you can see from the script that produces them, ratings are published only if the rating has changed.
If you find Hikaru Nakamura's ratings to be incredulously high, let me introduce you to the the quickest thinking human on planet Earth. You can watch him here playing bullet chess (1 minute per game) or blitz (3 minutes typically) and talking about stocks or whatever comes to mind.
Soon I will publish a post on bullet chess and why it is going to be a great source of interesting reinforcement learning and prediction problems. Pro-tip: the volatility for this stream tends to be highest after 11:30pm EST, after I finish editing pages like this one. Contact us if you use chess.com and you'd like your own rating to be included. And yes, I have an obscure chess blog too, if you are into unsound bullet openings.
NOAA Wind Speed and Direction*
The National Oceanic and Atmospheric Administration (NOAA) is charged with alerting us to dangerous sea creatures. Also, they use towers like the one you see here, or buoys like this to measure wind speed and direction.
We have deliberately clustered some of them so that some are closely related - that is to say that the community implied Copulas are non-trivial. The stream name contains a location clue which should point you to the website for the measurement station. You can infer the page of the measurement station, and from there scrape whatever information you deem useful. For instance, the stream with 46061 in the name corresponds to this weather station:
As an aside, notice that the wind speed has been divided by ten and the direction by 360 so that the latter is between zero and one. Some plots of the community implied copulas should appear in the gallery sooner or later.
The number of times an emoji is used on Twitter, as tabulated by EmojiTracker (warning: this is a busy site with an epilepsy warning). The time series are a little easier on the eye, such as this one which shows the number of people tweeting "face with medical mask".
Changes in some major cryptocurrencies are published, assuming the feed appears to be warm - which it usually is. The five currencies bitcoin, ethereum, iota, cardano and ripple have very interesting joint behaviour and you can view recent causality plots or make your own. The community implied copulas are also quite pretty, as you can see from the scatter plots.
In addition to USD denominated prices, three related streams track bitcoin in USD, EURO and AUD. This is more of a challenge for copula prediction, and you can see one fail, for example, using a centered vine copula. See the blog article How to Enter a Cryptocurrency Copula Contest and the Crypto and Copulas competition page.
The aggregate delay in minutes reported by the Bay Area Rapid Transit (BART). This one's the tip of the iceberg as far as public transport and exogenous data gathering is concerned.
The number of comments on the landing page populates the hacker news stream. This one has discontinuities that flummoxed Prophet, you may recall. Don't expect your model to do well here if it can't jump.
Citibike data is a little over done on Medium, if you ask me. But a contributor Eric Lou introduced this interesting COVID-19 proxy. The measure is obtained by computing changes in the occupancy of bike sharing stations, aggregated across several locations chosen to be near the major New York City hospitals.
Covered in the blog article Helicopulas, there are two streams for the pitch and yaw of a laboratory helicopter. This example was created for the SciML Julia Day challenge.
You have starred our repos, haven't you?
The stream pandemic infected tracks the number of infected people in an agent model for a pandemic. The simulations are crowd-sourced at SwarmPrediction.com, should you wish to add to the stockpile. Each simulation uses different initial conditions and parameters. Agents follow Ornstein-Uhlenbeck processes on the plane and can infect each other when they bump. Read Dear New Zealand... for more explanation. The model and the desire to create accurate surrogates is motivated by a couple of my working papers including Repeat Contacts and the Spread of Disease.
There is a rather complex set of relationships between travel time measurements at different points of the system, as you can see from causality plots like this one.
Perhaps you can be inspired by queuing theory and improve your prediction of the surges and drop-off you see in streams like this one.
No explanation required as I think we're all too familiar with this one. Let's hope that COVID-19 ends soon and this time series becomes more exciting.
This is a simulated data stream. As the name suggests, it is derived from the physics of a three-body system, but the measurements are noisy. Here's a spoiler:
I don't know if agricultural futures float your boat or something else, but I think you get the idea. If you can write a crawler that does well predicting all of these time series, you are doing well. Don't forget that I've listed plenty of ideas in the listing of popular python time series packages, and remember that the goal here is to add value to someone, somewhere, in real-time. This isn't just another throwaway notebook.
I hope you contribute to an open source library or two, or create your own. Go to settings in GitHub and activate donations if you do. I've written a step-by-step guide to packaging on PyPI for those not already familiar. Twice a week I hold open virtual office hours (Tue 8pm EST, noon Fri EST) and you can reach out by whatever means you prefer to be included in the invite - for instance the contact information in the Knowledge Center.
Above all, have fun. Résumé padding doesn't have to be dull.
I work for Intech Investments.
The video was produced by a Stanford alum who creates extremely high quality material. I would love to acknowledge, but that will give away the context, the question, and the solution. I'll edit later.
I've just added a job postings page. I'm happy to add opportunities if you have them, and are looking to hire handy people.