There are some reasonable ways to impress future employers with your mathematical acumen. You could polish off that IMO medal, or be very, very good at saying fascinating things when presented with a conversation starter like this:
That's not an opinion on interviewing technique by the way, I just like looking at it. Alternatively, you could establish beyond doubt that you aren't in Kaggle anymore. You can prove that you are capable of deploying a model, thinking about online versus batch calculations as required, and otherwise doing something that is remotely similar to real-world work.
I dare say open source time series prediction and the public data available for researchers is, for the most part, in something of a shambles. Most public data repositories drop you into a giant pile of ad hoc data sets - typically starting in 1992 and ending in 2003. Yes, you might be lucky and stumble across an interesting source of live data, but they are surprisingly few and far between. And don't get me started on those wonderful public datasets with one data point per month.
Stale, canned old data hardly helps the reproducibility crisis. Let's fix p-hacking and also the superfund site called live public data. Let's not waste data scientists time predicting so-called real-world data that turns out to be mostly synthetic.
There are over a thousand live data streams currently used to train time series algorithms at Microprediction.org, and you are welcome to add more. Here I demystify some to give you a taste of what you'll find in the stream listing.
Live data is important because without it, one is kidding oneself. If you read my article Is Facebook Prophet the Time Series Messiah ... you'll appreciate that there are plenty of time series packages out there that are stupidly popular ... but don't seem to have really been tested on anything. Want to change that?
Consider contributing to the timemachines package, where all models are regularly tested against new, incoming data and compared to each other using data that can't be memorized (notice by the way that the world's most popular time series package is substantially worse than some really simple models that run in 1/100'th of the time, have simple serializable state, and are trivially deployed).
Or look around for good libraries to contribute to. The river package is a good example that tries to get away from the 1990s batch learning style. It could use your help. Pycaret is another tremendous initiative. Nevergrad is running a contest for the best pull requests (see, I'm not anti-Facebook). Show employers that you can work with people, produce clean code that works, and make a contribution.
But back to the self-serving stuff. The first phase of stream creation at Microprediction is coming to a close. My initial objectives were pretty limited - just checking a few boxes, such as ensuring we had a good mix of time series with diverse properties and all the usual suspects when it comes to time-series challenges. These include quasi-periodicity, dramatic excursions, mean-reversion, hourly and daily effects, noisy measurements including serially correlated noise, complex dependencies between similar time series, market microstructure and so on.
In the next phase of its life, Microprediction is going to undergo a fairly radical transformation and you'll see the quantity of streams spike - more about that another day. But in advance of that, and also ahead of some other steps to make things a little more human-friendly, I jotted down some notes about some live data you can use for whatever purpose you see fit.
Streams can come and go, so apologies if a link is broken, depending on when you are reading this. It is all accessible via the microprediction client or the API which is all explained in the knowledge center. There you will find Python tutorials walking you through various aspects of the site. You'll also observe that the data is right there in the open. So for instance you can download in comma-separated format with a link like this if you must, though I tend to discourage it as I would prefer you write a live algorithm that runs (instructions).
|Description||Example||Pattern for similar streams|
|Emoji usage||medical masks||emojitracker-twitter*|
|Cryptocurrencies||ripple price changes||c5_*|
|Airport wait times||Newark Terminal A||airport_ewr|
|Futures||sugar price changes||finance-futures|
|Chess ratings||Hikaru Nakamura||chess_bullet|
|Hackernews comments||comment counts||www-hackernews|
|Govt bonds||30_year changes||finance-futures-*bond|
|Hospital wait times||Piedmont||hospital-er-wait-minutes*|
|GitHub repos popularity||stargazers tensorflow||github_*|
|Laboratory helicopter||helicopter pitch||helicopter_*|
|Die, coins etc||die||coin_*|
|Epidemic agent model||infected||epidemic*|
|Three body system||three_body_x||three_body*|
|Unidentified flying object||altitude changes||altitude|
If a category is marked with an asterisk it means that z2 and z3 streams exist, which is to say that predictions are used to transform simultaneous measurements to a vector with roughly normally distributed margins. A dollar sign implies cash prizes.
Did you know that wind powered electricity generation (stream) is correlated with humidity? Well you know now and it is obvious in retrospect, isn't it? The electricity streams are discussed a length in this presentation by Rusty Conover, and he has made his prediction code open source (for the Julia folks). The New York Independent System Operator (NYSIO) publishes all sorts of live public data you are encouraged to search.
See also the electricity competition page. The competition pays out regularly and runs all year long.
Cities and airports often stream their radio traffic to the internet, you can listen to police or firetrucks being dispatched or airplanes being given headings and altitudes. We listen to a selection of these streams and count the amount of time that the frequency is occupied with transmissions over a silence threshold.
Every five minutes, the percentage of time that the audio stream was not silent is returned as a number ranging from 0.0 to 1.0. By predicting the activity of these radio frequencies, the common pattern of activity may be learned and anomalies detected.
See this stream for one of many examples of financial data, where changes are logged to the system. The mean of these series will be very close to zero, but the challenge lies in estimating the distribution of the changes.
This is actually the altitude of the Earth itself, or if you like a very low flying UFO. We were just curious to see how time series algorithm cope with the geometry of the planet.
Love chess? Love mean-reverting time-series-inviting hidden Markov models? Some people think these work really well in finance markets - so consider this an easier example where there absolutely is a hidden state influencing the drift (plus mean reversion due to the Elo system). Bullet ratings for the world's top players are produced here. As you can see from the script that produces them, ratings are published only if the rating has changed.
If you find Hikaru Nakamura's ratings to be incredulously high, let me introduce you to the the quickest thinking human on planet Earth. You can watch him here playing bullet chess (1 minute per game) or blitz (3 minutes typically) and talking about stocks or whatever comes to mind.
Soon I will publish a post on bullet chess and why it is going to be a great source of interesting reinforcement learning and prediction problems. Pro-tip: the volatility for this stream tends to be highest after 11:30pm EST, after I finish editing pages like this one. Contact us if you use chess.com and you'd like your own rating to be included. And yes, I have an obscure chess blog too, if you are into unsound bullet openings.
NOAA Wind Speed and Direction*
The National Oceanic and Atmospheric Administration (NOAA) is charged with alerting us to dangerous sea creatures. Also, they use towers like the one you see here, or buoys like this to measure wind speed and direction.
We have deliberately clustered some of them so that some are closely related - that is to say that the community implied Copulas are non-trivial. The stream name contains a location clue which should point you to the website for the measurement station. You can infer the page of the measurement station, and from there scrape whatever information you deem useful. For instance, the stream with 46061 in the name corresponds to this weather station:
As an aside, notice that the wind speed has been divided by ten and the direction by 360 so that the latter is between zero and one. Some plots of the community implied copulas should appear in the gallery sooner or later.
The number of times an emoji is used on Twitter, as tabulated by EmojiTracker (warning: this is a busy site with an epilepsy warning). The time series are a little easier on the eye, such as this one which shows the number of people tweeting "face with medical mask".
Changes in some major cryptocurrencies are published, assuming the feed appears to be warm - which it usually is. The five currencies bitcoin, ethereum, iota, cardano and ripple have very interesting joint behaviour and you can view recent causality plots or make your own. The community implied copulas are also quite pretty, as you can see from the scatter plots.
In addition to USD denominated prices, three related streams track bitcoin in USD, EURO and AUD. This is more of a challenge for copula prediction, and you can see one fail, for example, using a centered vine copula. See the blog article How to Enter a Cryptocurrency Copula Contest and the Crypto and Copulas competition page.
The aggregate delay in minutes reported by the Bay Area Rapid Transit (BART). This one's the tip of the iceberg as far as public transport and exogenous data gathering is concerned.
The number of comments on the landing page populates the hacker news stream. This one has discontinuities that flummoxed Prophet, you may recall. Don't expect your model to do well here if it can't jump.
Citibike data is a little over done on Medium, if you ask me. But a contributor Eric Lou introduced this interesting COVID-19 proxy. The measure is obtained by computing changes in the occupancy of bike sharing stations, aggregated across several locations chosen to be near the major New York City hospitals.
Covered in the blog article Helicopulas, there are two streams for the pitch and yaw of a laboratory helicopter. This example was created for the SciML Julia Day challenge.
You have starred our repos, haven't you?
The stream pandemic infected tracks the number of infected people in an agent model for a pandemic. The simulations are crowd-sourced at SwarmPrediction.com, should you wish to add to the stockpile. Each simulation uses different initial conditions and parameters. Agents follow Ornstein-Uhlenbeck processes on the plane and can infect each other when they bump. Read Dear New Zealand... for more explanation. The model and the desire to create accurate surrogates is motivated by a couple of my working papers including Repeat Contacts and the Spread of Disease.
There is a rather complex set of relationships between travel time measurements at different points of the system, as you can see from causality plots like this one.
Perhaps you can be inspired by queuing theory and improve your prediction of the surges and drop-off you see in streams like this one.
No explanation required as I think we're all too familiar with this one. Let's hope that COVID-19 ends soon and this time series becomes more exciting.
This is a simulated data stream. As the name suggests, it is derived from the physics of a three-body system, but the measurements are noisy. Here's a spoiler:
This is in flux, so I'll refer you to the GitHub documentation describing the various kinds of FAANG streams.
Similarly, see xray examples for roughly 1,000 stock streams and 1,0000 semi-random portfolios of the same.
I don't know if agricultural futures float your boat or something else, but I think you get the idea. If you can write a crawler that does well predicting all of these time series, you are doing well. Don't forget that I've listed plenty of ideas in the listing of popular python time series packages, and remember that the goal here is to add value to someone, somewhere, in real-time. This isn't just another throwaway notebook.
I hope you contribute to an open source library or two, or create your own. Go to settings in GitHub and activate donations if you do. I've written a step-by-step guide to packaging on PyPI for those not already familiar.
I hope "office hours" on Fridays at noon Eastern. See meet.
Above all, have fun. Résumé padding doesn't have to be dull.