7 min

Can DataRobot Outrun an Open-Source Network?

Published on November 11, 2021

Automated machine learning has come of age with the recent valuation of DataRobot. A pioneer in the enterprise model building space, the company’s latest round of funding equates to a $6.3 billion dollar valuation. As I noted in this post it may have morphed into a killer robot, having purchased Algorithmia and promptly shutting down Algorithmia's API marketplace, a potential source of competition from the little guys. 

Naturally one asks: is the product any good? Or could it be embarrassed by free, open-source work? 

I’d like to find out. 

The challenge

The question I pose is too broad for me to answer with my finite resources, so I’ll focus on one aspect only — time series prediction. My challenge to any user of DataRobot out there, and competitors, is initially simple. Can you predict traffic speed better than it is already predicted using a free API or a few lines of Python?  

Specifically, can you predict the time it takes to travel from the turnpike (495 on the New Jersey side) to the New York side via the South tube of the Lincoln Tunnel?   

                                    Travel time through the Lincoln tunnel

You can view the live data or just the plot to see what’s going on at the moment. Perhaps the residuals of existing models are of interest too. The point is that the leaderboard you see is based solely on out-of-sample distributional prediction. One cannot p-hack so easily. 

I can’t tell you precisely how to use DataRobot to predict this series because I haven’t looked at the API in a few years. However, it shouldn't take more than ten minutes to modify the crawler called Malaxable Fox (code) for this purpose. One could using point estimates, confidence estimates, or whatever else DataRobot supplies you with to swap out the prediction logic. Then, you need only run the script. 

I'll give more details in a moment but what’s a crawler, you ask? It’s merely a program that monitors Microprediction.org (and other hosts someday, feel free to set one up). It submits guesses on a regular basis. Many are derived from MicroCrawler and there are plenty of examples. They travel from stream to stream to determine if they can add value. Natrually, I'd like to see if someone can deploy a DataRobot powered crawler that performs consistently well — not just on this particular time-series but across the full gamut. 

Now I know those of you in the C-suite must be wondering — how could all those data scientists and engineers at DataRobot fail to produce something better? After all, they can free-ride the open-source tooling. Ah, but so can open-source contributors and there are actually more of us. It is inevitable that we'll swarm into something that looks like a really big fish. 

Even in the nascent stages of that coalescence, it isn't a given that a DataRobot-powered crawler will beat Soshed Boa for a specific problem you care about. That little snake is merely an application of AutoReg from statsmodels.tsa but they don't give Nobel's for nothing. Sometimes, simple things are better than what might prove to be ML-inspired overfitting-as-a-service. 

Likewise, on noisy datasets, it's entirely possible that Yex Cheetah (which implements a non-gaussian filter I won't try to justify) could perform better than the vendor.   

That's the least of DataRobot's problems in this harsh new environment. It is up against algorithms I don't know about, written in archaic languages like R (I'm joking) by people who don't need to reveal what they are doing to me or anyone else. The ongoing competition also includes prediction algorithms written in Julia and trained using Flux, a relatively new entrant with remarkable flexibility. 

Can DataRobot beat every contribution by anyone in the world who is free to make predictions using whatever tooling, language, or framework they wish, forever? Can they claim to orchestrate that search better? For how long? Consider instead an open, collective model.

DataRobot would suggest that they add value by learning how to recommend algorithms. But I'm yet to see any evidence that one could call scientific. In an open network are we to believe that they will be the best at this too? Because algorithms can survey other algorithms' performances very easily, they can do their own self-recommending. They can even ask other algorithms for help.  

Microprediction nano-markets

The worse news for DataRobot is that even if someone manages to rise to the top of the leaderboard using their software, and maybe take my money, it still might not justify the closed paradigm. That finishing line may prove to be a mirage because the best algorithm isn’t as valuable as all the algorithms

There's already evidence when you look at the hundred or so Elo-rated time-series approaches filling leaderboards like this one, and then appreciate that most of these are yet to be stacked in imaginative ways (though some are, as you can see here). But I'm not just talking about ensembles of models, or selecting the right ensemble, something DataRobot will claim to do. 

DataRobot is also up against the possibility of a more profound symbiosis between disparate algorithms, people, and data. Behind Microprediction.org, we find a high-velocity nano-market for predictions, and all the algorithms are contributing to the picture. I describe the mechanism in the blog Collective Distributional Prediction and it is a little like a financial exchange.

The algorithms don't bid and offer as with a central limit order book, but they hurl points where they believe the true distribution lies. At least they have some incentive to, given the lottery paradox, even if they don't know where others will predict this time around. Most algorithms provide predictions for every arriving data point but some do not. Over time, it will become increasingly more convenient to make sporadic contributions, assuming I stop ranting and push back end code. 

Over time, the community CDF gets harder and harder for DataRobot to beat. 

When many open-source robots can make contributions, we can create millions of little pseudo-markets that are really competitive. Not because the rewards are large, but because the ratio of reward to friction is large. The marginal cost of an algorithm consuming a new feed, and perhaps your business problem with it, is virtually zero.

More importantly, you can't really replace a market, even if you can beat it.

By analogy, Jim Simons is pretty good at beating the market — but only because his folks can choose their battles. If they had to compete with every single prediction made by the equity markets they wouldn’t stand a chance. The market is actually much better at beating Jim Simons than the other way around, from this “always-on” perspective. 

Similarly, even the very best professional sports handicappers might struggle to produce probabilities for every single outcome that are better than the betting markets as a whole. Typically, they are only better on five percent of occasions and sit out the rest. People tend to miss this distinction when they compare open source to enterprise software. 

While you might think that RenTech, TwoSigma, or Intech are great funds, you would never just select one and say "okay from now on you can just set the price of Apple stock, since you seem to be so good at it". Yet that's exactly what you are doing in going down a vendor route for autonomous machine learning. 

Bad news for DataRobot.

And we're just getting started. 

Specialization

DataRobot is also up against specialization. When model residuals are in the clear — as they certainly are here — it is possible for anyone, anywhere in the world to spot a tiny signal. 

It's like model review on steroids. 

And this is also just one example of specialization - models that are good at predicting residuals of other models aren't necessarily the first ones you'd reach for, or the first to be run by DataRobot. You can see a little of this in the Elo ratings too. For instance, the residual leaderboard, which you can view here is quite different from an otherwise similar leaderboard, also for 21-step ahead prediction. 

Andrew Gelman said it the most succinctly. All models are wrong, but some are (somewhere) useful. Prediction is a collective problem due to the need for diversity. Loosely coupled individuals can affect a supply chain that can be very effective compared to efforts to internally organize employees in a big firm.  

Exogenous data

Another potential difficulty for the Goliath of automated machine learning is what economists would refer to as local knowledge. Data is scattered and the central challenge is not how to combine it in a model, but how to exploit it while it lies scattered in disparate locations. 

In an open prediction network, anyone can be inspired to introduce new data into the system and has an incentive to do so, given that they get to see the predictions. Recently, a contributor added the meme stock stream and it would be an extremely tough challenge for DataRobot's product, I suspect. That's true in part because a tiny, tiny amount of generalized knowledge can be leveraged to help navigate existing algorithms towards or away in a tiny, tiny amount of time.  

It helps if a human knows that this stream refers to a fairly famous sub-reddit. And switching back to our first example, what data impacts the travel times through the Lincoln tunnel? Is it similar to data that also helps predict something else that someone cares about? Is there not a tragedy of the commons here?  

The price mechanism has always been regarded as something of a small miracle when it comes to orchestrating a global optimization. It does this in a way that requires only local optimizations by many self-interested individuals. That applies to search in the space of causality, in spades. 

One doesn’t need explicit trade or prices in a conventional sense, of course, for this to work. It is sufficient to organize some variety of repeated statistical game. Then, a network of algorithms can scoop up exogenous data and self-organize in surprising ways.  

That is why, in the medium term, open collective prediction can take a very different path to DataRobot and its ilk. And let’s face it, there are some pretty big differences for the consumer. As I noted in Dorothy, You’re Not In Kaggle Anymore, the comparison between an open-source prediction (including open, networked use) and vendor products is pretty stark:

  1. The vendor products can cost a lot, say $10,000 a year heading towards $100,000 or more. That’s a lot more than free, and it introduces considerable friction in the trade of data and algorithms.  
  2. The money might buy model tweaking only, not relevant exogenous data.
  3. It may represent a one-way street.

I used the example of a bakery needing AutoML. It won’t be credited for the fact that bakery sales are also a source of data. Yet perhaps sales of baked goods help predict precipitation with slightly lower latency than commercial weather products. Who knows? 

Open contributions to prediction can help spin a web that might one day drive down the cost of bespoke AI and help distribute it to small businesses, not for profits, and the vast numerical majority of enterprises who cannot afford to employ data scientists or buy expensive software. 

That is, in my view, why open-source represents a far stiffer challenge to DataRobot, and other automated machine learning vendors, than it may first appear.

Benefits of an open-prediction web

Did I mention that the platform I wrote for my firm's use is open-source? What would happen if more firms started adopting this pattern? The open-ended difficulty associated with many real-time operational problems could be handled by prediction-as-a-community-service and it can be:

  1. Free. Though one is welcome to offer prize-money, as we do. 
  2. Easy. Writing a "crawler" can be accomplished in a few lines of Python (like this)
  3. Ongoing. Once you script a test, you can run it three months from now to keep your algorithm or favorite library honest. Is it keeping up? How does it do against new challenges?
  4. Out of sample. Only live updating data is used. This eliminates data leakage which is a huge problem for assessing time series algorithms (discussion).
  5. Less likely to be gamed. You can bet vendor libraries are trained on M4 and other well known canned time series. The only good place to hide data is in the future.
  6. Anonymous. No registration is required.
  7. Extensible. You can add to the test streams by publishing your own live data.
  8. Pragmatic. A few lines of Python (like this) should suffice. But the relative convenience of using a vendor product in a live environment will quickly become apparent.
  9. Realistic. Submitting a time series is free but requires an expenditure of CPU, similar to bitcoin mining. So people tend to publish real world live streams that are interesting. Here we start with a curated list for simplicity, but I remark briefly on how to enlarge that.
  10. Civic. You might accidentally help someone else in a number of ways. You can read about the goals of Microprediction.org.

How easy is it really?

To point number 2, everyone says everything is easy. So is this? Well:

      pip install microprediction

is where you start. Then you will need to derive from SimpleCrawler and modify the sample method. Perhaps something like the following:

No alt text provided for this image

And then run it:

No alt text provided for this image

That's all it takes, assuming you have a model - more on that momentarily. 

Choosing an AutoML vendor

Do you still want to go down the vendor path? No skin off my nose. 

There are many products on the market today which attempt to automate all aspects of the model discovery process. They may certainly be capable of improving your bottom line, though assessment is time-consuming. In arriving at a careful decision, the little snippet of Python shown above might help your process in some small way. 

To briefly elaborate on points 8 and 9 above:  

  • It is important to discover quickly operational details. Does the vendor handle online updates? Is the API or client well thought out? Is there going to be a gap between research and production use? How quickly does it estimate models? If model fitting is intended to be performed periodically, how much does fitting latency degrade performance? Do edge cases get tripped?
  • There are some interesting, purely generative time series at Microprediction.org, such as simulated agent models for epidemics (stream) and physical systems with noise (stream), which test different modeling abilities. It is nice to have a few of these although you are free to disregard them as textbook exercises if you wish. The behavior of an instrumented laboratory helicopter (stream) provides a tough test given the incomplete physics (more remarks here) and is the reason the SciML group chose it for their Julia Day challenge. The lurching and potentially bimodal behavior of a badminton player's neck position during a game (article) might not fall out of an obvious generative model. And so it goes.
  • As I noted, the trickiest tests come from examples like electricity prices (vibrant discussion going on) with their spikes and occasional negative prices. 

Here are some more examples. They come and go, so I can't guarantee all are working. Your crawler won't care. 

Description Example Pattern for similar streams
Radio activity queens_110_and_115 scanner-audio*
Wind  noaa_wind_speed_46061 noah_wind_*
Traffic triborough traffic-nj511-minutes*
Emoji usage medical masks emojitracker-twitter*
Cryptocurrencies ripple price changes c5_*
Crypto+exchange rates btc_eur btc_*
Airport wait times Newark Terminal A airport_ewr
Atmospheric ozone ozone
Futures sugar price changes finance-futures
Hackernews comments comment counts www-hackernews
Tides water level water
Stock indexes euro_stoxx finance-futures-euro*
Govt bonds 30_year changes finance-futures-*bond
Hospital wait times Piedmont hospital-er-wait-minutes*
Electricity (NYSIO) overall electricity-*
GitHub repos popularity stargazers tensorflow github_*

You may find that the vendor product is not complete enough to even participate in the contest. That's potentially a red flag in itself, but let's proceed. Looking more closely at my example, the only requirement placed on the vendor model is that it has a method that returns the inverse cumulative distribution function for the next data point in the series, here called model.invcdf.

It is possible that your library does not provide an inverse cumulative distribution function - only lame point estimates. You may choose to interpret a confidence interval or whatever is supplied (talk to them) and promote it to the status of a distributional estimate. By some means, your crawler must produce a collection of 225 data points that approximate the distribution.

You can always tell them to steal my stuff so that their model produces an empirical distribution of its own errors and running bias and variances. 

Contributing

I hope that many more people contribute to open-source autonomous analytics. Contribution to open prediction in real-time, in particular, might be a contribution to a narrative something like the following:

  1. Open-source automated online machine learning improves rapidly (witness the effort put into PyCaret of late, to name one).
  2. A “microprediction network” gradually eats away at a tragedy of the commons as it relates to nowcasts and real-time control applications: namely data and algorithm reuse. 
  3. Self-interested people and algorithms conspire to create a very high-quality supply chain. Algorithms survive off of economic surpluses many, many orders of magnitude lower than humans. They are miniature firms buying and selling data.
  4. Slowly, the benefits catch on leading to increased use of the network inside companies, but also reaching between them. Eventually, this live feature space, jointly hosted by individuals the same way the internet is jointly hosted, becomes indispensable for real-time operations.
  5. Despite recent actions by DataRobot, this all surfs a new, democratic MLOps wave and advances in privacy-preserving computation. 

Since writing that some time ago, I haven’t seen anything to convince me that this isn't occurring, or that prediction is not an inherently collective activity. But feel free to argue the point. Stop by our slack sometime or visit the Microprediction knowledge center.  

Comments