Microprediction founder, Peter Cotton, Ph.D., sat down with the co-founder and CEO of Prifina, Markus Lampinen, for the fifth episode of the "Liberty. Equality. Data." podcast. The below is a transcript of their discussion.
First of all, Peter, thanks a lot for joining us on this podcast. It's great to have you here.
PC: Nice to be here.
ML: Why don't we start with a little bit more context. One of the things that you and I share is that we have worked quite a lot with financial data (you, certainly, much more than I). I would be remiss if we did not go back a little bit. I know that you have worked a lot with various data. By way of an introduction: what type of data have you worked with, what type of things have you developed?
[1:50] PC: Sure. I will start with the middle section of my career when I was an entrepreneur trying to launch a financial data company and we took in, literally, tens of millions of data points every day, and we spat out tens of millions of data points every day. In a sense, we were a glorified cleaner of financial data. We were trying to provide indicative mid-prices for corporate bonds based on inputs from a corporate bonds market, but also from the credit defaults market, the interest rate market. Many of the data points were quite messy: it is normal to think these days about alternative data. But then, not everyone realized that data could come from an unstructured email sent from dealers to clients, and that kind of thing. That is what I did in the middle of my career and formed a strong impression about the value of a statistically-based data cleaning - having lived through the alternative.
The rest of my career I spent in financial firms. At Morgan Stanley, I led the effort to build correlations between trading models and deliver tools that pushed credit derivatives along. Most recently, I worked at J.P. Morgan Chase on various programs including some crowd-source initiatives, optimal platform trading, initiated a privacy-preserving track there, among other things.
ML: There are many different things that we can pick on. The crowdsourcing aspect is fundamentally interesting because you are talking about it coming from a hedge fund’s perspective. I’d love to hear what that means for you in practice, and this had meant in practice because it may have had different dimensions.
PC: It is a broad topic. When people talk about crowdsourcing, they think about crowd-labeling data, or who knows what. It certainly means a lot of different things to me. I tend to focus on things that try to level the playing field in the machine learning (ML) space and things that give access to anyone who has the drive and ambition to prove that they can deploy. In practice, what that meant in one of my past roles, was that a relatively junior employee (one or two years out of school) working in operations in Mumbai actually produced a better model than all of the ML experts for a very large organization. That’s an example of what I like about it: it is a very egalitarian activity.
ML: I was thinking about that aspect of leveling the playing field. It seems to be the dominant thesis of the last decades, certainly in ML and AI where people talk a lot about how to unlock the value from ML to solve some of the largest challenges that we have got. Also, recognizing that from the macro-scale we have a number of challenges: it is not plausible that only one company or one individual were to solve all of them. So that is where you have this idea of flipping the crowd.
Another thing that you mentioned was around those alternative data sources. I wonder how much of that has changed. Yes, a lot of financial data has got structured and a lot of the financial data has been organized by some large companies. Of course, not all of it, but certainly many pieces. You mentioned that it could be possible to take alternative data sources for broker communication.
Let’s talk a little bit about the new things that you are doing because you are also working on new types of alternative data sources. Probably, one of the hypotheses you can make is that whenever you have a new data source (which we have more of every single day), this problem of cleaning does not go anywhere. Rather, it is going from one place to another; however, you have the same problem repeating itself across different places you look at.
PC: That is a fair comment, certainly. You know, I see a lot of this phrase “garbage in, garbage out”. It grates on me a little bit. Actually, data cleaning is one of the more interesting mathematical activities: it is an inference problem at heart. It is funny because when I began my career I was sort of a quantitative snob: at that time everyone worked on derivative pricing. There was not a single quant that worked on pricing vanilla bonds. Interesting. In the next stage of my career, I worked on pricing vanilla bonds - so I took one step down. And then, I took one more step down: not pricing bonds, but cleaning the data that goes into the pricing bonds.
It is funny because as you move down this food chain, if you will, an interesting thing happens - the math becomes more interesting. So you start with derivative pricing which I would describe as financial probability, and then you move to the pricing of vanilla bonds. You may ask what to do with them in the traditional quantitative finance sense ... the only thing to do is to compute the cash flow of the bonds. But then you realize that the bond market is an incredibly complex noisy beast, so then you realize that you have a real financial statistics problem, and not just a financial probability problem. As you move "down" to data cleaning, you realize that these tasks - which are sometimes shunted off to your database admin - are actually the most interesting mathematical problem of all because the cleaning of data implies that you understand the market itself. You can’t clean the data well if you do not know what is going on in the market.
ML: There are so many things that you just have to write, take various parsers. It depends entirely on the data set. But, for example, I’ve myself spent a lot of time just building web products. Especially, when you have different input coming from the end-users you could think of many variations. Let’s take San Francisco: how many different names can you think of to spell San Francisco? If you ask the user, they could come with 250 different ways. That is something that is not necessarily a mathematical problem, but it has to be solved anyway.
PC: I used to live there, and I think I spelled it in 5 different ways.
[10:15] ML: That’s right. You went to Stanford over here.
PC: I would agree that sometimes there is some sort of data where ultimately you are going to require a human eyeball somewhere or a human judgment. However, there are a lot of things that you can do along the road to the final result which speed things up. For example, if your database does in fact contain names that have not been put in a canonical form yet, and if you were to allow people to run algorithms over it, they could compute the probability that the human is going to change that record in the database. Then you can rank such sorts of probabilities, and then you could send the most likely things to change according to the computers to humans. That would make humans more efficient. You can also try all sorts of different unsupervised algorithms or what you have.
ML: One of the things that we noticed is that you can also boost the utility of the data by creating metadata in the process - and this is incredibly use-case-specific - e.g., in the context that we have seen starting from personal data, such as hours of sleep, you can then very distinctly put it into a couple of categories. E.g., regular and irregular sleep, a small and large amount of sleep, and so on and so forth.
These types of things are not necessarily just about cleaning, they can also have a lot of utility in terms of categorizations. There are a lot of such base-level problems that may not seem as the most glorious thing to work on, but as you start getting those things right, they do have a multiplying effect making the data more useful and more valuable when you are building on top of it.
PC: Another one is making data continuous, rather than sort of intermittent. Think of the number of models out there that used monthly accounting data for companies: they could be using a continuous estimate of the same thing. There are lots of different ways to enhance data. That is a pretty good application of what I am trying to work on now.
ML: Let’s talk about that a little bit. One of the things that prompted this discussion about alternative data sets is that you have really cool stuff about micro-predictions. E.g., there is a feed on hacker news comments and such. The first question that I had was what does a hedge fund do with such things.
PC: There are different stages to it. First, I am trying to build an infrastructure and make sure that it works. Some of these [feeds] are for fun, but clearly some of them have some intent going forward. You can look at Hacker News, or other streams. I did not do that one, but a contributor did this. We also standardize Emoji use. One of the interesting things about our platform is that any stream you put through gets predictive in a distributional sense, meaning that the algorithms will tell if an event is 96% or 13% of it. So everything gets standardized.
It has become very easy for me to look at those standardized emoji use based on time of day. I do not know: maybe people more happy faces in the morning than in the evening - I really don't know. But all of those things you can take out. When I looked at the presidential debate, I was able to see what was poking out immediately. That was rather interesting. I can tell you that the reaction to the first presidential debate - I am sure you remember it, it was a dumpster fire - was incredibly negative. All of the darkest and most horrible emojis you could think of started going off the scales in an expression of disgust.
ML: This is a very specific point we can take a look at. If we look at the data feed in order to determine sentiment; you may have an emoji used, you might also have many other ways to do that. That could give you a lot of ways of real time reactions to something, such as the presidential debate and lots of other events where you have that type of population that reacts to something. Is that one of the core thesis of micro-predictions that you can take these types of data sets, categorize them, and make them even more democratized - so that someone can take these data sets and use them for whatever thing they are working on?
[15:55] PC: The broader underlying thesis of micro-predictions is a restatement of how we can come to an understanding of real time activity. This has to be a collective activity. It is simply inconceivable that one small group of people - or even a large group of people - could really manage that. As you mentioned, from one source of data at certain times you can create what I call “weak truths” about all sorts of things (from your number of mentions of CEO resigns on Hacker News is not the truth, but if you ask people to predict that and you do it in the real-time contest where it helps to find exogenous data, then you start enriching that feed). After all, sometimes the truth is a very useful regression of what is an approximate truth.
So micro-prediction is about sharing algorithms (a place where you can put an algorithm and let it drive around to see where it is good at). But it is also about sharing that real-time feature space.
ML: There is certainly a lot of exploration as it comes to these types of data sets and also these types of models. Certainly, there are some data feeds that are more actionable where you can right away see the utility that is directly linked to that. You and I talked a little bit about the data feeds that are around bike-sharing data around hospitals for example. During COVID such data could be used to determine how busy the hospital is or some time of environmental factors. There are probably a lot of other types of data feeds that the community can submit which may be more borderline: it may not be entirely clear if it has utility or not.
PC: Yes, that is right. Search in the space of models and data - as we would admit - is a very challenging thing. And, nothing overnight is going to make it miraculously easy. But, slowly over time as people start to contribute to the structure, that is going to help humans and algorithms find things and assess whether or not data is useful (the marginal examples that might be useful in the future). That is the matching process.
Sometimes I think of it as an analogy of the web itself. Initially, people started creating web pages back in the early days of html. We probably did not realize at that time that we were creating this structure that related different concepts and things. I think that we need something analogous for the real-time predictions.
ML: It is fascinating, of course, in hindsight, it all makes sense. You are right that at the moment building it’s all about building, and then through quantity that starts converging; and you don’t necessarily see it ahead of time. Let’s take one step up from micro-predictions and talk about this entire democratization of ML. There is a lot of excitement in the ML space. What part of this is actionable?
PC: We have to start by appreciating that the cost of instrumentation is falling dramatically. But, the cost of building bespoke models is not. So the balance of power right now is incredibly one-sided, isn’t it? People will continue to surrender their data for nothing because companies are so much better placed to analyze that data and use it. So it is very hard to see the balance of that power changing unless you see the price of that bespoke AI also falling at the same rate as the instrumentation is falling. There isn't a lot of expectation about it right now; there is no Morse Law for enterprise data science. You don't expect data science to cost half as much every year. We still think of a risky activity that is undertaken by well-compensated individuals. So most of that kind of work takes place in companies to revenues and access to billions of dollars a year, for the most part. So the vast numerical majority of organizations or individuals do not have pocket-size data science teams ready to help them utilize their data or take control of their quantitative destiny.
That is why I am trying to focus on trying to reduce the cost of data science - that is my current obsession. Cost, as you know, has different forms. And one of the big costs - the search cost, and fictional cost of trading, and other costs that individuals incur trying to understand what to do with that data - the first step is to give them an API that predicts anything. If they keep feeding the data, eventually, it will predict it very well.
The next thing we can do is to ask whether there is a really simple way for you to discover if your private data is useful. That is where the privacy-preserving computation comes in. It does not have to be very complicated actually; these are some really simple algorithms. Encapsulating it so that the user does not have to think about that, they just get better predictions because somebody else has data that is useful.
ML: That’s right. I’ve been thinking about different layers. If we take those companies in excess of billions of dollars of revenue, that’s one pocket of different types of users. And, just like you said, they can afford well-compensated mathematicians, data scientists, and ML experts. They typically have small armies of them.
When you go one step further to small and mid-cap companies, that may have quite a lot of data but they only have small data science teams focused on core verticals, and they can not apply it to everything simply because of the cost. And one step further, you have those small companies and startups. Of course, Silicon Valley companies can scale up rather quickly, but they have to start somewhere.
Then you can go all the way down to the individuals, that is the area where the cost of modelling is not enough any more. Most folks will never ever care about the modelling, but they will care only about utility. There you have to go into the packaging. Enterprise applications are very different from end-user applications. They have to be better: they have to help them sleep better or lower their cholesterol, and help them do something better. So there must be that user agency embedded. I agree that the cost is a very clear thing. Even if there is no Morse Law, there is this movement towards efficiency: if there is a cheaper way to get some insights, and be able to predict.
How do you look at the most radical area? Let's take micro-predictions and what could they look like?
PC: If micro-prediction is designed so that a restaurant could predict how many people will come in the next half an hour and just be done. In the shorter term, it isn’t the restaurant that will be the first to the ball, it would be hedge funds. That is the function of my limited market, at this time. It would be good if people knew they can have some free data science, for it’s the SME’s and smaller businesses that can receive automation of their business. Right now that's only through other products’ or companies’ terms.
So, I would agree with you that there is a general push to reduce the cost of it. Nonetheless, it will arrive through a smarter paper clip or something at the bottom of the excel sheet. It is not going to be your own thing, really. I think that is the key difference: it can be easy for a company to get optimization that it needs. In theory, once we build this thing, it can be easy for them to take control. For an inventory optimization, or modelling failure rates, or any of these classical uses. And can they monetize their data at the same time?
ML: You could look at it from the self-sufficiency prism. When I think about it, part of the value proposition could be from the output, but another part of the value proposition is the control - you are not relying on anybody else, so that you can build it into the DNA of your company. From there, you almost have a spillover effect - now you can harness the value of your inventory data, or if you are a restaurant, the value of the data about the lunch hours. That means you can run a more successful business.
PC: That’s right. Who is to say that your bookings are useful to someone else. For companies to own and take charge of everything they are instrumenting, that is something they are not in the position to do because most of the companies can’t even afford a data scientist, let alone a data science team. So the question is how do you empower them?
ML: There is also this very broad push in the market toward decentralization overall. You can define decentralization in all kinds of different ways: it's effectively specialization of skills of knowledge, of time, of everything. There is also this decoupling of different huge systems into building blocks. This fits neatly into the idea of taking the modelling of huge black-box systems and breaking them down into individual APIs, that would make total sense from the efficiency point of view.
PC: I spent a lot of time thinking about it, actually. The design of what I built and what I am am trying to build - the second version of it not yet visible. In theory, we should be able to create these efficient supply chains where people can make incremental contributions of a feature or a piece of data or something to produce a high-quality product or maybe dictate the decision that is made in the manufacturing chain. I think we know that, at least in theory, the thing that stops training self-organizing is the friction of trade.
Right now that friction is incredibly large. If you think of the cost of hiring an employee, or attending a conference. The thesis of what I am doing in the book is that if trade occurs inside of statistical games, then the friction of that trade could be incredibly small. In theory, we should see an explosion of this microscopic trade in AI and therefore a tremendous reuse and sharing of all of the things that trade self-organizes. To be honest, I think that the price mechanism is sort of under-rated.
ML: If it's the difference between hiring a data scientist, training them with your business, giving them information about your business, onboarding them, and then asking them a question versus just calling an API…
[30:00] PC: It certainly is. We can rephrase it by going to our original discussion of leveling the playing field, so we are just leveling the playing field between humans and algorithms. Right now the reason why algorithms cannot directly assist businesses is because businesses are arguably not doing enough to accommodate them. Businesses do all sorts of things to accommodate humans, they have entire departments devoted to recruiting. Well, in the future, companies will have AI recruiting departments - the task is specifically to recruit the right algorithms for a specific business purpose through the use of simulated data etc.
The second problem is sort of analogous to human rights, privacy being one of them.
ML: It’s a fascinating thought. If we take this a little bit further, you could argue that we can already see this in practice because on one side you have a lot of RPA and software automation and software robotics and they are actually performing work. Some countries are actually taxing some of the work that RPA is producing. At the end of the day, if you have software robots actually performing tasks, then the question arises that they need oversight, coordination, maintenance, all of those different things. Effectively, they are similar types of tasks in nature to the ones you have in the traditional HR, then in all of this surrounding orchestration and how you support all of that. This may seem quite far away, but I would argue that it may come sooner than we may expect.
PC: I think that one of the core interesting mathematical problems is the design of micro-managers. The micro-manager is trying to solve the problem of how you manage algorithms, including the full life-cycle of entering economic engagement with them, assessing them, and so forth. I tend to be a little bit of a coward, so I like to focus on the micro-prediction domain because there is so much fast-moving data that it’s relatively easy to perform an assessment. Therefore the task of building an automated manager of algorithms is a lot easier, or, as I would say, practical; whereas I think sometimes the task of managing an algorithm is predicting next year’s GDP or some unusual surgery ... those things require much effort.
Companies fail to distinguish two classes of applications - the ones where you can automatically assess quantitative work, and the ones where you can't. They tend to drag the high cost of the latter into the former. If you split them up, then the cost of modeling can be reduced by an unbounded amount. Whereas the cost for a model of the next year’s GDP will always be high.
ML: I like the term of micro-managers: it’s a fascinating thing. It is in human nature, that we start by looking at solving one very specific problem, and then we start realizing that in order to do that we actually need to do all these other things. The idea that we have an algorithm and then you have a manager of that algorithm - it’s complete common sense. You also have the same thing in software. But, the more that these algorithms actually work, the more ingrained they become, the more likely is that you will need an entire HR department for that.
PC: The AI department.
ML: But it becomes very real, because there is no other way you can do that. If you have a black box where you dump all the algorithms and they are doing God knows what, arguably, they would be doing the same thing over and over, but still, you will need all of those different things - there is no way to skip that.
Taking a step back, let’s think about the individual’s role in all of this. We talked about the reducing of friction or modelling and lowering various burdens, but can we think about the individuals themselves being able to use their own intelligent algorithms, although they are closer to apps for individuals.
There is also this concurrent debate about personal data and personal data rights which individuals can utilize on their own. What are your thoughts around this?
PC: I am always reminded of Hayek, who was trying to establish a "rational economic order" and asking himself where the blocker lay. His answer was to go with whatever utilizes everyone’s dispersed knowledge in the best way. So what does that look like [for machine learning]? I am sure there are a lot of things that one can do to bring that dispersed knowledge to a better outcome (whether it is your health, or something else, the optimization of machines, or who knows what). But I think that one very clear thing that can work is reducing the friction of trade because that enables individuals and people to contribute to the overall optimization without having to solve the overall optimization. I don't think that the regulators are going to solve the overall optimization - it just isn't going to happen. They can, perhaps, do some things to maybe avoid the most egregious market failures, but they can also do things - as we know - that have a lot of unintended consequences, increase frictions, and sometimes can actually cut out the small players who can’t afford those fixed costs of those new regulations.
ML: It feels that it is a perfect example of that crowd-sourcing mechanism that we talked about earlier in terms of figuring out how these micro-predictions and algorithms, what type of utility those data feeds have. If we have this world order we can make this personal data more utilized and create a privacy-preserving mechanism. Then the next question is how to utilize it. This is where the crowd comes in: let’s expose it to the n-amount of individuals, let them build different kinds of things, and let the market figure out what is valuable or not.
PC: I think that what the world needs, and what we need as we think about AI is to overcome our prejudices of what kind of work is useful. I would argue that the world needs those little micro-managers to affect those little arbitrages that bring problems to data and vice versa. The world needs those little middlemen. The world needs these things that find those opportunities. Just as you tend to look at the arbitrator or an algorithm broker, sometimes you think of those people in the shadow. Actually, in the AI space, you need the algorithmic version of those things.
ML: What’s even better is that you need them on your side. It is not about having one central AI, but it’s about having millions of them, and having them effectively for each individual. If you could democratize AI in that sense that you can make it work for you as an individual, as a small company, as a mid-cap company as well as the billion+ large players, that is the promise of democratization in a very real sense. It is not one thing, it is lots of them. This is something that I am very excited about. Just think about longevity and health: we have so much data, but where is my AI? Where are my scripts that look after me? That is where we need to bring down the cost, because we need more exploration.
PC: That’s right. There is no master algorithm. There are a lot of different things that work in some places and not others, it’s a search problem. I do not know if that is all that different to document search.
ML: Maybe one thing to touch upon on the tail end: what are some of the technologies or use-cases that you are following closely? What are the technologies that you are most excited about and where are they going?
PC: It's quite interesting to see what other people are doing, wind speed and direction, and electricity innovations, and how much renewable energy we can produce. There are these big prediction problems, obviously: volatilities in stocks. We have a really detailed look into the cryptocurrencies and we try to understand the five-dimensional structure of the five major cryptocurrencies. How do you do that? Maybe you can crowdsource some lower-dimensional margins and reconstruct it. This is what is interesting to me statistically because I do not think that that kind of a collective approach in statistics has ever been done before.
Some physical systems are interesting. I borrowed some data from a laboratory helicopter - you have this stylized helicopter and you can understand approximately what the differential equation should be, but it is never quite true, so you are missing some friction points.
I’ve looked at things such as the popularity of different software packages and repos. There is a simulated epidemic model that I worked on separately. It has very different characteristics but it's slower; so I was wondering if I could come up with some approximations of it.
The patterns of traffic in the major thoroughfare in New York City are interesting, statistical problems and statistical relationships between entries and exits, airport wait times, and three-body system.
All sorts of things. What’s interesting is that they all make very different demands on algorithms. Some algorithms are good at dealing with noisy data, others are better at dealing where there is a lot of structure to be determined. So yes, let's see where it all goes. But as mentioned above, it's all about small things, and I find small things very interesting. Not everybody is caring about Hikaru Nakamura’s blitz chess rating, but statistically it's very interesting because of the way it is generated; there are time of day effects, there are duration effects because he gets tired after playing… so on and so forth. There is electricity or sugar price changes, the number of medical masks to get, numbers of emojis, there are so many things to look at.
What is really interesting to me is that when you take super popular open-source forecasting packages and put them through this, and you force them to predict real data, they do terribly. We will see where it goes. The more adaptive algorithms, the purposeful they can become - in theory.
ML: It is all about lifting them off into the open, and it's about the collaboration. If you can reduce the cost to bare minimum, you can get quicker iterations, quicker feedback loops and you can develop them quicker. That would be super exciting to watch. Thank you for the chat. We will keep watching what is coming from the micro-prediction community.
PC: Thank you for the chat and best of luck with your project.
If you enjoyed this discussion, please consider following microprediction on Linked-in because Peter is sad and lonely in a quarantine hotel and it will cheer him up.