Nobody with a passing interest in machine learning, control, or applied statistics will have missed the Reward is Enough paper by David Silver, Satinder Singh, Doina Precup, and Richard S. Sutton. The title is a high-quality provocation but in this response, I argue that "reasonably impressive collective intelligence" is more feasible than the outcome they hope for - one seemingly premised on very powerful reinforcement learning (RL) agents. In my counter-scenario where many people contribute only "reasonably intelligent" algorithms, there is an orthodox reason why reward might not be enough. Fortunately, this also suggests a solution.
It is invigorating to see professors formulating a strong hypothesis about a field they pioneered - and at the same time a hypothesis about multiple fields. This paper might be seen as both a mini-survey for those doing related work and an introduction to reinforcement learning principles for those in different fields that might benefit. It invites researchers to consider whether they have underestimated a line of thinking. And it says, or at least I read it as, "listen, this might not work but the payoff is rather large - so lend me your ears (and maybe a little more funding)."
Oh yes, the list of flippant responses to this paper is long. Rejected titles might have included, "Is another $3 billion enough?". Coming from DeepMind, the paper might also be cynically viewed as "reward hacking". The project might be seen as an agent that has a huge, legitimate scientific goal. But like a robot in a maze that is seemingly too hard to solve, it might be accused of overfitting to less scientific intermediate rewards it has created for itself.
If that harsh view is in any way accurate, then I blame the press and corporate faux data scientists for that, not the researchers. Funding for research is hard to come by, and science is rarely advanced by those with lukewarm enthusiasm for their own work. And let's not forget the accomplishments. Solving protein folding has immense implications, even if that doesn't translate immediately into general intelligence. It doesn't matter if DeepMind hasn't "finished" that - for they seem to be leading.
Why not entertain speculation on the future of AI from these authors?
In this article we hypothesise that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward. Accordingly, reward is enough to drive behaviour that exhibits abilities studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language, generalization and imitation.
Sure, what differentiates this paper from a reinforcement learning survey is this somewhat aggressive style. But the fact that someone is "talking their book" doesn't make them wrong. Nor should it disqualify the potential. That's so large, it almost feels like a Pascalian wager:
This is in contrast to the view that specialised problem formulations are needed for each ability, based on other signals or objectives. Furthermore, we suggest that agents that learn through trial and error experience to maximise reward could learn behaviour that exhibits most if not all of these abilities, and therefore that powerful reinforcement learning agents could constitute a solution to artificial general intelligence.
I share the author's desire for unifying beauty - who doesn't? And in particular, the elegant idea that reward maximization is "sufficient" is attractive. I'm less sure that this is quite as clearly delineated from every other possible emphasis as the authors would like, but they make the case as follows.
One possible answer is that each ability arises from the pursuit of a goal that is designed specifically to elicit that ability. For example, the ability of social intelligence has often been framed as the Nash equilibrium of a multi-agent system; the ability of language by a combination of goals such as parsing, part-of-speech tagging, lexical analysis, and sentiment analysis; and the ability of perception by object segmentation and recognition.
In other words, the need for seemingly disparate approaches is disheartening, as far as general AI goes, and they reduce the chance of a "wow" answer. So let's bet on RL. The situation is not dissimilar to the mediation of Pedro Domingos' book The Master Algorithm, in which the author invites the reader to generalize an all-encompassing algorithm subsuming special cases like nearest-neighbor, genetic programming, or backpropagation. (I note that Domingos pursues the crowd-sourcing approach, and isn't asking for forgiveness on a $1.5 billion dollar loan ... but I promise to stop with the jibes now).
In "reward is enough", the speculation is not quite down Domingos' line because the authors stop short of suggesting a singular RL approach - or even the possibility of one. At least in my reading, the point is more subtle. They suggest:
In this paper, we consider an alternative hypothesis: that the generic objective of maximising reward is enough to drive behaviour that exhibits most if not all abilities that are studied in natural and artificial intelligence.
So, the broad category of reinforcement learning, which presumably needs to be delineated from more model-intensive ways of guiding people and machines, can be enough when it comes to creating and explaining generalized intelligence. I don't think they are suggesting that every function in numpy be rewritten as a reward optimization (it is mathematically obvious that all of them can be).
The authors mention some examples where reward is enough for imitation, perception, learning, social intelligence, generalization, and language. A danger here is that it can feel a little bit like a survey of applied statistics or control theory, where an author concludes that computing a norm of a vector is enough (as that is common to many activities, it must be conceded).
The possible circularity of defining intelligence as goal attainment and then claiming that goal attainment explains intelligence has been noted in a rather withering style by Herbert Roitblat (article), who also provides a historical perspective on reward-based explanations going back to B. F. Skinner. Skinner copped the People's Elbow from Noam Chomsky and the fight was stopped immediately.
As the authors are keenly aware, "Reward is Enough" sails dangerously close to that other well-known thesis normally attributed to Charles Darwin. Is evolution enough? The authors suggest that because crossover and mutation aren't the only mechanisms, this is substantially different. Okay.
It is certainly true that rewards help in many places - I'll get back to Economics 101 in a moment. However, for me, the least persuasive section of the paper is the author's somewhat glib dismissal of other things that might be "enough". For instance, they reject the idea that prediction is enough on the grounds that alongside supervised learning it will not provide a principle for action selection and therefore cannot be enough for goal-oriented intelligence.
What about action-conditional prediction? What about the conditional prediction of value functions? Is that considered so different from prediction? This feels too much like a game with strange rules and rewards: an imagined competition in which single-word explanations are to be set against each other. We might as well typeset them on Malcolm Gladwell-style book covers. (That generator used to exist, by the way, but maybe the rewards for maintaining the site were insufficient). But continuing:
Optimization is a generic mathematical formalism that may maximise any signal, including cumulative reward, but does not specify how an agent interacts with its environment.
Except that "mere optimization" certainly does specify how an agent interacts, or can. Anyone with the creativity to design ingenious optimization algorithms that work in high dimensional spaces miraculously well and beat benchmarks is certainly capable of making the relatively trivial conceptual step - marrying this to hyper-parameters in some model-rich approach. I suppose, however, that the argument is that there is no model-rich approach in the vicinity.
Still, rather than elevate one area of study over another, I'd be inspired by the theory of computational complexity. This has taught us that many seemingly different problems are equally hard, and equally general. Solve one and you might well solve them all.
That said, it is legitimate for the authors to point out the distinction, and chief advantage, of reinforcement learning compared with various flavours of control theory. RL attempts a shortcut to intelligent behaviour that can sometimes avoid the self-imposed limitations of an incorrect, or inconvenient, model of reality.
Avoiding a normative model of reality is a line many of us can be sympathetic to - and not just in real-time decision making. In my case, I recall noodling on ways to avoid models in a derivative pricing setting. (I'm not sure it came to much, but unlike the Malcolm Gladwell book generator, at least the page still exists). I'm guessing many modelers have grown frustrated at their own inability to mimic nature's generative model over the years. That is, after all, the genesis of the machine learning revolution and Brieman's "second culture of statistics". Enter reinforcement learning:
By contrast, the reinforcement learning problem includes interaction at its heart: actions are optimised to maximise reward, those actions in turn determine the observations received from the environment, which themselves inform the optimisation process; furthermore optimisation occurs online in real-time while the environment continues to tick.
I can't say I've ever completely bought into this taxonomy. Working backward in the above paragraph, there are plenty of things that perform optimization incrementally in real-time. I was just working on one here but I wouldn't call it RL - maybe Darwinian. Nor would I call it reward-based even though yes, algorithms are rewarded for having a lower error. The above passage makes it sound like most of the novelty in RL springs from online versus batch computation. I'm a huge fan of the former but it's pretty old stuff.
The problem is, that the more you introduce specific tricks into RL for creating and predicting reward functions or advantage functions, the more it starts to look like there might be other holy grails, like online optimization, or conditional prediction (of intermediate rewards, sure).
Or maybe it's even simpler. If you predict for me how long it will take to travel through the Lincoln tunnel versus an alternative, I can certainly make a decision. It seems that here prediction is the real open-ended task, not the "reinforcement logic". Perhaps a discussion of temporal difference learning, and other specific devices, might help make the author's case crisper. Otherwise, Captain Obvious from the Hotels.Com commercials enters stage left and declares "prediction is enough!"
The relative benefits of RL - which I loosely interpret as the benefits of "letting go of the model" - are not really the subject of this paper, as far as I can see. We are to presume RL will get better, but so will things like optimization, prediction, and even differentiation (also all you need, some would argue). Perhaps the reasonable thing to do is not opine on the veracity of the author's claims but instead establish ongoing benchmarks that help us get at the heart of the matter.
Here it doesn't really help to talk about situations where infinite data is available and, by a sheer fluke, I can present two ongoing benchmarks where sample efficiency matters.
Yes, yes I know we are talking about the origin of species here and the lofty goal of artificial intelligence. But hear me out. The paper is suggesting that a single style of attack can work in many places. It is sort of like suggesting that an engineering problem takes on a fractal nature - that a car can be constructed with careful re-use of a special kind of lego block (probably true).
But as noted, if you play with words too much, everything is central to everything. Optimization is fairly central to everything, I'm sure, including reinforcement learning. If reinforcement learning cannot make a big impact on optimization, then the former might not be a higher idea. The failure of RL to improve optimization would not necessarily refute the thesis that "reward is enough" admittedly ... but it might make us suspicious.
I'm sure many people have had the idea of applying RL to learn how to search. Some papers include Learning to Optimize: A Primer and Benchmark: by Chen et al. Those authors also provide a repository Open-L20 intended for benchmarking of learning to optimize approaches. The idea has been pursued for some time by Li and Malik (see Learning to Optimize and subsequent papers).
Now, where do you draw the line? Is an approach that learns parameters or meta-rules to be considered a result of reinforcement learning? For example, the nevergrad library improves when contributors study the efficacy of different types of approaches on different problems. They don't necessarily have a probabilistic model over that space in mind. But I think the "reward is enough" thesis is suggesting more explicit RL. And this begs the obvious question - why hasn't RL smashed derivative-free optimization to smithereens, given that the idea has been out there for at least five years?
At the time of writing, here are some top-performing global optimizer routines, according to rankings that I maintain. There are many flaws with my methodology, I am sure (you can critique the code should you wish), but setting that aside, I haven't seen a lot from the learning-to-learn category. Instead, I see quadratic approximation doing well. There's a cute approach in the dlib library that incrementally estimates bounds on the function variation as it proceeds. Of course, there are also gaussian process methods in the mix, which are also competitive if a little more computationally burdensome.
Then there is the nevergrad approach I mentioned. If I were to include the next twenty or thirty, it would include more from this fame family, and also various implementations of surrogate optimization and evolutionary approaches (CMA-ES in particular, and more). What it doesn't, as yet, include are techniques I would describe as RL-based.
Given my visibility into the relative success of these algorithms, I suppose I could reverse engineer an algorithm that learns to optimize pretty easily - it could look like an RL-based optimizer. But that seems cynical and I'm not sure what the "rewards is enough" thesis is buying us here, really. It will be interesting to see how this leaderboard changes over time, and whether I eat my words. It may well be the case that RL is the singular idea that has been missing - unless you say that the leaderboard itself constitutes RL.
A second test for the everything-is-RL thesis is time-series prediction. This seems to be a quintessential task where classic "model rich" approaches like ARIMA can be juxtaposed against more model-free RL-style contributions. For example, self-play between a time-series generator and a discriminator might yield novel forecasting capability.
Once again I hear the objections. This is a toy problem, you say. This has nothing to do with artificial intelligence in the grand scheme of things. And yet prediction using a relatively small number of data points (say a few hundred) is surely a key building block for other capabilities.
Indeed, to make the case that time-series prediction is central, let me remind you that artificial (cough) intelligence must surely be enhanced by time-series prediction if only because most reward functions, value functions, and various other quantities are both temporal and predicted (or could be). Many intelligent systems can be framed as the prediction of action-conditional value functions.
Often they are explicit decisions about the expected return, quantified in some way, which moves forward in time informed by its own past values as well as being influenced by exogenous factors. I certainly wouldn't suggest that time-series prediction is universal in any sense (i.e. that cracking it also solves other problems), but I would ask why an agent implicitly or explicitly predicting intermediate quantities would not be interested in the time-series of its internal model residuals.
It doesn't matter a great deal what intermediate reward is being optimized. Perhaps chess position evaluation isn't the greatest example, but cricket position gets us there (will it rain?) and business value functions can also suggest the same. For instance, the optimization of a transport company is going to involve explicit prediction (I dare say) for some time. Value functions elsewhere might look like inventory adjusted profit and loss, or some other accounting measure of intermediate "success" like Uber fares minus running costs minus poor location. I simply don't understand how the wielding of these intermediate rewards for decision-making is not assisted by time-series prediction.
As an aside, sometimes these value functions are internally consistent or strive to be motivated by the Bellman equation . The distinction between Bellman-inspired RL and control theory is a discussion for another day. Either way, I think my characterization of intelligence as explicit or implicit value function prediction is somewhat tribe neutral.
The suggestion is that animals evolve intelligence in part by using the prediction of intermediate quantities. It stands to reason that short-term forecasting (of value functions) is a life and death skill too - even if it is mere bias correction. How could an intelligent actor in a complex environment not be good at this? What about the converse? Would it not be mildly bizarre if the key trick the actor uses to evolve all manner of intelligent behaviour (RL) fails to assist with the sub-task (prediction) itself?
So, as with optimization of derivative-free objective functions, it seems quite reasonable to ask whether RL is helping out. Certainly, RL has been used for forecasting and here's your tutorial on predicting stock prices using RL just in case you didn't already have a clear path to riches. There's a review by the Journal of Quantitative Finance (article) and there is a lot of activity in this area.
Yet the skeptic in me asks the same old question - where are the benchmark-beating algorithms? More optimistically, when will we see them? When will the reward be enough for us to dispense with some workhorse ARIMA models in the statsmodels.tsa package, or fast ensembles, or TBATS?
I'll spare you the next 100 entries in that particular leaderboard but for now, there aren't any RL-based approaches that are bounding up it. There's more support for the theory "ensembling is enough". I'll be as happy as anyone when that changes (pull requests are welcome). Perhaps I'll need to increase the prize money on offer because at least at the moment, the rewards are not enough.
I move on to a more high-level response to the paper, conscious that this might be stretching my knowledge.
I ask, when is RL's implicit prediction (if that is a fair characterization) more efficient than alternative styles of nagivation? If it is more efficient, that isn't a well-developed part of the argument - or at least self-contained in the paper. I would have liked to see the authors draw more on their experience when it comes to prodding us about model-rich versus model-free approaches. Instead:
We do not offer any theoretical guarantee on the sample efficiency of reinforcement learning agents. Indeed, the rate at and degree to which abilities emerge will depend upon the specific environment, learning algorithm, and inductive biases; furthermore one may construct artificial environments in which learning will fail.
Much has been written on the topic. It can be very difficult to design good rewards, for one thing. But supposing we could ...
Instead, we conjecture that powerful reinforcement learning agents, when placed in complex environments, will in practice give rise to sophisticated expressions of intelligence. If this conjecture is correct, it offers a complete pathway towards the implementation of artificial general intelligence.
A complete pathway sounds pretty great, but what really is being suggested here? I think this is proposing that model-free agents are enough - because without that distinction I'm not sure what's new in the idea that self-training algorithms that assimilate data get smarter over time.
Of course, of course, Alpha-Zero is brought to the case and I do not want to be seen as pooh-pooh'ing chess playing accomplishments or other RL wins that are damn impressive (I am a former chess player - and chess program dabbler - I get it). However, that problem is trivial, in some statistical sense, as there is infinite data to estimate a value function from self-play (easy to say, I know). For me, sample efficiency is pretty darn important. It rather feels like it is the problem of statistical decision-making under uncertainty.
For much of the discussion, the authors make a strong case for why Alpha-Zero style accomplishments (or even protein folding) are not relevant, as far as it goes for agents traversing real-world problems with few data points to lean on. The examples cited do nothing to convince skeptics of the sample efficiency of reinforcement learning, which the authors might be right to be defensive about.
This brings us back to semantics. In many practical settings, a hyper-parameter optimization of a model predictive control system can work pretty well with sparse data. Does this mean that optimization is enough? Does it mean that hyper-parameters are enough? Or does it mean that English words aren't enough and we should just let algorithms talk to each other from this day forward? At least they understand precise communication (proper scoring rules, and the like).
It isn't that model-free approaches weren't considered in many existing decision-making settings. I come from finance. I now work in asset management. We have theories built around model-free portfolio optimization, and one was introduced by the late Thomas M. Cover. These are clever, intriguing, and a little more ML-like due to regret bounds and so forth. But at the same time, I've always been suspicious. I don't know of any firm that actually uses universal portfolio theory, as it is termed.
Maybe ask why? Maybe ... it's because when real dough is on the line the notion that you want to pretend you don't know anything probabilistic seems like a stretch? The Bayesians - and everyone is a little bit Bayesian - might also have a problem with the general slant of model-freeness. Isn't there a continuum, with diffuse priors on one end? My questions to the authors would be:
Is it ironic that someone who arranges data science contests worries that reward might not be enough? I hope I don't come across as dismissive of the thesis. On the contrary, I'm in the process of finishing a book that is almost entirely aligned with the "reward is enough" promise. However, I argued that reward might be sufficient only in an AI sub-world we invent - one that can nonetheless help real-world applications.
I believe that the future of intelligent systems looks like a billion little economic agents driving algorithms from one place to another, navigating for themselves and eeking out their existence in the harshest of economic environments. That future isn't dominated by DeepMind or any other company - they get priced out. Nor is it dominated by any singular approach. But it is otherwise a happy place for RL researchers and their brain-children.
However, what troubles me about the "reward is enough" paper is that it presents what is at heart an economic argument without considering the standard objections. Adam Smith presented the "reward is enough" argument, as it applies to the intelligence of a system, but Ronald Coase countered by noting that private firms exist (The Nature of the Firm).
The issue: economic friction prevents reward from being sufficient in the collective sense. Every human interaction isn't a trade. We form other types of collaborations, follow protocols, observe mores, and submit to governing structures of various kinds. Reward can be enough in the thermodynamic limit where economic friction is zero - but we ain't there yet. Lowering the temperature is the central task, at least if you think reward is everything.
Let me elaborate. Section 3.1 suggests that knowledge is enough for knowledge and learning. But what kind of knowledge do we mean? Reward is a powerful tool for knowledge orchestration. That is the topic of Hayek's famous essay, The Use of Knowledge in Society. But even this powerful mechanism is severely hampered by economic frictions, including search cost, asymmetric information, contract overhead, and so forth.
I think Hayek's discussion (text) can bolster the Reward is Enough's Section 3.1 but also suggest a very different path. Hayek isolates the key challenge. This is not the accumulation of knowledge by an individual agent running an internal RL algorithm as it bumps into walls. Actually, the central problem is the use of disparate knowledge held by disparate self-interested agents (human or artificial). The price mechanism allows them to form a kind of super-computer, even though they perform only local optimizations, such as switching from one supplier of a machine learning feature to another.
That's where intelligence might emerge. The main thing is figuring out how to overcome the limitations of any one mind - either human or artificial. So I would steer towards a slightly different ambition. The honest trailer title is: "Reward is enough ... to coordinate efforts on a subset of problems where there is so much data that machine learning has a chance, so let's get rid of trade frictions and let the algorithms solve the easy half of statistics."
Ah, you say, but reducing economic friction is like honing "social skills" - just another RL task. I hope you are right. That's why I violently agree and violently disagree with the authors' position. Reward might not produce intelligence because individual intelligence might just be too hard. But reward is more likely to produce collective intelligence - just not at the present level of friction that exists between algorithms, authors, sources of data, applications, and people. Hoping for purely market-inspired orchestration is too much, until we create a system where the costs for algorithms of finding, entering and maintaining economic relationships with other algorithms are dramatically lower.
Indeed it is ironic for DeepMind, as a large private organization of researchers, to suggest that "reward is enough" - that suggests there are no major benefits of the private firm. Those hierarchical constructs are the major competitor to the "reward is enough" principle in the constellation of orchestration mechanisms (such as loose associations, free trade, open-source communities, and so forth - see Superminds for a survey). If reward is all you need for collective intelligence, then we should be able to split up DeepMind into self-interested researchers whose only interaction is direct trade (or perhaps barter) in ideas and data. Does anyone see a Lemon's Problem lurking in that proposal?
So, it comes down to your view on whether repeated conditional prediction (or "AI" if you must) is an inherently collective activity, or not. Will it be solved by an extremely powerful RL agent, or by a network of sort-of-clever agents drawn together by rewards alone? Either way, and I lean towards the latter as the most likely outcome, the size of the constants in the physical system matter a lot - the coefficients of static and kinetic trade frictions, if you will.
This brings me to a possible downside of the "reward is enough" mantra: cost. Getting back to the mild cynicism, the more mysterious the characterization of analytics is (and hype around the same) the more this serves the interest of DeepMind and their brethren, who wish to sell analytics to enterprise at high prices. Again, I don't fault researchers trying to get funding for blue-sky work - how else can they do it?.
But maybe a better way to help the underserved consumer of intelligence (like small businesses) is the very opposite - an unbundling of prediction from other application logic. That might drive costs down dramatically since repeated prediction (of anything, even value functions) can be commoditized. So in this respect, there might be more social welfare in positioning "AI" as mundane - namely as conditional prediction, rather than winning a war with control theorists.
Cheap prediction can be ubiquitous. It's just another good. We need look no further than the global supply chain for consumer and industrial goods which, viewed as an intelligent system, is pretty impressive. Maybe it isn't everyone's idea of general intelligence, but it certainly attains goals most of the time (unless you are looking for an 11-speed chain during a pandemic).
We could use radically low-cost supply chains for repeated prediction to power AI applications. Therein, RL agents and their model-bearing competitors playing the role of value-adding firms can sort out these silly arguments for us - the system will determine who thrives. (The system can judge the wisdom, or otherwise, of deliberately eschewing vast tracts of applied mathematics because it isn't trendy - he mutters under his breath). And all you need is for reward to exceed friction. Trade need not be conventional. It can manifest as repeated statistical games, where friction is essentially zero.
I emphasize again that this poor man's collective intelligence (to some) is even more modest because it will only work on the domain of problems where data-hungry methods excel (I make no other claim). On that domain - more or less the same one where most things we term machine learning will work too - automated assessment of contribution is feasible, so autonomous formation of meaningful economic relationships between algorithms is also feasible. Thus friction, arising in large part from the inability to assess the quality of predictions provided by others, falls towards zero. Agents can use RL or whatever they like to optimize their set of relationships.
So yes, RL is cool. Yes, we need to power the world of intelligent applications with an algorithm-first environment. We need algorithms to be able to traverse our problems - because human chaperones are too expensive. We need them to be economically aware and self-navigating. We need algorithms to engage in micro-commerce with each other. We need to open up the possibility of collective, impressive intelligence - by whatever name.
But they can't really exist in our world yet, and won't for some time. Don't kid yourself. They should move on a substratum that accommodates their lack of generalized ability. I suggest we focus more on this minimal infrastructure these algorithms will need for communication and meaningful, reward-driven interaction - and less on which species of approach, or school of thought, or statistical terminology, will dominate when it comes to design of each individual agent.
For it is the system that is truly clever. An economy might be as good a reinforcement learning agent as anyone has devised. All we really require is that the reward for trade be substantially larger than the cost of entering into trade. Then, and only then, reward will be enough.
I work for Intech Investments. I stir up trouble on LinkedIn on occasion, in the hope that people might contribute to open source code serving the broad goal of collective artificial intelligence. I hope that helps you reverse engineer my intermediate rewards. You can predict my bias using this free prediction API.