I think that it’s worth having a simple picture of my own brain; this post presents my working model, which I’ve found extremely useful.
I. The model
The monkey is a fast subconscious process, something like an RNN running at 100 Hz with a billion-dimensional hidden state. From a computational perspective, the monkey is where the action is.
The monkey doesn’t think in words. It gets an input, and reflexively produces an output. We have no direct introspective access to anything happening inside the monkey.
The monkey is trained by something like reinforcement learning, with a really complicated reward signal derived from physical sensations and a thousand interacting heuristics (drives like curiosity and play and grief and fear and lust).
Its training involves model-free RL (trying things and see what works); it involves model-based RL (learning to predict and using predictions to learn from imagined data);it involves fitting a value function and temporal difference learning; it probably involves third-person imitation learning and hierarchical RL and other techniques we haven’t considered. These algorithms change how the monkey forms new reflexes, but they don’t involve anything that we’d normally think of as “reasoning” inside of the monkey.
I don’t think we should expect to have a detailed model of how this training works. The upshot is that the monkey executes a set of reflexes trained to maximize a complex reward function, which was in turn tuned by evolution to maximize reproductive fitness.
The machine consists of thousands of levers that the monkey can pull and displays that the monkey can see. This is how the monkey interacts with the world.
- Move muscles: speak or manipulate objects or look or walk.
- Imagine sounds or pictures or feelings.
- Write or read data from long and short-term memory. This may involve complex access mechanisms (e.g. associative memory)
- Control where attention is focused.
- What has recently been seen/heard/felt/imagined.
- The results of recent memory retrieval.
- More detailed information about the current foci of attention.
The deliberator is a process the monkey implements by using the machine, part of a strategy for deciding what to do and managing relations with other humans. The monkey pulls levers in order to execute the steps of explicit reasoning, and to construct and share narratives. Our thoughts and words are the outputs and intermediate steps of this strategy.
These thoughts do not directly control our behavior any more than a parent’s requests control our behavior. Our thoughts are visible to the monkey on one of its displays, just as the words we hear or the objects we see. The monkey may learn reflexes that connect certain thoughts to certain actions.
Feel free to skip this section once you get the idea.
Learning to count
If I teach a child to count to 3, they can do it by reflex: the monkey can directly learn to observe three objects on a display, and to pull the levers which utter “three.”
If I teach a child to count to 17, this is not what will happen. It’s conceivable that the monkey can learn to recognize 17 objects, but it would take a great deal of training and would probably be limited to a narrow domain. (This is what it would feel like to look at an image and to simply know that it contained 17 teacups.)
Instead, the monkey will learn to implement the strategy of counting.
It can learn reflexively that “five” comes after “four” and that “six” comes after “five,” simply by hearing the sequences enough times.
It can watch a teacher point to objects one at a time, saying the words in sequence. It will experiment with copying the behavior, and learn to do so reliably if it earns approval or achieves some other goal.
It can make the mistake of counting the same object twice, have that mistake corrected, and learn to skip objects it has already pointed to.
Eventually it can learn to count with attention rather than pointing, to imagine the words rather than to speak them, to attend carefully to “which things I’ve counted,” to strategically pick the next object, and many other subtleties.
In the end, the monkey has learned not one thing but dozens of small reflexes. From the inside we hear the words “one, two, three…” and perceive our attention falling on each object in turn. We have no awareness of the reflexes that have been learned, except insofar as we can observe their consequences. (I’m going to save a post about consciousness for another day.)
Of course these reflexes are not always executed: we do not always say “three” when we hear “two,” just as we do not always grasp a handle when we see one. Once the strategy is available the monkey can experiment with it. Eventually it learns the conditions under which to take each step—at first when the teacher asks, then in other situations where it proved helpful, and at last whenever it perceives the desire to know how many?
Learning to reason
If I teach a child to answer questions like “The apple is on top of the box. What is on top of the box?” the child can learn to answer nearly-reflexively.
If I teach a child to answer more complex questions, like “The apple is on top of the box. The box is on the shelf. Is the apple on the shelf?” then the monkey must make more extensive use of its machinery.
For example, it may learn to pull the levers which bring parts of the sentence into focus one after another, and then to pull the levers which adjust the imagined scenery appropriately. Or it may learn a strategy which explicitly manipulates verbal information about the apple until it arrives at an answer.
Over a large number of interactions the monkey will learn a wide range of strategies (probably in parallel with learning language, at least for modern humans). The monkey will experiment with different strategies, and discover which lead to correct answers. It will build intuitive models of its own mental workspace, and use those models to more quickly learn sensible policies. It will learn some kinds of thinking by imitating language or physical manipulations from others and translating them to internal operations.
There is no notion of “correct” reasoning baked into this learning processes, “correctness” entirely emerges from the strategies learned by the monkey, which are themselves optimized for the various rewards supplied by evolution. Amongst those rewards are probably drives like a desire for correct predictions and to explore “interesting” physical and cognitive situations.
(I’m certainly not confident of that, it’s plausible that our RL algorithms are in fact good enough to produce these drives where appropriate rather than taking them as given. Of course it’s most likely the monkey doesn’t really fit into the abstraction of RL, such that the truth is messy and there is no notion of “real” rewards.)
Suppose that one day I do some reasoning and conclude “I ought to more often greet people by name.”
As soon as I think it, this thought changes my brain. Not by influencing the monkey’s policy, which happens relatively slowly over the course of many decisions. Instead, the immediate effect is to change my memories. The thought is immediately available in short term memory, and depending on the situation the monkey may write it to various forms of longer term memory.
There is no inherent mechanism by which forming these memories will lead to me greeting people by name. All that will happen differently is that certain memory retrieval operations will return a different result.
But sometimes the monkey may pull a lever to access some memories about what it ought to do, and what lever it pulls next may depend on the result of that retrieval. And so one dar I may in fact recall “I ought to greet people by name” and thereby do it.
I might facilitate this process by deciding to store different memories. It is rare for the monkey to explicitly ask “what ought I do?” But when it encounter a concrete situation, it does often look up any memories that are extremely similar to that situation, since that reflex often proves useful. So by concretely imagining a situation and then storing the imagined situation in memory, I can improve the probability that the monkey accesses a relevant memory at the time when it is needed (and then hopefully the monkey will respond by greeting someone by name).
Alternatively, the deliberator could accelerate this process by training the reflex of more explicitly asking questions like “is there anything I should remember about this situation?” Of course this would have to be done in the same way, but once installed such a reflex could quickly translate abstract intentions into information that could be acted on at the time.
Whether or not we actually end up with any of these reflexes depends in large part on whether they pay rent. The question is always: (a) is the behavior helpful for the RL task, and (b) does the monkey have enough training data / capacity to learn it.
Aliefs and beliefs
There are different senses in which I can “believe” that eggs crack when dropped:
- The monkey’s reflexes can avoid dropping eggs.
- The monkey’s model can predict that eggs crack when dropped.
- I might have a memory of the thought “Eggs crack if dropped.”
- I might have a memory of the thought that I ought to avoid dropping eggs.
- The monkey can reflexively respond to questions like “Do eggs crack if dropped?” with “yes.”
The monkey’s training procedure tends to bring #1 and #2 into harmony, by updating from real or imagined experiences. For example, if the monkey’s model predicts the egg breaking when dropped, and the monkey’s value function rates that outcome poorly, then over the course of many internal rollouts the monkey will revise any reflex that leads to dropping eggs.
Similarly, my explicit reasoning tends to bring #3, #4, and #5 into harmony. When I recall two facts that are in tension, we use reasoning to pin down the contradiction, and then record new memories about those facts. My reflexes about how to answer “do eggs crack if dropped?” are trained by comparing the results to explicit memory retrieval.
I map #1/#2 onto “aliefs” and #3/#4/#5 onto “beliefs.”
There is no mechanism that automatically brings aliefs and beliefs into harmony. In general the two don’t even have the same domain: we can have beliefs about things that our monkey knows nothing about directly (e.g. the orbits of the planets, category theory) and we can have aliefs about things that we have no explicit understanding of (e.g. aspects of body language we have never explicitly considered).
Aliefs are produced directly by the monkey’s training process, by fitting models to its experience with the world.
Beliefs are produced as actions taken by the monkey, as the result of a bunch of levers getting pulled. Beliefs can be shaped by explicit reasoning and can update in response to abstract forms of evidence that aren’t meaningful to the monkey. So in many cases beliefs will be a much better guide to the truth than aliefs, even when they are imperfect.
I can believe that my aliefs are accurate in a domain without “knowing” what they are. For example, I can defer to my intuitions about psychology even when I have no idea where they come from, if I’ve observed that my intuitions are usually right . Similarly, I can alieve that my beliefs are accurate in a domain. For example, the monkey can predict that a ball will land where my calculations say it will land, even though the calculation disagrees with intuition. The monkey will make this prediction if it keeps seeing the calculations be correct.
Cesires and desires
There are two different senses in which I might “want” to win a game:
- The monkey’s reflexes might be optimized to win it, and its value function might be higher if it wins.
- I might have a memory of the thought “I want to win the game” / I might reflexively answer “win the game” to the question “what do you want?”
I map #1 onto “cesire” and #2 onto “desire.” This is precisely analogous to the distinction between alief and belief. In the same way, cesires and desires need not be in harmony nor even have the same domain.
If you ask me to make a choice quickly, I am going to pick the option I cesire. And over the long run, I am going to use whatever cognitive processes result in getting what I cesire.
If you ask me “what you want” and I’m inclined to answer honestly, I will tell you what I desire. Similarly, if I have to deliberate between two options, I am likely to choose the option which I desire. As a consequence my beliefs and deliberative processes may be distorted in order to better result in decisions that lead to what I cesire—after all, beliefs and deliberation are just the result of cognitive actions taken by the monkey in pursuit of cesires.
Cesires are produced directly by the monkey’s training process, by backwards chaining and fitting a value function. Desires are produced as actions taken by the monkey, as the result of a bunch of levers getting pulled.
The monkey may learn that generating desires is an effective way to achieve ends, and this will tend to bring our desires in harmony with our cesires. But that’s not a trivial learning problem, and the incentives aren’t perfectly aligned, since desiring X has effects beyond causing me to get X. So my desires typically diverge from my cesires.
There are two especially salient points of divergence:
- Cesires can only be propagated by the monkey’s training process, which rests on real or imagined associations between states of affairs. Desires can be propagated on the basis of explicit reasoning about cause and effect.
- Because humans aren’t good at lying, desires are also under pressure to look good to potential allies.
I can cesire to achieve what I desire: the monkey may alieve that when I believe “I want to win this game,” winning the game will result in fulfillment of my cesires. This can be generalized to the cesire to achieve what we desire. That generalization may have caveats and may apply with different force in different domains.
Similarly, I can desire to achieve what I cesire, and indeed many people have the intuition that their cesires are what they “really” want. (Many people also have the reverse intuition, that their desires are what they really want. By nature I’m more like the latter camp.)
“Weakness of will” and internal trust
Desires can be informed by explicit reasoning, and so from the monkey’s perspective it is useful to do what we desire. For example, I may cesire to take a marshmallow now, but desire to wait (so that I will get two marshmallows in a few minutes). If we pursue our desires, ultimately our cesires are satisfied. In this way, the cesire to do what we desire is strengthened.
I think this is a critical part of being an effective human in the modern world, and I think this internal trust is an extremely important resource. This trust can be exploited to achieve our desires at the expense of our cesires, and over time it can be eroded.
For example, suppose that Bob has a strong desire for the future to be full of flourishing. From the perspective of the monkey, events a century from now don’t even really exist. Nevertheless, Bob ends up with a desire to reduce the risk of human extinction, which flows backwards all the way to the desire to make progress on a particular project. And Bob cesires to achieve what he happens to desire, he makes progress on that project.
But every day he spends pursuing desires that do not advance my cesires, the monkey’s reflexes towards fulfilling his desires weakens; he is burning this internal trust. If this project never satisfies his cesires, eventually he will no longer work on it.
Perhaps at first caveats are added, and Bob loses the cesire to fulfill certain classes of desires. At this point a normal human would let a project slide because it has become unappealing. Some people won’t stop there though, and will instead search for other mechanisms to overcome their “weakness of will,” to bring themselves to do what they believe themselves to want.
Bob may find ways to route around the caveats—he may reframe the situation such that he still cesires to do what he desires. He may set up social incentives, structure his work in a particular way, put up a motivational poster, cheer when he succeeds. But at the end of the day, unless the project is actually achieving what he cesires, the monkey will eventually learn to see through it (if only by trial and error), and each trick will eventually fail.
If Bob is especially determined, then he may end up much worse off than he began, having permanently undone many reflexes that led him to do what he desired. In the extreme case he can no longer even pass the marshmallow test or its various ubiquitous analogies.
No matter how “dumb” the monkey is, if it is unbiased then there is no free lunch. For a time we can do what we desire at the expense of what we cesire, but any cognitive policy that does so will eventually become unappealing.
The monkey and the deliberator are essentially two different optimization processes with different goals, and I think they both benefit from compromise. (I discussed this compromise in a previous post.)
What does such a compromise look like?
When I deliberate, by default I would pick the option that I desire more. This introduces significant incentives for the monkey to sabotage my deliberation in various ways, for example to shift my desires or to give me incorrect beliefs.
Instead, I can pick the option which I believe to be best according to a compromise between what I want, and what I cesire. I can also make an effort to improve my beliefs about my cesires, and to generally act as a trustworthy ally of the monkey.
If this is how I deliberate, then I am much more likely to wind up with the cesire to deliberate and do what I believe is best after deliberating, since by so doing I actually can better achieve what I cesire. At the same time, even though this deliberation is not aimed directly at achieving what I desire, having my heart in it is sufficiently valuable that I expect to get much more of what I desire.
My inclination is to put cesire and desire in symmetric bargaining positions, though I think that the “right” bargaining outcome requires moving beyond the gross simplifications in the monkey+machine model. But I think that even getting things roughly right captures a lot of the value.
In practice, I think the best compromise will probably involve doing good for the world in a way that is broadly recognized and respected, and which helps me become an awesome person who is able to achieve impressive things. The ability to have a large effect on the world is a major input both into satisfying my desires and cesires.
V. Artificial intelligence
Current state of deep learning
The monkey+machine model roughly matches my description of prosaic AGI.
On this picture, neural networks implement the monkey. Evaluating our ability to directly train neural networks to perform tasks that require explicit reasoning is probably missing the point. Instead, the key questions are (a) when will our models become expressive enough to implement the monkey, and (b) when will we have strategies that can actually train them?
(This discussion corresponds to training neural networks to operate computational machinery, though using a lot of RL rather than differentiable mechanisms, e.g. using mostly hard rather than soft attention.)
On (a): it is natural to suppose that perceptual and control tasks are roughly pushing the abilities of the monkey, since they are the chief tasks accomplished by animals during most of natural selection. So if our models can learn these tasks, we might expect that they can learn most of the monkey’s functions.
On (b): I don’t think we have seen much progress on training strategies recently, but it’s extremely hard to know how difficult the problem is. That is, evolution coughed up a whole bunch of innovations—great heuristic proxies for fitness, nice exploration strategies, model-based RL, probably third-person imitation and hierarchical RL, TD learning, etc.. It’s not clear if human researchers will achieve the same feat by the usual scientific process, or by implementing better meta-learning, or something else. The main reason to think it might happen soon is that once our learning algorithms are good enough to implement the monkey, the benefits of good training strategies rise from “million dollar curiosity” to “trillion dollar keystone of the economy.”
On the flip side, I often look at what AI can learn (or will plausibly be able to learn) to get some information about what the monkey is capable of. By going back and forth between these two perspectives I think I have gotten better intuitions on both sides.
Inferring human values
I think this model partly explains my pessimism about inferring the “right” human values from human behavior.
The issue is not that there are two sets of values and its unclear how to compromise between them. The issue is much more fundamental: our beliefs and desires don’t even have the type signature of functions on possible worlds.
The correspondence between mental objects and facts about the world is a characteristic of the policy learned by the monkey, which will be incredibly complex and inconsistent. Even taking the correspondence as given, our beliefs and desires will themselves be inconsistent, since they are produced by a complex and imperfect strategic process.
The monkey does not have the kind of architecture where we can say that some of these beliefs are “mistakes.” We can imagine a version of ourselves where the monkey’s models were trained to convergence (on some distribution) and had infinite capacity. This essentially amounts to adopting the values used during the monkey’s training, but:
- Those values are not our values, they are at best our cesires.
- Probably the monkey behaves extremely strangely if you “train it to convergence,” since it was produced by evolution, which doesn’t care at all what happens in this weird regime. It would not be at all surprising if the monkey ended up catatonic.
- The only reason we can imagine the “perfect” version of the monkey is because we are abstracting out all of the details. If we open up that abstraction, we are going to have similar problems all over again. Probably the only logical endpoint is recovering something like reproductive fitness, which is even more clearly not our values.
We could instead talk about what would happen if we allow explicit deliberation to occur for much larger amounts of time, though now it feels like we are basically back to capability amplification.
Comprehensibility and AI control
If we think that humans and AI work like the monkey+machine, then seeking certain kinds of “principled” understanding feels particularly hopeless. There is not much reason for the monkey’s cognitive policy to be at all comprehensible, and nothing stopping us from building it before comprehending it.
It seems more promising to do analysis one level of abstraction up from the cognitive policies themselves: thinking about how the “values” of the monkey relate to its training procedure, and thinking about how the values of the deliberator relate to the values of the monkey. Moreover, to be successful this analysis should be agnostic to the concrete policy learned by the monkey, since that policy is likely to be completely incomprehensible.
My hope is to train the monkey by using supervision from the monkey+machine (which amounts to using the monkey+machine to generate supervision for the monkey), to kickstart the process with human supervision, and to hope that a benign monkey leads to a benign deliberator. The example of human cognition highlights one reason this task may be difficult: from the perspective of the monkey, the deliberator is not close to benign.
I think that many people’s thinking about thinking–whether humans or AIs–would be improved by imagining a fast RL agent (the “monkey”) operating slower cognitive machinery:
- I think self-improvement benefits from the basic perspectives of compromise/coordination/communication between the monkey and the deliberator.
- Thinking of giant RNNs/CNNs as analogous to the monkey, which must be coupled to external machinery in order to produce deliberation, gives a more accurate picture of the relationship between our cognition and existing AI.
- Imagining conscious deliberation as fundamental, rather than a product and input to reflexes that actually drive behavior, seems likely to cause confusion.
Even if you don’t buy the model, you should probably accept that it is a way that brains could be. If you make an argument about brains and find that the conclusion fails for the monkey+machine, you should probably be able to point to the assumptions or observations that ruled out this model.