The monkey and the machine: a dual process theory

I think that it’s worth having a simple picture of my own brain; this post presents my working model, which I’ve found extremely useful.

I. The model

The monkey is a fast subconscious process, something like an RNN running at 100 Hz with a billion-dimensional hidden state. From a computational perspective, the monkey is where the action is.

The monkey doesn’t think in words. It gets an input, and reflexively produces an output. We have no direct introspective access to anything happening inside the monkey.

The monkey is trained by something like reinforcement learning, with a really complicated reward signal derived from physical sensations and a thousand interacting heuristics (drives like curiosity and play and grief and fear and lust).

Its training involves model-free RL (trying things and see what works); it  involves model-based RL (learning to predict and using predictions to learn from imagined data);it involves fitting a value function and temporal difference learning; it probably involves third-person imitation learning and hierarchical RL and other techniques we haven’t considered. These algorithms change how the monkey forms new reflexes, but they don’t involve anything that we’d normally think of as “reasoning” inside of the monkey.

I don’t think we should expect to have a detailed model of how this training works. The upshot is that the monkey executes a set of reflexes trained to maximize a complex reward function, which was in turn tuned by evolution to maximize reproductive fitness.

Detailed schematic. (Credit: Tara Tyler, via Shelly and Chad)

The machine consists of thousands of levers that the monkey can pull and displays that the monkey can see. This is how the monkey interacts with the world.

Levers can:

  • Move muscles: speak or manipulate objects or look or walk.
  • Imagine sounds or pictures or feelings.
  • Write or read data from long and short-term memory. This may involve complex access mechanisms (e.g. associative memory)
  • Control where attention is focused.

Displays show:

  • What has recently been seen/heard/felt/imagined.
  • The results of recent memory retrieval.
  • More detailed information about the current foci of attention.

The deliberator is a process the monkey implements by using the machine, part of a strategy for deciding what to do and managing relations with other humans. The monkey pulls levers in order to execute the steps of explicit reasoning, and to construct and share narratives. Our thoughts and words are the outputs and intermediate steps of this strategy.

These thoughts do not directly control our behavior any more than a parent’s requests control our behavior. Our thoughts are visible to the monkey on one of its displays, just as the words we hear or the objects we see. The monkey may learn reflexes that connect certain thoughts to certain actions.

II. Capabilities

Feel free to skip this section once you get the idea.

Learning to count

If I teach a child to count to 3, they can do it by reflex: the monkey can directly learn to observe three objects on a display, and to pull the levers which utter “three.”

If I teach a child to count to 17, this is not what will happen. It’s conceivable that the monkey can learn to recognize 17 objects, but it would take a great deal of training and would probably be limited to a narrow domain. (This is what it would feel like to look at an image and to simply know that it contained 17 teacups.)

Instead, the monkey will learn to implement the strategy of counting.

It can learn reflexively that “five” comes after “four” and that “six” comes after “five,” simply by hearing the sequences enough times.

It can watch a teacher point to objects one at a time, saying the words in sequence. It will experiment with copying the behavior, and learn to do so reliably if it earns approval or achieves some other goal.

It can make the mistake of counting the same object twice, have that mistake corrected, and learn to skip objects it has already pointed to.

Eventually it can learn to count with attention rather than pointing, to imagine the words rather than to speak them, to attend carefully to “which things I’ve counted,” to strategically pick the next object, and many other subtleties.

In the end, the monkey has learned not one thing but dozens of small reflexes. From the inside we hear the words “one, two, three…” and perceive our attention falling on each object in turn. We have no awareness of the reflexes that have been learned, except insofar as we can observe their consequences. (I’m going to save a post about consciousness for another day.)

Of course these reflexes are not always executed: we do not always say “three” when we hear “two,” just as we do not always grasp a handle when we see one. Once the strategy is available the monkey can experiment with it. Eventually it learns the conditions under which to take each step—at first when the teacher asks, then in other situations where it proved helpful, and at last whenever it perceives the desire to know how many?

Learning to reason

If I teach a child to answer questions like “The apple is on top of the box. What is on top of the box?” the child can learn to answer nearly-reflexively.

If I teach a child to answer more complex questions, like “The apple is on top of the box. The box is on the shelf. Is the apple on the shelf?” then the monkey must make more extensive use of its machinery.

For example, it may learn to pull the levers which bring parts of the sentence into focus one after another, and then to pull the levers which adjust the imagined scenery appropriately. Or it may learn a strategy which explicitly manipulates verbal information about the apple until it arrives at an answer.

Over a large number of interactions the monkey will learn a wide range of strategies (probably in parallel with learning language, at least for modern humans). The monkey will experiment with different strategies, and discover which lead to correct answers. It will build intuitive models of its own mental workspace, and use those models to more quickly learn sensible policies. It will learn some kinds of thinking by imitating language or physical manipulations from others and translating them to internal operations.

There is no notion of “correct” reasoning baked into this learning processes, “correctness” entirely emerges from the strategies learned by the monkey, which are themselves optimized for the various rewards supplied by evolution. Amongst those rewards are probably drives like a desire for correct predictions and to explore “interesting” physical and cognitive situations.

(I’m certainly not confident of that, it’s plausible that our RL algorithms are in fact good enough to produce these drives where appropriate rather than taking them as given. Of course it’s most likely the monkey doesn’t really fit into the abstraction of RL, such that the truth is messy and there is no notion of “real” rewards.)

Forming habits

Suppose that one day I do some reasoning and conclude “I ought to more often greet people by name.”

As soon as I think it, this thought changes my brain. Not by influencing the monkey’s policy, which happens relatively slowly over the course of many decisions. Instead, the immediate effect is to change my memories. The thought is immediately available in short term memory, and depending on the situation the monkey may write it to various forms of longer term memory.

There is no inherent mechanism by which forming these memories will lead to me greeting people by name. All that will happen differently is that certain memory retrieval operations will return a different result.

But sometimes the monkey may pull a lever to access some memories about what it ought to do, and what lever it pulls next may depend on the result of that retrieval. And so one dar I may in fact recall “I ought to greet people by name” and thereby do it.

I might facilitate this process by deciding to store different memories. It is rare for the monkey to explicitly ask “what ought I do?” But when it encounter a concrete situation, it does often look up any memories that are extremely similar to that situation, since that reflex often proves useful. So by concretely imagining a situation and then storing the imagined situation in memory, I can improve the probability that the monkey accesses a relevant memory at the time when it is needed (and then hopefully the monkey will respond by greeting someone by name).

Alternatively, the deliberator could accelerate this process by training the reflex of more explicitly asking questions like “is there anything I should remember about this situation?” Of course this would have to be done in the same way, but once installed such a reflex could quickly translate abstract intentions into information that could be acted on at the time.

Whether or not we actually end up with any of these reflexes depends in large part on whether they pay rent. The question is always: (a) is the behavior helpful for the RL task, and (b) does the monkey have enough training data / capacity to learn it.

III. Duality

Aliefs and beliefs

There are different senses in which I can “believe” that eggs crack when dropped:

  1. The monkey’s reflexes can avoid dropping eggs.
  2. The monkey’s model can predict that eggs crack when dropped.
  3. I might have a memory of the thought “Eggs crack if dropped.”
  4. I might have a memory of the thought that I ought to avoid dropping eggs.
  5. The monkey can reflexively respond to questions like “Do eggs crack if dropped?” with “yes.”

The monkey’s training procedure tends to bring #1 and #2 into harmony, by updating from real or imagined experiences. For example, if the monkey’s model predicts the egg breaking when dropped, and the monkey’s value function rates that outcome poorly, then over the course of many internal rollouts the monkey will revise any reflex that leads to dropping eggs.

Similarly, my explicit reasoning tends to bring #3, #4, and #5 into harmony. When I recall two facts that are in tension, we use reasoning to pin down the contradiction, and then record new memories about those facts. My reflexes about how to answer “do eggs crack if dropped?” are trained by comparing the results to explicit memory retrieval.

I map #1/#2 onto “aliefs” and #3/#4/#5 onto “beliefs.”

There is no mechanism that automatically brings aliefs and beliefs into harmony. In general the two don’t even have the same domain: we can have beliefs about things that our monkey knows nothing about directly (e.g. the orbits of the planets, category theory) and we can have aliefs about things that we have no explicit understanding of (e.g. aspects of body language we have never explicitly considered).

Aliefs are produced directly by the monkey’s training process, by fitting models to its experience with the world.

Beliefs are produced as actions taken by the monkey, as the result of a bunch of levers getting pulled. Beliefs can be shaped by explicit reasoning and can update in response to abstract forms of evidence that aren’t meaningful to the monkey. So in many cases beliefs will be a much better guide to the truth than aliefs, even when they are imperfect.

I can believe that my aliefs are accurate in a domain without “knowing” what they are. For example, I can defer to my intuitions about psychology even when I have no idea where they come from, if I’ve observed that my intuitions are usually right . Similarly, I can alieve that my beliefs are accurate in a domain. For example, the monkey can predict that a ball will land where my calculations say it will land, even though the calculation disagrees with intuition.  The monkey will make this prediction if it keeps seeing the calculations be correct.

Cesires and desires

There are two different senses in which I might “want” to win a game:

  1. The monkey’s reflexes might be optimized to win it, and its value function might be higher if it wins.
  2. I might have a memory of the thought “I want to win the game” / I might reflexively answer “win the game” to the question “what do you want?”

I map #1 onto “cesire” and #2 onto “desire.” This is precisely analogous to the distinction between alief and belief. In the same way, cesires and desires need not be in harmony nor even have the same domain.

If you ask me to make a choice quickly, I am going to pick the option I cesire. And over the long run, I am going to use whatever cognitive processes result in getting what I cesire.

If you ask me “what you want” and I’m inclined to answer honestly, I will tell you what I desire. Similarly, if I have to deliberate between two options, I am likely to choose the option which I desire. As a consequence my beliefs and deliberative processes may be distorted in order to better result in decisions that lead to what I cesire—after all, beliefs and deliberation are just the result of cognitive actions taken by the monkey in pursuit of cesires.

Cesires are produced directly by the monkey’s training process, by backwards chaining and fitting a value function. Desires are produced as actions taken by the monkey, as the result of a bunch of levers getting pulled.

The monkey may learn that generating desires is an effective way to achieve ends, and this will tend to bring our desires in harmony with our cesires. But that’s not a trivial learning problem, and the incentives aren’t perfectly aligned, since desiring X has effects beyond causing me to get X. So my desires typically diverge from my cesires.

There are two especially salient points of divergence:

  • Cesires can only be propagated by the monkey’s training process, which rests on real or imagined associations between states of affairs. Desires can be propagated on the basis of explicit reasoning about cause and effect.
  • Because humans aren’t good at lying, desires are also under pressure to look good to potential allies.

I can cesire to achieve what I desire: the monkey may alieve that when I believe “I want to win this game,” winning the game will result in fulfillment of my cesires. This can be generalized to the cesire to achieve what we desire. That generalization may have caveats and may apply with different force in different domains.

Similarly, I can desire to achieve what I cesire, and indeed many people have the intuition that their cesires are what they “really” want. (Many people also have the reverse intuition, that their desires are what they really want. By nature I’m more like the latter camp.)

IV. Self-help

“Weakness of will” and internal trust

Desires can be informed by explicit reasoning, and so from the monkey’s perspective it is useful to do what we desire. For example, I may cesire to take a marshmallow now, but desire to wait (so that I will get two marshmallows in a few minutes). If we pursue our desires, ultimately our cesires are satisfied. In this way, the cesire to do what we desire is strengthened.

I think this is a critical part of being an effective human in the modern world, and I think this internal trust is an extremely important resource. This trust can be exploited to achieve our desires at the expense of our cesires, and over time it can be eroded.

For example, suppose that Bob has a strong desire for the future to be full of flourishing. From the perspective of the monkey,  events a century from now don’t even really exist. Nevertheless, Bob ends up with a desire to reduce the risk of human extinction, which flows backwards all the way to the desire to make progress on a particular project. And Bob cesires to achieve what he happens to desire, he makes progress on that project.

But every day he spends pursuing desires that do not advance my cesires, the monkey’s reflexes towards fulfilling his desires weakens; he is burning this internal trust. If this project never satisfies his cesires, eventually he will no longer work on it.

Perhaps at first caveats are added, and Bob loses the cesire to fulfill certain classes of desires. At this point a normal human would let a project slide because it has become unappealing. Some people won’t stop there though, and will instead search for other mechanisms to overcome their “weakness of will,” to bring themselves to do what they believe themselves to want.

Bob may find ways to route around the caveats—he may reframe the situation such that he still cesires to do what he desires. He may set up social incentives, structure his work in a particular way, put up a motivational poster, cheer when he succeeds. But at the end of the day, unless the project is actually achieving what he cesires, the monkey will eventually learn to see through it (if only by trial and error), and each trick will eventually fail.

If Bob is especially determined, then he may end up much worse off than he began, having permanently undone many reflexes that led him to do what he desired. In the extreme case he can no longer even pass the marshmallow test or its various ubiquitous analogies.

No matter how “dumb” the monkey is, if it is unbiased then there is no free lunch. For a time we can do what we desire at the expense of what we cesire, but any cognitive policy that does so will eventually become unappealing.


The monkey and the deliberator are essentially two different optimization processes with different goals, and I think they both benefit from compromise. (I discussed this compromise in a previous post.)

What does such a compromise look like?

When I deliberate, by default I would pick the option that I desire more. This introduces significant incentives for the monkey to sabotage my deliberation in various ways, for example to shift my desires or to give me incorrect beliefs.

Instead, I can pick the option which I believe to be best according to a compromise between what I want, and what I cesire. I can also make an effort to improve my beliefs about my cesires, and to generally act as a trustworthy ally of the monkey.

If this is how I deliberate, then I am much more likely to wind up with the cesire to deliberate and do what I believe is best after deliberating, since by so doing I actually can better achieve what I cesire. At the same time, even though this deliberation is not aimed directly at achieving what I desire, having my heart in it is sufficiently valuable that I expect to get much more of what I desire.

My inclination is to put cesire and desire in symmetric bargaining positions, though I think that the “right” bargaining outcome requires moving beyond the gross simplifications in the monkey+machine model. But I think that even getting things roughly right captures a lot of the value.

In practice, I think the best compromise will probably involve doing good for the world in a way that is broadly recognized and respected, and which helps me become an awesome person who is able to achieve impressive things. The ability to have a large effect on the world is a major input both into satisfying my desires and cesires.

V. Artificial intelligence

Current state of deep learning

The monkey+machine model roughly matches my description of prosaic AGI.

On this picture, neural networks implement the monkey. Evaluating our ability to directly train neural networks to perform tasks that require explicit reasoning is probably missing the point. Instead, the key questions are (a) when will our models become expressive enough to implement the monkey, and (b) when will we have strategies that can actually train them?

(This discussion corresponds to training neural networks to operate computational machinery, though using a lot of RL rather than differentiable mechanisms, e.g. using mostly hard rather than soft attention.)

On (a): it is natural to suppose that perceptual and control tasks are roughly pushing the abilities of the monkey, since they are the chief tasks accomplished by animals during most of natural selection. So if our models can learn these tasks, we might expect that they can learn most of the monkey’s functions.

On (b): I don’t think we have seen much progress on training strategies recently, but it’s extremely hard to know how difficult the problem is. That is, evolution coughed up a whole bunch of innovations—great heuristic proxies for fitness, nice exploration strategies, model-based RL, probably third-person imitation and hierarchical RL, TD learning, etc.. It’s not clear if human researchers will achieve the same feat by the usual scientific process, or by implementing better meta-learning, or something else. The main reason to think it might happen soon is that once our learning algorithms are good enough to implement the monkey, the benefits of good training strategies rise from “million dollar curiosity” to “trillion dollar keystone of the economy.”

On the flip side, I often look at what AI can learn (or will plausibly be able to learn) to get some information about what the monkey is capable of. By going back and forth between these two perspectives I think I have gotten better intuitions on both sides.

Inferring human values

I think this model partly explains my pessimism about inferring the “right” human values from human behavior.

The issue is not that there are two sets of values and its unclear how to compromise between them. The issue is much more fundamental: our beliefs and desires don’t even have the type signature of functions on possible worlds.

The correspondence between mental objects and facts about the world is a characteristic of the policy learned by the monkey, which will be incredibly complex and inconsistent. Even taking the correspondence as given, our beliefs and desires will themselves be inconsistent, since they are produced by a complex and imperfect strategic process.

The monkey does not have the kind of architecture where we can say that some of these beliefs are “mistakes.” We can imagine a version of ourselves where the monkey’s models were trained to convergence (on some distribution) and had infinite capacity. This essentially amounts to adopting the values used during the monkey’s training, but:

  • Those values are not our values, they are at best our cesires.
  • Probably the monkey behaves extremely strangely if you “train it to convergence,” since it was produced by evolution, which doesn’t care at all what happens in this weird regime. It would not be at all surprising if the monkey ended up catatonic.
  • The only reason we can imagine the “perfect” version of the monkey is because we are abstracting out all of the details. If we open up that abstraction, we are going to have similar problems all over again. Probably the only logical endpoint is recovering something like reproductive fitness, which is even more clearly not our values.

We could instead talk about what would happen if we allow explicit deliberation to occur for much larger amounts of time, though now it feels like we are basically back to capability amplification.

Comprehensibility and AI control

If we think that humans and AI work like the monkey+machine, then seeking certain kinds of “principled” understanding feels particularly hopeless. There is not much reason for the monkey’s cognitive policy to be at all comprehensible, and nothing stopping us from building it before comprehending it.

It seems more promising to do analysis one level of abstraction up from the cognitive policies themselves: thinking about how the “values” of the monkey relate to its training procedure, and thinking about how the values of the deliberator relate to the values of the monkey. Moreover, to be successful this analysis should be agnostic to the concrete policy learned by the monkey, since that policy is likely to be completely incomprehensible.

My hope is to train the monkey by using  supervision from the monkey+machine (which amounts to using the monkey+machine to generate supervision for the monkey), to kickstart the process with human supervision, and to hope that a  benign monkey leads to a benign deliberator. The example of human cognition highlights one reason this task may be difficult: from the perspective of the monkey, the deliberator is not close to benign.

VI. Conclusion

I think that many people’s thinking about thinking–whether humans or AIs–would be improved by imagining a fast RL agent (the “monkey”) operating slower cognitive machinery:

  • I think self-improvement benefits from the basic perspectives of compromise/coordination/communication between the monkey and the deliberator.
  • Thinking of giant RNNs/CNNs as analogous to the monkey, which must be coupled to external machinery in order to produce deliberation, gives a more accurate picture of the relationship between our cognition and existing AI.
  • Imagining conscious deliberation as fundamental, rather than a product and input to  reflexes that actually drive behavior, seems likely to cause confusion.

Even if you don’t buy the model, you should probably accept that it is a way that brains could be. If you make an argument about brains and find that the conclusion fails for the monkey+machine, you should probably be able to point to the assumptions or observations that ruled out this model.

10 thoughts on “The monkey and the machine: a dual process theory

  1. Hi,

    This is really interesting. I especially like the section on internal trust and compromise.

    I have few questions about about the section on using neural controllers for external interfaces.

    My understanding is that the monkey+machine model is in the same vein as work using neural networks to operate computational machinery, but you think that learning the “monkey” should happen through differentiable mechanisms rather than RL. Is this right?

    I’m not sure why this is. It feels like when I do logical reasoning (and lots of other tasks), I’m manipulating discrete symbols in my brain (monkey taking actions). It’s possible that all these actions are differentiable, but that wasn’t my understanding from this post, so I think I’m misunderstanding something.

    I also don’t think I followed your point about what’s necessary for prosaic AI. To help me, if we subdivide bottlenecks as:
    (1) computational power (e.g. FLOPs)
    (2) learning algorithms (e.g. backprop, reinforce, TD-learning, being able to use semi-supervised RL, IRL, etc.)
    (3) training techniques (e.g. ADAM, TRPO, weight initialization)
    (4) models (e.g. LSTM, CNN, resnets). It seems okay to me to group these with training techniques for this discussion, but maybe it’s helpful to have them separate.
    (Mapping these back to the post, 1 is supposed to be a, and 2-4 are b – that may be wrong.)

    My understanding is that you think that perceptual and control tasks do a good job testing whether (1) is an issue. And then the claim on (b) is that (3) and (4) haven’t seen much progress, but if there are breakthroughs on (2) which make prosaic AI imaginable, then (3) and (4) could potentially advance very quickly. I think my understanding is probably wrong here.

    Also, are there examples of perceptual and control tasks which you think are especially difficult? For example, superhuman speech and image recognition seem within reach, but I have no clue how you train a network to write theoretical math papers. Drawing from, do you include e.g. Starcraft in this category?

    Also, can you clarify what the two sides are in “By going back and forth between these two perspectives I think I have gotten better intuitions on both sides.”?


    1. I expect controllers will ultimately have to be trained with RL, I was contrasting this with the contemporary approach which relies almost exclusively on differentiable mechanisms. (I think this is the typical view.) That said, you may use something-like-differentiation for variance reduction.

      I suspect starcraft is relatively easy, and don’t think it is a hard control task. I think manipulation is the most likely domain for hard control tasks. If we could build a system with human-level control of a hand (and human-level robustness across distractors / shapes / textures) I would update towards shorter AI timelines. (I tentatively think there is about a 50% chance of this happening soon.)

      My breakdown isn’t really aligned with your 1-4. Roughly, I would lump 1,3, and 4 in the category of “can you implement a good enough, trainable model?” Then I would lump 2 under “do you have a strategy for actually training it?” Of course, e.g. more computation lets you run longer training, and with enough training you wouldn’t need any clever training strategies, and in general the division isn’t tight. Maybe the breakdown is something like (a) “if you had a super detailed teacher supervising every tiny decision, could you do it?” (b) can you do it?

      The two perspectives are the analogies (monkey RL controller) and (monkey fast cognition). I have different sources of info about what AI can do and what I can do fast, the stories seem consistent and comparing notes gives a fuller picture.


      1. I expect controllers will ultimately have to be trained with RL, I was contrasting this with the contemporary approach which relies almost exclusively on differentiable mechanisms. (I think this is the typical view.)
        Sorry, I think I misinterpreted (reversed) the point in post. Is it fair to say that this model fits better into a framework of training neural controllers via RL rather than differentiable mechanisms? (I agree that most work so far has used differentiable mechanisms.)

        I think manipulation is the most likely domain for hard control tasks. If we could build a system with human-level control of a hand (and human-level robustness across distractors / shapes / textures) I would update towards shorter AI timelines. (I tentatively think there is about a 50% chance of this happening soon.)

        Oh, that’s interesting. Is there a concrete way to summarize this belief / elaborate on this part “chief tasks accomplished by animals during most of natural selection. So if our models can learn these tasks, we might expect that they can learn most of the monkey’s functions.”? Would you also agree with the claim that “if we could train a system to do the same tasks as a rabbit, I would update towards shorter AI timelines”? (Chosen because rabbits seem smarter than AI, but I have no clue how you would get a rabbit to perform useful tasks we might want an AI to do.)

        It’s not clear to me why an algorithm that’s good at finding solutions to low-level control problems should be good at searching through the space of abstract thoughts. One difference seems to be that it’s possible to get better feedback on individual actions in the control setting, whereas if you’re learning to e.g. write Python code, the odds of stumbling onto anything useful are basically zero.

        I think I might also have a confused interpretation for how the machine works in this model. My current interpretation is that each computational building block the monkey calls has to also be learned with something like RL. For example, it seems basically impossible to write some module that would do something like “given this abstract representation of a program I’ve written so far and this particular input, what does the state of the program look like at this point?” Of course, the monkey gets to use all the tools a programmer uses, but it still seems like there’s a lot of mental symbol pushing which seems like it would be difficult even for a system with human-level control over a hand.

        My breakdown isn’t really aligned with your 1-4. Roughly, I would lump 1,3, and 4 in the category of “can you implement a good enough, trainable model?” Then I would lump 2 under “do you have a strategy for actually training it?” Of course, e.g. more computation lets you run longer training, and with enough training you wouldn’t need any clever training strategies, and in general the division isn’t tight. Maybe the breakdown is something like (a) “if you had a super detailed teacher supervising every tiny decision, could you do it?” (b) can you do it?

        Jumping back to the original post, my impression is that the claim is large progress on (1,3,4) would make (2) a really important problem where there would be huge financial incentives to work on it. I would have expected the converse to be more true (not that the two necessarily compete), since it seems like (2) involves more breakthrough-y/jagged progress, and problems in (1,3,4) seem like they can be solved by increasing number of researchers + money e.g. ramping up number of GPUs, testing more models, testing more optimization procedures, etc.

        This might actually just be the same question as the one above, since I agree if there’s some semi-clear path from human-level control of a hand to human-level reasoning then companies will throw tons of resources to go from the first to the second.


        1. Yes, this model makes more sense with RL than differentiable mechanisms.

          I’d consider replicating a rabbit to be a lot of evidence.

          I think that training strategies can most likely be developed with enough sweat and blood. Being clever can get you faster progress / algorithms that will come closer to working “out of the box,” but I don’t think this problem is fundamentally hard. I think it’s much more uncertain whether we can actually build trainable models that can replicate human cognition. I might be cheating by shifting difficulty between these two parts of the problem though, it’s not clear whether my abstractions are solid enough for any of this to be meaningful.


          1. Okay, thanks! That all seems reasonable to me.

            I’d have to think more to have an opinion about the relative difficulty of low-level control problems relative to e.g. natural language. I don’t find the argument from evolutionary pressures compelling by itself, but I haven’t thought about it before.


  2. I’m wondering if there is anything that might ever make you reconsider whether AI alignment is a good idea.

    My concern is that much of the impetus for this goal seems to be the confident assertion, made by Eliezer Yudkowsky among others, that adaptive future minds will not be morally relevant, but will instead just be meaningless noise. This is plausible in a model where there is a strong theory of intelligence waiting to be found. If this is the case, maybe there is a significant qualitative difference between the structure of human and (e.g.) chimp minds, such that the former implement Intelligence and the latter don’t. And then plausibly most of the messy interesting parts of our minds are a legacy from the pre-Intelligence era, that an AI will obviously not have.

    On the other hand, the more similar AI will be to existing brains, the more likely it is that it will itself be morally relevant. In which case it seems not remotely obvious that building AIs to serve current human goals, or even future human goals, is the most moral course of action.

    Have you thought about what level of similarity to human minds AIs would have to have before you’d start caring about their experiences?


    1. My concern is that much of the impetus for this goal seems to be the confident assertion, made by Eliezer Yudkowsky among others, that adaptive future minds will not be morally relevant, but will instead just be meaningless noise.

      I don’t think this is relevant to the argument. I think that the moral status of the systems we build is not particularly important to the moral case for AI control (though it may present additional considerations that are relevant to AI, for example we may want to avoid building systems which suffer). All that really matters are the preferences optimized by the systems we build.

      I do think that there are two qualitatively different ways that an AI systems might be morally valuable: one because it helps us achieve our desires, and a second because it has desires for which we feel sympathy in the same way that we feel sympathy for the desires of other people. If the second mechanism proved to be “easier” than the first, then solving AI control might not be so important, and instead we should aim to understand that mechanism and build a system that has value by that channel.

      I think this is a complicated question which hasn’t received much formal attention and probably deserves more. I do want to emphasize that the relevant kind of moral standing is basically unrelated to the consciousness or moral patiency of the systems we build, since I think this is a really important distinction which is often glossed over.

      My current view is that (a) there are probably some systems that have sympathetic values, which would ignore human values but might nevertheless be OK to build, (b) we don’t yet see how to identify such systems, and I think that in expectation most systems we could point to will be substantially less morally valuable than a controlled AI, (c) if we could flip a switch that reduced the probability of another x-risk by 10%, but increased our probability of building an uncontrolled AI by 11%, I think that we should probably hit that switch, I’m not sure exactly what the cutoff point is. I think it’s conceivable though unlikely that I’ll eventually decide that other extinction risks are an order of magnitude more important than AI risk (per a fixed reduction in probability).


  3. As soon as I think it, this thought changes my brain. Not by influencing the monkey’s policy, which happens relatively slowly over the course of many decisions. Instead, the immediate effect is to change my memories.

    This is at odds with the reward pathway literature, which distinguishes between model-free and model-based Actors. These actors are encoded in different loops within the basal ganglia. Only the SR loop “updates slowly”; the latter can update immediately based on updates to model information encoded in Cortex. More here:
    Kaplan & Oudeyer (2007). In search of the neural circuits of intrinsic motivation.

    “The monkey and the deliberator are essentially two different optimization processes with different goals”

    Perhaps we should make the following identifications:

    Monkey = Sensorimotor loop of basal ganglia (stimulus-response SR)
    Deliberator = Associative loop of basal ganglia (action-outcome AO)

    There is robust empirical evidence that eg habit formation involves information transfer SR -> AO.

    Yin and Knowlton (2006). The role of the basal ganglia in habit formation.

    You might be able to improve your distinction between desires vs cesires by admitting biological constraints eg.,

    Cesires = Ventral Pallidum + Nucleus Accumbens Shell
    Desires = Orbitofrontal Cortex

    Lastly, I wanted to note that your model seems fairly cognition-centric, and I suspect could benefit by admitting emotional-visceral processes.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s