Giving AIs safe motivations
(This is the sixth essay in a series I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, plus a bit more about the series as a whole. And see here for video and transcript of a talk I gave in September 2025 on the main content of this essay.)
1. Introduction
Thus far in this series, I’ve defined what it would be to solve the alignment problem, and I’ve outlined a high-level picture of how we might get there – one that emphasized the role of “AI for AI safety,” and of automated alignment research in particular. But I’ve said relatively little about object-level, technical approaches to the alignment problem itself. In the upcoming set of essays, I try to say more.
In particular: in this essay, I’m going to sketch my current high-level best-guess as to what it looks like to control the motivations of an advanced AI system in a manner adequate to prevent rogue behavior, even in contexts where successful rogue behavior is a genuine option (call these contexts “dangerous inputs”1). I also talk briefly about how this best-guess extends to fully eliciting beneficial AI capabilities.2 Then, in the upcoming essays, I turn to techniques for controlling the options available to AIs, and for building AIs that do what I call “human-like philosophy.”
The basic picture of motivation control I have in mind has four steps:
- Instruction-following on safe inputs: Ensure that your AI follows instructions on safe inputs (i.e., cases where successful rogue behavior isn’t a genuine option), using accurate evaluations of whether it’s doing so.
- No alignment faking: Make sure it isn’t faking alignment on these inputs – i.e., adversarially messing with your evidence about how it will generalize to dangerous inputs.
- Science of non-adversarial generalization: Study AI generalization on safe inputs in a ton of depth, until you can control it well enough to be rightly confident that your AI will generalize its instruction-following to the dangerous inputs it will in fact get exposed to.
- Good instructions: On these dangerous inputs, make it the case that your instructions rule out the relevant forms of rogue behavior.3

To be clear: the first three steps, here, each implicate deep challenges (I think the fourth may be comparatively straightforward).4 Below I’ll talk a bit about the tools we can use at each stage. Ultimately, though, adequate success at (1)-(3) will require significant scientific progress – progress that I’m hoping AI labor will itself significantly accelerate.
Indeed: in many respects, the picture above functions, in my head, centrally as a structured decomposition of the problems that an adequate approach to motivation control needs to overcome. It’s certainly not a “solution” to the alignment problem, in the sense of “a detailed, do-able, step-by-step plan that will work with high-confidence, and which requires only realistic deviation from the default trajectory.”5 And on its own, I’m not sure it even warrants the term “plan.”
But I’ve found it useful to have in mind regardless. In particular: in the past, I found it hard to visualize what it even would be to solve the alignment problem. Now, it feels easier. I feel like the problem has distinct parts, with specific inter-relationships. I can see how solving each would add up to solving the whole. And I have at least some sense of how each could get solved. My aim is to describe and clarify this broad picture, and to make it easier to build on and critique.
(Like much in this series, this picture isn’t original to me. Indeed, in many respects, much of the frame here is latent in the discourse about AI alignment as a whole – I’m mostly trying to bring it to the surface and to organize it. That said, the four-step framing in particular owes special debt to Ryan Greenblatt and Josh Clymer, who have each written either internal or external docs covering many similar points, and with whom I’ve discussed some of these issues in depth.)
1.1 Summary of the essay
The essay proceeds as follows. I start by explaining how I think about the core problems at stake in motivation control. In particular: the most fundamental problem, in my opinion, is what I call “generalization without room for mistakes” – that is, roughly, ensuring that AIs reject catastrophically dangerous rogue options, despite the fact that you can’t safely test for this behavior directly. This fundamental problem is exacerbated by a number of sub-problems – notably: evaluation accuracy, causing good training/testing behavior, limits on data access, adversarial dynamics, and the opacity of AI cognition. I discuss each of these in turn.
I then briefly discuss the key tools we have at our disposal. I divide these into two categories – “behavioral science