How human-like do safe AI motivations need to be?
(This is the eighth essay in a series I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, plus a bit more about the series as a whole.
This essay is also a review/critique of one of the central arguments in the book “If Anyone Builds It, Everyone Dies,” by Eliezer Yudkowsky and Nate Soares.)
1. Introduction
In previous essays, I’ve laid out my rough picture of the path to building increasingly powerful AIs safely – and in particular, to exerting control over their motivations and their options sufficient to allow us to use their labor to massively improve the situation (“AI for AI safety”), especially with respect to our ability to make the next generation of AIs safe. In this essay, I want to address directly a particular sort of concern about this project: namely, that the AIs in question will be too alien to be safe. That is, the thought goes, AIs built/grown via contemporary machine learning methods will end up motivated by a complex tangle of strange, inhuman drives/heuristics that happen to lead to highly-rewarded behavior in training. But in the context of more powerful systems and/or more out-of-distribution inputs, the thought goes, these alien drives will lead to existentially catastrophic behavior.
This sort of concern is core to certain kinds of arguments for pessimism about AI alignment risk – for example, the argument presented in the recent book “If Anyone Builds It, Everyone Dies,” by Eliezer Yudkowsky and Nate Soares. And I think the concern has real force. However, I also find it less worrying than Yudkowsky and Soares do – especially in AIs with more intermediate levels of capability (that is, the AIs most crucial to “AI for AI safety,” and which I view as the most direct responsibility of human alignment researchers to make safe).
The core reason for this is that we don’t need to build AI systems with long-term consequentialist motivations we’re happy to see optimized extremely hard. In the context of systems like that: yes, alien motivations are indeed a problem (as are human-like motivations with other flaws, even potentially minor flaws). But systems like that are not the goal. Rather, according to me, the goal is (roughly) to build AI systems that follow our instructions in safe ways. And this project, in my opinion, admits of a much greater degree of “error tolerance.”
In particular: the motivations that matter most for safe instruction-following are not the AI’s long-term consequentialist motivations (indeed, if possible, I think we mostly want to avoid our AIs having this kind of motivation except insofar as it is implied by safe instruction-following). Rather, the motivations that matter most are the motivations to reject options for rogue behavior – that is, motivations that are applied centrally to actions rather than long-term outcomes. Or to put it another way: a lot of the safety we’re getting via motivation control is going to rest on AIs being something more akin to “virtuous” or “deontological” with respect to options for rogue behavior, rather than from AIs directly caring about the same long-term outcomes as we do. And the relevant form of virtue/deontology, in my opinion, need not be fully or even especially human-like in the concepts/drives/motivations that structure it.1 Rather, it just needs to add up, in practice, to safe behavior on any dangerous inputs (that is, inputs that make options for rogue behavior available) that the AI is in fact exposed to.
Admittedly: this reply doesn’t address all of the standard concerns about relying on non-consequentialist motivations for safety – for example, concerns about AIs with at least some long-term consequentialist motivation going rogue via the “nearest unblocked strategy” that is suitably compatible with the non-consequentialist considerations they care about. Nor does it provide additional comfort with respect to preventing alignment faking, or with respect to what I’ve previously called the “science of non-adversarial generalization” – that is, the challenge of ensuring (on the first safety-critical try) that the motivations of non-alignment-faking AIs generalize safely to practically-relevant out-of-distribution inputs. To the contrary, I do think that AI motivations being less human-like makes these challenges harder, because the alien-ness at stake makes it harder for humans to predict how the motivations will apply in a given case.
But this, I think, is an importantly different concern than the one at stake in the central argument of “If Anyone Builds It, Everyone Dies” – and one, I think, about which existing levels of success at alignment in current systems (together with: existing success at out-of-distribution generalization in ML more generally) provides greater comfort. That is: to the extent that the degree of good/safe generalization we currently get from our AI systems arises from a complex tangle of strange alien drives, it seems to me plausible that we can get similar/better levels of good/safe generalization via complex tangles of strange alien drives in more powerful systems as well. Or to put the point another way: currently, it looks like image models classify images in somewhat non-human-like ways – e.g., they’re vulnerable to adversarial examples that humans wouldn’t be vulnerable to. But this doesn’t mean that they’re not adequately reliable for real-world use, even outside the training distribution. Aligning AIs with alien motivations might, I think, be similar.

All that said: at a higher level, relying on smarter-than-human AIs with strange alien drives to reject options to seek power/control over humans seems extremely dangerous. I am more optimistic than Yudkowsky and Soares that it might work; but I share their alarm at the idea that we would need to try it. And to the extent we end up needing to try it with earlier generations of AIs, I think a key goal should be to transition rapidly to a different regime.
I’ll be starting a job at Anthropic soon, but I’m here speaking only for myself, and Anthropic comms hasn’t reviewed this post. Thanks especially to Nate Soares and Holden Karnofsky for extensive discussion of some of these issues.
2. Are our AIs like aliens?
Let’s start by laying out the concern about alien AIs in a bit more detail, focusing on the presentation in “If Anyone Builds It, Everyone Dies” (IABIED).
We can understand the core argument in IABIED roughly as follows:
- AIs built via anything like current techniques will end up motivated by a complex tangle of strange alien drives that happened to lead to highly-rewarded behavior during training.
- AIs with this kind of motivational profile will be such that “what they most want” is a world that is basically valueless according to humans.
- Superintelligent AIs with this kind of motivational profile will be in a position to get “what they most want,” because they will be in a position to take over the world and then optimize hard for their values.
- So, if we build superintelligent AIs via anything like current techniques, they will take over the world and then optimize hard for their values in a way that leads to a world that is basically valueless according to humans.
We can query various aspects of this argument – and I won’t try to evaluate all of it in detail here. For now, though, let’s focus on the first premise. Is that right?
I’m not sure it is. Notably, for example: current AI pre-training focuses their initial development specifically on a vast amount of human content, thereby plausibly endowing them with many quite human-like representations. That is: current AIs need to understand at a very early stage what human concepts like “helpfulness,” “harmlessness,” and “honesty” mean. And while it is of course possible to know what these concepts mean without being motivated by them (cf “the genie knows but doesn’t care”), the presence of this level of human-like conceptual understanding at such an early stage of development makes it more likely that these human-like concepts end up structuring AI motivations as well.
Indeed, this is one of many notable disanalogies between AI training and natural selection – one of Yudkowsky and Soares’s favorite reference points. That is, pointing human motivations directly at something like “inclusive genetic fitness” wasn’t even an option for natural selection, because humans didn’t have a concept of inclusive genetic fitness until quite recently. But AIs will plausibly have concepts like “helpfulness,” “harmlessness,” “honesty” much earlier in the process that leads to their final form.
What’s more, existing efforts at interpretability do in fact often uncover notably human-legible representations at work in current AI systems (though obviously, there are serious selection effects at stake in this evidence);2 and to the extent such representations correspond to important/natural “joints in nature” generally useful to understanding the world, this is all the less surprising. And to my mind, at least, the ease with which we’ve been able to prompt quite human-like and aligned behavior in our AIs using quite basic, RLHF-like techniques is both notable and, in my opinion, in tension with the naive predictions of a worldview that treats AI cognition as extremely alien. In particular: in my opinion, we haven’t just succeeded at getting fairly reliably aligned behavior on a specific training distribution. Rather, we’ve succeeded at creating dispositions towards aligned behavior that generalize fairly (though of course, not perfectly) well to new, at-least-somewhat out of distribution inputs as well – and success at this kind of generalization is effectively what “human-like-ness” consists in.3 (Here I expect Yudkowsky and Soares will say that the sort of generalization we care about most is importantly different; I’ll address this concern later in the essay.)
Of course, it’s true that current AIs do sometimes behave quite badly – including, sometimes, in quite alien ways. But in interpreting this kind of evidence, my sense is that people worried about AI alignment often try to have the evidence both ways. That is, they treat incidents like Bing Sydney as evidence that alignment is hard, but they don’t treat the absence of more of such incidents as evidence that alignment is easy; they treat incidents of weird/bad out-of-distribution AI behavior as evidence alignment is hard, but they don’t treat incidents of good out-of-distribution AI behavior as evidence alignment is easy. Of course, you can claim to learn nothing from any of these data points, and to be using them only to illustrate your perspective to others. But if you take yourself to be learning from the bad cases, I think you should be learning from the good cases, too.4
Indeed, my sense is that many observers of AI have indeed taken a lot of comfort from the good cases. That is, the intuition goes: if AI alignment remains effectively this easy going forwards, then things are looking pretty good. And while I generally think that casual comfort of this kind is notably premature, I share some intuition in the vicinity. In particular, I have some hope that by the time we start building AIs that can be transformatively useful – e.g., AIs within the “AI for AI safety sweet spot” that I discussed earlier in the series – alignment has not become radically harder, and in particular, that efforts to ensure safe instruction-following behavior continue to generalize out-of-distribution at least as well as they have done so far. And I think it plausible that if transformatively useful AI systems safely follow instructions about as reliably as current models do (and especially: if we can get better at dealing with reward-hacking-like problems that might mess up capability elicitation), this is enough to safely elicit a ton of transformatively useful AI labor, including on alignment research – and that the game-board looks substantially better after that.
What’s more, while I am indeed concerned about the many incidents of bad/misaligned behavior in current models, I don’t think any of these currently look like full-blown incidents of the threat model made most salient by concerns about AI alien-ness in particular. In particular, while it’s true that we see AIs willing to engage in problematic forms of power-seeking – e.g., deceiving humans, self-exfiltrating, sandbagging, resisting shut-down, etc – they currently mostly do so in pursuit either of fairly human-legible or context/prompt-legible goals like helpfulness or harmlessness (e.g. here and here); on the basis of shifting between different human-legible personas (e.g. here and here); in pursuit of completing the task itself (e.g. here and here); or, perhaps, in pursuit of “terminalized” instrumental goals like an intrinsic drive towards power/survival (this is another interpretation of some of the results previously cited). That is: in my opinion, we have yet to see especially worrying cases of AIs going rogue specifically in pursuit of goals (and especially, long-term consequentialist goals) that seem especially strange/alien – though of course, this could change fast.
What’s more, in thinking about what it would mean for an AI’s motivations/cognition to be human-like or alien, I think we need to be careful about the level of abstraction at which we are understanding the claim in question. That is: it’s not enough to say that AI behavior emerges from a complex tangle of heuristics, circuits, etc rather than something more cleanly factored (since: the same is true of human behavior); nor, more importantly, to say that the heuristics/circuits/etc work in a different way in the AI case. Rather, we should be focused on the high-level behavioral profile that emerges in the system as a whole, and the degree to which it diverges from some more human-like alternative.5 And as I’ll discuss below, what actually matters is whether it diverges in practice, and in catastrophic ways – not just whether it does so in some way on some inputs. Thus, per the analogy I discussed in the intro: an AI classifying cat pictures adequately doesn’t actually need to mimic human judgments across every single case (nor, indeed, will human judgments all agree with one another); to be robust to every adversarial example; etc. Rather: it just needs to get enough cases (including: enough cases out of distribution) enough right. And in this respect, for example, it looks to me like many of our efforts to get AIs to behave in fairly human-like ways, including out of distribution, are going OK.
All that said: I remain sympathetic to some versions of premise (1). In particular: I think it quite plausible that if we really understood how current AI systems think/make decisions etc, we would indeed see that to the extent they are well-understood as having motivations at all, these motivations are quite strange/alien indeed, and that they will indeed lead to notably alien behavior on a wide variety of realistic inputs (beyond what we’ve seen thus far). For example, while I think it’s an open question exactly how to interpret data of this kind (see e.g. discussion here), I definitely get some (fairly creepy) vibe of “strange, alien mind” from e.g. the sorts of chains of thought documented in this work on scheming from Apollo and OpenAI (full transcripts here):

And I think it plausible, as well, that this sort of alien-ness will persist and/or increase in more powerful AIs built using similar techniques – and this even accounting for moderate improvements in our behavioral science and transparency tools.
That is: overall, I share Yudkowsky and Soares’ concern that the motivations of AIs built using current techniques will remain importantly strange/alien. And so I want to examine in more detail the implications for safety if this is true. In particular: if, in fact, powerful AI systems are motivated in at-least-somewhat alien ways, does that mean we are as doomed as Yudkowsky and Soares think?
I’m skeptical. In particular: I think the Yudkowsky/Soares argument above places too much emphasis on long-term consequentialist AI motivations in particular, and that it neglects the ways in which the sort of safety accessible via shorter-term and especially non-consequentialist motivations ends up more tolerant of error. Or to put the point in more Yudkowskian terminology, I think that something like “corrigible AIs” (that is, roughly, AIs with imperfect motivations but which nevertheless obey your instructions and don’t go rogue) can safely be more notably alien (and otherwise flawed) in their motivations than “sovereign AIs” (that is, AIs with motivations so perfect you trust them to optimize the future arbitrarily hard without accepting any of correction from you) – and that our focus should be on something like corrigibility in particular.6 Let me say more about what I mean.
3. Value fragility, sovereign AIs, and getting AI motivations “exactly right”
In my opinion, we should understand the concern about “alien AIs,” and its role in the argument I laid out above, as a specific version of a broader concern that has haunted the AI alignment discourse from basically the beginning: namely, the concern that safe AI motivations need to be, in some sense, “exactly right.” That is, the more general argument in the background (though: not stated as explicitly in IABIED) is something like the following (alterations from the previous version in bold):7
- AIs built via anything like current techniques will end up with motivations that aren’t exactly right.
- AIs whose motivations aren’t exactly right will be such that “what they most want” is a world that is basically valueless according to humans.
- Superintelligent AIs with this kind of motivational profile will be in a position to get “what they most want,” because they will be in a position to take over the world and then optimize hard for their values.
- So, if we build superintelligent AIs, they will take over the world and then optimize hard for their values in a way that leads to a world that is basically valueless according to humans.
Why think that AI motivations need to be exactly right? Well, roughly, the basic concern is that human value is “fragile” under extreme optimization. That is, the thought goes: extreme optimization for slightly-flawed values leads to places that are basically valueless by human lights; and superintelligences will be forces for extreme optimization.
I’ve written in some detail, elsewhere, about my takes on concerns about “value fragility” of this kind. See, in particular, this set of informal notes, and this longer essay about whether the concern in question applies similarly to humans with respect to one another. For those interested, I’ve also given a summary of some of those takes in an appendix below.
However, while I think there are a variety of interesting and important questions we can raise about value fragility (and especially: about the extent to which similar concerns do or do not apply between different humans), I’m not, here, going to dispute a certain kind of broad concern about it. That is, I’m going to accept that for long-term consequentialist value-systems that aren’t exactly right (or at least, which don’t put non-trivial weight on something exactly right), if you optimize for them super hard, you do indeed create a world that is roughly valueless by human lights. And I’m going to accept, further, that the degree of alien-ness at stake in the motivations of AIs developed via current techniques is likely enough to fall short of “exactly right” in this sense (at least to the extent that such AIs develop long term consequentialist motivations at all – something which, as I’ll discuss below, I think we should be trying to prevent except insofar as these motivations follow from safe instruction-following).
What follows from this? Basically: what follows is that current techniques aren’t fit to build AI systems with long-term consequentialist motivations that we’re happy to see optimized extremely hard. That is, roughly, they are not fit for building what Yudkowsky call a “Sovereign AI” – that is, in his words, an AI that “wants exactly what we extrapolated-want and is therefore safe to let optimize all the future galaxies without it accepting any human input trying to stop it.”
But building sovereign AIs of this kind, I claim, should not be our goal. Indeed, I explicitly defined solving the alignment problem so as to neither require this degree of alignment in the AIs we build; nor, even, to require the ability to elicit the creation of AIs that are this degree of aligned (and in particular, I’m not counting “build an AI that you’re happy to make dictator-of-the-universe” as one of the “main benefits of superintelligence”). This is centrally because I think that building an AI worthy of this degree of trust may be a notably more difficult challenge than building an AI that safely follows our instructions.8 But also, even aside from the technical difficulty of building a “sovereign AI” of this kind, I don’t think we should view “now we’ve handed control of the world to a perfectly benevolent AI dictator/oligarchy” as a clearly ideal end state of our efforts on alignment – nor, indeed, one that is unavoidable absent some other sort of enforced restriction on AI development.9 To the contrary, I think we should focus more on a vision of humans who are able to get safe, fully-elicited superintelligent help in navigating the ongoing transition to even greater levels of AI capability – including with respect to questions about what sorts of “sovereign” to make what sorts of AIs going forwards.10
That said: the argument for pessimism at stake in IABIED – and also, in the more general value-fragility argument outlined above – isn’t “we should aim for perfect AI dictators, but we’re going to get alien/imperfect AI dictators instead.” Rather, it’s more like: “we’re not going to be able to avoid getting AI dictators of some form, and the dictators we’re going to get will be alien/imperfect.” That is: Yudkowsky and Soares do recognize the possibility of trying to build what Yudkowsky calls “corrigible AI” – that is, in Yudkowsky’s words, an AI “which doesn’t want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there.” And indeed, my understanding is that Yudkowsky and Soares agree with me, as a first pass, that “corrigible AI” in this sense is a better near-term focus of efforts at alignment. But they think that the project of building corrigible AI, too, is doomed to fail.
Now, as I’ve discussed in some informal notes elsewhere, I think that the role of the notion of “corrigibility” in the discourse about AI alignment is often unclear/ambiguous. In particular, in the context of the Yudkowsky quote above, it basically just means “any powerful AI with not-exactly-right values that is somehow otherwise safe.” But often, people use the term to refer to a number of more specific properties – notably, a willingness to submit to corrective intervention like shut-down or values-modification (while remaining useful in other ways – e.g. not trying to shut itself down). Conceptually, these aren’t actually the same. For example: humans don’t perfectly share each other’s values, and they will generally resist “corrective intervention” like shut-down (death) and values-modification (brainwashing), but they also aren’t at least presently aiming at omnicide.
Still, something like “corrigibility” is indeed a closer match for my focus in this series than “sovereign AI.” That is: I want us to learn how to build AIs that safely follow our instructions – where “safely” means “without engaging in rogue behavior,” and where “rogue behavior” includes things like resisting shut down/values-modification, and definitely includes taking over the world. Indeed, I think it’s notable that on the most straightforward understanding of plans for building “sovereign AI,” the AI in question does take over the world (for example, the classic argument for AI takeover I laid out here applies with roughly comparable weight to AIs with perfect long-term consequentialist values) – it’s just that, what it does from there is suitably valuable by human lights.11 That is, in a sense, even sovereign AIs with exactly-right long-term values go “rogue” in my sense – it’s just that, after humans lose control, the future is still good. But I’m interested in avoiding rogue behavior period.
4. What does it take to build corrigible AIs?
What does that involve? Well, let’s return, for a moment, to the framework for thinking about AI incentives that I laid out in the second essay in the series.

On this framework, the central aspects of an AI’s motivation profile that matter are its satisfaction (how happy it is with the best non-rogue option), its inhibition (how averse it is to various paths to successful rogue behavior), its failure aversion (how much it wants to avoid the downsides of failed rogue behavior) and its ambition (how much it wants the power at stake in rogue behavior). The reason AIs with “alien” drives take over the world is that what they most want in the long term makes them suitably “ambitious” about getting the power at stake. And as I noted above, I actually think that similar considerations would suggest that even AIs with exactly-right long-term consequentialist values would take over as well.
Suppose, though, that we don’t want our AIs to take over or to go rogue at all. How can we ensure this? A big part of it is option control – and in particular, restricting the options for rogue behavior available to the AIs in question. Indeed, as I discussed in the second essay, I think the classic argument for AI risk is best understood as relying heavily on the premise that rogue-option-restriction of this kind will be so ineffective that the AI in question will be in a position to take over the world extremely easily (such that “failure aversion” isn’t an important part of the story), and to do so via such a wide variety of methods that it can successfully route around any deontology-like inhibitions that make certain routes undesirable (cf concerns about “nearest unblocked neighbor”). In this context, that is, the main thing that matters for whether the AI has an incentive to take over is whether it has any suitably long-term consequentialist motivation that gives rise to even a small amount of ambition. And indeed, as I’ll discuss below, I think we see this kind of low/weak standard at work in the argument in IABIED for expecting AIs with alien drives to have long-term consequentialist drives in particular.
As I discussed in my last essay, though, I don’t think we should give up on option control playing an important role in the safety of advanced AIs, nor do I think that success in this respect needs to be all or nothing (i.e., no viable options for rogue behavior vs. can take over the world extremely easily via tons of different methods). And in the context of understanding “corrigibility,” I think the availability of option control as a possible tool is important. In particular, to the extent you are hoping to rule out a suitable range of paths to rogue behavior somehow, option control allows you to do so via intervention on the AI’s environment/capabilities as well as via its motivations. In this sense, as with the notion of “alignment” more broadly, “corrigibility” in the sense I care about is importantly relative to a particular environment and capability level. That is, the AI in question doesn’t need to act corrigibly across all possible inputs and capability levels – it just needs to act corrigibly in the specific context you care about, on the specific set of tasks you’re trying to get it to perform.
Beyond option control, though, we can divide the motivational aspect of corrigibility into two components:
- Minimizing the AI’s “ambition.”
- Ensuring that the other aspects of the AI’s motivational profile (its satisfaction, inhibition, and failure aversion) are sufficiently strong/robust as to outweigh the degree of ambition it does have.
Let’s look at each in turn.
4.1 Minimizing ambition
AI ambition arises, paradigmatically, when AIs have long-term consequentialist motivations – the sort of motivations that create instrumental incentives to seek power in problematic ways. Here the time horizon is important because the AI needs time for the relevant efforts at getting and using power to pay off; and the “consequentialist” is important because the paradigmatic use-case of power, in this context, is for causing the consequences the AI wants.
Why exactly, though, should we expect advanced AIs to have motivations of this kind? In my opinion, IABIED is inadequately clear about its answer here.12 But we can distinguish, roughly, between two different reasons for concern, both of which are present in IABIED in different forms.13 The first is that AIs will end up with long-term consequentialist motivations by accident. The second is that we’ll give them these motivations on purpose.
4.1.1 Making AIs ambitious by accident
At times in IABIED, it looks like “AIs will end up with ambitious motivations by accident” is playing the central role. Consider, for example, the discussion in Chapter 5 of why we shouldn’t expect AI preferences to be easily satisfied:
“In an AI that has a huge mix of complicated preferences, at least one is likely to be open-ended—which, by extension, means that the entire mixture of all the AI’s preferences is open-ended and unable to be satisfied fully. The AI will think it can do at least slightly better, get a little more of what it wants (or get what it wants a little more reliably), by using up a little more matter and energy.”14
That is, the picture here is something like: somewhere amidst the AI’s complex tangle of alien drives will be at least some suitably ambitious motivation (here “open-endedness” and “non-satiability” are the relevant forms for ambition, but we could similarly focus on aspects like consequentialism and long-time-horizon).
Note, though, that this story rests on a few assumptions we can query. First: it assumes that the type of alien-ness at stake in AI motivations is specifically such as to implicate a complex variety of different motivations, thereby implicating a high probability that at least one of them will be suitably ambitious. But even if we grant that AI motivations will be alien in some sense, the idea that AIs will have many diverse alien motivations is a further step – one incompatible, for example, with some salient threat models (e.g., AIs that end up solely focused on some alien conception/correlate of “reward”); and one that I don’t think IABIED offers a strong argument for.15
More importantly, though: even if we grant that because AIs will have many diverse alien motivations, at least one of them will likely be ambitious enough to make taking over the world attractive pro tanto, Yudkowsky and Soares then make the further assumption that this level of ambition is also enough to make the AI choose to attempt world takeover overall. But per my discussion above, this doesn’t follow. That is, it could also be the case that the AI’s inhibition and failure aversion combine to outweigh the ambition in question – and this, especially, to the extent there are meaningful restrictions on which routes to taking over the world are available. Or to put it another way, I think Yudkowsky and Soares are generally assuming that the AI is in such a dominant position that taking over the world is effectively “free,” such that it just needs to have some benefit according to the AI’s motivational profile in order to be worth doing overall. But I don’t think we should assume this – and especially not in the context of the sorts of intermediate-level capability AIs that matter most for “AI for AI safety.”
Beyond IABIED’s argument for “AIs will have many complex motivations, so at least one is probably ambitious,” there are also other ways to worry about AIs ending up with long-term consequentialist motivations by accident. I’ve discussed some of these in section 2.2 of my scheming AIs report, on “beyond-episode goals,” and I won’t review that discussion here.
4.1.2 Making AIs ambitious on purpose
What about the concern that we will make advanced AIs ambitious on purpose? Some version of this is the argument for expecting long-horizon consequentialism that I personally take most seriously. That is, the thought goes:
- We are going to want AIs that successfully and tenaciously optimize for real world long-horizon outcomes, so
- This kind of AI will have ambitions of the kind that prompt pro tanto interest in world takeover.
I think this is right, but that the implications are a bit slippery.
First, on (1): while it is true that we will likely want AIs that optimize for outcomes on time horizons of e.g. years, this is distinct from saying that we will want AIs that optimize for outcomes on indefinite time horizons. That is, to the extent the paradigm rogue AI has motivations to optimize “all future galaxies over the entire future of the universe,” it’s not clear that there are strong commercial incentives for that.16