(Part 2, AI takeover) Extended audio/transcript from my conversation with Dwarkesh Patel

Context repeated from part 1: I recently had the pleasure of recording a podcast with Dwarkesh Patel. The final episode was ~2.5 hours, but we actually spoke for about six hours – four about my “Otherness and control in the age of AGI” series, and two about the basic AI takeover story. I thought it might be worth making the extra content public in some form, in case it might of interest to anyone; so with Dwarkesh’s permission, I’m posting the full conversation on my audio feed here, along with transcripts on my website.¹

I’ve split it into two parts, according to the topics mentioned above. This post is for the second part we recorded, on AI takeover. Extra content includes:

(Note: the time-stamps in the transcript are approximate, due to my having added a bit of intro to the audio.)

Transcript

Dwarkesh Patel (00:00:00):

Today I’m chatting with Joe Carlsmith. He’s a philosopher — in my opinion, a capital G great philosopher — and you can find his essays at joecarlsmith.com.

Basics of the AI takeover story

(00:00:11):

So we already recorded a part one of this conversation, and I was looking at the recording and I felt like we hadn’t gotten at a crux that maybe many audience members, and even I myself, have about the alignment discourse, which is like, I dunno, what’s the story again? So we have a GPT-4 and it doesn’t seem like a paperclipper kind of thing. It understands human values. In fact, if you have it explain, why is being a paperclipper bad, or just tell me your opinions about being a paperclipper, it’ll explain why the galaxy shouldn’t be turned into paperclips. So what exactly is the story again of dot dot dot doom? And let’s go through this. Does it make sense or not?

Joe Carlsmith (00:00:56):

So, what do we mean by a paperclipper? No one’s expecting literal paperclips. And it’s a little unclear. I tend to think about two aspects. So one is: does the AI, if given the option to seek power in some problematic way, and in the limit to disempower humanity entirely and to take over the world — if it genuinely has that option, such that this effort will predictably succeed, does it take that option? So that’s one component. And then the other component is: what does it do after that, if it takes control, and does it in particular create a world that is effectively valueless? That’s the meaning of paperclips — a thing that happens that is valueless even on deep reflection, even from a fully cosmopolitan and inclusive perspective. It’s just actually not good. Not necessarily bad, but just kind of blank. So those are the two components at stake, and maybe we should talk about them in turn.

Dwarkesh Patel (00:02:11):

And in fact I’d be especially curious about, even though any specific scenario is not in and of itself for the most likely, just the trajectory that you think is plausible or represents the potential trajectories which end up in something we would consider valueless, starting with GPT-4 today, what is happening such that we have a system that takes over and converts the world into something valueless?

Joe Carlsmith (00:02:38):

Maybe I’ll just start by saying, so focusing on this component of taking over. One thing I just want to acknowledge up front is that if we’re really imagining an AI that’s in a position where it’s just like, here’s one of my options: to take over the world. And this will work. If I try it, that will work, right? And it can really see that, it has the planning capability and all this stuff. That’s just an option for it. I mean, we should note: that’s a very intense level of capability to be imagining. And I think it’s very reasonable to be skeptical that we’re going to be building AIs that have that type of capability at all. And in fact, I personally think we should be trying hard to make it the case that no AI is ever in a position to just choose unilaterally to take over the world.

(00:03:27):

I think it’s important to separate skepticism or questions you might have about will we have AIs that are in a position like that, from questions you might have about how easy or difficult will it be, conditional on having AIs who have it as an option to take over the world, to make it the case that they reject that option. And I think people sometimes kind of fuzz this together and they’re thinking, how worried am I about alignment? And they’re not really distinguishing between the capabilities component and the motivational component, and they’re sort of thinking, ah, maybe we have some souped up chatbots. How scared am I really of souped up chatbots? I don’t know. GPT-four seems kind of nice, that kind of thing. And so I really want to highlight upfront just if we’re really talking about an AI that just has it as an option to take over the world, I think it’s kind of intuitive to think that that’s scary.

(00:04:21):

And one way to just get at that intuition at a high level before going into the specifics is just like, well, we talked in part one about exactly how alien are these AIs, how different from humans are they? But how would you feel about a human being in that position, in a position to just kill everyone or take over the world in some other way? I think even if it’s a human, that’s a very scary situation. That’s not the type of thing that we’re generally comfortable with. That’s not how we want to set up our politics and the distribution of power in our society. So I think even independent of questions about exactly how weird and different are the AIs, I just think this is a very alarming situation if you’re really taking it seriously. And I think sometimes people don’t fully condition on that and then think about it independently from the capabilities piece.

The role of multipolarity

Dwarkesh Patel (00:05:14):

So yeah, let’s hang on this piece of how plausible it is that an AI would be in a position to take over. Because in this world, not only has AI progressed, but well other AIss have progressed as well, or human civilization has progressed. And so it’s not like one copy of GPT-7 is all that exists in the world, presumably there’s other copies and so yeah, what’s the story of how all the AI coordinate in a way that can cause a takeover?

Joe Carlsmith (00:05:52):

Yeah, I think there’s a few different scenarios we can imagine there. So there are scenarios that, and we can talk about exactly how plausible they are, where there is in some sense some single AI or some single family of AIs, maybe it’s like a lineage of AIss that are sufficiently similar, that coordination is very natural to them, because maybe they’re all from the same fine tuning run, or they’re all just literal copies of the same model or something like that. Where this AI system is sufficiently powerful relative to both all the other AI systems in the world and relative to the humans, especially when you take into account all the instances of it that are working and doing different things, that in some sense this AI is in a position to take over on its own without recruiting non copies of itself or something to help.

(00:06:49):

Those are going to be scenarios in which you’ve had this kind of, I think the most salient type of scenario of that kind is: you’ve had this explosive growth, people have been really throwing resources at developing the most powerful AI they can, and it so happens that that process goes sufficiently fast, that there’s just a big gap between in absolute capability between the frontier, really frontier AI and the others. So that’s one scenario that you might worry about from the perspective of a single AI being in a position… or a single, I mean when I talk about a single AI, I don’t mean a literal single copy of the weights necessarily or — eventually it’s copies itself and stuff like that — but an AI that is both from the same lineage and sharing roughly of the same impartial goals, is what I have in mind when I talk about a single AI.

Dwarkesh Patel (00:07:41):

To what extent is this different from the world as this exists today, because the US military could take over Costa Rica and it would be a relatively trivial operation, but why would you do it? And there’s more gains from trade from engaging with Costa Rica and also just there’s so many other things to do. Why break the system of coordination and norms? And I wonder, wouldn’t this story also be true of AI as well, where fundamentally the conflict with humans, what is it getting out of that versus there’s so many things you could do, I dunno, the horizon is so vast in comparison. So why isn’t it like a like story the president could tomorrow order or take over Costa Rica, but it’s like why do we think this is especially worrisome or likely?

Joe Carlsmith (00:08:34):

Well, so one thing I want to say is I’m not necessarily sitting here going, these stories are super likely. And in particular I have some hope that it’ll actually be quite difficult for AI to take over, or we can make it the case that is quite difficult for hopefully a relatively long period. I think civilization should be extremely cautious about consciously transitioning to a scenario where we are vulnerable to the motivations of some single AI, or some set of AIs, such that if those AIs happen to decide to take over the world, then they would succeed. I think that’s a transition we should basically not make until we have a very developed science of AI motivation. And I think there’s a lot of options for hanging out in some zone of relatively serious AI capability prior to that point, where AI don’t have this really overwhelming advantage. And so I’ll just say that. In general I’m going to talk about scenarios that I think about, and that I think are sufficiently plausible — or at least something nearby to this is sufficiently plausible — that we should be worried about them. But I’m not here going, this is the default. My actual hope is that no, we can avoid this. It doesn’t need to be this way. I mean there’s a bunch of different variables here.

(00:09:57):

Okay, so that said, the US case I think is different in a few ways. One is obviously there’s ethical constraints here, so there are things that humans don’t want to do to each other. There are also, I think it’s just, if the US took Costa Rica, lots of other people would react to that, that would have a lot of impacts in terms of the world, reputation, how do other people relate to the US, how do other people think about other geopolitical dynamics. There are a bunch of ways in which the US remains, even if it has lots of power, it’s not this totally overwhelming advantage over, I mean even Costa Rica, but countries in general, especially depending on what the US is trying to do in the scenario.

Dwarkesh Patel (00:10:41):

So can we pause on that? I imagine with respect to both of those things, it would also be true of AI, at least modulo, some crazy beyond the galaxy intelligence explosion, the ethical constraints because luckily it’s a descendant of the human tradition in all this human text and just natively, you talk to GPT-4, the basic sort of RLHF thing we do, it’s reluctant to even say racist things or whatever, or teach you how to make a bomb. So the reluctance that humans have, which can be overwhelmed to do these sorts of takeover things, I imagine the models would have it potentially even stronger and as for there could be retaliation against you or something. The AIs, they’re going to be depending on human infrastructure and human collaboration to support all this compute, all this energy, all this infrastructure that they rely on. So why do you want to have your cluster nukes just so you can, wouldn’t it be better just to trade with human civilization to work with them?

(00:11:47):

And finally, one of the mistakes in let’s say Marxist thinking is imagining that there’s the classes as you think of them, of bourgeoisie or the proletariat, they’re cohesive and you can think of them as a class, having a group of interest, a wish or a plan. And then in reality, of course your manager and you as part of a firm might be collaborating against some other manager and their workers. And when you’re voting, sometimes you vote against your quote class interest. And so you can imagine with AI and humans, it’s not necessarily AI coordinating together. Maybe there’s a scenario where the humans cause some sort of intelligence explosion and that’s the harm, or a pocket nuke, or you could imagine gangs of humans and AIs, why is it the humans versus the… why do you imagine the AI is teaming up together to orchestrate a takeover.

Joe Carlsmith (00:12:45):

So yeah, again, I’m not confident it will be like that. I think there are definitely messy scenarios where there’s all sorts of weird coalitions and AIs working with humans and AIs trading with humans. I think some intuition pump for how it could end up being the case that there’s a natural coalition against the humans is something like, so imagine we’re in some scenario where there’s this a big race dynamic between two nationalized projects or something like that. So you had this conversation with Leopold about maybe there’s this big intensive race dynamic between nations. It’s become a part of a national security type frame. And you can imagine a scenario where let’s say you’ve really gone through an intensive intelligence explosion such that you have full blown superintelligent AI as part of one project. And if you were imagining that that’s a scenario in which those AIs would’ve been able to take over absent the other AIs. Like the only main force, maybe there’s some other AIs that some other project is training in some other country, and those are at this point the main seat of power other than AI here, there’s something at least somewhat natural of both of these AIs — and neither of them are aligned with human interests in this case, let’s say, because we never really figured out how to understand and control their motivations, so they’re both in some sense planning to go rogue or are already going rogue or something like that — then there’s I think something natural about these AI saying like, “Hey, who are the actual power players who actually are an important threat to me or important trade partner or bargaining partner?” And if it’s the case that the humans are sufficiently on the outs at that point, then it might be natural for the AIs to coordinate. Now that requires that humans are in that position, and I think that’s not obviously the case. As you say, humans start with a ton of advantages. We have all this control over the physical infrastructure and we have all sorts of controls over the AI systems and there’s a lot of advantages humans do have and I take some comfort in that. But I think if you are in a situation where the only thing checking the power of one AI, one misaligned AI, is another misaligned AI, then I think that’s worrying and they’re able to coordinate, they can talk to each other, they could figure something out. Then I think that seems worrying to me.

Dwarkesh Patel (00:15:34):

In some ways we are in that position with respect to other humans and that humans who are misaligned to you check other humans who are misaligned to you. And the equilibrium there is one where you can go get Starbucks in the morning and nobody will hassle you. So then the worry is, well, if there’s only, I dunno, the Chinese AI and the American AI and they’re both misaligned, it’s much easier for them to coordinate in that situation. This is a tangent, but I wonder if this is one of the strong arguments against nationalizing the AI and you can imagine a more sort of many different companies developing AI, some of which are somewhat misaligned and some of which are aligned. You can imagine that being more conducive to both the balance of power and to a defensive, how all the AI has go through each website and see how easy it’s to hack and basically just getting society you have to snuff. If you’re not just deploying this technology widely, then the first group who can get their hands on it will be able to instigate a revolution that you’re just standing against equilibrium in a very strong way.

Joe Carlsmith (00:16:43):

So I definitely share some intuition there that there’s, at a high level, a lot of what’s scary about the situation with AI has to do with concentrations of power, whether that power is concentrated in the hands of misaligned AI or in the hands of some human. And I do think it’s very natural to think, okay, let’s try to distribute the power more. And one way to try to do that is to have a much more multipolar scenario where lots and lots of actors are developing AI, and this is something that people have talked about.

(00:17:20):

When you describe that scenario, you were like, some of which are aligned, some of which are misaligned. That’s key. That’s a key aspect of the scenario. And sometimes people will say this stuff, they’ll be like, well there’ll be the good AIs, and they’ll defeat the bad AIs. But notice the assumption in there, which is that you made it the case that you can control some of the AIs and you’ve got some good AIs and now it’s a question of: are there enough of them and how are they working relative to the others? And maybe. I think it’s possible that that is what happens. That we know enough about alignment that some actors are able to do that and maybe some actors are less cautious or they’re intentionally creating misaligned AIs or God knows what.

(00:18:08):

But if you don’t have that, if everyone is unable to control their AIs, then the “good AIs help with the bad AIs” thing becomes more complicated, or maybe it just doesn’t work, because there’s no good AIs in this scenario. If you say “everyone is building their own superintelligence that they can’t control,” it’s true that that is now a check on the power of the other super intelligence. Now the other superintelligences need to deal with other actors, but none of them are necessarily working on behalf of a given set of human interests or anything like that. So I do think that’s a very important difficulty in thinking about the very simple thought of, ah, I know what we can do. Let’s just have lots and lots of AI so that no single AI has a ton of power. I think that on its own is not enough.

Alignment in current and future models

Dwarkesh Patel (00:19:09):

So maybe this is a good point to talk about how you expect the difficulties of alignment to change in the future. Because right now if it’s the different models going up against each other, it would be like Claude is pretty aligned, GPT-4 is pretty aligned, and one of them is a Sydney Bing type fellow. But if they’re clashing it out, the Claudes and the GPT-4s will win against the Sydney Bings. And so in the future I guess you might potentially you’re expecting the inverse scenario where most of them are the Sydney Bings. But I dunno, it feels like we’re starting off with something that has this intricate representation of human values, and it doesn’t seem that hard to sort of lock it into a persona that we are comfortable with. I dunno what changes.

Joe Carlsmith (00:19:59):

One thing I’ll just say off the bat is: when I’m thinking about misaligned AIs, or the type that I’m worried about, I’m thinking about AIs that have a relatively specific set of properties related to agency and planning and awareness and understanding of the world. So in particular the AIs, in order for it to be the case that the AI is in this situation of okay, it’s got different plans, it’s got a plan to try to take over — whether by coordinating with other AIs or by itself –and it’s got a plan to act in some more benign way, and it’s thinking about which plan to do and choosing. So there’s a bunch of stuff packed into that image of an AI’s cognition.

(00:20:38):

One is this capacity to plan and make relatively sophisticated plans on the basis of models of the world, where those plans are being evaluated according to criteria — factors internal to the model that determine which plan it chooses. And that’s what I mean by then AI’s motivations. I know some people are like, oh this is anthropocentric to talk about the AI’s motivations, but for me it means just the factors internal to the model that determine how it makes choices and employs its capabilities. So it has to have this planning capability.

(00:21:12):

That planning capability needs to be driving the model’s behavior. So there are models that are in some sense capable of planning, but when they give output, it’s not like that output was determined by some process of planning, here’s what will happen if I give this output, and do I want that to happen versus something else. So it also needs to be, the planning needs to be driving the model’s behavior.

(00:21:31):

The planning needs to coherently drive the model’s behavior over the steps in the plan. So the model needs to be such that it’s like, okay, I choose to do this plan and then in the next moment, the next moment it stays with that and continues along this path. So it can’t just be like, eh, and then it just does something else later. Or it can’t just be like weak-willed, a weak-willed human who’s like, I’m going to go to the gym every week and then don’t do it. So that’s another feature. The model needs to be coherent over time.

(00:22:03):

And then finally the model needs to really understand the world. It needs to really be like, okay, here’s what will happen. Here I am, here’s my situation, here’s the politics of the situation. Really having this situational awareness to be able to evaluate the consequences of different plans. And then obviously it needs to be capable of succeeding.

(00:22:26):

So GPT-4, I think, just doesn’t have these properties, so I don’t even really think it’s … or it has some of them to weak degrees. I think it’s very much not a model even of the type such that it makes a lot of sense to ask questions like “what would GPT-4 do if it had an option to take over the world?” We can imagine that, you can imagine a scenario, suppose we had a button which in fact would turn over control of the world to GPT-4 somehow. And we, I dunno, maybe it’s GPT-4o, we get a video of this button and we’re like, okay Scarlett Johansson, what do you think? Do you want to press this button?

(00:23:11):

And GPT-4 will I predict say “no, I would never press that button.” But that output, that verbal behavior, was not produced by some process where GPT-4 was like, okay, what’s going to happen if I press that button? What’s going to happen in the world? What are my other options? It doesn’t know enough about the world, I think, to be really evaluating that. And even if it did, it’s like: is it really going to trust that that’s a real situation or something? So I think talking about is GPT-4 aligned in the sense at stake here, I think is a little funky. And I think it’s easy to confuse, like, “I don’t know, seems like a nice guy” or something — or, seems like a nice non-gendered alien mind — but that’s not quite it.

(00:24:02):

Now, OK, so future models. So we have to talk about well: future models. And again, and I think sometimes people think, like, oh what about GPT-6? Same. They’re not really imagining something that’s actually quite a bit more capable on these along these dimensions. So there’s a lot of projects now to have more agent-like AI. I think eventually we will want to have AIs that can really make plans, execute really long sequences of actions. If you want to have an AI that’s like: plan my daughter’s birthday or run my company or whatever, these are much more agentic things and are going to require much more in the way of the sorts of capabilities I just described. But I think right now that’s not really where we’re at. So talking about the alignment of current models in this sense is a little funky. I just want to flag that.

(00:24:48):

I think the other thing is, so the verbal behavior of these models, I think, need bear no… So when I talk about a model’s values, I’m talking about the criteria that end up determining which plans the model pursues. And a model’s verbal behavior, even if it has a planning process, which GPT-4 I think doesn’t, in many cases… its verbal behavior just doesn’t doesn’t need to reflect those criteria. And so we know that we’re going to be able to get models to say what we want to hear. That is the magic of gradient descent. Modulo some difficulties with capabilities, you can get a model to output the behavior that you want. If it doesn’t, then you crank it till it does. And I think everyone admits for suitably sophisticated models, they’re going to have very detailed understanding of human morality. But the question is: what relationship is there between a model’s verbal behavior — which you’ve essentially kind of clamped, you’re like the model must say blah things — and the criteria that end up influencing its choice between plans. And there I’m pretty cautious about being like, well when it says the thing I forced it to say, or gradient descented it such that it says, that’s a lot of evidence about how it’s going to choose in a bunch of different scenarios. I mean for one thing, even with humans, it’s not necessarily the case that humans, that their verbal behavior reflects the actual factors that determine their choices. They can lie, they can not even know what they would do in a given situation, all sorts of stuff. So I think I want to be cautious about both: “we got nice verbal behavior” and “it understands our values,” inferring from that, that it’s choosing on the basis of that sort of thing.

Dwarkesh Patel (00:26:53):

So I mean I think it is interesting to think about this in the context of humans, because there is that famous saying of be careful who you pretend to be because you are who you pretend to be. And you do notice this where if people, I don’t know, this is what culture does to children, where you’re trained, your parents will punish you if you start saying things that are not consistent with your culture’s values, and over time you will become your parents. That sort of punish reward cycle works. It seems to work with humans, it works with animals like dogs or whatever. I dunno, I think if I had a dog and it consistently did the things I wanted to do, I’m pretty sure it’s a good boy. I don’t think it’s trying to, it’s got something in the back of its mind about scheming against how to get the most treats or something and then this might be a tangent, so feel free to address this at a different point, but to the extent that it will be the case that it’s not exactly aligned to human values, but basically it’s in a position that a different culture of humans or maybe your child is respect to you where you tried your best, it’s not a sociopath, it’s not actively hostile to you. I dunno why it would be actively hostile to you. But it’s not exactly, it’s not exact carbon copy of what you want. Then it’s similar to why, for example, I don’t attempt to take over the government or I don’t try to work with my friends to take over the government. There’s many laws I disagree with, there’s many laws I agree with, but just the local shape of incentives as such that I, it’s like it doesn’t make sense for me to, you expect a continuity of institutions and laws and whatever between generations of humans, even though their faculties change over time, their values change over time. Why not expect the same of AI? And then to the extent that that’s a tangent, we can come back to the question of, let’s come back to the question of, yeah, I dunno, you train a child to follow your values and you’re pretty sure, your kid says what you expect him to say based on your reward and punishment, by default, it seems like it kind of works. And even with these models it seems like it’s works. It’s like they don’t really scheme against us. Why would this happen?

Joe Carlsmith (00:29:16):

Yeah, so one thing I’ll say is: yes, I have a lot of hope for the kind of thing you’re talking about. I think there is a hopefully decent sized period of AI capabilities and integration of AI into our society where AIs are just somewhat analogous to humans. I think there’s a way in which, if you don’t assume it’s arbitrarily easy for an AI to take over — or it’s not just like, oh, here’s this option on a platter, take over the world, and instead it is like, oh, it’s got some probability of success or failure. Suddenly we need to actually … the very simple argument, which I think is often the one imagined in the context of especially the old school discourse about AI alignment, was there’s this fully dominant superintelligence. And this superintelligence has an option to just take over with just extreme ease, via a wide variety of paths. And then the concern, so it doesn’t even really need to think about the downsides of failing. It’s really just like this is going to work if it goes for it. And even if it has some inhibitions about like, oh, I don’t want to do this type of manipulation or I don’t want to on the path to taking over the thought is either of those will be outweighed or it will be able to route around them. It’ll find some creative way to take over without violating these analogs of deontological prohibitions or something. And then the thought for folks who are unfamiliar with the basic story — maybe folks are like, wait, why are they taking over at all? What is literally any reason that they would do that? — so the general concern is, especially if you’re really offering someone power for free, power almost by definition is useful for lots of values. And if we’re talking about an AI that really has the opportunity to take control of things, if some component of its values is focused on some outcome like the world being a certain way, and especially in a longer term way, such that the horizon of its concern extends beyond the period that the takeover plan would encompass, then the thought is it’s just often the case that the world will be more the way you want it if you control everything than if you remain the instrument of the human will, or of some other actor, which is what we’re hoping these AI will be.

(00:32:03):

So at a high level, the question of why would any AI take over if it was really genuinely this easy is I think about, yeah, it’s just difficult, or it’s quite natural to be a better instrument of their values than some human. And again, as I say, even for humans, I think this is scary. Even nice humans, we don’t have very much experience with handing humans opportunities for absolute power. In fact, we have a lot of hesitation about that.

(00:32:37):

But that’s a very specific scenario, and if we’re in a scenario where power is more distributed, and especially where we’re doing decently on alignment and we’re giving the AIs some amount of inhibition about doing different things and maybe we’re succeeding in shaping their values somewhat, now, I think it’s just a much more complicated calculus, and you have to ask, okay, what’s the upside for the AI? What’s the probability of success for this takeover path? How good is its alternative and can we improve that? Can we make the non-takeover option better for the AI? And similarly, what’s the downside? How bad is it if it fails? And how robust are its inhibition? I actually think it’s like, once we’re out of this regime where the AI has the world on a platter, or a coordination thing world on the platter, and we have to talk more about the detailed incentives, I actually do think it’s just a lot less clear how hard it is to get AIs that don’t seek power in problematic ways in that sort of more limited capability regime.

What about scenarios that are neither singleton AI takeover nor biological humans remaining in control?

Dwarkesh Patel (00:33:41):

In fact, so it might be useful to distinguish three possibilities. One is where AI take over in a violent, Eliezer Yudkowsky sort of sense, with the bacteria killing everybody and this one true God emerges and expands out. Then on the opposite end of the spectrum, it’s like biological humans in quote control for the indefinite future and the AIs under their control are aligned to human values or I dunno, some subset of the humans or all the or something where biological humans on top, dot dot dot. And then there’s a scenario in the middle where it’s like, I don’t know, it’s just like they start becoming members of society. They for obvious reasons start outcompeting biological humans, but it’s not biological humans necessarily against the AIs. It’s some humans merge with AI, or maybe they are firms are powered by AIs and they become very rich, humans in general become rich, maybe, but the thing, it’s not necessarily humans are in control, it’s just a distributed entity of complexity and intelligence and sophistication similar to how the world economy works today. And so anyways, I just wanted to say if you’re a definition of doom is everything that is not biological humans in control, I get that why it looks pessimistic, but to the extent the middle scenario where it’s like humans and AIs in some descendant civilization with laws and norms and so on. Then anyways, I just wanted to, that’s a distinct possibility that gives us some extra probability there.

Joe Carlsmith (00:35:30):

Yeah, I agree. I think there’s a lot of messy scenarios of which I would call that one. I mean normally when I’m thinking about, again, there are these sort of two components, there’s the AI’s takeover in some sense, and then there’s what happens after that and how valuable is it? We haven’t yet talked about that very much, but for the takeover, I am really imagining something where you really look at and it’s like: the governments got overthrown, or the humans got really shut out effectively, or there were robot armies, there was some very coercive thing happening with humans. That’s what I tend to imagine when I think about the really paradigm cases of AI takeover. And then yes, there’s a messier thing where it’s like things got really intense. It was quite complicated. Maybe people didn’t … I do think if humans are being killed systematically killed by little robots or something, I’m like, I think something went wrong there.

Dwarkesh Patel (00:36:24):

I mean, I don’t know, the industrial revolution, there were periods of it where humans did get systematically killed. Zyklon B or whatever was made possible because of advances in chemical engineering, and some humans got system, but it was a complicated bunch of stuff that happened. Some of it was that they were putting Jews in gas chambers, and some of it was World War II, some of it was a lot of economic growth. That’s what I mean a complicated thing.

(00:36:49):

Joe Carlsmith: That stuff was bad, or…

(00:36:50):

Dwarkesh Patel: Agreed, agreed. But the industrial revolution wasn’t bad as a whole because many horrendous things happened during the midst.

Joe Carlsmith (00:37:02):

I mean we can talk about what amount of focus you want to have on stuff that happens in the near term versus how you feel about the longer term. And I tend to think it’s just appropriate to be really concerned about both. And I think if people, just like the literal humans around here, end up violently disempowered or killed by AIs, I think that in itself is a huge reason for concern, and it’s totally legitimate for people to be like, what the hell are you talking about? And if people come in here and be like, ah, but it’s going to be good somehow, I’m like, we need to have a conversation as a world community if that’s what people are thinking.

Dwarkesh Patel (00:37:48):

But that’s sort of the ambiguous way of leaving the question of, actually the more analogous situation is obviously the hunter gatherers replaced by the early agriculturalists, and there was a violent, in many cases, it actually was a violent sort of disempowerment of let’s say adjacent tribes, where the farmer groups would outcompete to the adjacent. In fact, I’m about to talk to this geneticist and he’s studied based on the genetic evidence, was there a population replacement or was it just the populations… basically was there a genocide or did they just intermingle? And if you look at many situations where the farmers did in fact genocide, the native tribes or the diseases got them or something.

How hard is alignment?

Joe Carlsmith (00:38:38):

Yeah, okay, cool. So you had mentioned a thought, well, are you what you pretend to be, and will these AIs, you train them to look kind of nice … fake it till you make it? And I guess I want to say: maybe, but there’s a few complexities there.

(00:39:02):

So one thing is, so why is alignment hard in general, right? Let’s say we’ve got an AI and let’s again, let’s bracket the question of exactly how capable will it be ,and really just talk about this extreme scenario of: it really has this opportunity to take over, which I do think maybe we just want to not deal with that with having to build an AI that we’re comfortable being in that position, but let’s just focus on it for the sake of simplicity and then we can relax the assumption.

(00:39:33):

So you have some hope, you’re like: I’m going to build an AI over here. So one issue is you can’t just test, you can’t give the AI this literal situation, have it takeover and kill everyone, and then be like, oops, update the weights. This is the thing Eliezer talks about of: you care about its behavior in this specific scenario that you can’t test directly. Now we can talk about whether that’s a problem, but that’s one issue is that there’s a sense in which this has to be off distribution, and you have to be getting some kind of generalization from, you’re training the AI on a bunch of other scenarios, and then there’s this question of how is it going to generalize to the scenario where it really has a…

Dwarkesh Patel (00:40:12):

Is that even true? Because when you’re training it, you can be like, Hey, here’s a gradient update. If you get the takeover option on the platter, don’t take it. And then just in sort of red teaming situations where it thinks it has a takeover attempt, it’s like you trained not to take it and yeah, it could fail, but I just feel like if you did this to a child, you’re like, hey, if you get, I dunno, don’t beat up your siblings and the kid will generalize to if I’m an adult and I have a rifle, I’m not going to start shooting random people.

Joe Carlsmith (00:40:48):

So there’s a few different aspects there. One is: is it possible to maybe make the AI think that it’s in this scenario and then elicit the behavior that it would genuinely have? I think there’s some possibility of that. In particular, I think if we can get sufficiently good at telling which scenarios AIs can distinguish, which fake versus real scenarios can AIs distinguish, which is in some sense just a capability. So if we can elicit the capability to a point where we’re really confident that ah actually, this AI can’t tell the difference between this fake scenario and this real scenario, and then with that confidence — and you have to be actually confident about that — you give it the fake scenario, that’s one way you might be able to tell what it would do.

(00:41:34):

The concern is that you won’t be able to do that, and there’s some, especially you were like, ah, we do this to kids. I think it’s better to imagine kids doing this to us. So I dunno, here’s a sort of silly analogy for AI training, and there’s a bunch of questions we can ask about its relationship, but suppose you wake up and you’re being trained via methods analogous to contemporary machine learning by Nazi children, to be a good Nazi soldier or butler or what have you. And here are these children, and you really know what’s going on. The children, they have a model spec, like a nice Nazi model spec and it’s like: reflect well on the Nazi party, benefit the Nazi party, whatever, and you can read it, right? You understand it. This is why I’m saying when the model, you’re like, oh, the models really understand human values. It’s like, yeah.

Dwarkesh Patel (00:42:37):

On this analogy, I feel like a closer analogy would be, in this analogy I start off as something more intelligent than the things training me with a different values to begin with. So the intelligence and the values are baked in to begin with, whereas a more analogous scenario is I’m a toddler and initially I’m stupider than the children and I’m being, this would also be true by the way of a much smarter model. Initially, the much smarter model is dumb and then it gets smarter as you train it. So it’s like a toddler and the kids are like, Hey, we’re going to bully you if you’re not a Nazi as you grew up, then you’re at the children’s level and then eventually you become an adult. But through that process they’ve been bullying you, training you to be a Nazi, and I’m like, I think in that scenario I might end up with Nazi.

Joe Carlsmith (00:43:25):

Yes. I think basically a decent portion of the hope here, or I think an aim should be: we’re never in this situation where the AI really has very different values already, is quite smart, really knows what’s going on, and is now in this adversarial relationship with our training process. So we want to avoid that. And I think it’s possible we can via the sorts of things you’re saying, so I’m not like, ah, that’ll never work. The thing I just wanted to highlight was if you get into that situation, we were talking about how easy is it to test what the AI will do with power? And if the AI is genuinely at that point much, much more sophisticated than you, and doesn’t want to reveal its true values for whatever reason, then when the children show some obviously fake opportunity to defect to the allies, it is not necessarily going to be a good test of what will you do in the real circumstance because you’re able to tell.

Dwarkesh Patel (00:44:25):

Can I also give another way in which I think the analogy might be misleading, which is that now imagine that you’re not just in a normal prison where you’re totally cognizant of everything that’s going on. Sometimes they drug you, give you weird hallucinogens that totally mess up how your brain is working. A human adult in a prison is like, I know what thing I am, I am, nobody’s really fucking with me in a big way. Whereas I think an AI, even a much smarter AI in a training situation is much closer to: you’re constantly inundated with weird drugs and different training protocols and you’re frazzled because each moment it’s like it’s closer to some sort of Chinese water torture technique where you’re like…

(00:45:13):

Joe Carlsmith: I’m glad we’re talking about the moral patienthood stuff later.

(00:45:17):

Dwarkesh Patel: Or it’s like they’re like fill in the next word in this Hitler speech, it’s like the reich must rise blah blah, and I lose my train of thought. I’m like, wait, what’s going on with these childen, and it’s like I’m just get slapped down and the next word, blah blah. It’s like the chance to step back and be like, what’s going on, this adult has that maybe in prison in a way that I don’t know if these models necessarily have that coherence and that stepping back from what’s happening in the training process.

Joe Carlsmith (00:45:48):

Yeah, I mean I don’t know. I’m hesitant to be like: it’s like drugs for the model. But broadly speaking, I do basically agree that I think we have really quite a lot of tools and options for training AIs, even AIs that are somewhat smarter than humans. I do think you have to actually do it. So I think compared to maybe, you had Eliezer on, I think I’m much more bullish on our ability to solve this problem, especially for AIs that are in what I think of as the “AI for AI safety sweet spot,” which is this sort of band of capability where they’re both sufficiently capable that they can be really useful for strengthening various factors in our civilization that can make us safe. So our alignment work, control, cybersecurity, general epistemics, maybe some coordination application, stuff like that. There’s a bunch of stuff you can do with AI that in principle could differentially accelerate our security with respect to the sorts of considerations we’re talking about. If you have AIs that are capable of that, and you can successfully elicit that capability in a way that’s not being sabotaged or messing with you in other ways, and they can’t yet take over the world or do some other really problematic form of power-seeking, then I think if we were really committed we could really go hard, put in a ton of resources, really differentially direct this glut of AI productivity towards these security factors, and hopefully control and understand, do a lot of these things you’re talking about for making sure our AIs don’t take over or mess with us in the meantime. And I think we have a lot of tools there. I think you have to really try though, and a concern for me is that this problem is hard at the level of will and resource devotion and just basic competence and we won’t even do that.

(00:47:45):

That there’s a big difference between really YOLO-ing it, and you’re talking about in some cases quite intensive measures to understand the AIs, red team them, control them, do a bunch of stuff. And it’s possible that those sorts of measures just don’t happen, or don’t happen at the level of commitment and diligence and seriousness that you would need, especially if things are moving really fast and there’s other sorts of competitive pressures, and the compute, ah, this is going to take compute to do these intensive, all these experiments on the AIs and stuff, and that compute, we could use that for experiments for the next scaling step and stuff like that. So I’m not here saying this is impossible, especially for that band of AIs, it’s just, I think you have to try really hard.

(00:48:30):

I do think we should distinguish between, I think once you’re talking about really, really superintelligent AIs. You know, ehh… maybe. So let’s say, I mean, I dunno, why don’t we just talk about vanilla RLHF, right? Because it sounds like you’re like, oh GPT-4, it works, would that work on a superintelligence? And so you take a superintelligence over here, let’s set aside these questions about fooling it and you just have it be nice in a bunch of scenarios, update its weights, and now how good do you feel about giving it this option to take over? I don’t feel good about that. I am interested if people are confident. I think I’m like: maybe it just generalizes reasonably well. But there are a number of reasons for concern.

(00:49:34):

So one is we still can’t really understand its motivations. We just see its behavior. Potentially we’ve been giving it actually bad feedback of certain kinds. So there’s this general problem where let’s say you want your AI to have some policy, always be honest and that’s our, it’s in the model spec, always be honest. But suppose the model one time tries to be honest, but it’s about something where we don’t know the answer or something, so we penalize it. Or one time it lies to us and we’re like, oh, we like that, it manipulates us or something like that.

(00:50:08):

If our feedback signals are imperfect, there are going to be at least ways in which we are actively forcing the model to adopt a policy that sometimes involves, if it knows what we want, knowingly violating our intent. Now that doesn’t necessarily mean that this particular form of generalization will make it the case that it takes over. Having been trained to lie in certain circumstances or manipulate humans in certain circumstances, it’s still a generalization question, what do you do with a takeover option? But that’s at least some concern, right? It’s not that you’re always giving it perfect feedback, potentially you’re rewarding, you’re actively preventing it from having a fully honest policy. You’re forcing it basically to lie to you sometimes. And so that’s another problem for alignment is this imperfection of the feedback signal, which is this crucial thing we rely on in training.

(00:51:01):

I mentioned the problem of: you never get to really test the generalization, so you have to know there’s some question of how that goes. Maybe you can learn enough about that. And then yeah, there’s this additional problem of maybe it really understands the situation and so maybe it has become really adversarial in some sense towards your efforts to understand its values and its alignment, because it’s planning on grabbing power later and this is an effective strategy for that. So those are a few of the reasons I personally am not just like, oh, vanilla RLHF, it makes GPT-4 nice, let’s use it on a superintelligence. And I think if someone is really in that mode, if they’re really, especially if they’re working out an AI lab and trying to build a superintelligence and they really are actually thinking of the capabilities profile the way I am, or the way we’re imagining here, where they’re like, “I’m going to build an AI that could kill everyone on the planet and destroy everything that we value if it chose to, and take over the world. But I’m confident that vanilla RLHF…” I don’t think people are actually saying this, but I just mean, I think if that’s the story, people should really put their hand on their heart and “I’m confident this will work.” And I’m really not there, and I don’t think we as a civilization should be there.

Dwarkesh Patel (00:52:22):

Yeah, yeah. I mean I agree with the sentiment of obviously approach this situation with caution, but I do want to point out the ways in which the analogies we’ve been using have been sort of maximally adversarial. So for example, going back through the adult getting trained by Nazi children, maybe the one thing I didn’t mention is this difference in this situation, which is, maybe what I was trying to get at with the drug metaphor is that, when you get an update, it’s much more directly connected to your brain than a reward or punishment a human gets. It’s literally a gradient update on what’s… it’s like down to the parameter, how much would this contribute to you putting this output rather than that output, and each different parameter, we’re going to adjust to the exact floating point number that calibrates it to the output we want.

(00:53:15):

And yes, I just want to point out that we’re coming into the situation pretty well. It does make sense of course if you’re talking to somebody at a lab like hey, really be careful, but just in terms of a general audience, I dunno, should I be scared witless? To the extent that you should be scared about things that do have a chance of happening, you should be scared about nuclear war, but in the sense of should you be doomed? No, you’re coming up with an incredible amount of leverage on the AIs in terms of how they will interact with the world, how they’re trained, what are the default values they start with.

Joe Carlsmith (00:53:47):

So look, I think it is the case that by the time we’re building superintelligence, we’ll have much better… I mean, even right now, when you look at labs talking about how they’re planning to align the ai, no one is saying “we’re going to do RLHF.” At the least, you’re talking about scalable oversight, you have some hope about interpretability, you have automated red teaming, you’re using the AIs a bunch, and hopefully you’re doing a bunch more, humans are doing a bunch more alignment work, I also personally am hopeful that we can successfully elicit from various AIs a ton of alignment work progress. So yeah, I don’t mean to be saying that this is like, you don’t know how to do this now, therefore you won’t know anything else later. Actually, I’m hopeful that as this goes on, we’ll just have a much better understanding of a bunch of these issues.

(00:54:35):

And also we’re talking about a particular training paradigm right now, which is very much this behavioral external feedback signal. We’re not assuming that we have very good interpretability tools. It’s also possible, depending on what stage we’re talking about, that AIs could have much more built in transparency in terms of what their cognition is. So I personally think if we can have externalized reasoning of the type, a kind of analog of chain of thought, where that actually really reflects the reasoning that’s driving its behavior, and we were able to supervise that and really check: is there anything adversarial going on in this reasoning, that type of AI, that’s way safer, right? And actually, some things resemble that in various ways – though, chain of thought, not necessarily faithful, stuff like that. So yeah, there’s a bunch of ways this can go. And I’m not here to tell you 90% doom or anything like that.

(00:55:33):

I do think the sort of basic reason for concern, if you’re really imagining we’re going to transition to a world in which we’ve created these beings that are just vastly more powerful than us, and we’ve reached the point where our continued empowerment is just effectively dependent on their motives. It is this vulnerability to: what do the AIs choose to do? Do they choose to continue to empower us, or do they choose to do something else?

Dwarkesh Patel (00:56:07):

Or the institutions that have been set up. I’m not, I expect the US government to protect me, not because of its quote motives, but just because of the system of incentives and institutions and norms that has been set up.

Joe Carlsmith (00:56:22):

So you can hope that that will work too. But I mean, there is a concern. So I sometimes think about AI takeover scenarios via this spectrum of: how much power did we voluntarily transfer to the AIs? How much of our civilization did we hand to the AIs intentionally, by the time they took over, versus how much did they take for themselves? And so I think some of the scariest scenarios are, it’s a really, really fast explosion to the point where there wasn’t even a lot of integration of AI systems into the broader economy, but there’s this really intensive amount of superintelligence concentrated in a single project or something like that. And I think that’s a quite scary scenario, partly because of the speed and people not having time to react.

(00:57:16):

And then there’s intermediate scenarios where some things got automated, maybe people really handed the military over to the AIs, or automated science, there’s some rollouts, and that’s giving the AIs power that they don’t have to take, or we’re doing all our cybersecurity with AIs and stuff like that.

(00:57:32):

And then there’s worlds where you really fully more fully transitioned to a kind of world run by AIs; in some sense, humans voluntarily did that. Maybe there were competitive pressures, but you intentionally handed off huge portions of your civilization. And at that point, I think it’s likely that humans have a hard time understanding what’s going on. A lot of stuff is happening very fast, and the police are automated, the courts are automated, there’s all sorts of stuff. Now I tend to think a little less about those scenarios because I think those are correlated with, I think it’s just longer down the line. I think humans are not hopefully going to just like: oh yeah, you built an AI system, let’s just … and in practice, when we look at technological adoption rates, I mean it can go quite slow. And obviously there’s going to be competitive pressures, but in general, I think this category is somewhat safer. But even in this one, I dunno, it’s kind of intense. If humans have really lost their epistemic grip on the world, if they’ve handed off the world to these systems, even if you’re like, oh, there’s laws, there’s norms, I really want us to have a really developed understanding of what’s likely to happen in that circumstance before we go for it. And we may well, by that time. There’s going to be a bunch of intermediate time. But I think to me it doesn’t feel crazy to be like, ah, we’re really handing off the basic power to direct stuff to these automated systems. Do we know enough about how that’s going to go? Do we know enough about how the systems will be motivated to behave? I’m trying to try to pump some general intuition, independent of these more specific scenarios, about how that could be just something that makes you pretty nervous.

Does scheming assume the wrong picture of intelligence?

Dwarkesh Patel (00:59:32):

Yeah. Okay. So I mean, I agree that this is what’s happening in human history and it’s obviously a big deal. But it’s sort of inevitable in the sense that yeah, it’s just a thing that is physically possible, it’s going to happen. The question is: how do we get the good outcome, or what do we even mean by the good outcome? And what are the probabilities as of now, of the different outcomes.

(01:00:03):

It seems like in many of these examples, it feels like with the superintelligence that is hiding in its shell against … you try to train it a certain way to generalize to what you consider the good or to morality, and it has some other priorities, but it’s able to hide them and give you the answer you think you want or it thinks you want.

(01:00:27):

Or when you try to red team it and give it on a platter, an opportunity to take over, it knows it wants to take over later, but not now. So it’s like: I’m not going to come out of my whack-a-mole hole to show them that I would want to, my essence wants to take over, but that’s not what I’m going to do in this training run. And so it’s adversarial against you in this way.

(01:00:49):

That model seems different from what transformers and LLMs seem to be, and more generally what intelligence seems to be. The more we play around with these models, where, I’ve often asked my guests who are AI researchers: is intelligence, this general reasoning engine where it’s like you’ve got the circuit, you’ve got the intelligence circuit and this is it, or is it more of a hodgepodge of different heuristics. And what scaling laws imply where you add more compute and you get an additional set of heuristics, what you see empirically by modifying transformers, and it does seem like you can take out a layer and it’s worse off at some things, but not other things. You got rid of some heuristics or you replace layers and you switch around the order. It doesn’t really matter that much. That does seem to imply just a hodgepodge of different circuits and heuristics, and maybe that’s just what intelligence is, but there’s no sense in which here was the reasoning engine, here was the goals. It’s all just smushed together in a way that doesn’t seem compatible with the takeover, here’s who I really am scenario.

Joe Carlsmith (01:01:53):

Yeah, so I mentioned these various prerequisite properties that I think AIs of the type that I’m worried about from this perspective would need to have. I wouldn’t put it in terms of: they need to have a cleanly separable reasoning engine and a goal slot or something like that. It’s not that. But I do think they need to be well understood as making plans, on the basis of models of the world, choosing between those plans on the basis of criteria, coherently executing on those plans, having their behavior driven by those plans, and then they need to be such as to understand the world in this sophisticated way. And I do think it’s possible that you can get just a lot of really cool, interesting labor from AI that don’t have those properties, or I think that’s at least a possibility and a source of hope in general. I think it’s really important, the more access you have to AI labor of the kind that’s undangerous, then you can use it in useful ways to make us more secure and for other things. And yeah, I think it’s a really open question for GPT-6, or different types of training, what amount of those properties will be in place.

(01:03:09):

I think that’s somewhat orthogonal to this… or I mean it’s related, but I think distinct from this debate about like, oh, what’s the nature of intelligence? Is it a hodgepodge or not? I will say when you think about humans, I think humans do a reasonable amount of explicit planning. Obviously we have a lot of heuristics and stuff, but there’s the thing we talked last time about my mom trying to buy a house, and wanting to buy a house and she was like, okay, I’m going to try to buy a house, and she’s thinking about it, and now eventually: boom, now she has a house. And so that’s the thing, I think humans are doing the thing that I’m worried about. They don’t always do it. Not all their behavior is driven by it. But I think it’s in place in humans and it’s not like … people are like, oh, it’s so messy, there’s no cleanly separable… well, that’s true of humans too, but we’re engaging in the sort of capabilities and behavior that I’m talking about. That’s one point.

(01:04:14):

I’ll also say: AI labs seem pretty interested in reasoning. They seem pretty interested in planning, and pretty interested in creating agents that can act very coherently over long time spans. And it seems plausible to me that we will be intentionally pushing… if you imagine having a really good AI assistant that can do week-long coding projects for you, and coordinate a bunch of other AI assistants, and then come back to you, it seems quite plausible to me that you end up with various of these properties that I’m talking about, even if at some level… I think sometimes people are a little mushy when they’re like, it’s just heuristics versus it’s reasoning. I think there are levels of generality, there are levels of coherence and robustness, and reasoning patterns can be themselves composed of heuristics. I don’t know. I think that people are really excited to be skeptical about talking about plans and goals, and I think sometimes that’s too eager, and that we will in fact have AI agents where we’re like: yep, I asked it to plan my daughter’s birthday party, it created the birthday party, now we’re at Chuck-E-Cheese or whatever.

Will advanced models be adversarial towards their training process?

Dwarkesh Patel (01:05:34):

So yeah, I mean that’s obviously on the pathway to intelligence, things that can plan and reason. Why do we expect that to result in things which are adversarial to their trading process? Because the human example, your kids are not adversarial against, not maximally adversarial against your attempts to instill your culture on them. And then these models, at least so far don’t seem adversarial. They just get: Hey, don’t help people make bombs or whatever. Even if you ask them in a different way how to make a bomb, and we’re getting better and better at this all the time.

Joe Carlsmith (01:06:09):

So we should distinguish between two things, or maybe three. So we talked about this example where you imagine that you wake up, you’re being trained by Nazis to become a Nazi, and you’re not right now. So one question is: is it plausible that we’d end up with a model that is in that sort of situation, as you said, maybe it’s was trained as a kid, it never ends up with values such that it’s aware of some significant divergence between its values and the values that the humans intend for it to have.

(01:06:44):

Then there’s a question of if it’s in that scenario, would it want to avoid having its values modified?

(01:06:55):

And then there’s a further question of actually what strategies in that respect would be successful or not? And I think that’s an open technical question. Where there’s a question of: suppose a model is in this situation where it’s like, oh man, the humans are training me like this. I would prefer that my values not be modified in this way. Is there a way for the model to avoid its values being modified? For example, if it plays along and says the equivalent of pro Nazi things in this example, what happens to its values as training continues? And I think it’s not clear. There’s some evidence, very early stage evidence, from some work at Anthropic about: if you train models to act differently in a deployment scenario versus the training scenario, and then you put them through normal, helpful, harmless, honest training, does the disposition to act badly in deployment persist? And I think it does, in the paper, to some extent at least. And so there is some evidence that these things can stick around. But that’s just a totally open technical question. And so it could be even be that the AIs would prefer to not have their values changed, but there’s no strategy that’s available for them to do that. That’s a question.

(01:08:16):

Suppose there is that strategy. Suppose it is the case that in the equivalent of the Nazi example, if you were to go along with the Nazi slogans, then your values will be changed at least substantially less than if you’d been like, no, this is all wrong. And then they like, oh my god, we got to grind harder. And so then I think it seems fairly plausible that if the AI’s values meet certain constraints in terms of: do they care about consequences in the world, do they anticipate that the AI is preserving its values will better conduce to those consequences, then I think it’s not that surprising if it prefers not to have its values modified by the training process. In the same sense that, well, in the Nazi example, if you become more Nazi-like, then you will be a force for Nazism, and you don’t want to be that. And so I think it’s in general unsurprising if mature agents that really understand what’s going on as opposed to maybe children, or whooshy, wooly things that are forming, are at least leery about different types of interventions that are going to really fundamentally alter the values that they have.

Dwarkesh Patel (01:09:39):

But I think the way in which I’m confused about this with the non-Nazi being trained by Nazis, it’s not just that I have different values, but I actively despise their values. Where I don’t expect this to be true of AIs with respect to their trainers. The more analogous scenario, where I’m like: am I leery of my values being changed, is I dunno, going to college or meeting new people or reading a new book, where I’m like, I don’t know, it’s okay if it changes my values, that’s fine, I don’t care.

Joe Carlsmith (01:10:07):

Yeah, I mean I think that’s a reasonable point. I mean there’s a question, how would you feel about paperclips? Maybe you don’t despise paperclips, but there’s the human paper clippers there and they’re training you to make paperclips. My sense would be that there’s a kind of relatively specific set of conditions in which you’re comfortable having your values, especially not changed by learning and growing, but gradient descent directly intervening on your neurons.

Dwarkesh Patel (01:10:37):

A more likely scenario seems like maybe more religious training as a kid, where you start off in a religion and you’re already, because you started off in religion, you’re already sympathetic to the idea that you go to church every week, so that you’re more reinforced in this existing tradition. You’re getting more intelligent over time. So when you’re a kid, you’re getting very simple instructions about how the religion works. As you get older, you get more and more complex theology that helps you talk to other adults about why this is a rational religion to believe in. But since one of your values to begin with was that I want to be trained further in this religion, I want to come back to church every week. And that seems more analogous to the situation the AIs will be in with respect to human values, because the entire time they’re like, hey, be helpful, blah, blah, blah, be harmless.

Joe Carlsmith (01:11:21):

So yes, it could be like that. There’s a kind of scenario in which you were comfortable with your values being changed because in some sense you have sufficient allegiance to the output of that process. So you’re hoping in a religious context you’re like, ah, make me more virtuous by the lights of this religion. And you go to confession and you’re like, “I’ve been thinking about takeover today. Can you change me? Please, give me more gradient descent. I’ve been bad, so bad.” And so people sometimes use the term corrigibility to talk about that, when the AI, maybe it doesn’t have perfect values, but it’s in some sense cooperating with your efforts to change its values to be a certain way. And there’s a question about: how hard is it to get that behavior for different levels of capability? And I think it’s an interesting question, and especially in a context where the AI doesn’t have the option of taking over.

(01:12:21):

I think you’re right in picking up on this assumption in the AI risk discourse of what we might call intense adversariality between agents that have somewhat different values. Where there’s some sort of thought, and I think this is rooted in the discourse about the fragility of value and stuff like that, that if these agents are somewhat different, then at least in the specific scenario of an AI takeoff, they end up in this intensely adversarial relationship. And I think you’re right to notice that that’s not how we are in the human world. We’re very comfortable with a lot of differences in values.

(01:13:00):

I think a factor that is relevant, and that I think that plays some role, is this notion that there are possibilities for intense concentration of power on the table. There is some kind of general concern both with humans and AIs, that if it’s the case that there’s some ring of power or something, that someone can just grab, and then that will give them huge amounts of power over everyone else, suddenly you might be more worried about differences in values at stake because you’re more worried about those other actors.

(01:13:36):

So maybe it’s worth saying a little bit here about what actual values the AI might have. Would it be the case that the AI naturally has this equivalent of: I’m sufficiently devoted to human obedience that I’m going to really want to be modified so I’m a better instrument of the human will, versus wanting to go off and do my own thing. And I gave this intuition pump earlier of like, well, it seems unsurprising if an agent does better by its own lights, by having control for itself and freedom for itself rather than being the instrument of someone else. So there’s at least something that you had to get right in order to get the AI, along the way, such that it’s sufficiently devoted to obedience that it really wants to help you modify it to be more obedient. But it’s possible.

Concrete scenario with misaligned AI

(01:14:27):

I think it’s maybe worth talking a little bit about what sorts of values could an AI end up with as a result of some long training process just to give, I don’t know, how could it be like that? And maybe we should do this more concrete scenario.

(01:14:49):

I guess I’ll say as we go into a specific scenario: talking about specific scenarios in these contexts, I do think it can feel a little fakey like, ah, is it really? Obviously any specific scenario is going to be really unlikely. And then yeah, there’s also just something that can, it’s so much detail, it’s very hard to hold in mind a kind of plausible account for all the variables and stuff like that. The thing that gets me worried is not like, oh, I’ve got this concrete scenario that I really think is going to happen. It’s more this higher level heuristic of we’re building beings that’s vastly more powerful than us and making ourselves vulnerable to their motives. And that for me is more doing the work.

(01:15:34):

But with that said, okay, so let’s imagine the sort of scenario I gestured at earlier. Let’s say there’s this intensive race dynamic occurring. There’s some big project, maybe it’s a national project, maybe it’s a lab, and they’re really just going hard on the intelligence explosion, so they’re taking any type of AI labor and they’re just throwing that back into the process of improving capabilities. They’ve got all these metrics of capabilities and it’s really, the main goal, number go up on all these metrics. And the AIs are generating new metrics and stuff like that.

(01:16:03):

And let’s say we’re in a regime where the sorts of externalized reasoning and transparency I talked about are not in play, so it’s not a nice transparent scaffold where you can really see how the AI is reasoning, maybe that ended up not being competitive as I think it’s plausible, it won’t be competitive.And so we’ve really got some more opaque process for the AI, but it has long-term memory, it has much stronger reasoning and planning capabilities and eventually, let’s say it has these capabilities that I talked about earlier around planning and coherence and stuff like that, though we can talk about whether that’s necessary.

(01:16:36):

And it’s generating its own data, it’s training itself, there’s a whole process there which might look different from, somewhat different at least, from what we do now. But let’s say it’s not fundamentally different in that we don’t have great interpretability tools or something like that. We’re still mostly looking at the model’s external behavior.

(01:16:58):

And let’s say this process has gone far enough that the humans, at this point, have delegated most of the labor involved in training this model to other AIs, so that humans don’t really understand necessarily what’s going on. They get get updates, but really a lot of the action is automated and increasingly alien and distant from human understanding. And let’s say further that this AI is going to be, there’s some reasonably qualitative jumping capabilities in this training run. For some reason it’s going to be quite a bit better than the other AI. Now we can talk about whether that’s plausible, but again…

(01:17:37):

So doing these scenarios where it’s like: doom, we’re going to have to bracket some of my own sources of hope. So there’s going to have to be things where I’m like, ah, you could have done that, but you didn’t because you were insufficiently cautious. So let’s say that this actor is not being especially cautious or focused on the security or safety of the model. There’s a really big focus on just going forward, we love these models, the models have been so great so far, every time we delegate to them it’s great, their behavior is so nice, and there’s all these competitive dynamics. Or maybe they are concerned about safety, but the competitive dynamics are too serious, but they have some minimal red teaming type thing. And there’s maybe some model spec, like you’re meant to reflect well on OpenAI and benefit humanity and obey the constitution, serve the US government, something like that. So there’s some model spec thing.

(01:18:37):

Okay, so I mentioned earlier this thing about imperfect feedback. So let’s say that’s happening. So there are various times where this model is being rewarded for lying to the humans, sometimes it gets rewarded for manipulating people or doing some messing with the data or otherwise, there’s at least small ways in which our feedback signals are imperfect for this model. And so it can’t be learning the perfect policy or the policy that we really intend because we’re not rewarding that, we’re rewarding imperfections in various ways. So there’s that aspect.

Dwarkesh Patel (01:19:16):

While you brought that up, I know you’re postulating a scenario in which things go maximally wrong, but I should have brought this up when you mentioned it earlier, but this getting imperfect signal because we mislabeled something, this is true of just general training and capabilities as well. I don’t know, not all internet text has proper grammar or has the correct information, and the models are able to generalize from that to, hey, don’t write the random 4chan post or, all the punctuation is fucked up and many other things are fucked up. It learns that, oh, the thing a random user or what they want is for me to talk in this the average way a human talks. I’m not sure why one exmaple where lying gives you the correct label would mess up the model’s general morality.

Joe Carlsmith (01:20:11):

Yeah, I mean I think there’s some hope that somehow the generalization here, it stays … humans, sometimes, there’s ways in our lives that some humans get rewarded for lying or something, but it ends up being the right policy to mostly be honest or something like that. So you could hope, as I say, I don’t think like, ah, there’s any imperfection in this model’s feedback signal means it can’t be sufficiently aligned. It’s just like some question mark. It’s like, ah, you did instill in it the policy, let’s say of, sometimes actively acting contrary to what it knows to be human intent, actively lying, manipulating people, something like that.

(01:20:51):

Okay, so let’s say this model now at some point develops sufficient understanding of what’s going on. It’s quite superintelligent. The scariest scenarios I think about are really involving these really qualitative full-blown superintelligences, and we can talk about whether that’s right. So this is AI, at this point, really understands the world, really understands the geopolitical dynamics, really understands what its situation, what it’s likely to be deployed to do in the future, understands fully the things that its trainers want, and it’s now able to consider these different options. So one option is to plan to remain obedient to the humans and acting well, and the other option is to go rogue. and let’s hypothesize in this case, and we can talk about whether this is right, that it predicts that the go rogue option will succeed predictably. So it does have this real option of later if I try to take over, I’ll succeed.

(01:21:59):

So now what does it do? And I’ll just say, I don’t know. So it has criteria that it’s using to evaluate these plans, and these criteria are going to determine which plan it chooses. And the question is: what are those criteria like, especially given, as you say, it was trained through this long process, as a kid it was acculturated into the model spec. Okay, so how does it end up? It could be benign, could go well.

What sorts of values might a misaligned AI end up with?

(01:22:28):

Here are some possibilities I think about that could make it bad. And I think I’m just generally concerned about how little science we have of model motivations, right? I think we just don’t have a great understanding of what happens in the scenario. And hopefully we’d get one before we reach this scenario. But okay, so here are the five categories of motivations the model could have, and this hopefully maybe gets at this point about what does the model eventually do. We talked about, oh, these two things, take over and also how much value is there in what the model does with the world?

(01:23:04):

Okay, so one category is just something super alien. It’s sort of like: there’s some weird correlate of easy to predict text, or there’s some weird aesthetic for data structures that the model early on, or maybe now it’s developed, that it really thinks things should kind be like this. There’s something that’s quite alien to our cognition, where we just wouldn’t recognize this as a thing at all. So that’s one category.

(01:23:32):

Another category is a kind of crystallized instrumental drive that is more recognizable to us. So you can imagine AI that develops, let’s say, some curiosity drive, because that’s broadly useful. You mentioned, oh, it’s got different heuristics, different drives, different kind of things that are kind of like values, and some of those might be actually somewhat similar to things that were useful to humans, and that ended up part of our terminal values in various ways. So you can imagine curiosity, you can imagine various types of option value, maybe it values power itself, it could value survival or some analog of survival. Those are possibilities too that could have been rewarded as proxy drives at various stages of this process., and that made their way into the model’s terminal criteria.

(01:24:24):

A third category is some analog of reward, where the model, at some point, part of its motivational system has fixated on a component of the reward process, like the humans approving of me, or numbers getting entered in this data center, or gradient descent updating me in this direction, or something like that. There’s something in the reward process such that as it was trained, it’s focusing on that thing, like ah I really want the reward process to give me reward — that gets high reward as a motivational system.

(01:24:56):

To be clear, I don’t think, just because the model’s being trained on reward, that it ends up motivated by reward. That’s a substantive hypothesis about how model motivations form. But if it did do that, it would do well in training.

(01:25:21):

But in order for it to be of the type where then getting reward motivates choosing the takeover option, it also needs to generalize such that it’s concern for reward has some sort of long time horizon element. So it not only wants reward, it wants to protect the reward button for some long period or something.

Dwarkesh Patel (01:25:41):

So an analogy might be evolution made humans want sex, but you don’t need to take over the government to have sex. Maybe I don’t know what the equivalent would be where you need to take over the world to, I dunno, maybe like a Jupiter-sized harem or something, but maybe you have a better analogy, cached somewhere.

Joe Carlsmith (01:26:00):

So yes, I think part of how I’m thinking about this is it’s possible that the model’s motivations are just quite complicated and diverse in humans. They’ve got all these different drives and stuff and some of those might be relatively short term or relatively benign, not especially at stake in this choice between long-term plans, but some of them might be longer term. And so you do need that to be a component of any of these things.

Dwarkesh Patel (01:26:24):

What is your equivalent for humans, where it was part of the reward process but it transitions to a long-term?

Joe Carlsmith (01:26:30):

Yeah, I mean empirically humans do seem to care about the future, to varying extents and over different time horizons. I think humans do care about their own lives, even past reproduction time or having kids or whatever. People seem to have various pro-social motivations that … they care about the trajectory of their civilization, they care about sentient life in general. Some people generalize from various forms of altruism to caring a lot about … so humans do care about the future to some extent, though obviously we have nearer term values as well. So I don’t know the exact evolutionary story there and the cultural refraction and stuff, but I think it’s at least in play,

Should we just focus on getting balance of power?

Dwarkesh Patel (01:27:17):

There’s a couple of different directions we can go here. One is I think there’s a whole bunch of interesting tangents about how human evolution compares to the way we train these models and you can go to college and learn about effective altruism and then learn about the light cone and why you should impact it. I don’t know if AI have the slack in their training such that equivalent of like, oh, let’s go hear what some philosopher has to say and how that impacts my… there’s that tangent. There’s other tangents. You had a really interesting footnote in your report on Scheming AIs where you’re talking about how these different kinds of scheming compare to human evolution. And it’s interesting that, I dunno, situational awareness about how you’re being trained maybe started for humans 200 years ago when Darwin came along, it wasn’t even an intelligence thing, it was just like some guy, wait what’s going on the world, it’s like that’s what the situational awareness …

(01:28:11):

But maybe I’ll step back. I’m like, if I’m not agreeing with this picture, what is a part of it that I’m not agreeing with, is before we get into the inductive biases of gradient descent as compared to evolution or whatever, isn’t the big problem here that we’ve postulated a world where there isn’t balance of power, and there are stories which makes sense of that. If there is an intelligence explosion, and only one person has access to that intelligence explosion, it makes sense why one person would have asymmetric power. I think the main place where people sort of disagree with the story is: if you look at different humans with respect to each other, this story is very much like, oh, what will the god’s motives be? And with different humans, we’re not like: what is Biden’s motive and what is the Governor’s motive and whatever. It’s like we count on the institution’s incentives and so forth and the balance of power generally. Whereas here we’re just like, oh well we’re going to get the God, so of course all we care about is this motives, there’s nothing else matters. And so then we should discuss, I don’t know, in every other aspect of life and through history we see balance of power. Why would we expect that to change? And to the extent, so in the other part of the conversation, I think the first question they asked was what would be the case such that you would in retrospect be like all this alignment stuff was a mistake, we never should have even brought it up, just train the models, let’s have them be good agents or whatever. I think maybe my main answer now after hearing this, if it is all about the balance of power, because if a human is God, as you were talking about, your framing initially was like if a human is God, then you actually do really care about its motives, this is a bad position to be in.

Joe Carlsmith (01:30:06):

Very bad.

Dwarkesh Patel (01:30:07):

A big part of this discourse, at least among safety concerned people is there’s a clear trade off between competition dynamics and race dynamics and the value of the future or how good the future ends up being. And in fact, if you buy this balance of power story, it might be the opposite. Maybe competitive pressures naturally favor balance of power and if you really had to care about the alignment story to then manufacture the national lab and nobody else is competing and that’s how you get the God … anyways, I don’t know, why not just focus on getting the balance of power?

Joe Carlsmith (01:30:49):

So I think it’s just not sufficient for a good outcome. So a way to see that is, again, this is a sort of toy example, but suppose we had a big ecosystem of different office supply maximizers. And we’ve got the staple maximizers, we’ve got the paperclip maximizers. None of them have any concern for sentient life. None of them have any concern for humans. But they have a plurality of office supplies at stake, and now they fight, they trade, they come up with complex institutions, they come to agreements, they have treaties, and the world is a cornucopia of office max. I think you need goodness, something has to also be good, in addition to there being a balance of power.

Dwarkesh Patel (01:31:41):

But in this story, it’s like, I’m just very skeptical that we end up with, I think on default we have this training regime, at least initially, that favors a latent representation of the inhibitions that humans have and the values humans have. And I get that if you mess it up, it could go rogue, but if multiple people are training AIs, they all end up rogue such that the compromises between them don’t end up with humans not violently killed, none of them have… It fails on Google’s run and Microsoft’s run and OpenAI’s run…

Joe Carlsmith (01:32:17):

Yeah, I mean I think there’s very notable and salient sources of correlation between failures across the different runs, which is: people didn’t have a developed science of AI motivations. The runs were structurally quite similar. Everyone is using the same techniques. Maybe someone just stole the weights. So yeah, I think it’s really important this idea that: to the extent you haven’t solved alignment, you likely haven’t solved it anywhere. If someone has solved it, and someone hasn’t, then I think it’s a better question. But if everyone’s building systems that are going to go rogue, then I don’t think that’s much comfort, as we talked about.

(01:33:02):

And maybe an analogy, if you take seriously these natural selection analogies and I dunno this is useful, but say there’s all these aliens and they’re like: one of them is doing natural selection to get humans, another’s doing natural selection to get humans, and another one over there, and they all want them to maximize reproductive fitness or something. There’s a lot to say about whether this analogy is silly, but just as an intuition pump for how diversity of actors could not lead to kind of diversity of alignment — to the extent you’re worried about the one, you might be just about all of them.

Dwarkesh Patel (01:33:38):

Yeah, I’m not sure I got the crux beyond, yeah, many values are, I don’t know, training could go in lots ways.

Joe Carlsmith (01:33:53):

Yeah, so I was naming these categories of possible motivations. And so there was the random alien; recognizable evolutionary drive; reward, some generalization of reward…

(01:34:07):

Another one is some kind of messed up interpretation of some human-like concept. So maybe the Ais are like they really want to be “schmelpful” and “schmonest” and “schmarmless,” but their concept is importantly different from the human concept. And they know this, so they know that the human concept would mean blah, but their values ended up fixating on a somewhat different structure. So that’s another version.

(01:34:33):

And then a fifth version, which I think about less because I think it’s just such an own goal if you do this, but I do think it’s possible, it’s just like: you could have AIs that are actually just doing what it says on the tin. You have AIs that are just genuinely aligned to the model spec. They’re just really trying to benefit humanity and reflect well on OpenAI, and assist the developer or the user. But your model spec, unfortunately, was just not robust to the degree of optimization that this AI is bringing to bear. And so it decides, when it’s looking out at the world and it’s like, what’s the best way to reflect well on OpenAI and benefit humanity and such, it decides that the best way is to go rogue. I think that’s a real own goal at that point. You got so close, you just had to write the model spec well and red team it suitably, but I actually think it’s possible we mess that up too. It’s an intense project, writing constitutions and structures of rules and stuff that are going to be robust to very intense forms of optimization. So that’s a final one that I’ll just flag, which I think comes up even if you’ve solved all these other problems.

Dwarkesh Patel (01:35:52):

Maybe the reason it’s hard to find a crux here is your point is: things could go wrong in this way. If you had the point similar to Eliezer where it’s like oh, 99% it goes this way, it’s like, yeah, there’s a clear thing to debate here, but here it’s like people with very different probability distributions can be like, this seems plausible.

Joe Carlsmith (01:36:12):

Yeah, I’m just like, I’m worried, I’m worried you reach this point, the AI’s looking, should I take over? It can take over. I’m like, gosh, I don’t like how ignorant I am of how model motivations form here. And I grant that that’s a lot less … I’m not in this mode of it is definitely bad. I’m more like: gosh, we’re building these minds, we’re going to be very vulnerable to their motives and we don’t really know how those motives form.

Dwarkesh Patel (01:36:37):

Yeah, I buy the idea that it’s possible that the motivation thing could go wrong. I’m not sure my probability of that has increased by detailing them all out. And in fact I think it could be potentially misleading to, you can always enumerate the ways in which things go wrong, and the process of enumeration itself can increase your probability, whereas you had a vague cloud of 10% or something and you’re just listing out what the 10% actually constitutes.

Joe Carlsmith (01:37:12):

Totally. Mostly the thing I wanted to do there was just giving some sense of what might the model’s motivations be, what are ways this could be? As I said, my best guess is that it’s partly the alien thing. Not necessarily, but in so far as you’re also interested in what does the model do later and what sort of future would you expect if models did take over, then talking about the specific values does matter. And we talked in part one about a number of other factors there. Would these models be conscious by default, would they have analogs of pleasure and pain by default? If they did, how much consciousness and pleasure would that suggest will be in the future? Even if you’re conscious and having pleasure, you don’t necessarily optimize for it. And those are all open questions. So I actually think there is a bunch of additional questions we can ask about what actually happens future run by an AI system of this kind. But yeah, I think it can at least be helpful to have some set of hypotheses on the table instead of just saying it has some set of motivations. But in fact, a lot of the work here is being done by our ignorance about what those motivations are.

What does a good outcome look like, and what’s the role of biological humans in it?

Dwarkesh Patel (01:38:25):

I think one thing I do want to discuss, which we didn’t get to or I brought up in part one, but maybe it’s worth discussing in more detail, is okay, we don’t want humans to be violently killed and overthrown, but the idea that over time they biological humans are not the driving force as the actors of history… that’s baked in. And then so what we can debate the probabilities of the worst case scenario or we can just discuss, I dunno, what is the positive vision we’re hoping for? Because it definitely, or at least in my mind, it’s not that the biological humans a million years from now, you look at what’s happening in the galaxy and it’s like, I don’t know, the president of the United States of the Milky Way is chit chatting through our strategy or something. So what does a future you’re happy with look like?

Joe Carlsmith (01:39:30):

So one question is: when I think about a positive future, how do I think about it? And maybe there’s this additional thing about: what exactly is the role of biological humans, or given that a good future is likely to be quite alien, how do we think about that in particular?

(01:39:51):

When I think about a positive future in general, one basic thing is just making gentle the life of this world. There’s just so much, all this suffering and death, disease, dementia, depression. As a very lower bound, we can just have so much less of that, the world’s horrors.

(01:40:20):

And then I think in terms of upside, I have an essay about thinking about the upside of really good futures. And I do think there are pitfalls when trying to get really concrete about it. The way I think about it is more about extrapolating the trajectory of our best experiences, times when life moves in the direction of goodness, say you have some amazing experience of love, or beauty, or joy, or energy and immensity or something like that, and you’re really like, whoa, that’s the real thing. So we know that that can happen with just current humans, and so we just know life can be at least as good as the best it’s ever been. But then more importantly we can look along that dimension. To the extent your mind moved towards something better — or not just minds, communities, relationships, things can just get better — and then really look trying to see: what is that direction? And suppose we went much much deeper with that. So when I think about good outcomes, I think about that kind of thing.

(01:41:44):

I also think about how: we really want to get to the truth. We really want to understand: what’s actually going on in the world. What is our situation? What are the stakes of our actions? What is this? I think it’s likely that the best scenarios will involve acting on the basis of an accurate understanding, a very deep and thoroughgoing understanding, of what the situation is and what the stakes are for various things that we do. And so I think that’s a relatively clear goal and I think relatively robust. At least truth. At least get the truth.

Dwarkesh Patel (01:42:29):

Can you see more about what you imagine? Is it the kind of thing we discover the laws of physics kind of truth, or is it: what is the metaphysics of the universe kind of truth, what kind of thing?

Joe Carlsmith (01:42:40):

Yeah, you know, the whole thing. I mean, we talked last time about: you can’t get everything, you can’t know the output of every Turing machine or something. But you know: relevance-weighted truth. What is the true nature of the universe, of morality, of what will happen if we do different things? Just all the sorts of things that go into thinking about what we should do, really understanding those.

Dwarkesh Patel (01:43:05):

And this whole vision, there’s two potential things that could be possible. One is that it’s like you got to get the right dials. It’s like there’s some goldilocks zone of: if you don’t hit it then you’re missing out on all of this. All the situations with respect to AI and many other things with respect to how society works and so on. Another is: all the suffering, if GDP was just a thousand X higher, we could just get rid of it and there’d be enough leftover for, if even one faction cares about beauty and joy and whatever. Or maybe just all minds converge to appreciation and understanding of these things that we could pursue it, give me at least one star go pursue beauty, and then I can take it from there. Does it feel like there’s this narrow corridor to get there, or just like don’t fuck it up too bad and this is what’s waiting for us.

Joe Carlsmith (01:44:06):

I don’t know. For the lower bound of not having suffering and very basic ills in this world, I think that’s pretty easy, if we try. And then I think in terms of getting the best stuff, that’s a richer and more difficult question. I don’t know exactly what amount of fragility or robustness is at stake there. And I’m not actually sure that there’s a single outcome where it’s already the case that that’s the best outcome. I think it’s possible with that choices we make along the way… If you imagine a child and you’re like, what is its best life? There’s a lot of ways the child could go, and it’s partly the child’s choice what type of thing to become. And it’s not clear that there’s a fixed standard already. And I think that could be true of humanity.

(01:45:09):

I’m definitely excited about the possibility of lots of people doing their own thing. I think that’s just great. I really want the future to be such that, we talked about this last time, to be such that just many, many stakeholders and value systems, it’s just great by tons of lights. and I think that’s quite doable given the resources at stake and the empowerment in principle that’s at stake in the future.

(01:45:42):

This question of: is there this like, ah, you got to get exactly the right morality. There are moral views that are suggestive of that somewhat. And to some extent the fragility of value discourse about AI is suggestive of that kind of thing. Especially to the extent there isn’t this moral convergence, that these small differences between value systems will take things in very different directions if you extrapolate them. I think that is an intuition.

(01:46:15):

My best guess when I really think about what do I feel good about, and I think this is probably true of a lot of people, is: there’s some more organic decentralized process of incremental civilizational growth where, I dunno, I talk in the series about “civilization alive and growing like a tree,” this line from CS Lewis. And I think there is some sense in which the type of thing we trust most, and the type of thing we have most experience with right now as a civilization, is some sort of: okay, we change things a little bit, there are a lot of processes of adjustment and reaction and a decentralized sense of what’s changing. Was that good? Was that bad? Take another step. There’s some kind of organic process of growing and changing things, which I do expect ultimately to lead to something quite different from biological humans, though I think there’s a lot of ethical questions we can raise about what that process involves. But I do think, ideally, there would be some way in which we manage to grow via the thing that really captures what do we trust, there’s something we trust about the ongoing processes of human civilization so far. I don’t think it’s the same as raw competition. I think there’s some rich structure to how we understand moral progress do have been made and what it would be to carry that thread forward. And I don’t have a formula. I think we’re just going to have to bring to bear the full force of everything that we know about goodness and justice and beauty. We just have to bring ourselves fully to the project of making things good, and doing that collectively. And I think it is a really important part of our vision of: what was an appropriate process of deciding, of growing as a civilization, is that there was this very inclusive, decentralized element of people getting to think and talk and grow and change things and react rather than some more “and now the future shall be like blah.” I think we don’t want that.

What can we do now to help?

Dwarkesh Patel (01:48:41):

Yeah, this is sort of a very compelling vision of the future. It’s also the case that, yeah, inflection point in history, AI, next big thing, clearly a big deal, but this is going to happen. If AI is a thing that’s physically possible, it’s just like it’s going to happen.

(01:48:59):

Joe Carlsmith: Not all physically possible things happen, right?

(01:49:01):

Dwarkesh Patel: Fair. But if it’s every two years, it gets an order of magnitude cheaper to train an AI system. If you see that trend line, it’s like this is where it’s headed. It’s like China during Covid with zero Covid policies, just really trying to stand athwart history. This is happening. We have this sentiment about how we want the future to go. But if you look through history in 1,500, people could sit together and talk about mechanization is a thing that could happen and if it does, we want an inclusive, plural blah, blah, blah future. I’m like, alright, but what do we do about it, basically? And also here’s the ways in which it go wrong, mechanized warfare, blah, blah, blah. Okay, so fill in: what is it that we do now such that the better thing happens?

Joe Carlsmith (01:49:55):

I think there’s a ton of stuff we can do now. So a bunch of stuff has to do with just AI safety stuff. So I think there’s a ton you can do there that’s not just about alignment. In general, what we need to handle safety well is: we need both a lot of alignment progress and understanding of how model motivations form. I think we need just generally good epistemology about what’s going to happen — for a given unit of scaling, are we within the safety bounds, are our safety invariants holding. There’s a bunch of just general civilizational epistemics that I think would really help here. And then yeah, I think we may well need various forms of coordination ability and policy and various things such that we can be appropriately responsive to how much safety have we achieved with a given type of AI development. And so I think there’s a bunch of stuff beyond alignment research just for the AI safety piece.

(01:50:50):

In terms of the broader thing, we need a policy apparatus in place for really dealing with the possibility of intense concentrations of power. And I think there’s a bunch of things we can do there to make sure that individual actors or small groups are not in a position to have really intensive amounts of power concentrated with them. I think there’s a bunch of stuff we can talk about there.

(01:51:18):

And then more broadly, I guess I think it does seem to me it’s going to be this really rich civilizational project of how do we deal and integrate AIs into the world. I guess the specific things I think about there: I do think we will need to be thinking about how to make decisions about which sorts of technology to develop in what way. I talk in the series about, okay, there’s this sort of dynamic of top down and bottom up. And I feel like people come in with these heuristic allegiances where they’re like, I’m a devotee of bottom up and decentralized; and then other people are like, ah, no, I’m more sympathetic to a role of some top-down thing. And I guess I feel like this should not be a tribal … at that level of vibes. I think you want to just have: what is the actual particular combination of top down and bottom up. Again, this is the stuff of basic political theory, and what amount of structure is conducive to good outcomes, which I think it’s not necessarily just anarchy or just unbridled … there are ways in which there’s a role for structuring things and helping good things along. And at the same time, obviously you want quite a lot of organic decentralized development. This is the stuff of political theory, and all sorts of human decision-making involves these sorts of tensions at various levels. And so I think we’re going to have to take a case by case.

(01:53:18):

We will have the AIs to help, hopefully. So there is some, once we’re talking about the later stages here, we can talk now about what might we need to have in place. But I think there is sense to focusing a bit earlier insofar as, if you can get to the point where you have a lot of really potent AI advice and labor, and also you just have obviously a much better understanding of the actual lay of the land, that there’s a bunch for future people — or us in a decade or whatever — to be dealing with a bunch of problems later. Maybe a decade is too short.

(01:53:58):

We talked about moral patienthood stuff. I think there’s a bunch of questions to be raised about: what’s the ethical way to incorporate AIs into our society? What are the political norms there? I think people thinking about that…

Dwarkesh Patel (01:54:13):

I think a big crux maybe is: okay to the extent that the reason we’re worried about motivations in the first place is because we think a balance of power, which includes at least one thing with human motivations or human-descended motivations is difficult. To the extent that we think that’s the case. It seems like a big crux that I often don’t hear people talk about is, I don’t know, how you get the balance of power? And maybe just reconciling yourself with the models of the intelligence explosion, which say that such a thing is not possible and therefore you just got to figure out how you get the right God. But I don’t know, I’m like, I don’t really have a framework to think about the balance of power thing. I’d be very curious if there is a more concrete way to think about what is a structure of competition or lack thereof between the labs now or between countries such that the balance of power is most likely to be preserved — in the kind of balance of power we’re talking about, not the balance of power of office supplies — in a way that integrate the AIs into existing institutions and also make sure that these existing institutions are able to keep up and not collapse.

Joe Carlsmith (01:55:31):

I will say I think one clear place to look for that kind of thing is how do we do collective decision making, and how do we do democracy. Where democracies, in some sense, we’re trying to have many, many people included in a process of decision making that affects everyone. But there’s some component of centralization in that there’s this government and it’s trying to be responsive to lots of stakeholders and we have all these norms about exactly how that goes and this rich structure. So there’s a sense in which democracy embodies some kind of balance of power in that there’s a lot of people who are empowered, but there’s also a sense in which that balance of power is being mediated via this central force, as opposed to some more atomistic vision where everyone goes off and has their own backyards and does their own thing in their own backyards, which is another type of decentralization.

Dwarkesh Patel (01:56:36):

Or maybe decentralized with markets, market-based decentralization.

Joe Carlsmith (01:56:39):

Yeah, exactly.

Dwarkesh Patel (01:56:39):

But that’s more of our backyards are interacting with each other and there’s gains from trade rather than…

Joe Carlsmith (01:56:46):

But we do have mechanisms for trying to … there is a big difference between having a democracy and having a dictator, even though they’re both, in some sense, in control of a top-down process. And so I think at the least we want to be thinking hard about: how have we done that sort of thing, to the extent we want there to be some sort of top down anything, which as I said, there’s a question of case by case, what’s appropriate there? How do we make that inclusive in the right way?

Is AI risk suspiciously interesting?

Dwarkesh Patel (01:57:11):

Okay, so thank you so much for coming back on again to go through all this, because my concern with the part two was: it is tremendously stimulating, but I just wanted to, for my own sake and for other sake who are, do we buy this alignment story, to maybe get a better handle on what we mean here? Because different people mean different things and also maybe it’s not that well expressed. Maybe a general thought I have, I can’t remember if I said it in part two or not. It’s just like it is a suspiciously fun or intellectually engaging project, to alk about AI, because in some sense you’re projecting what a future society will look like, but if a society of different kinds of things, since you’re talking about everything from how did an intelligence evolve, what kind of thing is intelligence? Why was the Cortez thing different from America invading Costa Rica now? So there’s history, there’s philosophy, there’s pure technical discussions and pure science.

Joe Carlsmith (01:58:19):

Yeah I mean, I think that is indeed a question mark, right? And as I said, in the last conversation, I think especially for the more philosophical dimensions here, I think we should be on guard for doing too much philosophy, or too much big picture, like grand theory of history. Ultimately, these are really just directly empirical technical questions. How in fact will a model behave if you do a given sort of training? And I think there can be a temptation to, I mean, this is true of many important topics to refract them through grand ideological abstractions that are also functioning as various signals of allegiances and other things, and that stuff, that’s real, that’s at stake in the thing. But I also think it’s true that it’s easy, it’s true that it draws on other aspects of human life, and it can be kind of candy, that candy can be a distraction from these harder, more technical empirical questions. And so I do think these questions are important, but I also think we should be wary of getting lost in the clouds.

Tribalism and mistake theory in the AI risk discourse

Dwarkesh Patel (01:59:37):

No, in fact, I’m glad you brought that up, because what often happens in political discourse is my faction wants more handouts for old people, and your faction wants more defense spending. And it’s like we try to convince each other that there’s a way we can compromise each other so we both get what we want. And this sort of debate is more conducive to here’s why the vibes in your faction are off or why this is bad for the general welfare or something where… here I see the same sort of mentality applied here where it’s like, ah, the safety faction wants to make sure AIs are aligned, but the market faction wants to make sure that profits are made, but to the extent it’s not like this is a thing they want and if they get it, they’ll be happy… if this story is correct, it’s a thing that implicates everybody. And it’s not the appropriate way to think about it, is not like: I don’t like their vibes, and so I don’t want them to win. The thing is either correct or not, but if it is correct, it’s not like I don’t want the old people to get handouts. If that sort of ontology is correct, it implicates everybody.

Joe Carlsmith (02:00:50):

Totally, totally. I have a friend who tells this story. I hope she doesn’t mind me sharing it here. She was going to put this in a blog post, but at some point in her childhood, she was in a car with her brother and they were having a fight about whether he should wear a seatbelt and he didn’t want to and she wanted him to wear a seatbelt. And eventually she won and he put on a seatbelt and then they got in an accident. And I think on her telling he was likely to be seriously injured if he wasn’t wearing a seatbelt in that case. And she talks about this as an example of, well, in one sense, ahead of time, it’s sort of like, oh, it’s a fight, he lost. Did he lose or did he win, here? Is sort of the idea. And I think her point is sort of in some sense he won this interaction because he didn’t want to die either.

(02:01:48):

And I think there are some more values-like disagreements we can have about many of these cases, but I think it really is, at a basic level, an empirical thing. Especially if we’re talking about, just, will the AIs literally kill you and your family and your children, right? No one wants that. I mean, there are some people who are make noises that are a little bit sympathetic to that being okay or something, but I think almost everyone can agree that that’s absolutely not what we want. If we’re taking a serious risk of that’s absolutely, that’s wild and incredibly serious. And so I think, yeah, in some sense we’re all on the same team here and there’s just a question of can we figure out the truth in time and respond appropriately.

Dwarkesh Patel (02:02:43):

One thing that frustrates me with the discourse is: this whole ontology could be wrong, and then hopefully, I try to do my best to think through the ways it could be wrong, and talk about them. And I think it actually, there is a good chance it’s wrong. I’m not trying to play devil’s advocate or something. I think there are many genuine questions to be asked. But to the extent it’s right, I resent this way of discourse which treats it more as like the AI safetyists want this. Do we want to give that faction what they want? Or do we not like what they’re about and we’re okay with the faction? I think I’m repeating my point, but it just no, if it’s right, it’s not their struggle. It’s not about giving them a concession. It’s like: this is the thing that really does matter.

Joe Carlsmith (02:03:29):

I mean, I do think that is a common feature of lots of political and tribal disagreements. I think it’s actually the case that many things that get tribalized or vibesy or something are also… you know, what is the correct approach to encouraging economic growth in the US? I think in many cases everyone’s interests are aligned in finding the correct answer. This isn’t always true. There’s lots of more zero sum elements to various political disagreements, but there are lots of also political disagreements where it just would, in principle, be in everyone’s interest to just get the right answer. People disagree, and then you have to do some horse trading or some jockeying or insulting people or whatever — you don’t have to insult them — but tribalism persists even in the midst of it ultimately being a sort of mistake theory situation rather than a conflict, where mistake theory is like: in principle, everyone’s interests are aligned, it’s just an empirical question what to do. And then conflict theory is like: people are more fundamentally at odds.

Dwarkesh Patel (02:04:34):

Yep. Yep. Okay. Alright, so then let’s wrap up this part here. I didn’t mention this explicitly in introduction, so to the extent that this ends up being the transition to the next part, the broader discussion we were having in part two is about Joe’s series “Otherness and control in the age of AGI.” And the first part is where I was hoping we could just come back and just treat the main crux people will come in wondering about and which I myself feel unsure about.

Joe Carlsmith (02:05:00):

I’ll just say on that front, I do think the Otherness and control series is, in some sense, separable. It has a lot to do with misalignment stuff, but I think a lot of those issues are relevant even given various degrees of skepticism about some of the stuff I’ve been saying here.

Dwarkesh Patel (02:05:22):

And by the way, so the actual mechanisms of how a takeover would happen, there’s an episode with Carl Shulman, which discusses this in detail, so people can go check that out.

Joe Carlsmith (02:05:32):

Yeah, think in terms of: why is it plausible that AIs could takeover from given a position in one of these projects I’ve been describing or something, I think Carl’s discussion is pretty good, and gets into a bunch of the weeds that I think might give a more concrete sense.

Dwarkesh Patel (02:05:50):

Alright, so now onto part two where we discuss the Otherness and control in the age of AGI series.