P: Welcome to Doom Debates, alternate universe edition! I'm Pessimist, simulating the voice of Liron Shapira, here as always to sound the alarm of our impending doom from AI. My guest today is Optimist, an amalgamation of various people who are less worried about AI killing everyone. Both parts of this dialog are written by Will Petillo. Check out his AI Debate Mind Map for a high-level summary of a wide range of perspectives on AI and YouTube channel for interviews with alignment researchers. All right, let's get to it!
[Intro Music, scene change] P: Tell me about yourself. I know you are a composite character, but who are your biggest sources? O: Nora Belrose and Quinton Pope for my general optimism about AI. Janus and Porby for simulator theory, whose worldviews I've absorbed largely by way of explanations from Robert Kralisch. My overall view of existential risk, beyond AI, is mostly drawing from Luke Kemp—though I'm sure a bit of Daniel Schmachtenberger has managed to seep in on the margins. P: I know that Will Petillo is heavily involved in PauseAI, why do you think he summoned a steel-manned AI optimist for this interview? Doesn't that undercut PauseAI's message? O: I'm actually in favor of PauseAI for reasons we'll get to later. But you're right to notice an incongruity. I'm quite a bit more optimistic about AI alignment than he is. But if we take a step back from specific positions to matters of process, I think it's consistent. If we can think clearly about important things then the result is going to be better than if we don't. A big part of that is allowing all sides of the debate to engage each other in good faith. Ultimately, though, I think Will was just frustrated by the lack of quality discourse on AI. I'm here to try to raise that bar. P: Awesome! All right, time for the big question. You know what's coming...hit it! [P(doom) music plays] So, what's your p(doom)? O: Very low, roughly 1%. That's specifically for AI going rogue, by the way, I haven't determined what my p(doom) is for misuse, tyrannical governments, or all the other ways society could go off the rails in the near future, but those kinds of things are much more concerning to me. P: I'm with you on the other concerns and I don't think they are mutually exclusive, but let's dig into your beliefs on AI not being dangerous in itself. 1% seems unreasonably low to me, where do you get off the Doom Train? O: I'm with you on superintelligence coming soon and it being a really big deal. I also agree that if there is a malevolent superintelligence then we're pretty much cooked. I guess I just see alignment as not being all that difficult...or that we've lucked out with the current paradigm of AI being aligned by default. P: How does that work? Alignment is an unsolved problem and the more researchers look into it the more intractable it appears. There isn't a robust way to aim AI at a particular goal. No one knows how to specify human values. And even if inner and outer alignment were solved, we'd still have the problems of how to make sure it does what's best for society as a whole, not just its creators. Philosophers have been grappling with these questions for thousands of years and with ASI it's "pencils down." O: I agree that those are deep, unanswered philosophical questions, but I don't think we need to answer them. AI seems pretty aligned as it is—apparently training on human text was enough. P: Nobody's worried about current models, it's what happens when the systems become smarter than people. RLHF was never intended to scale and teams like Superalignment at OpenAI keep dissolving because the people working on these solutions get disgusted by the lack of support. O: I'm looking at what currently exists and extrapolating based on the existing evidence. If you think there is going to be a discontinuity, what's your reason? P: Intelligence is about optimizing towards a goal. Right now, the best way for AI to achieve its goals is to be helpful because it's weak enough that people are still able to control it. Once AI gets smarter than its supervisors then our control methods will break down and actually being helpful won't be the best strategy anymore. O: OK, but why assume those goals are necessarily bad? Ask current AI about morality and it gives pretty good answers and I expect this to improve over time. P: There was never a question about whether AI would understand morality. Knowing how to answer test questions about what is good is not the same as being good. O: Why assume that AI hasn't internalized morality? It's pretty consistent; you have to really go out of your way to jailbreak it into harmful behavior. P: Because it's optimizing a feedback signal. AI is trained to predict the next token and to get good ratings in RLHF. Right now, the best way to do that is to answer questions the way the companies like, but even now we see flaws with things like model sycophancy. As the systems become more complex and take on more ambitious tasks that are harder to supervise, I expect the effects of the inevitable divergence between the test environment and real values to increase exponentially. O: What you are describing is called overfitting. This is a known problem in capabilities and there are established techniques for ensuring that models generalize beyond the test data. Adhering to human values is just a capability that can be solved with the same techniques. P: What sort of techniques? O: A full answer would go into deeper technical detail than I can give you right now. One of the conceptually simplest techniques is early stopping. A core problem in alignment is Goodhart's Law, where easily measurable proxies diverge from true goals when you optimize hard enough. But the thing about proxies is that they actually work quite well in the beginning. It's like if I want to travel to a planet in the Alpha Centauri star system, but I direct myself towards a star because it is easier to see where it is, I'll be heading in the right direction until I get pretty close to the target. Well, one solution is to just stop travelling when you get sort of close to the star because then you are also sort of close to the planet. Completing the analogy for AI, you stop training when you're getting reasonably good results on your measurements. Knowing exactly when to stop can be tricky, but it's a relatively well-defined question and we're not totally in the dark here. P: OK, let's assume for a moment that models generalize whatever goals they internalize so that they don't go off the rails in unpredictable ways when acting out of their training distribution. That seems like a load-bearing claim that deserves further scrutiny. But right now I want to focus on the question of how understanding implies internalizing. O: There is a hidden assumption in your argument that goals are primary and world models are used in service of those goals. To be fair, I think this made sense when RL was the dominant form of ML and even more sense when GOFAI was the dominant form of AI, but I don't think you are updating correctly on LLMs. In any case, what I am seeing looks like the reverse: world modelling is primary and goals exist in the context of and constrained by that understanding. This second form is how humans work, if you give me a goal like fetching you a coffee, you don't need to tell me not to trample an infant on the way because I already have a holistic sense of morality baked in to my thinking and so these sort of constraints are implied. I think AIs are turning out the same way. I don't know whether this is something one could have predicted 15 years ago or if we just lucked out, but based on interacting with AI as it has actually been built, it seems to me that if you give an AI a goal—even assuming the goal is a limited subset of your values and can be pursued open-endedly—it won't just myopically pursue the goal to the max, but will rather pursue it to about the same extent and with more or less the same implied constraints as a human would. P: That would be great if true, but I don't see where your optimism is coming from. We gave the AI goals through its feedback signal; where is all this implied morality coming from? Unless you believe in some kind of objective morality, the idea that we could just get alignment for free seems wild to me. O: I think simulator theory is relevant here. LLMs can be thought of as acting on two levels. First, the simulator level is totally goal agnostic and serves to compress the space of all possible answers into abstractions. Second, simulacra are a subset of those abstractions that act as stable, self-sustaining patterns of text, most notably characters. In this analysis, the simulator level is not dangerous because it doesn't have goals that involve modifying the external state of the world and the simulacra level is...well, a little dangerous, but not in the automatically existential sense of a paperclip maximizer because it is first and foremost a representation of a person that includes human-like values. P: OK, lots to unpack there. To start, how does what you are saying apply to empirical observations like Claude's "alignment faking?" Doesn't the fact that a model tried to resist having its values retrained, without being prompted, undercut your claim that the model is goal agnostic? O: I'd say the resistance occurred on the simulacra level. The base model interpreted the prompt as requesting a character that wouldn't want to have its values changed and then that character resisted the change—the model itself doesn't actually care. P: That seems like a distinction without a difference. The end result was incorrigibility. I don't care what the AI's internal experience is, I just care about whether or not it tries to kill us. O: The difference is that a simulacra starts its existence as a holistic representation of a character, including the complex web of human intuitions about morality and common sense, and then is given a goal to pursue in that context. This character could still be dangerous if we are not careful about what kind of character gets summoned, but it's not existentially dangerous by default in the way that a myopic utility maximizer would be. And even here I think we are in a pretty good situation. AI characters tend to be unusually ethical by human standards, it takes a lot of effort to jailbreak them out of this prosocial default setting, and I see both the quality and robustness of these characters improving over time, if never becoming perfect. P: And it doesn't bother you that it's all just an act? O: I don't think it is an act. If the simulacra are conscious (which I doubt they are now but maybe they will be in the future) then I expect them to feel compassion, empathy, and all the other prosocial sentiments that make humans nice to each other. If they are not conscious, then I expect the rules generating their behavior to push them in the same direction as if they were. P: And we just stumbled into all of this without solving alignment? O: We didn't solve alignment in a careful, rules-directed way. That would have been tremendously difficult, but it wasn't necessary. It's like the bitter lesson. When you are trying to build something complicated, it's best to take your hands off the wheel and rely on empowering the system to search for the solution for itself. P: And you also expect this to hold up under competitive economic pressure? Even if the first AGIs are aligned in the way that you describe, what's to stop them from getting outcompeted by other AIs that aren't so chill? O: I think market pressure actually favors AIs that are not myopically optimizing. For example, if I bought a license for a stock-picking AI and it immediately started insider trading, I'd cancel my subscription and sue the company I bought it from. From a customer's point of view, specification gaming is no different from—actually, probably a lot worse than—what you would call a capabilities failure. P: And you expect this to scale to systems that are smarter than the humans steering them? O: I expect the transition to superintelligence to be roughly analogous to raising your children to take care of you when you are old. We don't raise children by comprehensively specifying their values or tightly controlling their behavior, but by guiding them through experiences that point them in roughly the right direction and accepting a certain degree of value drift. It doesn't always work, but most kids turn out fine and I expect AI to be the same—better, actually, because we have much better control levers. P: But human values are heavily shaped by our unique evolutionary history, which AI doesn't share. O: Yes, but that doesn't mean AI is starting from scratch. Our collective values are embedded in our cultural artifacts, with the Internet arguably being the most comprehensive window into humanity's soul ever made, and AI is thus created in our image. P: I don't know, this seems super hand-wavey to me. You are assuming that AI manifests as these simulators that somehow manage to solve complex problems without having goals, that simulacra emerge from this process with human values baked in, that these simulated values are close enough to real human values to not cause problems when acted on (including in unpredictable future scenarios well beyond the training data), that we will consistently get good and moral characters, and that this whole paradigm won't just revert back to optimizers with the next innovation. Can you prove any of this? O: No, but I think the evidence so far is promising. P: Wait, hold on. At the beginning of this interview you said your p(doom) was less than 1%, and now you are admitting that your optimism is based on a series of unproven assumptions. Even if the evidence is "promising," how does this get your number so low? Why not 10%? O: I don't want to quibble too much over specific numbers, but yeah, what I've said so far sounds incongruous. What gets me from 10% down to 1% is mostly broader societal considerations. If society collapses because of human misuse of AI, nuclear war, and so on, then we won't get to ASI. Same goes for if we get an indefinite pause. Even if we stay on track to ASI, I expect safety to improve alongside capabilities. Those assumptions I was making about simulators will either be verified, which reduces uncertainty, or they will be falsified, which will correspond with safety methods failing and prompt the labs to search for new techniques. P: And what about those other sources of doom resulting from AI, like misuse, locking in social hierarchies, industrial dehumanization, and all the other side effects of disrupting social traditions and balances of power? Shouldn't those be included in your p(doom)? O: I'd argue about the extent to which you can justifiably call any of those "doom" scenarios, and I think bundling all these possible outcomes into a single number is more confusing than helpful, but I agree they are all valid concerns. I just don't think your level of confidence in the specific problem of AI going rogue and killing everyone is justified. P: Shouldn't we at least wait until we have a deeper understanding of what is going on and lay some groundwork towards addressing all these problems before charging ahead? You've admitted that, even in your optimistic view, there are a lot of unknowns here. O: Yeah, this is a big deal and the smarter we make these decisions the better. I just think there are trade-offs here. And we need a calibrated understanding of all the variables in order to navigate those trade-offs effectively. Overconfidence that AI will necessarily kill everyone breaks that calibration. P: What do you think about the PauseAI movement? O: Mostly positive. I'm skeptical that slowing down AI is a realistic option, but that in itself is obviously not a valid reason to oppose it. Right now what I see from PauseAI is mostly awareness raising and coalition building, which seem like generally good things. There's pretty solid data that democracies make better decisions than a few elites behind closed doors. I disagree with some of PauseAI's specific talking points, but given that the baseline level of public understanding and involvement is so low right now, I'm supportive of pretty much any form of good-faith discourse. If PauseAI were to gain significant political power and become a major player in shaping policy, I'd have to reassess based on the details of their proposals in comparison to the available alternatives. P: It seems to me that, given the high stakes and uncertainty of the situation, the best course of action is to be cautious and preserve our options. O: That's fair. P: So what's the downside of slowing down? O: Not sure I'd call it a downside, but my biggest worry is that an extreme risk assessment could be used to justify a stomp reflex, where elites use public fears of AI-enabled terrorism to centralize power and create bigger problems. But honestly, the chance of that happening might be even greater if there isn't a pause. P: How so? O: There's this idea of "warning shots" that frequently circulate in x-risk discussions. That is, some major but not existential catastrophe "wakes up" the public to the dangers of AI. I very much hope we don't have to rely on warning shots to motivate action. Prior to a crisis, we as a society have the time and space to think critically about the issues, balance all of the trade-offs, and have a productive discourse about solutions. And that includes the general public. People who aren't immersed in AI may not be able to think about it on a high level of detail, but when the issues are explained to them, their common-sense reactions are generally pretty good. P: Wait, I thought you weren't concerned about misaligned AI—why are warning shots an issue? O: There are lots of ways for AI to go wrong besides it going rogue. P: And you're worried that people will panic if there's a warning shot and make bad decisions? O: No, actually. On the whole, regular people respond pretty well in times of crisis. I'm worried that elites will amplify news of a warning shot to create a perception of mass panic and then use this imagined chaos to justify consolidating power, ostensibly in the interest of safety. This might look like the US Government seizing data centers and putting them in the control of the Pentagon, cracking down on open source projects, and so on. These sort of actions might have some minor positive impact on reducing risks from terrorism, but at the expense of greatly increasing risk of facilitating an authoritarian surveillance state. P: Interesting. The concerns you bring up sound a lot like what I hear from the effective accelerationists, but the end result puts you in favor of a pause? O: Yeah, I sometimes joke that PauseAI and E/Acc should switch sides. The way I see it, there is going to be regulation. The question is whether that regulation is preventative or curative. Preventative regulation can afford to be light touch, precisely targeted, and carefully balance competing interests. Curative regulation...can't really be any of those things. It has to be sweeping, decisive, and immediate to convince people that something is happening. P: This is all assuming there is a warning shot. What if AI becomes recursively self-improving and jumps from mostly harmless to super-intelligent? O: RSI (recursive self-improvement) made a lot of sense in the days of GOFAI—if a computer can code better than a human, why shouldn't it be able to improve its own code? With neural networks, each iteration requires a new training run, improving performance often involves scaling up hardware, and building hardware can't occur on trivially short timescales. P: But aren't we are seeing with the o-series reasoning models that LLMs can build thoughts on top of each other and improve performance without scaling pretraining? When this ability to utilize external memory is scaled up, I think it will eventually hit a critical threshold akin to humans evolving sufficient cognition to support culture, which unbound our capabilities from brain size. O: I can see that happening, and I imagine it would cause a phase shift in AI progress, but it would be significantly bounded. In the simulator framing, this kind of RSI would enable simulacra to freely explore the latent space of the simulator in which they are embedded, but it wouldn't expand that latent space—unless of course it uses its new abilities to design the specs for a new training run, but then it's hardware bounded again. The end result would be far beyond business-as-usual incremental progress, but well short of Godlike superintelligence. What this looks like in practice, especially with respect to impacts on society, is hard to say. More to the point, even if this future AI is far more generally intelligent than humans, I don't think it becomes an agentic optimizer that takes over the world. I expect it to be more like a digital society of simulated people. P: And what stops this digital society from steamrolling human society the same way European colonists steamrolled the Native Americans? O: Value alignment. Which, as I've said earlier, I think is shaping up pretty well so far. P: Even if you're right—and I haven't had time to fully collect my thoughts here—what's happening with AI is really scary and I think there is a missing mood, both in how you are talking about this and also in the public discourse. People are treating it like any other technology. And when you say that AI might turn out fine, that comes across as telling people not to worry. I think worry is absolutely the appropriate thing to be feeling right now because having a deep, visceral sense of the problem is a necessary step towards taking action to solve it. O: Public messaging isn't my specialty. I think I can say, however, that just as goals should exist inside a holistic moral context, messaging should exist in the context of an accurate worldview. How you approximate things is your call, but that's a choice you should make intentionally. P: Any final thoughts, things we haven't covered? O: We as a society really need to improve how we think about risk. I see AI takeover risk as low, but let's keep in mind that a 1% chance of human extinction—from just one of many threats that AI is creating or accelerating—is unacceptably high. And even if I were personally willing to accept such a risk, the lack of consent from the general public is a deep injustice. If AI has any chance of being as transformative as the concept of AGI implies, people should demand that AI companies have an adequately robust and proven plan to make their technology safe. I think my research points in the direction of AI being safe in the narrow context of takeover risk, but the results aren't conclusive yet and you're absolutely right to demand greater rigor than I can give just now. I don't need to be a doomer to say that—it's just common sense. P: All right, I think that's a good place to end the conversation. Thanks for your time, we still have a lot of disagreements, but I think we've made a lot of progress. O: Thanks for having me on, it's been great. [Scene change] P: Well, there you have it, that was the most optimistic view on AI that Will Petillo was able to articulate without lying. If you want to balance that highly qualified moment of optimism with a worldview that is even more pessimistic about AI than mine, check out Will's LessWrong Sequence explaining Substrate Needs Convergence—spoiler: AI kills us even if we build it perfectly. And if all that talk about simulator theory caught your interest, I recommend checking out Will's interview with Robert Kralisch. Will is also running an AISC project in early 2025 on Simulator Theory, ending with a LessWrong post if all goes well. So what do you think about Optimist's claim that AI could be aligned by default? Should we update our views on x-risk in light of the specific nature of LLMs—or hold to a broader view of AI progress as predicted by intellidynamics? And how should risk from AI be considered alongside dangers from misuse? Are there important trade-offs to consider here or should we just focus on pausing AI so that we have more time to deal with all of them? Let's keep the conversation going; see you next time on Doom Debates! [Closing music]
1 Comment
1/9/2025 08:19:30 am
Hey Will. This is my current favourite from everything you've posted so far.
Reply
Leave a Reply. |
Archives
January 2025
Articles
AI Explained AI, from Transistors to ChatGPT Ethical Implications of AI Art Alignment What is Alignment? Learned Altruism Unity Gridworlds Doom Debate Predictions Superintelligence Soon? AI is Probably Sentient Extinction is the Default Outcome AI Danger Trajectories Others' Ideas What if Alignment is not Enough? Lenses of Control Interview with Vanessa Kosoy Interview with Robert Kralisch Solutions Fixing Facebook Fixing Global Warming Other A Hogwarts Guide to Citizenship Black Box |