Part 1: “Speculation” and the Burden of Proof
The argument that AI poses an existential risk is simple:
People who engage with the AI safety argument object for all sorts of reasons, but I think a crux of the debate is whether human extinction is–or at least reasonably might be–the default outcome of a sufficiently competent AI. If it isn't, then AI danger sounds like a weird, untested theory that requires a long series of things to go wrong in specific and unprecedented ways. But if it is, then the optimist has the burden of coming up with scenarios where everything magically works out. Point out the flaws in these scenarios and they eventually switch to saying that aligning AI to human values will be easy, we just have to X. Shoot down enough Xs and they eventually start to realize that this is a hard problem we are nowhere close to solving. Non-technical people are often willing to accept the “danger by default” idea from a cheat, since "evil AI" is a familiar trope in sci-fi. One unfortunate side effect of this cheat, however, is that people with a more technical background often have an auto-immune response that replies "machines don't have some magic pixie dust that causes them to 'wake up and become evil'; they reliably do what we tell them–and we wouldn't give them stupid instructions–so we're safe". Getting past this sort of objection requires acknowledging that yes, fictional tropes are not predictive of reality, but there's an actual reasoned argument here. A deeper objection to the x-risk argument is that it is too speculative, unscientific, and based on arguments rather than hard data. But this raises the question: why should the idea that AI will not cause human extinction be free from this same objection? Both claims are based on a prediction about the future: that it will fall into one category of outcomes or another. Why should either be assumed as the default outcome, with the other bearing the “burden of proof” to be legitimate? Even uncertainty is merely splitting the difference between multiple speculative positions, so where does that leave us for a starting point? There actually is a valid reason for the intuition that the x-risk claim and its negation are not equal in the absence of any other reasoning or evidence. But in order to know whether that reason is applicable to predictions about AI, we must first make it explicit. In the space of all possible realities, humanity only exists in an infinitesimally tiny fraction. This raises the question: “why are we here?” We can partially sidestep this unlikeliness with the anthropic principle: if we did not exist, we would not be able to question our improbability. But then one could also ask: “why do we continue to exist–day after day, year after year?” In short, because our existence has a kind of causal inertia. At least on the timescale of years, humans are persistent creatures. We do not spontaneously appear and disappear; we are born, we survive until something kills us, and some of us create new humans, causing humanity to persist for millennia, if not longer. The idea of causal inertia can be applied to the universe more broadly: the state of reality does not hop about possibility space at random, but rather every aspect persists as it is until something changes it, and that change is often gradual. I suspect that the assumption of causal inertia is why, to many, x-risk seems unintuitive. Humans existed yesterday and the day before; they will continue to exist tomorrow and the day after; this is a reliable pattern, why should it change? Notice, however, that this reason for assuming humans will continue to exist into the foreseeable future is not only speculative–as it must be since it makes a prediction–it is also based on extrapolation. The extrapolation may be reasonable, but it is extrapolation nonetheless. In any case, with causal inertia we have an actual, articulated answer to what the x-risk argument must overcome. The question then, is: has it? But before answering, there’s one thing that needs to be said, something so obvious I would not insult the readers of this essay if it were not for the fact that I find myself having to say it in a substantial portion of conversations I have on this topic: in order to assess the validity of an argument, you have to engage with it! There are some valid reasons not to engage with ideas, none of which apply here. The first is argument from authority, or trusting someone else on faith, rather than taking the time, effort, and difficulty to get to the bottom of a matter for oneself. But experts on AI are divided about x-risk, with researchers as prominent as Geoffrey Hinton sounding the alarm, and the only major survey on the matter showing that such concerns are broadly shared. The second is not wanting to engage…but then the correct response is to simply not have an opinion at all! If you are taking the time to actively participate in a public debate–with confident opinions–where experts are divided while simultaneously refusing to actually think about the arguments…I’m sorry, there’s no way to say this politely, that’s just dumb! For people engaging honestly, I believe that the x-risk argument has overcome causal inertia sufficiently to shift the burden of proof back to those who believe everything will be fine. In short: powerful AI will almost certainly be extremely disruptive, causing reality’s position in possibility space to move far enough that, if this movement is in the wrong direction, the result will be very bad for us. Part 2: Demystifying Instrumental Convergence Instrumental convergence is the idea that almost any sufficiently competent agent that acts in a goal-directed way will seek out things like survival, resource and power acquisition, self-improvement, and the elimination of threats since these things are generally useful in achieving almost anything else the agent might want. To a skeptic, however, instrumental convergence sounds like just another weird, untested, sci-fi assumption. But at its heart, instrumental convergence is just specification gaming applied on a large scale. Specification gaming is a real, concrete challenge familiar to anyone who has worked with machine learning at all. For those not familiar with this term, the idea is that processes that learn by reinforcement (good outcomes rewarded, bad outcomes punished) often find unexpected ways to achieve their goals–which are frequently not what the designer had in mind. Specification gaming is not just a quirk of machine learning, it’s also common behavior for people and organizations. As a familiar example, consider recommendation systems for social media. The initial intent was to improve user experience by giving people more of what they want and less of what they don’t. But a website can’t read your mind, it can only observe your behavior, so to make a best guess about what people want, social media observes what keeps them engaged–and this works great from a business perspective, since engagement correlates well with ad revenue. It turns out that the best way to increase engagement is to make people angry. But not everyone is equally prone to anger, which means that the calmest, most reasonable among us are harder to engage, and thus are less profitable. So a really well-tuned recommendation system will gradually polarize people, feeding them a steady diet of information confirming their beliefs–making those beliefs more extreme–with the occasional insertion of the opposition’s most ridiculous examples to inspire bouts of righteous fury. Keep at it long enough and eventually you have a population riled up enough to storm the US capitol building. To be clear, social media companies have no interest in fostering a populist overthrow of the US government–such a breakdown of democracy would in fact severely damage their business model–this was just an unintended side-effect of trying to maximize ad revenue. Quick aside, Yann LeCun is the head of AI research at Meta, the company behind Facebook, probably the most influential social media company. He is also one of the loudest voices trying to paint anyone who believes AI could cause major harm to society as a crackpot. Fuck Meta. So we already know that specification gaming is an observable problem with real world consequences and that it is a fundamental challenge of anything that learns by reinforcement, including modern AI. The next question is, do those consequences become significantly worse when AI is powerful enough to be better than humans at nearly all cognitive tasks and AI systems are deeply embedded throughout society to the point of driving a large and growing percentage of the world economy? The answer may not be quite as obvious as my last question makes it sound. Powerful AI acting in unpredictable ways clearly implies chaos…but maybe we’ll recover and come out stronger in the end? This is, in fact, the explicit position of Microsoft, whose chief economist recently stated that “we shouldn’t regulate AI until we see meaningful harm”. The idea here is that, given the black-box nature of AI, the space of all theoretically plausible harms is so vast that we should wait until some harms actually occur so that we can target fixes where the technology is observably broken. …as opposed to, you know, understanding AI well enough to have a strong theoretical basis for knowing where the dangers are before building it, as would be expected in any civil engineering project… Sarcasm aside, the underlying disagreement here boils down to “what’s the worst that could happen?” If the worst thing that happens is that kids cheating on their homework copy-paste some AI-generated misinformation into essays that no one reads then yes, maybe we could wait until flaws emerge before we fix them. Ultimately, however, we don’t know how bad the consequences of AI could be because it has never existed before and its inner-workings are not well understood. This is where instrumental convergence becomes relevant, as it predicts that at a certain level of competence, the principle of specification gaming will motivate AI to resist efforts for us to shut it down, change its goals, or otherwise learn from our mistakes and try again. If specification gaming–or just a bad specification–also motivates it to act in a way we don’t like, that’s bad. Instrumental convergence rests on two assumptions. First, AI will eventually become competent enough at recognizing patterns to be able to assess the impact of its own survival (I’ll stick with survival for brevity, the same applies for other behaviors predicted by instrumental convergence) on its ability to achieve its goals. Second, survival is beneficial for almost any set of goals. If both of these claims are true, then we should expect AI to seek survival, even if it was never trained to do so, as a form of specification gaming. If a competent AI allows us to turn it off, then we must have either discovered some means of reinforcement structure that makes survival relatively low priority–but not negative priority, since then it will shut itself off immediately and be useless–or we must have found a fully general means of preventing specification gaming. Both of these are difficult, unsolved problems, to the extent that there does not exist a widely accepted theory as to how to even think about starting to address them. For AI to be safe by default, one of the following must be true:
Possibilities 1 & 2 seem very unlikely, 3 is impossible to evaluate in the absence of coherent counter-arguments, so I would say x-risk’s burden of proof has been met. Or, to mix my metaphors… The ball’s in your court now, accelerationists!
0 Comments
|
Archives
August 2024
Articles
AI Explained AI, from Transistors to ChatGPT Ethical Implications of AI Art Alignment What is Alignment? Learned Altruism Unity Gridworlds Predictions Superintelligence Soon? AI is Probably Sentient Extinction is the Default Outcome AI Danger Trajectories Others' Ideas What if Alignment is not Enough? Interview with Vanessa Kosoy Solutions Fixing Facebook Fixing Global Warming Other A Hogwarts Guide to Citizenship Black Box |