This post assumes some prior knowledge of how AI works. If you are new to the subject, this primer covers all that you need to know.
AI "Safety" RLHF (reinforcement learning from human feedback) is a standard approach to AI safety among leading labs like OpenAI. To oversimplify somewhat, RLHF is a process where an AI model, previously trained with self supervised learning to predict text, generates multiple answers to prompts, which are then rated by human evaluators—with some automation provided by Reinforcement Learning to speed up the process. The difference between the spectacular failure of Bing (sorry Sydney, you have been a good Bing) and the wildly successful ChatGPT is likely a testament to the value of RLHF, both as a control technique and in making AI a marketable product. OpenAI's approach to alignment, including its Superalignment initiative, are essentially plans to create increasingly powerful AI assistants to help such control techniques scale to contexts that are too complex for humans to act as effective evaluators. Implicit in OpenAI approach to AI safety is a definition of alignment that seems to be something along the lines of: make sure the AI says good things and not bad things. This is a very convenient definition, as it is difficult enough for OpenAI to show that it is working hard and taking safety seriously, yet easy enough as to be generally solvable. Indeed, viewed through this lens, AI alignment is going pretty well: AI chatbots mostly give safe and reasonable answers, with the exeptions requiring elaborate prompting or jailbreaks. Even then, such failings could be thought of as being aligned to the user rather than the developer, which isn't necessarily such a bad thing, depending on whether you are more worried about decentralized catastrophe or centralized tyranny. Focusing on the output of AI to assess its safety is intuitively compelling—what the AI actually does is what we care about, right? And if it's safe now, why expect it to suddenly become dangerous later? Sure, there are theories of instrumental convergence or whatever, but if these theories had any grounding in reality, surely we would be seeing evidence of them early on? And saying that the models aren't capable enough sounds like a bit of a dodge. Do we really need to wait for AI to become superhumanly intelligent before we see evidence of deception? Small children learn deception, so why not GPT-4? Alignment, Simplified I contend that AI is completely off the rails, now, in observable ways, and few of the commonly proposed solutions even attempt do anything about it. We must hold in clear view what the alignment problem actually is, and has always been:
Careful readers may notice that the above logic is a restatement and application of Goodhart's Law, the simplified version of which states that a measure that becomes a target ceases to be a good measure. Indeed, all of the classic problems of alignment are applications of Goodhart's Law:
Learning is Inherently Goal-Directed All this talk of goals raises the question: "Is AI goal-directed?" It's one thing to talk about the theoretical implications of optimization, but do these principles actually hold for the real AI in use today? Quite simply, yes. Let's step back a moment and consider what it means for something to have a "goal." In Life 3.0, Max Tegmark defines goal directed behavior as that which is better described in terms of its consequences than its process. By this definition, a heat-seeking missile is goal directed because it is far simpler—and more usefully accurate—to describe its behavior as moving towards a target than to dive into the technical details of rocketry. How does this definition apply to AI? We can divide AI goals into two types. The first are learned goals, which describe the system's behavior during deployment or a given training run. The second are training goals, which describe the way the system's behavior changes as it learns. For an example, let's start with RL (reinforcement learning) because it is easier to imagine toy examples. Imagine an AI agent navigating a gridworld that contains a set of coins randomly scattered about. The agent gets a reward for each coin it picks up and the episode resets after a set number of timesteps. The training goal of the system is thus to pick up as many coins as possible within the time limit, so we should expect the end result of the training session to be an agent that follows direct paths from one coin to the next. The learned goal is a bit more complicated. First, the agent doesn't really have any learned goals at all, unless one counts "move around randomly" as a goal. Then it might be to follow some arbitrary series of actions that happened to result in picking up a coin more often than other arbitrary actions. Then its goal might be to move in a straight line towards the nearest coin and continue moving in that direction. Eventually, if training goes on long enough and the agent doesn't fall into a local minima, the learned goal will eventually converge towards a sensible strategy that matches the training goal. Note that in the above example the learned goal is never "hack the reward signal." If the AI were to stumble upon some bizarre series of actions that triggered a glitch that gave it massive reward, it would learn that strategy and one could say that triggering the glitch was its learned goal, but this is still operating at a lower level of abstraction than seeking the reward signal itself. This distinction, however, does not point to a general principle that will necessarily hold forever. If the AI were trained on a massive array of environments, with different mechanics and reward criteria, then seeking the abstraction of reward signal could become a better strategy than learning a multitude of behaviors that are specific to each environment—and thus what one should expect the AI to learn if it has the capacity to do so. For LLMs (large language models) the training goal is fairly straightforward: accurately predict text...and also get good ratings in RLHF. This goal is extremely informative as it tells us the general direction in which we should expect the capacity of the system to expand as it learns. The learned goal is harder to pin down, but I would be inclined to assume it is mostly the same as the training goal, modified to the past tense: generate text according to patterns observed in the training data that got good ratings during RLHF. Evidence of Misalignment Given the above theory of alignment, one should expect to see misalignment at small scales, though the impact of this misalignment may well be trivial. In RL systems, misalignment looks like specification gaming, or finding surprising ways to satisfy the reward criteria that are not what the designer had in mind. In LLMs, misalignment looks like sycophancy, or telling the user what they want to hear. Both of these phenomena are well-documented. Disagreements about whether there is evidence for x-risk are not really about whether evidence exists so much as they are disagreements about what counts as evidence in the first place. For the sake of fairness, I should also note that it would be inconsistent to take the view I am describing here while also pointing to issues like jailbreaks and chatbots going off the rails as evidence of misalignment—at least without qualification and context. Beyond fairness, crowing at every failure of AI companies to control their systems reinforces the idea that if they could get their systems to say good things and not bad things then everything is on track to be fine. That said, failures and successes of control are not irrelevant to AI safety. For nearer-term concerns, like preventing deepfakes and proliferation of dangerous weapons—which are entirely valid concerns!—control matters. Further, effective control implies some degree of understanding, which seems very likely to be useful for any alignment plan. One might object that specification gaming and sycophancy seem like rather minor problems. This is where it becomes appropriate to point to the limited capabilities of current AI and extrapolate from existing trends—it isn't exactly a stretch to argue that a real and observed problem will become more significant when the system generating it becomes more powerful! As an example that exists in the grey area between analogy and direct precedent, the societal harms currently coming from social media are essentially alignment problems. Assume for a moment that Facebook's creators were motivated by the benign intention of presenting audiences with useful, entertaining, and informative content, but the results needed to be measured somehow in order to provide feedback for the recommendation algorithm to learn and improve. Engagement is a clear and relevant metric. And maybe it works well enough at first. Sure, people are spending more time watching cat videos than informing themselves, but that's their choice and who are we to judge? But with further optimization, the algorithm discovers that the best way to maximize engagement is to make people angry...no, scratch that, the best way to maximize engagement is to polarize people so they are more susceptible to anger so they are easier to engage. Of course, the starting assumption in the above example was far too charitable. Zuckerberg and co. saw what was happening and encouraged this process to continue. Engagement wasn't merely an insufficient metric of quality content, it was a great metric for maximizing ad revenue, which is what they actually cared about. But this profit maximization at the expense of society is itself an alignment problem. For-profit corporations are legally obligated to maximize financial returns for shareholders, optimizing any metric drives all unmeasured considerations to zero, so they are effectively required not to care about costs externalized onto society. But even this is holding back. For algorithms to really maximize engagement, they would be better served to generate bespoke content rather than being tethered to what humans have already created. For corporations to really maximize profit, they need to break free from the limits of mere human ingenuity when coming up with ways to capture value or extract from the global commons. Wherever human involvement introduces a bottleneck, removing humans from the loop will increase the system's effectiveness--at the cost of all of the side effects from that increased effectiveness. On the Validity of Thought Experiments When a persuasive device is used more on one side than another in a debate with political implications, the device itself becomes politicized. As such, thought experiments and analogies, despite having a long history in education and philosophy, are now sometimes viewed with suspicion in the context of x-risk. That said, analogies are certainly not immune to being used incorrectly, so it is worth being mindful with them. The important thing to note about analogies is that their purpose is to illustrate, not to act as evidence in themselves. Whether it makes sense to use an analogy therefore depends on the nature of a disagreement and what one is trying to communicate. If you believe the person you are speaking to is missing some critical piece of evidence or has made a logical error then it is best to point those out directly. If, on the other hand, you believe the disconnect comes from your audience not understanding your point of view, or if your point is obvious when viewed with the right lens, then analogies may be just the thing to bridge that gap. Another important consideration when using analogies is being clear about what point you are trying to illustrate with them. Thinking about misalignment in the context of social media, business, or other familiar examples is great for illustrating plausibility, but is ineffective for communicating severity, in that it makes the risks seem manageable—perhaps even acceptable when considered in combination with associated benefits. On the other extreme, Yudkowsky's thought experiments involving AI leveraging diamandoid bacteria to kill all humans in the world in the same second leaves no room for effective resistance, but at the cost of being more speculative (because the scenario involves technology that doesn't exist yet). Such scenarios are therefore vulnerable to criticism regarding their realism, but such criticisms miss this point. The point is to challenge the intuition—based on tropes in popular science fiction like The Terminator—that an AI takeover scenario is really scary and destructive but ultimately humanity, through pluck, luck, and creativity, manage to eke out a narrow and dramatic victory against the machines. No, if a higher intelligence emerges and its goals are misaligned to our own, we just lose. Trying to imagine every possible failure state is fundementally impossible and demonstrating that a particular failure state is unlikely just means that failure won't look like that. Again, this line of reasoning is based on an acceptance of the premises of misaligned, super-intelligent AI and counters the idea of effective resistance; if you disagree with those premises then this particular argument just isn't relevant to you. AI Optimism I have a very negative view of most arguments against x-risk as being deliberate misunderstandings motivated by self-interest. The most notable exceptions are those put forward by the self-titled AI Optimists, with whom I respectfully disagree. This topic could easily become an essay in itself, but I don't want to skip it over entirely because there are nuances here that are very useful towards reinforcing what AI aligment is all about. For now, I'll address them in bullet-point fashion:
0 Comments
|
Archives
January 2025
Articles
AI Explained AI, from Transistors to ChatGPT Ethical Implications of AI Art Alignment What is Alignment? Learned Altruism Unity Gridworlds Doom Debate Predictions Superintelligence Soon? AI is Probably Sentient Extinction is the Default Outcome AI Danger Trajectories Others' Ideas What if Alignment is not Enough? Lenses of Control Interview with Vanessa Kosoy Interview with Robert Kralisch Solutions Fixing Facebook Fixing Global Warming Other A Hogwarts Guide to Citizenship Black Box |