In a previous post, I described in broad terms what alignment is and how the current AI safety efforts of the leading tech companies are completely missing the point. Now I’d like to give an example of what an actual alignment plan might look like. My purpose in writing this is not because I believe this a great plan that OpenAI should adopt, but to make my earlier comments more concrete. Indeed, this proposal has quite a lot of flaws (which I will discuss later), which illustrate some of the fundamental difficulties any serious alignment proposal must grapple with.
The Plan:
Rationale: Instilling altruism into AI may be a useful method for building safe AI at scale, such that we are not entirely reliant on specifying AI interests to be perfectly aligned with those of humanity, as such specification is vulnerable to Goodhart’s Curse. There seems to be an analogy between human evolutionary adaptations that sometimes act at cross-purposes with genetic fitness and the observed phenomenon of mesa-optimizers in AI agents shaped by machine learning. I explore the possibility that the latter can be exploited to instill a base concern for external entities in AI through reinforcement learning, without the need to formally specify what alignment looks like. I propose a series of experiments testing these theories, as well as how they hold up under pressures of scale, in a toy example. I see two conceptual ways in which one agent's motivations can be aligned with another agent:
Relying on mutual interest for alignment may be good-enough for near-term, weak AI. But its limitations at scale seem fundamentally intractable since human values are holistic whereas deep learning requires a clearly defined, measurable feedback signal. The problems of the altruistic approach, however, may represent solvable gaps in our understanding, rather than inherent contradictions, and I believe it merits more attention. Rule-based formulations of alignment, such as via Asimov's Three Laws or Coherent Extrapolated Volition, seem intractable because their terms are ambiguous and, even if they were clear, there is no known way to reliably encode complex values into a neural network—one can only specify conditions for reinforcement and hope for the best. But biology has found a way to encode prosocial values into various animal species, using an analogous form of gradient descent (evolution), so perhaps we can look there for inspiration. My best guess as to why prosocial motivations exist in biological creatures is that evolution is an adaptation executor, not a fitness maximizer. Humans, for example, evolved relying on cooperation with their group to survive. Cooperation, however, requires trust, which in turn requires reliability. Having prosocial values leads to more reliable cooperation than trying to manipulate others for personal benefit, particularly in long-term relationships where one's collaborators are about as intelligent as oneself. Genuine prosocial values were therefore beneficial to survival and so the capacity to develop them got coded into our DNA—making them appear more reliably and, as a side effect, less likely to be unlearned—and were then (sometimes) further encouraged by social conditioning. This is not to say the altruism is disguised self-interest, because relying on deception would make it less effective, but that it is a byproduct of self-interest that has taken on a life of its own to the point where it sometimes acts against self-interest, the very force that created it. A similar phenomenon of adaptation execution leading to values that contradict fitness maximization exists in AI in the form of mesa-optimizers. Mesa-optimizers are commonly viewed as pesky side-effects of the AI training process because they cause unexpected and typically unwanted behavior, but perhaps we can exploit this defect of gradient descent in a way that makes it a feature. The Learned Altruism approach to alignment has additional, pragmatic benefits: low (apparent) risk of significantly enhancing AI capabilities, can be tested in a rudimentary capacity on minimal hardware, and, if it turns out to be successful and scalable, could be developed in parallel with capabilities research without necessarily relying on the latter slowing down. Further, the core idea could be tested in the context of truth, a narrower-scope attribute with potentially immediate commercial benefits (which decreases resistance to adoption). If an AI could be trained to be truthful--in defiance of training contexts that implicitly direct the AI to tell users what they want to hear (sycophancy)--this would confirm the possibility of training prosocial traits that are too general to be directly measurable. Known Flaws:
0 Comments
|
Archives
August 2024
Articles
AI Explained AI, from Transistors to ChatGPT Ethical Implications of AI Art Alignment What is Alignment? Learned Altruism Unity Gridworlds Predictions Superintelligence Soon? AI is Probably Sentient Extinction is the Default Outcome AI Danger Trajectories Others' Ideas What if Alignment is not Enough? Interview with Vanessa Kosoy Solutions Fixing Facebook Fixing Global Warming Other A Hogwarts Guide to Citizenship Black Box |