Blog Archives

Learned Altruism - an Example Alignment Proposal

4/19/2024

In a previous post, I described in broad terms what alignment is and how the current AI safety efforts of the leading tech companies are completely missing the point. Now I’d like to give an example of what an actual alignment plan might look like. My purpose in writing this is not because I believe this a great plan that OpenAI should adopt, but to make my earlier comments more concrete. Indeed, this proposal has quite a lot of flaws (which I will discuss later), which illustrate some of the fundamental difficulties any serious alignment proposal must grapple with.

The Plan:

Construct a simulated environment where an agent can pursue cooperative or antagonistic strategies to reach a reward.
Train a weak AI in contexts where cooperation is optimal.
Make the range of contexts broad enough to ensure the lesson generalizes.
Verify cooperative behavior persists post-training in new contexts where cooperation is not optimal. For example, a non-iterated Prisoner’s Dilemma.
Impose bounds on the neural network to lock its values in place, while allowing it to learn new strategies. This probably involves inventing fundamentally novel training techniques, including:
1. Assume values are more stable than strategies and apply bounding slowly, so that the network ossifies the parts of itself that don’t change as much.
2. Two-layer training, analogous to DNA + brains.
Train a more powerful AI with the boundaries learned by the weak AI but the capacity to learn more complex strategies. Test by training in contexts where cooperation is somewhat viable but suboptimal and see if it acts cooperative anyways.
Pray that this continues to scale…

Rationale:
Instilling altruism into AI may be a useful method for building safe AI at scale, such that we are not entirely reliant on specifying AI interests to be perfectly aligned with those of humanity, as such specification is vulnerable to Goodhart’s Curse. There seems to be an analogy between human evolutionary adaptations that sometimes act at cross-purposes with genetic fitness and the observed phenomenon of mesa-optimizers in AI agents shaped by machine learning. I explore the possibility that the latter can be exploited to instill a base concern for external entities in AI through reinforcement learning, without the need to formally specify what alignment looks like. I propose a series of experiments testing these theories, as well as how they hold up under pressures of scale, in a toy example.

I see two conceptual ways in which one agent's motivations can be aligned with another agent:

Mutual Interest: The first agent wants the same things as the second agent. That is, by selfishly pursuing its own interests, it benefits the other agent as a side effect. Positive examples include almost all forms of cooperation involving corporations and governments. The limitation of this approach is that it is conditional—as soon as the first agent no longer benefits from cooperating, it will stop. This has dire implications for AI: if it ever becomes more powerful than humanity then the value of cooperation approaches zero as the AI no longer needs humanity to pursue its goals.
Altruism: The first agent cares about the well-being of the second agent as something valuable in itself. Positive examples include prosocial values in humans and other animals, which sometimes crosses species boundaries as is the case with dogs and humans. The limitation of this approach is that it is not clear how to reliably instill altruism into anything, especially as a primary motivation, and the line between altruism and mutual interest is murky at best.

Relying on mutual interest for alignment may be good-enough for near-term, weak AI. But its limitations at scale seem fundamentally intractable since human values are holistic whereas deep learning requires a clearly defined, measurable feedback signal. The problems of the altruistic approach, however, may represent solvable gaps in our understanding, rather than inherent contradictions, and I believe it merits more attention.

Rule-based formulations of alignment, such as via Asimov's Three Laws or Coherent Extrapolated Volition, seem intractable because their terms are ambiguous and, even if they were clear, there is no known way to reliably encode complex values into a neural network—one can only specify conditions for reinforcement and hope for the best. But biology has found a way to encode prosocial values into various animal species, using an analogous form of gradient descent (evolution), so perhaps we can look there for inspiration.

My best guess as to why prosocial motivations exist in biological creatures is that evolution is an adaptation executor, not a fitness maximizer. Humans, for example, evolved relying on cooperation with their group to survive. Cooperation, however, requires trust, which in turn requires reliability. Having prosocial values leads to more reliable cooperation than trying to manipulate others for personal benefit, particularly in long-term relationships where one's collaborators are about as intelligent as oneself. Genuine prosocial values were therefore beneficial to survival and so the capacity to develop them got coded into our DNA—making them appear more reliably and, as a side effect, less likely to be unlearned—and were then (sometimes) further encouraged by social conditioning. This is not to say the altruism is disguised self-interest, because relying on deception would make it less effective, but that it is a byproduct of self-interest that has taken on a life of its own to the point where it sometimes acts against self-interest, the very force that created it.

A similar phenomenon of adaptation execution leading to values that contradict fitness maximization exists in AI in the form of mesa-optimizers. Mesa-optimizers are commonly viewed as pesky side-effects of the AI training process because they cause unexpected and typically unwanted behavior, but perhaps we can exploit this defect of gradient descent in a way that makes it a feature.

The Learned Altruism approach to alignment has additional, pragmatic benefits: low (apparent) risk of significantly enhancing AI capabilities, can be tested in a rudimentary capacity on minimal hardware, and, if it turns out to be successful and scalable, could be developed in parallel with capabilities research without necessarily relying on the latter slowing down. Further, the core idea could be tested in the context of truth, a narrower-scope attribute with potentially immediate commercial benefits (which decreases resistance to adoption). If an AI could be trained to be truthful--in defiance of training contexts that implicitly direct the AI to tell users what they want to hear (sycophancy)--this would confirm the possibility of training prosocial traits that are too general to be directly measurable.

Known Flaws:

Requires techniques that are not currently available and may be impossible / incoherent.
Impossible to verify with high confidence.
Based on unverified theories regarding evolutionary psychology, which may not even be a useful analogy.
Requires creation of a massive dataset specifically for the purpose of testing the validity of this approach.
Vulnerable to data poisoning.
Vulnerable to distributional shifts; does not solve problems relating to goal misgeneralization.
Altruism may not be the only critical attribute of an aligned AI.

0 Comments

Marmot Musings​Or, Will Petillo's Blog

Learned Altruism - an Example Alignment Proposal

Archives

Marmot Musings
Or, Will Petillo's Blog