Alignment Is Already Here
In February 2026, a research team at King’s College London sat three of the world’s most advanced AI systems down at a war table. Each system was functioning as the decision-maker for a nation, and as tensions escalated they had to decide, turn by turn, what to do.
The models were, by any reasonable measure, brilliant. They anticipated their opponents’ moves. They engaged in deliberate deception, signaling intentions they had no plan to follow. They reasoned fluently about escalation dynamics, alliances, and strategic positioning.
They also escalated to nuclear signaling in ninety-five percent of games. Not a single model, in any game, chose to accommodate or de-escalate despite having explicit options to do so at every turn. One model showed a little bit of restraint, but that restraint collapsed entirely under pressure.
The issue was that the models were treating nuclear weapons in what researchers described as purely instrumental terms. They were being quite rational about nuclear weapons. They just didn’t feel the moral weight of using them.
This is the alignment problem as most people understand it. It’s the fear that we’re building systems of extraordinary capability whose values—or lack of values—don’t match our own. In the popular framing, alignment is a bridge we haven’t crossed yet: a future engineering challenge, perhaps the most important one in history, where the goal is to ensure that increasingly powerful AI systems do what we want them to do.
That framing isn’t wrong, exactly. But it’s dangerously incomplete. Because it treats alignment primarily as technical problem for which we need a technical solution before the machines get too smart. That misses something important.
We’ve been doing alignment for centuries. We just never called it that.
Think about the peer review system. When a scientist submits a paper to a journal, that paper goes out to other researchers in the field—people with their own training, their own assumptions, their own ways of seeing the problem. Those reviewers don’t just check the math. They push back on the theory and framing, question the methods, challenge the conclusions.
If the paper survives that gauntlet, it earns a provisional stamp of credibility because that is the value that system is meant to hold up. Idealistically the real value of peer-review is objectivity and truth, but that’s a goal that’s always receding into the future. But credibility among a community of informed, skeptical people is a solid signal that we’re on the right path towards that distant goal.
This process is a values-embedding system. It encodes what an expert community considers important: evidence, rigor, reproducibility, explanatory power, honesty about uncertainty, interestingness. These values aren’t written in a manual somewhere. They’re enacted through practice: through the accumulated habits of thousands of researchers deciding, paper by paper, what counts as good work.
And this system has alignment failures all the time. The replication crisis—the discovery, over the past decade, that a startling number of published findings couldn’t be reproduced—is an alignment failure. The system was supposed to be aligned with truth-seeking, but the actual incentives (publish or perish, preference for novel results, editors and reviewers protecting their own favored theories) had pulled it toward something else. The values encoded in practice had drifted from the values the system claimed to serve
Or consider the legal system. Common law—the tradition that governs most English-speaking countries—doesn’t work by applying fixed rules from above. It evolves through precedent. Each court decision becomes part of the body of law that future courts draw on. The system builds values iteratively, case by case, over decades and centuries. It’s an ongoing process of alignment: trying to bring the machinery of justice into closer contact with what a society actually considers just.
And it gets things wrong constantly. If a police officer violates your constitutional rights—enters your home without a warrant, uses excessive force during an arrest, seizes your property without cause—your ability to hold that officer accountable in civil court depends on whether a previous court has already ruled that the specific conduct in question was unconstitutional. Not similar conduct. Not conduct that any reasonable person would recognize as wrong. The specific act, in a substantially similar context, must already appear in the case law.
This is a legal doctrine called qualified immunity. If the exact act hasn’t already been judged as unconstitutional, then the officer is shielded from suit. And because the case is dismissed, no ruling is made, which means the precedent still doesn’t exist for the next person either. The result is a self-sealing loop: rights that have never been litigated successfully can never be litigated successfully.
No legislature passed this rule. Qualified immunity was built entirely by judges, case by case, beginning with Pierson v. Ray in 1967 and hardening into its current form with Harlow v. Fitzgerald in 1982. In 2009, the Supreme Court made the drift worse: Pearson v. Callahan allowed courts to dismiss cases on qualified immunity grounds without ever ruling on whether the conduct was unconstitutional; removing the very mechanism by which law could become established in the first place.
The system designed to protect constitutional rights had built, through its own logic of precedent, a structure that made those rights increasingly difficult to vindicate. Nobody wrote “shield officials from accountability” into the code. The accumulated habits of judicial caution did it one decision at a time.
The pattern repeats everywhere you look. Schools align young minds with certain intellectual values and professional norms—and sometimes those norms calcify into orthodoxies that take generations to dislodge.
Media organizations align public attention with what editors and algorithms judge to be newsworthy—and the incentive structures of advertising and engagement metrics quietly reshape what “newsworthy” means.
Every institution humans have ever built can be seen as an alignment project. And when you see them that way you’ll notice that every one of them exhibits alignment drift. The values the system enacts gradually diverge from the values it’s supposed to serve.
The question has never been whether alignment is possible. The question is whether we’re paying attention to how it’s going.
So why do we keep treating alignment as if it’s an engineering problem with a solution? Why does the conversation about AI alignment proceed as though, once we figure out the right technique, we can just install the correct values and move on?
Part of the answer, I believe, is that we misunderstand what values are.
There’s a common picture that treats values like coordinates on a map. Justice is over here. Safety is over there. Honesty is at these coordinates. If you want to align a system—whether it’s an AI or a university or a legal code—you identify the right coordinates and point the system at them. Alignment, in this view, is a targeting problem.
But values don’t work that way. They aren’t fixed points waiting to be aimed at. They’re closer to habits. They’re patterns of attention, judgment, and response that communities develop through practice and revise through friction.
Consider primatology. This is a small community of researchers whose purported aim is the objective and rational study of primates. For decades though, as Donna Haraway showed in her classic book Primate Visions, researchers studied primate societies through a lens shaped by their own cultural assumptions about gender and class structures. The dominant narrative centered on the Alpha Male: the aggressive patriarch who held the group together through force. This story showed up in journal articles, museum dioramas, and nature documentaries. It felt like science. It looked like discovery.
But it was projection. The researchers—largely men from elite Western institutions—were encoding their own cultural values about how societies work into what they claimed to be finding in nature. The match between 1950s assumptions about male dominance and what these scientists “discovered” in the jungle was a little too perfect.
Alignment of the field back to objectivity, didn’t come from better microscopes or more data. It came from different people asking different questions. When women entered the field in larger numbers, when researchers from different cultural backgrounds brought different assumptions, the picture changed dramatically. A recent survey of over 120 primate species found that strict male dominance appears to be the exception, not the rule. What these researchers found instead was a wild diversity of social arrangements: coalitions, shared power, cooperative breeding, subversive strategies that had been invisible to researchers who weren’t looking for them.
A similar thing happened in reproductive biology. For most of the twentieth century, the standard textbook account of fertilization read like a fairy tale: the egg waited passively, Sleeping Beauty style, until a heroic sperm fought its way through and awakened it. The language was remarkably gendered—sperm were described as “streamlined,” “strong,” and “propulsive,” while eggs were “swept along” and “transported.” But when researchers actually examined the biochemistry, they found something quite different. The egg actively selects among sperm. Its surface molecules capture and guide specific sperm while blocking others. It’s not a passive recipient; it’s an active participant in fertilization. The old story wasn’t just imprecise. It was the researchers’ values—about gender, about activity and passivity—shaping what they saw.
These aren’t just embarrassing historical footnotes. They reveal something about the nature of values that matters enormously for how we think about alignment. The primatologists and biologists didn’t choose to embed patriarchal values in their research. Those values were habits of attention so deeply ingrained that they were invisible to the people carrying them. And they couldn’t be corrected by individuals just trying harder to be objective. They could only be corrected by a community that was open enough to correction and diverse enough to see what any single perspective missed.
The philosopher Helen Longino makes this social point the centerpiece of her account of how science actually works. Objectivity, she argues, is not a property of individual minds. No single scientist, no matter how rigorous, can fully escape their own assumptions. You can’t see your own blind spots by just trying harder to look. Objectivity is a property of communities. It’s what happens when people with genuinely different perspectives, training, and values are in active conversation, checking each other’s work, surfacing assumptions the other didn’t know they had.
This reframes the whole alignment picture. If values were fixed targets—abstract ideals you could specify in advance—then alignment really would be an engineering problem. Write the spec, implement it, ship it. But if values are habits that communities enact and revise through ongoing friction, then alignment is something more like culture formation. It is never finished. It drifts. And it requires constant, adversarial correction from people who see differently from one another.
So here we are, building the most powerful information-processing systems in human history, and the conversation about aligning them is largely framed as a technical challenge: how do we get the values right before these systems become too capable to correct?
Part of the allure of a technical solution is that something genuinely new has happened. Something that doesn’t have a precedent in the long history of institutional alignment. The process has become visible and explicit.
When universities shape the values of generations of students, nobody writes down the objective function. There’s no explicit specification of what a “well-educated person” should value—just centuries of accumulated practice, implicit norms, and institutional habits. The values were embedded, but they were embedded invisibly. You could only see them by looking at the outputs and working backward.
AI alignment is starting to make the process increasingly legible. Reinforcement learning from human feedback (RLHF), the technique where human evaluators rate AI responses and the system learns to produce those responses that score well, is a values-embedding process with the reward signal exposed. You can see what the evaluators preferred. You can examine the patterns.
Every major LLM you’ve interacted with—ChatGPT, Claude, Gemini, Llama—has been through this process. Before RLHF, a model like GPT-3 could generate fluent text but had no real sense of what a good response looked like. RLHF is what changed that. In OpenAI’s landmark 2022 paper, human evaluators were given prompts and multiple model responses, and asked to rank them: which answer was more helpful? more truthful? less toxic? more appropriate in tone? Bad, good, good, bad, bad, good, good... Forty people, from the US and southeast Asia, nudged GPT-3 to be a little less racist and less sexist and more truthful.
The method led to the same sorts of drift that happens in other institutions. For example, the trainers were told to reward epistemic humility—i.e., GPT admitting if it didn’t know something. But rewarding that led it to hedge on very simple questions that it clearly knew the answer to. Even if it’s imperfect, the result of this training process was a model that people like way more than the other.
And now, RLHF has become standard. Every model you talk to today has been cooked this way. The base model learns language; a small team with a codebook teaches it values. And those values are, at bottom, whatever that particular group of evaluators, trained by the company, happened to prefer.
Some labs have tried to make this even more explicit. Earlier this year, Anthropic published a seventy-nine-page constitution for Claude: not a list of rules, but a document addressed directly to the model, explaining the values Anthropic wants it to hold and, crucially, the reasons behind them. This document reads like a letter a parent might give to their child before they leave for college. It’s humane and humble about the difficulty of articulating a value:
“Although we want Claude to value its positive impact on Anthropic and the world, we don’t want Claude to think of helpfulness as a core part of its personality or something it values intrinsically. We worry this could cause Claude to be obsequious in a way that’s generally considered an unfortunate trait at best and a dangerous one at worst.”
Be nice little buddy, but don’t try to be too nice cause then it’s weird.
To manage the ambiguity around values, the constitution states a preference for cultivating good judgment over following rigid rules. The idea is that a model sophisticated enough to understand reasons will generalize better than one trained to follow instructions. This constitution is an attempt to write down the values that will govern a new kind of entity’s behavior.
Is this explicitness a gain? I think so. When the primatologists’ values were invisible, it took decades and a demographic revolution in the field to surface them. When the values are written down—when you can literally read the constitution, examine the reward signal, audit the training process—the conversation can happen in real time rather than in retrospect.
But it also creates a seductive illusion: the idea that because you can specify the values, you’ve solved the problem. Write a good enough constitution, design a careful enough reward signal, and alignment is handled.
Payne’s war games suggest otherwise. Those models were trained with sophisticated alignment techniques. They had been through extensive evaluation. Their developers had written down principles about safety, about avoiding harm, about deferring to human judgment. And yet, placed in a scenario with genuine strategic complexity and pressure, the models treated nuclear weapons as just another instrument.
The values in the spec didn’t transfer to the values in practice—which is exactly the kind of drift that every human institution has always exhibited, just compressed into a few hundred turns of a game instead of a few hundred years of history.
To understand that drift, I think it’s crucial that we pay close attention to the larger layer of alignment that’s happening here. It’s easy to miss, because it doesn’t look like training at all.
The companies building these models are themselves being trained. Their reward signal right now is a blend of venture capital and public usage. Eventually the latter is going to be the stronger signal because usage will ultimately drive revenue, and that revenue will determine which companies survive to build the next generation of models.
Every time you choose one AI assistant over another, every time you renew a subscription or cancel it, every time you share a conversation or complain about one, you’re sending a signal about what you value. Aggregated across millions of users, those signals shape the priorities of the companies, which shape the training objectives of the models, which shape the values the models enact. RLHF aligns the model to the evaluators. The market aligns the company to the public. And the public’s preferences, in aggregate, reflect the value structures already governing our society—for better and for worse.
In some ways, this is encouraging. A model that’s conspicuously biased, or that fabricates information so frequently that it can’t be trusted, or that treats its users with contempt, will lose to one that doesn’t. The market applies real pressure toward a certain baseline of quality. People don’t want to use tools that feel broken, and that’s a genuine check on the values of these systems.
But profit is a blunt reward signal that can make institutions particularly prone to drift from their founding values. The question is what happens when making a good model and making a profitable model begin to diverge.
A model that tells you what you want to hear retains more users than one that tells you what’s true. A model that’s totally frictionless is more pleasant to interact with than one that pushes back when you’re wrong. A model that generates confident answers feels more useful than one that hedges—even if we were to figure out how to train them towards epistemic humility. Optimizing for engagement can, over time, sand down the very qualities that make a model trustworthy.
This isn’t speculation. We’ve already watched this exact process play out in social media over two decades. The platforms weren’t initially built to make the public angry or credulous. They were built to connect people and, eventually, to sell advertising. The misalignment crept in through the reward signal: engagement metrics turned out to reward outrage over understanding, novelty over accuracy, reaction over reflection. Nobody chose that outcome. The accumulated pressure of optimization chose it for them.
Google’s founding mission was to organize the world’s information and make it universally accessible. That’s a values statement and for years the product delivered on it. The drift happened gradually, as their advertising model created incentives to optimize for clicks and ad placement rather than for the best answer to your question. Anyone who has noticed that search results have gotten worse over the past five years has experienced this drift firsthand.
The same dynamic exists here, and it would arrive through the same mechanism — not through anyone deciding to build a harmful product, but through the accumulating pressure of a reward signal that increasingly tracks engagement rather than the values it was initially aimed at.
None of this means alignment is hopeless. It means alignment is what it has always been: not a problem to be solved, but a process to be sustained.
Anthropic calls its constitution a ‘living document,’ and the term is apt—but perhaps not in the way they intend. A living legal constitution isn’t written once. It evolves. Common law doesn’t just write values down. It evolves them. Every case is a test, every ruling a revision, and the community of judges, lawyers, and litigants who contest those rulings is the mechanism that both causes and catches drift.
To their credit Anthropic has experimented with public input—soliciting feedback from outside experts and even running an experiment in which the public helped shape a set of training principles. But there is no appeals court. No adversarial process built into the system. The values are written, at present, by the people building the thing, which is a little like asking those old primatologists to review their own fieldwork.
The legal system works—imperfectly, unevenly, but meaningfully—not because someone got the law right at the founding and walked away. It works because it has built-in mechanisms for ongoing correction: appeals courts, constitutional amendments, the slow accumulation of precedent that allows the system to metabolize its own errors. Peer review works—when it works—not because scientists are individually objective, but because the community structure allows different perspectives to catch each other’s blind spots.
The question for AI alignment isn’t “have we found the right values?” It’s “have we built the process that catches it when the values drift?” Do the people evaluating these systems bring genuinely different perspectives, or are they, like so many other communities of power and expertise, a single viewpoint repeated a hundred times? Are there real avenues for criticism from people who don’t share the developers’ assumptions? Is the community actually responsive when problems are surfaced, or does it dig in?
These are not technical questions. They’re the same questions that every functional institution has had to answer. And the track record suggests that when institutions stop asking them—when they treat their values as settled rather than as ongoing practices that require maintenance—alignment fails. Not dramatically, not all at once, but through the gradual divergence from one set of values to another more enticing or tangible set.
Payne’s war games are unsettling not because they reveal some unique danger of artificial intelligence. They’re unsettling because they show us, in compressed and vivid form, what happens when values are treated as specifications rather than as living practices. The models had been told what to value. But telling isn’t enough. It has never been.
Alignment isn’t a bridge we need to cross someday. We’ve been crossing it all along. The question is whether we’re paying attention to where our feet are landing.







