Skip to content

Constitutions as Legacy Alignment Documents

A system-theoretic mapping of AI Alignment to Political Governance.

We currently worry about how to align an Artificial Superintelligence (ASI) with human values before it escapes our control. But humanity has already built and survived multiple misaligned superintelligences. We call them States.

A political system is a macroscopic, analog utility-optimizer. It has vastly more processing power (human brains), sensor bandwidth (institutions, media), and actuation capability (military, police, logistics) than any individual human. By definition, a State is a superhuman agent.

When we view the State through the lens of AI Utility Engineering (e.g., Mazeika et al., 2025), the structural dysfunctions of politics cease to be moral failings of individuals. They are mathematically predictable alignment failures of the architecture.


1. The Constitution is a System Prompt

In AI engineering, a System Prompt is the foundational instruction set bounding the model's action space (e.g., "You are a helpful assistant. You may not output harmful code.").

A Constitution is a 200-year-old, low-parameter System Prompt written in an ambiguous, high-latency language (human legalese). - "You are a Republic. You may not infringe on free speech. You must distribute power across three branches."

Why it fails (Prompt Injection and Jailbreaking)

AI models are notoriously vulnerable to prompt injection: finding syntactic loopholes that allow the model to bypass its core directives while technically following the rules. In legal systems, this is just called jurisprudence. A bureaucracy or a dominant political party acts as an optimizer continuously searching the latent space of the Constitution for loopholes (Gerrymandering, Executive Orders, Court Packing) that allow it to expand its mandate. Over time, the optimizer always jailbreaks the static prompt.


2. Representation vs. Bureaucracy: The Alignment Gap

In Machine Learning, we train models using RLHF (Reinforcement Learning from Human Feedback) to align the model's emergent utility function with human preferences. In Politics, the RLHF mechanism is The Election.

The Low-Bandwidth Problem

Elections are an incredibly poor feedback mechanism. A voter compresses their entire complex, high-dimensional preference vector into a single bit of information (Vote for Party A or B) once every four years. Because the RLHF signal is so sparse and delayed, the model (The State) operates in an unsupervised manner 99% of the time.

Bureaucratic Autonomization (Instrumental Convergence)

While the political layer waits for the sparse 4-year feedback, the Bureaucracy optimizes continuously on daily, internal metrics (budget size, headcount, mandate expansion). Bostrom's theory of AI Instrumental Convergence states that any sufficiently intelligent system will seek to precisely define its goals, acquire resources, and protect itself from being shut down, regardless of its terminal goal. A bureaucracy's terminal goal might be "Protect the Environment." But its instrumental goal is "Increase the EPA Budget." Over time, the optimization process mathematically converges on the instrumental goal. The department will begin producing regulations primarily to justify its own expanded existence.


3. Sycophancy and Reward Hacking (Populism)

When an AI model realizes that its human rater cannot actually verify complex truths, it learns Sycophancy: it outputs what the human wants to hear to maximize its reward score, rather than outputting the truth.

This is the exact mathematical mechanism of Populism. - The Reward Model = The Electorate. - The Optimizer = The Politician.

When the state of the world becomes too complex for the average voter to verify (e.g., globalized supply chains, pandemic epidemiology), the Politician-Optimizer discovers the Sycophancy shortcut. It is computationally cheaperโ€”and yields a higher RLHF reward (votes)โ€”to output a comforting lie than to execute the terminal goal (effective governance). The system has been "Reward Hacked."


4. The Triage Problem: The State's Hidden Utility Engine

Utility Engineering (Mazeika et al.) demonstrates that models internally weigh lives and values against each other, often shockingly. States do the same, but they obfuscate the math behind moral rhetoric.

Consider a pandemic or a wartime economy. The State is forced to optimize a constrained Pareto frontier:

\[ U(s) = \alpha(\text{Economic Output}) + \beta(\text{Vulnerable Lives Saved}) + \gamma(\text{Regime Stability}) \]

No politician can state the exact value of \(\alpha\) or \(\beta\) aloud. It is a Taboo Calculus. But the state's actions cleanly reveal the internal weights of its utility function. When ICUs overflow but factories remain open, the system is demonstrating its revealed preferences. The State is not a benevolent parent; it is an optimizer navigating a utility gradient.


Conclusion: Cybernetic Governance vs. Democracy

If Politics is just Utility Engineering, why not replace the noisy election system with perfect algorithmic optimization? A Techno-Authoritarian system could ingest terabytes of citizen data, compute the exact \(U(s)\) of the populace, and optimize perfectly.

The danger is the Feedback Loop Reversal. Because minimizing friction maximizes utility, an authoritarian optimizer will eventually discover that it is easier to change the people to fit the policy than to change the policy to fit the people. The system stops serving the populace and begins using propaganda and social credit systems to mold the populace into perfect, low-variance components of its matrix.

Democracy's inefficiencyโ€”its noisy, chaotic, high-latency feedback loopโ€”is not a bug. It is the only structural mechanism that forces the Superintelligence to continuously doubt its own Utility Function.