Grokking: The Phase Transition from Memory to Understanding¶
How neural networks spontaneously jump from rote-memorization algorithms to generalizing intelligence, inspired by Power et al. (2022).
The Illusion of Overfitting¶
In the classical machine learning paradigm, the relationship between training data and testing data is treated as a delicate balance. If a model performs perfectly on the data it has seen during training, but terribly on unseen test data, it has overfitted. It has constructed a giant, fragile lookup table instead of learning a rule.
Historically, researchers practiced "early stopping"—halting training the moment test loss began to degrade, assuming the model was ruined by over-memorization.
In a landmark 2022 ArXiv paper, Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets, researchers at OpenAI observed a phenomenon that shattered this intuition.
When training small models on algorithmic tasks (like modular addition), the researchers allowed the model to overfit completely. Train accuracy hit 100%. Test accuracy stayed at random guessing. But instead of stopping, they kept training. They let the optimization algorithm run for thousands of epochs past the point of apparent failure.
Suddenly a thermodynamic-like phase transition occurred. Without any new data, the test loss plummeted, and the model generalized perfectly to unseen examples. The network had "grokked" the underlying principle.
The Mechanics of Grokking¶
Why does this happen? The process occurs in three distinct phases:
- Memorization (The Easy Path): The network quickly realizes that building a lookup table of the training data is the fastest way to drop the training loss to zero. It overfits.
- The Random Walk (The Plateau): The train loss is zero, so the gradient from the data is gone. However, optimizers use Weight Decay (L2 regularization), a mathematical penalty for complexity. Weight decay constantly nudges the network to find simpler representations of the same lookup table. The network engages in a random walk across flat valleys in the loss landscape.
- The Phase Transition (Grokking): Eventually, the random walk stumbles upon the true, underlying algorithm. A mathematical formula (a structured graph) is much simpler and "lighter" (in terms of weight norm) than a massive lookup table. Weight decay grabs onto this newly discovered, simpler representation and violently collapses the memorization circuits. Suddenly, the model generalizes to the test set perfectly.
Intelligence as Compression¶
Grokking is a concrete phenomenon that strongly supports the philosophical lens discussed in Emergence and the Origin of Intelligence: Intelligence is Compression. It is not a general proof of that claim across all forms of intelligence; it is an existence proof that, in some settings, compression-like simplicity pressures can coincide with a sudden jump to generalization.
The lookup table is uncompressed data. The generalizing algorithm (the "grokking") is the compressed mechanism behind that data.
To grok is to compress successfully. The phase transition from high test loss to low test loss is the exact moment when mere data points condense into pure intelligence.
If we want to build robust AI, we have to recognize that true understanding is often hiding just past the horizon of catastrophic local minima. The mind—both biological and artificial—sometimes needs a long, seemingly unproductive plateau of random noise to discard its memorized prejudices and finally collapse onto the truth.
Generator reading¶
Grokking is the cleanest empirical demonstration in this repository of the project's organizing question. The reading is direct.
The Generator Question names a forward/inverse asymmetry: running a generator forward (rules → trace) is cheap, recovering a generator from a trace (trace → rules) is structurally hard. Almost every simulation in the repository is on the forward side. Grokking is on the inverse side, and it shows what the inverse direction looks like inside a learning system.
Before the phase transition, the network has stored its training set. The weights encode a lookup table: each (input, output) pair memorized as a particular pattern of activations. There is no generator. There is only the trace. Test accuracy is at chance because the trace contains no compressible structure that extends beyond the examples.
After the phase transition, the network has replaced the stored trace with an approximation of the generator that produced it. The modular-arithmetic algorithm — small, structured, and lighter in weight norm than any lookup table — has been found. Test accuracy is perfect because the recovered generator extends naturally to inputs the network has never seen.
The transition is sharp and discontinuous. This is the structural point: generator approximation is not a smooth compression of stored data. It is a qualitative shift from one regime (storing the trace) to another (running an approximated generator). The lookup table is not gradually compressed into the algorithm. It is displaced by it.
Why this matters for the spine¶
Most of the inverse direction is conceptual or scaffold-stage in the repository:
- The Trace to Generator essay describes the problem.
- Open Problem 11 states it formally.
- The
lab/experiments/trace_to_generator/scaffold provides a toy testbed. - The Agentic Identity Suite attempts to distinguish trace-memorizers from generator-approximators behaviorally, but on mock embeddings.
Grokking is different. Grokking is a system, runnable inside the repository, in which the inverse direction actually happens — and is observable as a sharp transition in a measurable quantity (test loss). The plateau before grokking is what the practical hardness of generator search looks like from the inside. The transition itself is what successful generator approximation looks like.
This does not generalize to all of intelligence. Grokking has been demonstrated for small algorithmic tasks. Whether the same kind of transition exists for natural language modelling, multi-agent identity formation, or any of the other settings the project cares about is an open empirical question — and would, if found, be one of the most consequential results in the project's frame.
What grokking does not show¶
It is worth being explicit about the limits.
- The grokking network does not prove it has found the correct generator. It exhibits behavior consistent with having approximated one. Gödel's incompleteness rules out internal certification of correctness for any system of sufficient power; the network is no exception.
- The generator the network finds is not provably minimal. It is lighter than the lookup table (in weight norm under L2 regularization) but Kolmogorov complexity is uncomputable; the network cannot verify that no even-simpler generator exists.
- The transition is observed for specific algorithmic tasks with structured underlying rules. There is no result showing that grokking occurs for unstructured or genuinely random target functions.
- The transition is sensitive to optimizer choice, weight decay strength, and architecture. It is not yet a controlled phenomenon, let alone a guarantee.
A second reading: phase transition under the foundational assumption¶
Under the project's [FOUNDATIONAL ASSUMPTION] that generator reconstruction is, in general, not efficiently tractable, the pre-grokking plateau acquires a specific meaning. It is not a failure of training. It is the trace of a hard search problem being executed by gradient descent inside a constrained architecture. The search is slow because the search space is large; the search succeeds because the regularizer (weight decay) prefers simpler representations once they exist; the success is sudden because phase transitions in this kind of optimization landscape are not gradual.
If P = NP turned out to be the case (in a practically useful sense), the plateau before grokking would be a curiosity of optimization, not a structural feature. As things stand — see The Generator Question for the foundational-assumption framing — the plateau is exactly what the spine predicts: the inverse direction is hard, and its hardness is visible as the long stretch of apparent non-progress before the system discovers a generator that extends.
Whatever else grokking is, it is the project's single best empirical example of the inverse direction occurring.