Skip to content

Grokking: The Phase Transition from Memory to Understanding

How neural networks spontaneously jump from rote-memorization algorithms to generalizing intelligence, inspired by Power et al. (2022).


The Illusion of Overfitting

In the classical machine learning paradigm, the relationship between training data and testing data is treated as a delicate balance. If a model performs perfectly on the data it has seen during training, but terribly on unseen test data, it has overfitted. It has constructed a giant, fragile lookup table instead of learning a rule.

Historically, researchers practiced "early stopping"โ€”halting training the moment test loss began to degrade, assuming the model was ruined by over-memorization.

In a landmark 2022 ArXiv paper, Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets, researchers at OpenAI observed a phenomenon that shattered this intuition.

When training small models on algorithmic tasks (like modular addition), the researchers allowed the model to overfit completely. Train accuracy hit 100%. Test accuracy stayed at random guessing. But instead of stopping, they kept training. They let the optimization algorithm run for thousands of epochs past the point of apparent failure.

Suddenly a thermodynamic-like phase transition occurred. Without any new data, the test loss plummeted, and the model generalized perfectly to unseen examples. The network had "grokked" the underlying principle.

The Mechanics of Grokking

Why does this happen? The process occurs in three distinct phases:

  1. Memorization (The Easy Path): The network quickly realizes that building a lookup table of the training data is the fastest way to drop the training loss to zero. It overfits.
  2. The Random Walk (The Plateau): The train loss is zero, so the gradient from the data is gone. However, optimizers use Weight Decay (L2 regularization), a mathematical penalty for complexity. Weight decay constantly nudges the network to find simpler representations of the same lookup table. The network engages in a random walk across flat valleys in the loss landscape.
  3. The Phase Transition (Grokking): Eventually, the random walk stumbles upon the true, underlying algorithm. A mathematical formula (a structured graph) is much simpler and "lighter" (in terms of weight norm) than a massive lookup table. Weight decay grabs onto this newly discovered, simpler representation and violently collapses the memorization circuits. Suddenly, the model generalizes to the test set perfectly.

Intelligence as Compression

Grokking is a concrete phenomenon that strongly supports the philosophical lens discussed in Emergence and the Origin of Intelligence: Intelligence is Compression. It is not a general proof of that claim across all forms of intelligence; it is an existence proof that, in some settings, compression-like simplicity pressures can coincide with a sudden jump to generalization.

The lookup table is uncompressed data. The generalizing algorithm (the "grokking") is the compressed mechanism behind that data.

To grok is to compress successfully. The phase transition from high test loss to low test loss is the exact moment when mere data points condense into pure intelligence.

If we want to build robust AI, we have to recognize that true understanding is often hiding just past the horizon of catastrophic local minima. The mindโ€”both biological and artificialโ€”sometimes needs a long, seemingly unproductive plateau of random noise to discard its memorized prejudices and finally collapse onto the truth.