The Jagged Frontier · The Jagged Frontier

The strawberry test

Ask a large language model how many R’s are in the word “strawberry,” and for a long time it would answer wrong. The mistake became a small internet ritual: proof, apparently, that these systems were impressive but fundamentally hollow. The interpretation was almost exactly backwards.

The model does not fail at counting because it is stupid. It fails because it never sees the letters. Before a model processes text, the text is broken into tokens — subword chunks, so that “strawberry” arrives as one or two pieces rather than ten characters. Asking the model to count the R’s is like asking a fluent reader to count the pixels in a word they recognized at a glance. The information was discarded before the question began.

The same system which fluffs the letter-counting can do genuinely hard reasoning a sentence later. The two facts sit side by side and refuse to average out. These systems are jagged: superhuman at some things and subhuman at others in the same moment, in a profile that resembles no human’s. Most of the confusion in the public debate comes from trying to flatten that jagged frontier onto a single axis — a number, a date, a line that gets crossed.

This survey takes the opposite stance. Intelligence, as it appears in machines, is a profile rather than a scalar. Capability does not advance like an army crossing a border on a known day. It advances like a coastline: irregular, indented, extending fast in some directions and stalled in others, with no single moment at which “the line” is passed.

Capability does not advance like an army crossing a border. It advances like a coastline.

What scaling actually bought

For a few years the story was simple, and the simplicity was earned. Between roughly 2020 and 2022, more compute, more data, and more parameters drove prediction error down in a way that was almost embarrassingly predictable. Double the inputs, watch the loss fall along a smooth curve. This is the period most people still have in mind when they imagine AI “getting smarter” — a dial labelled parameters, turned steadily upward.

Two things then changed. The cheap returns got expensive: high-quality human text is roughly finite, and the improvement bought per dollar began to shrink, which is why “is pretraining hitting a wall” became a serious question rather than a contrarian pose. And the frontier moved off the parameter axis altogether — onto three others, none of which is model size.

It moved onto post-training with reinforcement learning, in which the model is rewarded not for predicting text but for producing answers that pass a check, especially in mathematics and code where a check is cheap and reliable. It moved onto test-time compute: letting the model think longer before answering, spending more steps on a hard problem, the approach behind the recent reasoning models. And it moved onto tool use and agency — the model calling a calculator, a search, a code interpreter, rather than carrying everything in its head.

The consequence is the lede of this whole section: intelligence stopped being a property of a model and became a property of a system — a model, plus verifiers, plus tools, plus a budget of computation. So the common worry, that “if we double the parameters we don’t know what will happen,” is half right in an instructive way. The smooth quantity, the prediction error, is quite predictable. Which specific abilities appear at which scale is not. There is even a credible argument that much of the apparent emergence — capabilities seeming to switch on suddenly at a certain size — is partly an artifact of measuring with all-or-nothing metrics. Prediction holds for the quantity we can measure and breaks for the thing we actually care about.

The wall is not made of depth

Now the question underneath the others: are there theoretical limits, even if the networks are stacked ever deeper — “double deep,” in the phrase that keeps recurring? There are, and they are more interesting than the usual contest between hype and doom.

Depth alone does not buy what intuition promises. A single forward pass through a transformer — the architecture behind current models — sits in a low band of computational difficulty. In the language of circuit complexity, which classifies problems by how much parallel hardware they need, a forward pass lives near the floor (a class called TC⁰: roughly, what shallow, constant-depth circuits can do). Stacking more layers raises that ceiling slowly but does not escape it. Some problems are inherently sequential — they require step following step — and no amount of width or depth in a single pass will solve them.

The escape hatch is not more depth. It is more serial computation: working step by step, writing intermediate results down, calling tools, looping. This is chain-of-thought — the model reasoning out loud across many tokens instead of leaping to an answer. With enough of it, a transformer becomes, in principle, capable of any computation at all. This is why the reasoning models work: they trade tokens, which is to say time, for the serial depth a single pass lacks. “Double deep” is the wrong lever; the lever is letting the system compute over more steps and check its own work.

Stacking layers lifts the single-pass ceiling slowly but never clears it. Reasoning step by step trades time for the serial depth one pass lacks — and clears the same ceiling.

The deeper limit concerns what a system can learn and verify. Training by predicting the next token has strong built-in biases. Some functions are representable but, in practice, unlearnable — parity over many bits, deciding whether a long string contains an odd or even number of ones, is the textbook case. And imitation has a ceiling baked in: a model trained to predict human text is pulled toward the high-probability center of that text. It learns the average of what people wrote, not the exception.

How does a system exceed the human ceiling, then? By the move that defined AlphaZero — replacing imitation with a reward signal and letting the system discover what no human demonstrated. This is the master variable of the whole survey. Where a cheap, reliable verifier exists — a game’s outcome, a test that passes, a proof a checker can confirm — machines can climb past the best human examples. Where no such verifier exists, or where checking is as hard as doing, progress stalls. Generation has become cheap. Judgment is the bottleneck.

Why Faust is the right test

This is why the question “could a machine write Goethe’s Faust” is sharper than it sounds, and why the instinct to call the result an “AI copy” is doing real analytical work. Three different questions hide inside the one.

Could a model produce text of comparable formal quality? In pastiche — convincing imitation of the style — increasingly yes. But pastiche of Faust is derivative by construction, so that answer settles nothing. Could it produce a genuinely new work of comparable originality, one a culture would choose to canonize? Much harder, and it runs straight into the verifier problem. There is no objective function for great literature, no checker that returns true. Worse, great literature is partly defined by productively breaking the patterns it inherited — which is exactly the move a system biased toward the center of the distribution is built to avoid.

The third question is the one the “AI copy” instinct has actually found, and here the limit is not in the network at all. Faust is valued partly as the trace of a particular mind wrestling with mortality across sixty years; it is read as testimony. A flawless forgery of a Vermeer becomes nearly worthless the moment its origin is known, though not one stroke of the canvas has changed. The value was never only in the pixels. “AI copy,” then, names a fact about meaning and provenance — about who made a thing and whether it was lived — not a fact about capability. No quantity of parameters changes that no one lived it. This is a limit no amount of scale can move, because it sits in the structure of human value rather than in the machine.

Limits the machine shares with us

The question of complex systems splits along the same seam. There is a real and underappreciated sense in which machines already understand some things not as humans do. A protein-folding model grasps structure in a space no person can hold in their head; a game-playing system finds moves that look alien and turn out to be correct. That non-human mode of understanding is genuine and already present — but it is domain-specific, and it depends on good signal and clean structure.

For systems that are complex in the technical sense — emergent, where the behavior of the whole is not visible in the parts; chaotic, where a tiny error in the starting point explodes; computationally irreducible, meaning there is no shortcut and the only way to know the outcome is to run the thing — there is a hard limit that has nothing to do with how the AI is built. It binds any computer exactly as it binds any human. An irreducible system cannot be predicted faster than it can be run, and chaos amplifies whatever you got wrong about its initial state.

The economist Friedrich Hayek named a version of this in the 1940s, and it has aged into something larger than its original quarrel. His knowledge problem was never merely that humans cannot plan an economy from the center. It was that the relevant knowledge is dispersed across millions of people, much of it tacit — known in the doing, never written down — and continuously regenerated, so that no central aggregator can assemble it in time to act on it. A bigger model does not dissolve that. But note where the tension now lives: Hayek argued against pencil-and-paper planners, and large-scale machine learning on behavioral data has reopened the old debate about whether enough data could, in principle, centralize what he thought uncentralizable. The honest reading is that the technology has changed the applications far more than it has settled the principle. A machine can be a far better probe of a complex system — running more simulations, tracking more variables, extending human cognition outward — without making that system predictable. The limit is in the system, not the solver.

The coastline, not the wall

Which returns the survey to where it began, and to the claim about AGI — the idea of a machine that can do every task a human can. Treated as a theoretical statement, the skepticism is well founded, and the reason is structural rather than a matter of timing.

“All the tasks a human can do” is under-defined. Some tasks are bound to having a body, to occupying a social role, to being a particular person who lived a particular life — not to capability that a system could acquire. The pattern of the last few years is telling: the field operationalizes general intelligence as a list of tasks, the systems beat the list, and the field draws a new line. That cycle suggests the tasks were never a good proxy for “general intelligence” in the first place.

The jaggedness is the reason there is probably no clean crossover, no afternoon on which one curve passes another and the world is different by dinner. What the evidence suggests instead is an expanding, irregular coastline of capability — superhuman in more and more places, oddly blank in others, never collapsing into a single human-equivalent moment. The curve is real. It is simply not a wall being crossed on a date. It is a coastline, and the rest of this survey is an attempt to map it.