Index
Chapter 5 · The Jagged Frontier

The Verifier and Nature

In science the judge is either formal proof or reality itself — and its price predicts where AI transforms and where it stalls

Sheet 5 · Where nature is the judge/ The price of the verifier/ 14 min read

The cheapest judge in the world

In the summer of 2024, a system called AlphaProof sat down to the International Mathematical Olympiad — the same six problems the world’s strongest teenage mathematicians face — and scored at silver-medal level, solving the competition’s single hardest problem along the way. A year later a system reasoning in ordinary mathematical English, Gemini Deep Think, took gold, working autonomously inside the four-and-a-half-hour limit the human contestants are given. These are not pastiches of mathematical style. They are correct proofs.

The reason mathematics yields like this, when so much else does not, is the same variable that has run through the whole survey. In everything examined so far — art, games, software — the thing that judged a machine’s output judged an artifact: is it good, is it fun, is it correct. In science the judge is different and harder. It is either formal proof or nature itself. And the price of consulting that judge — how cheap it is to ask, and whether it can be reached at all — varies enormously across the sciences. That single variable predicts, with almost uncomfortable accuracy, where artificial intelligence has already transformed a field and where it has stalled.

A second fact runs underneath all four fields. Generating a candidate — a guess, a hypothesis, a conjecture — is one task. Checking whether the candidate is true is another. AI relieves the first far more easily than the second, and in most of science the checking was always the binding constraint. The science was never short of ideas. It was short of ways to tell which ideas were right.

Mathematics: the one perfect verifier

Mathematics is the extreme good case, because it possesses the only perfect, cheap verifier in existence. A proof written in a system like Lean — a proof assistant, software that checks each logical step against the axioms — is machine-verifiable. Given a fully formal statement, correctness is decidable: the checker returns true or false and is never wrong. Nothing else in science offers that.

This is what puts proof-finding in the regime that produced AlphaZero. When a cheap, reliable check exists, a machine can train against it — propose a step, see whether the checker accepts, adjust, repeat — and climb past the examples any human gave it. AlphaProof is exactly this: a reinforcement-learning agent that searches for formal proofs by interacting with Lean. The IMO results are one consequence. There are others less visible to the public. A system formalized the strong Prime Number Theorem in Lean in three weeks, a task that had stalled after more than eighteen months of expert effort. In early 2026, around fifteen problems posed by the mathematician Paul Erdős — long-open questions, some of them decades old — moved from open to solved with AI involved in most of them, the proofs Lean-verified.

But notice precisely what is automated and what is not. A proof assistant verifies a proof; it does not generate one. The search for the proof is the hard part, and the direction of the search — which lemma to try, which structure to impose — is harder still. The mathematicians closest to the work, Terence Tao among them, describe these systems consistently as assistants: valuable for verification, for testing conjectures, for routine technical labour, most useful as a supplement that leaves the research direction to a human. The creative core sits outside the verifier’s reach. Deciding which structures are worth defining, seeing that two distant fields are secretly the same subject in different clothing — the kind of reframing associated with a mathematician like Grothendieck — is taste-driven and unverifiable. There is no checker that returns true for this was the right thing to wonder about. It is the same unverifiable tail that made original art resist the machine, now wearing formal dress.

A proof assistant verifies a proof; it does not generate one. The search is the hard part, and its direction harder still.

And mathematics carries a limit no other field can claim: a proven ceiling. Undecidability — established in the 1930s — means no algorithm can prove every true statement. Some truths have no proof a machine could find, not because the machine is too small but because no such procedure exists. “Automate all of mathematics” is impossible not merely in practice but in principle. The good case still has a wall, and the wall is a theorem.

Physics: a field that splits in two

Physics does not sit at one point on the verifier axis. It splits down the middle, and the split is instructive.

On one side is computational and applied physics, where the verifier is cheapish: a simulation with known ground truth. Here AI surrogates thrive. A surrogate model is a learned approximation that stands in for an expensive calculation — trained on the slow, exact method until it can produce nearly the same answer in a fraction of the time. GraphCast and similar emulators now rival full numerical simulators for complex dynamical systems at enormous speedups, and related surrogates are doing fusion-plasma control and lattice calculations. When the truth is computable, even at great cost, a machine can learn to shortcut it.

On the other side is fundamental theoretical physics, and here the verifier is not expensive but inaccessible. To test a theory of physics at its deepest level requires observing nature at energy scales no instrument can reach. The decades-long stall of string theory is not a shortage of ideas; it is that the experiment cannot be performed. AI can generate more candidate theories than any human committee. But where the candidates cannot be tested, more of them is simply more unfalsifiable elegance — a verifier collapse in which the truth presumably exists and the instrument is missing.

Physics is also where irreducibility bites hardest, in its most rigorous form. Turbulence, chaos, and many-body quantum systems — systems of many particles whose interactions cannot be separated and solved one at a time — are computationally irreducible: there is no shortcut to their behaviour, no formula that leaps to the answer, only the running-out of the process step by step. AI finds better approximations, but it cannot abolish the irreducibility. Simulating a general quantum many-body system on an ordinary classical computer is intractable not by accident but in principle, which is the entire reason physicists want quantum computers in the first place. The limit here is not in the AI. It binds any computer, exactly as it binds any human.

Chemistry: the factory made literal

Chemistry is where the idea of an “AI factory” — a loop that generates candidates, measures them, and feeds the result back to generate better ones — is most literally real. The molecular-level verifier is good but expensive: quantum-chemical simulation, and ultimately the wet lab, where compounds are actually made.

The numbers are striking. A system called GNoME predicted 2.2 million crystal structures, including hundreds of thousands of thermodynamically stable ones, and external laboratories have already synthesized more than 700 of them. It produced roughly 380,000 stable candidates in seventeen days. And the loop has closed physically: self-driving labs — robotic systems that plan an experiment, run it, read the result, and decide what to try next without a human in the cycle — now cut materials discovery from years to weeks. They still have to sit and wait for each reaction to finish, sometimes up to an hour, before they can measure anything.

This is the generate-measure-reinforce loop with a physical verifier attached. And so the bottleneck does not disappear; it migrates. It moves to robot and reagent throughput, and to the wall-clock of chemistry itself — the time a reaction takes regardless of how clever the planner is. The limits are a funnel that narrows at every stage: a structure predicted to be stable need not be synthesizable, a compound that can be synthesized need not be useful, and the signal that finally matters — does this compound do its job in the real application — sits at the far end of the longest and most expensive verifier in the chain.

Biology: the worst judge of all

Biology is the hardest of the four, and it is the purest demonstration of the survey’s recurring theme, because biology is complex systems all the way down.

Where a verifiable signal and massive data both exist, AI has been spectacular. AlphaFold effectively solved protein folding — a fifty-year grand challenge — released more than 200 million predicted structures, and earned the 2024 Nobel Prize in Chemistry. The design frontier followed: tools now generate protein binders from scratch, de novo, that achieve nanomolar binding — very tight, specific attachment — to a majority of the novel targets they are aimed at. Protein structure had exactly the conditions the good cases require: a clean signal and a deep well of data.

But a static protein structure is one thing. The behaviour of a living cell, the development of an organism, the progression of a disease — these are emergent, multi-scale, stochastic, and context-dependent. Biology has no compact governing equations the way physics does. One cannot predict the dynamics of a cell from its genome the way one predicts a fold from a sequence. The field is correlation-rich and causation-poor: it has more patterns than it knows how to assign causes to, and establishing a cause requires intervention — changing one thing and watching what follows.

That requirement lands biology on a verifier that is slow, noisy, expensive, ethically constrained, and non-stationary — the system evolves and adapts while it is being studied, so the answer to a question can change between asking it and checking it. Clinical trials still run for years at roughly ninety percent attrition. This is the precise reason the promise that “AI will cure disease in a few years” overshoots. The binding constraint was never hypothesis generation. It is the loop through biological reality, and AI accelerates the cheap half of that loop while the expensive half stays bound to the wet lab and the calendar.

Three bottlenecks, relieved unequally

Step back, and the four fields resolve into a single structure: three distinct bottlenecks, which AI relieves very unequally.

The first is generation — the part that forms concepts and frames new questions. AI aids this everywhere and owns it nowhere. The genuinely novel reframing, the decision about what is worth asking, remains the irreducible creative tail, the same place every domain in this survey has proven hardest.

The second is verification, and this is where the fields fan out. It is worth seeing them ranked on the one axis that organizes the whole chapter.

Field The verifier Cost / accessibility AI’s reach
Mathematics Formal proof (Lean) Perfect and cheap Greatest — proof-search in the AlphaZero regime
Chemistry Simulation + wet lab Expensive but physical Large — self-driving labs, bounded by throughput
Fundamental physics Nature at unreachable scales Inaccessible Stalls — candidates outrun any test
Biology Intervention in living systems Worst: slow, noisy, non-stationary Spectacular where signal exists; choked elsewhere

The third bottleneck is the physical wall-clock of experiments. AI can make each experiment more informative — a better-chosen reaction, a better-targeted trial — but it cannot abolish real time, because reality runs at its own pace and will not be hurried.

Rank the fields by a single property — how far each compresses into compact laws — and the entire pattern falls out. Mathematics and the formal parts of fundamental physics are law-governed and AI-friendly. Biology and complex-systems physics are irreducible, and there AI is a powerful probe that extends human reach without ever becoming an oracle that renders the system predictable. The two constraints that opened this survey — Hayek’s knowledge problem and computational irreducibility — turn out to be the binding ones here too, exactly as they were at the start.

What the survey has been mapping

Two quieter threads close the case. The first is economic. There is a well-documented finding that ideas are getting harder to find: research productivity per scientist has been declining for decades, more effort buying less discovery. The current optimism is that AI might bend that curve back — and the structure of this chapter suggests where that hope is best founded, namely the high-verifier fields, where a machine can actually be told whether it is right.

The second thread is a warning. The same ease of generation that helps can also flood. AI can pour plausible, unverified results into the scientific literature faster than peer review and replication can check them. The reproducibility crisis does not ease under this pressure; it worsens. The verifier — here, replication — is the bottleneck, and it is being overwhelmed. It is the same shape as everywhere else in the survey: generation is cheap, judgment is scarce, and scarcity under flood becomes crisis.

Which is why the most cited barrier among the people actually building this is no longer technical. It is becoming organizational. These breakthroughs need full-time interdisciplinary teams and new institutional models rather than the traditional grant. That is a Coasean observation: when the technical input commoditizes, the scarce input becomes the coordination structure around it — who is in the room, how the work is funded, what the institution is shaped to do.

And so the survey closes on the note it has been approaching from the beginning. The frontier is a coastline, not a wall — advancing fast in some directions, blank in others, never crossing a single line on a single day. Its shape is set by the verifier: where one is cheap, machines surpass us; where it is missing, they stall. Where systems are irreducible, no solver of any kind escapes the cost of running them. And as the doing grows cheap, value migrates — in software it moved to the specification and the test, and in science it moves the same way: to choosing the question, judging the candidate, and reaching the judge. The same conclusion, reached twice from opposite ends. That recurrence is the strongest evidence the survey has that it was mapping something real.