The Specification Is the Work · The Jagged Frontier

A developer can ask a coding assistant for a todo app — a small program that tracks a list of tasks, lets a user add and check off items, and stores them somewhere — and have a working version in minutes. Sometimes it arrives in a single shot. Sometimes it takes a short round of specification-driven prompting. Either way, the result feels like magic, and it has launched a great many predictions that software engineering is about to be automated away.

The instinct to generalize from this is strong and almost entirely wrong. The reason the todo app is trivial is not that its code is easy. It is that its specification is free. Everyone already knows what a todo app is. The requirements are common knowledge, the non-functional demands — speed, security, scale — are near zero, and the model has ingested ten thousand examples of exactly this program. What artificial intelligence collapsed was the cost of implementation. It barely touched the cost of specification and judgment. And on most real software, the implementation was never the work.

This is an old observation wearing new evidence. In 1986 the computer scientist Fred Brooks drew a distinction that has aged well: the difference between accidental and essential complexity. Accidental complexity is the friction of expression — syntax, boilerplate, glue code, looking up which library call does what, the mechanical translation of a clear intention into working code. Essential complexity is the irreducible difficulty of the thing itself: modelling a messy domain, and deciding what the system should actually do. Brooks argued no single tool would ever deliver an order-of-magnitude gain, because the accidental part was already shrinking and the essential part is, by definition, the part that does not shrink. Modern AI is a massive assault on accidental complexity and a much smaller dent in the essential kind. The entire gradient that follows — from the todo app to the flight controller — is really a gradient of how much essential complexity sits behind the keystrokes.

The survey’s opening chapter argued that the verifier is the master variable: the limit of what a machine can reliably produce is set by how cheaply you can check the result. Software is the sharpest case of that idea, because it is simultaneously the domain with the best verifier — code runs and passes its tests, or it does not — and the domain where the hard part is the least verifiable. Before the gradient, two pieces of evidence need grounding, because they say something uncomfortable about how the gains are being measured.

The broken instrument

The first comes from METR, a research group that ran a randomized controlled trial — the same design medicine uses for drugs, with developers randomly assigned to use AI tools or not. The developers were experienced, working on mature open-source repositories they had contributed to for years. With early-2025 AI tools, they were about 19 percent slower. The striking part is not the slowdown itself but the gap around it: the same developers forecast a 24 percent speedup beforehand, and afterward still believed the tools had sped them up by roughly 20 percent — while the stopwatch said they had been slowed down.

The caveats matter, and they cut in a specific direction. This was a snapshot of early-2025 tooling on familiar, high-standard, brownfield code — “brownfield” being the trade’s word for an existing, established codebase, as opposed to “greenfield” work started from nothing. That is exactly the setting where the bottleneck is not generating code but loading a large, implicit context into the model and then reviewing plausible output for subtle wrongness. METR’s own follow-up, in February 2026, found a speedup for the subset of developers using the more capable agentic tools of late 2025. So this is a moving target, not a verdict. But the durable lesson survives the moving target: subjective “feels faster” is a broken instrument. The sensation of fluency and the fact of throughput had come apart.

The amplifier

The second body of evidence comes from DORA, a long-running research program that measures software delivery across thousands of organizations. Its 2024 report found that AI adoption raised individual productivity while being associated with a roughly 1.5 percent drop in delivery throughput and a 7.2 percent drop in stability per increment of adoption — the changes arriving in larger, less manageable batches. The 2025 report showed teams had learned: throughput flipped positive. But the tension with stability persisted. AI accelerates the writing of code, and that acceleration exposes downstream weaknesses unless the control systems — automated testing, small batches, fast feedback — speed up to match.

Independent telemetry from Faros, spanning more than ten thousand developers, named the same pattern the “AI productivity paradox”: individuals completed 21 percent more tasks and merged 98 percent more pull requests, while organizational delivery stayed flat. The shape of this is familiar to anyone who has studied manufacturing. AI optimizes the coding step — which was rarely the constraint on the system as a whole. The Theory of Constraints, a piece of lean-manufacturing thinking, holds that speeding up anything other than the bottleneck only piles up inventory in front of it. Here the inventory is work-in-progress: more code, in bigger batches, flowing into an unchanged review-and-integration bottleneck. Local optimization, global stall.

AI is an amplifier: it helps teams that know what to build, and it helps teams that do not build the wrong thing faster.

The most actionable finding in the DORA work is that last point stated plainly. AI is an amplifier. It improves user-centric teams — those organized around discovering what users actually need — and it actively harms teams without that focus, because the latter merely produce the wrong thing more efficiently. The tool does not supply judgment. It multiplies whatever judgment is already there.

The gradient

The useful way to map software is not by lines of code but by four axes that vary almost independently of size: how ambiguous the specification is, how much context the work is embedded in, how costly an error is, and how cheaply a verifier can be built. Running the same kinds of software along those axes produces a gradient, and the gradient predicts where the factory reaches and where it stalls.

Artifact	What is scarce	AI’s reach
Todo app	Nothing — the spec is common knowledge	Near-total; one-shot or short spec-driven loop
Word, PowerPoint	The long tail of the specification	Builds the easy 70%; the costly 30% resists
SAP / ERP	Agreement about reality	Marginal; the hard part is not code
Custom agile software	The learning loop with the customer	Speeds the build, not the discovery
Embedded / avionics	Guaranteed correctness	Highest value in verification, not generation

The todo app sits where all four axes are minimal: a fresh start, a complete and cheap specification, a trivial verifier, and harmless errors. This is pure factory zone. The danger here is sociological rather than technical — it is the demo that misleads an executive into believing that an enterprise system is just a bigger todo app. It is not, because the four axes are roughly independent of size.

Word and PowerPoint mark the next station. The domain is well understood, yet the essential complexity is enormous and almost entirely in the long tail: decades of feature accretion, backward compatibility, file-format edge cases, performance at scale, accessibility, internationalization. One can produce something that opens a .docx file in an afternoon. One cannot produce something that opens every .docx file anyone has ever created. This is the familiar problem that the last 30 percent of a system costs more than the first 70, raised to civilizational scale — and the missing portion is the part a demo cannot reveal.

SAP and the broader category of enterprise resource planning systems are the purest case of a different problem, and barely a coding problem at all. Such an implementation is the encoding of an organization’s actual business processes, which are tacit, politically contested, mutually inconsistent across departments, and changing during the project itself. These programs are famous for failing on organizational rather than technical grounds. This is Friedrich Hayek’s knowledge problem in enterprise clothing: the relevant knowledge is dispersed, partly unspoken, and held by people who do not fully agree with one another. AI does not help with extracting and reconciling that distributed tacit knowledge, and the part it does help with — the configuration and the routine code — was always the cheap part. The scarce input is agreement about reality, and no model manufactures that.

Customized client software, built under agile methods, is the most instructive station, because agile exists precisely as a response to the fact that the specification is undiscoverable in advance and must be found by iterating with the customer. In the vocabulary of the Cynefin framework — a way of sorting problems by how knowable they are — this is the complex domain, where the right move is to probe, sense, and respond, rather than the clear domain, where a known procedure applies. AI compresses the build inside each loop, which is real. But the binding constraint in agile was never the speed of building; it was the learning loop, the slow business of discovering what is actually valuable. Borrowing the military strategist John Boyd’s OODA loop — observe, orient, decide, act — AI shortens the act while observe and orient stay the bottleneck. The genuine prize, where it can be captured, is that cheap building permits more probes: more throwaway prototypes per unit of customer feedback, which tightens learning. That only pays where the feedback loop already exists — which is why the DORA finding on user-centricity flips the sign of AI’s impact.

Embedded systems in automobiles and aerospace are the most resistant tier, and the reasons compound. The cost of an error is catastrophic, so “plausible” is worthless and formal verification — mathematically proving that the code meets its specification, rather than merely testing some cases — becomes mandatory. The code is coupled to hardware, runs in real time, and is starved of resources, so the model cannot paper over a gap by importing a convenient library; it must reason about timing, memory, and physical safety.

The regulatory shape reinforces all of this. Certification regimes such as DO-178C for airborne software and ISO 26262 for road vehicles demand auditable traceability from every requirement to the code that satisfies it to the test that confirms it. Opaque AI provenance — output whose derivation cannot be shown — is therefore a liability, the same problem of unaccountable origin that recurs across this survey wherever an artifact must answer for where it came from. And the V-model and waterfall processes survive here because the discovery of an error late in the process is fatal, which is the opposite of the fast-feedback regime where AI shines.

But this tier carries a twist. It is where AI’s highest-value role is not code generation but verification amplification: formalizing specifications, proposing invariants and proof obligations, generating property-based tests — tests that check a general rule across many automatically generated inputs rather than a few hand-picked cases — and comprehending undocumented legacy code that no living engineer fully understands. Pointed at the verifier instead of the generator, the most resistant tier becomes the one where AI quietly does its most valuable work.

Where the factory’s reach ends

The unifying claim is simple. The factory’s reach is set by the cost of the specification, the size of the context, and the cost of the verifier. The residual that never yields is essential complexity — the same irreducibility the opening chapter described, now in engineering dress. Software is the one medium that splits both ways at once. It has the best verifier of anything, so a genuine and growing slice really does industrialize. And its hard core is irreducible, so the factory hits a hard ceiling exactly where deciding-what-to-build, tacit organizational knowledge, and guaranteed correctness live.

AI raises the output rate without raising the per-step reliability of what it produces. The human effort that remains therefore scales with how many “nines” of reliability the artifact demands — how close to certain the result must be, where a flight controller asks for far more nines than a todo app ever will. That single relationship is why the gradient from todo app to flight controller tracks AI’s resistance so well.

The conclusion follows from the premise rather than from optimism. As implementation commoditizes, the engineer’s value migrates to specification, judgment, integration, and verification. The leverage point becomes the specification and the verification harness — both of which are hard, senior, tacit skills that do not commoditize on the same curve. Specification-driven development, now fashionable, is mostly the methodological recognition that when building is cheap, the work has moved to the spec. The economist Ronald Coase’s reading would be the same: as the marginal cost of implementation approaches zero, teams and firm boundaries reorganize around the inputs that stay scarce — domain understanding, verification capacity, and accountable trust. The developer does not disappear. The definition of the job rotates away from the part the machine does well and toward the part it structurally cannot.