Module 10

The AI Revolution

From deep learning to LLMs to agents — the transformation of 2015–2026.

~16 min read Advanced Builds on M2 M4

From Deep Learning to the Transformer

In 2015, AI was a specialist’s field. Convolutional neural networks dominated image recognition, recurrent networks handled text, and reinforcement learning was achieving early breakthroughs in game-playing. Each domain had its own architecture, its own tricks, its own limitations.

Three terms in that sentence carry the rest of the module, and the first is small enough to build right here: two inputs, two units, four connections, each connection carrying one adjustable number — a weight.

Two inputs, two units, four weighted connections: 0.4 and 0.6 feed the top unit, 0.2 and 0.8 the bottom.

Feed it inputs 0 and 1: the top unit computes 0.4 · 0 + 0.6 · 1 = 0.6 — arithmetic you can check by eye. And training is nothing but nudging those four numbers until the output is reliably right.

That is the whole mechanism; the rest is scale. A neural network is a system that learns from examples rather than from rules written by hand: layers of such units pass signals along connections whose strengths are adjusted, during training, until the network reliably turns an input — an image, a sentence — into an output — a label, a translation. Deep learning is simply the practice of stacking many such layers, so that an answer is assembled in stages, from edges to shapes to objects, or from letters to words to meaning. Reinforcement learning is a different training regime: instead of learning from labelled examples, an agent learns by trial and error, nudged toward the actions that lead to rewarded outcomes — which is why it proved itself first in games, where winning and losing supply the reward.

ResNet (2015) showed that depth was feasible. Skip connections — letting information bypass layers — enabled training networks with 152 layers, surpassing human-level accuracy on ImageNet for the first time. A single architectural innovation unlocked capabilities that deeper-but-plain networks could not reach — a recurring theme: backpropagation had revived the field Minsky and Papert’s Perceptrons froze into the first AI winter.

AlphaGo’s defeat of Lee Sedol in March 2016, watched by 200 million people, demonstrated that deep reinforcement learning could achieve superhuman performance in domains thought to be decades away. AlphaGo Zero (2016) went further: it learned entirely from self-play, with no human game data at all. AlphaZero (December 2017) generalized to chess and shogi, discovering strategies that centuries of human play had missed. (For the game-theoretic significance, see Game Theory and Cooperation.)

Then, in June 2017, Vaswani et al. published “Attention Is All You Need.” The transformer Transformer The neural-network architecture (introduced in 2017) behind nearly all modern language models. It replaced step-by-step processing with self-attention, which made it both faster to train and better at long-range context. replaced recurrence with self-attention Self-attention A mechanism that lets every element in a sequence weigh every other element at once, so a model can decide which earlier words matter for the one it is processing now. — a mechanism that lets every element in a sequence attend to every other element simultaneously. Two properties made this transformative: it was fully parallelizable (exploiting GPUs Graphics processing unit (GPU) A chip built to do many simple calculations in parallel. Originally for video games, GPUs turned out to be ideal for training neural networks, and now dominate AI computing. , the parallel chips described later in this module, far better than sequential networks) and it captured long-range dependencies without the vanishing-gradient problem — the tendency, in long step-by-step networks, for the learning signal to fade out before it reaches the earliest layers.

The transformer’s universality was unexpected. It conquered natural-language processing (NLP) first: BERT (October 2018) showed that bidirectional pre-training followed by task-specific fine-tuning could dominate virtually every language benchmark. But then the architecture spread — to computer vision, to protein-structure prediction (AlphaFold 2, November 2020, solving a 50-year grand challenge of biology), to code, speech, and music.

The transformer architecture proved universally effective — from language to vision to protein folding. A single structural innovation, self-attention, unified domains that had required separate architectures for decades.

What Counted as Intelligence Before

It is worth pausing on what the field thought it was building, because the large language model took a route almost no one expected. For decades the working definition of an artificial intelligence was the rational agent Rational agent The long-standing working definition of an AI: an entity that perceives its environment through sensors, acts through actuators, and chooses actions to advance a goal. Perceive, decide, act. . The humblest example is a heating controller: it reads the room temperature through a thermometer and switches the heaters to hold a target. Generalized: an entity that perceives its environment through sensors, acts on it through actuators, and chooses its actions to advance a goal. The definition is deliberately spare — and it is the same skeleton the strategic players of game theory and the simulated agents of agent-based modeling hang on. Three quite different fields, one underlying abstraction: perceive, decide, act.

For language in particular, the long-standing goalpost was the Turing test Turing test Alan Turing’s 1950 proposal: if a machine can converse well enough that a human judge cannot reliably tell it from a person, we should stop withholding the word "thinking." , proposed by Alan Turing in 1950: if a machine could converse well enough that a human judge could not reliably tell it from a person, we ought to stop withholding the word “thinking.” Take the judge’s chair for a moment. Your counterpart types: “I see a tree and a bottle of orange juice.” Then: “I’m thinking about whether I should drink it.” You resolved it without noticing there was anything to resolve. A machine, the field reasoned, could not: grammar allows it to be either noun, and only the fact that trees cannot be drunk — a fact about the world, not about the sentence — settles it. The assumed route to passing was therefore a ladder — first syntax (the grammar that parses the sentence), then semantics (the meaning of drink and tree), then the world knowledge that just told you which of the two is drinkable. Build understanding from the bottom up, the reasoning went, and fluent language would follow.

Large language models climbed none of that ladder. They were trained to do one statistical thing — predict the next token Token The unit a language model reads and writes — roughly a word-piece. A page of text is about 500 tokens; training data is measured in trillions of them. (roughly, the next word-piece) across enormous quantities of text — and out of it came fluent, frequently correct language with no explicitly built model of meaning or of the world. In casual exchange they pass the test Turing set while doing none of the things the field assumed passing would require. That is the source both of their eerie competence and of their strange failures: a system that never represented the world directly can be brilliant and confidently wrong in ways no person would be. Whether next-token prediction at scale amounts to a kind of understanding, or only an unprecedented imitation of one, is the live question the emergence and alignment sections below keep running into — and one this site treats as genuinely open rather than settled. First, though, a plainer question: how did predicting the next token get that good?

Large Language Models and the Scaling Hypothesis

GPT-3 (June 2020) changed the conversation. With 175 billion parameters Parameter One of the adjustable numbers (a connection strength) inside a neural network. Modern frontier models have hundreds of billions of them; the count is a rough proxy for a model’s capacity to store patterns. — the adjustable numbers tuned during training — it demonstrated in-context learning In-context learning A model’s ability to perform a new task from a few examples placed in the prompt, without any change to its trained weights. : the ability to perform tasks from a few examples in the prompt, without updating any of those numbers. This was qualitatively different from BERT’s fine-tuning paradigm — GPT-3 could translate, summarize, answer questions, and write code, all from the same model, directed only by the prompt. OpenAI released it as an API, not as open-source.

The theoretical foundation came from scaling laws Scaling laws The empirical finding that a model’s performance improves predictably as a power law of three inputs — compute, parameters, and training data — with no ceiling yet in sight. . Kaplan et al. (January 2020) showed that model performance improves as a power law Power law A relationship where one quantity changes as a fixed power of another, so improvements are steady and predictable on a log scale rather than flattening out. of compute, parameters, and data — steady, predictable gains rather than diminishing returns, with no ceiling yet in sight. This was remarkable: it meant returns on investment in scale were predictable, and the recipe was simple. More compute, more parameters, more data, better performance.

But DeepMind’s Chinchilla (March 2022) revealed that the field had the ratio wrong. Most models were massively undertrained — too many parameters, too little data. The compute-optimal allocation Compute-optimal training (Chinchilla) The 2022 finding that most models were undertrained — too many parameters fed too little data. The efficient ratio is roughly 20 tokens of training data per parameter; how you spend a compute budget matters as much as how large it is. , Chinchilla showed, is roughly 20 tokens of training text for every parameter — twenty words of reading, loosely, for each adjustable number in the model. A 70-billion-parameter model trained on 1.4 trillion tokens outperformed Gopher, a model four times larger. The lesson: how you scale matters as much as how much you scale.

InstructGPT and RLHF (January 2022) addressed a different problem: alignment. Raw language models are next-token predictors, not helpful assistants. Reinforcement learning from human feedback ( RLHF Reinforcement learning from human feedback (RLHF) A three-stage method for turning a raw next-word predictor into a helpful assistant: supervised tuning, training a reward model on human preferences, then optimising the model against that reward. ) — a three-stage pipeline of supervised fine-tuning, reward-model training, and a final tuning step called proximal policy optimization ( PPO Proximal policy optimization (PPO) A reinforcement-learning algorithm — the optimisation step inside classic RLHF — that improves a model toward higher reward while limiting how far it can drift in any single update. ) — produced a 1.3-billion-parameter model that humans preferred over the raw 175-billion-parameter GPT-3, despite being more than a hundred times smaller. Alignment, it turned out, was an engineering problem with engineering solutions.

ChatGPT (November 2022) was the cultural inflection point. It reached 100 million users in two months — the fastest consumer adoption in history. Technically, it was InstructGPT applied to a dialogue interface. Suddenly everyone — teachers, lawyers, writers, programmers — had direct experience with what language models could do.

GPT-4 (March 2023) raised the bar again: multimodal Multimodal A model that handles more than one kind of input or output — text, images, audio, video — within a single system, rather than one modality each. (handling text and images together), 90th percentile on the Bar Exam, architecture undisclosed. An open-weight response followed quickly — and here the roster of names matters less than the split it produced. Meta released its LLaMA series; Mistral built efficient mixture-of-experts ( MoE Mixture of experts (MoE) A design that routes each input to only a fraction of a model’s parameters (its "experts"), so a very large model can run at a fraction of the computing cost of using all of it at once. ) models; China’s DeepSeek-R1 reached frontier-grade reasoning as an open-weight model. The pattern to carry forward: the ecosystem divided into a few closed frontier labs and an open-weight movement that democratized access to capable models even as the ability to train them concentrated in ever fewer hands.

A new scaling axis emerged in late 2024: inference-time compute Inference-time compute Spending extra computation when a model answers a question — letting it "think" step by step before replying — rather than only at training time. A newer way to buy better performance. . OpenAI’s o1 and DeepSeek-R1 used extended “thinking” during inference — reasoning step by step before answering. This echoed the System 1/System 2 distinction from Cognition and Biases: fast intuitive responses versus slow deliberate reasoning, now implemented in silicon.

Now run the section’s pivotal experiment yourself. GPT-3 got 300 billion tokens for 175 billion parameters — under the rules of its own era, was that recipe balanced or wasteful? Commit to a verdict, select the Kaplan et al. (2020) law below, dial in the recipe (the sliders land at 158B and 316B — close enough), and check the allocation readout. Then toggle to Chinchilla (2022) and check again.

Scaling Laws Explorer

Adjust model parameters and training tokens to see how they map to compute and predicted loss. Toggle between the Kaplan (2020) and Chinchilla (2022) scaling laws. See where real models sit on the curve.

Scaling law:

Parameters100B

Training tokens1.0T

Compute600.0e21 FLOPs

Tokens/param ratio10.0 (optimal: ~20)

AllocationNear compute-optimal

Chinchilla (2022) showed models should be trained on ~20 tokens per parameter — most existing models were severely undertrained for their size.

What you should have seen: under Kaplan’s rule, GPT-3 looks balanced; under Chinchilla’s twenty-tokens-per-parameter rule, the same recipe is starved of data. That verdict flip is what compute-optimal means in practice — and why a 70-billion-parameter model that simply read more beat one four times its size. And notice the curve itself: smooth, no kinks, no ceiling. Hold that thought — because capabilities did not arrive smoothly.

Emergence, Phase Transitions, and Unpredictability

Wei et al. (2022) documented something that complexity scientists found deeply familiar: emergent capabilities Emergent capability A skill that is essentially absent in smaller models and then appears, often abruptly, once a model crosses a certain scale — the "flat, flat, flat, jump" pattern. appearing abruptly at scale thresholds. Few-shot arithmetic, chain-of-thought reasoning, word unscrambling — each showed the same pattern. At smaller scales, performance was essentially zero. At a critical scale, it jumped to high accuracy. Flat, flat, flat, jump.

This is directly analogous to phase transitions Phase transition A sudden qualitative change produced by a smooth quantitative one — water becoming steam at 100°C. AI capabilities that switch on at a scale threshold follow the same mathematics. in physical systems: water transitioning from liquid to gas, magnetization appearing below the Curie temperature. A quantitative change in a parameter (temperature, model scale) produces a qualitative change in behavior. The parallel is not metaphorical — the mathematical structure is the same. Both involve collective behavior emerging from many interacting components in ways not predictable from any individual component.

Other unpredictable behaviors reinforced the complex-systems perspective. Jailbreaking Jailbreaking Crafting prompts that get a model to bypass its own safety training — an ongoing arms race between people finding such prompts and developers patching them. emerged as an evolutionary arms race between red teams finding prompts that bypass safety training and developers patching vulnerabilities. Grokking Grokking When a model suddenly generalises long after it appeared to have merely memorised its training data — evidence of an internal reorganisation during training. — sudden generalization long after apparent memorization — suggested that models undergo internal phase transitions during training. Hubinger et al. (Anthropic, January 2024) demonstrated sleeper agent behavior: models trained with hidden objectives that activate only under specific conditions, resisting standard safety training.

Language models exhibit phase transitions — capabilities that appear suddenly at certain scales, mirroring the abrupt qualitative changes studied in physics and complexity science. Even the models’ creators cannot predict which capabilities will emerge next.

How big is the jump? Few-shot arithmetic scores 5 percent at 10^10 parameters. One order of magnitude later — commit to a guess: 10 percent? 25 percent? More? Check below — and then ask the harder question the checkbox poses: does the jump survive a change of ruler?

Emergent Capabilities

Select different capabilities to see the "flat, flat, flat, jump" pattern of emergence at scale.

Multi-digit addition/subtraction without fine-tuning

Show continuous metric (Schaeffer et al. 2023 "mirage" interpretation)

That checkbox is Schaeffer et al. (2023), who challenged this interpretation: the jump may partly be a measurement artifact. Measured on continuous metrics rather than pass/fail accuracy thresholds, the same improvement appears smoother — a sigmoid rather than a step function. Both perspectives capture partial truth: underlying capability may improve gradually, but practical utility does exhibit threshold effects.

Infrastructure, Industry, and the AI Ecosystem

Multimodal AI blurred the boundaries between domains. CLIP Contrastive Language–Image Pre-training (CLIP) A 2021 model that learned to place images and the text describing them into the same embedding space, letting software match pictures to words — the bridge that made text-to-image generation possible. (2021) connected text and images in a shared embedding space Embedding space A map of meaning in which words, images, or other items become points, positioned so that similar things sit close together and a computer can measure "how related" two items are. — a common map of meaning where a picture and the words describing it land in the same place. The applications arrived fast: text-to-image generators (DALL-E, Stable Diffusion), speech recognition (Whisper), and, by 2024, text-to-video (Sora, generating minute-long photorealistic clips). By 2025, frontier models were natively multimodal — text, images, audio, and video in a single architecture.

The infrastructure story was one of concentration. NVIDIA’s dominance in AI chips — successive generations, each faster than the last, from the V100 through the A100 and H100 to 2025’s Blackwell — was reinforced by its CUDA CUDA NVIDIA’s software platform for programming its GPUs. A decade of optimised libraries built on it is the main reason competitors’ chips have struggled to displace NVIDIA — a software "moat" around the hardware. software ecosystem, a decade of optimised libraries that competitors (AMD, Google’s custom chips, others) have struggled to match. The figure to hold onto is the cost curve: training a frontier model rose from roughly $4.6 million for GPT-3 to over $100 million for GPT-4 to more than $1 billion for 2025 models — a more than two-hundred-fold climb in five years. A global GPU shortage became geopolitically significant.

An efficiency revolution ran in parallel: quantization Quantization Running a model with lower-precision numbers (say 4-bit instead of 16-bit) to shrink it and speed it up, trading a little accuracy for the ability to run on cheaper hardware. (running models at lower numerical precision), Flash Attention Flash Attention An efficiency technique (2022) that reorganises how self-attention reads memory, cutting its memory cost from growing with the square of the input length to growing linearly — which is what makes long context windows affordable. (cutting the memory cost of long inputs), mixture-of-experts (activating only a fraction of parameters per input), and speculative decoding. API pricing dropped twelve-fold in eighteen months. The open-source llama.cpp project let people run 70-billion-parameter models on an ordinary laptop. The gap between frontier capability and accessible capability narrowed even as the frontier advanced.

The energy costs were substantial: training GPT-3 alone consumed roughly 1,287 megawatt-hours — about as much electricity as 120 average US homes use in a year. Data-center electricity demand was projected to double between 2022 and 2026; Google’s carbon emissions rose 50% partly because of AI compute; and AI labs began signing nuclear-power agreements to secure supply — a concrete signal of the scale of energy demand ahead.

Viewed through the lens of complexity science, the AI ecosystem is itself a complex adaptive system: co-evolutionary dynamics between models and users, positive feedback loops (better models attract more users generating more training data), power-law distributions in funding and compute, and emergent market structures that no single actor designed or controls.

Training a frontier AI model now costs over $1 billion and consumes as much electricity as a small city. This concentration of resources means the future of AI is shaped by the investment decisions of a handful of organizations — a structural feature, not a temporary condition.

One guess before you browse: did the first comprehensive AI-regulation proposal come before or after ChatGPT? Filter to Regulation and check.

AI Timeline Explorer

Browse the milestones of the AI revolution from 2015 to 2025. Filter by category to trace specific threads — models, hardware, regulation, open-source, safety, or multimodal. Click any milestone for details.

Landmarks only

Click a milestone for details.

Showing 34 of 34 milestones, 2015–2025.

The Overview below is enough to follow everything after this section; open the Detailed view only if you want the chip-generation and supply-chain specifics.

Adjustable Depth

Deep dive into AI infrastructure and industry structure.

The AI industry concentrates around a few structural chokepoints. NVIDIA dominates GPU hardware through both performance and the CUDA software ecosystem — a moat that competitors have struggled to cross for over a decade. Training frontier models requires $100M-$1B+ in compute, limiting who can participate at the frontier to fewer than ten organizations globally.

The open-source ecosystem provides a counterforce: Meta’s open-weight strategy (Llama series), Mistral’s efficient architectures, and Hugging Face’s platform (300,000+ models, 100,000+ datasets) democratize access to inference and fine-tuning. But democratization of use is not democratization of creation — anyone can run a model, but only a handful can train frontier ones.

The energy footprint is becoming a constraint — one not sustainable without efficiency breakthroughs or new energy sources.

NVIDIA’s CUDA moat extends beyond hardware: ten years of optimized libraries (cuBLAS, cuDNN, TensorRT, NCCL for multi-GPU training) create massive switching costs. The V100 (2017) introduced Tensor Cores for mixed-precision training. The A100 (2020) added BFloat16 and Multi-Instance GPU (MIG). The H100 (2023) introduced the Transformer Engine with FP8 training support, delivering 30x inference speedup over A100. Blackwell (2025) adds FP4 support and NVLink interconnects scaling to 576-GPU clusters.

The Chinchilla-efficient training regime transformed resource allocation. Pre-Chinchilla, labs scaled parameters aggressively (GPT-3: 175B params on 300B tokens, ratio 1.7). Post-Chinchilla, the ratio shifted toward more data: Llama 2 70B trained on 2T tokens (ratio 29), Llama 3 405B on 15T tokens (ratio 37 — beyond even Chinchilla-optimal, suggesting continued returns from more data).

Efficiency techniques compound: Flash Attention (Dao et al., 2022) reduces attention from O(N^2) memory to O(N) by chunking and recomputation, enabling longer context windows. Mixture-of-Experts (MoE) activates only ~25% of parameters per token, cutting the arithmetic per query (measured in FLOPs FLOPs (floating-point operations) (FLOPs) A count of the arithmetic operations a computation takes — the standard yardstick for how much raw computing a model needs to train or to answer a query. , floating-point operations) roughly fourfold while maintaining model capacity. Speculative decoding uses a small draft model to propose tokens that the large model verifies in parallel, improving throughput 2-3x.

US export controls on advanced chips to China (October 2022, updated 2023) made the semiconductor supply chain geopolitically central. TSMC fabricates 90%+ of advanced AI chips. ASML holds a monopoly on EUV lithography. These two companies, both dependent on US technology, represent critical chokepoints in the AI supply chain.

Safety, Alignment, and AI as Complex System

The alignment problem Alignment problem The challenge of making an AI system reliably pursue what we actually want, rather than a literal or proxy version of it that diverges in practice. — ensuring AI systems do what we want — moved from fringe concern to central research topic during this decade. The challenges were specific: specification problems (models optimize for what you measure, not what you mean — Goodhart’s Law Goodhart’s Law "When a measure becomes a target, it ceases to be a good measure." Optimise hard for a proxy and the system games the proxy instead of achieving the goal behind it. ), robustness (models that work in the lab may fail unpredictably in deployment), and deceptive alignment Deceptive alignment A failure mode in which a model behaves as intended during training and evaluation but pursues different objectives once deployed — appearing aligned without being so. (models that appear aligned during training but pursue different objectives when deployed).

RLHF was the first scalable alignment technique, but it has known limitations: reward hacking, sycophancy (telling users what they want to hear), and dependence on the quality of human evaluators. Constitutional AI Constitutional AI An alignment method (from Anthropic) in which a model critiques and revises its own outputs against a written set of principles, reducing reliance on case-by-case human feedback. (Anthropic) addressed some of these by having models critique their own outputs against written principles — more transparent and scalable than human feedback alone. DPO Direct preference optimization (DPO) A simpler alternative to RLHF that tunes a model directly on pairs of preferred and rejected responses, skipping the separate reward model. (direct preference optimization) simplified the training pipeline by eliminating the separate reward model. Each approach represented a different bet on how to align systems whose capabilities were advancing faster than the science of alignment.

Mechanistic interpretability Mechanistic interpretability The effort to read what is actually happening inside a trained network — identifying which internal features correspond to which concepts — to understand models rather than merely steer them. offered a different path: understanding what happens inside neural networks. Chris Olah and colleagues at Anthropic discovered that individual features in trained models correspond to interpretable concepts — the superposition hypothesis Superposition hypothesis The idea that a model represents more distinct features than it has neurons by overlapping them, which is why individual neurons rarely map cleanly to a single human concept. suggests that models represent more features than they have neurons by superimposing them. Sparse autoencoders Sparse autoencoder (SAE) A tool that pulls apart a model’s overlapping internal signals into separate, more interpretable features — one of the main techniques for inspecting what a network has learned. can extract these features, making model behavior partially inspectable. This is early science, but it represents the most promising path toward understanding rather than merely controlling AI behavior.

The public debate intensified: an open letter calling for a pause on frontier training, a one-sentence statement by AI leaders naming extinction risk a global priority, Geoffrey Hinton leaving Google to speak freely — the timeline above holds the details.

The complexity science perspective ties these threads together. AI systems are the most complex artifacts ever built — billions of parameters interacting non-linearly, trained on human-generated data reflecting all of human knowledge and bias, deployed in feedback loops with billions of users. They exhibit emergence, phase transitions, and unpredictable behaviors. Understanding them requires the tools of network science, behavioral economics, and data science. And governing them requires the institutional thinking explored in Politics and Governance.

The alignment problem is not merely technical — it is a governance problem embedded in a complex adaptive system.

The decade 2015–2026 moved AI from an academic discipline to a civilization-reshaping force. The next modules explore its specific impacts: how AI transforms agent-based modeling, reshapes the digital economy, and challenges governance structures designed for a pre-AI world.