Data: The New Raw Material
Governance, ethics, the modern data stack, and how data transformed complexity science.
What Data Is, and How It Becomes Knowledge
This module’s title calls data a “raw material,” and the metaphor is exact: like ore, data is worthless until refined, and most of what is dug up is waste. But before the refining, it helps to be precise about what is being refined — because three words everyday language treats as synonyms are, here, three different things.
Data is the raw record: characters, numbers, measurements, the contents of a file. Information is what the data means once read in context — “10:30” is data; “your flight boards at 10:30” is information. Knowledge is information validated and generalized enough to act on: not “this shopper bought an umbrella while it rained” but “rain predicts umbrella sales.” Data is the container, information is the content, and knowledge is the content you can rely on. Climbing that ladder — from record to meaning to something dependable — is the entire job of the discipline this module describes.
A second distinction governs everything downstream: how structured the data is. Structured data fits a predefined model — the rows and columns of a table, every field a declared type. Semi-structured data carries its own loose schema — a JSON or CSV record, recognizable but not guaranteed. Unstructured data has no model a machine can lean on: the text of an email, an image, an audio clip. By a rule of thumb long repeated in the industry, the great majority of an organization’s data — perhaps 80% — was unstructured and therefore largely unanalyzed; the central achievement of the deep-learning era, as the next module shows, was making that unread majority computable.
For decades the structured world had one dominant home: the relational database, which stores data as tables and answers questions in SQL. Its defining guarantee is ACID — atomicity, consistency, isolation, durability — the promise that when two people change the same account balance at the same instant, the database resolves the collision rather than settling into a corrupted in-between state. ACID is why banks trusted databases with money.
But ACID was designed for a single machine. When data outgrew any one computer and had to spread across many, a hard limit appeared: the CAP theorem, which says a distributed data store cannot simultaneously guarantee consistency, availability, and tolerance of network partitions — when the network splits, one of the three must give. The entire “NoSQL” generation of the 2010s was a catalogue of different answers to that forced choice. And the lakehouse architecture this module discusses below is best read as the latest answer: bolting ACID guarantees back onto the cheap, distributed storage that had abandoned them.
From Data to Knowledge: The Data-Science Method
Suppose a shop has sold a new product for a month and still does not know who is buying it. It has a table: each row a shopper, the columns recording age band, time of day, how full the cart was, the weather — and whether they bought. The question “who are our customers?” has become a question about that table: which columns predict the last one?
That is the elementary move of data science — a discipline renamed twice without changing much underneath. In the 1990s it was Knowledge Discovery in Databases (KDD); through the 2000s, data mining; since the 2010s, data science. The constant: using statistics and computation to pull from data the knowledge needed to predict or decide.
When the column you want to predict is a category — buys or doesn’t, fraud or not, tumour or benign — the task is classification. When it is a number — how many units, how much spend — it is regression. Both are supervised learning: you train the method on examples where the answer is already known (the “bought?” column), and it learns to fill that answer in where it isn’t. The 2016 worked example used a decision tree — a flowchart of nested if-then rules a program builds automatically from the data — but the neural networks driving the AI revolution are supervised learners too, fitting far more complex nonlinear relationships to far more data. Fine-tuning a model on labelled examples, in the next module, is this same supervised move, scaled up.
When no answer column is supplied, the task is unsupervised learning — the method looks for structure rather than a labelled target. Sorting shoppers into segments nobody defined in advance is the canonical case. (The large language models of Module 10 blur this old binary: they are self-supervised, manufacturing their own prediction targets out of raw text — a third category the 2016 taxonomy did not anticipate.)
A prediction can be wrong in two different ways, and the difference often matters more than the overall error rate. Lay the model’s guesses against reality in a confusion matrix:
| Actually buys | Actually doesn’t | |
|---|---|---|
| Model predicts “buys” | Correct (hit) | False positive — false alarm |
| Model predicts “doesn’t” | False negative — missed sale | Correct (hit) |
A false positive is a false alarm — the model said “buys,” the shopper didn’t; one unit too many shipped. A false negative is a miss — the model said “won’t,” the shopper would have; a sale lost. For a shop the two errors cost about the same. For a fraud detector — or the recidivism algorithm this module examines later — they do not: a false positive freezes an innocent transaction or flags an innocent defendant, a false negative lets the real case through. Which error you are willing to tolerate is a value choice, not a technical one — and that seemingly small point is the seed of the fairness-impossibility results below.
Finding a model that works is rarely one clean pass. In practice the method is a loop — understand the problem and the data, prepare the data (the cleaning that consumes the 60–80% of effort the quality section describes), build the model, test it against data it has never seen, deploy it — and the loop runs backwards as often as forwards, because the testing step keeps revealing that the first step needed more data. Two facts discipline the whole enterprise. Garbage in, garbage out: a model can be no better than the data beneath it. And learning has limits: computational learning theory proves that not every pattern is learnable from data at all, and that the harder the pattern and the scarcer the data, the worse any method must do.
That last fact is where a restored foundation meets a live challenge. The 2016 edition drew a firm line: the intelligence lives in the data scientist, not the algorithm, and complex systems stay unpredictable even for machine learning. The scaling-laws era has pressed hard on that line — large models now perform feature engineering that once demanded human expertise, and capabilities have emerged at scale that no one hand-designed. But pressed is not erased. Scaling has moved the frontier of what is learnable; it has not repealed the theorem that a frontier exists. A model trained on the past stays weakest exactly where the world turns genuinely novel — the same limit, in a different vocabulary, that the economics module met in Hayek’s knowledge problem. The honest reading is that AI has automated far more of the data-mining loop than the 2016 account expected, without dissolving the irreducible complexity the loop was always running up against.
Every term the AI revolution leans on — supervised and self-supervised learning, classification, fine-tuning, false positives and negatives — is defined here, in the older and humbler vocabulary of data science. The deep-learning era did not replace this foundation; it industrialized it, and ran it into the same limits at a larger scale.
From Big Data to Data Engineering
“Big data” peaked as a buzzword around 2015. Hadoop was the centerpiece — store everything in HDFS, process it with MapReduce, figure out what you want later. The promise was transformative; the reality was data swamps, operational nightmares, and clusters that cost $500,000 per year to run.
The buzzword died, but the problems it named — volume, velocity, variety — did not. Instead, they were absorbed into a maturing discipline: data engineering. Maxime Beauchemin’s 2017 blog post “The Rise of the Data Engineer” captured the shift. Data work was no longer the province of generalists running ad-hoc scripts; it was becoming a rigorous engineering practice with version control, testing, documentation, and CI/CD pipelines.
The most consequential architectural shift was ELT replacing ETL. In the traditional Extract-Transform-Load paradigm, data was cleaned and restructured before loading into the warehouse — expensive, rigid, and slow. When cloud data warehouses made compute cheap and elastic, a radical simplification became possible: load the raw data first, transform it in place using SQL. dbt (data build tool) became the transformation layer, turning SQL into testable, version-controlled, modular code.
The modern data stack crystallized around 2019–2020: cloud warehouses (Snowflake’s $3.4 billion IPO in September 2020; Databricks valued at $43 billion by 2024), the lakehouse paradigm bridging data lakes and warehouses (Delta Lake, Apache Iceberg, Apache Hudi adding ACID transactions to object storage), and streaming data becoming mainstream (Kafka adopted by 80%+ of the Fortune 100, Confluent’s IPO in 2021).
Hadoop’s decline was swift. Cloudera and Hortonworks merged in 2018; Cloudera was taken private for $5.3 billion in 2021. The on-premise monolith gave way to the cloud-native modular stack — separation of compute and storage as the foundational principle, with best-of-breed tools at each layer.
The most important shift was not more data, but better plumbing. Data engineering matured from ad-hoc scripts into a rigorous discipline — because organizations learned that the pipeline is the product.
Data Pipeline Explorer
Compare the ETL and ELT paradigms to see how the transformation step moved inside the warehouse. Then explore how the full data stack evolved across three eras.
Privacy, Governance, and the Right to Be Forgotten
The General Data Protection Regulation (GDPR), enforced from May 2018, was the most consequential data regulation in history. Its extraterritorial scope meant it applied to any organization processing EU residents’ data, regardless of where the organization was based. The right to erasure forced architectural changes to systems that had assumed append-only storage. Data Protection Impact Assessments, consent management, and data lineage tracking became engineering requirements — not just legal checkboxes.
The fines were substantial: Amazon received a €746 million penalty in 2021; Meta was fined €1.2 billion in 2023 for EU-US data transfers. Data sovereignty became geopolitically charged — the Schrems I decision (2015) invalidated the Safe Harbor framework, Schrems II (2020) struck down Privacy Shield, and the EU-US Data Privacy Framework (2023) represented a third attempt at a transatlantic data transfer mechanism. Data localization laws proliferated in Russia, China, India, and Vietnam. (For the political dimensions, see Politics and Governance.)
The tension between data utility and data privacy drove innovation in privacy-preserving computation. Four approaches emerged, each with different trade-offs: differential privacy (Apple deployed it in iOS 10 in 2016; the US Census used it in 2020), federated learning (Google’s Gboard, NVIDIA Clara for medical imaging), homomorphic encryption (computing on encrypted data — theoretically ideal but 1,000–10,000x overhead), and secure multi-party computation (banks using it for cross-institutional fraud detection).
Privacy regulation did not stop data use — it restructured it. The technical challenge shifted from “how do we collect more data?” to “how do we extract value from data we are not allowed to see?” This constraint, paradoxically, drove some of the most innovative engineering of the decade.
Privacy Trade-off Explorer
Select and compare privacy-preserving computation techniques across five dimensions. Each has a different balance of privacy strength, computational cost, and practical maturity.
- • Apple iOS emoji/keyboard (2016)
- • US Census 2020
- • Google Chrome RAPPOR
Mature and widely deployed. The privacy-utility trade-off is tunable via the epsilon parameter — smaller epsilon means stronger privacy but noisier results.
- • Google Gboard keyboard predictions
- • NVIDIA Clara for medical imaging
- • MELLODDY consortium (pharma)
Keeps data distributed across devices or institutions. Communication overhead is the bottleneck. Vulnerable to gradient inversion attacks without additional protections.
Data Quality: The 80% Problem
Data scientists spent 60–80% of their time cleaning and preparing data. “Garbage in, garbage out” applied with renewed force in the machine learning era, where model performance was directly bounded by data quality. Yet quality was treated as an afterthought — something individual analysts handled ad hoc rather than something organizations engineered systematically.
Andrew Ng’s data-centric AI movement (2021) reframed the problem. The ML community had been obsessively focused on model architecture — more layers, more parameters, more compute. Ng argued the higher-leverage intervention was often improving the data while keeping the model fixed. Landing AI’s competition demonstrated this: participants improved model performance by curating better training data, not by architectural innovation.
Data contracts emerged as the engineering response. Proposed by Andrew Jones (2022–2023), they treated data interfaces like API contracts — explicit, versioned, testable agreements between data producers and consumers. Three levels of contract captured progressively deeper guarantees: schema contracts (structural), SLA contracts (operational), and semantic contracts (meaning). The cultural shift was profound: from treating data as a byproduct of operations to treating it as a product with explicit quality guarantees.
Synthetic data addressed a different quality problem: scarcity. When real data was insufficient, privacy-constrained, or too expensive, organizations could generate it. GANs (StyleGAN for images, CTGAN for tabular data), diffusion models, and LLM-based synthetic data (the Self-Instruct approach behind Microsoft’s Phi models) expanded the toolkit. But a new risk appeared: model collapse — training on synthetic data generated by other models, creating a recursive degradation loop where each generation drifts further from reality.
Andrew Ng’s inversion was profound: instead of asking “what model architecture works best on this data?”, ask “what data would make even a simple model work well?” The data-centric perspective mirrors complexity science — the structure of the inputs matters more than the sophistication of the processing.
Data Quality Dashboard
Explore the data quality pyramid to understand three levels of data contracts. Then use the cascade simulator to see how quality degrades through a pipeline — and how contracts at each stage prevent compounding errors.
Click a layer to explore.
Data scientists spend 60–80% of their time cleaning data. Data contracts aim to shift quality assurance upstream — catching issues at the source rather than the dashboard.
Data Ethics: Bias, Fairness, and Power
ProPublica’s 2016 investigation of COMPAS — a recidivism prediction algorithm used in US courts — demonstrated that the system was nearly twice as likely to falsely flag Black defendants as high-risk compared to white defendants. This was not an isolated failure but the first widely publicized instance of a systemic pattern.
Joy Buolamwini and Timnit Gebru’s Gender Shades study (2018) revealed that commercial facial recognition systems had a 34.7% error rate for dark-skinned women versus 0.8% for light-skinned men. The bias cascade was clear: biased training data produces biased models produces biased decisions produces biased outcomes that, in turn, generate more biased training data. The loop is self-reinforcing.
The research community responded with a proliferation of fairness definitions: equalized odds, demographic parity, individual fairness, counterfactual fairness. But researchers proved that many desirable criteria are mathematically incompatible (Chouldechova 2017, Kleinberg et al. 2016). You cannot simultaneously satisfy all reasonable fairness constraints — tradeoffs are unavoidable, and choosing between them is a value judgment, not a technical decision.
Data colonialism (Couldry and Mejias, 2019) named a structural power asymmetry: data extracted from developing countries trained models whose benefits accrued to Silicon Valley. Annotation labor — the human work of labeling training data — was often performed by workers in Kenya, the Philippines, and India for less than $2 per hour, including content moderation work that exposed workers to traumatic material.
Bias in AI is not a bug that can be patched — it is a structural feature of learning from historical data. When that data reflects centuries of discrimination, the model does not correct the bias; it operationalizes it at scale and speed no human institution could match.
Adjustable Depth
Explore algorithmic fairness frameworks and the impossibility results.
Several mathematical definitions of fairness have been proposed: demographic parity (equal selection rates across groups), equalized odds (equal error rates), individual fairness (similar individuals treated similarly), and counterfactual fairness (the outcome would be the same in a counterfactual world where the protected attribute differed).
The sobering discovery: Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016) independently proved that except in trivial cases, it is mathematically impossible to simultaneously satisfy calibration (predicted probabilities match actual rates) and balance (equal false positive/negative rates across groups). When base rates differ between groups — as they often do due to historical inequality — you must choose which fairness criterion to satisfy.
This means fairness is ultimately a value choice, not a technical optimization. Tools like IBM AI Fairness 360 and Microsoft Fairlearn can measure and partially mitigate bias, but they cannot eliminate the fundamental tradeoff. The question “fair according to whom?” remains irreducibly political.
The impossibility results in algorithmic fairness have precise mathematical formulations. Chouldechova (2017) proved that when base rates differ between groups, a classifier cannot simultaneously achieve calibration within groups, equal false positive rates, and equal false negative rates. Kleinberg et al. (2016) proved a related impossibility involving calibration and balance.
COMPAS illustrates this directly: the system was calibrated — a score of 7 meant the same recidivism probability regardless of race. But because base rates differed (due to historical policing patterns and systemic factors), equal calibration necessarily produced unequal false positive rates. ProPublica focused on error rates; Northpointe focused on calibration. Both were mathematically correct — they were measuring different, incompatible criteria.
The causal fairness framework (Kusner et al., 2017) attempts resolution by reasoning about counterfactuals: would the decision have been the same if the protected attribute had been different? This requires a causal model of the data-generating process — a significant modeling assumption.
Intersectionality compounds the challenge. Gender Shades showed that bias was worst not for women or dark-skinned individuals separately, but for dark-skinned women — the intersection. Fairness constraints applied independently along each axis may miss intersectional discrimination entirely.
Data colonialism extends ethics beyond individual fairness to structural power. Couldry and Mejias formalize “data relations” as a new form of extractive capitalism. The annotation labor pipeline exemplifies this: OpenAI’s content moderation workers in Kenya earned $1.32–$2.00 per hour to label traumatic content that made ChatGPT safer for Western users.
The deepfake detection arms race illustrates a game-theoretic dynamic: generators improve faster than detectors because generation requires only plausibility while detection requires certainty. Content provenance standards (C2PA) represent the most promising response — proving where content came from rather than detecting whether it is synthetic.
Data as the Lens for Complexity
The data revolution did not merely provide more data — it fundamentally expanded what complexity science could study. The earlier modules discussed complexity in theoretical terms: network structure, bounded rationality. Data made empirical complexity science possible at scale.
Large-scale behavioral data transformed social science. Mobile phone call detail records revealed universal patterns in human mobility (Barabási, González). Social media data enabled real-time study of information diffusion, opinion dynamics, and polarization. Wearable and IoT data opened complexity-aware health research — circadian rhythms as coupled oscillators, activity patterns as markers of systemic health.
Real-time urban data fed agent-based traffic models, smart city management, and digital twins. Singapore’s Virtual Singapore and Helsinki’s Kalasatama digital twin modeled entire neighborhoods in real time, enabling what-if scenarios for urban planning. (As live, continuously-calibrated ABMs, digital twins are developed in full in Module 15.)
COVID-19 was the paradigm case. The Johns Hopkins CSSE dashboard provided real-time global tracking. Google and Apple mobility reports calibrated epidemiological models. GISAID enabled real-time variant tracking. Agent-based models (Covasim, OpenABM-Covid19) used granular contact data for realistic simulations. But the pandemic also revealed the limits — reporting biases, inconsistent definitions, and data delays that undermined real-time claims.
Five transformations for complexity science: from toy models to calibrated models, from hypothesis to empirical validation, from simulation to hybrid approaches (physics-informed neural networks, neural ODEs), from offline to real-time, and the emergence of phenomena visible only through data instrumentation — flash crashes, viral cascades, and coordinated inauthentic behavior that exist only in the digital substrate.
The data revolution transformed complexity science from a field that could theorize about emergence to one that could observe it in real time — in financial markets, epidemics, cities, and ecosystems. Data did not replace theory; it gave theory something to push against.
The data infrastructure surveyed in this module — the pipelines, privacy frameworks, quality systems, and ethical guardrails — is the foundation on which the AI revolution and modern agent-based modeling are built. The quality of AI’s outputs cannot exceed the quality of its inputs. The ethics embedded in the data propagate through every model trained on it. In complexity terms: the substrate constrains the dynamics.