Module 9

Data: The New Raw Material

Governance, ethics, the modern data stack, and how data transformed complexity science.

~17 min read Intermediate

What Data Is, and How It Becomes Knowledge

This module’s title calls data a “raw material,” and the metaphor is exact: like ore, data is worthless until refined, and most of what is dug up is waste. But before the refining, it helps to be precise about what is being refined — because three words everyday language treats as synonyms are, here, three different things.

Data is the raw record: characters, numbers, measurements, the contents of a file. Information is what the data means once read in context — “10:30” is data; “your flight boards at 10:30” is information. Knowledge is information validated and generalized enough to act on: not “this shopper bought an umbrella while it rained” but “rain predicts umbrella sales.” Data is the container, information is the content, and knowledge is the content you can rely on. Climbing that ladder — from record to meaning to something dependable — is the entire job of the discipline this module describes.

A second distinction governs everything downstream: how structured the data is. Structured data fits a predefined model — the rows and columns of a table, every field a declared type. Semi-structured data carries its own loose schema — a JSON or CSV record. One shopper, as JSON:

{ "age": "young", "cart": "full", "bought": "yes" }

The record names its own fields, but nothing enforces them — nothing stops the next record from giving "age" as "fairly young". Unstructured data has no model a machine can lean on: the text of an email, an image, an audio clip. Most of an organization’s data — by industry lore, 80% — was unstructured and therefore largely unanalyzed; the central achievement of the deep-learning era, as the next module shows, was making that unread majority computable.

For decades the structured world had one dominant home: the relational database, which stores data as tables and answers questions in SQL. What happens when two people change the same account balance at the same instant? The ACID guarantees — atomicity, consistency, isolation, durability — were invented for exactly that collision: the database resolves it rather than settling into a corrupted in-between state. ACID is why banks trusted databases with money.

But ACID was designed for a single machine. When data outgrew any one computer and had to spread across many, a hard limit appeared: the CAP theorem, which says a distributed data store cannot simultaneously guarantee consistency, availability, and tolerance of network partitions — when the network splits, one of the three must give. The entire “NoSQL” generation of the 2010s was a catalogue of different answers to that forced choice. And the lakehouse architecture this module discusses below is best read as the latest answer: bolting ACID guarantees back onto the cheap, distributed storage that had abandoned them.

From Data to Knowledge: The Data-Science Method

Suppose a shop has sold a new product for a month and still does not know who is buying it. It has a table — a month of sales, say 400 shoppers: each row a shopper, five columns recording age band, time of day, how full the cart was, the weather — and whether they bought. The question “who are our customers?” has become a question about that table: which columns predict the last one?

That is the elementary move of data science — a discipline renamed twice without changing much underneath. In the 1990s it was Knowledge Discovery in Databases (KDD); through the 2000s, data mining; since the 2010s, data science. The constant: using statistics and computation to pull from data the knowledge needed to predict or decide.

When the column you want to predict is a category — buys or doesn’t, fraud or not, tumour or benign — the task is classification. When it is a number — how many units, how much spend — it is regression. Both are supervised learning: you train the method on examples where the answer is already known (the “bought?” column), and it learns to fill that answer in where it isn’t. The 2016 worked example used a decision tree — a flowchart of nested if-then rules a program builds automatically from the data — but the neural networks driving the AI revolution are supervised learners too, fitting far more complex nonlinear relationships to far more data. Fine-tuning a model on labelled examples, in the next module, is this same supervised move, scaled up.

When no answer column is supplied, the task is unsupervised learning — the method looks for structure rather than a labelled target. Sorting shoppers into segments nobody defined in advance is the canonical case. (The large language models of Module 10 blur this old binary: they are self-supervised, manufacturing their own prediction targets out of raw text — a third category the 2016 taxonomy did not anticipate.)

A prediction can be wrong in two different ways, and the difference often matters more than the overall error rate. Lay the model’s guesses against reality in a confusion matrix:

	Actually buys	Actually doesn’t
Model predicts “buys”	Correct (hit)	False positive — false alarm
Model predicts “doesn’t”	False negative — missed sale	Correct (hit)

A false positive is a false alarm — the model said “buys,” the shopper didn’t; one unit too many shipped. A false negative is a miss — the model said “won’t,” the shopper would have; a sale lost. For a shop the two errors cost about the same. For a fraud detector — or the recidivism algorithm this module examines later — they do not: a false positive freezes an innocent transaction or flags an innocent defendant, a false negative lets the real case through. Which error you are willing to tolerate is a value choice, not a technical one — and that seemingly small point is the seed of the fairness-impossibility results below.

Finding a model that works is rarely one clean pass. In practice the method is a loop — understand the problem and the data, prepare the data (the cleaning that consumes the 60–80% of effort the quality section describes), build the model, test it against data it has never seen, deploy it — and the loop runs backwards as often as forwards. To see why, hand the shop’s trained model a shopper its training month never contained — empty cart, evening, buys anyway. The model answers “won’t buy,” confidently and wrongly: no empty-cart buyer appears anywhere in its training data. That failure is why the loop returns for more data — and why testing data must never be training data. Two facts discipline the whole enterprise. Garbage in, garbage out: a model can be no better than the data beneath it. And learning has limits: computational learning theory proves that not every pattern is learnable from data at all, and that the harder the pattern and the scarcer the data, the worse any method must do.

That last fact is where a restored foundation meets a live challenge. The 2016 edition drew a firm line: the intelligence lives in the data scientist, not the algorithm, and complex systems stay unpredictable even for machine learning. The scaling-laws era has pressed hard on that line — large models now perform feature engineering that once demanded human expertise, and capabilities have emerged at scale that no one hand-designed. But pressed is not erased. Scaling has moved the frontier of what is learnable; it has not repealed the theorem that a frontier exists. A model trained on the past stays weakest exactly where the world turns genuinely novel — the same limit, in a different vocabulary, that the economics module met in Hayek’s knowledge problem. The honest reading is that AI has automated far more of the data-mining loop than the 2016 account expected, without dissolving the irreducible complexity the loop was always running up against.

Every term the AI revolution leans on — supervised and self-supervised learning, classification, fine-tuning, false positives and negatives — is defined here, in the older and humbler vocabulary of data science. The deep-learning era did not replace this foundation; it industrialized it, and ran it into the same limits at a larger scale.

From Big Data to Data Engineering

“Big data” peaked as a buzzword around 2015. Hadoop was the centerpiece — store everything in HDFS, process it with MapReduce, figure out what you want later. The promise was transformative; the reality was data swamps, operational nightmares, and clusters that cost $500,000 per year to run.

The buzzword died, but the problems it named — volume, velocity, variety — did not. Instead, they were absorbed into a maturing discipline: data engineering. Maxime Beauchemin’s 2017 blog post “The Rise of the Data Engineer” captured the shift. Data work was no longer the province of generalists running ad-hoc scripts; it was becoming a rigorous engineering practice with version control, testing, documentation, and CI/CD pipelines.

The most consequential architectural shift was ELT replacing ETL. In the traditional Extract-Transform-Load paradigm, data was cleaned and restructured before loading into the warehouse — expensive, rigid, and slow. When cloud data warehouses made compute cheap and elastic, a radical simplification became possible: load the raw data first, transform it in place using SQL. dbt (data build tool) became the transformation layer, turning SQL into testable, version-controlled, modular code.

The modern data stack crystallized around 2019–2020: cloud warehouses, the lakehouse paradigm bridging data lakes and warehouses (Delta Lake, Apache Iceberg, Apache Hudi adding ACID transactions to object storage), and streaming data becoming mainstream.

Hadoop’s decline was swift. The on-premise monolith gave way to the cloud-native modular stack — separation of compute and storage as the foundational principle, with best-of-breed tools at each layer.

The most important shift was not more data, but better plumbing. Data engineering matured from ad-hoc scripts into a rigorous discipline — because organizations learned that the pipeline is the product.

Exactly one of the five pipeline stages changes position between ETL and ELT — which one, and where does it move? Commit to a guess, then toggle.

Data Pipeline Explorer

Toggle between the ETL and ELT paradigms, then explore how the full data stack evolved across three eras.

SourcesDatabases, APIs, files

ExtractPull raw data

TransformClean, join, aggregate

LoadWrite to warehouse

WarehouseQuery-ready data

Transform locationExternal tools (Informatica, Talend)

Schema flexibilityRigid — must define upfront

Cost driverUpfront compute (transform before load)

What you should have seen: Transform jumped from mid-pipeline to the end, inside the warehouse. That one reordering is the whole shift — storage became cheap enough to keep everything raw and decide later.

Privacy, Governance, and the Right to Be Forgotten

The General Data Protection Regulation (GDPR), enforced from May 2018, was the most consequential data regulation in history. Its extraterritorial scope meant it applied to any organization processing EU residents’ data, regardless of where the organization was based. The right to erasure forced architectural changes to systems that had assumed append-only storage. Data Protection Impact Assessments, consent management, and data lineage tracking became engineering requirements — not just legal checkboxes.

The fines were substantial: Amazon received a €746 million penalty in 2021; Meta was fined €1.2 billion in 2023 for EU-US data transfers. Data sovereignty became geopolitically charged — EU courts twice struck down the transatlantic transfer frameworks (the Schrems decisions), and data localization laws proliferated in Russia, China, India, and Vietnam. (For the political dimensions, see Politics and Governance.)

The tension between data utility and data privacy drove innovation in privacy-preserving computation. Four approaches emerged, each with different trade-offs: differential privacy (Apple deployed it in iOS 10 in 2016; the US Census used it in 2020), federated learning (Google’s Gboard, NVIDIA Clara for medical imaging), homomorphic encryption (computing on encrypted data — theoretically ideal but 1,000–10,000x overhead), and secure multi-party computation (banks using it for cross-institutional fraud detection).

Privacy regulation did not stop data use — it restructured it. The technical challenge shifted from “how do we collect more data?” to “how do we extract value from data we are not allowed to see?” This constraint, paradoxically, drove some of the most innovative engineering of the decade.

On paper, homomorphic encryption is the strongest guarantee of the four — so why isn’t everyone using it? Predict which of the five dimensions breaks, then add it to the comparison.

Privacy Trade-off Explorer

Select and compare privacy-preserving computation techniques across five dimensions.

Differential Privacy

• Apple iOS emoji/keyboard (2016)
• US Census 2020
• Google Chrome RAPPOR

Mature and widely deployed. The privacy-utility trade-off is tunable via the epsilon parameter — smaller epsilon means stronger privacy but noisier results.

Federated Learning

• Google Gboard keyboard predictions
• NVIDIA Clara for medical imaging
• MELLODDY consortium (pharma)

Keeps data distributed across devices or institutions. Communication overhead is the bottleneck. Vulnerable to gradient inversion attacks without additional protections.

You should have seen the pattern invert: the strongest privacy (5/5) carries the worst overhead (5/5) and the lowest adoption (1/5), while differential privacy wins deployment with a weaker guarantee. Maturity, not strength, decides what ships.

Data Quality: The 80% Problem

Data scientists spent 60–80% of their time cleaning and preparing data. “Garbage in, garbage out” applied with renewed force in the machine learning era, where model performance was directly bounded by data quality. Yet quality was treated as an afterthought — something individual analysts handled ad hoc rather than something organizations engineered systematically.

Andrew Ng’s data-centric AI movement (2021) reframed the problem. The ML community had been obsessively focused on model architecture — more layers, more parameters, more compute. Ng argued the higher-leverage intervention was often improving the data while keeping the model fixed. Landing AI’s competition demonstrated this: participants improved model performance by curating better training data, not by architectural innovation.

Data contracts emerged as the engineering response. Proposed by Andrew Jones (2022–2023), they treated data interfaces like API contracts — explicit, versioned, testable agreements between data producers and consumers. Three levels of contract captured progressively deeper guarantees: schema contracts (structural), SLA contracts (operational), and semantic contracts (meaning). The cultural shift was profound: from treating data as a byproduct of operations to treating it as a product with explicit quality guarantees.

Synthetic data addressed a different quality problem: scarcity. When real data was insufficient, privacy-constrained, or too expensive, organizations could generate it. GANs (StyleGAN for images, CTGAN for tabular data), diffusion models, and LLM-based synthetic data (the Self-Instruct approach behind Microsoft’s Phi models) expanded the toolkit. But a new risk appeared: model collapse — training on synthetic data generated by other models, creating a recursive degradation loop where each generation drifts further from reality.

Andrew Ng’s inversion was profound: instead of asking “what model architecture works best on this data?”, ask “what data would make even a simple model work well?” The data-centric perspective mirrors complexity science — the structure of the inputs matters more than the sophistication of the processing.

The cascade simulator’s pipeline has four hand-offs, each silently shaving a few percent. Start at 95% source quality with no contracts: does decision quality land above 90% or below 85%? Guess, then slide.

Data Quality Dashboard

Explore the quality pyramid's three contract levels, then run the cascade simulator through the pipeline's four hand-offs.

Click a layer to explore.

Data scientists spend 60–80% of their time cleaning data. Data contracts aim to shift quality assurance upstream — catching issues at the source rather than the dashboard.

You should have seen roughly 82% with no contracts and 93% with all four. Degradation multiplies, so an error caught at the source is worth more than the same error caught at the dashboard — and model collapse is this multiplication in miniature: a pipeline fed its own degraded output drifts further with each pass.

Data Ethics: Bias, Fairness, and Power

ProPublica’s 2016 investigation of COMPAS — a recidivism prediction algorithm used in US courts — demonstrated that the system was nearly twice as likely to falsely flag Black defendants as high-risk compared to white defendants. This was not an isolated failure but the first widely publicized instance of a systemic pattern.

Joy Buolamwini and Timnit Gebru’s Gender Shades study (2018) revealed that commercial facial recognition systems had a 34.7% error rate for dark-skinned women versus 0.8% for light-skinned men. The bias cascade was clear: biased training data produces biased models produces biased decisions produces biased outcomes that, in turn, generate more biased training data. The loop is self-reinforcing.

The research community responded with a proliferation of fairness definitions: equalized odds, demographic parity, individual fairness, counterfactual fairness. But researchers proved that many desirable criteria are mathematically incompatible (Chouldechova 2017, Kleinberg et al. 2016). You cannot simultaneously satisfy all reasonable fairness constraints — tradeoffs are unavoidable, and choosing between them is a value judgment, not a technical decision.

Data colonialism (Couldry and Mejias, 2019) named a structural power asymmetry: data extracted from developing countries trained models whose benefits accrued to Silicon Valley. Annotation labor — the human work of labeling training data — was often performed by workers in Kenya, the Philippines, and India for less than $2 per hour, including content moderation work that exposed workers to traumatic material.

Bias in AI is not a bug that can be patched — it is a structural feature of learning from historical data. When that data reflects centuries of discrimination, the model does not correct the bias; it operationalizes it at scale and speed no human institution could match.

The Overview below carries all the fairness argument the rest of the module needs; the Detailed view adds the COMPAS reconciliation and the causal framework — skipping it costs nothing downstream.

Adjustable Depth

Explore algorithmic fairness frameworks and the impossibility results.

Several mathematical definitions of fairness have been proposed: demographic parity (equal selection rates across groups), equalized odds (equal error rates), individual fairness (similar individuals treated similarly), and counterfactual fairness (the outcome would be the same in a counterfactual world where the protected attribute differed).

The sobering discovery: Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016) independently proved that except in trivial cases, it is mathematically impossible to simultaneously satisfy calibration (predicted probabilities match actual rates) and balance (equal false positive/negative rates across groups). When base rates differ between groups — as they often do due to historical inequality — you must choose which fairness criterion to satisfy.

This means fairness is ultimately a value choice, not a technical optimization. Tools like IBM AI Fairness 360 and Microsoft Fairlearn can measure and partially mitigate bias, but they cannot eliminate the fundamental tradeoff. The question “fair according to whom?” remains irreducibly political.

The impossibility results in algorithmic fairness have precise mathematical formulations. Chouldechova (2017) proved that when base rates differ between groups, a classifier cannot simultaneously achieve calibration within groups, equal false positive rates, and equal false negative rates. Kleinberg et al. (2016) proved a related impossibility involving calibration and balance.

COMPAS illustrates this directly: the system was calibrated — a score of 7 meant the same recidivism probability regardless of race. But because base rates differed (due to historical policing patterns and systemic factors), equal calibration necessarily produced unequal false positive rates. ProPublica focused on error rates; Northpointe focused on calibration. Both were mathematically correct — they were measuring different, incompatible criteria.

The causal fairness framework (Kusner et al., 2017) attempts resolution by reasoning about counterfactuals: would the decision have been the same if the protected attribute had been different? This requires a causal model of the data-generating process — a significant modeling assumption.

Intersectionality compounds the challenge. Gender Shades showed that bias was worst not for women or dark-skinned individuals separately, but for dark-skinned women — the intersection. Fairness constraints applied independently along each axis may miss intersectional discrimination entirely.

Data colonialism extends ethics beyond individual fairness to structural power. Couldry and Mejias formalize “data relations” as a new form of extractive capitalism. The annotation labor pipeline exemplifies this: OpenAI’s content moderation workers in Kenya earned $1.32–$2.00 per hour to label traumatic content that made ChatGPT safer for Western users.

The deepfake detection arms race illustrates a game-theoretic dynamic: generators improve faster than detectors because generation requires only plausibility while detection requires certainty. Content provenance standards (C2PA) represent the most promising response — proving where content came from rather than detecting whether it is synthetic.

Data as the Lens for Complexity

The data revolution did not merely provide more data — it fundamentally expanded what complexity science could study. The earlier modules discussed complexity in theoretical terms: network structure, bounded rationality. Data made empirical complexity science possible at scale.

Large-scale behavioral data transformed social science. Mobile phone call detail records revealed universal patterns in human mobility (Barabási, González); social-media and wearable data did the same for information diffusion and for health rhythms.

Real-time urban data fed agent-based traffic models, smart city management, and digital twins. Singapore’s Virtual Singapore and Helsinki’s Kalasatama digital twin modeled entire neighborhoods in real time, enabling what-if scenarios for urban planning. (As live, continuously-calibrated ABMs, digital twins are developed in full in Module 15.)

COVID-19 was the paradigm case. The Johns Hopkins CSSE dashboard provided real-time global tracking, Google and Apple mobility reports calibrated epidemiological models, and agent-based models used granular contact data for realistic simulations. But the pandemic also revealed the limits — reporting biases, inconsistent definitions, and data delays that undermined real-time claims.

Five transformations for complexity science: from toy models to calibrated models, from hypothesis to empirical validation, from simulation to hybrid approaches, from offline to real-time, and the emergence of phenomena visible only through data instrumentation — flash crashes, viral cascades, coordinated inauthentic behavior.

The data revolution transformed complexity science from a field that could theorize about emergence to one that could observe it in real time — in financial markets, epidemics, cities, and ecosystems. Data did not replace theory; it gave theory something to push against.

The data infrastructure surveyed in this module — the pipelines, privacy frameworks, quality systems, and ethical guardrails — is the foundation on which the AI revolution and modern agent-based modeling are built. The quality of AI’s outputs cannot exceed the quality of its inputs. The ethics embedded in the data propagate through every model trained on it. In complexity terms: the substrate constrains the dynamics.