Scientific Reasoning

Correlation vs Causation

Sharpen your ability to distinguish genuine causal relationships from misleading statistical associations by analyzing scenarios from epidemiology, economics, education, and public health. You will learn to identify confounding variables, reverse causation, collider bias, and ecological fallacies that routinely lead policymakers, journalists, and even researchers to draw invalid conclusions from correlational data.

Beginner15 minScientific Reasoning

Context

Why this exercise

The conflation of correlation with causation may be the single most consequential reasoning error in modern life. It drives bad medical decisions, bad business strategy, bad public policy, and bad personal investments. The good news is that the discipline of distinguishing the two is teachable and reduces to a small set of procedural moves: identifying which kinds of explanation can produce the observed correlation, ruling out the most plausible alternatives, and recognizing when only a controlled experiment can settle the question. This exercise drills those moves on scenarios drawn from health, business, education, and everyday life.

Before you start

The intellectual foundation here is the work of Sir Austin Bradford Hill, whose 1965 paper 'The Environment and Disease: Association or Causation?' laid out the criteria still used in epidemiology and biostatistics for inferring causation from observational data. Hill's criteria — strength of association, consistency, specificity, temporality, biological gradient, plausibility, coherence, experimental evidence, and analogy — do not amount to a proof of causation, but they let an analyst grade the credibility of a causal claim. The modern refinement of this framework comes from Judea Pearl's work on causal inference, which uses directed acyclic graphs (DAGs) and the do-calculus to clarify exactly which observational designs can and cannot identify causal effects.

When you observe a correlation between X and Y, four explanations are always on the table. First, X may cause Y (the intuitive interpretation). Second, Y may cause X (reverse causation — successful people wake early may run that way because successful jobs require it, not the other way around). Third, some common cause Z may produce both X and Y without any causal link between them (confounding — ice cream sales correlate with drowning rates because both are caused by hot weather). And fourth, the correlation may be spurious — produced by chance, by selection bias in the sample, or by data-dredging across many comparisons. A confident causal claim from an observational study requires ruling out each of these alternatives, which is why randomized controlled trials remain the gold standard: random assignment breaks the connection between X and any confounder Z, leaving only the X-Y connection as a possible explanation.

Several specific patterns recur often enough to be worth memorizing. 'Children who play chess score higher on math tests' is observational and cannot distinguish chess causing math ability from a shared trait (analytical aptitude) producing both. 'Companies that hire expensive consultants grow faster' suffers from selection bias — the kinds of companies that hire expensive consultants are already different from those that do not. 'Cities that installed speed cameras saw fewer traffic fatalities' is consistent with the cameras working, but also with regression to the mean (cities install cameras after spikes in fatalities, and the spikes were going to come down anyway). As you work the scenarios, practice listing all four alternative explanations before accepting the causal interpretation, and notice when only a controlled experiment would resolve the question. For broader treatment, see Scientific Thinking.

Question 1 of 617% Complete

A widely shared health article reports: "A 12-year longitudinal study of 48,000 adults found that those who ate breakfast daily had a 23% lower risk of developing type 2 diabetes (HR = 0.77, 95% CI: 0.71-0.84, p < 0.001)." A lifestyle influencer cites this to argue that eating breakfast prevents diabetes. What is the most likely reason this causal conclusion is wrong?