Probability & Statistics

Statistical Fallacies

Identify the most dangerous statistical fallacies that lead to wrongful convictions, failed policies, wasted research funding, and medical harm. These advanced scenarios test whether you can spot subtle errors involving Simpson's paradox, the prosecutor's fallacy, multiple comparisons, selection bias, and expected value traps that regularly fool judges, journalists, scientists, and executives.

Advanced20 minProbability & Statistics

Context

Why this exercise

Statistical fallacies are the systematic ways that valid-looking analyses produce wrong conclusions. They show up in published research, in policy debates, in courtroom testimony, and in business analytics — often invoked with confidence by people who can recite the relevant formulas but have missed the structural issue that invalidates them. This exercise drills the most consequential advanced patterns: Simpson's paradox, the regression-to-the-mean illusion, survivorship bias, multiple comparisons, p-hacking, and the misinterpretation of confidence intervals and significance tests that has driven the replication crisis in psychology and biomedical research.

Before you start

The replication crisis, documented through systematic meta-research by John Ioannidis ('Why Most Published Research Findings Are False', 2005), Brian Nosek and the Open Science Collaboration, and others, has shown that a substantial fraction of published statistical findings fail to replicate. The causes are not usually fraud; they are the cumulative effect of small statistical errors that look defensible in isolation. P-hacking — running many analyses and reporting only the significant ones — turns a 5% nominal false-positive rate into a much higher actual rate. Multiple-comparison problems — running 20 hypothesis tests at p<0.05 will produce an average of one 'significant' result by chance alone — produce confident conclusions from noise. Optional stopping (continuing to collect data until significance is reached) inflates effect sizes. And the garden of forking paths described by Andrew Gelman — making seemingly innocuous analytic choices that depend on the data — produces results that look pre-specified but are effectively cherry-picked.

Simpson's paradox is perhaps the most counterintuitive statistical phenomenon. A pattern that holds in every subgroup of the data can reverse direction when the subgroups are aggregated. The 1973 UC Berkeley graduate-admissions case is the classic example: women had a lower acceptance rate than men in aggregate, but every individual department admitted women at a higher rate than men — the aggregate apparent bias arose because women applied disproportionately to more selective departments. Recognizing Simpson's paradox requires holding the disaggregated and aggregated views simultaneously and asking which one answers the question being asked. Regression to the mean is the related phenomenon that extreme observations tend to be followed by less extreme ones for purely statistical reasons — which means any intervention applied to extreme cases will appear to work even if it has no effect.

Survivorship bias deserves its own attention because it underlies confidently wrong reasoning in finance, history, and self-help. Abraham Wald's famous WWII analysis — examining returning bombers and recommending armor in the areas without bullet holes, because the planes hit in the other areas had not survived to be examined — illustrates the structure. Studies of successful companies, successful traders, and successful entrepreneurs all suffer from survivorship bias when they fail to compare against the much larger sample of failures. As you work the scenarios, practice asking which population the data sample, whether disaggregation would change the pattern, whether multiple comparisons have inflated the false-positive rate, and whether the cited statistic answers the question that was actually asked. For broader treatment of how to evaluate evidence and avoid these traps, see Scientific Thinking and Cognitive Biases: Memory & Self.

Question 1 of 617% Complete

A kidney stone treatment study finds: Treatment A is more effective than Treatment B for large stones (93% vs. 87%), AND Treatment A is more effective for small stones (87% vs. 83%). But when all patients are combined, Treatment B appears more effective overall (83% vs. 78%). The hospital board, looking only at the combined data, chooses Treatment B for all patients. What went wrong?