Scientific Reasoning

Evaluating Research Studies

Develop the skills to critically appraise scientific claims by dissecting sample sizes, placebo controls, statistical versus clinical significance, publication bias, p-hacking, and the limitations of peer review. These competencies will equip you to evaluate health news headlines, pharmaceutical marketing, and policy arguments that invoke "studies show" as their authority.

Intermediate18 minScientific Reasoning

Context

Why this exercise

Most scientific claims in everyday life reach you not as raw experiments but as published research studies summarized in news headlines. Evaluating these studies requires knowing what to look for: study design, sample size and representativeness, effect size versus statistical significance, conflicts of interest, and the difference between a result that has replicated and one that has not. This exercise drills the diagnostic checklist that distinguishes a confident reader of research from a credulous one, using realistic studies drawn from medicine, psychology, nutrition, and education research.

Before you start

The 21st century has been a difficult one for empirical research credibility. John Ioannidis's 2005 paper 'Why Most Published Research Findings Are False' argued, on first-principles statistical grounds, that the combination of low prior probabilities, small sample sizes, flexible analytic choices, and publication bias produces a published literature in which most positive findings are likely to be wrong. The Open Science Collaboration's 2015 replication study attempted to replicate 100 published psychology studies and succeeded with only 36-47% depending on the criterion. Similar replication efforts in cancer biology, economics, and behavioral medicine have produced similarly humbling results. The lesson is not that research is worthless, but that the credibility of any individual finding depends on details that headlines almost always omit.

Several specific design features carry most of the credibility weight. Sample size determines statistical power: small studies are systematically more likely to produce both false negatives (missing real effects) and overestimated effect sizes when they do find effects. Pre-registration of hypotheses and analysis plans prevents post-hoc reanalysis from inflating the apparent significance of findings. Randomization and blinding control for confounding and expectation effects. Effect size matters separately from statistical significance: a trivially small effect can be statistically significant in a large sample, and a clinically important effect can fail to reach significance in a small one. Independent replication, especially preregistered direct replication by different research teams, is the strongest evidence that an effect is real.

Several specific failure modes recur often enough to deserve named recognition. P-hacking is the practice of trying many analyses and reporting only the significant ones, which transforms a 5% nominal false-positive rate into something much higher. HARKing (Hypothesizing After Results are Known) is the post-hoc creation of a hypothesis that fits the data, then presenting it as if it had been predicted in advance. The garden of forking paths, described by Andrew Gelman, captures the more subtle problem that seemingly innocuous analytic choices can depend on the data and effectively cherry-pick the result. Publication bias means that null results often go unpublished, leaving the published literature systematically biased toward positive findings. As you work the scenarios, practice asking about preregistration, replication, sample size, effect size, and conflicts of interest before forming an opinion about a study's claim. For deeper treatment, see Scientific Thinking.

Question 1 of 617% Complete

A supplement brand's Instagram ad states: "Clinically proven! In a study at a leading university, participants who took NeuroFocus showed a 22% improvement in sustained attention scores after just two weeks (n = 16, p = 0.04)." You find the paper and discover it had no placebo control, used an unvalidated attention measure created by the company, and was funded entirely by the manufacturer. What is the most critical problem?