P Less Than .05

In 1925, Ronald Fisher published a textbook called Statistical Methods for Research Workers. Aimed at biologists and agricultural scientists, it was a practical guide to applying statistical tests — a user manual for data analysis. Somewhere in the middle of that manual, Fisher noted that when a result produces a p-value below 0.05, it is “convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not.” It was a suggestion. A working rule of thumb, offered in the way a carpenter might recommend a particular tolerance for fit: not as law, but as reasonable starting practice for someone who needs to get on with the job.

One hundred years later, that convenience threshold controls what gets published in scientific journals, which drug trials succeed, which psychological findings enter the textbooks, and which public health interventions receive regulatory approval. A number chosen arbitrarily for its tidiness became the gatekeeping criterion for what counts as knowledge.

What a P-Value Actually Measures

The p-value is a conditional probability. Specifically, it is the probability of observing data at least as extreme as what you observed, assuming the null hypothesis is true. That definition is not intuitive, and the gap between what it says and what most people hear when they encounter it is where most of the damage occurs.

What p < 0.05 does not mean: that your result is probably true. That the null hypothesis is probably false. That the effect size is meaningful. That the finding will replicate. What it means, precisely, is that if there were no real effect and you ran this experiment many times, results as extreme as yours would appear by chance fewer than one time in twenty. That’s it. It is a statement about data in relation to a hypothesis, not a statement about the hypothesis itself. Fisher understood this. The researchers who canonized his threshold often did not.

How a Guideline Became a Dogma

Fisher was not alone in building the statistical architecture of twentieth-century science. Jerzy Neyman and Egon Pearson developed a competing framework — the Neyman-Pearson approach — that formalized hypothesis testing differently, with explicit attention to error rates and the distinction between Type I errors (false positives) and Type II errors (false negatives). Fisher and Neyman had a famous professional animosity, and their frameworks were philosophically incompatible. What actually spread through scientific practice was a muddle of both, stripped of each system’s theoretical foundations and merged into a ritual: calculate p, compare to 0.05, declare or withhold significance.

This hybrid procedure had one significant practical advantage: it was legible to journal editors and grant reviewers who weren’t statisticians. A binary threshold produces a binary decision, and binary decisions are easy to administer at scale. The consequence was that an entire infrastructure of scientific judgment — peer review, funding allocation, regulatory approval — organized itself around a single number that its originator never intended to bear that weight. William James, writing decades before Fisher, was already suspicious of systems that reduced complex judgments to single convenient measurements — a thread explored in The Writings of William James.

The Structural Incentives

Once the threshold became institutional, behavior adapted to it. Careers depended on publication, publication depended on significance, and significance meant p < 0.05. The result was a set of well-documented distortions: p-hacking (running multiple analyses until one crosses the threshold), HARKing (Hypothesizing After Results are Known — presenting exploratory findings as if they were confirmatory), selective reporting of outcomes, and publication bias toward positive results. None of these required conscious fraud. They emerged naturally from a system that rewarded crossing a line.

The replication crisis that became visible in psychology in the early 2010s — and subsequently in medicine, economics, and nutrition science — made the structural problem undeniable. A 2015 large-scale replication effort found that fewer than half of published psychology findings held up when independent researchers attempted to reproduce them. The studies in question had all, by definition, met the p < 0.05 criterion when originally published.

The ASA Statement and What Came After

In 2016, the American Statistical Association took the unusual step of issuing a formal statement on p-values — six principles clarifying what the statistic does and doesn’t establish. The statement was direct: p-values do not measure the probability that a hypothesis is true; they do not measure the size or importance of an effect; they do not constitute a complete basis for scientific inference. The ASA was not abolishing the p-value but was explicitly rejecting the binary “significant or not” framework that had grown around it.

What the statement could not do was redesign the infrastructure. Journals, granting agencies, regulatory bodies, and university hiring committees all continued operating with thresholds because thresholds are administratively convenient. The ASA followed with a 2019 editorial in The American Statistician calling for a move “beyond p < 0.05” — a harder position. The scientific community’s response was divided. Proposals for alternative thresholds (0.005 instead of 0.05), confidence intervals without dichotomization, Bayesian approaches, and pre-registration requirements all circulate with genuine advocates. None has achieved comparable institutional adoption. The broader dynamic — where scientists suppress competing paradigms in favor of an administratively convenient consensus — mirrors what Kuhn’s structure of scientific revolutions describes, even if Kuhn’s examples came from physics rather than statistics.

The Deeper Problem

The technical critiques of p < 0.05 are correct but insufficient, because the root problem isn’t statistical — it’s institutional. Any bright-line criterion applied at scale will generate gaming behavior, because bright lines determine resource allocation and bright lines can be crossed. Replace p < 0.05 with p < 0.005 and you get smaller false positive rates and larger sample-size requirements, but the underlying incentive structure remains. Researchers still need to publish. Journals still preferentially accept positive results. Novelty still commands more prestige than replication.

Fisher’s original intent was something more modest: a heuristic for deciding whether an experimental result was worth pursuing further. Not a verdict, but a prompt. The transformation of that prompt into a verdict was not a statistical error — it was a sociological one. Science industrialized a shortcut and forgot the shortcut was provisional. What remained was a number, 0.05, bearing the entire weight of an institution’s claim to produce knowledge — doing a job Fisher never asked it to do, and never could.

Sources

Similar Posts