In 1925, Ronald Aylmer Fisher published a book called Statistical Methods for Research Workers and, almost incidentally, proposed a threshold. If the probability that your results occurred by chance was less than one in twenty — less than 5 percent — that was, he suggested, a reasonable point at which to start paying attention. He called this level of probability 0.05. He did not call it the boundary between real science and noise. He did not call it a decision rule. He explicitly said it was one convenient marker among many possible ones, context-dependent, never to be applied mechanically. Within a generation, every major scientific journal on earth was applying it mechanically.
What Fisher Actually Built
Fisher was solving a practical problem. Agricultural researchers in the 1920s needed a way to decide whether a new fertilizer was working or whether the yield difference between two plots was just weather, soil variation, and chance. He needed a decision procedure — something that would let experimenters act on data without endless deliberation. The 0.05 threshold was engineered for that context: small samples, repeated field trials, a researcher who would run the same experiment again next season if the result was borderline.
It was not engineered for a psychology researcher who runs one study on 40 undergraduates and publishes the result as a fact about human cognition. It was not engineered for a pharmaceutical trial whose result will inform drug approvals affecting millions of people. It was not engineered for a genomics study testing half a million genetic variants simultaneously. Fisher built a hand tool. Science industrialized it and called it a standard.

The Neyman-Pearson Collision
The story gets stranger. Around the same time Fisher was developing his significance testing framework, two other statisticians — Jerzy Neyman and Egon Pearson — were building a different approach to the same problem. Their framework was based on explicitly specifying two types of error: falsely concluding an effect exists when it doesn’t (Type I), and falsely concluding it doesn’t exist when it does (Type II). You set acceptable rates for both before collecting data. You define what decision you’ll make in advance. It’s a formal decision procedure, not a measure of evidence.
Fisher hated this framework. He found it philosophically confused — the idea that you could pre-specify an “alternative hypothesis” and balance error rates struck him as a corruption of what inference was for. Neyman and Pearson thought Fisher’s approach was incomplete without it. They argued in print, at conferences, and in increasingly hostile correspondence for decades. The scientific community, faced with two incompatible frameworks from the discipline’s founding figures, responded by merging them into a hybrid that neither camp endorsed and that statisticians have been trying to fix ever since.
The hybrid is what most researchers are actually using when they report a p-value. They compute a Fisher p-value and interpret it using Neyman-Pearson decision language — saying a result is “significant” as though that word means something about a pre-specified decision boundary, when it was originally just Fisher’s shorthand for “worth a second look.” The confusion is baked into the method. It connects directly to the larger question Karl Popper spent his career on: what actually separates science from non-science, and whether p-values are even the right tool for drawing that line.
Why 0.05 Specifically
Fisher’s choice of 0.05 was not derived from mathematics. It was a round number that corresponded roughly to two standard deviations from the mean in a normal distribution — convenient for hand-calculation in the pre-computer era. He wrote in 1926 that a value of p around 0.05 “is convenient to take as the limit in judging whether a deviation is to be considered significant or not.” Convenient. Not correct. Not optimal. Not theoretically motivated. Convenient.
Alternative thresholds have been proposed repeatedly. In 2017, a group of 72 statisticians published a paper in Nature Human Behaviour arguing that the threshold for claiming a new discovery should be lowered to 0.005 — one in two hundred — for findings in fields with low prior probabilities. The proposal was reasonable. It was also, implicitly, an admission that the existing threshold had been wrong all along, or at least wrong for the uses to which it had been put. The paper generated significant debate and essentially no change in standard practice.
The Institutional Lock-In
The 0.05 threshold persists not because it’s correct but because changing it would be expensive for everyone currently operating under it. Journal editors have built submission standards around it. Funding agencies have built grant evaluation criteria around it. Tenure committees have built publication-count metrics that implicitly assume it. Entire research careers have been constructed on the basis of results that cross the line. Lowering the threshold retroactively would reclassify a substantial fraction of the published literature as non-significant — which is probably accurate, and which is precisely why it won’t happen through voluntary consensus.
This is not a conspiracy. It is a coordination problem. The 0.05 threshold functions as a standard in the economic sense: like the QWERTY keyboard layout or the gauge of railroad tracks, its value comes partly from universality rather than optimality. Everyone using the same threshold — even a suboptimal one — creates legibility across studies, fields, and decades. Switching to a better standard requires everyone to switch simultaneously, which requires an authority willing to mandate the change, which science’s decentralized structure makes nearly impossible.

What Fisher Said When He Noticed
Fisher spent the later part of his career watching his threshold calcify into dogma and objecting strenuously. He wrote in 1956 that no scientific worker “has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.” He was describing, in polite language, exactly the opposite of how p = 0.05 was being used. He had built a thinking tool. He was watching it replace thinking.
The irony compounds when you consider that Fisher spent decades as one of the more prominent scientific defenders of tobacco industry research — applying his own statistical framework in ways that demonstrated, conclusively, that a sophisticated understanding of statistics is no protection against motivated reasoning. The man who built the line knew how to move it when he wanted to. It’s the same tension that runs through the Collins-Coyne debate on science and belief: technical mastery and motivated conclusions are not mutually exclusive.
You Might Also Like
- The Demarcation Problem: Karl Popper, Falsifiability, and the Boundary Between Science and Pseudoscience
- Cargo Cult Science
- Horizontal Gene Transfer: Why Darwin’s Tree of Life Is Actually a Tangled Web
Sources
- Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd.
- Fisher, R. A. (1956). Statistical Methods and Scientific Inference. Oliver and Boyd.
- Neyman, J., & Pearson, E. S. (1933). “On the Problem of the Most Efficient Tests of Statistical Hypotheses.” Philosophical Transactions of the Royal Society A, 231, 289–337.
- Benjamin, D. J., et al. (2018). “Redefine statistical significance.” Nature Human Behaviour, 2, 6–10. doi.org/10.1038/s41562-017-0189-z
- Wasserstein, R. L., & Lazar, N. A. (2016). “The ASA Statement on p-Values.” The American Statistician, 70(2). doi.org/10.1080/00031305.2016.1154108
- Gigerenzer, G. (2004). “Mindless Statistics.” Journal of Socio-Economics, 33(5), 587–606.







