Why exactly 0.05?
Credit for the choice of a significance level of 5 % is ascribed to the statistician Ronald A. Fisher (1890 – 1962). Fisher was one of the founders of modern research methodology and statistical analysis. His methods were developed for use in agricultural research and genetics, and have since been applied in a number of disciplines. He is best known for developing analysis of variance and randomised studies (2).
In 1925 he published the book Statistical methods for research workers, in which he writes that a significance level of 5 % is an appropriate choice (3): «The value for which P = .05, or 1 in 20, …; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not» (3, p. 45).
We may be left with the impression that a p-value of ≤ 0.05 and the importance of this value in later research may have been caused by Ronald A. Fisher picking a significance level of 5 % more or less at random. If he had chosen 2 %, 7 % or 10 % instead, would medical research and clinical practice have looked different today? Is it true that results and conclusions from large parts of medical research depend on what number a statistician had in mind nearly one hundred years ago?
Although Ronald A. Fisher undoubtedly has had a great impact on the development of trial methods and statistics, it would be simplistic to assign him all the credit (or blame) for this choice of 5 %. Nor is it correct that he chose this level entirely at random; other statisticians were using similar values (4).
Cowles & Davis (5) investigated why Fisher chose 5 % as a significance level. They believe that he was only using what was already an established concept. Karl Pearson (1857 – 1936), another founder of modern statistics, developed methods for assessing how well data fit with a mathematical probability distribution, which is part of the basis for the frequently used chi-square test of cross-tabulations. He claimed that with a probability of 10 % (i.e. p = 0.1) it is not unlikely that the observed data are random, and further that with a probability of 1 % (i.e. p = 0.01) it is highly unlikely that the observed data can be due to random variations. A suitable point between these extremes is 5 %. William Gosset (1876 – 1937), who developed the t-test, also suggested 5 % as a natural choice of significance level, although he expressed this in other statistical-mathematical terms (4, 5).
Is there anything special about a probability of 5 %? Inspired by their historical investigations of recommended significance levels, Cowles and Davis explored whether there is an intuitive and natural significance level (6). How rarely must an event occur in relation to what is expected before we recognise that the original assumption, i.e. the null hypothesis, is untrue? They provide a simple example. You and your colleague toss a coin to determine who will buy coffee for lunch, but day after day you keep losing. How many days will you be prepared to continue buying coffee for your colleague before starting to suspect that your losses are not coincidental? I would assume that many will be prepared to accept this for four (p = 0.0625) or five (p = 0.03125) days, but I believe that few would accept that only coincidence is involved if they lose ten days in a row and have to pay for the coffee (p < 0.001).
To investigate this systematically, they developed a psychological experiment (6). Volunteers participated in a gambling game. Three cups were place in front of them, and they were told that one of them concealed a small red button. If they chose the right cup, they would win some money. This gamble was repeated until the participants wanted to stop.
For the participants, the intuitive null hypothesis is that they have a probability of one-third for guessing the correct cup in each round of the game. The participants were unaware, however, that none of the cups concealed a red button, and that they thus would lose every time. In other words, the intuitive null hypothesis was untrue. The objective of the experiment was to investigate how many times the participants would repeat the game before starting to suspect that something was wrong, meaning that they would doubt the null hypothesis. More than half of the participants were suspicious after six rounds of repeated losses (p = 0.088) and nearly 90 % after eight rounds (p = 0.039). The experiment indicates that many people naturally and intuitively will choose a significance level of approximately 5 %.