Statistical power: Before, but not after!

Stian Lydersen

doi:10.4045/tidsskr.18.0847

Medicine and numbers

Statistical power: Before, but not after!

Norwegian

Stian Lydersen

See All Articles

Stian Lydersen

Orcid

E-mail: stian.lydersen@ntnu.no

Stian Lydersen, dr.ing. and professor of medical statistics at the Regional Centre for Child and Youth Mental Health and Child Welfare (RKBU Central Norway), Department of Mental Health, Norwegian University of Science and Technology (NTNU).

The author has completed the ICMJE form and declares no conflicts of interest.

Article

Some journals, and some reviewers, request a calculation of statistical power based on the observed effect size after a study has been carried out. This is fundamentally flawed.

Statistical power is the probability of rejecting the null hypothesis in a future study. After the study has been carried out, this probability is 100 % (if the null hypothesis was rejected) or 0 % (if the null hypothesis was not rejected).

Before starting up a study, it is recommended to calculate the statistical power or sample size. This calculation is based on an expected effect size, or on an effect size regarded as clinically important. The statistical power is the probability that the result of a study with a given number of participants will be statistically significant (1). Alternatively, we calculate the number of participants needed to obtain a given statistical power, for example 90 %, for this effect size. The calculations are performed for a given significance level, usually 0.05.

'Observed power' and p-value

After the study, it is generally recommended to report an estimate and a 95 % confidence interval for the effect, as well as a p-value. Sometimes, a journal or a reviewer requests a calculation of the 'observed power' in addition (2). This means a statistical power calculation based on the observed effect, as if it were a future study, and reporting it as if it gives additional information about the study already performed. This is not only fundamentally flawed, but it gives no information in addition to the reported p-value: For every statistical hypothesis test, there is a unique correspondence between the p-value and 'observed power'. For a one-sided test with normally distributed outcome and known variance, this correspondence is particularly simple (2). This is illustrated in Figure 1. Here, 'observed power' is over 50 % if the p-value is less than 0.05, and 'observed power' is under 50 % if p is greater than 0.05.

Figure 1 'Observed power' for a one-sided test with normally distributed outcome variable and known variance, with… — **Figure 1** 'Observed power' for a one-sided test with normally distributed outcome variable and known variance, with significance level 0.05. When the p-value equals 0.05, the 'observed power' equals 0.5 (2). Study A has higher 'observed power' than Study B. This does not imply that Study A provides stronger evidence in favour of the null hypothesis: On the contrary, Study A has lower p-value, hence stronger evidence against the null hypothesis.

'Observed power' provides no additional information beyond the p-value. On the contrary, it can be misleading, something which many researchers seem not to be aware of. Let A and B be two studies where the null hypothesis was not rejected, and the p-values were 0.10 and 0.25, respectively (Figure 1). Study A has higher 'observed power' than Study B. Some may conclude that Study A has strongest evidence in favour of the null hypothesis, which was not rejected, but this is a fallacy. Study A has the lower p-value, and hence, strongest evidence against the null hypothesis. This kind of fallacy is called 'the power approach paradox'.

'Observed power' and confidence interval

Other types of retrospective power calculations have been suggested, including this one: Assume a study did not result in the rejection of the null hypothesis. The question is: With the observed variability in the study, what would a hypothetical effect size in a future study need to be to give a certain statistical power, for example 90 %? However, this is also logically flawed, and can lead to a version of the 'power approach paradox', as described in (2). A 95 % confidence interval, on the other hand, indicates the range of effect sizes that are likely, given the observed data.

Reporting statistical power

It is good practice to report a power- or sample-size calculation that was performed before the study was started up. This is recommended in the CONSORT Statement (3) for randomised trials, and helps to document that the study was well planned. After the study has been conducted, a confidence interval and a p-value are appropriate measures of uncertainty. 'Observed power' after the study has been carried out, is both superfluous and misleading.

Comments ( 2 )

Dette kommentarfeltet modereres, men kommentarer blir ikke redaksjonelt behandlet ut over å sikre at de følger retningslinjer for vårt kommentarfelt.

30.07.2019:

The author’s main point was to criticize retrospective power calculations. There is no disagreement here. This brief note simply addresses the definitions of Statistical Power (SP) stated in the article.
The author defined SP as “the probability of rejecting the null hypothesis in a future study”. He also asserted that SP “is the probability that the result of a study with a given number of participants will be statistically significant”. Both definitions are incorrect. These definitions don’t take into account the fact that the null hypothesis may be true (i.e., the probability of rejecting the null is the type I error), and seem to conflate power with replication probability (i.e., the probability of rejecting the null hypothesis in a future study, assuming the alternative hypothesis is true, depends not only on a large enough sample size but also on methodological quality).
SP is the long-term probability of rejecting a false null hypothesis, given the population effect size (ES), a significance level (α), and a sample size (N). Any of the four parameters described (SP, ES, α, N) is a function of the other three, which means that when any three of them are fixed, the fourth is completely determined. For example, in research planning, the required N can be determined from a given SP, ES, and α.
SP is based on a number of assumptions: the long-term probability is derived from a normal distribution, the value of the ES is obtained from perfect experimentation, and N is randomly selected. These assumptions need to be formally evaluated depending on the study. E.g., surgical treatment of obesity may be successful in reducing weight but is not capable of increasing weight after the intervention. This odd possibility is nevertheless expected under the assumption that the data is normally distributed. A SP calculation under this circumstance would inflate the desired N.
Studies with low SP appear to be common in the biomedical sciences, and the previous comments don’t condone the surplus of underpowered studies. The main point was to explain the meaning of SP and make explicit the underlying assumptions.

02.08.2019:

I thank Adelson Pinon for the interest in my article. We agree that the message of my article was to criticize retrospective power calculations.
Pinon focuses on the definitions of statistical power, and claims that my definitions are incorrect. He defines statistical power as the long-term probability of rejecting a false null hypothesis, given the population effect size, a significance level, and a sample size. I completely agree with Pinon that this is a correct and precise definition of statistical power. But I do not agree that my definitions are incorrect. Rather, they are less precise. There is a limit to which level of precision is purposeful in an article in the column “Medisin og tall”.
I thank Pinon for adding the more precise definition in his comment.

This article was published more than 12 months ago and we have therefore closed it for new comments.

Published: 28 January 2019

Tidsskr Nor Legeforen 28 January 2019 Vol. 139.

doi:

10.4045/tidsskr.18.0847

Published: 28 January 2019

Tidsskr Nor Legeforen 2019 Vol. 139.

doi: 10.4045/tidsskr.18.0847

PDF

Print

Statistical power: Before, but not after!

'Observed power' and p-value

'Observed power' and confidence interval

Reporting statistical power

A brief note about Statistical Power

Definition of statistical power

Recent Articles