Reducing the P-Value for Statistical Significance to Improve the Quality of Research?
Statistical significance set at P <.05 results in high rates of false-positives, "even in the absence of other experimental, procedural and reporting problems."
In a controversial and divisive article posted July 22, 2017 on the preprint server PsyArXiv, a group of 72 well-established researchers from the same number of institutions across the United States, Europe, Canada, and Australia in departments as diverse as psychology, statistics, social sciences, and economics, led by Daniel J. Benjamin, PhD, from the Center for Economic and Social Research and Department of Economics, University of Southern California, Los Angeles, propose to improve "statistical standards of evidence," by lowering the P-value for significance from P <.05 to P <.005 in the fields of biomedical and social sciences.1 This article was published in September 2017 as a comment in Nature Human Behavior.2
Statistical significance set at P <.05 results in high rates of false-positives, note the authors, "even in the absence of other experimental, procedural and reporting problems," and may underlie commonly encountered issues of lack of reproducibility.
In an open science collaboration published in Science in August 2015, 270 psychologists seeking to assess reproducibility in their field endeavored to replicate a total of 100 studies published in 3 high-impact factor journals in psychology during 2008.3
"Reproducibility is a defining feature of science," remarked the investigators in the Science article's introduction. Reproducibility was assessed using 5 parameters: "significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes."
Surprisingly, the researchers found that "replication effects were half the magnitude of original effects, representing a substantial decline." Replications led to significant results in just 36% of studies, and "47% of original effect sizes were in the 95% confidence interval of the replication effect size." They conclude that "variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research."
However, in a comment on this large-scale replication study published several months later, also in Science, by Daniel T. Gilbert, PhD, professor of psychology at Harvard University, Cambridge, Massachusetts, and colleagues, the psychologists argue that this article "contains 3 statistical errors, and provides no support for [the low rate of reproducibility in psychology studies]."4 The comment's authors argue that, because results from the replication study were not corrected for error, power, or bias, "the data are consistent with the opposite conclusion, namely, that the reproducibility of psychological science is quite high."
Similar issues were encountered by the biotechnology companies Amgen and Bayer Healthcare, which were able to replicate only 11% of 53 “landmark” preclinical studies and 25% of 67 studies (70% in oncology), respectively.5,6 One of the reasons cited in the Bayer study for this lack of reproducibility, is an “incorrect or inappropriate statistical analysis of results or insufficient sample sizes, which result in potentially high numbers of irreproducible or even false results.”6
Although several measures (including increased statistical power, multiple testing, and P-hacking) have been proposed to tackle the root cause of the perceived, justly or not, lack of reproducibility, the authors of the "Redefine statistical significance" article believe that any of these measures, by themselves or in combination, would not adequately address the issue. With such a lowering of the P-value, results with a P-value comprised between .05 and .005 would be classified as "suggestive" vs significant. The authors add that this is not a novel concept but, rather, a concept now endorsed by a "critical mass of researchers."7,8
This new standard for statistical significance is meant for studies uncovering new effects vs replicative studies, and for studies in which statistical analyses use a null hypothesis. Although other options may also be employed in an effort to improve reproducibility, lowering of the P-value represents a simple measure, according to the authors, which would not require additional training by the research community, and might therefore gather broad consensus.
A rebuff to the article by Benjamin and colleagues was published on the same preprint server on September 18, 2017, with the backing of an even greater group of academics– also international and multidisciplinary.9 In this article, the researchers propose an alternative to lowering the threshold for statistical significance, and recommend scientists "transparently report and justify all choices they make when designing a study, including the alpha level."
The group, recognizing the necessity to address the issue of replicability raised in the original commentary, and agreeing that its members "appreciate their attempt to provide a concrete, easy-to-implement suggestion to improve science," added: "We do not think that redefining the threshold for statistical significance to the lower, but equally arbitrary threshold of P ≤.005 is advisable."
The researchers provide 3 reasons to back their argument: a paucity of evidence supporting the notion that the current threshold for statistical significance is the "leading cause of non-reproducibility, poor arguments in favor of a lowering of the P-value to .005, and the fact that neither the positive nor the negative consequences of such a change have been assessed, a necessary precaution before proposing such a change. The group concludes that "researchers [should] justify their choice for an alpha level before collecting the data, instead of adopting a new uniform standard."
A Nature poll surveying almost 7000 of the publication's readers, presumably comprising a majority of scientists, and asking whether the P-value threshold should be lowered, indicated widespread support for this change, with 69% of responders in favor.10
Clinical Pain Advisor sought the feedback of Nathaniel Katz, MD, adjunct associate professor of anesthesia, Tufts University School of Medicine, Boston, Massachusetts, and formerly chair of the Advisory Committee, Anesthesia, Critical Care, and Addiction Products Division at the US Food and Drug Administration.
Clinical Pain Advisor: What would the main implications be of lowering the P-value on clinical trials?
Dr Katz: The problem of nonreproducibility in clinical trials has a variety of root causes, 1 of which is lack of quality control procedures, and most basic science laboratories around the world, both academic and nonacademic, lack attention to experimental method and data integrity and lack accountability for animals used in basic science research, and the journal publication process lacks any type of data validation. Those are the root causes, and reducing the P-value is hardly relevant to the actual causes of nonreproducibility.
The first thing that would happen [with such a change] is that sample size requirements for clinical trials would balloon so far out of control that pharmaceutical companies my get out of clinical development for many indications, including pain.
The notion that there should be a single threshold for deciding whether a clinical trial is positive or not is immature. It is true that regulatory agencies need thresholds, as they need to have an even playing field for deciding the efficacy of a drug, but in the scientific community, the idea that the use of a threshold would limit one's vision of how to interpret a clinical trial is already a problem. It is a naive and very destructive approach to science.
Clinical Pain Advisor: How would this change affect clinical research?
Dr Katz: Pharmaceutical companies would get out of clinical development, as with a multiplication of the sample size, the cost of drug development would greatly increase substantially. Drug development would no longer make sense financially.
On the academic side, if you multiply the sample size requirements, how is any National Institutes of Health-funded research ever going to lead to an interpretable result?
Clinical Pain Advisor: What other measure or measures would more adequately benefit the development of pain drugs?
Dr Katz: The cause of the crisis has more to do with transparency and quality control on the preclinical side. I suspect that 1 of the main reasons we don't have better pain treatments right now is because of a lack of transparency, quality control, and data validation on the preclinical/basic science side. There are guidelines for preclinical research that have recently come out that show how preclinical research should be transformed.11 On the clinical side, we need to figure out ways to accelerate patient recruitment in clinical trials, and we need to tighten up the way we conduct clinical trials in a number of different ways that are the subject of many papers by the IMMPACT group and others. Pain research could be incentivized in a variety of different ways, both by many branches of the government, including the US Food and Drug Administration.
Dr Katz provided Clinical Pain Advisor with a calculation of how a lowering of the P-value would affect clinical trials. For this calculation, Dr Katz opted for a standard scenario and noted that "implications will be different for a 40 patient proof-of-concept study vs a 12,000 patient global pivotal trial." According to his calculation, Dr Katz estimates that, using this standard example, sample sizes would increase by 60%, and time to complete enrollment would increase from 14 to 22 months. He noted that "Sponsors could try to compensate by increasing the number of sites, but this [would lead] to quality problems that decrease data integrity and statistical power."
As may be gathered from the debate the original comment has prompted, there is no simple answer to the issue of nonreproducibility in research. Would journals that encourage the publishing of negative results in basic and clinical studies improve the issue? What other changes may be implemented to improve the quality, and as a consequence, the reproducibility, of research?
- Benjamin DJ, Berger JO, Johannesson M, et al. Redefine statistical significance [published online July 22, 2017]. PsyArXiv Preprints. doi: 10.17605/OSF.IO/MKY9J
- Benjamin DJ, Berger JO, Johannesson M, et al. Redefine statistical significance [published online September 26, 2017]. Nat Hum Behav. doi:10.1038/s41562-017-0189-z
- PSYCHOLOGY. Estimating the reproducibility of psychological science. Science. 2015;349(6251):aac4716.
- Gilbert DT, King G, Pettigrew S, Wilson TD. Comment on "Estimating the reproducibility of psychological science". Science. 2016;351(6277):1037.
- Begley CG, Ellis LM. Drug development: Raise standards for preclinical cancer research. Nature. 2012;483(7391):531-533.
- Prinz F, Schlange T, Asadullah K. Believe it or not: how much can we rely on published data on potential drug targets?. Nat Rev Drug Discov. 2011;10(9):712.
- Greenwald AG, Gonzalez R, Harris RJ, Guthrie D. Effect sizes and p values: what should be reported and what should be replicated? Psychophysiology. 1996;33(2):175-183.
- Johnson VE. Revised standards for statistical evidence. Proc Natl Acad Sci USA. 2013;110(48):19313-19317.
- Lakens D, Adolfi FG, Albers CJ, et al. Justify your alpha: a response to "redefine statistical significance." [published online September 18, 2017]". PsyArXiv Preprints. doi: 10.17605/OSF.IO/9S3Y6
- Chawla DS. 'One-size-fits-all' threshold for P values under fire [published online September 19, 2017]e. Nature. doi: 10.1038/nature.2017.22625
- Rice A, et al. Third ACTTION Scientific Workshop Transformative Strategies-development of pain therapies. https://www.fda.gov/downloads/Drugs/DevelopmentApprovalProcess/Development Resources/UCM407139.pdf. Accessed September 25, 2017.