Tikalon Header

Hacking the p-Value

May 4, 2015

Important things are represented by single characters. Beyond the primary pronoun, "I," we have many physical and mathematical objects. In math, we have, π; the base of natural logarithms, e; the imaginary unit, i; and the golden ratio, φ.

In chemistry, we have the elements represented by single character symbols; namely, hydrogen, boron, carbon, nitrogen, oxygen, fluorine, phosphorus, sulfur, potassium, vanadium, yttrium, iodine, and tungsten. In physics we have the unit of elementary charge, e; the speed of light, c; the gravitational constant, G; the Planck constant, h; the Stefan-Boltzmann constant, σ; and the gas constant, R.

Psychologists and biologists have their own important single character quantity, p, for possibility, since their experiments are unlike the generally quantitative experiments of the physical sciences. Often, the only way that they can make sense from their results is through the use of statistics. Not that statistics are the savior of just life science experiments. Particle physics experiments rely on statistics, also.

Physicists strive to attain a result with a high confidence level. The existence of the Higgs boson has been confirmed at the 4.9-sigma level, meaning that there's only a one-in-a-million chance that the Higgs wasn't really detected. Biological objects are affected by too many environmental influences to give such clear results.

In many scientific disciplines, Fisher's null hypothesis significance test is the path to statistical truth. The null hypothesis, as its name indicates, is the hypothesis that your experimental variable had no affect on the observed outcome. The experimenter computes a probability, p, that the anticipated effect might still be observed even if the null hypothesis is true. If this p-value is very small, say 0.05, then the null hypothesis is rejected, and it's claimed that the experimental variable does affect the observed outcome.

Sir Ronald Aylmer Fisher (1890 - 1962)Sir Ronald Aylmer Fisher
(1890 - 1962).


Everyone who's studied statistics is familiar with Fischer as the originator of ANOVA, the analysis of variance.

(Via Wikimedia Commons.)

This p-value validation has been used historically in published papers, but the method can be misleading. In February, 2015, the journal, Basic and Applied Social Psychology, declared that it won't publish papers that rely on the p-value method, or publish those that even mention the method.[1] This journal warned authors in 2014 of its belief that the null hypothesis significance testing procedure (NHSTP) is invalid, but it allowed a grace period until this year.[1] In answer to the question whether manuscripts with mention of p-values will be rejected automatically, the journal responded,
"No. If manuscripts pass the preliminary inspection, they will be sent out for review. But prior to publication, authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about 'significant' differences or lack thereof, and so on)."[1]

The journal cited the p-test as an "important obstacle to creative thinking" that's dominated psychology for decades; and, it hoped that other journals would join in this ban on what's seen as an unneeded crutch.[1] Shortly thereafter, the American Statistical Association (ASA) posted a comment on its web site that it was wary that such a p-value ban might have its own negative consequences.[2] The ASA has formed a group of more than two-dozen "distinguished statistical professionals" to develop a statement on p-values.[2]

Tom Siegfried, in his blog at Science News, quotes William Rozeboom, a philosopher of science, as saying that the p-test was "surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students."[3]

A recent paper in PLoS Biology by biologists at the Australian National University (Canberra, Australia) and Macquarie University (New South Wales, Australia) concludes that scientists will sometimes "tweak" experiments and analysis methods to obtain a better p-value and thereby increase the likelihood of publication.[5] The authors call this technique, "p-hacking," and it appears to be common in the life sciences. This conclusion is based on an analysis of more than 100,000 research papers in such diverse scientific disciplines as medicine, biology and psychology.[5]

Megan Head, Australian National UniversityMegan Head, lead author of the p-hacking paper, in her evolutionary biology laboratory at the Australian National University.

(Australian National University photo by Regina Vega-Trejo.)

Says lead author, Megan Head of the Australian National University, you can't put too much blame on the scientists who perform p-hacking, as
"Many researchers are not aware that certain methods could make some results seem more important than they are. They are just genuinely excited about finding something new and interesting."[5]

Typical research practices leading to p-hacking include doing analyses in the middle of an experiment to decide whether to continue the experiment; recording many variables, but deciding which are significant enough to report; dropping outliers; excluding, combining, or splitting groups after analysis; and stopping data taking once an analysis gives a significant p-value.[4]

One reason for p-hacking is publication pressure. Prestigious journals accept papers that have statistically significant ("positive") results, and this appears to generate papers with false positive results that hinder scientific progress.[4-5] Early positive studies receive a lot of attention, while contradicting negative studies not as much.[4] In multiple studies on the effectiveness of a pharmaceutical drug, too many p-hacked findings would make the drug would look more effective than it is.[5]

Figure caption
Evidence for p-hacking for various scientific disciplines based on p-values in paper abstracts. Engineering and chemistry appear to be honest disciplines, while other sciences have p-values clumped at the high end. See ref. 4 for details. (Fig. 3B of ref. 4, licensed under a Creative Commons Attribution License.)[4]

Not surprisingly, the study found many papers citing p-values just over the acceptable threshold of significance.[5] This is evidence that some scientists have adjusted their experiments and analyses to cross that important threshold. Says Head,
"This suggests that some scientists adjust their experimental design, datasets or statistical methods until they get a result that crosses the significance threshold... They might look at their results before an experiment is finished, or explore their data with lots of different statistical methods, without realizing that this can lead to bias."[5]

Funding for this research was provided by the Australian Research Council.[4]

Figure captionRandall Munroe weighed in on the p-value debate in his xkcd comic of January 26, 2015, licensed under the Creative Commons Attribution-NonCommercial 2.5 License.

Click image to view the comic on his web site.

References:

  1. David Trafimowa and Michael Marksa, "Editorial: Publishing models and article dates explained," Basic and Applied Social Psychology, vol. 37, no. 1 (February 12, 2015), pp. 1-2, DOI: 10.1080/01973533.2015.1012991.
  2. ASA Comment on a Journal's Ban on Null Hypothesis Statistical Testing.
  3. Tom Siegfried, "P value ban: small step for a journal, giant leap for science," Science News, March 17, 2015.
  4. Megan L. Head, Luke Holman, Rob Lanfear, Andrew T. Kahn, and Michael D. Jennions, "The Extent and Consequences of P-Hacking in Science," PLOS Biology, vol. 13, no. 3 (March 13, 2015), DOI: 10.1371/journal.pbio.1002106. This is an open access paper with a PDF file available here.
  5. Scientists unknowingly tweak experiments, Australian National University Press Release, March 18, 2015.