P-Hacking True Effects




Psychologists have been talking about a research practice that goes something like this: I have a hypothesis that people are happier after they listen to Taylor Swift’s “Shake It Off” than after they listen to that Baz Lurhmann song about sunscreen. So I play “Shake It Off” to some people and “Everybody’s Free to Wear Sunscreen” to some other people. Then, I ask everyone how happy they are. I see that the people who listened to Taylor Swift rated themselves a little higher on my happiness scale than the people who listened to Baz Luhrmann. But this difference isn’t statistically significant.

So I play each of the songs to a few more people. Then, I pool my new data with the data from before and run my statistical test again. Now the the difference is significant! I have something I can publish!

This is one form of “p-hacking,” or running multiple statistical tests in order to get a significant result where there wasn’t one before. A while ago, Ryne Sherman wrote an R function that simulates this process. The details of it are over at his blog. His simulations showed that, as expected, determining sample size by looking intermittently at the data increases false positives when there’s no real difference between the groups. I’ll be using his function to look at what happens when my hypotheses are correct.

But first, just to demonstrate how it works, let’s take a look at what happens when there really is no difference between groups.

For my simulations, I am starting with 30 participants per condition and adding 30 more per condition each time I find p >= .05, up to a maximum of 270 per condition with a 2-sided t-test. Then, I’m repeating the study 9,999 more times.

Here’s what happens when the null hypothesis is true (people are just as happy after Taylor Swift as after Baz Luhrmann):

source("http://rynesherman.com/phack.r") # read in Ryne Sherman's function
set.seed(4)
res.null <- phack(initialN=30, 
                  hackrate=30, 
                  grp1M=0, # Group 1 has a mean of 0
                  grp2M=0, # So does Group 2
                  grp1SD=1, # Group 1 has an SD of 1
                  grp2SD=1, # So does Group 2
                  maxN=270, 
                  alpha=.05, 
                  alternative="two.sided", 
                  graph=FALSE, 
                  sims=10000)
## Loading required package: psych
## Proportion of Original Samples Statistically Significant = 0.049 
## Proportion of Samples Statistically Significant After Hacking = 0.1898 
## Probability of Stopping Before Reaching Significance = 0.819 
## Average Number of Hacks Before Significant/Stopping = 6.973 
## Average N Added Before Significant/Stopping = 209.19 
## Average Total N 239.19 
## Estimated r without hacking 0 
## Estimated r with hacking 0 
## Estimated r with hacking 0 (non-significant results not included)

The first line of the output tells me what proportion of times my first batch of 60 participants (30 per cell) was significant. As expected, it’s 5% of the time.

The second line tells me what proportion of times I achieved significance overall, including when I added more batches of 60 participants. That’s a much higher number, 19%.

Wow. I can increase my hit rate by almost 400% by looking at the data intermittently! One in five studies now returns a hit.

The Average Total N is the average number of participants I ran per cell before I stopped collecting data. It’s 239. If I am collecting data on Mechanical Turk, getting 239 people to listen to Taylor Swift and 239 to listen to Baz Luhrman is a cake-walk. I could collect hits very easily by running tons of studies on mTurk. I’d be very productive (in terms of publication count) this way. But all of my “hits” would be false positives, and all of my papers would be reporting on false findings.

 

But what about when the null is false?

The first simulation assumed that there really is no difference between the groups. But I probably don’t really think that is true. More likely, I think there is a difference. I expect the Taylor Swift group to score higher than the Baz Luhrmann group. I don’t know how much higher. Maybe it’s a small effect, d = .2.

So, what happens when people really are happier listening to Taylor Swift?

set.seed(4)
res.small <- phack(initialN=30, 
                   hackrate=30, 
                   grp1M=.2, # Group 1 now has a mean of .2
                   grp2M=0, 
                   grp1SD=1, 
                   grp2SD=1, 
                   maxN=270, 
                   alpha=.05, 
                   alternative="two.sided", 
                   graph=FALSE, 
                   sims=10000)
## Proportion of Original Samples Statistically Significant = 0.1205 
## Proportion of Samples Statistically Significant After Hacking = 0.744 
## Probability of Stopping Before Reaching Significance = 0.3006 
## Average Number of Hacks Before Significant/Stopping = 4.4569 
## Average N Added Before Significant/Stopping = 133.707 
## Average Total N 163.707 
## Estimated r without hacking 0.1 
## Estimated r with hacking 0.14 
## Estimated r with hacking 0.17 (non-significant results not included)

Holy hit rate, Batman! Now I’m seeing p < .05 almost 75% of the time! And this time, they are true positives!

Sure, my effect size estimate is inflated if I publish only my significant results, but I am generating significant findings at an outstanding rate.

Not only that, but I’m stopping on average after 164 participants per condition. How many participants would I need to have 75% success if I only looked at my data once? I need a power analysis for that.

library(pwr)
pwr.t.test(d = .2, 
           sig.level = 0.05, 
           power = .75, 
           type = "two.sample", 
           alternative = "two.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 347.9784
##               d = 0.2
##       sig.level = 0.05
##           power = 0.75
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

348 participants per condition!! That’s more than twice as many! The other way is MUCH more efficient. My Taylor Swift = happiness paper is going to press really quickly!

 

What if the effect were moderate? Say, d = .4?

Here are the simulations for d = .4:

set.seed(4)
res.moder <- phack(initialN=30, 
                   hackrate=30, 
                   grp1M=.4, # Group 1 now has a mean of .4
                   grp2M=0, 
                   grp1SD=1, 
                   grp2SD=1, 
                   maxN=270, 
                   alpha=.05, 
                   alternative="two.sided", 
                   graph=FALSE, 
                   sims=10000)
## Proportion of Original Samples Statistically Significant = 0.3348 
## Proportion of Samples Statistically Significant After Hacking = 0.9982 
## Probability of Stopping Before Reaching Significance = 0.005 
## Average Number of Hacks Before Significant/Stopping = 1.424 
## Average N Added Before Significant/Stopping = 42.72 
## Average Total N 72.72 
## Estimated r without hacking 0.2 
## Estimated r with hacking 0.24 
## Estimated r with hacking 0.24 (non-significant results not included)

BOOM!! Batting a thousand! (Ok, .998, but that’s still really good!!)

And with only 73 participants per condition!

I’m rolling in publications! I can’t write fast enough to publish all these results.

And what would I have to do normally to get 99.8% success?

pwr.t.test(d = .4, 
           sig.level = 0.05, 
           power = .998, 
           type = "two.sample", 
           alternative = "two.sided")
## 
##      Two-sample t test power calculation 
## 
##               n = 293.5578
##               d = 0.4
##       sig.level = 0.05
##           power = 0.998
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Dang. That’s FOUR TIMES as many participants. Looking at the data multiple times wins again.

 

But am I really right all the time?

So looking at my data intermittently is actually a super effective way to reach p < .05 when I have even small true effects.1 It could lead to faster research, more publications, and less participant time used! Those are substantial benefits. On the downside, I would get to play “Shake It Off” for fewer people.

Looking at data multiple times makes it easier to get true positives.

And I’m only studying true effects, right?

Probably not.

p-hacking only seems like a problem if I accept that I might be studying false effects.2 Which I almost certainly am. At least some of the time.

But the problem is that I don’t know ahead of time which hypotheses are true or false. That’s why I am doing research to begin with.

It also seems that when I am studying true effects, and I am willing to collect large-ish samples,3 intermittent looking should yield a high hit rate. And I should be able to achieve that rate without needing to do anything else, such as dropping conditions, to achieve my desired p-value.4 If I am looking at my data intermittently, a low hit rate should make me consider that my hypothesis is wrong – or at the very least that I am studying a very small effect.
 
Edit:
Alexander Etz (@AlxEtz) pointed out that it’s possible to look at the data more than once without increasing alpha. He’s right. And it can be efficient if I’m not interested in getting a precise effect size estimate. Daniel Lakens has a great post about doing this, as does Rolf Zwaan. Alex adds:


  1. Some people much, much smarter than I am have already written about the “optimal” strategies for winning publications, and you should read their paper because it shows just how much these strategies bias publications.
  2. Or if I care about effect size estimates.
  3. Even if I am only willing to test 120 people in each condition, I find significant results 9%, 41%, and 90% of the time for d = 0, .2, and .4, respectively. For a small effect, even looking at my data just four times (at 30, 60, 90, and 120 participants per cell), my hit rate is quadruple that under the null hypothesis.
  4. I also modified Sherman’s original code a bit to look at what happens if I only continue adding participants when the Taylor Swift mean is bigger (but not sigficantly) than the Baz Luhrmann mean. I was able to find a significant effect 9%, 59%, and 93% of the time for d = 0, .2, and .4, respectively. In other words, I can still expect to find a significant result more than half the time even for effects as small as d = .2, even if the only p-hacking I do is looking at my data intermittently.

Kahneman on Intuitive Statistics and Small Samples

Lately, I’ve been reading Thinking Fast and Slow by Daniel Kahneman, whose work on judgment and decision-making almost cannot be over-stated in its importance (hey, he won a Nobel Prize for it, and there isn’t even a Nobel for psychology!).

In the book, Kahneman discusses early conversations with his long-time collaborator Amos Tversky and how he came to the realization that even people with years of statistical training and practice can fail in their statistical intuitions.

Here is Kahneman on intuitive statistics:

We had concluded in the seminar that our own intuitions were deficient. In spite of years of teaching and using statistics, we had not developed an intuitive sense of the reliability of statistical results observed in small samples. Our subjective judgments were biased: we were far too willing to believe research findings based on inadequate evidence and prone to collect too few observations in our own research.

And later:

Like most research psychologists, I had routinely chosen samples that were too small and had often obtained results that made no sense. Now I knew why: the odd results were actually artifacts of my research methods. My mistake was particularly embarrassing because I taught statistics and knew how to compute the sample size that would reduce the risk of failure to an acceptable level. But I had never chosen a sample size by computation. Like my colleagues, I had trusted tradition and my intuition in planning my experiments and had never thought seriously about the issue. When Amos visited the seminar, I had already reached the conclusion that my intuitions were deficient…

These confessions of past errors, coming from such an eminent scientist, are powerful reminders to the rest of us to question our intuitive assumptions, use larger samples, and admit to our own faults.

Digging

Champaign-Urbana is finally experiencing beautiful spring weather after a brutal (for central Illinois) winter, so I spent Saturday afternoon digging out my garden and filling in the holes that my dogs have dug in the yard. As I pulled up chunks of grass + dirt, I had the idea to transplant these hunks to the newly-filled holes. The dogs have torn up a lot of our grass, and maybe I could use this as an opportunity to patch up some of the dead spots.

Will it work? I don’t know. It’s an experiment, I told myself.

Scientist-me immediately chimed in: That’s not an experiment! Where is my control group? As a scientist, I should be more careful about how I use words like “experiment.”

Then forgiving-me added soothingly: It’s ok. This is just how people use the word “experiment” when they are not doing professional science. And right now, I am not doing professional science. I am just digging in my garden. Colloquially, an experiment is just a process whose outcome is unknown. I don’t know if the grass will grow. It probably won’t. But I will just do it and see.

Scientist-me chewed this over. Wouldn’t it be nice if scientists also did not know in advance the outcomes of their experiments? When one spends a lot of time and careful thought developing theories and deriving predictions from them, it is easy to feel like one knows what the outcome will be. And this can lead to confirming what one “knows” by dropping measurements that do not verify this knowledge, changing how one calculates statistics so that one can draw the inferences one knew all along, re-running the same experiment until the conclusions align with what one knows. And each of these sources of confirmation can feel in the moment as if they are justified: Of course these data need to be dropped, the other measures were the ones that really mattered! And so on.

But that’s not science. That’s just digging.

I hope that scientist-me can learn to be a little bit more like intuitive-gardener-me: genuinely curious about the world, open-minded about the possibility that my ideas may not work (and that they are still worth testing anyway), and seriously in love with tomatoes. Ok, the tomatoes may not help my science much, but they will make me happy anyway.

gardening
I also dig holes in the yard. But I plant things in them.