I’ve been thinking a lot about this over the past few days. First, I agree with Joe that part of doing science is building on the past, and uncertainty about the power and bias of our literature means that making use of our past is a struggle.
Those of us who believe that power is generally well under 50%, that publication bias is nearly 100%, and that flexibility in stopping rules, analyses, and reporting practices were used often enough to be of concern approach the published literature with skepticism about the size and reproducibility of reported effects.
This doesn’t mean that we assume every effect we encounter was found only through lots of trials of small studies that used flexible stopping, analysis, and reporting to achieve p < .05. But it does mean that we have some motivation to estimate the probabilities of these practices if we hope to make use of the literature we read.
It also means that, to a greater or lesser extent, some of us see sharing well-supported inferences about the power, extent of publication bias, and use of flexibility as a social good. Making informed inferences about these things can require a lot of work, and there’s no reason for every person researching in an area to duplicate that work. On the other hand, like every other part of science, these analyses benefit from replication, critique, and revision, so discussing them with others can make them better.
So that leaves me here: I am not confident that everything in our past is a reliable source of inference without investigation into its bias. Much as I love the idea of pressing the reboot button and starting over, I think that’s ultimately more wasteful than trying to make something of the past. I want to be able to do bias investigations, to share them with others, and to learn from the investigations others have done.
This is not about finding out who is good or bad or who is naughty or nice. This is about doing the best science I can do. And for that, I need to know how to interpret the past, which means I need a way to be able to talk about the strengths and weaknesses of the past with others.
Pretending like everything in the past is solid evidence is no longer an honest option for researchers who have accepted that small sample sizes, publication bias, and flexibility are threats to inference and parts of our research legacy. Yet, saying, “Gosh, I don’t quite believe that this study/paper/literature provides compelling evidence,” feels risky. It might be seen as an attack on the researchers (including one’s own collaborators if the research is one’s own), might be deemed uncivil, or might invite a bunch of social media backlash that would be a serious hassle and/or bummer. So Joe’s question is really important: How do we create a culture that makes this not an attack, not uncivil, and not a total bummer?
What Can We Do?
I have a few ideas. I don’t think any of them are easy, but I suspect that, like many things, the costs of doing them are likely not as high as we imagine.
Stop citing weak studies, or collections of weak studies, as evidence for effects
When you think the literature supporting an idea is too weak to draw a confident inference, stop citing the literature as if it strongly supports the idea. Instead of citing the evidence, cite the ideas or hypotheses. Or stop citing the classic study you no longer trust as good evidence and cite the best study. When reviewers suggest that you omitted a classic and important finding, politely push back, explaining why your alternative citation provides better evidence.
Focus on the most defensible criticism
As Jeff Sherman pointed out, it can be harder to find evidence that research makes use of research flexibility than that it exhibits low power and publication bias, and an argument about flexibility has more of a feeling of a personal attack. It’s relatively easy to show that even post-hoc power (which is likely an overestimate) is low and yet every reported finding is positive. Like all evidence, this isn’t proof that power is low and suppressed findings exist, but it’s reason to be cautious. If you can make a point for caution with power and publication bias alone maybe don’t bring up flexibility. So long as suggesting the use of flexibility feels like a personal attack, unless there are really compelling reasons to suspect that flexible research practices were used, you might be weakening your case against the evidence by suggesting they are possible.
That’s not to say we shouldn’t discuss research flexibility where there is good evidence for it, but I think Jeff Sherman makes another good point about such criticisms: “If I suggest that lack of double-blinding may be a problem for a study, I am specifying a particular issue. If I suggest p-hacking or researcher degrees of freedom, I am making a suggestion of unknown, unspecified monkeying around. There is a big difference.” So when suggesting that flexibility may undermine the inferences from a line of research, it’s important to be as specific about the type of flexibility and as concrete in the evidence as possible.
Perhaps the safest place to start is with oneself. Michael Inzlicht and Michael Kraus have written about how some of their previous research shows signs of bias (and how they are changing things so that their future work shows less bias). They haven’t called out specific papers, but they’ve p-curved and TIVA’d and R-Indexed their prior papers and owned up to the fact that the work they’re doing now is better than the work they did in the past.
In admitting that their own research exhibits some forms of bias, they have opened the discussion and made it safer and easier for others to make similar admissions about themselves. Not that it was easy for them. Michael Inzlicht talks about fear, sadness, and pain in the process. But it is beautiful and brave that he not only performed the self-check anyway but went on to publish it publicly. And ultimately, he found the experience “humbling, yet gratifying.”
Publish commentaries on or corrections of your previous work
I’m not going to pretend that this is at all easy or likely to be rewarded. It’s hard to remember exactly all of the studies that were run in a given research line, and, unfortunately, records may not be good enough to reconstruct that. So researchers may not know precisely the extent of publication bias in their own work. But still, for those cases where one knows that bias exists, it would benefit the entire community to admit it.
I can only think of one instance where someone has done this. Joe Hilgard wrote a blog post about a paper he had come to feel reported an unlikely finding based on (actually disclosed) flexible analyses and reporting. Vox wrote up a report complimenting Joe’s confession (and it really was brave and awesome!), but the coverage kind of gave the impression that Joe’s barely-cited paper was responsible for the collapse of the entire ego depletion literature: “All of this goes to show how individual instances of p-hacking can snowball into a pile of research that collapses when its foundations are tested.” Oops.
I doubt that that would happen to the next person who publishes a similar piece. But what will happen? One comment on Joe’s blog post asks whether he plans to retract the paper. I don’t think that’s the appropriate response to the bias in our literature but others definitely do, so calls for retraction seem plausible. Another concern is reputation: Will you anger your friends and collaborators or develop a reputation as someone who backstabs your colleagues? If people see admitting to bias as a personal black mark, this is possible
Maybe there should be no such thing as retraction, or maybe we could ban the word “retraction” and simply offer “corrections.” That would be fine with me. The point is never to “expunge the record,” it’s about correcting the record so that later scholars don’t take a mistaken claim as being true, or proven.
But, to the extent there are retractions, or corrections, or whatever you want to call them: Sure, just do it. It’s not a penalty or a punishment. I published corrections for two of my papers because I found that they were in error. That’s what you do when you find a mistake.
I’d love to see this opinion spread through psychology. As people who study people, psychologists know that bias happens; it’s just part of being human. Correct the record and move on. Start with thinking about the bias in your solo-authored papers. Begin talking about the idea with colleagues you already talk to about bias; warm them up to the idea of correcting their own work or your joint work. Then start leaving comments on PubPeer or on your blog or on http://psychdisclosure.org/. Or maybe even submit them as brief corrections to journals. If you’re an editor at a journal who would consider these kinds of corrections, invite them.
This is really an extension of what Michael Inzlicht and Michael Kraus have already done: start at home. By admitting our bias, we can set the example that it’s OK to have bias called out. But it can go a bit further by actually adding to the literature. If you include new data (e.g., dropped studies, conditions, or variables) or new analyses (e.g., an alternative specification of a DV), you are not just admitting bias but also contributing valuable new information that might make your correction into a meaningful paper in its own right.
Publish your file-drawered studies
Make some use out of all the data you’re sitting on that was never published. You can simply post the data in an archive and make it available to meta-analysts and other researchers. You can publish it yourself as a new paper or as part of a correction. If you can’t get null or inconclusive results through traditional peer review, try an alternative outlet like the Journal of Articles in Support of the Null Hypothesis or The Winnower. The Winnower has the benefit of giving your blog post a citable DOI and pushing it through to Google Scholar. If you want to use your file drawer to make a big impact, gather all of your studies on a single topic into a publication-bias-free meta-analysis and use that to create theoretical insights and make meaningful methodological recommendations.
Publish meta-scientific reviews
We already accept bias investigations in meta-analyses. Funnel plots, Egger tests, and other bias detection techniques are standard parts of meta-analysis. We are adding more and more tests to this repertoire every year.
Malte Elson brought up the idea that synthesizing whole research areas might be a more acceptable way to bring up criticisms about research flexibility, and he’s done some fantastic and detailed work cataloging flexibility in operationalizations of the CRTT. This work is specific (applies to a specific domain, a specific measure, and specific papers) but also diffuses agency across many authors. No one person is responsible for all of the flexibility, and actually attempting to figure out who has used more or less flexibility is fairly involved and just about the least interesting thing one can do with the published tools. Rather than providing, say, field-wide estimates of power, publication bias, or research flexibility, these domain-specific investigations provide the type of information needed by researchers to evaluate the papers they are using in their own work.
Publicly praise and reward people who do these things
Cite corrections. Tweet and post on Facebook about how awesome people who admit bias are. Offer them jobs and promotions. If people are going to risk their reputations and relationships in trying to help others navigate the past, do everything you can to make it worth their while.
Let me be clear, doing any and all of these things is awesome, but it’s also only a beginning. Joe’s question is really about how to create a culture so that it is ok to point out specific instances of research flexibility in others’ work without ruining either one’s own or the author’s reputations. I think that admitting our own bias and examining field-wide bias will help normalize bias discussions, but they probably won’t bring us far enough.
I don’t expect everyone to make a complete catalog of their unpublished work or reveal their original planned analyses for every study they’ve ever published. Most people don’t have the time or records to do that. But we still need to be able to talk about the potential bias in their work anyway if we want to build on it. So we have to look and we have to talk about, and it has to be ok to do that.
Some people are already doing these investigations, but my general impression is that they are not received well. I hope that talking more about bias in ourselves and in general will bring us closer to the goal of discussing specific cases of bias, but I wonder whether there is more we can do to get us there faster.
Update: I’ve edited this page slightly for clarity and proofreading and to correct an error. Before doing so, I archived the original version of the post. You can see the revision history at the Internet Archive.
This blog post provides additional details and analyses for the poster I am presenting at the 2016 meeting of the Society for Personality and Social Psychology. If you’ll be at SPSP and want to chat, come by my poster during Poster Session E on Friday at 12-1:30pm. I’ll be at poster board 258.
Over the past several years, psychologists have become increasingly concerned about the quality and replicability of the research in their field. To what extent are the findings reported in psychology journals “false positives”—reports of effects where none truly exist? Researchers have attempted to answer this question different approaches: by replicating previous research and by developing a series of research quality metrics that attempt to quantify the evidential value, replicability, power, and bias of of the research literature.
As part of this movement, a wave of replication studies has been published, including several large-scale projects. The results of these projects have been mixed. The Many Labs 1 (ML1) project involved 36 labs all running replications of the same 13 effects (16 effects, if you count the four anchoring effects separately). In aggregate analyses on data from all of the labs, the authors found that only two failed to reject the null hypothesis of no effect. Many Labs 3 (ML3), following a similar model, attempted to replicate 10 effects (plus 3 post-hoc additions of effects from three of the replicated studies) in 21 samples. This time, aggregate analyses of the 10 planned effects failed to reject the null hypothesis of no effect for seven of the effects. (Many Labs 2 is still in progress). The Reproducibility Project: Psychology took a different approach to replication, selecting many effects from specific journals and replicating each in a single lab. Out of 97 replications, 62 failed to reject the null hypothesis of no effect, a similar rate to ML3. However, unlike ML3, these analyses were not based on large, aggregated samples. Across these three projects, in general, effect sizes shrank from original to replication. The overall replicability of psychological science remains unknown (and may not be a well-defined or readily quantifiable concept); however, it is clear that some effects can be observed relatively regularly and in many settings while others are difficult to observe, even with many subjects and carefully constructed protocols.
At the same time, concerns about research methods that inflate false positive rates and about the effect of publication bias on the veracity of reported research (as well as increasing awareness that traditional meta-analyses are threatened by publication bias) has driven researchers to develop new techniques to evaluate the literature. The p-curve, for example, tests the evidential value of a set of studies by looking at its distribution of p-values: the shape of the distribution of p-values changes when studying a true vs. false test (and with the power of the test). The Replication Index (R-Index) attempts to quantify the replicability of a set of studies based on their post-hoc power (their power to detect an effect of the observed size). The Test of Insufficient Variance (TIVA) examines publication bias by asking whether the variance in p-values (converted to z-scores) is smaller than would be expected, suggesting that some results have been censored. A positive correlation between sample size and effect size may be taken as an indication of suppressing null results. Smaller samples will produce significant results only when the effect size is larger, but they are less powerful at detecting effects at all. So there may be lots of non-significant results missing from the literature if all the small sample studies report large effects. The N-Pact factor attempts to quantify the power of a set of studies by looking at its median sample size with the assumption that more powerful studies create more reliable estimates. These tests and indices have become popular tools for evaluating the quality of research output. Researchers are using these metrics to examine the quality of journals, of their own work, and of the papers published by others.
Their Relationship to One Another
I’ve heard colleagues dismiss papers because they contain mostly high p-values, low variance in p-values, small sample sizes, or sample sizes correlated with effect sizes. Papers with these characteristics are perceived as unreplicable and unsound. I’ve made such inferences myself. Yet, most of these metrics are not explicitly designed to index replicability. So are these judgments about replicability justified? Do research quality metrics at the article level predict replication outcomes?
Intuitively, it makes sense to think they would. Assuming papers generally conclude that effects are real, sets of studies based on large samples, with adequate power, and not exploiting flexibility in analysis seem like they ought to contain more replicable results than sets of studies based on small samples with flexible analysis plans and low power.
But there are some reasons that these metrics may not be predictive in practice. Optimistically, if researchers are doing a bang up job on powering their studies, we would actually observe a (perhaps modest) negative correlation between effect size and sample size, because no one be wasting huge samples on huge effects or bothering with tiny samples for tiny effects. Pessimistically, if the literature is extremely biased or extremely underpowered, there may be no predictive power to the metrics in the present literature at all. For example, when power is very low, the distribution of p-values becomes fairly flat. (You can observe this for yourself here. Try an effect size of d = 0.3 and sample size of n = 20. This is also shown in the first p-curve paper.) Such a literature would also have small sample sizes and low post-hoc power regardless of whether effects are true—and thus little variability in the metrics. And we would expect poor replicability, even for true effects, if sample sizes for replications were not sufficiently larger than the original, dismally powered sample. So even if some of the metrics should predict replicability in principle, they may not do so in practice.
To test the predictive power of these metrics at the article level, I asked whether they are related to outcomes in ML1 and ML3. Why ML1 and ML3? I hoped that their large samples would yield fairly reliable outcome measures, and I wanted to have enough effects to have a hope of detecting a relationship, so using just one of them would be inadequate. If all the effects were usable, I would have a sample of 23 effects. Not very big, but powered at 80% to detect a correlation of r = .55. That might be optimistic, and it would be better to have even more effects included. But if I am going to use these metrics to make dismissive judgments about the replicability of effects in specific articles, I’d hope that the relationship is fairly strong.
There are many ways to operationalize replicability. For example, one might consider whether the original study had sufficient power to detect the observed effect size of the replication (Simonsohn’s  “small telescopes“). Or whether the replication effect size falls within a prediction interval based on the the original and replication effect sizes. I decided to focus on the two outcomes I hear most often discussed:
Difference in effect size (continuous): How much the replication effect size (converted to Cohen’s d) differed from the original effect size (converted to the same scale).
Replication success (dichotomous): Whether the replication rejected the null hypothesis of no effect at p < .05.
These are not perfect outcome operationalizations, but I believe they represent the ways many people evaluate replications. Since the research question was driven by the kinds of judgments I and others seem to be making, it made sense to me to operationalize replication outcomes in ways that appear to be common in replication discourse.
I again decided to use the metrics that I hear discussed frequently as predictors. These are the same metrics mentioned in the Introduction:
P-Curve: Evidential Value: Test statistic (z) for evidential value of a set of studies based on p-values. Z-scores less than -1.64 indicate evidential value.
P-Curve: Lacks Evidential Value: Test statistic (z) for a lack evidential value (power less than 33%) of a set of studies based on p-values. Z-scores less than -1.64 indicate lack of evidential value.
R-Index: The difference between median post-hoc power of a set of studies and the “inflation” in the studies. Inflation is defined as the proportion of significant results minus the expected proportion of significant results. Higher values indicate greater replicability.
Test of Insufficient Variance (TIVA): The variance in the converted z-scores of test statistics. For heterogeneous sets of studies (i.e., studies with different sample sizes or different methods), variance should be greater than 1. Variance less than 1 indicates that some studies have been censored.
Correlation Between Effect Size and N: Pearson correlation between the observed effect sizes and sample sizes in a paper. Negative correlations may indicate publication bias.
N-Pact Factor: Median sample size of included tests. Higher values generally indicate greater power.
Because many of these indices are based on the same information (test statistics, p-values, sample sizes), we can expect them to be correlated. For this reason, I decided to evaluate them individually. Models including multiple predictors might exhibit multicollinearity and be unsuitable.
ML1 and ML3 replicate effects from 22 articles:
Anchoring (4 effects)
People’s quantitative judgments are biased after seeing too large or too small estimates
People report watching more TV when the response scale ranges from “up to half an hour” to “more than two and a half hours” than when it ranges from “up to two and a half hours” to “more than four and a half hours”
Schwarz, N., Hippler, H.-J., Deutsch, B., & Strack, F. (1985). Response scales: Effects of category range on reported behavior and comparative judgments. The Public Opinion Quarterly, 49, 388–395. http://doi.org/10.1086/268936
People endorse a quotation more strongly when it is attributed to a liked rather than a disliked figure
People are more likely to say that foreign reporters should be allowed into their home country after first being asked whether a foreign country should allow reporters from their country
Hyman, H. H., & Sheatsley, P. B. (1950). The current status of American public opinion. In J. C. Payne (Ed.), The teaching of contemporary affairs: 21st yearbook of the National Council of Social Studies (pp. 11–34). New York, NY: National Council of Social Studies.
People are more likely to brave the cold to use a ticket they bought vs. one that was free.
Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45, 867–872. http://doi.org/10.1016/j.jesp.2009.03.009
Imagining contact with people from different ethnic groups reduces prejudice towards those groups
People report more conservative attitudes after subtle exposure to the US flag than after no exposure
Carter, T. J., Ferguson, M. J., & Hassin, R. R. (2011). A single exposure to the American flag shifts support toward Republicanism up to 8 months later. Psychological Science, 22, 1011–1018. http://doi.org/10.1177/0956797611414726
People endorse the status quo more strongly after exposure to an image of money than after no exposure
Caruso, E. M., Vohs, K. D., Baxter, B., & Waytz, A. (2012). Mere exposure to money increases endorsement of free-market systems and social inequality. Journal of Experimental Psychology: General, 142, 301–306. http://doi.org/10.1037/a0029288
People are slower to name the font color of a color word when the word names a different color than when it names the same color
People interpret an ambiguous temporal statement differently depending on whether they had just completed a spatial task in a frame of reference involving an object and a stick figure labeled “you” or involving only objects
Persistence as measured by one personality measure is positively correlated with conscientiousness as measured by another
De Fruyt, F., & Van De Wiele, L. (2000). Cloninger’s psychobiological model of temperament and character and the five-factor model of personality. Personality and Individual Differences, 29, 441–452. http://doi.org/10.1016/S0191-8869(99)00204-4
Power and perspective-taking
People made to feel high in power perform poorer on a perspective-taking task
People judge a room to be warmer after reading about a communal rather than an agentic person
Szymkow, A., Chandler, J., IJzerman, H., Parzuchowski, M. & Wojciszke, B. (2013). Warmer hearts, warmer rooms: How positive communal traits increase estimates of ambient temperature. Social Psychology, 44, 167-176. http://dx.doi.org/10.1027/1864-9335/a000147
Elaboration likelihood model
People high in need for cognition differ more in their judgments of the persuasiveness of strong and weak arguments than people low in need for cognition
Cacioppo, J. T., Petty, R. E., & Morris, K. J. (1983). Effects of need for cognition on message evaluation, recall, and persuasion. Journal of Personality and Social Psychology, 45, 805–818. http://doi.org/10.1037/0022-35220.127.116.115
Self-Esteem and subjective distance
People high in self-esteem judge past positive and negative events as more different in subjective temporal distance than participants with low self-esteem
Ross, M., & Wilson, A. E. (2002). It feels like yesterday: Self-esteem, valence of personal past experiences, and judgments of subjective distance. Journal of Personality and Social Psychology, 82, 792–803. http://doi.org/10.1037//0022-3518.104.22.1682
Credentials and prejudice
People are more likely to express prejudiced attitudes when they first have an opportunity to show that they are not prejudiced
Unfortunately, out of the 22 articles, several were excluded from the analyses.
One of the articles (Norm of reciprocity; Hyman & Sheatsley, 1950) was not available online or from the ML1 team. I found a catalog entry for the book at my library, but upon taking it out found it was the wrong volume of the series. I have an interlibrary loan request pending for the correct volume, but it has not yet been delivered. Since complete information for this article could not be gathered, it was excluded.
I next excluded any articles that reported only a single study: Anchoring, Allowed/Forbidden, Quote Attribution, and Persistence and conscientiousness. Some of the metrics (e.g., p-curve, correlation between effect size and sample size) require multiple studies to compute.
One further paper was excluded. The Sunk Costs effect originated in a paper (Thaler, 1985) that does not report null-hypothesis significance testing and so cannot be used for most of the metrics. The protocol used in the replication was derived from work by Oppenheimer and colleagues (2009), reported in an article testing the utility of instructional manipulation checks for studies conducted on Mechanical Turk. This paper is not suitable for inclusion because the remaining studies in the paper do not address the same theoretical question as the replicated effect.
This left 16 papers in the final sample, all of which could be evaluated on the dichotomous outcome. However,the original papers reporting the three effects involving interactions (Elaboration Likelihood Model, Self-Esteem and Subjective Distance, and Credentials and Prejudice) did not report enough information (or did not have appropriate designs) to estimate effect sizes in Cohen’s d. Thus, these three effects were excluded from analyses involving the continuous outcome, reducing the number of included effects for those analyses to 13.
Papers with Multiple Replicated Effects
Two of the sampled papers (Jacowitz & Kahneman, 1995; Nosek et al., 2002) had multiple effects included in the replication studies. Multiple outcomes from the same article should not be evaluated separately as their predictor metrics are not independent. I decided to average the outcomes for effects that are testing the same theoretical relationship and to focus on the the effect that was the primary theoretical target of the original paper when the effects were testing different theoretical relationships. This meant that the Anchoring effects were averaged and the implicit/explicit attitude correlation was excluded from analyses.
All of the data and code for data manipulation and primary analyses are on Github. Please feel free to ask me questions, open issues, send pull requests, or ask for files in other formats. I am happy to share.
Before I dive into results, I want to show some important features of the data. First, let’s look at the continuous outcome. The figure below shows original effect sizes on the left and replication effect sizes on the right, for those papers reporting enough information to calculate Cohen’s d. Lines connect the effect sizes from the same effects. There’s a filter that allows you to show only those effects included in the analyses (i.e., omitting those those from papers that contained only one study or which I couldn’t access).
Try switching from showing all effects to showing only those whose original articles are included in the analyses. Notice that just about all of the effects whose size grew from original to replication are excluded. This might be problematic. How often do effect sizes grow from original to replication, and how do the original papers reporting such studies differ from other papers? Unfortunately, there is not enough information here to answer these questions. But perhaps there is bias in the continuous outcome in our remaining studies.
Next, we should look at the distributions of the predictors. The figure below allows you to select which predictor you want to see and to switch among the full set of 16 effects, the set of 13 effects included in the effect size analyses, and the set of 3 effects excluded from those analyses.
One thing you might notice is that for most metrics, the scores are clustered in the “worse” end. The papers included in this sample have, mostly, low power, small sample sizes, and low variance in their z-scores. This would be consistent with a fairly biased or underpowered literature. So low predictive validity of these metrics would not be very surprising, given the small sample size and the low variability in the predictors. Models based on these metrics might be treated with suspicion, given their poor variability.
Ok, now that we’ve looked at the data a bit and considered the ways in which they are less-than-ideal (small sample, perhaps unrepresentative on at least one outcome, and low variability in predictors), we should be pretty well-prepared for the results.
Difference in Effect Size
For this blog post, I’ve used absolute value of the difference between replication and original effect sizes rather than raw difference. Since only of the effects has a positive difference, and it’s a change of d = .01, this changes only the signs of the relationships but not their sizes. It just makes the plots and correlations a bit more intuitive to interpret: more positive values indicate greater disparity between original and replication results. The poster uses the raw difference scores (replication – original), if you want to compare.
Below is a table of correlations. The first column represents correlations with effect size difference. The degrees of freedom for all of these correlations are 11. The correlations in the other cells include all 16 of the effects and so have 14 degrees of freedom.
1. Difference in Effect Size
2. P-Curve: Evidential Value
3. P-Curve: Lacks Evidential Value
5. TIVA: Variance of Z
6. Correlation between ES and N
7. N-Pact Factor: Median N
Overall, the correlations with effect size difference are moderately strong, but notice that some of them are in the opposite direction one would predict. Z-scores for evidential value decrease as effect size discrepancy increases. But negative z-scores indicate evidential value, so we would expect higher z-scores to be associated with greater discrepancy. For the other p-curve metric, positive z-scores indicate lack of evidential value, but the correlation is positive. Instead of discrepancy rising with lack of evidential value, it falls. Higher R-Index values should predict better replicability, but here they are associated with larger effect size differences. The last two correlations are in the predicted direction. As the article-level correlation between effect size and sample size increases (as potential publication bias increases), so does the difference between original and replication outcomes. And as N-pact factor (power) increases, discrepancy decreases (a bit).
But our predictors didn’t have much variability. Let’s see what these relationships look like plotted:
We can see that only a few papers have p-curve z-scores much different from zero, or have variance in the z-scores of their studies (TIVA) greater than 1, or have sample sizes larger than 100. In general, the correlations are driven by just a few points that are outliers in their distributions.
Ok, let’s look quickly at the dichotomous outcome, with “successful” (p < .05) replications coded as 1 and “unsuccessful” (p >= .05) coded as 0. Here are the results of six different logistic regression models predicting this outcome:
1. P-Curve: Evidential Value
2. P-Curve: Lacks Evidential Value
4. TIVA: Variance in Z
5. Correlation Between Effect Size and N
6. N-Pact Factor
Now, the p-curve relationships are in the expected direction. As is R-Index. And TIVA. We’ll plot the predicted probability of success.
We can see that, at least, the p-curve values most strongly indicating evidential value are associated with replication successes. But there are also a couple of replication “failures” that come from papers with evidential value z-scores above 1.64, i.e., where there is evidential value in that set of studies (but not necessarily in the replicated study). And there are a few successes that come from papers that do not have evidential value (though there are none from the two papers that significantly lack evidential value). So while p-curve performs the best at predicting replication success from article-level information in this set of studies, it’s definitely not predicting all of the outcomes.
Some Closing Thoughts
I’m very grateful for the attention this project has received and the many helpful comments and thought-provoking questions people have sent on Twitter, Facebook, email, and Github. I hope that this blog has clarified many points about methodology and data quality. There is a lot to be desired here. And I look forward to expanding this project to make it more useful.
In particular, as a few people have suggested, it’s important to have more data. Adding replications from the Reproducibility Project, a special issue of SocialPsychology, the soon-to-be-released Many Labs 2, and perhaps other sources will hopefully increase the sample size, the reliability of the estimates, and the power to detect relationships.
In addition, there may be other outcomes that would be useful. Perhaps whether the original study had the power to detect the replication effect. Or, for Many Labs studies, the proportion of labs returning significant results. That said, no replication outcome metric can replace informed scientific judgment. If a replication is poorly conducted or has low power, we should not evaluate the replicability of the effect in the same way as when the replication is competently conducted and high powered. Likewise, if an original study used invalid manipulations or measures, what does its replicability matter?
It also remains possible that article level metrics don’t predict replication outcomes for the current psychological literature. If the literature is extremely low-powered and biased, these metrics may be capable of telling us only that the literature is low-powered and biased but not which effects are likely to replicate. Or there may be too little information contained in single papers to predict replication outcomes of their studies. Perhaps author- or journal-level metrics would be better indicators, though certainly harder to assess off-the-cuff.
For now, I would probably make three recommendations. First, I would apply caution in using these metrics to make judgments about the replicability of studies from single papers. It’s not clear that they can diagnose the replicability of studies in this way. Second, when reporting studies, include enough information to aid in meta-analytic and meta-scientific research. This means, include full model specifications, cell sizes, and test statistics (if you’re using frequentist statistics, please report more than a p-value). Finally, conduct more replications.
Psychologists have been talking about a research practice that goes something like this: I have a hypothesis that people are happier after they listen to Taylor Swift’s “Shake It Off” than after they listen to that Baz Lurhmann song about sunscreen. So I play “Shake It Off” to some people and “Everybody’s Free to Wear Sunscreen” to some other people. Then, I ask everyone how happy they are. I see that the people who listened to Taylor Swift rated themselves a little higher on my happiness scale than the people who listened to Baz Luhrmann. But this difference isn’t statistically significant.
So I play each of the songs to a few more people. Then, I pool my new data with the data from before and run my statistical test again. Now the the difference is significant! I have something I can publish!
This is one form of “p-hacking,” or running multiple statistical tests in order to get a significant result where there wasn’t one before. A while ago, Ryne Sherman wrote an R function that simulates this process. The details of it are over at his blog. His simulations showed that, as expected, determining sample size by looking intermittently at the data increases false positives when there’s no real difference between the groups. I’ll be using his function to look at what happens when my hypotheses are correct.
But first, just to demonstrate how it works, let’s take a look at what happens when there really is no difference between groups.
For my simulations, I am starting with 30 participants per condition and adding 30 more per condition each time I find p >= .05, up to a maximum of 270 per condition with a 2-sided t-test. Then, I’m repeating the study 9,999 more times.
Here’s what happens when the null hypothesis is true (people are just as happy after Taylor Swift as after Baz Luhrmann):
source("http://rynesherman.com/phack.r") # read in Ryne Sherman's function
res.null <- phack(initialN=30,
grp1M=0, # Group 1 has a mean of 0
grp2M=0, # So does Group 2
grp1SD=1, # Group 1 has an SD of 1
grp2SD=1, # So does Group 2
## Loading required package: psych
## Proportion of Original Samples Statistically Significant = 0.049
## Proportion of Samples Statistically Significant After Hacking = 0.1898
## Probability of Stopping Before Reaching Significance = 0.819
## Average Number of Hacks Before Significant/Stopping = 6.973
## Average N Added Before Significant/Stopping = 209.19
## Average Total N 239.19
## Estimated r without hacking 0
## Estimated r with hacking 0
## Estimated r with hacking 0 (non-significant results not included)
The first line of the output tells me what proportion of times my first batch of 60 participants (30 per cell) was significant. As expected, it’s 5% of the time.
The second line tells me what proportion of times I achieved significance overall, including when I added more batches of 60 participants. That’s a much higher number, 19%.
Wow. I can increase my hit rate by almost 400% by looking at the data intermittently! One in five studies now returns a hit.
The Average Total N is the average number of participants I ran per cell before I stopped collecting data. It’s 239. If I am collecting data on Mechanical Turk, getting 239 people to listen to Taylor Swift and 239 to listen to Baz Luhrman is a cake-walk. I could collect hits very easily by running tons of studies on mTurk. I’d be very productive (in terms of publication count) this way. But all of my “hits” would be false positives, and all of my papers would be reporting on false findings.
But what about when the null is false?
The first simulation assumed that there really is no difference between the groups. But I probably don’t really think that is true. More likely, I think there is a difference. I expect the Taylor Swift group to score higher than the Baz Luhrmann group. I don’t know how much higher. Maybe it’s a small effect, d = .2.
So, what happens when people really are happier listening to Taylor Swift?
res.small <- phack(initialN=30,
grp1M=.2, # Group 1 now has a mean of .2
## Proportion of Original Samples Statistically Significant = 0.1205
## Proportion of Samples Statistically Significant After Hacking = 0.744
## Probability of Stopping Before Reaching Significance = 0.3006
## Average Number of Hacks Before Significant/Stopping = 4.4569
## Average N Added Before Significant/Stopping = 133.707
## Average Total N 163.707
## Estimated r without hacking 0.1
## Estimated r with hacking 0.14
## Estimated r with hacking 0.17 (non-significant results not included)
Holy hit rate, Batman! Now I’m seeing p < .05 almost 75% of the time! And this time, they are true positives!
Sure, my effect size estimate is inflated if I publish only my significant results, but I am generating significant findings at an outstanding rate.
Not only that, but I’m stopping on average after 164 participants per condition. How many participants would I need to have 75% success if I only looked at my data once? I need a power analysis for that.
pwr.t.test(d = .2,
sig.level = 0.05,
power = .75,
type = "two.sample",
alternative = "two.sided")
## Two-sample t test power calculation
## n = 347.9784
## d = 0.2
## sig.level = 0.05
## power = 0.75
## alternative = two.sided
## NOTE: n is number in *each* group
348 participants per condition!! That’s more than twice as many! The other way is MUCH more efficient. My Taylor Swift = happiness paper is going to press really quickly!
What if the effect were moderate? Say, d = .4?
Here are the simulations for d = .4:
res.moder <- phack(initialN=30,
grp1M=.4, # Group 1 now has a mean of .4
## Proportion of Original Samples Statistically Significant = 0.3348
## Proportion of Samples Statistically Significant After Hacking = 0.9982
## Probability of Stopping Before Reaching Significance = 0.005
## Average Number of Hacks Before Significant/Stopping = 1.424
## Average N Added Before Significant/Stopping = 42.72
## Average Total N 72.72
## Estimated r without hacking 0.2
## Estimated r with hacking 0.24
## Estimated r with hacking 0.24 (non-significant results not included)
BOOM!! Batting a thousand! (Ok, .998, but that’s still really good!!)
And with only 73 participants per condition!
I’m rolling in publications! I can’t write fast enough to publish all these results.
And what would I have to do normally to get 99.8% success?
pwr.t.test(d = .4,
sig.level = 0.05,
power = .998,
type = "two.sample",
alternative = "two.sided")
## Two-sample t test power calculation
## n = 293.5578
## d = 0.4
## sig.level = 0.05
## power = 0.998
## alternative = two.sided
## NOTE: n is number in *each* group
Dang. That’s FOUR TIMES as many participants. Looking at the data multiple times wins again.
But am I really right all the time?
So looking at my data intermittently is actually a super effective way to reach p < .05 when I have even small true effects.1 It could lead to faster research, more publications, and less participant time used! Those are substantial benefits. On the downside, I would get to play “Shake It Off” for fewer people.
Looking at data multiple times makes it easier to get true positives.
And I’m only studying true effects, right?
p-hacking only seems like a problem if I accept that I might be studying false effects.2 Which I almost certainly am. At least some of the time.
But the problem is that I don’t know ahead of time which hypotheses are true or false. That’s why I am doing research to begin with.
It also seems that when I am studying true effects, and I am willing to collect large-ish samples,3 intermittent looking should yield a high hit rate. And I should be able to achieve that rate without needing to do anything else, such as dropping conditions, to achieve my desired p-value.4 If I am looking at my data intermittently, a low hit rate should make me consider that my hypothesis is wrong – or at the very least that I am studying a very small effect.
Alexander Etz (@AlxEtz) pointed out that it’s possible to look at the data more than once without increasing alpha. He’s right. And it can be efficient if I’m not interested in getting a precise effect size estimate. Daniel Lakens has a great post about doing this, as does Rolf Zwaan. Alex adds:
Some people much, much smarter than I am have already written about the “optimal” strategies for winning publications, and you should read their paper because it shows just how much these strategies bias publications.↩
Even if I am only willing to test 120 people in each condition, I find significant results 9%, 41%, and 90% of the time for d = 0, .2, and .4, respectively. For a small effect, even looking at my data just four times (at 30, 60, 90, and 120 participants per cell), my hit rate is quadruple that under the null hypothesis.↩
I also modified Sherman’s original code a bit to look at what happens if I only continue adding participants when the Taylor Swift mean is bigger (but not sigficantly) than the Baz Luhrmann mean. I was able to find a significant effect 9%, 59%, and 93% of the time for d = 0, .2, and .4, respectively. In other words, I can still expect to find a significant result more than half the time even for effects as small as d = .2, even if the only p-hacking I do is looking at my data intermittently.↩
Lately, I’ve been reading Thinking Fast and Slowby Daniel Kahneman, whose work on judgment and decision-making almost cannot be over-stated in its importance (hey, he won a Nobel Prize for it, and there isn’t even a Nobel for psychology!).
In the book, Kahneman discusses early conversations with his long-time collaborator Amos Tversky and how he came to the realization that even people with years of statistical training and practice can fail in their statistical intuitions.
Here is Kahneman on intuitive statistics:
We had concluded in the seminar that our own intuitions were deficient. In spite of years of teaching and using statistics, we had not developed an intuitive sense of the reliability of statistical results observed in small samples. Our subjective judgments were biased: we were far too willing to believe research findings based on inadequate evidence and prone to collect too few observations in our own research.
Like most research psychologists, I had routinely chosen samples that were too small and had often obtained results that made no sense. Now I knew why: the odd results were actually artifacts of my research methods. My mistake was particularly embarrassing because I taught statistics and knew how to compute the sample size that would reduce the risk of failure to an acceptable level. But I had never chosen a sample size by computation. Like my colleagues, I had trusted tradition and my intuition in planning my experiments and had never thought seriously about the issue. When Amos visited the seminar, I had already reached the conclusion that my intuitions were deficient…
These confessions of past errors, coming from such an eminent scientist, are powerful reminders to the rest of us to question our intuitive assumptions, use larger samples, and admit to our own faults.
Champaign-Urbana is finally experiencing beautiful spring weather after a brutal (for central Illinois) winter, so I spent Saturday afternoon digging out my garden and filling in the holes that my dogs have dug in the yard. As I pulled up chunks of grass + dirt, I had the idea to transplant these hunks to the newly-filled holes. The dogs have torn up a lot of our grass, and maybe I could use this as an opportunity to patch up some of the dead spots.
Will it work? I don’t know. It’s an experiment, I told myself.
Scientist-me immediately chimed in: That’s not an experiment! Where is my control group? As a scientist, I should be more careful about how I use words like “experiment.”
Then forgiving-me added soothingly: It’s ok. This is just how people use the word “experiment” when they are not doing professional science. And right now, I am not doing professional science. I am just digging in my garden. Colloquially, an experiment is just a process whose outcome is unknown. I don’t know if the grass will grow. It probably won’t. But I will just do it and see.
Scientist-me chewed this over. Wouldn’t it be nice if scientists also did not know in advance the outcomes of their experiments? When one spends a lot of time and careful thought developing theories and deriving predictions from them, it is easy to feel like one knows what the outcome will be. And this can lead to confirming what one “knows” by dropping measurements that do not verify this knowledge, changing how one calculates statistics so that one can draw the inferences one knew all along, re-running the same experiment until the conclusions align with what one knows. And each of these sources of confirmation can feel in the moment as if they are justified: Of course these data need to be dropped, the other measures were the ones that really mattered! And so on.
But that’s not science. That’s just digging.
I hope that scientist-me can learn to be a little bit more like intuitive-gardener-me: genuinely curious about the world, open-minded about the possibility that my ideas may not work (and that they are still worth testing anyway), and seriously in love with tomatoes. Ok, the tomatoes may not help my science much, but they will make me happy anyway.