Article Level Metrics and Many Labs Replication Outcomes

Update: I’ve edited this page slightly for clarity and proofreading and to correct an error. Before doing so, I archived the original version of the post. You can see the revision history at the Internet Archive.

This blog post provides additional details and analyses for the poster I am presenting at the 2016 meeting of the Society for Personality and Social Psychology. If you’ll be at SPSP and want to chat, come by my poster during Poster Session E on Friday at 12-1:30pm. I’ll be at poster board 258.


Over the past several years, psychologists have become increasingly concerned about the quality and replicability of the research in their field. To what extent are the findings reported in psychology journals “false positives”—reports of effects where none truly exist?  Researchers have attempted to answer this question different approaches: by replicating previous research and by developing a series of research quality metrics that attempt to quantify the evidential value, replicability, power, and bias of of the research literature.


As part of this movement, a wave of replication studies has been published, including several large-scale projects. The results of these projects have been mixed. The Many Labs 1 (ML1) project involved 36 labs all running replications of the same 13 effects (16 effects, if you count the four anchoring effects separately). In aggregate analyses on data from all of the labs, the authors found that only two failed to reject the null hypothesis of no effect. Many Labs 3 (ML3), following a similar model, attempted to replicate 10 effects (plus 3 post-hoc additions of effects from three of the replicated studies) in 21 samples. This time, aggregate analyses of the 10 planned effects failed to reject the null hypothesis of no effect for seven of the effects. (Many Labs 2 is still in progress). The Reproducibility Project: Psychology took a different approach to replication, selecting many effects from specific journals and replicating each in a single lab. Out of 97 replications, 62 failed to reject the null hypothesis of no effect, a similar rate to ML3. However, unlike ML3, these analyses were not based on large, aggregated samples. Across these three projects, in general, effect sizes shrank from original to replication. The overall replicability of psychological science remains unknown (and may not be a well-defined or readily quantifiable concept); however, it is clear that some effects can be observed relatively regularly and in many settings while others are difficult to observe, even with many subjects and carefully constructed protocols.


At the same time, concerns about research methods that inflate false positive rates and about the effect of publication bias on the veracity of reported research (as well as increasing awareness that traditional meta-analyses are threatened by publication bias) has driven researchers to develop new techniques to evaluate the literature. The p-curve, for example, tests the evidential value of a set of studies by looking at its distribution of p-values: the shape of the distribution of p-values changes when studying a true vs. false test (and with the power of the test). The Replication Index (R-Index) attempts to quantify the replicability of a set of studies based on their post-hoc power (their power to detect an effect of the observed size). The Test of Insufficient Variance (TIVA) examines publication bias by asking whether the variance in p-values (converted to z-scores) is smaller than would be expected, suggesting that some results have been censored. A positive correlation between sample size and effect size may be taken as an indication of suppressing null results. Smaller samples will produce significant results only when the effect size is larger, but they are less powerful at detecting effects at all. So there may be lots of non-significant results missing from the literature if all the small sample studies report large effects. The N-Pact factor attempts to quantify the power of a set of studies by looking at its median sample size with the assumption that more powerful studies create more reliable estimates. These tests and indices have become popular tools for evaluating the quality of research output. Researchers are using these metrics to examine the quality of journals, of their own work, and of the papers published by others.

Their Relationship to One Another

I’ve heard colleagues dismiss papers because they contain mostly high p-values, low variance in p-values, small sample sizes, or sample sizes correlated with effect sizes. Papers with these characteristics are perceived as unreplicable and unsound. I’ve made such inferences myself. Yet, most of these metrics are not explicitly designed to index replicability. So are these judgments about replicability justified? Do research quality metrics at the article level predict replication outcomes?

Intuitively, it makes sense to think they would. Assuming papers generally conclude that effects are real, sets of studies based on large samples, with adequate power, and not exploiting flexibility in analysis seem like they ought to contain more replicable results than sets of studies based on small samples with flexible analysis plans and low power.

But there are some reasons that these metrics may not be predictive in practice. Optimistically, if researchers are doing a bang up job on powering their studies, we would actually observe a (perhaps modest) negative correlation between effect size and sample size, because no one be wasting huge samples on huge effects or bothering with tiny samples for tiny effects. Pessimistically, if the literature is extremely biased or extremely underpowered, there may be no predictive power to the metrics in the present literature at all. For example, when power is very low, the distribution of p-values becomes fairly flat. (You can observe this for yourself here. Try an effect size of d = 0.3 and sample size of n = 20. This is also shown in the first p-curve paper.) Such a literature would also have small sample sizes and low post-hoc power regardless of whether effects are true—and thus little variability in the metrics. And we would expect poor replicability, even for true effects, if sample sizes for replications were not sufficiently larger than the original, dismally powered sample. So even if some of the metrics should predict replicability in principle, they may not do so in practice.

To test the predictive power of these metrics at the article level, I asked whether they are related to outcomes in ML1 and ML3. Why ML1 and ML3? I hoped that their large samples would yield fairly reliable outcome measures, and I wanted to have enough effects to have a hope of detecting a relationship, so using just one of them would be inadequate. If all the effects were usable, I would have a sample of 23 effects. Not very big, but powered at 80% to detect a correlation of r = .55. That might be optimistic, and it would be better to have even more effects included. But if I am going to use these metrics to make dismissive judgments about the replicability of effects in specific articles, I’d hope that the relationship is fairly strong.


Operationalizing Replicability

There are many ways to operationalize replicability. For example, one might consider whether the original study had sufficient power to detect the observed effect size of the replication (Simonsohn’s [2015] “small telescopes“). Or whether the replication effect size falls within a prediction interval based on the the original and replication effect sizes. I decided to focus on the two outcomes I hear most often discussed:

  1. Difference in effect size (continuous): How much the replication effect size (converted to Cohen’s d) differed from the original effect size (converted to the same scale).
  2. Replication success (dichotomous): Whether the replication rejected the null hypothesis of no effect at p < .05.

These are not perfect outcome operationalizations, but I believe they represent the ways many people evaluate replications. Since the research question was driven by the kinds of judgments I and others seem to be making, it made sense to me to operationalize replication outcomes in ways that appear to be common in replication discourse.

Selecting Predictors

I again decided to use the metrics that I hear discussed frequently as predictors. These are the same metrics mentioned in the Introduction:

  1. P-Curve: Evidential Value: Test statistic (z) for evidential value of a set of studies based on p-values. Z-scores less than -1.64 indicate evidential value.
  2. P-Curve: Lacks Evidential Value: Test statistic (z) for a lack evidential value (power less than 33%) of a set of studies based on p-values. Z-scores less than -1.64 indicate lack of evidential value.
  3. R-Index: The difference between median post-hoc power of a set of studies and the “inflation” in the studies. Inflation is defined as the proportion of significant results minus the expected proportion of significant results. Higher values indicate greater replicability.
  4. Test of Insufficient Variance (TIVA): The variance in the converted z-scores of test statistics. For heterogeneous sets of studies (i.e., studies with different sample sizes or different methods), variance should be greater than 1. Variance less than 1 indicates that some studies have been censored.
  5. Correlation Between Effect Size and N: Pearson correlation between the observed effect sizes and sample sizes in a paper. Negative correlations may indicate publication bias.
  6. N-Pact Factor: Median sample size of included tests. Higher values generally indicate greater power.

I followed Simonsohn, Nelson, & Simmons (2015) recommendations for the inclusion of tests, using only tests of critical hypotheses. The p-curve disclosure table is available here as dataEntrySheet.csv.

Because many of these indices are based on the same information (test statistics, p-values, sample sizes), we can expect them to be correlated. For this reason, I decided to evaluate them individually. Models including multiple predictors might exhibit multicollinearity and be unsuitable.


ML1 and ML3 replicate effects from 22 articles:

ML Effect Name Description Citation
1 Anchoring (4 effects) People’s quantitative judgments are biased after seeing too large or too small estimates Jacowitz, K. E., & Kahneman, D. (1995). Measures of anchoring in estimation tasks. Personality and Social Psychology Bulletin, 21, 1161–1166.
1 Allowed/ Forbidden People are less likely to endorse banning anti-democracy speeches than to fail to endorse allowing them Rugg, D. (1941). Experiments in wording questions: II. The Public Opinion Quarterly, 5, 91–92.
1 Retrospective gambler fallacy People think that a rare outcome is from a longer series of events than a more common outcome Oppenheimer, D. M., & Monin, B. (2009). The retrospective gambler’s fallacy: Unlikely events, constructing the past, and multiple universes. Judgment and Decision Making, 4, 326–334.
1 Gain vs loss framing People are more willing to take risks to avoid losses than to procure gains Tversky, A., & Kahneman, D. (1981). The framing of decisions and the psychology of choice. Science, 211, 453–458.
1 Sex differences in implicit math attitudes (and relation between I & E attitudes) Women have more negative implicit math attitudes than men / Implicit and explicit math attitudes are positively correlated Nosek, B. A., Banaji, M. R., & Greenwald, A. G. (2002). Math = male, me = female, therefore math ≠ me. Journal of Personality and Social Psychology, 83, 44–59.
1 Low vs high category scales People report watching more TV when the response scale ranges from “up to half an hour” to “more than two and a half hours” than when it ranges from “up to two and a half hours” to “more than four and a half hours” Schwarz, N., Hippler, H.-J., Deutsch, B., & Strack, F. (1985). Response scales: Effects of category range on reported behavior and comparative judgments. The Public Opinion Quarterly, 49, 388–395.
1 Quote Attribution People endorse a quotation more strongly when it is attributed to a liked rather than a disliked figure Lorge, I., & Curtiss, C. C. (1936). Prestige, suggestion, and attitudes. The Journal of Social Psychology, 7, 386–402.
1 Norm of reciprocity People are more likely to say that foreign reporters should be allowed into their home country after first being asked whether a foreign country should allow reporters from their country Hyman, H. H., & Sheatsley, P. B. (1950). The current status of American public opinion. In J. C. Payne (Ed.), The teaching of contemporary affairs: 21st yearbook of the National Council of Social Studies (pp. 11–34). New York, NY: National Council of Social Studies.
1 Sunk Costs People are more likely to brave the cold to use a ticket they bought vs. one that was free. Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45, 867–872.
1 Imagined contact Imagining contact with people from different ethnic groups reduces prejudice towards those groups Husnu, S., & Crisp, R. J. (2010). Elaboration enhances the imagined contact effect. Journal of Experimental Social Psychology, 46, 943–950.
1 Flag Priming People report more conservative attitudes after subtle exposure to the US flag than after no exposure Carter, T. J., Ferguson, M. J., & Hassin, R. R. (2011). A single exposure to the American flag shifts support toward Republicanism up to 8 months later. Psychological Science, 22, 1011–1018.
1 Currency Priming People endorse the status quo more strongly after exposure to an image of money than after no exposure Caruso, E. M., Vohs, K. D., Baxter, B., & Waytz, A. (2012). Mere exposure to money increases endorsement of free-market systems and social inequality. Journal of Experimental Psychology: General, 142, 301–306.
3 Stroop People are slower to name the font color of a color word when the word names a different color than when it names the same color Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18, 643–662.
3 Metaphoric Restructuring People interpret an ambiguous temporal statement differently depending on whether they had just completed a spatial task in a frame of reference involving an object and a stick figure labeled “you” or involving only objects Boroditsky, L., & Ramscar, M. (2002). The roles of body and mind in abstract thought. Psychological Science, 13, 185–189.
3 Availability Heuristic People overestimate the frequency of words starting with a given letter (rather than words where the letter is in the third place) because they are easier to recall Tversky, A., & Kahneman, D. (1973). Availability: a heuristic for judging frequency and probability. Cognitive Psychology, 5, 26–232.
3 Persistence and Conscien-tiousness Persistence as measured by one personality measure is positively correlated with conscientiousness as measured by another De Fruyt, F., & Van De Wiele, L. (2000). Cloninger’s psychobiological model of temperament and character and the five-factor model of personality. Personality and Individual Differences, 29, 441–452.
3 Power and perspective-taking People made to feel high in power perform poorer on a perspective-taking task Galinsky, A. D., Magee, J. C., Inesi, M. E., & Gruenfeld, D. H. (2006). Power and perspectives not taken. Psychological Science, 17, 1068–1074.
3 Weight embodiment People judge an issue as more important when holding a heavier (rather than lighter) clipboard Jostmann, N. B., Lakens, D., & Schubert, T. W. (2009). Weight as an embodiment of importance. Psychological Science, 20, 1169–1174.
3 Warmth perceptions People judge a room to be warmer after reading about a communal rather than an agentic person Szymkow, A., Chandler, J., IJzerman, H., Parzuchowski, M. & Wojciszke, B. (2013). Warmer hearts, warmer rooms: How positive communal traits increase estimates of ambient temperature. Social Psychology, 44, 167-176.
3 Elaboration likelihood model People high in need for cognition differ more in their judgments of the persuasiveness of strong and weak arguments than people low in need for cognition Cacioppo, J. T., Petty, R. E., & Morris, K. J. (1983). Effects of need for cognition on message evaluation, recall, and persuasion. Journal of Personality and Social Psychology, 45, 805–818.
3 Self-Esteem and subjective distance People high in self-esteem judge past positive and negative events as more different in subjective temporal distance than participants with low self-esteem Ross, M., & Wilson, A. E. (2002). It feels like yesterday: Self-esteem, valence of personal past experiences, and judgments of subjective distance. Journal of Personality and Social Psychology, 82, 792–803.
3 Credentials and prejudice People are more likely to express prejudiced attitudes when they first have an opportunity to show that they are not prejudiced Monin, B., & Miller, D. T. (2001). Moral credentials and the expression of prejudice. Journal of Personality and Social Psychology, 81, 33-43.



Unfortunately, out of the 22 articles, several were excluded from the analyses.

One of the articles (Norm of reciprocity; Hyman & Sheatsley, 1950) was not available online or from the ML1 team. I found a catalog entry for the book at my library, but upon taking it out found it was the wrong volume of the series. I have an interlibrary loan request pending for the correct volume, but it has not yet been delivered. Since complete information for this article could not be gathered, it was excluded.

I next excluded any articles that reported only a single study: Anchoring, Allowed/Forbidden, Quote Attribution, and Persistence and conscientiousness. Some of the metrics (e.g., p-curve, correlation between effect size and sample size) require multiple studies to compute.

One further paper was excluded. The Sunk Costs effect originated in a paper (Thaler, 1985) that does not report null-hypothesis significance testing and so cannot be used for most of the metrics. The protocol used in the replication was derived from work by Oppenheimer and colleagues (2009), reported in an article testing the utility of instructional manipulation checks for studies conducted on Mechanical Turk. This paper is not suitable for inclusion because the remaining studies in the paper do not address the same theoretical question as the replicated effect.

This left 16 papers in the final sample, all of which could be evaluated on the dichotomous outcome. However,the original papers reporting the three effects involving interactions (Elaboration Likelihood Model, Self-Esteem and Subjective Distance, and Credentials and Prejudice) did not report enough information (or did not have appropriate designs) to estimate effect sizes in Cohen’s d. Thus, these three effects were excluded from analyses involving the continuous outcome, reducing the number of included effects for those analyses to 13.

Papers with Multiple Replicated Effects

Two of the sampled papers (Jacowitz & Kahneman, 1995; Nosek et al., 2002) had multiple effects included in the replication studies. Multiple outcomes from the same article should not be evaluated separately as their predictor metrics are not independent. I decided to average the outcomes for effects that are testing the same theoretical relationship and to focus on the the effect that was the primary theoretical target of the original paper when the effects were testing different theoretical relationships. This meant that the Anchoring effects were averaged and the implicit/explicit attitude correlation was excluded from analyses.

All of the data and code for data manipulation and primary analyses are on Github. Please feel free to ask me questions, open issues, send pull requests, or ask for files in other formats. I am happy to share.

The Data

Before I dive into results, I want to show some important features of the data. First, let’s look at the continuous outcome. The figure below shows original effect sizes on the left and replication effect sizes on the right, for those papers reporting enough information to calculate Cohen’s d. Lines connect the effect sizes from the same effects. There’s a filter that allows you to show only those effects included in the analyses (i.e., omitting those those from papers that contained only one study or which I couldn’t access).

Try switching from showing all effects to showing only those whose original articles are included in the analyses. Notice that just about all of the effects whose size grew from original to replication are excluded. This might be problematic. How often do effect sizes grow from original to replication, and how do the original papers reporting such studies differ from other papers? Unfortunately, there is not enough information here to answer these questions. But perhaps there is bias in the continuous outcome in our remaining studies.

Next, we should look at the distributions of the predictors. The figure below allows you to select which predictor you want to see and to switch among the full set of 16 effects, the set of 13 effects included in the effect size analyses, and the set of 3 effects excluded from those analyses.

One thing you might notice is that for most metrics, the scores are clustered in the “worse” end. The papers included in this sample have, mostly, low power, small sample sizes, and low variance in their z-scores. This would be consistent with a fairly biased or underpowered literature. So low predictive validity of these metrics would not be very surprising, given the small sample size and the low variability in the predictors. Models based on these metrics might be treated with suspicion, given their poor variability.


Ok, now that we’ve looked at the data a bit and considered the ways in which they are less-than-ideal (small sample, perhaps unrepresentative on at least one outcome, and low variability in predictors), we should be pretty well-prepared for the results.

Difference in Effect Size

For this blog post, I’ve used absolute value of the difference between replication and original effect sizes rather than raw difference. Since only of the effects has a positive difference, and it’s a change of d = .01, this changes only the signs of the relationships but not their sizes. It just makes the plots and correlations a bit more intuitive to interpret: more positive values indicate greater disparity between original and replication results. The poster uses the raw difference scores (replication – original), if you want to compare.

Below is a table of correlations. The first column represents correlations with effect size difference. The degrees of freedom for all of these correlations are 11. The correlations in the other cells include all 16 of the effects and so have 14 degrees of freedom.

1 2 3 4 5 6
1. Difference in Effect Size
2. P-Curve: Evidential Value -.44
3. P-Curve: Lacks Evidential Value .45 -.99
4. R-Index .46 -.89 .89
5. TIVA: Variance of Z -.04 -.37 .39 .35
6. Correlation between ES and N .49 -.37 .37 .42 .40
7. N-Pact Factor: Median N -.14 .06 -.03 .15 .27 -.06

Overall, the correlations with effect size difference are moderately strong, but notice that some of them are in the opposite direction one would predict. Z-scores for evidential value decrease as effect size discrepancy increases. But negative z-scores indicate evidential value, so we would expect higher z-scores to be associated with greater discrepancy. For the other p-curve metric, positive z-scores indicate lack of evidential value, but the correlation is positive. Instead of discrepancy rising with lack of evidential value, it falls. Higher R-Index values should predict better replicability, but here they are associated with larger effect size differences. The last two correlations are in the predicted direction. As the article-level correlation between effect size and sample size increases (as potential publication bias increases), so does the difference between original and replication outcomes. And as N-pact factor (power) increases, discrepancy decreases (a bit).

But our predictors didn’t have much variability. Let’s see what these relationships look like plotted:

Plot of Difference Between Replication and Original Effect Sizes by Bias Metrics

We can see that only a few papers have p-curve z-scores much different from zero, or have variance in the z-scores of their studies (TIVA) greater than 1, or have sample sizes larger than 100. In general, the correlations are driven by just a few points that are outliers in their distributions.

Ok, let’s look quickly at the dichotomous outcome, with “successful” (p < .05) replications coded as 1 and “unsuccessful” (p >= .05) coded as 0. Here are the results of six different logistic regression models predicting this outcome:

       b      SE        z        p      OR
1. P-Curve: Evidential Value


0.18 -1.31 .19


2. P-Curve: Lacks Evidential Value

0.23 0.18 1.33 .18


3. R-Index


1.64 0.56 .57


4. TIVA: Variance in Z


0.58 0.25 .80


5. Correlation Between Effect Size and N


0.66 0.19 .85


6. N-Pact Factor


0.01 -0.48 .63


Now, the p-curve relationships are in the expected direction. As is R-Index. And TIVA. We’ll plot the predicted probability of success.

Plot of Predicted Probability of Success from Six Bias Metrics

We can see that, at least, the p-curve values most strongly indicating evidential value are associated with replication successes. But there are also a couple of replication “failures” that come from papers with evidential value z-scores above 1.64, i.e., where there is evidential value in that set of studies (but not necessarily in the replicated study). And there are a few successes that come from papers that do not have evidential value (though there are none from the two papers that significantly lack evidential value). So while p-curve performs the best at predicting replication success from article-level information in this set of studies, it’s definitely not predicting all of the outcomes.

Some Closing Thoughts

I’m very grateful for the attention this project has received and the many helpful comments and thought-provoking questions people have sent on Twitter, Facebook, email, and Github. I hope that this blog has clarified many points about methodology and data quality. There is a lot to be desired here. And I look forward to expanding this project to make it more useful.

In particular, as a few people have suggested, it’s important to have more data. Adding replications from the Reproducibility Project, a special issue of Social Psychology, the soon-to-be-released Many Labs 2, and perhaps other sources will hopefully increase the sample size, the reliability of the estimates, and the power to detect relationships.

In addition, there may be other outcomes that would be useful. Perhaps whether the original study had the power to detect the replication effect. Or, for Many Labs studies, the proportion of labs returning significant results. That said, no replication outcome metric can replace informed scientific judgment. If a replication is poorly conducted or has low power, we should not evaluate the replicability of the effect in the same way as when the replication is competently conducted and high powered. Likewise, if an original study used invalid manipulations or measures, what does its replicability matter?

It also remains possible that article level metrics don’t predict replication outcomes for the current psychological literature. If the literature is extremely low-powered and biased, these metrics may be capable of telling us only that the literature is low-powered and biased but not which effects are likely to replicate. Or there may be too little information contained in single papers to predict replication outcomes of their studies. Perhaps author- or journal-level metrics would be better indicators, though certainly harder to assess off-the-cuff.

For now, I would probably make three recommendations. First, I would apply caution in using these metrics to make judgments about the replicability of studies from single papers. It’s not clear that they can diagnose the replicability of studies in this way. Second, when reporting studies, include enough information to aid in meta-analytic and meta-scientific research. This means, include full model specifications, cell sizes, and test statistics (if you’re using frequentist statistics, please report more than a p-value). Finally, conduct more replications.

2 thoughts on “Article Level Metrics and Many Labs Replication Outcomes”

Leave a Reply

Your email address will not be published.