A psychologist's thoughts on how and why we play games

Sunday, May 14, 2017

Curiously Strong effects

The reliability of scientific knowledge can be threatened by a number of bad behaviors. The problems of p-hacking and publication bias are now well understood, but there is a third problem that has received relatively little attention. This third problem currently cannot be detected through any statistical test, and its effects on theory may be stronger than that of p-hacking.

I call this problem curiously strong effects.

The Problem of Curiously Strong

Has this ever happened to you? You come across a paper with a preposterous-sounding hypothesis and a method that sounds like it would produce only the tiniest change, if any. You skim down to the results, expecting to see a bunch of barely-significant results. But instead of p = .04, d = 0.46 [0.01, 0.91], you see p < .001, d = 2.35 [1.90, 2.80]. This unlikely effect is apparently not only real, but it is four or five times stronger than most effects in psychology, and it has a p-value that borders on impregnable. It is curiously strong.

The result is so curiously strong that it is hard to believe that the effect is actually that big. In these cases, if you are feeling uncharitable, you may begin to wonder if there hasn't been some mistake in the data analysis. Worse, you might suspect that perhaps the data have been tampered with or falsified.

Spuriously strong results can have lasting effects on future research. Naive researchers are likely to accept the results at face value, cite them uncritically, and attempt to expand upon them. Less naive researchers may still be reassured by the highly significant p-values and cite the work uncritically. Curiously strong results can enter meta-analyses, heavily influencing the mean effect size, Type I error rate, and any adjustments for publication bias.

Curiously strong results might, in this way, be more harmful than p-hacked results. With p-hacking, the results are often just barely significant, yielding the smallest effect size that is still statistically significant. Curiously strong results are much larger and have greater leverage on meta-analysis, especially when they have large sample sizes. Curiously strong results are also harder to detect and criticize: We can recognize p-hacking, and we can address it by asking authors to provide all their conditions, manipulations, and outcomes. We don't have such a contingency plan for curiously strong results.

What should be done?

My question to the community is this: What can or should be done about such implausible, curiously strong results?

This is complicated, because there are a number of viable responses and explanations for such results:

1) The effect really is that big.
2) Okay, maybe the effect is overestimated because of demand effects. But the effect is probably still real, so there's no reason to correct or retract the report.
3) Here are the data, which show that the effect is this big. You're not insinuating somebody made the data up, are you?

In general, there's no clear policy on how to handle curiously strong effects, which leaves the field poorly equipped to deal with them. Peer reviewers know to raise objections when they see p = .034, p = .048, p = .041. They don't know to raise objections when they see d = 2.1 or r = 0.83 or η2 = .88.

Nor is it clear that curiously strong effects should be a concern in peer review. One could imagine the problems that ensue when one starts rejecting papers or flinging accusations because the effects seem too large. Our minds and our journals should be open to the possibility of large effects.

The only solution I can see, barring some corroborating evidence that leads to retraction, is to try to replicate the curiously strong effect. Unfortunately, that takes time and expense, especially considering how replications are often expected to collect substantially more data than original studies. Even after the failure to replicate, one has to spend another 3 or 5 years arguing about why the effect was found in the original study but not in the replication. ("It's not like we p-hacked this initial result -- look at how good the p-value is!")

It would be nice if the whole mess could be nipped in the bud. But I'm not sure how it can.

A future without the curiously strong?

This may be naive of me, but it seems that in other sciences it is easier to criticize curiously strong effects, because the prior expectations on effects are more precise.

In physics, theory and measurement are well-developed enough that it is a relatively simple matter to say "You did not observe the speed of light to be 10 mph." But in psychology, one can still insist with a straight face that (to make up an example) subliminal luck priming lead to a 2 standard deviation improvement in health.

In the future, we may be able to approach this enviable state of physics. Richard, Bond Jr., and Stokes-Zoota (2003) gathered up 322 meta-analyses and concluded that the modal effect size in social psych is r = .21, approximately d = 0.42. (Note that even this is probably an overestimate considering publication bias.) Simmons, Nelson, and Simonsohn (2013) collected data on obvious-sounding effects to provide benchmark effect sizes. Together, these reports show that an effect of d > 2 is several times stronger than most effects in social psychology and stronger even than obvious effects like "men are taller than women (d = 1.85)" or "liberals see social equality as more important than conservatives (d = 0.69)".

By using our prior knowledge to describe what is within the bounds of psychological science, we could tell what effects need scrutiny. Even then, one is likely to need corroborating evidence to garner a correction, expression of concern, or retraction, and such evidence may be hard to find.

In the meantime, I don't know what to do when I see d = 2.50 other than to groan. Is there something that should be done about curiously strong effects, or is this just another way for me to indulge my motivated reasoning?


  1. Great post. This is one advantage that I see of results-blind reviewing. In my experience, many of these curiously strong effects happen with very small samples. If we were to evaluate the study without knowing the results, we would likely say it had too little power/precision to produce meaningful results. Focusing on the design takes the pressure off of having to address what produced those results.
    Of course this only applies to cases where we'd be likely to think the design is weak.

  2. Sadly, I suspect that fabrication will very often turn out to be the most parsimonious explanation. Fraudulent researchers tend to be rather incompetent, including in their understanding of what a reasonable effect size might be.

  3. Interesting and valuable blog post! A few quick comments on specific points:

    >>>The only solution I can see, barring some corroborating evidence that leads to retraction, is to try to replicate the curiously strong effect. Unfortunately, that takes time and expense, especially considering how replications are often expected to collect substantially more data than original studies.

    Yes! Independent corroboration via strong falsification attempts via replicability tests **IS** the only fail-proof way to increase one's belief confidence in a curiously strong, or any, published effect. Falsifiability is simply **NOT** optional for scientific progress to be possible: https://osf.io/preprints/psyarxiv/dv94b/

    >>>>Even after the failure to replicate, one has to spend another 3 or 5 years arguing about why the effect was found in the original study but not in the replication.

    No, this is not necessary. Our job isn't to figure out **why** certain published findings are false (there could be a thousand different reasons why original researchers got it wrong). Our job is to better understand reality by building upon each others' findings, which are assumed to be (in principle) replicable according to the specified conditions. If independent labs, in good faith, cannot demonstrate to themselves that a published effect is replicable, then they must simply move onto investigating other effects that are indeed replicable in the hope of increasing our understanding of the world.

  4. Good post. BTW, a curiously strong effect (spotted by Richard Morey) played a role in a recent Psych Science retraction. Also, as others have noted, when power is low an effect will be significant only if it exaggerates true effect size. Be deeply skeptical of low-powered, single-experiment studies with surprising results.

  5. One might at least expect authors to point out their curiously strong effects and offer their thoughts on the matter.

  6. I guess you could calculate the sample size needed for a 'curiously large effect. it would be wonderfully small - if d=2.5, then you would probably need only 5 per group (to get 95% power). Why not then get multiple labs to replicate using variations around these sample sizes and 'bobs your uncle' - see how many replicate... then pool them all together in a meta-analysis ...add a few moderator variables ...ready for publication

    1. It sounds like you and Dr. LeBel both see value in a replication attempt. The thought had occurred to me, but it seems a shame to do all the set-up only to collect N = 10. I'm also concerned about the possibility of a fighting retreat: "Well, no, so it's not d = 2.5, but maybe it's d = 0.5, which your replication wouldn't detect." Perhaps I'm overthinking it and a little replication would go a long way.

    2. Then it's no longer a 'curiously large effect' to worry about

    3. This is a great situation where Simonsohn's "small telescopes" approach (http://datacolada.org/wp-content/uploads/2015/05/Small-Telescopes-Published.pdf) is useful - you can adjust your sample size so that you can draw conclusions about whether the original study was adequately powered to detect any effect that could be there.

  7. This is a really important point to keep emphasising the research community.

    There are some things worth checking if you see large d or r values. My initial thought on seeing such large standardised effects (usually a point estimate) is that the sample size is small and the effect size is inevitable. My second thought is to wonder how they have computed d or r - it is quite easy to use an incorrect conversion formula or some other approach that inflates d (e.g., computing d from t with a within-subject design). Third, there are artefacts of the design that distort d or r and combined with choice of computational approach produce large d or r values. Ceiling and floor effects are an example as they can dramatically shrink the sample variance as can ecological correlations. Finally, there are study designs that produce large effects by increasing the strength of a manipulation such as extreme group designs.

    In most cases it really helps to get a measure of effects size that is unstandardized to give more context (or equivalently a raw data plot).

  8. To address “maybe the effect is overestimated because of demand effects,” you could see if the curiously strong effect correlates with the Perceived Awareness of the Research Hypothesis scale: https://osf.io/preprints/psyarxiv/m2jgn/

  9. It might be constructive to transform the effect sizes into the r-metric, where we have more benchmarks. A d of 2.5 is going to be equivalent to a correlation .78. That's a correlation that exceeds the average reliability of most measures used in the social sciences (.75) and would be correlation used to make the argument that the IV and DV are the same thing. That might make a more compelling argument.

  10. Surely, replication is the only way for science to proceed, irrespective of the size of the effect. There are three things we (should) want to know: (1) is the finding reliable? (i.e., can it be replicated in an independent sample of subjects); (2) how big is it?; and (3) how general is it? P values don't address these questions; only doing the hard work does ...

  11. Interesting post. I agree with those comments pointing the finger at study design, only to add that failing to take into account the underlying structure of the data can generate unbelievably-high effect sizes. Yule pointed this out almost a hundred years ago with regards to nonsense correlations in time-series (https://www.jstor.org/stable/2341482?seq=1#page_scan_tab_contents). Dan Hruschka and I have a commentary coming out soon on a study that looked at the effect of the ebola outbreak on voting intentions using time-series data, without accounting for autocorrelation (i.e non-independence) of data points. This, combined with the fact that the original paper used smoothed data instead of raw data, greatly inflated the estimates of effect-size.

  12. An interesting problem, but one that could be eliminated by tightening standards for publication. Sample size of submitted manuscripts should be large enough to support a split-half test of the hypothesis. If the finding of the first half is replicated by the second, then the manuscript, if otherwise acceptable, can be published. This will increase the cost of conducting research and likely delay the publication of findings. But what do we really have to lose? Replicability and replication is at the heart of empirical science. We face a replication crisis that is causing erosion of public trust in the scientific enterprise. See: A manifesto for reproducible science
    Marcus R. Munafò, Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware & John P. A. Ioannidis
    Nature Human Behaviour 1, Article number: 0021 (2017)