Header


Tuesday, July 1, 2014

Can p-curve detect p-hacking through moderator trawling?

NOTE: Dr. Simonsohn has contacted me and indicated a possible error in my algorithm. The results presented here could be invalid. We are talking back in forth and I am trying to fix my code. Stay tuned!

Suppose a researcher were to conduct an experiment looking to see if Manipulation X had any effect on Outcome Y, but the result was not significant. Since nonsignificant results are harder to publish, the researcher might be motivated to find some sort of significant effect somehow. How might the researcher go about dredging up a significant p-value?

One possibility is "moderator trawling". The researcher could try potential moderating variables until one is found that provides a significant interaction. Maybe it only works for men but not women? Maybe the effect can be seen after error trials, but not after successful trials? In essence, this is slicing the data until one manages to find a subset of the data that does show the desired effect. Given the number of psychological findings that seem to depend on surprisingly nuanced moderating conditions (ESP is one of these, but there are others), I do not think this is an uncommon practice.

To demonstrate moderator trawling, here's a set of data that has no main effect.
However, when we slice up the data by one of our moderators, we do find an effect. Here the interaction is significant, and the simple slope in group 0 is also significant.

In the long run, testing the main effect and three moderators will cause the alpha error rate to increase from 5% to 18.5%. That's the chance that at least one of the four tests come up p>.05, (.95)^4.

Because I am intensely excited by the prospect of p-curve meta-analysis, I just had to program a simulation to see whether p-curve could detect this moderator trawling in the absence of a real effect. P-curve meta-analysis is a statistical technique which examines the distribution of reported significant p-values. It relies on the property of the p-value that, when the null is true, p is uniformly distributed between 0 and 1. When an effect exists, smaller p-values are more likely than larger p-values, even for small p: p < .01 is more likely than .04<p<.05 for a true effect. Thus, a flat p-curve indicates no effect and possible file-drawering of null findings, while a right-skewed p-curve indicates a true effect. More interesting yet, a left-skewed curve suggests p-hacking -- doing what you need to to achieve p < .05, alpha error be damned.

You can find the simulation hosted on Open Science Framework at https://osf.io/ydwef/. This is my first swipe at an algorithm; I'd be happy to hear other suggestions for algorithms and parameters that simulate moderator trawling. Here's what the script does.
1) Create independent x and y variables from a normal distribution
2) Create three moderator variables z1, z2, and z3, of which 10 random subjects make up each of two levels
3) Fit the main effect model y ~ x. If it's statistically significant, stop and report the main effect.
4) If that doesn't come out significant, try z1, z2, and z3 each as moderators (e.g. y ~ x*z1; y ~ x*z2; y ~ x*z3). If one of these is significant, stop and plan to report the interaction.
5) Simonsohn et al. recommend using the p-value of the interaction for an attenuation interaction (e.g. "There's an effect among men that is reduced or eliminated among women"), but the p-values of each of the simple slopes for a crossover interaction (e.g. "This makes men more angry but makes women less angry."). So, determine whether it's an attenuation or crossover interaction.
5a)  If just one simple slope is significant, we'll call it an interaction. There's an effect in one group that is significant that is eliminated or reduced in the other group. In this case, we report the interaction p-value.
5b) If neither simple slopes are significant, or both are significant with coefficients of opposite sign, we'll call it a crossover. Both slopes significant indicates opposite effects, while neither slope significant indicates that the simple slopes aren't strong enough on their own but their opposition is enough to power a significant interaction. In these cases, we'll report both simple slopes' p-values.

We repeat this for 10,000 hypothetical studies, export the t-tests, and put them into the p-curve app at www.p-curve.com. Can p-curve tell that these results are the product of p-hacking?




It cannot. In the limit, it seems that p-curve will conclude that the findings are very mildly informative, indicating that the p-curve is flatter than 33% power, but still right-skewed, suggesting a true effect measured at about 20% power. Worse yet, it cannot detect that these p-values come from post-hoc tomfoolery and p-hacking. A few of these sprinkled into a research literature could make an effect seem to bear slightly more evidence, and be less p-hacked, then it really is.

The problem would seem to be that the p-values are aggregated across heterogeneous statistical tests: some tests of the main effect, some tests of this interaction or that interaction. Heterogeneity seems like it would be a serious problem for p-curve analysis in other ways. What happens when the p-values come from a combination of well-powered studies of a true effect and some poorly-powered, p-hacked studies of that same effect? (As best I can tell from the manuscript draft, the resulting p-curve is flat!) How does one meta-analyze across studies of different phenomena or different operationalizations or different models?

I remain excited and optimistic for the future of p-curve meta-analysis as a way to consider the strength of research findings. However, I am concerned by the ambiguities of practice and interpretation in the above case. It would be a shame if these p-hacked interactions would be interpreted as evidence of a true effect. For now, I think it best to report the data with and without moderators, preregister analysis plans, and ask researchers to report all study variables. In this way, one can reduce the alpha-inflation and understand how badly the results seem to rely upon moderator trawling.