A psychologist's thoughts on how and why we play games

## Friday, July 3, 2015

### Bayesian Perspectives on Publication Bias

I have two problems with statistics in psychological science. They are:
1. Everybody speaks in categorical yes/no answers (statistical significance) rather than continuous, probabilistic answers (probably yes, probably no, not enough data to tell).
2. There's a lot of bullshit going around. The life cycle of the bullshit is extended by publication bias (running many trials and just reporting the ones that work) and p-hacking (torturing the data until it gives you significance).
Meta-analysis is often suggested as one solution to these problems. If you average together everybody's answers, maybe you get closer to the true answer. Maybe you can winnow out truth from bullshit when looking at all the data instead of the tally of X significant results and Y nonsignificant results.

That's a nice thought, but publication bias and p-hacking make it possible that the meta-analysis just reports the degree of bias in the literature rather than the true effect. So how do we account for bias in our estimates?

### Bayesian Spike-and-Slab Shrinkage Estimates

One very simple approach would be to consider some sort of "bullshit factor". Suppose you believe, as John Ioannidis does, that half of published research findings are false. If that's all you know, then for any published result you believe that there's a 50% chance that there's an effect such as the authors report it (p(H1) = .5) and a 50% chance that the finding is false (p(H0) = .5). Just to be clear, I'm using H0 to refer to the null hypothesis, H1 to refer to the alternative hypothesis.

How might we summarize our beliefs if we wanted to estimate the effect with a single number? Let's say the authors report d = 0.60. We halfway believe in them, but we still halfway believe in the null. So on average, our belief in the true effect size delta is

delta(d | H0) * probability(H0) + (d | H1) * probability(H1)
or
delta = (0) * (0.5) + (0.6) * (0.5) = 0.3

So we've applied some shrinkage or regularization to our estimate. Because we believe that half of everything is crap, we're able to improve our estimates by adjusting our estimates accordingly.

This is roughly a Bayesian spike-and-slab regularization model: the spike refers to our belief that delta is exactly zero, while the slab is the diffuse alternative hypothesis describing likely non-zero effects. As we believe more in the null, the spike rises and the slab shrinks; as we believe more in the alternative, the spike lowers and the slab rises. By averaging across the spike and the slab, we get a single value that describes our belief.

 Bayesian Spike-and-Slab system. As evidence accumulates for a positive effect, the "spike" of belief in the null diminishes and the "slab" of belief in the alternative soaks up more probability. Moreover, the "slab" begins to take shape around the true effect.

So that's one really crude way of adjusting for meta-analytic bias as a Bayesian: just assume half of everything is crap and shrink your effect sizes accordingly. Every time a psychologist comes to you claiming that he can make you 40% more productive, estimate instead that it's probably more like 20%.

But what if you wanted to be more specific? Wouldn't it be better to shrink preposterous claims more than sensible claims? And wouldn't it be better to shrink fishy findings with small sample sizes and a lot of p = .041s moreso than a strong finding with a good sample size and p < .001?

### Bayesian Meta-Analytic Thinking by Guan & Vandekerckhove

This is exactly the approach given in a recent paper by Guan and Vandekerckhove. For each meta-analysis or paper, you do the following steps:
1. Ask yourself how plausible the null hypothesis is relative to a reasonable alternative hypothesis. For something like "violent media make people more aggressive," you might be on the fence and assign 1:1 odds. For something goofy like "wobbly chairs make people think their relationships are unstable" you might assign 20:1 odds in favor of the null.
2. Ask yourself how plausible the various forms of publication bias are. The models they present are:
1. M1: There is no publication bias. Every study is published.
2. M2: There is absolute publication bias. Null results are never published.
3. M3: There is flat probabilistic publication bias. All significant results are published, but only some percentage of null results are ever published.
4. M4: There is tapered probabilistic publication bias: everything < .05 gets published, but the chances of publishing get worse the farther you get from < .05 (e.g. p = .07 gets published more than p = .81).
3. Look at the results and see which models of publication bias look likely. If there's even a single null result, you can scratch off M2, which says null results are never published. Roughly speaking, if the p-curve looks good, M1 starts looking pretty likely. If the p-curve is flat or bent the wrong way, M3 and M4 start looking pretty likely.
4. Update your beliefs according to the evidence. If the evidence looks sound, belief in the unbiased model (M1) will rise and belief in the biased models (M2, M3, M4) will drop. If the evidence looks biased, belief in the publication bias models will rise and belief in the unbiased model will drop. If the evidence supports the hypothesis, belief in the alternative (H1) will rise and belief in the null (H0) will drop. Note that, under each publication bias model, you can still have evidence for or against the effect.
5. Average the effect size across all the scenarios, weighting by the probability of each scenario.
If you want to look at the formula for this weighted average, it's:
delta = (d | M1, H1) * p(M1, H1) + (d | M1, H0)*p(M1, H0) + (d | M2, H1)*p(M2, H1) + (d | M2, H0)*p(M2, H0) + (d | M3, H1)*p(M3, H1) + (d | M3, H0)*p(M3, H0) + (d | M4, H1)*p(M4, H1) + (d | M4, H0)*p(M4, H0)
(d | Mx, H0) is "effect size d given that publication bias model X is true and there is no effect." We can go through and set all these to zero, because when the null is true, delta is zero.

(d | Mx, H1) is "effect size d given that pubication bias model X is true and there is a true effect." Each bias model makes a different guess at the underlying true effect.    (d | M1, H1) is just the naive estimate. It assumes there's no pub bias, so it doesn't adjust at all. However, M2, M3, and M4 say there is pub bias, so they estimate delta as being smaller. Thus, (| M2, H1), (d | M3, H1), and (d | M4, H1) are shrunk-down effect size estimates.

p(M1, H1) through p(M4, H0) reflect our beliefs in each (pub-bias x H0/H1) combo. If the evidence is strong and unbiased, p(M1, H1) will be high. If the evidence is fishy, p(M1, H1) will be low and we'll assign more belief to skeptical models like p(M3, H1), which says the effect size is overestimated, or even p(M3, H0), which says that the null is true.

Then to get our estimate, we make our weighted average. If the evidence looks good, p(M1, H1) will be large, and we'll shrink d very little according to publication bias and remaining belief in the null hypothesis. If the evidence is suspect, values like p(M3, H0) will be large, so we'll end up giving more weight to the possibility that d is overestimated or even zero.

### Summary

So at the end of the day, we have a process that:
1. Takes into account how believable the hypothesis is before seeing data, gaining strength from our priors. Extraordinary claims require extraordinary evidence, while less wild claims require less evidence.
2. Takes into account how likely publication bias is in psychology, gaining further strength from our priors. Data from a pre-registered prospective meta-analysis is more trustworthy than a look backwards at the prestige journals. We could take that into account by putting low probability in pub bias models in the pre-registered case, but higher probability in the latter case.
3. Uses the available data to update beliefs about the hypothesis and publication bias both, improving our beliefs through data. If the data look unbiased, we trust it more. If the data looks like it's been through hell, we trust it less.
4. Provides a weighted average estimate of the effect size given our updated beliefs. It thereby shrinks estimates a lot when the data are flimsy and there's strong evidence of bias, but shrinks estimates less when the data are strong and there's little evidence of bias.
It's a very nuanced and rational system. Bayesian systems usually are.

That's enough for one post. I'll write a follow-up post explaining some of the implications of this method, as well as the challenges of implementing it.