A psychologist's thoughts on how and why we play games

Tuesday, October 18, 2016

Publishing the Null Shows Sensitivity to Data

Some months ago, a paper argued for the validity of an unusual measurement of aggression. According to this paper, the number of pins a participant sticks into a paper voodoo doll representing their child seems to be a valid proxy for aggressive parenting.

Normally, I might be suspicious of such a paper because the measurement sounds kind of farfetched. Some of my friends in aggression research scoffed at the research, calling bullshit. But I felt I could trust the research.

Why? The first author has published null results before.

I cannot stress enough how much an author's published null results encourages my trust of a published significant result. With some authors, the moment you read the methods section, you know what the results section will say. When every paper supports the lab's theory, one is left wondering whether there are null results hiding in the wings. One starts to worry that the tested hypotheses are never in danger of falsification.

"Attached are ten stickers you can use to harm the child.
You can stick these onto the child to get out your bad feelings.
You could think of this like sticking pins into a Voodoo doll."

In the case of the voodoo doll paper, the first author is Randy McCarthy. Years ago, I became aware of Dr. McCarthy when he carefully tried to replicate the finding that heat-related word primes influence hostile perceptions (DeWall & Bushman, 2009) and reported null results (McCarthy, 2014).

The voodoo doll paper from McCarthy and colleagues is also a replication attempt of sorts. The measure was first presented by DeWall et al. (2013); McCarthy et al. perform conceptual replications testing the measure's validity. On the whole, the replication and extension is quite enthusiastic about the measure. And that means all the more to me given my hunch that McCarthy started this project by saying "I'm not sure I trust this voodoo doll task..."

Similar commendable frankness can be seen in work from Michael McCullough's lab. In 2012, McCullough et al. reported that religious thoughts influence male's stereotypically-male behavior. Iin 2014, one of McCullough's grad students published that she couldn't replicate the 2012 result (Hone & McCullough, 2014).

I see it as something like a Receiver Operating Characteristic curve. If the classifier has only ever given positive responses, that's probably not a very useful classifier -- you can't tell if there's any specificity to the classifier. A classifier that gives a mixture of positive and negative responses is much more likely to be useful.

A researcher that publishes a null now and again is a researcher I trust to call the results as they are.

[Conflict of interest statement: In the spirit of full disclosure, Randy McCarthy once gave me a small Amazon gift card for delivering a lecture to the Bayesian Interest Group at Northern Illinois University.]

Friday, August 19, 2016

Comment on Strack (2016)

Yesterday, Perspectives on Psychological Science published a 17-laboratory Registered Replication Report, totaling nearly 1900 subjects. In this RRR, researchers replicated an influential study of the Facial Feedback Effect, showing that being surreptitiously made to smile or to pout could influence emotional reactions.

The results were null, indicating that there may not be much to this effect.

The first author of the original study, Fritz Strack, was invited to comment. In his comment, Strack makes four criticisms of the original study that, in his view, undermine the results of the RRR to some degree. I am not convinced by these arguments; below, I address each in sequence.

"Hypothesis-aware subjects eliminate the effect."

First, Strack says that participants may have learned of the effect in class and thus failed to demonstrate it. To support this argument, he performs a post-hoc analysis demonstrating that the 14 studies using psychology pools found an effect size of d = -0.03, whereas the three studies using non-psychology undergrad pools found an effect size of d = 0.16, p = .037.

However, the RRR took pains to exclude hypothesis-aware subjects. Psychology students were also, we are told, recruited prior to coverage of the Strack et al. study in their classes. Neither of these steps ensure that all hypothesis-aware subjects were removed, of course, but it certainly helps. And as Sanjay Srivastava points out, why would hypothesis awareness necessarily shrink the effect? It could just as well enhance it by demand characteristics.

Also, d = 0.16 is quite small -- like, 480-per-group for a one-tailed 80% power test small. If Strack is correct, and the true effect size is indeed d = 0.16, this would seem to be a very thin success for the Facial Feedback Hypothesis, and still far from consistent with the original study's effect.

"The Far Side isn't funny anymore."

Second, Strack suggests that, despite the stimulus testing data indicating otherwise, perhaps The Far Side is too 1980s to provide an effective stimulus.

I am not sure why he feels it necessary to disregard the data, which indicates that these cartoons sit nicely in the midpoint of the scale. I am also at a loss as to why the cartoons need to be unambiguously funny -- had the cartoons been too funny, one could have argued there was a ceiling effect.

"Cameras obliterate the facial feedback effect."

Third, Strack suggests that the "RRR labs deviated from the original study by directing a camera at the participants." He argues that research on objective self-awareness demonstrates that cameras induce subjective self-focus, tampering with the emotional response.

This argument would be more compelling if any studies were cited, but in either case, I feel the burden of proof rests with this novel hypothesis that the facial feedback effect is moderated by the presence of cameras.

"The RRR shows signs of small-study effects."

Finally, Strack closes by using a funnel plot to suggest that the RRR results are suffering from a statistical anomaly.

He shows a funnel plot that compares sample size and Cohen's d, arguing that it is not appropriately pyramidal. (Indeed, it looks rather frisbee-shaped.)

Further, he conducts a correlation test between sample size and Cohen's d. This result is not, strictly speaking, statistically significant (p = .069), but he interprets it all the same as a warning sign. (It bears mention here that an Egger test with an additive error term is a more appropriate test. Such a test yields p = .235, quite far from significance.)

Strack says that he does not mean to insinuate that there is "reverse p-hacking" at play, but I am not sure how else we are to interpret this criticism. In any case, he recommends that "the current anomaly needs to be further explored," which I will below.

Strack's funnel plot does not appear pyramidal because the studies are all of roughly equal size, and so the default scale is way off. Here I present a funnel plot of more appropriate scale. Again, there's little variance in sample size or standard error with which to make a pyramid shape. You're used to seeing taller, more funnel-y funnels because sample sizes in social psych tend to range broadly between 40 and 400, whereas here they vary narrowly from 80 to 140.

You can also see that there's really only one of the 17 studies that contributes to the correlation, having a negative effect size and larger standard error. This study is still well within the range of all the other results, of course; together, the studies are very nicely homogeneous (I^2 = 0%, tau^2 = 0), indicating that there's no evidence this study's results measure a different true effect size.

Still, this study has influence on the funnel plot -- it has a Cook's distance of 0.46, whereas all the others have distances of 0.20 or less. Removing this one study abolishes the correlation between d and sample size (r(14) = .27, p = .304), and the resulting meta-analysis is still quite null (raw effect size = 0.04, [-0.09, 0.18]). Strack is interpreting a correlation that hinges upon one influential observation.

I am willing to bet that this purported small-study effect is a pattern detected in noise. (Not that it was ever statistically significant in the first place.)

Admittedly, I am sensitive to the suggestion that an RRR would somehow be marred by reverse p-hacking. If all the safeguards of an RRR can't stop psychologists from reaching whatever their predetermined result, we are completely and utterly fucked, and it's time to pursue a more productive career in refrigerator maintenance.

Fortunately, that does not seem to be the case. The RRR does not show evidence of small-study effects or reverse p-hacking, and its null result is robust to exclusion of the most negative result.

Tuesday, July 19, 2016

The Failure of Fail-safe N

Fail-Safe N is a statistic suggested as a way to address publication bias in meta-analysis. Fail-Safe N describes the robustness of a significant result by calculating how many studies with effect size zero could be added to the meta-analysis before the result lost statistical significance. The original formulation is provided by Rosenthal (1979), with modifications proposed by Orwin (1983) and Rosenberg (2005).

I would like to argue that, as a way to detect and account for bias in meta-analysis, Fail-Safe N is completely useless. Others have said this before (see the bottom of the post for some links), but I needed to explore it further for my own curiosity. All together, I have to say that Fail-Safe N appears to be completely obsoleted by subsequent techniques, and thus is not recommended for use.

Fail-Safe N isn't for detecting bias

When we perform a meta-analysis, the question on our minds is usually "Looking at the gathered studies, how many null results were hidden from report?" Fail-Safe N does not answer that. Instead, it asks, "Looking at the gathered studies, how many more null results would you need before you'd no longer claim an effect?"

This isn't useful as a bias test. Indeed, Rosenthal never meant it as a way to test the presence of bias -- he'd billed it as an estimate of tolerance for null results, an answer to the question "How bad would bias have to be before I changed my mind?" He used it to argue that the published psych literature was not the 5% of Type I errors, while the 95% of null results languished in file drawers. Fail-Safe N was never meant to distinguish biased from unbiased literatures.

Fail-Safe N doesn't scale with bias

Although Fail-Safe N was never meant to test for bias, sometimes people will act as though a larger Fail-Safe N indicates the absence of bias. That won't work.

To see why it won't work, let's look briefly at the equation that defines FSN.

FSN = [(ΣZ)^2 / 2.706] - k

where ΣZ is the sum of z-scores from individual studies (small p-values mean large z-scores) and k is the number of studies.

This means that Fail-Safe N grows larger with each significant result. Even when each study is just barely significant (p = .050), Fail-Safe N will grow rapidly. After six p = .05 results, FSN is 30. After ten p = .05 results, FSN is 90. After twenty p = .05 results, FSN is 380. Fail-safe N rapidly becomes huge, even when the individual studies just barely cross the significance threshold.

Worse, FSN can get bigger as the literature becomes more biased.

  • For each dropped study with an effect size of exactly zero, FSN grows by one. (That's what it says on the tin -- how many dropped zeroes would be required to make p > .05.) 
  • When dropped studies have positive but non-significant effect sizes, FSN falls. 
  • When dropped studies have negative effect sizes, FSN rises.
If all the studies with estimated effect sizes less than zero are censored, FSN will quickly rise. 

Because Fail-Safe N doesn't behave in any particular way with bias, the following scenarios could all have the same Fail-Safe N:

  • A few honestly-reported studies on a moderate effect.
  • A lot of honest studies on a teeny-tiny effect.
  • A single study with a whopping effect size.
  • A dozen p-hacked studies on a null effect.

Fail-Safe N is often huge, even when it looks like the null is true

Publication bias and flexible analysis being what they are in social psychology, Fail-Safe N tends to return whopping huge numbers. The original Rosenthal paper provides two demonstrations. In one, he synthesizes 94 experiments examining the effects of interpersonal self-fulfilling prophecies, and concludes that 3,263 studies averaging null effects would be necessary to make the effect go away. In another analysis of k = 311 studies, he says nearly 50,000 studies would be needed.

Similarly, in the Hagger et al. meta-analysis of ego depletion, Fail-Safe N reported that 50,000 null studies would be needed to reduce the effect to non-significance. By comparison, the Egger test indicated that the literature was badly biased, and PET-PEESE indicated that the effect size was likely zero. The registered replication report also indicated that the effect size was likely zero. Even a Fail-Safe N of 50,000 does not indicate a robust result.


Fail-Safe N is not a useful bias test because:

  1. It does not tell you whether there is bias.
  2. Greater bias can lead to a greater Fail-Safe N.
  3. Hypotheses that would appear to be false have otherwise obtained very large values of FSN.

FSN is just another way to describe the p-value at the end of your meta-analysis. If your p-value is very small, FSN will be very large; if your p-value is just barely under .05, FSN will be small.

In no case does Fail-Safe N indicate the presence or absence of bias. It only places a number on how bad publication bias would have to be, in a world without p-hacking, for the result to be a function of publication bias alone. Unfortunately, we know well that we live in a world with p-hacking. Perhaps this is why Fail-Safe N is sometimes so inappropriately large.

If you need to test for bias, I would recommend instead Begg's test, Egger's test, or p-uniform. If you want to adjust for bias, PET, PEESE, p-curve, p-uniform, or selection models might work. But don't ever try to interpret the Fail-Safe N in a way it was never meant to be used.

Related reading:
Becker (2005) recommends "abandoning Fail-Safe N in favor of other, more informative analyses."
Here Coyne agrees that Fail-Safe N is not a function of bias and does not check for bias.
The Cochrane Collaboration agrees that  Fail-Safe N is mostly a function of the net effect size, and criticizes the emphasis on statistical significance over effect size.
Moritz Heene refers me to three other articles pointing out that the average Z-score of unpublished studies is probably not zero, as Fail-Safe N assumes, but rather, less than zero. Thus, the Fail-Safe N is too large. (Westfall's comment below makes a similar point.) This criticism is worth bearing in mind, but I think the larger problem is that Fail-Safe N does not answer the user's question regarding bias.

Thursday, June 23, 2016

Derailment, or The Seeing-Thinking-Doing Model

Inspired by a recent excellent lecture by Nick Brown, I decided to finally sit down and read Diederik Stapel's confessional autobiography, Ontsporing. Brown translated it from Dutch into English; it is available for free here.

In this account, Stapel describes how he came to leave theater for social psychology, how he had some initial fledgling successes, and ultimately, how his weak results and personal greed drove him to fake his data. A common theme is the complete lack of scientific oversight -- Stapel refers to his sole custody of the data as being alone with a big jar of cookies.

Doomed from the start

Poor Stapel! He based his entire research program on a theory doomed to failure. So much of what he did was based on a very simple, very crude model: Seeing a stimulus "activates" thoughts related to the stimulus. Those "activated" thoughts then influence behavior, usually at sufficient magnitude and clarity that they can be detected in a between-samples test of 15-30 samples per cell.

Say what you will about the powerful effects of the situation, but in hindsight, it's little surprise that Stapel couldn't find significant results. The stimuli were too weak, the outcomes too multiply determined, and the sample sizes too small. It's like trying to study if meditation reduces anger by treating 10 subjects with one 5-minute session and then seeing if they ever get in a car crash. Gelman might say Stapel was "driven to cheat [...] because there was nothing there to find. [...] If there's nothing there, they'll start to eat dirt."

Remarkably, Stapel writes as though he never considered that his theories could be wrong and that he should have changed course. Instead, he seems to have taken every p < .05 as gospel truth. He talks about p-hacking two studies into shape (he refers to "gray methods" like dropping conditions or outcomes) only to be devastated when the third study comes up immovably null. He didn't listen to his null results.

However, theory seemed to play a role in his reluctance to listen to his data. Indeed, he says the way he got away with it for as long as he did was by carefully reading the literature and providing the result that theory would have obviously predicted. Maybe the strong support from theory is why he always assumed there was some signal he could find through enough hacking.

He similarly placed too much faith in the significant results of other labs. He alludes to strange ideas from other labs as though they were established facts: things like the size of one's signature being a valid measure of self-esteem, or thoughts of smart people making you better at Trivial Pursuit.

Thinking-Seeing-Doing Theory

Reading the book, I had to reflect upon social psychology's odd but popular theory, which grew to prominence some thirty years ago and is just now starting to wane. This theory is the seeing-thinking-doing theory: seeing something "activates thoughts" related to the stimulus, the activation of those thoughts leads to thinking those thoughts, and thinking those thoughts leads to doing some behavior.

Let's divide the seeing-thinking-doing theory into its component stages: seeing-thinking and thinking-doing. The seeing-thinking hypothesis seems pretty obvious. It's sensible enough to believe in and study some form of lexical priming, e.g. that some milliseconds after you've just showed somebody the word CAT, participants are faster to say HAIR than BOAT. Some consider the seeing-thinking hypothesis so obvious as to be worthy of lampoon.

But it's the thinking-doing hypothesis that seems suspicious. If incidental thoughts are to direct behavior in powerful ways, it would suggest that cognition is asleep at the wheel. There seems to be this idea that the brain has no idea what to do from moment to moment, and so it goes rummaging about looking for whatever thoughts are accessible, and then it seizes upon one at random and acts on it.

The causal seeing-thinking-doing cascade starts to unravel when you think about the strength of the manipulation. Seeing probably causes some change in thinking, but there's a lot of thinking going on, so it can't account for that much variance in thinking. Thinking is probably related to doing, but then, one often thinks about something without acting on it.

The trickle-down cascade from minimal stimulus to changes in thoughts to changes in behavior would seem to amount to little more than a sneeze in a tornado. Yet this has been one of the most powerful ideas in social psychology, leading to arguments that we can reduce violence by keeping people from seeing toy guns, stimulate intellect through thoughts of professors, and promote prosocial behavior by putting eyes on the walls.


When I read Ontsporing, I saw a lot of troubling things: lax oversight, neurotic personalities, insufficient skepticism. But it's the historical perspective on social psychology that most jumped out to me. Stapel couldn't wrap his head around the idea that words and pictures aren't magic totems in the hands of social psychologists. He set out to study a field of null results. Rather than revise his theories, he chose a life of crime.

The continuing replicability crisis is finally providing some appropriately skeptical and clear tests of the seeing-thinking-doing hypothesis. In the meantime, I wonder: What exactly do we mean when we say "thoughts" are "activated"? How strong is the evidence is that the activation of a thought can later influence behavior? And are there qualitative differences between the kind of thought associated with incidental primes and the kind of thought that typically guides behavior? The latter would seem much more substantial.

Thursday, June 2, 2016

Prior elicitation for directing replication efforts

Brent Roberts suggests the replication movement solicit federal funding for the organization of federally-funded replication daisy chains. James Coyne suggests that the replication movement has already made a grave misstep by attempting to replicate findings that were always hopelessly preposterous. Who is in the right?

It seems to me that both are correct, but the challenge is in knowing when to replicate and when to dismiss outright. Coyne and the OSF seem to be after different things: the OSF has been very careful to make the RP:P about "estimating the replicability of psychology" in general rather than establishing the truth or falsity of particular effects of note. This motivated their decision to choose a random-ish sample of 100 studies rather than target specific controversial studies.

If in contrast, we want to direct our replication efforts to where they will have the greatest probative value, we will need to first identify which phenomena we are collectively most ambivalent about. There's no point in replicating something that's obviously trivially true or blatantly false.

How do we figure that out? Prior elicitation! We gather a diverse group of experts and ask them to divide up their probability, indicating how big they think the effect size is in a certain experimental paradigm.

If most the probability mass is away from zero, then we don't bother with the replication -- everybody believes in the effect already.

On the other hand, if the estimates are tightly clustered around zero, we don't bother with the replication -- it's obvious nobody believes it in the first place.

It's when the prior is diffuse, or evenly divided between the spike at zero and the slab outside zero, or bimodal, that we find the topic is controversial and in need of replication. That's the kind of thing that might benefit from a RRR or a federally-funded daisy chain.

Code below:
# Plot1
x = seq(-2, 2, .01)
plot(x, dcauchy(x, location = 1, scale = .3)*.9, type = 'l',
     ylim = c(0, 1),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "All-but-certain finding \n Little need for replication")
arrows(0, 0, 0, .1)

# Plot2
plot(x, dcauchy(x, location = 0, scale = .25)*.1, type = 'l',
     ylim = c(0, 1),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "No one believes it \n Little need for replication")
arrows(0, 0, 0, .9)

# Plot3
plot(x, dcauchy(x, location = 0, scale = 1)*.5, type = 'l',
     ylim = c(0, .75),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "No one knows what to think \n Great target for replication")
arrows(0, 0, 0, .5)

# Plot4
plot(x, dcauchy(x, location = 1, scale = 1)*.5, type = 'l',
     ylim = c(0, .75),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "Competing theories \n Great target for replication")
lines(x, dcauchy(x, location = -1, scale = 1)*.5)

Wednesday, June 1, 2016

Extraordinary evidence

Everyone seems to agree with the saying "extraordinary claims require extraordinary evidence." But what exactly do we mean by it?

In previous years, I'd taken this to mean that an improbable claim requires a dataset with strong probative value, e.g. a very small p-value or a very large Bayes factor. Extraordinary claims have small prior probability and need strong evidence if they are to be considered probable a posteriori.

However, this is not the only variety of extraordinary claim. Suppose that someone tells you that he has discovered that astrological signs determine Big Five personality scores. You scoff, expecting that he has run a dozen tests and wrestled out a p = .048 here or there. But no, he reports strong effects on every outcome: all are p < .001, with correlations in the r = .7 range. If you take the results at face value, it is clearly strong evidence of an effect.

Is this extraordinary evidence? In a sense, yes. The Bayes factor or likelihood ratio or whatever is very strong. But nested within this extraordinary evidence is another extraordinary claim: that his study found these powerful results. These effects are unusually strong for personality psychology in general, much less for astrology and personality in particular.

What kind of extraordinary evidence is needed to support that claim? In this post-Lacour-fraud, post-Reinhart-Rogoff-Excel-error world, I would suggest that more is needed than simply a screenshot of some SPSS output.

In ascending order of rigor, authors can support their extraordinary evidence by providing the following:

  1. The post-processed data necessary to recreate the result.
  2. The pre-processed data (e.g., single-subject e-prime files; single-trial data).
  3. All processing scripts that turn the pre-processed data into the post-processed data.
  4. Born-open data, data that is organized by Git to be saved and uploaded to the cloud in an automated script. This is an extension of the above -- it provides the pre-processed data, uploaded to the central, 3rd-party GitHub server, where it is timestamped.

Providing access to the above gives greater evidence that:

  1. The data are real, 
  2. The results match the data, 
  3. The processed data are an appropriate function of the preprocessed data, 
  4. The data were collected and uploaded over time, rather than cooked up in Excel overnight, and
  5. The data were not tampered with between initial collection and final report.

If people do not encourage data-archival, a frustrating pattern may emerge: Researchers report huge effect sizes with high precision. These whopping results have considerable influence on the literature, meta-analyses, and policy decisions. However, when the data are requested, it is discovered that the data were hit by a meteor, or stolen by Chechen insurgents, or chewed up by a slobbery old bulldog, or something. Nobody is willing to discard the outrageous effect size from meta-analysis for fear of bias, or appearing biased. Techniques to detect and adjust for publication bias and p-hacking, such as P-curve and PET-PEESE, would be powerless to detect and adjust for bias so long as a few high-effect-size farces remain in the dataset.

The inevitable fate of many suspiciously successful datasets.
Like Nick Brown points out, this may be the safest strategy for fraudsters. At present, psychologists are not expected to be competent custodians of their own data. Little of graduate training concerns data archival. It is not unusual for data to go missing, and so far I have yet to find anybody who has been censured for failure to preserve their data. In contrast, accusations of fraud or wrongdoing require strong evidence -- the kind that can only be obtained by looking at the raw data, or perhaps by finding the same mistake, made repeatedly across a lifetime of fraudulent research. Somebody could go far by making up rubbish and saying the data were stolen by soccer hooligans, or whatever.

For a stronger, more replicable science, we must do more to train scientists in data management and incentivize data storage and sharing. Open science badges are nice. They let honest researchers signal their honesty. But they are not going to save the literature so long as meta-analysis and public policy statements must tiptoe around closed-data (or the-dog-ate-my-data) studies with big, influential results.

Monday, May 16, 2016

The value-added case for open peer reviews

Last post, I talked about the benefits a manuscript enjoys in the process of scientific publication. To me, it seems that the main benefits are that an editor and some number of peer reviewers read it and give edits. Somehow despite this part coming from volunteer labor, it still manages to cost $1500 an article.

And yet, as researchers, we can't afford to try to do without the journals. When the paper appears with a sagepub.com URL on it, readers now assume it to be broadly correct. The journal publication is part of the scientific canon, whereas the preprint was not.

Since the peer reviews are what really elevates the research from preprint to publication, I think the peer reviews should be made public, as part of the article's record. This will open the black box and encourage readers to consider: Who thinks this article is sound? What do they think are the strengths and weaknesses of the research? Why?

By comparison, the current system provides only the stamp of approval. But we readers and researchers know that the stamp of approval is imperfect. The process is capricious. Sometimes duds get published. Sometimes worthy studies are discarded. If we're going to place our trust in the journals, we need to be able to check up on the content and process of peer review.

Neuroskeptic points out that, peer review being what it is, perhaps there should be fewer journals and more blogs. The only difference between the two, in Neuro's view, is that a journal implies peer review, which implies the assent of the community. If journal publication implies peer approval, shouldn't journals show the peer reviews to back that up? And if peer approval is all it takes to make something scientific canon, couldn't a blogpost supported by peer reviews and revisions be equivalent to a journal publication?

Since peer review is all that separates blogging from journal publishing, I often fantasize about sidestepping the journals and self-publishing my science. Ideally, I would just upload a preprint to OSF. Alongside the preprint there would be the traditional 2-5 uploaded peer reviews.

Arguably, this would provide an even higher standard of peer review, in that readers could see the reviews. This would compare favorably with the current system, in which howlers are met with unanswerable questions like "Who the heck reviewed this thing?" and "Did nobody ask about this serious flaw?"

Maybe one day we'll get there. In the meantime, so long as hiring committees, tenure committees, and granting agencies are willing to accept only journal publications as legitimate, scientists will remain powerless to self-publish. In the meantime, the peer reviews should really be open. The peer reviews are what separates preprint from article, and we pay millions of dollars a year to maintain that boundary, so we might as well place greater emphasis and transparency on that piece of the product.