What We Have Here Is A Failure To Replicate

Details: Written by Dr. Steven Novella; Published: 17 March 2012

Scientific research is difficult to do well, and people are flawed and biased. As Carl Sagan noted, science is not just an ideal abstraction, but is very much a human endeavor, and as such is messy and imperfect. Nature itself is random and quirky and doesn’t always cooperate with our desires to penetrate its secrets.

The power of science as a tool for understanding the world comes largely from the fact that it is self corrective – it doesn’t always get it right the first time, but it has the potential to fix any mistakes and get it right eventually. Part of this self-corrective power comes from the humble replication – scientists repeating the work of other scientists to see if they get the same results. If a certain result represents a real effect in the world, it should generally replicate, no matter who is doing the experiment. If the result represents some methodological error, or was just a chance result, then it will not consistently show up in replication.

This fact has particularly plagued parapsychology research, whose research paradigms have not historically survived replication. Skeptics interpret this fact quite simply – psi effects are not real, they are the product of research error. Psi proponents have fancier explanations, which sound a lot like special pleading: skeptical replicators inhibit psi phenomena, or perhaps there is a “decline effect” inherent to psi phenomena itself. It may be the nature of psi that it is elusive in the laboratory, or repeatedly looking for it makes it go away. These are not convincing arguments.

The latest psi research to fall victim to a failure to replicate is Daryl Bem’s “feeling the future” research. A recent article in PLoSone reports the result of three independently carried out precise replications of Bem’s 9^th study. All three report completely negative results. This paper presents a good opportunity to delve deeply into the issues surrounding replication and research error.

The Experiment

In early 2011 Bem published a series of 9 studies in The Journal of Personality and Social Psychology, a prestigious psychology journal. All of the studies followed a similar format, reversing the usually direction of standard psychology experiments to determine if future events can affect past performance.

In the 9^th study, for example, subjects were given a list of words in sequence on a computer screen. They were then asked to recall as many of the words as possible. Following that they were given two practice sessions with half of the word chosen by the computer at random. The results were then analyzed to see if practicing the words improved the subject’s recall for those words in the past. Bem found that they did, with the largest effect size of any of the 9 studies.

Needles to say, these results were met with widespread skepticism. There are a number of ways to assess an experiment to determine if the results are reliable. You can examine the methods and the data themselves to see if there are any mistakes. You can also replicate the experiment to see if you get the same results.

A Bayesian Analysis

A somewhat controversial way to analyze an experiment is to evaluate the conclusion for plausibility and prior probability. Bem and others are fairly dismissive of plausibility arguments and feel that scientists should be open to whatever the evidence states. If we dismiss results because we have already decided the phenomenon is not real, then how will we ever discover new phenomena?

On the other hand, it seems like folly to ignore the results of all prior research and act as if we have no prior knowledge. There is a workable compromise – be open to new phenomena, but put any research results into the context of existing knowledge. What this means is that we make the bar for rigorous evidence proportional to the implausibility of the phenomenon being studied. (Extraordinary claims require extraordinary evidence.)

One specific manifestation of this issue is the nature of the statistical analysis of research outcomes. Some researchers propose that we use a Bayesian analysis of data, which in essence puts the new research data into the context of prior research. A Bayesian approach essentially asks – how much does this new data affect the prior probability that an effect is real?

Wagenmakers et al reanalyzed Bem’s data using a Bayesian analysis and concluded that the data is not sufficient to reject the null hypothesis. (Incidentally, I interviewed Wagenmakers on the SGU about this very issue.) They further claim that the currently in vogue P-value analysis tends to overcall positive results. In reply Bem claims that Wagenmakers used a ridiculously low prior probability in his analysis. In reality, it doesn’t matter what you think the prior probability is, the Bayesian analysis showed that Bem’s data has very little effect on the probability that retrocausal cognition is real.

Other Criticisms of Bem

The authors of the current paper replicating Bem’s research, Stuart J. Ritchie, Richard Wiseman, and Christopher C. French, outline other criticisms of Bem’s methods. They mention the Bayesian issue, but also that an analysis of the data shows an inverse relationship between effect size and subject number. In other words, the fewer the number of subjects the greater the effect size. This could imply a process called optional stopping.

This is potentially very problematic. Related to this is the admission by Bem, according to the article, that he peeked at the data as it was coming in. The reason peeking is frowned upon is precisely because it can result in things like optional stopping, which is stopping the collection of data in an experiment because the results so far are looking positive. This is a subtle way of cherry picking positive data. It is preferred that a predetermined stopping point is chosen to prevent this sort of subtle manipulation of data.

Another issue raised was the use of multiple analysis. Researchers can collect lots of data, by looking at many variables, and then making many comparisons among those variables. Sometimes they only publish the positive correlations, and may or may not disclose that they even looked at other comparisons. Sometimes they publish all the data, but the statistical analysis treats each comparison independently. In short what this means is that if you look at 20 comparisons with a 1/20 chance of reaching statistical significance, by chance one comparison would be significant. You can then declare a real effect. But what should happen is that the statistical analysis is adjusted to account for the fact that 20 different comparisons were made, which can potentially make the positive results negative.

Finally there was a serious issue raised with how the data was handled. Subjects occasionally made spelling error when listing the words they recall. The result may have been a non-word (like ctt for cat) or another word (like car for cat). Researchers had to go through and correct these misspellings manually.

The authors point out that these corrections were done in a non-blinded fashion, creating the opportunity to fudge the data toward the positive by how correction choices are made. Bem countered that even if you removed the corrected words the results would still be positive, but that is still methodologically sloppy and is likely still relevant, for reasons I will now get into.

Researcher Degrees of Freedom

As we see, there were many problems with the methods and statistical analysis of Bem’s original paper. Bem argues that each problem was small and by itself would not have changed the results. This argument, however, misses a critical point, made very clear in another recent paper.

Simmons et al published a paper demonstrating how easy it is to achieve false positive results by exploiting (consciously or unconsciously) so-called “researcher degrees of freedom.” In the abstract they write:

“In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (! .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis.”

In my opinion this is a seminal paper that deserves wide distribution and discussion among skeptics, scientists, and the public. The paper discusses the fact that researchers make many decisions when designing and executing a study, and analyzing and reporting the data. Each individual decision may have only a small effect on the final outcome. Each decision may be made perfectly innocently, and can be reasonably justified.

However, the cumulative effect of these decisions (degrees of freedom) could be to systematically bias the results of a study toward the positive. The power of this effect is potentially huge, and likely results in a significant bias towards positive studies in the published literature.

But even worse, this effect can also be invisible. As the authors point out – each individual decision can seem quite reasonable by itself. The final published paper may not reflect the fact that the researchers, for example, looked at three different statistical methods of analysis before choosing the one that gave the best results.

The authors lay out some fixes for this problem, such as researchers disclosing their methods prior to collecting data (and no peeking). But another check on this kind of bias in research is replication.

The Power of Replication

The degrees of freedom issue is one big reason that replicating studies, especially precise replications, is so important. A precise replication should have no degrees of freedom, because all the choices were already made by the original study. If the effect being researched is real, then the results should still come out positive. If they were the result of exploiting the degrees of freedom, then they should vanish.

There are also the other recognized benefits of replication. The most obvious is that any unrecognized quirky aspects of study execution or researcher biases should average out over multiple replications. For this reason it is critical for replications to be truly independent.

Another often missed reason why replications are important is simply to look at a fresh set of data. It is possible for a researcher, for example, to notice a trend in data that generates a hypothesis. That trend may have been entirely due to random clustering, however. If the data in which the trend was initially observed is used in a study, then the original random clustering can be carried forward, creating the false impression that they hypothesis is confirmed.

Replication involves gathering an entirely new data set, so any prior random patterns would not carry forward. Only if there is a real effect should the new data reflect the same pattern.

Replicating #9

To get back to the recent replication of Bem’s research - Ritchie, Wiseman, and French each replicated, as precisely as possible, Bem’s protocol for his 9^th study. They point out that sometimes even trivial deviations from the original protocol are used to dismiss negative replication results. They point out any changes they did make (all minor), such as changing some words to their British equivalents. They used the exact same software for the study, provided by Bem (to his credit, he did encourage and facilitate replications of his research).

In my assessment, after reading their methods, they seem like fair and accurate replications. There results were entirely negative. A separate fourth replication by Eric Robinson was also negative.

The weight of the evidence (even without considering plausibility) is heavily on the side of concluding that “feeling the future” is not a real phenomenon. This has important implications that go beyond this one study and this one topic.

This means that studies that look clean on paper, and that can be reasonably defended from criticism, can still be completely wrong. I cannot explain exactly why Bem’s results were positive when they were probably false positive (given the negative replications), but I can give reasons to be skeptical of the results. Invisible (or sometimes visible) researcher degrees of freedom are enough to generate false positive results.

The Straw Man vs Skepticism

This is exactly why skeptics do not take research results seriously (especially when they appear to demonstrate the impossible) until they have been independently replicated sufficiently to show a consistent pattern of positive results with rigorous methods.

Prominently displayed at the top of the Society for Pyschical Research’s website is this quote:

"I shall not commit the fashionable stupidity of regarding everything I cannot explain as a fraud." - C.G.Jung

Clearly that quote reflects the prevailing attitude among psi researchers of external skepticism of their claims and research. Every skeptic who has voiced their opinion has likely been met with accusations of being dismissive and closed-minded.

But this is a straw man. Skeptics are open to new discoveries, even paradigm-changing revolutionary ideas. Often I am asked specifically – what would it take to make me accept psi claims. I have given a very specific answer. It would take research simultaneously displaying the following characteristics:

1 – Solid methodology (proper blinding, fresh data set, clearly defined end points, etc.)

2 – Statistically significant results

3 – Absolute magnitude of the effect size that is greater than noise level (a sufficient signal to noise ratio)

4 – Consistent results with independent replication.

Most importantly, it would need to display all four of these characteristics simultaneously. Psi research cannot do that, and that is why I remain skeptical. These are the same criteria that I apply to any claim in science.

In addition, I do think that prior probability should play a role – not in accepting or rejecting any claim a priori, but in setting the threshold for the amount and quality of evidence that will be convincing. This is reasonable – it would take more evidence to convince me that someone hit Bigfoot with their car than that they hit a deer with their car. There is a word for someone who accepts the former claim with a low threshold of evidence.

You can convince me that psi phenomena are real, but it would take evidence that is at least as solid as the evidence that implies that such phenomena are probably not possible.

It is also important to recognize that the evidence for psi is so weak and of a nature that it is reasonable to conclude it is not real even without considering plausibility. But it is probably not a coincidence that we consistently see either poor quality or negative research in areas that do have very low plausibility.

Science may be difficult and complex, but we have gained much hard won knowledge and experience in how to apply science rigorously to arrive at reliable conclusions. All skeptics want is to apply that knowledge fairly and consistently to all claims. When we apply rigorous science to psi claims the best conclusion we can come to at this time is that they are very implausible and very probably not true.

Steven Novella, M.D. is the JREF's Senior Fellow and Director of the JREF’s Science-Based Medicine project.

Dr. Novella is an academic clinical neurologist at Yale University School of Medicine. He is the president and co-founder of the New England Skeptical Society and the host and producer of the popular weekly science show, The Skeptics’ Guide to the Universe. He also authors the NeuroLogica Blog.

JREF Swift Blog

What We Have Here Is A Failure To Replicate

Main Menu