Fractal Pensive Ziztur
Freedom of the Mind.
Ziztur.com

Sunday, June 7, 2009

“Why Most Published Research Findings are False”

“Orac” over at Respectful Insolence has a writing style that’s fairly prone to offend—definitely pugnacious, and very fond of side swipes at those he dislikes (primarily alternative medicine quacks)—and I don’t blame him for his distaste, which in fact I share, but it does sometimes make his essays a bit harder to slog through. (He also has an inordinate fondness for beginning sentences with Indeed. This is one area where I can tentatively claim superiority: I can also be pugnacious and come off as offensive, but while I am no less prone than Orac to complicated sentence structure, I’ve never been accused of any such repetitive verbal tic.)

However, those foibles aside, he has written some very good stuff (he’s on my list of blogs I ready daily for a reason), and this article, summarising and explaining the work of a John Ioannidis, was very interesting indeed. The claim it looks at is a very interesting and puzzling one: Given a set of published clinical studies reporting positive outcomes, all with a confidence interval of 95%, we should expect more than 5% to give wrong results; and, furthermore, studies of phenomena with low prior probability are more likely to give false positives than studies where the prior probabilities are high. He has often cited this result as a reason why we should be even more skeptical of trials of quackery like homeopathy than the confidence intervals and study powers suggest, but I have to confess I never quite understood it.

I would suggest that you go read the article (or this take, referenced therein), but at the risk of being silly in summarising what is essentially a summary to begin with…here’s the issue, along with some prefatory matter for the non-statisticians:

A Type I error is a false positive: We seem to see an effect where there is no effect, simply due to random chance. This sort of thing does happen. Knowing how dice work, I may hypothesise that if you throw a pair of dice, you are not likely to throw two sixes, but one time out of every 36 (¹/₆×¹/₆), you will. I can confidently predict that you won’t roll double sixes twice in a row, but about one time in 1,296, you will. Any time we perform any experiment, we may get this sort of effect, so a statistical test, such as a medical trial, has a confidence level, where a confidence level of 95% means there’s a 5% chance of a Type I error.

There’s also a Type II error, or false negative, where the hypothesis is true but the results just aren’t borne out on this occasion. To the best of my knowledge, there is no equivalent of the confidence level for Type II errors.

This latter observation is a bit problematic, and leads into what Ioannidis observed:
Suppose there are 1000 possible hypotheses to be tested. There are an infinite number of false hypotheses about the world and only a finite number of true hypotheses so we should expect that most hypotheses are false. Let us assume that of every 1000 hypotheses 200 are true and 800 false.

It is inevitable in a statistical study that some false hypotheses are accepted as true. In fact, standard statistical practice [i.e. using a confidence level of 95%] guarantees that at least 5% of false hypotheses are accepted as true. Thus, out of the 800 false hypotheses 40 will be accepted as "true," i.e. statistically significant

It is also inevitable in a statistical study that we will fail to accept some true hypotheses (Yes, I do know that a proper statistician would say "fail to reject the null when the null is in fact false," but that is ugly). It's hard to say what the probability is of not finding evidence for a true hypothesis because it depends on a variety of factors such as the sample size but let's say that of every 200 true hypotheses we will correctly identify 120 or 60%. Putting this together we find that of every 160 (120+40) hypotheses for which there is statistically significant evidence only 120 will in fact be true or a rate of 75% true.
Did you see that magic? Our confidence interval was 95%, no statistics were abused, no mistakes were made (beyond the ones falling into that 5% gap, which we accounted for), and yet we were only 75% correct.

The root of the problem is, of course, the ubiquitous problem of publication bias: Researchers like to publish, and people like to read so journals like to print, positive outcome studies rather than negative ones, because a journal detailing a long list of ideas that turned out to be wrong isn’t very exciting. The problem is, obviously, that published studies are therefore biased in favour of positive outcomes. (If not, all 800 studies of false hypotheses would have been published and the problem would disappear.)

Definition time again: A prior probability is essentially a plausibility measure before we run an experiment. Plausibility sounds very vague and subjective, but can be pretty concrete. If I know that it rains on (say) 50% of all winter days in Vancouver, I can get up in the morning and assign a prior probability of 50% to the hypothesis that it’s raining. (I can then run experiments, e.g. by looking out a window, and modify my assessment based on new evidence to come up with a posterior probability.)

Now we can go on to look at why Orac is so fond of holding hypotheses with low prior probabilities to higher standards. It’s pretty simple, really: Recall that the reason why we ended up with so many false positives above—the reason why false positives were such a large proportion of the published results—is because there were more false hypotheses than true hypotheses. The more conservative we are in generating hypotheses, the less outrageous we make them, the more likely we are to be correct, and the fewer false hypotheses we will have (in relation to true hypotheses). Put slightly differently, we’re more likely to be right in medical diagnoses if we go by current evidence and practice than if we make wild guesses.

Now we see that modalities with very low prior probability, such as ones with no plausible mechanism, should be regarded as more suspect. Recall that above, we started out with 800 false hypotheses (out of 1000 total hypotheses), ended up accepting 5% = 40 of them, and that
It's hard to say what the probability is of not finding evidence for a true hypothesis because it depends on a variety of factors such as the sample size but let's say that of every 200 true hypotheses we will correctly identify 120 or 60%. Putting this together we find that of every 160 (120+40) hypotheses for which there is statistically significant evidence only 120 will in fact be true or a rate of 75% true.
That is, the proportion of true hypotheses to false hypotheses affects the accuracy of our answer. This is very easy to see—let’s suppose that only half of the hypotheses were false; now we accept 5% of 500, that is 25 false studies, and keeping the same proportions,
…Let's say that of every 200 500 true hypotheses we will correctly identify 120 300 or 60%. Putting this together we find that of every 160 (120+40) 325 (300+25) hypotheses for which there is statistically significant evidence only 120 300 will in fact be true or a rate of 75% 92% true.
We’re still short of that 95% measure, but we’re way better than the original 75%, simply by making more plausible guesses (within each study, we were still equally likely to make either Type I or Type II errors). The less plausible an idea is, the higher the proportion of false hypotheses will be out of all the hypotheses the idea generates: A true/false ratio. Wild or vague ideas (homeopathy, reiki, …) are very likely to generate false hypotheses along with any true ones they might conceivably generate. More conventional ideas will tend to generate a higher proportion of true hypotheses—if we know from long experience that Aspirin relieves pain, it’s very likely that a similar drug does likewise.

This is not to say that no wild ideas are ever right. Of course they sometimes are (though of course they usually aren’t). What it does mean is that not only should we be skeptical and demand evidence for them, there are sound statistical reasons to set the bar of evidence even higher for implausible than for plausible modalities.

It is also a good argument for the move away from strict EBM (evidence-based medicine) to SBM (science-based medicine) where things like prior probability are taken into account. Accepting 95% double-blind trials at face value isn’t good enough.

Labels: , , , ,

4 Comments:

Anonymous Saint Gasoline said...

Great post, Petter. I wasn't aware of these studies, though they certainy validate my intuitive feelings about the reliability of medical studies and the noise they are capable of generating.

Nevertheless, even with EBM, it is possible to recognize positive results that are likely just statistical noise through analyzing multiple studies (keeping in mind the positive publication bias, of course). It seems that science-based medicine is more of a pragmatic approach, though, and it certainly makes sense in a world where research opportunities are limited and shouldn't be wasted, and where EBM is often exploited by CAM practitioners who latch onto the few positive results. Of course, doctors and professionals should know that a few positive results don't mean anything if most results are negative, but this doesn't trickle down to the lay public, as the lay media is even worse with positive publication bias and seeking out studies that "defy explanation" and "stun scientists"---which is a large part of the reason people think acupuncture and other silly treatments work.

If only we lived in a perfect world, where EBM would suffice!

June 7, 2009 1:33 PM  
Anonymous Devysciple said...

Wow! If my math teacher had told me some ten years ago that I would one day be *fascinated* by a piece on statistics, I would have told him to go forth and multiply.
A really interesting read, thanks for summing it up and explaining it. Even those who are not savvy when it comes to statistics can follow the line of thinking.

June 7, 2009 1:56 PM  
Blogger Petter Häggholm said...

The one thing I should like to add is that while the context of all this is medical trials, it applies no less to any other research. Medical trials are very good examples of this stuff, because apart from all the usual problems in research, they also have to wrestle with vast commercial interests, emotionally charged subjects, and new and interesting confounding factors like placebo, Hawthorne effect, etc. As such, I generally suspect that in terms of statistical and “book-keeping” aspects of experimental design and analysis, medical trials are among the most difficult, and anything that applies to them applies even more easily to, say, physical experiments.

June 7, 2009 8:27 PM  
Blogger Ziztur said...

It's also really important to look at the effect size of a given statistical event. It is one thing to say "X is statistically significantly different from Y" and an entirely different thing to say to what extent they are different.

June 9, 2009 9:19 AM  

Post a Comment

I will never delete a comment because I disagree with you, but if you're posting anonymously, at least give us a name so that if you make multiple comments we can tell you apart from the other anonymous people.

Links to this post:

Create a Link

<< Home