September 12, 2011

Reproducible irreproducibility

The hunter and the mad scientist Marginal Revolution: How good is published academic research? links to some worrisome findings supporting the pharma industry rule of thumb that at least 50% of all published academic studies cannot be repeated in an industrial setting.

This is not just a pharma problem. My brother told me about his struggle to make a really good garbage collector for the Java engine his company developed. He surveyed the literature and found a number of papers describing significant improvements compared to other methods... except that a paper describing method A proved it was better than method B and C, yet papers on B and C showed them to be better than A. The reason is likely that the authors had carefully implemented their own methods and then ran them against fairly unoptimized versions of the other methods or on test data that fitted their own brand better.

Improving reproducibility is very important. Any claim has a finite and non-negligible risk of being wrong. In the case of research, the probability of error P=P1+P2+P3 is a sum of the probability of a "honest" mistake P1, a dishonest or biased interpretation P2 and the probability that the method is flawed P3. Repeating the experiment yourself will just reduce P1. Having somebody else repeat the experiment typically reduces both P1 and P2 (at least if they are disinterested in the result). And if you are lucky, they do not just repeat it but do it in a different way, reducing P3 too.

The amount of information we gain about a hypothesis by trying to reproduce it is surprisingly large. The probability of that both you and N-1 reproducing experiments get things wrong even if you have the same P is P^N - it doesn't take that many experiments to show a discrepancy. Recognizing that it is a discrepancy and not just bad experimental luck is much harder, of course.

Since there is systematic under-reporting of negative findings, the status of repeating a finding is lower than making the claim first, and people often do not pay attention to updates of claims, it is not enough to do the repeats. We need better ways of rewarding reproducing research and to collate the current research state clearly.

In the end, I suspect it comes down to science funding bodies to realize that they should reward maximum amount of convergence towards truth (weighted by the importance of the question) rather than maximum coolness.

Posted by Anders3 at September 12, 2011 07:39 PM
Comments