Editorial: Just how significant is it?

by Geoff Hart

Previously published as: Hart, G. 2009. Editorial: Just how significant is it? the Exchange 16(3): 2, 7–9.

This editorial will seem a bit distant from the subject of this newsletter (scientific communication), but bear with me for a few hundred words and I hope the relevance will become clear.

A curious little article appeared in 2003 in the ordinarily staid and sober British Medical Journal: "Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomised controlled trials" (BMJ 327:1459-1461). In this paper, Gordon Smith and Jill Pell note, tongues firmly in cheek, that their goal was to determine whether parachutes were effective in preventing "major trauma related to gravitational challenge"—what the rest of us might call, in the vernacular, damage caused by striking the ground at high velocity after falling from an airplane. In their words:

"As with many interventions intended to prevent ill health, the effectiveness of parachutes has not been subjected to rigorous evaluation by using randomised controlled trials. Advocates of evidence based medicine have criticised the adoption of interventions evaluated by using only observational data. We think that everyone might benefit if the most radical protagonists of evidence based medicine organised and participated in a double blind, randomised, placebo controlled, crossover trial of the parachute."

Like all the best satire, their paper provides an important critique—here, one that cuts right to the heart of the modern scientific method. Currently, the gold standard for research is a placebo-controlled double-blind experimental design. The "placebo" part means that you study both what you're testing and a placebo that should, in theory, have no effect. The "double-blind" part refers to the fact that you conceal details of the treatment from both the researchers and the test subjects so that both are "blind" to the experimental conditions and therefore cannot consciously or subconsciously affect the results. For a result to be considered real, the treatment must produce better results than the placebo under these conditions. To learn more, see the Wikipedia article on this topic.

Such trials are critical in several areas of science, most notably in medicine, and have greatly advanced our knowledge by preventing many false or potentially misleading findings from entering the scientific literature. However, for various compelling reasons, it's not always possible to design such trials—as Smith and Pell so incisively observe. But not everyone accepts that argument, and some people who should know better insist on this standard of proof. It's worthwhile noting that some of the most important scientific discoveries, including most early genetics and "natural history" (animals and plants) studies did not use this approach at all, yet managed to produce important results. Perhaps more seriously, blinded placebo trials don't eliminate bias and error in how scientists frame their research question (i.e., you won't see what you're not looking for), and they don't eliminate problems with how researchers interpret their results.

One such problem relates to the modern dependence on the science of statistics, and specifically on the requirement for statistical replication, to provide reassurance that our observations are real. Replication involves observing the same process repeatedly under controlled conditions based on the belief that something that happens repeatedly is likely to have a predictable cause rather than arising solely from random chance. The statistical aspect of the approach relies on the mathematics of probability to provide an indication of how likely it is that we observed something purely by random chance. To learn more, see the Wikipedia article.

Again, this approach represents a tremendous step forward. Its key insight may be that it quantifies the likelihood that a result could have occurred purely as a result of random chance, thereby increasing our confidence that a result is real. Although the principle is sound and important, we must remember two key limitations:

First and most serious, statistical significance does not tell us whether a result is real or just a coincidence. It only tells us the probability of these two possibilities.
Second, statistical tests rely on an arbitrary decision about what probability level defines a significant finding.

Most science journals consider a result to be statistically significant if the likelihood of it occuring purely by chance is less than 5%. This sounds impressive until you think about what it means: at this level of significance, as many as 1 in 20 published results represent nothing more than bad luck on the part of the researchers rather than a valid result that points towards something real. Those are great odds if you're a gambler or a professional athlete in any major sport, but not so great if you believe that our world follows consistent, predictable rules that should produce consistent, predictable results. In such a world, wouldn't a 1% or 0.1% chance of error provide more confidence?

Please note: In no way am I suggesting that statistical replication is a bad idea, or that a 5% probability level is inadequate. In reality, many additional controls greatly reduce the likelihood of errors being published, including the fact that all research results are evaluated by a journal's editor and (most often) at least two peer reviewers in light of their knowledge of the body of research. That corpus has been validated by many other researchers who have repeated the research under different conditions, and who have tested each other's findings to ensure that they present a consistent picture. Studies that fail to produce statistically significant results are difficult to publish because there is no way to know whether the researchers simply screwed up; 1 in 20 may have. Statistically significant positive results that contradict accepted knowledge are similarly difficult to publish, at least until other researchers have replicated these results to "prove" that they are real.

Andrew Gelman and Hal Stern explored this problem and several related issues in their paper The difference between “significant” and “not significant” is not itself statistically significant (The American Statistician, November 2006). This is a subtle paper, and not an easy read, but it makes some important points. I'm not the only one to have observed that any particular threshold for statistical significance is arbitrary. Neither am I alone in pointing out that statistically significant results can have no practical significance; for example, a change of blood pressure of 1 point in response to some medication may be statistically significant, but for someone who must lower their systolic blood pressure from 160 to 120, it's not a meaningful result. Their more important point is that a small and not necessarily meaningful change in the measured values can change a statistically insignificant result (with a 5.1% significance level) into a statistically significant result (with a 4.9% significance level). It's not the arbitrariness of the 5% criterion that is important here, but rather the fact that the change in the data that is required to move a result from significant to non-significant may not be meaningful in any practical sense. This makes it risky to compare two results based on their reported significance levels: the fact that one result has a significance level of 4.9% and another has a significance level of 5.1% does not make the former result more meaningful. This is particularly true if the two studies that produced these results reveal a small and a large actual difference: the small significant result may not be meaningful, whereas the large non-significant result may be very meaningful indeed. It's often more important to compare the magnitudes of two results (do both studies say the same thing?), not just their significance levels.

So: how does all this relate to the task of scientific communication? First, as in the Smith and Pell paper, it means that we must exercise some judgment in assessing and communicating the methods used to produce experimental results. Sometimes the traditional techniques of a trained and objective observer, combined with a reliance on extensive experience with how the world works, can reveal important results, even if the methodology is not state of the art. Indeed, most engineering proceeds based on the assumption that there's no need to test design alternatives that we know will prove ineffective just to prove that effective alternatives really work better than the failures. Of course, we should never ignore an opportunity to test a conclusion using the most rigorous methods available, but neither should we insist on those methods when there are clear ethical or practical problems that prevent their use. If a research result appears important and has important consequences, it may be necessary to find a way to draw attention to that result, yet without blinding ourselves or our audience to potential flaws in the methodology that produced it.

Second, as in the Gelman and Stern paper, we must not rely exclusively on simple numerical values (levels of statistical significance) to judge whether a result is important. We must always remember the arbitrary nature of statistical levels of significance, and never assume that something is meaningful simply because it is statistically significant; neither must we assume that something is wrong and unimportant if it fails to meet such criteria. We must, of course, rigorously examine research results; problems arise most often when we use that rigor to obviate the need to actually think about the results.

As communicators, this can lead to thorny challenges. Because we are most often not experts in a particular subject, we must work closely with the experts to ensure that we understand what we must communicate: Is it meaningful even if the methodology was flawed or the result was not statistically significant? Is it important even if the methodology was scrupulous and the results statistically significant? How much confidence should we place in the results? In short, how significant (in all senses of the word) is the result that we're being asked to communicate?

My essays on scientific communication have now been collected in the following book:

Hart, G. 2011. Exchanges: 10 years of essays on scientific communication. Diaskeuasis Publishing, Pointe-Claire, Que. Printed version, 242 p.; eBook in PDF format, 327 p.