This article is about two common problems with “statistical significance” in medical research. Both problems are particularly rampant in the study of massage therapy, chiropractic, and alternative medicine in general, and are wonderful examples of why science is hard, “why most published research findings are false”1 and genuine robust treatment effects are rare:2
Stats are hard and scary, of course — everyone knows that. But there will be comic strips, funny videos, and diagrams. I’ll try to make it worth your while, and I’ll try to simplify dramatically without butchering underlying math. Angels fear to tread here!
Research can be statistically significant, but otherwise unimportant. Statistical significance on its own is the sound of one hand clapping. But researchers often focus on the the positive: “Hey, at least we’ve got statistical significance! Maybe!” So they summarize their findings as “significant” without telling us the size of the effect they observed, which is a little devious or sloppy. Almost everyone is fooled by this — except 98% of statisticians — because the word “significant” carries so much weight. It really sounds like a big deal, like good news.
But it’s like bragging about winning a lottery without mentioning that you only won $25.3
Statistical significance without other information really doesn’t mean all that much. It is not only possible but common to have clinically trivial results that are nonetheless statistically significant. How much is that statistical significance is worth? It depends … on details that are routinely omitted.
Which is convenient if you’re pushing a pet theory, isn’t it?
Imagine a study of a treatment for pain, which has a statistically significant effect, but it’s a tiny effect: that is, it only reduces pain slightly. You can take that result to the bank (supposedly) — it’s real! It’s statistically significant! But … no more so than a series of coin flips that yields enough heads in a row to raise your eyebrows. And the effect was still tiny. So calling these results “significant” is using math to put lipstick on a pig.
There are a lot of decorated pigs in research: “significant” results that are possibly not even that, and clinically boring in any case.
Just because a published paper presents a statistically significant result does not mean it necessarily has a biologically meaningful effect.
Science Left Behind: Feel-Good Fallacies and the Rise of the Anti-Scientific Left, Alex Berezow & Hank Campbell
If you torture data for long enough, it will confess to anything.
Ronald Harry Coase
Statistical significance is boiled down to one convenient number: the infamous, cryptic, bizarro and highly over-rated P-value. Cue Darth vader theme. This number is “diabolically difficult” to understand and explain, and so p-value illiteracy and bloopers are epidemic (Goodman identifies ““A dirty dozen: twelve p-value misconceptions””4). It seems to be hated by almost everyone who actually understands it, because almost no one else does. Many believe it to be a blight on modern science.5 Including the American Statistical Association6 — and if they don’t like it, should you?
The mathematical soul of the p-value is, frankly, not really worth knowing. It’s just not that fantastic an idea. The importance of scientific research results cannot be jammed into a single number (and nor was that ever the intent). And so really wrapping your head around it no more important than learning the gritty details of the Rotten Tomatoes algorithm when you’re trying to decide whether to see that new Godzilla (2014) movie.7
What you do need to know is the role that p-values play in research today. You need to know that “it depends” is a massive understatement, and that there are “several reasons why the p-value is an unobjective and inadequate measure of evidence”8 Because it is so often abused, it’s way more important to know what the p-value is NOT than what it IS. For instance, it’s particularly useless when applied to studies of really outlandish ideas. Because it is so often abused, it’s way more important to know what it’s not than what it is. For instance, it’s particularly useless when applied to studies of really outlandish ideas.9 And yet it’s one of the staples of pseudoscience,10 because it is such an easy way to make research look better than it is.
Above all, a good p-value is not a low chance that the results were a fluke or false alarm — which is by far the most common misinterpretation (and the first of Goodman’s Dirty Dozen). The real definition is a kind of mirror image of that:11 it’s not a low chance of a false alarm, but a low chance of an effect that actually is a false alarm. The false alarm is a given! That part of the equation is already filled in, the premise of every p-value. For better or worse, the p-value is the answer to this question: if there really is nothing going on here, what are the odds of getting these results? A low number is encouraging, but it doesn’t say the results aren’t a fluke, because it can’t12 — it was calculated by assuming they are.
The only way to actually find out if the effect is real or a fluke is to do more experiments. If they all produce results that would be unlikely if there was no real effect, then you can say the results aren’t probably real. The p-value alone can only be a reason to check again — not statistical congratulations on a job well done. And yet that’s exactly how most researchers use it. And most science journalists.13
Head hurting already? Time for the stick people to take over this tutorial.
Geeky web comic xkcd illustrated the trouble with statistical significance better than I can with my paltry words.
A “5% chance of coincidence” is actually a fairly strong chance of a coincidence. That’s one in twenty. That can happen. Something with a one in twenty chance of happening each day is going to happen more than once per month.
Randall Munroe hides an extra little joke in all his comic strips (mouse over his comic strips on his website, wait a moment, and they are revealed). This time it was:
'So, uh, we did the green study again and got no link. It was probably a--' 'RESEARCH CONFLICTED ON GREEN JELLY BEAN/ACNE LINK; MORE STUDY RECOMMENDED!'
A great deal of crap science is presented in exactly this way. It’s one of the main ways that “studies show” a lot of things help pain that aren’t actually do no such thing. It’s one of the easiest ways for the “controversy” over many alternative treatments can be extended: by citing “significant” evidence of benefit, with data exactly as absurd as in the comic strip above.
It’s actually statistically normal for studies to make bad treatments look good. It happens regularly.
A classic real world example of “statistically significant but clinically trivial” is the supposedly proven benefit of chiropractic adjustment: how much benefit, exactly? “Less than the threshold for what is clinically worthwhile,” as it turns out, according to Nefyn Williams, author of a 2011 paper in International Musculoskeletal Medicine.14 I have been pointing out that spinal adjustment benefits seem to be real-but-minor for many years now, and I explore that evidence in crazy detail in my low back pain and neck pain tutorials. It’s a big topic, with lots of complexity.
Williams’ paper offers an oblique perspective that is quite different and noteworthy: his paper is not about chiropractic adjustment, but about the concept of clinical significance itself. There are various ways of measuring improvement in scientific tests of treatments, and, as Williams explains, “when an outcome measure improves by, say, five points it is not immediately apparent what this means.” How much improvement matters? After explaining and discussing various proposed standards and methods, Williams needed a good example to make his point. It’s quite interesting that he picked spinal manipulative therapy.
Chondroitin sulfate is a “nutraceutical” — a food-like nutritional supplement that is supposedly “good for cartilage” (because it is major component of cartilage). It has been heavily studied, but there has never been any clear good scientific news about it, and it bombed a particularly large and good quality test in 2006 (see Clegg).
So it was a bit hard to believe my eyes when I read the summary of a 2011 experiment claiming that chondroitin sulfate “improves hand pain.”15 Really?!
No, not really. On a 100mm VAS (a pain scale, “visual analogue scale”), the treatment group was 8.77mm happier with their hands. With a p=.02 (a middlin’ p-value, neither high nor low). So basically what the researchers found is a chance that chondroitin makes a small difference in arthritis pain. It’s not nothing, but it is an incredibly unimpressive result — pretty much the definition of clinically insignificant. The authors’ interpretation is like taking the dog to the end of the driveway and saying you took him for a walk. Technically true ...
So that is a lovely demonstration of the abuse of statistical significance!
The first significance “problem” is almost like a trick or truth bending. It works well and fools a lot of people — sometimes, I think, even the scientists or doctors who are using it — because the usage of the term “significant” is often technically correct and literally true, but obscures and diverts attention away from the whole story. Thus it is more like a subtle and technical lie of omission than an actual error.
That’s pretty bad! But it’s not the half of it. It gets much worse: a lot of so-called significant results aren’t even technically correct. Problem #2 is an actual error — something that would get you a failing grade in a basic statistics course.
a stark statistical error so widespread it appears in about half of all the published papers surveyed from the academic neuroscience research literature.
Dr. Steven Novella also wrote about it for ScienceBasedMedicine.org recently, adding that
there is no reason to believe that it is unique to neuroscience research or more common in neuroscience than in other areas of research.
And it is not. Dr. Christopher Moyer is a psychologist who studies massage therapy:
I have been talking about this error for years, and have even published a paper on it. I critiqued a single example of it, and then discussed how the problem was rampant in massage therapy research. Based on the Nieuwenhuis paper, apparently it’s rampant elsewhere as well, and that is really unfortunate. Knowing the difference between a within-group result and a between-groups result is basic stuff.
Clinical trials are all about comparing treatments. To be considered effective, a real treatment has to work better than a fake one — a placebo. A drug must produce better results than a sugar pill. If that difference is big enough, it is “statistically significant.” There are a lot of details, but that’s the beating heart of a fair scientific test: can the treatment beat a fake?
What you can’t do is just compare the treatment to nothing at all and say, “See, it works: huge difference! Huge improvement over nothing!” The problem is that both effective medicines and placebos can beat nothing — it just doesn’t mean much until you know the treatment can also trounce a placebo. On its own, a “statistically significant” difference between treatment and nothing at all is the sound of one hand clapping. A meaningful comparison has to be a statistical ménage à trois, comparing all three to each other (analysis of variance, or ANOVA).
The error is the failure to do this. And, shockingly, Nieuwenhuis et al reported that more than half of researchers were making this mistake: comparing treatments and placebos to nothing, but not to each other.
Studies of massage therapy (and others, like chiropractic17) are particularly plagued by this error. Why? Because massage is so much “better than nothing.” The size of that difference looms large, and so it’s all too easy to mistake it for the one that matters — and fail to even compare the treatment to a placebo. And it’s really hard to come up with a meaningful placebo. It’s notoriously difficult to give a patient a fake massage. (They catch on.)
Statistics does not care about these difficulties: you still can’t compare massage to nothing, stop there, and call the difference “significant.” You still have to do your ANOVA, and massage still has to beat some kind of placebo before it can be considered more effective than pleasant grooming.
This research problem is not limited to massage, but massage is probably be the single best example of it. It crops up when you’re studying any treatment that involves a lot of interaction. The more interaction, the worse the problem gets. It’s a big deal in massage research because massage involves a lot of interaction, much of which is pleasant and emotionally engaging. It’s notoriously difficult to give a patient a fake massage. (They catch on.)Interaction with a friendly health care provider has a lot of surprisingly potent effects: people react strongly and positively to compassion, attention, and touch. The problem is that those benefits have nothing to do with any specific “active ingredient” in a massage. Grooming is just nice. It’s like pizza: even when it’s bad, it’s pretty good.
Much of the good done by therapists of all kinds is attributable to potent placebos driven by their complex interactions with patients, and not by anything in particular that they are doing to the patient. To find out how well a therapy works, it must be compared to sham treatments which are as much like the treatment as possible. This is hard to do, and it has rarely been done well. It’s much more typical to compare therapy to something too lifeless and “easy to beat,” to much like comparing it to nothing at all instead of a real placebo. And there’s the difference error: comparison to the wrong thing, statistical significance of the wrong difference.
I am a science writer, former massage therapist, and I was the assistant editor at ScienceBasedMedicine.org for several years. I have had my share of injuries and pain challenges as a runner and ultimate player. My wife and I live in downtown Vancouver, Canada. See my full bio and qualifications, or my blog, Writerly. You might run into me on Facebook or Twitter.
— Science update: citation to Pereira 2012 about the lack of large treatment effects in medicine.
This intensely intellectual paper — it’s completely, hopelessly nerdy — became one of the most downloaded articles in the history of the Public Library of Science and was described by the Boston Globe as an instant cult classic. Despite the title, the paper does not, in fact, say that “science is wrong,” but something much less sinister: that it should take rather a lot of good quality and convergent scientific evidence before we can be reasonably sure of something, and he presents good evidence that a lot of so-called conclusions are premature, not as “ready for prime time” as we would hope. This is not the least bit surprising to good scientists, who never claimed in the first place that their results are infallible or that their conclusions are “true.”
I go into much more detail here: Ioannidis: Making Medical Science Look Bad Since 2005.BACK TO TEXT
A “very large effect” in medical research is probably exaggerated, according to Stanford researchers. Small trials of medical treatments often produce results that seem impressive. However, when more and better trials are performed, the results are usually much less promising. In fact, “most medical interventions have modest effects” and “well-validated large effects are uncommon.”BACK TO TEXT
Let’s spell that out a bit: a friend approaches you, and excitedly reports that he had won some money — but rather than telling you how much, he brags only about the odds: “It was a twenty to one!” he says happily, and goes for a high five. You indulge him with a high five, and then demand to know how much he actual won.
“Um, three dollars. But I won!”
This is how silly it is to declare statistical significance without clear reference to effect size.BACK TO TEXT
BACK TO TEXT
P values have always had critics. In their almost nine decades of existence, they have been likened to mosquitoes (annoying and impossible to swat away), the emperor's new clothes (fraught with obvious problems that everyone ignores) and the tool of a “sterile intellectual rake” who ravishes science but leaves it with no progeny. One researcher suggested rechristening the methodology “statistical hypothesis inference testing”, presumably for the acronym it would yield.
Misuse of the P value — a common test for judging the strength of scientific evidence — is contributing to the number of research findings that cannot be reproduced, the American Statistical Association (ASA) warns in a statement released today. The group has taken the unusual step of issuing principles to guide use of the P value, which it says cannot determine whether a hypothesis is true or whether results are important.BACK TO TEXT
BACK TO TEXT
Reporting p values from statistical significance tests is common in psychology's empirical literature. Sir Ronald Fisher saw the p value as playing a useful role in knowledge development by acting as an "objective" measure of inductive evidence against the null hypothesis. We review several reasons why the p value is an unobjective and inadequate measure of evidence when statistically testing hypotheses. A common theme throughout many of these reasons is that p values exaggerate the evidence against H0. This, in turn, calls into question the validity of much published work based on comparatively small, including .05, p values. Indeed, if researchers were fully informed about the limitations of the p value as a measure of evidence, this inferential index could not possibly enjoy its ongoing ubiquity. Replication with extension research focusing on sample statistics, effect sizes, and their confidence intervals is a better vehicle for reliable knowledge development than using p values. Fisher would also have agreed with the need for replication research.
BACK TO TEXT
The authors illustrate the difficulties involved in obtaining a valid statistical significance in clinical studies especially when the prior probability of the hypothesis under scrutiny is low. Since the prior probability of a research hypothesis is directly related to its scientific plausibility, the commonly used frequentist statistics, which does not take into account this probability, is particularly unsuitable for studies exploring matters in various degree disconnected from science such as complementary alternative medicine (CAM) interventions. Any statistical significance obtained in this field should be considered with great caution and may be better applied to more plausible hypotheses (like placebo effect) than that examined - which usually is the specific efficacy of the intervention. Since achieving meaningful statistical significance is an essential step in the validation of medical interventions, CAM practices, producing only outcomes inherently resistant to statistical validation, appear not to belong to modern evidence-based medicine.
Like a German word with no exact English translation, there really is no way to properly define p-value without some jargon. (Or at least that’s how I justify my own past erroneous definitions.) So here’s a proper definition now, including the essential jargon — specifically the “null hypothesis” — here’s Goodman’s definition: “The probability of the observed result, plus more extreme results, if the null hypothesis were true.”
The null hypothesis is basically the bet there’s actually no effect for the experiment to find. If that bet is right, then you’re only going observe the appearance of an effect due to a fluke or error. The null hypothesis is difficult but really interesting idea, and I cover it in some detail in Why So “Negative”? Answering accusations of negativity, and my reasons and methods for debunking bad treatment options for pain and injury.BACK TO TEXT
BACK TO TEXT
P values are the most commonly used tool to measure evidence against a hypothesis or hypothesized model. Unfortunately, they are often incorrectly viewed as an error probability for rejection of the hypothesis or, even worse, as the posterior probability that the hypothesis is true. The fact that these interpretations can be completely misleading when testing precise hypotheses is first reviewed, through consideration of two revealing simulations. Then two calibrations of a ρ value are developed, the first being interpretable as odds and the second as either a (conditional) frequentist error probability or as the posterior probability of the hypothesis.