updated (first published 2011)

by Paul Ingraham, Vancouver, Canada

by Paul Ingraham, Vancouver, Canada

This article is about *two* common problems with “statistical significance” in medical research. Both problems are particularly rampant in the study of massage therapy, chiropractic, and alternative medicine in general, and are wonderful examples of why science is hard and “why most published research findings are false”:

- mixing up statistical and clinical significance and the probability of being “right”
- reporting statistical significance of
*the wrong dang thing*

Stats are hard and scary, of course — everyone knows that. But there will be comic strips, funny videos, and diagrams. I’ll try to make it worth your while, and I’ll try to simplify dramatically without butchering underlying math. Angels fear to tread here!

Two flavours of “significant”: statistical versus clinical

Research can be *statistically* significant, but otherwise unimportant. Statistical significance on its own is the sound of one hand clapping. But researchers often focus on the the positive: “Hey, at least we’ve got statistical significance! Maybe!” So they summarize their findings as “significant” without telling us the *size* of the effect they observed, which is a little devious or sloppy. Almost everyone is fooled by this — except 98% of statisticians — because the word “significant” carries so much weight. It really *sounds* like a big deal, like good news.

But it’s like bragging about winning a lottery without mentioning that you only won $25.1

Statistical significance without other information really doesn’t mean all that much. It is not only possible but common to have *clinically trivial* results that are nonetheless *statistically* significant. How much is that statistical significance is worth? *It depends* … on details that are routinely omitted.

Which is convenient if you’re pushing a pet theory, isn’t it?

Imagine a study of a treatment for pain, which has a statistically significant effect, but it’s a *tiny* effect: that is, it only reduces pain *slightly*. You can take that result *to the bank* (supposedly) — it’s real! It’s statistically significant! But … no more so than a series of coin flips that yields enough heads in a row to raise your eyebrows. And the effect was still tiny. So calling these results “significant” is using math to put lipstick on a pig.

There are a lot of decorated pigs in research: “significant” results that are possibly not even that, and clinically boring in any case.

Just because a published paper presents a statistically significant result does not mean it necessarily has a biologically meaningful effect.

Science Left Behind: Feel-Good Fallacies and the Rise of the Anti-Scientific Left, Alex Berezow & Hank Campbell

If you torture data for long enough, it will confess to anything.

Ronald Harry Coase

Statistical significance is boiled down to one convenient number: the infamous, cryptic, bizarro and highly over-rated *P-value*. Cue Darth vader theme. This number is “diabolically difficult” to understand and explain, and so *p*-value illiteracy and bloopers are epidemic (Goodman identifies ““A dirty dozen: twelve p-value misconceptions””2). It seems to be hated by almost everyone who actually understands it, because almost no one else does. Many believe it to be a blight on modern science.3

The mathematical soul of the *p*-value is, frankly, not really *worth* knowing. It’s just not that fantastic an idea. The importance of scientific research results cannot be jammed into a single number (and nor was that ever the intent). And so really wrapping your head around it no more important than learning the gritty details of the Rotten Tomatoes algorithm when you’re trying to decide whether to see that new Godzilla (2014) movie.4

What you *do* need to know is the role that *p*-values play in research today. You need to know that “it depends” is a massive understatement, and that there are “several reasons why the *p*-value is an unobjective and inadequate measure of evidence”5 Because it is so often abused, it’s way more important to know what the p-value is *NOT* than what it IS. For instance, it’s particularly useless when applied to studies of really outlandish ideas. Because it is so often abused, it’s way more important to know what it’s *not* than what it *is*. For instance, it’s particularly useless when applied to studies of really outlandish ideas.6 And yet it’s one of the staples of pseudoscience,7 because it is such an easy way to make research look better than it is.

Above all, a good *p*-value is *not* a low chance that the results were a fluke or false alarm — which is *by far* the most common misinterpretation (and the first of Goodman’s Dirty Dozen). The real definition is a kind of mirror image of that:9 it’s not a low chance of a false alarm, but a low chance of an effect that actually *is* a false alarm. The false alarm is a given! That part of the equation is already filled in, the premise of every *p*-value. For better or worse, the *p*-value is the answer to *this* question: if there really is nothing going on here, what are the odds of getting these results? A low number is encouraging, but it doesn’t say the results aren’t a fluke, because it can’t10 — *it was calculated by assuming they are*.

The only way to *actually* find out if the effect is real or a fluke is to do more experiments. If they all produce results that would be unlikely if there was no real effect, *then* you can say the results aren’t probably real. The *p*-value alone can only be a reason to check again — not statistical congratulations on a job well done. And yet that’s exactly how most researchers use it. And most science journalists.11

Head hurting already? Time for the stick people to take over this tutorial.

Geeky web comic **xkcd** illustrated the trouble with statistical significance better than I can with my paltry words.

A “5% chance of coincidence” is actually a fairly strong chance of a coincidence. That’s one in twenty. That can happen. Something with a one in twenty chance of happening each day is going to happen more than once per month.

Randall Munroe hides an extra little joke in all his comic strips (mouse over his comic strips on his website, wait a moment, and they are revealed). This time it was:

'So, uh, we did the green study again and got no link. It was probably a--' 'RESEARCH CONFLICTED ON GREEN JELLY BEAN/ACNE LINK; MORE STUDY RECOMMENDED!'

A great deal of crap science is presented in exactly this way. It’s one of the main ways that “studies show” a lot of things help pain that aren’t actually do no such thing. It’s one of the easiest ways for the “controversy” over many alternative treatments can be extended: by citing “significant” evidence of benefit, with data exactly as absurd as in the comic strip above.

It’s actually statistically normal for studies to make bad treatments look good. It happens regularly.

A classic real world example of “statistically significant but clinically trivial” is the supposedly proven benefit of chiropractic adjustment: how much benefit, exactly? “Less than the threshold for what is clinically worthwhile,” as it turns out, according to Nefyn Williams, author of a 2011 paper in International Musculoskeletal Medicine.12 I have been pointing out that spinal adjustment benefits seem to be real-but-minor for many years now, and I explore that evidence in crazy detail in my low back pain and neck pain tutorials. It’s a big topic, with lots of complexity.

Williams’ paper offers an oblique perspective that is quite different and noteworthy: his paper is not about chiropractic adjustment, but about *the concept of clinical significance itself*. There are various ways of measuring improvement in scientific tests of treatments, and, as Williams explains, “when an outcome measure improves by, say, five points it is not immediately apparent what this means.” How much improvement matters? After explaining and discussing various proposed standards and methods, Williams needed a good example to make his point. It’s quite interesting that he picked spinal manipulative therapy.

Chondroitin sulfate is a “nutraceutical” — a food-like nutritional supplement that is supposedly “good for cartilage” (because it is major *component* of cartilage). It has been heavily studied, but there has never been any clear good scientific news about it, and it bombed a particularly large and good quality test in 2006 (see Clegg).

So it was a bit hard to believe my eyes when I read the summary of a 2011 experiment claiming that chondroitin sulfate “improves hand pain.”13 *Really?!*

No, not really. On a 100mm VAS (a pain scale, “visual analogue scale”), the treatment group was 8.77mm happier with their hands. With a p=.02 (a middlin’ *p*-value, neither high nor low). So basically what the researchers found is a *chance* that chondroitin makes a *small* difference in arthritis pain. It’s not nothing, but it is an incredibly unimpressive result — pretty much the definition of clinically insignificant. The authors’ interpretation is like taking the dog to the end of the driveway and saying you took him for a walk. Technically true ...

So that is a *lovely* demonstration of the abuse of statistical significance!

Statistical significance of the wrong comparison (no analysis of variance)

The first significance “problem” is almost like a trick or truth bending. It works well and fools a lot of people — sometimes, I think, even the scientists or doctors who are using it — because the usage of the term “significant” is often technically correct and literally true, but obscures and diverts attention away from the whole story. Thus it is more like a subtle and technical lie of omission than an actual error.

That’s pretty bad! But it’s not the half of it. It gets much worse: a lot of so-called significant results aren’t even *technically* correct. Problem #2 is an actual *error* — something that would get you a failing grade in a basic statistics course.

This bomb comes from a recent analysis of neuroscience research.14 A number of writers have reported on this already. It was described by Ben Goldacre for *The Guardian* as

a stark statistical error so widespread it appears in about half of all the published papers surveyed from the academic neuroscience research literature.

Dr. Steven Novella also wrote about it for ScienceBasedMedicine.org recently, adding that

there is no reason to believe that it is unique to neuroscience research or more common in neuroscience than in other areas of research.

And it is not. Dr. Christopher Moyer is a psychologist who studies massage therapy:

I have been talking about this error for years, and have even published a paper on it. I critiqued a single example of it, and then discussed how the problem was rampant in massage therapy research. Based on the Nieuwenhuis paper, apparently it’s rampant elsewhere as well, and that is really unfortunate. Knowing the difference between a within-group result and a between-groups result is basic stuff.

Clinical trials are all about *comparing* treatments. To be considered effective, a real treatment has to work better than a fake one — a placebo. A drug must produce better results than a sugar pill. If *that* difference is big enough, it is “statistically significant.” There are a lot of details, but that’s the beating heart of a fair scientific test: can the treatment beat a fake?

What you can’t do is just compare the treatment to nothing at all and say, “See, it works: huge difference! Huge improvement over nothing!” The problem is that both effective medicines *and* placebos can beat *nothing* — it just doesn’t mean much until you know the treatment can also trounce a placebo. On its own, a “statistically significant” difference between treatment and nothing at all is the sound of one hand clapping. A meaningful comparison has to be a statistical *ménage à trois*, comparing all three to each other (analysis of variance, or ANOVA).

The error is the failure to do this. And, shockingly, Nieuwenhuis et al reported that *more than half* of researchers were making this mistake: comparing treatments and placebos to nothing, but *not to each other.*

Studies of massage therapy (and others, like chiropractic15) are particularly plagued by this error. Why? Because massage is so much “better than nothing.” The size of that difference looms large, and so it’s all too easy to mistake it for the one that matters — and fail to even compare the treatment to a placebo. And it’s really hard to come up with a meaningful placebo. It’s notoriously difficult to give a patient a fake massage. (They catch on.)

Statistics does not care about these difficulties: you still can’t compare massage to nothing, stop there, and call the difference “significant.” You still have to do your ANOVA, and massage still has to beat some kind of placebo before it can be considered more effective than pleasant grooming.

This research problem is not limited to massage, but massage is probably be the single best example of it. It crops up when you’re studying any treatment that involves a lot of interaction. The more interaction, the worse the problem gets. It’s a big deal in massage research because massage involves a *lot* of interaction, much of which is pleasant and emotionally engaging. It’s notoriously difficult to give a patient a fake massage. (They catch on.)Interaction with a friendly health care provider has a lot of surprisingly potent effects: people react strongly and positively to compassion, attention, and touch. The problem is that those benefits have nothing to do with any specific “active ingredient” in a massage. Grooming is just nice. It’s like pizza: even when it’s bad, it’s pretty good.

Much of the good done by therapists of all kinds is attributable to potent placebos driven by their complex interactions with patients, and *not* by anything in particular that they are doing to the patient. To find out how well a therapy works, it must be compared to sham treatments which are as much *like* the treatment as possible. This is hard to do, and it has rarely been done well. It’s much more typical to compare therapy to something too lifeless and “easy to beat,” to much like comparing it to nothing at all instead of a real placebo. And there’s the difference error: comparison to the wrong thing, statistical significance of *the wrong difference*.

I am a science writer, former massage therapist, and assistant editor of ScienceBasedMedicine.org. I have had my share of injuries and pain challenges as a runner and ultimate player. My wife and I live in downtown Vancouver, Canada. See my full bio and qualifications, or my blog, *Writerly*. You might run into me on Facebook and Google, but mostly Twitter.

Let’s spell that out a bit: a friend approaches you, and excitedly reports that he had won some money — but rather than telling you how much, he brags only about the odds: “It was a twenty to one!” he says happily, and goes for a high five. You indulge him with a high five, and then demand to know how much he actual won.

“Um, three dollars. But I won!”

BACK TO TEXT*This is how silly it is to declare statistical significance without clear reference to effect size.*- Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008 Jul;45(3):135–40. PubMed #18582619. PainSci #53780. BACK TO TEXT
- Nuzzo R. Scientific method: statistical errors. Nature. 2014 Feb;506(7487):150–2. PubMed #24522584.

BACK TO TEXTP values have always had critics. In their almost nine decades of existence, they have been likened to mosquitoes (annoying and impossible to swat away), the emperor's new clothes (fraught with obvious problems that everyone ignores) and the tool of a “sterile intellectual rake” who ravishes science but leaves it with no progeny. One researcher suggested rechristening the methodology “statistical hypothesis inference testing”, presumably for the acronym it would yield.

*First*, just*don’t*. It’s terrible. It’s not a gritty new “vision” of Godzilla: it’s a stilted and stale-on-arrival stinker that doesn’t even satisfy with the effects, let alone script, and how in hell did an actor of Brian Cranston’s calibre end up in there? The good reviews and the decent Tomatometer score of ~70% are really a puzzle.*Second*, the Tomatometer is an amusingly apt analogy for the*p*-value problem: an algorithm that produces a simple score that is routinely misinterpreted and fundamentally incapable of telling us what we*really*want to know, while seemed to promise exactly that. It’s even been analyzed in the same exasperated way! See Misinterpreting the Tomatometer, which concludes, “The Tomatometer is a reliable way of measuring Tomatometer ratings. That's about it.” Similarly,*p*-values seem to be, above all, an excellent indication of the*p*-value and not super reliable for anything else. BACK TO TEXT- Hubbard R, Lindsay RM. Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing. Theory & Psychology. 2008;18(1):69–88. PainSci #53786.

BACK TO TEXTReporting p values from statistical significance tests is common in psychology's empirical literature. Sir Ronald Fisher saw the p value as playing a useful role in knowledge development by acting as an "objective" measure of inductive evidence against the null hypothesis. We review several reasons why the p value is an unobjective and inadequate measure of evidence when statistically testing hypotheses. A common theme throughout many of these reasons is that p values exaggerate the evidence against H0. This, in turn, calls into question the validity of much published work based on comparatively small, including .05, p values. Indeed, if researchers were fully informed about the limitations of the p value as a measure of evidence, this inferential index could not possibly enjoy its ongoing ubiquity. Replication with extension research focusing on sample statistics, effect sizes, and their confidence intervals is a better vehicle for reliable knowledge development than using p values. Fisher would also have agreed with the need for replication research.

- Regina Nuzzo: “The more implausible the hypothesis — telepathy, aliens, homeopathy — the greater the chance that an exciting finding is a false alarm,
*no matter what the P value is.*” And yet studies of implausible treatments is exactly where*p*-values are typically emphasized by researchers with an agenda. BACK TO TEXT - Pandolfi M, Carreras G. The faulty statistics of complementary alternative medicine (CAM). Eur J Intern Med. 2014 Sep;25(7):607–9. PubMed #24954813.

BACK TO TEXTThe authors illustrate the difficulties involved in obtaining a valid statistical significance in clinical studies especially when the prior probability of the hypothesis under scrutiny is low. Since the prior probability of a research hypothesis is directly related to its scientific plausibility, the commonly used frequentist statistics, which does not take into account this probability, is particularly unsuitable for studies exploring matters in various degree disconnected from science such as complementary alternative medicine (CAM) interventions. Any statistical significance obtained in this field should be considered with great caution and may be better applied to more plausible hypotheses (like placebo effect) than that examined - which usually is the specific efficacy of the intervention. Since achieving meaningful statistical significance is an essential step in the validation of medical interventions, CAM practices, producing only outcomes inherently resistant to statistical validation, appear not to belong to modern evidence-based medicine.

Like a German word with no exact English translation, there really is no way to

*properly*define*p*-value without some jargon. (Or at least that’s how I justify my own past erroneous definitions.) So here’s a proper definition now, including the essential jargon — specifically the “null hypothesis” — here’s Goodman’s definition: “The probability of the observed result, plus more extreme results, if the null hypothesis were true.”The null hypothesis is basically the bet there’s actually no effect for the experiment to find. If that bet is right, then you’re only going observe the appearance of an effect due to a fluke or error. The null hypothesis is difficult but really interesting idea, and I cover it in some detail in Why So “Negative”? Answering accusations of negativity, and my reasons and methods for debunking bad treatment options for pain and injury.

BACK TO TEXT- Sellke T, Bayarri MJ, Berger JO. Calibration of ρ Values for Testing Precise Null Hypotheses. The American Statistician. 2001;55(1):62–71. PainSci #53787.

BACK TO TEXTP values are the most commonly used tool to measure evidence against a hypothesis or hypothesized model. Unfortunately, they are often incorrectly viewed as an error probability for rejection of the hypothesis or, even worse, as the posterior probability that the hypothesis is true. The fact that these interpretations can be completely misleading when testing precise hypotheses is first reviewed, through consideration of two revealing simulations. Then two calibrations of a ρ value are developed, the first being interpretable as odds and the second as either a (conditional) frequentist error probability or as the posterior probability of the hypothesis.

- Including myself for years —
*mea culpa*. BACK TO TEXT - Williams NH. How important is the ‘minimal clinically important change’? International Musculoskeletal Medicine. 2011 Jun;33(2):47–82(2). PainSci #55101. BACK TO TEXT
- Gabay C, Medinger-Sadowski C, Gascon D, Kolo F, Finckh A. Symptomatic effects of chondroitin 4 and chondroitin 6 sulfate on hand osteoarthritis: a randomized, double-blind, placebo-controlled clinical trial at a single center. Arthritis Rheum. 2011 Sep 6;63(9). PubMed #21898340. BACK TO TEXT
- Nieuwenhuis S, Forstmann BU, Wagenmakers EJ. Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci. 2011;14(9):1105–7. PubMed #21878926. PainSci #55075. BACK TO TEXT
- This is a problem studying any kind of therapy where patients spend more time with health care professionals. The more time and interaction, the worse the problem. Thus, massage is by far the most difficult of them all — but it’s also a problem studying chiropractic, physical therapy, psychotherapy, osteopathy, and so on. Basically, the higher the potential for a robust placebo, the harder it gets to avoid making this research mistake. BACK TO TEXT