Many painful problems are surprisingly mysterious, and there are many theories about why people hurt. Debate can rage for years about whether or not a problem even exists. For instance, chiropractic “subluxations” have been a hot topic for decades now: are these little spinal dislocations actually real? What if five different chiropractors all looked at you, but each diagnosed different spots in your spine that were supposedly “out” and in need of adjustment?
That’s a reliability study.
Reliability studies are awesome: although the concept is obscure to most people, they are accessible and interesting, easy for anyone to understand, and very persuasive. Evidence of unreliable diagnosis can make further debate pointless. If chiropractors can’t agree on where subluxations are in the same patient — and some studies have shown that they can’t1 — then the debate about whether or not subluxations actually exist gets less interesting. A reliability study with a negative result doesn’t necessarily prove anything,2 but they are strongly suggestive, and can be a handy shortcut for consumers. Who wants a diagnosis that will probably be contradicted by each of five other therapists? No one, that’s who.What if five different chiropractors all looked at you, but each diagnosed different spots in your spine that were supposedly “out” & in need of adjustment?
In reliability science, we talk about “raters.” A rater is a judge … of anything. One who rates. The person who makes the call. All health care professionals are raters whenever they are assessing and diagnosing.
Reliability studies are studies of “inter-rater” reliability, agreement, or concordance. In other words, how much do raters agree with each other? Not in a meeting about it later, but on their own. Do they come to similar conclusions when they assess the same patient independently?
There are formulas that express reliability as a score, such as a “concordance correlation coefficient.” For the non-statistician, that boils down to: how often are health care professionals going to come to the same or similar conclusions about the same patient? Every time? Half the time? One in ten?
This reliability thing is not subtle: you don’t need a second opinion for a gunshot wound. Ten out of ten doctors will agree: “Yep, that’s definitely a gunshot wound!” Well, almost.3
That’s high inter-rater reliability.
Lots of diagnostic challenges are much harder, of course. Humans are complex. It’s not always obvious what’s wrong with them. This is why you need second and third opinions sometimes. And it’s perfectly fine to have low reliability regarding difficult medical situations. Patients are pretty forgiving of low diagnostic reliability quickly when professionals are candid about it. All a doctor has to say is, “I’m not sure. I don’t know. Maybe it’s this, and maybe it isn’t.”
What you have to watch out for is low reliability combined with high confidence: the professionals who claim to know, but can’t agree with each other when tested. Unfortunately, this is a common pattern in alternative medicine. And it is a strong argument that it’s actually alternative medicine practitioners who are “arrogant,” not doctors.Ten out of ten doctors will agree: “Yep, that’s definitely a gunshot wound!”
True story: a patient of mine, back in the day, a young woman with chronic neck pain and nausea, went to a “body work” clinic for her problem. Three deeply spiritual massage therapists hovered over her for three hours, charging $100/hour — each, for a total of $900 for the session — and provided (among some other things) a running commentary/translation of what her stomach was “trying to tell her” about her psychological issues.
True story: my eyes rolled out their sockets. And my patient was absolutely horrified.
Obviously, if she’d gone to another gurgle-interpreter down the road, her gastric messages would have been interpreted differently.
That’s low inter-rater reliability.
There are numerous common diagnoses and theories of pain that suffer from lousy inter-rater reliability. Here are some good examples:
And so on and on. Over the months and years, I’ll add other nice examples to this list as they occur to me. For contrast, many diagnostic and testing procedures are reliable, such as testing range of motion in people with frozen shoulder.17
Supposedly a humming tuning fork applied to a stress fracture will make it ache. This analysis of studies18 since the 1950s tried to determine if tuning forks (and ultrasound) are actually useful in finding lower-limb stress fractures. Neither technique was found to be accurate. “it is recommended that radiological imaging should continue to be used” instead. Fortunately (for the sake of the elegant quirkiness of the idea), they aren’t saying that a tuning fork actually can’t work … just that’s it not reliable for confirmation, which kind of a “well, duh” conclusion.
I am a science writer, former massage therapist, and I was the assistant editor at ScienceBasedMedicine.org for several years. I have had my share of injuries and pain challenges as a runner and ultimate player. My wife and I live in downtown Vancouver, Canada. See my full bio and qualifications, or my blog, Writerly. You might run into me on Facebook or Twitter.
— Added an example of diagnosis that is reliable.
— Added motion palpation. Added a citation about detecting craniosacral therapy. Started update logging.
Many unlogged updates.
I do enjoy reliability studies, and this is one of my favourites. Three chiropractors were given twenty patients with chronic low back pain to assess, using a complete range of common chiropractic diagnostic techniques, the works. Incredibly, assessing only a handful of lumbar joints, the chiropractors agreed which joints needed adjustment only about a quarter of the time (just barely better than guessing). That’s an oversimplification, but true in spirit: they couldn’t agree on much, and researchers concluded that all of these chiropractic diagnostic procedures “should not be seen … to provide reliable information concerning where to direct a manipulative procedure.”BACK TO TEXT
The first test of the claim that craniosacral therapists are able to palpate change in cyclical movements of the cranium. They concluded that “therapists were not able to measure it reliably,” and that “measurement error may be sufficiently large to render many clinical decisions potentially erroneous.” They also questioned the existence of craniosacral motion and suggested that CST practitioner might be imagining such motion. This prompted extensive and emphatic rebuttal from Upledger.BACK TO TEXT
“Palpation of a cranial rhythmic impulse (CRI) is a fundamental clinical skill used in diagnosis and treatment” in craniosacral therapy. So, researchers compared the diagnostics methods of “two registered osteopaths, both with postgraduate training in diagnosis and treatment, using cranial techniques, palpated 11 normal healthy subjects.” Unfortunately, they couldn’t agree on much: “interexaminer reliability for simultaneous palpation at the head and the sacrum was poor to nonexistent.” Emphasis mine.BACK TO TEXT
This is one of those fun studies that catches clinicians in their inability to come up with the same assessment of a structural problem. Three doctors were asked to “rate forefoot alignment,” but they didn’t agree. From the abstract: “… the commonplace method of visually rating forefoot frontal plane deformities is unreliable and of questionable clinical value.”BACK TO TEXT
Two examiners, using standard methods of motion palpation of the thoracic spine, could not agree at all well on the location of joint stiffness or pain in 25 patients. Simplifying the diagnostic challenge did not improve matters. Therefore, “The results for interrater reliability were poor for motion restriction and pain.” This does not bode well for manual therapists who use motion palpation to identify patients who might benefit from spinal manipulative therapy.
The study only used two examiners, which might be a serious flaw. More raters would certainly be better. Nevertheless, even a small data sample can produce meaningful information if the effect size is robust enough (see It's the effect size, stupid), which it probably is here. Even just two examiners should generate similar results, unless someone is grossly incompetent. If they differ greatly, more examiners probably isn’t going to change that.BACK TO TEXT
Diagnosis by acupuncturists may be unreliable. In this study, “six TCM acupuncturists evaluated the same six patients on the same day” and found that “consistency across acupuncturists regarding diagnostic details and other acupoints was poor.” The study concludes: “TCM diagnoses and treatment recommendations for specific patients with chronic low back pain vary widely across practitioners.”BACK TO TEXT
This paper is a survey of the state of the art of trigger point diagnosis: can therapists be trusted to find trigger points? What science has been done so far? It’s a confusing mess, unfortunately. This paper explains that past research has not “reported the reliability of trigger point diagnosis according to the currently proposed criteria.” The authors also explain that “there is no accepted reference standard for the diagnosis of trigger points, and data on the reliability of physical examination for trigger points are conflicting.” Given these conditions, it’s hardly surprising that the conclusion of the study was disappointing: “Physical examination cannot currently be recommended as a reliable test for the diagnosis of trigger points.”
This is essentially the same conclusion as a review the year before by Myburgh et al.BACK TO TEXT
This overconfidently titled paper essentially declares that there is no longer any controversy about ESWT for plantar fasciitis. However, my confidence in their conclusions is suppressed by the fact that the researchers are on the payroll of a company that makes ESWT devices, and the entire study was funded by that company. As always, conflicts of interest are not necessarily a deal-breaker, but they can be, and this one seems particularly strong.BACK TO TEXT
Diagnostic reliability of range of shoulder motion in patients with frozen shoulder is “acceptable.”BACK TO TEXT