Cohen’s Kappa score for reliability is represented by the Kappa symbol.
Many painful problems are surprisingly mysterious, and there are an extraordinary number of theories about why people hurt — even afer eliminating the sillier ones. Debate can rage for years about whether or not a problem even exists. For instance, chiropractic “spinal subluxations” have been a hot topic for decades now: are these little spinal dislocations actually real? What if five different chiropractors all looked at you, but each diagnosed different spots in your spine that were supposedly “out” and in need of adjustment?
That’s a reliability study.
Reliability studies are great: although the concept is obscure to most people, they are easily understood in principle, interesting, and very persuasive. Evidence of unreliable diagnosis can make further debate pointless. If chiropractors can’t agree on where subluxations are in the same patient — and indeed some studies have shown that they can’t1 — then the debate about whether or not subluxations actually exist gets less interesting. A reliability study with a negative result doesn’t necessarily prove anything,2 but they are strongly suggestive, and can be a handy shortcut for consumers. Who wants a diagnosis that will probably be contradicted by each of five other therapists? No one, that’s who.
In reliability science, we talk about “raters.” A rater is a judge … of anything. One who rates. The person who makes the call. All health care professionals are raters whenever they are assessing and diagnosing.
Reliability studies are studies of “inter-rater” reliability, agreement, or concordance. In other words, how much do raters agree with each other? Not in a meeting about it later, but on their own. Do they come to similar conclusions when they assess the same patient independently?
There are formulas that express reliability as a score, such as a “concordance correlation coefficient.” For the non-statistician, that boils down to: how often are health care professionals going to come to the same or similar conclusions about the same patient? Every time? Half the time? One in ten?
Gunshot wound diagnosis is super reliable
This reliability thing is not subtle: you don’t need a second opinion for a gunshot wound. Ten out of ten doctors will agree: “Yep, that’s definitely a gunshot wound!” Well, almost.3
That’s high inter-rater reliability.
Lots of diagnostic challenges are much harder, of course. Humans are complex. It’s not always obvious what’s wrong with them. This is why you need second and third opinions sometimes. And it’s perfectly fine to have low reliability regarding difficult medical situations. Patients are forgiving of low diagnostic reliability when professionals are candid about it. All a doctor has to say is, “I’m not sure. I don’t know. Maybe it’s this, and maybe it isn’t.”
What you have to watch out for is low reliability combined with high confidence: the professionals who claim to know, but can’t agree with each other when tested. Unfortunately, this is a common pattern in alternative medicine. And it is a strong argument that it’s actually alternative medicine practitioners who are “arrogant,” not doctors.
Stomach gurgle interpretation is not reliable
True story: a patient of mine, back in the day, a young woman with chronic neck pain and nausea, went to a “body work” clinic for her problem. Three deeply spiritual massage therapists hovered over her for three hours, charging $100/hour — each, for a total of $900 for the session — and provided (among some other things) a running commentary/translation of what her stomach was “trying to tell her” about her psychological issues.
True story: my eyes rolled out of their sockets. And my patient was absolutely horrified.
Obviously, if she’d gone to another gurgle-interpreter down the road, her gastric messages would have been interpreted differently.
That’s low inter-rater reliability.
11 examples of unreliable diagnosis in musculoskeletal medicine
There are numerous common diagnoses and theories of pain that suffer from lousy inter-rater reliability. Here are some good examples:
- Craniosacral therapists allege that they can detect subtle defects in the circulation of your cerebrospinal fluid, but reliability testing shows that they can’t agree with each other about it.45
- It’s well known that scans and X-rays often reveal diagnostic red herrings, but it’s worse than a few false positives: a 2016 study sent one patient to get ten different MRIs, and the results were astonishingly inconsitent. The radiologists identified sixteen different findings and made an average of a dozen errors each.6 The patient would have been better off throwing darts at a list of possible results.
- Many kinds of therapists believe that the alignment of the forefoot is important, but a reliability study showed that “the commonplace method of visually rating forefoot frontal plane deformities is unreliable and of questionable clinical value.”7 I know one of these foot alignment kooks: he literally believes that “all pain” is caused by a single joint in the foot, and that he can fix it every time. Again, there’s that arrogance.8
- Many therapists, naturopathic physicians and other self-proclaimed healers use a kind of testing called “applied kinesiology” which uses a simple strength test as the primary diagnostic tool for all problems, but a simple study showed that practitioners’ efforts were “not more useful than random guessing” — not just poor reliabiliy, but zero reliability.
- Motion palpation is used to identify patients who might benefit from spinal manipulative therapy. This is particularly common in chiropractic offices. Unfortunately, trying to detect spinal joint stiffness and/or pain using “motion palpation” didn’t go well in a 2015 test: the examiners found different “problems” in the same patients.9
- The Functional Movement Screen is a set of physical tests of coordination and strength. Although intended to be just a trouble-detection system, in practice it’s popularity is substantially based on reaching beyond that purpose to actually diagnose biomechanical problems and justify correctional training or treatment. Unfortunately, not only has FMS failed to reliably forecast injuries, but all FMS predictions may be “a product of specious grading.”10
- Traditional Chinese medicine acupuncturists couldn’t agree at all on what was wrong with patients who had low back pain. In six cases evaluated by six practitioners on the same day, twenty diagnoses were used at least once — which is pretty excessive. Even an “inexact science” should probably be a little more exact than that.11
- “Trigger points [muscle knots] are promoted as an important cause of musculoskeletal pain,” but after several decades we still don’t actually know whether or not professionals can reliably diagnose trigger points — the evidence is limited and ambiguous.121314 There’s almost no doubt that identifying trigger points by feel is technically unreliable, but that may not actually be a deal-breaker.15
- “Core instability” is an extremely popular thing to blame for back pain. However, you can’t very well treat core instability if you can’t diagnose it as a problem in the first place. A test of core stability testing was a clear failure: “6 clinical core stability tests are not reliable when a 4-point visual scoring assessment is used.”16 This is a bit problematic for core dogma.
- Ever been told your shoulder blade was misbehaving? “Shoulder dyskinesis” is fancy talk (elaborate parlance!) for “bad shoulder movement.” Unfortunately, therapists cannot agree on these diagnoses, and a 2013 review in the British Journal of Sports Medicine condemned them: “no physical examination test of the scapula was found to be useful in differentially diagnosing pathologies of the shoulder.”17
- Surprisingly, professionals often seem to have trouble deciding whether a given foot has a flat arch or a high arch.1819
And so on and on. For contrast, many diagnostic and testing procedures are reliable, such as testing range of motion in people with frozen shoulder.20
An odd example: tuning-fork diagnosis!
Supposedly a humming tuning fork applied to a stress fracture will make it ache. This analysis of studies21 since the 1950s tried to determine if tuning forks (and ultrasound) are actually useful in finding lower-limb stress fractures. Neither technique was found to be accurate. “it is recommended that radiological imaging should continue to be used” instead. Fortunately (for the sake of the elegant quirkiness of the idea), they aren’t saying that a tuning fork actually can’t work … just that’s it not reliable for confirmation, which is kind of a “well, duh” conclusion.
Unreliable reliability science
The old school math for reliability was just a percentage of agreement. For instance, if you and I are both trying to diagnose something and we agree half the time, that’s 50% agreement. But this isn’t really a fancy enough way to measure, because it doesn’t account for things like guessing, luck, or bias. What if we agree only because we’re both imagining the same bullshit? This stuff is tricky!
Enter Cohen’s kappa (𝛋), “a more robust measure than simple percent agreement calculation, since κ takes into account the possibility of the agreement occurring by chance.” (Though not from bias!) Like the much maligned p-value, not everyone is a fan of kappa scores. But despite all the usual expert controversy — statistics is never straightforward — Cohen’s Kappa has been more or less the standard for ages now.
The scores go from -1.0 to +1.0, with a zero score representing coin flipping odds of agreement. And so how good is a 𝛋 score of, say, 0.5? How do we translate these scores? In 1977, Landis and Koch suggested some descriptive words that were pure opinion.22 They have been widely cited and used ever since in the absence of any clear alternative. And so science goes.
There are lots of challenges with the science of reliability. You could even argue that it’s unreliable! For instance, the odds of agreeing by chance drop when someone is biased. And that inflates the kappa coefficient — a very misleading result.23
But even a “poor” kappa score still represents more agreement than chance, and perfect agreement is nearly unheard of when testing the reliability of anything that needs testing. Diagnosis is hard!
Did you find this article useful? Interesting? Maybe notice how there’s not much content like this on the internet? That’s because it’s crazy hard to make it pay. Please support (very) independent science journalism with a donation. See the donation page for more information & options.
About Paul Ingraham
I am a science writer in Vancouver, Canada. I was a Registered Massage Therapist for a decade and the assistant editor of ScienceBasedMedicine.org for several years. I’ve had many injuries as a runner and ultimate player, and I’ve been a chronic pain patient myself since 2015. Full bio. See you on Facebook or Twitter., or subscribe:
What’s new in this article?
Four updates have been logged for this article since publication (2009). All PainScience.com updates are logged to show a long term commitment to quality, accuracy, and currency. more
Like good footnotes, update logging sets PainScience.com apart from most other health websites and blogs. It’s fine print, but important fine print, in the same spirit of transparency as the editing history available for Wikipedia pages.
I log any change to articles that might be of interest to a keen reader. Complete update logging started in 2016. Prior to that, I only logged major updates for the most popular and controversial articles.
See the What’s New? page for updates to all recent site updates.
2017 — New section: “Unreliable reliability science,” in which I dumb down reliability stats and their challenges.
2017 — Science update: added a reference to Herzog et al’s remarkable MRI reliability results. The list of good examples unreliable diagnosis is now up to ten.
2016 — Added an example of diagnosis that is reliable.
2016 — Added motion palpation. Added a citation about detecting craniosacral therapy. Started update logging.
Many unlogged updates.
2009 — Publication.
- French SD, Green S, Forbes A. Reliability of chiropractic methods commonly used to detect manipulable lesions in patients with chronic low-back pain. J Manipulative Physiol Ther. 2000 May;23(4):231–8. PubMed 10820295 ❐
I do enjoy reliability studies, and this is one of my favourites. Three chiropractors were given twenty patients with chronic low back pain to assess, using a complete range of common chiropractic diagnostic techniques, the works. Incredibly, assessing only a handful of lumbar joints, the chiropractors agreed which joints needed adjustment only about a quarter of the time (just barely better than guessing). That’s an oversimplification, but true in spirit: they couldn’t agree on much, and researchers concluded that all of these chiropractic diagnostic procedures “should not be seen … to provide reliable information concerning where to direct a manipulative procedure.”
- The problem may be with the design of the test, or the training and skill of those tested, rather than with what they are looking for.
- In the first chapter of his superb book, Complications: A surgeon's notes on an imperfect science, surgeon Atul Gawande tells a fascinating story about a bullet that got lost. Some kid got shot in the butt. There was a classic entry wound. Internal bleeding. No exit wound. It was a critical situation, and they opened him up to get the bullet out, but … no bullet was ever found. Was he shot, or wasn’t he? It was never explained.
- Wirth-Pattullo V, Hayes KW. Interrater reliability of craniosacral rate measurements and their relationship with subjects' and examiners' heart and respiratory rate measurements. Phys Ther. 1994 Oct;74(10):908–16; discussion 917–20. PubMed 8090842 ❐
The first test of the claim that craniosacral therapists are able to palpate change in cyclical movements of the cranium. They concluded that “therapists were not able to measure it reliably,” and that “measurement error may be sufficiently large to render many clinical decisions potentially erroneous.” They also questioned the existence of craniosacral motion and suggested that CST practitioner might be imagining such motion. This prompted extensive and emphatic rebuttal from Upledger.
- Moran RW, Gibbons P. Intraexaminer and interexaminer reliability for palpation of the cranial rhythmic impulse at the head and sacrum. J Manipulative Physiol Ther. 2001 Mar-Apr;24(3):183–190. PubMed 11313614 ❐
“Palpation of a cranial rhythmic impulse (CRI) is a fundamental clinical skill used in diagnosis and treatment” in craniosacral therapy. So, researchers compared the diagnostic methods of “two registered osteopaths, both with postgraduate training in diagnosis and treatment, using cranial techniques, palpated 11 normal healthy subjects.” Unfortunately, they couldn’t agree on much: “interexaminer reliability for simultaneous palpation at the head and the sacrum was poor to nonexistent.” Emphasis mine.
- Herzog R, Elgort DR, Flanders AE, Moley PJ. Variability in diagnostic error rates of 10 MRI centers performing lumbar spine MRI examinations on the same patient within a 3-week period. Spine J. 2016 Nov. PubMed 27867079 ❐
People mostly assume that MRI is a reliable technology, but if you send the same patient to get ten different MRIs, interpreted by ten different radiologists from different facilities, apparently you get ten markedly different explanations for her symptoms. A 63-year-old volunteer with sciatica allowed herself to be scanned again and again and again for science. The radiologists — who did not know they were being tested — cooked up forty-nine distinct “findings.” Sixteen were unique; not one was found in all ten reports, and only one was found in nine of the ten. On average, each radiologist made about a dozen errors, seeing one or two things that weren’t there and missing about ten things that were. That’s a lot of errors, and not a lot of reliability. The authors clearly believe that some MRI providers are better than others, and that’s probably true, but we also need to ask the question: is any MRI reliable?
(See also my more informal description of this study, which includes an amazing personal example of an imaging error.)
- Cornwall MW, McPoil TG, Fishco WD, et al. Reliability of visual measurement of forefoot alignment. Foot Ankle Int. 2004 Oct;25(10):745–8. PubMed 15566707 ❐
This is one of those fun studies that catches clinicians in their inability to come up with the same assessment of a structural problem. Three doctors were asked to “rate forefoot alignment,” but they didn’t agree. From the abstract: “ … the commonplace method of visually rating forefoot frontal plane deformities is unreliable and of questionable clinical value.”
- The Not-So-Humble Healer: Cocky theories about the cause of pain are waaaay too common in massage, chiropractic, and physical therapy
- Walker BF, Koppenhaver SL, Stomski NJ, Hebert JJ. Interrater Reliability of Motion Palpation in the Thoracic Spine. Evidence-Based Complementary and Alternative Medicine. 2015;2015:6. PubMed 26170883 ❐ PainSci Bibliography 54242 ❐
Two examiners, using standard methods of motion palpation of the thoracic spine, could not agree well on the location of joint stiffness or pain in a couple dozen patients. Simplifying the diagnostic challenge did not improve matters. Therefore, “The results for interrater reliability were poor for motion restriction and pain.” This does not bode well for manual therapists who use motion palpation to identify patients who might benefit from spinal manipulation.
The study only used two examiners, which might be a serious flaw. More raters would certainly be better. Nevertheless, even a small data sample can produce meaningful information if the effect size is robust enough (any two people can agree on, say, fire hydrant locations; see It's the effect size, stupid), which it probably is here. Even just two examiners should generate more similar results, unless someone is grossly incompetent. If they differ greatly, more examiners probably isn’t going to change that.
- Whiteside D, Deneweth JM, Pohorence MA, et al. Grading the Functional Movement Screen™: A Comparison of Manual (Real-Time) and Objective Methods. J Strength Cond Res. 2014 Aug. PubMed 25162646 ❐ The results are hardly surprising, since FMS fails to take into account “several factors that contribute to musculoskeletal injury.” These concerns must be addressed “before the FMS can be considered a reliable injury screening tool.”
- Hogeboom CJ, Sherman KJ, Cherkin DC. Variation in diagnosis and treatment of chronic low back pain by traditional Chinese medicine acupuncturists. Complement Ther Med. 2001 Sep;9(3):154–66. PubMed 11926429 ❐
Diagnosis by acupuncturists may be unreliable. In this study, “six TCM acupuncturists evaluated the same six patients on the same day” and found that “consistency across acupuncturists regarding diagnostic details and other acupoints was poor.” The study concludes: “TCM diagnoses and treatment recommendations for specific patients with chronic low back pain vary widely across practitioners.”
- Myburgh C, Larsen AH, Hartvigsen J. A systematic, critical review of manual palpation for identifying myofascial trigger points: evidence and clinical significance. Arch Phys Med Rehabil. 2008 Jun;89(6):1169–76. PubMed 18503816 ❐
This 2008 review of the reliability of trigger point diagnosis resoundingly concluded that the question simply hasn’t been properly studied. The authors urge clinicians and scientists to “move toward simpler, global assessments of patient status.” Translation: “Nothing to see here, move along!”
This is essentially the same conclusion as a 2009 review by Lucas et al.
- Lucas N, Macaskill P, Irwig L, Moran R, Bogduk N. Reliability of physical examination for diagnosis of myofascial trigger points: a systematic review of the literature. Clinical Journal of Pain. 2009 Jan;25(1):80–9. PubMed 19158550 ❐
This paper is a survey of the state of the art of trigger point diagnosis as of 2009, which is a confusing mess, unfortunately. It explains that past research has not “reported the reliability of trigger point diagnosis according to the currently proposed criteria.” The authors also explain that “there is no accepted reference standard for the diagnosis of trigger points, and data on the reliability of physical examination for trigger points are conflicting.” Given these conditions, it’s hardly surprising that the conclusion of the study was disappointing: “Physical examination cannot currently be recommended as a reliable test for the diagnosis of trigger points.”
This is essentially the same conclusion as a review the year before by Myburgh et al.
- Rathbone ATL, Grosman-Rimon L, Kumbhare DA. Interrater Agreement of Manual Palpation for Identification of Myofascial Trigger Points: A Systematic Review and Meta-Analysis. Clin J Pain. 2017 Aug;33(8):715–729. PubMed 28098584 ❐ This review is called a meta-analysis, which is weird, because “only 1 study met inclusion criteria for intrarater agreement and therefore no meta-analysis was performed.” So it was just a regular old review of 6 studies of how much different experts can agree on the location of myofascial trigger points. Lacking adequate data for statistical pooling, they had to “estimate” an agreement score of 𝛋=0.452 — a rather precise etimate! Of the criteria used to determine the location of trigger points, the most reliable were localized tenderness (.68) and pain recognition (.57). Those are actually decent reliability scores, but the authors conclude that “manual palpation for identification of MTrPs is unreliable.”
- Based on Rathbone’s estimated kappa scores, their negative conclusions is technically correct but also a misleading: most attempts to detect pathologies in the body are technically “unreliable,” falling well short of a score of κ=1.0 (perfect agreement), but still much better than κ=0 (coin flipping agreement).
My conclusion is that this review was mostly inconclusive, but actually found evidence that trigger point reliability is probably not all that bad — as compared to most comparable assessment procedures.
This is why I summarized the evidence as “ambiguous.” This one is hard to call.
- Weir A, Darby J, Inklaar H, et al. Core stability: inter- and intraobserver reliability of 6 clinical tests. Clin J Sport Med. 2010 Jan;20(1):34–8. PubMed 20051732 ❐
- Wright AA, Wassinger CA, Frank M, Michener LA, Hegedus EJ. Diagnostic accuracy of scapular physical examination tests for shoulder disorders: a systematic review. Br J Sports Med. 2013 Sep;47(14):886–92. PubMed 23080313 ❐
- Sensiba PR, Coffey MJ, Williams NE, Mariscalco M, Laughlin RT. Inter- and intraobserver reliability in the radiographic evaluation of adult flatfoot deformity. Foot Ankle Int. 2010 Feb;31(2):141–5. PubMed 20132751 ❐ Although not terrible, even x-rays of the same foot get judged differently: just fine with some measures, merely okay for others. However, that’s radioloists evaluated x-rays: you would hope it would be fairly reliable. The problem is with some kinds of clinicians (see next note).
- This is a bit of a cheat: I don’t have a proper reliability study to back this up, just a professional story: when I worked as a massage therapist, it was common for people to come into my office with so-called “flat” feet, convinced by a previous massage therapist (or chiropractor) that they “have no arch left” (or some other motivating hyperbole) … when in fact I could still easily get my finger under their arch up to the first knuckle. That’s something that you simply can’t do on someone who really has flat feet. Similarly, though not so common, I have often seen people accused by another professional of having high arches, when in fact they look nothing like it to me. So take such diagnoses with a grain of salt.
- Tveitå EK, Ekeberg OM, Juel NG, Bautz-Holter E. Range of shoulder motion in patients with adhesive capsulitis; intra-tester reproducibility is acceptable for group comparisons. BMC Musculoskelet Disord. 2008;9:49. PubMed 18405388 ❐ PainSci Bibliography 53284 ❐
Diagnostic reliability of range of shoulder motion in patients with frozen shoulder is “acceptable.”
- Schneiders AG, Sullivan SJ, Hendrick PA, et al. The Ability of Clinical Tests to Diagnose Stress Fractures: A Systematic Review and Meta-analysis. J Orthop Sports Phys Ther. 2012;42(9):760–71. PubMed 22813530 ❐
- Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159–74. PubMed 843571 ❐
- Walker 2015, op. cit.
Walker et al. explained this in their study of spinal motion palpation, where there are a real risk of bias:
When interpreting Kappa coefficients, however, it is important to understand that both bias and prevalence have potential to influence the agreement estimates. Bias occurs when there is disagreement in the proportion of yes and no judgments between each rater. As bias increases, chance agreement decreases, resulting in inflation of the Kappa coefficient.