Statistical Approaches to Blinded Reader Monitoring in Clinical Trials

Joseph Pierro MD and David Raunig PhD | |

Our past blog entitled “Blinded Reader Training for Clinical Trial Imaging” discussed blinded reader selection, training and reader performance monitoring.  Over the last 10 plus years there has been a 150% increase in the number of studies that include imaging evaluation as a study outcome endpoint, averaging about 800 studies per year.  Since nearly all of these imaging evaluations need an expert reader to interpret images, the performance of those readers is a critical component of the reliability of those endpoints.

Readers must be expertly accurate and remain accurate for the length of each trial, which often span 3-5 years.  And even though adjudication rate and inter-reader discordance is often the source of conversations around reader performance, consistent image interpretation over the course of the trial is equally important, and indeed is highlighted by the FDA’s guidance to industry [1].

Perform monitoring early / throughout trials to address bias soonerReader performance in medical imaging has been a topic of great clinical –and statistical – concern since roentgenologists were first evaluated in 1947. Here we touch on the statistical concerns and standard tools that help researchers analyze reader performance to ensure optimal outcomes in their clinical trials that include medical imaging.

Where’s the Bias?  Take the Quiz

Before we get into performance monitoring considerations, take this quiz: Two readers disagree on 30% of the cases.  The adjudicator selected each reader exactly 50% of the time.  The adjudicator is expertly trained, blinded to the identity of the primary readers, is making a conscious choice and is not randomly choosing the preferred reader. The final result shows an obvious bias.  Where does the bias come from?

  • Reader disagreement in time to event oncology endpoints, such as Date of Progression, range from 25-40% [2]. This range agrees with studies done since 1947 on diagnostic discordance in the clinic [3, 4].
  • Adjudication rate, or reader disagreement, includes both inter- and intra-reader variability and can be used as a near-real time indicator of degraded performance in a single reader.
  • Adjudication rate that involves more than one endpoint (i.e. any disagreement on any endpoint) is misleading and depends almost entirely on how well the multiple endpoints are correlated and very little on the performance of the readers.
  • Intra-reader variability testing typically uses a percentage of the number of subjects which can and often does, lead to too many or too few evaluations to reliably precipitate a decision or action. The number of intra-variability evaluations should be based on the number required to detect a large enough change in reliability that requires corrective actions.
  • Adjudicator selection rate, or the percentage of discordant findings by the primary readers, is a good first step to single out one of the primary readers but does not consider that the adjudicator may be biased. Since the adjudicator is predicted to be involved in about 30% of the endpoint evaluations, evaluating the adjudicator is necessary and the adjudicator is highly recommended to participate in intra-reader variability testing.

Routine Assessment Drives Reliability

Although the FDA recommends performing intra-reader variability assessment at the end of each trial, we believe monitoring should be performed much earlier and throughout the trial so that interventions around reader drift/variability or bias can be addressed sooner. Given that these issues could significantly impact regulatory approval decisions, this more rigorous approach to monitoring is a very important step to ensuring the reliability of the endpoint measurement.

Our approach to reader monitoring includes assessment of total variability and inter-reader variability between readers evaluated throughout the entire study. This enables the identification of changes and differences, assuming that, at least in the short-term, 2 readers will not vary in the exact same way at exactly the same time. It also assumes that acute changes in adjudication rate are more likely due to a single reader than to all readers just becoming more variable. A structured report should be the end result of reader performance evaluations since everyone, especially the sponsor, is interested in the performance of the evaluation system.

Quiz Answer

The answer is that the adjudicator always chose the earlier progression date of the two primary readers and the primary readers varied randomly between who had the earlier date. The adjudicator was biased and it affected 30% of the adjudicated evaluations.

Joseph Pierro, MD is the Medical Director of Imaging at ERT and David Raunig, PhD is the Senior Principal Imaging Statistician at ERT.



  1. FDA, United States Food and Drug Administration Guidance for Industry: Standards for Clinical Trials Imaging Endpoints, U.D.o.H.a.H. Services, Editor. 2018: Rockville, MD.
  2. Ford, R., et al., Adjudication Rates between Readers in Blinded Independent Central Review of Oncology Studies. 2016. 6(289): p. 2167-0870.1000289.
  3. Birkelo, C.C., et al., Tuberculosis case finding: A comparison of the effectiveness of various roentgenographic and photofluorographic methods. Journal of the American Medical Association, 1947. 133(6): p. 359-366.
  4. Pinto, A., et al. The concept of error and malpractice in radiology. in Seminars in Ultrasound, CT and MRI. 2012. Elsevier.