One of our Imaging project managers (PM) recently approached us with a study sponsor’s concern that the adjudication rate (AR) in their oncology study of 56% was too high. At first glance it is troubling but then it was noted that we were asking the adjudicator to make a judgment when any or all of the 3 different endpoints i.e. progression free survival (PFS), best overall response (BOR) and duration of response (DOR) differed between two independent readers while using a 2+1 reader model and that the disagreement on any one endpoint was actually lower than published in the oncology literature.
What this study team did was assume that all of the 3 Venn diagram disagreement circles (shown below) laid completely on top of each other, in other words, all 3 endpoints were exactly the same. While this assumption was erroneous, it is also no different than using a single endpoint standard for multiple endpoint metrics. In this situation the study team options include either revising how the AR is defined or changing the AR threshold to allow for partially overlapping disagreements.
Study endpoint definitions should be well-defined and selected to describe the efficacy and/or safety of the investigational product. A general principle that drug developers typically follow is to use a single endpoint which is the most clinically important while avoiding the use of multiple endpoints in registration trials. This approach aligns with regulatory and statistical advice from the Food and Drug Administration (FDA) and European Medicines Agency (EMA)1. However, there are clinical indications or disease states where the use of a single or several primary endpoints may not be appropriate or possible.
This blog will review approaches to reduce the impact of multiplicity related to multiple endpoints. The joint probability that there would be disagreement for either of two different endpoints has to be greater than for one endpoint and so on until there are enough endpoints to ensure that at least one disagreement always occurs.
For three uncorrelated endpoints (which is close to what we’ve observed), each with a 30% chance for disagreement, the disagreement rate for at least one of the three endpoints would be 66%! Four endpoints would result in 76% and so on up to 100% and due, not to any reader or data issue but solely to the laws of probability. The percentages go down slightly as the endpoints are correlated but the point remains.
Added to this is the assumption that all of the endpoints are equally weighted, which is almost never the case. To be completely correct, all of the joint probabilities should be calculated, the correlations estimated and then compared to an assumption of independence. Of course this is a lot of work and the results will be complex and certain to be confusing. Therefore, it is safe to say that a much more practical solution would be to individually calculate the disagreement rate for each endpoint and compare that to a predetermined threshold, to monitor for trends on each endpoint and to avoid taking actions on the overall endpoint which will almost certainly not lead to the desired result.
The final issue when using multiple endpoint adjudication rates to monitor performance is that a lot more is going to be involved. Take the above example of 3 endpoints -to get a complete and accurate picture of how the readers are performing, the following is necessary:
- Each individual endpoint disagreement rate
- Each of 3 different pairwise comparisons of endpoints
- Both “OR” and “AND” disagreement combinations of each pair as well as all three together, and finally
- The multinomial confidence bounds on the total adjudication rate, which requires the above to be calculated.
While these may be interesting and possibly even informative, the immediate usefulness for the clinical study team to monitor reader performance may be taxing and not reveal anything that the first bullet doesn’t accomplish all on its own. This point is important to restate – we believe that each endpoint be evaluated on its own.
Part II of this blog series will discuss endpoint adjudication in more detail while building on our past blog related to blinded independent central reviewer (BICR) adjudication. Regulatory agencies point to using independent endpoint adjudication of results using standardized definitions when treatment groups are unmasked or when endpoints are complex or based on subjective assessments and to avoid incorrect treatment response assessments (e.g., false positives or false negatives classification) which leads to biased data (i.e., lower quality) and lower study power. Part II of this blog series will explain the multiple adjudication paradigms that may be deployed to improve study results validity and help the reader understands that there is no single methods that would be optimal for all situations.
- Multiple Endpoints in Clinical Trials Guidance for Industry [https://www.fda.gov/media/102657/download]