When Accuracy is Not Enough: Why Error Analysis is Necessary for Radiology AI Evaluation
Rohit Reddy
Evaluation of AI models for radiology has gotten a lot more complex, but it’s still very dependent on aggregate performance metrics. AUC, sensitivity, specificity, and calibration dominate the literature. While these metrics are necessary and statistically meaningful, they are not sufficient for determining clinical readiness.
Radiology is Not Practiced in Aggregate
Clinical risk is rarely about mean performance; it is more about edge cases, where diagnostic errors carry disproportionate clinical consequences. Consider rare pathologies, ambiguous presentations, postsurgical anatomy, device-related complexity, and distributional shifts across patient populations and imaging platforms. Large imaging datasets skew toward more common pathologies, so these edge cases are often exactly where models are least prepared. A model that performs well on average may still have systematic weaknesses in clinically sensitive scenarios. Aggregate metrics alone are not designed to catch them.
The Limits of Aggregate Metrics
An AUC of 0.96 for detecting pneumothorax, pneumonia, or pulmonary edema sounds good, but that number doesn’t tell you whether the model systematically underperforms in supine trauma patients, struggles with portable radiographs, or overcalls pathology when there’s an indwelling device in the field. It also doesn’t tell you whether performance degrades in low prevalence conditions or underrepresented demographic groups. Aggregate metrics collapse heterogeneous failure modes into a single summary statistic. In doing so, they mask patterns that matter clinically.
Cross institutional variability adds another layer. Differences in scanner hardware, acquisition protocols, patient demographics, and disease prevalence can introduce distribution shifts that quietly erode performance. Without explicit stress testing across these conditions, claims of generalizability don’t hold up.
Radiologists understand the importance of failure modes intuitively. How a system fails matters as much as how often it fails. A sporadic, random false positive is fundamentally different from a reproducible failure in a specific clinical context. One is manageable within workflow. The other introduces a systematic avenue for harm.
“Wrong for the Right Reasons” and “Right for the Wrong Reasons”
Two categories of model behavior are worth calling out here.
First, there are cases where a model is technically wrong but clinically understandable. For example, take a chest radiograph with borderline interstitial markings, low lung volumes, and motion artifact. The ground truth is negative for edema, but the model output is mild pulmonary edema. That’s not algorithmic incompetence. That’s diagnostic ambiguity. Even human readers could reasonably disagree on such a study.
Second, there are cases where a model gets the right answer for the wrong reason. A pneumothorax is correctly identified, but saliency mapping shows the model is attending to adjacent rib margins or device artifacts rather than the pleural line itself. That’s a spurious correlation baked into the training data. The output is correct, but the reasoning is fragile and probably won’t generalize.
Traditional performance metrics treat both scenarios identically. Structured error analysis doesn’t.
Making Error Analysis a Focus
If we’re going to integrate radiology AI responsibly into clinical workflows, evaluation must move beyond aggregate metrics. A more clinically aligned approach would include systematic categorization of false positives and false negatives, stratification by pathology prevalence and clinical context, analysis of performance in device heavy and post-operative studies, evaluation under simulated distribution shifts, and comparative assessment of AI alone, human alone, and AI assisted performance. These subgroup analyses need to be adequately powered; indeed, stratifying by clinical context only matters if the sample sizes within each stratum are large enough to support reliable conclusions. Otherwise, the breakdown is just noise dressed up as rigor.
These analyses shouldn’t be relegated to the supplemental material. They should be central to model validation. Understanding where a model breaks down is what enables bounded trust. When clinicians know where a system performs reliably and where it’s vulnerable, they can use it appropriately.
Trust Through Transparency
Resistance to AI in radiology often gets misinterpreted as resistance to innovation. Really, resistance reflects a deep understanding of clinical fragility; radiologists work in environments where small diagnostic deviations carry significant downstream consequences. A superior AUC shouldn’t convince a radiologist of a model’s reliability. Real trust in AI is built on transparency about each model’s limitations.
A mature evaluation paradigm treats error analysis not as a post hoc explanation, but as a primary endpoint. It emphasizes degradation curves, boundary conditions, and failure maps. It asks not just “How accurate is this model?” but “Under what conditions does performance deteriorate?” and “How does this interact with human oversight?”
Accuracy is a starting point. Understanding the structure of failure is what enables safe deployment.
Rohit Reddy is a PGY-3 integrated DR/IR resident at the University of Miami and current resident research director for the Department of IR. He is currently working on automated post-acute care coordination and implications of generalist medical image interpretation. A past DREAM Scholarship recipient, his 50+ publications span AI in medicine, medical education, IR, and men’s health. He also serves on the Trainee Editorial Board for Radiology: Artificial Intelligence.


