Exploring Model Generalizability: Part of the Journal Vision Series
by Yunhe Gao
Despite a large body of studies exploring artificial intelligence (AI) in medical imaging, few AI solutions have been successfully deployed in clinical settings. One of the major obstacles to scaling imaging AI algorithms is limited generalizability of current models.
Generalizability pertains to model performance on data that has not been seen during training. Current data-driven models exhibit strong performance when tested on data that is similarly distributed as the training data. However, their performance tends to degrade in the presence of data shift, where real-world data differ in distribution from training data. Data shift may arise from a variety of factors, including differences in patient demographics, patient genotypic and phenotypic characteristics, imaging equipment and protocols, and processing methods. Improving and evaluating the model generalizability is crucial for clinical deployment of AI solutions that adapt to evolving practice environments.
This edition of Journal Vision features three recently published articles on model generalizability from the journal Radiology: Artificial Intelligence.
These articles examine the impact of data shift on DL model performance, importance of addressing both narrow and broad generalization issues, and strategies to avoid common methodological pitfalls that can impact model generalizability.
1. External Validation of Deep Learning Algorithms for Radiologic Diagnosis: A Systematic Review
The first paper is a systematic review of studies on deep learning (DL) algorithms for image-based radiologic diagnosis that included external validation. This review revealed that data shifts substantially affect the performance of current DL models. The majority (81%) of the algorithms experienced at least some decrease in external performance compared to internal performance. Nearly half (49%) of the algorithms demonstrated at least a modest decrease (≥ 0.05 on the unit scale) in performance, while nearly a quarter (24%) exhibited a substantial decrease (≥0.10 on the unit scale).
This paper highlighted an important finding that most DL models are prone to decreased performance on external datasets, likely owing to data shift. Researchers generally have overlooked the issue of model generalizability, as only 1.4% (83 of 6018) reviewed papers used external datasets for evaluation. These findings emphasize that model generalization and data shift must be given serious consideration when applying DL models in clinical settings. Out-of-distribution evaluation using external datasets featuring diversity in terms of demographics and geographic distribution is vital to create robust models resistant to data shift.
2. Toward Generalizability in the Deployment of Artificial Intelligence in Radiology: Role of Computation Stress Testing to Overcome Underspecification
This paper explores two overarching types of generalization issues and how they can be addressed. The first is narrow generalization, which refers to the ability of a model to perform well on datasets that are identically distributed to the training set. One common manifestation of narrow generalization is overfitting, referring to an inability of a complex model to distinguish between signal and noise within training data. Narrow generalization can be assessed by testing on subsets generated from random splitting of a hold-out test dataset. Strategies to mitigate overfitting include increasing the size of the training set, using cross-validation, employing adversarial training and federated learning, and reducing model complexity through techniques such as feature selection, regularization, and neural architecture search.
The second type of generalization is broad generalization, which is concerned with the ability of a model to perform well in a deployment domain that may have a different distribution from the training domain. To achieve broad generalization, a model must be able to identify stable signals, which will remain constant despite data shift after deployment, to make predictions. However, a regular pipeline that evaluates a model with testing data similarly distributed as training data may not be sufficient to assess the broad generalizability, leading to a phenomenon known as underspecification. To address underspecification, enriched pipelines should be designed in the testing phase to include stress tests that reproduce the challenges in the real world, such as shifted performance evaluation, contrastive evaluation, and stratified performance evaluation.
Researchers must employ thoughtful, deliberate intervention in both the training and testing phase to obtain a model that can generalize well in real deployment scenarios. Narrow generalization can be addressed by optimized training protocol, while broad generalization needs to be addressed with a carefully designed evaluation pipeline. Stress testing is a specific strategy to overcome underspecification, potentially serving as a standard in imaging AI, paralleling the role of crash tests in the automotive industry.
3. Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls
This paper aims to provide a clear technical guideline to promote the development of generalizable machine learning (ML) and deep learning (DL) models that can be clinically deployed. The authors identify and investigate the impact of three methodological pitfalls on model generalizability: (a) violation of the independence assumption, (b) model evaluation with an inappropriate performance indicator or baseline for comparison, and (c) batch effect. Such errors can lead to models that are not generalizable despite achieving seemingly promising results during internal evaluation, which may not be evident to readers, reviewers, or even the authors themselves. The guidelines offer these suggestions to avoid pitfalls:
Prevent data leakage
Split data into training, validation, and test partitions before oversampling, feature selection, or data augmentation
Data points for a patient should not be distributed across training, validation, and test sets
Report performance measures
Accuracy should not be used as a sole performance indicator for imbalance datasets
Other performance indicators, such as precision and recall should be reported alongside accuracy
A baseline for acceptable model performance should be provided
Select appropriate data
In multicenter studies, aim to keep similar data distributions across centers to avoid the batch effect
If possible, use an external dataset to provide an unbiased estimation of generalization error.
Guidelines such as TRIPOD and CLAIM can help assure the quality and reproducibility of ML and DL models in medical research. Although these guidelines focus primarily on reporting and reproducibility of research findings, they are limited in the extent of guidance for recommended methodological practices. This article supplements previous efforts, such as established reporting guidelines, by providing actionable steps supported by evidence that every researcher should check before developing generalizable ML and DL models for clinical use.
Keep an eye out for the next “Journal Vision” for more commentary on the latest publications from Radiology: Artificial Intelligence.
Yunhe Gao is a PhD candidate in Computer Science at Rutgers University, with research interests in large-scale medical image models and model robustness. He is a member of this journal’s Trainee Editorial Board.



