Commentary
Video
Artificial intelligence (AI) can reduce radiologist workload for breast cancer screenings and mammograms, but their review of the results is still crucial, explained Sarah Verboom of Radboud University Medical Center.
Artificial intelligence (AI)–assisted mammogram screenings reduced radiologists’ workload by nearly 40%, but they still needed to review the results even when the AI was certain.
The American Journal of Managed Care® interviewed lead author Sarah Verboom, PhD candidate at Radboud University Medical Center in the Netherlands, about the certainty metric used that obtained such results in this retrospective analysis. The findings were published in Radiology
Verboom was confident that while the AI model was able to allocate a probability of malignancy for each region of the breast it classified as of interest, the certainty scores it then assigned to each of its evaluations, even when completely certain, should still be reviewed by radiologists.
This does not stem from the confidence of the AI model, which was uncertain in 61.9% of examinations in this study, but from patients' trust and preference for a human assessment and not solely that of AI.
“Therefore, it might be considered more acceptable for radiologists to review not only the examinations for which AI interpretation is uncertain but also the examinations recalled by an AI model, even if the model has high certainty in its decision,” the study authors concluded.
This transcript has been lightly edited; captions are auto-generated.
Transcript
How do you see patient trust and acceptance influencing the adoption of AI in breast cancer screenings?
AI systems are getting better every time. I think if we communicate clearly to those women what the performance of AI is and also do their quality control of AI systems. I think then women would also be open to just AI reviewing the mammograms if we can guarantee that quality is better than just radiologists in some cases.
While the AI model performed well overall, its sensitivity dropped when specificity was matched to radiologists. How should this trade-off be balanced in clinical practices?
In order for us to try different things and incorporate this uncertainty metric, this model will underperform compared to the state-of-the-art models. You will see state-of-the-art models perform similarly to a single radiologist. However, in the Netherlands, we have 2 radiologists looking at every exam, and we see that most AI models are not on par with 2 radiologists.
Regarding the tradeoff between sensitivity and specificity, that depends on the application, of course. If you were to go for AI specifying completely normal exams, then, of course, you would want a very high sensitivity but not missing anything. But, if you use it as a safety net, you can take a bit lower sensitivity, but what you want to have is a very high specificity.
The AI model predicted 19% more recalls on its own, without a radiologist's input. Do you see this as a benefit to catching important cases or creating unnecessary recalls?
The 19% you're referring to, I think, should be interpreted slightly differently. That means that those cases were not missed by radiologists during screening, but in this retrospective analysis, it means that 19% of the recalls will be done by AI and not by radiologists. That can be unfavorable because it's harder to explain to a woman why she was recalled. After all, there's no person behind it doing the explanation. I think in practice, it will be better, maybe, to use a radiologist for those couple of cases—because there are not many—to review them and explain to the women why they were recalled.
What insights or surprises emerged when you tested different ways of measuring the model’s confidence when compared to entropy—the key uncertainty metric used?
I think the biggest difference is that we tried with different uncertainty metrics, viewing all regions that are possibly suspicious by the AI model or just the most suspicious region, and we expected that including all suspicious regions would give a better estimate of the uncertainty. That's not what we saw. Just using the most suspicious region was already enough, which is, of course, much faster to do, because you're only looking at one reading instead of multiple, and it already gave us enough information to quantify the uncertainty.
Since the study relied on retrospective data from the Netherlands, how generalizable are these results when compared with other populations and health care systems?
I think the numbers are probably not one-to-one or generalizable to other countries, especially in the Netherlands, where we have a double reading. That means that 2 radiologists evaluate every exam. In the US, it is mostly a single radiologist. That's going to differ; however, I think the methods that we use to address uncertainty are applicable to many different populations and AI models. I think there will be a benefit in the multiple different screening programs, but just the numbers will be slightly different.
References
1. Verboom SD, Kroes J, Pires S, Broeders MJM, Sechopoulos I. AI should read mammograms only when confident: a hybrid breast cancer screening reading strategy. J Am Coll Radiol. Published online August 19, 2025, doi:10.1148/radiol.242594
Stay ahead of policy, cost, and value—subscribe to AJMC for expert insights at the intersection of clinical care and health economics.