BreastAI / InformaticsResearchTrainee
ChatGPT vs. Breast Radiologists on ACR Appropriateness Criteria: Radiologists Win on Accuracy and Consistency
ACR Appropriateness Criteria (PubMed)Mar 31
ChatGPT-3.5, -4, and -4o all showed significant mean bias vs. ACR Appropriateness Criteria for breast imaging (biases 1.76–2.47, all p<0.001), while experienced breast radiologists had a non-significant mean bias of 0.24 (p=0.489). Human oversight remains essential.
- Single-institution diagnostic accuracy study: 4 breast radiologists and 3 ChatGPT versions (GPT-3.5, GPT-4, GPT-4o, July 2024) each rated 81 imaging decisions across 10 ACR AC breast imaging clinical variants on a 1–9 appropriateness scale.
- All ChatGPT versions also demonstrated significant slope bias (p<0.001) on Bland-Altman analysis, meaning errors were not random but systematic — radiologists showed the smallest slope bias of any group.
- Key limitation: single institution, small panel (4 radiologists, 10 clinical variants), and no external validation; results may not generalize to other LLM versions, prompting strategies, or practice settings.
RadPigeon summaries are original and for information only. They are not clinical advice.
