Neuro / Head & NeckAI / InformaticsResearchTrainee
Multimodal LLMs Lag Expert Neuroradiologists Despite Generational Gains
American Journal of Neuroradiology (AJNR)2w ago
On 106 image-based neuroradiology MCQs, expert neuroradiologists reached mean accuracy 0.915 (95% CI 0.877–0.953). GPT-5 and Gemini 2.5 trailed by ~0.22–0.24 per item and aligned closer to learner than expert performance; improved mean scores did not guarantee consistent executi…
- Item-level paired benchmark study (106 Radiopaedia image-based MCQs) comparing GPT-4, GPT-5, Gemini 1.5, and Gemini 2.5 against expert neuroradiologists and radiology residents, using non-parametric bootstrap CIs and paired permutation tests with false-discovery-rate correction.
- Second-generation models (GPT-5, Gemini 2.5) approximated or exceeded resident-level performance in selected comparisons, but both remained ~0.22–0.24 mean per-item accuracy below the expert reference; when contextualized against Radiopaedia community data, advanced LLMs resembled aggregate learner rather than expert performance.
- Key limitations: questions sourced from a single crowdsourced educational platform (Radiopaedia), not real clinical imaging workflows; models were not externally or prospectively validated on independent clinical datasets, limiting generalizability to operational neuroradiology practice.
RadPigeon summaries are original and for information only. They are not clinical advice.
