Neuro / Head & NeckAI / InformaticsResearchTrainee

Multimodal LLMs Lag Expert Neuroradiologists Despite Generational Gains

American Journal of Neuroradiology (AJNR)2w ago

On 106 image-based neuroradiology MCQs, expert neuroradiologists reached mean accuracy 0.915 (95% CI 0.877–0.953). GPT-5 and Gemini 2.5 trailed by ~0.22–0.24 per item and aligned closer to learner than expert performance; improved mean scores did not guarantee consistent executi…

Item-level paired benchmark study (106 Radiopaedia image-based MCQs) comparing GPT-4, GPT-5, Gemini 1.5, and Gemini 2.5 against expert neuroradiologists and radiology residents, using non-parametric bootstrap CIs and paired permutation tests with false-discovery-rate correction.
Second-generation models (GPT-5, Gemini 2.5) approximated or exceeded resident-level performance in selected comparisons, but both remained ~0.22–0.24 mean per-item accuracy below the expert reference; when contextualized against Radiopaedia community data, advanced LLMs resembled aggregate learner rather than expert performance.
Key limitations: questions sourced from a single crowdsourced educational platform (Radiopaedia), not real clinical imaging workflows; models were not externally or prospectively validated on independent clinical datasets, limiting generalizability to operational neuroradiology practice.

Read the source

RadPigeon summaries are original and for information only. They are not clinical advice.