Chest / ThoracicGeneralEmergencyAI / InformaticsResearchTrainee
ChatGPT-4o Beats Clinicians on Critical Care Board Questions — but One-Third of Answers Carry Harm Risk
Radiology education & curriculum (PubMed)1w ago
ChatGPT-4o scored 74.9% on 183 multimodal critical care board questions vs. 71.1% for pooled clinicians (p=0.03), but 33.3% of its responses were flagged for potential clinical harm, driven largely by poor image interpretation (61.7% correct) and flawed reasoning.
- Observational study, n=183 SCCM board-style questions with clinical images; 14 expert reviewers (physicians, APPs, pharmacists) assessed accuracy, reasoning, and harm potential.
- Performance varied widely by domain: best in pulmonary disease (91.7%) and surgery/trauma (87.5%); worst in critical care ultrasound (51.1%) — a major gap for bedside imaging applications.
- Key limitation: simulated board-exam environment only; findings may not generalize to real-time clinical decision-making, and only one LLM (GPT-4o) was tested without external validation.
RadPigeon summaries are original and for information only. They are not clinical advice.
