Chest / ThoracicAI / InformaticsResearch
LLMs Match Human Accuracy for ILD Data Extraction From Clinic Notes
Radiology AI literature (PubMed)3d ago
Seven of 12 LLMs achieved human-level accuracy (96.2%) extracting binary structured data from ILD clinic notes—at seconds per note and under US $10.50 per patient. Multiclass ILD classification accuracy dipped to 88–91%, demonstrating feasibility but reduced reliability for comp…
- Retrospective analysis of 100 ILD clinic notes; 12 LLMs evaluated on 10 binary clinical questions with consensus of 3 ILD physicians as ground truth.
- Top-tier models (Claude 3.5 Sonnet, GPT-4o, o1, o3-mini, gpt-oss-20b, gpt-oss-120b) showed no significant accuracy differences, though GPT-4o was significantly lower than o1 (Bonferroni-adjusted P=0.04).
- Limitations: single-center, small cohort, prompts optimized on a 10-note engineering set; external validity and generalizability to other ILD subtypes or institutions not assessed.
RadPigeon summaries are original and for information only. They are not clinical advice.
