🚨 Can your AI save a life when the doctor’s makes diagnostic errors?
Our team at Johns Hopkins Bloomberg School of Public Health built a first-of-its-kind benchmark to test whether LLMs can course-correct physicians' diagnostic errors during the initial patient presentation — the moment when small mistakes can have the biggest consequences.
We tested models from:
OpenAI – GPT-4o, GPT-4.5, GPT -o1
Anthropic – Claude 3.5, Claude 3.7
Google DeepMind – Gemini 2.5 Pro, Gemini 2 Flash
xAI – Grok 2, Grok 3
DeepSeek AI – DeepSeek V3, R1
Amazon Nova Pro
📊 Finding: All models share a similar knowledge pattern — but differ sharply in disease-specific performance, with each excelling in some conditions and struggling in others.
💡 Opportunity: Let’s stress-test your latest models, refine this benchmark, and explore safe, high-impact clinical integration.
📩 Building the future of medical AI? My inbox is open.
📖 Stay tuned — the full manuscript with complete methods, results, and insights is coming soon.
#AI #HealthcareAI #Diagnostics #PatientSafety #LLM #Benchmark #Collaboration
📢 Huge thanks to our incredible team: Xiaoyi P., Ruxandra Irimia, Anthony Li, MD, MPH, António Bandeira, MD, MPH