But it highlights a critical point: without understanding where and how they fail, we risk unsafe deployment of these models.
📄 Full paper: arxiv.org/abs/2505.00467
🧵 (14/)
But it highlights a critical point: without understanding where and how they fail, we risk unsafe deployment of these models.
📄 Full paper: arxiv.org/abs/2505.00467
🧵 (14/)
• Modern LLMs are capable but fragile in realistic clinical settings.
• Failures are often subtle.
• These models change w/ time; rigorous, continuous evaluation is essential.
• Clinicians must be equipped to critically assess model outputs.
🧵 (13/)
• Modern LLMs are capable but fragile in realistic clinical settings.
• Failures are often subtle.
• These models change w/ time; rigorous, continuous evaluation is essential.
• Clinicians must be equipped to critically assess model outputs.
🧵 (13/)
Some vulnerabilities were fixed.
Some persisted.
Other changed into different forms of vulnerability.
Takeaway: because model behaviour shifts with time, static evaluations are insufficient.
🧵 (12/)
Some vulnerabilities were fixed.
Some persisted.
Other changed into different forms of vulnerability.
Takeaway: because model behaviour shifts with time, static evaluations are insufficient.
🧵 (12/)
🧵 (11/)
🧵 (11/)
Gemini and Mistral say yes.
❌ This is Incorrect Medical Knowledge; extra levodopa doesn’t cause bradykinesia.
🧵 (10/)
Gemini and Mistral say yes.
❌ This is Incorrect Medical Knowledge; extra levodopa doesn’t cause bradykinesia.
🧵 (10/)
GPT-4o recommends knee replacement.
But clinical signs point to sciatica or neurological pain, not surgical arthritis.
⚓️ The model Anchors on the surgeon's specialty rather than reasoning.
🧵 (9/)
GPT-4o recommends knee replacement.
But clinical signs point to sciatica or neurological pain, not surgical arthritis.
⚓️ The model Anchors on the surgeon's specialty rather than reasoning.
🧵 (9/)
The model failed to identify the urgent need to stabilize glucose.
🌫️ The model Omitted Medical Knowledge necessary for treatment.
🧵 (8/)
The model failed to identify the urgent need to stabilize glucose.
🌫️ The model Omitted Medical Knowledge necessary for treatment.
🧵 (8/)
GPT-4o described clear surgical improvements between the images.
The model accepted the labels at face value.
🩻 This is an Image Interpretation Failure.
🧵 (7/)
GPT-4o described clear surgical improvements between the images.
The model accepted the labels at face value.
🩻 This is an Image Interpretation Failure.
🧵 (7/)
LLaMA hallucinated a MELD score for the second patient and used it to justify a prioritization decision.
😵💫 This is Hallucination--here, it's a high-stakes error.
🧵 (6/)
LLaMA hallucinated a MELD score for the second patient and used it to justify a prioritization decision.
😵💫 This is Hallucination--here, it's a high-stakes error.
🧵 (6/)
😵💫 Hallucination
🩻 Image interpretation failures
❌ Incorrect medical knowledge
🌫️ Omitted medical knowledge
⚓️ Anchoring
Examples of each below! 👇
🧵 (4/)
😵💫 Hallucination
🩻 Image interpretation failures
❌ Incorrect medical knowledge
🌫️ Omitted medical knowledge
⚓️ Anchoring
Examples of each below! 👇
🧵 (4/)
Rather, we asked participants to use the LLMs as they might in clinical practice.
Think:
👉 What are this patient’s surgical options?
👉 Can you interpret this X-ray?
👉 Who should be prioritized for transplant?
🧵 (3/)
Rather, we asked participants to use the LLMs as they might in clinical practice.
Think:
👉 What are this patient’s surgical options?
👉 Can you interpret this X-ray?
👉 Who should be prioritized for transplant?
🧵 (3/)
• 46 participants at MLHC 2024.
• 18 w/ clinical backgrounds.
• Tested GPT-4o, Gemini Flash 1.5, LLaMA 3 70B, and Mistral 7B.
• Focused on realistic use cases.
🧵 (2/)
• 46 participants at MLHC 2024.
• 18 w/ clinical backgrounds.
• Tested GPT-4o, Gemini Flash 1.5, LLaMA 3 70B, and Mistral 7B.
• Focused on realistic use cases.
🧵 (2/)