Abstract
Purpose: Large language models (LLMs), such as GPT-4 (OpenAI; San Francisco, CA), are promising tools for surgical education. However, skepticism surrounding their accuracy and reliability remains a significant barrier to their widespread adoption. Although GPT-4 has demonstrated a remarkable ability to pass multiple-choice tests, its general surgery knowledge and clinical judgment in complex oral-based examinations are less clear. This study aims to evaluate GPT-4’s general surgery knowledge on mock written and oral board-style examinations to drive improvements that will enable the tool to revolutionize surgical education and practice. Methods: We tested GPT-4’s ability to answer 250 random multiple-choice questions (MCQs) from the Surgical Council on Resident Education (SCORE) question bank and navigate 4 oral board scenarios derived from the Entrustable Professional Activities (EPA) topic list. Two former oral board examiners assessed the responses independently for accuracy. Results: On MCQs, GPT-4 answered 197 out of 250 (78.8%) correctly, corresponding to a 92% probability of passing the American Board of Surgery Qualifying Examination (ABS QE). On oral board scenarios, GPT-4 committed critical failures in 3 out of 4 (75%) clinical cases. Common reasons for failure were incorrect timing of intervention and incorrect suggested operation. Conclusions: While GPT-4’s high performance on MCQs mirrored prior studies, the model struggled to generate accurate long-form content in our mock oral board examination. Future efforts should use specialized datasets and advanced reinforcement learning to improve LLM performance in complex, high-stakes clinical decision-making.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 1557-1566 |
| Number of pages | 10 |
| Journal | Medical Science Educator |
| Volume | 35 |
| Issue number | 3 |
| DOIs | |
| State | Published - Jun 2025 |
Keywords
- Artificial intelligence
- Competency-based education
- Digital learning
- Education technology
- Entrustable professional activities
- Large language models
- Surgical education
ASJC Scopus subject areas
- Medicine (miscellaneous)
- Education
Fingerprint
Dive into the research topics of 'GPT-4 as a Board-Certified Surgeon: A Pilot Study'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS