TY - JOUR
T1 - GPT-4 as a Board-Certified Surgeon
T2 - A Pilot Study
AU - Roshal, Joshua A.
AU - Silvestri, Caitlin
AU - Sathe, Tejas
AU - Townsend, Courtney
AU - Klimberg, V. Suzanne
AU - Perez, Alexander
N1 - Publisher Copyright:
© The Author(s) under exclusive licence to International Association of Medical Science Educators 2025.
PY - 2025
Y1 - 2025
N2 - Purpose: Large language models (LLMs), such as GPT-4 (OpenAI; San Francisco, CA), are promising tools for surgical education. However, skepticism surrounding their accuracy and reliability remains a significant barrier to their widespread adoption. Although GPT-4 has demonstrated a remarkable ability to pass multiple-choice tests, its general surgery knowledge and clinical judgment in complex oral-based examinations are less clear. This study aims to evaluate GPT-4’s general surgery knowledge on mock written and oral board-style examinations to drive improvements that will enable the tool to revolutionize surgical education and practice. Methods: We tested GPT-4’s ability to answer 250 random multiple-choice questions (MCQs) from the Surgical Council on Resident Education (SCORE) question bank and navigate 4 oral board scenarios derived from the Entrustable Professional Activities (EPA) topic list. Two former oral board examiners assessed the responses independently for accuracy. Results: On MCQs, GPT-4 answered 197 out of 250 (78.8%) correctly, corresponding to a 92% probability of passing the American Board of Surgery Qualifying Examination (ABS QE). On oral board scenarios, GPT-4 committed critical failures in 3 out of 4 (75%) clinical cases. Common reasons for failure were incorrect timing of intervention and incorrect suggested operation. Conclusions: While GPT-4’s high performance on MCQs mirrored prior studies, the model struggled to generate accurate long-form content in our mock oral board examination. Future efforts should use specialized datasets and advanced reinforcement learning to improve LLM performance in complex, high-stakes clinical decision-making.
AB - Purpose: Large language models (LLMs), such as GPT-4 (OpenAI; San Francisco, CA), are promising tools for surgical education. However, skepticism surrounding their accuracy and reliability remains a significant barrier to their widespread adoption. Although GPT-4 has demonstrated a remarkable ability to pass multiple-choice tests, its general surgery knowledge and clinical judgment in complex oral-based examinations are less clear. This study aims to evaluate GPT-4’s general surgery knowledge on mock written and oral board-style examinations to drive improvements that will enable the tool to revolutionize surgical education and practice. Methods: We tested GPT-4’s ability to answer 250 random multiple-choice questions (MCQs) from the Surgical Council on Resident Education (SCORE) question bank and navigate 4 oral board scenarios derived from the Entrustable Professional Activities (EPA) topic list. Two former oral board examiners assessed the responses independently for accuracy. Results: On MCQs, GPT-4 answered 197 out of 250 (78.8%) correctly, corresponding to a 92% probability of passing the American Board of Surgery Qualifying Examination (ABS QE). On oral board scenarios, GPT-4 committed critical failures in 3 out of 4 (75%) clinical cases. Common reasons for failure were incorrect timing of intervention and incorrect suggested operation. Conclusions: While GPT-4’s high performance on MCQs mirrored prior studies, the model struggled to generate accurate long-form content in our mock oral board examination. Future efforts should use specialized datasets and advanced reinforcement learning to improve LLM performance in complex, high-stakes clinical decision-making.
KW - Artificial intelligence
KW - Competency-based education
KW - Digital learning
KW - Education technology
KW - Entrustable professional activities
KW - Large language models
KW - Surgical education
UR - http://www.scopus.com/inward/record.url?scp=105000035465&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105000035465&partnerID=8YFLogxK
U2 - 10.1007/s40670-025-02352-5
DO - 10.1007/s40670-025-02352-5
M3 - Article
AN - SCOPUS:105000035465
SN - 2156-8650
JO - Medical Science Educator
JF - Medical Science Educator
ER -