GPT-4 as a Board-Certified Surgeon: A Pilot Study

Joshua A. Roshal, Caitlin Silvestri, Tejas Sathe, Courtney Townsend, V. Suzanne Klimberg, Alexander Perez

Research output: Contribution to journalArticlepeer-review

Abstract

Purpose: Large language models (LLMs), such as GPT-4 (OpenAI; San Francisco, CA), are promising tools for surgical education. However, skepticism surrounding their accuracy and reliability remains a significant barrier to their widespread adoption. Although GPT-4 has demonstrated a remarkable ability to pass multiple-choice tests, its general surgery knowledge and clinical judgment in complex oral-based examinations are less clear. This study aims to evaluate GPT-4’s general surgery knowledge on mock written and oral board-style examinations to drive improvements that will enable the tool to revolutionize surgical education and practice. Methods: We tested GPT-4’s ability to answer 250 random multiple-choice questions (MCQs) from the Surgical Council on Resident Education (SCORE) question bank and navigate 4 oral board scenarios derived from the Entrustable Professional Activities (EPA) topic list. Two former oral board examiners assessed the responses independently for accuracy. Results: On MCQs, GPT-4 answered 197 out of 250 (78.8%) correctly, corresponding to a 92% probability of passing the American Board of Surgery Qualifying Examination (ABS QE). On oral board scenarios, GPT-4 committed critical failures in 3 out of 4 (75%) clinical cases. Common reasons for failure were incorrect timing of intervention and incorrect suggested operation. Conclusions: While GPT-4’s high performance on MCQs mirrored prior studies, the model struggled to generate accurate long-form content in our mock oral board examination. Future efforts should use specialized datasets and advanced reinforcement learning to improve LLM performance in complex, high-stakes clinical decision-making.

Original languageEnglish (US)
JournalMedical Science Educator
DOIs
StateAccepted/In press - 2025

Keywords

  • Artificial intelligence
  • Competency-based education
  • Digital learning
  • Education technology
  • Entrustable professional activities
  • Large language models
  • Surgical education

ASJC Scopus subject areas

  • Medicine (miscellaneous)
  • Education

Fingerprint

Dive into the research topics of 'GPT-4 as a Board-Certified Surgeon: A Pilot Study'. Together they form a unique fingerprint.

Cite this