NASSJ Literature Review
Performance of ChatGPT versus Spine Surgeons as an Emergency Department Spine Call Consultant

Diego Garmendia, BA
Yale School of Medicine New Haven, CT

Jonathan N. Grauer, MD
Yale School of Medicine New Haven, CT
Article Reviewed
Taka TM, Sebt S, Meng S, Cabrera A, Shin D, Yacoubian V, Chao W, Rossie D, Xu Z, Erickson M, Rocos B, Than K, Yu E, Ahn N, Bono C, Cheng W, Danisa O. (2026). Performance of ChatGPT versus spine surgeons as an emergency department spine call consultant. North American Spine Society Journal (NASSJ), 25, 100836. https://doi.org/10.1016/j.xnsj.2025.100836
Abstract
Background: Large language models (LLMs) like ChatGPT are increasingly being recognized as credible tools for use across diverse health care settings. While artificial intelligence (AI) use has previously been evaluated in emergency medicine, its use in subspecialty care – particularly spine surgery – remains underexplored. This study evaluates the clinical accuracy, management appropriateness, completeness, helpfulness, and overall quality of ChatGPT responses compared to those of board-certified, spine surgeons in response to common emergency department (ED) consultations.
Methods: A 7-part questionnaire was developed based on common ED spine consultations (eg, Cauda Equina Syndrome, compression fracture in elderly patients, purulent drainage from surgical wound, acute lumbar disc herniation, incomplete spinal cord injury, epidural abscess, and metastatic spine disease). Each case included 3–4 questions pertaining to examination, diagnosis, management, and counseling. Responses from ChatGPT and 7 board-certified spine surgeons were restricted to 3–4 sentences per question. Three emergency medicine physicians rated each de-identified questionnaire response using a 5-point Likert scale. Statistical analysis was conducted using a 2-sample T-test with unequal variance. Inter-rater reliability was assessed using pairwise weighted Cohen’s kappa coefficient (κ).