•  
  •  
 

Abstract

Background: ChatGPT is a large-scale language model trained on various datasets to learn, analyze, and generate human-like answers to user’s questions. To assess its applicability to medical education, more information is required to understand whether its analyses can provide accurate and coherent responses to questions. The aim of this study was to characterize ChatGPT responses to ophthalmology questions according to subtopic to determine where the system might be used reliably in resident education and where its performance remains weak.

Methods: Ophthalmology questions were obtained from a widely utilized study resource, OphthoQuestions. Thirteen sections, each with a differing ophthalmic subtopic, were sampled, and questions were collected from each section. Questions containing images or tables were excluded. Of 163 questions and their respective answer choices, 131 were input into ChatGPT-3.5. The accuracy of ChatGPT by subtopic was analyzed using Excel. ChatGPT responses were evaluated via the properties of natural coherence. Incorrect responses were categorized as logical fallacy, informational fallacy, or explicit fallacy. Statistical significance of categorical variables were analyzed using the χ2 test.

Results: ChatGPT answered 71 of 131 questions correctly (54.2%). Accuracy in each subtopic was as follows: general medicine (90%), oculoplastics (70%), retina and vitreous (70%) cornea (30%), fundamentals (40%), optics (40%), pediatrics (40%), glaucoma (50%), lens and cataract (50%), neuro-ophthalmology (60%), pathology and tumors (60%), refractive surgery (55), and uveitis (50%). Logical reasoning, internal information, and external information were identified in 82.4%, 100%, and 83.2% of the responses, respectively. The use of logical reasoning (P = 0.003) and external information (P = 0.02) was found to be statistically significant when stratified by correct and incorrect responses.

Conclusion: ChatGPT scored higher in general medicine, oculoplastics, and retina and vitreous than in cornea, fundamentals, optics, and pediatrics. Identifying subtopics in which ChatGPT performs less well allows learners to acquire appropriate supplemental resources in these areas.

Received Date

10 Aug 2024

Accepted Date

4 Feb 2025

Share

COinS