• Center on Health Equity and Access
  • Clinical
  • Health Care Cost
  • Health Care Delivery
  • Insurance
  • Policy
  • Technology
  • Value-Based Care

Diagnostic Accuracy Similar Between ChatGPT, Retina Specialists in Small Study


Glaucoma specialists were outperformed by a large language model chatbot when it came to diagnostic and treatment accuracy in glaucoma cases.

A large language model (LLM) chatbot was able to outperform glaucoma specialists and match retina specialists in terms of accuracy when presented with deidentified glaucoma and retina cases and questions, according to a study published in JAMA Ophthalmology. This finding indicates that it could be a diagnostic tool in the future.

LLMs chatbots—a form of artificial intelligence—have previously demonstrated their ability to perform well on Ophthalmic Knowledge Assess Program examinations, and research has begun to examine how they can be used in specific areas of ophthalmology. This study aimed to assess the broader capabilities of the chatbot by comparing its accuracy with that of ophthalmologists at the attending level. Glaucoma and retina specialists who were at the fellowship level were compared with the LLM in this study.

Tonometry test for eye pressure | Image credit: eyeadobestock - stock.adobe.com

Tonometry test for eye pressure | Image credit: eyeadobestock - stock.adobe.com

The cross-sectional study took place in a single center. All data for eyes were taken from the Department of Ophthalmology at Icahn School of Medicine at Mount Sinai, New York, New York. All specialists were practicing physicians in the same center. The researchers selected 10 glaucoma and retina questions each from the Commonly Asked Questions of the American Academy of Ophthalmology to test knowledge on clinical questions. To test case management knowledge, 10 of retina cases and 10 glaucoma cases were selected from patients in the department. All selections of questions and patients were random.

The GPT-4 chatbot, whose version was that from May 12, 2023, was used for the study. A 10-point Likert scale was used to measure the accuracy of all answers, with 1 and 2 representing poor or unacceptable inaccuracies and 9 and 10 representing very good accuracy without any inaccuracies. A 6-point scale was used to evaluate how medically complete the results were.

The specialists for retina and glaucoma answered the clinical questions and the case management questions, and their answers were compared with the answers generated by GPT-4 as the primary end point.

There were 1271 images for accuracy and 1267 images for completeness rated for this study. There were 12 specialists included, with 8 of them being glaucoma specialists and 4 being retina specialists; 3 ophthalmology trainees were also included. The mean (SD) amount of years that the participants practiced was 11.7 (13.5) years.

The LLM chatbot had a mean combined question-case accuracy rank of 506.2, whereas the glaucoma specialists had a mean rank of 403.4. The mean rank for completeness was similar within the 2 groups at 528.3 for the LLM chatbot and 398.7 for the specialists. The mean rank for combined accuracy was closer between the LLM chatbot and the retina specialists, at 235.3 and 216.1, respectively. The mean rank for completeness was comparable at 258.3 for the chatbot and 208.7 for the retina specialists.

“Both trainees and specialists rated the chatbot’s accuracy and completeness more favorably than those of their specialist counterparts,” the authors wrote, with specialists rating the chatbot significantly better than humans in terms of accuracy and completeness.

There were some limitations to this study. This study took place at a single center with only 1 group of attendings, which may not make it generalizable to other populations. There are also limitations to the decision-making of chatbots, especially with complex decisions, which should be considered.

Overall, this assessment found that the LLM chatbot displayed comparative accuracy in diagnosis to retina and glaucoma specialists when it came to both clinical questions and clinical cases, which indicates its potential use as a tool in diagnosis.


Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. Assessment of a large language model’s response to questions and cases about glaucoma and retina management. JAMA Ophthalmol. Published online February 22, 2024. doi:10.1001/jamaophthalmol.2023.6917

Related Videos
Will Shapiro, vice president of data science, Flatiron Health
Kathy Oubre, MS, Pontchartrain Cancer Center
Emily Touloukian, DO, Coastal Cancer Center
dr krystyn van vliet
dr mitzi joi williams
Stephen Speicher, MD, MS
dr marisa mcginley
Mike Brown, Vice President of Managed Care, Cardinal Health
Mike Brown, vice president of managed services at Cardinal Health
Mike Brown, vice president of managed services, Cardinal Health
Related Content
© 2024 MJH Life Sciences
All rights reserved.