News|Articles|February 23, 2024

Diagnostic Accuracy Similar Between ChatGPT, Retina Specialists in Small Study

Glaucoma specialists were outperformed by a large language model chatbot when it came to diagnostic and treatment accuracy in glaucoma cases.

A large language model (LLM) chatbot was able to outperform glaucoma specialists and match retina specialists in terms of accuracy when presented with deidentified glaucoma and retina cases and questions, according to a study published in JAMA Ophthalmology. This finding indicates that it could be a diagnostic tool in the future.

LLMs chatbots—a form of artificial intelligence—have previously demonstrated their ability to perform well on Ophthalmic Knowledge Assess Program examinations, and research has begun to examine how they can be used in specific areas of ophthalmology. This study aimed to assess the broader capabilities of the chatbot by comparing its accuracy with that of ophthalmologists at the attending level. Glaucoma and retina specialists who were at the fellowship level were compared with the LLM in this study.

The cross-sectional study took place in a single center. All data for eyes were taken from the Department of Ophthalmology at Icahn School of Medicine at Mount Sinai, New York, New York. All specialists were practicing physicians in the same center. The researchers selected 10 glaucoma and retina questions each from the Commonly Asked Questions of the American Academy of Ophthalmology to test knowledge on clinical questions. To test case management knowledge, 10 of retina cases and 10 glaucoma cases were selected from patients in the department. All selections of questions and patients were random.

The GPT-4 chatbot, whose version was that from May 12, 2023, was used for the study. A 10-point Likert scale was used to measure the accuracy of all answers, with 1 and 2 representing poor or unacceptable inaccuracies and 9 and 10 representing very good accuracy without any inaccuracies. A 6-point scale was used to evaluate how medically complete the results were.

The specialists for retina and glaucoma answered the clinical questions and the case management questions, and their answers were compared with the answers generated by GPT-4 as the primary end point.

There were 1271 images for accuracy and 1267 images for completeness rated for this study. There were 12 specialists included, with 8 of them being glaucoma specialists and 4 being retina specialists; 3 ophthalmology trainees were also included. The mean (SD) amount of years that the participants practiced was 11.7 (13.5) years.

The LLM chatbot had a mean combined question-case accuracy rank of 506.2, whereas the glaucoma specialists had a mean rank of 403.4. The mean rank for completeness was similar within the 2 groups at 528.3 for the LLM chatbot and 398.7 for the specialists. The mean rank for combined accuracy was closer between the LLM chatbot and the retina specialists, at 235.3 and 216.1, respectively. The mean rank for completeness was comparable at 258.3 for the chatbot and 208.7 for the retina specialists.

“Both trainees and specialists rated the chatbot’s accuracy and completeness more favorably than those of their specialist counterparts,” the authors wrote, with specialists rating the chatbot significantly better than humans in terms of accuracy and completeness.

There were some limitations to this study. This study took place at a single center with only 1 group of attendings, which may not make it generalizable to other populations. There are also limitations to the decision-making of chatbots, especially with complex decisions, which should be considered.

Overall, this assessment found that the LLM chatbot displayed comparative accuracy in diagnosis to retina and glaucoma specialists when it came to both clinical questions and clinical cases, which indicates its potential use as a tool in diagnosis.

Reference

Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. Assessment of a large language model’s response to questions and cases about glaucoma and retina management. JAMA Ophthalmol. Published online February 22, 2024. doi:10.1001/jamaophthalmol.2023.6917

Stay ahead of policy, cost, and value—subscribe to AJMC for expert insights at the intersection of clinical care and health economics.

Diagnostic Accuracy Similar Between ChatGPT, Retina Specialists in Small Study

Related Content

Addressing Mental Health Challenges Among Black and Latino Boys

FDA Advisory Panel Votes Against Approval of Deramiocel for DMD

MASLD, MASH Linked to Higher Mild Cognitive Impairment Risk, Lower Dementia Odds

AHRQ Grant Freeze Halts Patient Safety Research Nationwide

Delgocitinib Cream Succeeds in First Phase 3 Trial for Teens With Hand Eczema

Trending on AJMC

Judge Denies Preliminary Injunction in Medicaid Work Requirements Case

Wildfire Smoke Raises Asthma Risks as Fires Worsen Air Quality Across North America

Can Oral Semaglutide Reduce Heavy Drinking in Alcohol Use Disorder?

AHRQ Grant Freeze Halts Patient Safety Research Nationwide

Contributor: Artificial Intelligence Grows Across Health Care, Led by Administrative Processes