Glaucoma specialists were outperformed by a large language model chatbot when it came to diagnostic and treatment accuracy in glaucoma cases.
A large language model (LLM) chatbot was able to outperform glaucoma specialists and match retina specialists in terms of accuracy when presented with deidentified glaucoma and retina cases and questions, according to a study published in JAMA Ophthalmology. This finding indicates that it could be a diagnostic tool in the future.
LLMs chatbots—a form of artificial intelligence—have previously demonstrated their ability to perform well on Ophthalmic Knowledge Assess Program examinations, and research has begun to examine how they can be used in specific areas of ophthalmology. This study aimed to assess the broader capabilities of the chatbot by comparing its accuracy with that of ophthalmologists at the attending level. Glaucoma and retina specialists who were at the fellowship level were compared with the LLM in this study.
The cross-sectional study took place in a single center. All data for eyes were taken from the Department of Ophthalmology at Icahn School of Medicine at Mount Sinai, New York, New York. All specialists were practicing physicians in the same center. The researchers selected 10 glaucoma and retina questions each from the Commonly Asked Questions of the American Academy of Ophthalmology to test knowledge on clinical questions. To test case management knowledge, 10 of retina cases and 10 glaucoma cases were selected from patients in the department. All selections of questions and patients were random.
The GPT-4 chatbot, whose version was that from May 12, 2023, was used for the study. A 10-point Likert scale was used to measure the accuracy of all answers, with 1 and 2 representing poor or unacceptable inaccuracies and 9 and 10 representing very good accuracy without any inaccuracies. A 6-point scale was used to evaluate how medically complete the results were.
The specialists for retina and glaucoma answered the clinical questions and the case management questions, and their answers were compared with the answers generated by GPT-4 as the primary end point.
There were 1271 images for accuracy and 1267 images for completeness rated for this study. There were 12 specialists included, with 8 of them being glaucoma specialists and 4 being retina specialists; 3 ophthalmology trainees were also included. The mean (SD) amount of years that the participants practiced was 11.7 (13.5) years.
The LLM chatbot had a mean combined question-case accuracy rank of 506.2, whereas the glaucoma specialists had a mean rank of 403.4. The mean rank for completeness was similar within the 2 groups at 528.3 for the LLM chatbot and 398.7 for the specialists. The mean rank for combined accuracy was closer between the LLM chatbot and the retina specialists, at 235.3 and 216.1, respectively. The mean rank for completeness was comparable at 258.3 for the chatbot and 208.7 for the retina specialists.
“Both trainees and specialists rated the chatbot’s accuracy and completeness more favorably than those of their specialist counterparts,” the authors wrote, with specialists rating the chatbot significantly better than humans in terms of accuracy and completeness.
There were some limitations to this study. This study took place at a single center with only 1 group of attendings, which may not make it generalizable to other populations. There are also limitations to the decision-making of chatbots, especially with complex decisions, which should be considered.
Overall, this assessment found that the LLM chatbot displayed comparative accuracy in diagnosis to retina and glaucoma specialists when it came to both clinical questions and clinical cases, which indicates its potential use as a tool in diagnosis.
Reference
Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. Assessment of a large language model’s response to questions and cases about glaucoma and retina management. JAMA Ophthalmol. Published online February 22, 2024. doi:10.1001/jamaophthalmol.2023.6917
Kaiser Permanente was hit by a data breach in mid-April, impacting 13.4 million health plan members; GlaxoSmithKline (GSK) sued Pfizer and BioNTech for allegedly infringing on its messenger RNA technology patents in the companies’ COVID-19 vaccines; the CDC announced the first-known HIV cases transmitted via cosmetic injections.
Read More
Emily Goldberg Shares Insights as a Genetic Counselor for Breast Cancer Risk Screening
October 30th 2023On this episode of Managed Care Cast, Emily Goldberg, MS, CGC, a genetic counselor at JScreen, breaks down how genetic screening for breast cancer works and why it is so important to increase awareness and education around these screening tools available to patients who may be at risk for cancer.
Listen
HOPE-CAT Can Identify Maternal Cardiovascular Risk 2 Months Earlier Than Doctors, Study Says
April 25th 2024In a retrospective study, the machine learning tool was able to screen for potential risks of cardiovascular disease nearly 60 days before the patient's medical record showed any signs of a related condition or before they were officially diagnosed or treated for it.
Read More
Examining Telehealth Uptake to Increase Equitable Care Access
January 26th 2023To mark the publication of The American Journal of Managed Care®’s 12th annual health IT issue, on this episode of Managed Care Cast, we speak with Christopher M. Whaley, PhD, health care economist at the RAND Corporation, who focuses on health economics issues, including the influence of the COVID-19 pandemic on health care delivery.
Listen