News|Articles|November 19, 2025

Foundational Model Comparable to Expert Ophthalmologist for Textual Questions

Listen

0:00 / 0:00

Key Takeaways

Foundational models (FMs) showed comparable performance to ophthalmology experts in textual questions, with Claude 3.5 Sonnet achieving 77.7% accuracy.
Multimodal imaging questions revealed limitations in FMs, with GPT-4o achieving the highest accuracy at 57.5%, still below expert performance.
The study underscores the potential of FMs in ophthalmologic textual queries but highlights the need for human guidance in multimodal imaging.
Limitations include unclear correlation between aptitude and performance, and insufficient multimodal questions for stratified analysis.

An improvement in ophthalmological knowledge was seen compared with older large language models and ophthalmology trainees.

The use of foundational models (FMs) for medical assistance was comparable to ophthalmology experts for textual questions in a new analysis,¹ with notable improvements found over older large language models (LLMs) and ophthalmology trainees. These results encourage the use of these FMs for help in ophthalmologic textual queries, but multimodal imaging may still require human guidance.

LLMs are a form of artificial intelligence (AI) that use a large textual dataset to train themselves to generate natural language. LLMs can also be used to handle patterns in text, which can make them valuable.² All models of this kind are FMs and are used in chatbots like ChatGPT. Written examination is a common method of evaluating the clinical knowledge of FMs, but this is understudied. This study aimed to assess the performance of FMs in ophthalmologic textual questions to compare with older LLMs and to compare between FMs and the 10 physicians answering the same questions.¹

The researchers used a textbook that acts as preparation material for the Fellowship of the Royal College of Ophthalmologists part 2 written multiple-choice examination. The textbook provided 360 questions for the FM to answer, of which 13 were multimodal and 345 were textual. A consultant created 27 more multimodal imaging questions for a total of 40 images to be used.

There were 7 FMs that were trialed, of which none had customization, fine-tuning, or additional guidance. The questions were entered into the FMs between September 2024 and March 2025, and the multimodal questions were inputted through a slide deck of all 40 questions. The 40 images were also evaluated by 10 physicians who had varying experience in ophthalmology.

The Claude 3.5 Sonnet model had the best performance answering the textual questions, at an accuracy of 77.7%. Other models were less accurate overall, with PT-4o having an accuracy of 69.9%, Qwen2.5-Max having an accuracy of 69.3%, DeepSeek V3 having an accuracy of 63.2%, and Gemini Advanced having an accuracy of 62.6%. The answer agreement between FM and LLMs was best between GPT-4o and Claude 3.5 Sonnet (Cohen κ, 0.70).

Claude 3.5 Sonnet performed similarly compared with ophthalmology experts (difference, 1.3%; 95% CI, –5.1% to 7.4%). The trainees (difference, 9.0%; 95% CI, 2.4%-15.6%) and unspecialized junior physicians (difference, 35.2%; 95% CI, 28.3%-41.9%) performed worse compared with Claude 3.5 Sonnet. The Claude 3.5 Sonnet model scored higher compared with the mean candidate score of 66.4% and the mean official pass mark of 61.2%.

GPT-4o had the highest accuracy of the FMs in the multimodal questions at 57.5%, followed by Claude 3.5 Sonnet at 47.5%. Ophthalmology experts had a mean score of 75.7% compared with 42% from the FMs and 71.3% for the trainees. GPT-4o and Claude 3.5 Sonnet had the highest agreement between the FMs and physicians.

There were some limitations to this study. There was an unclear correlation between aptitude, examination performance, and clinical reasoning. There was no evaluation of the efficacy of the prompt engineering. Multimodal performance could not be stratified because there were not enough questions.

“Results of this cross-sectional study show that while FMs exhibit comparable ophthalmological knowledge and reasoning skills with expert ophthalmologists, their ability to incorporate and process multimodal data remains limited,” the authors concluded

References

1. Rocha H, Chong YJ, Thirunavukarasu AJ, et al. Performance of foundation models vs physicians in textual and multimodal ophthalmological questions. JAMA Ophthalmol. Published online November 13, 2025. doi:10.1001/jamaophthalmol.2025.4255

2. Stryker C. What are large language models (LLMs)? IBM. Accessed November 18, 2025. https://www.ibm.com/think/topics/large-language-models

Stay ahead of policy, cost, and value—subscribe to AJMC for expert insights at the intersection of clinical care and health economics.