
Foundational Model Comparable to Expert Ophthalmologist for Textual Questions
Key Takeaways
- Foundational models (FMs) showed comparable performance to ophthalmology experts in textual questions, with Claude 3.5 Sonnet achieving 77.7% accuracy.
- Multimodal imaging questions revealed limitations in FMs, with GPT-4o achieving the highest accuracy at 57.5%, still below expert performance.
An improvement in ophthalmological knowledge was seen compared with older large language models and ophthalmology trainees.
The use of foundational models (FMs) for medical assistance was comparable to
LLMs are a form of artificial intelligence (AI) that use a
The researchers used a textbook that acts as preparation material for the Fellowship of the Royal College of Ophthalmologists part 2 written multiple-choice examination. The textbook provided 360 questions for the FM to answer, of which 13 were multimodal and 345 were textual. A consultant created 27 more multimodal imaging questions for a total of 40 images to be used.
There were 7 FMs that were trialed, of which none had customization, fine-tuning, or additional guidance. The questions were entered into the FMs between September 2024 and March 2025, and the multimodal questions were inputted through a slide deck of all 40 questions. The 40 images were also evaluated by 10 physicians who had varying experience in ophthalmology.
The Claude 3.5 Sonnet model had the best performance answering the textual questions, at an accuracy of 77.7%. Other models were less accurate overall, with PT-4o having an accuracy of 69.9%, Qwen2.5-Max having an accuracy of 69.3%, DeepSeek V3 having an accuracy of 63.2%, and Gemini Advanced having an accuracy of 62.6%. The answer agreement between FM and LLMs was best between GPT-4o and Claude 3.5 Sonnet (Cohen κ, 0.70).
Claude 3.5 Sonnet performed similarly compared with ophthalmology experts (difference, 1.3%; 95% CI, –5.1% to 7.4%). The trainees (difference, 9.0%; 95% CI, 2.4%-15.6%) and unspecialized junior physicians (difference, 35.2%; 95% CI, 28.3%-41.9%) performed worse compared with Claude 3.5 Sonnet. The Claude 3.5 Sonnet model scored higher compared with the mean candidate score of 66.4% and the mean official pass mark of 61.2%.
GPT-4o had the highest accuracy of the FMs in the multimodal questions at 57.5%, followed by Claude 3.5 Sonnet at 47.5%. Ophthalmology experts had a mean score of 75.7% compared with 42% from the FMs and 71.3% for the trainees. GPT-4o and Claude 3.5 Sonnet had the highest agreement between the FMs and physicians.
There were some limitations to this study. There was an unclear correlation between aptitude, examination performance, and clinical reasoning. There was no evaluation of the efficacy of the prompt engineering. Multimodal performance could not be stratified because there were not enough questions.
“Results of this cross-sectional study show that while FMs exhibit comparable ophthalmological knowledge and reasoning skills with expert ophthalmologists, their ability to incorporate and process multimodal data remains limited,” the authors concluded
References
1. Rocha H, Chong YJ, Thirunavukarasu AJ, et al. Performance of foundation models vs physicians in textual and multimodal ophthalmological questions. JAMA Ophthalmol. Published online November 13, 2025. doi:10.1001/jamaophthalmol.2025.4255
2. Stryker C. What are large language models (LLMs)? IBM. Accessed November 18, 2025.
Newsletter
Stay ahead of policy, cost, and value—subscribe to AJMC for expert insights at the intersection of clinical care and health economics.














































