Peer-reviewed veterinary case report
Observer-Performance Comparison of ChatGPT-5 and Gemini 2.5 Pro Versus Veterinarians in Canine and Feline Fundus Interpretation: A Multi-Reader, Multi-Case Study.
- Journal:
- Veterinary ophthalmology
- Year:
- 2026
- Authors:
- Kibar, Büşra et al.
- Affiliation:
- Faculty of Veterinary Medicine
Abstract
OBJECTIVE: To compare two large language models (ChatGPT-5, Gemini 2.5 Pro) with experienced and novice veterinarians on canine and feline fundus cases, and to assess the relationship between perceived case difficulty and diagnostic performance. ANIMALS STUDIED: Forty-three client-owned cases were sampled from 200 ophthalmology records. PROCEDURE(S): Each case included signalment, history, and fundus photographs. Two experienced veterinarians, two novice veterinarians, and two LLMs independently selected findings and provided diagnosis from options. Participants rated difficulty (Very Easy-Hard). Group differences were tested with Kruskal-Wallis and Dunn-Bonferroni procedures; associations with difficulty used Spearman's ρ; paired proportions used Cochran's Q with Holm-adjusted McNemar tests. RESULTS: Experts achieved the highest accuracies (findings: 73.3% and 61.6%; diagnosis: 86.0% and 66.3%), significantly outperforming LLMs and novices (all adjusted p < 0.05). LLM finding accuracies were 52.0% (ChatGPT-5) and 49.3% (Gemini 2.5 Pro), both above novices (28.3% and 26.9%). LLM diagnosis accuracies were lower (ChatGPT-5: 37.2%, Gemini 2.5 Pro: 37.2%) but still numerically higher than novices (23.1% and 22.5%). Expert accuracy declined with increasing case difficulty, whereas LLM performance was comparatively stable (ChatGPT-5 range 2.37-3.86; Gemini 2.5 Pro 2.00-2.95). Difficulty correlated negatively with Expert 2 totals (ρ = -0.70, p < 0.0001) but not with LLMs (|ρ| ≤ 0.17, p ≥ 0.28). CONCLUSIONS: Experienced veterinarians are most accurate in fundus interpretation, but their performance declines with increasing difficulty. LLMs, though less accurate, remain stable across cases and outperform novices, indicating value as training or decision-support tools. Future studies should assess whether expert-LLM collaboration enhances accuracy and efficiency.
Find similar cases for your pet
PetCaseFinder finds other peer-reviewed reports of pets with the same symptoms, plus a plain-English summary of what was tried across them.
Search related cases →Original publication: https://pubmed.ncbi.nlm.nih.gov/41485127/