Peer-reviewed veterinary case report

How do AI models compare to vets in eye exams for dogs and cats?

By Kibar, Büşra et al.·Published in Veterinary ophthalmology·2026·Faculty of Veterinary Medicine·View original on PubMed →

PetCaseFinder translated the abstract of this peer-reviewed paper into plain English so pet owners can read it. We do not publish original research — every detail traces back to the citation above. How we work →

Original publication title: Observer-Performance Comparison of ChatGPT-5 and Gemini 2.5 Pro Versus Veterinarians in Canine and Feline Fundus Interpretation: A Multi-Reader, Multi-Case Study.

Plain-English summary

This study looked at how well two advanced computer programs, ChatGPT-5 and Gemini 2.5 Pro, could interpret eye images from dogs and cats compared to both experienced and novice veterinarians. They analyzed 43 cases taken from a larger group of 200 records, where each case included details about the animal and photos of their eyes. The results showed that experienced veterinarians were the most accurate, but their performance dropped when the cases were more difficult. The computer programs performed better than novice veterinarians but were not as accurate as the experts. Overall, while the computer programs are not perfect, they could be helpful tools for training or assisting veterinarians.

Abstract

OBJECTIVE: To compare two large language models (ChatGPT-5, Gemini 2.5 Pro) with experienced and novice veterinarians on canine and feline fundus cases, and to assess the relationship between perceived case difficulty and diagnostic performance. ANIMALS STUDIED: Forty-three client-owned cases were sampled from 200 ophthalmology records. PROCEDURE(S): Each case included signalment, history, and fundus photographs. Two experienced veterinarians, two novice veterinarians, and two LLMs independently selected findings and provided diagnosis from options. Participants rated difficulty (Very Easy-Hard). Group differences were tested with Kruskal-Wallis and Dunn-Bonferroni procedures; associations with difficulty used Spearman's ρ; paired proportions used Cochran's Q with Holm-adjusted McNemar tests. RESULTS: Experts achieved the highest accuracies (findings: 73.3% and 61.6%; diagnosis: 86.0% and 66.3%), significantly outperforming LLMs and novices (all adjusted p < 0.05). LLM finding accuracies were 52.0% (ChatGPT-5) and 49.3% (Gemini 2.5 Pro), both above novices (28.3% and 26.9%). LLM diagnosis accuracies were lower (ChatGPT-5: 37.2%, Gemini 2.5 Pro: 37.2%) but still numerically higher than novices (23.1% and 22.5%). Expert accuracy declined with increasing case difficulty, whereas LLM performance was comparatively stable (ChatGPT-5 range 2.37-3.86; Gemini 2.5 Pro 2.00-2.95). Difficulty correlated negatively with Expert 2 totals (ρ = -0.70, p < 0.0001) but not with LLMs (|ρ| ≤ 0.17, p ≥ 0.28). CONCLUSIONS: Experienced veterinarians are most accurate in fundus interpretation, but their performance declines with increasing difficulty. LLMs, though less accurate, remain stable across cases and outperform novices, indicating value as training or decision-support tools. Future studies should assess whether expert-LLM collaboration enhances accuracy and efficiency.

Find similar cases for your pet

PetCaseFinder finds other peer-reviewed reports of pets with the same symptoms, plus a plain-English summary of what was tried across them.

Search related cases →

Original publication on PubMed: https://pubmed.ncbi.nlm.nih.gov/41485127/