Year 2026 / Volume 118 / Number 2
Original
Evaluation of the clinical reasoning of GPT-4o, a multimodal generative artificial intelligence model, in 18 public gastroenterology case studies

95-100

DOI: 10.17235/reed.2025.11369/2025

Alejandro García-Rudolph, Elena Hernández-Pena, Nuria del Cacho, Claudia Teixido-Font, Marc Navarro-Berenguel, Eloy Opisso,

Abstract
Introduction and aim: although generative language models have been extensively studied in the field of digestive diseases, further progress requires addressing underexplored aspects such as linguistic bias, the evaluation of clinical reasoning underlying model responses, and the use of realistic clinical material in non-English-speaking contexts. The aim of this study was to evaluate the accuracy of GPT-4o to answer clinical questions in Spanish and to qualitatively analyze its errors. Methods: we used the most recent official board examination for specialist in Gastroenterology (Spain, 2023), focusing on its practical section, which includes 18 real clinical cases described through text and images, totaling 50 multiple-choice questions (200 options in total). Forty-nine valid questions were analyzed, excluding one withdrawn by the organizing committee. GPT-4o answered 39 questions correctly (79.6 %). No significant differences were observed between questions with clinical images (22/29 correct) and those without images (17/20 correct). Results: twenty percent of the answers were incorrect. In those cases, the model was prompted to provide its reasoning, which was then qualitatively analyzed by a team of experts. Errors were associated with inappropriate therapeutic generalizations, confusion regarding diagnostic or therapeutic sequencing, poor integration of contextual information, unawareness of contraindications, and omission of key temporal criteria in clinical decision-making. Conclusions: clinical images did not increase the error rate; however, the observed failures revealed that the model tends to omit information already provided (such as clinical context or temporal criteria), thereby compromising the quality of its reasoning.
Lay Summary
Artificial intelligence systems that generate text, such as large language models, are being increasingly used in healthcare. They can help explain diseases or answer clinical questions, but their reliability, especially in languages other than English, is still uncertain. This study evaluated how well one of the most advanced models, GPT-4o, performed on a real medical specialty exam in Digestive Diseases held in Spain in 2023. The exam included clinical cases based on real patients, described using both written information and diagnostic images, followed by multiple-choice questions similar to those faced by doctors in training. GPT-4o answered 80 % of the questions correctly. There were no significant differences between questions with or without medical images. However, when analyzing the incorrect answers, the research team identified several types of errors that could be clinically relevant if the model were used without medical supervision. These included confusing diagnostic or treatment steps, failing to consider important clinical details, overlooking contraindications, or ignoring time-sensitive criteria. The results suggest that, although these models may be helpful for medical learning or educational purposes, they are not yet reliable enough for use in clinical decision-making without professional oversight. The study also highlights the need to test these tools in different languages and using realistic clinical materials to better understand their limitations and safety risks in diverse healthcare settings.
New comment
Comments
No comments for this article
References
1. Klang E, Sourosh A, Nadkarni GN, Sharif K, Lahat A. Evaluating the role of ChatGPT in gastroenterology: a comprehensive systematic review of applications, benefits, and limitations. Therap Adv Gastroenterol. 2023 Dec 25;16:17562848231218618. doi: 10.1177/17562848231218618.
2. Gong EJ, Bang CS, Lee JJ, Park J, Kim E, Kim S, Kimm M, Choi SH. Large Language Models in Gastroenterology: Systematic Review. J Med Internet Res. 2024 Dec 20;26:e66648. doi: 10.2196/66648.
3. Giuffrè M, Kresevic S, You K, Dupont J, Huebner J, Grimshaw AA, Shung DL. Systematic review: The use of large language models as medical chatbots in digestive diseases. Aliment Pharmacol Ther. 2024 Jul;60(2):144-166. doi: 10.1111/apt.18058.
4. Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
5. Ali, Hassam, Pratik Patel, Itegbemie Obaitan, Babu P. Mohan, Amir Humza Sohail, Lucia Smith-Martinez, Karrisa Lambert, Manesh Kumar Gangwani, Jeffrey J. Easler, and Douglas G. Adler. "Evaluating the performance of ChatGPT in responding to questions about endoscopic procedures for patients." iGIE 2, no. 4 (2023): 553-559.
6. Kerbage A, Kassab J, El Dahdah J, Burke CA, Achkar JP, Rouphael C. Accuracy of ChatGPT in Common Gastrointestinal Diseases: Impact for Patients and Providers. Clin Gastroenterol Hepatol. 2024 Jun;22(6):1323-1325.e3. doi: 10.1016/j.cgh.2023.11.008.
7. Lai Y, Liao F, Zhao J, Zhu C, Hu Y, Li Z. Exploring the capacities of ChatGPT: A comprehensive evaluation of its accuracy and repeatability in addressing helicobacter pylori-related queries. Helicobacter. 2024 May-Jun;29(3):e13078. doi: 10.1111/hel.13078.
8. Servicio Andaluz de Salud - CONCURSO-OPOSICIÓN 2023 PARA CUBRIR PLAZAS BÁSICAS VACANTES DE: FEA APARATO DIGESTIVO https://www.sspa.juntadeandalucia.es/servicioandaluzdesalud/profesionales/ofertas-de-empleo/oferta-de-empleo-publico-puestos-base/oep-extraordinaria-decreto-ley-122022-centros-sas/cuadro-de-evolucion-concurso-oposicion-centros-sas/fea-aparato-digestivo (ultimo acceso 28/05/2025)
9. OpenAI. GPT-4o Technical Report. OpenAI; 2025. Disponible en: https://openai.com/index/gpt-4o (último acceso 28/05/2025)
10. Servicio Andaluz de Salud. Cuadernillo del Examen. CONCURSO-OPOSICIÓN 2023 PARA CUBRIR PLAZAS BÁSICAS VACANTES DE: FEA APARATO DIGESTIVO https://www.sspa.juntadeandalucia.es/servicioandaluzdesalud/sites/default/files/sincfiles/wsas-media-ope_fichero/2023/revisado_56007_fea_aparato_digestivo_final.pdf (último acceso 28/05/2025)
11. Li DJ, Kao YC, Tsai SJ, Bai YM, Yeh TC, Chu CS, Hsu CW, Cheng SW, Hsu TW, Liang CS, Su KP. Comparing the performance of ChatGPT GPT-4, Bard, and Llama-2 in the Taiwan Psychiatric Licensing Examination and in differential diagnosis with multi-center psychiatrists. Psychiatry Clin Neurosci. 2024 Jun;78(6):347-352. doi: 10.1111/pcn.13656.
12. Balta KY, Javidan AP, Walser E, Arntfield R, Prager R. Evaluating the Appropriateness, Consistency, and Readability of ChatGPT in Critical Care Recommendations. J Intensive Care Med. 2024 Aug 8:8850666241267871. doi: 10.1177/08850666241267871.
13. Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers. Int Endod J. 2024 Jan;57(1):108-113. doi: 10.1111/iej.13985.
14. Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, Mansour HA, Abishek RM, Xu D, Sridhar J, Yonekawa Y, Kuriyan AE. Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases. Ophthalmol Retina. 2023 Oct;7(10):862-868. doi: 10.1016/j.oret.2023.05.022.
15. Gencer A. Readability analysis of ChatGPT's responses on lung cancer. Sci Rep. 2024 Jul 26;14(1):17234. doi: 10.1038/s41598-024-67293-2.
16. Fazilat, A.Z., Brenac, C., Kawamoto-Duran, D. et al. Evaluating the quality and readability of ChatGPT-generated patient-facing medical information in rhinology. Eur Arch Otorhinolaryngol 282, 1911–1920 (2025). https://doi.org/10.1007/s00405-024-09180-0
Related articles

Letter

Artificial intelligence, yes — But what do we need it for?

DOI: 10.17235/reed.2025.11435/2025

Letter

Domain-specific language models: innovation with inherent risks

DOI: 10.17235/reed.2025.11224/2025

Editorial

A new year and a new roadmap for the journal in challenging times

DOI: 10.17235/reed.2022.9372/2022

Letter

Retracted articles in Gastroenterology

DOI: 10.17235/reed.2022.8859/2022

Letter

Analysis of retracted articles in the field of Gastroenterology

DOI: 10.17235/reed.2022.8760/2022

Editorial

Small but great steps

DOI: 10.17235/reed.2019.6758/2019

Special Article

The quality of abdominal ultrasound: a much-needed consensus

DOI: 10.17235/reed.2019.6177/2019

Editorial

On the centenary of The Spanish Journal of Gastroenterology (REED)

DOI: 10.17235/reed.2017.5427/2017

Original

Gastroenterology - Evolution of specialty choice in recent years

DOI: 10.17235/reed.2017.4977/2017

Citation tools
García-Rudolph A, Hernández-Pena E, del Cacho N, Teixido-Font C, Navarro-Berenguel M, Opisso E, et all. Evaluation of the clinical reasoning of GPT-4o, a multimodal generative artificial intelligence model, in 18 public gastroenterology case studies. 11369/2025


Download to a citation manager

Download the citation for this article by clicking on one of the following citation managers:

Metrics
This article has received 8 visits.
This article has been downloaded 0 times.

Statistics from Dimensions


Statistics from Plum Analytics

Publication history

Received: 28/05/2025

Accepted: 22/07/2025

Online First: 29/09/2025

Published: 09/02/2026

Article Online First time: 124 days

Article editing time: 257 days


Share
This article hasn't been rated yet.
Reader rating:
Valora este artículo:




Asociación Española de Ecografía Digestiva Sociedad Española de Endoscopia Digestiva Sociedad Española de Patología Digestiva
The Spanish Journal of Gastroenterology is the official organ of the Sociedad Española de Patología Digestiva, the Sociedad Española de Endoscopia Digestiva and the Asociación Española de Ecografía Digestiva
Cookie policy Privacy Policy Legal Notice © Copyright 2026 y Creative Commons. The Spanish Journal of Gastroenterology