REED - Revista Española de Enfermedades Digestivas

New issue alert	Contact
Submit a manuscript	Follow REEDigestivas_
Versión en español

Year 2026 / Volume 118 / Number 2
Original

Evaluation of the clinical reasoning of GPT-4o, a multimodal generative artificial intelligence model, in 18 public gastroenterology case studies

95-100

DOI: 10.17235/reed.2025.11369/2025

Alejandro García-Rudolph, Elena Hernández-Pena, Nuria del Cacho, Claudia Teixido-Font, Marc Navarro-Berenguel, Eloy Opisso,

Abstract

Introduction and aim: although generative language models have been extensively studied in the field of digestive diseases, further progress requires addressing underexplored aspects such as linguistic bias, the evaluation of clinical reasoning underlying model responses, and the use of realistic clinical material in non-English-speaking contexts. The aim of this study was to evaluate the accuracy of GPT-4o to answer clinical questions in Spanish and to qualitatively analyze its errors. Methods: we used the most recent official board examination for specialist in Gastroenterology (Spain, 2023), focusing on its practical section, which includes 18 real clinical cases described through text and images, totaling 50 multiple-choice questions (200 options in total). Forty-nine valid questions were analyzed, excluding one withdrawn by the organizing committee. GPT-4o answered 39 questions correctly (79.6 %). No significant differences were observed between questions with clinical images (22/29 correct) and those without images (17/20 correct). Results: twenty percent of the answers were incorrect. In those cases, the model was prompted to provide its reasoning, which was then qualitatively analyzed by a team of experts. Errors were associated with inappropriate therapeutic generalizations, confusion regarding diagnostic or therapeutic sequencing, poor integration of contextual information, unawareness of contraindications, and omission of key temporal criteria in clinical decision-making. Conclusions: clinical images did not increase the error rate; however, the observed failures revealed that the model tends to omit information already provided (such as clinical context or temporal criteria), thereby compromising the quality of its reasoning.

Lay Summary

Artificial intelligence systems that generate text, such as large language models, are being increasingly used in healthcare. They can help explain diseases or answer clinical questions, but their reliability, especially in languages other than English, is still uncertain. This study evaluated how well one of the most advanced models, GPT-4o, performed on a real medical specialty exam in Digestive Diseases held in Spain in 2023. The exam included clinical cases based on real patients, described using both written information and diagnostic images, followed by multiple-choice questions similar to those faced by doctors in training. GPT-4o answered 80 % of the questions correctly. There were no significant differences between questions with or without medical images. However, when analyzing the incorrect answers, the research team identified several types of errors that could be clinically relevant if the model were used without medical supervision. These included confusing diagnostic or treatment steps, failing to consider important clinical details, overlooking contraindications, or ignoring time-sensitive criteria. The results suggest that, although these models may be helpful for medical learning or educational purposes, they are not yet reliable enough for use in clinical decision-making without professional oversight. The study also highlights the need to test these tools in different languages and using realistic clinical materials to better understand their limitations and safety risks in diverse healthcare settings.

New comment

Comments

No comments for this article

References

1. Klang E, Sourosh A, Nadkarni GN, Sharif K, Lahat A. Evaluating the role of ChatGPT in gastroenterology: a comprehensive systematic review of applications, benefits, and limitations. Therap Adv Gastroenterol. 2023 Dec 25;16:17562848231218618. doi: 10.1177/17562848231218618.

DOI: 10.1177/17562848231218618

2. Gong EJ, Bang CS, Lee JJ, Park J, Kim E, Kim S, Kimm M, Choi SH. Large Language Models in Gastroenterology: Systematic Review. J Med Internet Res. 2024 Dec 20;26:e66648. doi: 10.2196/66648.

DOI: 10.2196/66648

3. Giuffrè M, Kresevic S, You K, Dupont J, Huebner J, Grimshaw AA, Shung DL. Systematic review: The use of large language models as medical chatbots in digestive diseases. Aliment Pharmacol Ther. 2024 Jul;60(2):144-166. doi: 10.1111/apt.18058.

DOI: 10.1111/apt.18058

4. Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

DOI: 10.2196/60807

5. Ali, Hassam, Pratik Patel, Itegbemie Obaitan, Babu P. Mohan, Amir Humza Sohail, Lucia Smith-Martinez, Karrisa Lambert, Manesh Kumar Gangwani, Jeffrey J. Easler, and Douglas G. Adler. "Evaluating the performance of ChatGPT in responding to questions about endoscopic procedures for patients." iGIE 2, no. 4 (2023): 553-559.

DOI: 10.1016/j.igie.2023.10.001

6. Kerbage A, Kassab J, El Dahdah J, Burke CA, Achkar JP, Rouphael C. Accuracy of ChatGPT in Common Gastrointestinal Diseases: Impact for Patients and Providers. Clin Gastroenterol Hepatol. 2024 Jun;22(6):1323-1325.e3. doi: 10.1016/j.cgh.2023.11.008.

DOI: 10.1016/j.cgh.2023.11.008

7. Lai Y, Liao F, Zhao J, Zhu C, Hu Y, Li Z. Exploring the capacities of ChatGPT: A comprehensive evaluation of its accuracy and repeatability in addressing helicobacter pylori-related queries. Helicobacter. 2024 May-Jun;29(3):e13078. doi: 10.1111/hel.13078.

DOI: 10.1111/hel.13078

8. Servicio Andaluz de Salud - CONCURSO-OPOSICIÓN 2023 PARA CUBRIR PLAZAS BÁSICAS VACANTES DE: FEA APARATO DIGESTIVO https://www.sspa.juntadeandalucia.es/servicioandaluzdesalud/profesionales/ofertas-de-empleo/oferta-de-empleo-publico-puestos-base/oep-extraordinaria-decreto-ley-122022-centros-sas/cuadro-de-evolucion-concurso-oposicion-centros-sas/fea-aparato-digestivo (ultimo acceso 28/05/2025)

9. OpenAI. GPT-4o Technical Report. OpenAI; 2025. Disponible en: https://openai.com/index/gpt-4o (último acceso 28/05/2025)

10. Servicio Andaluz de Salud. Cuadernillo del Examen. CONCURSO-OPOSICIÓN 2023 PARA CUBRIR PLAZAS BÁSICAS VACANTES DE: FEA APARATO DIGESTIVO https://www.sspa.juntadeandalucia.es/servicioandaluzdesalud/sites/default/files/sincfiles/wsas-media-ope_fichero/2023/revisado_56007_fea_aparato_digestivo_final.pdf (último acceso 28/05/2025)

11. Li DJ, Kao YC, Tsai SJ, Bai YM, Yeh TC, Chu CS, Hsu CW, Cheng SW, Hsu TW, Liang CS, Su KP. Comparing the performance of ChatGPT GPT-4, Bard, and Llama-2 in the Taiwan Psychiatric Licensing Examination and in differential diagnosis with multi-center psychiatrists. Psychiatry Clin Neurosci. 2024 Jun;78(6):347-352. doi: 10.1111/pcn.13656.

DOI: 10.1111/pcn.13656

12. Balta KY, Javidan AP, Walser E, Arntfield R, Prager R. Evaluating the Appropriateness, Consistency, and Readability of ChatGPT in Critical Care Recommendations. J Intensive Care Med. 2024 Aug 8:8850666241267871. doi: 10.1177/08850666241267871.

DOI: 10.1177/08850666241267871

13. Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers. Int Endod J. 2024 Jan;57(1):108-113. doi: 10.1111/iej.13985.

DOI: 10.1111/iej.13985

14. Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, Mansour HA, Abishek RM, Xu D, Sridhar J, Yonekawa Y, Kuriyan AE. Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases. Ophthalmol Retina. 2023 Oct;7(10):862-868. doi: 10.1016/j.oret.2023.05.022.

DOI: 10.1016/j.oret.2023.05.022

15. Gencer A. Readability analysis of ChatGPT's responses on lung cancer. Sci Rep. 2024 Jul 26;14(1):17234. doi: 10.1038/s41598-024-67293-2.

DOI: 10.1038/s41598-024-67293-2

16. Fazilat, A.Z., Brenac, C., Kawamoto-Duran, D. et al. Evaluating the quality and readability of ChatGPT-generated patient-facing medical information in rhinology. Eur Arch Otorhinolaryngol 282, 1911–1920 (2025). https://doi.org/10.1007/s00405-024-09180-0

Editorial

A search for patient-centered strategic alliances between the specialties of Gastroenterology and Hospital Pharmacy: the CONDIFA project

DOI: 10.17235/reed.2025.11737/2025

Editorial

Role of the gastroenterologist in the comprehensive management of people living with obesity. SEPD Position Paper

DOI: 10.17235/reed.2025.11636/2025

Letter

Artificial intelligence, yes — But what do we need it for?

DOI: 10.17235/reed.2025.11435/2025

Letter

Domain-specific language models: innovation with inherent risks

DOI: 10.17235/reed.2025.11224/2025

Letter

The future of artificial intelligence in Healthcare: smaller, more specialized language models

DOI: 10.17235/reed.2025.11141/2025

Letter

Esophageal tuberculosis as a differential diagnosis of esophageal cancer

DOI: 10.17235/reed.2024.10498/2024

Review

Advancements in biliopancreatic endoscopy - A comprehensive review of artificial intelligence in EUS and ERCP

DOI: 10.17235/reed.2024.10456/2024

Letter

An artificial intelligence-based diagnostic imaging system with virtual enteroscopy and virtual unfolded views to evaluate small bowel lesions in Crohn’s disease

DOI: 10.17235/reed.2024.10405/2024

Editorial

Organization of Neurogastroenterology and Motility units with a multidisciplinary, patient-centered perspective

DOI: 10.17235/reed.2024.10368/2024

Editorial

A new year and a new roadmap for the journal in challenging times

DOI: 10.17235/reed.2022.9372/2022

Letter

Can artificial intelligence increase the efficiency of referrals from primary to specialized care?

DOI: 10.17235/reed.2022.9020/2022

Review

Artificial intelligence in gastrointestinal endoscopy — Evolution to a new era

DOI: 10.17235/reed.2022.8961/2022

Letter

Retracted articles in Gastroenterology

DOI: 10.17235/reed.2022.8859/2022

Letter

Analysis of retracted articles in the field of Gastroenterology

DOI: 10.17235/reed.2022.8760/2022

Letter

Implementation of a hepatology and gastroenterology teleconsultation for our penitentiary center

DOI: 10.17235/reed.2021.7985/2021

Original

Artificial intelligence and capsule endoscopy: automatic detection of enteric protruding lesions using a convolutional neural network

DOI: 10.17235/reed.2021.7979/2021

Letter

Resuming ultrasound activity after the acute phase of the COVID-19 pandemic

DOI: 10.17235/reed.2020.7343/2020

Letter

Author´s reply: “Consensus document for Digestive services in the National Health System of the XXI century”

DOI: 10.17235/reed.2020.7257/2020

Special Article

Resumption of activity in gastroenterology departments. Recommendations by SEPD, AEEH, GETECCU and AEG

DOI: 10.17235/reed.2020.7141/2020

Special Article

Recommendations by the SEPD and AEG, both in general and on the operation of gastrointestinal endoscopy and gastroenterology units, concerning the current SARS-CoV-2 pandemic (March, 18)

DOI: 10.17235/reed.2020.7052/2020

Special Article

Digestive units in the National Health System of the 21st century. Organizational and management standards for a patient-centered service

DOI: 10.17235/reed.2020.6778/2019

Editorial

Small but great steps

DOI: 10.17235/reed.2019.6758/2019

Special Article

The quality of abdominal ultrasound: a much-needed consensus

DOI: 10.17235/reed.2019.6177/2019

Original

Effectiveness and safety of gastrointestinal endoscopy during a specific sedation training program for non-anesthesiologists

DOI: 10.17235/reed.2018.5713/2018

Editorial

On the centenary of The Spanish Journal of Gastroenterology (REED)

DOI: 10.17235/reed.2017.5427/2017

Original

Gastroenterology - Evolution of specialty choice in recent years

DOI: 10.17235/reed.2017.4977/2017

Original

CLIF-C ACLF score is a better mortality predictor than MELD, MELD-Na and CTP in patients with Acute on chronic liver failure admitted to the ward

DOI: 10.17235/reed.2017.4701/2016

Editorial

A one-stop clinic in gastroenterology - Benefits and limitations

Original

Guideline for wireless capsule endoscopy in children and adolescents: A consensus document by the SEGHNP (Spanish Society for Pediatric Gastroenterology, Hepatology, and Nutrition) and the SEPD (Spanish Society for Digestive Diseases)

DOI: 10.17235/reed.2015.3921/2015

Original

Efficacy of a high-resolution consultation system in gastroenterology at an Andalusian hospital center

DOI: 10.17235/reed.2015.3885/2015

Citation tools

García-Rudolph A, Hernández-Pena E, del Cacho N, Teixido-Font C, Navarro-Berenguel M, Opisso E, et all. Evaluation of the clinical reasoning of GPT-4o, a multimodal generative artificial intelligence model, in 18 public gastroenterology case studies. 11369/2025

Download to a citation manager

Download the citation for this article by clicking on one of the following citation managers:

Metrics

This article has received 714 visits.
This article has been downloaded 32 times.

Statistics from Dimensions

Statistics from Plum Analytics

Publication history

Received: 28/05/2025

Accepted: 22/07/2025

Online First: 29/09/2025

Published: 09/02/2026

Article Online First time: 124 days

Article editing time: 257 days

This article hasn't been rated yet.

Reader rating:

Valora este artículo:

Submission and tracking of manuscripts

Instructions to authors

Submission and tracking

Home

Articles

Current issues

Previous issues

Cookie	Purpose	Expiration	Cookie type
ASP.NET_SessionId	Makes it possible to control traffic from the server to multiple users at the same time, identification and access as a user.	Logout / End of project	Session
cookiesDirective	Makes it possible to remember cookie settings.	Logout / End of project	Session
vercookies	Makes it possible to display the notice on the use of cookies.	Logout / End of project	Session

Download to a citation manager

Statistics from Dimensions

Statistics from Plum Analytics

Submission and tracking of manuscripts

Receive the digital monthly newsletter of the REED

In accordance with current legislation, this publication is only for health care professionals.