Versions 3 and 4 of the chatbot scored only 65% and 62%, respectively, on the American College of Gastroenterology Self-Assessment Test. The minimum passing grade is 70%.
“You might expect a physician to score 99%, or at least 95%,” lead author Arvind J. Trindade, MD, regional director of endoscopy at Northwell Health (Central Region) in New Hyde Park, New York, said in an interview.
The study was published online in the American Journal of Gastroenterology.
Dr. Trindade and colleagues undertook the study amid growing reports of students using the tool across many academic areas, including law and medicine, and growing interest in the chatbot’s potential in medical education.
“I saw gastroenterology students typing questions into it. I wanted to know how accurate it was in gastroenterology – if it was going to be used in medical education and patient care,” said Dr. Trindade, who is also an associate professor at Feinstein Institutes for Medical Research in Manhasset, New York. “Based on our research, ChatGPT should not be used for medical education in gastroenterology at this time, and it has a way to go before it should be implemented into the health care field.”
The researchers tested the two versions of ChatGPT on both the 2021 and 2022 online ACG Self-Assessment Test, a multiple-choice exam designed to gauge how well a trainee would do on the American Board of Internal Medicine Gastroenterology board examination.
Questions that involved image selection were excluded from the study. For those that remained, the questions and answer choices were copied and pasted directly into ChatGPT, which returned answers and explanations. The corresponding answer was selected on the ACG website based on the chatbot’s response.
Of the 455 questions posed, ChatGPT-3 correctly answered 296, and ChatGPT-4 got 284 right. There was no discernible pattern in the type of question that the chatbot answered incorrectly, but questions on surveillance timing for various disease states, diagnosis, and pharmaceutical regimens were all answered incorrectly.
The reasons for the tool’s poor performance could lie with the large language model underpinning ChatGPT, the researchers write. The model was trained on freely available information – not specifically on medical literature and not on materials that require paid journal subscriptions – to be a general-purpose interactive program.
Additionally, the chatbot may use information from a variety of sources, including non- or quasi-medical sources, or out-of-date sources, which can lead to errors, they note. ChatGPT-3 was last updated in June 2021 and ChatGPT-4 in September 2021.
“ChatGPT does not have an intrinsic understanding of an issue,” Dr. Trindade said. “Its basic function is to predict the next word in a string of text to produce an expected response, regardless of whether such a response is factually correct or not.”
In a previous study, ChatGPT was able to pass parts of the U.S. Medical Licensing Examination.
The chatbot may have performed better on the USMLE because the information tested on the exam may have been more widely available for ChatGPT’s language training, Dr. Trindade said. “In addition, the threshold for passing [the USMLE] is lower with regard to the percentage of questions correctly answered,” he said.
ChatGPT seems to fare better at helping to inform patients than it does on medical exams. The chatbot provided generally satisfactory answers to common patient queries about colonoscopy in one study and about hepatocellular carcinoma and liver cirrhosis in another study.
For ChatGPT to be valuable in medical education, “future versions would need to be updated with medical resources such as journal articles, society guidelines, and medical databases, such as UpToDate,” Dr. Trindade said. “With directed medical training in gastroenterology, it may be a future tool for education or patient use in this field, but not currently as it is now. Before it can be used in gastroenterology, it should be validated.”
That said, he noted, medical education has evolved from being based on textbooks and print journals to include Internet-based journal data and practice guidelines on specialty websites. If properly primed, resources such as ChatGPT may be the next logical step.
This study received no funding. Dr. Trindade is a consultant for Pentax Medical, Boston Scientific, Lucid Diagnostic, and Exact Science and receives research support from Lucid Diagnostics.
A version of this article first appeared on Medscape.com.