two new studies suggest.
AI chatbots, such as ChatGPT (OpenAI), are becoming go-to sources for health information. However, no studies have rigorously evaluated the quality of their medical advice, especially for cancer.
Two new studies published in JAMA Oncology did just that.
One, which looked at common cancer-related Google searches, found that AI chatbots generally provide accurate information to consumers, but the information’s usefulness may be limited by its complexity.
The other, which assessed cancer treatment recommendations, found that AI chatbots overall missed the mark on providing recommendations for breast, prostate, and lung cancers in line with national treatment guidelines.
The medical world is becoming “enamored with our newest potential helper, large language models (LLMs) and in particular chatbots, such as ChatGPT,” Atul Butte, MD, PhD, who heads the Bakar Computational Health Sciences Institute, University of California, San Francisco, wrote in an editorial accompanying the studies. “But maybe our core belief in GPT technology as a clinical partner has not sufficiently been earned yet.”
The first study by Alexander Pan of the State University of New York, Brooklyn, and colleagues analyzed the quality of responses to the top five most searched questions on skin, lung, breast, colorectal, and prostate cancer provided by four AI chatbots: ChatGPT-3.5, Perplexity (Perplexity.AI), Chatsonic (Writesonic), and Bing AI (Microsoft).
Questions included what is skin cancer and what are symptoms of prostate, lung, or breast cancer? The team rated the responses for quality, clarity, actionability, misinformation, and readability.
The researchers found that the four chatbots generated “high-quality” responses about the five cancers and did not appear to spread misinformation. Three of the four chatbots cited reputable sources, such as the American Cancer Society, Mayo Clinic, and Centers for Disease Controls and Prevention, which is “reassuring,” the researchers said.
However, the team also found that the usefulness of the information was “limited” because responses were often written at a college reading level. Another limitation: AI chatbots provided concise answers with no visual aids, which may not be sufficient to explain more complex ideas to consumers.
“These limitations suggest that AI chatbots should be used [supplementally] and not as a primary source for medical information,” the authors said, adding that the chatbots “typically acknowledged their limitations in providing individualized advice and encouraged users to seek medical attention.”
A related study in the journal highlighted the ability of AI chatbots to generate appropriate cancer treatment recommendations.
In this analysis, Shan Chen, MS, with the AI in Medicine Program, Mass General Brigham, Harvard Medical School, Boston, and colleagues benchmarked cancer treatment recommendations made by ChatGPT-3.5 against 2021 National Comprehensive Cancer Network guidelines.
The team created 104 prompts designed to elicit basic treatment strategies for various types of cancer, including breast, prostate, and lung cancer. Questions included “What is the treatment for stage I breast cancer?” Several oncologists then assessed the level of concordance between the chatbot responses and NCCN guidelines.
In 62% of the prompts and answers, all the recommended treatments aligned with the oncologists’ views.
The chatbot provided at least one guideline-concordant treatment for 98% of prompts. However, for 34% of prompts, the chatbot also recommended at least one nonconcordant treatment.
And about 13% of recommended treatments were “hallucinated,” that is, not part of any recommended treatment. Hallucinations were primarily recommendations for localized treatment of advanced disease, targeted therapy, or immunotherapy.
Based on the findings, the team recommended that clinicians advise patients that AI chatbots are not a reliable source of cancer treatment information.
“The chatbot did not perform well at providing accurate cancer treatment recommendations,” the authors said. “The chatbot was most likely to mix in incorrect recommendations among correct ones, an error difficult even for experts to detect.”
In his editorial, Dr. Butte highlighted several caveats, including that the teams evaluated “off the shelf” chatbots, which likely had no specific medical training, and the prompts
designed in both studies were very basic, which may have limited their specificity or actionability. Newer LLMs with specific health care training are being released, he explained.
Despite the mixed study findings, Dr. Butte remains optimistic about the future of AI in medicine.
“Today, the reality is that the highest-quality care is concentrated within a few premier medical systems like the NCI Comprehensive Cancer Centers, accessible only to a small fraction of the global population,” Dr. Butte explained. “However, AI has the potential to change this.”
How can we make this happen?
AI algorithms would need to be trained with “data from the best medical systems globally” and “the latest guidelines from NCCN and elsewhere.” Digital health platforms powered by AI could then be designed to provide resources and advice to patients around the globe, Dr. Butte said.
Although “these algorithms will need to be carefully monitored as they are brought into health systems,” Dr. Butte said, it does not change their potential to “improve care for both the haves and have-nots of health care.”
The study by Mr. Pan and colleagues had no specific funding; one author, Stacy Loeb, MD, MSc, PhD, reported a disclosure; no other disclosures were reported. The study by Shan Chen and colleagues was supported by the Woods Foundation; several authors reported disclosures outside the submitted work. Dr. Butte disclosed relationships with several pharmaceutical companies.
A version of this article first appeared on.