AI Scribes or VHA Docs: Which Created Better Clinical Notes?
Artificial intelligence (AI) scribes produced lower-quality documentation of clinical notes than human clinicians, and especially struggled in settings with background noise or clinicians wearing masks, a new Veterans Health Administration (VHA) study finds.
In 5 simulated clinical cases, notes written by various AI programs scored lower than reports produced by humans on the modified Physician Documentation Quality Instrument (PDQI-9), a measurement of note quality scale, reported Ashok Reddy, MD, MSc, of the University of Washington and Veterans Affairs Puget Sound Health Care System, Seattle, et al in the April issue of Annals of Internal Medicine.
AI scribes scored lower compared with humans across all domains, including accuracy, thoroughness, and usefulness. There was an especially large gap in scores on the 50-point PDQI-9 in an acute low back pain case (human, 43.8 points; AI, 20.3 points; difference, 23.5 points).
“For clinicians, AI scribes should be regarded as tools for generating draft documentation that requires review and editing, rather than as a substitute for clinician-authored notes,” the authors wrote. “Although ambient AI scribes hold promise for reducing clinician burden, rigorous and ongoing evaluation of their quality is essential to ensure that these tools enhance rather than compromise the quality of clinical care.”
AI Scribe Use is Widespread
Taylor N. Anderson, MD, a clinical informatics fellow at Oregon Health & Science University, Portland, is familiar with the study findings and noted that the use of AI scribes in medicine has grown rapidly. All major health organizations are either using it or facing “enormous pressure” from clinicians to do so, she told Federal Practitioner.
Previous research has linked the use of AI scribes for clinical notes to less electronic health record usage and documentation time for clinicians, leading to more time for patient visits. Still, the quality of clinical notes written by AI is “quite variable across vendors,” Anderson said.
Anderson led a 2025 study that examined 5 AI scribe platforms and found an average of 3.0 errors per case with “potential for moderate-to-severe harm.”
For the new study on the simulated cases, part of a VHA-sponsored “technology sprint” via Challenge.gov, researchers developed audio descriptions of 5 clinical cases reflecting common patient encounters in primary care: acute low back pain, chest pain, a new diagnosis of diabetes, a pharmacy consultation, and a follow-up with a nurse case manager for heart failure.
Two cases included non-English accents, 1 included background noise, and 1 featured speech through a medical mask. All the “patients” were played by what the authors described as “trained standardized patient actors.”
For each case, 3 humans and 11 AI scribe programs produced clinical notes. The clinical notes were then evaluated by 6 raters.
Researchers found that AI scribe-generated notes scored worse than human-generated notes across all 10 domains of the modified PDQI-9 (accuracy, thoroughness, usefulness, organization, comprehensiveness, succinctness, synthesization, internal consistency, and freedom from hallucination and bias).
There were especially large gaps between the AI and human notes in the domains of thoroughness, organization, and usefulness. Even wider gaps were observed for the encounters with noise and mask usage.
“These findings highlight that although ambient AI scribes can generate complete notes, the overall quality remains broadly below that of human-authored documentation,” the authors wrote.
No Comparison Between AI Scribes
The researchers noted that “given contractual limitations, we cannot interpret the results for specific vendors.” They also noted that the study did not use professional scribes, who may produce even higher-quality results, and the humans were not producing notes in a real-world clinical environment.
Anderson, the clinical informatics fellow, pointed out that the study does not examine the common scenario in which a clinician edits notes produced by an AI scribe. In fact, she said, there is no current research on this, failing to examine “the postediting note that would actually go into the chart.”
In an accompanying commentary, collaborative scientist Aaron Tierney, PhD, and Kristine Lee, MD, an associate executive director, both with the Permanente Medical Group, California, called for future research to focus on “real-world performance, promote the development of documentation policies that prioritize patient care over billing requirements, and systematically incorporate patient perspectives into assessments of quality.”
Why AI Misses the Mark
In an interview with Federal Practitioner, AI researcher Maxim Topaz, PhD, RN, MA, an associate professor of Nursing and Data Science at Columbia University School of Nursing, New York City, who is familiar with the study but did not participate in it, praised the research.
He pointed out that AI has trouble accurately representing clinical encounters because they “tend to fill gaps with plausible-sounding language, which can mask omissions and make errors harder to catch.” Also, “ambient scribes can only document what is verbalized aloud. Physical exam findings the clinician notices but does not narrate, nonverbal cues, and patient-initiated concerns that drift past in conversation are systematically underrepresented.”
Moving forward, Topaz advised clinicians to “treat AI-generated notes as a first draft, not a finished product. Read them carefully, especially for omissions, which the current evidence suggests are by far the most common error type and which are harder to spot than fabrications because the surrounding note still reads coherently.”
Two study authors disclosed employment by the US Department of Veterans Affairs. Other authors had no disclosures. The commentary authors have no disclosures. Anderson has no disclosures. Topaz discloses