LLMs Show High Accuracy in Extracting CRC Data From VA Health Records

November 10, 2025|AVAHO

TOPLINE: Large Language Models (LLMs) achieve more than 95% accuracy in extracting colorectal cancer and dysplasia diagnoses from Veterans Health Administration (VHA) pathology reports, including patients with Million Veteran Program (MVP) genomic data. The validated approach using publicly available LLMs demonstrates excellent performance across both Inflammatory Bowel Disease (IBD) and non-IBD populations.

METHODOLOGY:

Researchers analyzed 116,373 pathology reports generated in the VHA between 1999 and 2024, utilizing search term filtering followed by simple yes/no question prompts for identifying colorectal dysplasia, high-grade dysplasia and/or colorectal adenocarcinoma, and invasive colorectal cancer.

Results were compared to blinded manual chart review of 200 to 300 pathology reports for each patient cohort and diagnostic task, totaling 3,816 reviewed reports, to validate the LLM approach.
Validation was performed independently in IBD and non-IBD populations using Gemma-2 and Llama-3 LLMs without any task-specific training or fine-tuning.
Performance metrics included F1 scores, positive predictive value, negative predictive value, sensitivity, specificity, and Matthew's correlation coefficient to evaluate accuracy across different tasks.

TAKEAWAY:

In patients with IBD in the MVP, the LLM achieved (F1-score, 96.9%; 95% confidence interval [CI], 94.0%-99.6%) for identifying dysplasia, (F1-score, 93.7%; 95% CI, 88.2%-98.4%) for identifying high-grade dysplasia/colorectal cancer, and (F1-score, 98%; 95% CI, 96.3%-99.4%) for identifying colorectal cancer.
In non-IBD MVP patients, the LLM demonstrated (F1-score, 99.2%; 95% CI, 98.2%-100%) for identifying colorectal dysplasia, (F1-score, 96.5%; 95% CI, 93.0%-99.2%) for high-grade dysplasia/colorectal cancer, and (F1-score, 95%; 95% CI, 92.8%-97.2%) for identifying colorectal cancer.
Agreement between reviewers was excellent across tasks, with (Cohen's kappa, 89%-97%) for main tasks, and (Cohen's kappa, 78.1%-93.1%) for indefinite for dysplasia in IBD cohort.
The LLM approach maintained high accuracy when applied to full pathology reports, with (F1-score, 97.1%; 95% CI, 93.5%-100%) for dysplasia detection in IBD patients.

IN PRACTICE: “We have shown that LLMs are powerful, potentially generalizable tools for accurately extracting important information from clinical semistructured and unstructured text and which require little human-led development.” the authors of the study wrote

SOURCE: The study was based on data from the Million Veteran Program and supported by the Office of Research and Development, Veterans Health Administration, and the US Department of Veterans Affairs Biomedical Laboratory. It was published online in BMJ Open Gastroenterology.

LIMITATIONS: According to the authors, this research may be specific to the VHA system and the LLM models used. The authors did not test larger models. The authors acknowledge that without long-term access to graphics processing units, they could not feasibly test larger models, which may overcome some of the shortcomings seen in smaller models. Additionally, the researchers could not rule out overlap between Million Veteran Program and Corporate Data Warehouse reports, though they state that results in either cohort alone are sufficient validation compared with previously published work.

DISCLOSURES: The study was supported by Merit Review Award from the United States Department of Veterans Affairs Biomedical Laboratory Research and Development Service, AGA Research Foundation, National Institutes of Health grants, and the National Library of Medicine Training Grant. Kit Curtius reported receiving an investigator-led research grant from Phathom Pharmaceuticals. Shailja C Shah disclosed being a paid consultant for RedHill Biopharma and Phathom Pharmaceuticals, and an unpaid scientific advisory board member for Ilico Genetics, Inc.

This article was created using several editorial tools, including AI, as part of the process. Human editors reviewed this content before publication.