News

Article

LLMs Show Promise, But Challenges Remain in Improving Inefficient Clinical Trial Screening

Author(s):

Fact checked by:

Key Takeaways

  • GPT-4 outperforms GPT-3.5 in clinical trial patient screening but is slower and more expensive, requiring human oversight due to potential errors.
  • Low patient accrual in clinical trials is a major issue, with manual screening being time-consuming and inefficient, necessitating automated solutions.
SHOW MORE

Large language models (LLMs) such as GPT-3.5 and GPT-4 may offer a solution to the costly and inefficient process of manual clinical trial screening, which is often hindered by the inability of structured electronic health record data to capture all necessary criteria.

Large language models. | Image Credit: witsarut - stock.adobe.com

Large language models (LLMs) such as GPT-3.5 and GPT-4 may offer a solution to the costly and inefficient process of manual clinical trial screening, which is often hindered by the inability of structured electronic health record data to capture all necessary criteria. | Image Credit: witsarut - stock.adobe.com

Large language models (LLMs) like GPT-4 can effectively analyze unstructured clinical notes to improve the efficiency of patient screening for clinical trials, but while GPT-4 consistently outperforms GPT-3.5, it is slower and more costly, and both models still require human oversight due to potential errors and limited sensitivity in identifying eligible patients, according to a study published in Machine Learning: Health.1

Factors impeding patient accrual include resource scarcity, inefficient manual screening processes, and limited availability of research staff. Manual eligibility screening is particularly time-consuming, often requiring over 40 min per patient. Even though manual screening is inefficient, it is currently a standard practice when conducting clinical trial research.2 Previous research found that an automated screening system reduced patient screening time by 34% compared with manual methods, demonstrating the inefficiency of the traditional process.

There is a need for improved interdisciplinary collaboration between doctors, data scientists, and domain experts.3 It underscores that customizing machine learning and natural language processing approaches for specific medical situations is a difficult task due to the limited availability of high-quality data for niche disorders and the ethical concerns surrounding patient privacy and data protection.

Researchers evaluated the performance of GPT-3.5 and GPT-4 in screening 74 patients for a head and neck cancer trial using EHR data.1 They tested 3 prompting methods, including a structured output, chain of thought, and self-discover.

GPT-4 consistently outperformed GPT-3.5 across all metrics. While GPT-3.5's best-performing methods achieved an accuracy of 91% and a Youden's Index (YI) of 0.59, GPT-4's median performance was notably higher, with a median accuracy of 84%, a sensitivity of 84%, and a specificity of 83%.

GPT-4's most effective method, the self-discover approach, yielded a superior YI of 0.73, showcasing a better balance of sensitivity and specificity. In other trials, GPT-4 maintained its lead with median accuracies of 94% and 85%, significantly surpassing GPT-3.5's median accuracies of 87% and 72% in the same contexts. Its highest scores for both accuracy and YI consistently exceeded those of GPT-3.5.

When assessing patient eligibility for trial enrollment, GPT-3.5 had a median accuracy of 0.54 (95% CI, 0.50-0.61), with the structured and expert guidance approach achieving the best result at 0.611. While its specificity was high (median 100%), its sensitivity was very low (median 0%), indicating it was poor at identifying eligible patients.

GPT-4 performed slightly better, with a median accuracy of 0.61 (95% CI, 0.54-0.65) and a highest accuracy of 0.65 using the chain of thought plus expert approach. Similar to its predecessor, GPT-4 maintained high specificity (median 100%) but also had a low sensitivity (median 16%), showing that both models struggle to correctly identify eligible patients despite being effective at ruling out those who are ineligible.

Screening a single patient with GPT-3.5 took between 1.4 and 3 minutes at a cost of $0.02 to $0.03. In contrast, GPT-4 was significantly slower and more expensive, with screening times ranging from 7.9 to 12.4 minutes and costs from $0.15 to $0.27 per patient. The higher cost and longer processing time for GPT-4 are likely due to its increased computational demands.

After reviewing 42 misclassifications (21 for each model), 2 main types of errors were identified. The most common issue for both models was the improper processing of available information, accounting for 95% of GPT-4's errors and 71% of GPT-3.5's errors. This failure occurred when the model correctly identified the relevant text but misinterpreted details like dates, locations, or clinical requirements. The second type of error, a failure to identify relevant information, was more prevalent in GPT-3.5 (29% of its errors) than in GPT-4 (5% of its errors), where the model simply failed to locate the necessary text to answer the question correctly.

The study's LLM-based approach has several limitations. First, while cost-effective compared with manual screening, the use of closed-source GPT models raises concerns about ongoing costs and generalizability to open-source alternatives. Second, the system lacks metadata extraction from clinical notes, which would enable better chronological understanding. Third, generating expert guidance requires specialized domain expertise, posing a barrier to widespread adoption. Other limitations include the lack of structured data integration, limited analysis of how indexing affects outcomes, and the absence of performance data for diverse trial types. The need for human review to correct for potential LLM hallucinations is also a factor. A key limitation is that the patient sample is from a single institution with a specific documentation style, which may limit the generalizability of the findings to other health care settings and patient populations. Overall, the study's conclusions would be strengthened by external validation across a larger, more diverse set of clinical trials and institutions.

“LLM performance varies by prompt, with GPT-4 generally outperforming GPT-3.5, but at higher costs and longer processing times. LLMs should complement, not replace, manual chart reviews for matching patients to clinical trials,” study authors concluded.

References

  1. Beattie J, Owens D, Navar AM, et al. ChatGPT augmented clinical trial screening. Mach Learn Health. 2025;1(1):015005. doi:10.1088/3049-477x/adbd47
  2. Ni Y, Bermudez M, Kennebeck S, Liddy-Hicks S, Dexheimer J. A real-time automated patient screening system for clinical trials eligibility in an emergency department: design and evaluation. JMIR Med Inform. 2019;7(3):e14185. doi:10.2196/14185
  3. Khalate P, Gite S, Pradhan B, Lee CW. Advancements and gaps in natural language processing and machine learning applications in healthcare: a comprehensive review of electronic medical records and medical imaging. Front Phys. 2024;12. doi:10.3389/fphy.2024.1445204

Newsletter

Stay ahead of policy, cost, and value—subscribe to AJMC for expert insights at the intersection of clinical care and health economics.

Related Videos
AJMC interview with Ali Shan Hafeez and Dr Abdul Rafae Faisal | Background image credit: ipopba - stock.adobe.com
Marry Vuong, PharmD, BCPPS
Dr Sundar Jagannath
Dr Muhammad Faizan
Andrew Kuykendall, MD, Moffitt Cancer Center
Dr Sundar Jagannath
4 experts in this video
4 experts in this video
Hadar Avihai Lev-Tov, MD
Related Content
AJMC Managed Markets Network Logo
CH LogoCenter for Biosimilars Logo