News|Articles|March 25, 2026

LLMs Show Promise in Pediatric Care but Need More Safety, Efficacy Research

Fact checked by: Rose McNulty

Listen

0:00 / 0:00

Key Takeaways

A 2020–2025 scoping review identified 40 eligible pediatric LLM studies, largely published in 2024–2025 across journals, preprints, brief reports, and conference proceedings.
US institutions led output, but studies also emerged from China and multiple EMEA/Asian regions; age reporting was sparse, and only nine datasets included ages 0–5 years.
Pretrained GPT models predominated, with limited pediatric fine-tuning; most evaluations compared models and measured accuracy, F1, sensitivity, and specificity against multi-annotator ground truth, sometimes using nonexperts.
Reported benefits included improved accuracy and workflow efficiency, yet hallucinations and inconsistency persisted; next steps emphasize rigorous designs, pediatric-specific models, underrepresented specialties/age groups, stakeholder input, and standardized reporting.

LLMs may improve pediatric clinical decision-making, but gaps in safety, accuracy, and pediatric-specific data remain.

Large language model (LLM) usage to analyze clinical data is increasing, yet more research is needed to assess its safety and efficacy in diagnostic decision-making, more specifically in the clinical pediatric space.¹

These findings, recently published in JAMA Network Open, aimed to address the integration of LLMs into health care. This specific study analyzed original research published from January 1, 2020, to July 1, 2025, that used modern transformer-based LLMs with pediatric clinical text as input. LLMs are powered by artificial intelligence to recognize, summarize, translate, and then generate natural language content to perform a wide range of tasks. When used within electronic health records, they can help streamline clinical documentation and automate responses to patients.² However, there is limited research that distinguishes between adult and pediatric care, nor does it focus on pediatric applications.

This study reviews various research databases such as PubMed/MEDLINE, Embase, Web of Science, Scopus, and preprint servers for studies that used LLM-and-pediatric-related terms. Forty studies were included in the final analysis that met the inclusion criteria. All of the studies were published between 2023 and 2025. More specifically, 21 studies were published in 2024, and 16 studies were published in 2025. Furthermore, 24 studies were original research articles, 5 studies were brief reports or research letters, 7 articles were preprints, and 4 articles were conference proceedings.¹

Regarding the location of each study, 23 studies were conducted in the US, 4 were conducted in China, and the remaining 13 studies were conducted across countries spanning Europe, the Middle East, North Africa, and Asia.

All except for 2 studies focused solely on pediatric populations, 24 studies did not specify pediatric subgroup ages, and only 9 studies included children aged from birth to 5 years. The sample sizes ranged from 10 to 172,683, with ages spanning from birth to 18 years old.

LLM Clinical Application

The most common LLM used was GPT (OpenAI) in 29 studies, followed by LLaMA (Meta) in 9 studies. Furthermore, 23 studies evaluated a single LLM, whereas the remaining 17 studies compared multiple models. Thirty studies relied solely on pretrained models, 6 studies fine-tuned the LLMs on pediatric data, and 4 studies used both fine-tuned and pretrained models.

Clinical decision support was the most studied category of LLM clinical application across pediatric subspecialties, followed by clinical note generation, patient communication and education, administration and workflow, and medical research assistance. Of the 12 subcategories of LLM application, the most common were diagnostic decision support, which appeared in 24 studies, and treatment planning, in 7 studies.

All the studies compared LLM performance. Fifteen studies compared evaluation metrics across different LLMs and between other machine learning or AI models, whereas 9 studies evaluated a single LLM. The most common evaluation metrics included accuracy in 20 studies, F1 score in 12 studies, sensitivity in 9 studies, and specificity in 8 studies. Furthermore, 33 studies assessed model performance against annotated ground truth labels. Among these, 10 studies used trained nonexperts, and 1 study did not provide annotator details. In the remaining 26 studies, multiple annotators were used.

Many of the study authors reported improved accuracy, time efficiency, and cost-effectiveness. However, they also reported limitations, including hallucinations and inconsistent performance.

This study was limited, as the studies assessed were solely focused on pediatric populations, contrary to the more common use of LLM in mixed-age or general population cases.

"Future research should prioritize rigorous study designs, pediatric-specific models, underrepresented specialties and age groups, and stakeholder input, while adhering to implementation, evaluation, and reporting standards to support safe, effective, and equitable deployment of LLMs in pediatrics,” the study authors concluded.

References

1. Huang T, Tse G, Pageler NM, Bannett Y. Large language model using clinical text in pediatrics: a scoping review. JAMA Netw Open. 2026;9(3):e262443. doi:10.1001/jamanetworkopen.2026.2443

2. Tierney AA, Reed ME, Grant RW, Doo FX, Payán DD, Liu VX. Health equity in the era of large language models. AJMC^®. 2025;31(3):112-117. doi:10.37765/ajmc.2025.89695