
Large Language Model Shows Promise for Automating SALT Scoring in Pediatric Alopecia Areata
Key Takeaways
- GPT-4o achieved ICC/CCC of 0.815/0.866 versus in-person SALT scoring and 0.833/0.817 versus image-based scoring, while expert interrater concordance reached 0.950/0.948.
- Bland–Altman analyses showed systematic overestimation of scalp hair loss by GPT-4o relative to clinicians, indicating calibration drift that could affect threshold-based treatment eligibility decisions.
GPT-4o reliably scored alopecia areata severity from scalp images, showing strong concordance with experienced dermatology providers without prior AI training.
An off-the-shelf large language model could serve as a useful adjunct tool for scoring hair loss severity in patients with
The study, titled GATEWAYS (Generative Artificial Intelligence Tools: Evaluating Ways to Automate Your SALT), assessed whether GPT-4o could reliably calculate Severity of Alopecia Tool (SALT) scores from standardized 4-view scalp images of pediatric patients. SALT scores, which quantify scalp hair loss on a scale from 0 to 100, are a primary outcome measure in AA clinical practice and trials, but their accuracy is limited by provider subjectivity and the time required for manual assessment.1,2
“Accuracy in quantifying hair loss via SALT scoring is important for standardization of patient care and for clinical trials,” the authors explained.1 “However, physician-related subjectivity exists, especially in determining the extent of hair loss upon manual inspection, and the additional time needed to quantitatively assess hair loss is a major limitation in clinical practice.”
Researchers at Children's Hospital of Philadelphia identified 104 sets of 4-view scalp images from patients with AA seen at their dermatology clinic. Images were deidentified and provided to GPT-4o via a structured prompt. Of those, 100 image sets were ultimately included in analysis after excluding cases where GPT-4o failed to generate a score or where human providers showed implausible discrepancies attributed to data entry error.
The model demonstrated strong agreement with clinician-generated scores. Intraclass correlation coefficients (ICC) and concordance correlation coefficients (CCC) between GPT-4o and in-person provider assessments were 0.815 and 0.866, respectively. Against image-based provider assessments, ICC and CCC reached 0.833 and 0.817. By comparison, interrater concordance between 2 experienced human providers reached ICC and CCC values of 0.950 and 0.948. This benchmark underscores some room for AI improvement, the authors noted.
Notably, GPT-4o tended to overestimate hair loss relative to providers, reflected by negative mean difference values in Bland-Altman plots. The model also struggled with left-right scalp orientation, correctly identifying "left" views only 13% of the time compared with 97% accuracy for top and back views. The authors attributed this finding to potential misinterpretation of spatial orientation in images.
The authors acknowledged that content filtering built into GPT-4o initially blocked responses for some images, requiring that filtering be disabled to proceed, which would be a practical hurdle for real-world clinical deployment. They also noted study limitations, including the single-center data and evaluation of only 1 AI platform.
Still, the investigators framed these findings as a proof of concept for a low-barrier, scalable approach to SALT scoring. They noted that AI-assisted tracking could benefit patients in settings with limited dermatology access and may help patients and families track treatment progress.
The authors noted that further updates to off-the-shelf large language models will most likely improve their performance, although the study suggests that the off-the-shelf GPT-4o compares favorably with at least 1 proprietary AI system that has already achieved ICC scores between 0.95 and 0.97.3
“These tools may also aid dermatologists in more accurately determining patient eligibility for therapies that require certain SALT score thresholds to be met for approval by insurance companies,” the authors concluded.
References
1. Gupta R, Salmasian H, Oboite M, et al. Generative artificial intelligence tools: evaluating ways to automate Your SALT (GATEWAYS) scoring of alopecia areata. Pediatr Dermatol. Published online March 9, 2026. doi:10.1111/pde.70195
2. King BA, Senna MM, Ohyama M, et al. Defining severity in alopecia areata: current perspectives and a multidimensional framework. Dermatol Ther (Heidelb). 2022;12(4):825-834. doi:10.1007/s13555-022-00711-3
3. Nguyen H, Gazeau L, Wolfe J. Using artificial intelligence to compute severity of alopecia tool scores. JAAD Int. 2024;18:101-102. doi:10.1016/j.jdin.2024.04.003




