Clinical calculators that do not include demographic variables may be biased, and their equity should be understood in the context of clinical guidelines.
Objectives: To evaluate whether one summary metric of calculator performance sufficiently conveys equity across different demographic subgroups, as well as to evaluate how calculator predictive performance affects downstream health outcomes.
Study Design: We evaluate 3 commonly used clinical calculators—Model for End-Stage Liver Disease (MELD), CHA2DS2-VASc, and simplified Pulmonary Embolism Severity Index (sPESI)—on the cohort extracted from the Stanford Medicine Research Data Repository, following the cohort selection process as described in respective calculator derivation papers.
Methods: We quantified the predictive performance of the 3 clinical calculators across sex and race. Then, using the clinical guidelines that guide care based on these calculators’ output, we quantified potential disparities in subsequent health outcomes.
Results: Across the examined subgroups, the MELD calculator exhibited worse performance for female and White populations, CHA2DS2-VASc calculator for the male population, and sPESI for the Black population. The extent to which such performance differences translated into differential health outcomes depended on the distribution of the calculators’ scores around the thresholds used to trigger a care action via the corresponding guidelines. In particular, under the old guideline for CHA2DS2-VASc, among those who would not have been offered anticoagulant therapy, the Hispanic subgroup exhibited the highest rate of stroke.
Conclusions: Clinical calculators, even when they do not include variables such as sex and race as inputs, can have very different care consequences across those subgroups. These differences in health care outcomes across subgroups can be explained by examining the distribution of scores and their calibration around the thresholds encoded in the accompanying care guidelines.
Am J Manag Care. 2023;29(1):e1-e7. https://doi.org/10.37765/ajmc.2023.89306
Our analysis of 3 commonly used clinical calculators—Model for End-Stage Liver Disease, CHA2DS2-VASc, and simplified Pulmonary Embolism Severity Index—based on data from a local hospital chain shows the following:
As clinical calculators are becoming ubiquitous in guiding medical decision-making,1-3 health care professionals are paying attention to inspect biases present in these calculators. The effect of biases on health disparities have long been understood.4 However, the urgency of such examination has recently been magnified by greater visibility of structural racism,5,6 prompting medical professionals to inspect how their practices may contribute to health inequities.7,8 For example, there has been a call to action9 to investigate calculators that include race adjustments because they may create or perpetuate health care inequities in the affected subgroup, as race reflects a social construct rather than a biological reality.10-12
However, there remain 2 barriers that hinder the examination of the possible inequities in clinical calculator–based care. First, many popular clinical calculators,13-16 regardless of whether they contain adjustments for specific demographic subgroups, have not confirmed comparable performance across each subgroup. Even if a factor such as sex or race is not an input to a clinical calculator, the calculator’s output (ie, the score it provides) may still have different distributions across patient subgroups. Second, even when performance measures are reported across subgroups,17 the assessment is often focused on the predictive performance of the calculator’s output (eg, C statistic), without examining the consequences in terms of the downstream health interventions or outcomes.18 Take, for example, the case of Model for End-Stage Liver Disease (MELD) calculator. The United Network for Organ Sharing (UNOS)19 assigns priority for liver transplants based on the MELD calculator output. If this calculator is not calibrated across different demographic groups, then allocation of liver transplants will be systematically different for certain subgroups.
We propose that a fairness assessment of a clinical calculator must include examining its performance by subgroups of concern, as well as observing its consequences in light of the corresponding care guideline.20 To illustrate such an assessment, we selected 3 commonly used clinical calculators—MELD, which predicts end-stage liver disease (ESLD) mortality; CHA2DS2-VASc (congestive heart failure, hypertension, age ≥ 75 years, diabetes, stroke, vascular disease, age 65-74 years, sex category), which predicts stroke risk in patients with atrial fibrillation; and simplified Pulmonary Embolism Severity Index (sPESI), which predicts mortality outcomes in patients with pulmonary embolism (PE). We then quantified the predictive performance of each across demographic subgroups defined along 2 dimensions: sex and race. Two calculators do not take sex or race as input (MELD and sPESI), and 1 takes sex (CHA2DS2-VASc). We then identified clinical guidelines that rely on these calculators’ output and quantified the potential resulting disparities in health outcomes. In particular, we observed negative events that may result from making a decision based on the calculator’s output across sex and race. Specific negative events for selected calculators are death after being denied liver transplant based on MELD, stroke after not being recommended for anticoagulation based on CHA2DS2-VASc, and death following an early discharge guided by sPESI.
A study team of physicians was formed to nominate popular clinical calculators that are routinely used. By consulting previous literature surveying taxonomy of calculators21-23 and a popular website that aggregates clinical tools used by practicing clinicians (MDCalc1), 3 calculators—MELD, CHA2DS2-VASc, and sPESI—were selected. Detailed selection process can be found in the eAppendix (available at ajmc.com).
For each calculator, we then identified a clinical guideline in which the calculator is used to determine a care recommendation. These guidelines define risk groups based on score ranges and provide therapeutic recommendations for each group.
MELD13 predicts survivability of patients with ESLD over a 90-day period. Based on CDC statistics,24 this calculator is potentially applicable to more than 4.5 million patients with ESLD in the United States, approximately 2% of the adult population.
The calculator was later adopted by UNOS to prioritize liver transplant and further modified to include serum sodium level as an input. For patients with liver cirrhosis who are aged at least 12 years, the MELD score is calculated as shown in eAppendix Listing 1.
UNOS drives a complex liver distribution policy19 based on the MELD output. Most importantly, patients with a MELD score of less than 15 receive the least priority, so local hospitals typically avoid listing such a patient as a transplant candidate.
The CHADS2 (congestive heart failure, hypertension, age ≥ 75 years, diabetes, and stroke) score16 estimates the risk of stroke in patients with atrial fibrillation, the most common arrhythmia with an estimated prevalence of at least 5% of adults.25 CHA2DS2-VASc17 improves on the CHADS2 by adding a new age group and incorporating vascular disease history and sex category. For patients with atrial fibrillation, CHA2DS2-VASc score is calculated as shown in eAppendix Listing 2. Both CHADS2 and CHA2DS2-VASc predict the likelihood of developing stroke over a 1000-day window.
CHA2DS2-VASc scores are often used to determine antithrombotic therapy for the patient. We identified 2 versions of a guideline that is being used. In the 2014 American College of Cardiology (ACC)/American Heart Association (AHA) atrial fibrillation treatment guideline,26 antithrombotic therapy with oral anticoagulants is recommended only for individuals with a CHA2DS2-VASc score of 2 or greater.
In the most recent 2020 ACC/AHA guideline,27 this threshold is increased by 1 for female sex (CHA2DS2-VASc score of 3 or greater), while remaining the same for male sex. This change nullifies the effect of including sex as an input to the CHA2DS2-VASc score, acknowledging that biological sex does not increase the risk of stroke as previously thought.
Despite the publication of the 2020 guideline, the 2014 guideline continues to be used actively.28 Because of this, we quantified CHA2DS2-VASc negative event frequency using thresholds from both guidelines.
sPESI14 is a simplified version of the PESI15 that predicts 30-day survivability of patients with PE (eAppendix Listing 3). Nearly 400,000 Americans receive a diagnosis of PE yearly, and PE is the third most common cause of cardiovascular death after myocardial infarction and stroke.29
The sPESI score classifies patients into 2 categories: patients with a score of 0 as low risk, and those with a score of 1 or more as high risk. Patients deemed low risk are considered for early discharge14 and management of their PE entirely in an outpatient setting.
We used the Stanford Medicine Research Data Repository (STARR)30 for our data analysis, which is composed of records from Stanford Health Care and the Lucile Packard Children’s Hospital. We also linked STARR with the Social Security Administration’s Death Master File31 to ascertain out-of-hospital deaths. The eAppendix details the data set characteristic and the linkage process.
STARR data is provided in the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM) format. Clinical calculators were implemented in Python to map lab measurements and diagnosis codes used in each calculator derivation paper13,14,17 to corresponding standard vocabulary terms in OMOP-CDM. Calculators then operate with OMOP-CDM measurement values to derive scores for each respective cohort.
Cohorts were established by applying the same selection criteria described in each of the calculator derivation papers.13,14,17 Because OMOP-CDM does not support selection of more than 1 race, our analysis is based on the first preferential race declared by the individual. Specifically, we used the 5 categories of race as defined by the US Census and cast Hispanic as a dedicated ethnicity. Those records that are missing sex or race information were dropped from the analysis. eAppendix Table 2 shows the cohort size for each calculator.
Scores are calculated using data available on the day of diagnosis. When a patient has multiple measurements, we compute the score using the first measurement because it is the most likely to drive downstream medical intervention. Analysis is performed using Pandas32 1.3.0, SciPy33 1.7.0, and lifelines34 0.26.3 running on Python 3.9.6, configured through Conda 4.5.11.
Similar to previous studies,13,14,17 we use C statistic as the metric to quantify a calculator’s predictive capability. We first compute the C statistic for the entire cohort, then generate the same statistic for each subgroup: sex and race. For each statistic we compute the 95% CI, using 1000 bootstraps.
Once raw calculator performance is understood, we evaluate its impact in the clinical context by observing the negative event frequency. The population of interest is defined as individuals who, based on the calculator output, would have not been offered treatment (MELD and CHA2DS2-VASc) or would have been considered for early discharge (sPESI). Negative events in this population represent a negative health outcome for which the calculator predicted the patient to be at low risk—all-cause mortality (MELD and sPESI) or stroke (CHA2DS2-VASc). We apply the same 95% bootstrap CI.
When the C statistic or negative event frequency of one subgroup is compared with the other, 1-tailed Welch’s t test is used to compute P values.
eAppendix Table 3 shows the characteristics of the cohort, which we verified to significantly overlap with the MELD calculator derivation paper.13
Table 119 then provides C statistics for MELD, which exhibited an overall C statistic of 0.81 (95% CI, 0.75-0.86), and is comparable with what was reported in the calculator derivation paper13 (0.78; 95% CI, 0.74-0.81). However, C statistics for each sex show that MELD exhibits higher concordance for male patients (0.82; 95% CI, 0.72-0.90; P < .001) compared with female patients (0.77; 95% CI, 0.70-0.84).
In contrast, when examining by racial subgroups, the White population exhibits the worst C statistic (0.77; 95% CI, 0.66-0.87; P < .001). Sample size for the Black population is too small to be conclusive.
When the guidance provided by the UNOS policy (ie, MELD score < 15) is applied, the negative event frequency (Table 119) follows the same. Female patients exhibit a higher percentage of deaths (2.07%; 95% CI, 0.78%-3.64%; P < .001) among those who would have been denied listing for liver transplant compared with male patients (1.21%; 95% CI, 0.45%-2.08%). Across racial subgroups, the White population shows the highest negative event frequency, at 2.42% (95% CI, 1.20%-3.86%; P < .001).
Observing how the MELD score for each population is distributed around the cutoff threshold explains the observed negative event frequency. The left panel of Figure 1 shows that a significant portion of both the male and female populations exhibit MELD scores less than 15. Therefore, the worse C statistic in female patients translates into a worse negative event frequency than in male patients.
Likewise, the majority of the White population has MELD scores less than 15 (Figure 1, right panel). As a result, misclassification stemming from poor calculator performance results in significantly more negative outcomes relative to other racial subgroups.
eAppendix Table 4 shows the characteristics of our cohort. Compared with the CHA2DS2-VASc derivation paper,17 our cohort is older (77 years vs 66 years) and has higher prevalence of comorbidities (eg, 77% vs 24% with heart failure and 93% vs 17% with diabetes).
As shown in Table 2,26,27 the overall C statistic of the CHA2DS2-VASc calculator on our cohort is marginally better (0.66; 95% CI, 0.66-0.67) than what was reported for the original development cohort (0.606; 95% CI, 0.513-0.699).17
Nevertheless, the C statistic for the CHA2DS2-VASc calculator varies by sex- and race-based subgroups. Stratification by patient sex reveals that CHA2DS2-VASc exhibits a higher C statistic for female patients (0.69; 95% CI, 0.68-0.69) than male patients (0.65; 95% CI, 0.64-0.65). The CIs do not overlap. Across racial subgroups, the CHA2DS2-VASc calculator exhibits the best C statistic for the Hispanic population (0.73; 95% CI, 0.70-0.76; P < .001).
Table 226,27 then compares the negative event frequency for stroke based on application of CHA2DS2-VASc calculator scores using 2 clinical guidelines. Under the new guideline, better predictive performance translates into better health outcomes, where both female (1.91%; 95% CI, 1.68%-2.16%; P < .001) and Hispanic (1.78%; 95% CI, 1.17%-2.47%; P < .001) populations exhibit the lowest negative event frequency. However, under the old guideline, the Hispanic population shows the highest negative event frequency (3.30%; 95% CI, 2.13%-4.65%; P < .001) despite showing the highest C statistic.
Such nonconcordance in event frequency stems from the poor calibration of the CHA2DS2-VASc score (Figure 2). The stroke risk for the Hispanic population in our cohort for the case of CHA2DS2-VASc score of 1 turns out to be slightly higher than for other scores. Because a score of 1 is below the guideline’s threshold for treatment, this increased stroke risk in the Hispanic subgroup drives a higher negative event frequency.
Negative event frequency for the female population under the old guideline could not be calculated, because no female patients in our cohort with a CHA2DS2-VASc score of 1 or less experienced stroke.
eAppendix Table 5 presents the cohort characteristics. Compared with the original cohort used to derive the sPESI calculator,14 our cohort has a higher percentage of male patients (57% vs 40%) and patients with cancer history (52% vs 20%).
In this cohort, sPESI shows an overall C statistic of 0.58 (95% CI, 0.56-0.60) (Table 314), which is considerably less than what was reported in the original study14 (0.75; 95% CI, 0.69-0.80) but is in line with a meta-study that assessed sPESI predictive performance35 (0.57; 95% CI, 0.52-0.61).
Observing per-group predictive performance, the sPESI C statistic does not vary significantly across sex. Examining the C statistic across racial subgroups, however, shows that sPESI exhibits varying performance for different groups, with worst performance in the Black population, where the C statistic is close to random (0.50; 95% CI, 0.40-0.61; P < .001). The Asian population exhibits the highest C statistic (0.62; 95% CI, 0.54-0.69; P < .001), followed by the White (0.58; 95% CI, 0.55-0.61) and Hispanic (0.58; 95% CI, 0.50-0.66) populations, which show similar C statistics.
When put in clinical context, however, such similarity in performance between the White and Hispanic populations does not translate to similar negative event frequency (Table 314). Specifically, the White population shows the highest incident rate at 12.77% (95% CI, 7.52%-18.05%; P < .001), which means that, among the patients who identified as White, 12.77% of those who may have been considered for early discharge based on sPESI score (ie, sPESI =0) ultimately died within 30 days of receiving a diagnosis of PE. Sample size for the Asian and Black populations with a sPESI score of 0 is too small to be conclusive (eAppendix Table 626,27).
eAppendix Figure 2 shows the sPESI calculator exhibiting poor calibration, with the death percentage for the White population being significantly higher than that for the Hispanic population at a sPESI score of 0.
The results show that reporting a single summary C statistic for the entire cohort is inadequate. For the 3 calculators that we examined, none of the original derivation papers13,14,17 reported validation across sex and race subgroups, yet we demonstrate that each calculator’s output can be stratified by demographic variables.
Our results also show that calculator performance has unpredictable impact on outcomes when evaluated under the relevant clinical guidelines. Most striking would be the comparison of 2 guidelines on CHA2DS2-VASc, where the Hispanic subgroup exhibited the worst negative event frequency under the old guideline, despite the highest predictive performance of the calculator. Such disparity vanishes under the new guideline.
As guidelines often assign patients to a risk group based on thresholds, understanding the distribution of risk scores and the calibration around those thresholds is crucial to understanding the impact of calculator-guided decisions.20 Such dependence of calculator fairness on guidelines also warrants that fairness assessments be redone whenever policies relying on a given calculator are introduced or revised.
Although we chose to focus only on undertreatment to test our hypothesis, such an analysis could similarly be conducted on overtreatment, and our conclusion is not contingent on testing both scenarios.
Finally, it is often assumed that removing demographic inputs such as sex and race from a calculator can ensure fairness, yet bias can still persist in a calculator even without these inputs. We have demonstrated this by uncovering demographic subgroup stratification in calculators both with (CHA2DS2-VASc) and without (MELD and sPESI) relevant demographic input. Explicitly adding sex or race as calculator input, however, could increase bias, as these categories often reflect a social construct.9-11 Therefore, fairness analysis must be conducted for all clinical calculators and not just those that include such variables as input.
In summary, our results stress that when a treatment guideline translates a calculator’s score into risk groups, distribution of those scores—in particular, how fairly the score is calibrated at the grouping thresholds—affects the achieved health outcomes for each subgroup. We recommend that other institutions routinely conduct similar validation and stratification analysis of calculators on their patient populations and audit their institutional policies based on such analysis.
Although we focused on stratification across sex and race due to their widespread use as calculator inputs, they can easily be confounded by other socioeconomic and demographic variables. Future studies could further isolate the impact by analyzing confounding variables in tandem.
Our work does not propose ways to revise calculators to ameliorate identified bias, because how to do so is an active field of research, and a generalizable strategy to accomplish this is unclear.36 A promising line of effort was demonstrated with the development of new Chronic Kidney Disease Epidemiology Collaboration estimated glomerular filtration rate equations that do not take race as an input,37 where fairness could be achieved by incorporating parameters (eg, cystatin) that are specific to the biological pathway at hand.
We acknowledge that the STARR data set is confined to only 2 hospitals in a small health care system in a single geographic region. It is recommended that other institutions conduct similar validation and stratification analysis on data representative of their patient populations. Also, use of only first preferential race may not accurately reflect multiracial patients. We leave the use of multirace information to future work.
Lastly, we did not attempt to quantify the frequency with which these calculators are used in practice. If they are not used in every relevant clinical situation, the population-level effect of the bias uncovered may be reduced.
Clinical calculators are used to guide medical decision-making in all specialties. Despite high interest in inspecting systematic biases in these calculators, there exist structural limitations in assessing fairness. Our results show that calculators—even those that do not include demographic variables such as sex and race as inputs—can have very different C statistic performance and different score distributions across sex and race. This demonstrates that reporting a single summary performance metric fails to adequately reveal biased performance over individual subgroups.
Given that clinical calculators are applied via clinical guidelines, and such guidelines often assign patients to risk groups and allocate treatments based on calculator score thresholds, it is essential to observe how the calculator’s output is distributed and calibrated around these decision thresholds in order to understand the health equity consequences of using a calculator in real-world practice. We encourage institutions to routinely conduct such validation and stratification analysis on commonly used clinical calculators for their patient populations and to audit their local application of clinical guidelines accordingly.
Author Affiliations: Department of Medicine (RMY, JHL, JZG, JAF, NHS), Department of Emergency Medicine (DD), Department of Pediatrics (NR), and Clinical Excellence Research Center (NHS), School of Medicine, Stanford University, Stanford, CA; Technology and Digital Services (NHS), Stanford Health Care, Palo Alto, CA.
Source of Funding: None.
Author Disclosures: The authors report no relationship or financial interest with any entity that would pose a conflict of interest with the subject matter of this article.
Authorship Information: Concept and design (RMY, DD, JHL, JZG, NR, NHS); acquisition of data (JAF); analysis and interpretation of data (RMY, DD, JHL, JZG, NR, NHS); drafting of the manuscript (RMY, DD, JZG, JAF); critical revision of the manuscript for important intellectual content (RMY, JHL, JZG, NR, NHS); statistical analysis (RMY); provision of patients or study materials (NHS); obtaining funding (NHS); administrative, technical, or logistic support (JAF, NHS); and supervision (NHS).
Address Correspondence to: Richard M. Yoo, PhD, MBI, Stanford University, 1265 Welch Rd, Stanford, CA 94305. Email: firstname.lastname@example.org.
1. MDCalc. Accessed April 16, 2022. https://www.mdcalc.com/
2. Widely used sepsis prediction tool is less effective than Michigan doctors thought. National Heart, Lung, and Blood Institute. June 29, 2021. Accessed April 16, 2022. https://www.nhlbi.nih.gov/news/2021/widely-used-sepsis-prediction-tool-less-effective-michigan-doctors-thought
3. Khetpal V, Shah N. How a largely untested AI algorithm crept into hundreds of hospitals. Fast Company. May 28, 2021. Accessed April 16, 2022. https://www.fastcompany.com/90641343/epic-deterioration-index-algorithm-pandemic-concerns
4. Wilkinson DY, King G. Conceptual and methodological issues in the use of race as a variable: policy implications. Milbank Q. 1987;65(suppl 1):56-71. doi:10.2307/3349951
5. Bailey ZD, Feldman JM, Bassett MT. How structural racism works — racist policies as a root cause of U.S. racial health inequities. N Engl J Med. 2021;384(8):768-773. doi:10.1056/NEJMms2025396
6. Williams DR, Lawrence JA, Davis BA. Racism and health: evidence and needed research. Annu Rev Public Health. 2019;40:105-125. doi:10.1146/annurev-publhealth-040218-043750
7. O’Reilly KB. AMA: racism is a threat to public health. American Medical Association. November 16, 2020. Accessed April 16, 2022. https://www.ama-assn.org/delivering-care/health-equity/ama-racism-threat-public-health
8. Eberly LA, Richterman A, Beckett AG, et al; Brigham and Women’s Internal Medicine Housestaff. Identification of racial inequities in access to specialized inpatient heart failure care at an academic medical center. Circ Heart Fail. 2019;12(11):e006214. doi:10.1161/CIRCHEARTFAILURE.119.006214
9. Vyas DA, Eisenstein LG, Jones DS. Hidden in plain sight — reconsidering the use of race correction in clinical algorithms. N Engl J Med. 2020;383(9):874-882. doi:10.1056/NEJMms2004740
10. Eneanya ND, Yang W, Reese PP. Reconsidering the consequences of using race to estimate kidney function. JAMA. 2019;322(2):113-114. doi:10.1001/jama.2019.5774
11. Kowalsky RH, Rondini AC, Platt SL. The case for removing race from the American Academy of Pediatrics clinical practice guideline for urinary tract infection in infants and young children with fever. JAMA Pediatr. 2020;174(3):229-230. doi:10.1001/jamapediatrics.2019.5242
12. Bancks MP, Kershaw K, Carson AP, Gordon-Larsen P, Schreiner PJ, Carnethon MR. Association of modifiable risk factors in young adulthood with racial disparity in incident type 2 diabetes during middle adulthood. JAMA. 2017;318(24):2457-2465. doi:10.1001/jama.2017.19546
13. Kamath PS, Wiesner RH, Malinchoc M, et al. A model to predict survival in patients with end-stage liver disease. Hepatology. 2001;33(2):464-470. doi:10.1053/jhep.2001.22172
14. Jiménez D, Aujesky D, Moores L, et al; RIETE Investigators. Simplification of the pulmonary embolism severity index for prognostication in patients with acute symptomatic pulmonary embolism. Arch Intern Med. 2010;170(15):1383-1389.doi:10.1001/archinternmed.2010.199
15. Aujesky D, Obrosky DS, Stone RA, et al. Derivation and validation of a prognostic model for pulmonary embolism. Am J Respir Crit Care Med. 2005;172(8):1041-1046. doi:10.1164/rccm.200506-862OC
16. Gage BF, Waterman AD, Shannon W, Boechler M, Rich MW, Radford MJ. Validation of clinical classification schemes for predicting stroke: results from the National Registry of Atrial Fibrillation. JAMA. 2001;285(22):2864-2870. doi:10.1001/jama.285.22.2864
17. Lip GYH, Nieuwlaat R, Pisters R, Lane DA, Crijns HJGM. Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the Euro Heart Survey on Atrial Fibrillation. Chest. 2010;137(2):263-272. doi:10.1378/chest.09-1584
18. Pfohl SR, Foryciarz A, Shah NH. An empirical characterization of fair machine learning for clinical risk prediction. J Biomed Inform. 2021;113:103621. doi:10.1016/j.jbi.2020.103621
19. Liver policy: distribution. UNOS. Accessed March 19, 2022. https://unos.org/policy/liver/distribution/
20. Foryciarz A, Pfohl SR, Patel B, Shah NH. Evaluating algorithmic fairness in the presence of clinical guidelines: the case of atherosclerotic cardiovascular disease risk estimation. BMJ Health Care Inform. 2022;29(1):e100460. doi:10.1136/bmjhci-2021-100460
21. Aakre C, Dziadzko M, Keegan MT, Herasevich V. Automating clinical score calculation within the electronic health record: a feasibility assessment. Appl Clin Inform. 2017;8(2):369-380. doi:10.4338/ACI-2016-09-RA-0149
22. Dziadzko MA, Gajic O, Pickering BW, Herasevich V. Clinical calculators in hospital medicine: availability, classification, and needs. Comput Methods Programs Biomed. 2016;133:1-6. doi:10.1016/j.cmpb.2016.05.006
23. Green TA, Shyu CR. Developing a taxonomy of online medical calculators for assessing automatability and clinical efficiency improvements. Stud Health Technol Inform. 2019;264:601-605. doi:10.3233/SHTI190293
24. Chronic liver disease and cirrhosis. CDC. Updated September 6, 2022. Accessed December 15, 2022. https://www.cdc.gov/nchs/fastats/liver-disease.htm
25. Atrial fibrillation. CDC. Updated October 14, 2022. Accessed December 15, 2022. https://www.cdc.gov/heartdisease/atrial_fibrillation.htm
26. Steen DL. The revised ACC/AHA/HRS guidelines for the management of patients with atrial fibrillation. American College of Cardiology. October 29, 2014. Accessed March 19, 2022. https://www.acc.org/latest-in-cardiology/articles/2014/10/14/11/02/the-revised-acc-aha-hrs-guidelines-for-the-management-of-patients-with-atrial-fibrillation
27. Heidenreich PA, Estes NAM III, Fonarow GC, et al. 2020 update to the 2016 ACC/AHA Clinical Performance and Quality Measures for Adults With Atrial Fibrillation or Atrial Flutter: a report of the American College of Cardiology/American Heart Association Task Force on Performance Measures. Circ Cardiovasc Qual Outcomes. 2021;14(1):e000100. doi:10.1161/HCQ.0000000000000100
28. CHA2DS2-VASc score for atrial fibrillation stroke risk. MDCalc. Accessed April 30, 2022. https://www.mdcalc.com/cha2ds2-vasc-score-atrial-fibrillation-stroke-risk
29. Morrone D, Morrone V. Acute pulmonary embolism: focus on the clinical picture. Korean Circ J. 2018;48(5):365-381. doi:10.4070/kcj.2017.0314
30. Datta S, Posada J, Olson G, et al. A new paradigm for accelerating clinical data science at Stanford Medicine. arXiv. March 17, 2020. Accessed September 6, 2021. doi:10.48550/arXiv.2003.10534
31. Hanna DB, Pfeiffer MR, Sackoff JE, Selik RM, Begier EM, Torian LV. Comparing the National Death Index and the Social Security Administration’s Death Master File to ascertain death in HIV surveillance. Public Health Rep. 2009;124(6):850-860. doi:10.1177/003335490912400613
32. McKinney W. Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference. SciPy; 2010:56-61. doi:10.25080/Majora-92bf1922-00a
33. Virtanen P, Gommers R, Oliphant TE, et al; SciPy 1.0 Contributors. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261-272. doi:10.1038/s41592-019-0686-2
34. Davidson-Pilon C. lifelines: survival analysis in Python. J Open Source Softw. 2019;4(40):1317. doi:10.21105/joss.01317
35. Walter R, Holley A. Performance characteristics for the simplified Pulmonary Embolism Severity Index: a meta-analysis. Chest. 2012;142(4)(suppl):849A. doi:10.1378/chest.1390421
36. Manski CF. Patient-centered appraisal of race-free clinical risk assessment. Health Econ. Published online July 5, 2022. doi:10.1002/hec.4569
37. Inker LA, Eneanya ND, Coresh J, et al. New creatinine- and cystatin C–based equations to estimate GFR without race. N Engl J Med. 2021;385(19):1737-1749. doi:10.1056/NEJMoa2102953