Investigating Real-world Consequences of Biases in Commonly Used Clinical Calculators

Yoo,Richard;Dash,Dev;Lu,Jonathan;Genkins,Julian;Rabbani,Naveed;Fries,Jason;Shah,Nigam;

Publication

Article

January 26, 2023

The American Journal of Managed Care

January 2023

Volume29

Issue 1

Investigating Real-world Consequences of Biases in Commonly Used Clinical Calculators

Author(s):

Richard M. Yoo, PhD, MBI,Dev Dash, MD, MPH

Clinical calculators that do not include demographic variables may be biased, and their equity should be understood in the context of clinical guidelines.

ABSTRACT

Objectives: To evaluate whether one summary metric of calculator performance sufficiently conveys equity across different demographic subgroups, as well as to evaluate how calculator predictive performance affects downstream health outcomes.

Study Design: We evaluate 3 commonly used clinical calculators—Model for End-Stage Liver Disease (MELD), CHA₂DS₂-VASc, and simplified Pulmonary Embolism Severity Index (sPESI)—on the cohort extracted from the Stanford Medicine Research Data Repository, following the cohort selection process as described in respective calculator derivation papers.

Methods: We quantified the predictive performance of the 3 clinical calculators across sex and race. Then, using the clinical guidelines that guide care based on these calculators’ output, we quantified potential disparities in subsequent health outcomes.

Results: Across the examined subgroups, the MELD calculator exhibited worse performance for female and White populations, CHA₂DS₂-VASc calculator for the male population, and sPESI for the Black population. The extent to which such performance differences translated into differential health outcomes depended on the distribution of the calculators’ scores around the thresholds used to trigger a care action via the corresponding guidelines. In particular, under the old guideline for CHA₂DS₂-VASc, among those who would not have been offered anticoagulant therapy, the Hispanic subgroup exhibited the highest rate of stroke.

Conclusions: Clinical calculators, even when they do not include variables such as sex and race as inputs, can have very different care consequences across those subgroups. These differences in health care outcomes across subgroups can be explained by examining the distribution of scores and their calibration around the thresholds encoded in the accompanying care guidelines.

Am J Manag Care. 2023;29(1):e1-e7. https://doi.org/10.37765/ajmc.2023.89306

_____

Takeaway Points

Our analysis of 3 commonly used clinical calculators—Model for End-Stage Liver Disease, CHA₂DS₂-VASc, and simplified Pulmonary Embolism Severity Index—based on data from a local hospital chain shows the following:

A single-number summary of calculator performance (eg, C statistic) does not adequately convey performance variability across demographic subgroups. This is true even for those calculators that do not take demographic variables such as sex and race as input.
How such varying performance translates into actual health outcomes for each subgroup, however, depends on the clinical guideline in which the calculators are being used. Observing how calculator output distributes and calibrates around guideline thresholds is key to understanding calculator impact on downstream health outcomes.
We recommend that health care organizations conduct routine stratification analysis of commonly used clinical calculators on their patient populations.

_____

As clinical calculators are becoming ubiquitous in guiding medical decision-making,^1-3 health care professionals are paying attention to inspect biases present in these calculators. The effect of biases on health disparities have long been understood.⁴ However, the urgency of such examination has recently been magnified by greater visibility of structural racism,^5,6 prompting medical professionals to inspect how their practices may contribute to health inequities.^7,8 For example, there has been a call to action⁹ to investigate calculators that include race adjustments because they may create or perpetuate health care inequities in the affected subgroup, as race reflects a social construct rather than a biological reality.^10-12

However, there remain 2 barriers that hinder the examination of the possible inequities in clinical calculator–based care. First, many popular clinical calculators,^13-16 regardless of whether they contain adjustments for specific demographic subgroups, have not confirmed comparable performance across each subgroup. Even if a factor such as sex or race is not an input to a clinical calculator, the calculator’s output (ie, the score it provides) may still have different distributions across patient subgroups. Second, even when performance measures are reported across subgroups,¹⁷ the assessment is often focused on the predictive performance of the calculator’s output (eg, C statistic), without examining the consequences in terms of the downstream health interventions or outcomes.¹⁸ Take, for example, the case of Model for End-Stage Liver Disease (MELD) calculator. The United Network for Organ Sharing (UNOS)¹⁹ assigns priority for liver transplants based on the MELD calculator output. If this calculator is not calibrated across different demographic groups, then allocation of liver transplants will be systematically different for certain subgroups.

We propose that a fairness assessment of a clinical calculator must include examining its performance by subgroups of concern, as well as observing its consequences in light of the corresponding care guideline.²⁰ To illustrate such an assessment, we selected 3 commonly used clinical calculators—MELD, which predicts end-stage liver disease (ESLD) mortality; CHA₂DS₂-VASc (congestive heart failure, hypertension, age ≥ 75 years, diabetes, stroke, vascular disease, age 65-74 years, sex category), which predicts stroke risk in patients with atrial fibrillation; and simplified Pulmonary Embolism Severity Index (sPESI), which predicts mortality outcomes in patients with pulmonary embolism (PE). We then quantified the predictive performance of each across demographic subgroups defined along 2 dimensions: sex and race. Two calculators do not take sex or race as input (MELD and sPESI), and 1 takes sex (CHA₂DS₂-VASc). We then identified clinical guidelines that rely on these calculators’ output and quantified the potential resulting disparities in health outcomes. In particular, we observed negative events that may result from making a decision based on the calculator’s output across sex and race. Specific negative events for selected calculators are death after being denied liver transplant based on MELD, stroke after not being recommended for anticoagulation based on CHA₂DS₂-VASc, and death following an early discharge guided by sPESI.

METHODS

Selected Calculators

A study team of physicians was formed to nominate popular clinical calculators that are routinely used. By consulting previous literature surveying taxonomy of calculators^21-23 and a popular website that aggregates clinical tools used by practicing clinicians (MDCalc1), 3 calculators—MELD, CHA₂DS₂-VASc, and sPESI—were selected. Detailed selection process can be found in the eAppendix (available at ajmc.com).

For each calculator, we then identified a clinical guideline in which the calculator is used to determine a care recommendation. These guidelines define risk groups based on score ranges and provide therapeutic recommendations for each group.

MELD

MELD¹³ predicts survivability of patients with ESLD over a 90-day period. Based on CDC statistics,²⁴ this calculator is potentially applicable to more than 4.5 million patients with ESLD in the United States, approximately 2% of the adult population.

The calculator was later adopted by UNOS to prioritize liver transplant and further modified to include serum sodium level as an input. For patients with liver cirrhosis who are aged at least 12 years, the MELD score is calculated as shown in eAppendix Listing 1.

UNOS drives a complex liver distribution policy¹⁹ based on the MELD output. Most importantly, patients with a MELD score of less than 15 receive the least priority, so local hospitals typically avoid listing such a patient as a transplant candidate.

CHA₂DS₂-VASc

The CHADS₂ (congestive heart failure, hypertension, age ≥ 75 years, diabetes, and stroke) score¹⁶ estimates the risk of stroke in patients with atrial fibrillation, the most common arrhythmia with an estimated prevalence of at least 5% of adults.²⁵ CHA₂DS₂-VASc¹⁷ improves on the CHADS₂ by adding a new age group and incorporating vascular disease history and sex category. For patients with atrial fibrillation, CHA₂DS₂-VASc score is calculated as shown in eAppendix Listing 2. Both CHADS₂ and CHA₂DS₂-VASc predict the likelihood of developing stroke over a 1000-day window.

CHA₂DS₂-VASc scores are often used to determine antithrombotic therapy for the patient. We identified 2 versions of a guideline that is being used. In the 2014 American College of Cardiology (ACC)/American Heart Association (AHA) atrial fibrillation treatment guideline,²⁶ antithrombotic therapy with oral anticoagulants is recommended only for individuals with a CHA₂DS₂-VASc score of 2 or greater.

In the most recent 2020 ACC/AHA guideline,²⁷ this threshold is increased by 1 for female sex (CHA₂DS₂-VASc score of 3 or greater), while remaining the same for male sex. This change nullifies the effect of including sex as an input to the CHA₂DS₂-VASc score, acknowledging that biological sex does not increase the risk of stroke as previously thought.

Despite the publication of the 2020 guideline, the 2014 guideline continues to be used actively.²⁸ Because of this, we quantified CHA₂DS₂-VASc negative event frequency using thresholds from both guidelines.

sPESI

sPESI14 is a simplified version of the PESI¹⁵ that predicts 30-day survivability of patients with PE (eAppendix Listing 3). Nearly 400,000 Americans receive a diagnosis of PE yearly, and PE is the third most common cause of cardiovascular death after myocardial infarction and stroke.²⁹

The sPESI score classifies patients into 2 categories: patients with a score of 0 as low risk, and those with a score of 1 or more as high risk. Patients deemed low risk are considered for early discharge¹⁴ and management of their PE entirely in an outpatient setting.

Data Analysis

We used the Stanford Medicine Research Data Repository (STARR)³⁰ for our data analysis, which is composed of records from Stanford Health Care and the Lucile Packard Children’s Hospital. We also linked STARR with the Social Security Administration’s Death Master File³¹ to ascertain out-of-hospital deaths. The eAppendix details the data set characteristic and the linkage process.

STARR data is provided in the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM) format. Clinical calculators were implemented in Python to map lab measurements and diagnosis codes used in each calculator derivation paper^13,14,17 to corresponding standard vocabulary terms in OMOP-CDM. Calculators then operate with OMOP-CDM measurement values to derive scores for each respective cohort.

Cohorts were established by applying the same selection criteria described in each of the calculator derivation papers.^13,14,17 Because OMOP-CDM does not support selection of more than 1 race, our analysis is based on the first preferential race declared by the individual. Specifically, we used the 5 categories of race as defined by the US Census and cast Hispanic as a dedicated ethnicity. Those records that are missing sex or race information were dropped from the analysis. eAppendix Table 2 shows the cohort size for each calculator.

Scores are calculated using data available on the day of diagnosis. When a patient has multiple measurements, we compute the score using the first measurement because it is the most likely to drive downstream medical intervention. Analysis is performed using Pandas³² 1.3.0, SciPy³³ 1.7.0, and lifelines³⁴ 0.26.3 running on Python 3.9.6, configured through Conda 4.5.11.

Statistical Analysis

Similar to previous studies,^13,14,17 we use C statistic as the metric to quantify a calculator’s predictive capability. We first compute the C statistic for the entire cohort, then generate the same statistic for each subgroup: sex and race. For each statistic we compute the 95% CI, using 1000 bootstraps.

Once raw calculator performance is understood, we evaluate its impact in the clinical context by observing the negative event frequency. The population of interest is defined as individuals who, based on the calculator output, would have not been offered treatment (MELD and CHA₂DS₂-VASc) or would have been considered for early discharge (sPESI). Negative events in this population represent a negative health outcome for which the calculator predicted the patient to be at low risk—all-cause mortality (MELD and sPESI) or stroke (CHA₂DS₂-VASc). We apply the same 95% bootstrap CI.

When the C statistic or negative event frequency of one subgroup is compared with the other, 1-tailed Welch’s t test is used to compute P values.

RESULTS

MELD

eAppendix Table 3 shows the characteristics of the cohort, which we verified to significantly overlap with the MELD calculator derivation paper.¹³

Table 1¹⁹ then provides C statistics for MELD, which exhibited an overall C statistic of 0.81 (95% CI, 0.75-0.86), and is comparable with what was reported in the calculator derivation paper¹³ (0.78; 95% CI, 0.74-0.81). However, C statistics for each sex show that MELD exhibits higher concordance for male patients (0.82; 95% CI, 0.72-0.90; P < .001) compared with female patients (0.77; 95% CI, 0.70-0.84).

In contrast, when examining by racial subgroups, the White population exhibits the worst C statistic (0.77; 95% CI, 0.66-0.87; P < .001). Sample size for the Black population is too small to be conclusive.

When the guidance provided by the UNOS policy (ie, MELD score < 15) is applied, the negative event frequency (Table 1¹⁹) follows the same. Female patients exhibit a higher percentage of deaths (2.07%; 95% CI, 0.78%-3.64%; P < .001) among those who would have been denied listing for liver transplant compared with male patients (1.21%; 95% CI, 0.45%-2.08%). Across racial subgroups, the White population shows the highest negative event frequency, at 2.42% (95% CI, 1.20%-3.86%; P < .001).

Observing how the MELD score for each population is distributed around the cutoff threshold explains the observed negative event frequency. The left panel of Figure 1 shows that a significant portion of both the male and female populations exhibit MELD scores less than 15. Therefore, the worse C statistic in female patients translates into a worse negative event frequency than in male patients.

Likewise, the majority of the White population has MELD scores less than 15 (Figure 1, right panel). As a result, misclassification stemming from poor calculator performance results in significantly more negative outcomes relative to other racial subgroups.

CHA₂DS₂-VASc

eAppendix Table 4 shows the characteristics of our cohort. Compared with the CHA₂DS₂-VASc derivation paper,¹⁷ our cohort is older (77 years vs 66 years) and has higher prevalence of comorbidities (eg, 77% vs 24% with heart failure and 93% vs 17% with diabetes).

As shown in Table 2,^26,27 the overall C statistic of the CHA₂DS₂-VASc calculator on our cohort is marginally better (0.66; 95% CI, 0.66-0.67) than what was reported for the original development cohort (0.606; 95% CI, 0.513-0.699).¹⁷

Nevertheless, the C statistic for the CHA₂DS₂-VASc calculator varies by sex- and race-based subgroups. Stratification by patient sex reveals that CHA₂DS₂-VASc exhibits a higher C statistic for female patients (0.69; 95% CI, 0.68-0.69) than male patients (0.65; 95% CI, 0.64-0.65). The CIs do not overlap. Across racial subgroups, the CHA₂DS₂-VASc calculator exhibits the best C statistic for the Hispanic population (0.73; 95% CI, 0.70-0.76; P < .001).

Table 2^26,27 then compares the negative event frequency for stroke based on application of CHA₂DS₂-VASc calculator scores using 2 clinical guidelines. Under the new guideline, better predictive performance translates into better health outcomes, where both female (1.91%; 95% CI, 1.68%-2.16%; P < .001) and Hispanic (1.78%; 95% CI, 1.17%-2.47%; P < .001) populations exhibit the lowest negative event frequency. However, under the old guideline, the Hispanic population shows the highest negative event frequency (3.30%; 95% CI, 2.13%-4.65%; P < .001) despite showing the highest C statistic.

Such nonconcordance in event frequency stems from the poor calibration of the CHA₂DS₂-VASc score (Figure 2). The stroke risk for the Hispanic population in our cohort for the case of CHA₂DS₂-VASc score of 1 turns out to be slightly higher than for other scores. Because a score of 1 is below the guideline’s threshold for treatment, this increased stroke risk in the Hispanic subgroup drives a higher negative event frequency.

Negative event frequency for the female population under the old guideline could not be calculated, because no female patients in our cohort with a CHA₂DS₂-VASc score of 1 or less experienced stroke.

sPESI

eAppendix Table 5 presents the cohort characteristics. Compared with the original cohort used to derive the sPESI calculator,¹⁴ our cohort has a higher percentage of male patients (57% vs 40%) and patients with cancer history (52% vs 20%).

In this cohort, sPESI shows an overall C statistic of 0.58 (95% CI, 0.56-0.60) (Table 3¹⁴), which is considerably less than what was reported in the original study¹⁴ (0.75; 95% CI, 0.69-0.80) but is in line with a meta-study that assessed sPESI predictive performance³⁵ (0.57; 95% CI, 0.52-0.61).

Observing per-group predictive performance, the sPESI C statistic does not vary significantly across sex. Examining the C statistic across racial subgroups, however, shows that sPESI exhibits varying performance for different groups, with worst performance in the Black population, where the C statistic is close to random (0.50; 95% CI, 0.40-0.61; P < .001). The Asian population exhibits the highest C statistic (0.62; 95% CI, 0.54-0.69; P < .001), followed by the White (0.58; 95% CI, 0.55-0.61) and Hispanic (0.58; 95% CI, 0.50-0.66) populations, which show similar C statistics.

When put in clinical context, however, such similarity in performance between the White and Hispanic populations does not translate to similar negative event frequency (Table 3¹⁴). Specifically, the White population shows the highest incident rate at 12.77% (95% CI, 7.52%-18.05%; P < .001), which means that, among the patients who identified as White, 12.77% of those who may have been considered for early discharge based on sPESI score (ie, sPESI =0) ultimately died within 30 days of receiving a diagnosis of PE. Sample size for the Asian and Black populations with a sPESI score of 0 is too small to be conclusive (eAppendix Table 6^26,27).

eAppendix Figure 2 shows the sPESI calculator exhibiting poor calibration, with the death percentage for the White population being significantly higher than that for the Hispanic population at a sPESI score of 0.

DISCUSSION

The results show that reporting a single summary C statistic for the entire cohort is inadequate. For the 3 calculators that we examined, none of the original derivation papers^13,14,17 reported validation across sex and race subgroups, yet we demonstrate that each calculator’s output can be stratified by demographic variables.

Our results also show that calculator performance has unpredictable impact on outcomes when evaluated under the relevant clinical guidelines. Most striking would be the comparison of 2 guidelines on CHA₂DS₂-VASc, where the Hispanic subgroup exhibited the worst negative event frequency under the old guideline, despite the highest predictive performance of the calculator. Such disparity vanishes under the new guideline.

As guidelines often assign patients to a risk group based on thresholds, understanding the distribution of risk scores and the calibration around those thresholds is crucial to understanding the impact of calculator-guided decisions.²⁰ Such dependence of calculator fairness on guidelines also warrants that fairness assessments be redone whenever policies relying on a given calculator are introduced or revised.

Although we chose to focus only on undertreatment to test our hypothesis, such an analysis could similarly be conducted on overtreatment, and our conclusion is not contingent on testing both scenarios.

Finally, it is often assumed that removing demographic inputs such as sex and race from a calculator can ensure fairness, yet bias can still persist in a calculator even without these inputs. We have demonstrated this by uncovering demographic subgroup stratification in calculators both with (CHA₂DS₂-VASc) and without (MELD and sPESI) relevant demographic input. Explicitly adding sex or race as calculator input, however, could increase bias, as these categories often reflect a social construct.^9-11 Therefore, fairness analysis must be conducted for all clinical calculators and not just those that include such variables as input.

In summary, our results stress that when a treatment guideline translates a calculator’s score into risk groups, distribution of those scores—in particular, how fairly the score is calibrated at the grouping thresholds—affects the achieved health outcomes for each subgroup. We recommend that other institutions routinely conduct similar validation and stratification analysis of calculators on their patient populations and audit their institutional policies based on such analysis.

Limitations

Although we focused on stratification across sex and race due to their widespread use as calculator inputs, they can easily be confounded by other socioeconomic and demographic variables. Future studies could further isolate the impact by analyzing confounding variables in tandem.

Our work does not propose ways to revise calculators to ameliorate identified bias, because how to do so is an active field of research, and a generalizable strategy to accomplish this is unclear.³⁶ A promising line of effort was demonstrated with the development of new Chronic Kidney Disease Epidemiology Collaboration estimated glomerular filtration rate equations that do not take race as an input,³⁷ where fairness could be achieved by incorporating parameters (eg, cystatin) that are specific to the biological pathway at hand.

We acknowledge that the STARR data set is confined to only 2 hospitals in a small health care system in a single geographic region. It is recommended that other institutions conduct similar validation and stratification analysis on data representative of their patient populations. Also, use of only first preferential race may not accurately reflect multiracial patients. We leave the use of multirace information to future work.

Lastly, we did not attempt to quantify the frequency with which these calculators are used in practice. If they are not used in every relevant clinical situation, the population-level effect of the bias uncovered may be reduced.

CONCLUSIONS

Clinical calculators are used to guide medical decision-making in all specialties. Despite high interest in inspecting systematic biases in these calculators, there exist structural limitations in assessing fairness. Our results show that calculators—even those that do not include demographic variables such as sex and race as inputs—can have very different C statistic performance and different score distributions across sex and race. This demonstrates that reporting a single summary performance metric fails to adequately reveal biased performance over individual subgroups.

Given that clinical calculators are applied via clinical guidelines, and such guidelines often assign patients to risk groups and allocate treatments based on calculator score thresholds, it is essential to observe how the calculator’s output is distributed and calibrated around these decision thresholds in order to understand the health equity consequences of using a calculator in real-world practice. We encourage institutions to routinely conduct such validation and stratification analysis on commonly used clinical calculators for their patient populations and to audit their local application of clinical guidelines accordingly.

Author Affiliations: Department of Medicine (RMY, JHL, JZG, JAF, NHS), Department of Emergency Medicine (DD), Department of Pediatrics (NR), and Clinical Excellence Research Center (NHS), School of Medicine, Stanford University, Stanford, CA; Technology and Digital Services (NHS), Stanford Health Care, Palo Alto, CA.

Source of Funding: None.

Author Disclosures: The authors report no relationship or financial interest with any entity that would pose a conflict of interest with the subject matter of this article.

Authorship Information: Concept and design (RMY, DD, JHL, JZG, NR, NHS); acquisition of data (JAF); analysis and interpretation of data (RMY, DD, JHL, JZG, NR, NHS); drafting of the manuscript (RMY, DD, JZG, JAF); critical revision of the manuscript for important intellectual content (RMY, JHL, JZG, NR, NHS); statistical analysis (RMY); provision of patients or study materials (NHS); obtaining funding (NHS); administrative, technical, or logistic support (JAF, NHS); and supervision (NHS).

Address Correspondence to: Richard M. Yoo, PhD, MBI, Stanford University, 1265 Welch Rd, Stanford, CA 94305. Email: rmyoo@stanford.edu.

REFERENCES

1. MDCalc. Accessed April 16, 2022. https://www.mdcalc.com/

2. Widely used sepsis prediction tool is less effective than Michigan doctors thought. National Heart, Lung, and Blood Institute. June 29, 2021. Accessed April 16, 2022. https://www.nhlbi.nih.gov/news/2021/widely-used-sepsis-prediction-tool-less-effective-michigan-doctors-thought

3. Khetpal V, Shah N. How a largely untested AI algorithm crept into hundreds of hospitals. Fast Company. May 28, 2021. Accessed April 16, 2022. https://www.fastcompany.com/90641343/epic-deterioration-index-algorithm-pandemic-concerns

4. Wilkinson DY, King G. Conceptual and methodological issues in the use of race as a variable: policy implications. Milbank Q. 1987;65(suppl 1):56-71. doi:10.2307/3349951

5. Bailey ZD, Feldman JM, Bassett MT. How structural racism works — racist policies as a root cause of U.S. racial health inequities. N Engl J Med. 2021;384(8):768-773. doi:10.1056/NEJMms2025396

6. Williams DR, Lawrence JA, Davis BA. Racism and health: evidence and needed research. Annu Rev Public Health. 2019;40:105-125. doi:10.1146/annurev-publhealth-040218-043750

7. O’Reilly KB. AMA: racism is a threat to public health. American Medical Association. November 16, 2020. Accessed April 16, 2022. https://www.ama-assn.org/delivering-care/health-equity/ama-racism-threat-public-health

8. Eberly LA, Richterman A, Beckett AG, et al; Brigham and Women’s Internal Medicine Housestaff. Identification of racial inequities in access to specialized inpatient heart failure care at an academic medical center. Circ Heart Fail. 2019;12(11):e006214. doi:10.1161/CIRCHEARTFAILURE.119.006214

9. Vyas DA, Eisenstein LG, Jones DS. Hidden in plain sight — reconsidering the use of race correction in clinical algorithms. N Engl J Med. 2020;383(9):874-882. doi:10.1056/NEJMms2004740

10. Eneanya ND, Yang W, Reese PP. Reconsidering the consequences of using race to estimate kidney function. JAMA. 2019;322(2):113-114. doi:10.1001/jama.2019.5774

11. Kowalsky RH, Rondini AC, Platt SL. The case for removing race from the American Academy of Pediatrics clinical practice guideline for urinary tract infection in infants and young children with fever. JAMA Pediatr. 2020;174(3):229-230. doi:10.1001/jamapediatrics.2019.5242

12. Bancks MP, Kershaw K, Carson AP, Gordon-Larsen P, Schreiner PJ, Carnethon MR. Association of modifiable risk factors in young adulthood with racial disparity in incident type 2 diabetes during middle adulthood. JAMA. 2017;318(24):2457-2465. doi:10.1001/jama.2017.19546

13. Kamath PS, Wiesner RH, Malinchoc M, et al. A model to predict survival in patients with end-stage liver disease. Hepatology. 2001;33(2):464-470. doi:10.1053/jhep.2001.22172

14. Jiménez D, Aujesky D, Moores L, et al; RIETE Investigators. Simplification of the pulmonary embolism severity index for prognostication in patients with acute symptomatic pulmonary embolism. Arch Intern Med. 2010;170(15):1383-1389.doi:10.1001/archinternmed.2010.199

15. Aujesky D, Obrosky DS, Stone RA, et al. Derivation and validation of a prognostic model for pulmonary embolism. Am J Respir Crit Care Med. 2005;172(8):1041-1046. doi:10.1164/rccm.200506-862OC

16. Gage BF, Waterman AD, Shannon W, Boechler M, Rich MW, Radford MJ. Validation of clinical classification schemes for predicting stroke: results from the National Registry of Atrial Fibrillation. JAMA. 2001;285(22):2864-2870. doi:10.1001/jama.285.22.2864

17. Lip GYH, Nieuwlaat R, Pisters R, Lane DA, Crijns HJGM. Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the Euro Heart Survey on Atrial Fibrillation. Chest. 2010;137(2):263-272. doi:10.1378/chest.09-1584

18. Pfohl SR, Foryciarz A, Shah NH. An empirical characterization of fair machine learning for clinical risk prediction. J Biomed Inform. 2021;113:103621. doi:10.1016/j.jbi.2020.103621

19. Liver policy: distribution. UNOS. Accessed March 19, 2022. https://unos.org/policy/liver/distribution/

20. Foryciarz A, Pfohl SR, Patel B, Shah NH. Evaluating algorithmic fairness in the presence of clinical guidelines: the case of atherosclerotic cardiovascular disease risk estimation. BMJ Health Care Inform. 2022;29(1):e100460. doi:10.1136/bmjhci-2021-100460

21. Aakre C, Dziadzko M, Keegan MT, Herasevich V. Automating clinical score calculation within the electronic health record: a feasibility assessment. Appl Clin Inform. 2017;8(2):369-380. doi:10.4338/ACI-2016-09-RA-0149

22. Dziadzko MA, Gajic O, Pickering BW, Herasevich V. Clinical calculators in hospital medicine: availability, classification, and needs. Comput Methods Programs Biomed. 2016;133:1-6. doi:10.1016/j.cmpb.2016.05.006

23. Green TA, Shyu CR. Developing a taxonomy of online medical calculators for assessing automatability and clinical efficiency improvements. Stud Health Technol Inform. 2019;264:601-605. doi:10.3233/SHTI190293

24. Chronic liver disease and cirrhosis. CDC. Updated September 6, 2022. Accessed December 15, 2022. https://www.cdc.gov/nchs/fastats/liver-disease.htm

25. Atrial fibrillation. CDC. Updated October 14, 2022. Accessed December 15, 2022. https://www.cdc.gov/heartdisease/atrial_fibrillation.htm

26. Steen DL. The revised ACC/AHA/HRS guidelines for the management of patients with atrial fibrillation. American College of Cardiology. October 29, 2014. Accessed March 19, 2022. https://www.acc.org/latest-in-cardiology/articles/2014/10/14/11/02/the-revised-acc-aha-hrs-guidelines-for-the-management-of-patients-with-atrial-fibrillation

27. Heidenreich PA, Estes NAM III, Fonarow GC, et al. 2020 update to the 2016 ACC/AHA Clinical Performance and Quality Measures for Adults With Atrial Fibrillation or Atrial Flutter: a report of the American College of Cardiology/American Heart Association Task Force on Performance Measures. Circ Cardiovasc Qual Outcomes. 2021;14(1):e000100. doi:10.1161/HCQ.0000000000000100

28. CHA2DS2-VASc score for atrial fibrillation stroke risk. MDCalc. Accessed April 30, 2022. https://www.mdcalc.com/cha2ds2-vasc-score-atrial-fibrillation-stroke-risk

29. Morrone D, Morrone V. Acute pulmonary embolism: focus on the clinical picture. Korean Circ J. 2018;48(5):365-381. doi:10.4070/kcj.2017.0314

30. Datta S, Posada J, Olson G, et al. A new paradigm for accelerating clinical data science at Stanford Medicine. arXiv. March 17, 2020. Accessed September 6, 2021. doi:10.48550/arXiv.2003.10534

31. Hanna DB, Pfeiffer MR, Sackoff JE, Selik RM, Begier EM, Torian LV. Comparing the National Death Index and the Social Security Administration’s Death Master File to ascertain death in HIV surveillance. Public Health Rep. 2009;124(6):850-860. doi:10.1177/003335490912400613

32. McKinney W. Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference. SciPy; 2010:56-61. doi:10.25080/Majora-92bf1922-00a

33. Virtanen P, Gommers R, Oliphant TE, et al; SciPy 1.0 Contributors. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261-272. doi:10.1038/s41592-019-0686-2

34. Davidson-Pilon C. lifelines: survival analysis in Python. J Open Source Softw. 2019;4(40):1317. doi:10.21105/joss.01317

35. Walter R, Holley A. Performance characteristics for the simplified Pulmonary Embolism Severity Index: a meta-analysis. Chest. 2012;142(4)(suppl):849A. doi:10.1378/chest.1390421

36. Manski CF. Patient-centered appraisal of race-free clinical risk assessment. Health Econ. Published online July 5, 2022. doi:10.1002/hec.4569

37. Inker LA, Eneanya ND, Coresh J, et al. New creatinine- and cystatin C–based equations to estimate GFR without race. N Engl J Med. 2021;385(19):1737-1749. doi:10.1056/NEJMoa2102953