Kyle Morawski, MD, MPH; Yoni Dvorkis, MPH; and Craig B. Monsen, MD, MS
The healthcare system generates, collects, and stores a tremendous amount of data during the course of a patient’s clinical encounter, with one study finding an average of more than 200,000 individual data points available during a single hospital stay.1,2
These data are used to monitor a patient’s progress, coordinate care among all members of the healthcare team, and provide documentation for billing and reporting activities. Although the use of data for these purposes has been long-standing, the availability of these data has increased substantially. The Health Information Technology for Economic and Clinical Health Act of 2009 was passed in part to assist healthcare professionals’ transition to electronic health records (EHRs). A decade later, systematically collected data generated in the course of clinical care have created an opportunity to use such data to improve care practices.3,4
Retail entities have put forth strategic investments in data science, often with substantial return.5,6
Accordingly, using data stored in EHRs to improve the lives of patients and lower total medical costs is one approach to transforming care. Big data, machine learning, and predictive analytics are some of the ways that clinicians hope to anticipate patients’ needs and improve outcomes, evidenced by the myriad of organizations working in this field.7
However, this is an evolving field with improving techniques, accuracy, and actionability of predictions. We need more precise prediction models and better integration of data into clinical care4,8-10
to focus care resources and, in doing so, provide higher value.11
and healthcare costs13
associated with hospital admissions underscore the need for hospitalization prevention activities including patient outreach, review of recent discharges, and case management. Unfortunately, acute hospital care needs remain difficult to predict.9
A recent review evaluating accuracy of EHR-based prediction modeling showed that hospitalization and service utilization were more difficult to predict than mortality or disease-specific outcomes. Whereas mortality and clinical prediction models demonstrated C statistics ranging above 0.8, the discrimination of models built to predict hospitalization and service utilization was lower, at 0.71.8
Several approaches to improve hospitalization prediction exist, such as using new data sources, new variable types, more complete data, more timely data, or more advanced statistical methods. Data sets capable of linking EHR and claims data at the patient level remain uncommon. We hypothesized that when combined, these 2 data sources would complement each other and lead to stronger prediction than that observed previously. We set out to develop and test a model that uses EHR and claims data to predict patient hospitalizations in such a way that it can be implemented in an outpatient practice setting.
We performed a retrospective analysis of data generated in the course of clinical care and healthcare operations to develop a logistic regression model predicting a patient’s future risk of hospitalization. Data were extracted from Atrius Health’s unified data warehouse, which marries clinical data from Atrius Health’s EHR (Epic version 2015; Epic Systems; Verona, Wisconsin) to normalized administrative claims data received from Medicare, Medicaid, and commercial payers. Variables were ascertained at the patient-month level. To reflect seasonality in hospitalization outcomes, 4 dates of prediction—referred to as index dates—were selected throughout the study period: September 1, 2014; December 1, 2014; March 1, 2015; and June 1, 2015. Sensitivity testing was performed to determine how the inclusion of certain variable categories or data sources (ie, EHR vs claims) would influence model performance. The analysis was performed as part of a quality improvement effort at Atrius Health and did not undergo institutional review board review.
The study population was selected among patients seen from June 2013 to November 2015 at Atrius Health, a large multispecialty group in eastern Massachusetts. The population included patients insured under Medicare, Medicaid, and commercial contracts. Patients younger than 18 years were excluded from analysis because adult primary care was the focus of this effort.
We selected a binary outcome variable indicating if a patient had experienced any medical/surgical admission within 6 months of the index date of prediction. We chose to predict hospitalizations within 6 months to best match the prediction interval with the timeline of likely future downstream interventions. For example, to assist in the care of a complex patient, a relationship with a case manager is often established. This potential intervention requires a period of time to plausibly affect risk of hospitalization. Longer prediction intervals would potentially dilute the impact of future interventions or else necessitate interventions spanning very long time horizons. We excluded obstetrical admissions because these would not be targets for anticipated interventions.
The initial set of features included 651 variables defined among sociodemographics, diagnoses, medications, and prior utilization of both inpatient and outpatient services.
Sociodemographic variables, such as age, insurance type, body mass index, and smoking status, were for the most part obtained from the EHR. In the case of claims sensitivity testing, age and insurance status were obtained from payer roster files. Missing data were considered as a separate class within each categorical variable.
We aggregated diagnoses among EHR encounter- and claims-level International Classification of Diseases, Ninth Revision
) and Tenth Revision
) codes and mapped them to a smaller set of features by grouping them into 1 of 87 HHS–Hierarchical Condition Categories (HHS-HCC) diagnosis groups.14
A patient needed just 1 instance of an ICD-9
code within an HHS-HCC group at any point during the retrospective period to ascertain that categorical variable as positive. Missing data were interpreted as the patient not having the clinical condition.
Uses of medications were similarly aggregated by National Drug Code across EHR data and pharmacy claims using commercially available therapeutic class codes (First Databank, Inc; South San Francisco, California). As with diagnoses, just 1 occurrence of an order or a prescription for a medication belonging to a given class was needed to ascertain that categorical variable as positive. Missing data were interpreted as the patient not having used the medication class.
Utilization variables included indicator variables of prior admissions, emergency department (ED) visits, and outpatient visits. These variables were further categorized based on the timing of the occurrence relative to the index date. For example, hospitalization utilization variables included those indicating if the patient had been hospitalized in the past 1 month, hospitalized in the past 1 to 3 months, hospitalized in the past 3 to 6 months, and hospitalized in the past 6 to 12 months.
Any variables that did not occur in more than 30 patient-months in the data set were removed prior to model training to provide stable coefficients for the logistic regression model. For example, if there were just 10 patient-months in the sample during which any patient was on a medication represented as a binary variable, this variable was dropped from the model.
Although EHR data are readily available within 24 hours of an index date, claims data are often received at a 3-month delay called claims lag. To simulate this claims lag, we ascertained historical variables during a 12-month period starting 15 months prior to the index date up until to 3 months prior to the index date. This avoids advantaging models with data that would not normally be available. Data from the EHR, which do not experience this lag, were obtained during a partially overlapping 12-month period starting 12 months prior to the index date until the day before the index date. This is illustrated in Figure 1
We randomly selected 80% of the data to serve as the training set, reserving the remaining 20% of the data as a testing set. We then regressed our selected variables onto our hospitalization outcome using a logistic model with the canonical link. Variables were included in the final model if their odds ratio (OR) was greater than or equal to 1 (see eAppendix
[available at ajmc.com
]). This decision was made to be consistent with our organization’s goal to identify predictors of increased risk of hospitalization and aid with model interpretability, as clinicians would be appropriately skeptical of a disease state conferring a protective effect. Previous unpublished work informed our approach here, as machine learning algorithms such as random forest, support vector machines, and neural networks did not consistently improve model performance and were less interpretable than the logistic regression approach. This has since been corroborated in recent literature for general outcomes such as mortality and disease-specific outcomes such as HIV incidence.8,15,16
All analysis was performed in R version 3.2.1 (R Foundation; Vienna, Austria).
We measured performance on the training and testing data sets using area under the receiver operating characteristic curve (AUC) and model calibration.17
We calculated 95% CIs around the AUC using the DeLong method (R pROC package, version 1.10.0). For model calibration, we plotted calibration curves and calculated the Hosmer-Lemeshow statistic (R Resource Selection Package, version 0.3-2).
Other Statistical Tests
For continuous variables, we report means and SDs. For noncontinuous variables, we report counts and percentages. For normally distributed data, we applied the t
test. For nonnormally distributed data, we applied the Wilcoxon test. For comparisons between categorical variables, we used the Fisher test.
Although the canonical model included EHR and claims data, we sought to identify which category of variables most contributed to model performance. We trained 15 models testing 2 dimensions of model characteristics.
The first dimension compared models developed from different data sources: EHR data only, claims data only, or both. The EHR data–only models used information drawn from the EHR (eg, medication use categories were ascertained as positive if the patient had a medication order placed by a provider). The claims data–only models used information drawn from claims (eg, medication use categories were ascertained as positive if a patient had a medication dispense claim in the administrative data). In the models using both data sources, a categorical feature was ascertained to be positive if there was evidence from either the EHR or claims data.
The second dimension considered was variable types. Separate models were trained to include demographic variables only, diagnoses only, medications only, prior utilization only, or all variables combined. Model performance was assessed for training and testing sets using the C statistic.
After exclusions, 363,855 patient-months were included for analysis, corresponding to 185,388 unique patients. Selected patient characteristics ascertained by combining EHR and claims data are summarized in Table 1
. In aggregate, 5% of the study population had been hospitalized within 6 months of an index date.
After excluding variables with low counts or protective factors, 169 variables were included in the final model. Diagnoses, demographics, and prior utilization were well represented among the top predictors (Figure 2
). The features with the highest ORs for predicting future hospitalization were sickle cell anemia (OR, 52.72), lipidoses and glycogenosis (OR, 8.44), heart transplant (OR, 6.12), and age 76 years or older (OR, 5.32). A full list of final features is included in the eAppendix.
Model discrimination varied widely, depending primarily on included variables. The predictive model using only prescription medications performed least well, with an AUC of 0.602. The model including all variable types, claims data, and EHR data performed best on the testing set, with an AUC of 0.846. There were no statistical differences in performance on the testing set among the 3 models including all variable types based on claims data alone (AUC, 0.840; 95% CI, 0.832-0.848), EHR data alone (AUC, 0.840; 95% CI, 0.831-0.848), or the claims and EHR data combined (AUC, 0.846; 95% CI, 0.838-0.853). Table 2
illustrates these results in more detail.
The best-performing model, which included all variable types from claims and EHR data combined, appeared to be well calibrated (Figure 3
). Predicted probability of hospitalization at 6 months corresponded closely to the observed proportion of hospitalized patients when sorted into 10 bins of equal size (~7300 patients per bin). Further, the slope of the calibration was 0.96 (95% CI, 0.94-0.98) compared with a perfectly calibrated slope of 1.0. The model overestimated 6-month hospitalizations among those with the highest predicted risk.
Using a combination of EHR and claims data describing patients’ demographics, healthcare utilization behavior, medical diagnoses, and medications, we were able to develop a risk score that accurately predicted hospitalization in the ensuing 6 months. Although our results suggest some utility to combining EHR and claims data to inform predictive model creation, we find that even in scenarios in which only EHR or claims data are available, strong performance can be achieved provided that a diverse collection of variable types is represented. A variety of highly predictive characteristics were derived from all major domains evaluated. Consistent with traditional methods, age group was one of the strongest predictors, with the more elderly groups being at higher risk. Prior healthcare utilization was also a strong predictor and likely covaries with many other factors in the model. However, this collinearity improves the variance of the logistic regression approach and may allow unmeasured factors, such as healthcare literacy and choices among individuals of where to seek care, influence in the prediction.18
Particular medical diagnoses also were found to be predictive, likely indicating frailty and rapid decline in health status that is unable to be adequately managed in the outpatient setting. For example, those with end-stage organ damage (renal or hepatic) have little functional reserve, necessitating precision with both health behaviors and medication adjustments. They are prone to imbalances in fluid or electrolytes that require the care of the inpatient setting for monitoring and correction.
The risk prediction score was also found to be well calibrated in those less likely to be hospitalized in the next 6 months, but it did become less accurate among those at higher risk of hospitalization. The model tended to overestimate the likelihood of hospitalization in those with higher than 30% predicted risk, likely owing to the small number of patients demonstrating such high risk.
Comparison With Prior Work
Although many risk scores have been created for individual disease entities19
or certain groups of people,20-24
ours is agnostic of clinical condition or demographic. Past efforts in predicting hospitalization have been limited in addressable ways.25,26
Whereas other models are updated infrequently, as in the case of the QAdmissions model from the British National Health Service that is updated quarterly,27
the present model may be updated weekly to provide more timely information across a range of clinical applications. Another model uses a clinician’s assessment to ask whether a patient is likely to be seen in the emergency ward,25
whereas ours uses a multimodal, data-derived approach to create the risk prediction. Additionally, our model’s C statistic of 0.846 compares favorably with those of previous models (0.67-0.77), which we attribute to its incorporation of a wide array of variables (demographics, clinical diagnoses, medications, and prior utilization). We believe that our model adds to the current literature by providing an example of EHR and claims data utilization that can routinely and in real time provide risk prediction for hospitalization among patients seen in a primary care setting.
Our investigation has limitations. First, the retrospective analysis was performed using data from a single health system without an external center to validate our results. Although this threatens the generalizability of the model results, we believe the approach is one that can be reproduced at other centers to derive a more tailored model that reflects local patients, patient features, and care practices, all of which may also influence the risk of hospitalization. For instance, ED visits may occur with different frequencies and in different clinical scenarios in other parts of the country due to geographical characteristics of care providers. Other regions may have differing access to outpatient care, which may result in lower-acuity situations escalating to inpatient care. It is worth noting that we used data representing a large, diverse patient population, which offers some stability to the model coefficients and results. That said, we would expect that a given health system could apply these methods to calibrate the model for its own patients and system of care.
After creating our model, we used an internal validation strategy, testing its predictive ability on 20% of the data that were withheld during model creation. Other methods of validation include bootstrapping28
and external validation.29
We felt that the training/testing set approach was a sufficiently accurate and interpretable method for measuring discrimination, and we observe that it is commonly used in the literature.30
Because these efforts were performed to improve the quality of care in a single health system, future research work would be helpful to validate our approach on an external population.31
The extent to which our predictive model can better target particular interventions and improve care remains to be proven. First, the strongest covariates in the model were those that are nonmodifiable, such as clinical diagnoses. For example, somebody with sickle cell anemia or a heart transplant cannot modify those factors. Second, for factors that are modifiable, such as medication use, the coefficients derived are correlative, not causative. One must be careful not to interpret the fact that a patient is on a medication associated with hospitalization to mean that the medication is a cause of future hospitalization. The net of this is that although identifying highest-risk patients seems a natural approach to prioritize interventions such as postdischarge education and case management, our model provides no evidence that such patients are amenable to these interventions or that their risk of hospitalization would be responsive to them.
Despite these limitations, we believe that our model approach is a meaningful step toward identifying patients at highest risk of hospitalization. Tying the model to care interventions that are likely to modify the risk of hospitalization represents a promising area for future research.
Prediction models using EHR-only, claims-only, and combined data had similar predictive value and demonstrated strong discrimination for which patients will be hospitalized in the ensuing 6 months. The resulting model offers additional benefits of interpretability and timeliness and may be reproduced with local data for greater accuracy.