Currently Viewing:
The American Journal of Managed Care January 2020
Using Applied Machine Learning to Predict Healthcare Utilization Based on Socioeconomic Determinants of Care
Soy Chen, MS; Danielle Bergman, BSN, RN; Kelly Miller, DNP, MPH, APRN, FNP-BC; Allison Kavanagh, MS; John Frownfelter, MD, MSIS; and John Showalter, MD
Eliminating Barriers to Virtual Care: Implementing Portable Medical Licensure
Pooja Chandrashekar, AB; and Sachin H. Jain, MD, MBA
Trust in Provider Care Teams and Health Information Technology–Mediated Communication
Minakshi Raj, MPH; Jodyn E. Platt, PhD, MPH; and Adam S. Wilk, PhD
The Health IT Special Issue: Enduring Barriers to Adoption and Innovative Predictive Methods
Ilana Graetz, PhD
What Accounts for the High Cost of Care? It’s the People: A Q&A With Eric Topol, MD
Interview by Allison Inserro
Does Machine Learning Improve Prediction of VA Primary Care Reliance?
Edwin S. Wong, PhD; Linnaea Schuttner, MD, MS; and Ashok Reddy, MD, MSc
Health Information Technology for Ambulatory Care in Health Systems
Yunfeng Shi, PhD; Alejandro Amill-Rosario, MPH; Robert S. Rudin, PhD; Shira H. Fischer, MD, PhD; Paul Shekelle, MD; Dennis Scanlon, PhD; and Cheryl L. Damberg, PhD
The Challenges of Consumerism for Primary Care Physicians
Timothy Hoff, PhD
Advancing the Learning Health System by Incorporating Social Determinants
Deepak Palakshappa, MD, MSHP; David P. Miller Jr, MD, MS; and Gary E. Rosenthal, MD
Currently Reading
Predicting Hospitalizations From Electronic Health Record Data
Kyle Morawski, MD, MPH; Yoni Dvorkis, MPH; and Craig B. Monsen, MD, MS
e-Consult Implementation Success: Lessons From 5 County-Based Delivery Systems
Margae Knox, MPH; Elizabeth J. Murphy, MD, DPhil; Timi Leslie, BS; Rachel Wick, MPH; and Delphine S. Tuot, MDCM, MAS

Predicting Hospitalizations From Electronic Health Record Data

Kyle Morawski, MD, MPH; Yoni Dvorkis, MPH; and Craig B. Monsen, MD, MS
The authors aimed to develop a rigorous technique for predicting hospitalizations using data that are already available to most health systems.
Feature Development

The initial set of features included 651 variables defined among sociodemographics, diagnoses, medications, and prior utilization of both inpatient and outpatient services.

Sociodemographic variables, such as age, insurance type, body mass index, and smoking status, were for the most part obtained from the EHR. In the case of claims sensitivity testing, age and insurance status were obtained from payer roster files. Missing data were considered as a separate class within each categorical variable.

We aggregated diagnoses among EHR encounter- and claims-level International Classification of Diseases, Ninth Revision (ICD-9) and Tenth Revision (ICD-10) codes and mapped them to a smaller set of features by grouping them into 1 of 87 HHS–Hierarchical Condition Categories (HHS-HCC) diagnosis groups.14 A patient needed just 1 instance of an ICD-9 or ICD-10 code within an HHS-HCC group at any point during the retrospective period to ascertain that categorical variable as positive. Missing data were interpreted as the patient not having the clinical condition.

Uses of medications were similarly aggregated by National Drug Code across EHR data and pharmacy claims using commercially available therapeutic class codes (First Databank, Inc; South San Francisco, California). As with diagnoses, just 1 occurrence of an order or a prescription for a medication belonging to a given class was needed to ascertain that categorical variable as positive. Missing data were interpreted as the patient not having used the medication class.

Utilization variables included indicator variables of prior admissions, emergency department (ED) visits, and outpatient visits. These variables were further categorized based on the timing of the occurrence relative to the index date. For example, hospitalization utilization variables included those indicating if the patient had been hospitalized in the past 1 month, hospitalized in the past 1 to 3 months, hospitalized in the past 3 to 6 months, and hospitalized in the past 6 to 12 months.

Any variables that did not occur in more than 30 patient-months in the data set were removed prior to model training to provide stable coefficients for the logistic regression model. For example, if there were just 10 patient-months in the sample during which any patient was on a medication represented as a binary variable, this variable was dropped from the model.

Although EHR data are readily available within 24 hours of an index date, claims data are often received at a 3-month delay called claims lag. To simulate this claims lag, we ascertained historical variables during a 12-month period starting 15 months prior to the index date up until to 3 months prior to the index date. This avoids advantaging models with data that would not normally be available. Data from the EHR, which do not experience this lag, were obtained during a partially overlapping 12-month period starting 12 months prior to the index date until the day before the index date. This is illustrated in Figure 1.

Model Development

We randomly selected 80% of the data to serve as the training set, reserving the remaining 20% of the data as a testing set. We then regressed our selected variables onto our hospitalization outcome using a logistic model with the canonical link. Variables were included in the final model if their odds ratio (OR) was greater than or equal to 1 (see eAppendix [available at]). This decision was made to be consistent with our organization’s goal to identify predictors of increased risk of hospitalization and aid with model interpretability, as clinicians would be appropriately skeptical of a disease state conferring a protective effect. Previous unpublished work informed our approach here, as machine learning algorithms such as random forest, support vector machines, and neural networks did not consistently improve model performance and were less interpretable than the logistic regression approach. This has since been corroborated in recent literature for general outcomes such as mortality and disease-specific outcomes such as HIV incidence.8,15,16 All analysis was performed in R version 3.2.1 (R Foundation; Vienna, Austria).

Model Performance

We measured performance on the training and testing data sets using area under the receiver operating characteristic curve (AUC) and model calibration.17 We calculated 95% CIs around the AUC using the DeLong method (R pROC package, version 1.10.0). For model calibration, we plotted calibration curves and calculated the Hosmer-Lemeshow statistic (R Resource Selection Package, version 0.3-2).

Copyright AJMC 2006-2020 Clinical Care Targeted Communications Group, LLC. All Rights Reserved.
Welcome the the new and improved, the premier managed market network. Tell us about yourself so that we can serve you better.
Sign Up