The authors evaluated a new "big data" analytic predictive platform that quickly and accurately analyzes large data sets to identify populations at risk of developing conditions such as metabolic syndrome.
Gregory B. Steinberg, MB, BCh; Bruce W. Church, PhD; Carol J. McCall, FSA, MAAA; Adam B. Scott, MBA; and Brian P. Kalis, MBA
The growing prevalence of metabolic syndrome in the United States, and globally, is alarming. Metabolic syndrome is generally defined as having three or more of five common biological abnormalities out of range: waist circumference, blood pressure, elevated triglycerides, low high density lipoproteins (HDL),and increased insulin resistance. Analysis
^{1} suggests that almost onethird of US adults, or approximately 80 million people, meet the Adult Treatment Panel III criteria for metabolic syndrome, with prevalence increasing significantly with age and body weight.
^{2 }An additional 45%, or approximately 104 million people, have 1 or 2 risk factors for developing metabolic syndrome.
These trends have profound clinical and financial implications. Individuals with metabolic syndrome are twice as likely to develop cardiovascular disease and 5 times more likely to develop diabetes mellitus, both of which mean higher than average annual healthcare costs. Workplace participation and productivity of individuals with metabolic syndrome are also negatively impacted.
^{3 }Health insurance companies have large quantities of data relevant to metabolic syndrome, including demographic data, diagnosis and procedure claim data, lab results, prescription data, and care management program data. Using “big data analytics” to interrogate large, complex data sets can generate meaningful insights about individuals with or at risk of developing metabolic syndrome.
We applied a proprietary “big data” analytic platform— Reverse Engineering and Forward Simulation (REFS)—to the data set of 1 of Aetna’s larger nationwide retail customers and calculated:

The subsequent risk of metabolic syndrome, both overall and by metabolic syndrome risk factor, at both a population and individual level

The impact of incremental changes in risk factors on the overall subsequent risk of metabolic syndrome and on costs

The impact of adherence to medications and to routine, scheduled outpatient doctor visits on the subsequent risk of metabolic syndrome.
Big data analytic techniques of this type rapidly yiled insights that support datadriven targeted interventions for people with or at risk of developing metabolic syndrome. Aetna is currently piloting an intervention program based upon the results.
METHODSThe REFS platform is best used to analyze and simulate large, dynamic, multisource data sets. The platform learns by reverse engineering ensembles of models that represent the diversity of processes consistent with the data and then simulating nonparametric knowledge representations to generate accurate, granular group and individual predictions that are both actionable and generalizable. Accurate insights from available data can be generated within a few months, and new data easily integrated. The speedtoinsight allows care providers to develop effective therapeutic programs and interventions quickly and costeffectively, ultimately lowering the cost to serve the affected populations.
Data Sources Data for this study were gathered from:

Insurance eligibility records

Comprehensive Metabolic Syndrome Screening (CMSS) results

Health risk assessment (HRA) responses
Study PopulationThe CMSS results provided the core outcome variables for the study, and measured each of the 5 metabolic syndrome factors (including systolic and diastolic blood pressure). Screenings were conducted twice: once at the beginning of 2011 and again in early 2012, for an initial cohort of 59,605 people. We then restricted the study to participants for whom we had: complete coverage records from January 1, 2010, through December 31, 2011; complete data from medical claims, pharmacy claims, or test lab results for 2010 and 2011; and valid responses to a small set of HRA questions. This resulted in a study population of 36,944, which was then randomly assigned to either an 80% training set (N = 29,527) or a 20% test set (N = 7417). The study population metabolic syndrome risk and medical cost profile is found in
Figure 1. Additional demographic detail is found in
eAppendix Figure 1.
Variable Creation and Definitions The 4291 variables in the analysis spanned 6 different data categories. The specific breakdown of data categories is found in
eAppendix Table 1. Continuous variables were discretized into ranges in preparation for modeling with multivariate categorical models. The ranges of the CMSS factors were constructed from metabolic syndrome outofrange boundaries and other clinically relevant boundaries.
Demographics captured 5 dimensions in addition to gender: age, body mass index (BMI), ethnicity, cigarette usage, and sleep. In addition, 4 event types were defined from claims: diagnoses, procedures, provider specialty, and prescriptions. Further detail regarding demographics and events is found in eAppendix Figure 1. An indicator variable identified the year in which an event occurred.
1. Lab results. Results from 24 common lab tests (as identified by Logical Observation Identifiers Names and Codes number) were extracted for each year. Results were discretized in up to 7 ranges.
2. Biometrics. For each of the CMSS biometric screenings conducted, 6 variables were created (the 4 singlemetric metabolic syndrome factors and systolic and diastolic blood pressure values). The values were then segregated into 7 ranges for blood pressure and 6 ranges for the remaining CMSS factors. In cases where the biometric corresponded to a lab test, the same discretization was used.
3. Medication adherence. We calculated a subject’s medication possession ratio (MPR) for 4 classes of medication: antidiabetics, antihyperlipidemics, antihypertensives, and other cardiovascular medications. More detailed information on MPR calculus is found in eAppendix Table 1. An MPR of 80% or higher was considered adherent.4 For each year and each category of medication, a subject was categorized as: N/A (no prescriptions of that type), once and done (1 prescription of that type), not adherent, or adherent.
4. Preventive visits. A subject was deemed to have had a preventive visit if they had at least 1 claim during each year coded as a Preventive Visit (with one of 26 specific Evaluation & Measurement CPT4 codes).
Statistical Methods: Platform Analytic Methods and Simulations The REFS platform learns by Metropolis Monte Carlo
^{5} sampling from the posterior of the modelstructure distribution. Model structure probabilities are computed in a Bayesian framework by marginalizing out the unknown parameter distributions against the observed data and maximum entropy parameter priors.
^{6 }These model structure probabilities balance the model’s fit of the data against the model’s complexity.
Once learned, the model was interrogated by Forward Simulation (FS) to learn risk factors as well as the impact of interventions for individuals and populations. FS is a fast Monte Carlo process that samples simultaneously from the structure of the platform, the uncertainty in its parameters, and residual uncertainty on the outcomes that is efficient enough to be driven interactively. Multivariate categorical models were sampled describing each of the 6 discretized metabolic syndrome components. These models included up to 16 variables chosen from the total set of all variables. For each of the 6 metabolic syndrome components, the size of the space of models sampled during reverse engineering (RE) is the number of ways to choose up to 16 distinct variables from the 4291 variables possible or approximately 1044 models. Metropolis Monte Carlo can efficiently sample from these astronomically large hypothesis spaces guided only by the data even without prior knowledge to guide the search.
Two models were learned. The Metabolic Syndrome Status Model was trained on claimsbased events from 2010 to predict the CMSS measurements taken at the beginning of 2011 and the Metabolic Syndrome Velocity Model used claimsbased events from 2011 together with the 2011 CMSS measurements to predict 2012 CMSS results.
Simulations ConductedBecause the number of model parameters is much larger than the number of observations, there were many models consistent with the observed data. The ensemble of models learned in the reverse engineering phase is a population sample from the posterior distribution over model structures.
Individual risk simulations. Forward simulations were computed for each of the 5 primary metabolic syndrome factors (with blood pressure separated into systolic and diastolic components for all study subjects) to predict likely values of metabolic syndrome factors at the next biometrics screening. The output for each factor was the probability of each range of the discretization of the factor. The probabilities across the outcome ranges were aggregated on either side of the factors outofrange boundary and the resultant outofrange probability computed for each factor. The individual outofrange probabilities were further aggregated to compute the probability of metabolic syndrome.
PDF is available on the last page.