Predicting High-Cost Privately Insured Patients Based on Self-Reported Health and Utilization Data

The results of this study show that patient-reported data on health and healthcare can be useful in predicting high-cost patients when claims data for prior years are not available.
Published Online: July 31, 2017
Peter J. Cunningham, PhD

Objectives: To examine how well self-reported data on health, health behaviors, and healthcare utilization by a sample of privately insured patients predict whether they will incur high healthcare costs the following year. 
Study Design: A 2012 mail survey of autoworkers from Chrysler, Ford, and General Motors, with 3983 survey respondents linked to their health insurance claims data for 2012 and 2013. 

Methods: High healthcare costs are defined as being in the 75th percentile or higher of healthcare expenditures. Models that include combinations of claims-based measures of expenditures and morbidity and self-reported measures of health, health behaviors, and healthcare utilization are compared. 

Results: Claims-based measures of healthcare costs and comorbidity for 2012 were strong predictors of whether a patient would incur high healthcare costs in 2013 (C statistic = 0.78). Self-reported measures of chronic conditions, health status, health behaviors, and hospital use are also good predictors of high healthcare costs. However, even the most comprehensive model that included self-reported measures was not as accurate in predicting high healthcare costs (C statistic = 0.73). 

Conclusions: Efficient targeting of high-cost patients is crucial to the success of innovative care delivery models that attempt to lower costs and improve quality of care through more intensive care management of patients. The results of this study show that in the absence of claims data on prior use and expenditures, patient-reported measures of health status and prior healthcare use are reasonable predictors of future healthcare costs for a privately insured population. 
Takeawy Points

Identifying patients who are likely to incur high healthcare costs is a crucial goal of innovative care delivery models. Insurance claims or electronic health records are often used to identify high-cost patients, but sometimes they are unavailable. The results of this study show that: 
  • Self-reported health and healthcare utilization based on survey data can be useful in predicting whether privately insured patients will incur high healthcare costs when claims-based expenditure and health information from prior years is not available. 
  • Among the survey-based measures, questions on inpatient stays and emergency department visits in the previous year were the strongest predictors of incurring high healthcare costs in the following year.
The shift to value-based payment strategies and healthcare providers assuming greater financial risk for the care of patients has generated considerable interest in risk-adjustment methods that can identify patients who will incur high healthcare costs.1,2 Such methods often rely on medical claims and electronic health records (EHRs) to predict future health needs and healthcare costs of patients based on their prior diagnoses, utilization, and costs.3

About 13% of privately insured individuals change health plans in a given year and about 9% change their usual source of care,4 and this appears to have increased since implementation of the Affordable Care Act (ACA).5 However, claims or medical records data are usually not available to private insurance plans or practitioners when they accept and treat new patients.

Due to the lack of data on health history and prior use, health needs assessments (HNAs) or health risk appraisals (HRAs) are often administered by plans to identify patients’ care needs, although it is largely unknown how useful these tools are in predicting which patients are likely to incur high healthcare costs and therefore may require more intensive care management.6,7 Among other potential limitations, HNAs and HRAs rely entirely on patient self-reporting of health conditions and medical care use, which may be subject to reporting error.

Research based on the nationally representative Medical Expenditure Panel Survey and the Medicare population shows that common measures of self-reported health status—including the single-item self-rated general health measure, the SF-12 measure, and self-reported chronic conditions—generally performed as well as risk scores derived from claims or medical records.8-11 The results of 2 studies of Medicaid enrollees showed that self-reported information on health status, health behavior, prior utilization, and other measures were also good predictors of future use of high healthcare costs.12,13

Whether these results are consistent for individuals with private insurance is unknown. In addition, most studies have not directly compared the predictive power of self-reported health measures with claims-based measures of previous expenditures for the same population. The objective of this paper was to determine how well patient-reported data on health conditions, overall health status, health behaviors, and experiences with the healthcare system predict high health costs in the future. To our knowledge, this is the first study that assesses the ability of a comprehensive set of self-reported measures of health, health behaviors, and healthcare utilization to predict high healthcare costs for a privately insured population, similar to studies that have been conducted for the Medicaid population.12,13 Also, because the data include both prior-year claims data on health expenditures and self-reported measures for the same time period, the study compares self-reported measures and claims-based measures of expenditures in predicting high healthcare costs.



The data for this study were based on the 2012 Autoworker Health Care Survey, a survey of active and retired hourly wage workers from Chrysler, Ford, and General Motors. The survey was sponsored by the National Institute for Health Care Reform, a nonprofit, nonpartisan organization established by the International Union, United Autoworkers; Chrysler Group, LLC; Ford Motor Company; and General Motors. The total survey sample included 8624 hourly wage workers, retirees younger than 65 years (ie, not eligible for Medicare), and their spouses. The sample was randomly selected, with some oversampling of active workers so that the proportion of active and retired workers in the sample was about evenly split.

The survey questionnaire was administered by mail. A consent form was included that asked respondents for permission to obtain insurance claims data for themselves and their spouses. Survey respondents were asked to sign and return the consent form to the survey firm if they agreed to allow their insurance claims data to be linked to their survey data. There were 3983 survey respondents who provided consent and had claims data that we were able to link to their survey responses. The eAppendix Figure (eAppendices available at describes the process for determining the final sample for this study.

The survey response rate was 64%. Among survey responders, claims data were available for 46%, for a combined response rate of 29.4% for the sample with linked claims data. Survey weights used in this analysis adjusted for survey nonresponse and differences between individuals who provided signed consent to access their claims data and those who did not provide consent. With these statistical adjustments, the full survey sample and the linked survey/claims sample were very similar with respect to age, gender, race/ethnicity, education, income, health status, and prevalence of chronic conditions, as is shown in the eAppendix Table.

Dependent Variable

The main dependent variable is a binary indicator of whether the individual had high healthcare costs in 2013 (the year following the survey), defined as being in the 75th percentile or higher with respect to total expenditures for 2013 (about $7000 or higher). Total expenditures include combined costs for hospital care (inpatient, emergency department [ED], outpatient), prescription drugs, office-based physicians, tests, procedures, medical equipment, and other services covered by the health plan.


Independent Variables

Two variables from sample individuals’ 2012 claims data were constructed for this analysis, as was a categorical measure reflecting total healthcare expenditures for 2012. Categories include being in the 90th percentile of spending or higher ($16,660 and above), between the 75th and 90th percentiles ($6610 and $16,660), between the 50th and 75th percentiles ($2431 and $6610), and below the 50th percentile ($2431 and under). Diagnoses based on International Classification of Diseases, Ninth Revision, Clinical Modification codes were used to construct a Charlson comorbidity index.14 This is a commonly used measure of health status based on 22 conditions that has been used in prior research to predict mortality and future healthcare costs.

Measures obtained from the survey questionnaire are used to identify all other independent variables in the analysis (see Table 1 for more detailed definitions of these variables). These include demographic and socioeconomic characteristics (age, gender, race/ethnicity, educational attainment, family income). The survey also asked about a selected number of chronic conditions that respondents reported had been diagnosed by a physician, including hypertension, heart disease, congestive heart failure, diabetes, chronic obstructive pulmonary disease (COPD), arthritis, depression, high cholesterol, and cancer. Also included were distinct measures of both self-rated physical and mental health, as well as work and activity limitations, needing help with personal care, and requiring special equipment due to health problems.

The survey also asked about health behaviors or lifestyle factors that may be related to healthcare costs, including the amount of physical activity respondents engage in on a weekly basis, whether they are a current smoker, and height and weight (to identify individuals who are obese). Self-reported measures of healthcare utilization are also included, such as having a usual source of care, number of hospital stays, and ED visits.


The analysis assessed: 1) how well self-reported information on health, health behaviors, and healthcare use predict being a high-cost patient in 2013 compared with claims-based measures of health expenditures and comorbidity and 2) which self-reported measures appear to be the most important in predicting whether a patient will be considered high-cost in 2013. 
Logistic regression models for the likelihood of being a high-cost patient in 2013 were estimated using combinations of the independent variables described above. One model included only information on 2012 expenditures and comorbidity (based on claims data), as well as age, gender, and race/ethnicity. These results were compared with a series of logistic regressions that included only survey variables. These regressions sequentially added groups of survey variables to assess their relative contribution to predicting high-cost patients. A final model included all of the 2012 claims-based expenditures and survey measures described above.

For each model, we computed a C statistic, which is a common measure used to quantify how well a model predicts an outcome better than chance.15 Values range from 0.5 to 1.0, with a value of 0.5 indicating that the model predicts high-cost patients no better than chance and a value of 1.0 indicating that the model perfectly predicts high-cost patients. Models that have a C statistic of 0.7 or higher are considered to be good predictors of the outcome measure.

A pseudo R2 reflects the proportion of explained variation in the dependent variable based on a logistic regression model.16 Values can range from 0 to 1.0, with 0 indicating that the model explains none of the variation in the dependent variable and 1.0 indicating that the model explains all of the variation in the dependent variable.

PDF is available on the last page.
Adult ADHD Compendium
COPD Compendium
Dermatology Compendium
Diabetes Compendium
GI Compendium
Immuno-oncology Compendium
Lipids Compendium
MACRA Compendium
Oncology Compendium
Pain Compendium
Rare Disease Compendium
Reimbursement Compendium
Rheumatoid Arthritis Compendium
Know Your News
HF Compendium
Managed Care PODCAST