Electronic Health Record Problem Lists: Accurate Enough for Risk Adjustment?

Timothy J. Daskivich, MD, MSHPM; Garen Abedi, MD, MS; Sherrie H. Kaplan, PhD, MPH; Douglas Skarecky, BS; Thomas Ahlering, MD; Brennan Spiegel, MD, MSHS; Mark S. Litwin, MD, MPH; and Sheldon Greenfield, MD

Recent study results highlight the inconsistency of different sources of data (eg, registries, claims, the electronic health record [EHR]) for identifying basic health information, such as major comorbidities.1-5 For individual physicians, these inconsistencies are less relevant because they have the opportunity to confirm this information directly with the patient. However, when used for risk adjustment for purposes of performance assessment,6-8 incorrect data may lead to misclassification and unfair comparisons. This is a major concern for health systems participating in alternative payment models, which base some portion of reimbursement on risk-adjusted quality outcomes.9-13 Comorbidity is a key component of the risk adjustment needed for fair comparisons of measures of quality, as comorbid disease burden affects readmissions,14-16 complications,17-19 quality of life,20,21 and mortality.22-24 Accurately quantifying comorbidity requires varying degrees of detail regarding number, severity, or types of conditions, depending on the measure used.25 All of these measures require robust data sources to identify the presence or absence of each of the included comorbid conditions. 

Whereas comorbidity data from inpatient medical records are reviewed by trained coders, outpatient records may be less reliable,1-5 as they often rely on “problem lists” in the EHR to identify the index condition. The problem list is a compilation of patient diagnoses entered by clinicians during patient encounters and updated at varying intervals. With increasing numbers of institutions and office-based practices using EHRs to store patient data, interest has grown in utilizing these lists as a source of comorbidity data.26 However, it is unclear whether the data in these lists are sufficiently accurate to assess patients’ total comorbid disease burden. Recent studies have attempted to validate the accuracy of the problem list by comparing it with other diagnosis lists or with short-term future proximate outcomes, such as glycated hemoglobin.1 However, the most appropriate metric for assessment of validity of these lists is long-term mortality, especially in an elderly population; a longer list of major comorbidities should be strongly associated with higher mortality. 

In this study, we compared the ability of Charlson Comorbidity Index (CCI) scores derived from the Veterans Affairs (VA) problem list to predict mortality in an elderly population with a gold standard for comorbidity assessment, manual abstraction directly from the physician’s free-text notes. We captured mortality over a 10-year period, which was long enough to reveal the impact of even minor comorbidities over time. Because the problem list is not actively maintained, we hypothesized that the problem list would provide poor accuracy in identifying comorbidities and would poorly predict survival compared with free-text–based assessment. 


Data Sources and Study Participants

We used the California Cancer Registry to identify men newly diagnosed with prostate cancer at the Greater Los Angeles and Long Beach VA Medical Centers between 1998 and 2004 (N = 1915). We reviewed EHRs for sociodemographic, tumor risk, comorbidity, and survival data and identified all men with sufficient data to determine comorbidity and survival (n = 1596). Institutional review board approval was granted by the University of California, Los Angeles, and both VA Medical Centers. 


Comorbidity. We assessed comorbid disease burden at the time of prostate cancer diagnosis using 2 sources of data: 1) the EHR free-text notes record and 2) the EHR problem list. The interdisciplinary EHR free-text notes record contained outpatient and inpatient notes from all clinical encounters. Data from the medical record within 12 months of the diagnosis of prostate cancer were used for free-text–based comorbidity assessment. Comorbidities were coded according to the definitions originally indicated by Charlson et al,22 and age-unadjusted CCI scores were calculated. Because free-text–based comorbidity assessment was conducted primarily by 1 author (TJD), reliability was assessed on a random 5% subset of the sample by a separate author (GA). Interrater agreement in Charlson scores was 77.5% and the associated kappa statistic was 0.67. 

The VA problem list was populated and updated by clinicians variably as they deemed appropriate; diagnoses were coded by International Classification of Diseases, Ninth Revision (ICD-9) codes and dates and entered into a retrievable database. Comorbidities added to the problem list up to 12 months after the diagnosis of prostate cancer were used for EHR problem list–based comorbidity assessment. Comorbidities were coded according to the claims-based definitions indicated by Deyo et al,27 and age-unadjusted Deyo-CCI scores were calculated. Neither prostate cancer nor any complications of prostate cancer were included in comorbidity scoring for either comorbidity assessment method.

Survival model covariates. Age at diagnosis was coded as a continuous variable. Race/ethnicity was coded as Caucasian, African American, Hispanic, or other. Tumor characteristics included prostate-specific antigen (PSA), Gleason sum, and clinical tumor (T), node, and metastasis stage at diagnosis. Categories for PSA, Gleason sum, and clinical T stage were defined by the widely accepted D’Amico criteria, which have been shown to predict overall and cancer-specific mortality.28,29 

Mortality. Survival was measured from date of treatment until date of death. We determined date of death using a combination of the medical record and the Social Security Death Index. Cause of death was determined using the medical record with an algorithm that has been previously described.30

Statistical Analysis

We determined the prevalence of 7 major comorbidities using free-text–based and EHR problem list–based comorbidity assessment. Sensitivity and specificity for identification of major comorbidities were calculated, using free-text–based assessment as the gold standard. Agreement between assessment methods was ascertained for each comorbidity using Cohen’s kappa statistic. 

We compared mean CCI scores based on the problem list and on free-text notes using a paired t test for difference in means. Continuous and categorical (0, 1, 2, and ≥3) versions of CCI scores for each assessment method were also compared using Pearson’s correlation and Cohen’s kappa, respectively. 

Multivariable competing risks regression as described by Fine and Gray31 was used to compare prediction of survival by CCI scores derived from each comorbidity assessment method. We calculated subhazard and cumulative incidence of non–prostate cancer mortality by CCI scores for each comorbidity assessment method, treating prostate cancer as a competing risk. All multivariable regression models were adjusted for age, race, VA site, clinical stage, Gleason score, PSA, and treatment type. 

We then conducted a sensitivity analysis to determine if year of diagnosis affected our results, as use of the EHR problem list may have changed over time. We subdivided our group into those diagnosed in 2001 or earlier (n = 810) and 2002 or later (n = 786) and repeated our analyses. 

We used P <.05 to denote statistical significance, and all tests were 2-sided. All statistical analyses were performed in Stata, version 12.0 (Stata, Inc; College Station, Texas). 


The majority of the sample (N = 1596) comprised white (44%) and African American men (37%). Approximately one-half of the sample was 65 years or younger, and most had early-stage prostate cancer (Table 1). 

EHR problem list–based comorbidity assessment had poor sensitivity but high specificity for identification of common major comorbidities (Table 2). Sensitivity values for EHR problem list–based assessment (using free-text–based assessment as the gold standard) ranged from 8% for myocardial infarction to 46% for diabetes. Specificity was above 94% for all comorbidities. Agreement between EHR problem list–based and free-text–based comorbidity assessment was poor for all major comorbidities, with kappa ranging from 0.02 to 0.44. Results did not change after subdividing our group by those diagnosed before or after 2001.

Comparison of the CCI scores based on the EHR problem list and on free text showed that EHR problem list–based assessment underestimated comorbidity burden (Table 3). Agreement across all scores was 53% (840/1596). Pearson correlation for continuous scores was 0.3, and kappa for categorical scores was 0.2. Among scores that were discordant, 82% (627/765) were higher using free-text–based compared with EHR problem list–based comorbidity assessment. Mean free-text–based and EHR problem list–based CCI scores were 1.1 (95% CI, 1.04-1.19) and 0.5 (95% CI, 0.46-0.56), respectively. Free-text–based scores were significantly higher by a mean of 0.6 points (95% CI, 0.53-0.67; P <.001). 

Competing risks regression analysis showed that free-text–based CCI scores predicted noncancer mortality, whereas EHR problem list–based scores did not. Higher free-text–based CCI scores were associated with a graduated increase in hazard of other-cause mortality (Table 4), and 10-year cumulative incidence of noncancer mortality associated with chart-based scores was 41%, 57%, 64%, and 83% for CCI scores of 0, 1, 2, and ≥3, respectively (Figure, part A). Higher EHR problem list–based CCI scores were not associated with increased mortality risk (Table 4), and EHR problem list–based CCI scores did not discriminate 10-year cumulative incidence of noncancer mortality: 55%, 64%, 62%, and 58% for CCI scores of 0, 1, 2, and ≥3, respectively (Figure, part B). Results did not change after subdividing our group by those diagnosed before or after 2001.


Despite interest in capitalizing on readily available problem list data in the EHR for purposes of risk adjustment, our findings suggest that these data should be validated prior to application to performance assessment. The sensitivity of the VA problem list for identifying common major comorbidities was poor, ranging from 1% to 46%, compared with manual free-text note abstraction. This lack of accuracy led to incorrect risk adjustment and would lead to unfair comparisons of clinicians’ quality of care if applied directly to risk adjustment for performance assessment. These data may be generalizable to more contemporary EHRs because comorbidity information in problem lists and outpatient claims still is not actively maintained. Absent a valid method of adjustment for comorbidity, it is not possible to confidently distinguish between physicians or groups who provide poor care and those who disproportionately see patients with greater disease burden. Because measures of quality of care are now being tied to compensation in programs like value-based purchasing,32 the stakes are higher and the consequences of errors in performance assessment are much more substantial. These errors are magnified when the units of comparison are smaller (eg, physician groups or individual physicians).

Previous studies have shown that failure to adjust for case mix can alter quality rankings.33-37 For example, in a comparison of functional outcomes at 4 hospitals 1 year after total hip replacement, significant differences in institutional outcomes were apparent before adjustment for comorbid disease status but not after.33 In our study, CCI scores from free-text–based assessment predicted mortality, whereas EHR problem list–based scores did not, likely due to the latter’s poor sensitivity for detecting major comorbid diseases. For clinical purposes, such as risk adjustment of quality scores, versus billing and utilization purposes, comorbidity scores relying on EHR diagnoses could be misleading. 

Inaccuracy in identification of major comorbid conditions and its impact on risk adjustment could have significant financial consequences for physicians and physician groups participating in accountable care organizations (ACOs). ACOs, such as those participating in the Medicare Shared Savings Program, require participating organizations to report certain quality measures to CMS.13 Only providers who meet the program’s quality performance benchmarks will be eligible to receive a share of cost savings, which can be substantial. For example, in the Physician Group Practice Demonstration, 7 of 10 groups received $108 million in shared savings payments over the course of the demonstration.38 Providers who do not meet quality benchmarks will be subject to penalties. Although early ACOs have started with fairly undemanding process benchmarks for quality, such as the Healthcare Effectiveness Data and Information Set measures for chronic disease management, which are met by a national average of 79% to 80% of groups,11 quality benchmarks will likely become more stringent over time, as evidenced by the inclusion of all-cause readmission in quality scoring in the third year of participation in the Medicare Shared Savings Program.39 There is an almost 2-fold variation in readmissions across different areas of the United States,40 and these differences have been found to be sensitive to risk adjustment for comorbidity.14-16 Although ACOs have shown prudence in their meticulous and transparent statistical methodology for risk adjustment,13 there has been little focus on the quality of the comorbidity data.

Performance assessment in the outpatient setting may be particularly susceptible to failures in risk adjustment, as outpatient claims typically used for risk adjustment are closely linked with problem list data. In contrast with current approaches to risk adjustment in hospital quality rankings, in which inpatient claims data are reviewed by trained hospital personnel, outpatient claims data may be directly taken from problem list data. They may not be corroborated by trained coding teams, in contrast with inpatient claims. Furthermore, outpatient claims may be limited to those principal diagnoses for which patients seek care in the outpatient setting and may not represent the total comorbid disease burden of the individual. This could be an explanation for the apparent underestimation of CCI scores by EHR problem list–based assessment in our study. Conversion from ICD-9 to International Classification of Diseases, Tenth Revision coding algorithms is also unlikely to affect these deficiencies, as the new coding platform is more apt to improve the specificity, but not sensitivity, of claims data. Previous work has highlighted the deficiencies of outpatient claims–based approaches to comorbidity assessment.3-5 Yet despite these concerns, outpatient claims data are frequently used for risk adjustment for numerous applications, including quality assessment.41,42 

Although the accuracy of contemporary EHR diagnosis lists may differ from that of the VA problem list of the early 2000s, emphasis still is not placed on the accuracy of this data element more than a decade later, and it is likely that these deficiencies still persist, even outside of the VA. Clinicians must voluntarily add or subtract diagnoses through a process separate from billing. Contemporary EHR diagnosis lists are not routinely updated by administrative or clinical staff, and there is no compelling incentive for reconciliation of the data. Future diagnoses are also collected cumulatively without expiration dates for time-limited or rule-out diagnoses, and some problem lists are truncated to the most current issues, which can unintentionally omit major chronic diagnoses. These facts suggest that more modern EHRs are subject to many of the drawbacks that led to the extremely poor results observed in our study. A study of the Massachusetts General Hospital Primary Care Practice-Based Research Network database from 2005 to 2011, comparing assessment of patient complexity through comprehensive EHR data (demographic, diagnostic, procedure, medication, laboratory, and utilization) versus outpatient CCI scores versus a commercial risk predictor, showed more than 60% disagreement among methods.2 The results of a study of contemporary outpatient claims data have also shown poor accuracy compared with registry data obtained from chart review.1 We feel that the summation of these findings argues for more active management of EHR diagnosis lists (and outpatient claims) going forward.


Our study has several limitations. First, because hospital system EHR training varies by institution, it is possible, although highly unlikely, that the accuracy of the VA problem list in the 2 centers in our study differs from that observed across the entire VA. Second, differences in our VA sample (ie, disproportionately male and African American) may affect generalizability to other populations. Third, there are differences in comorbidity assessment between Deyo-Charlson and CCI scores that make comparisons between the 2 scoring systems dissimilar; however, these slight differences would not explain the large disparity observed in our study. Fourth, free-text–based comorbidity assessment included chart review of the comprehensive medical record within 12 months before and after diagnosis of prostate cancer, whereas EHR problem list–based comorbidity assessment included all diagnoses until 12 months after diagnosis. This intentional difference in methodology accounts for the fact that medical notes routinely document the cumulative chronic disease burden of the individual, whereas the problem list relies on date of diagnosis. This difference would tend to enhance sensitivity of the EHR problem list–based approach, opposing the direction of effect observed in our study.


Although there is great enthusiasm for using readily available comorbidity data in EHRs for risk adjustment in performance assessment, the findings of this proof-of-concept study suggest that such data need to be corroborated prior to application. If these findings are replicated in modern EHRs, what can be done? One approach is to develop an automated electronic search strategy to review free-text notes for key words or phrases indicating comorbid diseases, but the programming efforts for this strategy are daunting.3 We and others have taken another approach: to gather standardized quantifiable information from patients. This strategy has been shown in numerous studies to be accurate and reliable in the assessment of comorbidity.23,24 As ACOs and other entities increasingly use achievement of quality benchmarks, such as 30-day readmissions and all-cause mortality, to allot shared savings bonuses and penalties, the quality of risk adjustment data will be increasingly important, especially in the outpatient setting, where claims data are taken from the problem list. The stakes are high for quality indicators to be unbiased and fair in order to avoid unduly penalizing providers and organizations who care for the sickest patients.
Print | AJMC Printing...