Electronic Medical Records for Clinical Research: Application to the Identification of Heart Failure

Published Online: June 01, 2007
Serguei Pakhomov, PhD; Susan A. Weston, MS; Steven J. Jacobsen, MD, PhD; Christopher G. Chute, MD, DrPH; Ryan Meverden, BS; and V??ronique L. Roger, MD, MPH

Objective: To identify patients with heart failure (HF) by using language contained in the electronic medical record (EMR).

Methods: We validated 2 methods of identifying HF through the EMR, which offers transcription of clinical notes within 24 hours or less of the encounter. The first method was natural language processing (NLP) of the EMR text. The second method was predictive modeling based on machine learning, using the text of clinical reports. Natural language processing was compared with both manual record review and billing records. Predictive modeling was compared with manual record review.

Results: Natural language processing identified 2904 HF cases; billing records independently identified 1684 HF cases, 252 (15%) of them not identified by NLP. Review of a random sample of these 252 cases did not identify HF, yielding 100% sensitivity (95% confidence interval [CI] = 86, 100) and 97.8% specificity (95% CI = 97.7, 97.9) for NLP. Manual review confirmed 1107 of the 2904 cases identified by NLP, yielding a positive predictive value (PPV) of 38% (95% CI = 36, 40). Predictive modeling yielded a PPV of 82% (95% CI = 73,93), 56% sensitivity (95% CI = 46, 67), and 96% specificity (95% CI = 94, 99).

Conclusions: The EMR can be used to identify HF via 2 complementary approaches. Natural language processing may be more suitable for studies requiring highest sensitivity, whereas predictive modeling may be more suitable for studies requiring higher PPV.

(Am J Manag Care. 2007;13(part 1):281-288)

Two methods, natural language processing and predictive modeling, were
used to identify patients with heart failure from electronic medical records.

Both approaches enable accurate and timely case identification as soon
as the text of a clinical note becomes available electronically, avoiding the
delays and biases associated with manual coding.

Natural language processing may be more suitable for studies requiring
the highest sensitivity such as observational studies.

Because of its higher positive predictive value, the predictive-modeling
approach is a better screening mechanism for clinical trials.

The electronic medical record (EMR) is increasingly used in healthcare.1 Its clinical goals include streamlining clinical practice and improving patient safety. In addition to improving practice, the EMR offers promising methods for identification of potential study participants, which is essential for clinical research. Indeed, although the use of  anually coded patient records in clinical research is a long-standing tradition,2,3 these methods must allow for a delay between the diagnosis and the assignment of the code. In addition, coding systems have variable yields in identifying patients, depending on the disease under consideration, and are subject to shifts related to changing reimbursement incentives.4 Use of coding systems to identify patients appears particularly problematic for heart failure (HF) because of its syndromic nature, which precludes its ascertainment from a single diagnostic test.5,6 The EMR may enable efficient case identification by providing access to clinical reports as soon as they become transcribed; however, novel methods of identification that use the EMR require rigorous validation.5 Finding patient records that meet predefined clinical criteria lends itself well to statistical classification algorithms.7-14 Natural language processing (NLP) systems such as the Medical Language Extraction and Encoding System have been used to identify cases of interest either directly by defining a terminologic profile of a case or indirectly by extracting covariates for predictive modeling.13,15,16  To our knowledge, there have been no large-scale studies that examined the validity of both NLP and statistical methods for identification of patients with HF.

We report here on use of the EMR that currently is in place at the Mayo Clinic17 for prospective recruitment of patients with HF.18 The goal of our study was to validate 2 approaches to rapid prospective identification of patients with HF. One approach uses NLP of the EMR; the other uses predictive modeling.


The study design—including the data sources, processing components, data flow, and evaluation—is shown in Figure 1.

Mayo Clinic Electronic Medical Record

For this study, we used 2 data sources available as part of the Mayo Clinic EMR: clinical notes and diagnostic codes.

Clinical Notes. Clinical notes dictated by healthcare providers at the Mayo Clinic first became available electronically in 1994. These are electronic records that document each inpatient and outpatient encounter, and contain the text of the medical dictations transcribed by trained medical transcriptionists (for an example, see Figure 2). The Mayo Clinic EMR complies with the American National Standards Institute Clinical Document Architecture, which is a widely accepted standard for clinical documentation.19 Most of the Mayo Clinic clinical notes are transcribed within 24 hours of the patient-physician encounter.

Diagnostic Codes. Patient-physician encounters are coded using International Classification of Diseases, Ninth Revision (ICD-9) diagnostic codes. The codes are assigned by trained medical coders as part of the routine billing process within 30 days of the encounter.

Use of Natural Language Processing
The NLP case-finding algorithm was piloted in October 2003.20 The algorithm uses nonnegated terms indicative of HF: cardiomyopathy, heart failure, congestive heart failure, pulmonary edema, decompensated heart failure, volume overload, and fluid overload. To maximize sensitivity, all available synonyms (n = 426) for these terms were used as well. The synonyms were found by automatically searching a database of 16 million problem-list entries comprised of diagnostic phrases expressed in natural language. These phrases are manually coded by trained staff as part of the Rochester Epidemiology Project,3 using a hospital adaptation of the International Classification of Diseases.21 Diagnostic phrases were considered synonymous if they were assigned the same code (eg, phrases such as heart failure, CHF [congestive heart failure], biventricular failure, and cardiopulmonary arrest were treated as synonymous).22  In addition to synonyms, the NLP algorithm relies on finding nonnegated terms by excluding those terms that have negation indicators (eg, "no," "denies," "unlikely") in their immediate context (±7 words). In order to identify potential cases of HF, the algorithm searched for the terms indicative of HF and their synonyms in the text of clinical notes as soon as the notes were dictated, transcribed, and became available electronically. Once a term was found, a determination with respect to its negation status was made. If this particular instance of the term was negated, it was ignored for the purposes of identification of evidence of HF in the clinical note. However, the note was identified as containing evidence of HF if another instance of the same term was found in a nonnegated context. The algorithm was implemented in Perl programming language as an application that runs inside a JBoss Application Server. The application continually "listened" to the live stream of clinical notes that are generated within the Mayo Integrated Clinical Systems production environment.

After the pilot, we conducted periodic verifications of the method to ensure that no patients with HF were being omitted by comparing the results of the algorithm with the billing codes. We extracted all unique patient identifiers using ICD-9 code 428.x (heart failure) for the period between October 10, 2003, and May 31, 2005, and compared them with the patients identified by the NLP system. All cases found by the NLP system since October 2003 were reviewed by nurse abstractors for HF criteria as part of the ongoing Heart Failure Surveillance Study.23 The results of this manual review were used for validation of the NLP method (Figure 1, Phase I).

Use of Predictive Modeling
Predictive Modeling Algorithm. Prior studies have reported on using predictive modeling techniques including logistic regression, classification trees, and neural networks for clinical decision support and outcomes prediction.24,25 A comparative validation between these 3 approaches shows that logistic regression outperforms the other methods.26 Although traditional logistic regression relies on small sets of well-defined clinical covariates, predictive modeling based on the text of clinical notes involves an unlimited number of predictive covariates based on the vocabulary of the clinical notes and may include more than 10 000 items whose relative contribution to the categorization decisions is unknown. Thus, large-scale predictive modeling based on clinical notes requires algorithms specifically designed to process large numbers of covariates. Naïve Bayes27 is one such approach that is robust, highly efficient, and widely used in text classification. It has been shown to be functionally equivalent to logistic regression.28 This algorithm chooses the most likely outcome given a set of predictive covariates. In the present study, the outcome is dichotomous (HF positive vs HF negative) and the covariates are words found in the clinical notes. The likelihood of an outcome is computed based on its co-occurrence frequency with each of the predictive covariates. One of the advantages of naïve Bayes compared with other more sophisticated techniques is that it is robust and fast to train, and does not require large amounts of computing resources.

Covariate Extraction. To extract covariates from text, we split the text of the clinical notes into single words listed in no particular order ("bag-of-words" representation29). We collected 2048 random clinical notes manually verified to contain evidence of HF (HF-positive examples) and 2048 random notes with no HF (HF-negative examples). Each note was then represented in terms of the vocabulary contained in all notes (see Figure 3), with the exception of 124 stop words (eg, "the," "a," "on"). We used the entire vocabulary of 10 179 covariates without any restrictions.

Training and Testing Data. We sampled 1000 HF-positive and 1000 HF-negative examples at random from the entire collection of 4096. We set aside 200 (20%) of each half for testing and used the remaining 800 (80%) for training (Figure 1, Phase II). The test set was created by combining one third of the HF-positive testing examples with two thirds of the HF-negative testing examples to reflect the proportion of HF-positive examples in the data. (The proportion was determined during periodic verifications of the NLP method. A little more than one third of all patients identified by the NLP method were manually confirmed to have HF.) The training set was created by combining 200 HF-positive examples with 600 HF-negative examples to force the predictive modeling algorithm to favor HF-negative cases and thus maximize the positive predictive value (PPV). A predictive model was then trained using the training set and tested on the test set.

PDF is available on the last page.
Adult ADHD Compendium
COPD Compendium
Dermatology Compendium
Diabetes Compendium
GI Compendium
Immuno-oncology Compendium
Lipids Compendium
MACRA Compendium
Oncology Compendium
Pain Compendium
Reimbursement Compendium
Rheumatoid Arthritis Compendium
Know Your News
HF Compendium
Managed Care PODCAST