Published Online: June 01, 2007
Serguei Pakhomov, PhD; Susan A. Weston, MS; Steven J. Jacobsen, MD, PhD; Christopher G. Chute, MD, DrPH; Ryan Meverden, BS; and V??ronique L. Roger, MD, MPH
This study also has unique strengths. It used a large dataset (more than 3000 patients) that was developed over a period of 3 years and involved complete manual records abstraction for validation. Another strength is that this study addressed identification of HF patients, whose diagnosis is complex, and relies in part on the language found in the unrestricted text of the EMR. The 2 methods described here were tested on the same population, which made it possible to determine their respective yields in the same dataset and to define their potential application separately or in combination with predictable results. Another advantage of the NLP approach is that it is not restricted to a specific data element or a specific location in the EMR. The Mayo Clinic EMR maintains a diagnostic problem list that is used to summarize the main findings entered into the clinical note. The notes contain problem-list entries as numbered items inside the Impression/Report/Plan and the Final Diagnosis sections. The NLP algorithm described in this article does not take advantage of the problem-list items. Instead, the search is performed across the entire text of the note in the attempt to capture symptom information. Therefore, our NLP strategy may be used in EMR systems that do not routinely use problem-list entries. The feasibility of using the NLP strategy in other EMR systems and for other conditions will be assessed in subsequent work.
Implications for Clinical Research
Although the highly sensitive NLP method may be more appropriate as a screening mechanism for observational studies, the predictive modeling method may be more suitable for clinical trials, when stricter inclusion criteria may be required. Indeed, the NLP method often involves subsequent manual abstraction of medical records, and a highly sensitive screening tool will direct manual data collection. The predictive modeling method is based on selected populations and is more concerned with the efficient enrollment of patients who fit study inclusion and exclusion criteria. For clinical trials, the predictive-modeling approach with a higher PPV is a better screening mechanism.
We acknowledge Kay Traverse, RN, and Susan Stotz, RN, for manual review of patient records.
Author Afiliations: From the Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, Minn (SP, SAW, CGC, RM); the Department of Research and Evaluation, Kaiser Permanente, Pasadena, Calif (SJJ); and the Division of Cardiovascular Diseases, Department of Medicine, Mayo Clinic, Rochester, Minn (VLR).
Funding Sources: This work was supported by NIH grants RO1-72435, GM14321, and AR30582; NLM Training Grant in Medical Informatics (T15 LM07041-19); and the NIH Roadmap Multidisciplinary Clinical Research Career Development Award Grant (K12/NICHD-HD49078).
Correspondence Author: Serguei V. Pakhomov, PhD, Department of Health Science Research, Mayo Clinic, 200 First St SW, Rochester, MN 55905. E-mail: firstname.lastname@example.org.
Author Disclosure: The authors (SP, SAW, SJJ, CGC, RM, VLR) report no relationship or financial interest with any entity that would pose a conflict of interest with the subject matter discussed in this manuscript.
Authorship Information: Concept and design (SP, SJJ, CGC, VLR)); acquisition of data (SP, SJJ, RM, VLR); analysis and interpretation of data (SP, SAW, SJJ, CGC, RM); drafting of the manuscript (SP, SJJ); critical revision of the manuscript for important intellectual content (SP, SAW, SJJ, VLR); statistical analysis (SP, SAW, RM, VLR); provision of study materials or patients (SP); and obtaining funding (SP, VLR); administrative, technical, or logistic support (SP, CGC, VLR); supervision (SP, CGC, VLR).
1. Committee on Quality of Health Care in America, Institute of Medicine. Crossing the Quality Chasm: A New Health System for the 21st Century.Washington, DC: Institute of Medicine; 2001.
2. Kurland LT, Molgaard CA. The patient record in epidemiology. Sci Am. 1981;245:54-63.
3. Melton LJ. History of the Rochester Epidemiology Project. Mayo Clin Proc. 1996;7:266-274.
4. Psaty BM, Boineau R, Kuller LH, Luepker RV. The potential costs of upcoding for heart failure in the United States. Am J Cardiol. 1999;84:108-109.
5. Onofrei M, Hunt J, Siemienczuk J, Touchette DR, Middleton B. A first step towards translating evidence into practice: heart failure in a community practice-based research network. Inform Prim Care. 2004;12:139-145.
6. Hunt S. ACC/AHA 2005 guideline update for the diagnosis and management of chronic heart failure in the adult: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines (Writing Committee to Update the 2001 Guidelines for the Evaluation and Management of Heart Failure). J Am Coll Cardiol. 2005;46:e1-e82.
7. Yang Y, Chute CG. A linear least squares fit mapping method for information retrieval from natural language texts. In: Proceedings of 14th International Conference on Computational Linguistics (COLING 92). Vol II. Nantes, France: August 1992:447-453.
8. Lewis D. Naive (Bayes) at forty: the independence assumption in information retrieval. In: Proceedings of the 10th European Conference on Machine Learning (ECML 98). Berlin, Germany: Springer Verlag; 1998:4-15.
9. Aronsky D, Haug PJ. Automatic identification of patients eligible for a pneumonia guideline. In: Overhage JM, ed. Proceedings of the 2000 AMIA Annual Symposium. Bethesda, Md: American Medical Informatics Association; 2000:12-16.
10. Johnson D, Oles F, Zhang T, Goetz T. A decision-tree-based symbolic rule induction system for text categorization. IBM Systems Journal. 2002;41:428-437.
11. Nigam K, Lafferty J, McCullum A. Using maximum entropy for text classification. In: Joachims T, ed. Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering. Stockholm, Sweden: August 1999:61-67.
12.Yang Y. Expert network: effective and efficient learning from human decisions in text categorization and retrieval. In: Croft WB, van Rijsbergen CJ, eds. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: Springer-Verlag; 1994:13-22.
13. Wilcox A. Automated Classification of Text Reports [dissertation]. New York: Columbia University; 2000.
14. Aronow DB, Fangfang F, Croft WB. Ad hoc classification of radiology reports. J Am Med Inform Assoc. 1999;6:393-411.
15. Jain NL, Friedman C. Identification of finding suspicious for breast cancer based on natural language processing of mammogram reports. In: Proceedings of the 1997 AMIA Annual Symposium. Bethesda, Md: American Medical Informatics Association; 1997:829-833.
16. Hripcsak G, Austin JHM, Alderson PO, Friedman C. Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports. Radiology. 2002;224:157-163.
17. Carpenter P. The electronic medical record: perspective from Mayo Clinic. Int J Biomed Comput. 1993;34:159-171.
18. Roger VL, Weston SA, Redfield MM, et al. Trends in heart failure incidence and survival in a community-based population. JAMA. 2004;292:344-350.
19. Dolin RH, Alschuler L, Boyer S, et al. HL7 clinical document architecture, release 2. J Am Med Inform Assoc. 2006;13:30-39.
20. Pakhomov SV, Buntrock J, Chute CG. Prospective recruitment of patients with congestive heart failure using an ad-hoc binary classifier. J Biomed Inform. 2005;38:145-153.
21. Commission on Professional and Hospital Activities. Hospital Adaptation of ICDA (H-ICDA). 2nd ed. Ann Arbor, Mich: CPHA; 1973.
22. Pakhomov SV, Buntrock J, Chute CG. Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques. J Am Med Inform Assoc. 2006;13:516-525.
23. Roger VL, Killian J, Henkel M, et al. Coronary disease surveillance in Olmsted County: objectives and methodology. J Clin Epidemiol. 2002;55:593-601.
24. Steyerberg EW, Eijkemans MJC, Boersma E, Habbem JDF. Equally valid models gave divergent predictions for mortality in acute myocardial infarction patients in a comparison of logical regression models. J Clin Epidemiol. 2005;58:383-390.
25. Wolfe R, McKenzie DP, Black J, Simpson P, Gabbe BJ, Cameron PA. Models developed by three techniques did not achieve acceptable prediction of binary trauma outcomes. J Clin Epidemiol. 2006;59:82-89.
26. Terrin N, Schmid CH, Griffith JL, D’Agostino RBS, Selker HP. External validity of predictive models: a comparison of logistic regression, classification trees, and neural networks. J Clin Epidemiol. 2003; 56:721-729.
27. Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco, Calif: Elsevier; 2005.
28. Roos T, Wettig H, Grunwald P, Myllymaki P, Tirri H. On discriminative bayesian network classifiers and logistic regression. Machine Learning. 2005;59:267-296.
29. Manning C, Shutze H. Foundations of Statistical Natural Language Processing. Cambridge, Mass: MIT Press; 1999.
30. Fiszman M, Chapman WW, Aronsky D, Evans RS, Haug PJ. Automatic detection of acute bacterial pneumonia from chest X-ray reports. J Am Med Inform Assoc. 2000;7:593-604.
31. Hripcsak G, Friedman C, Alderson PO, DuMouchel W, Johnson SB, Clayton PD. Unlocking clinical data from narrative reports: a study of natural language processing. Ann Intern Med. 1995;122:681-688.
32. Friedman C. A broad-coverage natural language processing system. In: Overhage JM, ed. Proceedings of the 2000 AMIA Annual Symposium. Bethesda, Md: American Medical Informatics Association; 2000: 270-274.
33. Melton GB, Hripcsak G. Automated detection of adverse events using natural language processing of discharge summaries. J Am Med Inform Assoc. 2005;12:448-457.
34. Wilchesky M, Tamblyn RM, Huang A. Validation of diagnostic codes within medical services claims. J Clin Epidemiol. 2004;57:131-141.
35. Ahmed F, Janes GR, Baron R, Latts LM. Preferred provider organization claims showed high predictive value but missed substantial portion of adults with high-risk conditions. J Clin Epidemiol. 2005;58:624-628.
36. Pakhomov S, Hemingway H, Weston S, Jacobsen S, Rodeheffer R, Roger V. Epidemiology of angina pectoris: role of natural language processing of the medical record. Am Heart J. 2007;153:666-673.