Comparing Breast Cancer Case Identification Using HMO Computerized Diagnostic Data and SEER Data

, , , , , , , ,
The American Journal of Managed Care, April 2004, Volume 10, Issue 4

Objectives: To determine the sensitivity and positive predictive value (PPV) of computerized diagnostic data from health maintenance organizations (HMOs) in identifying incident breast cancer cases.

Study Design: An HMO without a cancer registry developed an algorithm identifying incident breast cancer cases using computerized diagnostic codes. Two other HMO sites with Surveillance, Epidemiology, and End Results (SEER) registries duplicated this case-identification approach. Using the SEER registries as the criterion standard, we determined the sensitivity and PPV of the computerized data.

Methods: Data were collected from HMO computerized databases between January 1, 1996, and December 31, 1999. Surveillance, Epidemiology, and End Results data were also used.

Results: The overall sensitivity of the HMO databases was between 0.92 (95% confidence interval [CI], 0.91-0.96) and 0.99 (95% CI, 0.98-0.99). Sensitivity was high (range, 0.94-0.98), for the first 3 (of 4) years, dropping slightly (range, 0.81-0.94) in the last year. The overall PPV ranged from 0.34 (95% CI, 0.32-0.35) to 0.44 (95% CI, 0.42-0.46). Positive predictive value rose sharply (range, 0.18-0.20) after the first year to 0.83 and 0.92 in the last year because prevalent cases were excluded. Review of a random sample of 50 cases identified in the computerized databases but not by SEER data indicated that, while SEER usually identified the cases, the registry did not associate every case with the health plan.

Conclusions: Health maintenance organization computerized databases were highly sensitive for identifying incident breast cancer cases, but PPV was low in the initial year because the systems did not differentiate between prevalent and incident cases. Health maintenance organizations depending solely on SEER data for cancer case identification will miss a small percentage of cases.

(Am J Manag Care. 2004;10:257-262)

Methods commonly used to identify cancer cases in a defined population for observational research include geographically based registries such as the Surveillance, Epidemiology, and End Results (SEER) registries, hospital records, and ambulatory care data. Health maintenance organizations (HMOs) could provide an important opportunity to identify cancer cases in defined populations for which computerized medical data exist. As such, they offer promise as an important resource for cancer research in etiology, prognosis, and treatment efficacy, as well as for studies of clinical care.1 However, there is limited information regarding the accuracy of these data systems in the published literature, and we could not find a comparison to SEER registries, which are generally accepted as the most valid system of cancer case identification. While SEER has been used as the criterion standard in studies2-5 of Medicare claims data, published studies6-8 of managed care claims algorithms have not used the SEER standard, but have compared the findings of their algorithms with the medical record. Demonstrating the ability to use computerized data will enable studies on breast cancer without the age restrictions of Medicare.

While conducting a large, multisite study that identified incident breast cancer cases to determine the benefits of early screening and prophylactic mastectomy in high-risk women, we were interested in comparing breast cancer cases identified by HMO computerized diagnostic data with cases in HMOs identified by SEER registries. We therefore developed an algorithm for case identification that was duplicated by 2 HMO sites in regions with SEER registries. Using cancer registries as the criterion standard, we determined the sensitivity and positive predictive value (PPV) of the computerized data. We then looked for ways to increase the sensitivity and PPV. We also examined a sample of cases for which computerized databases disagreed with the registry data.


The participating sites are part of the Cancer Research Network, which consists of 11 HMOs and their research programs, enrollee populations, and databases. The Cancer Research Network aims to increase the effectiveness of preventive, curative, and supportive interventions that span the natural history of major cancers among diverse populations and health systems.

International Classification of Diseases, Ninth Revision (ICD-9-CM)

Three sites were involved in the breast cancer case-identification comparison. One site without a registry developed an algorithm for breast cancer case identification by searching for diagnosis codes indicating breast cancer. These included codes 174.0 to 174.9 (malignant neoplasm of female breast: nipple and areola, central portion, upper inner quadrant, lower inner quadrant, upper outer quadrant, lower outer quadrant, axillary tail, other specified sites of female breast, and breast [female] unspecified) and 233.0 (carcinoma in situ of breast or genitourinary system and breast). The data were obtained from the HMO use database that included diagnoses for each patient encountered from ambulatory and hospital settings.

This case-identification approach was then duplicated at 2 other HMOs with members residing within the catchment areas of their cancer registries. The list of women generated from the computerized identification scheme was compared with those cases identified from the SEER registries.

SEER Registries

Surveillance, Epidemiology, and End Results is a federal cancer surveillance program designed to collect information on diagnosis and treatment of all cancer patients within a geographic area. Registries contain information about incident cancer cases as reported by difference sources, including health plans, hospitals, pathology departments, and death certificates. Each SEER registry identifies cases in a slightly different way. The data collected provide a basis for estimating national cancer incidence and monitoring trends. Both HMO sites regularly receive a file from the SEER identifying all cancer cases for which their health plans are listed as offering the first course of treatment.

HMO Computerized Databases



The same process was used at both sites to identify women with a breast cancer diagnosis within the health plans' computerized databases between January 1, 1996, and December 31, 1999. Site 1 used the hospital information system for inpatient and short stays, and the appointment, registration, and patient accounting system for ambulatory visits. When women used services outside the health plan, diagnoses were captured from the claims system that contains billing diagnosis data from HCFA1500 (medical and surgical claims) and UB92 (institutional claims) forms. Outpatient codes (174.0-174.9 and 233.0) were pulled and then merged with hospital data to compare the HMO-generated list with that of the SEER registry. Subjects from site 2 were selected from hospital files using the same admission dates and codes.

Statistical Analysis

Using the SEER list of breast cancer cases as the criterion standard, we examined sensitivity (true positives) and PPV (the probability that a person with a positive test result actually has the condition of interest) of the HMO-generated breast cancer cases in several ways. First, all women who had at least 1 notation of a breast cancer diagnosis code were identified. Second, we identified those who had a diagnosis code notation on at least 2 different days, because we theorized that for some women a breast cancer code might have been used initially when they were seen to rule out breast cancer. We also examined sensitivity and PPV by year of diagnosis. At site 1, data were stratified by age and stage of disease.

At site 1, we investigated in detail the reasons for disagreement between the SEER registry and HMO-generated list of breast cancer cases. Those identified by the computerized database (based on 2 notations) but not by SEER were checked for breast cancer identification within SEER data from previous years back to 1974. All cases for which SEER data indicated a previous diagnosis of breast cancer were assumed to be prevalent cases. Of the remaining records, for efficiency and because of cost considerations, we examined a random sample of 50 cases identified by the HMO computerized data but not by SEER, all 46 cases identified by SEER but not by computerized data, and a random sample of 12 cases identified by both methods. To determine breast cancer status and possible reasons for the discrepancies between the 2 approaches, the diagnoses noted with each visit of computerized records were examined by 2 of us (MBB, SWF). The physician reviewers, blinded as to which group each patient belonged, classified each woman's breast cancer status as "yes" (having breast cancer), "probable," "unlikely," "unknown," or "no" (not having breast cancer). In instances in which there was initial disagreement, the physicians discussed the case and reached agreement.


From January 1, 1996, to December 31, 1999, the number of incident cases identified by SEER was 1405 at site 1 and 7445 at site 2. In comparison, the HMO computerized database identified 4113 cases at site 1 and 19 610 cases at site 2 when only one notation of a breast cancer diagnosis was required. The numbers dropped substantially, to 3089 and 16 151, respectively, when at least 2 notations were required.

Table 1 presents the number of cases identified using the HMO computerized databases and the SEER registries, as well as the sensitivity and PPV of the HMO computerized data using 1- and 2-notation requirements. Using only one notation of a breast cancer diagnosis, the computerized database at site 1 identified 98.6% (1385/1405) (95% confidence interval [CI], 0.98- 0.99) of cases identified by the SEER registry. However, it also identified 2728 women not in the SEER registry, for a PPV of 0.34 (95% CI, 0.32-0.35). At site 2, the database identified 95.7% (7127/7445) (95% CI, 0.95-0.96) of cases identified by the cancer registry, with a PPV of 0.36 (95% CI, 0.36-0.37).

When requiring breast cancer diagnoses on at least 2 different dates, the sensitivity decreased slightly and the PPV increased. At site 1, requiring a second notation of a breast cancer diagnosis resulted in a sensitivity of 0.97 (95% CI, 0.96-0.98) and a PPV of 0.44 (95% CI, 0.42-0.46). At site 2, the sensitivity was 0.92 (95% CI, 0.91-0.96) and the PPV was 0.42 (95% CI, 0.41-0.43).

At both sites, sensitivity dropped somewhat, from 0.98 to 0.94 in the last year at site 1 and from 0.96 to 0.81 at site 2 (Table 2). When we looked for explanations, we found that some cases identified by SEER at the end of one year were not included in the HMO computerized database until early in the following year. In contrast, in later years of the study, PPV improved markedly, rising from lows of 0.20 and 0.18 in the first year to 0.83 and 0.92 in the last year.

Little variation was found in sensitivity by stage of diagnosis (stage 0, 1, 2, 3, or 4) or age at diagnosis (≤49, 50-64, or ≥65 years). Requiring at least 2 notations of breast cancer, sensitivity was 0.93 for women with stage 0, 0.98 to 0.99 for stages 1 and 2, and 0.80 for women with stage 3 disease. Small numbers may account for the variability, as there were only 16 women with stage 3 cancer. Sensitivity ranged between 0.96 and 0.97 for all age categories.

At site 1, HMO-generated data identified 1730 cases not identified by SEER. For 1222 (70.6%), review of SEER data back to 1974 revealed previous (prevalent) breast cancer. For the remaining 508 cases, we conducted blind review of a random 50 cases: 18 (36.0%) were classified as "yes," 5 (10.0%) as "probable," and 3 (6.0%) as "no," with the remaining 24 women (48.0%) as "unlikely" or "unknown" (Table 3). Of the 18 women classified as "yes," the SEER registrar found 15. Seven women received their care at affiliated hospitals of the plan (rather than at HMO-owned hospitals) and were mistakenly not reported by SEER to the health plan; these women should have been included. For 5 women, the breast cancer was a recurrence of a breast cancer diagnosed outside of the site 1 health plan and thus should not have been included as incident cases. Three were listed by SEER as having cancer other than breast cancer and therefore should not have been included.

When the 46 cases identified by SEER registries but not by the HMO-generated data were blindly reviewed, 9 (19.6%) were classified as "yes," 6 (13.0%) were "probable," 7 (15.2%) were classified as "no," 18 (39.1%) were "unknown" or "unlikely," and 6 women could not be classified because no additional information was found. Of the 9 women classified as "yes," 6 had only one notation of breast cancer and thus were not included, based on our requirement of 2 notations. These cases, therefore, were missed by the HMO algorithm. For the other 3 women, adequate data were not available to determine the reason for omission. Finally, to validate the physician review to identify incident cancer cases, we reviewed 12 concordant cases identified by both SEER and the HMOgenerated data. When these 12 cases were blindly reviewed, 11 (91.7%) were classified as "yes" and 1 case was classified as "probable."


Computerized databases at 2 HMOs had high levels of sensitivity (0.92-0.99) in identifying incident breast cancer cases during 4 years, compared with cancer registries in their regions. Sensitivity dropped slightly in the last year, because some women who were identified late in a given year by the SEER registrar were entered into the health plan database in the following year.

Positive predictive value, however, was low (range, 0.34-0.44). Requiring at least 2 notations of a breast cancer diagnosis raised the PPV without substantially lowering sensitivity. Most important, PPV was much higher (range, 0.83-0.92) in the last year. By using data from the first year, we essentially prescreened for prevalent cases. Therefore, to improve PPV for incident cases, our results suggest obtaining data from at least 1 year before the actual period of interest so that most prevalent cases can be excluded (Table 2). Such an approach would minimize the resource-intensive step of medical record review for a large number of prevalent breast cancer cases.


We used an algorithm, based on diagnosis codes, developed at a non-SEER site to identify breast cancer cases. Other algorithms might obtain different results. Other investigators have examined algorithms to identify breast cancer cases using computerized claims data (Table 4), but none has compared computerized HMO computerized databases with SEER data. Warren et al2 measured the accuracy of Medicare administrative data vs SEER in identifying breast cancer cases using hospitalization data with a sensitivity of 0.97 and specificity of 0.59. Later work by Warren and others3 combined Medicare inpatient hospital claims with physician claims. The addition of physician claims increased sensitivity from 0.62 to 0.76. Cooper et al4 combined Medicare inpatient data and Part B (physician claims) for case identification and obtained a sensitivity of 0.94 for breast cancer. Freeman et al5 compared a logistic regression model combining Medicare data (including inpatient and outpatient hospitalization) with physician claims to SEER and found greater than 0.90 sensitivity and 0.70 PPV.

Current Procedural Terminology-4 (CPT-4)



As in the present study, Solin6,7 and Leung8 et al examined HMO computerized data to test algorithms to identify incident breast cancer cases. They used procedure and diagnosis codes. Verification was done through medical record review, but sensitivity was not checked against SEER registries. Positive predictive values ranged between 0.84 and 0.93. The inclusion of specific procedure codes (for mastectomy, excision, breast biopsy, lymphadenectomy, radiation therapy, and chemotherapy) may have increased their PPVs. We did not use codes because, with the exception of mastectomy, they are often associated with ruling out a condition in our system. Adding pathology data, if available, might also enhance the algorithm. These data were not available in the computerized database at the site developing the algorithm.

Our study suggests a limitation of SEER data given to HMOs, as some cases identified through HMO computerized data were not on the SEER list for the HMOs, even though the cases were listed in the SEER registry. Health plans that rely only on SEER data for identification of incident cancer cases will miss those patients not classified by SEER as being within their system. Our results allow us to estimate the frequency of this problem; 14.0% (7/50) or 57 of 408 HMO-generated cases not listed by SEER would be incident cases that would be missed if SEER were the sole data source. Because we identified 1405 breast cancer cases in this period, we would be missing an estimated 3.9% (57/[1405+57]) of incident breast cancer cases by relying solely on the SEER file. The proportion of missed patients might be higher in health plans that contract to outside providers more often than done in HMOs in our study. The proportion may also differ by type of cancer.4

Managed care provided health care services to 67% of the US population in 2000.9 While the plans in our study may or may not have more extensive data than others, most plans maintain computerized databases that include diagnoses.10 These databases have historically been used for administrative tracking, quality control, billing, and reimbursement purposes. These records can also be used for important epidemiological, clinical, and health services research. They can identify cancer patients and often can be used to obtain disease stage and treatment patterns, because of the inclusion of pharmacy, laboratory, and pathology records. By coupling database information with systems such as electronic physician notes to identify cases more accurately, important research opportunities may be realized. The use of such databases could broaden the population of patients included in studies of patterns of health care and outcomes beyond those found in cancer registry populations. Because most health plans are not connected with SEER registries, the addition of these HMOs' identification resources could increase the size and generalizability of these studies.


Identifying incident breast cancer cases through computerized data is feasible and sensitive; however, further medical record review is required to focus on incident cases. Positive predictive value was particularly low during the first year of data collection because many prevalent cases were identified. On the other hand, sensitivity was slightly lower in the last year of data collection, because the SEER registrar entered cases more quickly than the HMO data systems. Screening for prevalent cases before the period of interest maximized PPV. Finally, HMOs that rely solely on SEER will miss a small percentage of breast cancer cases because of misclassifications of the patients' insurer by the SEER registry.


The overall principal investigator for the Cancer Research Network is Edward H. Wagner, MD, PhD. The overall project manager is Sarah Greene, MPH. We thank the project coordinators at each site for their hard work, Mary Baker, PhD, of the Seattle-Puget Sound SEER for her assistance with case follow-up, and Jody Jackson, BSN, for assistance with editorial comments and manuscript preparation.

From the HealthPartners Research Foundation, Minneapolis, Minn (SJR, SKF, KJP); Group Health Cooperative (GH) and Department of Medicine, University of Washington School of Medicine (JGE), Seattle; Harvard Pilgrim Health Care, Boston, Mass (MBB, SWF); Kaiser Permanente Northern California, Oakland (LH, GH); Kaiser Permanente Northwest, Portland, Ore (ELH); and Kaiser Permanente Southern California, Pasadena (AMG).

This study was supported by grant 5 U19 CA79689-030 from the National Cancer Institute, Bethesda, Md.

Address for correspondence: Sharon J. Rolnick, PhD, MPH, HealthPartners Research Foundation, PO Box 1524, MS 21111R, Minneapolis, MN 55440-1524. E-mail:

J Clin Oncol.

1. Pearson ML, Ganz PA, McGuigan K, Malin JR, Adams J, Kahn KL. The case identification challenge in measuring quality of cancer care. 2002;20:4353-4360.

Health Care Financ Rev.

2. Warren JL, Riley GF, McBean AM, Hakim R. Use of Medicare data to identify incident breast cancer cases. 1996;18:237-246.

Med Care.

3. Warren JL, Feuer E, Potosky AL, Riley GF, Lynch CF. Use of Medicare hospital and physician data to assess breast cancer incidence. 1999;37:445-456.

Med Care.

4. Cooper GS, Yuan Z, Stange KC, et al. The sensitivity of Medicare claims data for case ascertainment of six common cancers. 1999;37:436-444.

J Clin Epidemiol.

5. Freeman JL, Zhang D, Freeman DH, Goodwin JS. An approach to identifying incident breast cancer cases using Medicare claims data. 2000;53:605-614.

J Med Syst.

6. Solin LJ, Legorreta A, Schultz DJ, et al. Analysis of a claims database for the identification of patients with carcinoma of the breast. 1994;18:23-32.

Med Syst.

7. Solin LJ, MacPherson S, Schultz DJ, Hanchak NA. Evaluation of an algorithm to identify women with carcinoma of the breast. J 1997;21:189-199.

J Clin Epidemiol.

8. Leung KM, Hasan AG, Rees KS, Parker RG, Legorreta AP. Patients with newly diagnosed carcinoma of the breast: validation of a claim-based identification algorithm. 1999;52:57-64.

9. Employee Benefit Research Institute. Available at: Accessed June 13, 2003.

The Managed Care Handbook.

10. Kongstvedt PR, Goldfield NI, Plocher DW. Using data and provider profiling in medical management. In: Kongstvedt PR, ed. 4th ed. Gaithersburg, Md: Aspen Publishers Inc; 2001:579-588.