Applying Weighting Methodologies to a Commercial Database to Project US Census Demographic Data
Thomas Wasser, PhD, MEd; Bingcao Wu, MS; Joseph W. Yčas, PhD; and Ozgur Tunceli, PhD
Researchers are keenly interested in ascertaining the impact of disease on society. One of the central elements of this determination is knowledge about the number of individual patients with a disease or condition of interest within a specific region, age group, or gender.1-6 The exact count or even estimates of patients affected by a given disease may not always be available for a variety of reasons, including the absence of reporting requirements or a lack of organized and maintained disease registries or longitudinal patient databases.7,8 To obtain an understanding of the size of patient populations that are not well quantified and characterized, often the only workable option is to extrapolate from available data in repositories such as registries and health plan databases (among others).
Disease prevalence can be estimated in subpopulations with accessible data,9-16 but in extrapolating to the general population, systematic differences in demographic composition must be taken into account. In the United States, it is unlikely that data sets in existing commercial health insurance databases will be representative enough by themselves to present an accurate estimate of the national population. 10,11,14-16 As a result, there is considerable interest in census decomposition methodologies or similar approaches that are capable of rendering the data in such nonrepresentative population samples in a form comparable to US Census data.
Cognizant of their role as a vital and reliable source of data on disease prevalence and the size limitations of commercial health plan databases, the objective of this study was to develop a weighting framework for projecting data from commercial databases to a population matching the demographic composition encompassed by the US Census.
This study compared data, demographic structures, and characteristics from a large commercial research database, the HealthCore Integrated Research Database (HIRD), which is notable for its size and geographic breadth, with data from the 2009 US Census. To create a basis for the approximation of counts relative to the US Census data, standard statistical procedures incorporating a suitable alternative to the goodness- of-fit method were used to establish weights for the HIRD. The weighting formulation was then tested with a sample of patients from the northeast region of the United States who were diagnosed with acute coronary syndrome (ACS).
Data Source HIRD
This study utilized a large commercial administrative claims database, the HIRD, which contains a broad spectrum of medical, pharmacy, and laboratory information on more than 46 million enrollees in 14 geographically dispersed managed care plans across the United States. The broad range of service models encompassed by these plans includes health maintenance organizations, point of service, preferred provider organizations, and indemnity plans. The data queried from the HIRD were categorized into geographic regions matching those used by the US Census Bureau.
The US Census Bureau publishes the American Community Survey results every year. The American Community Survey reports population numbers in categories including age, gender, race, and geographic region. No disease prevalence and other types of healthcare utilization information are collected by the American Community Survey. This study was conducted prior to the official release of the 2010 US Census data; as a result, population estimates from the US Census Bureau’s 2009 American Community Survey were used for the total count of individuals residing in the 50 US states.
Researchers had access to limited patient data in this study. Strict measures, in compliance with the 1996 Health Insurance Portability and Accountability Act (HIPAA), were observed to ensure the preservation of patient anonymity and confidentiality throughout. The study did not involve the collection, use, or transmittal of individually identifiable data. It was conducted under the Research Exception provisions of the Privacy Rule, 45 CFR 164.514(e); institutional review board sanction was not indicated.
Inclusion Criteria/Exclusion Criteria
Health plan members within the HIRD who had at least 1 day of health plan enrollment between January 1, and December 31, 2009, were eligible for inclusion in the study. This interval was selected because it represented the most current US Census Bureau’s American Community Survey data release available at the time of the study. Patients with ACS were selected to perform the projection demonstration. The disease was identified with International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes 410.x1 and 411.1x in the claims database.
Goodness-of-fit statistics was not applicable to match the samples because even small differences would appear to be statistically significant because of the large sample sizes. An alternative interpretation of the fit approach was used to examine the overall fit of the lines—census-defined regions, gender, and age groups—characterizing the HIRD and the US Census data. Statistical analyses were conducted with SAS version 9.2 (SAS Institute Inc, Cary, North Carolina).
Standard statistical procedures, comprising of an alternative version of goodness-of-fit statistics, were used to establish weights for the HIRD, to facilitate the approximation of counts relative to the US Census data. The linear weighting was computed as the percentage of the overall population divided by the percentage within the HIRD. Weighting schemes enable the projection from smaller known samples to larger populations in which the desired prevalence rate and other target information are not known. By using weights in a linear model along with specific variables, it is possible to make projections to the larger population by employing the relevant attributes of smaller population.17 This equation yielded a multiplication factor that was used to compute the weighted number of patients within a geographic region, age group, and gender for a specific disease type or drug classification (Table 1). On the basis of these distributions, weights were calculated to adjust for any differences in gender, geographic region, and age distributions observed between the HIRD population and the US census population estimates.
Study Populations and Demographic Comparison
During the 2009 calendar year, the HIRD included 14.8 million enrollees, and the US Census Bureau’s 2009 American Community Survey data projected in excess of 307.7 million individuals, within an estimated accuracy of 0.1% (margin of error: 0.001) 18 who were used as a base populations in this study. The HIRD population was similar to the US Census estimates in gender distribution, with females comprising 49.8% and 49.3% of their totals, respectively. Relative to the US Census estimates, the HIRD population appeared overrepresented in the midwest and underrepresented in the south. The HIRD population closely matched US Census estimates for the northeast and west regions, differing by only 1% in the northeast and 0.6% in the west (Table 2).
Age Distribution Comparison
The age group distributions of the HIRD and US Census populations are shown in Figure 1. The HIRD population had relatively higher representation of age categories between 30 and 59 years; it was underrepresented in the age categories <18 years and ≥65 years relative to the US Census. Although there was close agreement between the 2 populations for ages 5 to 30 years and 55 to 70 years, the overall age group of 18 to 64 years is overrepresented in the HIRD.
Weight Computation Based in the Northeast Region
To demonstrate the weight computation model, weight calculations were applied to the northeast region. In 2009, approximately 0.70% of the US Census population was male, aged 45 to 49 years, and lived in the northeast, while around 0.72% of the HIRD population shared the same geographic region, gender, and age characteristics. Thus, the weight for the male population aged 45 to 49 years living in the northeast during that time period was 0.9690% (Table 3).
Projection of ACS Patients in the Northeast Table 4 reports the results of weighting the number of HIRD patients with ACS in the northeast region within each age and gender stratum and projecting to the northeast US Census Bureau population. The HIRD had a total of 452 male members from the northeast region, aged 45 to 49 years, who had at least 1 claim with a diagnosis for ACS from January 1 to December 31, 2009. On the basis of the weight for this population group (0.9690), the projection of ACS diagnosis in a representative sample the same size as the HIRD repository (~4.82% of the US population) would be 438 patients (452 × 0.9690) and 9089 in the overall US Census Bureau population (438 ÷ 4.82%). Application of this weighting scheme results in a greater proportion of ACS patients in the ≥65 years age category and a smaller proportion of patients aged 30 to 64 years relative to the original HIRD estimate (Figure 2).
While healthcare data are hardly abundantly available for the entire US population, a considerable volume may be found in veritable data silos such as institutional disease registries and the transactional databases of health plans, among other repositories. For healthcare planners and budget directors, access to plausible population estimates is crucial for decision making. One avenue for population level figures for health budget projections and allocations is to extrapolate from smaller data collections. It is essential to have robust and reliable weighting tools, which are capable of achieving the low margin of error requirements, necessary for such projections.19,20 Driven by this need, this study developed a simple weighting tool to project heath plan data to estimate prevalence rates at the national level.
Although the HIRD represents a population that is slightly less than one-twentieth (~4.82%) of the US population—as represented by the 2009 US Census Bureau count—the data in the HIRD are remarkably representative of the entire US population. HIRD data trended in parallel with the US Census data on gender distribution, regional distribution in the northeast and west regions, but as was expected, it was overweight in the 30 to 59 years age category because the repository consists largely of employer-insured working age people.
Reflecting the source of the majority of people represented in the HIRD repository—enrollees of employer sponsored commercial healthcare insurance—the population aged ≥65 years appears to be relatively underrepresented. Still, the HIRD contains a sizable sample of ≥65-years-old enrollees who may be receiving commercial employer sponsored health benefits, or Medicare advantage, supplement, or Part D benefits. The sample size of this population is substantive enough to allow the application of this weighting methodology to extrapolate the data into the overall US population with statistically acceptable variance.
The weighting methodology developed in this study was tested on the ACS patients from northeast region as an illustration of how the weighting scheme may be applied in practice. While this example specifically addressed ACS patients, it demonstrated how the number of patients in the overall US population for any disease may be estimated from commercially derived healthcare data repositories like the HIRD. This study essentially demonstrated that by using a linear weighting methodology that accounts for differences in geographic regions, age, and gender between an accessible database and the US Census data, it was possible to estimate the prevalence of a number of important healthcare factors. Among areas that may be evaluated using this approach are disease prevalence, healthcare resource utilization, treatment patterns for therapies of interest, and current and potential use of pharmaceutical agents and other treatment modalities.
One of the key objectives of this study was the development of a projection method and a weighting scheme that could be applied to a range of disease conditions and therapeutic categories for which data were available in a repository—such as the HIRD. An important strength of this approach is that it allows for adjustments in the variables or for the updating of estimates of interest with the most current or different data as needed.
Weighed estimations have important planning, resource allocation, and cost management implications for a variety of stakeholders including patients, providers, and payers who have to make decisions based on research results, disease prevalence, treatment availability, and drug utilization, among other factors.
The results of the weighting scheme and ACS projection example discussed in this study must be viewed against some important limitations. This study relied on secondary data from commercial health plans across the United States. These data may have some relevance to similar commercial health plans, but only limited external validity for different patient populations such as the US Medicaid and Medicare programs. In addition, administrative claims lack data on race, ethnicity, and risk factors capable of influencing outcomes. Administrative claims data are prone to over- and underestimations (eg, for patients, disease, medication use, other areas) because of basic assumptions about index events, inability to capture and account for all treatments received by patients, and basic coding and clerical errors. Furthermore, extrapolation was done beyond the point of observable data, contravening a standard requirement of statistical methodology, and likely impacting the robustness of the results. In addition, notable differences existed between the values in the HIRD commercial database and the US Census data. The weights were calculated on the basis of 2009 ACS projections, not official US Census counts.
Consistent with its commercial employment origins and characteristics, the HIRD repository, while representative of US Census data, was overweighting the 30-to-59 years category. The age groups ≥65 years were underrepresented in the HIRD but still accounted for a substantial sample size. While extrapolations beyond observable data have statistical limitations, in the absence of data on disease prevalence and treatment for the US population as a whole, commercial databases could be viable for projecting patient counts within US Census parameters. This could be invaluable to key stakeholders such as healthcare planners, policy makers, and payers.