Benchmarking Physician Performance: Reliability of Individual and Composite Measures

At least 50 quality events per physician are needed to reach a minimum level of reliability for most quality measures calculated from administrative data.
Published Online: December 01, 2008
Sarah Hudson Scholle, MPH, DrPH; Joachim Roski, PhD, MPH; John L. Adams, PhD; Daniel L. Dunn, PhD; Eve A. Kerr, MD, MPH; Donna Pillittere Dugan, MS; and Roxanne E. Jensen, BA

Objective: To examine the reliability of quality measures to assess physician performance, which are increasingly used as the basis for quality improvement efforts, contracting decisions, and financial incentives, despite concerns about the methodological challenges.

Study Design: Evaluation of health plan administrative claims and enrollment data.

Methods: The study used administrative data from 9 health plans representing more than 11 million patients. The number of quality events (patients eligible for a quality measure), mean performance, and reliability estimates were calculated for 27 quality measures. Composite scores for preventive, chronic, acute, and overall care were calculated as the weighted mean of the standardized scores. Reliability was estimated by calculating the physician-to-physician variance divided by the sum of the physician-to-physician variance plus the measurement variance, and 0.70 was considered adequate.

Results: Ten quality measures had reliability estimates above 0.70 at a minimum of 50 quality events. For other quality measures, reliability was low even when physicians had 50 quality events. The largest proportion of physicians who could be reliably evaluated on a single quality measure was 8% for colorectal cancer screening and 2% for nephropathy screening among patients with diabetes mellitus. More physicians could be reliably evaluated using composite scores =17% for preventive care, >7% for chronic care, and 15%-20% for an overall composite).

Conclusions: In typical health plan administrative data, most physicians do not have adequate numbers of quality events to support reliable quality measurement. The reliability of quality measures should be taken into account when quality information is used for public reporting and accountability. Efforts to improve data available for physician profiling are also needed.

(Am J Manag Care. 2008;14(12):829-838)

When health plan administrative data are used to evaluate physician performance, most quality measures require at least 50 quality events per physician to gain a reliable estimate of physician performance (ie, to ensure that a quality measure is able to distinguish a physician’s performance from average performance).

  • Composite measures allow more physicians to be evaluated reliably but are less actionable for quality improvement.
  • The physician-level reliability of quality measures should be considered when quality information is used for public reporting and accountability.
  • Efforts to improve the quality and quantity of data available for physician profiling are also needed.
Measuring physician performance is becoming commonplace as health plans and purchasers look for ways to drive quality improvement and to increase physicians’ accountability and rewards for achieving quality goals. A recent study1 reported that, among 89% of health maintenance organization plans using physicianoriented pay-for-performance programs, more than one-third measured and rewarded quality at the individual physician level. In addition, public and private purchasers are demanding more information about America’s physicians and hospitals to aid in value-based purchasing and selection of health plans and providers.2

However, concerns remain regarding the validity and reliability of such physician performance profiles. Several factors are needed to support fair and accurate comparisons among physicians. These include evidence-based quality measures, complete and accurate data sources, and standardized methods of data collection. Physician-level reliability of a quality measure is another key consideration in this measurement. Physician-level reliability refers to the ability of a quality measure to distinguish an individual physician’s performance from the performance of physicians overall. Good physician-level reliability requires the following 2 factors: (1) a sufficient number of patients eligible for a given quality measure and (2) performance variation across physicians on that quality measure.3-5 The greater the number of a physician’s patients who are eligible for a quality measure, the more precise the estimate of the physician’s performance. When performance variation for a given quality measure across physicians is limited, the likelihood that a physician’s performance is statistically significantly different from that of his or her peers is also decreased. Hofer and colleagues6 showed that not controlling for a quality measure’s physician-level reliability significantly misrepresented performance differences across physicians. However, adjusting performance profiles in such a manner is not commonplace across the healthcare industry.

Ensuring that measurement results are valid and reliable is important when purchasers and plans (and potentially consumers) use the data to make decisions about which physicians get financial rewards or other benefits. The stakes are particularly high when profiling results are used for public reporting or eligibility for participation in a health plan network. Paying attention to the validity and reliability of data will help to ensure that these decisions are based on real differences in performance among physicians rather than any shortcomings of the measurement.

Although performance results based on limited sample sizes could be adjusted for the reliability of individual measures, 7-9 the creation of composite scores may also be a useful way to increase the reliability of physicians’ performance scores.10 Little is known about the extent to which constructing composite scores mitigates the limitations of sample size and reliability, while continuing to provide useful and understandable information.11

To date, there have been few reports regarding the reliability of physician-level performance scores associated with commonly used practices and methods in the healthcare industry. To begin to address this deficiency, this study relied on a large data set that combined patient-level administrative data from 9 large health plans to compute performance for primary care physicians (PCPs) using 27 commonly measured quality indicators. This data set is typical of data sources often used by individual health plans to profile physician performance. Specifically, we examined for each quality measure and composite score the proportion of PCPs who could be evaluated given different minimum sample size criteria and the physician-level reliability under those minimum sample size criteria. Our primary research questions were the following: (1) What is the physician-level reliability of commonly used performance measures calculated exclusively based on administrative data? (2) Can more physicians be reliably evaluated using a composite score?


Data Sources

This study used administrative data from the Ingenix Impact Pro database.12 Deidentified claims and enrollment data for individuals enrolled in 9 health plans from 9 separate geographic regions for 2003 and 2004 were available for this study. Each of these plans had at least 250,000 members and accounted for 15% to 50% of managed care enrollees in their markets (Table 1). In all, these plans covered more than 11 million unique members and many physicians and employer groups. The members included in these organizations were primarily enrolled in commercial health maintenance organization, preferred provider organization, and point-of-service health plan product designs, with fewer individuals enrolled in Medicare risk products. Pharmacy benefit status, an indicator of the general availability of pharmacy data to support measurement, ranged from 51% to 80% of the enrolled populations for each plan. Although the study population was drawn from multiple geographic census regions, most individuals were located in the northeast United States. The data were deidentified to protect patient, physician, and organization confidentiality. This study was reviewed and determined to be exempt by Chesapeake Research Review, Inc (Columbia, MD).

Because the Impact Pro database may not include complete data on all services (eg, pharmacy, laboratory, or mental health services) needed for calculating some performance measures, we conducted specific analyses to assess the completeness of the data available for the study. Using only administrative data sources, we compared performance rates based on Impact Pro data with performance data reported to the National Committee for Quality Assurance (NCQA) through the Healthcare Effectiveness Data and Information Set (HEDIS) reporting. If we found more than a 5–percentage point difference between the plan’s reported rate to the NCQA and the rate in the Impact Pro database, the data were excluded for that quality measure.

Selection of Quality Measures

Twenty-seven quality measures often used to assess care effectiveness were calculated using study data following HEDIS specifications.13 The quality measures were identified from an environmental scan of existing physician quality measures and prioritization by an NCQA expert panel on physician profiling. The quality measure set primarily includes quality measures that have been endorsed by the AQA Alliance and the National Quality Forum. Only quality measures that could be obtained through administrative claims data were included because we were emulating efforts to profile physicians based on data commonly available and used by health plans. Quality measures for diabetes care, cervical cancer screening, and colorectal cancer screening are specified for hybrid data collection for HEDIS (ie, using medical records data to supplement claims). Relying exclusively on administrative data for performance calculations for these quality measures may not accurately reflect performance.14 The selected quality measures describe preventive, chronic, and acute care activities and were considered appropriate for supporting comparisons of PCPs. (See eAppendix Table 1, available at

Identification and Attribution of Quality Events to Physicians We identified individual physicians using the unique physician identifiers used by health plans. Because we did not have a way to link a physician’s claims in one data set to that physician’s claims in another health plan’s data set, we did not pool patients for the same physician across health plans. Most of the 9 health plans in our study did not operate in the same healthcare markets, so pooling would have a limited effect. Primary care physicians, including family physicians, general internists, and general pediatricians, were identified based on the specialty designated in the credentialing records of the participating health plans.

The 9 health plans included in this study generally did not require patients to designate a PCP. Therefore, we developed algorithms based on patient care patterns to attribute a patient’s care to 1 or more physicians. We required at least 1 claim for an outpatient visit during the measurement period to attribute a patient to a PCP for inclusion in the study. This means that some patients who were eligible for quality measures may not have been attributed (eg, a woman eligible for mammography would not be attributed if she did not have a qualifying visit during the year). Outpatient visits were defined based on coding conventions established through HEDIS to identify preventive and ambulatory health services.13

For a physician to be considered responsible for a quality event (defined as an event for which a patient is eligible for a quality measure), the patient had to have a visit with the physician during a period when the physician would have an opportunity to meet the quality indicator. We chose this less stringent approach to maximize the number of quality events assigned to each physician. Any PCP rendering 1 or more visits for a patient during the eligibility period was considered responsible for the quality measure. A specific patient may be eligible for multiple quality measures and contribute multiple quality events for the responsible physicians. Likewise, more than 1 physician could be responsible for a specific quality event.

Statistical Analysis

PDF is available on the last page.
Adult ADHD Compendium
COPD Compendium
Dermatology Compendium
Diabetes Compendium
GI Compendium
Immuno-oncology Compendium
Lipids Compendium
MACRA Compendium
Oncology Compendium
Pain Compendium
Reimbursement Compendium
Rheumatoid Arthritis Compendium
Know Your News
HF Compendium
Managed Care PODCAST