This review suggests that only a few primary care quality measures, which usually are not found in claims data, have significant clinical and financial impact.
To understand the value for payers and purchasers of primary care quality measures in an insured population, we conducted a 2-part analysis. In the first part, we reviewed the economic and clinical literature supporting 62 quality metrics spanning primary care that had been proposed for use in a physician recertification program and in a pay-for-performance program. We then ranked these metrics by both economic and clinical evidence of effectiveness. For many of the metrics, there was little clinical or economic support for inclusion in a pay-for-performance program. For the 20 with both clinical and economic evidence of effectiveness, we constructed actuarial models to understand the potential financial effect that attainment of these metrics would have in an insured population, from the perspective of a payer. Of those, 16 were found to be cost-saving in the short term with respect to direct medical costs incurred by payers. This analysis suggests that many recommended primary care quality measures may have little clinical evidence of effectiveness beyond expert opinion, and may provide scant clinical or economic benefit to payers if achieved. A minority, however, may deliver substantial savings in the short term. Given the current emphasis on pay-for-performance and pay-for-reporting programs, and recent studies showing a lack of relationship between measures and clinical/ economic value, this analysis informs payers, purchasers, providers, and policymakers about the importance of choosing the right metrics and the methods for collecting them.
(Am J Manag Care. 2008;14(6):360-368)
Our research analyzed the clinical and financial value of 60 commonly used and generally approved physician quality measures from a payer-purchaser perspective.
Only a handful of those measures had a significant clinical and financial impact.
However, those measures are not routinely found in claims data, thus putting to question the amount of resources that should be devoted to large claims data aggregation efforts as opposed to other data collection efforts.
In 2007, for the first time in its history, the Medicare program tied a portion of a scheduled increase in physician fees to performance on a standard set of ambulatory care measures. This change in reimbursement strategy was prompted by (1) a recognition that measuring the value of Medicare physician spending has been, and continues to be, elusive; (2) a strong private sector movement to tie a portion of physician payment to demonstrated performance in delivering quality care; and (3) an acknowledgment that consumers deserve transparent information on the competence of physicians to meet certain quality thresholds.
As the Centers for Medicare & Medicaid Services (CMS) collects and disseminates these performance data, and as more than 100 similar efforts germinate in the private sector,1 there is a paucity of robust studies on the relationship between the achievement of ambulatory care measures and healthcare cost and quality. Prior research has shown a link between performance measures and costs and quality of care.2-4 In other related articles, physicians who received recognition by the National Committee for Quality Assurance (NCQA) for demonstrating good outcomes in the management of patients with diabetes were shown to have lower costs.5-8 These studies are consistent with other studies that demonstrate similar results.9-11 Their common denominator is the observation that a true measure of output is needed to compare the values created (or not created) by the care delivery process.
Output measures are best defined as those that most closely relate to the outcome of a patient’s care, or that have the highest correlation with that outcome. For example, an important outcome for a patient with diabetes is to avoid complications such as amputation, myocardial infarction, and renal failure. The measures that are most closely related to the avoidance of these events are the proper management of the patient’s glycosylated hemoglobin (A1C), low-density lipoprotein cholesterol (LDL-C), and blood pressure. Similarly, recent studies on the management of patients with cardiac disease demonstrate the importance of monitoring and measuring blood pressure.12
eAppendix Table A
In a 2-part study, we reviewed 62 ambulatory care measures proposed for a specialty organization’s recertification program and for a pay-forperformance initiative. These measures were selected by an expert panel, and 50 of them were endorsed by the National Quality Forum (NQF), the Ambulatory care Quality Alliance (AQA),and/or the NCQA. The measures span primary care, including coronary artery disease (CAD), heart failure (HF), diabetes mellitus, osteoarthritis, asthma, major depression, hypertension, and acute-care conditions. lists the metrics and their endorsement status (available at www.ajmc.com). The first part of the study consisted of ranking each measure according to an index that combined clinical and economic value, and the second part consisted of conducting detailed actuarial analyses of the subset of measures that had the highest index score.
Our findings imply that many payers, including CMS, should carefully consider what measures to focus on.
eAppendix Table B
METHODSTo understand the benefit of each measure, we conducted a clinical and economic literature review, emphasizing meta-analyses demonstrating support for the measures. Given the preponderance of meta-analyses in our review, we captured a very large number of peer-reviewed articles. presents a review of the articles (available at www.ajmc.com). After we assembled the evidence for the measures, we created a point-based ranking system for both the clinical and the economic value of each measure. In basing our ranking systems on well-known methods published in the literature, our intent was to use an approach for capturing clinical and economic value that had been independently validated and was completely transparent. However, it is possible that our clinical and economic ranking systems, although comprehensive, did not capture all the elements of clinical and economic value that might be contained in a quality measure.
For the clinical evidence rankings, we used a methodology adapted from that of the GRADE Working Group.13 The GRADE Working Group is an international collaboration that has critiqued the assortment of evaluation tools used to rate clinical guidelines and has generated a standardized evaluation process.14 Quality of evidence was scored on a 5-point scale based on the study design for the supporting evidence:
• Meta-analysis in support—5 points.
• Single randomized controlled trial in support—3 points.
• Expert opinion in support—1 point.
Scores were reduced if there were questions of study quality, consistency, bias, directness, or imprecise/sparse data as follows:
• Serious limitations in study quality (-1).
• High probability of reporting bias (-1).
• Imprecise or sparse data (-1).
Conversely, scores were increased if there was evidence of strong association or dose response according to the following schema:
• Significant evidence of a strong association between measure and outcome (relative risk or odds ratio of >2 for morbidity or mortality outcome) (+1).
• Evidence of a dose response gradient (+1).
As a result of this scoring, the maximum number of points awarded to any measure for clinical effectiveness in our analysis was 6.
For the economic ranking system, we adapted the method of Chiou et al.15 Points were first allocated on the basis of strength of evidence with:
• More than 1 study showing evidence of cost-effectiveness or cost utility at <$50,000 per life-year saved; or 1 study showing cost savings in some scenarios—3 points.
• No published cost studies—1 point.
Scores were increased or decreased based on the following questions applied to the highest-scoring individual evidence:
• Was uncertainty handled by (1) statistical analysis to address random events and (2) sensitivity analysis to cover a range of assumptions? Yes +.5. No -.5.
• Was the measurement of costs appropriate and the methodology for the estimation of quantities and unit costs clearly described? Yes +.5. No -.5.
As a result of this scoring scheme, the maximum number of points allocated for financial effectiveness was 5.5; therefore, the maximum number of points for the total combined score was 33, which represents the product of the clinical and economic scores (). The primary reason for using a product-based combined score was to numerically highlight the metrics that have been the subject of rigorous studies of both clinical and economic effectiveness. Moreover, a combined ranking based on the product of the separate clinical and economic scores provides a more balanced index, and avoids assigning undue weight, for example, to measures with strong clinical effectiveness scores but weak economic value, or vice-versa. illustrates the Pareto-like distribution of metrics by total combined points, where 19 metrics received 20 or more points and the remainder of the metrics received an average of 3 points or less.
We then performed a cost–benefit calculation using the measures with the highest combined rankings, because the metrics with low scores had little or no evidence of economic and clinical effectiveness. The actuarial models assessed the value of reductions in adverse outcomes when high-scoring metrics were achieved. To generate each model, we calculated the per capita benefits of treatment by determining the number, type, and average cost of morbidity events prevented by attainment of each metric, as determined from the literature and validated through the Thomson Medstat MarketScan database (Thomson Medstat Inc, Ann Arbor, Michigan), a large integrated claims database of commercially insured employees of mainly large corporations. See Figure 2 and for specific examples. When cost figures were outdated, we inflated them to 2006 levels using the medical Consumer Price Index. We assumed study populations were 50% male and 50% female, and where ethnicity was relevant (for the cholesterol and hypertension models), we assumed the population was 90% white and 10% black.
We next calculated per capita costs, using average costs for generic versions of pharmacotherapy treatments (where available) obtained from an online Internet pharmacy and including other related medical costs from an amalgam of likely therapies. The cost of medication side effects in these particular applications was generally not considered, with 2 exceptions: aspirin use and switching to angiotensin receptor blockers (ARBs) because of intolerance to angiotensin-converting enzyme (ACE) inhibitors. Although the incremental cost of medication used for treatment was included in the model, the incremental cost of physician time to prescribe these treatments was not considered. None of the interventions listed here would lead to codable procedures, although it is conceivable that they could increase the acuity of individual visits. (For example, a level 2 or 3 visit might be justifiably “upcoded” to a level 3 or 4.)
After we derived the per capita benefits and costs of treatment, we summed them to yield the net financial effect—savings or cost—of the specific quality measure. The actuarial models were conservative, considering only the direct medical cost of morbidity to employers/payers for patients less than 65 years of age, in a 1-year time frame. Based on the literature, we varied the specific morbidity effects for each measure. In the case of hypertension, for example, the literature documents reductions in end-stage renal disease (ESRD), CAD, and stroke. In the case of ACE inhibitor/ARB treatment for left ventricular systolic dysfunction (LVSD), there was a reduction in hospitalizations for congestive HF. There were additional morbidity effects that could be expected with each metric, such as the decrease in retinopathy with blood pressure reduction.24
This focus on direct medical costs to the employer/payer meant that we did not include several major elements of the cost of care in the savings that we estimated. We ignored any medical costs directly paid by patients as well as the indirect and intangible costs to employers or patients. Additionally, we imputed savings only for the reductions in complications and other health factors that would occur prior to age 65 years and excluded the cost of mortality. (Implementation costs were excluded as there was no consensus on what they would be. The model was constructed, however, so that these costs could be easily accounted for when determined.) Finally, we assumed full patient compliance. Although we had no basis to assume full compliance and in fact had evidence to the contrary, there were no data to suggest how compliance might vary by condition and medication. Nevertheless, the model was developed so that the compliance factor could be changed by user preference, with reductions in compliance therefore affecting achievement of the outcome.
Clearly, there are a number of limitations to our methodology. First, preventing mortality is an important goal of payers and employers, and mortality effects were not considered in our models because we could not define an acceptable method of valuing life-years saved. Second, employers have reasons to care about direct costs and indirect costs, yet we ignored indirect costs because of the difficulties in adequately measuring them. Although placing these limitations on the model may seem overly conservative by reducing the net potential savings, the goal was to capture only the direct medical costs that would be paid by employers and payers, which most payers and purchasers use as a primary means to determine the value of a program.
RESULTSFrom the list of 62 metrics, only 20 (each of which is endorsed by the NQF, AQA, or NCQA) received high combined scores for clinical and economic support in our ranking scheme. Of these 20, most were shown to be cost-saving in actuarial modeling based on the conservative assumptions described above ().
Approximately one third of the 62 metrics lacked conclusive clinical evidence to support their inclusion in pay-for-performance beyond expert opinion, even though all of these metrics are considered part of the standard of care in medical practice. In addition, the measures lacking supportive clinical evidence highly correlated with those lacking evidence of economic savings (data not shown). The low-scoring measures did share one important element—they tended to be process measures with distant relationships to outcomes. Although practices such as taking the patient’s medical history and performing a physical are time-honored parts of the medical evaluation and may be prerequisites to interventions of proven clinical or economic benefit, they do not reduce morbidity or mortality directly—which is the measure of output that is important to payers and purchasers—and as a result have little or no actuarial value.
Although we limited the scope of potential savings to direct medical costs for patients under age 65 years, we found that most of the quality metrics with the strongest clinical and economic evidence of effectiveness were cost-saving in the 1-year frame of analysis. The range of savings was $88 per patient per year for achieving systolic blood pressure of less than 140 mm Hg to $781 per patient per year for use of ACE inhibitors/ARBs in LVSD or LVSD with CAD. In contrast, an LDL-C value below 130 or below 100 mg/dL was not cost-saving, yielding a net cost per patient per year of $429 and $412, respectively (and ). In the case of LDL-C reduction, the absence of costsavings was driven by the high cost of statin therapy and, to a lesser degree, by the relatively lower effect of reduced LDL-C (as opposed to blood pressure) on morbidity. The growing availability of generic statin therapy will likely change this valuation.
The short-term savings in the care of a working-age population related to compliance with these metrics also implies that their value would be far greater for CMS. In addition, because the cost of treatment needed to achieve high performance on some of these metrics is very small (eg, generic medication), and the benefits of complication avoidance grow with the time frame for analysis, a longer-term perspective would yield far greater net savings.
One important limitation of this study is that the actuarial models used to develop the net savings associated with each metric considered every metric independently of the other, and the results are not additive. For example, the savings associated with the treatment of patients with CAD who have had a prior myocardial infarction cannot be derived simply by adding the savings associated with each CAD measure.
If measuring outcomes or processes tightly linked with outcomes can result in cost-savings, why aren’t payers focusing on these measures? An important reason is that there is an inherent difficulty in systematically collecting the data to assess performance on these measures without incurring significant data collection costs.
The most ubiquitous, standardized, and inexpensive data collection process in force in the US healthcare system is the billing process. There are hundreds of billions of claims processed each year, and they can yield valuable information. For example, the International Classification of Diseases, Ninth Revision, Clinical Modification code V85 is the diagnosis code for the body mass index (BMI) of adults. V85.0 is given for a BMI less than 19, V85.1 for a BMI of 19 through 24, and so on.26 Yet there is no widely used Current Procedural Terminology (CPT) designation that allows coding for a specific heart rate, LDL-C level, or A1C level. (As of 2006 and 2007, category II CPT codes have been issued that will allow coding of clinical values and will facilitate performance management through coding, but these have yet to be widely adopted.) Most importantly, despite the preponderance of evidence suggesting that morbidity and mortality are impacted by even small variations in blood pressure, there is, for example, no widely used coding that can differentiate between a systolic blood pressure of 140 mm Hg and one of 130 mm Hg.27 New CPT category 2 codes might help mitigate some of this deficiency, although it is unclear by how much.
The alternative is to use clinical data contained in medical records. However, the process of collecting those data generally requires the abstraction of paper medical records, which consumes significant time and resources for physicians, or the abstraction of data from electronic health records, which have been adopted by only 15% to 20% of physicians. The emergence of regional organizations to manage health information exchange holds the promise of automating clinical data collection and aggregation,28 but only in the future. Yet this analysis seems to suggest that private-sector and public-sector payers would be well served to invest in the collection of these data for some measures, while continuing to rely on claims for others.
2. Towers Perrin. Cardiac Care Analysisâ€“Savings Estimates. December 29, 2003. http://www.bridgestoexcellence.org/Documents/bte_towersperrin.pdf. Accessed February 29, 2008.
4. Ingenix. Evaluation of the Diabetes Care Link. February-March 2005. http://www.bridgestoexcellence.org/Documents/DPRP_Eval_2005.pdf. Accessed February 29, 2008.
6. de Brantes F. Pay-for-performance and beyond: a recipe for improving healthcare. In: The Quality Conundrum: Practical Approaches for Enhancing Patient Care. New York: PriceWaterhouseCoopers; 2007.
8. de Brantes F. Bridges to excellence: a program to start closing the quality chasm in healthcare. J Healthc Qual. 2003;25(2):2, 11.
10. Bodenheimer T, Wagner EH, Grumbach K. Improving primary care for patients with chronic illness: the chronic care model, part 2. JAMA. 2002;288(15):1909-1914.
12. Cutler D, Long G, Berndt ER, et al. The value of antihypertensive drugs: a perspective on medical innovation. Health Aff (Millwood). 2007;26(1):97-110.
14. Atkins D, Best D, Briss PA, et al; GRADE Working Group. Grading quality of evidence and strength of recommendations. BMJ. 2004;328(7454):1490.
16. Burt V, Whelton P, Roccella EJ, et al. Prevalence of hypertension in the US adult population. Results From the Third National Health and Nutrition Examination Survey, 1988-1991 [comment in Hypertension. 1995;25(3):303-304]. Hypertension. 1995;25(3):305-313.
18. Wilson P, Dâ€™Agostino RB, Levy D, Belanger AM, Silbershatz H, Kannel WB. Prediction of coronary heart disease using risk factor categories [comments in Circulation. 1998;97(18):1761-1762 and Circulation. 1999;99(16):2219]. Circulation. 1998;97(18):1837-1847.
20. US Census Bureau. Statistical Abstract of the United States, 2003. http://www.census.gov/prod/www/statistical-abstract-2001_2005.html. Accessed February 29, 2008.
22. Corvol J, Bouzamondo A, Sirol M, Hulot JS, Sanchez P, Lechat P. Differential effects of lipid-lowering therapies on stroke prevention. Arch Intern Med. 2003;163(6):669-676.
24. Yu T, Mitchell P, Berry G, Li W, Wang JJ. Retinopathy in older persons without diabetes and its relationship to hypertension. Arch Ophthalmol. 1998;116(1):83-89.
26. ICD9. chrisendres.com. Persons without reported diagnosis encountered during examination and investigation of individuals and populations (V70V85). http://icd9cm.chrisendres.com/index.php?action=child&recordid=10981. Accessed February 29, 2008.
28. de Brantes F, Emery DW, Overhage JM, Glaser J, Marchibroda J. The potential of HIEs as infomediaries. J Healthc Inf Manag. 2007;21(1):69-75.