GRACE Principles: Recognizing High-Quality Observational Studies of Comparative Effectiveness
Published Online: December 31, 1969
Nancy A. Dreyer, PhD; Sebastian Schneeweiss, MD; Barbara J. McNeil, MD; Marc L. Berger, MD; Alec M. Walker, MD; Daniel A. Ollendorf, MPH; and Richard E. Gliklich, MD; for the GRACE Initiative
Comparative effectiveness (CE) has been defined as “the conduct and synthesis of research comparing the benefits and harms of different interventions and strategies to prevent, diagnose, treat, and monitor health conditions in ‘real world’ settings.”1 As the demand for data to support decision making escalates, there is a growing recognition that randomized clinical trials alone will not fill the information gaps. Critics have characterized nonrandomized studies as having inferior quality of evidence because of limited internal validity, analytic challenges posed by a heterogeneous mix of patients with complex medical histories, and the lack of accepted guidance to distinguish more reliable studies.2,3 Nevertheless, observational studies are often a rich resource for meaningful information about treatment adherence, tolerance, use of concomitant therapies, and the decision-making processes and consequences of selecting or switching treatments. Real-world studies sometimes provide the only information about sensitive populations,4 sustained therapeutic effectiveness, and health services–related issues such as how the type of practitioner affects the choice of medical device.5 Noninterventional studies also can provide important information about treatment effectiveness, sometimes with surprising results, such as the lack of benefit from some types of cardiac rehabilitation in nontrial settings.6
Although the International Society for Pharmacoeconomics and Outcomes Research,7,8 the International Society of Pharmacoepidemiology,9 and the Agency for Healthcare Research and Quality have recommended good practices for observational studies and registries, they have not promulgated simple high-level principles to guide users in design and evaluation. The STROBE (Strengthening the Reporting of Observational Studies) guidelines10 and others11 address reporting, not quality. Tools such as GRADE (Grading of Recommendations Assessment, Development and Evaluation) that address quality generally rank all nonrandomized studies as “low quality,” regardless of the study quality.12
The GRACE (Good Research for Comparative Effectiveness) principles were created to guide practitioners, researchers, journal readers, and editors in evaluating the quality of observational CE studies. The active contributors are experienced academic and private sector researchers with different perspectives on the creation and use of observational CE data.13 The GRACE principles were tested and modified through presentations and critique,14,15 including formal review by the International Society of Pharmacoepidemiology.
The GRACE principles can be used to guide the design and evaluation of studies that are based on new data collection, use existing data, and are consistent with good pharmacoepidemiologic practice9 and the Agency for Healthcare Research and Quality’s handbook on Registries for Evaluating Patient Outcomes.16 The GRACE principles also may be useful for CE reviews following the Cochrane principles17 or the Agency for Healthcare Research and Quality’s Methods Guide for Effectiveness and Comparative Effectiveness Reviews.18
The following 3 questions comprise the GRACE principles for evaluating nonrandomized studies of CE. Although many examples refer to drugs, the GRACE principles also apply, in large, to medical devices, procedures, complex clinical strategies, and other elements. No scoring system is proposed or encouraged, as evidence must be weighed and the interpretation tempered in light of all available evidence. Adaptations and augmentation are anticipated as science develops.
Were the study plans (including research questions, main comparisons, outcomes, etc) specified before conducting the study?
A good study plan describes the research questions and documents the study design, target population, and intended methods for conducting the primary analyses of effectiveness and safety. The study plan also defines the diseases and conditions, patient characteristics, comparators, treatment regimens, and outcomes of interest. Creating a study plan at the outset helps assure skeptics that comparisons were not conducted iteratively until support for a preconceived conclusion was found.
The study should include clinically meaningful outcomes that would assist health professionals and patients with treatment decisions or policymakers with decisions about allocations of resources. For example, decreases in a biomarker may not affect the risk of development of clinically apparent disease, but differences in survival after invasive diagnostic procedures for acute myocardial infarction could be used to justify increasing the availability of cardiac catheterization laboratories.19 Intermediate end points can be useful when there are good data that link those end points to the long-term outcomes and when evaluation of the long-term outcomes is not feasible because of time or cost constraints. Quantitative evaluations of outcomes that are standardized, reproducible, and independently verifiable are preferable to clinical impressions or other measurements that have not been validated or have substantial interobserver variation.
Was the study conducted and analyzed in a manner consistent with good practice and reported insufficient detail for evaluation and replication?
Observational studies of CE should be conducted, reported, and evaluated in accord with generally accepted good practices for nonrandomized research.16,20,21 Meaningful data can be collected or assembled from several sources. The challenge to their successful utilization is to understand what is recorded and why. To evaluate the validity of conclusions drawn from their analysis, data that are collected specifically for the purposes of the study (primary data) and data that were collected for other purposes (secondary data) require an understanding of the purpose and method by which they were assembled, enrollment and coverage factors, pathways to care, quality assurance, and other factors that may have affected the quality of the data. For example, insurance claims data may not be reflective of the actual clinical condition and may be coded inaccurately, imprecisely (eg, diagnosis-related groups), inconsistently, or under different constraints (eg, a treatment such as migraine medicines might be subject to pharmacy prescription limits). For prospective data collection, studies should not create an incentive for physicians to recommend specific treatments to fill recruitment quotas, and to promote retention study procedures should not be overly burdensome on patients or physicians. For primary and secondary data collection, it is important to assess and report which data are missing, whether their absence seems to be systematic or random, and what is their potential effect on the overall results.
For developing the outcomes under study, it is important to compare persons with similar risks and, for drugs, to consider focusing on new users (inception cohorts), as this design avoids studying only subjects who tolerate a given treatment well enough to continue treatment.22 Groups of persons whose risk for treatment or for the outcomes of interest differ may be examined using stratification and multivariable modeling techniques such as propensity scoring,23 disease risk scores,24 and instrumental variables.25 Evaluations are enhanced when adherence and compliance are accounted for.
Enough information should be presented to allow others to replicate the analyses in another database or to test alternative methods of analysis in the same or a similar data set. Replication of CE in different populations and the use of alternative analytic methods can strengthen the conclusions that may be drawn from nonrandomized studies. It may also be useful to report the results of observational studies of CE in the context of how well they support existing clinical trials data.26 When the results of observational CE studies are inconsistent with those of a randomized clinical trial for similar patient subgroups, plausible explanations must be sought to avoid uncertainty about how to interpret the results of either type of study.
How valid is the interpretation of CE for the population of interest, assuming sound methods and appropriate follow-up?
A key challenge to interpreting CE studies is understanding how determinants of treatment choice are related to the expected outcomes. The highest-quality evidence comes from nonrandomized studies with the least potential for bias, especially for treatment assignment. For example, a direct way to obtain unbiased evidence about drugs would be to compare groups with similar levels of insurance coverage in which treatment decisions are driven largely by differences in benefit design and less by patient characteristics or physician preferences, as the choice of insurance (or residence, for national insurance plans) is generally unrelated to formulary decisions and treatment outcomes. The challenge in using instrumental variables like these, which have the promise of approximating randomization, is the lack of complete assurance that the variable is unrelated to outcomes directly or through patient characteristics.
Offering almost as high quality is evidence derived from situations in which various treatments are commonly used and there is no good evidence favoring one treatment over another or in situations where a reliable understanding of the drivers for physician treatment preferences and treatment determinants is independent of patient characteristics. As an illustration, consider when differences in hospital formularies discourage physicians from using a product in one hospital but promote its use in another hospital. It is unlikely that patients would choose a hospital because of its formulary, so contrasting the outcomes of similar patients treated in hospitals with different coverage for the product of interest would be unbiased.27
The lowest-evidence quality comes from small studies and those that are less rigorous in the quality of data collected or that require assumptions about the causal inference chain that may be open to dispute. Nevertheless, such studies can identify important previously unrecognized benefits that bear further investigation such as the reduction in suicide from using clozapine,28 a finding that was confirmed in a trial29 and led to approval of a new indication. Studies that fall into this evidence tier may reduce some uncertainty about the magnitude of treatment effects, although it may be unclear to what extent unknown confounding factors could have artificially affected the apparent benefit of one treatment compared with another.
Generally, unless an effect is observed that is much larger than would be expected or larger than could reasonably be explained by bias or that provides new information where none was available, studies in this category of lowest evidence quality are less likely to contribute meaningfully to clinical decision making. Although there is no unanimity about how large a relative benefit is needed to be worthy of consideration as evidence for decision making, some investigators suggest that analyses showing a doubling (or more) of the relative benefit should be given serious consideration,12 while others set the bar higher.30
PDF is available on the last page.