Image via WikipediaOne of the key challenges in
Comparative Effectiveness Research (
CER) is synthesizing evidence from multiple studies with varying designs to arrive at an estimate of a treatment's overall effectiveness. This is the process of health technology assessment (
HTA).
Systematic reviews are one means of accomplishing this and may lead to a full meta-analysis or mixed treatment comparison (MTC), a statistical model that uses head to head and placebo-controlled trials to estimate the relative efficacy of multiple treatments (sort of a Fantasy Football for healthcare treatments).
Systematic reviews are the core of HTAs and start with a specific question of interest to the reviewer. A standard approach to all research questions is
PICO, or "Patient, Intervention, Comparison, and Outcome". By pre-specifying the patient population, intervention(s), comparator(s), and outcome(s) of interest, a search strategy can be developed and implemented in any of the available literature search engines. For example, "Among cigarette smokers without co-morbid substance abuse or psychiatric illness (P), how does nicotine replacement (I) compare to cognitive behavioral therapy and 12-step programs (C) in achieving one-year tobacco abstinence and with respect to medical and psychiatric complications (O)?" This question may then be refined to develop search terms to collect abstracts and manuscripts.
A systematic review then requires a framework for evaluating the quality of the studies, including those that are informative and excluding those that are not. The pre-specified search may involve inclusion and exclusion criteria based on a minimum sample size, randomization (for trials), and other design issues. However, the resulting studies must be evaluated to rate the quality of evidence of each study. The US Preventative Services Task Force (
USPSTF) uses a
hierarchy of study design as the starting and ending point for rating studies. Properly conducted
randomized controlled trials (RCTs) receive Class I designation as the highest quality evidence while non-randomized trials or
cohort studies receive Class II designation. Case series and expert consensus receive Class III designation. The Grading of Recommendations Assessment, Development and Evaluation
(GRADE) working group employs a somewhat different approach to apply designations of High, Fair, and Poor quality. Study design still dominates, with RCTs "starting" as High and
observational studies Low quality. However, a criteria-driven process may downgrade RCTs or upgrade observational studies. Any study with a "fatal flaw" is designated as Poor quality. Both of these approaches are based on the assumption that the internal validity offered by double blind randomization is the most important element in any question related to comparative effectiveness.
The
Agency for Healthcare Research and Quality (AHRQ) takes a somewhat different stance in its
Methods Guide for Comparative Effectiveness Reviews. While acknowledging the inherent strength of RCTs to robustly answer questions of efficacy, the authors note that the strength of the design is only assessed in the context of the question at hand. Specifically, they cite that the long-term safety of a new medication may best be assessed through an observational study. Why? Because RCTs tend to attract healthier patients without many of the co-morbidities and concomitant treatments that may affect the overall safety of the medication. In other words, RCTs may have biased enrollment that prevents them from answering the question of real-world safety outside of controlled experimental settings.
I would extend this caveat to any questions of real-world effectiveness of treatments to achieve desired clinical outcomes in non-experimental settings. For example, if physicians do not commonly prescribe the same dose studied in published RCTs or in the product label, systematic reviews that favor such studies may be irrelevant to the question of comparative effectiveness in current practice. In
one study, colleagues and I demonstrated that time to psychiatric hospitalization was longer with one antipsychotic compared to others. This appears to have been influenced by the fact all of the medications tended to be initiated at sub-therapeutic doses. The comparators tended to be dosed much lower that the intervention of interest. A systematic review and meta-analysis of the RCT literature would have found no meaningful difference between treatments. In real world practice, there was.
The experimental design of RCTs is our gold standard for determining whether a medication has sufficient biological activity to favorably affect disease. Robust methods for comparing such effect sizes across different trials are available. However, policy makers should be cautious in assuming that experimental results can be directly applied to current clinical practice. If our question relates to real-world practice settings, observational methods may be the most appropriate design to answer the question.