Building evidence-based medicine skills in gynecology
Deciphering the evidence, and emerging with the best management approach, requires knowing the basics of study design and interpretation as well as the intricacies of patient interaction. Here, a primer.
In thIS article
- RCTs: The good, the bad, and the ugly
- Systematic reviews: What, why, and how?
- Assessing harms
- Applying the evidence and your expertise to your patient
Surrogate outcomes
Outcomes measured in a trial should be relevant, easy to interpret and diagnose, sensitive to treatment differences, and measurable within a reasonable period of time. However, these characteristics are not always achievable for important clinical outcomes in an RCT. Therefore, a surrogate outcome may take the place of the true clinical efficacy measurement.
For example, in studies of interventions for infertility in patients with polycystic ovary syndrome (PCOS), common surrogates to the “true” desired outcome of a healthy live birth may include ovulation, implantation, or pregnancy rates. These surrogate outcomes may correlate with live birth but clearly ignore other factors extrinsic and intrinsic to PCOS that affect the chance for a healthy term delivery; the possible increased risk for miscarriage in PCOS; and increased risks of other pregnancy complications, such as preeclampsia and gestational diabetes.
Similarly, many trials of oral contraceptives that aim to study the clinical endpoint of pulmonary embolism or venous thromboembolism, which are rare events, instead use the surrogates of results of coagulation tests or levels of sex hormone-binding globulin. Clearly, caution must be exercised when interpreting studies that use surrogate outcomes. As the clinician, you must recognize that a change in a biologic or physical measurement may not be clinically relevant. Some judgment is required about causal pathways: The less that is known about the causal pathway of a disease, the less confident one should be in any surrogate outcome.
Finally, clinicians also must recognize that a valid surrogate for one treatment may not be valid for another treatment or another population.5 For example, ovulation inhibition would be an appropriate surrogate endpoint for contraceptive efficacy for a method that reliably prevents ovulation; however, this would not be a good surrogate outcome to evaluate the progestin-only pill, which fails to inhibit ovulation completely and yet is highly effective in contraceptive trials.
Avoiding pitfalls with subgroup analyses
It is common, particularly in large RCTs, to evaluate treatment effects for a specific endpoint in a subgroup of patients included in the trial. The goal is to determine whether the findings of the larger study apply more or less to a specific patient (who may differ from the total population by some important characteristic, such as age, weight, parity, or menopausal or smoking status). The variability in study results when stratified by these patient factors is known as heterogeneity of treatment effect, which may be quantitative or qualitative.6
In the former, one treatment is always better than the other, although by varying degrees depending on the subgroup. (For example, a stronger effect could be seen in those aged 65 and younger than in those older than 65.) In the latter, the treatment fares better than the comparator in one subgroup but worse or no different for another subgroup. In either case, the appropriate statistical tool to identify heterogeneity of treatment effect is a test for interaction between the characteristic and the treatment effect, rather than claiming heterogeneity on the basis of separate tests of treatment effects within the different subpopulations.
One problem with dividing the original population into smaller subpopulations is that the number of participants decreases—thus there is less power, or less statistical strength, to identify a treatment effect. More accurately, there is a greater likelihood of a type II error (a false negative) when these small subpopulations have too few patients to demonstrate a clinical treatment effect that actually may exist.
False positives. Paradoxically, another problem with subgroup analyses is a greater chance for false positives due to the multiple statistical testing that is performed. The original study is rarely powered appropriately to do this (see “Error rates in subgroup analyses”). According to Wang and colleagues, “It is common practice to conduct a subgroup analysis for each of several (and often many) baseline characteristics, for each of several endpoints, or for both.”7 The more subgroup analyses performed, the more likely that differences found are due to chance only. Unfortunately, in unplanned post hoc analyses, the number of tests performed is often unreported; therefore, the error rates are unknown. There are statistical methods to try and correct for this “multiplicity” problem but, ideally, only a few key subgroup analyses are performed, and they are planned a priori in the original study design. In these cases, the study’s size can be adjusted accordingly. In most instances, findings from subgroup analyses, whether positive or negative, should be considered as “hypothesis generating” and interpreted with caution.
Error rates in subgroup analyses
With “k” independent subgroups and no difference in treatments, the probability of at least one “significant” subgroup (such as a false positive) is 1 – (1-α)k.
If α = 0.05 and there are k = 10 subgroups, then 1 – (0.95)10 = 0.40. That is, if 10 subgroup analyses are performed, there is a 40% likelihood that 1 will demonstrate a “significant” difference in treatment effect, even though no difference exists.