临床研究统计学争议:统计学显著性——过犹不及
2016年5月,欧洲肿瘤内科学会(ESMO)和日本肿瘤内科学会(JSMO)官方期刊、牛津大学出版社旗下的《肿瘤学年鉴》正式发表了比利时国际药物开发研究所、美国加利福尼亚大学洛杉矶分校、法国巴黎第十一大学、中国军事医学科学院附属北京307医院(江泽飞)、美国莎拉·坎农研究所、日本京都大学医学研究生院、德国慕尼黑伊萨尔医院、加拿大埃德蒙顿肿瘤学转化研究所对BOLERO-1(评估紫杉醇+曲妥珠单抗±依维莫司一线治疗HER2阳性进展期乳腺癌的随机双盲研究)阴性结果的统计学争议,认为P值对于解读临床研究结果是有用的,但是不能作为指导临床实践的分水岭。在该研究中,依维莫司的好处仅见于激素受体阴性乳腺癌患者亚组(无进展生存风险比0.66,P=0.0049)。
此文标题来自莎士比亚喜剧《皆大欢喜》(As You Like It)女主人公(英国版祝英台)对男主人公(英国版梁山伯)表白的著名台词:所以说,好的东西怎么会有人拒绝呢?Why then, can one desire too much of a good thing?
莎士比亚虽然不是该句首创者,但他是将其运用到文艺作品中的第一人。在该剧中,罗瑟琳(Rosalind)和奥兰多(Orlando)深爱着彼此,但由于一些原因,罗瑟琳不得不乔装成男子,并化名盖尼米德(Ganymede,希腊神话中的特洛伊王子,以美貌著称,宙斯因为喜爱他将他带走做神的斟酒者)。但是,当奥兰多(梁山伯)偶遇男装的罗瑟琳(祝英台)时,他并不知道眼前男子的真实身份,还一见如故,跟她成为了情同手足的“兄弟”。一天,奥兰多向盖尼米德倾诉了他对罗瑟琳日夜思念的苦恼,罗瑟琳在他心中是是那么的聪慧可爱、那么的完美,他想娶她为妻。盖尼米德(祝英台)提议让奥兰多(梁山伯)把自己当做罗瑟琳(祝英台),以此来练习求婚誓言,奥兰多(梁山伯)带着对恋爱的真诚,真的把眼前男子当做罗瑟琳(祝英台),向“她”说出了满腹痴情,还不停地念叨着他如何才能知道罗瑟琳(祝英台)的心意。盖尼米德(祝英台)有点不耐烦,向奥兰多(梁山伯)保证说,就是有20个像他一样的男人对罗瑟琳(祝英台)表白,罗瑟琳(祝英台)也会都接受。奥兰多(梁山伯)人品又好,对自己又痴情,这样的好男人多多益善,怎么舍得拒绝呢?Why then, can one desire too much of a good thing?
Ann Oncol. 2016 May;27(5):760-2.
Statistical Controversies in Clinical Research: Statistical significance - too much of a good thing...
Buyse M, Hurvitz SA, Andre F, Jiang Z, Burris HA, Toi M, Eiermann W, Lindsay MA, Slamon D.
International Drug Development Institute (IDDI), Louvain La Neuve, Belgium; University of California, Los Angeles (UCLA), Los Angeles, California, USA; Institut Gustave Roussy, Université Paris Sud, Villejuif, France; Beijing 307 Hospital of PLA, Beijing, China; Sarah Cannon Research Institute, Nashville, Tennessee, USA; Graduate School of Medicine, Kyoto University, Kyoto, Japan; ISAR Klinikum München, Munich, Germany; Translational Research in Oncology (TRIO), Edmonton, Canada.
The use and interpretation of P-values is a matter of debate in applied research. We argue that P-values are useful as a pragmatic guide to interpret the results of a clinical trial, not as a strict binary boundary that separates real treatment effects from lack thereof. We illustrate our point using the result of BOLERO-1, a randomized, double-blind trial evaluating the efficacy and safety of adding everolimus to trastuzumab and paclitaxel as first-line therapy for HER2+ advanced breast cancer. In this trial, the benefit of everolimus was seen only in the pre-defined subset of patients with hormone receptor-negative breast cancer at baseline (progression-free survival hazard ratio=0.66, P=0.0049). A strict interpretation of this finding, based on complex "alpha splitting" rules to assess statistical significance, led to the conclusion that the benefit of everolimus was not statistically significant either overall or in the subset. We contend that this interpretation does not do justice to the data, and we argue that the benefit of everolimus in hormone receptor-negative breast cancer is both statistically compelling and clinically relevant.
KEYWORDS: P-value; advanced breast cancer; everolimus; hormone receptor-negative; multiple comparisons; statistical significance
Key Message: "P-values are useful as a pragmatic guide to interpret the results of a clinical trial, not as a strict binary boundary that separates real treatment effects from lack thereof."
Rosalind: "Why then, can one desire too much of a good thing?"
(William Shakespeare, As You Like It, Act 4, Scene 1)
The BOLERO-1 trial
The BOLERO-1 randomized, double-blind trial evaluated the efficacy and safety of adding everolimus to trastuzumab and paclitaxel as first-line therapy for HER2+ advanced breast cancer (Clinicaltrials.gov identifier: NCT00876395). Patients were randomized 2:1 to receive either daily everolimus (10 mg/day) or placebo and weekly trastuzumab plus paclitaxel, in 4-week cycles. The two primary objectives were investigator-assessed progression-free survival (PFS) in the full study population and in the subset of patients with hormone receptor-negative (HR-) breast cancer at baseline. Table 1 shows the main PFS results of the study [1].
In the HR+ subset, the hazard ratio clearly did not differ from 1 (P = 0.67), but in the HR- subset, the hazard ratio was equal to 0.66 in favour of everolimus (P = 0.0049). However, the significance threshold for this pre-specified subset analysis (P = 0.0044) was not crossed and therefore the trial was considered "negative" [1].
Did BOLERO-1 fail to meet its objectives?
Leaving aside statistical subtleties, the interpretation of these analyses is clear enough: about two thirds of the patients receiving placebo had a PFS event at the time of the analysis. This proportion was slightly lower for patients receiving everolimus if they were HR+, but it dropped to less than half for HR- patients. The effect of everolimus in HR- patients is a reduction in the risk of progression or death of about one third (HR = 0.66), or an increase in median PFS of about one half (from about 13 months to 20 months). The benefit of everolimus is comfortably significant among HR- patients: the P-value is equal to 0.0049, i.e. there is less than 1 chance in 200 that these results are due to the play of chance rather than to a real effect of everolimus. In other words, if we had carried out 200 trials identical to BOLERO-1 among HR- patients, and if everolimus truly had no benefit in such patients, we would expect only one of these trials to exhibit results as good as, or better than, those seen in BOLERO-1.To be fair, HR- patients were only a subset of all randomized patients (though a pre-specified one), hence it would be appropriate to adjust our interpretation to account for the multiplicity of testing; we return to this issue later. It could also be argued that looking at the results among HR- patients does not use the totality of the data: an "interaction test" between the effect of everolimus and HR status would be relevant if one wanted to establish a true difference in treatment effect between patients with HR- vs. HR+ tumors. As it happens, the interaction test, for which the trial was not designed nor powered, also reached formal statistical significance (P = 0.02). Last but not least, the larger effect of everolimus in HR- patients had already been observed in BOLERO-3 [2]. Why, then, does the main publication of BOLERO-1 portray the trial as having failed to reach its objectives, throwing back all of its findings to the unenviable status of mere conjectures? Were the results of the trial unconvincing and is another confirmatory trial needed, or did these results suggest some as yet unknown aspect of the biology of HER2-amplified tumors [3]? We believe there is overwhelming evidence for the latter interpretation, and now discuss some of the reasons why a straightforward interpretation of the trial results may have been trumped by unnecessarily rigid statistical considerations.
The origins of statistical significance
R.A. Fisher proposed the notion of statistical significance in his highly influential book Statistical Methods for Research Workers [4]. He wrote: "The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available." [4] It is interesting to note that Fisher proposed "twice the standard deviation" (which in fact corresponds to a P-value slightly lower than 0.05) as a convenient rule of thumb to assess statistical significance. And his thoughtful qualifier "even if the statistics were the only guide available" does underline the importance of considering other factors in interpreting results. Such other factors include the pre-specification of the hypothesis under consideration, the magnitude of the treatment effect, treatment effects on other endpoints, the relevance of these effects in an overall a risk/benefit assessment, the biological plausibility of the effects, confirmation from other trials these effects are reproducible and generalizable to future patients, and so on and so forth.
Problems with statistical significance
Fisher's caveats notwithstanding, P-values have become an obsession in clinical research, with the magic "P < .05" seemingly dominating all other considerations, at least in the regulatory context of granting new therapeutics market authorization. We argue here that P-values per se are not the problem, but rather an excessive reliance on P-values to dichotomize reality between "no treatment effect" and "some treatment effect". The P-value is the probability of observing data as extreme as the data observed in the absence of any real treatment effect. The P-value is often misunderstood or abused, in particular to make exaggerated claims about an effect of interest [5]. The Editors of the journal Basic and Applied Social Psychology went as far as banning the use of P-values and of confidence intervals in all papers submitted for publication in their journal [6,7]. Such a staggering and rather ill-advised policy attracted them much publicity, but it was otherwise wholly unjustified [8]. In clinical research, there is no question that the insistence of journal editors and regulatory agencies to use P-values and confidence intervals as a standardized way to quantify the uncertainty in the data has been hugely useful to reduce false claims about the effects of new therapies. Here we focus on the opposite situation, where excessive reverence to a P-value has led the Sponsor of a trial to claim no treatment effect in spite of rather convincing evidence that there is one.
The perils of multiple comparisons
The situation of interest here concerns multiple comparisons, and the desire to attach a significance level to each comparison performed, since there are many ways of splitting the overall (or "experiment-wise") significance level α among the different tests of interest. This problem has been known (and addressed) for a long time in the context of interim analyses, where so-called group sequential boundaries have become a popular way to split α over the successive analyses that are performed in the course of the trial [9]. The problem also occurs when a specific subset of patients is of interest, as in the BOLERO-1 trial, with prior evidence suggesting that the benefit of everolimus might be confined to HR- patients [2]. In such a case, how should α be split between the two analyses of interest, one in all randomized patients, and the other in the subset of interest? In BOLERO-1, the Sponsor took great pains to describe how α would be split between these two analyses [1]. To put things briefly, significance would be claimed overall if the one-sided P-value was less than 0.0174 (it turned out to be 0.117, Table 1), and in the subset of HR- patients if the one-sided P-value was less than 0.0044 (it turned out to be 0.0049, Table 1). The authors concluded - strictly speaking, correctly - that the trial results were not significant, either overall or in the subset [1]. One may wonder, in retrospect, why so little α was spent on the promising subset, but this criticism is unfair with the data at hand. One may also observe that the analysis of progression-free survival without any arbitrary censoring had a P-value of 0.0043. This analysis is considered more appropriate [10] than that used in BOLERO-1, which censored patients who received another anticancer therapy before confirmed progression, thereby inducing treatment-dependent censoring [11]. At any rate, the results of these two analyses show that the treatment effect may fall on either side of statistical significance depending on analytical conventions used. Finally, the analysis made no allowance for the fact that the two analyses were by definition not independent of each other, since all patients contain the subset of HR- patients - in exactly the same way as the final analysis of a trial contains the data of all preceding interim analyses [12]. This important structure of the data was not used in the analysis, which resulted in significance levels being unnecessarily conservative both for the subset and overall. All these considerations make it amply clear that a sharp distinction between a non-significant P-value of 0.0049 (which is inappropriately equated with "no treatment effect") and a significant P-value of 0.0043 (which is equated with the existence of a real treatment effect) is unhelpful at best and misleading at worst.
Statistical significance going forward
In randomized clinical trials, pre-specified criteria to gauge statistical significance should not be so broad as to be fuzzy, nor so strict as to be silly. Going forward, the proper use of statistical significance may well be just as Fisher intended it, as a pragmatic guide to inform evaluation rather than as a strict binary boundary that separates real treatment effects from lack thereof. A great deal of statistical literature has been devoted to highly sophisticated methods of splitting α [13]. While these methods are broadly helpful to control the risk of type I errors in pre-defined sets of analyses, their use should not result in putting undue em 52 29431 52 15290 0 0 2733 0 0:00:10 0:00:05 0:00:05 3025phasis on exact significance levels, since the concept of statistical significance was never intended to be used in this manner. In the case of the BOLERO-1 trial, it seems to us entirely appropriate to state that the trial achieved one of its two pre-specified co-primary objectives, which was to confirm the benefit of everolimus on PFS among HR- patients. The magnitude of the PFS benefit observed in the trial (HR = 0.66, 95% CI [0.48-0.91]) remains to be further refined in a meta-analysis of all relevant randomized trials, which should also investigate whether this effect on PFS will eventually translate into a survival improvement. But to draw no conclusion from this large, well conducted trial would be a disservice to the patients who accepted to partake to the trial, as well as to all future patients who might be deprived from an effective therapy. Statistical significance should be a guide to inform conclusions from an experiment, not a hindrance to make decisions. P-values should inform but not replace sound scientific judgment, taking all aspects of the statistical and clinical evidence into consideration.
Acknowledgments
The data presented in this paper were provided by the trial sponsor, Novartis Pharmaceuticals Corporation, after the publication of the primary manuscript of this trial 1. The sponsor had no role in the writing of this paper or the interpretation of the analyses.
Funding
The BOLERO-1 trial was funded by Novartis Pharmaceuticals Corporation (no grant numbers apply).
Disclosure
The authors declare no conflicts of interest for the work presented here.
References
Hurvitz SA, Andre F, Jiang Z, et al. Combination of everolimus with trastuzumabplus paclitaxel as first-line treatment for HER2-positive advanced breast cancer (BOLERO-1): a phase 3, randomised, double-blind, multicentre trial. Lancet Oncol. 2015;16:816-29.
André F, O'Regan R, Ozguroglu M, et al. Everolimus for women with trastuzumab-resistant, HER2-positive, advanced breast cancer (BOLERO-3): a randomised, double-blind, placebo-controlled phase 3 trial. Lancet Oncol 2014;15:580-91.
Von Minckwitz G. A step towards a HER2-positive breast cancer super family. Lancet Oncol. 2015;16:745-6.
Fisher RA. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd.1925. ISBN 0-05-002170-2
Nuzzo R. Statistical errors. Nature. 2014;506:150-6.
Trafimow, D. Editorial. Basic Appl Soc Psych. 2014;36:1-2.
Trafimow D, Marks M. Editorial. Basic Appl Soc Psych. 2015;37:1-2.
Leek JT, Peng RD. P values are just the tip of the iceberg. Nature. 2015;520:612.
Jennison C, Turnbull BW. Group Sequential Methods with Applications to ClinicalTrials. New York: Chapman and Hall / CRC Press, 1999.
Carroll KJ: Analysis of progression-free survival in oncology trials: Some commonstatistical issues. Pharm Statist. 2007;6:99-113.
Fleming TR, Rothmann MD, Lu HL. Issues in using progression-free survival whenevaluating oncology products. J Clin Oncol. 2009;27:2874-80.
Spiessens B, Debois M. Adjusted significance levels for subgroup analyses in clinicaltrials. Contemp Clin Trials. 2010;31:647-56.
Dmitrienko A, Tamhane AC, Bretz F. Multiple Testing Problems in PharmaceuticalStatistics. New York: Chapman and Hall / CRC Press, 2009.
PMID: 26861602
PII: mdw047
DOI: 10.1093/annonc/mdw047