2016年,Kenneth J. Rothman与Sander Greenland于《欧洲流行病学》杂志发表了一篇总结性的文章《统计检验、P值、置信区间、检验效能的误导》,这一篇文献,解读了Fisher以P值来,不拒绝或者否定H0假设以来的种种误区,并进行解读。现在本论文来分三篇论文来陈述。本篇先来讨论P值的误读与真相。
一、KennethJ. Rothman是谁
学公共卫生的人不能不知道Kenneth J. Rothman。他是当代流行病学第一人!他出版的<Modern Epidemiology>一书是当代流行病学的圣经。有兴趣可以网上搜索下。Sander Greenland也是<Modern Epidemiology>的作者。
The P value is theprobability that the test hypothesis is true; for example, if a test of thenull hypothesis gave P = 0.01, the null hypothesis has only a 1 % chance ofbeing true; if instead it gave P = 0.40, the null hypothesis has a 40 % chanceof being true. No!解读1:
The P valueassumes the test hypothesis is true—it is not a hypothesis probability and maybe far from any reasonable probability for the test hypothesis. The P valuesimply indicates the degree to which the data conform to the pattern predictedby the test hypothesis and all the other assumptions used in the test (theunderlying statistical model). Thus P = 0.01 would indicate that the data arenot very close to what the statistical model (including the test hypothesis)predicted they should be, while P = 0.40 would indicate that the data are muchcloser to the model prediction, allowing for chance variation.
(惭愧,我也不甚理解,还是贴原文吧)The P value for the null hypothesis is theprobability that chance alone produced the observed association; for example,if the P value for the null hypothesis is 0.08, there is an 8 % probabilitythat chance alone produced the association.No! This is acommon variation of the first fallacy and it is just as false. To say thatchance alone produced the observed association is logically equivalent toasserting that every assumption used to compute the P value is correct,including the null hypothesis. Thus to claim that the null P value is theprobability that chance alone produced the observed association is completelybackwards: The P value is a probability computed assuming chance was operatingalone. The absurdity of the common backwards interpretation might beappreciated by pondering how the P value, which is a probability deduced from aset of assumptions (the statistical model), can possibly refer to theprobability of those assumptions. Note: One often sees ‘‘alone’’ dropped fromthis description (becoming ‘‘the P value for the null hypothesis is theprobability that chance produced the observed association’’), so that thestatement is more ambiguous, but just as wrong.
A significant testresult (P £ 0.05) means that the test hypothesis is false or should berejected.
一个小的P自意味着,如果零假设是对的话,这样的样本比较罕见。P值比较小,因为抽样误差比较小,或者和其他的假设有冲突。P值比较小也许,和更多的假设会P值也会更小。; P值小于0.05意味着和零假设的距离比较大(比如两组没有统计学差异),这样的差距,如果是偶然发生的话,发生概率比较小的。
No! A small Pvalue simply flags the data as being unusual if all the assumptions used tocompute it (including the test hypothesis) were correct; it may be smallbecause there was a large random error or because some assumption other thanthe test hypothesis was violated (for example, the assumption that this P valuewas not selected for presentation because it was below 0.05). P B 0.05 onlymeans that a discrepancy from the hypothesis prediction (e.g., no differencebetween treatment groups) would be as large or larger than that observed nomore than 5 % of the time if only chance were creating the discrepancy (as opposedto a violation of the test hypothesis or a mistaken assumption).
A nonsignificanttest result (P > 0.05) means that the test hypothesis is true or should beaccepted.
No! A large Pvalue only suggests that the data are not unusual if all the assumptions usedto compute the P value (including the test hypothesis) were correct. The samedata would also not be unusual under many other hypotheses. Furthermore, evenif the test hypothesis is wrong, the P value may be large because it wasinflated by a large random error or because of some other erroneous assumption(for example, the assumption that this P value was not selected forpresentation because it was above 0.05). P[0.05 only means that a discrepancyfrom the hypothesis prediction (e.g., no difference between treatment groups)would be as large or larger than that observed more than 5 % of the time ifonly chance were creating the discrepancy
A large P value isevidence in favor of the test hypothesis. No!
实际上,任何一个P值不等于1意味着这个假设不是我们样本最佳的假设,可能其他的假设和我们的样本更加契合。P值不能说明H0成立的证据是否充分,除非和P值较小的比较。此外,大的P值往往意味着这个数据无法足够的能力去挑选合适的属于它的假设。比如,很多作者在P = 0.70人认定处理因素没有效应,实际上P=0.7,不意味着零假设看和数据最契合,但是,其实还有更好的未知假设与我们的样本是一路的,也就是我们的样本属于其他总体的,比如会有P=1的请。即便是P=1,也有可能其他的假设比现有H0假设更契合。因此,有没有统计学关联,无法根据P值来下结论,无论P值有多大。
In fact, any Pvalue less than 1 implies that the test hypothesis is not the hypothesis mostcompatible with the data, because any other hypothesis with a larger P valuewould be even more compatible with the data. A P value cannot be said to favorthe test hypothesis except in relation to those hypotheses with smaller Pvalues. Furthermore, a large P value often indicates only that the data areincapable of discriminating among many competing hypotheses (as would be seenimmediately by examining the range of the confidence interval). For example,many authors will misinterpret P = 0.70 from a test of the null hypothesis asevidence for no effect, when in fact it indicates that, even though the null hypothesisis compatible with the data under the assumptions used to compute the P value,it is not the hypothesis most compatible with the data—that honor would belongto a hypothesis with P = 1. But even if P = 1, there will be many otherhypotheses that are highly consistent with the data, so that a definitiveconclusion of ‘‘no association’’ cannot be deduced from a P value, no matter how large。
A null-hypothesisP value greater than 0.05 means that no effect was observed, or that absence of an effect was shown or demonstrated.
No! ObservingP[0.05 for the null hypothesis only means that the null is one among the many hypotheses that have P[0.05. Thus, unless the point estimate (observed association) equals the null value exactly, it is a mistake to conclude fromP>0.05 that a study found ‘‘no association’’ or ‘‘no evidence’’ of an effect.If the null P value is less than 1 some association must be present in thedata, and one must look at the point estimate to determine the effect size mostcompatible with the data under the assumed model.
Statisticalsignificance indicates a scientifically or substantively important relation hasbeen detected. No!
Especially when astudy is large, very minor effects or small assumption violations can lead tostatistically significant tests of the null hypothesis. Again, a small null Pvalue simply flags the data as being unusual if all the assumptions used tocompute it (including the null hypothesis) were correct; but the way the dataare unusual might be of no clinical interest. One must look at the confidenceinterval to determine which effect sizes of scientific or other substantive(e.g., clinical) importance are relatively compatible with the data, given the model
Lack ofstatistical significance indicates that the effect size is small. No!
Especially when astudy is small, even large effects may be ‘‘drowned in noise’’ and thus fail tobe detected as statistically significant by a statistical test. A large null Pvalue simply flags the data as not being unusual if all the assumptions used tocompute it (including the test hypothesis) were correct; but the same data willalso not be unusual under many other models and hypotheses besides the null.Again, one must look at the confidence interval to determine whether itincludes effect sizes of importance.误区9:
P值是说明H0成立请将下我们样本的发生概率。比如P = 0.05意味着H0成立的时候,我们观察到的统计量发生概率是5%。不!
The P value is thechance of our data occurring if the test hypothesis is true; for example, P =0.05 means that the observed association would occur only 5 % of the time underthe test hypothesis
No! The P valuerefers not only to what we observed, but also observations more extreme thanwhat we observed (where ‘‘extremity’’ is measured in a particular way). Andagain, the P value refers to a data frequency when all the assumptions used tocompute it are correct. In addition to the test hypothesis, these assumptionsinclude randomness in sampling, treatment assignment, loss, and missingness, aswell as an assumption that the P value was not selected for presentation basedon its size or some other aspect of the results.
If you reject the test hypothesis because P £0.05, the chance you are in error (the chance your ‘‘significant finding’’ is afalse positive) is 5 %.
NO!To see why this description is false, suppose the test hypothesis is in fact true. Then, if you reject it, the chance you are in error is 100 %, not 5 %. The 5 % refers onlyto how often you would reject it, and therefore be in error, over very manyuses of the test across different studies when the test hypothesis and allother assumptions used for the test are true. It does not refer to your singleuse of the test, which may have been thrown off by assumption violations aswell as random errors. This is yet another version of misinterpretation #1
P = 0.05 and P <0.05mean the same thing. No!
就像说这个高度等于2m和高度小于2一样不是回事。高度=2,意味着很少人,意味着他们被认为很高,但是高度<=2m则说明几乎所有人都满足于条件,因此P = 0.05意味着是一个统计学研究的界值,P<=0.05意味着结果和H0不太兼容。
This is likesaying reported height = 2 m and reported height B2 m are the same thing:‘‘height = 2 m’’ would include few people and those people would be consideredtall, whereas ‘‘height =2 m’’ would include most people including small children. Similarly, P = 0.05 would be considered a borderline result in terms of statistical significance, whereas P < 0.05 lumps borderline results together with results very incompatible with the model (e.g., P = 0.0001) thus rendering its meaning vague, for no good purpose.
P values areproperly reported as inequalities (e.g., report ‘‘P < 0.02’’ when P = 0.015 or report ‘‘P > 0.05’’ when P = 0.06 or P = 0.70). No!
这是一个非常不好的习惯,因为这个会非常不容易让读者去理解统计学结果,除非P值太小了比如小于under 0.001,比较太小的P值区分去来也没有太大的意思。
This is badpractice because it makes it difficult or impossible for the readertoaccuratelyinterpretthe statistical result. Only when the P value is very small(e.g., under 0.001) does an inequality become justifiable: There is littlepractical difference among very small P values when the assumptions used tocompute P values are not known with enough certainty to justify such precision,and most methods for computing P values are not numerically accurate below acertain point.
Statistical significance is a property of the phenomenon being studied, and thusstatistical tests detect significance. No!
This misinterpretation is promoted when researchers state that they have or have notfound ‘‘evidence of’’ a statistically signifi- cant effect. The effect beingtested either exists or does not exist. ‘‘Statistical significance’’ is adichotomous description of a P value (that it is below the chosen cut-off) andthus is a property of a result of a statistical test; it is not a property ofthe effect or population being studied.
One should alwaysuse two-sided P values. No!
Two-sided P valuesare designed to test hypotheses that the targeted effect measure equals aspecific value (e.g., zero), and is neither above nor below this value. When,however, the test hypothesis of scientific or practical interest is a one-sided(dividing) hypothesis, a onesided P value is appropriate. For example, considerthe practical question of whether a new drug is at least as good as thestandard drug for increasing survival time. This question is one-sided, sotesting this hypothesis calls for a one-sided P value. Nonetheless, becausetwo-sided P values are the usual default, it will be important to note when andwhy a one-sided P value is being used instead.
When the samehypothesis is tested in different studies and none or a minority of the testsare statistically significant (all P > 0.05), the overall evidence supportsthe hypothesis.
这个一般经常用在文献分析方面,这个反应研究者往往过高估计检验效能。实际上,很多研究单个没有统计学意义,但是多个就不好说了。比如假如5个研究P均等于0.1,那么热如果按照Fisher formula方法合并来探讨差异性,那么总的P值就会小于0.01.因此没有统计学意义并不意味着总体也没有统计学意义
No! This belief isoften used to claim that a literature supports no effect when the opposite iscase. It reflects a tendency of researchers to ‘‘overestimate the power of mostresearch’’ [89]. In reality, every study could fail to reach statisticalsignificance and yet when combined show a statistically significant associationand persuasive evidence of an effect. For example, if there were five studieseach with P = 0.10, none would be significant at 0.05 level; but when these Pvalues are combined using the Fisher formula [9], the overall P value would be0.01. There are many real examples of persuasive evidence for important effectswhen few studies or even no study reported ‘‘statistically significant’’ associations[90, 91]. Thus, lack of statistical significance of individual studies shouldnot be taken as implying that the totality of evidence supports no effect.
When the samehypothesis is tested in two different populations and the resulting P valuesare on opposite sides of 0.05, the results are conflicting. No!
统计学检验对不同人群的结果是比较敏感的,比如样本量。因此两个研究提供了截然不同的P值也有可能说明情况是一致的。比如两个随机对照试验,A 有标准误为2,B为1,但是他们的效益指标都是3,但的P值为0.013,B为0.003,这个不能说明两项研究结论相反,这个时候还是要看看他们的结果差异性,特别是置信区间来显示,P值用来反映研究项目的异质性交互或者修饰。
Statistical tests are sensitive to many differences between study populations that are irrelevantto whether their results are in agreement, such as the sizes of compared groupsin each population. As a consequence, two studies may provide very different Pvalues for the same test hypothesis and yet be in perfect agreement (e.g., mayshow identical observed associations). For example, suppose we had tworandomized trials A and B of a treatment, identical except that trial A had aknown standard error of 2 for the mean difference between treatment groupswhereas trial B had a known standard error of 1 for the difference. If bothtrials observed a difference between treatment groups of exactly 3, the usualnormal test would produce P = 0.13 in A but P = 0.003 in B. Despite their difference in P values, the test of the hypothesis of no difference in effectacross studies would have P = 1, reflecting the perfect agreement of the observed mean differences from the studies. Differences between results must beevaluated by directly, for example by estimating and testing those differencesto produce a confidence interval and a P value comparing the results (often called analysis of heterogeneity, interaction, or modification).
When the same hypothesis is tested in two different populations and the same P values areobtained, the results are in agreement. No!
这个跟上面的误区一样,不同的研究特征不同,比如样本量也不一样,那么标准误是不同,这个时候即便是P值小于0.05,也不能说明两者一致性,往往可能是效应值不同。2个随机对照试验,A 有标准误为1,差距为3,B标准为4,差距为12,两个P值3 AP值为0.003,B为0.03,其实结论完全不同。
Again, tests are sensitive to many differencesbetween populations that are irrelevant to whether their results are inagreement. Two different studies may even exhibit identical P values fortesting the same hypothesis yet also exhibit clearly different observedassociations. For example, suppose randomized experiment A observed a meandifference between treatment groups of 3.00 with standard error 1.00, while Bobserved a mean difference of 12.00 with standard error 4.00. Then the standardnormal test would produce P = 0.003 in both; yet the test of the hypothesis ofno difference in effect across studies gives P = 0.03, reflecting the largedifference (12.00 - 3.00 = 9.00) between the mean differences.
If one observes asmall P value, there is a good chance that the next study will produce a Pvalue at least as small for the same hypothesis. No!
This is false evenunder the ideal condition that both studies are independent and all assumptionsincluding the test hypothesis are correct in both studies. In that case, if(say) one observes P = 0.03, the chance that the new study will show P B 0.03is only 3 %; thus the chance the new study will show a P value as small orsmaller (the ‘‘replication probability’’) is exactly the observed P value! Ifon the other hand the small P value arose solely because the true effectexactly equaled its observed estimate, there would be a 50 % chance that arepeat experiment of identical design would have a larger P value [37]. Ingeneral, the size of the new P value will be extremely sensitive to the studysize and the extent to which the test hypothesis or other assumptions areviolated in the new study [86]; in particular, P may be very small or verylarge depending on whether the study and the violations are large or small.
