fMRI研究的统计分析：FWE，RFT，FDR，Permutation（修正版）

2017-03-24 攻尘我爱脑科学网

什么是多重比较问题？什么是FWE（Family Wise Error）校正

在很多科学领域，我们都会遇到在多次统计比较的情况下判断显著性的问题。如果我们只做了一次统计分析，通常只要把P值设定在0.05，将我们犯错误的可能限制在5%的范围内，即小概率事件中即可。

然而，在神经影像学领域，需要通过进行成千上万次的统计对比。以任务态fMRI为例，为了得到与某个认知功能相关的脑区定位，通常要在全脑范围内去寻找激活区。也就是说要对每个体素进行一次统计分析。假设将全脑由10万个体素组成，那我们就要进行10万次统计分析。在这么庞大的基数面前，几乎可以肯定你能得到上百次的显著结果，即上百个假阳性（False Positive）的体素。为了避免假阳性的结果，研究者们通常要对P值按照“统计比较”的次数进行校正。这种方法即FWE校正，他能够有效的降低 I 类错误的发生概率。

标准的FWE校正方法，即Bonferroni校正，将单次比较情况下的P值（通常为0.05）除以整个实验的比较次数（如10万次），用得到的新P值（0.05/100000）来判断结果的显著性，即完成了Bonferroni校正。这是个非常保守的校正方法，并且在神经影像学的研究中有着致命的缺点。Bonferroni的本意是要对完全“独立”的“比较次数”进行校正，而实际上，脑内体素间的信号并非如此。

什么是高斯随机场理论（Gaussian Random-Field Theory），如何用它做FWE校正？

RFT（Random-Field Theory）校正假设数据在空间变异上有着确定的模式，统计量的分布可以用平滑后的随机场来模拟。通过计算真实的统计图的平滑度，估计脑内体素（或团块）在特定的统计水平下随机产生的可能性。RFT校正最大的优势是引入了平滑度来做判断。在空间相关性较高的情况下（平滑度较大），Bonferroni校正显得太过严苛，而RFT要宽大的多，也更合理。RFT校正在计算上也很便捷，并不耗时。然而，RFT也有缺点，他基于的前提假设过多。其中最大的假设是，数据的平滑度要达到一定水平才能适用于这种方法—平滑度至少是体素大小的2-3倍。在某些研究中，如果你不想牺牲掉图像的空间分辨率的话，RFT方法就不太合适了。RFT是SPM软件包的默认校正方法，当你选择“FWE correction”的时候，你采用的就是RFT校正。

什么是FDR（False Discovery Rate）校正？它与其它多重比较校正方法有什么不同？

FEW校正是为了控制N次统计比较后，随机情况下可能产生的假阳性事件。它可以确保任何假阳性出现在我们结果中的概率在5%以内，也就是在结果中几乎没有假阳性，每个显著的体素都是真正的被激活。而这与实际情况并非完全一致，因为在数据分析的过程中，每一步都会带入一些干扰而使得小部分本没有激活的体素变的显著了。也就是说，研究者实际上是可以接受结果中存在一定的假阳性，而这正是FWE校正想控制的。与FWE校正不同，FDR校正并不保证你的结果中没有假阳性，而是将假阳性的结果控制在很小的范围（如5%）。从二者的名字上，也可见差别。如果FDR校正后，你有100个体素激活了，我们可以肯定其中5个是假阳性的，但是你并不知道是哪5个。相对来说，FDR校正要比RFT校正更宽大一点，但是每个研究者必须知道他们付出的代价是，结果中有一些假阳性的体素。SPM和AFNI软件包中都有FDR校正的方法。

什么是置换检验（Permutation Testing）？它与其它类型的多重比较校正有什么不同？

置换检验是非参数检验的一种方法，可以较为敏感的控制FWE。重要的是，它对数据本身的特征不需要前提假设。假设fMRI试验中，你操纵的变量，你的设计矩阵，能够很好的解释数据的变异，有着较高的beta值。通过参数检验（如F检验）可以对结果定性，显著还是不显著。但是置换检验的思路不同。他需要你对设计矩阵中变量的分类进行多次的随机分配，比如将某个刺激条件和控制条件对调，患者和对照的分组进行对调等等。每次随机化后，你都会得到一个beta值。若干次后（比如，5000次），你就得到了beta值的分布图。基于此图，可以判断真实分配情况下的beta值是否属于小概率事件，也即是否显著。置换检验的思想就是不对数据的统计分布做任何假设，完全基于数据本身的特征来检验显著性。

不过，貌似这和多重比较校正没什么关系嘛。上面一段话只是针对一个体素说的，如果我们把这个体素换成一个全脑影像图呢？得到的就是一张伪彩色图（每个体素的显著性都不同，颜色代表强弱），而每一副图中，都有一个最强的点（point with maximal statistic，Pmax）。我们把每次随机分配后得到的最强点的值拿出来，可以得到一个分布直方图，可称之为Pmax的分布图。而真实分组情况下，我们也可以得到每个体素的统计值。通过Pmax即可判断真实情况下每个体素的显著性。值得注意的是，Pmax中的值挑选的是每次随机后，全脑范围内最强的点。因此，如果某个体素的统计值超过了这种极端情况下的95%的值，我们就认为它是经得起校正的，即校正后显著。

置换检验不需要前提假设，但是要对数据进行若干次的随机化分析。因此，如果数据量本身就很大的话，整个统计过程会很耗时。SPM的SnPM Toolbox以及AFNI的3dMonteCarlo Program可实现置换检验。

参考来源

1. What is the multiple-comparison problem? What is familywise error correction (FWE)?

To start, Nichols and Hayasaka () provide an excellent introduction to the issue of FWE in neuroimaging in very readable fashion. You're encouraged to check it out.

Many scientific fields have had to confront the problem of assessing statistical significance in the context of multiple tests. With a single statistical test, the standard conventionally dictates a statistic is significant if it is less than 5% likely to occur by chance - a p-threshold of 0.05. But in fields like DNA microassays or neuroimaging, many thousands of tests are done at once. Each voxel in the brain constitutes a separate test, which usually means tens of thousands of tests for a given subject. If the conventional p-threshold of 0.05 is applied on a voxelwise basis, then, just by chance you're almost guaranteed to have many hundreds of false-positive voxels. In order to avoid any false positives, then, researchers generally correct their p-threshold to account for how many tests they're performing. This type of correction prevents Type I error across the whole family of tests you're doing - a familwise error correction, or FWE correction.

The standard approach to FWE correction has been the Bonferroni correction - simply divide the desired p-threshold by the number of tests, and you'll maintain correct control over the FWE rate. In general, the Bonferroni correction is a pretty conservative correction, and it suffers from a fatal flaw with neuroimaging data. The Bonferroni correction demands that all the tests be independent from each other, and that demand is manifestly not fulfilled in neuroimaging data, where there is a complex, substantial and generally unknown structure of spatial correlations in the data. Essentially, the Bonferroni correction assumes there are more spatial 'degrees of freedom' than there really are; one voxel is not independent from the next, and so one only needs to correct for the 'true' number of independent tests you're doing. This effort, though, is tricky, and so a good deal of theory has been developed on ways around Bonferroni-type corrections that still control the FWE at a reasonable level.

2. What is Gaussian random-field theory and how does it apply to FWE?

Worsley et. al () is one of the first papers to link random-field theory with neuroimaging data, and that link has been tremendously productive in the years since. Random-field theory (RFT) corrections attempt to control the FWE rate by assuming that the data follow certain specified patterns of spatial variance - that the distributions of statistics mimic a smoothly varying random field. RFT corrections work by calculating the smoothness of the data in a given statistic image and estimating how unlikely it is that voxels (or clusters or patterns) with particular statistic levels would appear by chance in data of that local smoothness. The big advantages of RFT corrections are that they adapt to the smoothness in the data - with highly correlated data, Bonferroni corrections are far too severe, but RFT corrections are much more liberal. RFT methods are also computationally extremely efficient.

However, RFT corrections make many assumptions about the data which render the methods somewhat less palatable. Chief among these is the assumption that the data must have a minimum level of smoothness in order to fit the theory - at least 2-3 times the voxel size is recommended at minimum, and more is better. For those researchers unwilling to pay the cost in resolution that smoothing imposes, RFT methods are problematic. As well, RFT corrections are only available for statistics whose distributions in a random field have been laboriously calculated and derived - the common statistics fall in this category (F, t, minimum t, etc.), but ad hoc statistics can't be corrected in this manner. Finally, it's become clear (and Nichols and Hayasaka show in ), that even with the assumptions minimally satisfied, RFT corrections tend to be too conservative.

Random-field theory corrections are available by default in SPM; in SPM99 or earlier, choosing a "corrected" p-threshold means using an RFT correction, while in SPM2, choosing the "FWE" correction to your p-threshold uses these methods. I don't believe corrections of this sort are available in AFNI or BrainVoyager.

3. What is false discovery rate (FDR)? How is it different from other types of multiple-comparison correction?

RFT methods may have their flaws, but some researchers have pointed out a different problem with the whole concept of FWE correction. FWE correction in general controls the error rate for the whole family; it guarantees that there's only a 5% chance (for example) of any false positives appearing in the data. This type of correction simply doesn't fit the intuition of many neuroimaging researchers, because it suggests that every voxel activated is a true active voxel, and most researchers correctly assume there's enough noise in every stage of the process to make a few voxels here and there look active just by chance. Indeed, it's rarely of crucial interest in a particular study whether one particular voxel is necessarily truly or falsely positive - most researchers are willing to accept thatsome of their signal is actually noise - but that level of inference is precisely what FWE corrections attempt to license.

Benjamini & Hochberg, faced with this conundrum, developed a new idea. Rather than controlling the FWE rate, what if you could control the amount of false-positive data you had? They developed a method to control the false discovery rate, or FDR. Genovese et. al () recently imported this method specifically into neuroimaging. The idea in controlling the FDR is not to guarantee you haveno false positives - it's to guarantee you only have a few. Setting the FDR control level to 0.05 will guarantee that no more than 5% of your active voxels are false positives. You don't know which ones they might be, and you don't even know if fully 5% are false positive. But no more than 5% are falsely active.

The big advantage of FDR is that is adapts to the level of signal present in the data. With small signal, the correction is very liberal. With huge signal, it's relatively more severe. This adaptation renders it more sensitive than an RFT correction if there's any signal present in the data. It allows a much more liberal threshold to be set than RFT, at a cost that most researchers have already mentally paid - a few false positive voxels. It requires almost no computational effort, and doesn't require laborious derivations to be used with new statistics.

FDR is not a perfect cure-all - it does require some assumptions about the level of spatial correlation in the data. At the outer bound, allowing any arbitrary correlation structure, it is only slightly more liberal than the equivalent RFT correction. But with looser assumptions, it's a great deal more liberal. Genovese et. al have argued that fMRI data in many situations fits a very loose set of assumptions, enabling a pretty liberal correction.

The latest edition of every major neuroimaging program provides some methods for FDR control - SPM2 and BrainVoyager QX have it built-in, and AFNI's 3dFDR program does the same work. Tom Nichols has predicted FDR methods will essentially replace most FWE correction methods within a few years, and they are beginning to be widely used throughout neuroimaging literature.

4. What is permutation testing? How is it different from other types of multiple-comparison correction?

Permutation testing is a form of non-parametric testing, and Nichols and Holmes give an excellent introduction to the field in their paper (), a much better treatment than I can give it here. But here's the extreme nutshell version. Permutation tests are a sensitive way of controlling FWE that make almost no assumptions about the data, and are related to the stats/CS concept of 'bootstrapping.'

The idea is this. You hope your experimental manipulation has had some effect on the data, and to the extent that it has, your design matrix is a model that explains the data pretty well, with large beta weights for the conditions of interest. But what if your design matrix had been different? What if you randomly re-labeled your trials, so that a trial that was actually an A trial in the real experiment was re-labeled as a B, and put into the design matrix as a B, and a B trial was re-labeled and modeled as a C trial, and a C as an A, and so forth. If your experiment had a big effect, the new, randomly mixed-up design matrix won't explain it well at all - if you re-ran your model using that matrix, you'd get much smaller beta weights. Of course, on the null hypothesis, there wasn't any effect at all due to your manipulation, which means the random design matrix should explain it just as well.

And now that you've re-labeled your design matrix and re-run your stats, you mix up the design matrix again, differently and do the same thing. And then do it again. And again, until you've run through all the possible permutations of the design matrix (or at least a lot of them). You'll end up with a distribution of beta weights for that condition from possible design matrices. And now you go back and look at the beta weight from your real experiment. If it's at the extreme end of that distribution you've created - congrats! You've got a significant effect for that condition. The idea in permutation testing is you don't make any assumptions about what the statistic distribution could be - you go out and empirically determine it, from your own real data.

But how does that help you with the multiple-comparison problem? One nice thing about permuation testing is that aren't restricted to testing significance for stats with known distributions, like t or F. We can use these on any ad hoc statistic we like. So let's do it across the design matrices, using as our statistic the maximal T: the value of the maximum T-statistic in the whole image for that design matrix. We come up with a distribution, just like before, and we can find the t-statistic that corresponds to the 5% most extreme parts of the maximal T distribution. And now, the clever bit: we go back to our real experiment's statistical map, and threshold it at that 5% level from the maximal T. Hopefully the t-statistics from our real experiment are generally so much higher than those from the random design matrices as to mean a lot of voxels in our real experiment will have t-statistics above that level - and we don't need to correct their significance at all, because anything in that extreme part of the maximal T distribution is guaranteed to be among the most extreme possible t-statistics for any voxel for any design matrix.

Permuation tests have the big advantages of making almost no (but not totally none - see Nichols and Holmes for details) assumptions about your data, which means they work particularly well with low degrees of freedom, where other methods' assumptions about the shape of their statistic's distribution can be violated. They also are extremely flexible - any true or ad hoc statistic can be tested, such as maximal T, or size of structure, or voxel's favorite color - anything. But they have a big disadvantage: computational cost. Running a permutation test involves re-estimating at least 20 models to be able to guarantee a 0.05 significance level, and so in SPM for individual data, that cost can be prohibitive. For other programs, the situation's not as bad, but it can still be pretty difficult to wait. Permuation tests are available at least in SPM99 with the SnPM toolbox, and in AFNI with the 3dMonteCarlo program. Not sure about BrainVoyager.

本期编辑：陈锐

本文为转载翻译帖，若有错误及侵权，请后台留言！

52brain,Connect Young Brains.

52brain精彩内容回顾

（点击即达）

警察殴打打人学生，舆论撕裂的背后

你手放哪呢，出生啊

薅广电羊毛！100元话费实付94.6元，还有电费96.9充100元！招团长~

警察踢打校园欺凌者：当事人不愿返校，派出所拒收锦旗

疯传！广州地铁突发！警方介入