Play with R 第15期: Non-parametric tests
Play with R 第15期:
Non-parametric tests
15.1 What will this chapter tell me?
Four non-parametric procedures:
• the Wilcoxon rank-sum test (the Mann–Whitney test): see 15.4
• the Wilcoxon signed-rank test: see 15.5
• the Kruskal–Wallis test: see 15.6
• Friedman’s test: see 15.7
15.2 When to use non-parametric tests?
• Non-parametric tests are used when data that break the parametric assumptions (e.g., normal distribution, independence);
• They are less restricted by assumptions (also known as assumption-free tests), and have less power than parametric tests because they work with nominal- or ordinal-level data, not interval/ratio.
15.3 Packages used in this chapter
15.4 Comparing two independent conditions: the Wilcoxon rank-sum test
Wilcoxon rank-sum test/Mann-Whitney test (equivalent of the independent t-test)
• differences between conditions (different participants used in each condition):
• whether the distributions of observations obtained between two separate groups on a dependent variable are systematically different from one another.
15.4.1 Theory of the Wilcoxon rank-sum test
1. Example:
RQ: the depressant effects of certain recreational drugs
Participants: 20 clubbers in all (10 were given an ecstasy tablet to take on a Saturday night and 10 were allowed to drink only alcohol).
Measure: levels of depression were measured using the Beck Depression Inventory (BDI) the day after and midweek. See Table 15.1.
2. Calculate it manually
Step 1: Arrange the scores in ascending order
pool data together, attach a label to remind you of group type
Step 2: Assign potential ranks starting with 1 for the lowest score and going up to the number of scores you have
Step 3: Assign actual ranks
If the same score occurs more than once in a data set, assign a rank that is the average of the potential for those scores.
Step 4: Add the actual ranks for the scores that came from each group
àsum of group 1; sum of group 2
Step 5: Compute the test statistic
Subtract the mean rank of each group. Correct for the number of people in the group because otherwise larger groups would have larger ranksàsum of group 1 – (minus) mean rank of group 1; sum of group 2 – (minus) mean rank of group 2
Typically, we take the smallest value of these two to be the test statistic.
Step 6: Calculate the associated p-value.
There are two ways to calculate the associated p-value.
• Exact approach
(advantage: do not need to make any assumptions about the distribution;
disadvantage: take a long time).
1) Create lots of data set that match the sample, and put people into a random group rather than the correct group.
2) Now the null hypothesis is true because the people were assigned to a group randomly.
3) The values are calculated based on these data in which the null hypothesis is true.
4) If the null hypothesis is true, and the results of this analysis look like your analysis.
• Normal approximation
1) If the sample size is larger than 40, the default in R is to use a normal approximation.
2) The normal approximation does not assume that the data are normal. Instead it assumes that the sampling distribution of the test statistic is normal, which means that a standard error can be computed that is used to calculate a z and hence a p-value.
3) If you have ties, you have to use normal approximation (exact approach cannot be used).
How to calculate mean rank?
Mean rank = mean of the numbers of people in the group
For example, there are 10 people in the group, then
15.4.2 Inputting data and provisional analysis
Step 1: use the gl() function to create a coding variable (e.g. drug), which contains two blocks of 10 rows of data labelled Ecstasy and Alcohol.
Step 2: create the dependent variable (BDI) measured the day after (i.e., sundayBDI)
and midweek (i.e., wedsBDI)
Step 3: tie the variables together in a dataframe called drugData
Step 4: run some exploratory analyses on the data
• Normality of the outcome: doing the Shapiro-Wilk test
We see the test statistic, labelled W(normtest.W), and the p-value (normtest.p) (see Output 15.1).
Sunday data
For the ecstasy data, W = 0.81, p < .05, the distribution appears to be non-normal.
For the alcohol data, W = 0.96, ns, the distribution appears to be normal.
Wednesday data
For the ecstasy data, W = 0.94, ns, the distribution appears to be normal.
For the alcohol data, W = 0.75, p < .01, the distribution appears to be non-normal.
This finding would alert us to the fact that the sampling distribution might also be non-normal for the Sunday and Wednesday data and that a non-parametric test should be used.
• Homogeneity of variance: doing the Levene’s test
For the Sunday data, F (1, 18) = 3.64, ns.
For Wednesday, F (1, 18) = 0.51, ns.
The assumption of homogeneity has been met (see Output 15.2).
15.4.3 Running the analysis using R Commander (see P. 661-662)
15.4.4 Running the analysis using R
1. If you have the data for different groups stored in a single column
· new model: an object created that contains information about the model.
· outcome: a variable that contains the scores for the outcome measure (in this case drug).
· predictor: a variable that tells us to which group a score belongs (in this case sundayBDI or wedsBDI).
· dataFrame: the name of the dataframe containing the aforementioned variables.
· paired = FALSE determines whether or not you want to do the Wilcoxon test on matched (in which case include paired = TRUE) or independent samples (in which case exclude the option because this is the default or include paired = FALSE).
2. If you have the data for different groups stored into columns
To compute a basic Wilcoxon test for our Sunday data we could execute:
For the Wednesday data we could execute:
15.4.5 Output from the Wilcoxon rank-sum test
• For the BDI score on Sunday (see Output 15.3), the type of drug did not significantly affect depression levels the day after, W = 35.5, p = .286.
• For the Wednesday data (see Output 15.4), the type of drug did significantly affect depression levels the day after, W = 4, p < .001.
15.4.6 Calculating an effect size
We need to know the Z-score of the data, and then convert the Z- score into the effect size
estimate, R, using the following formula:
Executing the following commands creates a function called rFromWilcom().
First command: calculate the value of z using the qnorm() function.
Second command: compute r using the equation by dividing z by the square root of N.
Final command: tell us what the model represents, what the output shows, and the value of r.
Apply this function by executing:
The resulting output:
a small to medium effect for the Sunday data (-0.25)
a huge effect for the Wednesday data (-0.78)
15.4.7 Writing the results
Report the test statistic, its significance, and the effect size.
15.5 Comparing two related conditions: the Wilcoxon signed-rank test
Wilcoxon signed-rank test (equivalent of the dependent t-test)
• differences between conditions (same participants used in each condition):
15.5.1 Theory of the Wilcoxon signed-rank test
1. Example:
RQ: compare depression scores on Sunday to those on Wednesday for the two drugs separately
Participants: 20 clubbers in all (10 were given an ecstasy tablet to take on a Saturday night and
10 were allowed to drink only alcohol).
Measure: levels of depression were measured using the Beck Depression Inventory (BDI) the
day after and midweek. See Table 15.2.
2. Calculate it manually
Step 1: Calculating the difference between Sunday and Wednesday (that’s just Sunday’s score subtracted from Wednesday’s). If the difference is zero (i.e., the scores are the same on Sunday and Wednesday) then we exclude these data from the ranking.
Step 2: Making a note of the sign of the difference (positive or negative).
Step 3: Ranking the differences (starting with the smallest) ignoring whether they are positive or negative.
Step 4: Collecting together the ranks that came from a positive difference between the conditions, and add them up to get the sum of positive ranks (T+). We also add up the ranks that came from negative differences between the conditions to get the sum of negative ranks (T−).
So, for ecstasy, T+ = 36 and T− = 0 (in fact there were no negative ranks), and for alcohol, T+ = 8 and T− = 47.
Step 5:The test statistic, T, is the smaller of the two values, and so is 0 for ecstasy and 8 for alcohol.
Step 6: Calculating the significance of the test statistic (T)
First,calculating the mean (T ) and standard error ( ).Because we used the same participants, there is only one sample size.
In both groups, n is simply 10 (because that’s how many participants were used).
However, remember that for our ecstasy group we excluded two people because they had
differences of zero, therefore the sample size we use is 8, not 10.
Secondly,we calculate the z-score.
Finally,we compare the z to significant level.
If these values are bigger than 1.96 (ignoring the minus sign) then the test is significant at p < .05. So, it looks as though there is a significant difference between depression scores on Wednesday and Sunday for both ecstasy and alcohol.
15.5.2 Running the analysis with R Commander
Step 1: Because we want to look at the change for each drug separately, we need to use the subset command and ask R to split the file by the variable drug.
Select Data⇒Active data set⇒Subset active data set… to open the dialog box shown in Figure 15.5.
Then write “drug==Alcohol”.Finally, we give the data set a new name; we’ll call it alcoholData. Click on to create the dataframe. Repeat the process to create a dataframe called ecstasyData that contains only the data from the ecstasy group.
Step 2: use R Commander to run the Wilcoxon signed-rank test.
select Statistics ⇒ Nonparametric tests ⇒ Paired-samples Wilcoxon test… to open the dialog box in Figure 15.6.
Pick the two variables that you would like to compare.
You can choose a p-value calculation method.You can leave this as the default, in which case R will use the exact method to calculate the p-value if your sample size is less than 40, and the normal approximation if larger than 40.
We will select the normal approximation (we have ties, so we cannot use the exact method anyway) and we will choose not to use the continuity correction.
15.5.3 Running the analysis with R
1. Our first job is to split the dataframe into two
Use the subset() function to create separate dataframes for the different drugs called alcoholData
and ecstasyData.
2. Use the wilcox.test() function
This time because our data are stored in different columns (wedsBDI and sundayBDI) we need to enter the names of the two variables we want to compare rather than a formula, and we need to
include the option paired = TRUE to tell R that the data are paired. (If we don’t include this option R will do a Wilcoxon rank-sum test.)
15.5.4 Output from the Wilcoxon signed-rank test
• Output 15.6 shows the output for the alcohol group.It reports the value of T+ (which it calls V) and that this value is significant at p = .047. Therefore, we should conclude (based on the medians) that when taking alcohol there was a significant decline in depression (as measured by the BDI) from the morning after to midweek (p = .047).
• Output 15.7 shows the results for the ecstasy group. We should conclude that when taking ecstasy there was a significant increase in depression (as measured by the BDI) from the morning after to midweek, p = .012).
15.5.5 Calculating an effect size
we can reuse the rFromWilcox() function,as Wilcoxon rank-sum test. In both
the alcohol and ecstasy groups we had 20 observations (although we only used 10 people
and tested them twice, it is the number of observations, not the number of people, that is
important here).
The resulting output is:
For the alcohol group we find a medium to large change in depression when alcohol
is taken, r = −.45, which is between Cohen’s criteria of .3 and .5 for a medium and large
effect, respectively. For the ecstasy group, r = − .56, which represents a large change in
levels of depression when ecstasy is taken (it is above Cohen’s benchmark of .5).
15.5.6 Writing the results
Report the test statistic, its significance, and the effect size.
15.7 Differences between several independent groups: the Kruskal–Wallis test
The Kruskal–Wallis test can be used to test for differences between several independent groups when you have heterogeneity of variance.
I took 80 males and split them into four groups that varied in the number of soya meals they ate per week over a year-long period. The first group was a control group and had no soya meals at all per week (i.e., none in the whole year); the second group had one soya meal per week (that’s 52 over the year); the third group had four soya meals per week (that’s 208 over the year); and the final group had seven soya meals a week (that’s 364 over the year). At the end of the year, all of the participants were sent away to produce some sperm that I could count.
Step1: Inputting data and provisional analysis
Soya<-gl(4, 20, labels = c("No Soya", "1 Soya Meal", "4 Soya Meals", "7 Soya Meals"))
soyaData<-data.frame(Sperm, Soya)
soyaData<-read.delim("Soya.dat", header = TRUE)
soyaData$Soya<-factor(soyaData$Soya, levels = levels(soyaData$Soya)[c(4, 1,2, 3)])
Step2:
①Doing the Kruskal–Wallis test using R Commander
Statistics⇒Nonparametric tests⇒Kruskal-Wallis test
②Doing the Kruskal–Wallis test using R
The Kruskal–Wallis test is done using the kruskal.test() function, which works in the same way as the wilcox.test() function that we used for the rank-sum test. The general form of the function is:
newModel<-kruskal.test(outcome ~ predictor, data = dataFrame, na.action ="an.action")
For the current data, we could, therefore execute:
kruskal.test(Sperm ~ Soya, data = soyaData)
To interpret the Kruskal–Wallis test, it is useful to obtain the mean rank for each group.We can do this by adding a variable called Ranks to the dataframe with the rank() function:
soyaData$Ranks<-rank(soyaData$Sperm)
This command creates a variable Ranks in the soyaData dataframe that is the ranks for the variable Sperm. We can then obtain the mean rank for each group using the by() and mean() functions:
by(soyaData$Ranks, soyaData$Soya, mean)
15.6.5 Output from the Kruskal–Wallis test
this value is less than .05 we could conclude that the amount of soya meals eaten per week does significantly affect sperm counts. Like a one-way ANOVA, though, this test tells us only that a difference exists; it doesn’t tell us exactly where the differences lie. One way to get an idea is to look at the mean ranks (Output 15.11). These show that the ranks were lowest (27.35) in the group that had seven soya meals per week, but fairly similar in the other three groups, which implies that any differences might be between the seven soya meals group and the other three groups.
15.6.6. Post hoc tests for the Kruskal–Wallis test
Do the post hoc tests by executing:
kruskalmc(Sperm ~ Soya, data = soyaData)
kruskalmc(Sperm ~ Soya, data = soyaData, cont = 'two-tailed')
15.6.7. Testing for trends: the Jonckheere–Terpstra test
As such, you should use this test when you expect the groups you re comparing to produce a meaningful order of medians.
This function takes the general form:
jonckheere.test(outcome variable, group variable (as numbers))
jonckheere.test(soyaData$Sperm, as.numeric(soyaData$Soya))
In large samples (more than about eight per group) this test statistic has a sampling distribution that is normal, and a mean and standard deviation that are easily defined and calculated (the mean is 1200 and the standard deviation is 116.33). R has calculated the p-value for us, which is .013; because this value is less than .05 we have a statistically significant trend in the data. We can use the mean ranks (Output 15.11) to see that it is a decreasing trend: sperm counts go down as more soya is eaten.
15.6.9. Writing and interpreting the results
For the Kruskal–Wallis test, we need only report the test statistic (which, as we saw earlier, is denoted by H), its degrees of freedom and its significance. So, we could report something like:
✓ Sperm counts were significantly affected by eating soya meals, H(3) = 8.66, p = .034.
However, we need to report the follow-up tests as well (including their effect sizes):
✓ Sperm counts were significantly affected by eating soya meals, H(3) = 8.66, p = .034.
Focused comparisons of the mean ranks between groups showed that sperm counts were not significantly different when one soya meal (difference = 2.2 ) or four soya meals (difference = 2.2) were eaten per week compared to none. However, when seven soya meals were eaten per week sperm counts were significantly lower than when no soya was eaten (difference = 19). In all cases, the critical difference (α = .05 corrected for the number of tests) was 15.64. We can conclude that if soya is eaten every day it significantly reduces sperm counts compared to eating none; however, eating soya less frequently than every day has no significant effect on sperm counts
We might also want to report our trend:
✓ Jonckheere’s test revealed a significant trend in the data: as more soya was eaten, the median sperm count decreased, J = 912, p = .013.
CRAMMING SAM’S TIPS The Kruskal–Wallis test
• The Kruskal–Wallis test compares several conditions when different participants take part in each condition and the resulting data violate an assumption of one-way independent ANOVA.
• Look at the p-value. If the value is less than .05 then the groups are significantly different.
• You can follow up the main analysis with post hoc tests (ideally, focused ones). If the column labelled difference in the output says ‘true’ then the groups differ significantly.
• If you predict that the means will increase or decrease across your groups in a certain order then do Jonckheere’s trend test.
• Report the H-statistic, the degrees of freedom and the significance value for the main analysis. Also report the medians and their corresponding ranges (or draw a boxplot).
15.7 相关组别之间的差异:Friedman`s 方差分析
概念:Friedman`s 方差分析是用于比较同一组被试参与两个以上情境之间的不同的一种方法,这种方法尤其适用某些参数测试假设未得到支持的问题研究。
本节使用的案例:现今年轻人(尤其是女性)非常痴迷于降低体重和节食,且在媒体对于过瘦身材的大肆追捧和渲染下,我们对于自身的身材非常不自信,觉得自己不够完美。作者提倡可以借鉴他的饮食作息方法来达到健康生活的状态。本章使用的数据主要是10个觉得自己身材不够完美的女性参与者,采用作者的饮食作息方法且维持两个月体重前后的比较。
15.7.1 Friedman`s 方差分析的理论
Friedman`s 方差分析的理论和前面章节的理论都是基于等级数据提出来的。以下简要说明本节的数据信息:行代表的是不同的被试体重情况,列代表的是被试在开始和第一和第二个月后体重情况:从左往右数第1-3列,是被试在开始和第一和第二个月后体重情况,第4-6列是前三列的体重从低到高的排列情况(1表示最低,2表示次低,3表示低)。
当每一个被试体重从低到高计算后,计算以下公式Fr。在这个公式中,Ri指的是每一个情况下排列数的总和(在这里是第4-6列的数据情况)。N是所有样本量的大小(这里是10),k是情境数(这里是3)。
将本案例中具体数值代入后,即得到以下公式:
15.7.2数据的输入和初步分析
当采用的是被试内设计时,我们需要输入不同列名。在本样本中,有三个列名(开始,第一个月,第二个月)。在R的语句中,采用第五章节的stat.desc指令来进行一些探索性分析,如下图所示。从下图图表Friedman`s 方差分析可以看出,基线水平和第一个月的体重情况差异显著,W(10) = 0.78, p = .009,W(10) = 0.68, p < .001,但第二个月的体重情况和其他二者差异不显著,W(10) = 0.87, p = .121。
15.7.3 用R命令语句来做Friedman`s 方差分析
从菜单栏的界面中输入数据:Data⇒Import data⇒from text file, clipboard, or URL…,
Friedman`s 方差分析的操作界面是:Statistics⇒Nonparametric tests⇒Friedman rank-sum test,随后选择自变量的三个水平,具体如下图:
15.7.4 用R命令语句来算Friedman`s 方差分析
R语在数据预处理过程中,首先要排除空值的情况,这个采用以下语句即可:
随后,进行Friedman`s 方差分析。Friedman`s 方差分析的语句是friedman.test(),但在具体运用前,需要先将dataframe形式的数据进行转换成matrix形式,具体语句如下:
15.7.5 输出Friedman`s 方差分析的结果
从以上结果可以看出,卡方值是0.2,自由度是2,p值是0.9048,由于p值大于.05,可见作者推荐的饮食方法并不奏效。
15.7.6 Friedman`s 方差分析结果的事后比较
一般情况下,如果结果不显著,我们不用再进行事后比较,但出于稳妥的考虑,我们可以采用friedmanmc()语句进行。
从结果中看出,经过事后比较,发现依旧是差异不显著。
15.7.7 计算效应量
这里的效应量的语句可以采用rFromWilcox()进行。
本期排版:孟 雪
本期分享人:
权绍琦 西南交通大学(四川省成都市) 心理学
晏芹 人民大学 心理语言学
武永 匹兹堡大学 教育心理学
孙洁 西北师范大学(甘肃省兰州市) 认知神经科学
李志刚 浙江师范大学(金华) 心理语言学
往期精选: