查看原文
其他

语言教学 | 普渡大学写作教学系列Research&Citation16-Writing with Statistics(2)

提示:点击上方"英语写作教学与研究"免费关注哦

征稿:二语写作

3、Descriptive Statistics

The mean, the mode, the median, the range, and the standard deviation are all examples of descriptive statistics. Descriptive statistics are used because in most cases, it isn't possible to present all of your data in any form that your reader will be able to quickly interpret.

Generally, when writing descriptive statistics, you want to present at least one form of central tendency (or average), that is, either the mean, median, or mode. In addition, you should present one form of variability, usually the standard deviation.

Measures of Central Tendency and Other Commonly Used Descriptive Statistics

The mean, median, and the mode are all measures of central tendency. They attempt to describe what the typical data point might look like. In essence, they are all different forms of 'the average.' When writing statistics, you never want to say 'average' because it is difficult, if not impossible, for your reader to understand if you are referring to the mean, the median, or the mode.

The Mean

The mean is the most common form of central tendency, and is what most people usually are referring to when the say average. It is simply the total sum of all the numbers in a data set, divided by the total number of data points. For example, the following data set has a mean of 4: {-1, 0, 1, 16}. That is, 16 divided by 4 is 4. If there isn't a good reason to use one of the other forms of central tendency, then you should use the mean to describe the central tendency.

The Median

The median is simply the middle value of a data set. In order to calculate the median, all values in the data set need to be ordered, from either highest to lowest, or vice versa. If there are an odd number of values in a data set, then the median is easy to calculate. If there is an even number of values in a data set, then the calculation becomes more difficult. Statisticians still debate how to properly calculate a median when there is an even number of values, but for most purposes, it is appropriate to simply take the mean of the two middle values. The median is useful when describing data sets that are skewed or have extreme values. Incomes of baseballs players, for example, are commonly reported using a median because a small minority of baseball players makes a lot of money, while most players make more modest amounts. The median is less influenced by extreme scores than the mean.

{-2, 1, 3, 10, 500, 1000}

Here the median is 6.5: [(3+10)/2]

{30, 30, 50, 50, 60}

Here the median is 50.

The Mode

The mode is the most commonly occurring number in the data set. The mode is best used when you want to indicate the most common response or item in a data set. For example, if you wanted to predict the score of the next football game, you may want to know what the most common score is for the visiting team, but having an average score of 15.3 won't help you if it is impossible to score 15.3 points. Likewise, a median score may not be very informative either, if you are interested in what score is most likely.

{1, 2, 3, 4, 10, 10, 10}

Here the mode is 10.

Standard Deviation

The standard deviation is a measure of variability (it is not a measure of central tendency). Conceptually it is best viewed as the 'average distance that individual data points are from the mean.' Data sets that are highly clustered around the mean have lower standard deviations than data sets that are spread out.

For example, the first data set would have a higher standard deviation than the second data set:

{1,2,3,4,5,6,7,8,9}

Standard Deviation = 2.58

{4,4,4,5,5,5,6,6,6}

Standard Deviation = 0.82

Notice that both groups have the same mean (5) and median (also 5), but the two groups contain different numbers and are organized much differently. This organization of a data set is often referred to as a distribution. Because the two data sets above have the same mean and median, but different standard deviation, we know that they also have different distributions. Understanding the distribution of a data set helps us understand how the data behave.

4、Writing with Descriptive Statistics

Usually there is no good way to write a statistic. It rarely sounds good, and often interrupts the structure or flow of your writing. Oftentimes the best way to write descriptive statistics is to be direct. If you are citing several statistics about the same topic, it may be best to include them all in the same paragraph or section.

The mean of exam two is 77.7. The median is 75, and the mode is 79. Exam two had a standard deviation of 11.6.

Overall the company had another excellent year. We shipped 14.3 tons of fertilizer for the year, and averaged 1.7 tons of fertilizer during the summer months. This is an increase over last year, where we shipped only 13.1 tons of fertilizer, and averaged only 1.4 tons during the summer months. (Standard deviations were as followed: this summer .3 tons, last summer .4 tons).

Some fields prefer to put means and standard deviations in parentheses like this:

Group A (87.5) scored higher than group B (77.9) while both had similar standard deviations (8.3 and 7.9 respectively).

If you have lots of statistics to report, you should strongly consider presenting them in tables or some other visual form. You would then highlight statistics of interest in your text, but would not report all of the statistics. See the section on statistics and visuals for more details.

If you have a data set that you are using (such as all the scores from an exam) it would be unusual to include all of the scores in a paper or article. One of the reasons to use statistics is to condense large amounts of information into more manageable chunks; presenting your entire data set defeats this purpose.

At the bare minimum, if you are presenting statistics on a data set, it should include the mean and probably the standard deviation. This is the minimum information needed to get an idea of what the distribution of your data set might look like. How much additional information you include is entirely up to you. In general, don't include information if it is irrelevant to your argument or purpose. If you include statistics that many of your readers would not understand, consider adding the statistics in a footnote or appendix that explains it in more detail.

5、Basic Inferential Statistics: Theory and Application

The heart of statistics is inferential statistics. Descriptive statistics are typically straightforward and easy to interpret. Unlike descriptive statistics, inferential statistics are often complex and may have several different interpretations.

The goal of inferential statistics is to discover some property or general pattern about a large group by studying a smaller group of people in the hopes that the results will generalize to the larger group. For example, we may ask residents of New York City their opinion about their mayor. We would probably poll a few thousand individuals in New York City in an attempt to find out how the city as a whole views their mayor. The following section examines how this is done.

Basic Inferential Statistics: Theory and Application

A population is the entire group of people you would like to know something about. In our previous example of New York City, the population is all of the people living in New York City. It should not include people from England, visitors in New York, or even people who know a lot about New York City.

A sample is a subset of the population. Just like you may sample different types of ice cream at the grocery store, a sample of a population should be just a smaller version of the population.

It is extremely important to understand how the sample being studied was drawn from the population. The sample should be as representative of the population as possible. There are several valid ways of creating a sample from a population, but inferential statistics works best when the sample is drawn at random from the population. Given a large enough sample, drawing at random ensures a fair and representative sample of a population.

Comparing two or more groups

Much of statistics, especially in medicine and psychology, is used to compare two or more groups and attempts to figure out if the two groups are different from one another.

Example: Drug X

Let us say that a drug company has developed a pill, which they think increases the recovery time from the common cold. How would they actually find out if the pill works or not? What they might do is get two groups of people from the same population (say, people from a small town in Indiana who had just caught a cold) and administer the pill to one group, and give the other group a placebo. They could then measure how many days each group took to recover (typically, one would calculate the mean of each group). Let's say that the mean recovery time for the group with the new drug was 5.4 days, and the mean recovery time for the group with the placebo was 5.8 days.

The question becomes, is this difference due to random chance, or does taking the pill actually help you recover from the cold faster? The means of the two groups alone does not help us determine the answer to this question. We need additional information.

Sample Size

If our example study only consisted of two people (one from the drug group and one from the placebo group) there would be so few participants that we would not have much confidence that there is a difference between the two groups. That is to say, there is a high probability that chance explains our results (any number of explanations might account for this, for example, one person might be younger, and thus have a better immune system). However, if our sample consisted of 1,000 people in each group, then the results become much more robust (while it might be easy to say that one person is younger than another, it is hard to say that 1,000 random people are younger than another 1,000 random people). If the sample is drawn at random from the population, then these 'random' variations in participants should be approximately equal in the two groups, given that the two groups are large. This is why inferential statistics works best when there are lots of people involved.

Be wary of statistics that have small sample sizes, unless they are in a peer-reviewed journal. Professional statisticians can interpret results correctly from small sample sizes, and often do, but not everyone is a professional, and novice statisticians often incorrectly interpret results. Also, if your author has an agenda, they may knowingly misinterpret results. If your author does not give a sample size, then he or she is probably not a professional, and you should be wary of the results. Sample sizes are required information in almost all peer-reviewed journals, and therefore, should be included in anything you write as well.

Variability

Even if we have a large enough sample size, we still need more information to reach a conclusion. What we need is some measure of variability. We know that the typical person takes about 5-6 days to recover from a cold, but does everyone recover around 5-6 days, or do some people recover in 1 day, and others recover in 10 days? Understanding the spread of the data will tell us how effective the pill is. If everyone in the placebo group takes exactly 5.8 days to recover, then it is clear that the pill has a positive effect, but if people have a wide variability in their length of recovery (and they probably do) then the picture becomes a little fuzzy. Only when the mean, sample size, and variability have been calculated can a proper conclusion be made. In our case, if the sample size is large, and the variability is small, then we would receive a small p-value (probability-value). Small p-values are good, and this term is prominent enough to warrant further discussion.

P-values

In classic inferential statistics, we make two hypotheses before we start our study, the null hypothesis, and the alternative hypothesis.

Null Hypothesis: States that the two groups we are studying are the same.

Alternative Hypothesis: States that the two groups we are studying are different.

The goal in classic inferential statistics is to prove the null hypothesis wrong. The logic says that if the two groups aren't the same, then they must be different. A low p-value indicates a low probability that the null hypothesis is correct (thus, providing evidence for the alternative hypothesis).

Remember: It's good to have low p-values.

What a p-value actually means: The p-value you obtain from a test like this tells you precisely the following: It is the probability that you would obtain these or more extreme results assuming that the null hypothesis is true. For example, if we obtained a p-value of 0.01 (or 1%) for our drug experiment, it would mean that the probability of obtaining a difference between these two groups that is this large (or larger) is 1%, assuming that the two groups are in fact NOT different. When interpreting p-values, it is important to understand that it DOES NOT tell you the probability that the null hypothesis is wrong. If you didn't fully understand this section, that's o.k. For most students, it takes multiple courses in statistics to understand the nuances presented here.

精彩回顾

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存