异常值和缺失值处理环节和步骤, 让数据具有总体代表性
凡是搞计量经济的,都关注这个号了
邮箱:econometrics666@sina.cn
所有计量经济圈方法论丛的do文件, 微观数据库和各种软件都放在社群里.欢迎到计量经济圈社群交流访问.
感谢圈友们口碑相传
以下是一套完整的异常值和缺失值分析和处理步骤,如果觉得Eng版本有点吃力,可以看看前面第一篇关于经典书籍的文章,里面的中文书籍可以助你一臂之力。
1.Types of Outliers(异常值种类)
Outliers must be interpreted in the context of the study and this interpretation should be based on the types of information they provide. Depending on the source of their uniqueness, outliers can be classified into three categories:
– The first type of outlier is produced by data collection or entry errors. For example, if we ask people to indicate their household income in thousands ofUS dollars, some respondents may just indicate theirs in US dollars (not thousands). Obviously, there is a substantial difference between $30 and $30,000! Moreover, (as discussed before) data entry errors occur frequently. Outliers produced by data collection or entry errors should be deleted, or we need to determine the correct values by, for example, returning to the respondents.
– A second type of outlier occurs because exceptionally high or low values are a part of reality. While such observations can influence results significantly, they are sometimes highly important for researchers, because the characteristics of outliers can be insightful. Think, for example, of extremely successful companies, or users with specific needs long before most of the relevant marketplace also needs them (i.e., lead users). Deleting such outliers is not appropriate, but the impact that they have on the results must be discussed.
– A third type of outlier occurs when combinations of values are exceptionally rare. For example, if we look at income and expenditure on holidays, we may find someone who earns $1,000,000 and spends $500,000 of his/her income on holidays. Such combinations are unique and have a very strong impact on the
results (particularly the correlations). In such situations, the outlier should be retained, unless specific evidence suggests that it is not a valid member of the population under study. It is very useful to flag such outliers and discuss their impact on the results.
Detecting Outliers(识别异常值)
In a simple form, outliers can be detected using univariate or bivariate graphs and statistics.2 When searching for outliers, we need to use multiple approaches to ensure that we detect all the observations that can be classified as outliers. In the following, we discuss both routes to outlier detection:
Univariate Detection
The univariate detection of outliers examines the distribution of observations of each variable with the aim of identifying those cases falling outside the range of the “usual” values. In other words, finding outliers means finding observations with very low or very high variable values. This can be achieved by calculating the minimum and maximum value of each variable, as well as the range. Another useful option for detecting outliers is by means of box plots, which are a means of visualizing the distribution of a variable and pinpointing those observations that fall outside the range of the “usual” values. We introduce the above statistics and box plots in greater detail in the Describe Data section. It is important to recognize that there will always be observations with exceptional values in one or more variables. However, we should strive to identify outliers that impact the presented results.
Bivariate Detection
We can also examine pairs of variables to identify observations whose combinations of variables are exceptionally rare. This is done by using a scatter plot, which plots all observations in a graph where the x-axis represents the first variable and the y-axis the second (usually dependent) variable. Observations that fall markedly outside the range of the other observations will show as isolated points in the scatter plot.
A drawback of this approach is the number of scatter plots that we need to draw. For example, with 10 variables, we need to draw 45 scatter plots to map all possible combinations of variables! Consequently, we should limit the analysis to only a few relationships, such as those between a dependent and independent variable in a regression. Scatterplots with large numbers of observations are often problematic when we wish to detect outliers, as there is usually not just one dot, or a few isolated dots, just a cloud of observations where it is difficult to determine a cutoff point.
Dealing with Outliers(处理异常值)
In a final step, we need to decide whether to delete or retain outliers, which should be based on whether we have an explanation for their occurrence. If there is an explanation (e.g., because some exceptionally wealthy people were included in the sample), outliers are typically retained, because they are part of the population.
However, their impact on the analysis results should be carefully evaluated. That is, one should run an analysis with and without the outliers to assess if they influence the results. If the outliers are due to a data collection or entry error, they should be deleted. If there is no clear explanation, outliers should be retained.
2.Missing Data(缺失值)
Market researchers often have to deal with missing data. There are two levels at which missing data occur:
– Entire surveys are missing (survey non-response).
– Respondents have not answered all the items (item non-response).
Survey non-response (also referred to as unit non-response) occurs when entire surveys are missing. Survey non-response is very common and regularly only 5–25% of respondents fill out surveys. Although higher percentages are possible, they are not the norm in one-shot surveys. Issues such as inaccurate address lists, a lack of interest and time, people confusing market research with selling, privacy issues, and respondent fatigue also lead to dropping response rates. The issue of survey response is best solved by designing proper surveys and survey procedures.
Item non-response occurs when respondents do not provide answers to certain questions. There are different forms of missingness, including people not filling out or refusing to answer questions. Item non-response is common and 2–10% of questions usually remain unanswered. However, this number greatly depends on factors, such as the subject matter, the length of the questionnaire, and the method of administration. Non-response can be much higher in respect of questions that people consider sensitive and varies from country to country. In some countries, for
instance, reporting incomes is a sensitive issue.
The key issue with item non-response is the type of pattern that the missing data follow. Do the missing values occur randomly, or is there some type of underlying system? Once we have identified the type of missing data, we need to decide how to treat them. Figure 5.2 illustrates the process of missing data treatment.
The Three Types of Missing Data: Paradise, Purgatory, and Hell(三种类缺失值)
We generally distinguish between three types of missing data:
– missing completely at random (“paradise”),
– missing at random (“purgatory”), and
– non-random missing (“hell”).
Data are missing completely at random (MCAR) when the probability of data being missing is unrelated to any other measured variable and is unrelated to thevariable with missing values. MCAR data thus occurs when there is no systematic reason for certain data points being missing. For example, MCAR may happen if the Internet server hosting the web survey broke down temporarily for a random reason. Why is MCAR paradise? When data are MCAR, observations with missing data are indistinguishable from those with complete data. If this is the case and little data are missing (typically less than 10% in each variable) listwise deletion can be used. Listwise deletion means that we only analyze complete cases; in most statistical software, such as Stata, this is a default option. Note that this default option in Stata only works when estimating models and only applies to the variables included in the model. When more than 10% of the data are missing, we can use multiple imputation (Eekhout et al. 2014), a more complex approach to missing data treatment that we discuss in the section Dealing with Missing Data.
Unfortunately, data are rarely MCAR. If a missing data point (e.g., xi) is unrelated to the observed value of xi, but depends on another observed variable, we consider the data missing at random (MAR). In this case, the probability that the data point is missing varies from respondent to respondent. The term MAR is unfortunate, because many people confuse it with MCAR; however, the label has
stuck. An example of MAR is when women are less likely to reveal their income. That is, the probability of missing data depends on the gender and not on the income. Why is MAR purgatory? When data are MAR, the missing value pattern is not random, but this can be handled by more sophisticated missing data techniques such as multiple imputation techniques. We will illustrate how to impute a dataset with missing observations.
Lastly, data are non-random missing (NRM) when the probability that a data point (e.g., xi) is missing depends on the variable x and on other unobserved factors. For example, very affluent and poor people are less likely to indicate their income. Thus, the missing income values depend on the income variable, but also on other unobserved factors that inhibit the respondents from reporting their incomes. This is the most severe type of missing data (“hell”), as even sophisticated missing data techniques do not provide satisfactory solutions. Thus, any result based on NRM data should be considered with caution. NRM data can best be prevented by extensive pretesting and consultations with experts to avoid surveys that cause problematic response behavior. For example, we could use income categories instead of querying the respondents’ income directly, or we could simply omit the income variable.
A visualization of these three missingness mechanisms can be found under https://iriseekhout.shinyapps.io/MissingMechanisms/
Testing for the Type of Missing Data(检测缺失值种类)
When dealing with missing data, we must ascertain the missing data’s type. If the dataset is small, we can browse through the data for obvious nonresponse patterns. However, missing data patterns become more difficult to identify with an increasing sample size and number of variables. Similarly, when we have few observations, patterns should be difficult to spot. In these cases, we should use one (or both) of the following diagnostic tests to identify missing data patterns:
– Little’s MCAR test, and
– mean difference tests.
Little’s MCAR test (Little 1998) analyzes the pattern of the missing data by comparing the observed data with the pattern expected if the data were randomly missing. If the test indicates no significant differences between the two patterns, the missing data can be classified as MCAR. Put differently, the null hypothesis is that the data are MCAR. Thus,
– if we do not reject the null hypothesis, we assume that the data are MCAR, and
– if we reject the null hypothesis, the data are either MAR or NRM.
If the data cannot be assumed to be MCAR, we need to test whether the missing pattern is caused by another variable in the dataset. Looking at group means and their differences can also reveal missing data problems. For example, we can run a two independent samples t-test to explore whether there is a significant difference in the mean of a continuous variable (e.g., income) between the group with missing values and the group without missing values. In respect of nominal or ordinal variables, we could tabulate the occurrence of non-responses against different groups’ responses. If we put the (categorical) variable about which we have concerns in one column of a table (e.g., income category), and the number of (non-)responses in another, we obtain a table similar to Table 5.2.
Using the χ2-test (pronounced as chi-square), which we discuss under nonparametric tests, we can test if there is a significant relationship between the respondents’ (non-)responses in respect of a certain variable and their income. In this example, the test indicates that there is a significant relationship between the respondents’ income and the (non-)response behavior in respect of another variable, supporting the assumption that the data are MAR. We illustrate Little’s MCAR test, together with the missing data analysis and imputation procedures.
Dealing with Missing Data(处理缺失值)
Research has suggested a broad range of approaches for dealing with missing data. We discuss the listwise deletion and the multiple imputation method. Listwise deletion uses only those cases with complete responses in respect of all the variables considered in the analysis. If any of the variables used have missing values, the observation is omitted from the computation. If many observations have some missing responses, this decreases the usable sample size substantially and hypotheses are tested with less power.
Multiple imputation is a more complex approach to missing data treatment (Rubin 1987; Carpenter and Kenward 2013). It is a simulation-based statistical technique that facilitates inference by replacing missing observations with a set of possible values (as opposed to a single value) representing the uncertainty about the missing data’s true value (Schafer 1997). The technique involves three steps.
First, the missing values are replaced by a set of plausible values not once, but m times (e.g., five times). This procedure yields m imputed datasets, each of which reflects the uncertainty about the missing data’s correct value (Schafer 1997).
Second, each of the imputed m datasets are analyzed separately by means of standard data methods.
Third and finally, the imputed results from all m datasets (with imputed values) are combined into a single multiple-imputation dataset to produce statistical inferences with valid confidence intervals. This is necessary to reflect the uncertainty related to the missing values. According to the literature, deciding on the number of imputations, m, can be very challenging, especially when the patterns of the missing data are unclear. As a rule of thumb, an m of at least 5 should be sufficient to obtain valid inferences (Rubin 1987; White et al. 2011).
Now that we have briefly reviewed the most common approaches for handling missing data, there is still one unanswered question: Which one should you use? As shown in Fig. 5.2, if the data are MCAR, listwise deletion is recommended (Graham 2012) when the missingness is less than 10% and multiple imputation when this is greater than 10%. When the data are not MCAR but MAR, listwise deletion yields biased results. You should therefore use the multiple imputation method with an m of 5 (White et al. 2011). Finally, when the data are NRM, the multiple imputation method provides inaccurate results. Consequently, you should choose listwise deletion and acknowledge the limitations arising from the missing data. The following Table summarizes the data cleaning issues on outliers and missing data.
下一期,我们将引荐一些具体的处理程序和数据,可以到社群提取参考使用。
推荐阅读:
2.1998-2016年中国地级市年均PM2.5数据release
4.2005-2015中国分省分行业CO2数据circulation
5.匹配方法(matching)操作指南, 值得收藏的16篇文章
8.实证研究中用到的135篇文章, 社科学者常用toolkit
计量经济圈是中国计量第一大社区,我们致力于推动中国计量理论和实证技能的提升,圈子以海内外高校研究生和教师为主。计量经济圈绝对六多精神:社科资料最多、社科数据最多、科研牛人最多、海外名校最多、热情互助最多、前沿趋势最多。如果你热爱计量并希望长见识,那欢迎你加入到咱们这个大家庭(戳这里),要不然你只能去其他那些Open access圈子了。注意:进去之后一定要看小鹅社群“群公告”,不然接收不了群息,也不知道怎么进入咱们独一无二的微信群和QQ群。在规则框架下社群交流讨论无时间限制。