confounder与collider啥区别? 混淆 vs 对撞
凡是搞计量经济的,都关注这个号了
邮箱:econometrics666@sina.cn
所有计量经济圈方法论丛的code程序, 宏微观数据库和各种软件都放在社群里.欢迎到计量经济圈社群交流访问.
今天,计量经济圈引荐一下“confounder(混淆者)”和“collider(对撞者)”。简单讲,混淆变量就是那些同时会影响政策处理变量和结果变量的变量,在英文中也叫common factors。这样的例子很多,比如研究接受职业培训项目是否会提高个体找到工作的概率?其中,可能的混淆变量包括但不限于:年龄和性别等。因为,像年龄会同时影响个体是否参与职业培训项目以及是否能找到工作。要识别出政策效应,咱们要做的就是尽量分离出混淆因素,因此,在实证中需要做的就是控制住混淆变量。具体地,咱们需要将混淆变量作为控制变量放到回归方程中。那如果有些混淆变量不可观测,此时,该如何减轻他们对政策效应估计的影响呢?尽量找到代理变量,比如用教育水平作为个体能力的代理变量。
接下来,计量经济圈讲一讲collider。他与混淆变量是相反的,但也对政策效应估计产生重要影响。collider是对撞的意思,即政策处理变量和结果变量都会影响这个变量。比如,在上面的例子中,收入可能就是一个collider,因为参与职业培训和找工作的概率都会影响到个体的收入。此时,咱们若在个体工作回归方程中误加收入变量作为控制变量,那会导致政策效应估计出现错误(collider bias)。有时候,甚至会让两个本不相关的变量或者负(正)相关的变量,产生相关关系或正(负)相关关系。其中,以伯克森悖论(Berkson's Paradox)最为著名。也就是说,凡是collider变量,咱们都不要把他们放到回归方程中去。
伯克森悖论,指的是两个本来无关的变量之间体现出貌似强烈的相关关系。举个例子来说,假设某学校在招收学生时,要求学生要么学习成绩好,要么体育成绩好。所有的报考学生需要参加两门考试:文化(语数外),和体育(跑跳投)。最后,学校仅录取在任一考试中考到90分以上的报考学生。所以能够被学校录取的学生,要么在文化考试中考到90分以上,或者在体育考试中考到90分以上,或者在两门考试中都考到90分以上。现在如果我们分析这些被入取学生的成绩分布,会发现一个学生的学习成绩,和体育成绩是负相关的。因为那些体育成绩最好的学生(比如体育100分),他们的文化平均分为50分(假设他们的文化考试呈现正态分布)。而体育成绩最差的学生(比如体育成绩10分),其文化平均成绩为95分(因为只有超过90分的学生才被录取)。因此,分析人员可能会得出结论:体育越好,文化成绩越差。文化成绩越好,体育越差。但这个结论显然是错误的。
下面是一个比较简洁有效的英文介绍,里面使用了DAG图形来区分confounder与collider。用"因果关系图"来进行因果推断的新技能。
Background
When an exposure and an outcome independently cause a third variable, that variable is termed a ‘collider’. Inappropriately controlling for a collider variable, by study design or statistical analysis, results in collider bias. Controlling for a collider can induce a distorted association between the exposure and outcome, when in fact none exists. This bias predominantly occurs in observational studies. Because collider bias can be induced by sampling, selection bias can sometimes be considered to be a form of collider bias. The diagram below contrasts bias through confounding and collider bias.
Example
A clear example of collider bias was provided by Sackett in his 1979 paper. He analysed data from 257 hospitalized individuals and detected an association between locomotor disease and respiratory disease (odds ratio 4.06). The association seemed plausible at the time – locomotor disease could lead to inactivity, which could cause respiratory disease. But Sackett repeated the analysis in a sample of 2783 individuals from the general population and found no association (odds ratio 1.06). The original analysis of hospitalized individuals was biased because both diseases caused individuals to be hospitalized. By looking only within the stratum of hospitalized individuals, Sackett had observed a distorted association. In contrast, in the general population (including a mix of hospitalized and non-hospitalized individuals) locomotor disease and respiratory disease are not associated. In 1979, Sackett termed this phenomenon “admission rate bias”. With the help of causal diagrams (also known as directed acyclic graphs [DAGs]), this phenomenon can be explained by collider bias (Figure 1).
In this example, locomotor disease and respiratory disease are independent causes of hospitalization – the collider (since the two arrowheads collide into hospitalization). If the collider is controlled for by study design (selection bias), a distorted association will arise between locomotor and respiratory disease. This is what we see in Sackett’s 1979 example. Hypothetically, if he had statistically controlled for hospitalization in the general population dataset, he would have induced collider bias again, not through selection, but statistical error.
Figure 1. A causal diagram demonstrating collider bias. Controlling for hospitalization induces a distorted association between locomotor disease and respiratory disease.
A more recent example of the collider bias can be seen in the ‘obesity paradox’ (Figure 2). This paradox describes an apparent preventive effect of obesity on mortality in individuals with chronic conditions such as cardiovascular disease (CVD). In fact, obesity increases mortality rates in the general population. The collider bias occurs when an investigator conditions on CVD (by design or analysis), resulting in a distorted association between obesity and unmeasured other factors. This distorted association is what distorts the effect of obesity on mortality. Consequently, in a sample that includes only patients with CVD, obesity falsely appears to protect against mortality, whereas in the wider population (with and without CVD), obesity increases the risk of early death. There is some debateabout whether collider bias completely explains the obesity paradox.
Figure 2. A causal diagram demonstrating how the obesity paradox can be explained by collider bias.
Impact
Collider bias can have major effects. In Sackett’s example, collider bias inflated a null effect (unbiased odds ratio 1.06) to a positive effect (biased odds ratio 4.06). In the obesity paradox example, collider bias switched an unbiased harmful effect of obesity on mortality into a biased protective effect. This was shown in an analysis of the third US National Health and Nutrition Examination Survey (NHANES III). In the unbiased analysis, the mortality risk ratio for the entire cohort was 1.24 [95% CI = 1.11, 1.39] (harmful). In the biased analysis, the stratum-specific mortality risk ratio was 0.79 [95% CI = 0.68, 0.91] (protective) in patients with CVD.
The impact of collider bias – published examples
Preventive steps
Collider bias can be prevented by carefully applying appropriate inclusion criteria – making sure that the exposure and outcome of interest do not drive inclusion or selective retention in a study.
Causal diagrams (DAGs) can help identify colliders and non-colliders (or confounders). By using these techniques in the design and analysis of observational studies, researchers can identify colliders that should be left uncontrolled and confounders that should be controlled.
下面这些短链接文章属于合集,可以收藏起来阅读,不然以后都找不到了。
2年,计量经济圈公众号近1000篇文章,
Econometrics Circle
数据系列:空间矩阵 | 工企数据 | PM2.5 | 市场化指数 | CO2数据 | 夜间灯光 | 官员方言 | 微观数据 |
计量系列:匹配方法 | 内生性 | 工具变量 | DID | 面板数据 | 常用TOOL | 中介调节 | 时间序列 | RDD断点 | 合成控制 |
数据处理:Stata | R | Python | 缺失值 | CHIP/ CHNS/CHARLS/CFPS/CGSS等 |
干货系列:能源环境 | 效率研究 | 空间计量 | 国际经贸 | 计量软件 | 商科研究 | 机器学习 | SSCI | CSSCI | SSCI查询 |
计量经济圈组织了一个计量社群,有如下特征:热情互助最多、前沿趋势最多、社科资料最多、社科数据最多、科研牛人最多、海外名校最多。因此,建议积极进取和有强烈研习激情的中青年学者到社群交流探讨,始终坚信优秀是通过感染优秀而互相成就彼此的。