因果推断的圣杯:合成控制法及Python应用
如何使用综合控制方法:一个使用Abadie, Diamond和Hainmueller(2015)的德国统一研究的例子
来演:A Python package for causal inference using Synthetic Controls
https://github.com/OscarEngelbrektson/SyntheticControlMethods
本文提供一个工作示例,说明如何使用包,以及如何解释不同的情节。因此,这既是一个示例,也是一个文档。
在本例中,我们使用了Abadie、Diamond和Hainmueller(2015)的数据,该数据使用综合控制方法估计了1990年德国统一对西德的经济影响。但是,请注意,这并不意味着要进行复制。
分为四个部分:
使用任何类型的合成控件的一般数据要求
使用普通合成控制,Synth()
使用惩罚合成控件,Synth(pen="auto")
使用不同的合成控制,DiffSynth(pen=0)和DiffSynth(pen="auto")
1、导入Python库
#Import packages
import pandas as pd
import numpy as np
from SyntheticControlMethods import Synth, DiffSynth
2、数据需求
#Import German Reunification data from paper
#Can be found in /datasets folder in repo
data = pd.read_csv("datasets/german_reunification.csv")
data = data.drop(columns="code", axis=1)
data.head()
结果为:
country year gdp infrate trade schooling invest60 invest70 invest80 industry
0 USA 1960 2879 NaN 9.693181 43.799999 NaN NaN NaN NaN
1 USA 1961 2929 1.075182 9.444655 NaN NaN NaN NaN NaN
2 USA 1962 3103 1.116071 9.429324 NaN NaN NaN NaN NaN
3 USA 1963 3227 1.214128 9.470706 NaN NaN NaN NaN NaN
4 USA 1964 3420 1.308615 9.725879 NaN NaN NaN NaN NaN
这是Abadie, Diamond和Hainmueller(2015)使用的数据集。但是,本笔记本的目的不是提供精确的复制,而是展示如何使用该包。因此,我将不会涉及变量的含义,以及它们是如何收集的等等,请参阅原始论文(链接在标题中)。
3、格式要求:
1、数值变量。只有单位标识符,在这种情况下,“国家”可以是分类的。
2、数据集必须先按单位排序,然后再按时间排序。也就是说,任何单位的所有观察值都应该连续出现在数据集中。在上面的例子中,美国
3、只有一个单元标识符,最好是带有单元名称的字符串形式。在上面的例子中,代码和国家列都是单元标识符。要使用包,必须排除一个包。这可以通过从数据帧中删除列或在调用Synth()或DiffSynth()时使用exclude_columns参数来实现。表和图将使用单元标识符作为标签,所以最好使用描述性标识符。例如,在这种情况下,最好使用“country”而不是“code”,因为“USA”比“1”更容易解释。
4、数据集必须包含一个处理单元和多个控制单元。该方案试图在治疗前阶段,就协变量和结果而言,找到最接近治疗单元的控制单元的加权平均。当然,只有在有多个控制单元的情况下,才能找到一个好的加权平均。
5、结果没有丢失值,但协变量没问题。合成控件使用整个结果时间序列,因此不接受丢失的值。因此,在使用Synth之前,您必须自己处理丢失的结果数据,例如通过imputation或drop units。因为综合控制方法只对每个单元的协变量的预处理平均值起作用,所以协变量的值丢失是可以的。事实上,大多数估算缺失值的方法保持平均值,因此不会影响预处理平均值。因此,在Synth()中不会触及缺失的值。如果您希望进行某种类型的缺失值赋值,则应该在将数据帧提供给Synth()之前对其进行此操作。对于DiffSynth来说,情况就不同了。因为第一个差异只定义为连续的值,所以它对缺少值非常敏感(在下面的DiffSynth()部分对此有更多介绍)。DiffSynth()自动使用线性插值来输入缺失值。如果您不喜欢这样做,可以在向DiffSynth()提供数据帧之前自己输入值。
注:变量定义可在本笔记本的底部或Abadie等人(2015)的附录a中找到。
4、使用Synth,普通合成控制法
合成控制适合使用Synth()类,它接受以下输入:
Arguments
data: Type: Pandas dataframe. A pandas dataframe containing the dataset. Each row should contain one observation for a unit at a time, including the outcome and covariates. Dataset should be ordered by unit then time.
outcome_var: Type: str. Name of outcome column in data, e.g. "gdp"
id_var: Type: str. Name of unit indicator column in data, e.g. "country"
time_var: Type: str. Name of time column in data, e.g. "year" treatment_period: Type: int. Time of first observation after the treatment took place, i.e. first observation affected by the treatment effect. E.g. 1990 for german reunification.
treated_unit: Type: str. Name of the unit that recieved treatment, data["id_var"] == treated_unit.
n_optim: Type: int. Default: 10. Number of different initialization values for which the optimization is run. Higher number means longer runtime, but a higher change of a globally optimal solution.
pen: Type: float. Default: 0. Penalization coefficient which determines the relative importance of minimizing the sum of the pairwise difference of each individual control unit in the synthetic control and the treated unit, vis-a-vis the difference between the synthetic control and the treated unit. Higher number means pairwise difference matters more. When pen=0, as is the default, the pairwise differences are completely ignored. This means that unless otherwise specificed, Synth() is generating a classic synthetic control, like the ones in Abadie et al. (2015). If pen="auto", the penalization term will be optimized over, along with V and W, using the pre-treatment data.
exclude_columns: Type: list. Default: []. List of column names to be excluded from the Synthetic Control. This is practically equivalent to dropping the columns included in the dataframe before running Synth or DiffSynth. That means in the below examples, I could have used exclude_columns=["code"] instead of dropping the column when I loaded the dataset.
random_seed: type: int. Default: 0. Random seed is used to create a numpy.random.default_rng(random_seed) object which is subsequently used in all random processes in the code. Random samples are used to initialize covariate importance matrix, V, and "pen" (if pen="auto") for optimization.
Methods:
Synth objects have 3 methods:
Synth.plot(...) contains all the plotting functionalities Synth.in_time_placebo(...) runs in-time placebo tests Synth.in_space_placebo(...) runs in-space placebo tests Attributes:
Synth objects have a 1 attribute:
original_data: Original data is an object that stores variables and results Synth(). Important are weight_df and comparison_df, which are tables with summary information (see below). It also contains e.g. variables derived in the data-processing step. You can see all of them by calling Synth.original_data. dict
We will be using each of these in the following examples.
5、拟合合成控制并解释结果
#Fit synthetic control
sc = Synth(data, "gdp", "country", "year", 1990, "West Germany", n_optim=100)
#Visualize
sc.plot(["original", "pointwise", "cumulative"], treated_label="West Germany",
synth_label="Synthetic West Germany", treatment_label="German Reunification")
结果为:
该图包含三个面板。第一个面板,“原始”,显示数据和一个反事实的预测后治疗时期。第二个面板,“逐点”,显示了观测数据和反事实预测之间的差异。这是由模型估计的点态因果效应。第三组,“累积”,将第二组的逐点贡献相加,得出干预的累积效应。
weight_df显示在合成控件中分配给每个控件的权重。在上面的例子中,美国占了合成控制的约39%。权值被限制为非负且总和为1。
所有权重为0的单位都被排除在表之外,但是可以使用Synth.original_data.w检索完整的权重矩阵
这个数据帧显示了综合控制结果的均方根预测误差,与原始处理单元相比,在本例中是西德。较低的pre_rmspe意味着合成对照在预处理阶段与处理单元很好地匹配,如果我们要相信它能很好地估计处理单元在未接受处理的情况下的结果,这一点是很重要的。
评估协变量的平衡
列显示:
self.original_data.treated_unit: Unscaled, average covariate values of the treated unit If method == DSC, then the differenced data is displayed instead
Synthetic Control: Unscaled, covariate values of the synthetic control unit If method == DSC, then the differenced data is displayed instead
WMAPE: Weighted Mean Absolute Pairwise Error. For each covariate, how different is each control unit inside the synthetic control from the treated unit, weighted by the weight assigned to each unit. This does not change even if method == DSC, as bias scales with value of difference and not change
Importance: Leading diagonal of V matrix. How important, relative to other covariates, is matching on each covariate in the optimization process? Note that this is computed after rescaling each covariate to be unit variance, whereas the other columns show the unscaled covariate values.
Control Group Average: Simple average of all the units in the control group. Not strictly necessary for anything, but it is often interesting to see how well the synthetic control is doing.
如何解释这个表:
A good synthetic control will reconstruct (a) the outcome of the treated unit and (b) the covariates of the treated unit over the pre-treatment period. The plots and the RMSPE help us evaluate (a), this table is meant to evaluate (b). If the synthetic control has good fit, the following things should be true:
Each row of the first two columns should be approximately equal. This means the synthetic control has reconstructed the treated unit values. In this case, whilst the balance is not perfect, the values are quite similar on all covariates. On it's own, in my opinion, this is not enough to convince us that we have found a strong synthetitic control. However, when considered in combination with the validity tests in the subsequent section, we can be confident we have found a strong synthetic control for West Germany. This is generally true, none of these checks are suffici'ent to show the reliability of a synthetic control, but must be evaluated together–if all point to the conclusion that the synthetic control is good, we can be confident.
The third column should be small, relative to the values in columns 1 and 2. The closer to zero, the more similar the individual control units inside the syntetic control are to the treated unit. The smaller the WMAPE, the lower the potential bias, all else equal. In this case, all of the variables are small, even an order of magnitude smaller, than those in the two first columns–indicating lower risk for bias.
There is no fixed way to interpret the importance column. Instead, it should be evaluated using domain knowledge. Is the relative importance assigned to each covariate reasonable given the context?
SC效度测试
“为了评估我们结果的可信度,我们进行了安慰剂研究,在数据中,兴趣治疗被重新分配到1990年以外的一年,或不同于西德的国家。”(Abadie等人,2015)
时间安慰剂
“我们首先比较了上述估计的西德统一效应和安慰剂效应,在我们的数据中,将德国统一重新分配到统一真正发生之前的一段时间。一个巨大的安慰剂估计将破坏我们的信心,即图2中的结果确实表明了统一的经济成本,而不仅仅是由于缺乏预测能力。”(Abadie等人,2015)
具体来说,我们称之为in_time_placebo(1982),它将1982年作为治疗后的第一个观察结果重新运行了模型,大约是在1990年真正统一之前的8年。
观察“即时安慰剂”图,我们看到合成对照和西德的结果直到1990年真正的治疗才出现分歧,直到那时才接近。这增加了我们对合成控制所提供的反事实结果的信心。
空间安慰剂
进行安慰剂研究的另一种方法是将数据中的治疗重新分配到一个比较单元。这样,我们就可以获得没有发生相关事件的国家的综合控制估计。将这一想法应用于捐助者库中的每个国家,使我们能够将德国统一对西德的估计影响与对其他国家获得的安慰剂效应的分布情况进行比较。如果对西德的估计效应相对于安慰剂效应的分布异常大,我们将认为德国统一对西德的影响是显著的。
为了执行这个空间安慰剂研究,我们使用了in_space_placebo()方法。使用“rmspe比率”曲线图可以更好地显示结果,该曲线图显示了真实治疗单元和每个安慰剂治疗单元的治疗后期rmspe /治疗前期rmspe的分布。其逻辑是,在存在较大的处理效应时,相对于预处理差异,一个单元和它的合成对应物之间的后处理差异将会很大
使用惩罚合成控制进行复制
你可以把普通的综合控制方法想象成,在混合控制单元后,试图找到综合控制单元与处理单元最大程度相似的权重。在你将控制单元混合成一个合成单元后,以及在混合之前,尝试着去寻找能够让控制单元最大程度地相似的权重。也就是说,在混合之前,你更希望合成控制中的单位与处理过的单位相似。如果结果和协变量之间的关系是非线性的,那么综合控制是有偏差的,并且在混合之前,偏差会随着差异而增长(在README.md中有更多关于这方面的内容)。
如果设置pen=“auto”,则综合控制会找到一个惩罚系数,该系数在混合前和混合后的拟合之间达到平衡,这样得到的综合控制就能够最好地预测处理单元的结果。除此之外,您与Synth(pen="auto")的交互方式与与Synth(pen=0)的交互方式完全相同,只是结果可能不同。在西德,它们看起来一模一样。