如何使用综合控制方法:一个使用Abadie, Diamond和Hainmueller(2015)的德国统一研究的例子
来演:A Python package for causal inference using Synthetic Controls
#Import packages
import pandas as pd
import numpy as np
from SyntheticControlMethods import Synth, DiffSynth
#Import German Reunification data from paper
#Can be found in /datasets folder in repo
data = pd.read_csv("datasets/german_reunification.csv")
data = data.drop(columns="code", axis=1)
country year gdp infrate trade schooling invest60 invest70 invest80 industry
0 USA 1960 2879 NaN 9.693181 43.799999 NaN NaN NaN NaN
1 USA 1961 2929 1.075182 9.444655 NaN NaN NaN NaN NaN
2 USA 1962 3103 1.116071 9.429324 NaN NaN NaN NaN NaN
3 USA 1963 3227 1.214128 9.470706 NaN NaN NaN NaN NaN
4 USA 1964 3420 1.308615 9.725879 NaN NaN NaN NaN NaN
这是Abadie, Diamond和Hainmueller(2015)使用的数据集。但是,本笔记本的目的不是提供精确的复制,而是展示如何使用该包。因此,我将不会涉及变量的含义,以及它们是如何收集的等等,请参阅原始论文(链接在标题中)。
5、结果没有丢失值,但协变量没问题。合成控件使用整个结果时间序列,因此不接受丢失的值。因此,在使用Synth之前,您必须自己处理丢失的结果数据,例如通过imputation或drop units。因为综合控制方法只对每个单元的协变量的预处理平均值起作用,所以协变量的值丢失是可以的。事实上,大多数估算缺失值的方法保持平均值,因此不会影响预处理平均值。因此,在Synth()中不会触及缺失的值。如果您希望进行某种类型的缺失值赋值,则应该在将数据帧提供给Synth()之前对其进行此操作。对于DiffSynth来说,情况就不同了。因为第一个差异只定义为连续的值,所以它对缺少值非常敏感(在下面的DiffSynth()部分对此有更多介绍)。DiffSynth()自动使用线性插值来输入缺失值。如果您不喜欢这样做,可以在向DiffSynth()提供数据帧之前自己输入值。
data: Type: Pandas dataframe. A pandas dataframe containing the dataset. Each row should contain one observation for a unit at a time, including the outcome and covariates. Dataset should be ordered by unit then time.
outcome_var: Type: str. Name of outcome column in data, e.g. "gdp"
id_var: Type: str. Name of unit indicator column in data, e.g. "country"
time_var: Type: str. Name of time column in data, e.g. "year" treatment_period: Type: int. Time of first observation after the treatment took place, i.e. first observation affected by the treatment effect. E.g. 1990 for german reunification.
treated_unit: Type: str. Name of the unit that recieved treatment, data["id_var"] == treated_unit.
n_optim: Type: int. Default: 10. Number of different initialization values for which the optimization is run. Higher number means longer runtime, but a higher change of a globally optimal solution.
pen: Type: float. Default: 0. Penalization coefficient which determines the relative importance of minimizing the sum of the pairwise difference of each individual control unit in the synthetic control and the treated unit, vis-a-vis the difference between the synthetic control and the treated unit. Higher number means pairwise difference matters more. When pen=0, as is the default, the pairwise differences are completely ignored. This means that unless otherwise specificed, Synth() is generating a classic synthetic control, like the ones in Abadie et al. (2015). If pen="auto", the penalization term will be optimized over, along with V and W, using the pre-treatment data.
exclude_columns: Type: list. Default: []. List of column names to be excluded from the Synthetic Control. This is practically equivalent to dropping the columns included in the dataframe before running Synth or DiffSynth. That means in the below examples, I could have used exclude_columns=["code"] instead of dropping the column when I loaded the dataset.
random_seed: type: int. Default: 0. Random seed is used to create a numpy.random.default_rng(random_seed) object which is subsequently used in all random processes in the code. Random samples are used to initialize covariate importance matrix, V, and "pen" (if pen="auto") for optimization.
Synth objects have 3 methods:
Synth.plot(...) contains all the plotting functionalities Synth.in_time_placebo(...) runs in-time placebo tests Synth.in_space_placebo(...) runs in-space placebo tests Attributes:
Synth objects have a 1 attribute:
original_data: Original data is an object that stores variables and results Synth(). Important are weight_df and comparison_df, which are tables with summary information (see below). It also contains e.g. variables derived in the data-processing step. You can see all of them by calling Synth.original_data. dict
We will be using each of these in the following examples.
#Fit synthetic control
sc = Synth(data, "gdp", "country", "year", 1990, "West Germany", n_optim=100)
sc.plot(["original", "pointwise", "cumulative"], treated_label="West Germany",
synth_label="Synthetic West Germany", treatment_label="German Reunification")
self.original_data.treated_unit: Unscaled, average covariate values of the treated unit If method == DSC, then the differenced data is displayed instead
Synthetic Control: Unscaled, covariate values of the synthetic control unit If method == DSC, then the differenced data is displayed instead
WMAPE: Weighted Mean Absolute Pairwise Error. For each covariate, how different is each control unit inside the synthetic control from the treated unit, weighted by the weight assigned to each unit. This does not change even if method == DSC, as bias scales with value of difference and not change
Importance: Leading diagonal of V matrix. How important, relative to other covariates, is matching on each covariate in the optimization process? Note that this is computed after rescaling each covariate to be unit variance, whereas the other columns show the unscaled covariate values.
Control Group Average: Simple average of all the units in the control group. Not strictly necessary for anything, but it is often interesting to see how well the synthetic control is doing.
A good synthetic control will reconstruct (a) the outcome of the treated unit and (b) the covariates of the treated unit over the pre-treatment period. The plots and the RMSPE help us evaluate (a), this table is meant to evaluate (b). If the synthetic control has good fit, the following things should be true:
Each row of the first two columns should be approximately equal. This means the synthetic control has reconstructed the treated unit values. In this case, whilst the balance is not perfect, the values are quite similar on all covariates. On it's own, in my opinion, this is not enough to convince us that we have found a strong synthetitic control. However, when considered in combination with the validity tests in the subsequent section, we can be confident we have found a strong synthetic control for West Germany. This is generally true, none of these checks are suffici'ent to show the reliability of a synthetic control, but must be evaluated together–if all point to the conclusion that the synthetic control is good, we can be confident.
The third column should be small, relative to the values in columns 1 and 2. The closer to zero, the more similar the individual control units inside the syntetic control are to the treated unit. The smaller the WMAPE, the lower the potential bias, all else equal. In this case, all of the variables are small, even an order of magnitude smaller, than those in the two first columns–indicating lower risk for bias.
There is no fixed way to interpret the importance column. Instead, it should be evaluated using domain knowledge. Is the relative importance assigned to each covariate reasonable given the context?
为了执行这个空间安慰剂研究,我们使用了in_space_placebo()方法。使用“rmspe比率”曲线图可以更好地显示结果,该曲线图显示了真实治疗单元和每个安慰剂治疗单元的治疗后期rmspe /治疗前期rmspe的分布。其逻辑是,在存在较大的处理效应时,相对于预处理差异,一个单元和它的合成对应物之间的后处理差异将会很大