哈佛大学新修订完成的因果推断经典大作免费下载！附数据和code！

因果推断研究小组计量经济圈 2021-10-23

凡是搞计量经济的，都关注这个号了

邮箱：econometrics666@126.com

所有计量经济圈方法论丛的code程序, 宏微观数据库和各种软件都放在社群里.欢迎到计量经济圈社群交流访问.

photo courtesy: Ivanka Trump

前些日，咱们引荐了“CSMAR所有的数据产品均可免费下载！”，受到金融财务管理领域学者的欢迎。金融领域三大中文数据库, CSMAR, CCER, Wind和CNRDS，其中CSMAR数据库于2月29日就会停止免费服务，因此要使用这一数据库的学者得加快进度了。2月19日，咱们又引荐了“疫情期EPS数据库向全社会免费开放！附细致使用指南！”，受到海内外经管学者的一致好评，其工作人员发送了“EPS最新版本使用手册”。2月20日，给各位学者引荐了三门计量课程，系统讲解了最新因果推断，时间序列，面板数据等及在Stata中的实现过程（详见，疫情期计量课程免费开放！面板数据, 因果推断, 时间序列分析与Stata应用）。2月21日，给各位学者引荐了二个数据库的使用指南疫情期Wind资讯金融终端操作指南和CEIC数据库操作指南，参考一下“清华北大经管社科数据库有哪些? 不要羡慕嫉妒恨！”。2月22日，引荐了“估计具有两个高维固定效应的泊松回归模型”，里面包括面板泊松回归、面板负二项回归、控制函数法CF、受限三次样条等等。

上一日，因果推断研究小组引荐了最清晰的内生性问题详解及软件操作方案！实证研究必备工具！，受到学者高度肯定。在一个凡事都追求“因果关系”的现代社会，理解并掌握因果推断的各种计量方法显得异常重要，甚至会成为你人生中最为重要的转折点。基于此，咱们再推荐一下这门免费公开课：疫情期计量课程免费开放！面板数据, 因果推断, 时间序列分析与Stata应用。

今天的主题，是引荐一本由哈佛大学教授于2020年02月21日新修订完成的因果推断经典大作“Causal Inference: What If”。除了这本之外，还有图灵奖得主Judea Pearl新书《The book of Why: the new science of cause and effect》也值得一读。

《因果推段》是一本书公认的自命不凡的书名。因果推段是一项复杂的科学任务，它需要研究者从多个来源挖掘有效证据，并依赖于各种方法论的恰当应用。任何一本书都不可能全面地描述跨学科因果推断的方法论。任何因果推断书的作者都必须选择他们想要强调的因果推断方法论方面。

本导言的标题反映了我们自己的选择：这本书帮助科学家——特别是健康和社会科学家——生成和分析数据，从而对因果问题和数据分析的假设做出明确的因果推论。不幸的是，科学文献中充斥着一些研究，其中没有明确说明因果问题，也没有列出研究人员无法证实的假设。这种对因果推断的随意态度，产生了大量结论混乱的研究文章。例如，由于数据分析方法不能在研究者的假设下恰当地回答因果问题，因此发现效应估计难以解释的研究并不少见。

在这本书中，我们强调需要认真对待因果问题，因此足以清楚地表达它，并描述数据和假设在因果推断中的不同作用。一旦掌握了这些基础知识，因果推断就变得不那么随意了，这有助于防止学者陷入混淆。这本书描述了各种数据分析方法，而这些方法可用于在一组特定假设下估计让我们感兴趣的因果效应。这本书的一个关键信息是，因果推断不能简化为数据分析方法的集合。

这本书分为三部分，难度越来越大：第一部分是关于没有模型的因果推断（即因果效应的非参数识别），第二部分是关于有模型的因果推断（即用参数模型估计因果效应），第三部分是关于复杂纵向数据（longitudinal data）的因果推断（即，估计时变处理的因果效应）。在整篇文章中，我们穿插了一些细枝末节和技术要点，详细阐述了正文中提到的那些主题。细致讲解部分是为所有读者设计的，而技术部分是为受过中级统计培训的读者设计的。这本书提供了一个全面的关于因果推断的概念和方法，尽管这些方法目前都还分散存在于几个学科的期刊中。我们期望这本书会引起那些对因果推断感兴趣的人的兴趣，例如流行病学家、统计学家、心理学家、经济学家、社会学家、政治科学家、计算机科学家。当然，作为因果推断研究小组的必读书籍，希望各位学者也能够下载这本书籍参阅。

重要的是，这不是一本哲学书。我们对形而上学的概念，如因果关系和原因，仍然持不可知论的态度。相反，我们侧重于确定和估计人群中的因果效应，即测量不同干预下结果分布变化的数值。例如，我们讨论在严重心力衰竭患者中，若他们接受了心脏移植，或他们没有接受心脏移植，如何估计两种情况下的死亡风险。我们的主要目标是帮助决策者做出更好的决策——可操作的因果推理。

作者还附上了书中的数据及R，Stata, Python等运行程序。

这里只展示一下Stata的一章code，R, SAS, Python可以自行下载。

/***************************************************************
PROGRAM 17.1
生存曲线的非参数估计
Section 17.1
***************************************************************/
clear
use "nhefs.dta"

/*Some preprocessing of the data*/
gen survtime = .
replace survtime = 120 if death == 0
replace survtime = (yrdth - 83)*12 + modth if death ==1
* yrdth ranges from 83 to 92*

tab death qsmk

/*Kaplan-Meier graph of observed survival over time, by quitting smoking*/
*For now, we use the stset function in Stata*
stset survtime, failure(death=1)
sts graph, by(qsmk) xlabel(0(12)120)


/***************************************************************
PROGRAM 17.2
通过风险模型对生存曲线进行参数估计
Section 17.1
Generates Figure 17.4
***************************************************************/

/**Create person-month dataset for survival analyses**/

/*We want our new dataset to include 1 observation per person per month alive, starting at time = 0*/
*Individuals who survive to the end of follow-up will have 119 time points*
*Individuals who die will have survtime - 1 time points*
clear
use "nhefs.dta"

gen survtime = .
replace survtime = 120 if death == 0
replace survtime = (yrdth - 83)*12 + modth if death ==1

*expand data to person-time*
gen time = 0
expand survtime if time == 0
bysort seqn: replace time = _n - 1

*Create event variable*
gen event = 0
replace event = 1 if time == survtime - 1 & death == 1
tab event

*Create time-squared variable for analyses*
gen timesq = time*time

*Save the dataset to your working directory for future use*
save nhefs_surv, replace

/**Hazard ratios**/
clear

use "nhefs_surv.dta"

*Fit a pooled logistic hazards model *
logistic event qsmk qsmk#c.time qsmk#c.time#c.time c.time c.time#c.time


/**Survival curves: run regression then do:**/

*Create a dataset with all time points under each treatment level*
*Re-expand data with rows for all timepoints*
drop if time != 0
expand 120 if time ==0
bysort seqn: replace time = _n - 1	 
		
*Create 2 copies of each subject, and set outcome to missing and treatment -- use only the newobs*
expand 2 , generate(interv)
replace qsmk = interv	

*Generate predicted event and survival probabilities for each person each month in copies*		 
predict pevent_k, pr
gen psurv_k = 1-pevent_k
keep seqn time qsmk interv psurv_k

*Within copies, generate predicted survival over time*
*Remember, survival is the product of conditional survival probabilities in each interval*	
sort seqn interv time
gen _t = time + 1
gen psurv = psurv_k if _t ==1		
bysort seqn interv: replace psurv = psurv_k*psurv[_t-1] if _t >1

*Display 10-year standardized survival, under interventions*
*Note: since time starts at 0, month 119 is 10-year survival*
by interv, sort: summarize psurv if time == 119

*Graph of standardized survival over time, under interventions*
*Note, we want our graph to start at 100% survival, so add an extra time point with P(surv) = 1*
expand 2 if time ==0, generate(newtime)
replace psurv  = 1 if newtime == 1
gen time2 = 0 if newtime ==1
replace time2 = time + 1 if newtime == 0
*Separate the survival probabilities to allow plotting by intervention on qsmk*
separate psurv, by(interv)
*Plot the curves*
twoway (line psurv0 time2, sort) (line psurv1 time2, sort) if interv > -1, ylabel(0.5(0.1)1.0) xlabel(0(12)120) ytitle("Survival probability") xtitle("Months of follow-up") legend(label(1 "A=0") label(2 "A=1"))


/***************************************************************
PROGRAM 17.3
Estimation of survival curves via IP weighted hazards model
Data from NHEFS
Section 17.4
Generates Figure 17.6
***************************************************************/

clear
use "nhefs_surv.dta"

keep seqn event qsmk time sex race age education smokeintensity smkintensity82_71  smokeyrs exercise active wt71
preserve

*Estimate weights*
logit qsmk sex race c.age##c.age ib(last).education c.smokeintensity##c.smokeintensity ///
c.smokeyrs##c.smokeyrs ib(last).exercise ib(last).active c.wt71##c.wt71 if time == 0
predict p_qsmk, pr
logit qsmk if time ==0
predict num, pr
gen sw=num/p_qsmk if qsmk==1
replace sw=(1-num)/(1-p_qsmk) if qsmk==0
summarize sw

*IP weighted survival by smoking cessation*
logit event qsmk qsmk#c.time qsmk#c.time#c.time c.time c.time#c.time [pweight=sw] , cluster(seqn)

*Create a dataset with all time points under each treatment level*
*Re-expand data with rows for all timepoints*
drop if time != 0
expand 120 if time ==0
bysort seqn: replace time = _n - 1		 
		
*Create 2 copies of each subject, and set outcome to missing and treatment -- use only the newobs*
expand 2 , generate(interv)
replace qsmk = interv	

*Generate predicted event and survival probabilities for each person each month in copies*		 
predict pevent_k, pr
gen psurv_k = 1-pevent_k
keep seqn time qsmk interv psurv_k

*Within copies, generate predicted survival over time*
*Remember, survival is the product of conditional survival probabilities in each interval*	
sort seqn interv time
gen _t = time + 1
gen psurv = psurv_k if _t ==1		
bysort seqn interv: replace psurv = psurv_k*psurv[_t-1] if _t >1

*Display 10-year standardized survival, under interventions*
*Note: since time starts at 0, month 119 is 10-year survival*
by interv, sort: summarize psurv if time == 119

quietly summarize psurv if(interv==0 & time ==119)
matrix input observe = (0,`r(mean)')
quietly summarize psurv if(interv==1 & time ==119)
matrix observe = (observe \1,`r(mean)')
matrix observe = (observe \3, observe[2,2]-observe[1,2])
matrix list observe


*Graph of standardized survival over time, under interventions*
*Note: since our outcome model has no covariates, we can plot psurv directly. If we had covariates we would need to stratify or average across the values*
expand 2 if time ==0, generate(newtime)
replace psurv  = 1 if newtime == 1
gen time2 = 0 if newtime ==1
replace time2 = time + 1 if newtime == 0
separate psurv, by(interv)
twoway (line psurv0 time2, sort) (line psurv1 time2, sort) if interv > -1, ylabel(0.5(0.1)1.0) xlabel(0(12)120) ytitle("Survival probability") xtitle("Months of follow-up") legend(label(1 "A=0") label(2 "A=1"))
*remove extra timepoint*
drop if newtime == 1
drop time2

restore

**Bootstraps**
save nhefs_std1 , replace
 
capture program drop bootipw_surv
program define bootipw_surv , rclass
        u nhefs_std1 , clear
		preserve
		bsample, cluster(seqn) idcluster(newseqn)	
		
		logit qsmk sex race c.age##c.age ib(last).education c.smokeintensity##c.smokeintensity ///
		c.smokeyrs##c.smokeyrs ib(last).exercise ib(last).active c.wt71##c.wt71 if time == 0
		predict p_qsmk, pr
		logit qsmk if time ==0
		predict num, pr
		gen sw=num/p_qsmk if qsmk==1
		replace sw=(1-num)/(1-p_qsmk) if qsmk==0

		logit event qsmk qsmk#c.time qsmk#c.time#c.time c.time c.time#c.time  [pweight=sw] , cluster(newseqn)
	
		drop if time != 0
		expand 120 if time ==0
		bysort newseqn: replace time = _n - 1		 
		expand 2 , generate(interv_b)
		replace qsmk = interv_b	
		
		predict pevent_k, pr
		gen psurv_k = 1-pevent_k
		keep newseqn time qsmk interv_b psurv_k

		sort newseqn interv_b time
		gen _t = time + 1
		gen psurv = psurv_k if _t ==1		
		bysort newseqn interv_b: replace psurv = psurv_k*psurv[_t-1] if _t >1
		drop if time != 119
		bysort interv_b: egen meanS_b = mean(psurv)
		keep newseqn qsmk  meanS_b
        drop if newseqn != 1  /* only need one pair */
	
		drop newseqn		
		
	    return scalar boot_0 = meanS_b[1]
	    return scalar boot_1 = meanS_b[2]
	    return scalar  boot_diff = return(boot_1) - return(boot_0)
	   restore
end		
set rmsg on
simulate PrY_a0 = r(boot_0) PrY_a1 = r(boot_1)	difference=r(boot_diff) , reps(10) seed(1)  : bootipw_surv
set rmsg off
 
matrix pe = observe[1..3, 2]'
bstat, stat(pe) n(1629)


/***************************************************************
PROGRAM 17.4
通过g公式估算生存曲线
Section 17.5
Generates Figure 17.7
***************************************************************/
clear
use "nhefs_surv.dta"

keep seqn event qsmk time sex race age education smokeintensity smkintensity82_71  smokeyrs exercise active wt71
preserve
 
quietly logistic event qsmk qsmk#c.time  qsmk#c.time#c.time time c.time#c.time  ///
	sex race c.age##c.age ib(last).education c.smokeintensity##c.smokeintensity ///
	c.smokeyrs##c.smokeyrs ib(last).exercise ib(last).active c.wt71##c.wt71 , cluster(seqn)
			
drop if time != 0
expand 120 if time ==0
bysort seqn: replace time = _n - 1		 		
expand 2 , generate(interv)
replace qsmk = interv		 
predict pevent_k, pr
gen psurv_k = 1-pevent_k
keep seqn  time qsmk interv psurv_k		       	
sort seqn interv time
gen _t = time + 1
gen psurv = psurv_k if _t ==1		
bysort seqn interv:   replace psurv = psurv_k*psurv[_t-1] if _t >1
by interv, sort: summarize psurv if time == 119

keep qsmk interv psurv time
		
bysort interv : egen meanS = mean(psurv) if time == 119
by interv: summarize meanS

quietly summarize meanS if(qsmk==0  & time ==119)
matrix input observe = ( 0,`r(mean)')
quietly summarize meanS if(qsmk==1  & time ==119)
matrix observe = (observe \1,`r(mean)')
matrix observe = (observe \2, observe[2,2]-observe[1,2])
*Add some row/column descriptions and print results to screen*
matrix rownames observe =  P(Y(a=0)=1) P(Y(a=1)=1) difference
matrix colnames observe = interv survival


*Graph standardized survival over time, under interventions*
*Note: unlike in PROGRAM 17.3, we now have covariates so we first need to average survival across strata*
bysort interv time : egen meanS_t = mean(psurv)
*Now we can continue with the graph*
expand 2 if time ==0, generate(newtime)
replace meanS_t  = 1 if newtime == 1
gen time2 = 0 if newtime ==1
replace time2 = time + 1 if newtime == 0
separate meanS_t, by(interv)
twoway (line meanS_t0 time2, sort) (line meanS_t1 time2, sort), ylabel(0.5(0.1)1.0) xlabel(0(12)120) ytitle("Survival probability") xtitle("Months of follow-up") legend(label(1 "A=0") label(2 "A=1"))
*remove extra timepoint*
drop if newtime == 1

restore

*Bootstraps*
save nhefs_std2 , replace
 
capture program drop bootstdz_surv
program define bootstdz_surv , rclass
        u nhefs_std2 , clear
		preserve
		bsample, cluster(seqn) idcluster(newseqn)		
        logistic event  qsmk qsmk#c.time  qsmk#c.time#c.time time c.time#c.time   ///
			sex race c.age##c.age ib(last).education ///
			c.smokeintensity##c.smokeintensity c.smkintensity82_71  ///
			c.smokeyrs##c.smokeyrs ib(last).exercise ib(last).active c.wt71##c.wt71
		 drop if time != 0	
		/*only predict on new version of data */
	    expand 120 if time ==0
	    bysort newseqn: replace time = _n - 1		 		
	    expand 2 , generate(interv_b)
	    replace qsmk = interv_b		 
	    predict pevent_k, pr
	    gen psurv_k = 1-pevent_k
        keep newseqn  time qsmk psurv_k		       	
		sort newseqn qsmk time
	    gen _t = time + 1
	    gen psurv = psurv_k if _t ==1	
	    bysort newseqn  qsmk:   replace psurv = psurv_k*psurv[_t-1] if _t >1
	    drop  if time != 119   /* keep only last observation */
		keep  newseqn qsmk psurv
		/* if time is in data for complete graph add time to bysort */	
	    bysort qsmk  : egen meanS_b = mean(psurv)
		keep newseqn qsmk  meanS_b
        drop if newseqn != 1  /* only need one pair */
	
		drop newseqn		
		
	    return scalar boot_0 = meanS_b[1]
	    return scalar boot_1 = meanS_b[2]
	    return scalar  boot_diff = return(boot_1) - return(boot_0)
	   restore
end		
set rmsg on
simulate PrY_a0 = r(boot_0) PrY_a1 = r(boot_1)	difference=r(boot_diff) , reps(10) seed(1)  : bootstdz_surv
set rmsg off
 
matrix pe = observe[1..3, 2]'
bstat, stat(pe) n(1629)

Source:

https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

对因果推断研究感兴趣，都可以到社群交流访问。

长按上方二维码可以读该书

拓展性阅读

前些日，咱们引荐了①“实证研究中用到的200篇文章, 社科学者常备toolkit”、②实证文章写作常用到的50篇名家经验帖, 学者必读系列、③过去10年AER上关于中国主题的Articles专辑、④AEA公布2017-19年度最受关注的十大研究话题, 给你的选题方向，⑤2020年中文Top期刊重点选题方向, 写论文就写这些。后面，咱们又引荐了①使用CFPS, CHFS, CHNS数据实证研究的精选文章专辑！，②这40个微观数据库够你博士毕业了, 反正凭着这些库成了教授，③Python, Stata, R软件史上最全快捷键合辑！，④关于(模糊)断点回归设计的100篇精选Articles专辑！，⑤关于双重差分法DID的32篇精选Articles专辑！，⑥关于合成控制法SCM的33篇精选Articles专辑！⑦最近80篇关于中国国际贸易领域papers合辑！，⑧最近70篇关于中国环境生态的经济学papers合辑！⑨使用CEPS, CHARLS, CGSS, CLHLS数据库实证研究的精选文章专辑！⑩最近50篇使用系统GMM开展实证研究的papers合辑！这些文章受到了各位学者的欢迎和热议，博士生导师纷纷将其推荐给学生参阅。

之前，咱们圈子引荐过一些数据库(当然，社群里的数据库远不止这些)，如下：1.这40个微观数据库够你博士毕业了；2.中国工业企业数据库匹配160大步骤的完整程序和相应数据；3.中国省/地级市夜间灯光数据；4.1997-2014中国市场化指数权威版本；5.1998-2016年中国地级市年均PM2.5；6.计量经济圈经济社会等数据库合集；7.中国方言,官员, 行政审批和省长数据库开放；8.2005-2015中国分省分行业CO2数据；9.国际贸易研究中的数据演进与当代问题；10.经济学研究常用中国微观数据手册。

下面这些短链接文章属于合集，可以收藏起来阅读，不然以后都找不到了。

2年，计量经济圈公众号近1000篇文章，

Econometrics Circle

数据处理：Stata | R | Python | 缺失值 | CHIP/ CHNS/CHARLS/CFPS/CGSS等 |

计量经济圈组织了一个计量社群，有如下特征：热情互助最多、前沿趋势最多、社科资料最多、社科数据最多、科研牛人最多、海外名校最多。因此，建议积极进取和有强烈研习激情的中青年学者到社群交流探讨，始终坚信优秀是通过感染优秀而互相成就彼此的。

: ， . Video Mini Program Like ，轻点两下取消赞 Wow ，轻点两下取消在看

”FAN某”的离婚财产分割判决书（全文）

”FAN某”的离婚财产分割判决书（全文）

公益慈善｜“翼行天下一生守护”慈善项目捐赠仪式圆满举行！

何炅突然高调官宣喜讯，网友恭喜：30年了，终于等到这一天！

女性最佳“绝经期”已公布，不是45岁，而是这个数，越接近越健康！

哈佛大学新修订完成的因果推断经典大作免费下载！附数据和code！

您可能也对以下帖子感兴趣

”FAN某”的离婚财产分割判决书（全文）

”FAN某”的离婚财产分割判决书（全文）

公益慈善｜“翼行天下 一生守护”慈善项目捐赠仪式圆满举行！

何炅突然高调官宣喜讯，网友恭喜：30年了，终于等到这一天！

女性最佳“绝经期”已公布，不是45岁，而是这个数，越接近越健康！

生成图片，分享到微信朋友圈

哈佛大学新修订完成的因果推断经典大作免费下载！附数据和code！

您可能也对以下帖子感兴趣

公益慈善｜“翼行天下一生守护”慈善项目捐赠仪式圆满举行！