查看原文
其他

公开重现资料时如何发布涉密数据

连享会 连享会 2022-12-31

👇 连享会 · 推文导航 | www.lianxh.cn

连享会寒假班

作者: 常丁祎(山东大学)
邮箱: changdingyi1126@163.com


目录

  • 1. 引言

  • 2. Stata 范例

    • 2.1 stata 操作

    • 2.2 检验合成数据的可行性

  • 3. 结语

  • 4. 参考资料

  • 5. 相关推文



温馨提示: 文中链接在微信中无法生效。请点击底部「阅读原文」。或直接长按/扫描如下二维码,直达原文:

1. 引言

现如今国外经济学的一些 TOP 期刊(如 AER, QJE, JPE, AEJ 系列等)基本上都会要求作者提供论文的原始数据和代码,并且还会将作者上传的数据和代码也会公开出来,通过这样的方式不仅能约束学术不端行为,也能保护作者的知识产权。固然说,这种将数据代码公开给学者使用,可以帮助学术圈的进步,但是这也给投稿的作者们带来了难题,尤其是很多时候他们使用的数据是保密的或者签订了协议并不能公开此数据。

对此,我们就需要采取一些措施来处理我们的原始数据,如构造一个合成数据集,让这个合成数据集满足所有的隐私保护约束,同时还能保留原始数据的一些重要的结构,让广大学者可以通过使用这个合成数据集能够大致复现论文的主要结论。基于这个思考,我们可以利用多重填充(Multiple Imputation)的方法,以下的步骤参考于 How to come public, with private data.

2. Stata 范例

为了更好地描述该方法是如何进行的,我们将使用一个现成的在线数据集。该数据摘自《1998 瑞士劳动力市场调查》,在 stata 命令 oaxaca (by Jann, 2008)中作为示例数据提供。

在这里我们假设你已经签署了保密协议来处理 Swiss Survey 的数据,并准备提交论文,但是所投稿的期刊需要你提供论文的数据和代码。但是由于你已经签署了保密协议不能公开此数据集,因此在此文的建议是提供 5 个人为合成的数据集,基于此合成数据集,其他人就可以使用你提供的代码去复现论文中的实证结果。具体的 stata 操作如下:

2.1 stata 操作

首先,在 stata 中导入本文所使用的范例数据oaxaca.dta,并且可以发现oaxaca.dta是一个包含 15 个变量总共 1647 个观测值的数据集。

clear all

. use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear //使用oaxaca数据集
(Excerpt from the Swiss Labor Market Survey 1998)

. codebook, c

Variable Obs Unique Mean Min Max Label
---------------------------------------------------------------------------
lnwage 1434 675 3.357604 .507681 5.259097 log hourly wages
educ 1647 10 11.40134 5 17.5 years of education
exper 1434 563 13.15324 0 49.16667 years of work experience
tenure 1434 323 7.860937 0 44.83333 years of job tenure
isco 1434 9 4.014644 1 9 occupation (ISCO)
female 1647 2 .5391621 0 1 sex of respondent (1=female)
lfp 1647 2 .870674 0 1 labor force participation
age 1647 45 39.25379 18 62 age of respondent
agesq 1647 45 1662.489 324 3844 age squared
single 1647 2 .343048 0 1 single
married 1647 2 .5233758 0 1 married
divorced 1647 2 .1335762 0 1 divorced
kids6 1647 5 .2847602 0 4 number of childern ages 6 and younger
kids714 1647 5 .3290832 0 4 number of children ages 7 to 14
wt 1647 6 1.006181 .5302977 3.181786 sampling weights
---------------------------------------------------------------------------

. misstable summarize //对数据集中的缺失值进行报告
Obs<.
+--------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
---------+-----------------------+--------------------------
lnwage | 213 1,434 | >500 .507681 5.259097
exper | 213 1,434 | >500 0 49.16667
tenure | 213 1,434 | 323 0 44.83333
isco | 213 1,434 | 9 1 9
------------------------------------------------------------

. count if lfp==0
213

. list lnwage exper tenure isco lfp in 1435/1450

+--------------------------------------+
| lnwage exper tenure isco lfp |
|--------------------------------------|
1435. | . . . . 0 |
1436. | . . . . 0 |
1437. | . . . . 0 |
1438. | . . . . 0 |
1439. | . . . . 0 |
|--------------------------------------|
1440. | . . . . 0 |
1441. | . . . . 0 |
1442. | . . . . 0 |
1443. | . . . . 0 |
1444. | . . . . 0 |
|--------------------------------------|
1445. | . . . . 0 |
1446. | . . . . 0 |
1447. | . . . . 0 |
1448. | . . . . 0 |
1449. | . . . . 0 |
|--------------------------------------|
1450. | . . . . 0 |
+--------------------------------------+

从以上结果可以得知,原始数据集中 lnwage、exper、tenure 以及 isco 这四个变量具有缺失值,并且是在 lsp=0 时,这四个变量的观测值缺失。

面对这种情况,我们可以通过生成一个“seed”变量用来创建合成数据集,并且该变量是一个范围在 0-100 之间的随机均匀变量。除此之外,再生成一个 id 变量对数据集的观测值进行编号。

. gen id = _n //进行编号

. set seed 10101

. gen seed = runiform(0,100)

下一步是在原始数据集的基础上再生成相同数量的数据集。具体做法为按照数据集中第一列的观测值重复复制 1648 次,并且对于新生成的观测值赋值 tag=1。再将 tag=1 的所有变量的观测值都设置为缺失。具体操作如下:

. expand 1648 in 1, gen(tag)
(1,647 observations created)

. local vlist "lnwage educ exper tenure isco female lfp age single married divorced kids6 kids714 wt"
. foreach i of varlist `vlist' {
replace `i'=. if tag==1
}

对于 tag=1 对应的新生成的 1647 个观测值中需要重新生成 seed 和 lfp 的变量的值。

. replace seed = runiform(0,100) if tag==1
(1,647 real changes made)

. replace lfp = runiform()<.87 if tag==1
(1,647 real changes made)

下一步是利用多元填充(Multiple Imputation)的方法生成合成数据集。在这里我们需要使用 mi impute chain 命令,我们认为最好的方法是使用 pmm,即预测均值匹配(predictive mean matching)的方法。即:

. mi set wide

. mi register impute lnwage educ exper tenure ///
isco female age single married ///
kids6 kids714 wt

. mi impute chain ///
(pmm, knn(100)) educ female age single married kids6 kids714 wt (pmm if lfp==1, knn(100)) ///
lnwage exper tenure isco = seed lfp, add(5)
note: missing-value pattern is monotone; no iteration performed

Conditional models (monotone):
educ: pmm educ seed lfp , knn(100)
female: pmm female educ seed lfp , knn(100)
age: pmm age female educ seed lfp , knn(100)
single: pmm single age female educ seed lfp , knn(100)
married: pmm married single age female educ seed lfp , knn(100)
kids6: pmm kids6 married single age female educ seed lfp , knn(100)
kids714: pmm kids714 kids6 married single age female educ seed lfp , knn(100)
wt: pmm wt kids714 kids6 married single age female educ seed lfp , knn(100)
lnwage: pmm lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100)
exper: pmm exper lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100)
tenure: pmm tenure exper lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100)
isco: pmm isco tenure exper lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100)

Performing chained iterations ...

Multivariate imputation Imputations = 5
Chained equations added = 5
Imputed: m=1 through m=5 updated = 0

Initialization: monotone Iterations = 0
burn-in = 0

educ: predictive mean matching
female: predictive mean matching
age: predictive mean matching
single: predictive mean matching
married: predictive mean matching
kids6: predictive mean matching
kids714: predictive mean matching
wt: predictive mean matching
lnwage: predictive mean matching
exper: predictive mean matching
tenure: predictive mean matching
isco: predictive mean matching

----------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-----------+-----------------------------------+----------
educ | 1647 1647 1647 | 3294
female | 1647 1647 1647 | 3294
age | 1647 1647 1647 | 3294
single | 1647 1647 1647 | 3294
married | 1647 1647 1647 | 3294
kids6 | 1647 1647 1647 | 3294
kids714 | 1647 1647 1647 | 3294
wt | 1647 1647 1647 | 3294
lnwage | 1434 1458 1458 | 2892
exper | 1434 1458 1458 | 2892
tenure | 1434 1458 1458 | 2892
isco | 1434 1458 1458 | 2892
----------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
of the number of filled-in observations.)

. forvalues i = 1/5 {
preserve
keep if tag==1
keep _`i'_* lfp
ren _`i'_* *
save fake_oaxaca_`i', replace
restore
}

通过这种方式就新生成了 5 组变量,这 5 组变量可以用于创建 5 个独特的合成数据集,这些合成数据集与原始数据集具有类似的结构,并可以用来复制论文结果以及公开使用。

2.2 检验合成数据的可行性

现在通过估计一个简单的 Linear Regression、Quantile Regression 和 Heckman 两步法模型来检验合成数据集的可行性。即:

frame create test

frame test: {
use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear
qui:reg lnwage educ exper tenure female
est sto m1
qui:qreg lnwage educ exper tenure female, q(10)
est sto m2
qui:heckman lnwage educ exper tenure female age, selec(lfp =educ female age single married kids6 kids714) two
est sto m3
}

forvalues i = 1/5 {
frame test: {
use fake_oaxaca_`i', clear

qui:reg lnwage educ exper tenure female
est sto m1`i'
qui: qreg lnwage educ exper tenure female, q(10)
est sto m2`i'

qui: heckman lnwage educ exper tenure female age, ///
selec(lfp =educ female age single married kids6 kids714) two
est sto m3`i'
}
}

. ** OLS
. esttab m1 m11 m12 m13 m14 m15, mtitle(Original Fake1 Fake2 Fake3 Fake4 Fake5)

----------------------------------------------------------------------------------
(1) (2) (3) (4) (5) (6)
Original Fake1 Fake2 Fake3 Fake4 Fake5
----------------------------------------------------------------------------------
educ 0.0848*** 0.0664*** 0.0773*** 0.0676*** 0.0647*** 0.0597***
(16.34) (13.73) (16.11) (12.51) (14.00) (12.73)

exper 0.0111*** 0.00908*** 0.00992*** 0.0122*** 0.00655*** 0.00584***
(7.22) (6.34) (6.98) (7.64) (4.95) (3.96)

tenure 0.00771*** 0.00747*** 0.0100*** 0.00112 0.00626*** 0.00860***
(4.10) (4.17) (5.70) (0.57) (3.63) (4.64)

female -0.0841*** -0.0508* -0.0767** -0.0914*** -0.0931*** -0.0259
(-3.35) (-2.18) (-3.28) (-3.56) (-4.05) (-1.05)

_cons 2.213*** 2.469*** 2.308*** 2.441*** 2.540*** 2.558***
(32.38) (39.69) (36.55) (33.93) (42.24) (40.47)
----------------------------------------------------------------------------------
N 1434 1458 1458 1458 1458 1458
----------------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

. ** qreg 10
. esttab m2 m21 m22 m23 m24 m25, mtitle(Original Fake1 Fake2 Fake3 Fake4 Fake5)

----------------------------------------------------------------------------------
(1) (2) (3) (4) (5) (6)
Original Fake1 Fake2 Fake3 Fake4 Fake5
----------------------------------------------------------------------------------
educ 0.103*** 0.0698*** 0.0810*** 0.0768*** 0.0717*** 0.0637***
(6.21) (6.02) (7.09) (4.15) (6.57) (6.11)

exper 0.0200*** 0.0111** 0.0111** 0.0135* 0.00645* 0.00341
(4.06) (3.23) (3.29) (2.45) (2.06) (1.04)

tenure 0.000669 0.00592 0.00987* 0.00240 0.00270 0.00460
(0.11) (1.38) (2.35) (0.36) (0.66) (1.12)

female -0.151 -0.0822 -0.0657 -0.166 -0.116* 0.0175
(-1.87) (-1.47) (-1.18) (-1.88) (-2.13) (0.32)

_cons 1.462*** 1.963*** 1.791*** 1.869*** 2.067*** 2.104***
(6.67) (13.18) (11.90) (7.58) (14.55) (14.95)
----------------------------------------------------------------------------------
N 1434 1458 1458 1458 1458 1458
----------------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

. ** heckman
. esttab m3 m31 m32 m33 m34 m35, mtitle(Original Fake1 Fake2 Fake3 Fake4 Fake5)

--------------------------------------------------------------------------------
(1) (2) (3) (4) (5) (6)
Original Fake1 Fake2 Fake3 Fake4 Fake5
--------------------------------------------------------------------------------
lnwage
educ 0.0717*** 0.0617*** 0.0717*** 0.0644*** 0.0585*** 0.0487***
(13.13) (11.83) (14.16) (11.32) (12.20) (9.84)

exper 0.00179 0.00397* 0.00246 0.00248 -0.00347* -0.00398*
(0.94) (2.18) (1.40) (1.24) (-2.11) (-2.21)

tenure 0.00200 0.00481* 0.00637*** -0.00295 0.00101 0.00290
(1.01) (2.57) (3.52) (-1.49) (0.58) (1.52)

female -0.105*** -0.0721* -0.132*** -0.211*** -0.185*** -0.0583
(-3.59) (-2.48) (-4.10) (-6.35) (-6.57) (-1.80)

age 0.0146*** 0.00740*** 0.0108*** 0.0122*** 0.0141*** 0.0149***
(7.92) (4.50) (6.81) (6.79) (9.00) (8.99)

_cons 1.991*** 2.332*** 2.087*** 2.168*** 2.246*** 2.297***
(27.12) (33.01) (29.79) (27.02) (34.05) (33.22)
--------------------------------------------------------------------------------
lfp
educ 0.149*** 0.210*** 0.148*** 0.128*** 0.136*** 0.149***
(5.37) (6.87) (5.72) (5.19) (5.59) (5.80)

female -1.785*** -1.696*** -1.662*** -1.463*** -1.510*** -1.811***
(-11.09) (-10.62) (-9.66) (-9.78) (-10.91) (-10.17)

age -0.0388*** -0.00878 -0.0170** -0.0193*** -0.0271*** -0.0289***
(-5.77) (-1.39) (-2.94) (-3.55) (-4.61) (-4.78)

single -0.0998 -0.361 -0.0764 -0.192 -0.497* -0.681***
(-0.43) (-1.63) (-0.37) (-0.96) (-2.36) (-3.31)

married -0.867*** -0.775*** -0.544*** -0.596*** -0.746*** -0.644***
(-5.48) (-4.15) (-3.43) (-3.60) (-4.22) (-3.93)

kids6 -0.716*** -0.571*** -0.499*** -0.599*** -0.671*** -0.563***
(-8.71) (-7.40) (-6.84) (-7.55) (-8.84) (-7.03)

kids714 -0.343*** -0.345*** -0.206** -0.292*** -0.258*** -0.482***
(-5.26) (-5.22) (-2.90) (-4.55) (-3.98) (-6.73)

_cons 3.543*** 1.483** 2.199*** 2.434*** 2.858*** 3.112***
(7.29) (3.15) (4.95) (5.97) (6.88) (7.06)
--------------------------------------------------------------------------------
/mills
lambda -0.123 0.00819 0.0933 0.332*** 0.185** -0.0255
(-1.88) (0.11) (1.15) (4.18) (2.70) (-0.31)
------------------------------------------- -------------------------------------
N 1647 1647 1647 1647 1647 1647
------------------------------------------- -------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

可以发现的是,利用 5 个合成数据集跑出来的回归结果跟原始数据跑出来的回归结果是具有差异的,但是整体上这个差异并不会相差甚远,只有略微的差异,不会影响到最终结论。

接下来,我们再进行分析原始数据以及其中两个合成数据集变量的协方差矩阵的结果,结果如下:

. frame test: {
. use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear
(Excerpt from the Swiss Labor Market Survey 1998)
. mean lnwage exper tenure educ female age single married kids6 kids714

Mean estimation Number of obs = 1,434

--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
lnwage | 3.357604 .0140235 3.330096 3.385113
exper | 13.15324 .2632213 12.6369 13.66958
tenure | 7.860937 .2144401 7.440287 8.281587
educ | 11.53696 .0639585 11.4115 11.66242
female | .4762901 .0131934 .4504096 .5021706
age | 38.83891 .2915321 38.26704 39.41079
single | .3891213 .0128794 .3638568 .4143859
married | .4700139 .0131845 .4441509 .495877
kids6 | .2182706 .0151344 .1885826 .2479586
kids714 | .2782427 .0172008 .2445013 .311984
--------------------------------------------------------------
.
. corr lnwage exper tenure educ female age single married kids6 kids714 , cov
(obs=1,434)

| lnwage exper tenure educ female age single married kids6 kids714
-------------+------------------------------------------------------------------------------------------
lnwage | .28201
exper | 1.23107 99.3553
tenure | 1.03799 47.0903 65.9418
educ | .469384 -3.24851 -.510834 5.86604
female | -.043298 -.484036 -.598583 -.14532 .249612
age | 2.05353 79.3047 54.7529 1.62913 .213554 121.877
single | -.061535 -1.71735 -1.22853 -.005669 -.001235 -2.87447 .237872
married | .044889 1.05484 .938517 .089909 -.027229 1.75406 -.18302 .249275
kids6 | .030479 -.557053 -.447353 .118061 -.034249 -.953649 -.077317 .096222 .328459
kids714 | .036036 .088118 .006038 .020763 .001368 .469835 -.099274 .100813 .018081 .424272

.
. }

. forvalues i = 1/2 {
frame test: {
use fake_oaxaca_`i', clear
mean lnwage exper tenure educ female age single married kids6 kids714
corr lnwage exper tenure educ female age single married kids6 kids714 , cov
}
}

Mean estimation Number of obs = 1,458

------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
---------+--------------------------------------------
lnwage | 3.388637 .0126988 3.363727 3.413547
exper | 12.95556 .2533696 12.45855 13.45257
tenure | 7.501029 .201992 7.104803 7.897255
educ | 11.59825 .0627407 11.47518 11.72132
female | .4657064 .0130682 .4400719 .491341
age | 38.77092 .2892844 38.20346 39.33838
single | .3737997 .012675 .3489366 .3986628
married | .4835391 .013092 .457858 .5092202
kids6 | .2256516 .0151994 .1958365 .2554666
kids714 | .2921811 .0175129 .2578278 .3265343
------------------------------------------------------
(obs=1,458)

| lnwage exper tenure educ female age single married kids6 kids714
--------+------------------------------------------------------------------------------------------
lnwage | .235118
exper | 1.08476 93.598
tenure | .825123 41.0372 59.4875
educ | .370826 -1.40578 -.144862 5.73927
female | -.023944 -.422867 -.347883 -.073241 .248995
age | 1.59116 77.3443 48.6604 2.5879 .185001 122.013
single | -.051301 -1.51003 -1.02824 -.076044 .01523 -2.93971 .234234
married | .027974 .78726 .844446 .127133 -.03248 1.67021 -.180871 .2499
kids6 | .025567 -.887329 -.501605 .238281 -.025544 -1.0073 -.07068 .092598 .33683
kids714 | .014055 -.016863 -.031873 .029269 -.028408 .498002 -.08527 .109137 -.008324 .447173

(Excerpt from the Swiss Labor Market Survey 1998)

Mean estimation Number of obs = 1,458

--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
lnwage | 3.375149 .0131252 3.349403 3.400895
exper | 13.38134 .2600573 12.87122 13.89147
tenure | 7.981767 .2083602 7.573049 8.390485
educ | 11.51749 .0640525 11.39184 11.64313
female | .4718793 .0130783 .4462249 .4975337
age | 39.42593 .2911667 38.85478 39.99708
single | .3525377 .0125164 .3279856 .3770899
married | .5041152 .0130986 .4784211 .5298094
kids6 | .2139918 .0149255 .184714 .2432696
kids714 | .2716049 .0165868 .2390683 .3041415
--------------------------------------------------------------
(obs=1,458)

| lnwage exper tenure educ female age single married kids6 kids714
--------+------------------------------------------------------------------------------------------
lnwage | .25117
exper | 1.18759 98.6042
tenure | 1.00408 44.5726 63.2976
educ | .42614 -3.36842 -1.26497 5.98176
female | -.035159 -.297665 -.319233 -.127854 .24938
age | 1.90257 79.0684 49.8293 .911042 .277257 123.606
single | -.050738 -1.52906 -1.13198 -.024701 -.011356 -2.80091 .228412
married | .034221 .687831 .813819 .109228 -.007434 1.54698 -.177842 .250155
kids6 | .008237 -.775524 -.647205 .137813 -.012509 -1.08846 -.07206 .097952 .324801
kids714 | .037922 -.072246 .041389 .091506 .002839 .339282 -.087581 .090165 -.012176 .401129

从以上结果可以得知,合成的数据集中的各变量的均值和协方差与原始数据的变量是非常接近的,这表明合成的数据集并不会改变原始数据集的一些特征结构。

3. 结语

正如上面所见,最后我们得到的结果远远不能完美地复制原始数据得到的结果。毕竟,以上的操作是通过引入一个随机误差来创建一个合成的数据集,以便其他人能够利用该数据尝试性的重现论文的工作。但是虽然不能完全复制论文原有的工作,但是却可以大致的得到原来的工作结果,这是值得尝试的。通过这种尝试,我们可以解决保密数据不能公开的难题。

4. 参考资料

  • Blog, How to come public, with private data, -Link-
  • Jann, Ben (2008). The Blinder-Oaxaca decomposition for linear regression models. The Stata Journal 8(4): 453-479.
  • Jenkins, SP, Rios‐Avila, F, 2021. "Measurement error in earnings data: replication of Meijer, Rohwedder, and Wansbeek's mixture model approach to combining survey and register data." J Appl Econ 36(4): 474-483. https://doi.org/10.1002/jae.2811

5. 相关推文

Note:产生如下推文列表的 Stata 命令为:
lianxh
安装最新版 lianxh 命令:
ssc install lianxh, replace

  • 专题:专题课程
    • 直播-我的甲壳虫-论文精讲与重现
  • 专题:论文写作
    • 连享会:论文重现网站大全
    • Stata-JPE论文重现:资本深化与非平衡经济增长
    • 可重复性研究:如何保证你的研究结果可重现?
  • 专题:Stata资源
    • 会计期刊论文的结果可重现吗?
  • 专题:数据处理
    • Stata结果重现:dependencies命令-外部命令的版本控制

New! Stata 搜索神器:lianxhsongbl  GIF 动图介绍
搜: 推文、数据分享、期刊论文、重现代码 ……
👉 安装:
. ssc install lianxh
. ssc install songbl
👉  使用:
. lianxh DID 倍分法
. songbl all

🍏 关于我们

  • 连享会 ( www.lianxh.cn,推文列表) 由中山大学连玉君老师团队创办,定期分享实证分析经验。
  • 直通车: 👉【百度一下:连享会】即可直达连享会主页。亦可进一步添加 「知乎」,「b 站」,「面板数据」,「公开课」 等关键词细化搜索。


您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存