查看原文
其他

微观计量经济学在理论和应用上走过的30年

计量经济圈 计量经济圈 2021-10-23

凡是搞计量经济的,都关注这个号了

箱:econometrics666@sina.cn

所有计量经济圈方法论丛的code程序, 宏微观数据库和各种软件都放在社群里.欢迎到计量经济圈社群交流访问

作者:A. Colin Cameron(微观计量领域的大牛)

1.机器学习在微观计量的应用最新趋势: 回归模型

2.机器学习在微观计量的应用最新趋势: 大数据和因果推断

以下是Colin Cameron教授在Stata诞生30周年之时所作的报告。作为在微观计量领域处于居高临下位置的前辈,这份约15,000字的文本确实是经典中的经典。不管你是不是用Stata做数据分析,里面所包括的微观计量理论都是不可多得的宝贵财富


除此之外,还有其他领域的大牛的报告,语言通俗容易理解,最后合订成了一本《我与Stata走过的30年》。书籍已经放在计量社群里, 有需要可以下载参看


Introduction


Microeconomics research has become much more empirically oriented over the past 30 years. This has been made possible by increased computational power. The IBM XT 286, introduced in 1986, had 640KB of RAM, a 6MHz processor, a 20MB hard disk, and a 1.2MB processor.1 By contrast, a typical PC today runs more than 500 times faster, with memory and storage that is more than 10,000 times larger. This greater computer power has been accompanied by increased data availability, new methods, and the development of statistical software to implement these methods.

 

Here I discuss how theoretical and applied microeconometrics research has evolved over the past 30 years and how Stata has been part of this process. The discussion of theory is necessarily brief, with further detail provided in Cameron (2009). The role of Stata, one of several packages available to econometricians, is especially important because it is now the most commonly used package in applied microeconometrics.

 

The interplay between theory and implementation is not straightforward because considerable time can pass from the introduction of new methods to their use by applied researchers and their incorporation in a statistical package. This delay is partly due to it taking time before the usefulness of the new method is clear. To some extent, this is a “chicken and egg problem” because methods are used much more once they are incorporated into a statistical package. Delay also arises because some methods—notably semiparametric regression, maximum simulated likelihood, and Bayesian methods—are difficult to code into a user-friendly command that will work for a wide range of problems. Because Stata is programmable, the speed of this process can be (and has been) accelerated by users developing their own code ahead of any official Stata command. In some cases, this code is made available to other Stata users as a user-written Stata ado-file. Here I mention only a few of these useful add-ons.

 

Regression and Stata(回归与Stata)

 

Many of the core regression methods now widely used in applied microeconometrics research were introduced in the late 1970s and early 1980s. These methods include sample selection models (Heckman 1976), quantile regression (Koenker and Bassett 1978), bootstrap (Efron 1979), heteroskedastic-robust standard errors (White 1980, 1982), and generalized method of moments (GMM) estimation (Hansen 1982). Additionally, several seminal books appeared in the early and mid-1980s, namely, Limited-Dependent and Qualitative Variables in Econometrics (Maddala 1983) for limited dependent variable models, Advanced Econometrics (Amemiya 1985) for nonlinear regression models, and Analysis of Panel Data (Hsiao 2003) for panel data.

 

Cox (2005) provides a brief history of the first 20 years of Stata; Baum, Schaffer, and Stillman (2011) provide a recent overview. Stata was introduced in 1985 for use on IBM PCs running under DOS rather than on a mainframe computer. The initial release of Stata was quite limited and focused primarily on tools for data management and exploratory data analysis due to both its newness and the low computing power of PCs. The only regression command in the initial release was command regress for least-squares estimation of linear models.

 

The basic limited dependent variable models were among the first regression models to be introduced into Stata—logit and probit in 1987, survival models in 1988, tobit models and multinomial logit models in 1992, and linear sample selection models and negative binomial models in 1993. Quantile regression methods became much more widely used after their incorporation in Stata in 1992. Commands for general nonlinear least-squares and maximum likelihood estimation were introduced in 1993. GMM estimation was incorporated in several linear model commands, though a general command for GMM estimation was not introduced until 2009. The basic panel-data commands, a strength of Stata, were introduced in 1995 (linear) and 1996 (nonlinear).

 

Increased computing power has enabled greater use of simulation methods. Monte Carlo experiments using a known data-generating process can be conducted in Stata via the command simulate or the command postfile. Random variables can be drawn directly from a multitude of distributions following a major Stata enhancement in 2008. These distributions include the multivariate normal and the truncated multivariate normal (using the Geweke–Hajivassiliou–Keane simulator). The Stata randomnumber generators include Halton and Hammersley sequences in addition to a standard random-uniform generator.

 

Methods for simulation-based estimation of parametric models were developed in the 1980s and 1990s, especially maximum simulated likelihood estimation (McFadden 1989; Pakes and Pollard 1989) and Bayesian Markov chain Monte Carlo methods (Geman and Geman 1984). These methods have enabled the estimation of increasingly complex parametric models. In empirical microeconometrics, these are most often limited dependent variable models such as the random parameters logit model. Furthermore, Bayesian methods are generally used merely as a tool; the results are still given a frequentist interpretation rather than a Bayesian interpretation.

 

Stata initially introduced Bayesian methods in particular contexts, notably with the command asmprobit, which estimates the multinomial probit model using maximum simulated likelihood, and with multiple imputation commands that use Markov chain Monte Carlo methods. And user-written code provided Stata front ends to the Bayesian statistical packages WinBUGS (Thompson, Palmer, and Moreno 2006) and MLwiN (Leckie and Charlton 2013). The command bayesmh, introduced in Stata 14, may lead to much greater use of Bayesian methods.

 

Stata avoids use of simulation-based estimation methods whenever possible. In particular, complex parametric models are often difficult to estimate because of an intractable integral. For a one-dimensional integral, such as that in the linear randomeffects model, it is standard to use Gaussian quadrature rather than simulation methods. For higher dimensional integrals of the multivariate normal that appear in mixed models, Stata commands mixed and gsem use adaptive multivariate Gaussian quadrature rather than simulation methods.

 

An alternative strand of research has developed methods to estimate regression models that rely on relatively weak distributional assumptions. The building block is nonparametric regression on a single regressor. Several methods have been proposed in statistics literature, beginning with kernel regression in 1964, followed by lowess, local polynomial regression, wavelets, and splines. Stata initially provided lowess estimation. Local polynomial regression, including kernel regression and local linear as special cases, appeared as command lpoly in 2007. These nonparametric regression commands and the kernel density estimation command kdensity are especially valuable for viewing data and key statistical output such as residuals.

 

The single-regressor nonparametric regression methods do not extend well to models with multiple regressors because of the curse of dimensionality. Econometricians have been at the forefront of developing semiparametric models that combine a highdimensional parametric component with a low-dimensional (usually single-dimensional) component. The late 1980s and early 1990s saw development of estimation methods for three commonly used models—the partial linear model, the single-index model, and generalized additive models. Semiparametric methods are particularly useful for limited dependent variable models with censoring and truncation because they enable crucial parametric assumptions on unobservables to be weakened; Pagan and Ullah (1999) provide a survey. These semiparametric methods generally require selection of smoothing parameters, sometimes with deliberate undersmoothing or oversmoothing. Perhaps this is why there are no official Stata commands for semiparametric regression, though there are some Stata add-ons for some specific estimators. The lack of semiparametric regression commands in Stata is one reason that semiparametrics methods (a focus of recent theoretical econometrics research) are infrequently used in applied microeconometrics.

 

In addition to obtaining regression coefficients under minimal assumptions, the econometrics literature has developed methods for statistical inference under minimal assumptions. Heteroskedastic-robust standard errors were developed by White (1980, 1982) and introduced into Stata 3 in 1992. If model errors are clustered, then default and heteroskedastic-robust standard errors can be much too small. Extensions to cluster–robust inference were made by Liang and Zeger (1986) and Arellano (1987). Including cluster–robust standard errors (Rogers 1993) in basic Stata regression commands early on greatly increased Stata’s usage. Even though Stata is at the forefront in providing robust standard errors, the inclusion of a cluster–robust option for the more advanced estimation commands took considerable time.

 

When standard errors (nonrobust or robust) are not available, they can be obtained by using an appropriate bootstrap. A bootstrap command appeared in Stata in 1991 with significant enhancement in 2003. The theoretical literature has emphasized a second use of the bootstrap, namely, bootstraps with asymptotic refinement that may lead to better finite-sample inference. These latter bootstraps are seldom used in practice; a notable exception is the wild-cluster bootstrap when there are few clusters (Cameron, Gelbach, and Miller 2008). Bootstraps with refinement, such as bias-corrected confidence intervals as a bootstrap option and other methods with some additional coding, can also be implemented in Stata.

 

A distinguishing feature of econometrics is the desire to make causal inference from observational data. Instrumental-variables estimation and its extension to GMM were the dominant methods when Stata was introduced. Articles by Nelson and Startz (1990) and Bound, Jaeger, and Baker (1995) highlighted the need for alternative inference methods when instruments are weak. Recent results on weak instrument asymptotics for linear models with nonindependent identically distributed model errors, the usual case in empirical microeconomics studies, are implemented in Stata add-ons ivreg2 (Baum, Schaffer, and Stillman 2007) and weakiv (Finlay, Magnusson, and Schaffer 2013).

 

A major change in causal microeconometrics research is the use of the potential outcomes framework of Rubin (1974) that has evolved into the quasi-experimental or treatment-effects literature, summarized in Angrist and Pischke (2009). Matching methods such as propensity-score matching (Rosenbaum and Rubin 1983) or use inverse-probability weighting can be used for selection on observables only. A Stata command to implement these methods was introduced in 2013 and superseded earlier user-written add-ons. When selection is also on unobservables, most methods can be implemented using existing Stata commands. These methods include local average treatment-effects estimation (Imbens and Angrist 1994), a reinterpretation of instrumental variables when treatment effects are heterogeneous, fixed-effects panel models and their extension to differences-in-differences with repeated cross-section data, sample-selection models, and regression discontinuity design. For dynamic linear panel models with fixed effects, the methods of Arellano and Bond (1991) and extensions can be implemented using the official Stata command xtabond and the user-written add-on xtabond2 (Roodman 2009).

 

Methods for spatially correlated data have been progressively developed over the past 30 years. Currently there are no official Stata commands for spatial regression, but there are several user-written Stata add-ons that handle and analyze spatial data, including the spatial regression command sppack (Drukker et al. 2011).

 

Researchers in biostatistics and in social sciences other than economics, who are also Stata users, use some regression methods that are not often used in empirical mi-croeconometrics. Generalized linear models (command glm) and generalized estimating equations (xtgee) cover many nonlinear regression models, including those with binary or count dependent variables. Mixed models or hierarchical models (command mixed) can lead to more precise estimation than a simple random-effects model can when model errors are clustered. Other social sciences make greater use of completely specified structural models (commands sem and gsem).

 

Empirical research and Stata(实证研究与Stata)

 

There is more to empirical research than obtaining parameter estimates and their standard errors.

 

The first step of empirical research is to simply analyze and view the data ahead of any regression analysis. Useful graphical methods are kernel density estimates and twoway scatterplots with a fitted nonparametric regression curve. Stata introduced a very rich publication-quality graphics package in 2003. Interpreting the sources of variation in grouped data is simplified by using the statsby command and xt commands such as xtsum, xttab, and xtdescribe.

 

Model diagnostics and specification tests can be useful. Applied microeconometrics studies tend not to use available methods that can detect outlying observations and influential observations. This is in part due to concerns about subsequently overfitting a model, though such diagnostics can also highlight mistakes such as miscoded data. Available model specification tests are infrequently used, notable exceptions being Hausman tests and tests of overidentifying restrictions. Stata postestimation commands include these standard methods, and they also enable in-sample and out-of-sample prediction.

 

Many applied studies in microeconometrics seek to estimate a marginal effect, such as the increase in earnings with one more year of schooling, rather than a regression model parameter per se. Marginal effects and their associated standard errors can be computed using the margins command introduced in 2009 that supplanted the userwritten command margeff (Bartus 2005). Factor variables, also introduced in 2009, enable extension to models with interacted regressors.

 

Empirical microeconomics studies are increasingly based on data sources that are very complex. Complications include: 1) data may come from several different sources; 2) data may come from surveys; 3) data may have a grouped structure such as panel data or individual-level data from several villages; and 4) some data may be missing.

 

A real contribution of Stata has been its ability to handle these complications. Stata is a data-management package, in addition to a statistical package, with features including the ability to handle string variables and commands to merge and append datasets. The Stata survey commands control for weighting, clustering, and stratification. Empirical microeconometrics studies generally do not use the survey commands. Instead, regular estimation commands are used with weights, if necessary, and with appropriate cluster–robust standard errors. Stratification is ignored, with some potential loss in estimator efficiency. Grouped data can be manipulated using the by prefix command and the reshape command. Stata’s estimation commands automatically allow for missing data using case-deletion. If case-deletion is not valid, then weighted regression can be used if weights are available. Alternatively, one can use the Stata multiple-imputation command introduced in 2009. For imputation, empirical economics researchers currently rely on case-deletion or on crude imputation methods such as hot-deck imputation, despite their limitations.

 

Stata was initially limited in the size of dataset it could handle because it requires that all data be stored in memory to speed up computations. This limitation has greatly diminished over time given increases in computer memory capacity and the emergence of 64-bit PCs.

 

As empirical studies have become more complex, the need for replicability has increased. Researchers need to be able to keep track of their own work, return to it after leaving it for a considerable period of time, and potentially coordinate computations with coauthors, research assistants, and students. Furthermore, several leading journals require that data and programs be posted at their archives. Stata is well suited for producing replicable studies because it is command driven, and the resulting Stata scripts can be run on a wide range of platforms and on newer versions of Stata. As is clear from the previous section, it can take considerable time before a new method is included in a statistical package such as Stata. It is, therefore, advantageous to use software that is programmable. Stata has always been programmable, and it includes the complete matrix programming language Mata that was introduced in 2007.

 

The widespread use of Stata has created a community of users. Stata encourages this community through the Stata Technical Bulletin (which began in 1990 and was superseded by the Stata Journal in 2001), Statalist Server (1994), Stata Users Group meetings (1995), the Stata website (1996), and Stata Press books (1999). For basic applied microeconometrics, the books by Baum (2006), Cameron and Trivedi (2010) and Mitchell (2012) are especially helpful. The websites for introductory econometrics texts provide code for analysis in Stata. The Statistical Software Components (1997) website provides many Stata user-written programs that can be directly downloaded to Stata. As already noted, Stata users have provided many useful add-on programs. While some have been superseded by official Stata commands, many still fill gaps or augment official Stata commands.

 

As with any statistical package, the ubiquity of Stata also has downsides. Data analyses may be restricted only to what is easily implemented in Stata. Researchers may not understand the limitations of the methods used, such as tobit model estimates relying on very strong parametric assumptions. Also Stata may eventually become legacy software, yet one with a very large user base. To date, Stata has avoided this by continuing to target academic researchers in economics, other social sciences, and biostatistics.

 书籍已经放在计量社群里, 有需要可以下载参看

推荐阅读:

0.中国所有地级市各类空间权重矩阵数据release

1.工企数据库匹配160大步骤的完整程序和相应数据

2.1998-2016年中国地级市年均PM2.5数据release

3.1997-2014中国市场化指数权威版本release

4.2005-2015中国分省分行业CO2数据circulation

5.匹配方法(matching)操作指南, 值得收藏的16篇文章

6.内生性问题操作指南, 广为流传的22篇文章

7.面板数据模型操作指南, 不得不看的16篇文章

8.实证研究中用到的135篇文章, 社科学者常用toolkit

计量经济圈是中国计量第一大社区,我们致力于推动中国计量理论和实证技能的提升,圈子以海内外高校研究生和教师为主。计量经济圈绝对六多精神:社科资料最多、社科数据最多、科研牛人最多、海外名校最多、热情互助最多、前沿趋势最多如果你热爱计量并希望长见识,那欢迎你加入到咱们这个大家庭(戳这里),要不然你只能去其他那些Open access圈子了。注意:进去之后一定要看小鹅社群“群公告”,不然接收不了群息,也不知道怎么进入咱们独一无二的微信群和QQ群在规则框架下社群交流讨论无时间限制。

: . Video Mini Program Like ,轻点两下取消赞 Wow ,轻点两下取消在看

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存