# 机器学习方法出现在AER, JPE, QJE等顶刊上了！！！

凡是搞计量经济的，都关注这个号了

**投稿：****econometrics666@126.com**

**所有计量经济圈方法论**

**丛的code程序**

**, 宏微观数据库和各种软**

**件都放在社群里.欢迎到计量经济圈社群交流访问**

**.**

**关于机器学习在计量分析中的应用，各位学者可以参阅如下文章：****1.**Python中的计量回归模块及所有模块概览，**2.**空间计量软件代码资源集锦(Matlab/R/Python/SAS/Stata), 不再因空间效应而感到孤独，**3.**回归、分类与聚类：三大方向剖解机器学习算法的优缺点（附Python和R实现），**4.**机器学习第一书, 数据挖掘, 推理和预测，**5.**从线性回归到机器学习, 一张图帮你文献综述，**6.**11种与机器学习相关的多元变量分析方法汇总，**7.**机器学习和大数据计量经济学, 你必须阅读一下这篇，**8.**机器学习与Econometrics的书籍推荐, 值得拥有的经典，**9.**机器学习在微观计量的应用最新趋势: 大数据和因果推断，**10.**机器学习在微观计量的应用最新趋势: 回归模型，**11.**机器学习对计量经济学的影响, AEA年会独家报道，**12.**机器学习，可异于数理统计，**13.**Python, Stata, R软件史上最全快捷键合辑！,**14.**Python与Stata, R, SAS, SQL在数据处理上的比较, 含code及细致讲解，**15.**Python做因果推断的方法示例, 解读与code。

**关于R软件及其在计量分析中的应用，各位学者可以参阅如下文章：****1.**用R语言做Econometrics的书籍推荐, 值得拥有的经典，**2.**18+1张图掌握R软件的方方面面, 还有谁, 还有谁?，**3.**用R语言做空间计量, 绝不容错过的简明教程，**4.**R软件中的时间序列分析程序包纵览，**5.**R软件画图指南针,摆脱丑图不是梦，**6.**平滑转移自回归模型(STAR)应用与在R软件的操作，**7.**用R语言做空间计量, 绝不容错过的简明教程，**8.**R语言函数最全总结, 机器学习从这里出发，**9.**R语言ggplot2的小抄, 绘图总结查阅，**10.**送|R语言全套视频和资料，异常珍贵的材料，**11.**2卷RDD断点回归使用手册, 含Stata和R软件操作流程。

**关于Stata软件及其在计量分析中的应用，各位学者可以参阅如下文章：****1.**Stata16新增功能有哪些? 满满干货拿走不谢，**2.**Stata资料全分享，快点收藏学习，**3.**Stata统计功能、数据作图、学习资源等，**4.**Stata学习的书籍和材料大放送, 以火力全开的势头，**5.**史上最全Stata绘图技巧, 女生的最爱，**6.**把Stata结果输出到word, excel的干货方案，**7.**编程语言中的函数什么鬼？Stata所有函数在此集结，**8.**世界范围内使用最多的500个Stata程序，**9.**6张图掌握Stata软件的方方面面, 还有谁, 还有谁?，**10.**LR检验、Wald检验、LM检验什么鬼？怎么在Stata实现，**11.**Stata15版新功能，你竟然没有想到，一睹为快，**12.**"高级计量经济学及Stata应用"和"Stata十八讲"配套数据，**13.**数据管理的Stata程序功夫秘籍，**14.**非线性面板模型中内生性解决方案以及Stata命令，**15.**把动态面板命令讲清楚了，对Stata的ado详尽解释，**16.**半参数估计思想和Stata操作示例，**17.**Stata最有用的points都在这里,无可替代的材料，**18.**PSM倾向匹配Stata操作详细步骤和代码，干货十足，**19.**随机前沿分析和包络数据分析 SFA,DEA 及Stata操作，**20.**福利大放送, Stata编程技巧和使用Tips大集成，**21.**使用Stata进行随机前沿分析的经典操作指南，**22.**Stata, 不可能后悔的10篇文章, 编程code和注解，**23.**用Stata学习Econometrics的小tips, 第二发礼炮，**24.**用Stata学习Econometrics的小tips, 第一发礼炮，**25.**广义合成控制法gsynth, Stata运行程序release，**26.**多重中介效应的估计与检验

前些日，我们引荐了**①**如何选择正确的自变量(控制变量)，让你的计量模型不再肮脏，**②**忽略交互效应后果很严重，审稿人很生气！，**③**过去三十年, RCT, DID, RDD, LE, ML, DSGE等方法的“高光时刻”路线图，**④**空间双重差分法(spatial DID)最新实证papers合辑！等，在学者间引起了广泛的讨论。今天，我们引荐的是机器学习在实证研究中的应用示例。**机器学习方法渐渐地在经济类顶刊，如AER，JPE，QJE等期刊上出现了。**为了进一步了解机器学习在国外最新应用动向，我们为各位学者精选了如下30篇文章。对机器学习方法感兴趣的学者，可以到社群交流访问，大家讨论比较多的话题有**①**文本分析的步骤, 工具, 途径和可视化如何做？**②**文本大数据分析在经济学和金融学中的应用, 最全文献综述，**③**文本函数和正则表达式, 文本分析事无巨细。

Bandiera, O., et al. (2019). "CEO Behavior and Firm Performance." Journal of Political Economy: 000-000.

**CEO行为与公司绩效**

We develop a new method to measure CEO behavior in large samples via a survey that collects high-frequency, high-dimensional diary data and a machine learning algorithm that estimates behavioral types. Applying this method to 1,114 CEOs in six countries reveals two types: ?leaders,? who do multifunction, high-level meetings, and ?managers,? who do individual meetings with core functions. Firms that hire leaders perform better, and it takes three years for a new CEO to make a difference. Structural estimates indicate that productivity differentials are due to mismatches rather than to leaders being better for all firms.

Kleinberg, J., et al. (2017). "Human Decisions and Machine Predictions*." The Quarterly Journal of Economics **133**(1): 237-293.

**人类决策和机器预测**

Can machine learning improve human decision making? Bail decisions provide a good test case. Millions of times each year, judges make jail-or-release decisions that hinge on a prediction of what a defendant would do if released. The concreteness of the prediction task combined with the volume of data available makes this a promising machine-learning application. Yet comparing the algorithm to judges proves complicated. First, the available data are generated by prior judge decisions. We only observe crime outcomes for released defendants, not for those judges detained. This makes it hard to evaluate counterfactual decision rules based on algorithmic predictions. Second, judges may have a broader set of preferences than the variable the algorithm predicts; for instance, judges may care specifically about violent crimes or about racial inequities. We deal with these problems using different econometric strategies, such as quasi-random assignment of cases to judges. Even accounting for these concerns, our results suggest potentially large welfare gains: one policy simulation shows crime reductions up to 24.7% with no change in jailing rates, or jailing rate reductions up to 41.9% with no increase in crime rates. Moreover, all categories of crime, including violent crimes, show reductions; these gains can be achieved while simultaneously reducing racial disparities. These results suggest that while machine learning can be valuable, realizing this value requires integrating these tools into an economic framework: being clear about the link between predictions and decisions; specifying the scope of payoff functions; and constructing unbiased decision counterfactuals.

Kleinberg, J., et al. (2015). "Prediction Policy Problems." American Economic Review ** *105***(5): 491-495.

**预测政策问题**

Most empirical policy work focuses on causal inference. We argue an important class of policy problems does not require causal inference but instead requires predictive inference. Solving these "prediction policy problems" requires more than simple regression techniques, since these are tuned to generating unbiased estimates of coefficients rather than minimizing prediction error. We argue that new developments in the field of "machine learning" are particularly useful for addressing these prediction problems. We use an example from health policy to illustrate the large potential social welfare gains from improved prediction.

Abadie, A. and M. Kasy (2018). "Choosing among Regularized Estimators in Empirical Economics: The Risk of Machine Learning." The Review of Economics and Statistics ** *101***(5): 743-762.

**在实证经济学中的正则估计量之间进行选择：机器学习的风险。**

Many settings in empirical economics involve estimation of a large number of parameters. In such settings, methods that combine regularized estimation and data-driven choices of regularization parameters are useful. We provide guidance to applied researchers on the choice between regularized estimators and data-driven selection of regularization parameters. We characterize the risk and relative performance of regularized estimators as a function of the data-generating process and show that data-driven choices of regularization parameters yield estimators with risk uniformly close to the risk attained under the optimal (unfeasible) choice of regularization parameters. We illustrate using examples from empirical economics.

Arribas-Bel, D., et al. (2019). "Building(s and) cities: Delineating urban areas with a machine learning algorithm." Journal of Urban Economics: 103217.

**建筑物（城市）：使用机器学习算法描绘城市区域**

This paper proposes a novel methodology for delineating urban areas based on a machine learning algorithm that groups buildings within portions of space of sufficient density. To do so, we use the precise geolocation of all 12 million buildings in Spain. We exploit building heights to create a new dimension for urban areas, namely, the vertical land, which provides a more accurate measure of their size. To better understand their internal structure and to illustrate an additional use for our algorithm, we also identify employment centers within the delineated urban areas. We test the robustness of our method and compare our urban areas to other delineations obtained using administrative borders and commuting-based patterns. We show that: 1) our urban areas are more similar to the commuting-based delineations than the administrative boundaries but that they are more precisely measured; 2) when analyzing the urban areas’ size distribution, Zipf’s law appears to hold for their population, surface and vertical land; and 3) the impact of transportation improvements on the size of the urban areas is not underestimated.

Athey, S. (2017). "Beyond prediction: Using big data for policy problems." Science ** *355***(6324): 483.

**超越预测：使用大数据解决政策问题**

Machine-learning prediction methods have been extremely productive in applications ranging from medicine to allocating fire and health inspectors in cities. However, there are a number of gaps between making a prediction and making a decision, and underlying assumptions need to be understood in order to optimize data-driven decision-making.

Athey, S. and G. W. Imbens (2017). "The State of Applied Econometrics: Causality and Policy Evaluation." Journal of Economic Perspectives ** *31***(2): 3-32.

**应用计量经济学的现状：因果关系和政策评估**

In this paper, we discuss recent developments in econometrics that we view as important for empirical researchers working on policy evaluation questions. We focus on three main areas, in each case, highlighting recommendations for applied work. First, we discuss new research on identification strategies in program evaluation, with particular focus on synthetic control methods, regression discontinuity, external validity, and the causal interpretation of regression methods. Second, we discuss various forms of supplementary analyses, including placebo analyses as well as sensitivity and robustness analyses, intended to make the identification strategies more credible. Third, we discuss some implications of recent advances in machine learning methods for causal effects, including methods to adjust for differences between treated and control units in high-dimensional settings, and methods for identifying and estimating heterogenous treatment effects.

Athey, S. and G. W. Imbens (2019). "Machine Learning Methods That Economists Should Know About." Annual Review of Economics ** *11***(1): 685-725.

**经济学家应了解的机器学习方法**

We discuss the relevance of the recent machine learning (ML) literature for economics and econometrics. First we discuss the differences in goals, methods, and settings between the ML literature and the traditional econometrics and statistics literatures. Then we discuss some specific methods from the ML literature that we view as important for empirical researchers in economics. These include supervised learning methods for regression and classification, unsupervised learning methods, and matrix completion methods. Finally, we highlight newly developed methods at the intersection of ML and econometrics that typically perform better than either off-the-shelf ML or more traditional econometric methods when applied to particular classes of problems, including causal inference for average treatment effects, optimal policy estimation, and estimation of the counterfactual effect of price changes in consumer choice models.

Athey, S., et al. (2019). "Generalized random forests." Ann. Statist. ** *47***(2): 1148-1178.

**广义随机森林**

We propose generalized random forests, a method for nonparametric statistical estimation based on random forests (Breiman [Mach. Learn. 45 (2001) 5-32]) that can be used to fit any quantity of interest identified as the solution to a set of local moment equations. Following the literature on local maximum likelihood estimation, our method considers a weighted set of nearby training examples; however, instead of using classical kernel weighting functions that are prone to a strong curse of dimensionality, we use an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest. We propose a flexible, computationally efficient algorithm for growing generalized random forests, develop a large sample theory for our method showing that our estimates are consistent and asymptotically Gaussian and provide an estimator for their asymptotic variance that enables valid confidence intervals. We use our approach to develop new methods for three statistical tasks: nonparametric quantile regression, conditional average partial effect estimation and heterogeneous treatment effect estimation via instrumental variables. A software implementation, grf for R and C++, is available from CRAN.

Bajari, P., et al. (2015). "Machine Learning Methods for Demand Estimation." American Economic Review** *105***(5): 481-485.

**用于估计需求的机器学习方法**

We survey and apply several techniques from the statistical and computer science literature to the problem of demand estimation. To improve out-of-sample prediction accuracy, we propose a method of combining the underlying models via linear regression. Our method is robust to a large number of regressors; scales easily to very large data sets; combines model selection and estimation; and can flexibly approximate arbitrary non-linear functions. We illustrate our method using a standard scanner panel data set and find that our estimates are considerably more accurate in out-of-sample predictions of demand than some commonly used alternatives.

Beyca, O. F., et al. (2019). "Using machine learning tools for forecasting natural gas consumption in the province of Istanbul."Energy Economics ** *80***: 937-949.

**使用机器学习工具预测伊斯坦布尔省的天然气消耗**

Commensurate with unprecedented increases in energy demand, a well-constructed forecasting model is vital to managing energy policies effectively by providing energy diversity and energy requirements that adapt to the dynamic structure of the country. In this study, we employ three alternative popular machine learning tools for rigorous projection of natural gas consumption in the province of Istanbul, Turkey's largest natural gas-consuming mega-city. These tools include multiple linear regression (MLR), an artificial neural network approach (ANN) and support vector regression (SVR). The results indicate that the SVR is much superior to ANN technique, providing more reliable and accurate results in terms of lower prediction errors for time series forecasting of natural gas consumption. This study could well serve a useful benchmarking study for many emerging countries due to the data structure, consumption frequency, and consumption behavior of consumers in various time-periods.

Chalfin, A., et al. (2016). "Productivity and Selection of Human Capital with Machine Learning." American Economic Review ** *106***(5): 124-127.

**机器学习的生产力与人力资本选择**

Economists have become increasingly interested in studying the nature of production functions in social policy applications, with the goal of improving productivity. Traditionally models have assumed workers are homogenous inputs. However, in practice, substantial variability in productivity means the marginal productivity of labor depends substantially on which new workers are hired--which requires not an estimate of a causal effect, but rather a prediction. We demonstrate that there can be large social welfare gains from using machine learning tools to predict worker productivity, using data from two important applications - police hiring and teacher tenure decisions.

Deryugina, T., et al. (2019). "The Mortality and Medical Costs of Air Pollution: Evidence from Changes in Wind Direction." American Economic Review ** *109***(12): 4178-4219.

**空气污染的死亡率和医疗费用：来自风向变化的证据**

We estimate the causal effects of acute fine particulate matter exposure on mortality, health care use, and medical costs among the US elderly using Medicare data. We instrument for air pollution using changes in local wind direction and develop a new approach that uses machine learning to estimate the life-years lost due to pollution exposure. Finally, we characterize treatment effect heterogeneity using both life expectancy and generic machine learning inference. Both approaches find that mortality effects are concentrated in about 25 percent of the elderly population.

Erel, I., et al. (2018). "Selecting Directors Using Machine Learning." SSRN Electronic Journal.

**使用机器学习选择董事**

Galdo, V., et al. (2019). "Identifying Urban Areas by Combining Human Judgment and Machine Learning: An Application to India." Journal of Urban Economics: 103229.

**通过结合人类判断力和机器学习来识别城市区域：在印度的应用**

We propose a methodology for identifying urban areas that combines subjective assessments with machine learning, and we apply it to India, a country where several studies see the official urbanization rate as an under-estimate. For a representative sample of cities, towns and villages, as administratively defined, we rely on human judgment of Google images to determine whether they are urban or rural in practice. We collect judgments across four groups of assessors, differing in their familiarity with India and with urban issues, following two different protocols. We then combine the judgment-based classification with data from the population census and from satellite imagery to predict the urban status of the sample. The Logit model, and LASSO and random forests methods, are applied. These approaches are then used to decide whether each of the out-of-sample administrative units in India is urban or rural in practice. We do not find that India is substantially more urban than officially claimed. However, there are important differences at more disaggregated levels, with “other towns” and “census towns” being more rural, and some southern states more urban, than is officially claimed. The consistency of human judgment across assessors and protocols, the easy availability of crowd-sourcing, and the stability of predictions across approaches, suggest that the proposed methodology is a promising avenue for studying urban issues.

Ghoddusi, H., et al. (2019). "Machine learning in energy economics and finance: A review." Energy Economics ** *81***: 709-727.

**能源经济学和金融学中的机器学习：综述**

Machine learning (ML) is generating new opportunities for innovative research in energy economics and finance. We critically review the burgeoning literature dedicated to Energy Economics/Finance applications of ML. Our review identifies applications in areas such as predicting energy prices (e.g. crude oil, natural gas, and power), demand forecasting, risk management, trading strategies, data processing, and analyzing macro/energy trends. We critically review the content (methods and findings) of more than 130 articles published between 2005 and 2018. Our analysis suggests that Support Vector Machine (SVM), Artificial Neural Network (ANN), and Genetic Algorithms (GAs) are among the most popular techniques used in energy economics papers. We discuss the achievements and limitations of existing literature. The survey concludes by identifying current gaps and offering some suggestions for future research.

Gründler, K. and T. Krieger (2016). "Democracy and growth: Evidence from a machine learning indicator." European Journal of Political Economy ** *45***: 85-107.

**民主与增长：来自机器学习指标的证据**

We present a novel approach for measuring democracy based on Support Vector Machines, a mathematical algorithm for pattern recognition. The Support Vector Machines Democracy Index (SVMDI) is continuous on the [0,1] interval and enables very detailed and sensitive measurement of democracy for 185 countries in the period between 1981 and 2011. Application of the SVMDI yields results which highlight a robust positive relationship between democracy and economic growth. We argue that the ambiguity in recent studies mainly originates from the lack of sensitivity of traditional democracy indicators. Analyzing transmission channels through which democracy exerts its influence on growth, we conclude that democratic countries feature better educated populations, higher investment shares, and lower fertility rates, but not necessarily higher levels of redistribution.

Gu, S., et al. (2020). "Empirical Asset Pricing via Machine Learning." The Review of Financial Studies.

**通过机器学习进行实证资产定价**

We perform a comparative analysis of machine learning methods for the canonical problem of empirical asset pricing: measuring asset risk premiums. We demonstrate large economic gains to investors using machine learning forecasts, in some cases doubling the performance of leading regression-based strategies from the literature. We identify the best-performing methods (trees and neural networks) and trace their predictive gains to allowing nonlinear predictor interactions missed by other methods. All methods agree on the same set of dominant predictive signals, a set that includes variations on momentum, liquidity, and volatility.

Handel, B. and J. Kolstad (2017). "Wearable Technologies and Health Behaviors: New Data and New Methods to Understand Population Health." American Economic Review ** *107***(5): 481-485.

**可穿戴技术和健康行为：了解人口健康的新数据和新方法**

We study a randomized control trial in a large employer population of access to "wearable" technologies and the associated planning and monitoring tools on improved health behaviors (sleep and exercise). Both ITT and IV estimates based on actual plan enrollment for the treatment group suggest statistically significant but economically small changes in behavior after three months. We then implement machine learning-based models to assess treatment effect heterogeneity. We find little evidence for heterogeneous treatment effects base on observables. We also present detailed data on sleep patterns underscoring the value of this new data source to researchers.

Kasy, M. (2018). "Optimal taxation and insurance using machine learning — Sufficient statistics and beyond." Journal of Public Economics ** *167***: 205-219.

**使用机器学习实现最佳税收和保险-足够的统计数据及其他**

How should one use (quasi-)experimental evidence when choosing policies such as tax rates, health insurance copay, unemployment benefit levels, and class sizes in schools? This paper suggests an approach based on maximizing posterior expected social welfare, combining insights from (i) optimal policy theory as developed in the field of public finance, and (ii) machine learning using Gaussian process priors. We provide explicit formulas for posterior expected social welfare and optimal policies in a wide class of policy problems. The proposed methods are applied to the choice of coinsurance rates in health insurance, using data from the RAND health insurance experiment. The key trade-off in this setting is between transfers toward the sick and insurance costs. The key empirical relationship the policy maker needs to learn about is the response of health care expenditures to coinsurance rates. Holding the economic model and distributive preferences constant, we obtain much smaller point estimates of the optimal coinsurance rate (18% vs. 50%) when applying our estimation method instead of the conventional “sufficient statistic” approach.

McBride, L. and A. Nichols (2016). "Retooling Poverty Targeting Using Out-of-Sample Validation and Machine Learning." The World Bank Economic Review ** *32***(3): 531-550.

**使用样本外验证和机器学习重新调整贫困目标**

Proxy means test (PMT) poverty targeting tools have become common tools for beneficiary targeting and poverty assessment where full means tests are costly. Currently popular estimation procedures for generating these tools prioritize minimization of in-sample prediction errors; however, the objective in generating such tools is out-of-sample prediction. We present evidence that prioritizing minimal out-of-sample error, identified through cross-validation and stochastic ensemble methods, in PMT tool development can substantially improve the out-of-sample performance of these targeting tools. We take the United States Agency for International Development (USAID) poverty assessment tool and base data for demonstration of these methods; however, the methods applied in this paper should be considered for PMT and other poverty-targeting tool development more broadly.

Mittal, M., et al. (2019). "Monitoring the Impact of Economic Crisis on Crime in India Using Machine Learning." Computational Economics ** *53***(4): 1467-1485.

**使用机器学习监控经济危机对印度犯罪的影响**

Trends of crimes in India keep changing with the growing population and rapid development of towns and cities. The rise in crimes at any place especially crimes against women, children and weaker sections of the society is a worrying factor for the Indian Government. In India, the crime data is maintained by National Crime Records Bureau as well as an application called Crime Criminal Information System is available to make inquiry and generate reports for the crime data. To curb crime, the Police need countless hours to go through the crime data and determine the various factors that affect it. Therefore, there is necessity of tools which can automatically predict the factors that effects the crimes effectively and efficiently. The field of machine learning has emerged in the recent years for this purpose. In this paper, various machine learning techniques have been applied on crime data to monitor the impact of economic crisis on the crime in India. The effect of unemployment rates and Gross District Domestic Product on theft, robbery and burglary has been monitored across districts of various states in India. Further, Granger causality between crime rates and economic indicators has also been calculated. It has been observed from the experimental work that unemployment rate is the major economic factor which affects the crime rate, thus paving the path to control the crime rate by raising more opportunities for the employment.

Plakandaras, V., et al. (2015). "Forecasting Daily and Monthly Exchange Rates with Machine Learning Techniques." Journal of Forecasting ** *34***(7): 560-573.

**使用机器学习技术预测每日和每月汇率**

In this paper we propose and test a forecasting model on monthly and daily spot prices of five selected exchange rates. In doing so, we combine a novel smoothing technique (initially applied in signal processing) with a variable selection methodology and two regression estimation methodologies from the field of machine learning (ML). After the decomposition of the original exchange rate series using an ensemble empirical mode decomposition (EEMD) method into a smoothed and a fluctuation component, multivariate adaptive regression splines (MARS) are used to select the most appropriate variable set from a large set of explanatory variables that we collected. The selected variables are then fed into two distinctive support vector machines (SVR) models that produce one-period-ahead forecasts for the two components. Neural networks (NN) are also considered as an alternative to SVR. The sum of the two forecast components is the final forecast of the proposed scheme. We show that the above implementation exhibits a superior in-sample and out-of-sample forecasting ability when compared to alternative forecasting models. The empirical results provide evidence against the efficient market hypothesis for the selected foreign exchange markets. Copyright ? 2015 John Wiley & Sons, Ltd.

Saavedra, M. and T. Twinam (2020). "A machine learning approach to improving occupational income scores." Explorations in Economic History ** *75***: 101304.

**一种机器学习方法来提高职业收入得分**

Historical studies of labor markets frequently lack data on individual income. The occupational income score (OCCSCORE) is often used as an alternative measure of labor market outcomes. We consider the consequences of using OCCSCORE when researchers are interested in earnings regressions. We estimate race and gender earnings gaps in modern decennial Censuses as well as the 1915 Iowa State Census. Using OCCSCORE biases results towards zero and can result in estimated gaps of the wrong sign. We use a machine learning approach to construct a new adjusted score based on industry, occupation, and demographics. The new income score provides estimates closer to earnings regressions. Lastly, we consider the consequences for estimates of intergenerational mobility elasticities.

Sabahi, S. and M. M. Parast (2020). "The impact of entrepreneurship orientation on project performance: A machine learning approach." International Journal of Production Economics: 107621.

**创业导向对项目绩效的影响：一种机器学习方法**

Recent studies in project management have shown the important role of entrepreneurship orientation of the individuals in project performance. Although identifying the role of entrepreneurship orientation as a critical success factor in project performance has been considered as an important issue, it is also important to develop a measurement system for predicting performance based on the degree of an individual's entrepreneurial orientation. In this study, we use predictive analytics by proposing a machine learning approach to predict individuals' project performance based on measures of several aspects of entrepreneurial orientation and entrepreneurial attitude of the individuals. We investigated this relationship using a sample of 185 observations and a range of machine learning algorithms including lasso, ridge, support vector machines, neural networks, and random forest. Our results showed that the best method for predicting project performance is lasso. After identifying the best predictive model, we then used the Bayesian Information Criterion and the Akaike Information Criterion to identify the most significant factors. Our results identify all three aspects of entrepreneurial attitude (social self-efficacy, appearance self-efficacy, and comparativeness) and one aspect of entrepreneurial orientation (proactiveness) as the most important factors. This study contributes to the relationship between entrepreneurship skills and project performance and provides insights into the application of emerging tools in data science and machine learning in operations management and project management research.

Athey, S. and G. W. Imbens (2017). Chapter 3 - The Econometrics of Randomized Experimentsa. Handbook of Economic Field Experiments. A. V. Banerjee and E. Duflo, North-Holland. ** *1:*** 73-140.

**随机实验的计量经济学**

In this chapter, we present econometric and statistical methods for analyzing randomized experiments. For basic experiments, we stress randomization-based inference as opposed to sampling-based inference. In randomization-based inference, uncertainty in estimates arises naturally from the random assignment of the treatments, rather than from hypothesized sampling from a large population. We show how this perspective relates to regression analyses for randomized experiments. We discuss the analyses of stratified, paired, and clustered randomized experiments, and we stress the general efficiency gains from stratification. We also discuss complications in randomized experiments such as noncompliance. In the presence of noncompliance, we contrast intention-to-treat analyses with instrumental variables analyses allowing for general treatment effect heterogeneity. We consider, in detail, estimation and inference for heterogenous treatment effects in settings with (possibly many) covariates. These methods allow researchers to explore heterogeneity by identifying subpopulations with different treatment effects while maintaining the ability to construct valid confidence intervals. We also discuss optimal assignment to treatment based on covariates in such settings. Finally, we discuss estimation and inference in experiments in settings with interactions between units, both in general network settings and in settings where the population is partitioned into groups with all interactions contained within these groups.

Saltzman, B. and J. Yung (2018). "A machine learning approach to identifying different types of uncertainty." Economics Letters ** *171***: 58-62.

**一种识别不同类型不确定性的机器学习方法**

We implement natural language processing techniques to extract uncertainty measures from Federal Reserve Beige Books between 1970 and 2018. Business and economic related uncertainty is associated with future weakness in output, higher unemployment, and elevated term premia. On the other hand, political and government uncertainty, while high during recent times, has no statistically significant impact on the economy.

Sansone, D. (2019). "Beyond Early Warning Indicators: High School Dropout and Machine Learning." Oxford Bulletin of Economics and Statistics ** *81***(2): 456-485.

**超越预警指标：高中辍学和机器学习**

This paper combines machine learning with economic theory in order to analyse high school dropout. It provides an algorithm to predict which students are going to drop out of high school by relying only on information from 9th grade. This analysis emphasizes that using a parsimonious early warning system ? as implemented in many schools ? leads to poor results. It shows that schools can obtain more precise predictions by exploiting the available high-dimensional data jointly with machine learning tools such as Support Vector Machine, Boosted Regression and Post-LASSO. Goodness-of-fit criteria are selected based on the context and the underlying theoretical framework: model parameters are calibrated by taking into account the policy goal ? minimizing the expected dropout rate - and the school budget constraint. Finally, this study verifies the existence of heterogeneity through unsupervised machine learning by dividing students at risk of dropping out into different clusters.

Sommervoll, Å. and D. E. Sommervoll (2019). "Learning from man or machine: Spatial fixed effects in urban econometrics." Regional Science and Urban Economics ** *77***: 239-252.

**向人或机器学习：城市计量经济学中的空间固定效应**

Econometric models with spatial fixed effects (FE) require some kind of spatial aggregation. This aggregation may be based on postcode, school district, county or some other spatial subdivision. Common sense would suggest that the less aggregated, the better inasmuch as aggregation over larger areas tends to gloss over systematic spatial variation. On the other hand, low spatial aggregation results in thin data sets and potentially noisy spatial fixed effects. We show, however, how this trade-off can be substantially lessened if we allow for more flexible aggregations. The key insight is that if we aggregate over areas with similar location premiums, we obtain robust location premiums without glossing over too much of the spatial variation. We use machine learning in the form of a genetic algorithm to identify areas with similar location premiums. The best aggregations found by the genetic algorithm outperform a conventional FE by postcode, even with an order of magnitude fewer spatial controls. This opens the door for spatially sparse FEs, if economy in the number of variables is important. The major takeaway, however, is that the genetic algorithm can find spatial aggregations that are both refined and robust, and thus significantly, lessen the trade-off between robust and refined location premium estimates.

Storm, H., et al. (2019). "Machine learning in agricultural and applied economics." European Review of Agricultural Economics.

**农业和应用经济学中的机器学习**

This review presents machine learning (ML) approaches from an applied economist’s perspective. We first introduce the key ML methods drawing connections to econometric practice. We then identify current limitations of the econometric and simulation model toolbox in applied economics and explore potential solutions afforded by ML. We dive into cases such as inflexible functional forms, unstructured data sources and large numbers of explanatory variables in both prediction and causal analysis, and highlight the challenges of complex simulation models. Finally, we argue that economists have a vital role in addressing the shortcomings of ML when used for quantitative economic analysis.

Wager, S. and S. Athey (2018). "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests." Journal of the American Statistical Association ** *113***(523): 1228-1242.

**随机森林对不同处理效果的估计和推断**

Many scientific and engineering challenges?ranging from personalized medicine to customized marketing recommendations?require an understanding of treatment effect heterogeneity. In this article, we develop a nonparametric causal forest for estimating heterogeneous treatment effects that extends Breiman?s widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates.

**拓展性阅读**

**关于一些计量方法的合辑，各位学者可以参看如下文章：**

**①**“实证研究中用到的200篇文章, 社科学者常备toolkit”、

**②**实证文章写作常用到的50篇名家经验帖, 学者必读系列、

**③**过去10年AER上关于中国主题的Articles专辑、

**④**AEA公布2017-19年度最受关注的十大研究话题, 给你的选题方向，

**⑤**2020年中文Top期刊重点选题方向, 写论文就写这些。后面，咱们又引荐了

**①**使用CFPS, CHFS, CHNS数据实证研究的精选文章专辑！，

**②**这40个微观数据库够你博士毕业了, 反正凭着这些库成了教授，

**③**Python, Stata, R软件史上最全快捷键合辑！，

**④**关于(模糊)断点回归设计的100篇精选Articles专辑！，

**⑤**关于双重差分法DID的32篇精选Articles专辑！，

**⑥**关于合成控制法SCM的33篇精选Articles专辑！

**⑦**最近80篇关于中国国际贸易领域papers合辑！，

**⑧**最近70篇关于中国环境生态的经济学papers合辑！

**⑨**使用CEPS, CHARLS, CGSS, CLHLS数据库实证研究的精选文章专辑！

**⑩**最近50篇使用系统GMM开展实证研究的papers合辑！

**关于一些常用数据库，各位学者可以参看如下文章：**

**1**

**.**这40个微观数据库够你博士毕业了；

**2.**中国工业企业数据库匹配160大步骤的完整程序和相应数据；

**3.**中国省/地级市夜间灯光数据；

**4.**1997-2014中国市场化指数权威版本；

**5.**1998-2016年中国地级市年均PM2.5；

**6.**计量经济圈经济社会等数据库合集(在社群里)；

**7.**中国方言,官员, 行政审批和省长数据库开放；

**8.**2005-2015中国分省分行业CO2数据；

**9.**国际贸易研究中的数据演进与当代问题；

**10.**经济学研究常用中国微观数据手册；

**11.**疫情期Wind资讯金融终端操作指南；

**12.**

**C**EIC数据库操作指南；

**13.**清华北大经管社科数据库有哪些? 不要羡慕嫉妒恨！

**14.**金融领域三大中文数据库, CSMAR, CCER, Wind和CNRDS，

**15.**EPS最新版本使用手册，

**16**

**.**疫情期计量课程免费开放！面板数据, 因果推断, 时间序列分析与Stata应用。

**2年，计量经济圈公众号近1000篇文章，**

**Econometrics Circle**

**数据系列：空间矩阵 | ****工企****数据**** | ****PM2.5 | ****市场化指数 | ****CO2数据 | ****夜间灯光 | 官员方言 | 微观数据 |**

**计量系列：****匹配方法 | ****内生性 | ****工具变量 | ****DID | ****面板数据 | ****常用TOOL | 中介调节 | 时间序列 | RDD断点 | 合成控制 | **

**数据处理：Stata | R | Python | 缺失值 | CHIP/ CHNS/CHARLS/CFPS/CGSS等 |**

**干货系列：能源环境**** | 效率研究**** | 空间计量**** | ****国际经贸**** | 计量软件**** | 商科研究 | 机器学习 | SSCI | CSSCI | SSCI查询 |**

计量经济圈组织了一个计量社群，有如下特征：热情互助最多、前沿趋势最多、社科资料最多、社科数据最多、科研牛人最多、海外名校最多。因此，建议积极进取和有强烈研习激情的中青年学者到社群交流探讨，始终坚信优秀是通过感染优秀而互相成就彼此的。