# 从线性回归到机器学习, 一张图帮你文献综述

凡是搞计量经济的，都关注这个号了

**邮****箱：****econometrics666@sina.cn**

**所有计量经济圈方法论****丛的code程序****, 宏微观数据库和各种软****件都放在社群里.欢迎到计量经济圈社群交流访问****.**

**机器学习和大数据计量经济学, 你必须阅读一下这篇**”，你看过了吗？没有看过的学者，建议看看，里面非常多的文章在讲解机器学习相关方法，比如Lasso, decision tree, random forest, support vector machine等等。当然，社科学者也不应该为机器学习技术的到来感到慌张，毕竟这套技术在主流社科期刊，比如AER，ASR上应用远比想象得少。不过，ML对于金融领域的学者确实更重要，因为机器学习胜在预测，而这恰好是金融研究最需要的技术。

以下是关于机器学习的进化路线综述。作者详细介绍了机器学习是如何从线性回归模型发展到支持向量机的。

**机器学习的进化路线图**

机器学习是如何从线性回归模型发展到支持向量机

以下是文献综述，系统讲解了机器学习的进化路线图。

Machine learning is a branch of artificial intelligence whose foundational concepts were acquired over the years from contributions in the areas of computer science, mathematics, philosophy, economics, neuroscience, psychology, control theory, and more [397]. Research efforts during the last 75 years have given rise to a plethora of ML techniques [15, 169, 397, 435]. In this section, we provide a brief history of ML focusing on the techniques that have been particularly applied in the area of computer networks

The beginning of ML dates back to 1943, when the first mathematical model of NNs for computers was proposed by McCulloch [302]. This model introduced a basic unit called artificial neuron that has been at the center of NN development to this day. However, this early model required to manually establish the correct weights of the connections between neurons. This limitation was addressed in 1949 by Hebbian learning [184], a simple rule-based algorithm for updating the connection weights of the early NN model. Like the neuron unit, Hebbian learning greatly influenced the progress of NN. These two concepts led to the construction of the first NN computer in 1950, called SNARC (Stochastic Neural Analog Reinforcement Computer) [397]. In the same year, Alan Turing proposed a test –where a computer tries to fool a human into believing it is also human– to determine if a computer is capable of showing intelligent behavior. He described the challenges underlying his idea of a *“learning machine”* in [449]. These developments encouraged many researchers to work on similar approaches, resulting in two decades of enthusiastic and prolific research in the ML area.

In the 1950s, the simplest linear regression model called Ordinary Least Squares (OLS) –derived from the least squares method [266, 423] developed around the 1800s– was used to calculate linear regressions in electro-mechanical desk calculators [168]. To the best of our knowledge, this is the first evidence of using OLS in computing machines. Following this trend, two linear models for conducting classification were introduced: Maximum Entropy (MaxEnt) [215, 216] and logistic regression [105]. A different research trend centered on pattern recognition exposed two non-parametric models (i.e. not restricted to a bounded set of parameters) capable of performing regression and classification: *k*-Nearest Neighbors (*k*-NN) [147, 420] and Kernel Density Estimation (KDE) [388], also known as Parzen density [349]. The former uses a distance metric to analyze the data, while the latter applies a kernel function (usually, Gaussian) to estimate the probability density function of the data.

The 1950s also witnessed the first applications of the Naïve Bayes (NB) classifier in the fields of pattern recognition [97] and information retrieval [297]. NB, whose foundations date back to the 18th and 19th centuries [43, 261], is a simple probabilistic classifier that applies Bayes’ theorem on features with strong independence assumptions. NB was later generalized using KDE, also known as NB with Kernel Estimation (NBKE), to estimate the conditional probabilities of the features. In the area of clustering, Steinhaus [422] was the first to propose a continuous version of the to be called *k*-Means algorithm [290], to partition a heterogeneous solid with a given internal mass distribution into *k* subsets. The proposed centroid model employs a distance metric to partition the data into clusters where the distance to the centroid is minimized.

In addition, the Markov model [159, 296] (elaborated 50 years earlier) was leveraged to construct a process based on discrete-time state transitions and action rewards, named Markov Decision Process (MDP), which formalizes sequential decision-making problems in a fully observable, controlled environment [46]. MDP has been essential for the development of prevailing RL techniques [435]. Research efforts building on the initial NN model flourished too: the modern concept of perceptron was introduced as the first NN model that could learn the weights from input examples [387]. This model describes two NN classes according to the number of layers: Single-Layer Perceptron (SLP), an NN with one input layer and one output layer, and Multi-Layer Perceptron (MLP), an NN with one or more hidden layers between the input and the output layers. The perceptron model is also known as Feedforward NN (FNN) since the nodes from each layer exhibit directed connections only to the nodes of the next layer. In the remainder of the paper, MLP-NNs and NNs in general, will be denoted by the tuple (*i**n**p**u**t*_*n**o**d**e**s*,*h**i**d**d**e**n*_*l**a**y**e**r*_*n**o**d**e**s*+,*o**u**t**p**u**t*_*n**o**d**e**s*), for instance a (106,60,40,1) MLP-NN has a 160-node input layer, two hidden layers of 60 and 40 nodes respectively, and a single node output layer.

By the end of the 1950s, the term *“Machine Learning”* was coined and defined for the first time by Arthur Samuel (cf., Section 2), who also developed a checkers-playing game that is recognized as the earliest self-learning program [401]. ML research continued to flourish in the 1960s, giving rise to a novel statistical class of the Markov model, named Hidden Markov Model (HMM) [426]. An HMM describes the conditional probabilities between hidden states and visible outputs in a partially observable, autonomous environment. The Baum-Welch algorithm [41] was proposed in the mi-1960s to learn those conditional probabilities. At the same time, MDP continued to instigate various research efforts. The partially observable Markov decision process (POMDP) approach to finding optimal or near-optimal control strategies for partially observable stochastic environments, given a complete model of the environment, was first proposed by Cassandra et al. [25] in 1965, while the algorithm to find the optimal solution was only devised 5 years later [416]. Another development in MDP was the learning automata –officially published in 1973 [448]–, a Reinforcement Learning (RL) technique that continuously updates the probabilities of taking actions in an observed environment, according to given rewards. Depending on the nature of the action set, the learning automata is classified as Finite Action-set Learning Automata (FALA) or Continuous Action-set Learning Automata (CALA) [330].

In 1963, Morgan and Sonquis published Automatic Interaction Detection (AID) [323], the first regression tree algorithm that seeks sequential partitioning of an observation set into a series of mutually exclusive subsets, whose means reduces the error in predicting the dependent variable. AID marked the beginning of the first generation of Decision Trees (DT). However, the application of DTs to classification problems was only initiated a decade later by Morgan and Messenger’s Theta AID (THAID) [305] algorithm.

In the meantime, the first algorithm for training MLP-NNs with many layers –also known as Deep NN (DNN) in today’s jargon– was published by Ivakhnenko and Lapa in 1965 [210]. This algorithm marked the commencement of the Deep Learning (DL) discipline, though the term only started to be used in the 1980s in the general context of ML, and in the year 2000 in the specific context of NNs [9]. By the end of the 1960s, Minsky and Papertkey’s *Perceptrons* book [315] drew the limitations of perceptrons-based NN through mathematical analysis, marking a historical turn in AI and ML in particular, and significantly reducing the research interest for this area over the next several years [397].

Although ML research was progressing slower than projected in the 1970s [397], the 1970s were marked by milestones that greatly shaped the evolution of ML, and contributed to its success in the following years. These include the Backpropagation (BP) algorithm, the Cerebellar Model Articulation Controller (CMAC) NN model [11], the Expectation Maximization (EM) algorithm [115], the to-be-referred-to as Temporal Difference (TD) learning [478], and the Iterative Dichotomiser 3 (ID3) algorithm [373].

Werbos’s application of BP –originally a control theory algorithm from the 1960s [80, 81, 233]– to train NNs [472] resurrected the research in the area. BP is to date the most popular NN training algorithm, and comes in different variants such as Gradient Descent (GD), Conjugate Gradient (CG), One Step Secant (SS), Levenberg-Marquardt (LM), and Resilient backpropagation (Rp). Though, BP is widely used in training NNs, its efficiency depends on the choice of initial weights. In particular, BP has been shown to have slower speed of convergence and to fall into local optima. Over the years, global optimization methods have been proposed to replace BP, including Genetic Algorithms (GA), Simulated Annealing (SA), and Ant Colony (AC) algorithm [500]. In 1975, Albus proposed CMAC, a new type of NN as an alternative to MLP [11]. Although CMAC was primarily designed as a function modeler for robotic controllers, it has been extensively used in RL and classification problems for its faster learning compared to MLP.

In 1977, in the area of statistical learning, Dempster et al. proposed EM, a generalization of the previous iterative, unsupervised methods, such as the Baum-Welch algorithm, for learning the unknown parameters of statistical HMM models [115]. At the same time, Witten developed an RL approach to solve MDPs, inspired by animal behavior and learning theories [478], that was later referred to as Temporal Difference (TD) in Sutton’s work [433, 434]. In this approach the learning process is driven by the changes, or differences, in predictions over successive time steps, such that the prediction at any given time step is updated to bring it closer to the prediction of the same quantity at the next time step.

Towards the end of the 1970s, the second generation of DTs emerged as the Iterative Dichotomiser 3 (ID3) algorithm was released. The algorithm, developed by Quinlan [373], relies on a novel concept for attribute selection based on entropy maximization. ID3 is a precursor to the popular and widely used C4.5 and C5.0 algorithms.

The 1980s witnessed a renewed interest in ML research, and in particular in NNs. In the early 1980s, three new classes of NNs emerged, namely Convolutional Neural Network (CNN) [157], Self-Organizing Map (SOM) [249], and Hopfield network [195]. CNN is a feedforward NN specifically designed to be applied to visual imagery analysis and classification, and thus require minimal image preprocessing. Connectivity between neurons in CNN is inspired by the organization of the animal visual cortex –modeled by Hubel in the 1960s [200, 201]–, where the visual field is divided between neurons, each responding to stimuli only in its corresponding region. Similarly to CNN, SOM was also designed for a specific application domain; dimensionality reduction [249]. SOMs employ an unsupervised competitive learning approach, unlike traditional NNs that apply error-correction learning (such as BP with gradient descent).

In 1982, the first form of Recurrent Neural Network (RNN) was introduced by Hopfield. Named after the inventor, Hopfield network is an RNN where the weights connecting the neurons are bidirectional. The modern definition of RNN, as a network where connections between neurons exhibit one or more than one cycle, was introduced by Jordan in 1986 [226]. Cycles provide a structure for internal states or memory allowing RNNs to process arbitrary sequences of inputs. As such, RNNs are found particularly useful in Time Series Forecasting (TSF), handwriting recognition and speech recognition.

Several key concepts emerged from the 1980s’ *connectionism movement*, one of which is the concept of *distributed representation* [187]. Introduced by Hinton in 1986, this concept supports the idea that a system should be represented by many features and that each feature may have different values. Distributed representation establishes a many-to-many relationship between neurons and *(feature,value)* pairs for improved efficiency, such that a *(feature,value)* input is represented by a pattern of activity across neurons as opposed to being locally represented by a single neuron. The second half of 1980s also witnessed the increase in popularity of the BP algorithm and its successful application in training DNNs [263, 394], as well as the emergence of new classes of NNs, such as Restricted Boltzmann Machines (RBM) [413], Time-Lagged Feedforward Network (TLFN) [260], and Radial Basis Function Neural Network (RBFNN) [260].

Originally named Harmonium by Smolensky, RBM is a variant of Boltzmann machines [2] with the restriction that there are no connections within any of the network layers, whether it is visible or hidden. Therefor, neurons in RBMs form a bipartite graph. This restriction allows for more efficient and simpler learning compared to traditional Boltzmann machines. RBMs are found useful in a variety of application domains such as dimensionality reduction, feature learning, and classification, as they can be trained in both supervised and unsupervised ways. The popularity of RBMs and the extent of their applicability significantly increased after the mid-2000s as Hinton introduced in 2006 a faster learning method for Boltzmann machines called Contrastive Divergence [186] making RBMs even more attractive for deep learning [399]. Interestingly, although the use of the term “deep learning” in the ML community dates back to 1986 [111], it did not apply to NNs at that time.

Towards the end of 1980s, TLFN –an MLP that incorporates the time dimension into the model for conducting TSF [260]–, and RBFNN –an NN with a weighted set of Radial Basis Function (RBF) kernels that can be trained in unsupervised and supervised ways [78]– joined the growing list of NN classes. Indeed any of the above NNs can be employed in a DL architecture, either by implementing a larger number of hidden layers or stacking multiple simple NNs.

In addition to NNs, several other ML techniques thrived during the 1980s. Among these techniques, Bayesian Network (BN) arose as a Directed Acyclic Graph (DAG) representation model for the statistical models in use [352], such as NB and HMM –the latter considered as the simplest dynamic BN [107, 110]–. Two DT learning algorithms, similar to ID3 but developed independently, referred to as Classification And Regression Trees (CART) [76], were proposed to model classification and regression problems. Another DT algorithm, under the name of Reduced Error Pruning Tree (REPTree), was also introduced for classification. REPTree aimed at building faster and simpler tree models using information gain for splitting, along with reduced-error pruning [374].

Towards the end of 1980s, two TD approaches were proposed for reinforcement learning: TD(*λ*) [433] and Q-learning [471]. TD(*λ*) adds a discount factor (0≤*λ*≤1) that determines to what extent estimates of previous state-values are eligible for updating based on current errors, in the policy evaluation process. For example, TD(0) only updates the estimate of the value of the state preceding the current state. Q-learning, however, replaces the traditional state-value function of TD by an action-value function (i.e. Q-value) that estimates the utility of taking a specific action in specific states. As of today, Q-learning is the most well-studied and widely-used model-free RL algorithm. By the end of the decade, the application domains of ML started expending to the operation and management of communication networks [57, 217, 289].

In the 1990s, significant advances were realized in ML research, focusing primarily on NNs and DTs. Bio-inspired optimization algorithms, such as Genetic Algorithms (GA) and Particle Swarm Optimization (PSO), received increasing attention and were used to train NNs for improved performance over the traditional BP-based learning [234, 319]. Probably one of the most important achievements in NNs was the work on Long Short-Term Memory (LSTM), an RNN capable of learning long-term dependencies for solving DL tasks that involve long input sequences [192]. Today, LSTM is widely used in speech recognition as well as natural language processing. In DT research, Quinlan published the M5 algorithm in 1992 [375] to construct tree-based multivariate linear models analogous to piecewise linear functions. One well-known variant of the M5 algorithm is M5P, which aims at building trees for regression models. A year later, Quinlan published C4.5 [376], that builds on and extends ID3 to address most of its practical shortcomings, including data overfitting and training with missing values. C4.5 is to date one of the most important and widely used algorithms in ML and data mining.

Several techniques other than NN and DT also prospered in the 1990s. Research on regression analysis propounded the Least Absolute Selection and Shrinkage Operator (LASSO), which performs variable selection and regularization for higher prediction accuracy [445]. Another well-known ML technique introduced in the 1990s was Support Vector Machines (SVM). SVM enables plugging different kernel functions (e.g. linear, polynomial, RBF) to find the optimal solution in higher-dimensional feature spaces. SVM-based classifiers find a hyperplane to discriminate between categories. A single-class SVM is a binary classifier that deduces the hyperplane to differentiate between the data belonging to the class against the rest of the data, that is, *one-vs-rest*. A multi-class approach in SVM can be formulated as a series of single class classifiers, where the data is assigned to the class that maximizes an output function. SVM has been widely used primarily for classification, although a regression variant exists, known as Support Vector Regression (SVR) [70].

In the area of RL, SARSA (State-Action-Reward-State-Action) was introduced as a more realistic, however less practical, Q-learning variation [395]. Unlike Q-learning, SARSA does not update the Q-value of an action based on the maximum action-value of the next state, but instead it uses the Q-value of the action chosen in the next state.

A new emerging concept called *ensemble learning* demonstrated that the predictive performance of a single learning model can be be improved when combined with other models [397]. As a result, the poor performance of a single predictor or classifier can be compensated with ensemble learning at the price of (significantly) extra computation. Indeed the results from ensemble learning must be aggregated, and a variety of techniques have been proposed in this matter. The first instances of ensemble learning include Weighted Majority Algorithm (WMA) [279], boosting [403], bootstrap aggregating (or bagging) [75], and Random Forests (RF) [191]. RF focused explicitly on tree models and marked the beginning of a new generation of ensemble DT. In addition, some variants of the original boosting algorithm were also developed, such as Adaptive Boosting (AdaBoost) [153] and Stochastic Gradient Boosting (SGBoost) [155].

These advances in ML facilitated the successful deployment of major use cases in the 1990s, particularly, handwriting recognition [419] and data mining [3]. The latter represented a great shift to data-driven ML, and since then it has been applied in many areas (e.g., retail, finance, manufacturing, medicine, science) for processing huge amounts of data to build models with valuable use [169]. Furthermore, from a conceptual perspective, Tom Mitchell formally defined ML: “A computer program is said to learn from experience *E* with respect to some class of tasks *T* and performance measure *P*, if its performance at tasks in *T*, as measured by *P*, improves with experience *E*” [317].

The 21st century began with a new wave of increasing interest in SVM and ensemble learning, and in particular ensemble DT. Research efforts in the field generated some of the the most widely used implementations of ensemble DT as of today: Multiple Additive Regression Trees (MART) [154], extra-trees [164], and eXtreme Gradient Boosting (XGBoost) [93]. MART and XGBoost are respectively a commercial and open source implementation of Friedman’s Gradient Boosting Decision Tree (GBDT) algorithm; an ensemble DT algorithm based on gradient boosting [154, 155]. Extra-trees stands for *extremely randomized trees*, an ensemble DT algorithm that builds random trees based on *k* randomly chosen features. However instead to computing an optimal split-point for each one of the *k* features at each node as in RF, extra-trees selects a split-point randomly for reduced computational complexity.

At the same time, the popularity of DL increased significantly after the term “deep learning” was first introduced in the context of NNs in 2000 [9]. However, the attractiveness of DNN started decreasing shortly after due to the experienced difficulty of training DNNs using BP (e.g. vanishing gradient problem), in addition to the increasing competitiveness of other ML techniques (e.g. SVM) [169]. Hinton’s work on Deep Belief Networks (DBN), published in 2006 [188], gave a new breath and strength to research in DNNs. DBN introduced an efficient training strategy for deep learning models, which was further used successfully in different classes of DNNs [49, 381]. The development in ML –particularly, in DNNs– grew exponentially with advances in storage capacity and large-scale data processing (i.e. Big Data) [169]. This wave of popularity in deep learning has continued to this day, yielding major research advances over the years. One approach that is currently receiving tremendous attention is Deep RL, which incorporates deep learning models into RL for solving complex problems. For example, Deep Q-Networks (DQN) –a combination of DNN and Q-learning– was proposed for mastering video games [318]. Although the term Deep RL was coined recently, this concept was already discussed and applied 25 years ago [275, 440].

It is important to mention that the evolution in ML research has enabled improved learning capabilities which were found useful in several application domains, ranging from games, image and speech recognition, network operation and management, to self-driving cars [120].

**Source：**

*A comprehensive survey on machine learning for networking: evolution, applications and research opportunities，Journal of Internet Services and Applicationsvolume 9, Article number: 16 (2018)*

拓展阅读：

5.机器学习与Econometrics的书籍推荐, 值得拥有的经典

10.详解聚类, 回归, 分类三种统计方法, 人人都能够懂的方式

14.回归方法深度剖析(OLS, RIDGE, ENET, LASSO, SCAD, MCP, QR）

15.高维回归方法: Ridge, Lasso, Elastic Net用了吗

16.交叉验证，模型选择（cross validation）

下面这些短链接文章属于合集，可以收藏起来阅读，不然以后都找不到了。

**2年，计量经济圈公众号近1000篇文章，**

**Econometrics Circle**

**数据系列：空间矩阵 | ****工企****数据**** | ****PM2.5 | ****市场化指数 | ****CO2数据 | ****夜间灯光 | 官员方言 | 微观数据 |**

**计量系列：****匹配方法 | ****内生性 | ****工具变量 | ****DID | ****面板数据 | ****常用TOOL | 中介调节 | 时间序列 | RDD断点 | 合成控制 | **

**数据处理：Stata | R | Python | 缺失值 | CHIP/ CHNS/CHARLS/CFPS/CGSS等 |**

**干货系列：能源环境**** | 效率研究**** | 空间计量**** | ****国际经贸**** | 计量软件**** | 商科研究 | 机器学习 | SSCI | CSSCI | SSCI查询 |**

计量经济圈组织了一个计量社群，有如下特征：热情互助最多、前沿趋势最多、社科资料最多、社科数据最多、科研牛人最多、海外名校最多。因此，建议积极进取和有强烈研习激情的中青年学者到社群交流探讨，始终坚信优秀是通过感染优秀而互相成就彼此的。