共线性、过度/不能识别问题的Solutions
共线性和过度/不能识别问题
背景:使用最小二乘法拟合的普通线性回归是数据建模的基本方法。其建模要点在于误差项一般要求独立同分布(常假定为正态)零均值。t检验用来检验拟合的模型系数的显著性,F检验用来检验模型的显著性(方差分析)。如果正态性不成立,t检验和F检验就没有意义。
对较复杂的数据建模(比如文本分类,图像去噪或者基因组研究)的时候,普通线性回归会有一些问题:
(1)预测精度的问题 如果响应变量和预测变量之间有比较明显的线性关系,最小二乘回归会有很小的偏倚,特别是如果观测数量n远大于预测变量p时,最小二乘回归也会有较小的方差。但是如果n和p比较接近,则容易产生过拟合;如果n<p,最小二乘回归得不到有意义的结果。
(2)模型解释能力的问题 包括在一个多元线性回归模型里的很多变量可能是和响应变量无关的;也有可能产生多重共线性的现象:即多个预测变量之间明显相关。这些情况都会增加模型的复杂程度,削弱模型的解释能力。这时候需要进行变量选择(特征选择)。
针对OLS的问题,在变量选择方面有三种扩展的方法:
(1)子集选择 这是传统的方法,包括逐步回归和最优子集法等,对可能的部分子集拟合线性模型,利用判别准则 (如AIC,BIC,Cp,调整R2 等)决定最优的模型。
(2)收缩方法(shrinkage method) 收缩方法又称为正则化(regularization)。主要是岭回归(ridge regression)和lasso回归。通过对最小二乘估计加入罚约束,使某些系数的估计为0。
(3)维数缩减 主成分回归(PCR)和偏最小二乘回归(PLS)的方法。把p个预测变量投影到m维空间(m<p),利用投影得到的不相关的组合建立线性模型。
岭回归
ridge regression可以用来处理下面两类问题:一是数据点少于变量个数;二是变量间存在共线性。
当变量间存在共线性的时候,最小二乘回归得到的系数不稳定,方差很大。这是因为系数矩阵X与它的转置矩阵相乘得到的矩阵不能求得其逆矩阵,而ridge regression通过引入参数lambda,使得该问题得到解决。在R语言中,MASS包中的函数lm.ridge()可以很方便的完成。它的输入矩阵X始终为n x p 维,不管是否包含常数项。下面分别介绍包含和不包含常数项时的输出:
当包含常数项时,该函数对y进行中心化,以y的均值作为因子;对x进行中心化和归一化,以x中各个变量的均值和标准差作为因子。这样对x和y处理后,x和y的均值为0,这使得回归平面经过原点,即常数项为0.因此,虽然指定了包含常数项,它给出的系数(lmridge coef)里也没有常数项的值。在使用该模型进行预测的时候,也需要首先对x和y进行中心化和归一化,因子是使用训练模型时候进行中心化和归一化的因子,然后再与系数相乘得到预测结果。这里需要指出的是,如果建立模型后在命令行窗口直接输入lmridge,也会出现一整套系数,该系数会包含常数项,这个系数和模型给出的系数(lmridgecoef)里也没有常数项的值。
在使用该模型进行预测的时候,也需要首先对x和y进行中心化和归一化,因子是使用训练模型时候进行中心化和归一化的因子,然后再与系数相乘得到预测结果。这里需要指出的是,如果建立模型后在命令行窗口直接输入lmridge,也会出现一整套系数,该系数会包含常数项,这个系数和模型给出的系数(lmridgecoef)不一样,因为它是针对没有归一化和中心化的数据的,在预测的时候可以直接使用该系数,不需要对数据进行归一化和中心化。
当指定模型不包含常数项时,因为要强制通过原点,该模型假设各个变量的均值为0,因此不对x和y进行中心化,但是对x进行归一化,而且归一化因子也是假设变量均值为0计算出来的该变量的标准差。在进行预测的时候,如果使用lmridge$coef的系数,那么需要对数据进行归一化;如果使用lmridge直接给出的系数,只需要直接相乘。
ridge regression中lambda的选择:
可以使用select(lmridge)进行自动选择,一般使用GCV最小的值,lambda的范围是大于0即可以。
主成分回归分析
建立多元线性回归方程时,由于自变量间存在多重共线性,常常会发现某些自变量的系数极不稳定,当增减变量时,其值会出现很大变化,甚至出现与实际情况相悖的符号,以致难以对所建回归方程给予符合实际的解释。主成分回归分析 Principal Component Regression(PCR)是一种多元回归分析方法,旨在解决自变量间存在多重共线性问题。
In general, PCR is essentially a shrinkage estimator that usually retains the high variance principal components (corresponding to the higher eigenvalues of X’XX T X {\displaystyle \mathbf {X} ^{T}\mathbf {X} } Xx) as covariates in the model and discards the remaining low variance components (corresponding to the lower eigenvalues of X’XX T X {\displaystyle \mathbf {X} ^{T}\mathbf {X} } ). Thus it exerts a discreteshrinkage effect on the low variance components nullifying their contribution completely in the original model. In contrast, the ridge regression estimator exerts a smooth shrinkage effect through the regularization parameter (or the tuning parameter) inherently involved in its construction. While it does not completely discard any of the components, it exerts a shrinkage effect over all of them in a continuous manner so that the extent of shrinkage is higher for the low variance components and lower for the high variance components. Frank and Friedman (1993) conclude that for the purpose of prediction itself, the ridge estimator, owing to its smooth shrinkage effect, is perhaps a better choice compared to the PCR estimator having a discrete shrinkage effect.
In addition, the principal components are obtained from the eigen-decomposition of X {\displaystyle \mathbf {X} } X that involves the observations for the explanatory variables only. Therefore, the resulting PCR estimator obtained from using these principal components as covariates need not necessarily have satisfactory predictive performance for the outcome. A somewhat similar estimator that tries to address this issue through its very construction is the partial least squares (PLS) estimator. Similar to PCR, PLS also uses derived covariates of lower dimensions. However unlike PCR, the derived covariates for PLS are obtained based on using both the outcome as well as the covariates. While PCR seeks the high variance directions in the space of the covariates, PLS seeks the directions in the covariate space that are most useful for the prediction of the outcome.
偏最小二乘法
偏最小二乘法,它通过最小化误差的平方和找到一组数据的最佳函数匹配。 用最简的方法求得一些绝对不可知的真值,而令误差平方之和为最小。 很多其他的优化问题也可通过最小化能量或最大化熵用最小二乘形式表达。
Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares Discriminant Analysis (PLS-DA) is a variant used when the Y is categorical.
PLS is used to find the fundamental relations between two matrices (X and Y), i.e. a latent variable approach to modeling the covariance structures in these two spaces. A PLS model will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. PLS regression is particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among X values. By contrast, standard regression will fail in these cases (unless it is regularized).
Lasso Regression套索回归
它类似于岭回归,Lasso (Least Absolute Shrinkage and Selection Operator)也会惩罚回归系数的绝对值大小。此外,它能够减少变化程度并提高线性回归模型的精度。看看下面的公式:
Lasso 回归与Ridge回归有一点不同,它使用的惩罚函数是绝对值,而不是平方。这导致惩罚(或等于约束估计的绝对值之和)值使一些参数估计结果等于零。使用惩罚值越大,进一步估计会使得缩小值趋近于零。绝对值是局部线性的,所以 Lasso Regression对系数范数接近于0的拟合趋势是有利的。
要点:
除常数项以外,这种回归的假设与最小二乘回归类似;
它收缩系数接近零(等于零),这确实有助于特征选择;
这是一个正则化方法,使用的是L1正则化;
如果预测的一组变量是高度相关的,Lasso 会选出其中一个变量并且将其它的收缩为零。
lasso (least absolute shrinkage and selection operator) (also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. It was introduced by Robert Tibshirani in 1996 based on Leo Breiman’s Nonnegative Garrote. Lasso was originally formulated for least squares models and this simple case reveals a substantial amount about the behavior of the estimator, including its relationship to ridge regression and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates need not be unique if covariates are collinear.
Prior to lasso, the most widely used method for choosing which covariates to include was stepwise selection, which only improves prediction accuracy in certain cases, such as when only a few covariates have a strong relationship with the outcome. However, in other cases, it can make prediction error worse. Also, at the time, ridge regression was the most popular technique for improving prediction accuracy. Ridge regression improves prediction error by shrinking large regression coefficients in order to reduce overfitting, but it does not perform covariate selection and therefore does not help to make the model more interpretable.
ElasticNet回归
ElasticNet是Lasso和Ridge回归技术的混合体。它使用L1来训练并且L2优先作为正则化矩阵。当有多个相关的特征时,ElasticNet是很有用的。Lasso 会随机挑选他们其中的一个,而ElasticNet则会选择两个。Lasso和Ridge之间的实际的优点是,它允许ElasticNet继承循环状态下Ridge的一些稳定性。
要点:
在高度相关变量的情况下,它会产生群体效应;
选择变量的数目没有限制;
它可以承受双重收缩。
回归正则化方法(Lasso,Ridge和ElasticNet)在高维和数据集变量之间多重共线性情况下运行良好。
注:此文由@圈圈汇编,转发很好,转载留言。