查看原文
其他

用statsmodels库做计量分析

大邓 大邓和他的Python 2022-07-09

OLS估计

  1. import statsmodels.api as sm

  2. import numpy as np

  3. import matplotlib.pyplot as plt

  4. from statsmodels.sandbox.regression.predstd import wls_prediction_std

  5. %matplotlib inline


  6. np.random.seed(9876789)


  7. #生成实验数据

  8. nsample = 100

  9. x = np.linspace(0, 10, 100)

  10. X = np.column_stack((x, x**2))

  11. beta = np.array([1, 0.1, 10])

  12. e = np.random.normal(size=nsample)


  13. #ols模型需要的X矩阵需要加入常数项

  14. X = sm.add_constant(X)

  15. y = np.dot(X, beta)+e


  16. #OLS拟合实验数据

  17. model = sm.OLS(y, X)

  18. results = model.fit()

  19. print(results.summary())

  1. OLS Regression Results

  2. ==============================================================================

  3. Dep. Variable: y R-squared: 1.000

  4. Model: OLS Adj. R-squared: 1.000

  5. Method: Least Squares F-statistic: 4.020e+06

  6. Date: Thu, 19 Dec 2019 Prob (F-statistic): 2.83e-239

  7. Time: 17:33:16 Log-Likelihood: -146.51

  8. No. Observations: 100 AIC: 299.0

  9. Df Residuals: 97 BIC: 306.8

  10. Df Model: 2

  11. Covariance Type: nonrobust

  12. ==============================================================================

  13. coef std err t P>|t| [0.025 0.975]

  14. ------------------------------------------------------------------------------

  15. const 1.3423 0.313 4.292 0.000 0.722 1.963

  16. x1 -0.0402 0.145 -0.278 0.781 -0.327 0.247

  17. x2 10.0103 0.014 715.745 0.000 9.982 10.038

  18. ==============================================================================

  19. Omnibus: 2.042 Durbin-Watson: 2.274

  20. Prob(Omnibus): 0.360 Jarque-Bera (JB): 1.875

  21. Skew: 0.234 Prob(JB): 0.392

  22. Kurtosis: 2.519 Cond. No. 144.

  23. ==============================================================================


  24. Warnings:

  25. [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

  1. #OLS模型中含有的属性和方法

  2. print(dir(results))

  1. ['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_cache', '_data_attr', '_get_robustcov_results', '_is_nested', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 'df_resid', 'diagn', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 'get_robustcov_results', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid']

  1. #这里只看看系数和R2

  2. print('Parameters: ', results.params)

  3. print('R2: ', results.rsquared)

  1. Parameters: [ 1.1875799 0.06751355 10.00087227]

  2. R2: 0.9999904584846492

OLS非线性曲线,但参数是线性的

  1. nsample = 50

  2. sig = 0.5

  3. x = np.linspace(0, 20, nsample)

  4. X = np.column_stack((x, np.sin(x), (x-5)**2, np.ones(nsample)))

  5. beta = [0.5, 0.5, -0.02, 5.]

  6. y_true = np.dot(X, beta)

  7. y = y_true + sig * np.random.normal(size=nsample)


  8. res = sm.OLS(y, X).fit()

  9. print(res.summary())

  1. OLS Regression Results

  2. ==============================================================================

  3. Dep. Variable: y R-squared: 0.946

  4. Model: OLS Adj. R-squared: 0.942

  5. Method: Least Squares F-statistic: 267.0

  6. Date: Thu, 19 Dec 2019 Prob (F-statistic): 4.28e-29

  7. Time: 17:39:35 Log-Likelihood: -29.585

  8. No. Observations: 50 AIC: 67.17

  9. Df Residuals: 46 BIC: 74.82

  10. Df Model: 3

  11. Covariance Type: nonrobust

  12. ==============================================================================

  13. coef std err t P>|t| [0.025 0.975]

  14. ------------------------------------------------------------------------------

  15. x1 0.5031 0.024 20.996 0.000 0.455 0.551

  16. x2 0.5805 0.094 6.162 0.000 0.391 0.770

  17. x3 -0.0208 0.002 -9.908 0.000 -0.025 -0.017

  18. const 5.0230 0.155 32.328 0.000 4.710 5.336

  19. ==============================================================================

  20. Omnibus: 3.931 Durbin-Watson: 2.004

  21. Prob(Omnibus): 0.140 Jarque-Bera (JB): 2.119

  22. Skew: -0.235 Prob(JB): 0.347

  23. Kurtosis: 2.108 Cond. No. 221.

  24. ==============================================================================


  25. Warnings:

  26. [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

  1. print('Parameters: ', res.params)

  2. print('Standard errors: ', res.bse)

  3. print('Predicted values: ', res.predict())

  1. Parameters: [ 0.50314123 0.58047691 -0.02084597 5.02303139]

  2. Standard errors: [0.02396328 0.09420248 0.00210399 0.15537883]

  3. Predicted values: [ 4.50188223 5.01926399 5.49184506 5.88798995 6.18748025 6.38483648

  4. 6.49021831 6.52775537 6.53158285 6.54023313 6.59030507 6.7104509

  5. 6.91666871 7.20967408 7.57478245 7.98432176 8.40217892 8.78973307

  6. 9.11220088 9.34435135 9.47465119 9.50715972 9.46086192 9.36654926

  7. 9.26176075 9.18461538 9.16754929 9.23198657 9.3848194 9.61727563

  8. 9.90636005 10.21863244 10.51570167 10.7605333 10.92353419 10.98741461

  9. 10.95002886 10.82472782 10.63816544 10.42591956 10.22664657 10.0757303

  10. 9.99946853 10.01075234 10.10694816 10.27033045 10.47099394 10.67176702

  11. 10.83431887 10.92545718]

为了对比 真实值与OLS预测值,使用 wls_prediction_std

  1. from statsmodels.sandbox.regression.predstd import wls_prediction_std

  1. #标准差预测值, 置信区间下界限, 置信区间上界限

  2. prstd, iv_l, iv_u = wls_prediction_std(res)

  3. fig, ax = plt.subplots(figsize=(8,6))


  4. ax.plot(x, y, 'o', label="data")

  5. ax.plot(x, y_true, 'b-', label="True")

  6. ax.plot(x, res.predict(), 'r--.', label="OLS")

  7. ax.plot(x, iv_u, 'r--')

  8. ax.plot(x, iv_l, 'r--')

  9. ax.legend(loc='best');

虚拟变量处理

dummy = sm.categorical(groups, drop=True)

  1. import numpy as np

  2. import statsmodels.api as sm


  3. groups = np.array(['a', 'b', 'a', 'a', 'c'])

  4. sm.categorical(groups)

  1. array([['a', '1.0', '0.0', '0.0'],

  2. ['b', '0.0', '1.0', '0.0'],

  3. ['a', '1.0', '0.0', '0.0'],

  4. ['a', '1.0', '0.0', '0.0'],

  5. ['c', '0.0', '0.0', '1.0']], dtype='<U32')

  1. sm.categorical(groups, drop=True)

  1. array([[1., 0., 0.],

  2. [0., 1., 0.],

  3. [1., 0., 0.],

  4. [1., 0., 0.],

  5. [0., 0., 1.]])

  1. x = np.array([1, 2, 3, 4, 5])

  2. dummy = sm.categorical(groups, drop=True)

  3. np.column_stack((x, dummy))

  1. array([[1., 1., 0., 0.],

  2. [2., 0., 1., 0.],

  3. [3., 1., 0., 0.],

  4. [4., 1., 0., 0.],

  5. [5., 0., 0., 1.]])

共线性问题

数据集Longley是众所周知的拥有强共线性现象的数据集,也就是自变量之间拥有较高的相关性。

因变量TOTEMP

自变量

  • GNPDEFL

  • GNP

  • UNEMP

  • ARMED

  • POP

  • YEAR

共线性问题会影响ols参数估计的稳定性。

  1. import pandas as pd

  2. df = pd.read_csv('data/longley.csv')

  3. df.head()

  1. import statsmodels.api as sm

  2. import pandas as pd


  3. df = pd.read_csv('data/longley.csv')

  4. X = df[df.columns[1:]]

  5. X = sm.add_constant(X)

  6. y = df[df.columns[0]]


  7. ols = sm.OLS(y, X)

  8. ols_results = ols.fit()

  9. print(ols_results.summary())

  1. OLS Regression Results

  2. ==============================================================================

  3. Dep. Variable: TOTEMP R-squared: 0.995

  4. Model: OLS Adj. R-squared: 0.992

  5. Method: Least Squares F-statistic: 330.3

  6. Date: Thu, 19 Dec 2019 Prob (F-statistic): 4.98e-10

  7. Time: 19:52:43 Log-Likelihood: -109.62

  8. No. Observations: 16 AIC: 233.2

  9. Df Residuals: 9 BIC: 238.6

  10. Df Model: 6

  11. Covariance Type: nonrobust

  12. ==============================================================================

  13. coef std err t P>|t| [0.025 0.975]

  14. ------------------------------------------------------------------------------

  15. const -3.482e+06 8.9e+05 -3.911 0.004 -5.5e+06 -1.47e+06

  16. GNPDEFL 15.0619 84.915 0.177 0.863 -177.029 207.153

  17. GNP -0.0358 0.033 -1.070 0.313 -0.112 0.040

  18. UNEMP -2.0202 0.488 -4.136 0.003 -3.125 -0.915

  19. ARMED -1.0332 0.214 -4.822 0.001 -1.518 -0.549

  20. POP -0.0511 0.226 -0.226 0.826 -0.563 0.460

  21. YEAR 1829.1515 455.478 4.016 0.003 798.788 2859.515

  22. ==============================================================================

  23. Omnibus: 0.749 Durbin-Watson: 2.559

  24. Prob(Omnibus): 0.688 Jarque-Bera (JB): 0.684

  25. Skew: 0.420 Prob(JB): 0.710

  26. Kurtosis: 2.434 Cond. No. 4.86e+09

  27. ==============================================================================


  28. Warnings:

  29. [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

  30. [2] The condition number is large, 4.86e+09. This might indicate that there are

  31. strong multicollinearity or other numerical problems.

summary末尾Warnings提醒我们模型的condition number很大,可能存在很强的多重共线性问题或者其他问题

Condition number

condition number可以用来评估多重共线性问题的大小。

当该值大于20, 基本可以确定是存在多重共线性问题(参考Greene 4.9)

  1. import numpy as np

  2. np.linalg.cond(ols.exog)

  1. 4859257015.454868

删除观测值

Greene也指出即使移除一个观测值,也可能会对ols估计产生巨大的影响

  1. #保留了前14个观测值

  2. ols_results2 = sm.OLS(y[:14], X[:14]).fit()


  3. print("Percentage change %4.2f%%\n"*7 % tuple([i for i in (ols_results2.params - ols_results.params)/ols_results.params*100]))

  1. Percentage change 4.55%

  2. Percentage change -105.20%

  3. Percentage change -3.43%

  4. Percentage change 2.92%

  5. Percentage change 3.32%

  6. Percentage change 97.06%

  7. Percentage change 4.64%

我们也可以查看DFBETAS,即当移除某个观测值后,每个参数的会因此发生的改变(标准化)

We can also look at formal statistics for this such as the DFBETAS – a standardized measure of how much each coefficient changes when that observation is left out.

  1. influence = ols_results.get_influence()

  2. influence.summary_frame()

大致上,我们可以认为DBETAS的绝对值大于

  1. 2./(len(X)**.5)

  1. 0.5

  1. # 保留含有dfb的字段名

  2. influence.summary_frame().filter(regex="dfb")

现在statsmodels正在持续开发中,未来python的计量分析方面的应用会越来越好用,期待ing

近期文章

精选课 | Python网络爬虫与文本数据分析(学术)

NRC词语情绪词典和词语色彩词典

Loughran&McDonald金融文本情感分析库

股评师分析报告文本情感分析预测股价

使用分析师报告中含有的情感信息预测上市公司股价变动

【公开视频课】Python语法快速入门

【公开视频课】Python爬虫快速入门

一行pandas代码生成哑变量

使用Python读取图片中的文本数据

代码不到40行的超燃动态排序图


情绪及色彩词典获取方式,请在公众号后台回复关键词“20191220” ,

如果想做文本分析

可购买网课 | Python文本数据分析实战(学术) 

或参加Python&Stata数据分析课寒假工作坊


您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存