时序特征相关系数的稳定性分析(附代码)

数据挖掘工程师 2022-06-14

The following article is from 宅码 Author Ai

在时序中，特征也许是具有时效性的，比如在某些市场环境下，股票的收益更看重公司的市盈率，另外的行情时，有看重换手率。本质上，可以反映为：在时间上，特征与目标变量之间相关性的不稳定，为此，我们能做一些相关性分析，帮我们找到这些时间上不稳定的特征，剔除它们，并让模型更加鲁棒。

这里，直接上例子：

import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as plt

# 导入数据train_df = pd.read_csv('train.csv')train.head()

先对训练集数据，每月统计特征与target的相关性：

# 获取（年，月）特征train_df['Date'] = pd.to_datetime(train_df['Date'])train_df['Date_year'] = train_df['Date'].dt.yeartrain_df['Date_month'] = train_df['Date'].dt.monthdef concat_year_month(year, month): return (year, month)train_df['Date_ym'] = train_df.apply(lambda x:concat_year_month(x['Date_year'], x['Date_month']), axis=1)

# 针对每月，统计特征与target的相关性date_yms = train_df['Date_ym'].unique()corr_df = []for date_ym in date_yms: curr_df = train_df[train_df['Date_ym']==date_ym] curr_corr_df = curr_df.corr() curr_corr_df = curr_corr_df['target'].reset_index() curr_corr_df.rename(columns={'index':'feature', 'target':'corr'}, inplace=True) curr_corr_df['Date_ym'] = str(date_ym) corr_df.append(curr_corr_df)

corr_df = pd.concat(corr_df, axis=0).reset_index(drop=True)

再观察每月各个特征相关性，随着时间变化的情况：

USE_COLS = [f for f in corr_df['feature'].unique() \ if f not in ['Date', 'Date_year', 'Date_month', 'target']] # 训练时用的列名corr_df = corr_df[corr_df['feature'].isin(USE_COLS)]fig = plt.figure(figsize=(25,4))fig = sns.lineplot(data=corr_df, x='Date_ym', y='corr', hue='feature')fig.set_xticklabels(date_yms, rotation=90)plt.title('the correlation along the time axis')plt.show()

也可以看下，各个特征的月度相关性的标准差：

fig = plt.figure(figsize=(25,4))fig = sns.boxplot(data=corr_df.sort_values('feature'), x='feature', y='corr')plt.xticks(rotation=90)plt.title('The BoxPlot of each feature')plt.show()

我们打印看看，top相关性标准差大的特征有哪些：

top_corrStd_fnum = 10 # 选择top相关性标准差大的特征的数量

top_corrStd_feats = corr_df.groupby('feature').std().reset_index().sort_values('corr', ascending=False)['feature'].iloc[:top_corrStd_fnum].to_list()

print(top_corrStd_feats)

总结以上内容，打包函数如下：

def get_unstable_feats(df, top_fnum=10, corr_thresh=0.15): """对训练集数据，每月统计特征与target的相关性， 基于每个特征相关性的标准差，选择出TOP不稳定的特征，用于后续的特征选择工作 输入： df (pd.DataFrame): 训练集 top_fnum (int): top不稳定特征数量 corr_thresh (float): 相关性的标准差阈值 注意：若选择corr_thresh,而不是top_fnum,只要将top_fnum设为None就好。 输出： unstable_feats (list): 不稳定的特征 """ # 获取（年，月）特征 df['Date'] = pd.to_datetime(df['Date']) df['Date_year'] = df['Date'].dt.year df['Date_month'] = df['Date'].dt.month def concat_year_month(year, month): return (year, month) df['Date_ym'] = df.apply(lambda x:concat_year_month(x['Date_year'], x['Date_month']), axis=1) # 针对每月，统计特征与target的相关性 date_yms = df['Date_ym'].unique() corr_df = [] for date_ym in date_yms: curr_df = df[df['Date_ym']==date_ym] curr_corr_df = curr_df.corr() curr_corr_df = curr_corr_df['target'].reset_index() curr_corr_df.rename(columns={'index':'feature', 'target':'corr'}, inplace=True) curr_corr_df['Date_ym'] = str(date_ym) corr_df.append(curr_corr_df) corr_df = pd.concat(corr_df, axis=0).reset_index(drop=True) # 剔除非训练特征 USE_COLS = [f for f in corr_df['feature'].unique() if f not in ['Date', 'Date_year', 'Date_month', 'target']] corr_df = corr_df[corr_df['feature'].isin(USE_COLS)] # 基于每个特征相关性的标准差，选择出TOP不稳定的特征 if top_fnum != None: top_corrStd_fnum = top_fnum # 选择top相关性标准差大的特征的数量

        top_corrStd_feats = corr_df.groupby('feature').std().reset_index().sort_values('corr', ascending=False)['feature'].iloc[:top_corrStd_fnum].to_list()

elif corr_thresh != None: corr_df = corr_df.groupby('feature').std().reset_index().sort_values('corr', ascending=False) top_corrStd_feats = corr_df[corr_df['corr'] >= corr_thresh]['feature'].to_list() print('Features with Unstable Correlation:', top_corrStd_feats) return top_corrStd_feats top_corrStd_feats = get_unstable_feats(train_df, top_fnum=10, corr_thresh=None)

实际使用上述方法，确实对含冗余特征且存在明显相关性不稳定的数据集，有提分的帮助。
扩展：除了相关性分析，Kaggle常见的一个技巧：对抗验证也能做这块不稳定特征的筛选工作
参考：https://www.kaggle.com/competitions/ubiquant-market-prediction/discussion/312398推荐阅读
LightGBM 原理、代码最全解读！15种顶级分析思维模型。从梯度下降到 Adam！一文看懂各种神经网络优化算法

法明传[2024]173号：1月1日起，未用示范文本提交起诉状，部分法院将不予立案

法明传[2024]173号：1月1日起，未用示范文本提交起诉状，部分法院将不予立案

2025.1.1起，全国法院全面推进应用民事起诉状、答辩状示范文本(附下载链接)

法明传[2024]173号：关于加快推进起诉状、答辩状示范文本全面应用工作的通知(附下载链接)

2025.1.1起，全国法院全面推进应用民事起诉状、答辩状示范文本(附下载链接)

时序特征相关系数的稳定性分析(附代码)

您可能也对以下帖子感兴趣

法明传[2024]173号：1月1日起，未用示范文本提交起诉状，部分法院将不予立案

法明传[2024]173号：1月1日起，未用示范文本提交起诉状，部分法院将不予立案

2025.1.1起，全国法院全面推进应用民事起诉状、答辩状示范文本(附下载链接)

法明传[2024]173号：关于加快推进起诉状、答辩状示范文本全面应用工作的通知(附下载链接)

2025.1.1起，全国法院全面推进应用民事起诉状、答辩状示范文本(附下载链接)

生成图片，分享到微信朋友圈

时序特征相关系数的稳定性分析(附代码)

您可能也对以下帖子感兴趣