如何处理不平衡数据集（附代码）

Original 生信阿拉丁生信阿拉丁 2022-05-16

收录于合集 #机器学习 7个

上一篇中，我们给大家介绍了featuretools这个工具，可以很快速的根据数据的特征来生成各种新的数据组合，之后进行机器学习或者深度学习。这次我们给大家介绍在特征工程中，经常遇到的另一个问题的解决方法。这个问题就是不平衡数据集。

数据不平衡通常反映了数据集中类别的不均匀分布。比如在一个疾病的预测模型中，对照组可能有两百个，而疾病组只有二十个，数据不平衡比例达到10：1。如果不对数据集进行平衡，那么在后续的MLP抽样中，就会导致算法达不到收敛。

因此，我们需要对不平衡数据集进行处理，下面我们来介绍三种方法，两个包，来解决这个办法。

欠采样

欠采样就是一个随机删除一部分多数类（数量多的类型）数据的过程，这样可以使多数类数据数量可以和少数类（数量少的类型）相匹配。

首先我们来生成一套数据

1from sklearn.datasets import load_iris
2from imblearn.datasets import make_imbalance
3iris = load_iris(as_frame=True)
4
5sampling_strategy = {0: 10, 1: 20, 2: 47}
6X, y = make_imbalance(iris.data, iris.target, sampling_strategy=sampling_strategy)
7y.value_counts()

可以看到各个类别的个数，类别与{0: 10, 1: 20, 2: 47} 一致。下面我们拿这个数据进行下一步的分析。

方法1 可以用imblearn的字符串方法来进行欠采样

1from imblearn.under_sampling import RandomUnderSampler
2sampling_strategy = "not minority"
3rus = RandomUnderSampler(sampling_strategy=sampling_strategy)
4X_res, y_res = rus.fit_resample(X, y)
5y_res.value_counts()

可以看到结果为：

12    10
21    10
30    10

其中 sampling_strategy可以选择以下几种，大家可以去试试：

'majority'：resample only the majority class；

'not minority'：resample all classes but the minority class；

'not majority'：resample all classes but the majority class；

'all'：resample all classes；

'auto'：equivalent to 'not minority'。

方法2 使用dict方法

使用方法如下：

1from imblearn.under_sampling import RandomUnderSampler
2sampling_strategy = {0: 10, 1: 15, 2: 20}
3rus = RandomUnderSampler(sampling_strategy=sampling_strategy)
4X_res, y_res = rus.fit_resample(X, y)
5y_res.value_counts()

结果如下

12    20
21    15
30    10

可以看出，是按照dict指定的比例来。

过采样

这是一个生成合成数据的过程，通过学习少数类样本特征随机地生成新的少数类样本数据。

有许多方法对数据集进行过采样，最常见的技术是SMOTE（Synthetic Minority Over-sampling Technique）。使用方法也很简单，如下：

方法1 可以用imblearn的字符串方法来进行过采样

1from imblearn.over_sampling import RandomOverSampler
2sampling_strategy = "not majority"
3ros = RandomOverSampler(sampling_strategy=sampling_strategy)
4X_res, y_res = ros.fit_resample(X, y)
5y_res.value_counts()

可以看到结果为：

12    47
21    47
30    47

其中 sampling_strategy可以选择以下几种，大家可以去试试：

'minority'：resample only the minority class；

'not minority'：resample all classes but the minority class；

'not majority'：resample all classes but the majority class；

'all'：resample all classes；

'auto'：equivalent to 'not majority'。

方法2 使用dict方法

使用方法如下：

1from imblearn.over_sampling import RandomOverSampler
2sampling_strategy = {0: 25, 1: 35, 2: 47}
3ros = RandomOverSampler(sampling_strategy=sampling_strategy)
4X_res, y_res = ros.fit_resample(X, y)
5y_res.value_counts()

结果如下

12    47
21    35
30    25

可以看出，是按照dict指定的比例来。

使用pytorch集成的抽样方法来调整

如果大家习惯用torch的话，可以使用如下脚本：

 1import torch
 2import torch.utils.data as Data
 3X_torch = torch.from_numpy(X.values).float()
 4y_torch = torch.from_numpy(y.values).float()
 5class_sample_count = torch.tensor([(y_torch == t).sum() for t in torch.unique(y_torch, sorted=True)])
 6weight = 1. / class_sample_count.float()
 7sample_weight = torch.tensor([weight[int(t)] for t in y])
 8sampler = torch.utils.data.sampler.WeightedRandomSampler(sample_weight , len(samples_weight))
 9train_data = Data.TensorDataset(X,y)
10train_loader = Data.DataLoader( dataset = train_data , batch_size =200  ,sampler = sampler )

使用WeightedRandomSampler来进行进行不平衡抽样。

结语

相信大家对如何处理不平衡数据有了一定的了解，请大家以后继续关注我们相关系列。

参考阅读：
https://mp.weixin.qq.com/s/zX9_ysAPPlrPdPKr6DZxtQ
https://imbalanced-learn.org/stable/auto_examples/api/plot_sampling_strategy_usage.html#sphx-glr-auto-examples-api-plot-sampling-strategy-usage-py

作者：童蒙

编辑：amethyst

往期回顾