使用过采样平衡numpy数组
请帮我找一个简单的方法,从现有的数组中创建一个新数组。如果某个类别的样本数量少于该类别中样本数量最多的类别,就需要对这个类别进行过采样。样本可以从原始数组中获取,随机或顺序都可以。
假设,初始数组是这样的:
[ 2, 29, 30, 1]
[ 5, 50, 46, 0]
[ 1, 7, 89, 1]
[ 0, 10, 92, 9]
[ 4, 11, 8, 1]
[ 3, 92, 1, 0]
最后一列是类别:
classes = [ 0, 1, 9]
这些类别的分布如下:
distrib = [2, 3, 1]
我需要的是创建一个新数组,使得所有类别的样本数量相等,这些样本是从原始数组中随机抽取的,比如:
[ 5, 50, 46, 0]
[ 3, 92, 1, 0]
[ 5, 50, 46, 0] # one example added
[ 2, 29, 30, 1]
[ 1, 7, 89, 1]
[ 4, 11, 8, 1]
[ 0, 10, 92, 9]
[ 0, 10, 92, 9] # two examples
[ 0, 10, 92, 9] # added
3 个回答
1
你可以使用 imbalanced-learn 这个包:
import numpy as np
from imblearn.over_sampling import RandomOverSampler
data = np.array([
[ 2, 29, 30, 1],
[ 5, 50, 46, 0],
[ 1, 7, 89, 1],
[ 0, 10, 92, 9],
[ 4, 11, 8, 1],
[ 3, 92, 1, 0]
])
ros = RandomOverSampler()
# fit_resample expects two arguments: a matrix of sample data and a vector of
# sample labels. In this case, the sample data is in the first three columns of
# our array and the labels are in the last column
X_resampled, y_resampled = ros.fit_resample(data[:, :-1], data[:, -1])
# fit_resample returns a matrix of resampled data and a vector with the
# corresponding labels. Combine them into a single matrix
resampled = np.column_stack((X_resampled, y_resampled))
print(resampled)
输出结果:
[[ 2 29 30 1]
[ 5 50 46 0]
[ 1 7 89 1]
[ 0 10 92 9]
[ 4 11 8 1]
[ 3 92 1 0]
[ 3 92 1 0]
[ 0 10 92 9]
[ 0 10 92 9]]
RandomOverSampler 提供了 不同的抽样策略,但默认情况下,它会对所有类别进行重抽样,除了数量最多的那个类别。
5
这段代码可以生成一个随机分布,让每个类别都有相同的概率被选中:
distrib = np.bincount(a[:,-1])
prob = 1/distrib[a[:, -1]].astype(float)
prob /= prob.sum()
In [38]: a[np.random.choice(np.arange(len(a)), size=np.count_nonzero(distrib)*distrib.max(), p=prob)]
Out[38]:
array([[ 5, 50, 46, 0],
[ 4, 11, 8, 1],
[ 0, 10, 92, 9],
[ 0, 10, 92, 9],
[ 2, 29, 30, 1],
[ 0, 10, 92, 9],
[ 3, 92, 1, 0],
[ 1, 7, 89, 1],
[ 1, 7, 89, 1]])
每个类别的被选中概率是一样的,但这并不意味着每个类别出现的次数也会相同。
11
以下代码可以实现你想要的功能:
a = np.array([[ 2, 29, 30, 1],
[ 5, 50, 46, 0],
[ 1, 7, 89, 1],
[ 0, 10, 92, 9],
[ 4, 11, 8, 1],
[ 3, 92, 1, 0]])
unq, unq_idx = np.unique(a[:, -1], return_inverse=True)
unq_cnt = np.bincount(unq_idx)
cnt = np.max(unq_cnt)
out = np.empty((cnt*len(unq),) + a.shape[1:], a.dtype)
for j in xrange(len(unq)):
indices = np.random.choice(np.where(unq_idx==j)[0], cnt)
out[j*cnt:(j+1)*cnt] = a[indices]
>>> out
array([[ 5, 50, 46, 0],
[ 5, 50, 46, 0],
[ 5, 50, 46, 0],
[ 1, 7, 89, 1],
[ 4, 11, 8, 1],
[ 2, 29, 30, 1],
[ 0, 10, 92, 9],
[ 0, 10, 92, 9],
[ 0, 10, 92, 9]])
当numpy 1.9版本发布时,或者如果你从开发分支编译代码,那么前两行可以简化为:
unq, unq_idx, unq_cnt = np.unique(a[:, -1], return_inverse=True,
return_counts=True)
需要注意的是,np.random.choice
的工作方式并不能保证原始数组的所有行都会出现在输出结果中,就像上面的例子所示。如果你需要确保这一点,可以尝试以下方法:
unq, unq_idx = np.unique(a[:, -1], return_inverse=True)
unq_cnt = np.bincount(unq_idx)
cnt = np.max(unq_cnt)
out = np.empty((cnt*len(unq) - len(a),) + a.shape[1:], a.dtype)
slices = np.concatenate(([0], np.cumsum(cnt - unq_cnt)))
for j in xrange(len(unq)):
indices = np.random.choice(np.where(unq_idx==j)[0], cnt - unq_cnt[j])
out[slices[j]:slices[j+1]] = a[indices]
out = np.vstack((a, out))
>>> out
array([[ 2, 29, 30, 1],
[ 5, 50, 46, 0],
[ 1, 7, 89, 1],
[ 0, 10, 92, 9],
[ 4, 11, 8, 1],
[ 3, 92, 1, 0],
[ 5, 50, 46, 0],
[ 0, 10, 92, 9],
[ 0, 10, 92, 9]])