我试图使我的数据平衡,因为我的目标变量有多类,我想过度采样,使我的数据平衡

2024-06-06 14:46:47 发布

您现在位置:Python中文网/ 问答频道 /正文

print(x)' 这里的“x”是自变量

    Restaurant  Cuisines    Average_Cost    Rating  Votes   Reviews Area
    0   3.526361    0.693147    5.303305    1.504077    2.564949    1.609438    7.214504
    1   1.386294    4.127134    4.615121    1.504077    2.484907    1.609438    5.905362
    2   2.772589    1.386294    5.017280    1.526056    4.605170    3.433987    6.131226
    3   3.912023    2.833213    5.525453    1.547563    5.176150    4.564348    7.643483
    4   3.526361    2.708050    5.303305    1.435085    5.948035    5.046646    6.126869
    ... ... ... ... ... ... ... ...
    11089   3.912023    0.693147    5.525453    1.648659    5.789960    5.046646    3.135494
    11090   1.386294    6.028279    4.615121    1.526056    3.610918    2.833213    7.643483
    11091   1.386294    2.397895    4.615121    1.504077    3.828641    2.944439    5.814131
    11092   1.386294    6.028279    4.615121    1.410987    3.218876    2.302585    5.905362
    11093   1.386294    6.028279    4.615121    1.029619    0.000000    0.000000    5.564520
    11094 rows × 7 columns
^{pr2}$

这里“y”是目标变量,它有多个类。在

    30 minutes     7406
    45 minutes     2665
    65 minutes      923
    120 minutes      62
    20 minutes       20
    80 minutes       14
    10 minutes        4
    Name: Delivery_Time, dtype: int64

在研究了目标变量后,我们可以看到“30分钟”类在其他类中具有更高的计数。在

FOR FOR MAKING THINGS BALANCE I TRIED SMOTEtomek to oversamplemy data and make it balance. Below are the codes provide and got error.

from imblearn.combine import SMOTEtomek
smk = SMOTEtomek(ratio = 1)
x_res, y_res = smk.fit_sample(x,y)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-426e8b86623d> in <module>()
      1 from imblearn.combine import SMOTETomek
      2 smk = SMOTETomek(ratio = 1)
----> 3 x_res, y_res = smk.fit_sample(x,y)

2 frames
/usr/local/lib/python3.6/dist-packages/imblearn/utils/_validation.py in _sampling_strategy_float(sampling_strategy, y, sampling_type)
    311     if type_y != 'binary':
    312         raise ValueError(
--> 313             '"sampling_strategy" can be a float only when the type '
    314             'of target is binary. For multi-class, use a dict.')
    315     target_stats = _count_class_sample(y)

ValueError: "sampling_strategy" can be a float only when the type of target is binary. For multi-class, use a dict.

Tags: thesampletargettyperesfloatclassbinary
2条回答

我认为您应该保持目标变量的相同比例,因为SMOTE可能会在测试数据集上给您增强和更好的结果,但是模型可能会在用户输入的新数据(实时数据)上失败。在

是使用SMOTE还是不是。你可以使用此代码:

from imblearn.oversampling import SMOTE
smote=SMOTE("minority")
X,Y=smote.fit_sample(x_train_data,y_train_data)

您可以看到Smote的实际实现: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/utils/_validation.py#L355

你只要按错误中提到的那样把字典传过去就行了。但是SMOTE算法内部负责多类设置。在

执行:

from imblearn.oversampling import SMOTE
smote=SMOTE("minority")
X,Y=smote.fit_sample(x_train,y_train)
^{pr2}$

相关问题 更多 >