使用logi将数据采样成不同的比率

Customer Orderid item_name A 1 orange A 1 apple A 1 banana A 2 apple A 2 carrot A 3 orange A 4 grape A 4 watermelon A 4 banana B 1 pineapple B 2 banana B 3 papaya B 3 Lime

train: should contain all customers and all item_names (70% of complete data) train: customer item A orange A apple A banana A carrot A grape A watermelon B pinepple B banana B papaya B Lime validation : should contain all customers and item_names can be subset of train(15% of complete data) customer item A orange A apple A banana B pinepple B banana B papaya B Lime test : should contain all customers and item_names can be subset of train(15% of complete data) Customer item A carrot A grape A watermelon B papaya B Lime

1条回答

网友

1楼 · 发布于 2024-05-16 20:46:30

正如@Parth在评论中所提到的，首先你需要有一个数据集可以进行这样的分层拆分。然后，您可以使用“Customer”和“item\u name”的组合创建一个新列，以提供sklearn的一部分“train\u test\u split”方法的“stratify”参数

下面，你可以找到一个例子

import pandas as pd
from sklearn.model_selection import train_test_split

#Create sample data
data = {
    "Customer":["A", "A", "A", "A","A","A","A","A","A", "B", "B", "B","B", "B", "B", "B","B","B"],
    "Orderid":[1, 1, 1, 2, 2, 2, 2, 3, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2],
    "item_name":[
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple"
       ]
}
# Convert data to dataframe
df = pd.DataFrame(data)
# Create a new column with combination of "Customer" and "item_name" to feed the "stratify" parameter
# train_test_split method which is a part of "sklearn.model_selection"
df["CustAndItem"] = df["Customer"]+"_"+df["item_name"]

# First split the "train" and "test" set. In this example I have split %40 of the data as "test"
# and %60 of data as "train"
X_train, X_test, y_train, y_test = train_test_split(df.index,
                                                    df["CustAndItem"],
                                                    test_size=0.4,
                                                    stratify=df["CustAndItem"])

# Get actual data after split operation
df_train = df.loc[X_train].copy(True)
df_test = df.loc[X_test].copy(True)

# Now split "test" set to "validation" and "test" sets. In this example I have split them equally 
# (test_size = 0.5) which will contain %20 of the main set.
X_validate, X_test, y_validate, y_test = train_test_split(df_test.index,
                                                          df_test["CustAndItem"],
                                                          test_size= 0.5,
                                                          stratify=df_test["CustAndItem"])
# Get actual data after split
df_validate = df_test.loc[X_validate]
df_test = df_test.loc[X_test]

# Print results
print(df_train)
print(df_validate)
print(df_test)

相关问题更多 >

编程相关推荐

热门问题

热门文章