Python获取pandas数据帧中所有特征组合的条件概率

df = pd.read_csv('pathToData.csv') df = df.fillna("null") cols = 0 col_levels = [] columns = {} num = 0 for i in df.columns: cols += len(set(df[i])) col_levels.append(np.sort(list(set(df[i])))) for j in np.sort(list(set(df[i]))): columns[i + '_' + str(j)] = num num += 1 res = np.eye(cols) for i in range(len(df.columns)): for j in range(len(df.columns)): if i != j: row_feature = df.columns[i] col_feature = df.columns[j] rowLevels = col_levels[i] colLevels = col_levels[j] for ii in rowLevels: for jj in colLevels: frst = (df[row_feature] == ii) * 1 scnd = (df[col_feature] == jj) * 1 prob = sum(frst*scnd)/(sum(frst) + 1e-9) frst_ind = columns[row_feature + '_' + ii] scnd_ind = columns[col_feature + '_' + jj] res[frst_ind, scnd_ind] = prob

1条回答

网友

1楼 · 发布于 2024-05-15 17:38:25

我解决问题的方法是首先计算数据集中所有唯一的级别。然后通过这些层的笛卡尔积循环。在每个步骤中，过滤数据集以创建条件为真的子集。然后，计算发生事件的子集中的行数。下面是我的代码。在

import pandas as pd
from itertools import product
from collections import defaultdict

df = pd.DataFrame({
    'col1': ['a', 'a', 'b'],
    'col2': ['x', 'y', 'x'],
    'col3': ['l', 'l', 'l']
})

levels = df.stack().unique()

res = defaultdict(dict)
for event, cond in product(levels, levels):

    # create a subset of rows with at least one element equal to cond
    conditional_set = df[(df == cond).any(axis=1)]
    conditional_set_size = len(conditional_set)

    # count the number of rows in the subset where at least one element is equal to event
    conditional_event_count = (conditional_set == event).any(axis=1).sum()

    res[event][cond] = conditional_event_count / conditional_set_size

result_df = pd.DataFrame(res)
print(result_df)

# OUTPUT    
#       a         b    l         x         y
# a  1.000000  0.000000  1.0  0.500000  0.500000
# b  0.000000  1.000000  1.0  1.000000  0.000000
# l  0.666667  0.333333  1.0  0.666667  0.333333
# x  0.500000  0.500000  1.0  1.000000  0.000000
# y  1.000000  0.000000  1.0  0.000000  1.000000

我相信还有其他更快的方法，但这是我首先想到的。在

相关问题更多 >

编程相关推荐

热门问题

热门文章