Python中数据帧转换的改进

2024-04-25 23:36:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个熊猫数据帧,格式如下:

            id2_cond1  id2_cond2  id2_cond3  id2_cond4
id2_cond1   1.000000   0.819689  -0.753702  -0.617213
id2_cond2   0.819689   1.000000  -0.554437  -0.295122
id2_cond3  -0.753702  -0.554437   1.000000   0.939336
id2_cond4  -0.617213  -0.295122   0.939336   1.000000

我要做的是将数据帧转换为以下形式:

      cond1_cond2 cond1_cond3 cond1_cond4 cond2_cond3 cond2_cond4 cond3_cond4
id2    0.8196886  -0.7537023  -0.6172134   -0.554437  -0.2951216   0.9393364

我可以使用以下脚本正确地执行此操作:

df_tmp = pd.DataFrame(index=[identifier], columns=cols)
counter = 0
for x in range(len(df)):
    for y in range(x + 1, len(df)):
        df_tmp.ix[0, counter] = df.ix[x, y]
        counter += 1
print(df_tmp)

这种方法的问题是,我必须预定义列,并且必须知道顺序。你知道吗

cols = ["cond1_cond2", "cond1_cond3", "cond1_cond4", "cond2_cond3", "cond2_cond4", "cond3_cond4"]

有没有更好的方法来转换这个数据帧,自动创建不同的组合?你知道吗


Tags: 数据indfforlencounterrangetmp
2条回答

原始数据帧:

df = pd.DataFrame({'id2_cond1': {'id2_cond1': 1.0, 'id2_cond2': 0.81968899999999989, 'id2_cond3': -0.75370200000000009, 'id2_cond4': -0.61721300000000001},
                   'id2_cond2': {'id2_cond1': 0.81968899999999989, 'id2_cond2': 1.0, 'id2_cond3': -0.55443699999999996, 'id2_cond4': -0.295122},
                   'id2_cond3': {'id2_cond1': -0.75370200000000009, 'id2_cond2': -0.55443699999999996, 'id2_cond3': 1.0, 'id2_cond4': 0.93933600000000006},
                   'id2_cond4': {'id2_cond1': -0.61721300000000001, 'id2_cond2': -0.295122, 'id2_cond3': 0.93933600000000006, 'id2_cond4': 1.0}})

首先,让我们去掉名称(在本例中为“id2”):

name = df.index[0].split("_")[0]

然后,让我们得到每个属性的名称。我假设名称还可以包含下划线字符(在本例中不存在),因此我首先根据下划线进行拆分,提取除第一个字符以外的所有元素,然后使用下划线将它们重新连接在一起:

conds = ["_".join(i.split("_")[1:]) for i in df.index]

现在,让我们使用列表理解来生成所有名称组合:

idx = ['{0}_{1}'.format(conds[i], conds[j]) 
        for i in range(len(conds)) 
        for j in range(i + 1, len(conds))]

我们将使用相同的技术来展平数据:

data = [df.iat[i, j] 
        for i in range(len(conds)) 
        for j in range(i + 1, len(conds))]

最后,我们将根据上述信息创建一个系列:

corr_matrix_flat = pd.Series(data, index=idx, name=name)
>>> corr_matrix 
cond1_cond2    0.819689
cond1_cond3   -0.753702
cond1_cond4   -0.617213
cond2_cond3   -0.554437
cond2_cond4   -0.295122
cond3_cond4    0.939336
Name: id2, dtype: float64

下面是另一个使用pandas内置函数stack的版本。你知道吗

import pandas as pd

df = pd.DataFrame({'id2_cond1': {'id2_cond1': 1.0, 'id2_cond2': 0.81968899999999989, 'id2_cond3': -0.75370200000000009, 'id2_cond4': -0.61721300000000001},
                   'id2_cond2': {'id2_cond1': 0.81968899999999989, 'id2_cond2': 1.0, 'id2_cond3': -0.55443699999999996, 'id2_cond4': -0.295122},
                   'id2_cond3': {'id2_cond1': -0.75370200000000009, 'id2_cond2': -0.55443699999999996, 'id2_cond3': 1.0, 'id2_cond4': 0.93933600000000006},
                   'id2_cond4': {'id2_cond1': -0.61721300000000001, 'id2_cond2': -0.295122, 'id2_cond3': 0.93933600000000006, 'id2_cond4': 1.0}})

通过df.stack()df转换为Series

s = df.stack()
print s

输出

id2_cond1  id2_cond1    1.000000
           id2_cond2    0.819689
           id2_cond3   -0.753702
           id2_cond4   -0.617213
id2_cond2  id2_cond1    0.819689
           id2_cond2    1.000000
           id2_cond3   -0.554437
           id2_cond4   -0.295122
id2_cond3  id2_cond1   -0.753702
           id2_cond2   -0.554437
           id2_cond3    1.000000
           id2_cond4    0.939336
id2_cond4  id2_cond1   -0.617213
           id2_cond2   -0.295122
           id2_cond3    0.939336
           id2_cond4    1.000000
dtype: float64

接下来删除对角线和下三角部分。你知道吗

    ind_upper = []
    for i in range(len(df)):
        for j in range(len(df)):
...         if i < j:
...             ind_upper.append(True)
...         else:
...             ind_upper.append(False)

s = s[ind_upper]

接下来,将索引和列合并为一个。你知道吗

index = list(s.index)
print index
[('id2_cond1', 'id2_cond2'), ('id2_cond1', 'id2_cond3'), ('id2_cond1', 'id2_cond4'), ('id2_cond2', 'id2_cond3'), ('id2_cond2', 'id2_cond4'), ('id2_cond3', 'id2_cond4')]

index = ['_'.join(id) for id in index]
index = [id.replace('id2_', '') for id in index]
print index
['cond1_cond2', 'cond1_cond3', 'cond1_cond4', 'cond2_cond3', 'cond2_cond4', 'cond3_cond4']

index赋值给s

s.index = index
print s
cond1_cond2    0.819689
cond1_cond3   -0.753702
cond1_cond4   -0.617213
cond2_cond3   -0.554437
cond2_cond4   -0.295122
cond3_cond4    0.939336
dtype: float64

相关问题 更多 >