我仍然对熊猫中多索引的工作方式感到困惑。我创建了一个多索引,如下所示:
import pandas as pd
import numpy as np
arrays = [np.array(['pearson', 'pearson', 'pearson', 'pearson', 'spearman', 'spearman',
'spearman', 'spearman', 'kendall', 'kendall', 'kendall', 'kendall']),
np.array(['PROFESSIONAL', 'PROFESSIONAL', 'STUDENT', 'STUDENT',
'PROFESSIONAL', 'PROFESSIONAL', 'STUDENT', 'STUDENT',
'PROFESSIONAL', 'PROFESSIONAL', 'STUDENT', 'STUDENT']),
np.array(['r', 'p', 'r', 'p', 'rho', 'p', 'rho', 'p', 'tau', 'p', 'tau', 'p'])]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['correlator', 'expertise', 'coeff-p'])
然后我用它们制作了一个空的数据帧,并添加了一个列名'pair':
results_df = pd.DataFrame(index=index)
results_df.columns.names = ['pair']
填充了一些玩具数据(results_df['attr1-attr2'] = [1,2,3,4,5,6,7,8,9,10,11,12]
),看起来像这样:
pair attr1-attr2
correlator expertise coeff-p
pearson PROFESSIONAL r 1
p 2
STUDENT r 3
p 4
spearman PROFESSIONAL rho 5
p 6
STUDENT rho 7
p 8
kendall PROFESSIONAL tau 9
p 10
STUDENT tau 11
p 12
但是,我希望添加字典中的值,而不是伪值。对于每个attr attr对,字典的条目如下所示:
'attr-attr': {
'pearson': {
'STUDENT': {
'r': VALUE,
'p': VALUE
},
'PROFESSIONAL': {
'r': VALUE,
'p': VALUE
}
},
'spearman': {
'STUDENT': {
'r': VALUE,
'p': VALUE
},
'PROFESSIONAL': {
'r': VALUE,
'p': VALUE
}
}
'kendall': {
'STUDENT': {
'r': VALUE,
'p': VALUE
},
'PROFESSIONAL': {
'r': VALUE,
'p': VALUE
}
}
}
下面是供您使用的实际示例数据:
correlations = {'NormNedit-NormEC_TOT': {'pearson': {'PROFESSIONAL': {'r': 0.13615071018351657, 'p': 0.0002409555504769095}}, 'spearman': {'STUDENT': {'rho': 0.10867061294616957, 'p': 0.003437711066527592}, 'PROFESSIONAL': {'tau': 0.08185775947238913, 'p': 0.003435247172206748}}, 'kendall': {'STUDENT': {'tau': 0.08185775947238913, 'p': 0.003435247172206748}}}, 'NormLiteral-NormEC_TOT': {'pearson': {'PROFESSIONAL': {'r': 0.13615071018351657, 'p': 0.0002409555504769095}, 'STUDENT': {'tau': 0.08185775947238913, 'p': 0.003435247172206748}}, 'spearman': {'STUDENT': {'rho': 0.10867061294616957, 'p': 0.003437711066527592}, 'PROFESSIONAL': {'r': 0.13615071018351657, 'p': 0.0002409555504769095}}, 'kendall': {'STUDENT': {'tau': 0.08185775947238913, 'p': 0.003435247172206748}}}, 'NormHTra-NormEC_TOT': {'pearson': {'STUDENT': {'r': 0.13615071018351657, 'p': 0.0002409555504769095}}, 'spearman': {'STUDENT': {'rho': 0.10867061294616957, 'p': 0.003437711066527592}, 'PROFESSIONAL': {'r': 0.13615071018351657, 'p': 0.0002409555504769095}}, 'kendall': {'STUDENT': {'tau': 0.08185775947238913, 'p': 0.003435247172206748}}}, 'NormScatter-NormEC_TOT': {'pearson': {'STUDENT': {'r': 0.13615071018351657, 'p': 0.0002409555504769095}}, 'spearman': {'STUDENT': {'rho': 0.10867061294616957, 'p': 0.003437711066527592}, 'PROFESSIONAL': {'tau': 0.08185775947238913, 'p': 0.003435247172206748}}, 'kendall': {'PROFESSIONAL': {'tau': 0.08185775947238913, 'p': 0.003435247172206748}}}, 'NormCrossS-NormEC_TOT': {'pearson': {'STUDENT': {'r': 0.13615071018351657, 'p': 0.0002409555504769095}, 'PROFESSIONAL': {'rho': 0.10867061294616957, 'p': 0.003437711066527592}}, 'spearman': {'STUDENT': {'rho': 0.10867061294616957, 'p': 0.003437711066527592}, 'PROFESSIONAL': {'rho': 0.10867061294616957, 'p': 0.003437711066527592}}, 'kendall': {'PROFESSIONAL': {'tau': 0.08185775947238913, 'p': 0.003435247172206748}}}, 'NormPdur-NormEC_TOT': {'pearson': {'STUDENT': {'r': 0.13615071018351657, 'p': 0.0002409555504769095}, 'PROFESSIONAL': {'rho': 0.10867061294616957, 'p': 0.003437711066527592}}, 'spearman': {'STUDENT': {'rho': 0.10867061294616957, 'p': 0.003437711066527592}}, 'kendall': {'PROFESSIONAL': {'tau': 0.08185775947238913, 'p': 0.003435247172206748}}}}
因此,对于每个'attr attr'(最上面的键)作为列名,我想将其值添加到多索引中相应的行中。然而,我似乎找不到一个有效的方法来做这件事。缺少的值应该是np.nan
。我试着循环字典并使用query()[]
,但没有成功。你知道吗
for attr, attr_d in correlations.items():
for correl, correl_d in attr_d.items():
for split, split_d in correl_d.items():
results_df.query(f"correlator == {correl} and expertise == {split} and coeff_p == 'p'")[attr] = split_d['p']
results_df.query(f"correlator == {correl} and expertise == {split} and coeff_p != 'p'")[attr] = split_d['r'] if 'r' in split_d else split_d['rho'] if 'rho' in split_d else split['tau']
> pandas.core.computation.ops.UndefinedVariableError: name 'pearson' is not defined
我知道数据是相对复杂的,所以如果有什么不清楚请让我知道。你知道吗
您可以调整Wouter Overmeire's answer to this question以从嵌套字典中生成多索引数据帧:
如果希望列来自嵌套字典的最高级别(
attr-attr
级别),则可以取消堆叠结果:注意:我认为您的示例数据中有一个错误,其中
'PROFESSIONAL': {'STUDENT': ...
。如果这不是一个错误,我只是误解了什么,请告诉我。相关问题 更多 >
编程相关推荐