扁平化Pandas DataFrame
我有一个数据表,长得像下面这样:
import pandas as pd
df = pd.DataFrame(data={"CommonIdentifier":[1234,1235,1236,1237],
"CommonValue":["type1","type2","type1","type1"],
"Identifier1":["a","a","b","b"],
"Value1":[1,1.5,2,1.3],
"Identifier2":["b","b","c","a"],
"Value2":[4,2,3.2,1]})
df
CommonIdentifier CommonValue Identifier1 Identifier2 Value1 Value2
0 1234 type1 a b 1.0 4.0
1 1235 type2 a b 1.5 2.0
2 1236 type1 b c 2.0 3.2
3 1237 type1 b a 1.3 1.0
我想把它处理成下面这种形式:
pd.DataFrame(data={"CommonIdentifier":[1234,1235,1236,1237,1234,1235,1236,1237],
"CommonValue":["type1","type2","type1","type1","type1","type2","type1","type1"],
"IdentifierJoined":["a","a","b","b","b","b","c","a"],
"ValueJoined":[1,1.5,2,1.3,4,2,3.2,1]})
CommonIdentifier CommonValue IdentifierJoined ValueJoined
0 1234 type1 a 1.0
1 1235 type2 a 1.5
2 1236 type1 b 2.0
3 1237 type1 b 1.3
4 1234 type1 b 4.0
5 1235 type2 b 2.0
6 1236 type1 c 3.2
7 1237 type1 a 1.0
上面的只是个例子,实际上我有9对标识符和数值。我在想,有没有什么简洁的方法可以做到这一点。我想出了以下方法:
num_identifiers = 2
m1 = pd.melt(df,id_vars=['CommonIdentifier'],
value_vars=['Value1','Value2'])
m2 = pd.melt(df,id_vars=['CommonIdentifier'],
value_vars=['Identifier1','Identifier2'])
m3 = pd.concat([df[["CommonIdentifier","CommonValue"]]] * num_identifiers)
m3.index = range(len(m3))
pd.concat([m3[["CommonIdentifier","CommonValue"]],
m2[["value"]].rename(columns={"value":"IdentifierJoined"}),
m1[["value"]].rename(columns={"value":"ValueJoined"})], axis=1)
这个方法可以完成这个转换,但我觉得有点繁琐和冗长。
2 个回答
1
这里有一种方法,使用了 MultiIndex
、stack()
和 merge()
:
import pandas as pd
df = pd.DataFrame(data={"CommonIdentifier":[1234,1235,1236,1237],
"CommonValue":["type1","type2","type1","type1"],
"Identifier1":["a","a","b","b"],
"Value1":[1,1.5,2,1.3],
"Identifier2":["b","b","c","a"],
"Value2":[4,2,3.2,1]})
common_columns = ["CommonIdentifier", "CommonValue"]
identifier_columns = ["Identifier1", "Identifier2"]
value_columns = ["Value1", "Value2"]
common_df = df[common_columns]
mix_df = df[identifier_columns + value_columns]
mix_df.columns = pd.MultiIndex.from_product([["Identifier", "Value"], range(len(identifier_columns))])
pd.merge(common_df, mix_df.stack().reset_index(level=1, drop=True),
left_index=True, right_index=True).reset_index(drop=True)
2
你可以使用 df[[...]]
来选择你想要的列,接着用 rename(columns=...)
来修改列的名字,然后再用 pd.concat
把这些小的数据框叠加起来:
import pandas as pd
df = pd.DataFrame(data={"CommonIdentifier":[1234,1235,1236,1237],
"CommonValue":["type1","type2","type1","type1"],
"Identifier1":["a","a","b","b"],
"Value1":[1,1.5,2,1.3],
"Identifier2":["b","b","c","a"],
"Value2":[4,2,3.2,1]})
colgroups = [['CommonIdentifier', 'CommonValue', 'Identifier%d'%i, 'Value%d'%i]
for i in range(1,3)]
colmap = {'{}{}'.format(col,i):'{}Joined'.format(col)
for i in range(1,3) for col in ('Identifier', 'Value')}
result = pd.concat([df[cols].rename(columns=colmap) for cols in colgroups])
print(result)
这样就能得到
CommonIdentifier CommonValue IdentifierJoined ValueJoined
0 1234 type1 a 1
1 1235 type2 a 1.5
2 1236 type1 b 2
3 1237 type1 b 1.3
4 1234 type1 b 4
5 1235 type2 b 2
6 1236 type1 c 3.2
7 1237 type1 a 1
还有一种选择,受到HYRY的解决方案启发,可以把共同的列隐藏在索引里,然后使用HYRY的 stack
技巧:
import pandas as pd
df = pd.DataFrame(data={"CommonIdentifier":[1234,1235,1236,1237],
"CommonValue":["type1","type2","type1","type1"],
"Identifier1":["a","a","b","b"],
"Value1":[1,1.5,2,1.3],
"Identifier2":["b","b","c","a"],
"Value2":[4,2,3.2,1]})
df = df.set_index(['CommonIdentifier', 'CommonValue'])
df.columns = pd.MultiIndex.from_product([["Identifier", "Value"], range(2)])
df = df.stack()
df.index = df.index.droplevel(-1)
print(df.reset_index())