根据列标签在pandas中重塑数据框
如何在pandas中重新整理下面这个数据框(dataframe)是最好的方法?这个数据框df
包含了每个样本(在这个例子中是s1
和s2
)的x,y
值,长得像这样:
In [23]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10)})
In [24]: df
Out[24]:
s1_x s1_y s2_x s2_y
0 0.913462 0.525590 -0.377640 0.700720
1 0.723288 -0.691715 0.127153 0.180836
2 0.181631 -1.090529 -1.392552 1.530669
3 0.997414 -1.486094 1.207012 0.376120
4 -0.319841 0.195289 -1.034683 0.286073
5 1.085154 -0.619635 0.396867 0.623482
6 1.867816 -0.928101 -0.491929 -0.955295
7 0.920658 -1.132057 1.701582 -0.110299
8 -0.241853 -0.129702 -0.809852 0.014802
9 -0.019523 -0.578930 0.803688 -0.881875
s1_x
和s1_y
是样本1的x/y值,s2_x
和s2_y
是样本2的值,依此类推。我们怎么能把它整理成一个只包含x
和y
列的数据框,但还要多出一列sample
,用来标识每一行是来自s1
还是s2
呢?比如:
x y sample
0 0.913462 0.525590 s1
1 0.723288 -0.691715 s1
2 0.181631 -1.090529 s1
3 0.997414 -1.486094 s1
...
5 0.396867 0.623482 s2
...
这样做对后面用Rpy2绘图很有用,因为很多R的绘图功能可以利用这个分组变量,所以我想重新整理这个数据框的原因就是这个。
我觉得Chang She给的答案不适用于像这样的有独特索引的数据框:
In [636]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10), "names": range(10)})
In [637]: df
Out[637]:
names s1_x s1_y s2_x s2_y
0 0 0.672298 0.415366 1.034770 0.556209
1 1 0.067087 -0.851028 0.053608 -0.276461
2 2 -0.674174 -0.099015 0.864148 -0.067240
3 3 0.542996 -0.813018 2.283530 2.793727
4 4 0.216633 -0.091870 -0.746411 -0.421852
5 5 0.141301 -1.537721 -0.371601 -1.594634
6 6 1.267148 -0.833120 0.369516 -0.671627
7 7 -0.231163 -0.557398 1.123155 0.865140
8 8 1.790570 -0.428563 0.668987 0.632409
9 9 -0.820315 -0.894855 0.673247 -1.195831
In [638]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
In [639]: df.stack(0).reset_index(1)
Out[639]:
level_1 x y
0 s1 0.672298 0.415366
0 s2 1.034770 0.556209
1 s1 0.067087 -0.851028
1 s2 0.053608 -0.276461
2 s1 -0.674174 -0.099015
2 s2 0.864148 -0.067240
3 s1 0.542996 -0.813018
3 s2 2.283530 2.793727
4 s1 0.216633 -0.091870
4 s2 -0.746411 -0.421852
5 s1 0.141301 -1.537721
5 s2 -0.371601 -1.594634
6 s1 1.267148 -0.833120
6 s2 0.369516 -0.671627
7 s1 -0.231163 -0.557398
7 s2 1.123155 0.865140
8 s1 1.790570 -0.428563
8 s2 0.668987 0.632409
9 s1 -0.820315 -0.894855
9 s2 0.673247 -1.195831
转换成功了,但在这个过程中,"names"
这一列丢失了。我该如何在进行列名中有_
的列的变形操作时,保留"names"
这一列呢?"names"
列只是给数据框中的每一行分配一个独特的名称。这里是数字形式,但在我的数据中,它们是字符串标识符。
谢谢。
1 个回答
12
我假设你已经有了一个数据表(DataFrame)。在这种情况下,你可以把列转换成一个多重索引(MultiIndex),然后使用堆叠(stack)功能,再重置索引(reset_index)。注意,这样做之后你还需要重新命名和调整列的顺序,并按照样本排序,才能得到你在问题中提到的那种格式:
In [4]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10)})
In [5]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
In [6]: df.stack(0).reset_index(1)
Out[6]:
level_1 x y
0 s1 0.897994 -0.278357
0 s2 -0.008126 -1.701865
1 s1 -1.354633 -0.890960
1 s2 -0.773428 0.003501
2 s1 -1.499422 -1.518993
2 s2 0.240226 1.773427
3 s1 -1.090921 0.847064
3 s2 -1.061303 1.557871
4 s1 -1.697340 -0.160952
4 s2 -0.930642 0.182060
5 s1 -0.356076 -0.661811
5 s2 0.539875 -1.033523
6 s1 -0.687861 -1.450762
6 s2 0.700193 0.658959
7 s1 -0.130422 -0.826465
7 s2 -0.423473 -1.281856
8 s1 0.306983 0.433856
8 s2 0.097279 -0.256159
9 s1 0.498057 0.147243
9 s2 1.312578 0.111837
如果你能直接创建一个带有多重索引的数据表,就可以省去转换的步骤。
补充:使用合并(merge)功能把原来的ID重新加回来。
In [59]: df
Out[59]:
names s1_x s1_y s2_x s2_y
0 0 0.732099 0.018387 0.299856 0.737142
1 1 0.914755 -0.798159 -0.732868 -1.279311
2 2 -1.063558 0.161779 -0.115751 -0.251157
3 3 -1.185501 0.095147 -1.343139 -0.003084
4 4 0.622400 -0.299726 0.198710 -0.383060
5 5 0.179318 0.066029 -0.635507 1.366786
6 6 -0.820099 0.066067 1.113402 0.002872
7 7 0.711627 -0.182925 1.391194 -2.788434
8 8 -1.124092 1.303375 0.202691 -0.225993
9 9 -0.179026 0.847466 -1.480708 -0.497067
In [60]: id = df.ix[:, ['names']]
In [61]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
In [62]: pandas.merge(df.stack(0).reset_index(1), id, left_index=True, right_index=True)
Out[62]:
level_1 x y names
0 s1 0.732099 0.018387 0
0 s2 0.299856 0.737142 0
1 s1 0.914755 -0.798159 1
1 s2 -0.732868 -1.279311 1
2 s1 -1.063558 0.161779 2
2 s2 -0.115751 -0.251157 2
3 s1 -1.185501 0.095147 3
3 s2 -1.343139 -0.003084 3
4 s1 0.622400 -0.299726 4
4 s2 0.198710 -0.383060 4
5 s1 0.179318 0.066029 5
5 s2 -0.635507 1.366786 5
6 s1 -0.820099 0.066067 6
6 s2 1.113402 0.002872 6
7 s1 0.711627 -0.182925 7
7 s2 1.391194 -2.788434 7
8 s1 -1.124092 1.303375 8
8 s2 0.202691 -0.225993 8
9 s1 -0.179026 0.847466 9
9 s2 -1.480708 -0.497067 9
另外一种方法:
In [64]: df
Out[64]:
names s1_x s1_y s2_x s2_y
0 0 0.744742 -1.123403 0.212736 0.005440
1 1 0.465075 -0.673491 1.467156 -0.176298
2 2 -1.111566 0.168043 -0.102142 -1.072461
3 3 1.226537 -1.147357 -1.583762 -1.236582
4 4 1.137675 0.224422 0.738988 1.528416
5 5 -0.237014 -1.110303 -0.770221 1.389714
6 6 -0.659213 2.305374 -0.326253 1.416778
7 7 1.524214 -0.395451 -1.884197 0.524606
8 8 0.375112 -0.622555 0.295336 0.927208
9 9 1.168386 -0.291899 -1.462098 0.250889
In [65]: df = df.set_index('names')
In [66]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
In [67]: df.stack(0).reset_index(1)
Out[67]:
level_1 x y
names
0 s1 0.744742 -1.123403
0 s2 0.212736 0.005440
1 s1 0.465075 -0.673491
1 s2 1.467156 -0.176298
2 s1 -1.111566 0.168043
2 s2 -0.102142 -1.072461
3 s1 1.226537 -1.147357
3 s2 -1.583762 -1.236582
4 s1 1.137675 0.224422
4 s2 0.738988 1.528416
5 s1 -0.237014 -1.110303
5 s2 -0.770221 1.389714
6 s1 -0.659213 2.305374
6 s2 -0.326253 1.416778
7 s1 1.524214 -0.395451
7 s2 -1.884197 0.524606
8 s1 0.375112 -0.622555
8 s2 0.295336 0.927208
9 s1 1.168386 -0.291899
9 s2 -1.462098 0.250889