使用Pandas读取层次化ASCII表的文件
这是我在这里的第一篇帖子,希望不会太长或太复杂。
我在尝试读取和理解下面这个ascii表格时遇到了问题(这是一个更大表格的简单摘录)。
假设这个文件叫做“test.txt”:
A B C D E
0 992 CEN/4 -2.657293E+00 -3.309567E+01 4.697218E-01
1291 -3.368449E+00 7.837483E+00 2.311393E+00
800 -3.530800E+00 -7.392188E+01 -1.401380E+00
801 -1.952177E+00 -7.392114E+01 -1.367195E+00
1290 -1.777301E+00 7.838229E+00 2.345850E+00
0 994 CEN/4 7.270955E+00 -6.637891E+01 -1.293553E+01
1110 5.816999E+00 -3.981042E+01 -1.504738E+01
785 5.535329E+00 -9.246339E+01 -1.061554E+01
786 8.625161E+00 -9.163719E+01 -1.092563E+01
1109 9.080517E+00 -4.059749E+01 -1.523589E+01
我使用python 2.7和pandas(0.9.1)可以这样读取它:
>>> r=pd.read_fwf('test.txt', widths=(3, 8, 9, 14, 14, 14),skiprows=1, header=None)
>>> print r
X0 X1 X2 X3 X4 X5
0 0 992 CEN/4 -2.657293 -33.095670 0.469722
1 NaN NaN 1291 -3.368449 7.837483 2.311393
2 NaN NaN 800 -3.530800 -73.921880 -1.401380
3 NaN NaN 801 -1.952177 -73.921140 -1.367195
4 NaN NaN 1290 -1.777301 7.838229 2.345850
5 0 994 CEN/4 7.270955 -66.378910 -12.935530
6 NaN NaN 1110 5.816999 -39.810420 -15.047380
7 NaN NaN 785 5.535329 -92.463390 -10.615540
8 NaN NaN 786 8.625161 -91.637190 -10.925630
9 NaN NaN 1109 9.080517 -40.597490 -15.235890
我试着直接把它当作“层级表格”来读取:
>>> r=pd.read_fwf('test.txt', widths=(3, 8, 9, 14, 14, 14), skiprows=1, index_col=[0,1,2], header=None)
>>> print r
X3 X4 X5
X0 X1 X2
0 992 CEN/4 -2.657293 -33.095670 0.469722
994 1291 -3.368449 7.837483 2.311393
800 -3.530800 -73.921880 -1.401380
801 -1.952177 -73.921140 -1.367195
1290 -1.777301 7.838229 2.345850
CEN/4 7.270955 -66.378910 -12.935530
1110 5.816999 -39.810420 -15.047380
785 5.535329 -92.463390 -10.615540
786 8.625161 -91.637190 -10.925630
1109 9.080517 -40.597490 -15.235890
我想要得到的是:
>>> print r
X3 X4 X5
X0 X1 X2
0 992 CEN/4 -2.657293 -33.095670 0.469722
1291 -3.368449 7.837483 2.311393
800 -3.530800 -73.921880 -1.401380
801 -1.952177 -73.921140 -1.367195
1290 -1.777301 7.838229 2.345850
994 CEN/4 7.270955 -66.378910 -12.935530
1110 5.816999 -39.810420 -15.047380
785 5.535329 -92.463390 -10.615540
786 8.625161 -91.637190 -10.925630
1109 9.080517 -40.597490 -15.235890
有没有简单的方法可以用pandas来实现,还是说我必须在解析之前先“处理”一下这个ascii表格才能得到我想要的结果?谢谢大家!
1 个回答
3
看起来你需要在进行索引之前先填充NA值。试试这个:
r=pd.read_fwf('test.txt', widths=(3, 8, 9, 14, 14, 14),skiprows=1, header=None)
r=r.fillna(method='pad')
r=r.set_index(['X0','X1','X2'])