用python解析大型数据集

2024-04-25 20:31:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我在gzip中有一个大矩阵,看起来像这样:

locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0
locus_2 mark2 0.0,0.0,0.0,0.0,0.0,0.5536,0.9177,0.2929,0.0,0.0
locus_3 mark2 0.0,0.0,0.1,0.0,0.0,0.9536,0.8177,0.2827,0.0,0.0

因此,每一行都以两个描述符开头,后跟10个值。在

我只想解析出这行的前5个值,这样我就有了这样一个矩阵:

^{pr2}$

我编写了以下python脚本来解析它,但是没有用:

import gzip
import numpy as np

inFile = gzip.open('/home/anish/data.gz')

inFile.next()

for line in inFile:
        cols = line.strip().replace('nan','0').split('\t')
        data = cols[2:]
        data = map(float,data)

        gfpVals =  data[:5]

        print '\t'.join(cols[:6]) + '\t' + '\t'.join(map(str,gfpVals))

我只是得到了一个错误:

data = map(float,data)
ValueError: could not convert string to float: 

Tags: importmapdataline矩阵floatinfilecols
1条回答
网友
1楼 · 发布于 2024-04-25 20:31:08

您只使用制表符作为分隔符,而值也用逗号分隔。在

结果呢

locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0

分成

^{pr2}$

你要让绳子浮起来

"0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0"

它是无效的文本。在

您应该替换:

^{4}$

 data = cols[2:].split(',')

相关问题 更多 >