在Python中使用scipy/numpy读取csv文件
我在用Python读取一个用制表符分隔的csv文件时遇到了问题。我使用了以下这个函数:
def csv2array(filename, skiprows=0, delimiter='\t', raw_header=False, missing=None, with_header=True):
"""
Parse a file name into an array. Return the array and additional header lines. By default,
parse the header lines into dictionaries, assuming the parameters are numeric,
using 'parse_header'.
"""
f = open(filename, 'r')
skipped_rows = []
for n in range(skiprows):
header_line = f.readline().strip()
if raw_header:
skipped_rows.append(header_line)
else:
skipped_rows.append(parse_header(header_line))
f.close()
if missing:
data = genfromtxt(filename, dtype=None, names=with_header,
deletechars='', skiprows=skiprows, missing=missing)
else:
if delimiter != '\t':
data = genfromtxt(filename, dtype=None, names=with_header, delimiter=delimiter,
deletechars='', skiprows=skiprows)
else:
data = genfromtxt(filename, dtype=None, names=with_header,
deletechars='', skiprows=skiprows)
if data.ndim == 0:
data = array([data.item()])
return (data, skipped_rows)
问题是genfromtxt对我的文件发出了警告,比如出现了这个错误:
Line #27100 (got 12 columns instead of 16)
我不太确定这些错误是从哪里来的。有没有什么建议?
这里有一个导致问题的示例文件:
#Gene 120-1 120-3 120-4 30-1 30-3 30-4 C-1 C-2 C-5 genesymbol genedesc
ENSMUSG00000000001 7.32 9.5 7.76 7.24 11.35 8.83 6.67 11.35 7.12 Gnai3 guanine nucleotide binding protein alpha
ENSMUSG00000000003 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Pbsn probasin
有没有更好的方法来写一个通用的csv2array函数?谢谢。
5 个回答
0
很可能是你的数据文件在第27100行出现了问题...那一行有12列,而不是16列。也就是说,它的内容是:
separator,1,2,3,4,5,6,7,8,9,10,11,12,separator
而程序原本期待的是这样的内容:
separator,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,separator
我不太清楚你想怎么处理你的数据,但如果你的行长度不一致,最简单的方法可能是这样:
lines = f.read().split('someseparator')
for line in lines:
splitline = line.split(',')
#do something with splitline
2
请问你为什么不使用内置的csv读取器呢? http://docs.python.org/library/csv.html
我用它和numpy/scipy配合得很好。我想分享我的代码,但不幸的是它属于我的公司,不过你自己写一个应该很简单。
6
看看Python的CSV模块吧:http://docs.python.org/library/csv.html
import csv
reader = csv.reader(open("myfile.csv", "rb"),
delimiter='\t', quoting=csv.QUOTE_NONE)
header = []
records = []
fields = 16
if thereIsAHeader: header = reader.next()
for row, record in enumerate(reader):
if len(record) != fields:
print "Skipping malformed record %i, contains %i fields (%i expected)" %
(record, len(record), fields)
else:
records.append(record)
# do numpy stuff.