在Python中使用scipy/numpy读取csv文件

3 投票
5 回答
18245 浏览
提问于 2025-04-15 22:51

我在用Python读取一个用制表符分隔的csv文件时遇到了问题。我使用了以下这个函数:

def csv2array(filename, skiprows=0, delimiter='\t', raw_header=False, missing=None, with_header=True):
    """
    Parse a file name into an array. Return the array and additional header lines. By default,
    parse the header lines into dictionaries, assuming the parameters are numeric,
    using 'parse_header'.
    """
    f = open(filename, 'r')
    skipped_rows = []
    for n in range(skiprows):
        header_line = f.readline().strip()
        if raw_header:
            skipped_rows.append(header_line)
        else:
            skipped_rows.append(parse_header(header_line))
    f.close()
    if missing:
        data = genfromtxt(filename, dtype=None, names=with_header,
                          deletechars='', skiprows=skiprows, missing=missing)
    else:
    if delimiter != '\t':
        data = genfromtxt(filename, dtype=None, names=with_header, delimiter=delimiter,
                  deletechars='', skiprows=skiprows)
    else:
        data = genfromtxt(filename, dtype=None, names=with_header,
                  deletechars='', skiprows=skiprows)        
    if data.ndim == 0:
    data = array([data.item()])
    return (data, skipped_rows)

问题是genfromtxt对我的文件发出了警告,比如出现了这个错误:

Line #27100 (got 12 columns instead of 16)

我不太确定这些错误是从哪里来的。有没有什么建议?

这里有一个导致问题的示例文件:

#Gene   120-1   120-3   120-4   30-1    30-3    30-4    C-1 C-2 C-5 genesymbol  genedesc
ENSMUSG00000000001  7.32    9.5 7.76    7.24    11.35   8.83    6.67    11.35   7.12    Gnai3   guanine nucleotide binding protein alpha
ENSMUSG00000000003  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Pbsn    probasin

有没有更好的方法来写一个通用的csv2array函数?谢谢。

5 个回答

0

很可能是你的数据文件在第27100行出现了问题...那一行有12列,而不是16列。也就是说,它的内容是:

separator,1,2,3,4,5,6,7,8,9,10,11,12,separator

而程序原本期待的是这样的内容:

separator,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,separator

我不太清楚你想怎么处理你的数据,但如果你的行长度不一致,最简单的方法可能是这样:

lines = f.read().split('someseparator')
for line in lines:
    splitline = line.split(',')
    #do something with splitline
2

请问你为什么不使用内置的csv读取器呢? http://docs.python.org/library/csv.html

我用它和numpy/scipy配合得很好。我想分享我的代码,但不幸的是它属于我的公司,不过你自己写一个应该很简单。

6

看看Python的CSV模块吧:http://docs.python.org/library/csv.html

import csv
reader = csv.reader(open("myfile.csv", "rb"), 
                    delimiter='\t', quoting=csv.QUOTE_NONE)

header = []
records = []
fields = 16

if thereIsAHeader: header = reader.next()

for row, record in enumerate(reader):
    if len(record) != fields:
        print "Skipping malformed record %i, contains %i fields (%i expected)" %
            (record, len(record), fields)
    else:
        records.append(record)

# do numpy stuff.

撰写回答