Python在处理我的1GB CSV文件时停止运行

1 投票

4 回答

911 浏览

提问于 2025-04-15 17:40

我有两个文件：

metadata.csv：里面有一个ID，后面是供应商名称、文件名等等。
hashes.csv：里面也有一个ID，后面是一个哈希值。这个ID基本上是个外键，用来把文件的元数据和它的哈希值关联起来。

我写了一个脚本，目的是快速提取出与特定供应商相关的所有哈希值。但是在处理hashes.csv的时候，它就停止了，没能完成。

stored_ids = []

# this file is about 1 MB
entries = csv.reader(open(options.entries, "rb"))

for row in entries:
  # row[2] is the vendor
  if row[2] == options.vendor:
    # row[0] is the ID
    stored_ids.append(row[0])

# this file is 1 GB
hashes = open(options.hashes, "rb")

# I iteratively read the file here,
# just in case the csv module doesn't do this.
for line in hashes:

  # not sure if stored_ids contains strings or ints here...
  # this probably isn't the problem though
  if line.split(",")[0] in stored_ids:

    # if its one of the IDs we're looking for, print the file and hash to STDOUT
    print "%s,%s" % (line.split(",")[2], line.split(",")[4])

hashes.close()

这个脚本在处理hashes.csv的时候，大约能处理2000条记录就停下来了。我到底哪里出错了？我以为我是逐行处理的。

补充说明：这两个csv文件是流行的HashKeeper格式，而我正在解析的文件是NSRL哈希集。http://www.nsrl.nist.gov/Downloads.htm#converter

更新：下面是可行的解决方案。感谢所有评论的人！

entries = csv.reader(open(options.entries, "rb"))   
stored_ids = dict((row[0],1) for row in entries if row[2] == options.vendor)

hashes = csv.reader(open(options.hashes, "rb"))
matches = dict((row[2], row[4]) for row in hashes if row[0] in stored_ids)

for k, v in matches.iteritems():
    print "%s,%s" % (k, v)

内存管理数据提取哈希值外键文件解析 CSV处理数据关联 hashkeeper格式

4 个回答

请解释一下你说的“停止”是什么意思？是卡住了还是退出了？有没有错误追踪信息？

a) 如果某一行没有逗号，它就会失败。

>>> 'hmmm'.split(",")[2]
Traceback (most recent call last):
  File "<string>", line 1, in <string>
IndexError: list index out of range

b) 你为什么要多次拆分这一行？不如这样做。

tokens = line.split(",")

if len(tokens) >=5 and tokens[0] in stored_ids:
    print "%s,%s" % (tokens[2], tokens[4])

c) 创建一个存储 ID 的字典，这样 tokens[0] 在 stored_id 中会更快。

d) 把你的内部代码放在 try/exept 里，看看有没有错误。

e) 你是在命令行上运行，还是在某个开发环境里？

回答于 2025-04-15 由 Python大师

分享举报

这段代码会在任何没有至少4个逗号的行上出错；比如说，它会在空行上出错。如果你确定不想使用CSV读取器，那么至少要在line.split(',')[4]这一行加上对IndexError的处理。

回答于 2025-04-15 由 Python大师

分享举报

“崩溃”这个说法其实不太准确。它到底是干什么的呢？是交换数据吗？还是把所有内存都填满了？或者只是占用CPU，但看起来什么都没做？

不过，先从这里开始，建议用字典而不是列表来存储stored_ids。在字典里查找数据通常只需要O(1)的时间，而在列表里查找则需要O(n)的时间。

补充：这里有一个简单的微基准测试：

$ python -m timeit -s "l=range(1000000)" "1000001 in l"
10 loops, best of 3: 71.1 msec per loop
$ python -m timeit -s "s=set(range(1000000))" "1000001 in s"
10000000 loops, best of 3: 0.174 usec per loop

你可以看到，集合（它的性能和字典是一样的）在一百万个整数中查找的速度比类似的列表快了超过10000倍（查找时间不到一微秒，而列表则几乎需要100毫秒）。考虑到这样的查找会在你1GB的文件的每一行中发生，你就能明白这个问题有多严重。

回答于 2025-04-15 由 Python大师

分享举报

Python在处理我的1GB CSV文件时停止运行

4 个回答

撰写回答