Python：如何将大文本文件读入内存

$ ls -l links.csv; file links.csv; tail links.csv -rw-r--r-- 1 user user 469904280 30 Nov 22:42 links.csv links.csv: ASCII text, with CRLF line terminators 4757187,59883 4757187,99822 4757187,66546 4757187,638452 4757187,4627959 4757187,312826 4757187,6143 4757187,6141 4757187,3081726 4757187,58197

#!/usr/bin/python # -*- coding: utf-8 -*- import sys infile=open("links.csv", "r") edges=[] count=0 #count the total number of lines in the file for line in infile: count=count+1 total=count print "Total number of lines: ",total infile.seek(0) count=0 for line in infile: edge=tuple(map(int,line.strip().split(","))) edges.append(edge) count=count+1 # for every million lines print memory consumption if count%1000000==0: print "Position: ", edge print "Read ",float(count)/float(total)*100,"%." mem=sys.getsizeof(edges) for edge in edges: mem=mem+sys.getsizeof(edge) for node in edge: mem=mem+sys.getsizeof(node) print "Memory (Bytes): ", mem

Total number of lines: 30609720 Position: (9745, 2994) Read 3.26693612356 %. Memory (Bytes): 64348736 Position: (38857, 103574) Read 6.53387224712 %. Memory (Bytes): 128816320 Position: (83609, 63498) Read 9.80080837067 %. Memory (Bytes): 192553000 Position: (139692, 1078610) Read 13.0677444942 %. Memory (Bytes): 257873392 Position: (205067, 153705) Read 16.3346806178 %. Memory (Bytes): 320107588 Position: (283371, 253064) Read 19.6016167413 %. Memory (Bytes): 385448716 Position: (354601, 377328) Read 22.8685528649 %. Memory (Bytes): 448629828 Position: (441109, 3024112) Read 26.1354889885 %. Memory (Bytes): 512208580

3条回答

网友

1楼 · 编辑于 2024-05-16 05:51:22

有一个方法可以对大于RAM on this page的文件进行排序，不过您必须根据涉及CSV格式数据的情况对其进行调整。这里还有其他资源的链接。

编辑：正确，磁盘上的文件不“大于RAM”，但内存中的表示很容易变得比可用RAM大得多。首先，你自己的程序没有得到整个1GB（操作系统开销等）。另一方面，即使您以纯Python最紧凑的形式存储它（两个整数列表，假设是32位机器等），对于这3000万对整数，您也将使用934MB。

使用numpy也可以完成这项工作，只需要大约250MB。以这种方式加载并不特别快，因为您必须计算行数并预分配数组，但考虑到它在内存中，它可能是最快的实际排序：

import time
import numpy as np
import csv

start = time.time()
def elapsed():
    return time.time() - start

# count data rows, to preallocate array
f = open('links.csv', 'rb')
def count(f):
    while 1:
        block = f.read(65536)
        if not block:
             break
        yield block.count(',')

linecount = sum(count(f))
print '\n%.3fs: file has %s rows' % (elapsed(), linecount)

# pre-allocate array and load data into array
m = np.zeros(linecount, dtype=[('a', np.uint32), ('b', np.uint32)])
f.seek(0)
f = csv.reader(open('links.csv', 'rb'))
for i, row in enumerate(f):
    m[i] = int(row[0]), int(row[1])

print '%.3fs: loaded' % elapsed()
# sort in-place
m.sort(order='b')

print '%.3fs: sorted' % elapsed()

在我的机器上输出的示例文件与您显示的类似：

6.139s: file has 33253213 lines
238.130s: read into memory
517.669s: sorted

numpy中的默认值是Quicksort。sort（）例程（就地排序）也可以使用关键字参数kind="mergesort"或kind="heapsort"，但似乎这两个参数都不能在Record Array上排序，顺便说一下，我使用了唯一的方法将列一起排序，而默认的方法是独立排序（完全扰乱了您的数据）。

网友

2楼 · 编辑于 2024-05-16 05:51:22

因为这些都只是数字，所以将它们加载到Nx2数组将减少一些开销。对多维数组使用NumPy。或者，可以使用两个普通的pythonarrays来表示每一列。

网友

3楼 · 编辑于 2024-05-16 05:51:22

所有python对象在实际存储的数据上都有内存开销。根据我32位Ubuntu系统上的getsizeof，一个元组有32字节的开销，一个int需要12字节，所以文件中的每一行需要56字节+列表中的一个4字节指针——我想对于64位系统来说，这会多得多。这与您给出的数字一致，意味着您的3000万行将占用1.8 GB。

我建议您不要使用python，而是使用unix排序实用程序。我不是一个Mac机头，但我假设OS X排序选项与linux版本相同，所以这应该可以工作：

sort -n -t, -k2 links.csv

-n表示按数字排序

-t，表示使用逗号作为字段分隔符

-k2表示在第二个字段上排序

这将对文件进行排序并将结果写入stdout。您可以将其重定向到另一个文件，或者通过管道将其传递给python程序来执行进一步的处理。

编辑：如果不想在运行python脚本之前对文件进行排序，可以使用subprocess模块创建到shell排序实用程序的管道，然后从管道输出读取排序结果。

相关问题更多 >

编程相关推荐

热门问题

热门文章