Python 大文件读取

0 投票

2 回答

1791 浏览

数据工程师

提问于 2025-04-18 17:57

我需要用一个Python脚本逐行读取一个大数据文件（大约200GB）。

我试过一些普通的逐行读取方法，但这些方法会占用大量内存。我想要能够分块读取这个文件。

有没有更好的方法可以逐行加载这么大的文件，比如：

a) 明确指定一次最多可以加载多少行到内存中？或者

b) 按照一定大小的块来加载，比如1024字节，前提是这个块的最后一行能够完整加载，不会被截断？

数据流内存管理行读取大文件处理分块读取文件读取优化

2 个回答

要逐行读取一个文件，不要使用 readlines。相反，你可以直接遍历文件本身（你可能会看到使用 xreadlines 的例子 - 但这个方法已经不推荐使用了，它只是返回文件对象本身）。

with open(the_path, 'r') as the_file:
    for line in the_file:
        # Do stuff with the line

如果你想一次读取多行，可以对文件使用 next（文件是一个迭代器），但你需要处理 StopIteration 这个错误，它表示没有数据可读了：

with open(the_path, 'r') as the_file:
    the_lines = []
    done = False
    for i in range(number_of_lines): # Use xrange on Python 2
        try:
            the_lines.append(next(the_file))
        except StopIteration:
            done = True # Reached end of file
    # Do stuff with the lines
    if done:
        break # No data left

当然，你也可以按指定的字节数分块加载文件：

with open(the_path, 'r') as the_file:
    while True:
        data = the_file.read(the_byte_count)
        if len(data) == 0:
            # All data is gone
            break
        # Do stuff with the data chunk

回答于 2025-04-18 由 Python大师

分享举报

与其一次性读取所有内容，不如试着一行一行地读取：

with open("myFile.txt") as f:
    for line in f:
        #Do stuff with your line

或者，如果你想一次读取N行的话：

with open("myFile.txt") as myfile:
    head = [next(myfile) for x in xrange(N)]
    print head

当你到达文件末尾时，会出现一个叫StopIteration的错误，处理这个错误的方法很简单，就是用try/catch（当然还有很多其他方法）。

try:
    head = [next(myfile) for x in xrange(N)]
except StopIteration:
    rest_of_lines = [line for line in myfile]

或者你也可以用任何你喜欢的方式来读取最后几行。

回答于 2025-04-18 由 Python大师

分享举报

Python 大文件读取

2 个回答

撰写回答