使用python解析大(20GB)文本文件-以1的形式读取2行

2024-05-15 05:42:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在解析一个20Gb的文件,并将满足某个条件的行输出到另一个文件,但是有时python会一次读取两行并将它们连接起来。

inputFileHandle = open(inputFileName, 'r')

row = 0

for line in inputFileHandle:
    row =  row + 1
    if line_meets_condition:
        outputFileHandle.write(line)
    else:
        lstIgnoredRows.append(row)

我检查了源文件中的行尾,它们作为换行符(ascii char 10)签出。拉出问题行并对其进行隔离分析可以按预期工作。我在这里遇到了一些python限制吗?第一个异常在文件中的位置大约是4GB标记。


Tags: 文件inforiflineopencondition条件
2条回答

google快速搜索“python读取大于4gb的文件”得到了许多结果。见here for such an exampleand another one which takes over from the first

这是Python中的一个bug。

Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread(). In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF." Oddly, there is an almost exact copy of this function in Perl source code: http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668 The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?] At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.

以及周围的工作:

But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().

4GB标记可疑地接近可以存储在32位寄存器(2**32)中的最大值。

您发布的代码本身看起来很好,所以我怀疑您的Python构建中有一个bug。

FWIW,如果使用枚举,代码片段会更干净一些:

inputFileHandle = open(inputFileName, 'r')

for row, line in enumerate(inputFileHandle):
    if line_meets_condition:
        outputFileHandle.write(line)
    else:
        lstIgnoredRows.append(row)

相关问题 更多 >

    热门问题