文件逐行循环的最有效方法是什么？

0 投票

3 回答

770 浏览

提问于 2025-04-30 10:44

我有一个文件，叫做dataset.nt，大小还好（300Mb）。我还有一个列表，里面大约有500个元素。对于列表中的每一个元素，我想统计一下这个元素在文件中出现的行数，然后把这个元素和它出现的次数放到一个字典里（字典的键是列表元素的名字，值是这个元素在文件中出现的次数）。

这是我第一次尝试实现这个目标的代码：

mydict = {}

for i in mylist:
    regex = re.compile(r"/Main/"+re.escape(i))
    total = 0
    with open("dataset.nt", "rb") as input:
        for line in input:
            if regex.search(line):
                total = total+1
    mydict[i] = total

但是这个代码没有成功（也就是说，它一直在运行），我意识到我不应该每个元素都读取文件500遍。所以我尝试了这个：

mydict = {}

with open("dataset.nt", "rb") as input:
    for line in input:
        for i in mylist:
            regex = re.compile(r"/Main/"+re.escape(i))
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

但性能没有改善，脚本还是一直在运行。我于是上网查了一下，尝试了这个：

mydict = {}

file = open("dataset.nt", "rb")

while 1:
    lines = file.readlines(100000)
    if not lines:
        break
    for line in lines:
        for i in list:
            regex = re.compile(r"/Main/"+re.escape(i))
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

这个代码已经运行了30分钟，所以我猜它也没有什么好转。

我该如何调整这个代码，让它在合理的时间内完成呢？

暂无标签

3 个回答

其他的解决方案都很不错。不过，由于每个元素都有自己的正则表达式，而且如果某个元素在一行中出现多次也没关系，你可以使用 re.findall 来计算包含目标表达式的行数。

而且，当行数达到一定数量后，如果你的内存足够大，并且没有设计上的限制，最好是把整个文件读入内存。

    import re

    mydict = {}
    mylist = [...] # A list with 500 items
    # Optimizing calls
    findall = re.findall  # Then python don't have to resolve this functions for every call
    escape = re.escape

    with open("dataset.nt", "rb") as input:
        text = input.read() # Read the file once and keep it in memory instead access for read each line. If the number of lines is big this is faster.
        for elem in mylist:
            mydict[elem] = len(findall(".*/Main/{0}.*\n+".format(escape(elem)), text)) # Count the lines where the target regex is.

我用一个800MB的文件进行了测试（我想看看加载这么大文件到内存需要多长时间，其实比你想象的要快）。

我没有用真实数据测试整个代码，只测试了 findall 这一部分。

回答于 2025-04-30 由 Python大师

分享举报

看起来这个问题很适合用一些类似于map/reduce的并行处理方法。你可以把你的数据集文件分成N个部分（N就是你有多少个处理器），然后启动N个子进程，每个进程扫描一个部分，最后把结果加起来。

当然，这并不妨碍你先优化一下扫描的过程，也就是说（根据sebastian的代码）：

targets = [(i, re.compile(r"/Main/"+re.escape(i))) for i in mylist]
results = dict.fromkeys(mylist, 0)

with open("dataset.nt", "rb") as input:
    for line in input:
        # any match has to contain the "/Main/" part
        # -> check it's there
        # that may help a lot or not at all
        # depending on what's in your file
        if '/Main/' not in line:
            continue 

        # do the regex-part
        for i, regex in targets:
            if regex.search(line):
                results[i] += 1

请注意，如果你能提供一些数据集的样本，可能会更好地优化这个过程。例如，如果你的数据集可以按照"/Main/{i}"进行排序（可以使用系统的sort程序），那么你就不需要对每个的值检查每一行了。或者如果"/Main/"在行中的位置是已知且固定的，你可以直接对字符串的相关部分进行简单的比较（这可能比使用正则表达式更快）。

回答于 2025-04-30 由 Python大师

分享举报

我建议对你第二个版本做一点小修改：

mydict = {}

re_list = [re.compile(r"/Main/"+re.escape(i)) for i in mylist]
with open("dataset.nt", "rb") as input:
    for line in input:
        # any match has to contain the "/Main/" part
        # -> check it's there
        # that may help a lot or not at all
        # depending on what's in your file
        if not '/Main/' in line:
            continue 

        # do the regex-part
        for i, regex in zip(mylist, re_list):
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

正如@matsjoyce已经提到的，这样可以避免在每次循环时都重新编译正则表达式。如果你真的需要那么多不同的正则模式，那我觉得你能做的也不多。

也许可以试着用正则表达式捕捉"/Main/"后面的内容，然后把这个内容和你的列表进行比较。这样可能会减少真正需要的正则搜索次数。

回答于 2025-04-30 由 Python大师

分享举报

文件逐行循环的最有效方法是什么？

3 个回答

撰写回答