使用Python实现智能过滤器

3 投票

8 回答

888 浏览

提问于 2025-04-16 08:17

你好！
我需要过滤掉所有不包含某些符号的行，这些符号来自一个很大的“必要”列表，下面是示例代码：

def any_it(iterable):
      for element in iterable:
          if element: return True
      return False

regexp = re.compile(r'fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ...] # huge list of 10 000 members
f = open("huge_file", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()

## File rows like, let's say:
# 1 djhds fruit=REDSOMETHING sdkjld
# 2 sdhfkjk fruit=GREENORANGE lkjfldk
# 3 dskjldsj fruit=YELLOWDOG sldkfjsdl
# 4 gfhfg fruit=REDSOMETHINGELSE fgdgdfg

filtered = (line for line in lines if any_it(regexp.findall(line)[0].startswith(x) for x in necessary))

我用的是Python 2.4，所以不能使用内置的 any() 函数。
我等了很久来进行这个过滤，但有没有什么方法可以优化一下？比如说，第1行和第4行都包含“RED..”这个模式，如果我们发现“RED..”这个模式是可以的，那在处理第4行的时候，能不能跳过在10000个成员的列表中查找同样的模式呢？
有没有其他方法可以优化过滤的过程？
谢谢！
...编辑过...
更新：在这篇帖子评论中查看实际示例数据。我也想知道如何按“水果”对结果进行排序。谢谢！
...编辑结束...

数据结构数据处理排序算法模式匹配优化算法文本过滤列表查找智能过滤器

8 个回答

测试过的（但没有基准测试）代码：

import re
import fileinput

regexp = re.compile(r'^.*?fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ]

filtered = []
for line in fileinput.input(["test.txt"]):
    try:
        key = regexp.match(line).group(1)
    except AttributeError:
        continue # no match
    for p in necessary:
        if key.startswith(p):
            filtered.append(line)
            break

# "filtered" now holds your results
print "".join(filtered)

与问题中的代码的不同之处：

我们不会像使用 file.readlines() 时那样先把整个文件加载到内存中。相反，我们在读取文件时逐行处理。这里我用 fileinput 模块来简化，但你也可以用 line = file.readline() 和一个 while line: 循环来实现。
一旦找到匹配项，我们就停止遍历 necessary 列表。
我们修改了正则表达式的模式，并使用 re.match 代替 re.findall。这是因为假设每一行只会包含一个 "fruit=..." 的条目。

更新

如果输入文件的格式是一致的，你可以通过完全去掉正则表达式来提高一些性能。

try:
    # with line = "2 asdasd fruit=SOMETHING asdasd...."
    key = line.split(" ", 3)[2].split("=")[1]
except:
    continue # no match

回答于 2025-04-16 由 Python大师

分享举报

如果你把necessary这个列表组织成一种叫做字典树的结构，那么你就可以在这个字典树里查找，看看fruit这个词是否以一个有效的前缀开头。这样做应该比逐个比较fruit和每个前缀要快。

举个例子（测试得不多）：

import bisect
import re

class Node(object):
    def __init__(self):
        self.children = []
        self.children_values = []
        self.exists = False

    # Based on code at http://docs.python.org/library/bisect.html                
    def _index_of(self, ch):
        i = bisect.bisect_left(self.children_values, ch)
        if i != len(self.children_values) and self.children_values[i] == ch:
            return (i, self.children[i])
        return (i, None)

    def add(self, value):
        if len(value) == 0:
            self.exists = True
            return
        i, child = self._index_of(value[0])
        if not child:
            child = Node()
            self.children.insert(i, child)
            self.children_values.insert(i, value[0])
        child.add(value[1:])

    def contains_prefix_of(self, value):
        if self.exists:
            return True
        i, child = self._index_of(value[0])
        if not child:
            return False
        return child.contains_prefix_of(value[1:])

necessary = ['RED', 'GREEN', 'BLUE', 'ORANGE', 'BLACK',
             'LIGHTRED', 'LIGHTGREEN', 'GRAY']

trie = Node()
for value in necessary:
    trie.add(value)

# Find lines that match values in the trie
filtered = []
regexp = re.compile(r'fruit=([A-Z]+)')
for line in open('whatever-file'):
    fruit = regexp.findall(line)[0]
    if trie.contains_prefix_of(fruit):
        filtered.append(line)

这样一来，你的算法复杂度就从O(N * k)变成了O(k)（大致如此），其中N是necessary中的元素数量，k是fruit的长度。不过，这样会占用更多的内存，但在你的情况下，这可能是值得的权衡。

回答于 2025-04-16 由 Python大师

分享举报

我觉得Zach的回答是个不错的方向。出于好奇，我实现了另一个版本（结合了Zach关于用字典而不是bisect的建议），并把它整合成一个能匹配你例子的解决方案。

#!/usr/bin/env python
import re
from trieMatch import PrefixMatch # https://gist.github.com/736416

pm = PrefixMatch(['YELLOW', 'GREEN', 'RED', ]) # huge list of 10 000 members
# if list is static, it might be worth picking "pm" to avoid rebuilding each time

f = open("huge_file.txt", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()

regexp = re.compile(r'^.*?fruit=([A-Z]+)')
filtered = (line for line in lines if pm.match(regexp.match(line).group(1)))

为了简洁起见，PrefixMatch的实现可以在这里查看。

如果你的necessary前缀列表是固定的或者变化不大，你可以通过将PickleMatch对象保存起来并重复使用，而不是每次都重新构建它，这样可以加快后续的运行速度。

更新（关于排序结果）

根据Python 2.4的更新日志：

key应该是一个只接受一个参数的函数，这个函数接收列表中的一个元素，并返回一个用于比较的关键值。然后列表会根据这些关键值进行排序。

另外，在源代码的第1792行中：

/* Special wrapper to support stable sorting using the decorate-sort-undecorate
   pattern.  Holds a key which is used for comparisons and the original record
   which is returned during the undecorate phase.  By exposing only the key
   .... */

这意味着你的正则表达式模式对于每个条目只会被评估一次（而不是每次比较时都评估），所以这样做的开销应该不会太大：

sorted_generator = sorted(filtered, key=regexp.match(line).group(1))

回答于 2025-04-16 由 Python大师

分享举报

使用Python实现智能过滤器

8 个回答

更新

更新（关于排序结果）

撰写回答