多词字符串的子字符串搜索 - Python

1 投票

2 回答

876 浏览

提问于 2025-04-17 17:17

我想检查一组句子，看看里面是否出现了一些特定的关键词。但是我想避免使用 for seed in line 这种方式，因为这样会导致像关键词 ring 被错误地认为出现在包含 bring 的文档中。

我还想检查一些多词表达（MWE），比如 word with spaces 是否出现在文档里。

我试过这样做，但速度非常慢，有没有更快的方法呢？

seed = ['words with spaces', 'words', 'foo', 'bar', 
        'bar bar', 'foo foo foo bar', 'ring']

 docs = ['these are words with spaces but the drinks are the bar is also good', 
    'another sentence at the foo bar is here', 
    'then a bar bar black sheep, 
    'but i dont want this sentence because there is just nothing that matches my list',
    'i forgot to bring my telephone but this sentence shouldn't be in the seeded docs too']

docs_seed = []
for d in docs:
  toAdd = False
  for s in seeds:
    if " " in s:
      if s in d:
        toAdd = True
    if s in d.split(" "):
      toAdd = True
    if toAdd == True:
      docs_seed.append((s,d))
      break
print docs_seed

我希望得到的结果是这样的：

[('words with spaces','these are words with spaces but the drinks are the bar is also good')
('foo','another sentence at the foo bar is here'), 
('bar', 'then a bar bar black sheep')]

性能优化文本处理关键词匹配字符串搜索多词表达

2 个回答

这个方法应该能奏效，并且比你现在的做法快一些：

docs_seed = []
for d in docs:
    for s in seed:
        pos = d.find(s)
        if not pos == -1 and (d[pos - 1] == " " 
               and (d[pos + len(s)] == " " or pos + len(s) == len(d))):
            docs_seed.append((s, d))
            break

find 函数可以告诉我们 seed 值在文档中的位置（如果找不到就返回 -1）。接着，我们会检查这个值前后是否是空格（或者字符串在子串后就结束了）。这样做还修复了你原始代码中的一个错误：多词表达式不需要在单词边界开始或结束。你原来的代码会把 "swords with spaces" 也匹配成 "words with spaces"，这显然是不对的。

回答于 2025-04-17 由 Python大师

分享举报

可以考虑使用正则表达式：

import re

pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
pattern.findall(line)

\b 用来匹配一个“单词”的开始或结束（单词字符的序列）。

举个例子：

>>> for line in docs:
...     print pattern.findall(line)
... 
['words with spaces', 'bar']
['foo', 'bar']
['bar', 'bar']
[]
[]

回答于 2025-04-17 由 Python大师

分享举报

多词字符串的子字符串搜索 - Python

2 个回答

撰写回答