多词字符串的子字符串搜索 - Python

1 投票
2 回答
876 浏览
提问于 2025-04-17 17:17

我想检查一组句子,看看里面是否出现了一些特定的关键词。但是我想避免使用 for seed in line 这种方式,因为这样会导致像关键词 ring 被错误地认为出现在包含 bring 的文档中。

我还想检查一些多词表达(MWE),比如 word with spaces 是否出现在文档里。

我试过这样做,但速度非常慢,有没有更快的方法呢?

seed = ['words with spaces', 'words', 'foo', 'bar', 
        'bar bar', 'foo foo foo bar', 'ring']

 docs = ['these are words with spaces but the drinks are the bar is also good', 
    'another sentence at the foo bar is here', 
    'then a bar bar black sheep, 
    'but i dont want this sentence because there is just nothing that matches my list',
    'i forgot to bring my telephone but this sentence shouldn't be in the seeded docs too']

docs_seed = []
for d in docs:
  toAdd = False
  for s in seeds:
    if " " in s:
      if s in d:
        toAdd = True
    if s in d.split(" "):
      toAdd = True
    if toAdd == True:
      docs_seed.append((s,d))
      break
print docs_seed

我希望得到的结果是这样的:

[('words with spaces','these are words with spaces but the drinks are the bar is also good')
('foo','another sentence at the foo bar is here'), 
('bar', 'then a bar bar black sheep')]

2 个回答

0

这个方法应该能奏效,并且比你现在的做法快一些:

docs_seed = []
for d in docs:
    for s in seed:
        pos = d.find(s)
        if not pos == -1 and (d[pos - 1] == " " 
               and (d[pos + len(s)] == " " or pos + len(s) == len(d))):
            docs_seed.append((s, d))
            break

find 函数可以告诉我们 seed 值在文档中的位置(如果找不到就返回 -1)。接着,我们会检查这个值前后是否是空格(或者字符串在子串后就结束了)。这样做还修复了你原始代码中的一个错误:多词表达式不需要在单词边界开始或结束。你原来的代码会把 "swords with spaces" 也匹配成 "words with spaces",这显然是不对的。

3

可以考虑使用正则表达式:

import re

pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
pattern.findall(line)

\b 用来匹配一个“单词”的开始结束(单词字符的序列)。

举个例子:

>>> for line in docs:
...     print pattern.findall(line)
... 
['words with spaces', 'bar']
['foo', 'bar']
['bar', 'bar']
[]
[]

撰写回答