多词字符串的子字符串搜索 - Python
我想检查一组句子,看看里面是否出现了一些特定的关键词。但是我想避免使用 for seed in line
这种方式,因为这样会导致像关键词 ring
被错误地认为出现在包含 bring
的文档中。
我还想检查一些多词表达(MWE),比如 word with spaces
是否出现在文档里。
我试过这样做,但速度非常慢,有没有更快的方法呢?
seed = ['words with spaces', 'words', 'foo', 'bar',
'bar bar', 'foo foo foo bar', 'ring']
docs = ['these are words with spaces but the drinks are the bar is also good',
'another sentence at the foo bar is here',
'then a bar bar black sheep,
'but i dont want this sentence because there is just nothing that matches my list',
'i forgot to bring my telephone but this sentence shouldn't be in the seeded docs too']
docs_seed = []
for d in docs:
toAdd = False
for s in seeds:
if " " in s:
if s in d:
toAdd = True
if s in d.split(" "):
toAdd = True
if toAdd == True:
docs_seed.append((s,d))
break
print docs_seed
我希望得到的结果是这样的:
[('words with spaces','these are words with spaces but the drinks are the bar is also good')
('foo','another sentence at the foo bar is here'),
('bar', 'then a bar bar black sheep')]
2 个回答
0
这个方法应该能奏效,并且比你现在的做法快一些:
docs_seed = []
for d in docs:
for s in seed:
pos = d.find(s)
if not pos == -1 and (d[pos - 1] == " "
and (d[pos + len(s)] == " " or pos + len(s) == len(d))):
docs_seed.append((s, d))
break
find
函数可以告诉我们 seed
值在文档中的位置(如果找不到就返回 -1)。接着,我们会检查这个值前后是否是空格(或者字符串在子串后就结束了)。这样做还修复了你原始代码中的一个错误:多词表达式不需要在单词边界开始或结束。你原来的代码会把 "swords with spaces"
也匹配成 "words with spaces"
,这显然是不对的。
3
可以考虑使用正则表达式:
import re
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
pattern.findall(line)
\b
用来匹配一个“单词”的开始或结束(单词字符的序列)。
举个例子:
>>> for line in docs:
... print pattern.findall(line)
...
['words with spaces', 'bar']
['foo', 'bar']
['bar', 'bar']
[]
[]