下面是我的方法,通过一个包含“words”的列表来搜索包含“短语”的列表中的子字符串,并返回在包含短语的列表中的每个元素中找到的匹配子字符串。你知道吗
import re
def is_phrase_in(phrase, text):
return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None
list_to_search = ['my', 'name', 'is', 'you', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
to_be_appended = []
for phrase in list_to_be_searched:
searched = []
for word in list_to_search:
if is_phrase_in(word,phrase) is True:
searched.append(word)
to_be_appended.append(searched)
print(to_be_appended)
# (desired and actual) output
[['my'],
['name', 'is'],
['name', 'is'],
['you'],
['name', 'is', 'your'],
['my', 'name', 'is']]
由于“words”(或list-to-search)列表有约1700个单词,“phrases”(或list-to-be-search)列表有约26561个单词,完成代码需要30分钟。我不认为我上面的代码是考虑到Pythonic的编码方式和高效的数据结构实现的。:(
有谁能给我一些建议来优化或者加快速度?你知道吗
谢谢!你知道吗
实际上,我写错了上面的例子。 如果“列表到搜索”中的元素多于2个单词怎么办?你知道吗
import re
def is_phrase_in(phrase, text):
return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None
list_to_search = ['hello my', 'name', 'is', 'is your name', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
to_be_appended = []
for phrase in list_to_be_searched:
searched = []
for word in list_to_search:
if is_phrase_in(word,phrase) is True:
searched.append(word)
to_be_appended.append(searched)
print(to_be_appended)
# (desired and actual) output
[['hello my'],
['name', 'is'],
['name', 'is'],
[],
['name', 'is', 'is your name', 'your'],
['name', 'is']]
时机 第一种方法:
%%timeit
def is_phrase_in(phrase, text):
return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None
list_to_search = ['hello my', 'name', 'is', 'is your name', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
to_be_appended = []
for phrase in list_to_be_searched:
searched = []
for word in list_to_search:
if is_phrase_in(word,phrase) is True:
searched.append(word)
to_be_appended.append(searched)
#43.2 µs ± 346 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
第二种方法(嵌套列表和关于芬德尔)你知道吗
%%timeit
[[j for j in list_to_search if j in re.findall(r"\b{}\b".format(j), i)] for i in list_to_be_searched]
#40.3 µs ± 454 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\
时间安排确实有所改善,但会有更快的方法吗?或者,考虑到它的功能,这个任务在基因上是缓慢的?你知道吗
虽然最直接/清晰的方法是使用列表理解,但我想看看regex是否可以做得更好。你知道吗
在
list_to_be_searched
中的每个项目上使用regex似乎没有任何性能提升。但是将list_to_be_searched
加入一个大的文本块,并将其与由list_to_search
构造的正则表达式模式相匹配,性能略有提高:为了了解在大量输入下的表现,我使用了1000个英语中最常用的单词,一次一个单词作为^{cd3>},而来自Gutenberg项目的David Copperfield的整个文本一次一行作为^{cd1>}:
结果如下:
因此,如果您对性能感兴趣,可以使用任意一种regex方法。希望有帮助!:)
可以使用嵌套列表:
相关问题 更多 >
编程相关推荐