用Python中的regex优化查找两个列表之间的匹配子串

import re def is_phrase_in(phrase, text): return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None list_to_search = ['my', 'name', 'is', 'you', 'your'] list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe'] to_be_appended = [] for phrase in list_to_be_searched: searched = [] for word in list_to_search: if is_phrase_in(word,phrase) is True: searched.append(word) to_be_appended.append(searched) print(to_be_appended) # (desired and actual) output [['my'], ['name', 'is'], ['name', 'is'], ['you'], ['name', 'is', 'your'], ['my', 'name', 'is']]

import re def is_phrase_in(phrase, text): return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None list_to_search = ['hello my', 'name', 'is', 'is your name', 'your'] list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe'] to_be_appended = [] for phrase in list_to_be_searched: searched = [] for word in list_to_search: if is_phrase_in(word,phrase) is True: searched.append(word) to_be_appended.append(searched) print(to_be_appended) # (desired and actual) output [['hello my'], ['name', 'is'], ['name', 'is'], [], ['name', 'is', 'is your name', 'your'], ['name', 'is']]

%%timeit def is_phrase_in(phrase, text): return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None list_to_search = ['hello my', 'name', 'is', 'is your name', 'your'] list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe'] to_be_appended = [] for phrase in list_to_be_searched: searched = [] for word in list_to_search: if is_phrase_in(word,phrase) is True: searched.append(word) to_be_appended.append(searched) #43.2 µs ± 346 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

2条回答

网友

1楼 · 编辑于 2024-04-20 08:25:32

虽然最直接/清晰的方法是使用列表理解，但我想看看regex是否可以做得更好。你知道吗

在list_to_be_searched中的每个项目上使用regex似乎没有任何性能提升。但是将list_to_be_searched加入一个大的文本块，并将其与由list_to_search构造的正则表达式模式相匹配，性能略有提高：

In [1]: import re
   ...:
   ...: list_to_search = ['my', 'name', 'is', 'you', 'your']
   ...: list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
   ...:
   ...: def simple_method(to_search, to_be_searched):
   ...:   return [[j for j in to_search if j in i.split()] for i in to_be_searched]
   ...:
   ...: def regex_method(to_search, to_be_searched):
   ...:   word = re.compile(r'(\b(?:' + r'|'.join(to_search) + r')\b(?:\n)?)')
   ...:   blob = '\n'.join(to_be_searched)
   ...:   phrases = word.findall(blob)
   ...:   return [phrase.split(' ') for phrase in ' '.join(phrases).split('\n ')]
   ...:
   ...: def alternate_regex_method(to_search, to_be_searched):
   ...:   word = re.compile(r'(\b(?:' + r'|'.join(to_search) + r')\b(?:\n)?)')
   ...:   phrases = []
   ...:   for item in to_be_searched:
   ...:     phrases.append(word.findall(item))
   ...:   return phrases
   ...:

In [2]: %timeit -n 100 simple_method(list_to_search, list_to_be_searched)
100 loops, best of 3: 23.1 µs per loop

In [3]: %timeit -n 100 regex_method(list_to_search, list_to_be_searched)
100 loops, best of 3: 18.6 µs per loop

In [4]: %timeit -n 100 alternate_regex_method(list_to_search, list_to_be_searched)
100 loops, best of 3: 23.4 µs per loop

为了了解在大量输入下的表现，我使用了1000个英语中最常用的单词，一次一个单词作为^{cd3>}，而来自Gutenberg项目的David Copperfield的整个文本一次一行作为^{cd1>}：

In [5]: book = open('/tmp/copperfield.txt', 'r+')

In [6]: list_to_be_searched = [line for line in book]

In [7]: len(list_to_be_searched)
Out[7]: 38589

In [8]: words = open('/tmp/words.txt', 'r+')

In [9]: list_to_search = [word for word in words]

In [10]: len(list_to_search)
Out[10]: 1000

结果如下：

In [15]: %timeit -n 10 simple_method(list_to_search, list_to_be_searched)
10 loops, best of 3: 31.9 s per loop

In [16]: %timeit -n 10 regex_method(list_to_search, list_to_be_searched)
10 loops, best of 3: 4.28 s per loop

In [17]: %timeit -n 10 alternate_regex_method(list_to_search, list_to_be_searched)
10 loops, best of 3: 4.43 s per loop

因此，如果您对性能感兴趣，可以使用任意一种regex方法。希望有帮助！：）

网友

2楼 · 编辑于 2024-04-20 08:25:32

可以使用嵌套列表：

list_to_search = ['my', 'name', 'is', 'you', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name',
                       'how are you', 'what is your name', 'my name is jane doe']

[[j for j in list_to_search if j in i.split()] for i in list_to_be_searched]

[['my'],
 ['name', 'is'],
 ['name', 'is'],
 ['you'],
 ['name', 'is', 'your'],
 ['my', 'name', 'is']]

相关问题更多 >

编程相关推荐

热门问题

热门文章