当白名单和黑名单部分重叠时,如何匹配白名单而非黑名单中的字符串

2024-05-13 21:09:23 发布

您现在位置:Python中文网/ 问答频道 /正文

如果字符串在whitelist而不是blacklist中,我想获取匹配项。我的问题是这两张单子之间可能有重叠。 到目前为止,我的白名单工作使用

whitelist = ["but"]
blacklist = ["but now"]

# Correct, I get 'this is a test but\n not really'
re.sub(r"\b(" + r"|".join(whitelist) + r")\b", "\\1\n", "this is a test but not really")

有没有一种有效的方法来使用whitelistblacklist构建正则表达式,以便得到这种结果

efficient_regex = f(whitelist, blacklist)
re.sub(efficient_regex, "\\1\n", "this is a test but now it does not matter")
# And not 'this is a test but\n now it does not matter'

我正想弄清楚regexp的用法,但到目前为止还不能用


Tags: testreisnotitthisnowregex
2条回答

我最终找到了一个使用单个正则表达式的解决方案,它使用negative lookahead assertionnegative lookbehind assertion

whitelist = ["but", "however", "and yet"]
blacklist = ["but now", "anything but", "but it", "but they", "however it", "however they"]

# Can be combined into a single regex
import re
regex = re.compile(r"((?<!anything )but(?! now| it| they)|however(?! it| they)|and yet)")

然后只能使用一个正则表达式来进行替换

>>> regex.sub("****", "this is a test but not really")
'this is a test **** not really'

>>> regex.sub("****", "this is a test but now it does not matter")
'this is a test but now it does not matter'

应该可以从whitelistblacklist生成regex,但我还没有尝试

你可以试试这样的方法:

import re

str_list = [ 'this is a test but not really', \
            'this is a test but now it does not matter', \
            'now but', 'but but but', 'but now but now']

blacklist_words = ['but now']
whitelist_words = ['but']

# building regex pattern
blacklist = re.compile('|'.join([re.escape(word) for word in blacklist_words]))
whitelist = re.compile('|'.join([re.escape(word) for word in whitelist_words]))

whitelisted_strs = [word for word in str_list \
                    if not blacklist.search(word) and whitelist.search(word)]

print(whitelisted_strs)

相关问题 更多 >