location = set(['California', 'West Coast', 'Los Angeles'])
disease = set(['Measles', 'MMR', 'Pertussis'])
res = [s for s in strings if ( set(s.split()) & location and set(s.split()) & disease ) ]
print res
import re
strings = [
"Measles outbreak in the U.S worse than ever.",
"MMR vaccination rates in Los Angeles at all time low.",
"I don't live in California.",
"The West Coast has many cases of Pertussis.",
"Do Californians even get Measles?",
]
kw_sets = [
["California", "West Coast", "Los Angeles"],
["Measles", "MMR", "Pertussis"],
]
patterns = ('|'.join(r'\b{}\b'.format(re.escape(kw)) for kw in kw_set)
for kw_set in kw_sets)
compiled_patterns = [re.compile(pattern) for pattern in patterns]
filterfunc = lambda s: all(cp.search(s) for cp in compiled_patterns)
filtered_strings = list(filter(filterfunc, strings))
print(*filtered_strings, sep='\n')
location = {'California', 'West Coast', 'Los Angeles'}
disease = {'Measles', 'MMR', 'Pertussis'}
l = ['West Coast MMR',"Measles outbreak in the U.S worse than ever","MMR vaccination rates in California at all time low","I don't live in California"]
import re
r = re.compile("West Coast|Los Angeles|California")
for s in l:
if r.search(s) and any(word in disease for word in s.split()):
print(s)
for s in l:
if r.search(s) and disease.intersection(s.split()):
print(s)
只有字符串中至少有一个出现在两个集合中时,if location.intersection(spl) and disease.intersection(spl):才为真。r.search(s)从位置捕获两个单词子字符串。在
{cd1>定义了一个包含
会按需要做。请注意,
set(s.split())
操作已执行两次,应将其计算在内。在这是一个针对python3.x的正则表达式解决方案
输出:
^{pr2}$创建位置集和疾病集,将子字符串拆分为单词,并查看拆分字符串中的单词是否出现在两个集合中
只有字符串中至少有一个出现在两个集合中时,
if location.intersection(spl) and disease.intersection(spl):
才为真。r.search(s)
从位置捕获两个单词子字符串。在根据实际的
location
列表的混合情况,set-and-re方法可能是最快的,首先检查集合,然后使用orr.search(s)
来编译正则表达式以匹配多字子字符串。在您可能还希望使用单词边界,以便与
^{pr2}$Californian
等不匹配:根据可能出现的其他单词,您可能需要进行其他调整。如果不知道你的实际数据集,那么就不可能给出一个明确或最佳的答案。在
相关问题 更多 >
编程相关推荐