如何从不同的列表中选择包含两个关键字的字符串?

2024-04-23 18:00:05 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有一个字符串列表

"Measles outbreak in the U.S worse than ever"
"MMR vaccination rates in California at all time low"
"I don't live in California"

和两个关键字列表

^{pr2}$

我如何挑选出至少包含一个关键字的字符串,这些关键字同时来自diseaselocation。在

例如,应该选择第二个字符串,而不是第一个或最后一个字符串。在


Tags: the字符串in列表关键字atthanrates
3条回答

{cd1>定义了一个包含

location = set(['California', 'West Coast', 'Los Angeles'])
disease = set(['Measles', 'MMR', 'Pertussis'])

res = [s for s in strings if ( set(s.split()) & location and set(s.split()) & disease ) ]
print res

会按需要做。请注意,set(s.split())操作已执行两次,应将其计算在内。在

import re

strings = [
    "Measles outbreak in the U.S worse than ever.",
    "MMR vaccination rates in Los Angeles at all time low.",
    "I don't live in California.",
    "The West Coast has many cases of Pertussis.",
    "Do Californians even get Measles?",
]

kw_sets = [
    ["California", "West Coast", "Los Angeles"],
    ["Measles", "MMR", "Pertussis"],
]

patterns = ('|'.join(r'\b{}\b'.format(re.escape(kw)) for kw in kw_set) 
    for kw_set in kw_sets)
compiled_patterns = [re.compile(pattern) for pattern in patterns]
filterfunc = lambda s: all(cp.search(s) for cp in compiled_patterns)
filtered_strings = list(filter(filterfunc, strings))

print(*filtered_strings, sep='\n')

这是一个针对python3.x的正则表达式解决方案

输出:

^{pr2}$

创建位置集和疾病集,将子字符串拆分为单词,并查看拆分字符串中的单词是否出现在两个集合中

location = {'California', 'West Coast', 'Los Angeles'}
disease = {'Measles', 'MMR', 'Pertussis'}

l = ['West Coast MMR',"Measles outbreak in the U.S worse than ever","MMR vaccination rates in California at all time low","I don't live in California"]

import re

r = re.compile("West Coast|Los Angeles|California")

for s in l:
    if r.search(s) and any(word in disease for word in s.split()):
        print(s)

for s in l:
    if r.search(s) and disease.intersection(s.split()):
        print(s)

只有字符串中至少有一个出现在两个集合中时,if location.intersection(spl) and disease.intersection(spl):才为真。r.search(s)从位置捕获两个单词子字符串。在

根据实际的location列表的混合情况,set-and-re方法可能是最快的,首先检查集合,然后使用or r.search(s)来编译正则表达式以匹配多字子字符串。在

您可能还希望使用单词边界,以便与Californian等不匹配:

^{pr2}$

根据可能出现的其他单词,您可能需要进行其他调整。如果不知道你的实际数据集,那么就不可能给出一个明确或最佳的答案。在

相关问题 更多 >