检查dataframe列中的每个值是否包含来自另一个dataframe列的单词

a = pd.DataFrame({'text': ['the cat jumped over the hat', 'the pope pulled on the rope', 'i lost my dog in the fog']}) b = pd.DataFrame({'dirty_words': ['cat', 'dog', 'parakeet']}) a text 0 the cat jumped over the hat 1 the pope pulled on the rope 2 i lost my dog in the fog b dirty_words 0 cat 1 dog 2 parakeet

3条回答

网友

1楼 · 编辑于 2024-05-15 17:20:07

使用与str.contains匹配的正则表达式。在

p = '|'.join(b['dirty_words'].dropna())
a[a['text'].str.contains(r'\b{}\b'.format(p))]

                          text
0  the cat jumped over the hat
2     i lost my dog in the fog

单词边界确保不会仅仅因为“catch”包含“cat”就匹配它（谢谢@DSM）。在

网友

2楼 · 编辑于 2024-05-15 17:20:07

我想你可以在^{之后使用isin

a[pd.DataFrame(a.text.str.split().tolist()).isin(b.dirty_words.tolist()).any(1)]
Out[380]: 
                          text
0  the cat jumped over the hat
2     i lost my dog in the fog

网友

3楼 · 编辑于 2024-05-15 17:20:07

在按空格拆分字符串后，可以将列表理解与any一起使用。这种方法不包括“导管”，因为它包括“猫”。在

mask = [any(i in words for i in b['dirty_words'].values) \
        for words in a['text'].str.split().values]

print(a[mask])

                          text
0  the cat jumped over the hat
2     i lost my dog in the fog

相关问题更多 >

编程相关推荐

热门问题

热门文章