在Python数据框中查找匹配的相似关键字

2024-06-16 14:25:39 发布

您现在位置:Python中文网/ 问答频道 /正文

joined_Gravity1.head()
Comments
____________________________________________________
0   Why the old Pike/Lyrik?
1   This is good
2   So clean
3   Looks like a Decoy
Input: type(joined_Gravity1)
Output: pandas.core.frame.DataFrame

下面的代码允许我选择包含关键字的字符串:“ender”

joined_Gravity1[joined_Gravity1["Comments"].str.contains("ender", na=False)]

输出:

Comments
___________________________
194 We need a new Sender 😂
7   What about the sender
179 what about the sender?😏

如何修改代码以包含类似于“发件人”的单词,如“snder”、“bnder”


Tags: the代码isthiscommentsheadoldsender
3条回答
from difflib import get_close_matches 

def closeMatches(patterns, word): 
     print(get_close_matches(word, patterns)) 

 list_patterns = joined_Gravity1[joined_Gravity1["Comments"].str.contains("ender", na=False)]

 word = 'Sender'
 patterns = list_patterns
 closeMatches(patterns, word) 

我看不出regex=True函数中的contains在这里不起作用的原因

joined_Gravity1[joined_Gravity1["Comments"].str.contains(pat="ender|snder|bndr", na=False, regex=True)]

我只使用了"ender|snder|bnder"。您可以列出所有这些单词,比如list_words,并在上面的contains函数中传入pat='|'.join(list_words)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

这类单词中的字母组合可能会出现大量的可能性。您试图做的是两个字符串之间的模糊匹配。我建议使用以下方法:

#!pip install fuzzywuzzy
from fuzzywuzzy import fuzz, process

word = 'sender'
others = ['bnder', 'snder', 'sender', 'hello']

process.extractBests(word, others)
[('sender', 100), ('snder', 91), ('bnder', 73), ('hello', 18)]

基于此,您可以决定选择哪个阈值,然后将高于阈值的阈值标记为匹配(使用上面使用的代码)

这里有一个方法可以在你的问题陈述中用一个函数做到这一点-

df = pd.DataFrame(['hi there i am a sender', 
                   'I dont wanna be a bnder', 
                   'can i be the snder?', 
                   'i think i am a nerd'], columns=['text'])

#s = sentence, w = match word, t = match threshold
def get_match(s,w,t):
    ss = process.extractBests(w,s.split())
    return any([i[1]>t for i in ss])

#What its doing - Match each word in each row in df.text with 
#the word sender and see of any of the words have a match greater 
#than threshold ratio 70.
df['match'] = df['text'].apply(get_match, w='sender', t=70)
print(df)

                      text  match
0   hi there i am a sender   True
1  I dont wanna be a bnder   True
2      can i be the snder?   True
3      i think i am a nerd  False

t如果想要更精确的匹配,请将t值从70调整到80;如果想要更轻松的匹配,请将t值从70调整到80

最后你可以过滤掉-

df[df['match']==True][['text']]
                      text
0   hi there i am a sender
1  I dont wanna be a bnder
2      can i be the snder?

相关问题 更多 >