在python中通过子字符串匹配两个数据帧

2024-05-28 20:49:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个大数据帧(1000行),我需要通过子字符串来匹配它们,例如:

df1:

Id    Title
1     The house of pump
2     Where is Andijan
3     The Joker
4     Good bars in Andijan
5     What a beautiful house

df2:

Keyword
house
andijan
joker

预计产量为:

Id    Title                    Keyword
1     The house of pump        house
2     Where is Andijan         andijan
3     The Joker                joker
4     Good bars in Andijan     andijan
5     What a beautiful house   house

现在,我写了一种非常不高效的方法来匹配它,但是对于数据帧的实际大小,它运行了非常长的时间:

for keyword in df2.to_dict(orient='records'):
    df1['keyword'] = np.where(creative_df['title'].str.contains(keyword['keyword']), keyword['keyword'], df1['keyword'])

现在,我相信有一种更友好、更有效的方法可以做到这一点,并且在合理的时间内运行


Tags: ofthe数据inidtitleiswhere
2条回答

让我们试试findall

import re
df1['new'] = df1.Title.str.findall('|'.join(df2.Keyword.tolist()),flags= re.IGNORECASE).str[0]
df1
   Id                   Title      new
0   1       The house of pump    house
1   2        Where is Andijan  Andijan
2   3               The Joker    Joker
3   4    Good bars in Andijan  Andijan
4   5  What a beautiful house    house

进一步开发@BENY的解决方案,以便能够获得每个标题的多个关键字:

regex = '|'.join(keywords['Keyword'])
keywords = df['Title'].str.findall(regex, flags=re.IGNORECASE)
keywords_exploded = pd.DataFrame(keywords.explode().dropna())
df.merge(keywords_exploded, left_index=True, right_index=True)

相关问题 更多 >

    热门问题