获取与两个数据帧之间的url匹配的模式

2024-06-07 06:48:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个这样的数据帧

d1 = {'Domain': ['amazon.com', 'apple.com', 'amazon.com','xyz.com'], 'Pattern': ['kindle','music','subscribe-and-save',''],'Other Important Info':['a','b','c','d']}
df1 = pd.DataFrame(d1)

d2 = {'Domain': ['google.com','google.com','amazon.com','amazon.com', 'youtube.com', 'amazon.com'], 'Url': ['https://google.com/kindle','https://google.com/','https://amazon.com/subscribe-and-save','https://amazon.com/abc','https://youtube.com/music','https:amazon.com/kindle']}
df2 = pd.DataFrame(d2)

主要目的是基于“域”和“模式”在“Url”中时合并两个数据帧

所以结果应该是下面的数据帧

{'Domain':['amazon.com','amazon.com'],'Url':['https://amazon.com/subscribe-and-save','https:amazon.com/kindle'],'Other Important Info':['c','a']}

我现在是怎么做的

def lookup_table(value, df):
    out = None
    list_items = df['Pattern'].tolist()
    for item in list_items:
        if item in value:
            out = item
            break
    return out

df2['Pattern'] = df2['url'].apply(lambda x: lookup_table(x, df1[df1['Pattern']!='']))

merged = pd.merge(df2[df2['Pattern'].notnull()], df1[df1['Pattern']!=''],on=['Domain','Pattern'],how='left')

但是,由于for循环,lookup\u table函数的运行时间太长

我怎样才能做得更快?在windows上使用python2


Tags: and数据httpscomurlamazondomainsave
1条回答
网友
1楼 · 发布于 2024-06-07 06:48:31

df1型

       Domain             Pattern Other Important Info
0  amazon.com              kindle                    a
1   apple.com               music                    b
2  amazon.com  subscribe-and-save                    c
3     xyz.com                                         

df2型

        Domain                                    Url
0   google.com              https://google.com/kindle
1   google.com                    https://google.com/
2   amazon.com  https://amazon.com/subscribe-and-save
3   amazon.com                 https://amazon.com/abc
4  youtube.com              https://youtube.com/music
5   amazon.com                https:amazon.com/kindle

The main aim is to merge the two dataframes based on the 'Domain' and also when 'Pattern' is in 'Url'.

df = df1.merge(df2, on='Domain')
df.loc[df.apply(lambda x: x.Pattern in x.Url, axis=1)]

输出

       Domain             Pattern Other Important Info  \
2  amazon.com              kindle                    a   
3  amazon.com  subscribe-and-save                    c   

                                     Url  
2                https:amazon.com/kindle  
3  https://amazon.com/subscribe-and-save  

相关问题 更多 >