如果字符串包含关键字,则搜索不同ID的不同关键字集

2024-04-19 10:35:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两组数据帧

   IDs    Keywords
0  1234   APPLE ABCD
1  1234   ORANGE
2  1234   LEMONS
3  5346   ORANGE
4  5346   STRAWBERRY
5  5346   BLUEBERRY
6  8793   TEA COFFEE

第二个数据帧:

   IDs    Name         
0  1234   APPLE ABCD ONE
1  5346   APPLE ABCD   
2  1234   STRAWBERRY YES 
3  8793   ORANGE AVAILABLE  
4  8793   TEA AVAILABLE
5  8793   TEA COFFEE

我想根据IDs级别搜索关键字, 将其用于第二个数据帧并搜索列:名称 如果名称中包含相同id的关键字存在,请提供任何指标True,否则为False。你知道吗

例如: 对于IDs 1234,苹果ABCD,橙色,柠檬是关键字。所以在第二个数据帧中: 用APPLE索引第0行1将为真,因为“APPLE”是关键字的一部分

对于ids5346,橙色,草莓,蓝莓是关键词。所以在第二个数据帧中,用APPLE ABCD索引第1行将是False。你知道吗

   IDs    Name               Indicator
0  1234   APPLE ABCD ONE     True
1  5346   APPLE ABCD         False
2  1234   STRAWBERRY YES     False
3  8793   ORANGE AVAILABLE   False
4  8793   TEA AVAILABLE      False
5  8793   TEA COFFEE         True

Tags: 数据namefalsetrueidsapple关键字one
3条回答

在使用groupbylambda时,可以使用merge,如下所示:

>>> df.merge(df2).groupby(['IDs','Name']).apply(lambda x: any(x['Name'].str.contains('|'.join(x['Keywords'])))).rename('Indicator').reset_index()
    IDs              Name  Indicator
0  1234        APPLE ABCD       True
1  1234    STRAWBERRY YES      False
2  5346        APPLE ABCD      False
3  8793  ORANGE AVAILABLE      False
4  8793     TEA AVAILABLE       True

您需要:

# create a list of tuples from 1st dataframe
kw = list(zip(df1.IDs, df1.Keywords))

def func(ids, name):
    if (ids,name.split(" ")[0]) in kw:
        return True
    return False

df2['Indicator'] = df2.apply(lambda x: func(x['IDs'],x['Names']), axis=1)  

编辑

创建具有id和关键字组合的元组列表

kw = list(zip(df1.IDs, df1.Keywords))
# [(1234, 'APPLE ABCD'), (1234, 'ORANGE'), (1234, 'LEMONS'), (5346, 'ORANGE'), (5346, 'STRAWBERRY'), (5346, 'BLUEBERRY'), (8793, 'TEA COFFEE')]

unique_kw = list(df1['Keywords'].unique())
# ['APPLE ABCD', 'ORANGE', 'LEMONS', 'STRAWBERRY', 'BLUEBERRY', 'TEA COFFEE']

def samp(x):
    for u in unique_kw:
        if u in x:
            return u
    return None

# This will fetch the keywords from column which will be used for compare  
df2['indicator'] = df2['Names'].apply(lambda x: samp(x))

df2['indicator'] = df2.apply(lambda x: True if (x['IDs'], x['indicator']) in kw else False, axis=1)

输出:

    IDs     Names               indicator
0   1234    APPLE ABCD ONE      True
1   5346    APPLE ABCD          False
2   1234    NO STRAWBERRY YES   False
3   8793    ORANGE AVAILABLE    False
4   8793    TEA AVAILABLE       False
5   8793    TEA COFFEE          True

您可以主要使用pandas操作来实现这一点,这样效率也会更高。

# Let there be two DataFrames: kw_df, name_df

# Group all keywords of each ID in a list, associate it with the names
kw_df = kw_df.groupby('IDs').aggregate({'Keywords': list})
merge_df = name_df.join(kw_df, on='IDs')

# Check if any keyword is in the name
def is_match(name, kws):
    return any(kw in name for kw in kws)

merge_df['Indicator'] = merge_df.apply(lambda row: is_match(row['Name'], row['Keywords']), axis=1)
print(merge_df)

其输出如下:

    IDs              Name                         Keywords  Indicator
0  1234    APPLE ABCD ONE     [APPLE ABCD, ORANGE, LEMONS]       True
1  5346        APPLE ABCD  [ORANGE, STRAWBERRY, BLUEBERRY]      False
2  1234    STRAWBERRY YES     [APPLE ABCD, ORANGE, LEMONS]      False
3  8793  ORANGE AVAILABLE                     [TEA COFFEE]      False
4  8793     TEA AVAILABLE                     [TEA COFFEE]      False
5  8793        TEA COFFEE                     [TEA COFFEE]       True

相关问题 更多 >