Pandas用LIKE算子连接条件

2024-05-15 07:45:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧: 用户

 user_id    position
0   201 Senior Engineer
1   207 Senior System Architect
2   223 Senior account manage
3   212 Junior Manager
4   112 junior Engineer
5   311 junior python developer
df1 = pd.DataFrame({'user_id': ['201', '207', '223', '212', '112', '311'],
                   'position': ['Senior Engineer', 'Senior System Architect', 'Senior account manage', 'Junior Manager', 'junior Engineer', 'junior python developer']})

角色

 role_id     role_position
0   10         %senior%
1   20         %junior%
df2 = pd.DataFrame({'role_id': ['10', '20'],
                   'role_position': ['%senior%', '%junior%']})

我想加入他们,使用如下条件为df1中的每一行获取角色\u id:

lower(df1.position) LIKE df2.role_position

我想使用操作符LIKE(比如在SQL中)。 所以它看起来是这样的(或者没有角色_的位置-会更好):

user_id position                role_id  role_position
0   201 Senior Engineer           10      %senior%
1   207 Senior System Architect   10      %senior%
2   223 Senior account manage     10      %senior%
3   212 Junior Manager            20      %junior%
4   112 junior Engineer           20      %junior%
5   311 junior python developer   20      %junior%

我怎么做这个? 谢谢你的帮助


Tags: iddevelopermanagemanagerpositionaccountsystemrole
3条回答

如果资历级别始终从前面开始,则直接执行merge可以避免一些麻烦:

print (pd.merge(df, df2,
                left_on=df["position"].str.split().str[0].str.lower(),
                right_on=df2["role_position"].str.strip("%")).drop("key_0", axis=1))

否则,您可以在merge期间执行pd.Series.str.extract

pat = f'({"|".join(df2["role_position"].str.strip("%"))})'

print (pd.merge(df, df2,
                left_on=df["position"].str.extract(pat, flags=re.IGNORECASE, expand=False).str.lower(),
                right_on=df2["role_position"].str.strip("%")).drop("key_0", axis=1))

两者产生相同的结果:

  user_id                 position role_id role_position
0     201          Senior Engineer      10      %senior%
1     207  Senior System Architect      10      %senior%
2     223    Senior account manage      10      %senior%
3     212           Junior Manager      20      %junior%
4     112          junior Engineer      20      %junior%
5     311  junior python developer      20      %junior%

您可以使用str.extract()+merge()

pat='('+'|'.join(df2['role_position'].str.strip('%').unique())+')'
df1['role_position']='%'+df1['position'].str.lower().str.extract(pat,expand=False)+'%'
df1=df1.merge(df2,on='role_position',how='left')

df1的输出:

user_id position                role_id  role_position
0   201 Senior Engineer           10      %senior%
1   207 Senior System Architect   10      %senior%
2   223 Senior account manage     10      %senior%
3   212 Junior Manager            20      %junior%
4   112 junior Engineer           20      %junior%
5   311 junior python developer   20      %junior%

可能性:


    df1['Similarity'] = 0
        df1['Role'] = 0
        
        from difflib import SequenceMatcher
        def similar(a, b):
            return SequenceMatcher(None, a, b).ratio()
        
        for index, row in df1.iterrows(): 
            for x in df2['role_position']:
                z = similar(row['position'],x)
                if z >= 0.20: 
                    df1.loc[index, "Similarity"] = z
                    df1.loc[index, "Role"] = x

enter image description here

相关问题 更多 >