一种基于子串和字符串混合的模糊映射方法

2024-06-09 17:09:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧mappdata,如下所示

mapp = pd.DataFrame({'variable': ['d22','Studyid','noofsons','Level','d21'],'concept_id':[1,2,3,4,5]})

data = pd.DataFrame({'sourcevalue': ['d22heartabcd','Studyid','noofsons','Level','d21abcdef']})

enter image description here

enter image description here

我想从data中获取一个值,并检查它是否存在于mapp中,如果存在,则获取相应的concept_id值。优先级是首先查找exact match。如果找不到匹配项,则转到substring match。因为我要处理超过一百万条记录,任何可缩放的解决方案都是有用的

s = mapp.set_index('variable')['concept_id']
data['concept_id'] = data['sourcevalue'].map(s) 

产生如下输出

enter image description here

当我进行子串匹配时,有效记录也会变成NA,如下所示

data['concept_id'] = data['sourcevalue'].str[:3].map(s)

enter image description here

我不知道为什么它现在给有效记录NA

如何才能优雅高效地同时完成这两项检查

我希望我的输出如下所示

enter image description here


Tags: idmapdataframedatamatch记录levelvariable
3条回答

试试这个。在这种情况下,我们将在第一个映射之后定位NA值;对它们进行子字符串映射

s = mapp.set_index('variable')['concept_id']
data['concept_id'] = data['sourcevalue'].map(s)
data.loc[data['concept_id'].isnull(),'concept_id' ]= data['sourcevalue'].str[:3].map(s)

如果需要按字符串和前3个字母映射,请创建2个单独的序列,然后使用^{}^{}a中缺少的值替换为b

s = mapp.set_index('variable')['concept_id']
a = data['sourcevalue'].map(s) 
b = data['sourcevalue'].str[:3].map(s)

data['concept_id'] = a.fillna(b)
#alternative
#data['concept_id'] = a.combine_first(b)
print (data)
    sourcevalue  concept_id
0  d22heartabcd         1.0
1       Studyid         2.0
2      noofsons         3.0
3         Level         4.0
4     d21abcdef         5.0

编辑:

#all strings map Series
s = mapp.set_index('variable')['concept_id']
print (s)
variable
d22         1
Studyid     2
noofsons    3
Level       4
d21         5
Name: concept_id, dtype: int64

#first 3 letters map Series
s1 = mapp.assign(variable = mapp['variable'].str[:3]).set_index('variable')['concept_id']
print (s1)
variable
d22    1
Stu    2
noo    3
Lev    4
d21    5
Name: concept_id, dtype: int64

#first 3 letters map by all strings
print (data['sourcevalue'].str[:3].map(s))
0    1.0
1    NaN
2    NaN
3    NaN
4    5.0
Name: sourcevalue, dtype: float64

#first 3 letters match by 3 first letters map Series
print (data['sourcevalue'].str[:3].map(s1))
0    1
1    2
2    3
3    4
4    5
Name: sourcevalue, dtype: int64

使用^{}函数,我编写了:

new = fuzzy_merge(data, mapp, 'sourcevalue', 'variable')\
       .merge(mapp, left_on='matches', right_on='variable')\
       .drop(columns=['matches', 'variable'])

输出

    sourcevalue  concept_id
0  d22heartabcd           1
1       Studyid           2
2      noofsons           3
3         Level           4
4     d21abcdef           5

链接答案中使用的函数:

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    df_1 is the left table to join
    df_2 is the right table to join
    key1 is the key column of the left table
    key2 is the key column of the right table
    threshold is how close the matches should be to return a match
    limit is the amount of matches will get returned, these are sorted high to low
    """
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2

    return df_1

相关问题 更多 >