一种基于子串和字符串混合的模糊映射方法

mapp = pd.DataFrame({'variable': ['d22','Studyid','noofsons','Level','d21'],'concept_id':[1,2,3,4,5]}) data = pd.DataFrame({'sourcevalue': ['d22heartabcd','Studyid','noofsons','Level','d21abcdef']})

3条回答

网友

1楼 · 编辑于 2024-06-09 17:09:39

试试这个。在这种情况下，我们将在第一个映射之后定位NA值；对它们进行子字符串映射

s = mapp.set_index('variable')['concept_id']
data['concept_id'] = data['sourcevalue'].map(s)
data.loc[data['concept_id'].isnull(),'concept_id' ]= data['sourcevalue'].str[:3].map(s)

网友

2楼 · 编辑于 2024-06-09 17:09:39

如果需要按字符串和前3个字母映射，请创建2个单独的序列，然后使用^{}或^{}将a中缺少的值替换为b：

s = mapp.set_index('variable')['concept_id']
a = data['sourcevalue'].map(s) 
b = data['sourcevalue'].str[:3].map(s)

data['concept_id'] = a.fillna(b)
#alternative
#data['concept_id'] = a.combine_first(b)
print (data)
    sourcevalue  concept_id
0  d22heartabcd         1.0
1       Studyid         2.0
2      noofsons         3.0
3         Level         4.0
4     d21abcdef         5.0

编辑：

#all strings map Series
s = mapp.set_index('variable')['concept_id']
print (s)
variable
d22         1
Studyid     2
noofsons    3
Level       4
d21         5
Name: concept_id, dtype: int64

#first 3 letters map Series
s1 = mapp.assign(variable = mapp['variable'].str[:3]).set_index('variable')['concept_id']
print (s1)
variable
d22    1
Stu    2
noo    3
Lev    4
d21    5
Name: concept_id, dtype: int64

#first 3 letters map by all strings
print (data['sourcevalue'].str[:3].map(s))
0    1.0
1    NaN
2    NaN
3    NaN
4    5.0
Name: sourcevalue, dtype: float64

#first 3 letters match by 3 first letters map Series
print (data['sourcevalue'].str[:3].map(s1))
0    1
1    2
2    3
3    4
4    5
Name: sourcevalue, dtype: int64

网友

3楼 · 编辑于 2024-06-09 17:09:39

使用^{}函数，我编写了：

new = fuzzy_merge(data, mapp, 'sourcevalue', 'variable')\
       .merge(mapp, left_on='matches', right_on='variable')\
       .drop(columns=['matches', 'variable'])

输出

    sourcevalue  concept_id
0  d22heartabcd           1
1       Studyid           2
2      noofsons           3
3         Level           4
4     d21abcdef           5

链接答案中使用的函数：

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    df_1 is the left table to join
    df_2 is the right table to join
    key1 is the key column of the left table
    key2 is the key column of the right table
    threshold is how close the matches should be to return a match
    limit is the amount of matches will get returned, these are sorted high to low
    """
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2

    return df_1

相关问题更多 >

编程相关推荐

热门问题

热门文章

一种基于子串和字符串混合的模糊映射方法

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >