是否可以与python pandas进行模糊匹配合并?

2024-05-16 19:40:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧,我想根据一个列合并它们。但是,由于拼写交替、空格数不同、缺少/存在变音符号,我希望能够合并,只要它们彼此相似。

任何相似性算法都可以(soundex、Levenshtein、difflib)。

假设一个数据帧包含以下数据:

df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])

       number
one         1
two         2
three       3
four        4
five        5

df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

      letter
one        a
too        b
three      c
fours      d
five       e

然后我想得到最终的数据帧

       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

Tags: columns数据numberdataframeindexonetoothree
3条回答

我已经编写了一个Python包来解决这个问题:

pip install fuzzymatcher

您可以找到回购here和文档here

基本用法:

给定要模糊连接的两个数据帧df_leftdf_right,可以编写以下内容:

from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

或者如果您只想链接到最接近的匹配项:

fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)

我会使用Jaro Winkler,因为它是目前可用的最高效、最精确的近似字符串匹配算法之一[Cohen, et al.],[Winkler]。

这就是我如何处理来自jellyfish包的Jaro Winkler:

def get_closest_match(x, list_strings):

  best_match = None
  highest_jw = 0

  for current_string in list_strings:
    current_score = jellyfish.jaro_winkler(x, current_string)

    if(current_score > highest_jw):
      highest_jw = current_score
      best_match = current_string

  return best_match

df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))

df1.join(df2)

输出:

    number  letter
one     1   a
two     2   b
three   3   c
four    4   d
five    5   e

类似于@locojay suggestion,您可以将^{}^{}应用于df2的索引,然后应用^{}

In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]: 
      letter
one        a
two        b
three      c
four       d
five       e

In [31]: df1.join(df2)
Out[31]: 
       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

是的。

如果这些是列,您可以按照相同的方式应用于列,然后^{}

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)

相关问题 更多 >