Python比较姓氏列并获得它们的最大相似性p

1条回答

网友

1楼 · 发布于 2024-06-17 08:27:14

这应该能奏效。首先只是重新创建你的数据，这样你就可以看到我在测试什么。你知道吗

import pandas as pd

person_one_first_surname_column = ["Johnson", "Smith", "Scott", "Morris", "Foster"]
person_two_first_surname_column = ["Johnson", "Smith", "Garcia", "Flores", "Nelson"]
person_one_second_surname_column = ["null", "Dorrien", "null", "null", "null"] 
person_two_second_surname_column = ["null", "null", "Scott", "null", "null"]



dataset = {'lastname1_1': person_one_first_surname_column, 'lastname1_2': person_one_second_surname_column, "lastname2_1" : person_two_first_surname_column, "lastname2_2": person_two_second_surname_column}
df = pd.DataFrame(data=dataset)

在将来，如果您将示例数据包含在代码格式中以节省帮助您的人的时间，这将是很有帮助的！我不确定如何处理“null”值，所以假设它们也是字符串。你知道吗

我们首先定义一个比较两个名称列表的函数。它的工作原理是创建一个新的成对列表(a,b)，其中a来自第一个列表，b来自第二个列表，并且仅当它们不等于"null"时才包含它们。然后对它们运行序列匹配器，并获取比率，然后从该列表中获取最大值。你知道吗

import difflib
def get_max_similarity(list_of_user_one_names, list_of_user_two_names):
    max_similarity = max([difflib.SequenceMatcher(None, a,b).ratio() for a in list_of_user_one_names if a != "null" for b in list_of_user_two_names if b != "null"])
    return max_similarity

我们现在使用apply函数在数据帧的每一行上调用新函数，将名称列表作为变量输入。我们将这个新数据作为新行“Max\ u similarity”分配给数据帧。你知道吗

df["Max_similarity"] = df.apply(lambda row: get_max_similarity([row["lastname1_1"], row["lastname1_2"]], [row["lastname2_1"], row["lastname2_2"]]), axis=1)

输出：

  lastname1_1 lastname1_2 lastname2_1 lastname2_2  Max_similarity
0     Johnson        null     Johnson        null        1.000000
1       Smith     Dorrien       Smith        null        1.000000
2       Scott        null      Garcia       Scott        1.000000
3      Morris        null      Flores        null        0.500000
4      Foster        null      Nelson        null        0.166667

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python比较姓氏列并获得它们的最大相似性p

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >