试图根据三个公共列匹配两个df,但它们都不相同

2024-04-24 07:12:03 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个df

df1

    date            League                                 teams
0   201902272215    brazil cup                             foz do iguacu fcceara ce
1   201902272300    colombia primera a                     deportes tolimaatletico bucaramanga
2   201902272300    brazil campeonato gaucho 2nd division  ypiranga rsuniao frederiquense
3   201902272300    brazil campeonato gaucho 2nd division  esportivo rstupi rs
4   201902272300    brazil campeonato gaucho 2nd division  sao paulo rsgremio esportivo bage
14  201902280000    four nations women tournament (in usa) usa (w)japan (w)
25  201902280030    bolivia professional football league   real potosibolivar

df2

    date            league                     teams
0   201902280000    womens international       usa womenjapan women
1   201902280000    brazil amazonense          sul america ecrio negro am
2   201902280030    bolivia apertura           real potosibolivar
3   201902280030    brazil campeonato paulista palmeirasituano
4   201902280030    copa sudamericana          racing clubcorinthians

我想要的结果是df2中与df1接近匹配的所有行

    date            league                     teams                      near_match
0   201902280000    womens international       usa womenjapan women       1
1   201902280000    brazil amazonense          sul america ecrio negro am 0
2   201902280030    bolivia apertura           real potosibolivar         1
3   201902280030    brazil campeonato paulista palmeirasituano            0
4   201902280030    copa sudamericana          racing clubcorinthians     0 

我尝试使用for循环的一个变体,使用SequenceMatcher并将阈值设置为大于0.8的匹配,但没有任何运气

df_1['merge_teams'] = df_1['teams'] # we will use these as the merge keys
df_1['merge_date'] = df_1['date']
# df_1['merge_league'] = df_1['league']


for teams_1, date_1, league_1 in df_1[['teams','date']].values:
    for ixb, (teams_1, teams_2) in enumerate(df_2[['teams','date']].values):
        if difflib.SequenceMatcher(None,teams_1,teams_2).ratio() > .8:
            df_2.ix[ixb,'merge_teams'] = teams_1 # creates a merge key in df_2
        if difflib.SequenceMatcher(None,date_1, date_2).ratio() > .8:
            df_2.ix[ixb,'merge_date'] = date_1 # creates a merge key in df_2

# This should rturn all rows where teams,date and league all match by over 80%
# This is just for teams and date, I want to include league as well

如有任何建议或指导,将不胜感激


Tags: indffordatemergerealdivisionbrazil