如何搜索DataFrame中的行并使用lambda将字典中找到的第一个元素返回到新列

2024-05-16 04:44:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个如下所示的数据帧:

print(df):

# Output
                        Text         
0  Melbourne. Adelaide. Brisbane.     
1  Sydney. Adelaide. Gold Coast.       
2  Gold Coast. Brisbane. Melbourne.   
3  Sydney. Melbourne. Brisbane.  

我还有一本字典

print(dict):

{'VIC': ['Melbourne'], 'SA': ['Adelaide'], 'QLD': ['Brisbane','Gold Coast'], 'NSW': ['Sydney']}

我尝试创建的输出是:

print(df):

                       Text               VIC              SA              QLD                NSW
0  Melbourne. Adelaide. Brisbane.        Melbourne.       Adelaide.         Brisbane.        
1  Sydney. Adelaide. Gold Coast.                          Adelaide.         Gold Coast.      Sydney.  
2  Gold Coast. Brisbane. Melbourne.     Melbourne.                          Brisbane.        
3  Sydney. Melbourne. Brisbane.         Melbourne.       Adelaide.          Brisbane.        

当两个项目出现在同一个州(如df index 2)时,我想展示布里斯班而不是黄金海岸,因为它首先出现在dict

我使用了以下代码,如果找到字符串,则可以将1或0放入其中:

    for index in df.index:
        for key, s_elements in dict.items():
            df.at[index, key] = (lambda: 1 if any([s in df['Text'][index] for s in s_elements]) else 0)()

打印(df):

                       Text               VIC              SA              QLD                NSW
0  Melbourne. Adelaide. Brisbane.           1               1               1                  0
1  Sydney. Adelaide. Gold Coast.                            1               1                  1
2  Gold Coast. Brisbane. Melbourne.         1               0               1                  0
3  Sydney. Melbourne. Brisbane.             1               0               1                  1       

这很有效,但我正在尝试获取ComColumn中找到的城市名称(使用每个键中的第一个元素)

有人能帮我做些什么调整,让名字显示出来,而不是数字吗?我尝试了lambda: s_elements而不是lambda: 1,但得到了ValueError: Must have equal len keys and value when setting with an iterable

任何帮助都会很棒!非常感谢


Tags: textindfindexsadictprintsydney
2条回答

另一种尝试是:从pandas中取出数据,进行清理,然后将其拉回到数据帧中。因为它是字符串,所以它应该是快速的。但是两次?嗯。不管怎样,这里有:

#flip dictionary, so values become keys and vice versa : 
repl = {}
for k,v in d.items():
    for ent in v:
        repl[ent] = k

repl
{'Melbourne': 'VIC',
 'Adelaide': 'SA',
 'Brisbane': 'QLD',
 'Gold Coast': 'QLD',
 'Sydney': 'NSW'}

#get out of dataframe
res = df.Text.str.split('.',expand=True).to_numpy().tolist()
print(res)
[['Melbourne', ' Adelaide', ' Brisbane'],
 ['Sydney', ' Adelaide', ' Gold Coast'],
 ['Gold Coast', ' Brisbane', ' Melbourne'],
 ['Sydney', ' Melbourne', ' Brisbane']]

#cleanups here
res = [[entry.strip() for entry in ent] for ent in res]
res = [[(k,repl.get(k)) for k in ent] for ent in res]
#get rid of Gold Coast if Brisbane is also in the same space
[ent.remove(('Gold Coast','QLD')) if ('Gold Coast','QLD') in ent and ('Brisbane','QLD') in ent else ent for ent in res]
res = (zip(*ent) for ent in res) #this gets cities into a group and states in another group per row

cols = ['VIC','SA','QLD','NSW']
#get dataframe
out = pd.concat((pd.DataFrame(start,index=end).reindex(cols,axis=0).T for start,end in res), ignore_index=True)

#merge back with main df
pd.concat((df,out),axis=1)


       Text                          VIC          SA         QLD    NSW
0   Melbourne. Adelaide. Brisbane   Melbourne   Adelaide    Brisbane    NaN
1   Sydney. Adelaide. Gold Coast    NaN Adelaide    Gold Coast  Sydney
2   Gold Coast. Brisbane. Melbourne Melbourne   NaN Brisbane    NaN
3   Sydney. Melbourne. Brisbane Melbourne   NaN Brisbane    Sydney

这不是一项容易的任务,但有一种方法是首先为extractall编译正则表达式模式:

s = {'VIC': ['Melbourne'], 'SA': ['Adelaide'], 'QLD': ['Brisbane', 'Gold Coast'], 'NSW': ['Sydney']}

pattern = "|".join(f"({x})" for x in ("|".join(i) for i in s.values()))

这将提供(Melbourne)|(Adelaide)|(Brisbane|Gold Coast)|(Sydney),但您还希望Brisbane优先于Gold Coast,这将需要此模式:

pattern2 = "(Melbourne)|(Adelaide)|(Brisbane|Gold Coast(?!.*Brisbane))|(Sydney)"

无论如何,无论使用哪种模式,您都可以使用str.extractall获得匹配项,使用groupbyfirst展平行,然后使用原始df concat并重命名列:

ref = (df["Text"].str.extractall(pattern2).reset_index()
       .groupby("level_0").first()
       .rename(columns={v:k for k,v in zip(s.keys(), range(0,4))})
       .reset_index(drop=True))

print (pd.concat([df, ref.iloc[:,1:]],axis=1))

                               Text        VIC        SA         QLD     NSW
0    Melbourne. Adelaide. Brisbane.  Melbourne  Adelaide    Brisbane     NaN
1     Sydney. Adelaide. Gold Coast.        NaN  Adelaide  Gold Coast  Sydney
2  Gold Coast. Brisbane. Melbourne.  Melbourne       NaN    Brisbane     NaN
3      Sydney. Melbourne. Brisbane.  Melbourne       NaN    Brisbane  Sydney

相关问题 更多 >