用python中对应的名称替换数千行ID名称的最佳方法是什么?

2024-05-23 22:45:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据集。其中一个包含16169行乘5列,我想用相应的名称替换其中一列。这些对应的名字来自另一个数据集。你知道吗

例如:

UniProtID    NAME
Q15173     PPP2R5B
P30154     PPP2R1B
P63151     PPP2R2A

DrugBankID  Name    Type        UniProtID   UniProt Name
DB00001 Lepirudin   BiotechDrug P00734  Prothrombin
DB00002 Cetuximab   BiotechDrug P00533  Epidermal growth factor receptor
DB00002 Cetuximab   BiotechDrug O75015  Low affinity immunoglobulin gamma Fc region receptor III-B

在本例中,我想用上面数据集示例中相应的名称替换所有UniProt ID。最好的办法是什么?你知道吗

我真的是新的编程和python所以任何建议,帮助是感激的。你知道吗


Tags: 数据name名称名字uniprotreceptorppp2r5bbiotechdrug
2条回答

我认为您需要由^{}创建的^{},如果某些值与get NaN不匹配:

#change data for match
print (df1)
  UniProtID     NAME
0    O75015  PPP2R5B
1    P00734  PPP2R1B
2    P63151  PPP2R2A

df2['UniProt Name'] = df2['UniProtID'].map(df1.set_index('UniProtID')['NAME'])
print (df2)
  DrugBankID       Name         Type UniProtID UniProt Name
0    DB00001  Lepirudin  BiotechDrug    P00734      PPP2R1B
1    DB00002  Cetuximab  BiotechDrug    P00533          NaN
2    DB00002  Cetuximab  BiotechDrug    O75015      PPP2R5B

如果NaN需要原始值:

df2['UniProt Name'] = df2['UniProtID'].map(df1.set_index('UniProtID')['NAME'])
                                      .fillna(df2['UniProt Name'])
print (df2)
  DrugBankID       Name         Type UniProtID  \
0    DB00001  Lepirudin  BiotechDrug    P00734   
1    DB00002  Cetuximab  BiotechDrug    P00533   
2    DB00002  Cetuximab  BiotechDrug    O75015   

                       UniProt Name  
0                           PPP2R1B  
1  Epidermal growth factor receptor  
2                           PPP2R5B  

^{}-需要left^{}^{}连接的解决方案,最后通过^{}删除列:

df = pd.merge(df2, df1, on="UniProtID", how='left')
df['UniProt Name'] = df['NAME'].fillna(df['UniProt Name'])
#alternative
#df['UniProt Name'] = df['NAME'].combine_first(df['UniProt Name'])
df.drop('NAME', axis=1, inplace=True)
print (df)
  DrugBankID       Name         Type UniProtID  \
0    DB00001  Lepirudin  BiotechDrug    P00734   
1    DB00002  Cetuximab  BiotechDrug    P00533   
2    DB00002  Cetuximab  BiotechDrug    O75015   

                       UniProt Name  
0                           PPP2R1B  
1  Epidermal growth factor receptor  
2                           PPP2R5B  

df = pd.merge(df2, df1, on="UniProtID", how='left')
df = df.drop('UniProt Name', axis=1).rename(columns={'NAME':'UniProt Name'})
print (df)
  DrugBankID       Name         Type UniProtID UniProt Name
0    DB00001  Lepirudin  BiotechDrug    P00734      PPP2R1B
1    DB00002  Cetuximab  BiotechDrug    P00533          NaN
2    DB00002  Cetuximab  BiotechDrug    O75015      PPP2R5B

解决这个问题的更一般的方法是对两个表执行类似SQL的连接。你知道吗

注意:对于较大的数据集,这可能会很昂贵,我还没有尝试过性能。你知道吗

import pandas as pd

left = pd.DataFrame({"UniProtID": ["Q15173", "P30154", "P63151"],
                     "Name": ["PPP2R5B", "PPP2R1B", "PPP2R2A"]})

right = pd.DataFrame({"UniProtID": ["Q15173", "P30154", "P63151"],
                      "UniProt Name": ["Prothrombin", "Epidermal growth factor receptor", "Low affinity immunoglobulin gamma Fc region receptor III-B"],
                      "Type": ["BiotechDrug", "BiotechDrug", "BiotechDrug"],
                      "DrugBankID": ["DB00001", "DB00002", "DB00003"]})

result = pd.merge(left, right, on="UniProtID")

引用:https://pandas.pydata.org/pandas-docs/stable/merging.html#overlapping-value-columns

相关问题 更多 >