如何从另一个文件中给定优先级的列中提取数据?

2024-04-28 15:39:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧

df1

IMPACT     Rank
HIGH        1
MODERATE    2
LOW         3
MODIFIER    4

df2[“注释”]

Annotation
A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||

A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||

A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||
<^ >在{{CD1>}(逗号)中有多个注释,我想只考虑基于{{CD2>}中的秩的数据文件中的一个注释。 我的预期产出将是:

df[‘排名’]

RANKED
A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)
A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999)

我尝试了以下代码来生成输出:但没有给出预期的结果

d = df1.set_index('IMPACT')['Rank'].to_dict()
max1 = df1['Rank'].max()+1    

def f(x):
    d1 = {y: d.get(y, max1) for y in x for y in x.split(',')}
    return min(d1, key=d1.get)

df2['RANKED'] = df2['Annotation'].apply(f)

谢谢你的帮助


Tags: yestranscriptmodifiervarianthgncgatcodingprotein
1条回答
网友
1楼 · 发布于 2024-04-28 15:39:52

TL;博士

df2['RANKED'] = df2['Annotation'].str.split(',')
df2 = df2.explode(column='RANKED')
df2['IMPACT'] = df["RANKED"].str.findall(r"|".join(df1['IMPACT'])).apply("".join)
df_merge = df2.merge(df1, how='left', on='IMPACT')
df_final = df_merge.loc[df_merge.groupby(['Annotation'])['Rank'].idxmin().sort_values()].drop(columns=['Annotation', 'IMPACT'])

一步一步

首先定义数据帧

df1 = pd.DataFrame({'IMPACT':['HIGH', 'MODERATE', 'LOW', 'MODIFIER'], 'Rank':[1,2,3,4]})

df2 = pd.DataFrame({
    'Annotation':[
                  'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|MODERATE|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||',
                  'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|HIGH|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||',
                  'A|intron_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000341290|protein_coding||2/4||||||||||-1||HGNC|HGNC:28208||||,A|missense_variant|LOW|PERM1|ENSG00000187642|Transcript|ENST00000433179|protein_coding|1/3||||72|72|24|E/D|gaG/gaT|||-1||HGNC|HGNC:28208|YES|CCDS76083.1|deleterious(0)|probably_damaging(0.999),A|upstream_gene_variant|MODIFIER|PERM1|ENSG00000187642|Transcript|ENST00000479361|retained_intron|||||||||||4317|-1||HGNC|HGNC:28208||||']
    })

现在是棘手的部分。您应该使用原始Annotation列的逗号字符串拆分列表创建一个列。然后分解此列,以便可以为每个原始字符串重复目标值

df2['RANKED'] = df2['Annotation'].str.split(',')
df2 = df2.explode(column='RANKED')

接下来,从每个RANKED列中提取影响词

df2['IMPACT'] = df2["RANKED"].str.findall(r"|".join(df1['IMPACT'])).apply("".join)

然后,合并df1df2以获得每个RANKED的排名

df_merge = df2.merge(df1, how='left', on='IMPACT')

最后,这是一个简单的部分,您可以在最后的数据帧中丢弃您不想要的所有内容。这可以通过groupby完成

df_final = df_merge.loc[df_merge.groupby(['Annotation'])['Rank'].idxmin().sort_values()].drop(columns=['Annotation', 'IMPACT'])
RANKED                                              Rank
A|missense_variant|MODERATE|PERM1|ENSG00000187...   2
A|missense_variant|HIGH|PERM1|ENSG00000187642|...   1
A|missense_variant|LOW|PERM1|ENSG00000187642|T...   3

通过删除重复项

df_final = df_merge.sort_values(['Annotation', 'Rank'], ascending=[False,True]).drop_duplicates(subset=['Annotation']).drop(columns=['Annotation', 'IMPACT'])

相关问题 更多 >