从基因列表中删除元素

2024-04-30 04:14:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这样一个清单:

['>ENST00000262144 cds:known chromosome:GRCh37:16:74907468:75019046:-1 gene:ENSG00000103091 gene_biotype:protein_coding transcript_biotype:protein_coding', 
 '>ENST00000446813 cds:known chromosome:GRCh37:7:72349936:72419009:1 gene:ENSG00000196313 gene_biotype:protein_coding transcript_biotype:protein_coding']

我想创建一个具有相同维度和顺序的新列表,但在新列表中我只保留gene id。结果如下:

['ENSG00000103091', 'ENSG00000196313']

我在用python。你们知道怎么做吗?谢谢


Tags: 列表顺序grch37knowntranscriptgenecodingcds
3条回答

使用一些基本的列表理解:

lst = ['>ENST00000262144 cds:known chromosome:GRCh37:16:74907468:75019046:-1 gene:ENSG00000103091 gene_biotype:protein_coding transcript_biotype:protein_coding', '>ENST00000446813 cds:known chromosome:GRCh37:7:72349936:72419009:1 gene:ENSG00000196313 gene_biotype:protein_coding transcript_biotype:protein_coding']

res = [el[5:] for s in lst for el in s.split() if el.startswith('gene:')]

如果您更喜欢使用常规for循环来执行此操作,请使用以下命令:

lst = ['>ENST00000262144 cds:known chromosome:GRCh37:16:74907468:75019046:-1 gene:ENSG00000103091 gene_biotype:protein_coding transcript_biotype:protein_coding', '>ENST00000446813 cds:known chromosome:GRCh37:7:72349936:72419009:1 gene:ENSG00000196313 gene_biotype:protein_coding transcript_biotype:protein_coding']

res = []
for el in lst: # for each string in your list
    l = el.split() # create a second list, of split strings
    for s in l: # for each string in the 'split strings' list
        if s.startswith('gene:'): # if the string starts with 'gene:' we know we have match
            res.append(s[5:]) # so skip the 'gene:' part of the string, and append the rest to a list

这绝不是最能达到目的的方法,但它应该做你想做的

l = [
    '>ENST00000262144 cds:known chromosome:GRCh37:16:74907468:75019046:-1 gene:ENSG00000103091 gene_biotype:protein_coding transcript_biotype:protein_coding',
    '>ENST00000446813 cds:known chromosome:GRCh37:7:72349936:72419009:1 gene:ENSG00000196313 gene_biotype:protein_coding transcript_biotype:protein_coding'
]
genes = []
for e in l:
    e = e.split('gene:')
    gene = ''
    for c in e[1]:
        if c != ' ':
            gene += c
        else:
            break
    genes.append(gene)

print(genes)

循环遍历列表中的元素,然后在gene:上拆分它们,然后将所有字符附加到字符串并将其添加到数组中

For each string in the list:
    Split the string on spaces (Python **split** command)
    Find the element starting with "gene:"
    Keep the rest of the string (grab the slice [5:] of that element)

你有足够的基本Python知识来学习它吗?如果没有,我建议您咨询string method documentation

相关问题 更多 >