找到两个文件之间的共同元素

2 投票
2 回答
1002 浏览
提问于 2025-04-18 11:09
AT5G54940.1 3182
            pfam
            PF01253 SUI1#Translation initiation factor SUI1
            mf
            GO:0003743  translation initiation factor activity
            GO:0008135  translation factor activity, nucleic acid binding
            bp
            GO:0006413  translational initiation
            GO:0006412  translation
            GO:0044260  cellular macromolecule metabolic process
GRMZM2G158629_P02   4996
                pfam
                PF01575 MaoC_dehydratas#MaoC like domain
                mf
                GO:0016491  oxidoreductase activity
                GO:0033989  3alpha,7alpha,
OS08T0174000-01 560919
GRMZM2G158629_P02
AT5G54940.1
OS05T0566300-01
OS08T0174000-01
import re
with open('file2.txt') as mylist:                                                      
proteins = set(line.strip() for line in mylist)                         

with open('file1.txt') as mydict:                           
    with open('a.txt', 'w') as output:                  
        for line in mydict:                                 
            new_list = line.strip().split()                         
            protein = new_list[0]                               
            if protein in proteins:
                if re.search(r'GO:\d+', line):
                    output.write(protein+'\t'+line)
AT5G54940.1 GO:0003743  translation initiation factor activity
            GO:0008135  translation factor activity, nucleic acid binding
            GO:0006413  translational initiation
            GO:0006412  translation
            GO:0044260  cellular macromolecule metabolic process
GRMZM2G158629_P02   GO:0016491  oxidoreductase activity
                    GO:0033989  3alpha,7alpha,
OS08T0174000-01

我有两个不同的文件,内容如下:

file1.txt是用制表符分隔的

,而file2.txt里面包含不同的蛋白质名称。

我需要运行一个程序,找出在file1中存在的蛋白质名称,并且如果有的话,还要打印出与这些蛋白质相关的所有“GO:”信息。对我来说,最难的部分是解析第一个文件,因为它的格式有点奇怪。我试过一些方法,但如果有其他的办法,我也非常欢迎。

我希望的输出格式无所谓,只要能包含所有相关的GO信息就行。

2 个回答

1

一种选择是建立一个字典,里面存放列表,使用蛋白质的名字作为键:

#!/usr/bin/env python

import pprint
pp = pprint.PrettyPrinter()

proteins = set(line.strip() for line in open('file2.txt'))
d = {}

with open('file1.txt') as file:
    for line in file:
        line = line.strip()
        parts = line.split()

        if parts[0] in proteins:
            key = parts[0]            
            d[key] = []                            
        elif parts[0].split(':')[0] == 'GO':
            d[key].append(line)

pp.pprint(d)

我使用了 pprint 模块来打印这个字典,因为你说你对格式不是太挑剔。现在的输出结果是:

{'AT5G54940.1': ['GO:0003743  translation initiation factor activity',
                 'GO:0008135  translation factor activity, nucleic acid binding',
                 'GO:0006413  translational initiation',
                 'GO:0006412  translation',
                 'GO:0044260  cellular macromolecule metabolic process'],
 'GRMZM2G158629_P02': ['GO:0016491  oxidoreductase activity',
                       'GO:0033989  3alpha,7alpha,']}

编辑

你也可以用一个循环来获得问题中指定的输出,而不是使用 pprint

with open('out.txt', 'w') as out:    
    for k,v in d.iteritems():        
        out.write('Protein: {}\n'.format(k))
        out.write('{}\n'.format('\n'.join(v)))

out.txt

Protein: GRMZM2G158629_P02
GO:0016491  oxidoreductase activity
GO:0033989  3alpha,7alpha,
Protein: AT5G54940.1
GO:0003743  translation initiation factor activity
GO:0008135  translation factor activity, nucleic acid binding
GO:0006413  translational initiation
GO:0006412  translation
GO:0044260  cellular macromolecule metabolic process
2

我来给你讲讲怎么处理这个问题。你输入文件中属于同一个蛋白质的“组”,是通过从缩进的行变成不缩进的行来区分的。你只需要找到这个变化,就能得到你的组(或者叫“块”)。每组的第一行是蛋白质的名字,其他行可能是以GO:开头的行。

你可以通过使用 if line.startswith(" ") 来检测缩进(如果你的文件格式不同,也可以用 "\t" 来查找制表符)。

def get_protein_chunks(filepath):
    chunk = []
    last_indented = False
    with open(filepath) as f:
        for line in f:
            if not line.startswith(" "):
                current_indented = False
            else:
                current_indented = True
            if last_indented and not current_indented:
                yield chunk
                chunk = []       
            chunk.append(line.strip())
            last_indented = current_indented


look_for_proteins = set(line.strip() for line in open('file2.txt'))


for p in get_protein_chunks("input.txt"):
    proteinname = p[0].split()[0]
    proteindata = p[1:]
    if proteinname not in look_for_proteins:
        continue
    print "Protein: %s" % proteinname
    golines = [l for l in proteindata if l.startswith("GO:")]
    for g in golines:
        print g

在这里,一个块就是一系列去掉空格的行。我用一个生成器从输入文件中提取蛋白质块。可以看到,这个逻辑仅仅是基于从缩进行到不缩进行的变化。

使用生成器后,你可以随意处理这些数据。我只是简单地打印出来了。不过,你也可以把数据放进一个字典里,进行进一步分析。

输出结果:

$ python test.py 
Protein: AT5G54940.1
GO:0003743  translation initiation factor activity
GO:0008135  translation factor activity, nucleic acid binding
GO:0006413  translational initiation
GO:0006412  translation
GO:0044260  cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491  oxidoreductase activity
GO:0033989  3alpha,7alpha,

撰写回答