找到两个文件之间的共同元素

2 投票

2 回答

1002 浏览

提问于 2025-04-18 11:09

AT5G54940.1 3182
            pfam
            PF01253 SUI1#Translation initiation factor SUI1
            mf
            GO:0003743  translation initiation factor activity
            GO:0008135  translation factor activity, nucleic acid binding
            bp
            GO:0006413  translational initiation
            GO:0006412  translation
            GO:0044260  cellular macromolecule metabolic process
GRMZM2G158629_P02   4996
                pfam
                PF01575 MaoC_dehydratas#MaoC like domain
                mf
                GO:0016491  oxidoreductase activity
                GO:0033989  3alpha,7alpha,
OS08T0174000-01 560919

GRMZM2G158629_P02
AT5G54940.1
OS05T0566300-01
OS08T0174000-01

import re
with open('file2.txt') as mylist:                                                      
proteins = set(line.strip() for line in mylist)                         

with open('file1.txt') as mydict:                           
    with open('a.txt', 'w') as output:                  
        for line in mydict:                                 
            new_list = line.strip().split()                         
            protein = new_list[0]                               
            if protein in proteins:
                if re.search(r'GO:\d+', line):
                    output.write(protein+'\t'+line)

AT5G54940.1 GO:0003743  translation initiation factor activity
            GO:0008135  translation factor activity, nucleic acid binding
            GO:0006413  translational initiation
            GO:0006412  translation
            GO:0044260  cellular macromolecule metabolic process
GRMZM2G158629_P02   GO:0016491  oxidoreductase activity
                    GO:0033989  3alpha,7alpha,
OS08T0174000-01

我有两个不同的文件，内容如下：

file1.txt是用制表符分隔的

，而file2.txt里面包含不同的蛋白质名称。

我需要运行一个程序，找出在file1中存在的蛋白质名称，并且如果有的话，还要打印出与这些蛋白质相关的所有“GO:”信息。对我来说，最难的部分是解析第一个文件，因为它的格式有点奇怪。我试过一些方法，但如果有其他的办法，我也非常欢迎。

我希望的输出格式无所谓，只要能包含所有相关的GO信息就行。

数据处理文件解析制表符分隔共同元素蛋白质名称 GO信息

2 个回答

一种选择是建立一个字典，里面存放列表，使用蛋白质的名字作为键：

#!/usr/bin/env python

import pprint
pp = pprint.PrettyPrinter()

proteins = set(line.strip() for line in open('file2.txt'))
d = {}

with open('file1.txt') as file:
    for line in file:
        line = line.strip()
        parts = line.split()

        if parts[0] in proteins:
            key = parts[0]            
            d[key] = []                            
        elif parts[0].split(':')[0] == 'GO':
            d[key].append(line)

pp.pprint(d)

我使用了 pprint 模块来打印这个字典，因为你说你对格式不是太挑剔。现在的输出结果是：

{'AT5G54940.1': ['GO:0003743  translation initiation factor activity',
                 'GO:0008135  translation factor activity, nucleic acid binding',
                 'GO:0006413  translational initiation',
                 'GO:0006412  translation',
                 'GO:0044260  cellular macromolecule metabolic process'],
 'GRMZM2G158629_P02': ['GO:0016491  oxidoreductase activity',
                       'GO:0033989  3alpha,7alpha,']}

编辑

你也可以用一个循环来获得问题中指定的输出，而不是使用 pprint：

with open('out.txt', 'w') as out:    
    for k,v in d.iteritems():        
        out.write('Protein: {}\n'.format(k))
        out.write('{}\n'.format('\n'.join(v)))

out.txt：

Protein: GRMZM2G158629_P02
GO:0016491  oxidoreductase activity
GO:0033989  3alpha,7alpha,
Protein: AT5G54940.1
GO:0003743  translation initiation factor activity
GO:0008135  translation factor activity, nucleic acid binding
GO:0006413  translational initiation
GO:0006412  translation
GO:0044260  cellular macromolecule metabolic process

回答于 2025-04-18 由 Python大师

分享举报

我来给你讲讲怎么处理这个问题。你输入文件中属于同一个蛋白质的“组”，是通过从缩进的行变成不缩进的行来区分的。你只需要找到这个变化，就能得到你的组（或者叫“块”）。每组的第一行是蛋白质的名字，其他行可能是以GO:开头的行。

你可以通过使用 if line.startswith(" ") 来检测缩进（如果你的文件格式不同，也可以用 "\t" 来查找制表符）。

def get_protein_chunks(filepath):
    chunk = []
    last_indented = False
    with open(filepath) as f:
        for line in f:
            if not line.startswith(" "):
                current_indented = False
            else:
                current_indented = True
            if last_indented and not current_indented:
                yield chunk
                chunk = []       
            chunk.append(line.strip())
            last_indented = current_indented


look_for_proteins = set(line.strip() for line in open('file2.txt'))


for p in get_protein_chunks("input.txt"):
    proteinname = p[0].split()[0]
    proteindata = p[1:]
    if proteinname not in look_for_proteins:
        continue
    print "Protein: %s" % proteinname
    golines = [l for l in proteindata if l.startswith("GO:")]
    for g in golines:
        print g

在这里，一个块就是一系列去掉空格的行。我用一个生成器从输入文件中提取蛋白质块。可以看到，这个逻辑仅仅是基于从缩进行到不缩进行的变化。

使用生成器后，你可以随意处理这些数据。我只是简单地打印出来了。不过，你也可以把数据放进一个字典里，进行进一步分析。

输出结果：

$ python test.py 
Protein: AT5G54940.1
GO:0003743  translation initiation factor activity
GO:0008135  translation factor activity, nucleic acid binding
GO:0006413  translational initiation
GO:0006412  translation
GO:0044260  cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491  oxidoreductase activity
GO:0033989  3alpha,7alpha,

回答于 2025-04-18 由 Python大师

分享举报

找到两个文件之间的共同元素

2 个回答

编辑

撰写回答