找到两个文件之间的共同元素
AT5G54940.1 3182
pfam
PF01253 SUI1#Translation initiation factor SUI1
mf
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
bp
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
GRMZM2G158629_P02 4996
pfam
PF01575 MaoC_dehydratas#MaoC like domain
mf
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
OS08T0174000-01 560919
GRMZM2G158629_P02
AT5G54940.1
OS05T0566300-01
OS08T0174000-01
import re
with open('file2.txt') as mylist:
proteins = set(line.strip() for line in mylist)
with open('file1.txt') as mydict:
with open('a.txt', 'w') as output:
for line in mydict:
new_list = line.strip().split()
protein = new_list[0]
if protein in proteins:
if re.search(r'GO:\d+', line):
output.write(protein+'\t'+line)
AT5G54940.1 GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
GRMZM2G158629_P02 GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
OS08T0174000-01
我有两个不同的文件,内容如下:
file1.txt是用制表符分隔的,而file2.txt里面包含不同的蛋白质名称。
我需要运行一个程序,找出在file1中存在的蛋白质名称,并且如果有的话,还要打印出与这些蛋白质相关的所有“GO:”信息。对我来说,最难的部分是解析第一个文件,因为它的格式有点奇怪。我试过一些方法,但如果有其他的办法,我也非常欢迎。
我希望的输出格式无所谓,只要能包含所有相关的GO信息就行。
2 个回答
1
一种选择是建立一个字典,里面存放列表,使用蛋白质的名字作为键:
#!/usr/bin/env python
import pprint
pp = pprint.PrettyPrinter()
proteins = set(line.strip() for line in open('file2.txt'))
d = {}
with open('file1.txt') as file:
for line in file:
line = line.strip()
parts = line.split()
if parts[0] in proteins:
key = parts[0]
d[key] = []
elif parts[0].split(':')[0] == 'GO':
d[key].append(line)
pp.pprint(d)
我使用了 pprint
模块来打印这个字典,因为你说你对格式不是太挑剔。现在的输出结果是:
{'AT5G54940.1': ['GO:0003743 translation initiation factor activity',
'GO:0008135 translation factor activity, nucleic acid binding',
'GO:0006413 translational initiation',
'GO:0006412 translation',
'GO:0044260 cellular macromolecule metabolic process'],
'GRMZM2G158629_P02': ['GO:0016491 oxidoreductase activity',
'GO:0033989 3alpha,7alpha,']}
编辑
你也可以用一个循环来获得问题中指定的输出,而不是使用 pprint
:
with open('out.txt', 'w') as out:
for k,v in d.iteritems():
out.write('Protein: {}\n'.format(k))
out.write('{}\n'.format('\n'.join(v)))
out.txt
:
Protein: GRMZM2G158629_P02
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
Protein: AT5G54940.1
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
2
我来给你讲讲怎么处理这个问题。你输入文件中属于同一个蛋白质的“组”,是通过从缩进的行变成不缩进的行来区分的。你只需要找到这个变化,就能得到你的组(或者叫“块”)。每组的第一行是蛋白质的名字,其他行可能是以GO:开头的行。
你可以通过使用 if line.startswith(" ")
来检测缩进(如果你的文件格式不同,也可以用 "\t"
来查找制表符)。
def get_protein_chunks(filepath):
chunk = []
last_indented = False
with open(filepath) as f:
for line in f:
if not line.startswith(" "):
current_indented = False
else:
current_indented = True
if last_indented and not current_indented:
yield chunk
chunk = []
chunk.append(line.strip())
last_indented = current_indented
look_for_proteins = set(line.strip() for line in open('file2.txt'))
for p in get_protein_chunks("input.txt"):
proteinname = p[0].split()[0]
proteindata = p[1:]
if proteinname not in look_for_proteins:
continue
print "Protein: %s" % proteinname
golines = [l for l in proteindata if l.startswith("GO:")]
for g in golines:
print g
在这里,一个块就是一系列去掉空格的行。我用一个生成器从输入文件中提取蛋白质块。可以看到,这个逻辑仅仅是基于从缩进行到不缩进行的变化。
使用生成器后,你可以随意处理这些数据。我只是简单地打印出来了。不过,你也可以把数据放进一个字典里,进行进一步分析。
输出结果:
$ python test.py
Protein: AT5G54940.1
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,