如何在Python中使用序列文件创建数据集
我有一个蛋白质序列文件,内容大概是这样的:
>102L:A MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL -------------------------------------------------------------------------------------------------------------------------------------------------------------------XX
第一行是序列的名称,第二行是实际的蛋白质序列,第一行还包含一个指示符,用来显示是否有缺失的坐标。在这个例子中,注意到最后有两个“X”。这意味着序列的最后两个残基,也就是“NL”,是缺失坐标的。
我想用Python编程生成一个表格,应该看起来像这样:
- 序列的名称
- 缺失坐标的总数(也就是“X”的数量)
- 这些缺失坐标的范围(也就是“X”的位置范围)
- 序列的长度
- 实际的序列
所以最终结果应该是这样的:
>102L:A 2 163-164 164 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
而我的代码目前是这样的:
total_seq = []
with open('sample.txt') as lines:
for l in lines:
split_list = l.split()
# Assign the list number
header = split_list[0] # 1
seq = split_list[1] # 5
disorder = split_list[2]
# count sequence length and total residue of missing coordinates
sequence_length = len(seq) # 4
for x in disorder:
counts = 0
if x == 'X':
counts = counts + 1
total_seq.append([header, seq, str(counts)]) # obviously I haven't finish coding 2 & 3
with open('new_sample.txt', 'a') as f:
for lol in total_seq:
f.write('\n'.join(lol))
我刚学Python,能有人帮帮我吗?
1 个回答
0
这是你修改过的代码。现在它能产生你想要的输出了。
with open("sample.txt") as infile:
matrix = [line.split() for line in infile.readlines()]
header_list = [row[0] for row in matrix]
seq_list = [str(row[1]) for row in matrix]
disorder_list = [str(row[2]) for row in matrix]
f = open('new_sample.txt', 'a')
for i in range(len(header_list)):
header = header_list[i]
seq = seq_list[i]
disorder = disorder_list[i]
# count sequence length and total residue of missing coordinates
sequence_length = len(seq)
# get total number of missing coordinates
num_missing = disorder.count('X')
# get the range of these missing coordinates
first_X_pos = disorder.find('X')
last_X_pos = disorder.rfind('X')
range_missing = '-'.join([str(first_X_pos), str(last_X_pos)])
reformat_seq=" ".join([header, str(num_missing), range_missing, str(sequence_length), seq, '\n'])
f.write(reformat_seq)
f.close()
还有一些小建议:
别忘了Python的字符串函数。它们能自动解决很多问题。文档写得非常好。
如果你只搜索了问题的第2部分或第3部分,你会在其他地方找到相关的结果。