如何使用另一个文件的值在Python中过滤文件
我有一个文件叫做 sequence.txt,我已经把这个文件分割成了列表,内容大概是这样的:
原始文件:
102L 序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
103L 序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
104L 序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
分割成列表后:
['>102L', 'Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL']
['>103L', 'Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL']
['>104L', 'Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL']
我还有另一个文件叫做 title.txt,里面包含了我想要的序列的所有名称/标题,内容大概是这样的:
>102L
>104L
我想根据这个 title.txt 文件,筛选出所有标题不在标题列表中的序列,并把它们存储到另一个文件叫做 filter_sequence.txt。新文件的结果应该是这样的:
102L 序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
104L 序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
注意到 103L 不见了。我在用 Python,但不太知道该怎么处理这个问题。有人能帮我吗?谢谢!
这是我的最终代码:
import string
fin = open('title.txt')
all_titles = fin.readlines()
fin.close()
all_titles = map(string.strip, all_titles)
f = open('filtered_sequence.txt', 'w')
sequence_list = open('sequence.txt')
for sequence in sequence_list:
lists = sequence.strip() # Strip the sequence file into lists of sequence
if lists[0] in all_titles:
write_string = lists[0] + lists[1] + "\n\n"
f.write(write_string)
f.close()
title.txt 的内容是:
>102L
>104L
sequence.txt 的内容是:
102L Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
103L Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
104L Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
我希望我的 filtered_sequence.txt 看起来是这样的:
102L Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
104L Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
但是 filtered_sequence.txt 文件是空的。你能帮我吗?
1 个回答
你可能还想把第二个文件也存储在一个列表里。
import string
f = open("title.txt","r")
all_titles = f.readlines() # Get the data
f.close()
all_titles = map(string.strip,all_titles) # Strip off newlines.
这样的话,all_titles
就会包含 ['>102L','>104L']
。接下来,只需要做一个“检查某个项目是否在列表中”的测试:
f = open("filter_sequence.txt","w") # The file to write to.
for sequence in sequence_list:
if sequence[0] in all_titles: # sequence[0] is the sequence title.
write_string = str(sequence[0]) + ":\nSequence:" + str(sequence[1]) + "\n\n"
f.write(write_string) # Write the string above.
f.close() # Close the file.
这样就可以了。item in list
是一个快速的布尔测试,用来检查列表中是否有任何项目等于 item
。
注意:如果你想写 102L
而不是 >102L
,你可以通过写 sequence[0][1:]
来去掉 sequence[0]
的第一个字符。这意味着从字符1(也就是第二个字符)开始,抓取到最后。