如何使用另一个文件的值在Python中过滤文件

0 投票
1 回答
989 浏览
提问于 2025-04-18 12:39

我有一个文件叫做 sequence.txt,我已经把这个文件分割成了列表,内容大概是这样的:

原始文件:

102L 序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

103L 序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

104L 序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

分割成列表后:

['>102L', 'Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL']

['>103L', 'Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL']

['>104L', 'Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL']

我还有另一个文件叫做 title.txt,里面包含了我想要的序列的所有名称/标题,内容大概是这样的:

>102L
>104L

我想根据这个 title.txt 文件,筛选出所有标题不在标题列表中的序列,并把它们存储到另一个文件叫做 filter_sequence.txt。新文件的结果应该是这样的:

102L 序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

104L 序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

注意到 103L 不见了。我在用 Python,但不太知道该怎么处理这个问题。有人能帮我吗?谢谢!

这是我的最终代码:

import string

fin = open('title.txt')
all_titles = fin.readlines()
fin.close()
all_titles = map(string.strip, all_titles)

f = open('filtered_sequence.txt', 'w')
sequence_list = open('sequence.txt')
for sequence in sequence_list:
    lists = sequence.strip() # Strip the sequence file into lists of sequence
    if lists[0] in all_titles:
        write_string = lists[0] + lists[1] + "\n\n"
        f.write(write_string)

f.close()

title.txt 的内容是:

>102L
>104L

sequence.txt 的内容是:

102L     Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

103L     Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

104L     Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

我希望我的 filtered_sequence.txt 看起来是这样的:

102L Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
104L Sequence:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSAAELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

但是 filtered_sequence.txt 文件是空的。你能帮我吗?

1 个回答

0

你可能还想把第二个文件也存储在一个列表里。

import string

f = open("title.txt","r")
all_titles = f.readlines()                # Get the data
f.close()
all_titles = map(string.strip,all_titles) # Strip off newlines.

这样的话,all_titles 就会包含 ['>102L','>104L']。接下来,只需要做一个“检查某个项目是否在列表中”的测试:

f = open("filter_sequence.txt","w")       # The file to write to.

for sequence in sequence_list:
  if sequence[0] in all_titles:           # sequence[0] is the sequence title.
    write_string = str(sequence[0]) + ":\nSequence:" + str(sequence[1]) + "\n\n"
    f.write(write_string)                 # Write the string above.

f.close()                                 # Close the file.

这样就可以了。item in list 是一个快速的布尔测试,用来检查列表中是否有任何项目等于 item

注意:如果你想写 102L 而不是 >102L,你可以通过写 sequence[0][1:] 来去掉 sequence[0] 的第一个字符。这意味着从字符1(也就是第二个字符)开始,抓取到最后。

撰写回答