访问序列索引多进程共享内存中返回的dict

import sys import os from Bio import SeqIO from subprocess import * from multiprocessing import Pool, Manager manager = Manager() m_records = manager.dict() #m_records2 = manager.dict() m_kmers=manager.dict() def do_operation(seq): ##do some operations with m_kmers return def run_check(read_id): seq=str(m_records[read_id].seq) #seq=m_records2[read_id] do_operation(seq) def check_reads(n_threads): read_id_list=list(m_records.keys()) #print read_id_list pool = Pool(n_threads) m_rslt=pool.map(run_check, read_id_list) pool.close() pool.join() if __name__ == "__main__": sf_reads=sys.argv[1] n_threads=int(sys.argv[2]) m_records=SeqIO.index(sf_reads, "fasta") # for key in m_records: # m_records2[key]=str(m_records[key].seq) check_reads(n_threads)

1条回答

网友

1楼 · 发布于 2024-04-25 22:32:52

我可以用一个更小的数据集（所有来自大肠杆菌的蛋白质）重现你的问题，它确实是随机发生的。问题似乎在于manager.dict()在SeqIO.index上的使用，这是另一种类型。你知道吗

>>> print(type(m_records))
<class 'Bio.File._IndexedSeqFileDict'>

从documentation：

Indexes a sequence file and returns a dictionary like object.

从source code：

Note that this pseudo dictionary will not support all the methods of a true Python dictionary, for example values() is not defined since this would require loading all of the records into memory at once.

如果使用SeqIO.to_dict，错误会消失，但可能会耗尽内存。我不知道你的具体任务是什么，但也许把FASTA文件分成小块，使用完整的字典可以解决你的问题。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章