高效地处理python进程

2024-05-29 10:38:09 发布

您现在位置:Python中文网/ 问答频道 /正文

因此,在我的项目中,我必须得到一个基因列表,并从同义词中清除它ex. gene A might also be known as AA, so if in my original list there is AA and A I have to delete one of the two.

基因列表由用户提供,我从文本文件中读取同义词

两者都存储在dictionaries

列表是huuuuuge(特朗普笑话),我将不得不在管道中多次调用此函数。所以我的问题是:我能用这个让它更快吗

我最初的做法如下:

for g in genes:
    process = multiprocessing.Process(target = fixReoccurences, args =  (g, genes, synonyms, replaced, ))
    my_processes.append(process)
    process.start()

# Wait for *ALL* the processes to finish.
for p in my_processes:
    p.join()

但是这种方法很快就失败了,因为我的脚本需要400个进程,所有进程都在运行一个循环,循环次数约为40000次。它真的冻结了我的笔记本电脑

那么,如何有效地利用CPU的多核来处理进程,从而解决这个问题呢


Tags: theto项目in列表for进程my
2条回答

我生成了一些随机数据,然后做了一个直线替换:

#!python3

import random
import string

from mark_time import mark

synonyms = { 'AA': 'A', 'BB': 'B'}

population = list(string.ascii_uppercase)
population[-1] = 'AA'   # replace 'Z' with AA
population[-2] = 'BB'   # replace Y with BB

mark('Building inputs')
inputs = [random.choice(population) for _ in range(40000 * 400)]
mark('... done')

print(' '.join(inputs[:100]))

mark('Building outputs')
outputs = [synonyms.get(ch, ch) for ch in inputs]
mark('... done')

print(' '.join(outputs[:100]))

我的输出如下所示:

[1490996255.208] Building inputs
[1490996273.388] ... done
N A U W R W H D E BB V A S B B U W U V S W V E K N Q E R H R A H I V U X V E U G A R D M R S K F O R B B G R C U M C C Q T K G S S H W AA U BB K L W T L H V BB K H J D AA K P G W BB W C U G T P G M J L S J
[1490996273.388] Building outputs
[1490996276.12] ... done
N A U W R W H D E B V A S B B U W U V S W V E K N Q E R H R A H I V U X V E U G A R D M R S K F O R B B G R C U M C C Q T K G S S H W A U B K L W T L H V B K H J D A K P G W B W C U G T P G M J L S J

构建输入数据需要18秒,替换同义词只需要3秒。那是400*40000件。我不确定你的输入项目是单个基因还是某种SAM序列什么的。在这个问题上提供更多的信息可能会更好

我不认为你需要多处理这个。只要在读取文件时处理好数据就行了

更新

抱歉昨晚退学了。但是,啤酒

无论如何,这里有一些代码可以读入同义词文件,每一行都有一对单词,比如"old new",,并构建一个字典来映射每一个旧的->;新单词。然后,它“展平”字典,这样就不需要重复查找-每个键都存储了它的最终值。我想你可以用这个来读同义词文件等等

def get_synonyms(synfile):
    """Read in a list of 'synonym' pairs, two words per line A -> B.
    Store the pairs in a dict. "Flatten" the dict, so that if A->B and
    B->C, the dict stores A->C and B->C directly. Return the dict.
    """

    syns = {}

    # Read entries from the file
    with open(synfile) if type(synfile) is str else synfile as sf:
        for line in sf:
            if not line.strip(): continue
            k,v = line.strip().split()
            syns[k] = v

    # "flatten" the synonyms. If A -> B and B -> C, then change A -> C
    for k,v in syns.items():
        nv = v
        while nv in syns:
            nv = syns[nv]
        syns[k] = nv

    return syns

import io

synonyms = """
A B
B C
C D
E B
F A
AA G
""".strip()

#with open('synonyms.txt') as synfile:
with io.StringIO(synonyms) as synfile:
    thesaurus = get_synonyms(synfile)

assert sorted(thesaurus.keys()) == "A AA B C E F".split()
assert thesaurus['A'] == 'D'
assert thesaurus['B'] == 'D'
assert thesaurus['C'] == 'D'
assert thesaurus['E'] == 'D'
assert thesaurus['F'] == 'D'
assert thesaurus['AA'] == 'G'

使用^{}

你可以有一个函数,它接受一个基因并返回它,或者None如果它应该被过滤:

def filter_gene_if_synonym(gene, synonyms):
    return None if gene in synonymns else gene

可以使用partial绑定函数的参数:

from functools import partial

filter_gene = partial(filter_gene_if_synonym,
                      synonyms=synonyms)

那么这个函数就可以用一个基因来调用了

可以使用进程池将函数映射到数据序列:

pool = Pool(processes=4)
filtered_genes = [gene for gene in pool.map(filter_gene, genes)
                  if gene is not None]

map函数还可以将数据块传递给适当的函数:

def filter_genes_of_synonyms(genes, synonyms):
    return [gene for gene in genes
            if gene not in synonymns]

filter_genes = partial(filter_genes, synonyms=synonyms)

以及:

filtered_chunks = pool.map(filter_genes, genes, chunksize=50)
filtered_genes = [gene for chunk in filtered_chunks
                  for gene in chunk]

相关问题 更多 >

    热门问题