超越forlooping：一个大的、格式良好的数据fi的高性能解析

<MULTI-LINE HEADER> # number of header lines mirrors number of data columns <DATA BEGIN FLAG> # the word 'DATA' <DATA COLUMNS> # variable number of columns <DATA END FLAG> # the pattern '//' <EMPTY LINE>

def human_macaque_divergence(chromosome): """ A function for finding the positions of human-macaque divergent sites within segments of species alignment tracts :param chromosome: chromosome (integer: :return div_dict: a dictionary with tuple(segment_start, segment_end, valid_bases_in_segment) for keys and list(divergent_sites) for values """ ch = str(chromosome) div_dict = {} with gz.open('{al}Compara.6_primates_EPO.chr{c}_1.emf.gz'.format(al=pd.align, c=ch), 'rb') as f: # key to the header fields: # header_flag chromosome segment_start segment_end quality_flag chromosome_info # SEQ homo_sapiens 1 14163 24841 1 (chr_length=249250621) # flags, containers, counters and indices: species = [] starts = [] ends = [] mismatch = [] valid = 0 pos = -1 hom = None mac = None species_data = False # a flag signalling that the lines we are viewing are alignment columns for line in f: if 'SEQ' in line: # 'SEQ' signifies a segment info field assert species_data is False line = line.split() if line[2] == ch and line[5] == '1': # make sure that the alignment is to the desired chromosome in humans quality_flag is '1' species += [line[1]] # collect each species in the header starts += [int(line[3])] # collect starts and ends ends += [int(line[4])] if 'DATA' in line and {'homo_sapiens', 'macaca_mulatta'}.issubset(species): species_data = True # get the indices to scan in data columns: hom = species.index('homo_sapiens') mac = species.index('macaca_mulatta') pos = starts[hom] # first homo_sapiens positional coordinate continue if species_data and '//' not in line: assert pos > 0 # record the relevant bases: human = line[hom] macaque = line[mac] if {human, macaque}.issubset(bases): valid += 1 if human != macaque and {human, macaque}.issubset(bases): mismatch += [pos] pos += 1 elif species_data and '//' in line: # '//' signifies segment boundary # store segment results if a boundary has been reached and data has been collected for the last segment: div_dict[(starts[hom], ends[hom], valid)] = mismatch # reset flags, containers, counters and indices species = [] starts = [] ends = [] mismatch = [] valid = 0 pos = -1 hom = None mac = None species_data = False elif not species_data and '//' in line: # reset flags, containers, counters and indices species = [] starts = [] ends = [] pos = -1 hom = None mac = None return div_dict

3条回答

网友

1楼 · 编辑于 2024-05-17 16:02:27

当您有工作代码并且需要提高性能时，请使用探查器，一次测量一个优化的效果。（即使不使用profiler，也一定要使用后者。）您当前的代码看起来很合理，也就是说，从性能上看，我没有看到任何“愚蠢”的地方。在

尽管如此，对所有字符串匹配使用预编译正则表达式可能是值得的。通过使用re.MULTILINE，您可以将整个文件作为字符串读入并拉出部分行。例如：

s = open('file.txt').read()
p = re.compile(r'^SEQ\s+(\w+)\s+(\d+)\s+(\d+)\s+(\d+)', re.MULTILINE)
p.findall(s)

产生：

^{pr2}$

然后，您将需要对这些数据进行后处理，以处理代码中的特定条件，但总体结果可能更快。在

网友

2楼 · 编辑于 2024-05-17 16:02:27

您的代码看起来不错，但是还有一些特殊的地方需要改进，比如使用map，等等

有关良好性能的提示，请参见Python指南：

https://wiki.python.org/moin/PythonSpeed/PerformanceTips

我已经用上面的技巧让代码运行得几乎和C代码一样快。基本上，尽量避免for循环（使用map），尝试使用find内置函数，等等。通过使用Python的内置函数（大部分是用C编写的），使Python尽可能地为您工作

获得可接受的性能后，可以使用以下方法并行运行：

https://docs.python.org/dev/library/multiprocessing.html#module-multiprocessing

编辑：

我还刚刚意识到您正在打开一个压缩的gzip文件。我怀疑花了很多时间去减压。您可以尝试使用多线程来加快速度：

https://code.google.com/p/threadzip/

网友

3楼 · 编辑于 2024-05-17 16:02:27

是的，您可以使用一些正则表达式一次性提取数据；这可能是工作/性能的最佳比率。在

如果您需要更多的性能，您可以使用mx.TextTools来构建一个有限状态机；我很有信心这将大大加快速度，但是编写规则和学习曲线所需的工作量可能会使您望而却步。在

您还可以将数据分成块并并行处理，这可能会有所帮助。在

相关问题更多 >

编程相关推荐

热门问题

热门文章