Python中并行读取和过滤文件问题的回答

Python中并行读取和过滤文件

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

我可以想出一些可以加快进度的方法，从第一个文件重新排列数据开始。在 与其将它变成5个单独的<code>list</code>s，不如将其设为<code>tuple</code>s的<code>list</code>s的<code>dict</code>，并将<code>chr</code>值作为键： <pre><code>import csv import collections import bisect # Use a defaultdict so we don't have to worry about whether a chr already exists foobars = collections.defaultdict(list) with open('file1.csv', 'rb') as csvfile: rdr = csv.reader(csvfile) for (chrs, typ, name, start, end) in rdr: foobars[chrs].append((int(start), int(end), typ, name)) </code></pre> 然后对<code>foobars</code>中的每个列表进行排序（显然，您应该将其重命名为适合您的任务的名称），这将首先按<code>start</code>值排序，因为我们将其放在元组中的第一个值： ^{pr2}$ 现在要处理第二个文件： <pre><code>for line in inputFile: line = line.rstrip('\n') arr = line.split('\t') arr1int = int(arr[1]) # Since we rearranged our data, we only have to check one of our sublists search = foobars[arr[0]] # We use bisect to quickly find the first item where the start value # is higher than arr[1] highest = bisect.bisect(search, (arr1int + 1,)) # Now we have a much smaller number of records to check, and we've # already ensured that chr is a match, and arr[1] >= start for (start, end, typ, name) in search[:highest]: if arr1int <= end: outputFile.write('\t'.join((arr[0], typ, str(start), str(end), name, line)) + '\n') </code></pre> <code>bisect.bisect()</code>行需要一点额外的解释。如果您有一个排序的值序列，<code>bisect</code>可用于查找新值将插入序列的位置。{{cd7>在这里，我们首先要想的是，这些概念是如何用的。奇怪的<code>(arr1int + 1,)</code>值只是确保我们包括<code>start == arr[1]</code>的所有条目，并将其转换为元组，以便我们比较相似的值。在 这几乎可以肯定地提高代码的性能。我实在没资格说。在 如果没有输入数据，我就无法真正测试这段代码，因此几乎可以肯定存在一些小错误。希望它们能很容易修复。

Python中并行读取和过滤文件

1 个回答

相关Python问题