<p>我认为这是一个非常有效的方法,可以满足您的需求,所以希望能够很好地扩展。您没有说明您使用的是哪一版本的Python,所以它是用version 2.x编写的。用于创建输出文件的字段分隔符是一个变量,因此可以很容易地进行更改。你知道吗</p>
<p>匹配的数量不限于5000个-它会找到所有的-但如果真的有必要的话,可以增加一个限制。你知道吗</p>
<pre><code>from collections import defaultdict
TOLERANCE = 0.05
DELIM = '\t'
ref_dict = {}
with open('second_file.txt', 'rt') as inf:
next(inf) # skip header row
for line in inf:
fields = line.split()
ref_dict[fields[0]] = float(fields[3]) # rsID to MAF
matches = defaultdict(list)
with open('first_file.txt', 'rt') as inf:
next(inf) # skip header row
for line in inf:
fields = line.split()
rsID, MAF = fields[0], float(fields[1])
for ref_id, ref_value in ref_dict.iteritems():
if abs(MAF-ref_value) <= TOLERANCE:
matches[rsID].append(ref_id)
# determine maximum number of matches for output file header row
longest = max(map(len, (v for v in matches.itervalues())))
with open("output.txt", "wt") as outf:
outf.write('rsId' + DELIM + DELIM.join('match%d' % i
for i in xrange(1, longest+1)) + '\n')
fmt_str = '{}' + DELIM + '{}\n'
for k,v in matches.iteritems():
outf.write(fmt_str.format(k, (DELIM.join(v))))
</code></pre>
<p>根据问题中显示的示例数据生成的<code>output.txt</code>的内容(<code>»</code>表示制表符):</p>
<pre class="lang-none prettyprint-override"><code>rsId» match1» match2» match3» match4
rs870123» rs908341» rs090321» rs701234» rs101098
rs9038241» rs100981
rs1234123» rs512341
rs1293048» rs090321» rs701234» rs101098
rs723904» rs100981
rs1980123» rs512341
rs3801423» rs090321» rs701234» rs101098» rs100981
rs8041239» rs512341
rs239401» rs512341
rs314234» rs090321» rs701234» rs101098
</code></pre>