基于一列匹配文件

rsID MAF rs1980123 0.321 rs870123 0.142 rs314234 0.113 rs723904 0.022 rs1293048 0.098 rs1234123 0.314 rs239401 0.287 rs0928341 0.414 rs9038241 0.021 rs3801423 0.0712 rs8041239 0.312

rsID iHS Fst MAF rs701234 1.98 0.11 0.098 rs908341 1.32 0.31 0.189 rs101098 0.315 0.08 0.111 rs100981 0.093 0.123 0.023 rs7345123 0.481 0.20 0.479 rs090321 1.187 0.234 0.109 rs512341 1.89 0.092 0.324

2条回答

网友

1楼 · 编辑于 2024-05-13 09:45:40

希望这能给你一个好的开始。我刚刚调用了文件file1和file2。很有创意，我知道。你知道吗

import random

f1_dict = {}
f2_dict = {}
match_dict = {}
match_threshold = .05
matches_to_return = 2
skip_unmatched = True

isfirstline = True
for line in open("file1"):
    if isfirstline:
            isfirstline = False
            continue
    f1_dict[line.split()[0]] = line.split()[1]

isfirstline = True
for line in open("file2"):
    if isfirstline:
            isfirstline = False
            continue
    f2_dict[line.split()[0]] = line.split()[3]


for i in f1_dict:
    compare_rsID = i
    compare_val = f1_dict[i]
    temp_list = []
    for j in f2_dict:
        if abs(float(f2_dict[j]) - float(compare_val)) <= match_threshold:
            temp_list.append(j)
    match_dict[i] = temp_list

fo = open("output.txt", "wb")
for k in match_dict:
    if skip_unmatched and len(match_dict[k]) == 0:
        continue
    else:
        random.shuffle(match_dict[k])
        fo.write(k),
        for l in match_dict[k][:matches_to_return]:
            fo.write(" ")
            fo.write(l),
        fo.write("\n")

我确信这可以提高效率。它在第二个dict中循环的次数与第一个dict中索引的循环次数相同。此外，我还将匹配数设置为返回2，以便使用问题中的小数据集进行测试。你可以做5000块或者其他你喜欢的。列表中的元素不是随机化的，但我也没有在列表的自然构建之外强加任何顺序。（编辑：不再是真的。。。我还对匹配阈值进行了变量化，以防您想稍微研究一下其他值。你知道吗

网友

2楼 · 编辑于 2024-05-13 09:45:40

我认为这是一个非常有效的方法，可以满足您的需求，所以希望能够很好地扩展。您没有说明您使用的是哪一版本的Python，所以它是用version 2.x编写的。用于创建输出文件的字段分隔符是一个变量，因此可以很容易地进行更改。你知道吗

匹配的数量不限于5000个-它会找到所有的-但如果真的有必要的话，可以增加一个限制。你知道吗

from collections import defaultdict

TOLERANCE = 0.05
DELIM = '\t'

ref_dict = {}
with open('second_file.txt', 'rt') as inf:
    next(inf)  # skip header row
    for line in inf:
        fields = line.split()
        ref_dict[fields[0]] = float(fields[3])  # rsID to MAF

matches = defaultdict(list)
with open('first_file.txt', 'rt') as inf:
    next(inf)  # skip header row
    for line in inf:
        fields = line.split()
        rsID, MAF = fields[0], float(fields[1])
        for ref_id, ref_value in ref_dict.iteritems():
            if abs(MAF-ref_value) <= TOLERANCE:
                matches[rsID].append(ref_id)

# determine maximum number of matches for output file header row
longest = max(map(len, (v for v in matches.itervalues())))

with open("output.txt", "wt") as outf:
    outf.write('rsId' + DELIM + DELIM.join('match%d' % i
                                        for i in xrange(1, longest+1)) + '\n')
    fmt_str = '{}' + DELIM + '{}\n'
    for k,v in matches.iteritems():
        outf.write(fmt_str.format(k, (DELIM.join(v))))

根据问题中显示的示例数据生成的output.txt的内容（»表示制表符）：

rsId»   match1» match2» match3» match4
rs870123»   rs908341»   rs090321»   rs701234»   rs101098
rs9038241»  rs100981
rs1234123»  rs512341
rs1293048»  rs090321»   rs701234»   rs101098
rs723904»   rs100981
rs1980123»  rs512341
rs3801423»  rs090321»   rs701234»   rs101098»   rs100981
rs8041239»  rs512341
rs239401»   rs512341
rs314234»   rs090321»   rs701234»   rs101098

相关问题更多 >

编程相关推荐

热门问题

热门文章