Python在文件中搜索百万字符串并计算每个字符串的出现次数

Question

这段内容主要是关于寻找最快的方法来解决一个问题。我有一个文件file1，里面大约有一百万个字符串，每个字符串的长度在6到40个字符之间，都是单独一行。我想在另一个文件file2中查找这些字符串，file2里面大约有80,000个字符串，并且统计每个字符串出现的次数（如果一个小字符串在一个大字符串中出现多次，计数仍然算作1）。如果有人想比较性能，这里有下载file1和file2的链接：dropbox.com/sh/oj62918p83h8kus/sY2WejWmhu?m

我现在做的是为file2构建一个字典，用字符串的ID作为键，字符串本身作为值。（因为file2中的字符串有重复值，只有字符串ID是唯一的）我的代码是：

for line in file1:
   substring=line[:-1].split("\t")
   for ID in dictionary.keys():
       bigstring=dictionary[ID]
       IDlist=[]
       if bigstring.find(substring)!=-1:
           IDlist.append(ID)
   output.write("%s\t%s\n" % (substring,str(len(IDlist))))

我的代码需要几个小时才能完成。有没有人能建议一个更快的方法？file1和file2的大小都在50M左右，我的电脑有8G内存，你可以使用尽可能多的内存来加快速度。任何能在一个小时内完成的方法都是可以接受的：）

在这里，我尝试了一些评论中的建议，看看性能比较，先是代码，然后是运行时间。

一些人，比如Mark Amery，建议了一些改进：

import sys
from Bio import SeqIO

#first I load strings in file2 to a dictionary called var_seq, 
var_seq={}
handle=SeqIO.parse(file2,'fasta')
for record in handle:
    var_seq[record.id]=str(record.seq)

print len(var_seq) #Here print out 76827, which is the right number. loading file2 to var_seq doesn't take long, about 1 second, you shall not focus here to improve performance

output=open(outputfilename,'w')
icount=0
input1=open(file1,'r')
for line in input1:
    icount+=1
    row=line[:-1].split("\t")
    ensp=row[0]   #ensp is just peptides iD
    peptide=row[1] #peptides is the substrings i want to search in file2
    num=0
    for ID,bigstring in var_seq.iteritems(): 
        if peptide in bigstring:
            num+=1

    newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
    output.write(newline)
    if icount%1000==0:
        break

input1.close()
handle.close()
output.close()

这个方法花了1分4秒完成。比我之前的代码快了20秒。

#######接下来是entropy建议的方法

from collections import defaultdict
var_seq=defaultdict(int)
handle=SeqIO.parse(file2,'fasta')
for record in handle:
    var_seq[str(record.seq)]+=1

print len(var_seq) # here print out 59502, duplicates are removed, but occurances of duplicates are stored as value 
handle.close()

output=open(outputfilename,'w')
icount=0

with open(file1) as fd:
    for line in fd:
        icount+=1
        row=line[:-1].split("\t")
        ensp=row[0]
        peptide=row[1]
        num=0
        for varseq,num_occurrences in var_seq.items():
            if peptide in varseq:
                num+=num_occurrences

    newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
    output.write(newline)
    if icount%1000==0:
        break

output.close()

这个方法花了1分10秒，速度没有预期的快，因为它避免了搜索重复项，我不太明白为什么。

Mark Amery建议的“干草堆和针”方法，结果证明是最快的。不过这个方法的问题是所有子字符串的计数结果都是0，我还不太明白这个原因。

这是我实现他的方法的代码：

class Node(object):
    def __init__(self):
        self.words = set()
        self.links = {}

base = Node()

def search_haystack_tree(needle):
    current_node = base
    for char in needle:
        try:
            current_node = current_node.links[char]
        except KeyError:
            return 0
    return len(current_node.words)

input1=open(file1,'r')
needles={}
for line in input1:
    row=line[:-1].split("\t")
    needles[row[1]]=row[0]

print len(needles)

handle=SeqIO.parse(file2,'fasta')
haystacks={}
for record in handle:
    haystacks[record.id]=str(record.seq)

print len(haystacks)

for haystack_id, haystack in haystacks.iteritems(): #should be the same as enumerate(list)
    for i in xrange(len(haystack)):
        current_node = base
        for char in haystack[i:]:
            current_node = current_node.links.setdefault(char, Node())
            current_node.words.add(haystack_id)

icount=0
output=open(outputfilename,'w')
for needle in needles:
    icount+=1
    count = search_haystack_tree(needle)
    newline="%s\t%s\t%s\n" % (needles[needle],needle,str(count))
    output.write(newline)
    if icount%1000==0:
        break

input1.close()
handle.close()
output.close()

这个方法只花了11秒就完成，比其他方法快多了。不过，我不知道是我代码的问题导致所有计数结果都是0，还是Mark的方法本身有缺陷。

性能优化内存管理文件处理字符串搜索重复项处理计数算法字典构建干草堆和针

Python在文件中搜索百万字符串并计算每个字符串的出现次数

2 个回答

撰写回答