如何让这段代码运行得更快?(在大文本中搜索长词)
在Python中,我创建了一个文本生成器,它根据一些参数来工作,但我的代码大部分时间都很慢,表现不如我预期。我希望每3到4分钟生成一句话,但如果数据库很大,它就无法满足这个要求——我使用的是古腾堡项目的18本书的语料库,未来我还会创建自己的语料库并添加更多书籍,所以性能非常重要。下面是我的算法和实现:
算法
1- 输入触发句子——只需在程序开始时输入一次。
2- 找到触发句子中最长的单词。
3- 查找语料库中所有包含第2步中找到的单词的句子。
4- 从这些句子中随机选择一个。
5- 获取在第4步中选中的句子后面的句子(为了避免混淆,称之为sentA)——前提是sentA的长度超过40个字符。
6- 回到第2步,现在的触发句子是第5步中的sentA。
实现
from nltk.corpus import gutenberg
from random import choice
triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user
previousLongestWord = ""
listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus
sentenceAppender = ""
longestWord = ""
#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
seen = set()
return [x for x in seq if x not in seen and not seen.add(x)]
def findLongestWord(longestWord):
if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
longestWord = sortedSetOfValidWords[-2]
if(listOfWords.count(longestWord) == 1):
longestWord = sortedSetOfValidWords[-3]
doappend = corpusSentences.append
def appending():
for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
sentenceAppender = " ".join(mysentence)
doappend(sentenceAppender)
appending()
sentencesContainingLongestWord = []
def getSentence(longestWord, sentencesContainingLongestWord):
for sentence in corpusSentences:
if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
sentencesContainingLongestWord.append(sentence)
def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):
while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
sentencesContainingLongestWord.remove(triggerSentence)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence
sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set
setOfValidWords = [] #set for words in a sentence that exists in a corpus
split_str = triggerSentence.split()#split the sentence into words
setOfValidWords = [word for word in split_str if listOfWords.count(word)]
sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))
longestWord = sortedSetOfValidWords[-1]
findLongestWord(longestWord)
previousLongestWord = longestWord
getSentence(longestWord, sentencesContainingLongestWord)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)
triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence
print triggerSentence
print "\n"
corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers
print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors
我运行代码的电脑有点老,双核CPU是2006年2月买的,2个512的内存条是2004年9月买的,所以我不确定是我的实现有问题,还是硬件导致了运行速度慢。有没有什么建议可以让我改善这个问题?谢谢!
2 个回答
1
也许 Psyco 可以加快程序的运行速度?
4
我觉得我首先要给的建议是:仔细想想你的程序里每个功能是干什么的,确保它的名字能准确描述这个功能。目前你有一些这样的命名:
arraySorter
,这个名字让人以为它在处理数组并进行排序,但实际上它既不处理数组,也不排序(它实现的是nub功能)findLongestWord
,这个名字让人以为它在找最长的单词,但它实际上是在计数或者根据算法描述中没有的标准来选择单词,最后却什么都没做,因为最长单词是一个局部变量(可以理解为一个参数)getSentence
,这个名字让人觉得它是在获取句子,但实际上它是把任意数量的句子添加到一个列表中appending
,这个名字听起来像是一个状态检查器,但它实际上只是通过副作用来操作- 在局部变量和全局变量之间有很大的混淆,比如全局变量
sentenceAppender
从来没有被使用,而且它并不是一个像名字所暗示的那样的执行者(比如一个函数)
对于这个任务,你真正需要的是索引。给每个单词都建立索引可能有点过头了——实际上你只需要为那些作为句子中最长单词出现的单词建立索引。字典是你在这里的主要工具,第二个工具是列表。一旦你有了这些索引,查找包含任何给定单词的随机句子只需要一次字典查找、一次random.choice,再加上一个列表查找。考虑到句子的长度限制,可能还需要几次列表查找。
这个例子应该能很好地说明,现代硬件或像Psyco这样的优化工具并不能解决算法问题。