如何让这段代码运行得更快？（在大文本中搜索长词）

Question

在Python中，我创建了一个文本生成器，它根据一些参数来工作，但我的代码大部分时间都很慢，表现不如我预期。我希望每3到4分钟生成一句话，但如果数据库很大，它就无法满足这个要求——我使用的是古腾堡项目的18本书的语料库，未来我还会创建自己的语料库并添加更多书籍，所以性能非常重要。下面是我的算法和实现：

算法

1- 输入触发句子——只需在程序开始时输入一次。

2- 找到触发句子中最长的单词。

3- 查找语料库中所有包含第2步中找到的单词的句子。

4- 从这些句子中随机选择一个。

5- 获取在第4步中选中的句子后面的句子（为了避免混淆，称之为sentA）——前提是sentA的长度超过40个字符。

6- 回到第2步，现在的触发句子是第5步中的sentA。

实现

from nltk.corpus import gutenberg
from random import choice

triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user

previousLongestWord = ""

listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus

sentenceAppender = ""

longestWord = ""

#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
    seen = set()
    return [x for x in seq if x not in seen and not seen.add(x)]


def findLongestWord(longestWord):
    if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
        longestWord = sortedSetOfValidWords[-2]
        if(listOfWords.count(longestWord) == 1):
            longestWord = sortedSetOfValidWords[-3]


doappend = corpusSentences.append

def appending():

    for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
        sentenceAppender = " ".join(mysentence)
        doappend(sentenceAppender)


appending()
sentencesContainingLongestWord = []

def getSentence(longestWord, sentencesContainingLongestWord):


    for sentence in corpusSentences:
        if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
            sentencesContainingLongestWord.append(sentence)


def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):

    while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
        sentencesContainingLongestWord.remove(triggerSentence)
        triggerSentence = choice(sentencesContainingLongestWord)
        sentenceIndex = corpusSentences.index(triggerSentence)

while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence

    sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set

    setOfValidWords = [] #set for words in a sentence that exists in a corpus                    

    split_str = triggerSentence.split()#split the sentence into words

    setOfValidWords = [word for word in split_str if listOfWords.count(word)]

    sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))

    longestWord = sortedSetOfValidWords[-1]

    findLongestWord(longestWord)

    previousLongestWord = longestWord

    getSentence(longestWord, sentencesContainingLongestWord)

    triggerSentence = choice(sentencesContainingLongestWord)

    sentenceIndex = corpusSentences.index(triggerSentence)

    lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)

    triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence

    print triggerSentence
    print "\n"

    corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers


print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors

我运行代码的电脑有点老，双核CPU是2006年2月买的，2个512的内存条是2004年9月买的，所以我不确定是我的实现有问题，还是硬件导致了运行速度慢。有没有什么建议可以让我改善这个问题？谢谢！

性能优化文本处理数据库查询随机选择算法设计文本生成计算机硬件语料库管理

如何让这段代码运行得更快？（在大文本中搜索长词）

2 个回答

撰写回答