如何让这段代码运行得更快?(在大文本中搜索长词)

3 投票
2 回答
504 浏览
提问于 2025-04-16 16:15

在Python中,我创建了一个文本生成器,它根据一些参数来工作,但我的代码大部分时间都很慢,表现不如我预期。我希望每3到4分钟生成一句话,但如果数据库很大,它就无法满足这个要求——我使用的是古腾堡项目的18本书的语料库,未来我还会创建自己的语料库并添加更多书籍,所以性能非常重要。下面是我的算法和实现:

算法

1- 输入触发句子——只需在程序开始时输入一次。

2- 找到触发句子中最长的单词。

3- 查找语料库中所有包含第2步中找到的单词的句子。

4- 从这些句子中随机选择一个。

5- 获取在第4步中选中的句子后面的句子(为了避免混淆,称之为sentA)——前提是sentA的长度超过40个字符。

6- 回到第2步,现在的触发句子是第5步中的sentA。

实现

from nltk.corpus import gutenberg
from random import choice

triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user

previousLongestWord = ""

listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus

sentenceAppender = ""

longestWord = ""

#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
    seen = set()
    return [x for x in seq if x not in seen and not seen.add(x)]


def findLongestWord(longestWord):
    if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
        longestWord = sortedSetOfValidWords[-2]
        if(listOfWords.count(longestWord) == 1):
            longestWord = sortedSetOfValidWords[-3]


doappend = corpusSentences.append

def appending():

    for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
        sentenceAppender = " ".join(mysentence)
        doappend(sentenceAppender)


appending()
sentencesContainingLongestWord = []

def getSentence(longestWord, sentencesContainingLongestWord):


    for sentence in corpusSentences:
        if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
            sentencesContainingLongestWord.append(sentence)


def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):

    while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
        sentencesContainingLongestWord.remove(triggerSentence)
        triggerSentence = choice(sentencesContainingLongestWord)
        sentenceIndex = corpusSentences.index(triggerSentence)

while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence

    sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set

    setOfValidWords = [] #set for words in a sentence that exists in a corpus                    

    split_str = triggerSentence.split()#split the sentence into words

    setOfValidWords = [word for word in split_str if listOfWords.count(word)]

    sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))

    longestWord = sortedSetOfValidWords[-1]

    findLongestWord(longestWord)

    previousLongestWord = longestWord

    getSentence(longestWord, sentencesContainingLongestWord)

    triggerSentence = choice(sentencesContainingLongestWord)

    sentenceIndex = corpusSentences.index(triggerSentence)

    lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)

    triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence

    print triggerSentence
    print "\n"

    corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers


print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors

我运行代码的电脑有点老,双核CPU是2006年2月买的,2个512的内存条是2004年9月买的,所以我不确定是我的实现有问题,还是硬件导致了运行速度慢。有没有什么建议可以让我改善这个问题?谢谢!

2 个回答

1

也许 Psyco 可以加快程序的运行速度?

4

我觉得我首先要给的建议是:仔细想想你的程序里每个功能是干什么的,确保它的名字能准确描述这个功能。目前你有一些这样的命名:

  • arraySorter,这个名字让人以为它在处理数组并进行排序,但实际上它既不处理数组,也不排序(它实现的是nub功能)
  • findLongestWord,这个名字让人以为它在找最长的单词,但它实际上是在计数或者根据算法描述中没有的标准来选择单词,最后却什么都没做,因为最长单词是一个局部变量(可以理解为一个参数)
  • getSentence,这个名字让人觉得它是在获取句子,但实际上它是把任意数量的句子添加到一个列表中
  • appending,这个名字听起来像是一个状态检查器,但它实际上只是通过副作用来操作
  • 在局部变量和全局变量之间有很大的混淆,比如全局变量sentenceAppender从来没有被使用,而且它并不是一个像名字所暗示的那样的执行者(比如一个函数)

对于这个任务,你真正需要的是索引。给每个单词都建立索引可能有点过头了——实际上你只需要为那些作为句子中最长单词出现的单词建立索引。字典是你在这里的主要工具,第二个工具是列表。一旦你有了这些索引,查找包含任何给定单词的随机句子只需要一次字典查找、一次random.choice,再加上一个列表查找。考虑到句子的长度限制,可能还需要几次列表查找。

这个例子应该能很好地说明,现代硬件或像Psyco这样的优化工具并不能解决算法问题。

撰写回答