这段时间在python中用于反向索引是正常的吗？

2024-06-16 09:55:53 发布

男 | 程序猿一只，喜欢编程写python代码。

我在一堆PDF文件上实现了一个倒排索引。此反向索引数据的最终用途是将其用作搜索实用程序的一部分。我能够创建索引，但我的问题更多的是它的效率部分。为一个670KB、71页的PDF文件创建一个反向索引大约需要12-13秒的时间。由于我是Python新手，我想知道这段时间是否正常。我使用PyPDF2读取.pdf文件和nltk进行标记化、规范化和词干分析。单个词干词存储在以下格式的词典中 {术语：{docID:[Totalfrequency，[{page1:occurance}，{page2:occurance}，…，{pageN:occurance}]}} 术语-->；单词出现在哪个文档中，出现了多少次，出现在哪些页面中，以及这个词出现在这些页面中的每一页

下面是我的代码快照。任何帮助和建议都会非常好。谢谢你

import PyPDF2 as PDFHelper
from nltk.tokenize import word_tokenize as WordHelper
from nltk.stem import PorterStemmer
from wordsegment import load, segment
load()

# foreach PDF file the following code is executed
# extract text from each page
                    for pageNumber in range(0, pdfReader.numPages):
                        extractedText = ""
                        pageObj = pdfReader.getPage(pageNumber)
                        extractedText += " " + pageObj.extractText()

                        # tokenize the extracted words
                        # filter the stop words
                        # stem the filtered words to get set of unique root words
                        terms = WordHelper(extractedText)
                        for term in terms:
                            for segmentedWord in segment(term):
                                if segmentedWord not in self.StopWordsList:
                                    rootWord = PorterStemmer().stem(segmentedWord)
                                    # add this rootWord/term in the term vocabulary
                                    if rootWord in self._term_vocabulary:
                                        self._term_vocabulary[rootWord].<AddDocument(docID, pageNumber)>
                                    else:                                        
                                        self._term_vocabulary[rootWord] = <creating the new posting list>

Tags： the in from import self for pdf words

0条回答

目前没有回答

这段时间在python中用于反向索引是正常的吗？

相关问题更多 >

编程相关推荐

热门问题

热门文章

这段时间在python中用于反向索引是正常的吗？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >