Whoosh 返回空值

Question

我正在使用Whoosh来对各种编码的文本进行索引和搜索。但是，当我在已索引的文件上进行搜索时，使用“高亮”功能时，有些匹配的结果没有出现在输出中。我觉得这可能和编码错误有关，但我不知道是什么原因导致所有结果没有显示出来。如果有人能帮我解开这个谜团，我将非常感激。

这是我用来创建索引的脚本，这里是我正在索引的文件：

from whoosh.index import create_in
from whoosh.fields import *
import glob, os, chardet

encodings = ['utf-8', 'ISO-8859-2', 'windows-1250', 'windows-1252', 'latin1', 'ascii']

def determine_string_encoding(string):
    result = chardet.detect(string)
    string_encoding = result['encoding']
    return string_encoding

#specify a list of paths that contain all of the texts we wish to index
text_dirs = [

"C:\Users\Douglas\Desktop\intertextuality\sample_datasets\hume",
"C:\Users\Douglas\Desktop\intertextuality\sample_datasets\complete_pope\clean"

]

#establish the schema to be used when storing texts; storing content allows us to retrieve hightlighted extracts from texts in which matches occur
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored=True))

#check to see if we already have an index directory. If we don't, make it
if not os.path.exists("index"):
    os.mkdir("index")
ix = create_in("index", schema)

#create writer object we'll use to write each of the documents in text_dir to the index
writer = ix.writer()

#create file in which we can write the encoding of each file to disk for review
with open("encodings_log.txt","w") as encodings_out:

    #for each directory in our list
    for i in text_dirs:

        #for each text file in that directory (j is now the path to the current file within the current directory)
        for j in glob.glob( i + "\\*.txt" ):

            #first, let's grab j title. If the title is stored in the text file name, we can use this method:
            text_title = j.split("\\")[-1]

            #now let's read the file
            with open( j, "r" ) as text_content:
                text_content = text_content.read()

                #use method defined above to determine encoding of path and text_content
                path_encoding = determine_string_encoding(j)
                text_content_encoding = determine_string_encoding(text_content)

                #because we know the encoding of the files in this directory, let's override the previous text_content_encoding value and specify that encoding explicitly
                if "clean" in j:
                    text_content_encoding = "iso-8859-1"

                #decode text_title, path, and text_content to unicode using the encodings we determined for each above
                unicode_text_title = unicode(text_title, path_encoding)
                unicode_text_path = unicode(j, path_encoding)
                unicode_text_content = unicode(text_content, text_content_encoding)

                #use writer method to add document to index
                writer.add_document( title = unicode_text_title, path = unicode_text_path, content = unicode_text_content )

#after you've added all of your documents, commit changes to the index
writer.commit()

这段代码似乎没有问题地对文本进行了索引，但当我使用下面的脚本来解析索引时，我在out.txt输出文件中得到了三个空值——前两行是空的，第六行也是空的，但我原本期待这三行是有内容的。以下是我用来解析索引的脚本：

from whoosh.qparser import QueryParser
from whoosh.qparser import FuzzyTermPlugin
from whoosh.index import open_dir
import codecs

#now that we have an index, we can open it with open_dir
ix = open_dir("index")

with ix.searcher() as searcher: 
    parser = QueryParser("content", schema=ix.schema)

    #to enable Levenshtein-based parse, use plugin
    parser.add_plugin(FuzzyTermPlugin())

    #using ~2/3 means: allow for edit distance of two (where additions, subtractions, and insertions each cost one), but only count matches for which first three letters match. Increasing this denominator greatly increases speed
    query = parser.parse(u"swallow~2/3")
    results = searcher.search(query)

    #see see whoosh.query.phrase, which describes "slop" parameter (ie: number of words we can insert between any two words in our search query)

    #write query results to disk or html
    with codecs.open("out.txt","w") as out:

        for i in results[0:]:    

            title = i["title"]
            highlight = i.highlights("content")
            clean_highlight = " ".join(highlight.split())

            out.write(clean_highlight.encode("utf-8") + "\n")

如果有人能建议为什么这三行是空的，我将不胜感激。

索引文本解析编码错误空值处理结果输出搜索 Whoosh 高亮功能

Whoosh 返回空值

1 个回答

撰写回答