Whoosh 返回空值
我正在使用Whoosh来对各种编码的文本进行索引和搜索。但是,当我在已索引的文件上进行搜索时,使用“高亮”功能时,有些匹配的结果没有出现在输出中。我觉得这可能和编码错误有关,但我不知道是什么原因导致所有结果没有显示出来。如果有人能帮我解开这个谜团,我将非常感激。
这是我用来创建索引的脚本,这里是我正在索引的文件:
from whoosh.index import create_in
from whoosh.fields import *
import glob, os, chardet
encodings = ['utf-8', 'ISO-8859-2', 'windows-1250', 'windows-1252', 'latin1', 'ascii']
def determine_string_encoding(string):
result = chardet.detect(string)
string_encoding = result['encoding']
return string_encoding
#specify a list of paths that contain all of the texts we wish to index
text_dirs = [
"C:\Users\Douglas\Desktop\intertextuality\sample_datasets\hume",
"C:\Users\Douglas\Desktop\intertextuality\sample_datasets\complete_pope\clean"
]
#establish the schema to be used when storing texts; storing content allows us to retrieve hightlighted extracts from texts in which matches occur
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored=True))
#check to see if we already have an index directory. If we don't, make it
if not os.path.exists("index"):
os.mkdir("index")
ix = create_in("index", schema)
#create writer object we'll use to write each of the documents in text_dir to the index
writer = ix.writer()
#create file in which we can write the encoding of each file to disk for review
with open("encodings_log.txt","w") as encodings_out:
#for each directory in our list
for i in text_dirs:
#for each text file in that directory (j is now the path to the current file within the current directory)
for j in glob.glob( i + "\\*.txt" ):
#first, let's grab j title. If the title is stored in the text file name, we can use this method:
text_title = j.split("\\")[-1]
#now let's read the file
with open( j, "r" ) as text_content:
text_content = text_content.read()
#use method defined above to determine encoding of path and text_content
path_encoding = determine_string_encoding(j)
text_content_encoding = determine_string_encoding(text_content)
#because we know the encoding of the files in this directory, let's override the previous text_content_encoding value and specify that encoding explicitly
if "clean" in j:
text_content_encoding = "iso-8859-1"
#decode text_title, path, and text_content to unicode using the encodings we determined for each above
unicode_text_title = unicode(text_title, path_encoding)
unicode_text_path = unicode(j, path_encoding)
unicode_text_content = unicode(text_content, text_content_encoding)
#use writer method to add document to index
writer.add_document( title = unicode_text_title, path = unicode_text_path, content = unicode_text_content )
#after you've added all of your documents, commit changes to the index
writer.commit()
这段代码似乎没有问题地对文本进行了索引,但当我使用下面的脚本来解析索引时,我在out.txt输出文件中得到了三个空值——前两行是空的,第六行也是空的,但我原本期待这三行是有内容的。以下是我用来解析索引的脚本:
from whoosh.qparser import QueryParser
from whoosh.qparser import FuzzyTermPlugin
from whoosh.index import open_dir
import codecs
#now that we have an index, we can open it with open_dir
ix = open_dir("index")
with ix.searcher() as searcher:
parser = QueryParser("content", schema=ix.schema)
#to enable Levenshtein-based parse, use plugin
parser.add_plugin(FuzzyTermPlugin())
#using ~2/3 means: allow for edit distance of two (where additions, subtractions, and insertions each cost one), but only count matches for which first three letters match. Increasing this denominator greatly increases speed
query = parser.parse(u"swallow~2/3")
results = searcher.search(query)
#see see whoosh.query.phrase, which describes "slop" parameter (ie: number of words we can insert between any two words in our search query)
#write query results to disk or html
with codecs.open("out.txt","w") as out:
for i in results[0:]:
title = i["title"]
highlight = i.highlights("content")
clean_highlight = " ".join(highlight.split())
out.write(clean_highlight.encode("utf-8") + "\n")
如果有人能建议为什么这三行是空的,我将不胜感激。
1 个回答
哇,我可能找到了问题的关键!看起来我的一些文本文件(包括路径中有“hume”的两个文件)超过了一个限制,这个限制影响了Whoosh创建索引的方式。如果试图对一个太大的文件进行索引,Whoosh似乎会把这个文本当作字符串来存,而不是unicode格式。所以,假设你有一个索引,里面有“path”(文件路径)、“title”(文件标题)、“content”(文件内容)和“encoding”(当前文件的编码)这些字段,你可以通过运行下面的脚本来测试这些文件是否被正确索引:
from whoosh.qparser import QueryParser
from whoosh.qparser import FuzzyTermPlugin
from whoosh.index import open_dir
import codecs
#now that we have an index, we can open it with open_dir
ix = open_dir("index")
phrase_to_search = unicode("swallow")
with ix.searcher() as searcher:
parser = QueryParser("content", schema=ix.schema)
query = parser.parse( phrase_to_search )
results = searcher.search(query)
for hit in results:
hit_encoding = (hit["encoding"])
with codecs.open(hit["path"], "r", hit_encoding) as fileobj:
filecontents = fileobj.read()
hit_highlight = hit.highlights("content", text=filecontents)
hit_title = (hit["title"])
print type(hit_highlight), hit["title"]
如果打印出来的值中有类型是“str”,那么说明高亮工具把指定文件的一部分当作字符串处理,而不是unicode格式。
解决这个问题有两种方法:1)把你的大文件(超过32K字符的文件)拆分成小文件——每个小文件都应该少于32K字符——然后对这些小文件进行索引。这种方法需要更多的整理,但能保证处理速度合理。2)给你的结果变量传递一个参数,以增加可以作为unicode存储的最大字符数,这样在上面的例子中,就能正确地打印到终端。要在上面的代码中实现这个解决方案,可以在定义results
的那行后面添加以下代码:
results.fragmenter.charlimit = 100000
添加这一行后,你就可以从指定文件的前100000个字符中打印任何结果到终端,尽管这会显著增加处理时间。或者,你也可以完全去掉字符限制,使用results.fragmenter.charlimit = None
,不过这样在处理大文件时会大幅增加处理时间……