带邻近搜索的布尔搜索查询

2024-04-27 00:47:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我管理媒体监控服务。我们使用lucene索引和复杂(而且非常长)的布尔查询字符串来查找具有命中率的文档。这些文档中的绝大多数是基于PDF的

我们的问题是在上下文中显示这些点击。例如,一份文档可能有100页长,但唯一相关的文本可能只有一页或两页。我一直在研究一种解决方案,可以自动突出显示PDF中的点击,这样读者就可以很容易地看到其中提到的术语

我已经能够开发一个非常基本的脚本,使用PDFminer和PyMuPDF来完成这个任务

# -*- coding: utf-8 -*-
from pdfminer.high_level import extract_text
import fitz
import re

### READ IN PDF

regex = r"\b(?:growth\W+(?:\w+\W+){0,5}?hormone|hormone\W+(?:\w+\W+){0,5}?growth)\b"
file = 'test.pdf'
text = extract_text(file)
doc = fitz.open(file)

strings = []
for match in re.finditer(regex, text):
    string = match.group()
    strings.append(string)

for i in range(0, doc.pageCount):
    for j in range(0, len(strings)):
        page = doc[i]
        text_instances = page.searchFor(strings[j])
        print(text_instances)
        for inst in text_instances:
            highlight = page.addHighlightAnnot(inst)


### OUTPUT

doc.save("output.pdf", garbage=4, deflate=True, clean=True)

主要问题是,据我所知,我不能使用布尔查询,我们必须在文本中找到匹配的单词,然后在PDF中查找并突出显示这些单词的实例。在上面的示例中,我使用regex搜索布尔查询“growth w/5 hormone”的regex等价物。这是可行的,但对于我来说,将整个布尔查询转换为regex或任何其他可用于执行此搜索的语法太难了。或者更确切地说,转换'Or'运算符很容易,但我无法为诸如“W/”或“NEAR/”之类的邻近运算符找到可行的解决方案

下面是一个布尔查询的示例。如有任何建议或建议,将不胜感激。还请注意,我在Python方面的经验不太丰富,我所知道的都是自学的

查询字符串:

((abnorm* OR Abortion OR abuse* OR accidental OR ADR$1 OR "adverse effect*" OR "adverse event*" OR "adverse reaction*" OR AE$1 OR allerg* OR antibod* OR (benefit w/1 risk*) OR "birth defect*" OR carcinogen* OR "clinical failure") OR (complication* OR contamination OR congenital OR death* OR defect* OR deliver* OR "drug dispensing error*" OR (drug w/1 (effect* OR ineffect* OR interaction* OR reaction* OR resistance OR toxicity OR withdrawal))) OR (efficacy OR effectiveness OR embryo* OR epidemiology OR fatal OR fetal OR foetal OR genomic* OR genotoxic* OR hypersensitivity OR "idiosincratic reaction*" OR "idiosincratic toxicit*" OR immunogen* OR "incorrect drug administ*") OR (ineffective OR (lack w/3 (effect* OR efficacy OR effectiveness OR response)) OR lethal OR malformation* OR "medication error*" OR miscarriage* OR misuse* OR mutagen* OR "near miss" OR "occupational exposure" OR "off label") OR (overdos* OR pharmacogenetic* OR pharmacogenomic* OR "pharmacogenomic biomarker*" OR poison* OR pregnan* OR prescribing OR infection* OR SAE$1 OR safety OR "side effect*" OR SUSAR* OR teratogen*) OR ((therapeut* w/3 (decreased OR delay* OR effect OR efficacy OR effectiveness OR failure OR outcome OR response)) OR toxic OR toxicity OR (transmi* w/3 infect*)) OR ((Transmission w/3 (bacter* OR pathogen* OR viral OR virus))) OR ((treatment w/3 (delay* OR effect* OR efficacy OR effectiveness OR failure OR outcome OR response)) OR underdos* OR "undesirable effect*" OR "wrong drug administ*") OR "lack of efficacy")

Tags: ortextin文档importfordocpdf