然而,我是在逐字逐行的基础上生成结果的。如何将每个单词与另一个单词分开,而不是将一组单词逐行分开(参见下面的示例)。我试过PDFMiner tutorial中的几个论点。LTTextBox
和LTText
都被试过。此外,我不能使用开始和结束偏移通常用于文本分析。你知道吗
这个PDF是一个很好的例子,下面的代码中使用了这个。你知道吗
http://www.africau.edu/images/default/sample.pdf
from pdfminer.layout import LAParams, LTTextBox, LTText
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator
#Imports Searchable PDFs and prints x,y coordinates
fp = open('C:\sample.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)
for page in pages:
print('--- Processing ---')
interpreter.process_page(page)
layout = dev.get_result()
for lobj in layout:
if isinstance(lobj, LTText):
x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
print('At %r is text: %s' % ((x, y), text))
这将返回可搜索PDF的x,y坐标,如下所示:
--- Processing ---
At (57.375, 747.903) is text: A Simple PDF File
At (69.25, 698.098) is text: This is a small demonstration .pdf file -
At (69.25, 674.194) is text: just for use in the Virtual Mechanics tutorials. More text. And more
text. And more text. And more text. And more text.
想要的结果(坐标是演示的代理):
--- Processing ---
At (57.375, 747.903) is text: A
At (69.25, 698.098) is text: Simple
At (69.25, 674.194) is text: PDF
At (69.25, 638.338) is text: File
使用PDFMiner,在遍历每一行之后(就像您已经做的那样),您只能遍历该行中的每个字符。你知道吗
我用下面的代码完成了这项工作,同时尝试记录每个单词的第一个字符的x,y,并设置一个条件来拆分每个
LTAnno
(例如。\n)或.get_text() == ' '
空白处的单词。你知道吗输出如下所示
也许您可以优化条件,以检测符合您的需求和喜好的词语。(例如,剪切标点符号。!?(词尾)
相关问题 更多 >
编程相关推荐