<p>我使用此实用程序函数从PDF中提取所有文本元素:</p>
<pre><code>from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfparser import PDFParser
def pdf2text(stream):
parser = PDFParser(stream)
document = PDFDocument(parser)
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
resmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(resmgr, laparams=laparams)
interpreter = PDFPageInterpreter(resmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
for obj in device.get_result():
if isinstance(obj, (LTTextBox, LTTextLine)):
yield obj.get_text()
</code></pre>
<p><code>stream</code>参数是一个类似文件的对象(例如,为读取而打开的文件或<code>io.BytesIO</code>的实例或类似对象)。在</p>
<p>这个例子基本上遵循<a href="https://euske.github.io/pdfminer/programming.html" rel="nofollow noreferrer">official example</a>。在</p>