擅长:python、mysql、java
<p>有各种Python包可以使用Python从PDF中提取文本。</p>
<h2>pdftotext公司</h2>
<p><a href="https://github.com/jalan/pdftotext" rel="noreferrer">^{<cd1>}</a>包:似乎工作得很好,但它没有选项,例如提取边界框</p>
<h3>安装</h3>
<p>对于Ubuntu:</p>
<pre><code>sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
</code></pre>
<h3>最小工作示例</h3>
<pre><code>import pdftotext
with open("lorem_ipsum.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# Iterate over all the pages
for page in pdf:
print(page)
# Just read the second page
print(pdf.read(2))
# Or read all the text at once
print(pdf.read_all())
</code></pre>
<h2>PDF矿工</h2>
<p>用<code>pip install pdfminer.six</code>安装。最小的工作示例是<a href="https://stackoverflow.com/a/22898159/562769">here</a>。</p>