回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我试图从这个<a href="https://www.dropbox.com/s/y3nivxhjvvzva7d/test1.pdf?dl=0" rel="nofollow noreferrer">PDF</a>中的表中获取数据。我试过pdfminer和pypdf,不过运气不太好,我无法真正从表中获取数据。</p>
<p>这是其中一张桌子的样子:
<img src="https://i.stack.imgur.com/3kgtx.png" alt="enter image description here"/></p>
<p>如您所见,有些列用“x”标记。我正试着把这张表列成一个对象列表。</p>
<p>这是目前为止的代码,我现在正在使用pdfminer。</p>
<pre><code># pdfminer test
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, PDFPageAggregator
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage
from pdfminer.image import ImageWriter
from cStringIO import StringIO
import sys
import os
def pdfToText(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ''
maxpages = 0
caching = True
pagenos = set()
records = []
i = 1
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,
caching=caching, check_extractable=True):
# process page
interpreter.process_page(page)
# only select lines from the line containing 'Tool' to the line containing "1 The 'All'"
lines = retstr.getvalue().splitlines()
idx = containsSubString(lines, 'Tool')
lines = lines[idx+1:]
idx = containsSubString(lines, "1 The 'All'")
lines = lines[:idx]
for line in lines:
records.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(line)
i += 1
fp.close()
device.close()
retstr.close()
return records
def containsSubString(list, substring):
# find a substring in a list item
for i, s in enumerate(list):
if substring in s:
return i
return -1
# process pdf
fn = '../test1.pdf'
ft = 'test.txt'
text = pdfToText(fn)
outFile = open(ft, 'w')
for i in range(0, len(text)):
outFile.write(text[i])
outFile.close()
</code></pre>
<p>它生成一个文本文件并获取所有文本,但是,x没有保留间距。输出如下:
<img src="https://i.stack.imgur.com/pmV2O.png" alt="enter image description here"/></p>
<p>文本文档中的x只是一个空格</p>
<p>现在,我只是生成文本输出,但我的目标是生成一个包含表中数据的html文档。我一直在寻找OCR的例子,其中大多数看起来很混乱或不完整。我愿意使用C语言或其他任何可能产生我想要的结果的语言。</p>
<p><strong>编辑:</strong>将有多个这样的PDF,我需要从中获取表数据。据我所知,所有PDF的标题都是相同的。</p>