如何从PDF文件中提取文本？

import PyPDF2 pdf_file = open('sample.pdf') read_pdf = PyPDF2.PdfFileReader(pdf_file) number_of_pages = read_pdf.getNumPages() page = read_pdf.getPage(0) page_content = page.extractText() print page_content

3条回答

网友

1楼 · 编辑于 2024-04-25 16:49:44

使用textract。

它支持多种类型的文件，包括pdf

import textract
text = textract.process("path/to/file.extension")

网友

2楼 · 编辑于 2024-04-25 16:49:44

看看这个代码：

import PyPDF2
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')

输出为：

!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%

使用相同的代码从201308FCR.pdf读取pdf 。输出正常。

它的documentation解释了为什么：

def extractText(self):
    """
    Locate all text drawing commands, in the order they are provided in the
    content stream, and extract the text.  This works well for some PDF
    files, but poorly for others, depending on the generator used.  This will
    be refined in the future.  Do not rely on the order of text coming out of
    this function, as it will change if this function is made more
    sophisticated.
    :return: a unicode string object.
    """

网友

3楼 · 编辑于 2024-04-25 16:49:44

我们正在寻找一个用于Python3.x和windows的简单解决方案。似乎没有来自textract的支持，这是不幸的，但是如果您正在为windows/python 3寻找一个简单的解决方案，请签出tika包，非常直接地阅读pdf。

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

from tika import parser

raw = parser.from_file('sample.pdf')
print(raw['content'])

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何从PDF文件中提取文本？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >