使用pyPdf读取PDF时出现UnicodeEncodeError
大家好,我之前发过一个问题,关于pypdf这个Python工具。请不要把这个当成重复的问题,因为我遇到了下面提到的错误。
import sys
import pyPdf
def convertPdf2String(path):
content = ""
# load PDF file
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# iterate pages
for i in range(0, pdf.getNumPages()):
# extract the text from each page
content += pdf.getPage(i).extractText() + " \n"
# collapse whitespaces
content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content
# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]))
f.close()
# or print contents to the standard out stream
print convertPdf2String("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")
我在处理第一个PDF文件时遇到了这个错误:UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
。而在处理这个PDF文件时,链接是http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf,我遇到了以下错误:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 38: ordinal not in range(128)
请问怎么解决这个问题?
1 个回答
2
我自己试过了,结果也是一样。忽略我之前的评论,我没注意到你还在写输出到一个文件。这就是问题所在:
f.write(convertPdf2String(sys.argv[1]))
因为 convertPdf2String
返回的是一个Unicode字符串,而 file.write
只能写字节,所以调用 f.write
时,它试图用ASCII编码自动转换这个Unicode字符串。由于PDF里显然包含了非ASCII字符,这就失败了。所以应该改成类似这样的:
f.write(convertPdf2String(sys.argv[1]).encode("utf-8"))
# or
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
编辑:
这个有效的源代码,只有一行改动。
# Execute with "Hindi_Book.pdf" in the same directory
import sys
import pyPdf
def convertPdf2String(path):
content = ""
# load PDF file
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# iterate pages
for i in range(0, pdf.getNumPages()):
# extract the text from each page
content += pdf.getPage(i).extractText() + " \n"
# collapse whitespaces
content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content
# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
f.close()
# or print contents to the standard out stream
print convertPdf2String("Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")