代码是
from PyPDF2 import PdfFileReader
with open('HTTP_Book.pdf','rb') as file:
pdf=PdfFileReader(file)
pagedd=pdf.getPage(0)
print(pagedd.extractText())
此代码引发如下所示的错误:
^{pr2}$我在网上搜索发现了这个Troubleshooting "TypeError: ord() expected string of length 1, but int found" 但这帮不了什么忙。我知道这个错误的背景是什么,但不确定它与这里有什么关系?在
尝试改变pdf文件,它工作得很好。那么问题是什么:pdf文件或PyPDF2无法处理它?我知道根据文件,这种方法不太可靠:
This works well for some PDF files, but poorly for others, depending on the generator used
该如何处理?在
回溯:
Traceback (most recent call last):
File "pdf_reader.py", line 71, in <module>
print(pagedd.extractText())
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 2595, in ex
tractText
content = ContentStream(content, self.pdf)
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 2673, in __
init__
stream = BytesIO(b_(stream.getData()))
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\generic.py", line 841, in
getData
decoded._data = filters.decodeStreamData(self)
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 350, in
decodeStreamData
data = LZWDecode.decode(data, stream.get("/DecodeParms"))
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 255, in
decode
return LZWDecode.decoder(data).decode()
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 228, in
decode
cW = self.nextCode();
File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 205, in
nextCode
nextbits=ord(self.data[self.bytepos])
TypeError: ord() expected string of length 1, but int found
我明白了。这只是PyPDF2的一个限制。我用tika和beauthoulsoup来解析和提取文本,效果很好。尽管它只需要再多做点什么。在
相关问题 更多 >
编程相关推荐