简单的pdf文本提取
pdftotext的Python项目详细描述
pdftotext
简单的PDF文本提取
importpdftotext# Load your PDFwithopen("lorem_ipsum.pdf","rb")asf:pdf=pdftotext.PDF(f)# If it's password-protectedwithopen("secure.pdf","rb")asf:pdf=pdftotext.PDF(f,"secret")# How many pages?print(len(pdf))# Iterate over all the pagesforpageinpdf:print(page)# Read some individual pagesprint(pdf[0])print(pdf[1])# Read all the text into one stringprint("\n\n".join(pdf))
操作系统依赖项
debian、ubuntu和朋友:
sudo apt-get update
sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
软呢帽、红帽子和朋友:
sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config
Mac操作系统:
brew install pkg-config poppler
conda用户可能还需要libgcc
:
conda install libgcc
安装
pip install pdftotext