一个简单的cli工具,允许提取包含在文档中的所有文本。
documentparser的Python项目详细描述
#文档分析器
A simple CLI tool that allow to extract all text contained into a document.
安装
在安装documentparser之前执行以下命令
debian/ubuntu
- sudo apt get update
- sudo apt get install build essential libpoppler cpp dev pkg config python dev
- apt get install python dev libxml2 dev libxslt1 dev antiword unrtf poppler utils pstotext tesseract ocr
flac ffmpeg lame libmad0 libso-fmt-mp3 sox libjpeg dev swigx - pip安装docparser
macosx
- brew安装包配置poppler
- 啤酒桶安装Xquartz
- brew安装poppler antiword unrtf tesseract swig
Fedora/中心
在开始之前,您必须知道在基于Fedora的系统中没有快速安装DocParser的方法。 这是由一些缺失的家属造成的。这可能是最难的方法,但最终你会为自己感到骄傲。
- Yum-Y更新
- Yum安装python pip
Required by the .docx parser which uses lxml via python-docx.
- Yum安装libxml2 libxslt devel libxml2 devel
Required by the .docx parser which users lxml via python-docx.
- Yum安装libxslt
Required by the .doc and .ps parser.
- wgethttps://forensics.cert.org/cert-forensics-tools-release-el7.rpm
- RPM-UVH认证取证工具发布*RPM
- Yum--enablerepo=forensics install antiword
- Yum--enablerepo=forensics install pstoext
Require by .pdf parser
*Yum安装poppler实用程序
Requred by .jpg, .png, gif parser
CD/OPT
yum-y安装libstdc++autoconf automake libtool autoconf archive pkg config gcc-c++make libjpeg-devel libpng-devel libtiff-devel zlib-devel
Install AutoConf-Archive
- wgetftp://mirror.switch.ch/pool/4/mirror/epel/7/ppc64/a/autoconf-archive-2016.09.16-1.el7.noarch.rpm
- rpm-i autoconf-archive-2016.09.16-1.el7.noarch.rpm
Install Leptonica from Source
- wgethttp://www.leptonica.com/source/leptonica-1.75.3.tar.gz
- 焦油-zxvf瘦肉精-1.75.3.tar.gz
- CD瘦肉精-1.75.3
- /autobuild
- /配置
- 制造
- 进行安装
- CD..
Install Tesseract from Source
- wgethttps://github.com/tesseract-ocr/tesseract/archive/3.05.01.tar.gz
- 焦油-zxvf 3.05.01.tar.gz
- CD-Tesseract-3.05.01
- /autogen.sh
- pkg_config_path=/usr/local/lib/pkgconfig liblept_headersdir=/usr/local/include./configure—带额外的include=/usr/local/include—带额外的库=/usr/local/lib
- ldflags=“-l/usr/local/lib”cflags=“-i/usr/local/include”品牌
- 进行安装
- ldconfig
- CD..
Download and install tesseract language files
- wgethttps://github.com/tesseract-ocr/tessdata/raw/3.04.00/ben.traineddata
- wgethttps://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
- wgethttps://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.traineddata
- wgethttps://github.com/tesseract-ocr/tessdata/raw/3.04.00/tha.traineddata
- wgethttps://github.com/tesseract-ocr/tessdata/raw/3.04.00/osd.traineddata
- mv*.trainedata/usr/local/share/tessdata
Download Hindi Cube data
工作组https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.bigrams
工作组https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.fold
工作组https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.lm
工作组https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.nn
工作组https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.params
工作组https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.cube.word-freq
工作组https://github.com/tesseract-ocr/tessdata/raw/3.04.00/hin.tesseract_cube.nn
mv hin.*/usr/local/share/tessdata
ln-s/opt/tesseract-3.05.01/opt/tesseract最新版本
Required by .mp3 and .ogg parser
- Yum安装SOX
- rm cert-forensics-tools-release-el7.rpm
Install textract without unsupported features
rm textract/requirements/python&;cp requirements/textract/python textract/requirements/python
cd textract&;chmod+x setup.py
python setup.py安装
yum安装gcc-c++pkgconfig poppler c p p devel python devel redhat rpm config