Python PDF-Layout-Scanner包_程序模块 - PyPI

没有项目描述

PDF-Layout-Scanner的Python项目详细描述

关于

此脚本使用pdfminer将pdf转换为txt （http://www.unixuser.org/~euske/python/pdfminer/index.html）。

pdfminer是由yusuke shinyama用python编写的pdf解析库。

除了pdf2txt.py和dumppdf.py命令行工具之外，还有是一种以编程方式分析每个页面的内容树的方法。

这是使用 pdfminer，它继续默认文档（http://www.unixuser.org/~euske/python/pdfminer/programming.html#layout）停止。

此代码仍在进行中，还有改进的余地。

安装

因为它在pypi上可用，所以安装起来非常容易。

pip3 install pdf_layout_scanner

与pdfminer相比的优势

此脚本将从包含多列的PDF中提取文本。

用法

一般用法

frompdf_layout_scannerimportlayout_scanner# get a list of the table of contentsget_toc()# get the full textget_pages()

实例

frompdf_layout_scannerimportlayout_scannertoc=layout_scanner.get_toc('/path/to/your/pdf-file.pdf')print(len(toc))# the number of elements in the pdf document's table of contentsprint(toc[0])# a tuple containing the ordinal sequence and the title string,#  for example:#  (1, u'Introduction')pages=layout_scanner.get_pages('/path/to/your/pdf-file.pdf')print(len(pages))# should return the number of pages in the pdf documentprint(pages[0])# a string of all the text on the first page

改进空间

列合并-而我所描述的模糊启发式对于到目前为止，我已经解析了pdf文件，我可以想象更复杂的文档它会在哪里分解（也许这就是分析应该在哪里更复杂，而且不忽略这么多类型的pdfminer.layout.lt*objects）。
图像提取-我希望至少能够 pdftoimages，并以ppm或pnm默认格式保存每个文件，但我不知道我能做些什么
标题和标题大写-这似乎是一个问题 pdfminer，因为我在使用命令行工具时得到了类似的结果，但是，不得不回去解决所有的错误资本化是令人恼火的。手动操作，尤其是对于较大的文档。
标题和标题字体和间距-一个相关的问题，虽然可能在我自己的代码中，是那些相同的标题和段落标题与正文其他部分没有区别。在很多情况下，我不得不返回并手动添加垂直间距和字体属性。
页码删除-最初，我以为我可以使用regex 对于单个物理行上的全数值，但是每个文档页面编号是否略有不同，并且很难不用手工校对每一页就把这些去掉。
脚注-在注释和引用同时出现时处理这些在同一页上做已经够难了，但是当它们跨越不同的时候（甚至是连续的）页面更糟糕。

贡献

在这个分叉的项目中，我对原来的做了一些修改。

增加了对ltfigures中文本的支持
优化的数据操作和存储从简单的dict更改为dataframe。这将使进一步的贡献更容易。
添加了ProgressBar

Github

https://github.com/yoshihikoueno/pdfminer-layout-scanner

欢迎加入QQ群-->： 979659372

PDF-Layout-Scanner 1.3.2

PDF-Layout-Scanner的Python项目详细描述

关于

安装

与pdfminer相比的优势

用法

一般用法

实例

改进空间

贡献

Github

推荐PyPI第三方库

screenshotscloud

composable_paxos

plsnocrash

trac.por

lavator

python-social-auth

nose-logpertest

odoo8-addon-hr-employee-legacy-id

Scrapy-Cookies

hotbits

ielu

tergraw

django-readme-generator

hug_peewee

pandoc-latex-environment

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

PDF-Layout-Scanner 1.3.2

PDF-Layout-Scanner的Python项目详细描述

关于

安装

与pdfminer相比的优势

用法

一般用法

实例

改进空间

贡献

Github

推荐PyPI第三方库

screenshotscloud

composable_paxos

plsnocrash

trac.por

lavator

python-social-auth

nose-logpertest

odoo8-addon-hr-employee-legacy-id

Scrapy-Cookies

hotbits

ielu

tergraw

django-readme-generator

hug_peewee

pandoc-latex-environment

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签