Python DoT-Net包_程序模块 - PyPI

文档提取

DoT-Net的Python项目详细描述

文档提取

目的：

用于将非结构化OCR文档转换为结构化键值对。在

所需软件包：

Wand
Pytesseract
Tesseract
Ghost script
Imagemagick
Open CV
Sklearn
Keras
Tensorflow

使用

替换GETO2.0.py主函数中pdf的绝对路径

使用的关键技术：

Deep learning,
Ensembled learning

机器学习架构描述

DoT-Net: DoT-Net is a novel and innovative CNN architecture to classify and segment the text elements in the document.
RFClassifier: RFClassifier is ensembled deep learning architecture used to detect TOC pages with in the document.

框架结构流程图

Alt text

代码如下：

GETO2.0.py is the interface for our framework.
Segmentation.py is the module for DoT-Net. This function is used in GETO2.0.py
TOCclassifier.py is the module to detect the TOC in the document. This function is used in GETO2.0.py
TESSARACT.py is used for extract text entites from detected blocks of text in segmentation.py. This function is used in TOCclassifier.py
BlockParsing.py is used to extract TOC entites form TOCs pages detected in TOCclassifier. This function is used in Segementation.py

代码流：

Alt text

代码详细说明：

获取02.0.py:

GETO2.0是我们框架的主要接口。输入的pdf文件中的每一页都使用wand库转换为图像。这个转换图像使用TOC分类器检查TOC（我们只检查第一个N页中的TOC）。在

[x] 检测为目录的页面。
- 在
  tocClassifier.py：TOCclassifier检查页面中的TOC。如果页面被分类为TOC，那么我们使用^{str1}$镶嵌线.py提取目录的文本信息并附加到列表中。
  - 在
    镶嵌线.py: 镶嵌线.py使用pytesseract（tesseract的python包装器）。Tesseract是一个从图像中提取文本的框架），用于从目录中提取文本。
    在
  在
在
[x] 页面检测为非目录。
- Note：第一个N之后的页面也被视为非ToC。在
- 在
  分段.py：分段执行多个任务。
  - It segements the pages by using image morophology methods and counter functions, to find the Conneted Comments (Blocks).
  - A sliding window is passed over these Connected Components to generate 100 * 100 size tiles (DoT-Net takes 100 * 100 tiles as input to classify.
  - A data dulipcation or augmentation is performed on blocks which are less than 100 * 100 (especially for headings the blocks size will be less than 100 * 100), to avoid the data missing issue.
  - Now this is 100 * 100 are classifed using DoT-Net.
  - After patch classification we use majorty voting to predict the label of block.
  - If block label is text. Then we use blockparsing.py to extract the text from blocks.
  - Note: Our DoT-Net can detect other classes such as Table, Image, Mathematical Expressions, and Line drawings, but for this project we are only focused on Text.
  - Blockparsing.py uses pytesseract to extract the text.
  - Append the extracted text in list
  在
在
[x] 目录中的文本和剩余的PDF文档被扩展并附加在各自的列表中。
- After Extracting text from TOC and remaining pdf document and appended in list.
- we use fuzzy matching and regular expression matchings techniques to create JSON files
在

欢迎加入QQ群-->： 979659372

DoT-Net 0.1.1

DoT-Net的Python项目详细描述

文档提取

目的：

所需软件包：

使用

使用的关键技术：

机器学习架构描述

框架结构流程图

代码如下：

代码流：

代码详细说明：

获取02.0.py:

tocClassifier.py：TOCclassifier检查页面中的TOC。如果页面被分类为TOC，那么我们使用^{str1}$镶嵌线.py提取目录的文本信息并附加到列表中。

镶嵌线.py: 镶嵌线.py使用pytesseract（tesseract的python包装器）。Tesseract是一个从图像中提取文本的框架），用于从目录中提取文本。

分段.py：分段执行多个任务。

推荐PyPI第三方库

Erik

holdmybeer

django-dishes

ace-api

Nxp

neutron-bsn-lldp

Seeti

django-cte-forest

grist-api

raptus.article.fader

oauthsub

Clique

cliLoader

odoo10-addon-product-supplierinfo-discount

git-wipe

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

DoT-Net 0.1.1

DoT-Net的Python项目详细描述

文档提取

目的：

所需软件包：

使用

使用的关键技术：

机器学习架构描述

框架结构流程图

代码如下：

代码流：

代码详细说明：

获取02.0.py:

tocClassifier.py：TOCclassifier检查页面中的TOC。如果页面被分类为TOC，那么我们使用^{str1}$镶嵌线.py提取目录的文本信息并附加到列表中。

镶嵌线.py: 镶嵌线.py使用pytesseract（tesseract的python包装器）。Tesseract是一个从图像中提取文本的框架），用于从目录中提取文本。

分段.py：分段执行多个任务。

推荐PyPI第三方库

Erik

holdmybeer

django-dishes

ace-api

Nxp

neutron-bsn-lldp

Seeti

django-cte-forest

grist-api

raptus.article.fader

oauthsub

Clique

cliLoader

odoo10-addon-product-supplierinfo-discount

git-wipe

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签