文档提取

DoT-Net的Python项目详细描述


文档提取

目的:

用于将非结构化OCR文档转换为结构化键值对。在

所需软件包:

  • Wand
  • Pytesseract
  • Tesseract
  • Ghost script
  • Imagemagick
  • Open CV
  • Sklearn
  • Keras
  • Tensorflow

使用

替换GETO2.0.py主函数中pdf的绝对路径

使用的关键技术:

  • Deep learning,
  • Ensembled learning

机器学习架构描述

  • DoT-Net: DoT-Net is a novel and innovative CNN architecture to classify and segment the text elements in the document.
  • RFClassifier: RFClassifier is ensembled deep learning architecture used to detect TOC pages with in the document.

框架结构流程图

Alt text

代码如下:

  • GETO2.0.py is the interface for our framework.
  • Segmentation.py is the module for DoT-Net. This function is used in GETO2.0.py
  • TOCclassifier.py is the module to detect the TOC in the document. This function is used in GETO2.0.py
  • TESSARACT.py is used for extract text entites from detected blocks of text in segmentation.py. This function is used in TOCclassifier.py
  • BlockParsing.py is used to extract TOC entites form TOCs pages detected in TOCclassifier. This function is used in Segementation.py

代码流:

Alt text

代码详细说明:

获取02.0.py:

GETO2.0是我们框架的主要接口。输入的pdf文件中的每一页都使用wand库转换为图像。这个转换图像使用TOC分类器检查TOC(我们只检查第一个N页中的TOC)。在

  • [x] 检测为目录的页面。
    • tocClassifier.py:TOCclassifier检查页面中的TOC。如果页面被分类为TOC,那么我们使用^{str1}$镶嵌线.py提取目录的文本信息并附加到列表中。
      • 镶嵌线.py: 镶嵌线.py使用pytesseract(tesseract的python包装器)。Tesseract是一个从图像中提取文本的框架),用于从目录中提取文本。
  • [x] 页面检测为非目录。
    • Note:第一个N之后的页面也被视为非ToC。在
    • 分段.py:分段执行多个任务。
      • It segements the pages by using image morophology methods and counter functions, to find the Conneted Comments (Blocks).
      • A sliding window is passed over these Connected Components to generate 100 * 100 size tiles (DoT-Net takes 100 * 100 tiles as input to classify.
      • A data dulipcation or augmentation is performed on blocks which are less than 100 * 100 (especially for headings the blocks size will be less than 100 * 100), to avoid the data missing issue.
      • Now this is 100 * 100 are classifed using DoT-Net.
      • After patch classification we use majorty voting to predict the label of block.
      • If block label is text. Then we use blockparsing.py to extract the text from blocks.
      • Note: Our DoT-Net can detect other classes such as Table, Image, Mathematical Expressions, and Line drawings, but for this project we are only focused on Text.
      • Blockparsing.py uses pytesseract to extract the text.
      • Append the extracted text in list
  • [x] 目录中的文本和剩余的PDF文档被扩展并附加在各自的列表中。
    • After Extracting text from TOC and remaining pdf document and appended in list.
    • we use fuzzy matching and regular expression matchings techniques to create JSON files

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java获取JEditorPane中字符的绝对位置   java Datetime:将时间段拆分为天、小时和分钟   java是使此HashMap更高效的一种方法   java项目reactor:collectList()之后的block()对Flux不起作用。创建()   java在Mac OSX上安装OpenCV   java递归地确定一组数字是否包含两个总和相等的子集   Quad2D曲线上的几何图形Java绘图箭头   java将SSL证书导入Glassfish 4。十、   java Android未找到处理Intent MediaScanner的活动   EclipseJava。安全cert.CertificateParsingException:java。木卫一。IOException:主题密钥,无法创建EC公钥   java我能在O(M log N)时间内完成吗?   java跟踪eclipse中的资源更改也在中。元数据和。项目   java如何完全禁用Android键盘   java返回到上一个活动