Python liteocr包_程序模块 - PyPI

轻型ocr引擎。

liteocr的Python项目详细描述

这个库提供了一个干净的接口来分割和识别形象。它针对打印文本进行了优化，例如扫描文档和网站截图。

Python version Github release PyPI version PyPI status

安装

pip install liteocr

安装包括liteocrpython3库和命令行可执行文件。

用法

`>> liteocr`

对图像文件执行ocr并将识别结果写入 json。

usage: LiteOCR [-h] [-d] [--extra-whitelist str] [--all-unicode] [--lang str]
               [--min-text-size int] [--max-text-size int]
               [--uniformity-thresh :0.0<=float<1.0]
               [--thin-line-thresh :odd int] [--conf-thresh :0<=int<100]
               [--box-expand-factor :0.0<=float<1.0]
               [--horizontal-pooling int]
               str str

positional arguments:
  str                   image file
  str                   output JSON file

optional arguments:
  -h, --help            show this help message and exit
  -d, --display         display recognized bounding boxes and text on top of the image

engine:
  parameters to liteocr.OCREngine constructor

  --extra-whitelist str
                        string of extra chars for Tesseract to consider only
                        takes effect when all_unicode is False
  --all-unicode         if True, Tesseract will consider all possible unicode
                        characters
  --lang str            language in the text. Defaults to English.

recognition:
  parameters to OCREngine.recognize() method

  --min-text-size int   min text height/width in pixels, below which will be
                        ignored
  --max-text-size int   max text height/width in pixels, above which will be
                        ignored
  --uniformity-thresh :0.0<=float<1.0
                        ignore a region if the number of pixels neither black
                        nor white < [thresh]
  --thin-line-thresh :odd int
                        remove all lines thinner than [thresh] pixels.can be
                        used to remove the thin borders of web page textboxes.
  --conf-thresh :0<=int<100
                        ignore regions with OCR confidence < thresh.
  --box-expand-factor :0.0<=float<1.0
                        expand the bounding box outwards in case certain chars
                        are cutoff.
  --horizontal-pooling int
                        result bounding boxes will be more connected with more
                        pooling, but large pooling might lower accuracy.

python3库

fromliteocrimportOCREngine,load_img,draw_rect,draw_text,dispimage_file='my_img.png'img=load_img(image_file)# you can either use context manager or call engine.close() manually at the end.withOCREngine()asengine:# engine.recognize() can accept a file name, a numpy image, or a PIL image.fortext,box,confinengine.recognize(image_file):print(box,'\tconfidence =',conf,'\ttext =',text)draw_rect(img,box)draw_text(img,text,box,color='bw')# display the image with recognized text boxes overlaiddisp(img,pause=False)

注释

我不赞成并将旧代码移到separate folder。旧的 api直接对整个图像调用tesseract。低召回率不是一点也不重要，我后来才意识到：

命令行tesseract生成了非常奇怪的全局页面细分决策。它忽略某些没有明显的模式。我试过很多不同的组合少数可调参数，但没有任何帮助。我的手是因为Tesseract的记录很差，很少有人问关于stackoverflow的这些问题。
Tesserocr 是一个python包，它在tesseract的 C++ API。有一些本地api方法可以遍历文本区域，但它们随机失败与segfault（啊！！！）.我花了很多时间都在试图修复它，但在绝望中放弃了…
tesseract是最好的开源ocr引擎，这意味着我没有有其他选择。我考虑过使用google的在线ocr api，但是我们不应该被网络连接和api调用所困扰限制。

所以我最终使用了一个新的工作流：

应用opencv magic产生更好的文本分割。
对每个分段文本框运行tesseract。更重要的是比运行在整个图像上更透明。
收集文本结果和平均置信水平（yield作为发电机）。

欢迎加入QQ群-->： 979659372

liteocr 0.2.1

liteocr的Python项目详细描述

安装

用法

`>> liteocr`

python3库

注释

推荐PyPI第三方库

rex10ab

ploth

mailersend

pythonpook

threatbus-misp

Bolinette-CL

job-alert

Maxis_EA_Messenger_Client

Gaubin-dists

xtypes

aegis-data

deetl

django-react-templatetags-es-modules

docinit

PyColorText

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

liteocr 0.2.1

liteocr的Python项目详细描述

安装

用法

>> liteocr

python3库

注释

推荐PyPI第三方库

rex10ab

ploth

mailersend

pythonpook

threatbus-misp

Bolinette-CL

job-alert

Maxis_EA_Messenger_Client

Gaubin-dists

xtypes

aegis-data

deetl

django-react-templatetags-es-modules

docinit

PyColorText

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

`>> liteocr`

导航栏

项目链接

标签