没有项目描述
django-ocr-server的Python项目详细描述
django ocr服务器允许您识别图像和pdf。它正在使用tesseract。 https://github.com/tesseract-ocr/tesseract
django ocr服务器将结果保存到数据库中。 为了防止重复识别同一个文件, 它还保存上载文件的哈希和。 因此,当重新加载已经存在的文件时,结果立即返回, 绕过识别过程,这将显著减少服务器上的负载
如果作为识别的结果,接收到非空文本,则创建可搜索的pdf。
对于可搜索的pdf也是计算哈希和的。 因此,如果您将django ocr服务器创建的可搜索pdf上传回服务器, 然后将无法识别此文件,但将立即返回结果。
服务器不仅可以处理图像,还可以处理pdf文件。 同时,他分析说,如果PDF已经包含了真实的文本, 将使用此文本,并且无法识别该文件, 从而减少了服务器的负载,提高了输出的质量
可在设置中禁用下载文件和创建的可搜索PDF的存储。
对于上传的文件和创建的可搜索PDF, 整个处理结果 在设置中,您可以指定在其之后自动删除数据的生存期。
要与django ocr服务器交互,可以使用api或管理界面。
文件
http://django-ocr-server.readthedocs.org/en/latest 这个开源应用程序是由shmakovpn提供给您的。(https://github.com/shmakovpn)
安装
Linux Mint 19(Ubuntu仿生版)
- Installing packages
$sudo apt install g++ # need to build pdftotext$sudo apt install libpoppler-cpp-dev # need to buid pdftotext- Installing tesseract
$sudo apt install tesseract-ocr$sudo apt install tesseract-ocr-rus # install languages you want- Installing python3.7
$sudo apt install python3.7$sudo apt install python3.7-dev- Installing pip
- $sudo apt install python-pip
- Installing virtualenv
$pip install –user virtualenv$echo ‘PATH=~/.local/bin:$PATH’ >> ~/.bashrc$source ~/.bashrc- Installing virtualenvwrapper
$pip install –user setuptools$pip install –user wheel$pip install –user virtualenvwrapper$echo ‘source ~/.local/bin/virtualenvwrapper.sh’ >> ~/.bashrc$source ~/.bashrc- Creating virtualenv for django_ocr_server
- $mkvirtualenv django_ocr_server -p /usr/bin/python3.7
- Inslalling django-ocr-server (on virtualenv django_ocr_server). It installs Django as a dependency
- $pip install django-ocr-server-1.0.tar.gz
- Create your Django project (on virtualenv django_ocr_server)
- $django-admin startproject ocr_server
- Go to project directory
- $cd ocr_server
- Edit ocr_server/settings.py
Add applications to INSTALLED_APPS
INSTALLED_APPS = [ ... 'rest_framework', 'rest_framework.authtoken', 'django_ocr_server', 'rest_framework_swagger', ]Edit ocr_server/urls.py
from django.contrib import admin from django.urls import path, include from django.views.generic.base import RedirectView from rest_framework.documentation import include_docs_urls admin.site.site_header = 'OCR Server Administration' admin.site.site_title = 'Welcome to OCR Server Administration Portal' urlpatterns = [ path('admin/', admin.site.urls, ), path('docs/', include_docs_urls(title='OCR Server API')), path('', include('django_ocr_server.urls'), ), ]
- Perform migrations (on virtualenv django_ocr_server)
- $python manage.py migrate
- Create superuser (on virtualenv django_ocr_server)
- $python manage.py createsuperuser
- Run server (on virtualenv django_ocr_server), than visit http://localhost:8000/
- $python manage.py runserver
linux mint 19(ubuntu bionic)自动安装
- Clone django_ocr_server from github
- $git clone https://github.com/shmakovpn/django_ocr_server.git
- Run the installation script using sudo
- $sudo {your_path}/django_ocr_server/install_ubuntu.sh
The script creates OS user named ‘django_ocr_server’, installs all needed packages. Creates the virtual environment. It installs django_ocr_server (from PyPI by default, but you can create the package from cloned repository, see the topic ‘Creation a distribution package’ how to do this). Then it creates the django project named ‘ocr_server’ in the home directory of ‘django_ocr_server’ OS user. After the script changes settings.py and urls.py is placed in ~django_ocr_server/ocr_server/ocr_server/. Finally it applies migrations and creates the superuser named ‘admin’ with the same password ‘admin’.
- Run server under OS user django_ocr_server, then change ‘admin’ password in the http://localhost:your_port/admin/ page.
$sudo su$su django_ocr_servercd ~/ocr_serverworkon django_ocr_serverpython manage.py runserver
中心7
- Install epel repository
- $sudo yum install epel-release
- Install python 3.6
$sudo yum install python36$sudo yum install python36-devel- Install gcc
$sudo yum intall gcc$sudo yum install gcc-c++- Install dependencies
- $sudo yum install poppler-cpp-devel
- Install tesseract
$sudo yum-config-manager –add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/$sudo bash -c “echo ‘gpgcheck=0’ >> /etc/yum.repos.d/download.opensuse.org_repositories_home_Alexander_Pozdnyakov_CentOS_7*.repo”$sudo yum update$sudo yum install tesseract$sudo yum install tesseract-langpack-rus # install a language pack you need- Install pip
- $sudo yum install python-pip
- Install virtualenv
- $sudo pip install virtualenv
- Create the virtual env for django_ocr_server
- $sudo virtualenv /var/www/ocr_server/venv -p /usr/bin/python36 –distribute
- Give rights to the project folder to your user
- $sudo chown -R {your_user} /var/www/ocr_server/
- Activate virtualenv
- $source /var/www/ocr_server/venv/bin/activate
- Install postgresql 11 (The Postgresql version 9.2 that is installing in Centos 7 by default returns an error when applying migrations )
$sudo rpm -Uvh https://yum.postgresql.org/11/redhat/rhel-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm$sudo yum install postgresql11-server$sudo yum install postgresql-devel$sudo /usr/pgsql-11/bin/postgresql-11-setup initdbEdit /var/lib/pgsql/11/data/pg_hba.confhost all all 127.0.0.1/32 md5host all all ::1/128 md5$sudo systemctl enable postgresql-11$sudo systemctl start postgresql-11$sudo -u postgres psql# create database django_ocr_server encoding utf8;# create user django_ocr_server with password ‘django_ocr_server’;# alter database django_ocr_server owner to django_ocr_server;# alter user django_ocr_server createdb; # if you want to run tests# qpip install psycopg2 # (on virtualenv django_ocr_server)- Create django project (on virtualenv django_ocr_server)
$cd /var/www/ocr_server$django-admin startproject ocr_server .- Edit ocr_server/settings.py
Add applications to INSTALLED_APPS
INSTALLED_APPS = [ ... 'rest_framework', 'rest_framework.authtoken', 'django_ocr_server', 'rest_framework_swagger', ]Configure database connection
DATABASES = { 'default': { 'ENGINE': 'django.db.backends.postgresql_psycopg2', 'NAME': 'django_ocr_server', 'USER': 'django_ocr_server', 'PASSWORD': 'django_ocr_server', 'HOST': 'localhost', 'PORT': '', } }- Edit ocr_server/urls.py
from django.contrib import admin from django.urls import path, include from django.views.generic.base import RedirectView from rest_framework.documentation import include_docs_urls admin.site.site_header = 'OCR Server Administration' admin.site.site_title = 'Welcome to OCR Server Administration Portal' urlpatterns = [ path('admin/', admin.site.urls, ), path('docs/', include_docs_urls(title='OCR Server API')), path('', include('django_ocr_server.urls'), ), ]
- Apply migrations (on virtualenv django_ocr_server)
- $python manage.py migrate
- Create superuser (on virtualenv django_ocr_server)
- $python manage.py createsuperuser
- Run server (on virtualenv django_ocr_server), than visit http://localhost:8000/
- $python manage.py runserver
运行测试
- Perform under you django_ocr_server virtual environment
- $python manage.py test django_ocr_server.tests
API文档
Django-ocr-server provides API documentation use restframework.documentation and swagger. Visit http://localhost:8000/swagger and http://localhost:8000/docs/
注
你可以认为django ocr服务器不工作。 光学字符识别对于服务器来说是一个非常困难的操作。 这需要一些时间。 这完全取决于要识别的文件和服务器的参数。 例如,我的计算机“Ryzen 7 64 GB RAM”需要25 识别不带文本层的pdf格式的书籍并包含500页所需的分钟数
许可证
- The code in this repository is licensed under the Apache License, Version 2.0 (the “License”);
you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0除非适用法律要求或书面同意,否则软件 根据许可证分发是按“原样”分发的, 无任何明示或默示的保证或条件。 有关管理权限的特定语言和 许可下的限制。
注意:此软件依赖于其他软件包,这些软件包可能在不同的开放源代码许可下获得许可
创建分发包
As mentioned earlier, the automatic installation script ‘install_ubuntu.sh’ uses the package from the PyPI repository by default. To change this behavior or if you need your own distribution package you can build it.
- Run command
$cd path to cloned project from github$python setup.py sdistLook in ‘dist’ directory, there is your package was created.
Also you can continue automatic installation. The package will be used.
部署到生产
Linux Mint 19(Ubuntu仿生版)
- Installing nginx
- $sudo apt install nginx
- Installing uwsgi (on virtualenv django_ocr_server)
- $pip install uwsgi
- Create {path_to_your_project}/uwsgi.ini
[uwsgi] chdir = {path_to_your_project} # e.g. /home/shmakovpn/ocr_server module = {your_project}.wsgi # e.g. ocr_server.wsgi home = {path_to_your_virtualenv} # e.g. /home/shmakovpn/.virtualenvs/django_ocr_server master = true processes = 10 http = 127.0.0.1:8003 vacuum = true- Create /etc/nginx/sites-available/django_ocr_server.conf
server { listen 80; # choose port what you want server_name _; charset utf-8; client_max_body_size 75M; location /static/rest_framework_swagger { alias {path_to_your virtualenv}/lib/python3.6/site-packages/rest_framework_swagger/static/rest_framework_swagger; } location /static/rest_framework { alias {path_to_your virtualenv}/lib/python3.7/site-packages/rest_framework/static/rest_framework; } location /static/admin { alias {path_to_your virtualenv}/lib/python3.7/site-packages/django/contrib/admin/static/admin; } location / { proxy_pass http://127.0.0.1:8003; } }
- Enable the django_ocr_server site
- $sudo ln -s /etc/nginx/sites-available/django_ocr_server.conf /etc/nginx/sites-enabled/
- Remove the nginx default site
- $sudo rm /etc/nginx/sites-enabled/default
- Create the systemd service unit /etc/systemd/system/django-ocr-server.service
[Unit] Description=uWSGI Django OCR Server After=syslog.service [Service] User={your user} Group={your group} Environment="PATH={path_to_your_virtualenv}/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" ExecStart={path_to_your_virtualenv}/bin/uwsgi --ini {path_to_your_project}/uwsgi.ini RuntimeDirectory=uwsgi Restart=always KillSignal=SIGQUIT Type=notify StandardError=syslog NotifyAccess=all [Install] WantedBy=multi-user.target- Reload systemd
- $sudo systemctl daemon-reload
- Start the django-ocr-server service
- $sudo systemctl start django-ocr-server
- Enable the django-ocr-server service to start automatically after server is booted
- $sudo systemclt enable django-ocr-server
- Start nginx
- $sudo systemctl start nginx
- Enable nginx service to start automatically after server is booted
- $sudo systemctl enable nginx
- Go to http://{your_server}:80
- You will be redirected to admin page
中心7
- Installing nginx
- $sudo apt install nginx
- Installing uwsgi (on virtualenv django_ocr_server)
- $pip install uwsgi
- Create /var/www/ocr_server/uwsgi.ini
[uwsgi] chdir = /var/www/ocr_server module = ocr_server.wsgi home = /var/www/ocr_server/venv master = true processes = 10 http = 127.0.0.1:8003 vacuum = true- Create the systemd service unit /etc/systemd/system/django-ocr-server.service
[Unit] Description=uWSGI Django OCR Server After=syslog.service [Service] User=nginx Group=nginx Environment="PATH=/var/www/ocr_server/venv/bin:/sbin:/bin:/usr/sbin:/usr/bin" ExecStart=/var/www/ocr_server/venv/bin/uwsgi --ini /var/www/ocr_server/uwsgi.ini RuntimeDirectory=uwsgi Restart=always KillSignal=SIGQUIT Type=notify StandardError=syslog NotifyAccess=all [Install] WantedBy=multi-user.target- Reload systemd service
- $sudo systemctl daemon-reload
- Chango user of /var/www/ocr_server to nginx
- $sudo chown -R nginx:nginx /var/www/ocr_server
- Start Django-ocr-server service
- $sudo systemctl start django-ocr-service
- Check that port is up
- $sudo netstat -anlpt | grep 8003
you have to got something like this:tcp 0 0 127.0.0.1:8003 0.0.0.0:* LISTEN 2825/uwsgi- Enable Django-ocr-server uwsgi service
- $sudo systemctl enable django-ocr-service
- Edit /etc/nginx/nginx.conf
server { listen 80 default_server; listen [::]:80 default_server; server_name _; charset utf-8; client_max_body_size 75M; location /static/rest_framework_swagger { alias /var/www/ocr_server/venv/lib/python3.6/site-packages/rest_framework_swagger/static/rest_framework_swagger; } location /static/rest_framework { alias /var/www/ocr_server/venv/lib/python3.6/site-packages/rest_framework/static/rest_framework; } location /static/admin { alias /var/www/ocr_server/venv/lib/python3.6/site-packages/django/contrib/admin/static/admin; } location / { proxy_pass http://127.0.0.1:8003; } }- Configure selinux
$sudo semanage port -a -t http_port_t -p tcp 8003 $sudo semanage fcontext -a -t httpd_sys_content_t '/var/www/ocr_server/venv/lib/python3.6/site-packages/rest_framework_swagger/static/rest_framework_swagger(/.*)?' $sudo restorecon -Rv '/var/www/ocr_server/venv/lib/python3.6/site-packages/rest_framework_swagger/static/rest_framework_swagger/' $sudo semanage fcontext -a -t httpd_sys_content_t '/var/www/ocr_server/venv/lib/python3.6/site-packages/rest_framework/static/rest_framework(/.*)?' $sudo restorecon -Rv '/var/www/ocr_server/venv/lib/python3.6/site-packages/rest_framework/static/rest_framework/' $sudo semanage fcontext -a -t httpd_sys_content_t '/var/www/ocr_server/venv/lib/python3.6/site-packages/django/contrib/admin/static/admin(/.*)?' $sudo restorecon -Rv '/var/www/ocr_server/venv/lib/python3.6/site-packages/django/contrib/admin/static/admin/'- Start nginx service
- $sudo systemctl start nginx
- Enable nginx service
- $sudo systemctl enable nginx
- Configure firewall
$sudo firewall-cmd –zone=public –add-service=http –permanent$sudo firewall-cmd –reload- Go to http://{your_server}:80
- You will be redirected to admin page
用法示例
You can download all examples from https://github.com/shmakovpn/django-ocr-server/usage_examples
卷曲
- Use curl with ‘@’ before the path of the uploading file
#!/usr/bin/env bash curl -F "file=@example.png" localhost:8000/upload/
Python
- Use requests.post function
import requests with open("example.png", 'rb') as fp: print(requests.post("http://localhost:8000/upload/", files={'file': fp}, ).content)
perl
- Use LWP::UserAgent and HTTP::Request::Common
#!/usr/bin/perl use strict; use warnings FATAL => 'all'; use LWP::UserAgent; use HTTP::Request::Common; my $ua = LWP::UserAgent->new; my $url = "http://localhost:8000/upload/"; my $fname = "example.png"; my $req = POST($url, Content_Type => 'form-data', Content => [ file => [ $fname ] ]); my $response = $ua->request($req); if ($response->is_success()) { print "OK: ", $response->content; } else { print "Failed: ", $response->as_string; }
php
- Use
<?php //Initialise the cURL var $ch = curl_init(); //Get the response from cURL curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //Set the Url curl_setopt($ch, CURLOPT_URL, 'http://localhost:8000/upload/'); //Create a POST array with the file in it $file='example.png'; $mime=getimagesize($file)['mime']; $name=pathinfo($file)['basename']; $postData = array( 'file' => new CURLFile($file, $mime, $name), ); curl_setopt($ch, CURLOPT_POSTFIELDS, $postData); // Execute the request $response = curl_exec( $ch); echo($response); curl_close ($ch); ?>
配置
For changing your django_ocr_server behavior you can use several parameters in the settings.py of your django project.
OCR_STORE_FILES Set it to True (default) to enable storing uploaded files on the serverOCR_FILE_PREVIEW Set it to True (default) to enable showing uploaded images preview in admin interfaceOCR_TESSERACT_LANG Sets priority of using languages, default to ‘rus+eng’OCR_STORE_PDF Set it to True (default) to enable storing created searchable PDFs on the serverOCR_FILES_UPLOAD_TO Sets path for uploaded filesOCR_PDF_UPLOAD_TO Sets path for created searchable PDFsOCR_FILES_TTL Sets time to live for uploaded files, uploaded files older this interval will be removed. Use python datetime.timedelta to set it or 0 (default) to disable.OCR_PDF_TTL Sets time to live for created searchable PDFs, PDFs older this interval will be removed. Use python datetime.timedelta to set it or 0 (default) to disable.OCR_TTL Sets time to live for created models of OCRedFile, models older this interval will be removed. Use python datetime.timedelta to set it or 0 (default) to disable.
管理命令
- Run it to clean trash. It removes all uploaded files and PDFs that do not have related models in database.
- $python manage.py clean
- Run it to remove models, uploaded files and PDFs, whose time to live (TTL) has expired.
- $python manage.py ttl