Python planchet包_程序模块 - PyPI

大数据处理助手

planchet的Python项目详细描述

普朗切特

您的大型数据处理个人助理

关于

Planchet（发音为/plʌ̃ʃɛ/）是一种适合处理大型数据数组的数据包管理器项目。它支持本机读写CSV和JSONL数据文件并通过FastAPI服务向处理数据。它是科学家和黑客的工具，而不是生产工具。在

工作原理

Planchet在一个简单的通过控制数据的读写反对处理。当你用Planchet创建一个工作时，你告诉在哪里读，在哪里写，用什么类。接下来，您（使用客户机或简单的HTTP请求）向服务请求n数据项，您的过程在本地运行。当你的处理完成后，它将这些项目送回Planchet，由他将它们写入磁盘。所有工作和项目的服务和接收记录在Redis实例中，并具有持久性。这样可以确保，如果停止处理，则只会丢失没有发送回Planchet的数据。Planchet将自动恢复作业和跳过已处理的项目。在

注意：Planchet在一个线程中运行，以避免多个线程的混乱进程写入同一文件。在这件事解决之前（也许永远不会）你注意不要太大很小。在

diagram

阅读更多关于普兰切特的信息 documentation page。在

安装

Planchet有两个组件：服务和客户端。服务是当客户机是一盏灯时，负责管理数据的核心包装requests，使访问服务API更容易。在

服务

你可以使用这个回购协议，然后像这样开始交易：

git clone git@github.com:savkov/planchet.git
exportPLANCHET_REDIS_PWD=<some-password>
make install
make run-redis
make run

如果要在其他端口上运行Planchet，可以使用uvicorn 命令，但请注意，必须只使用一个worker。在

^{pr2}$

您也可以从git repo运行docker compose：

git clone git@github.com:savkov/planchet.git
exportPLANCHET_REDIS_PWD=<some-password>
docker-compose up

客户

pip install planchet

示例

在服务器上

在服务器上，我们需要安装Planchet并下载一些新闻标题数据在可访问的目录中。然后我们把数据乘以1000倍原来只有200行。别忘了在之前设置您的Redis密码 你要make install-redis！在

git clone https://github.com/savkov/planchet.git
cd planchet
mkdir data
wget https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/news_headlines.jsonl -O data/news_headlines.jsonl
python -c "news=open('data/news_headlines.jsonl').read();open('data/news_headlines.jsonl', 'w').write(''.join([news for _ in range(200)]))"exportPLANCHET_REDIS_PWD=<your-redis-password>
make install
make install-redis
make run

请注意，planchet将在主机上的端口5005上运行。在

在客户端

在客户端，我们需要安装Planchet客户端和spaCy。在

pip install planchet spacy tqdm
python -m spacy download en_core_web_sm
exportPLANCHET_REDIS_PWD=<your-redis-password>

然后我们在一个名为spacy_ner.py的文件中编写以下脚本，以确保你填写占位符。在

fromplanchetimportPlanchetClientimportspacyfromtqdmimporttqdmnlp=spacy.load("en_core_web_sm")PLANCHET_HOST='localhost'# <--- CHANGE IF NEEDEDPLANCHET_PORT=5005url=f'http://{PLANCHET_HOST}:{PLANCHET_PORT}'client=PlanchetClient(url)job_name='spacy-ner-job'metadata={# NOTE: this assumes planchet has access to this path'input_file_path':'./data/news_headlines.jsonl','output_file_path':'./data/entities.jsonl'}# make sure you don't use the clean_start option hereclient.start_job(job_name,metadata,'JsonlReader',writer_name='JsonlWriter')# make sure the number of items is large enough to avoid blocking the servern_items=100headlines=client.get(job_name,n_items)whileheadlines:ents=[]print('Processing headlines batch...')forid_,itemintqdm(headlines):item['ents']=[ent.textforentinnlp(item['text']).ents]ents.append((id_,item))client.send(job_name,ents)headlines=client.get(job_name,n_items)

最后，我们想用8个进程进行一些并行处理。我们可以开始了每个进程都是手动的，或者我们可以使用parallel工具来启动它们。在

seq -w 08| parallel python spacy_ner.py {}

贡献者

欢迎加入QQ群-->： 979659372

planchet 0.4.0

planchet的Python项目详细描述

普朗切特

关于

工作原理

安装

服务

客户

示例

在服务器上

在客户端

贡献者

推荐PyPI第三方库

holodeck

django-rest-eas

hdfs3

crontabber

nysol

timeslicer

ricochet

networking-hpe

sanepg

odoo8-addon-users-ldap-groups

large-image-source-mapnik

ofxhome

django-admin-csv

odoo9-addon-stock-quant-manual-assign

hasasia

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

planchet 0.4.0

planchet的Python项目详细描述

普朗切特

关于

工作原理

安装

服务

客户

示例

在服务器上

在客户端

贡献者

推荐PyPI第三方库

holodeck

django-rest-eas

hdfs3

crontabber

nysol

timeslicer

ricochet

networking-hpe

sanepg

odoo8-addon-users-ldap-groups

large-image-source-mapnik

ofxhome

django-admin-csv

odoo9-addon-stock-quant-manual-assign

hasasia

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签