Python scielo-clea包_程序模块 - PyPI

scielo发布模式xml文档前端元数据读取器/清理器

scielo-clea的Python项目详细描述

夹板

这个项目是一个XML前端元数据读取器，用于文档几乎跟随SciELO Publishing Schema，提取并清理与从属关系相关的值。

安装

可以使用以下任一选项安装CLEA：

pip install scielo-clea          # Minimal
pip install scielo-clea[cli]     # Clea with CLI (recommended)
pip install scielo-clea[server]  # Clea with the testing/example server
pip install scielo-clea[all]     # Clea with both CLI and the server

实际上所有这些命令都会安装所有的东西，只是依赖关系不一样。第一种是最低要求的安装，用于python中，作为导入包使用。

运行命令行界面

cli是一种将clea用作xml到jsonl转换程序的方法。（每个XML输入一个JSON输出行）：

clea -o output.jsonl article1.xml article2.xml article3.xml

对于python -m clea而不是clea也可以这样做。输出是标准输出流。有关详细信息，请参见clea --help。

运行测试服务器

您可以使用flask cli运行开发服务器。例如，对于以8080的速率从每个主机收听：

FLASK_APP=clea.server flask run -h 0.0.0.0 -p 8080

在一个有4个处理请求的工作进程的生产服务器中，您可以，例如：

安装Gunicorn（它不是依赖项）
运行gunicorn -b 0.0.0.0:8080 -w 4 clea.server:app

CLEA作为库

查看所有提取数据的一个简单示例是：

fromcleaimportArticlefrompprintimportpprintart=Article("some_file.xml")pprint(art.data_full)

这是一个包含所有“原始”提取数据的列表字典。那本字典的钥匙可以直接取用，因此可以避免从xml中提取所有内容只获取特定的项/属性（例如art["journal_meta"][0].data_full 或art.journal_meta[0].data_full 而不是art.data_full["journal_meta"][0]）。这些项目/属性始终是列表，例如：

art["aff"]：实例列表clea.core.Branch
art["sub_article"]：实例列表clea.core.SubArticle
art["contrib"][0]["contrib_name"]：字符串列表

其中art["contrib"][0]是一个Branch实例，所有这些例子的表现都是一样的（没有嵌套的分支）。在以前的字典里，这可以看作是另一种导航方式，最后一个例子应该返回一个列表 art.data_full["contrib"][0]["contrib_name"]，但是没有提取其他的东西出现在art.data_full字典中。

可以做的更简单的事情：

len(art.aff)# Number of <aff> entrieslen(art.sub_article)# Number of <sub-article>art.contrib[0].data_full# Data from the first contributor as a dict# Something like {"type": ["translation"], "lang": ["en"]},# the content from <sub-article> attributesart["sub_article"][0]["article"][0].data_full# A string with the article title, accessing just the desired contentart["article_meta"][0]["article_title"][0]

所有SubArticle、Article和Branch实例具有data_full属性和get方法，后者在内部用于获取项/属性。他们的行为是：

Branch.get始终返回字符串列表
Article.get("sub_article")返回SubArticle
Article.get(...)返回Branch
SubArticle的行为类似于Article

提取的信息并不详尽！其结果不应被视为原始XML的替代品。

这个图书馆的目标之一是帮助从给定的xml创建表格数据根据需要使用多行在每行中有一对匹配的<aff>和<contrib>。这些是匹配的Article方法/属性：

art.aff_contrib_inner_gen()
art.aff_contrib_full_gen()
art.aff_contrib_inner
art.aff_contrib_full
art.aff_contrib_inner_indices
art.aff_contrib_full_indices

最有用的可能是最后一个，返回索引（int）对（元组）的列表，因此可以使用(ai, ci)结果要访问(art.aff[ai], art.contrib[ci])对，除非索引是-1（未找到）。后缀为_gen的是生成器函数生成一个包含两个Branch项（或None）的元组，没有后缀的字典返回合并字典的列表几乎是表格格式（字符串列表字典）。关于这些特定元素的这些元素的每个列表通常最多应该有一个字符串，但即使是这些特定的元素也不总是这样，那么在使用data属性时应该小心。

名字中的inner和full 关于sql中的INNER JOIN和FULL OUTER JOIN，意思是不匹配的元素（所有<aff>和<contrib>未返回的节点）在以前的策略中被抛弃，而它们与后者。

打印从xml中提取的所有数据包括匹配<aff>和<contrib>对的索引在{}意义上执行，类似于测试服务器响应：

pprint({**article.data_full,"aff_contrib_pairs":article.aff_contrib_full_indices,})

欢迎加入QQ群-->： 979659372

scielo-clea 0.4.0

scielo-clea的Python项目详细描述

夹板

安装

运行命令行界面

运行测试服务器

CLEA作为库

推荐PyPI第三方库

linketurbidit

easy-grpc

odoo9-addon-l10n-ar-base-country-state

light-core

pdf2image

barcode-splitter

help

OMM

fio_party_merge

intouch-queryset-csv

docopt-subcommands

capitalizr

collective.forgetit

rieapie

pyislands

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

scielo-clea 0.4.0

scielo-clea的Python项目详细描述

夹板

安装

运行命令行界面

运行测试服务器

CLEA作为库

推荐PyPI第三方库

linketurbidit

easy-grpc

odoo9-addon-l10n-ar-base-country-state

light-core

pdf2image

barcode-splitter

help

OMM

fio_party_merge

intouch-queryset-csv

docopt-subcommands

capitalizr

collective.forgetit

rieapie

pyislands

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签