Python sfm-utils包_程序模块 - PyPI

使用使用标准格式标记（sfm数据文件）编码的词典编纂数据的实用程序。

sfm-utils的Python项目详细描述

sfm-utils是一个python实用程序的集合，可以方便快捷地总结内容和识别词典编纂数据的不一致性使用标准格式标记（SFM数据文件）编码。这些设施主要是在以下情况下提供帮助转换为其他格式之前清除SFM数据或导入工具，如 SIL Fieldworks Language Explorer (FLEx)。

SFM文件包含未初始化的数据。使用标记（反斜杠代码）构造。例如：

\lx déláme
\ps n
\gn petite calebasse
\ps v
\gn sorte de verre
\ge drinking bowl
\gr ɓi loonde

sfm\u实用程序脚本不赋予标记含义，而是因此与sfm中使用的标签集无关数据文件。SFM实用程序的目的是确保标签在整个数据文件中一致使用。

作者：gavin falconer（gfalconer@expressivelogic.co.uk）

安装

sfm-utils作为一个python包分发，因此可以 installed via pip（或您选择的包管理器）。需要python v3或更高版本：

> pip install sfm_utils

未来建议

在online jupyter notebook中使用sfm实用程序的宿主版本例如，请参见：https://jvns.ca/blog/2017/11/12/binder–an-awesome-tool-for-hosting-jupyter-notebooks/

简介

使用sfm-sniffer快速了解任何 SFM文件。sfm-sniffer列出文件中使用的标记，并给出数字每个标记的出现次数。它还为每个标记推断一个类型，并显示 “异常”的数目，其中标记值与预期类型不匹配。

> sfm-sniffer --summary my_lexicon.sfm
\gn : gloss (national)     : occurrences=2480 : type=text            : exceptions=26
\lx : lexeme               : occurrences=2474 : type=word            : exceptions=7
\sn : sense number         : occurrences=2456 : type=enumeration     : exceptions=28
\ps : part of speech       : occurrences=2450 : type=enumeration     : exceptions=79
\ge : gloss (english)      : occurrences= 511 : type=optional word   : exceptions=12
\gr : gloss (regional)     : occurrences= 500 : type=optional phrase : exceptions=11
\glo: ???                  : occurrences= 354 : type=text            : exceptions=0

在全模式下运行sfm-sniffer可以提供精确定位的行引用例外情况：

> sfm-sniffer my_lexicon.sfm
glo: gloss (other)        : occurrences= 354: type=text   : exceptions=0
===================================
\lx : lexeme              : occurrences=2474: type=word
7 exceptions for \lx of type 'word':
line    1: \lx <no value>
line 2335: \lx eptsá - v. int. fatsa
line 2470: \lx ékséɓé, ésséɓá
line 2474: \lx ékslá, alá
line 2712: \lx fá wé...
line 4025: \lx icá  - v.int. ɗatsa
line 11051: \lx ŋá (v.int. ŋɛŋa)
====================================
\ps : part of speech      : occurrences=2451: type=enumeration
Example values:
adj,adj adv,adj num,adj poss,adj poss.,adj?,adv,adv inter,adv tm,...
79 exceptions for \ps of type 'enumeration':
line  855: \ps v. int
line 1875: \ps v. int.
line 1879: \ps <no value>
line 1947: \ps <no value>
...

结果表明每个标记的用法（或其他）的一致性。见 example walkthrough了解更多详细信息。

标签类型扣除

标签类型推断通过检查每个标签。如果大多数值符合已知类型，则标记是那种类型的。（用于确定通过选择“严格”选项，可以改变可接受的多数。

类型是按顺序检查的，更具体的类型是先检查一下。因此，标签将被推断为可应用于用于该标记的值集的特定类型。

标记类型可以是以下类型之一（从最特定到最不具体的）：

Order	Type	Description
1	^{tt1}$	Tag never has a value.
2	^{tt2}$	Numeric value, e.g. 1, 2, 3. The tag must have a value.
3	^{tt3}$	Numeric value, or may be empty.
4	^{tt4}$	A single word or phrase drawn from a limited set of possible values. A typical example could be \ps (part of speech) accepting one of: noun, verb, adjective, adverb,… The tag must have a value.
5	^{tt5}$	As above, or may be empty.
6	^{tt6}$	A single-word value. A word may include non-alphanumeric characters, but must include at least one alphanumeric character. It may not include any whitespace, period, comma or semicolon within the value. A trailing period, comma or semicolon is acceptable. The following are all valid words: ^{tt7}$, ^{tt8}$, ^{tt9}$. The tag must have a value.
7	^{tt10}$	As above, or may be empty.
8	^{tt11}$	A single-phrase value. Like ^{tt6}$ but may contain whitespace. May not contain a period, comma or semicolon except as a trailing character. ^{tt13}$ is a valid phrase. ^{tt14}$ is not (it is assumed to be a list value). The tag must have a value.
9	^{tt15}$	As above, or may be empty.
10	^{tt16}$	A list of words or phrases (separated by commas or semicolons) where each word or phrase is drawn from a limited set of possible values. The tag must have a value.
11	^{tt17}$	Any combination of characters, words or phrases. The tag must have a value.
12	^{tt18}$	Any combination of characters, words or phrases, or may be empty. The ^{tt18}$ type is generic, and indicates that no consistent pattern of usage could be deduced for the tag.

很快就来…

使用sfm-struct-sniffer分析sfm的树结构归档并生成建议的架构：

> sfm-struct-sniffer my_lexicon.sfm > my_lexicon.schema

然后使用sfm-struct-sniffer验证sfm的完整性根据架构的数据：

> sfm-struct-sniffer --verify --schema=my_lexicon.schema my_lexicon.sfm
...

生成的模式是一个简单的文本文件，因此可以很容易地修改：

\lx
    \ps
        \ge
        \go?
        \sn?
            \ge
            \go?

当需要手动编辑或更正SFM文件时，数据可以通过sfm-struct-sniffer格式化以应用缩进显示树结构：

> sfm-struct-sniffer --format -schema=my_lexicon.schema my_lexicon.sfm
\lx déláme
    \ps n
        \gn petite calebasse
    \ps v
        \gn sorte de verre
        \ge drinking bowl
        \gr ɓi loonde
 \lx deremke
    \ps num
        \gn cent
        \ge one hundred
        \gr temerre

这也使我们更容易对进口的结果进行推理将数据放入SIL Fieldworks Language Explorer (FLEx)

未来建议

sfm-struct-sniffer可以在文件中嵌入注释突出显示异常或不明确的树元素，例如：

\lx déláme
   \ps n
# >>> unexpected \sn
      \sn 1
# <<<

功能

可用于任何SFM文件。推断类型是统计的结果 sfm文件内容分析。没有语义假设，没有先验知识是概念。

用法

使用–help可以显示sfm-sniffer的使用信息选项另请参见example walkthrough。

用法：

sfm-sniffer [--tags=<dictionary>] [--summary] [--normal|--stricter|--strictest] <file>
sfm-sniffer --dumptags
sfm-sniffer (-h | --help)
sfm-sniffer --version

选项：

`-t, --tags=file`
	Read a dictionary file that maps tags to labels. If unspecified, the default MDF tag labels will be used. [1]
`-s, --summary`	Output a summary report only.
`-1, --normal`	Apply normal type deduction rules.
`-2, --stricter`	Apply stricter type deduction rules.
`-3, --strictest`
	Apply strictest type deduction rules.
`-d, --dumptags`	Print the default SFM tag dictionary in the format used by –tags
`-h, --help`	Show this screen.
`--version`	Show version.

应用更严格的类型扣除规则将生成喜欢更具体的类型（例如number或word）而不是更多一般类型（例如optional text）。但是，更严格的类型演绎规则更容易产生大量的异常。

同样，对于sfm-struct-sniffer：

用法：

sfm-struct-sniffer [--tags=<dictionary>] <file>
sfm-struct-sniffer --dumptags
sfm-struct-sniffer (-h | --help)
sfm-struct-sniffer --version

选项：

`-t, --tags=file`
	Read a dictionary file that maps tags to labels. If unspecified, the default MDF tag labels will be used. [1]
`-d, --dumptags`	Print the default SFM tag dictionary in the format used by –tags
`-h, --help`	Show this screen.
`--version`	Show version.

存储库内容

待办事项

另请参见

SOLID是SIL提供的现有图形实用工具检查、清理和转换SDF文件。

参考文献

Making Dictionaries: A guide to lexicography and the Multi-Dictionary Formatter （coward&grimes，2000）：对（多字典格式化程序）和定义的一组常见的SFM反斜杠代码。
Technical Notes on SFM Database Import（Ken Zook，2010年）：提供有关可能遇到的问题的进一步信息使用SFM文件时。

欢迎加入QQ群-->： 979659372

sfm-utils 0.1.0rc1.post1

sfm-utils的Python项目详细描述

安装

简介

标签类型扣除

很快就来…

功能

用法

存储库内容

另请参见

参考文献

推荐PyPI第三方库

minefield

py_mina

timetracker

silva.app.redirectlink

sana

moxel

smshub-org

sphinxcontrib-shellcheck

addressable

appPublic

odoo8-addon-account-move-template

dnssec

agensgraph4jupyter

trio-websockets

mr4mp

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

sfm-utils 0.1.0rc1.post1

sfm-utils的Python项目详细描述

安装

简介

标签类型扣除

很快就来…

功能

用法

存储库内容

另请参见

参考文献

推荐PyPI第三方库

minefield

py_mina

timetracker

silva.app.redirectlink

sana

moxel

smshub-org

sphinxcontrib-shellcheck

addressable

appPublic

odoo8-addon-account-move-template

dnssec

agensgraph4jupyter

trio-websockets

mr4mp

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签