阅读作品的工具
opustools-pkg的Python项目详细描述
仙人掌
访问和处理opus数据的工具。
- opus_read:读取并行数据集并转换为不同的输出格式
- opus_express:从opus数据创建测试/开发/训练集。
- opus_cat:从发布数据中提取给定的opus文档
- opus_get:从opus下载文件
- opus_langid:将语言id添加到zip档案中xml文件中的句子
opus_read
使用量
usage: opus_read [-h] -d corpus_name -s langid -t langid [-r version]
[-p {raw,xml,parsed}] [-m M] [-S S] [-T T] [-a attribute]
[-tr TR] [-ln] [-w file_name [file_name ...]]
[-wm {normal,moses,tmx,links}] [-pn] [-f] [-rd path_to_dir]
[-af path_to_file] [-sz path_to_zip] [-tz path_to_zip]
[-cm delimiter] [-pa] [-sa attribute [attribute ...]]
[-ta attribute [attribute ...]] [-ca delimiter]
[--src_cld2 lang_id score] [--trg_cld2 lang_id score]
[--src_langid lang_id score] [--trg_langid lang_id score]
[-id file_name] [-q]
参数:
-h, --help show this help message and exit
-d corpus_name Corpus name
-s langid Source language
-t langid Target language
-r version Release (default=latest)
-p {raw,xml,parsed} Pre-process-type (raw, xml or parsed, default=xml)
-m M Maximum number of alignments
-S S Number of source sentences in alignments (range is
allowed, eg. -S 1-2)
-T T Number of target sentences in alignments (range is
allowed, eg. -T 1-2)
-a attribute Set attribute for filttering
-tr TR Set threshold for an attribute
-ln Leave non-alignments out
-w file_name [file_name ...]
Write to file. To print moses format in separate
files, enter two file names. Otherwise enter one file
name.
-wm {normal,moses,tmx,links}
Set write mode
-pn Print file names when using moses format
-f Fast parsing. Faster than normal parsing, if you print
a small part of the whole corpus, but requires the
sentence ids in alignment files to be in sequence.
-rd path_to_dir Change root directory (default=/proj/nlpl/data/OPUS/)
-af path_to_file Use given alignment file
-sz path_to_zip Use given source zip file
-tz path_to_zip Use given target zip file
-cm delimiter Change moses delimiter (default=tab)
-pa Print annotations, if they exist
-sa attribute [attribute ...]
Set source sentence annotation attributes to be
printed, e.g. -sa pos lem. To print all available
attributes use -sa all_attrs (default=pos,lem)
-ta attribute [attribute ...]
Set target sentence annotation attributes to be
printed, e.g. -ta pos lem. To print all available
attributes use -ta all_attrs (default=pos,lem)
-ca delimiter Change annotation delimiter (default=|)
--src_cld2 lang_id score
Filter source sentences by their cld2 language id
labels and confidence score, e.g. en 0.9
--trg_cld2 lang_id score
Filter target sentences by their cld2 language id
labels and confidence score, e.g. en 0.9
--src_langid lang_id score
Filter source sentences by their langid.py language id
labels and confidence score, e.g. en 0.9
--trg_langid lang_id score
Filter target sentences by their langid.py language id
labels and confidence score, e.g. en 0.9
-id file_name Write sentence ids to a file.
-q Download necessary files without prompting "(y/n)"
示例:
以xces align格式读取句子对齐:
opus_read -d Books -s en -t fi
打印具有对齐确定性的对齐>;linkthr=0:
opus_read -d MultiUN -s en -t es -a certainty -tr 0
打印前10对对齐对:
opus_read -d Books -s en -t fi -m 10
打印所有1:1句子对齐的xces对齐格式:
opus_read -d Books -s en -t fi -S 1 -T 1 -wm links
您还可以将模块导入到python脚本:
在your_script.py
中,首先导入包:
import opustools_pkg
如果要在命令行上提供参数,请使用空参数列表初始化OpusRead
:
opus_reader = opustools_pkg.OpusRead([])
opus_reader.printPairs()
然后运行:
python3 your_script.py -d Books -s en -t fi
您也可以使用列表中的参数初始化OpusRead
:
opus_reader = opustools_pkg.OpusRead(["-d", "Books", "-s", "en", "-t", "fi"])
opus_reader.printPairs()
然后运行:
python3 your_script.py
说明
opus_read
是一个脚本,用于读取存储在xces align格式中的句子对齐并将对齐的句子打印到stdout。它需要链接的xml文件中的句子的单语对齐。链接的xml文件在“todoc”和“fromdoc”属性中指定(见下文)。
<cesAlign version="1.0">
<linkGrp targType="s" toDoc="source1.xml" fromDoc="target1.xml">
<link certainty="0.88" xtargets="s1.1 s1.2;s1.1" id="SL1" />
....
<linkGrp targType="s" toDoc="source2.xml" fromDoc="target2.xml">
<link certainty="0.88" xtargets="s1.1;s1.1" id="SL1" />
可以设置多个参数来过滤路线并仅打印某些类型的路线
opus_read
也可用于筛选XCES对齐文件并在同一文件中打印其余链接
xces对齐格式。将“-wm”标志设置为“links”以启用此模式。
opus_read
从zip文件读取对齐。如果zip文件很大(例如opus中的opensubtitles),启动脚本可能需要一些时间。
opus_read
默认使用ExhaustiveSentenceParser
。这意味着每次找到<linkGrp>
标记时,都会读取相应的源和目标文档,并将每个句子存储在以句子id为键的哈希映射中。这允许读取器以非顺序读取具有句子id的对齐文件。每次找到<linkGrp>
标记时,脚本都会暂停打印一秒钟,以便读取源和目标文档。暂停的持续时间取决于源文档和目标文档的大小。
使用“-f”标志允许使用SentenceParser
,在只读取一小部分语料的情况下,该方法比exhavesenceparser快SentenceParser
不将句子存储在哈希映射中。相反,当它找到一个<link>
标记时,它会遍历一个句子文件,直到一个句子id与<link>
标记中找到的句子id匹配为止。sentence parser不能后退,这意味着如果id在对齐文件中不是按顺序排列的,那么在句子id序列中断后,解析器将找不到对齐对。SentenceParser
不如ExhaustiveSentenceParser
可靠,但是当整个语料库不需要扫描时使用“-f”标志是有益的,换句话说,当使用“-m”标志时。
Opus_快递
使用量
usage: opus_express [-h] [-f] -s lang_id -t lang_id
[-c [coll_name [coll_name ...]]]
[--root-dir /path/to/OPUS] [--test-override /path/to/file]
[--test-quota num_sents] [--dev-quota num_sents]
[--doc-bounds] [--quality-aware]
[--overlap-threshold min_pct] [--shuffle]
[--test-set filename] [--dev-set filename]
[--train-set filename]
参数:
-h, --help show this help message and exit
-f, --force suppress warnings (default: False)
-s lang_id, --src-lang lang_id
source language (e.g. `en')
-t lang_id, --tgt-lang lang_id
target language (e.g. `pt')
-c [coll_name [coll_name ...]], --collections [coll_name [coll_name ...]]
OPUS collection(s) to fetch (default: `OpenSubtitles')
Collections list: ['ALL', 'ada83', 'Bianet', 'bible-
uedin', 'Books', 'CAPES', 'DGT', 'DOGC', 'ECB',
'EhuHac', 'Elhuyar', 'EMEA', 'EUbookshop', 'EUconst',
'Europarl', 'Finlex', 'fiskmo', 'giga-fren',
'GlobalVoices', 'GNOME', 'hrenWaC', 'JRC-Acquis',
'KDE4', 'KDEdoc', 'MBS', 'memat', 'MontenegrinSubs',
'MPC1', 'MultiUN', 'News-Commentary', 'OfisPublik',
'OpenOffice', 'OpenSubtitles', 'ParaCrawl', 'PHP',
'QED', 'RF', 'sardware', 'SciELO', 'SETIMES', 'SPC',
'Tanzil', 'Tatoeba', 'TED2013', 'TedTalks', 'TEP',
'TildeMODEL', 'Ubuntu', 'UN', 'UNPC', 'wikimedia',
'Wikipedia', 'WikiSource', 'WMT-News', 'XhosaNavy']
--root-dir /path/to/OPUS
Root directory for OPUS
(default:`/proj/nlpl/data/OPUS')
--test-override /path/to/file
path to file containing resource IDs to reserve for
the test set (default: None)
--test-quota num_sents
test set size in sentences (default: 10000)
--dev-quota num_sents
development set size in sentences (default: 10000)
--doc-bounds preserve document blocks (also marks document
boundaries) (default: False)
--quality-aware reserve one-to-one aligned samples with high overlap
for test/dev sets (incompatible with `--doc-bounds')
(default: False)
--overlap-threshold min_pct
threshold for alignment overlap in `--quality-aware'
mode (default: 0.8)
--shuffle shuffle samples (incompatible with `--doc-bounds')
(default: False)
--test-set filename filename stub for output test set (default: `test')
--dev-set filename filename stub for output development set (default:
`dev')
--train-set filename filename stub for output training set (default:
`train')
说明
所有人都上了Opus Express!从OPUS数据创建测试/开发/训练集
Opus_类别
使用量
usage: opus_cat [-h] -d D -l L [-i] [-m M] [-p] [-f F] [-r R] [-pa]
[-sa SA [SA ...]] [-ca CA]
参数:
-h, --help show this help message and exit
-d D Corpus name
-l L Language
-i Print without ids when using -p
-m M Maximum number of sentences
-p Print in plain txt
-f F File name (if not given, prints all files)
-r R Release (default=latest)
-pa Print annotations, if they exist
-sa SA [SA ...] Set sentence annotation attributes to be printed, e.g. -sa
pos lem. To print all available attributes use -sa
all_attrs (default=pos,lem)
-ca CA Change annotation delimiter (default=|)
您还可以将模块导入到python脚本:
在your_script.py
中,首先导入包:
import opustools_pkg
如果要在命令行上提供参数,请使用空参数列表初始化OpusCat
:
opus_cat = opustools_pkg.OpusCat([])
opus_cat.printSentences()
然后运行:
python3 your_script.py -d Books -s en
您也可以使用列表中的参数初始化OpusCat
:
opus_cat = opustools_pkg.OpusRead(["-d", "Books", "-l", "en"])
opus_cat.printSentences()
然后跑:
python3 your_script.py
说明
从OPUS读取文档并打印到STDOUT
作品
用法
opus-get [-h] -s S [-t T] [-d D] [-r R] [-p {raw,xml,parsed}] [-l]
[-dl DL] [-q]
参数:
-h, --help show this help message and exit
-s S Source language
-t T Target language
-d D Corpus name
-r R Release
-p {raw,xml,parsed} Pre-process type
-l List resources
-dl DL Set download directory (default=current directory)
-q Download necessary files without prompting "(y/n)"
说明
从opus下载文件
作品
用法
opus_langid [-h] -f F [-t T] [-v] [-s]
参数:
-h, --help show this help message and exit
-f F File path
-t T Target file path. By default, the original file is edited
-v Verbosity. -v: print current xml file
-s Suppress error messages in language detection
说明
使用pycld2和langid.py将语言id添加到纯xml文件中的语句或zip文件中的xml文件中。