Python Sudachi包_程序模块 - PyPI

日本形态分析仪sudachi的python版本

Sudachi的Python项目详细描述

sudachipy

sudachipy是python版本的Sudachi，一个日本的形态学分析器。

sudachi&sudachipy是在WAP Tokushima Laboratory of AI and NLP开发的，该研究所隶属于Works Applications，专注于自然语言处理（nlp）。

警告：某些函数仍然与java sudachi不兼容。

中断更改

v0.3.0

resources/目录已移动到sudachipy/。

v0.2.2

通过pypi分发sudachipy包
- pip install SudachiPy

v0.2.0

添加了用户词典功能

设置简单

sudachipy需要python3.5+。

步骤1：安装sudachipy

Sudachipy是从Pypi发行的。可以通过从命令行执行pip install SudachiPy来安装sudachipy。

$ pip install SudachiPy

sudachipy（>；=v0.3.0）默认为sudachidict_core（不包含在sudachipy包中）的system.dic。请继续执行步骤2以安装dict软件包。

步骤2：安装sudachidict_core

默认的dict包SudachiDict_core是从我们的下载站点分发的。运行pip install如下：

$ pip install https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_core-20190718.tar.gz

用法

作为命令

安装sudachipy之后，还可以通过命令sudachipy在终端中使用它。

您可以通过以下方式使用标准输入执行sudachipy：

$ sudachipy

sudachipy有4个子命令（默认情况下为tokenize）

$ sudachipy tokenize -h
usage: sudachipy tokenize [-h][-r file][-m {A,B,C}][-o file][-a][-d][-v][file [file ...]]

Tokenize Text

positional arguments:
  file           text written in utf-8

optional arguments:
  -h, --help     show this help message and exit
  -r file        the setting file in JSON format
  -m {A,B,C}     the mode of splitting
  -o file        the output file
  -a             print all of the fields
  -d             print the debug information
  -v, --version  print sudachipy version

$ sudachipy link -h
usage: sudachipy link [-h][-t {small,core,full}][-u]

Link Default Dict Package

optional arguments:
  -h, --help            show this help message and exit
  -t {small,core,full}  dict dict
  -u                    unlink sudachidict

$ sudachipy build -h
usage: sudachipy build [-h][-o file][-d string] -m file file [file ...]

Build Sudachi Dictionary

positional arguments:
  file        source files with CSV format (one of more)

optional arguments:
  -h, --help  show this help message and exit
  -o file     output file (default: system.dic)
  -d string   description comment to be embedded on dictionary

required named arguments:
  -m file     connection matrix file with MeCab's matrix.def format

$ sudachipy ubuild -h
usage: sudachipy ubuild [-h][-d string][-o file][-s file] file [file ...]

Build User Dictionary

positional arguments:
  file        source files with CSV format (one or more)

optional arguments:
  -h, --help  show this help message and exit
  -d string   description comment to be embedded on dictionary
  -o file     output file (default: user.dic)
  -s file     system dictionary (default: ${SUDACHIPY}/resouces/system.dic)

作为一个python包

下面是一个用法示例；

fromsudachipyimporttokenizerfromsudachipyimportdictionarytokenizer_obj=dictionary.Dictionary().create()# Multi-granular tokenization# using `system_full.dic` or `system_full.dic` version 20190781# you may not be able to replicate this particular example due to dictionary you usemode=tokenizer.Tokenizer.SplitMode.C[m.surface()formintokenizer_obj.tokenize("国家公務員",mode)]# => ['国家公務員']mode=tokenizer.Tokenizer.SplitMode.B[m.surface()formintokenizer_obj.tokenize("国家公務員",mode)]# => ['国家', '公務員']mode=tokenizer.Tokenizer.SplitMode.A[m.surface()formintokenizer_obj.tokenize("国家公務員",mode)]# => ['国家', '公務', '員']# Morpheme informationm=tokenizer_obj.tokenize("食べ",mode)[0]m.surface()# => '食べ'm.dictionary_form()# => '食べる'm.reading_form()# => 'タベ'm.part_of_speech()# => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']# Normalizationtokenizer_obj.tokenize("附属",mode)[0].normalized_form()# => '付属'tokenizer_obj.tokenize("SUMMER",mode)[0].normalized_form()# => 'サマー'tokenizer_obj.tokenize("シュミレーション",mode)[0].normalized_form()# => 'シミュレーション'

安装dict软件包

您可以从Python packages · WorksApplications/SudachiDict下载并安装生成的词典。

$ pip install SudachiDict_full-20190718.tar.gz

您可以通过执行link命令来更改默认的dict包。

$ sudachipy link -t full

您可以删除默认的dict设置。

$ sudachipy link -u

自定义词典

如果需要应用自定义的system.dic，把sudachi.json放在你喜欢的任何地方，用从sudachi.json到system.dic的相对路径覆盖systemDict值。

{
    "systemDict" : "relative/path/to/system.dic",
    ...
}

然后可以使用-r选项指定sudachi.json。

$ sudachipy -r path/to/sudachi.json

最后，我们希望创建一个流来通过代码获取这些资源，比如NLTK（例如import nltk; nltk.download()）或spaCy（例如$python -m spacy download en）。

用户定义的词典

如果需要应用自定义用户词典，user.dic，把sudachi.json放在你喜欢的任何地方，将userDict值与sudachi.json到user.dic的相对路径相加。

{
    "userDict" : ["relative/path/to/user.dic"],
    ...
}

此外，还可以使用子命令ubuild构建用户字典。

关于文件格式，请参见here （用日语书写，现在没有英文文档）

对于开发者

代码格式

您可以使用./scripts/format.sh并检查代码是否符合规则。flake8flake8-import-orderflake8-buitins是必需的。见requirements.txt

测试

您可以使用./script/test.sh并检查更改是否导致回归。

欢迎加入QQ群-->： 979659372

SudachiPy 0.3.12

Sudachi的Python项目详细描述

sudachipy

中断更改

v0.3.0

v0.2.2

v0.2.0

设置简单

步骤1：安装sudachipy

步骤2：安装sudachidict_core

用法

作为命令

作为一个python包

安装dict软件包

自定义词典

用户定义的词典

对于开发者

代码格式

测试

推荐PyPI第三方库

subwordnmt

orgformat

teklia-toolbox

mopen

yashkulkarni-distributions

djangotenantschemas

coderbots

vvs25-distributions

ygoprodeck

taxRep

soft

seedisortconfig

ibmcloudant

formation-studio

hello-mattmoon

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签