Python pybo包_程序模块 - PyPI

用于处理藏语的python实用程序

pybo的Python项目详细描述

PYBO-Python中的藏语NLP

GitHub release

概述

pybo将藏文标记为单词

基本用法

入门

需要安装python3。

pip3 install pybo

标记字符串

drupchen@drupchen:~$ pybo tok-string "༄༅། །རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྻ་ཨ་བ་ཏ་ར། བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ། །སངས་རྒྱས་དང་བྱང་ཆུབ་སེམས་དཔའ་ཐམས་ཅད་ལ་ཕྱག་འཚལ་ལོ། །བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང༌། །ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །བདེ་གཤེགས་སྲས་ཀྱི་སྡོམ་ལ་འཇུག་པ་ནི། །ལུང་བཞིན་མདོར་བསྡུས་ནས་ནི་བརྗོད་པར་བྱ། །"
Loading Trie... (2s.)
༄༅།_། རྒྱ་གར་ སྐད་ དུ །_ བོ་ དྷི་ སཏྭ་ ཙརྻ་ ཨ་བ་ ཏ་ ར །_ བོད་སྐད་ དུ །_ བྱང་ཆུབ་ སེམས་དཔ འི་ སྤྱོད་པ་ ལ་ འཇུག་པ །_། སངས་རྒྱས་ དང་ བྱང་ཆུབ་
སེམས་དཔའ་ ཐམས་ཅད་ ལ་ ཕྱག་ འཚལ་ ལོ །_། བདེ་གཤེགས་ ཆོས་ ཀྱི་ སྐུ་ མངའ་ སྲས་ བཅས་ དང༌ །_། ཕྱག་འོས་ ཀུན་ ལ འང་ གུས་པ ར་ ཕྱག་ འཚལ་
ཏེ །_། བདེ་གཤེགས་ སྲས་ ཀྱི་ སྡོམ་ ལ་ འཇུག་པ་ ནི །_། ལུང་ བཞིན་ མདོར་བསྡུས་ ནས་ ནི་ བརྗོད་པ ར་ བྱ །_།

标记文件

写入以_pybo为后缀的同名文件

The file that will be tokenized:
drupchen@drupchen:~$ head text.txt
བཀྲ་ཤི་ས་བདེ་ལེགས་ཕུན་སུམ་ཚོགས། །རྟག་ཏུ་བདེ་བ་ཐོབ་པར་ཤོག། །

drupchen@drupchen:~$ pybo tok-file text.txt
parsing text.txt...
Loading Trie... (2s.)done

The output file:
drupchen@drupchen:~$ head text_pybo.txt
བཀྲ་ ཤི་ ས་ བདེ་ལེགས་ ཕུན་སུམ་ ཚོགས །_། རྟག་ ཏུ་ བདེ་བ་ ཐོབ་པ ར་ ཤོག །_།

pybo作为python库

>>>frompyboimportText>>># input is a multi-line input string>>>in_str="""ལེ གས། བཀྲ་ཤིས་མཐའི་ ༆ ཤི་བཀྲ་ཤིས་  tr ... བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། ... མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ།"""### STEP1: instanciating Text>>># A. on a string>>>t=Text(in_str)>>># B. on a file...# note all following operations can be applied to files in this way.>>>frompathlibimportPath>>>in_file=Path.cwd()/'test.txt'>>># file content:>>>in_file.read_text()'བཀྲ་ཤིས་བདེ་ལེགས།།\n'>>>t=Text(in_file)>>>t.tokenize_chunks_plaintext>>># checking an output file has been written:...# BOM is added by default so that notepad in Windows doesn't scramble the line breaks>>>out_file=Path.cwd()/'test_pybo.txt'>>>out_file.read_text()'\ufeffབཀྲ་ ཤིས་ བདེ་ ལེགས །།'### STEP2: properties will perform actions on the input string:### note: original spaces are replaced by underscores.>>># OUTPUT1: chunks are meaningful groups of chars from the input string....# see how punctuations, numerals, non-bo and syllables are all neatly grouped.>>>t.tokenize_chunks_plaintext'ལེ_གས །_ བཀྲ་ ཤིས་ མཐའི་ _༆_ ཤི་ བཀྲ་ ཤིས་__ tr_\n བདེ་་ ལེ_གས །_ བཀྲ་ ཤིས་ བདེ་ ལེགས་ ༡༢༣ ཀཀ །_\n མཐའི་ རྒྱ་ མཚོར་ གནས་ པའི་ ཉས་ ཆུ་ འཐུང་ །།_།། མཁའ །'>>># OUTPUT2: could as well be acheived by in_str.split(' ')>>>t.tokenize_on_spaces'ལེ གས། བཀྲ་ཤིས་མཐའི་ ༆ ཤི་བཀྲ་ཤིས་ tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ།'>>># OUTPUT3: segments in words....# see how བདེ་་ལེ_གས was still recognized as a single word, even with the space and the double tsek....# the affixed particles are separated from the hosting word: མཐ འི་ རྒྱ་མཚོ ར་ གནས་པ འི་ ཉ ས་>>>t.tokenize_words_raw_textLoadingTrie...(2s.)'ལེ_གས །_ བཀྲ་ཤིས་ མཐ འི་ _༆_ ཤི་ བཀྲ་ཤིས་_ tr_ བདེ་་ལེ_གས །_ བཀྲ་ཤིས་ བདེ་ལེགས་ ༡༢༣ ཀཀ །_ མཐ འི་ རྒྱ་མཚོ ར་ གནས་པ འི་ ཉ ས་ ཆུ་ འཐུང་ །།_།། མཁའ །'>>>t.tokenize_words_raw_lines'ལེ_གས །_ བཀྲ་ཤིས་ མཐ འི་ _༆_ ཤི་ བཀྲ་ཤིས་__ tr_\n བདེ་་ལེ_གས །_ བཀྲ་ཤིས་ བདེ་ལེགས་ ༡༢༣ ཀཀ །_\n མཐ འི་ རྒྱ་མཚོ ར་ གནས་པ འི་ ཉ ས་ ཆུ་ འཐུང་ །།_།། མཁའ །'>>># OUTPUT4: segments in words, then calculates the number of occurences of each word found...# by default, it counts in_str's substrings in the output, which is why we have བདེ་་ལེ གས	1, བདེ་ལེགས་	1...# this behaviour can easily be modified to take into account the words that pybo recognized instead (see advanced usage)>>>print(t.list_word_types)འི་3།2བཀྲ་ཤིས་2མཐ2ལེགས1༆1ཤི་1བཀྲ་ཤིས་1tr \n1བདེ་་ལེགས1བདེ་ལེགས་1༡༢༣1ཀཀ1། \n1རྒྱ་མཚོ1ར་1གནས་པ1ཉ1ས་1ཆུ་1འཐུང་1།།།།1མཁའ1།1

致谢

pybo是藏文NLP的开放源代码库

在引入新功能、工具集成和测试解决方案方面，我们始终乐于合作

非常感谢支持PYBO发展的公司和组织，特别是：

Khyentse Foundation出资22000美元启动项目
赞助培训数据管理的{a7}
BDRC为数据整理提供2名工作人员，为期6个月

维护

建立源距离：

rm -rf dist/
python3 setup.py clean sdist

并通过以下方式在tween上上载（version>；=1.11.0）：

twine upload dist/*

许可证

贡献者：

欢迎加入QQ群-->： 979659372

pybo 0.6.8

pybo的Python项目详细描述

PYBO-Python中的藏语NLP

概述

基本用法

入门

标记字符串

标记文件

pybo作为python库

致谢

维护

许可证

推荐PyPI第三方库

gauss-bin-distributions

gooseextractor

pyracmon

isha-dist-probabilit

bmpf

mypyboto3lambda

docstrcoverage

ppsqlviz

z048

lint-test

screcord

nutsfinder

crank-nicolson-numba

xlsx2pdf

yams-cli

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

pybo 0.6.8

pybo的Python项目详细描述

PYBO-Python中的藏语NLP

概述

基本用法

入门

标记字符串

标记文件

pybo作为python库

致谢

维护

许可证

推荐PyPI第三方库

gauss-bin-distributions

gooseextractor

pyracmon

isha-dist-probabilit

bmpf

mypyboto3lambda

docstrcoverage

ppsqlviz

z048

lint-test

screcord

nutsfinder

crank-nicolson-numba

xlsx2pdf

yams-cli

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签