Python tok包_程序模块 - PyPI

快速可定制的标记器

tok的Python项目详细描述

##托克

[！[pypi]（https://img.shields.io/pypi/v/tok.svg?style=flat-square)](https://pypi.python.org/pypi/tok/） [！[pypi]（https://img.shields.io/pypi/pyversions/tok.svg?style=flat-square)](https://pypi.python.org/pypi/tok/）

python中最快速、最完整/可定制的标记器。

它比spacy和nltk基于regex的标记器快大约25倍。

使用aho-corasick算法使其成为一种新颖的算法，并允许它在如何拆分方面既可解释又快速。

繁重的工作是由[textsearch]（https://github.com/kootenpv/textsearch）和[pyahocarasick]（https://github.com/WojciechMula/pyahocorasick）完成的，只允许用大约200行代码来编写。

与基于正则表达式的方法相反，它只会遍历文本中的每个字符一次。阅读[下面]（它是如何工作的）关于它是如何工作的。

###安装

pip install tok

###用法

默认情况下，它处理收缩、http（浮动）数字和货币。

`python from tok import word_tokenize word_tokenize("I wouldn't do that.... would you?") ['I', 'would', 'not', 'do', 'that', '...', 'would', 'you', '?'] `

或者自己配置：

`python from tok import Tokenizer tokenizer = Tokenizer(protected_words=["some.thing"]) # still using the defaults tokenizer.word_tokenize("I want to protect some.thing") ['I', 'want', 'to', 'protect', 'some.thing'] `

按句子拆分：

`python from tok import sent_tokenize sent_tokenize("I wouldn't do that.... would you?")[['I', 'would', 'not', 'do', 'that', '...'], ['would', 'you', '?']] `

有关更多选项，请查看标记器的文档。

###进一步定制

给定：

`python from tok import Tokenizer t = Tokenizer(protected_words=["some.thing"]) # still using the defaults `

您可以使用：

t.keep（x，reason）：只要找到x，它就不会添加空白。防止直接标记化。
t.split（x，reason）：只要找到x，它就会用空格包围它，从而创建一个令牌。
t.drop（x，reason）：只要找到x，它就会删除它，但添加一个拆分。
t.strip（x，reason）：只要找到x，它就会在不分裂的情况下移除它。

`python t.drop("bla", "bla is not needed") t.word_tokenize("Please remove bla, thank you") ['Please', 'remove', ',', 'thank', 'you'] `

###可解释的

解释发生了什么：

`python t.explain("bla")[{'from': 'bla', 'to': ' ', 'explanation': 'bla is not needed'}] `

查看其中的所有内容（将帮助您了解其工作原理）：

`python t.explain_dict `

###工作原理

它永远只会保持最长的比赛。通过在令牌中引入一个空格，它将使令牌被拆分。

如果考虑的标记化是如何工作的，请参见此处：

当它找到“a”时，它将成为“a”。（单字母缩写）
当它找到一个.0时，它将使它.0（数字）
当它找到一个时，它将使它成为。`（因此分开）

如果要确保包含点的内容保持不变，可以使用以下示例：

t.keep(“cool.”)

###贡献

如果你想为这个图书馆捐款，我将不胜感激。

为其他语言添加[压缩]（https://github.com/kootenpv/contractions）也很好。

欢迎加入QQ群-->： 979659372

tok 0.1.14

tok的Python项目详细描述

推荐PyPI第三方库

asksdkcore

pylshvec

ivp

krate

mypyboto3lambda

rastervision-aws-s3

gnb-distributions-lite

python-pype-lang-3

Rels

django-templates

ultimatehostsblacklistinputrepoupdater

dylan

distributions-antest

service-mapping-plugin-framework

NIPTool

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

tok 0.1.14

tok的Python项目详细描述

推荐PyPI第三方库

asksdkcore

pylshvec

ivp

krate

mypyboto3lambda

rastervision-aws-s3

gnb-distributions-lite

python-pype-lang-3

Rels

django-templates

ultimatehostsblacklistinputrepoupdater

dylan

distributions-antest

service-mapping-plugin-framework

NIPTool

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签