Python wordsegmentation包_程序模块 - PyPI

英语分词。

wordsegmentation的Python项目详细描述

分词是Apache2许可的英语单词模块分词，用纯python编写，基于万亿字的语料库。

灵感来自格兰特·詹克斯https://pypi.python.org/pypi/wordsegment。基于peter norvig“自然语言语料库数据”一章中的词权重算法摘自《美丽的数据》（Segaran和Hammerbacher，2009）。

如前所述，数据文件来自google web万亿单词语料库作者：Thorsten Brants和Alex Franz，由语言数据联盟发布。

功能

纯Python
采用分而治之的分割算法，使输入文本不存在最大长度限制。
分段ALGOTROTHM采用动态规划实现多项式时间复杂度。
使用google万亿语料库进行分词评分。
在Python2.7上开发
在CPython 2.6，2.7，3.4上测试。

快速启动

使用pip：

安装wordsegment很简单

$ pip install wordsegmentation

需要依赖关系networkx：

$ pip install networkx

教程

在您自己的python程序中，您通常希望使用segment将短语分成一个部分列表：

>>> from wordsegmentation import WordSegment
>>> ws = WordSegment()

>>> ws.segment('universityofwashington')
['university', 'of', 'washington']
>>> ws.segment('thisisatest')
['this', 'is', 'a', 'test']
>>> ws.segment('thisisanURLcontaining123345and-&**^&butitstillworks')
['this', 'is', 'an', 'url', 'containing', '123345', 'and', '-&**^&', 'but', 'it', 'still', 'works']
>>> ws.segment('NoMatterHowLongThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextMightBe')
['no', 'matter', 'how', 'long', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'might', 'be']

错误报告

Weihan@Github

gmail.com上的weihan.github

技术细节

在代码中，分割算法由以下步骤组成，

分而治之–将输入字符串安全地分成子字符串。这样我们就解决了长度限制，这将大大降低性能。例如，“facebook123helloworld”将被视为3个子问题-“facebook”、“123”和“helloworld”。
对于每个子字符串。我用动态规划的方法计算并得到最优词。
合并子问题，并返回原始字符串的结果。

在该模块中使用的分割算法，实现了O（n ^ 2）的时间复杂度。通过与现有分割算法的比较，该模块在以下方面优于：

可以处理很长的输入。没有为输入字符串设置任意最大长度限制。
通过动态规划在多项式时间内完成分割。
默认情况下，该算法使用过滤后的google语料库，其中只包含字典中可以找到的英语单词。

一个极端的例子是求解经典的English Scriptio_continua segmentation problem，如下所示：: >；>；你喜欢你的新生活。Esseditistheblightmanwasbornforitismargaretyoumownfor'）

我们的算法在多项式时间内解决了这个问题，结果是：

[‘margaret', 'are', 'you', 'grieving', 'over', 'golden', 'grove', 'un', 'leaving', 'leaves', 'like', 'the', 'things', 'of', 'man', 'you', 'with', 'your', 'fresh', 'thoughts', 'care', 'for', 'can', 'you', 'a', 'has', 'the', 'he', 'art', 'grows', 'older', 'it', 'will', 'come', 'to', 'such', 'sights', 'colder', 'by', 'and', 'by', 'nor', 'spa', 're', 'a', 'sigh', 'though', 'worlds', 'of', 'wan', 'wood', 'leaf', 'me', 'allie', 'and', 'yet', 'you', 'will', 'weep', 'and', 'know', 'why', 'now', 'no', 'matter', 'child', 'the', 'name', 'sorrows', 'springs', 'are', 'the', 'same', 'nor', 'mouth', 'had', 'non', 'or', 'mind', 'expressed', 'what', 'he', 'art', 'heard', 'of', 'ghost', 'guessed', 'it', 'is', 'the', 'blight', 'man', 'was', 'born', 'for', 'it', 'is', 'margaret', 'you', 'mourn', 'for']

欢迎加入QQ群-->： 979659372

wordsegmentation 0.3.5

wordsegmentation的Python项目详细描述

功能

快速启动

教程

错误报告

技术细节

推荐PyPI第三方库

django-knox-rest

eml-analyzer

time-series-models

qvrp

Peeman

adsocket-transport

apache-airflow-providers-hashicorp

navigation-mdp

wechat-backup

django-eveonline-doctrine-manager

pypushflow

swaggerjmx

odoo12-addon-stock-location-lockdown

beanmachine

odoo10-addon-purchase-analytic

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

wordsegmentation 0.3.5

wordsegmentation的Python项目详细描述

功能

快速启动

教程

错误报告

技术细节

推荐PyPI第三方库

django-knox-rest

eml-analyzer

time-series-models

qvrp

Peeman

adsocket-transport

apache-airflow-providers-hashicorp

navigation-mdp

wechat-backup

django-eveonline-doctrine-manager

pypushflow

swaggerjmx

odoo12-addon-stock-location-lockdown

beanmachine

odoo10-addon-purchase-analytic

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签