Python wordsegment包_程序模块 - PyPI

英语分词。

wordsegment的Python项目详细描述

WordSegment是apache2许可的英语单词模块分词，用纯python编写，基于万亿字的语料库。

基于彼得的“Natural Language Corpus Data”一章中的代码诺维格摘自《A3》一书（Segaran和Hammerbacher，2009）。

数据文件是从Google Web Trillion Word Corpus派生的，如由Thorsten Brants和Alex Franz描述，由语言数据联盟。此模块仅包含数据。unigram数据只包含最常见的333000个单词。同样地， bigram数据只包含最常见的250000个短语。每句话和短语小写，去掉标点符号。

功能

纯Python
完整记录
100%测试覆盖率
包括Unigram和Bigram数据
批处理的命令行界面
易于破解（例如，不同的评分、新数据、不同的语言）
在Python2.7上开发
在cpython 2.6、2.7、3.2、3.3、3.4、3.5、3.6和pypy、pypy3上进行测试
在Windows、Mac OS X和Linux上测试
使用Travis CI和Appveyor CI进行测试

https://api.travis-ci.org/grantjenks/python-wordsegment.svg

https://ci.appveyor.com/api/projects/status/github/grantjenks/python-wordsegment?branch=master&svg=true

快速启动

安装WordSegment很简单 pip：

$ pip install wordsegment

您可以使用python的内置帮助访问解释器中的文档功能：

>>> import wordsegment
>>> help(wordsegment)

教程

在您自己的python程序中，您通常希望使用segment来划分词组列表：

>>> from wordsegment import load, segment
>>> load()
>>> segment('thisisatest')
['this', 'is', 'a', 'test']

函数从磁盘。只需加载一次数据。

WordSegment还为批处理提供了一个命令行接口处理。此接口接受两个参数：in file和out file。线文件中的from被迭代分段，用空格连接，并写入输出文件。输入和输出分别默认为stdin和stdout。

$ echo thisisatest | python -m wordsegment
this is a test

如果您想将WordSegment作为一种服务器进程运行，那么使用python的 -u用于无缓冲输出的选项。您还可以在中设置PYTHONUNBUFFERED=1。环境。

>>> import subprocess as sp
>>> wordsegment = sp.Popen(
        ['python', '-um', 'wordsegment'],
        stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.STDOUT)
>>> wordsegment.stdin.write('thisisatest\n')
>>> wordsegment.stdout.readline()
'this is a test\n'
>>> wordsegment.stdin.write('workswithotherlanguages\n')
>>> wordsegment.stdout.readline()
'works with other languages\n'
>>> wordsegment.stdin.close()
>>> wordsegment.wait()  # Process exit code.
0

<>最大分段字长为24个字符。既不是unigram也不是 bigram数据包含超过该长度的单词。语料库也排除了标点符号和所有字母都已小写。在分割文本之前， clean被调用以将输入转换为规范形式：

>>> from wordsegment import clean
>>> clean('She said, "Python rocks!"')
'shesaidpythonrocks'
>>> segment('She said, "Python rocks!"')
['she', 'said', 'python', 'rocks']

有时，研究unigram和bigram计数很有趣他们自己。它们存储在python字典中，将单词映射到count。

>>> import wordsegment as ws
>>> ws.load()
>>> ws.UNIGRAMS['the']
23135851162.0
>>> ws.UNIGRAMS['gray']
21424658.0
>>> ws.UNIGRAMS['grey']
18276942.0

上面我们看到拼写gray比拼写gray更常见。

大图由空格连接：

>>> import heapq
>>> from pprint import pprint
>>> from operator import itemgetter
>>> pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1)))
[('of the', 2766332391.0),
 ('in the', 1628795324.0),
 ('to the', 1139248999.0),
 ('on the', 800328815.0),
 ('for the', 692874802.0),
 ('and the', 629726893.0),
 ('to be', 505148997.0),
 ('is a', 476718990.0),
 ('with the', 461331348.0),
 ('from the', 428303219.0)]

有些大论以<；s>；开头。这表示bigram的开始：

>>> ws.BIGRAMS['<s> where']
15419048.0
>>> ws.BIGRAMS['<s> what']
11779290.0

unigrams和bigrams数据存储在分别是unigrams.txt和bigrams.txt文件。

用户指南

参考文献

WordSegment许可证

根据apache许可证2.0版（以下简称“许可证”）授权；除非符合许可证，否则您不能使用此文件。您可以在

http://www.apache.org/licenses/LICENSE-2.0

除非适用法律要求或书面同意，否则软件根据许可证分发是按“原样”分发的，无任何明示或默示的保证或条件。有关管理权限的特定语言和许可下的限制。

欢迎加入QQ群-->： 979659372

wordsegment 1.3.1

wordsegment的Python项目详细描述

功能

快速启动

教程

用户指南

参考文献

WordSegment许可证

推荐PyPI第三方库

nanoservice

folder-rotator

elasticbeanstalk-to-env

patsi

juicebox-cli

bytecode

supermercado

hexhamming

jsb3

update-notip

pyviz

boltun

endpoints

django-fortunecookie

python-mt-st

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

wordsegment 1.3.1

wordsegment的Python项目详细描述

功能

快速启动

教程

用户指南

参考文献

WordSegment许可证

推荐PyPI第三方库

nanoservice

folder-rotator

elasticbeanstalk-to-env

patsi

juicebox-cli

bytecode

supermercado

hexhamming

jsb3

update-notip

pyviz

boltun

endpoints

django-fortunecookie

python-mt-st

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签