Python JpTokenPreprocessing包_程序模块 - PyPI

jptokenpreprocessing是用于令牌预处理的python库。

JpTokenPreprocessing的Python项目详细描述

https://travis-ci.org/Kesin11/JpTokenPreprocessing.svg?branch=master

jptoken预处理-日语token预处理

jptokenpreprocessing是一个用于令牌预处理的python库。它支持过滤noize（例如，令牌太短、只有数字或符号令牌）和规范化（支持字母大小写和unicode规范化）。自然语言处理（NLP）中有一些常见的预处理。

使用量

#coding: utf-8# Python3fromjp_token_preprocessingimportJpTokenPreprocessingimportMeCab# Return japanese word tokens using morphological analyzer MeCab.# And select only noun.deftokenize(text):tagger=MeCab.Tagger()node=tagger.parseToNode(text)whilenode:if'名詞'innode.feature:surface=node.surfaceyieldsurfacenode=node.nextif__name__=='__main__':text="""
    これは自然言語処理に必須な前処理のためのモジュールです。
    形態素解析や、n-gramでトークン化した後のフィルタリング、正規化を補助します。
    一語だけのトークンや'1234'のような数字だけのトークン、'!!'のような記号だけのトークンのフィルタリング、
    全角文字'ＰＹＴＨＯＮ'の半角化、英単語'Word'の小文字化といった正規化も行えます。
    さらに必ず除外したいトークンをストップワードに設定することもできます。
    """stopwords=['これ','こと']tokens=tokenize(text)"""
    >>> print(list(tokens))

    ['', '', '言語', '処理', '必須', '前', '処理', 'ため', 'モジュール', '形態素',
    '解析', 'n', '-', 'gram', 'トー', 'クン', '化', '後', 'フィルタ', 'リング', '正規',
    '化', '補助', '一語', 'トーク', 'ン', "'", '1234', "'", 'よう', '数字','トー',
    'クン', "'!!'", 'よう', '記号', 'トー', 'クン', 'フィルタ', 'リング', '全角',
    '文字', "'", 'ＰＹＴＨＯＮ', "'", '半角', '化', '英単語', "'", 'Word',"'", '小文字',
    '化', '正規', '化', '除外', 'トーク', 'ン', 'ストップ', 'ワード', '設定', 'こと']
    """tokens=tokenize(text)preprocessor=JpTokenPreprocessing(number=False,symbol=False,case='lower',unicode='NFKC',min_len=2,stopwords=stopwords)tokens=preprocessor.preprocessing(tokens)# Return iterator of tokens. Using list() for print sample."""
    >>> print(list(tokens))
    ['言語', '処理', '必須', '処理', 'ため', 'モジュール', '形態素', '解析', 'gram',
    'トー', 'クン', 'フィルタ', 'リング', '正規', '補助', '一語', 'トーク', 'よう',
    '数字', 'トー', 'クン', 'よう', '記号', 'トー', 'クン', 'フィルタ', 'リング',
    '全角', '文字', 'python', '半角', '英単語', 'word', '小文字', '正規', '除外',
    'トーク', 'ストップ', 'ワード', '設定']
    """

安装

pip install JpTokenPreprocessing

Python的mecab
请应用下面的补丁安装和使用mecab模块与python3。（2014/09/07麦加福音0.996）
https://code.google.com/p/mecab/issues/detail?id=7

方法

jptoken预处理（args）

number=bool（默认值：false）
Allow only number token.
symbol=bool（默认值：false）
Allow only symbol token.
大小写=“lower”或“upper”或“capital”
Normalize alphabet case.
unicode='nfc'或'nfkc'或'nfd'或'nfkd'a（默认值：'nfkc'）
Normalize unicode string with unicodedata.normalize().
最小长度=整数（默认值：2）
Filter out few character token. If min_len = 2 filter out token that has only 1 or 0 character.
stopWords=列表（默认值：[]）
Filter out any token that are contained in stopword list.
jptokenpreprocessing.preprocessing（iterable）
Return preprocessed tokens iterator.

未来工作

添加一些钩点以扩展自己的预处理。

作者

肯塔·卡斯kesin1202000@gmail.com

许可证

麻省理工学院许可证

欢迎加入QQ群-->： 979659372

JpTokenPreprocessing 0.1.5a2

JpTokenPreprocessing的Python项目详细描述

jptoken预处理-日语token预处理

使用量

安装

Python的mecab
请应用下面的补丁安装和使用mecab模块与python3。（2014/09/07麦加福音0.996）
https://code.google.com/p/mecab/issues/detail?id=7

方法

jptoken预处理（args）

未来工作

作者

许可证

推荐PyPI第三方库

cryo

django-social-api

bootstrap

audioscrobblerws

micropython-cpython-ure

rapidsms-xra

toflerdb

listwise

task-mcmc

pvault

colcon-parallel-executor

tuning-fork-cli

psukys-remoto

sms-api

anel-pwrctrl-homeassistant

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

JpTokenPreprocessing 0.1.5a2

JpTokenPreprocessing的Python项目详细描述

jptoken预处理-日语token预处理

使用量

安装

Python的mecab 请应用下面的补丁安装和使用mecab模块与python3。（2014/09/07麦加福音0.996）https://code.google.com/p/mecab/issues/detail?id=7

方法

jptoken预处理（args）

未来工作

作者

许可证

推荐PyPI第三方库

cryo

django-social-api

bootstrap

audioscrobblerws

micropython-cpython-ure

rapidsms-xra

toflerdb

listwise

task-mcmc

pvault

colcon-parallel-executor

tuning-fork-cli

psukys-remoto

sms-api

anel-pwrctrl-homeassistant

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

Python的mecab
请应用下面的补丁安装和使用mecab模块与python3。（2014/09/07麦加福音0.996）
https://code.google.com/p/mecab/issues/detail?id=7

导航栏

项目链接

标签