Python SimpleText包_程序模块 - PyPI

以简单方式管理文本数据的包。

SimpleText的Python项目详细描述

简单文本

以简单方式管理文本数据的包。在

安装方式：

pip install SimpleText

1）预处理功能

此函数将字符串作为输入并输出令牌列表。函数中有几个参数可以帮助快速预处理字符串。在

参数：

text（string）：一个文本字符串

n_grams（tuple，default=（1,1））：指定ngram的数量，例如（1,2）将是unigram，bigram，（2,2）将只是bigram

remove_accents（boolean，默认值=False）：删除重音符号

lower（布尔值，默认值=False）：小写文本

remove_less_than（int，默认值=0）：删除小于X个字母的单词

remove_more_than（int，默认值=20）：删除超过X个字母的单词

remove_punct（boolean，默认值=False）：删除标点符号

remove_alpha（boolean，默认值=False）：删除非字母标记

remove_stopwords（boolean，默认值=False）：删除非索引字

remove_custom_stopwords（list，default=[]）：删除自定义非索引字

lemma（boolean，默认值=False）：lemmanties标记（通过Word Net Lemmantizer算法）

stem（boolean，默认值=False）：词干标记（通过波特词干算法）

在下面的示例中，我们通过以下方式对字符串进行预处理：

小写字母
删除标点符号
删除停止语
删除超过15个字母小于1个字母的单词

^{pr2}$

输出将是：

['last', 'went', 'shops', 'week']

在第二个示例中，我们通过以下方式处理字符串：

生成单字图和二元图
堵塞
删除url
删除重音符号
小写字母

from SimpleText.preprocessor import preprocess

text = "I'm loving the weather this year in españa! https://en.tutiempo.net/spain.html"

preprocess(text, n_grams=(1, 2), remove_accents=True, lower=True, remove_less_than=0, 
           remove_more_than=20, remove_punct=False, remove_alpha=False, remove_stopwords=False,remove_custom_stopwords=[], lemma=False, stem=True, remove_url=True)

该输出：

["i'm",'love','the','weather','thi','year','in','espana!',("i'm", 'loving'),('loving', 'the'),('the', weather',
 ('weather', 'this'),('this', 'year'),('year', 'in'),('in', 'espana!')]

2）单独预处理文本

或者，也可以单独应用预处理步骤，而不必使用整个preprocess函数。可用功能包括：

from SimpleText.preprocessor import lowercase, strip_accents, strip_punctuation, strip_url, 
tokenise, strip_alpha_numeric_characters, strip_stopwords, strip_min_max_tokens, lemantization, stemming, get_ngrams

lowercase("Hi again") # outputs "hi again"

strip_accents("Hi ágain") # outputs "Hi again"

strip_punctuation("Hi again!") # outputs "Hi again"

strip_url("Hi again https//example.example.com/example/example") # outputs "Hi again"

tokenise("Hi again") # outputs ["Hi", "again"]

strip_alpha_numeric_characters(["Hi", "again", "@", "#", "*"]) # outputs ["Hi", "again"]

strip_stopwords(["Hi", "again"], ["Hi"]) # outputs ["again"]

strip_min_max_tokens(["consult", "consulting", "a"], 2, 8) # outputs ['consult']

lemantization(["bats", "feet"]) # outputs ["bat", "foot"]

stemming(["consult", "consultant", "consulting"]) # outputs ["consult", "consult", "consult"]

get_ngrams("hi all I'm", (1,3)) # outputs [('hi', 'all'), ('all', "I'm"), ('hi', 'all', "I'm")]

欢迎加入QQ群-->： 979659372

SimpleText 1.0.3

SimpleText的Python项目详细描述

简单文本

1）预处理功能

2）单独预处理文本

推荐PyPI第三方库

rex10ab

trdg

django-allianceutils

zooper-datasets

rsherer-udacity-gaussian-distributions

openvpn-ipdb

calculator-python

sparc-dft-api

rastervision-aws-s3

shanes-scrapers

unSteg

pypropgraph

cellwrapper

nbdt

isha-dist-probabilit

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

SimpleText 1.0.3

SimpleText的Python项目详细描述

简单文本

1） 预处理功能

2） 单独预处理文本

推荐PyPI第三方库

rex10ab

trdg

django-allianceutils

zooper-datasets

rsherer-udacity-gaussian-distributions

openvpn-ipdb

calculator-python

sparc-dft-api

rastervision-aws-s3

shanes-scrapers

unSteg

pypropgraph

cellwrapper

nbdt

isha-dist-probabilit

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

1）预处理功能

2）单独预处理文本

导航栏

项目链接

标签