Python plane包_程序模块 - PyPI

文本预处理库

plane的Python项目详细描述

平面

Plane is a tool for shaping wood using muscle power to force the cutting blade over the wood surface.
from Wikipedia

plane(tool) from wikipedia

此包用于从文本中提取或替换特定部分，如url、电子邮件、html标记、电话号码等。还支持标点符号规范化和移除。

请参阅完整的Documents。

安装

仅限python3.x。

PIP

pipinstallplane

从源安装

python setup.py install

功能

没有其他依赖关系
内置正则表达式模式：plane.pattern.Regex
自定义正则表达式模式
模式组合
提取、替换图案
分段句子
链函数调用：plane.plane.Plane
管道：plane.Pipeline

用法

快速启动

使用regex extract或replace：

fromplaneimportEMAIL,extract,replacetext='fake@no.com & fakefake@nothing.com'emails=extract(text,EMAIL)# this return a generator objectforeinemails:print(e)>>>Token(name='Email',value='fake@no.com',start=0,end=11)>>>Token(name='Email',value='fakefake@nothing.com',start=14,end=34)print(EMAIL)>>>Regex(name='Email',pattern='([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-]+)',repl='<Email>')replace(text,EMAIL)# replace(text, Regex, repl), if repl is not provided, Regex.repl will be used>>>'<Email> & <Email>'replace(text,EMAIL,'')>>>' & '

模式

Regex是一个有3个项的命名耦合器：

name
pattern：正则表达式
repl：替换标记，这将在使用replace函数时替换匹配的正则表达式

# create new patternfromplaneimportbuild_new_regexcustom_regex=build_new_regex('my_regex',r'(\d{4})','<my-replacement-tag>')

此外，还可以从默认模式构建新模式。

注意：这只应用于语言范围。

fromplaneimportextract,build_new_regex,CHINESE_WORDSASCII=build_new_regex('ascii',r'[a-zA-Z0-9]+',' ')WORDS=ASCII+CHINESE_WORDSprint(WORDS)>>>Regex(name='ascii_Chinese_words',pattern='[a-zA-Z0-9]+|[\\U00004E00-\\U00009FFF\\U00003400-\\U00004DBF\\U00020000-\\U0002A6DF\\U0002A700-\\U0002B73F\\U0002B740-\\U0002B81F\\U0002B820-\\U0002CEAF\\U0002CEB0-\\U0002EBEF]+',repl=' ')text="自然语言处理太难了！who can help me? (╯▔?▔)╯"print(' '.join([t.valuefortinlist(extract(text,WORDS))]))>>>"自然语言处理太难了 who can help me"fromplaneimportCHINESE,ENGLISH,NUMBERCN_EN_NUM=sum([CHINESE,ENGLISH,NUMBER])text="佛是虚名，道亦妄立。एवं मया श्रुतम्। 1999 is not the end of the world. "print(' '.join([t.valuefortinextract(text,CN_EN_NUM)]))>>>"佛是虚名，道亦妄立。 1999 is not the end of the world."

默认正则表达式：Details

URL：仅限ascii
EMAIL：本地部分@域
TELEPHONE：比如xxx-xxxx-xxxx
SPACE：, \t，\n，\r，\f，\v
HTML：html标记、脚本部分和css部分
ASCII_WORD：英语单词、数字、<tag>等等。
CHINESE：所有中文字符（仅限汉字和标点符号）
CJK：所有中文、日文、韩文（cjk）字符和标点符号
THAI：所有泰语和标点符号
VIETNAMESE：所有vietnames和标点符号
ENGLISH：所有英文字符和标点符号
NUMBER:0-9

Regex name	replace
URL	^{}
EMAIL	^{}
TELEPHONE	^{}
SPACE	^{}
HTML	^{}
ASCII_WORD	^{}
CHINESE	^{}
CJK	^{}

`段`

segment可以用来分割句子，英语和数字如'ps4'将被保留，其他如中文'中文字'将被拆分为单个单词格式['中', '文']。

fromplaneimportsegmentsegment('你看起来guaiguai的。<EOS>')>>>['你','看','起','来','guaiguai','的','。','<EOS>']

`标点符号`

punc.remove将把所有unicode标点替换为' '或作为参数发送到此函数的内容repl。punc.normalize将一些unicode标点标准化为英文标点。

注意：'+'、'^'、'$'、'~'和一些字符不是标点符号。

fromplaneimportpunctext='Hello world!'punc.remove(text)>>>'Hello world '# replace punctuation with special stringpunc.remove(text,'<P>')>>>'Hello world<P>'# normalize punctuationspunc.normalize('你读过那本《边城》吗？什么编程？！人生苦短，我用 Python。')>>>'你读过那本(边城)吗?什么编程?!人生苦短,我用 Python.'

`链函数`

Plane包含extract、replace、segment和punc.remove、punc.normalize，这些方法可以在链中调用。由于segment返回list，因此只能在链的末尾调用它。

Plane.text保存处理文本的结果，Plane.values保存提取字符串的结果。

fromplaneimportPlanefromplane.patternimportEMAILp=Plane()p.update('My email is my@email.com.').replace(EMAIL,'').text# update() will init Plane.text and Plane.values>>>'My email is .'p.update('My email is my@email.com.').replace(EMAIL).segment()>>>['My','email','is','<Email>','.']p.update('My email is my@email.com.').extract(EMAIL).values>>>[Token(name='Email',value='my@email.com',start=12,end=24)]

`管道`

如果您愿意，可以使用Pipeline。

segment和extract只能在最后出现。

fromplaneimportPipeline,replace,segmentfromplane.patternimportURLpipe=Pipeline()pipe.add(replace,URL,'')pipe.add(segment)pipe('http://www.guokr.com is online.')>>>['is','online','.']

欢迎加入QQ群-->： 979659372

`推荐PyPI第三方库`

导 航 栏

                                            项目 描述
                                        

                                            版本历史
                                        
项目 链接
首页
                                    
标 签
许可证: BSD许可证（BSD 3条款）
作者信息:: 暂无
                                
                            
维护者

                                  keming
                                
最新PyPI项目
italian_vip_says
UFx
vofs
fake_item_generator
NerEva
django-monologue
fio_product_attribute_strict
climailsystem
pyshape
tbb-devel
npy-append-arra
anthill.tal.macrorenderer
odoo11-addon-stock-a
uuuu
contextil
fyl_nester
appomatic_renderable
teacher
chuletas
slackbot_ce
最新Python常见问题
对变量表使用SQLAlchemy映射
对变量赋值（Python）感到困惑
对变量进行递归查找
对口译员在做什么感到好奇
对句子中的所有k执行kCombination的算法
对另一个DataFram范围下的DataFrame列求和
对另一个函数的结果执行一个函数，如果不是非
对另一个属性具有排序顺序的IN查询的预期结果是什么？
对另一个数据帧文件调用另一个函数
对另一个类中的对象执行计算
对另一列中的重复数字序列进行计数
对另一列使用if语句在dataframe中创建新列
对只包含0和1的列表进行高效排序，而不使用任何内置的python排序函数？
对可变函数参数默认值的良好使用？
对可变列数使用数据框和/或添加列

plane 0.2.0

plane的Python项目详细描述

平面

安装

PIP

从源安装

功能

用法

快速启动

模式

`段`

`标点符号`

`链函数`

`管道`

`推荐PyPI第三方库`

space-time-astar

viabel

odoo8-addon-web-tree-date-search

inv-py-docker-k8s-tasks

spotify-flask-downloader

razer-cli

django-wisdom-pets

trivia.p

nimbella

peanutbuterdatatime4

feature-creation

punctuator-lvl9-inga

pytextable

mailerlite-api-python

torchbio

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

plane 0.2.0

plane的Python项目详细描述

平面

安装

PIP

从源安装

功能

用法

快速启动

模式

段

标点符号

链函数

管道

推荐PyPI第三方库

space-time-astar

viabel

odoo8-addon-web-tree-date-search

inv-py-docker-k8s-tasks

spotify-flask-downloader

razer-cli

django-wisdom-pets

trivia.p

nimbella

peanutbuterdatatime4

feature-creation

punctuator-lvl9-inga

pytextable

mailerlite-api-python

torchbio

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

`段`

`标点符号`

`链函数`

`管道`

`推荐PyPI第三方库`

导航栏

项目链接

标签