Python chinese包_程序模块 - PyPI

中文文本分析器

chinese的Python项目详细描述

中文是一种中文文本分析器。

注意：不支持Python2.*

开始

使用pip安装中文：

$ pip install chinese
$ pynlpir update

开始分析中文文本：

>>>fromchineseimportChineseAnalyzer>>>analyzer=ChineseAnalyzer()>>>result=analyzer.parse('我很高兴认识你')>>>result.tokens()['我','很','高兴','认识','你']>>>result.pinyin()'wǒ hěn gāoxìng rènshi nǐ'>>>result.pprint(){'original':'我很高兴认识你','parsed':[{'dict_data':[{'definitions':['I','me','my'],'kind':'Simplified','match':'我','pinyin':['wo3']}],'token':('我',0,1)},{'dict_data':[{'definitions':['(adverb of degree)','quite','very','awfully'],'kind':'Simplified','match':'很','pinyin':['hen3']}],'token':('很',1,2)},{'dict_data':[{'definitions':['happy','glad','willing (to do sth)','in a cheerful mood'],'kind':'Simplified','match':'高興','pinyin':['gao1','xing4']}],'token':('高兴',2,4)},{'dict_data':[{'definitions':['to know','to recognize','to be familiar with','to get acquainted with sb','knowledge','understanding','awareness','cognition'],'kind':'Simplified','match':'認識','pinyin':['ren4','shi5']}],'token':('认识',4,6)},{'dict_data':[{'definitions':['you (informal, as opposed to ''courteous 您[nin2])'],'kind':'Simplified','match':'你','pinyin':['ni3']}],'token':('你',6,7)}]}>>>result=analyzer.parse('我喜歡這個味道',traditional=True)>>>print(res){'味道':[{'definitions':['flavor','smell','hint of'],'kind':'Traditional','match':'味道','pinyin':['wei4','dao5']}],'喜歡':[{'definitions':['to like','to be fond of'],'kind':'Traditional','match':'喜欢','pinyin':['xi3','huan5']}],'我':[{'definitions':['I','me','my'],'kind':'Traditional','match':'我',a'pinyin':['wo3']}],'這個':[{'definitions':['this','this one'],'kind':'Traditional','match':'这个','pinyin':['zhe4','ge5']}]}

功能

parse()返回chineseanlyzerresult对象。

>>>fromchineseimportChineseAnalyzer>>>analyzer=ChineseAnalyzer()# Basic usage.>>>result=analyzer.parse('你好世界')# If the traditional option is set to True, the analyzer tries to parse the# provided text as 繁体字.>>>result=analyzer.parse('你好世界',traditional=True)# The default tokenizer uses jieba's. You can also use pynlpir's to tokenize.>>>result=analyzer.parse('你好世界',using=analyzer.tokenizer.pynlpir)# In addition, a custom tokenizer can be passed to the method.>>>fromchinese.tokenizerimportTokenizerInterface>>>classMyTokenizer(TokenizerInterface):# Custom tokenizer must inherit from TokenizerInterface....# Custom tokenizer must implement tokenize() method....deftokenize(self,string):...# tokenize() must return a list of tuples containing at least...# a string as a first element....# For example: [('token1', ...), ('token2', ...), ...]....>>>my_tokenizer=MyTokenizer()>>>result=analyzer.parse('你好世界',using=my_tokenizer)# You can also specify the dictionary used for looking up each token.# You specify a path to a dictionary file for that and the file must have# the CC-CEDICT's dictionary file structure.# CC-CEDICT's dictionary is used for looking up by default.>>>result=analyzer.parse('你好世界',dictionary='path/to/dict')

original()按原样返回提供的文本。

>>>result=analyzer.parse('我最喜欢吃水煮肉片')>>>result.original()'我最喜欢吃水煮肉片'

tokens()返回所提供文本中的标记。

>>>result=analyzer.parse('我的汉语马马虎虎')>>>result.tokens()['我','的','汉语','马马虎虎']>>>result.tokens(details=True)# If the details option is set to True, additional information is also attached.[('我',0,1),('的',1,2),('汉语',2,4),('马马虎虎',4,8)]# In this case, the positions of tokens are included.>>>result=analyzer.parse('的的的的的在的的的的就以和和和')>>>result.tokens(unique=True)# You can get a unique collection of tokens using unique option.['的','在','就','以','和']

freq()返回一个计数器对象，该对象计算每个令牌的出现次数。

>>>result=analyzer.parse('的的的的的在的的的的就以和和和')>>>result.freq()Counter({'的':9,'和':3,'在':1,'就':1,'以':1})

sentences()返回所提供文本中的段落列表。

>>>s='''您好。请问小美在家吗？
...
... 在。请稍等。'''>>>result=analyzer.parse(s)>>>result.sentences()['您好','请问小美在家吗','在','请稍等']

search()返回包含参数的句子列表弦。

>>>s='自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此，这一领域的研究将涉及自然语言，即人们日常使用的语言，所以它与语言学的研究有着密切的联系，但又有重要的区别。自然语言处理并不是一般地研究自然语言，而在于研制能有效地实现自然语言通信的计算机系统，特别是其中的软件系统。因而它是计算机科学的一部分。'>>>result=analyzer.parse(s)>>>result.search('数学')['自然语言处理是一门融语言学、计算机科学、数学于一体的科学']

paragraphs()返回所提供文本中的句子列表。

>>>s='''您好。请问小美在家吗？
...
... 在。请稍等。'''>>>result=analyzer.parse(s)>>>result.paragraphs()['您好。请问小美在家吗？','在。请稍等。']

pinyin()返回所提供文本的拼音表示。

>>>result=analyzer.parse('我喜欢Python。')>>>result.pinyin()'wǒ xǐhuan Python.'>>>result=analyzer.parse('下个月我去涩谷')>>>result.pinyin()# Sometimes the analyzer cannot find a correponding pinyin.'xiàgèyuè wǒ qù 涩谷'>>>result.pinyin(force=True)# The force option forces it to try to convert an unknown word to pinyin.'xiàgèyuè wǒ qù sègǔ'

pprint()打印已分析文本的格式化描述。

>>>result=analyzer.parse('我爱看书')>>>result.pprint(){'original':'我爱看书','parsed':[{'dict_data':[{'definitions':['I','me','my'],'kind':'Simplified','match':'我','pinyin':['wo3']}],'token':('我',0,1)},{'dict_data':[{'definitions':['to love','to be fond of','to like','affection','to be inclined (to do sth)','to tend to (happen)'],'kind':'Simplified','match':'愛','pinyin':['ai4']}],'token':('爱',1,2)},{'dict_data':[{'definitions':['to read','to study'],'kind':'Simplified','match':'看書','pinyin':['kan4','shu1']}],'token':('看书',2,4)}]}

say()将提供的文本转换为中文语音（macos 仅限）。

>>>result=analyzer.parse('您好，我叫Ting-Ting。我讲中文普通话。')>>>result.say()# Output the speech.>>>result.say(out='say.aac')# Save the speech to out.

获取令牌数。

>>>result=analyzer.parse('我是中国人')>>>result.tokens()['我','是','中国','人']>>>len(result)4

检查结果中是否有令牌。

>>>result=analyzer.parse('我是中国人')>>>'中国'inresultTrue>>>'我是'inresultFalse

提取查找结果。

>>>result=analyzer.parse('你叫什么名字？')>>>result.tokens()['你','叫','什么','名字','？']>>>shenme=result['什么']# It's just a list of lookup results.>>>len(shenme)# It has only one entry.1>>>print(shenme[0])# Print that entry.{'definitions':['what?','something','anything'],'kind':'Simplified','match':'什麼','pinyin':['shen2','me5']}>>>shenme_info=shenme[0]>>>shenme_info.definitions# Definitions of the token.['what?','something','anything']>>>shenme_info.match# The corresponding 繁体字.'什麼'>>>shenme_info.pinyin# The pinyin of the token.['shen2','me5']

许可证

麻省理工学院许可证

谢谢

jieba和 PyNLPIR用于标记中文文本。

CC-CEDICT 用于查找令牌的信息。

欢迎加入QQ群-->： 979659372

chinese 0.2.1

chinese的Python项目详细描述

开始

功能

许可证

谢谢

推荐PyPI第三方库

xespiano

oligopool

mkdocs-ringcentral

lunamath

olpxek-bot

statistical-distribution

turkce-isimler

adaptdl

say22

sunbear

api-offres-emploi

pyntacle

biondi

sparse-pendulum

fb-to-redshift

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

chinese 0.2.1

chinese的Python项目详细描述

开始

功能

许可证

谢谢

推荐PyPI第三方库

xespiano

oligopool

mkdocs-ringcentral

lunamath

olpxek-bot

statistical-distribution

turkce-isimler

adaptdl

say22

sunbear

api-offres-emploi

pyntacle

biondi

sparse-pendulum

fb-to-redshift

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签