polyglot是一个支持大量多语言应用程序的自然语言管道。
polyglot的Python项目详细描述
polyglot是一个支持大规模 多语言应用程序
- 免费软件:gplv3许可证
- 文档:http://polyglot.readthedocs.org。
- github:https://github.com/aboSamoor/polyglot
功能
- 标记化(165种语言)
- 语言检测(196种语言)
- 命名实体识别(40种语言)
- 词性标注(16种语言)
- 情感分析(136种语言)
- 单词嵌入(137种语言)
- 形态分析(135种语言)
- 音译(69种语言)
显影剂
- 拉米·阿尔福@rmyeid gmail com
快速教程
importpolyglotfrompolyglot.textimportText,Word
语言检测
text=Text("Bonjour, Mesdames.")print("Language Detected: Code={}, Name={}\n".format(text.language.code,text.language.name))
Language Detected: Code=fr, Name=French
标记化
zen=Text("Beautiful is better than ugly. ""Explicit is better than implicit. ""Simple is better than complex.")print(zen.words)
[u'Beautiful', u'is', u'better', u'than', u'ugly', u'.', u'Explicit', u'is', u'better', u'than', u'implicit', u'.', u'Simple', u'is', u'better', u'than', u'complex', u'.']
print(zen.sentences)
[Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]
词性标注
text=Text(u"O primeiro uso de desobediência civil em massa ocorreu em setembro de 1906.")print("{:<16}{}".format("Word","POS Tag")+"\n"+"-"*30)forword,tagintext.pos_tags:print(u"{:<16}{:>2}".format(word,tag))
Word POS Tag ------------------------------ O DET primeiro ADJ uso NOUN de ADP desobediência NOUN civil ADJ em ADP massa NOUN ocorreu ADJ em ADP setembro NOUN de ADP 1906 NUM . PUNCT
命名实体识别
text=Text(u"In Großbritannien war Gandhi mit dem westlichen Lebensstil vertraut geworden")print(text.entities)
[I-LOC([u'Gro\xdfbritannien']), I-PER([u'Gandhi'])]
极性
print("{:<16}{}".format("Word","Polarity")+"\n"+"-"*30)forwinzen.words[:6]:print("{:<16}{:>2}".format(w,w.polarity))
Word Polarity ------------------------------ Beautiful 0 is 0 better 1 than 0 ugly -1 . 0
嵌入
word=Word("Obama",language="en")print("Neighbors (Synonms) of {}".format(word)+"\n"+"-"*30)forwinword.neighbors:print("{:<16}".format(w))print("\n\nThe first 10 dimensions out the {} dimensions\n".format(word.vector.shape[0]))print(word.vector[:10])
Neighbors (Synonms) of Obama ------------------------------ Bush Reagan Clinton Ahmadinejad Nixon Karzai McCain Biden Huckabee Lula The first 10 dimensions out the 256 dimensions [-2.57382345 1.52175975 0.51070285 1.08678675 -0.74386948 -1.18616164 2.92784619 -0.25694436 -1.40958667 -2.39675403]
形态
word=Text("Preprocessing is an essential step.").words[0]print(word.morphemes)
[u'Pre', u'process', u'ing']
音译
frompolyglot.transliterationimportTransliteratortransliterator=Transliterator(source_lang="en",target_lang="ru")print(transliterator.transliterate(u"preprocessing"))
препрокессинг
历史
“14.11”(2014-01-11)
- pypi上的第一个版本。
“15.5.2”(2015-05-02)
- Polyglot功能齐全。
“15.10.03”(2015-10-03)
- 将polyglot模型镜像改为Stony Brook University DSL实验室 谷歌云存储。
“16.07.04”(2016-07-03)
- 新功能: -支持转移POS标记。 -支持为文本提供提示语言代码
- 错误修复: -改进句子连串(pr 34) -修复罕见的Unicode编码错误(pr 35) -修正英语以外语言的音译(pr 46) -在自述文件(pr 49)中添加到github的链接 -使路径的处理更加一致(rp 55) -修复normalization嵌入到ner中会破坏pos的特性(问题60,pr 62)