基于textrank的文本摘要和关键词提取包

summa的Python项目详细描述


在Python3中实现文本摘要和关键字提取的textRank, 用optimizations on the similarity function

功能

  • 文本摘要
  • 关键字提取

示例

文本摘要:

>>> text = """Automatic summarization is the process of reducing a text document with a \
computer program in order to create a summary that retains the most important points \
of the original document. As the problem of information overload has grown, and as \
the quantity of data has increased, so has interest in automatic summarization. \
Technologies that can make a coherent summary take into account variables such as \
length, writing style and syntax. An example of the use of summarization technology \
is search engines such as Google. Document summarization is another."""

>>> from summa import summarizer
>>> print(summarizer.summarize(text))
'Automatic summarization is the process of reducing a text document with a computer
program in order to create a summary that retains the most important points of the
original document.'

关键词提取:

>>> from summa import keywords
>>> print(keywords.keywords(text))
document
summarization
writing
account

请注意,输入中的换行符将用作句子分隔符,因此请确保 对文本进行相应的预处理。

安装

这个软件是available in PyPI。 这取决于NumPyScipy, 两个用于科学计算的python库。 pip将自动安装它们以及summa

pip install summa

要获得更好的关键字提取性能,请安装Pattern

更多示例

  • 命令行用法:

    textrank -t FILE
    
  • 将摘要的长度定义为文本的比例(也可以在keywords中找到):

    >>> from summa.summarizer import summarize
    >>> summarize(text, ratio=0.2)
    
  • >P>用最接近的单词数(也可在{{CD1>})中定义摘要的长度:

    >>> summarize(text, words=50)
    
  • 定义输入文本语言(在keywords中也可用)。

    可用语言有阿拉伯语、丹麦语、荷兰语、英语、芬兰语、法语、德语, 匈牙利语、意大利语、挪威语、波兰语、波特语、葡萄牙语、罗马尼亚语、俄语, 西班牙语和瑞典语:

    >>> summarize(text, language='spanish')
    
  • 以列表形式获取结果(也可以在keywords中获得):

    >>> summarize(text, split=True)
    ['Automatic summarization is the process of reducing a text document with a
    computer program in order to create a summary that retains the most important
    points of the original document.']
    

参考文献

引用此作品:

@article{DBLP:journals/corr/BarriosLAW16,
  author    = {Federico Barrios and
             Federico L{\'{o}}pez and
             Luis Argerich and
             Rosa Wachenchauzer},
  title     = {Variations of the Similarity Function of TextRank for Automated Summarization},
  journal   = {CoRR},
  volume    = {abs/1602.03606},
  year      = {2016},
  url       = {http://arxiv.org/abs/1602.03606},
  archivePrefix = {arXiv},
  eprint    = {1602.03606},
  timestamp = {Wed, 07 Jun 2017 14:40:43 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/BarriosLAW16},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

summa是在The MIT License (MIT)下发布的开源软件。

版权所有(C)2014–现为Summa NLP。

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
image Java:设置iconimage而不使用完整路径   javaant和Eclipse构建   Java标签检查图像   java为什么我的glassfish服务器会重定向到另一台服务器?   java MySQL简单查询错误   java你能告诉我如何在图像视图中显示图像吗   Java驱动程序4.0:是否支持对象映射?   java在通过SourceDataLine播放音频时发出一致的爆裂声   java组织。金特罗普。dcom。常见的JIException:未找到错误代码0xC0000070的消息   运行Dijkstra算法实现时的java IndexOutOfBoundsException   java swing gui闪烁白色错误   java单元测试:我应该使用null还是可选的。返回()中的空()?   javajaxb创建空对象   如何拒绝Java构造函数中的非限定参数?   单元测试的java分类