基于textrank的文本摘要和关键词提取包

summa的Python项目详细描述


在Python3中实现文本摘要和关键字提取的textRank, 用optimizations on the similarity function

功能

  • 文本摘要
  • 关键字提取

示例

文本摘要:

>>> text = """Automatic summarization is the process of reducing a text document with a \
computer program in order to create a summary that retains the most important points \
of the original document. As the problem of information overload has grown, and as \
the quantity of data has increased, so has interest in automatic summarization. \
Technologies that can make a coherent summary take into account variables such as \
length, writing style and syntax. An example of the use of summarization technology \
is search engines such as Google. Document summarization is another."""

>>> from summa import summarizer
>>> print(summarizer.summarize(text))
'Automatic summarization is the process of reducing a text document with a computer
program in order to create a summary that retains the most important points of the
original document.'

关键词提取:

>>> from summa import keywords
>>> print(keywords.keywords(text))
document
summarization
writing
account

请注意,输入中的换行符将用作句子分隔符,因此请确保 对文本进行相应的预处理。

安装

这个软件是available in PyPI。 这取决于NumPyScipy, 两个用于科学计算的python库。 pip将自动安装它们以及summa

pip install summa

要获得更好的关键字提取性能,请安装Pattern

更多示例

  • 命令行用法:

    textrank -t FILE
    
  • 将摘要的长度定义为文本的比例(也可以在keywords中找到):

    >>> from summa.summarizer import summarize
    >>> summarize(text, ratio=0.2)
    
  • >P>用最接近的单词数(也可在{{CD1>})中定义摘要的长度:

    >>> summarize(text, words=50)
    
  • 定义输入文本语言(在keywords中也可用)。

    可用语言有阿拉伯语、丹麦语、荷兰语、英语、芬兰语、法语、德语, 匈牙利语、意大利语、挪威语、波兰语、波特语、葡萄牙语、罗马尼亚语、俄语, 西班牙语和瑞典语:

    >>> summarize(text, language='spanish')
    
  • 以列表形式获取结果(也可以在keywords中获得):

    >>> summarize(text, split=True)
    ['Automatic summarization is the process of reducing a text document with a
    computer program in order to create a summary that retains the most important
    points of the original document.']
    

参考文献

引用此作品:

@article{DBLP:journals/corr/BarriosLAW16,
  author    = {Federico Barrios and
             Federico L{\'{o}}pez and
             Luis Argerich and
             Rosa Wachenchauzer},
  title     = {Variations of the Similarity Function of TextRank for Automated Summarization},
  journal   = {CoRR},
  volume    = {abs/1602.03606},
  year      = {2016},
  url       = {http://arxiv.org/abs/1602.03606},
  archivePrefix = {arXiv},
  eprint    = {1602.03606},
  timestamp = {Wed, 07 Jun 2017 14:40:43 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/BarriosLAW16},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

summa是在The MIT License (MIT)下发布的开源软件。

版权所有(C)2014–现为Summa NLP。

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
如何在Java中创建固定大小的泛型数组?   javascript Paypal Braintree订阅付款   使用BufferedReader和PrintWriter实现java数据持久化?   类似于iCloud的java唯一google id   java使用网格布局和抓取组合框   java我的while循环无限运行,当我检查它时,它说while循环没有主体,而我认为它显然没有主体   java LWJGL碰撞3D OpenGL   java将Tibco RV切换到WebSphere MQ?   java如何使用Postman从curl发布REST   java是超级的,在通用通配符中是独占的吗?   在swing 1.5中,java在指定时间后自动关闭非模态对话框   java PrimeFaces饼图在JSF 2.0中不显示   java如何在Spring MVC中提供带有xml配置的默认bean实现?   java在eclipse中使用按钮关闭JFrame   java Sqoop jar已弃用   Java中的Getter方法