基于textrank的文本摘要和关键词提取包
summa的Python项目详细描述
在Python3中实现文本摘要和关键字提取的textRank, 用optimizations on the similarity function。
功能
- 文本摘要
- 关键字提取
示例
文本摘要:
>>> text = """Automatic summarization is the process of reducing a text document with a \ computer program in order to create a summary that retains the most important points \ of the original document. As the problem of information overload has grown, and as \ the quantity of data has increased, so has interest in automatic summarization. \ Technologies that can make a coherent summary take into account variables such as \ length, writing style and syntax. An example of the use of summarization technology \ is search engines such as Google. Document summarization is another.""" >>> from summa import summarizer >>> print(summarizer.summarize(text)) 'Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.'
关键词提取:
>>> from summa import keywords >>> print(keywords.keywords(text)) document summarization writing account
请注意,输入中的换行符将用作句子分隔符,因此请确保 对文本进行相应的预处理。
安装
这个软件是available in PyPI。 这取决于NumPy和Scipy, 两个用于科学计算的python库。 pip将自动安装它们以及summa
pip install summa
要获得更好的关键字提取性能,请安装Pattern。
更多示例
命令行用法:
textrank -t FILE
将摘要的长度定义为文本的比例(也可以在
keywords
中找到):>>> from summa.summarizer import summarize >>> summarize(text, ratio=0.2)
定义输入文本语言(在
keywords
中也可用)。可用语言有阿拉伯语、丹麦语、荷兰语、英语、芬兰语、法语、德语, 匈牙利语、意大利语、挪威语、波兰语、波特语、葡萄牙语、罗马尼亚语、俄语, 西班牙语和瑞典语:
>>> summarize(text, language='spanish')
以列表形式获取结果(也可以在
keywords
中获得):>>> summarize(text, split=True) ['Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.']
>>> summarize(text, words=50)
参考文献
- Mihalcea,R.,Tarau,P.: “Textrank: Bringing order into texts”。 作者:林,D,吴,D(编辑) 2004年EMNLP会议记录。第404-411页。计算语言学协会, 西班牙巴塞罗那。2004年7月。
- 巴里奥斯,F.,洛佩斯,F.,阿格里奇,L.,瓦钦乔泽,R.: “Variations of the Similarity Function of TextRank for Automated Summarization”。 拉斯44jaio分析。 Jornadas Argentinas de Informatica,阿根廷人工智能研讨会,2015年。
引用此作品:
@article{DBLP:journals/corr/BarriosLAW16, author = {Federico Barrios and Federico L{\'{o}}pez and Luis Argerich and Rosa Wachenchauzer}, title = {Variations of the Similarity Function of TextRank for Automated Summarization}, journal = {CoRR}, volume = {abs/1602.03606}, year = {2016}, url = {http://arxiv.org/abs/1602.03606}, archivePrefix = {arXiv}, eprint = {1602.03606}, timestamp = {Wed, 07 Jun 2017 14:40:43 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/BarriosLAW16}, bibsource = {dblp computer science bibliography, https://dblp.org} }
summa是在The MIT License (MIT)下发布的开源软件。
版权所有(C)2014–现为Summa NLP。