Python hybridtfidf包_程序模块 - PyPI

一个由David Ionuye和Jugal K.Kalita提出的TFIDF混合微博摘要算法的实现。

hybridtfidf的Python项目详细描述

混合TF-IDF

这是David Ionuye和Jugal K.Kalita（2011）提出的Hybrid TF-IDF algorithm实现。在

混合TF-IDF的设计考虑了twitter数据，其中文档长度很短。这是一种生成一组文档的多个后置摘要的方法。在

只需安装：

pip install hybridtfidf

加载表单的一些简短文本：

^{pr2}$

该算法对删除了停止字的标记化数据效果最好，但这不是必需的。您可以以任何方式标记文档。下面是一个使用流行的NLTK包的示例：

import nltk
nltk.download('stopwords')

documents = ["This is one example of a short text.",
            "Designed for twitter posts, a typical 'short document' will have fewer than 280 characters!"
            ]

stop_words = set(nltk.corpus.stopwords.words('english'))

tokenized_documents = []

for document in documents:
    tokens = nltk.tokenize.word_tokenize(document)
    tokenized_document = [i for i in tokens if not i in stop_words]
    tokenized_documents.append(tokenized_document)    

# tokenized_documents[0] = ['This','one','example','short','text','.']

然而，该算法要求每个文档都是一个字符串。如果使用nltk的标记器，请确保重新联接每个文档字符串。在

tokenized_documents = [' '.join(document) for document in tokenized_documents]

# tokenized_documents[0] = 'This one example short text .'

创建一个HybridTfidf对象，并将其放入数据中

from hybridtfidf import HybridTfidf

hybridtfidf = HybridTfidf(threshold=7)
hybridtfidf.fit(tokenized_documents)

# The thresold value affects how strongly the algorithm biases towards longer documents
# A higher threshold will make longer documents have a higher post weight
# (see next snippits of code for what post weight does)

将文档转换为TF-IDF混合矢量表示，得到每个文档的显著性值。在

document_vectors = hybridtfidf.transform(tokenized_documents)
document_weights = hybridtfidf.transform_to_weights(tokenized_documents)

文档向量表示嵌入在混合TF-IDF向量空间中的文档，任何线性代数技术都可以在这些文档上执行！在

文档权重列表为每个文档提供了一个单独的数字，这个数字反映了每个文档的important的程度（文档对主题讨论的贡献有多大）。理论上，垃圾邮件文档的后显著性权重较低。在

最后，Ionuye和Kalita建议使用混合TF-IDF来总结文件收集。我们选择最相关/最突出的文件中的“k”，为避免冗余，我们不会选择任何与先前文件过于相似的余弦的文件。实际上，我们选择最重要的文档，跳过讨论同一主题的文档。一、 e-我们将文件收集归纳为“k”代表性文件。在

# Get the indices of the most significant documents. 
from hybridtfidf.utils import select_salient_documents

most_significant = select_salient_documents(document_vectors,document_weights, k = 5, similarity_threshold = 0.5)

for i in most significant:
    print(documents[i])         # Prints the 'k' most significant documents that are each about a separate topic

注意：fit（）输入（起始文档列表）、文档向量和文档权重的索引都是对齐的。确保不要在没有重新订购其他产品的情况下重新订购一个。在

欢迎加入QQ群-->： 979659372

hybridtfidf 1.0.6

hybridtfidf的Python项目详细描述

混合TF-IDF

推荐PyPI第三方库

recordlib

ESN

bitcoin-ecc

mangopayments

libyaz0

pydart2

lyra2re2-hash

djangocarrot

xin-first

django-blowdb

laughs

genomeqaml-gui

modis-util

textsummarization

raincaller

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

hybridtfidf 1.0.6

hybridtfidf的Python项目详细描述

混合TF-IDF

推荐PyPI第三方库

recordlib

ESN

bitcoin-ecc

mangopayments

libyaz0

pydart2

lyra2re2-hash

djangocarrot

xin-first

django-blowdb

laughs

genomeqaml-gui

modis-util

textsummarization

raincaller

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签