Python hashedindex包_程序模块 - PyPI

使用哈希表（字典）的invertdindex实现

hashedindex的Python项目详细描述

使用哈希表（python字典）实现快速简单的invertdindex。

支持Python3.5+

免费软件：BSD许可证

Installing
Features
Text Parsing
Integration with Numpy and Pandas
Reporting Bugs

安装

安装hashindex的最简单方法是通过pypi

pip install hashedindex

功能

^ TT1}$提供了一个简单易用的倒排索引结构，它足够灵活，可以与各种用例一起工作。

基本用法：

importhashedindexindex=hashedindex.HashedIndex()index.add_term_occurrence('hello','document1.txt')index.add_term_occurrence('world','document1.txt')index.get_documents('hello')Counter({'document1.txt':1})index.items(){'hello':Counter({'document1.txt':1}),'world':Counter({'document1.txt':1})}example='The Quick Brown Fox Jumps Over The Lazy Dog'forterminexample.split():index.add_term_occurrence(term,'document2.txt')

hashedindex不限于字符串，任何哈希对象都可以被索引。

index.add_term_occurrence('foo',10)index.add_term_occurrence(('fire','fox'),90.2)index.items(){'foo':Counter({10:1}),('fire','fox'):Counter({90.2:1})}

文本分析

hashedindex模块附带了一个功能强大的textparser模块，其中包含要拆分的方法将文本转换为标记。

fromhashedindeximporttextparserlist(textparser.word_tokenize("hello cruel world"))[('hello',),('cruel',),('world',)]

由于能够指定所需的任意数量的n-grams，令牌被包装在元组中：

list(textparser.word_tokenize("Life is about making an impact, not making an income.",ngrams=2))[(u'life',u'is'),(u'is',u'about'),(u'about',u'making'),(u'making',u'an'),(u'an',u'impact'),(u'impact',u'not'),(u'not',u'making'),(u'making',u'an'),(u'an',u'income')]

查看函数的docstring以获取有关如何使用stopwords的信息，指定 min_length或ignore_numeric术语。

与Numpy和Pandas集成

hashedindex背后的想法是提供一种非常快速和简单的生成方法机器学习矩阵与额外使用的numpy，pandas和scikit学习。例如：

fromhashedindeximporttextparserimporthashedindeximportnumpyasnpindex=hashedindex.HashedIndex()documents=['spam1.txt','ham1.txt','spam2.txt']fordocindocuments:withopen(doc,'r')asfp:fortermintextparser.word_tokenize(fp.read()):index.add_term_occurrence(term,doc)# You *probably* want to use scipy.sparse.csr_matrix for better performanceX=np.as_matrix(index.generate_feature_matrix(mode='tfidf'))y=[]fordocinindex.documents():y.append(1if'spam'indocelse0)y=np.asarray(doc)fromsklearn.svmimportSVCclassifier=SVC(kernel='linear')classifier.fit(X,y)

您还可以将功能矩阵扩展到更详细的pandas数据帧：

importpandasaspdX=index.generate_feature_matrix(mode='tfidf')df=pd.DataFrame(X,columns=index.terms(),index=index.documents())

代码中的所有方法都有很高的测试覆盖率，因此您可以确保所有方法都按预期工作。

报告错误

找到虫子了吗？很好，发现的虫子就是修复的虫子。打开一个问题或者更好，打开一个请求。

历史记录

0.5.0（2019-07-21）

放弃对Python2.7和3.4的支持

0.1.0（2015-01-11）

pypi上的第一个版本。

欢迎加入QQ群-->： 979659372

hashedindex 0.5.0

hashedindex的Python项目详细描述

安装

功能

文本分析

与Numpy和Pandas集成

报告错误

历史记录

0.5.0（2019-07-21）

0.1.0（2015-01-11）

推荐PyPI第三方库

galaxy-updater

PyUnusedCodeBear

django-bittersweet

SchemDraw

pipdate

tweebot

portmin

parle

http-tarpit

setuptools-changelog

knowyourdata

django-dbindexer

certbot-dns-openstack

fpkem

djangocms-blocks

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

hashedindex 0.5.0

hashedindex的Python项目详细描述

安装

功能

文本分析

与Numpy和Pandas集成

报告错误

历史记录

0.5.0（2019-07-21）

0.1.0（2015-01-11）

推荐PyPI第三方库

galaxy-updater

PyUnusedCodeBear

django-bittersweet

SchemDraw

pipdate

tweebot

portmin

parle

http-tarpit

setuptools-changelog

knowyourdata

django-dbindexer

certbot-dns-openstack

fpkem

djangocms-blocks

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签