Python sparselsh包_程序模块 - PyPI

一种对局部性敏感的哈希库，重点放在大型（稀疏）数据集上。

sparselsh的Python项目详细描述

稀疏

一种对局部性敏感的哈希库，重点放在大的、高维的数据集上。

功能

使用稀疏矩阵的快速和内存高效计算。
内置对键值存储后端的支持：纯python、redis（内存绑定）、leveldb、berkeleydb
多哈希索引支持（基于kay zhu的lshash）
内置支持通用距离/目标函数，用于对输出进行排序。

详细信息

sparselsh基于kay-zhu的lshash的一个分支，适合于不需要适合主存储器或是高维的。使用稀疏矩阵与使用密集的、基于列表的方法相比，可以轻松地加速超过一个数量级或者基于numpy数组的向量数学。稀疏矩阵也使得处理这些数据集纯粹在内存中使用python dict或通过redis。如果这不是适当的话，您可以使用一个基于磁盘的键值存储，leveldb和berkeleydb。序列化是使用cpickle（用于原始c加速）完成的，返回到python 如果没有，就腌菜。

BTC捐款：1NejrUgQDm34CFyMHuaff9PNsd8zhd7SgR

安装

简单的方法：

pip install sparselsh

或者您可以克隆此回购协议并按照以下说明操作：

SparseLSH取决于以下库：

可选（用于内存和基于磁盘的持久性）：

redis
leveldb
bsddb（在python 2.7.x上内置）

安装（最小安装）：

python setup.py install

如果您想使用leveldb或redis 存储后端，您可以安装依赖项从optional-requirements.txt：

pip install -r optional-requirements.txt

快速启动

为7维输入数据创建4位散列：

from sparselsh import LSH
from scipy.sparse import csr_matrix

X = csr_matrix( [
    [ 3, 0, 0, 0, 0, 0, -1],
    [ 0, 1, 0, 0, 0, 0,  1],
    [ 1, 1, 1, 1, 1, 1,  1] ])

# One class number for each input point
y = [ 0, 3, 10]

X_sim = csr_matrix( [ [ 1, 1, 1, 1, 1, 1, 0]])

lsh = LSH( 4,
           X.shape[1],
           num_hashtables=1,
           storage_config={"dict":None})

for ix in range(X.shape[0]):
    x = X.getrow(ix)
    c = y[ix]
    lsh.index( x, extra_data=c)

# find the point in X nearest to X_sim
points = lsh.query(X_sim, num_results=1)

查询将生成矩阵类元组相似性列表得分元组。在这种情况下，分数越低越好：

[((<1x7 sparse matrix of type '<type 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>, 10), 1)]

我们可以通过访问稀疏数组来查看最相似的匹配项调用它的todense函数：

In [11]: print points[0][0][0].todense()
[[1 1 1 1 1 1 1]]

主界面

大多数参数在类初始化时提供：

LSH( hash_size,
     input_dim,
     num_of_hashtables=1,
     storage_config=None,
     matrices_filename=None,
     overwrite=False)

参数：

hash_size:
    The length of the resulting binary hash. This controls how many "buckets"
    there will be for items to be sorted into.

input_dim:
    The dimension of the input vector. This needs to be the same as the input
    points.

num_hashtables = 1:
    (optional) The number of hash tables used. More hashtables increases the
    probability of hash-collisions and the more similar items are likely
    to be found for a query item.

storage = None:
    (optional) A dict representing the storage backend and configuration
    options. The following storage backends are supported with the following
    configurations:
        In-Memory Python Dictionary:
            {"dict": None} # Takes no options
        Redis:
            {"redis": {"host": "127.0.0.1", "port": 6379, "db": 0}
        LevelDB:
            {"leveldb":{"db": "ldb"}}
            Where "ldb" specifies the directory to store the LevelDB database.
            (In this case it will be `./ldb/`)
        Berkeley DB:
            {"berkeleydb":{"filename": "./db"}}
            Where "filename" is the location of the database file.

matrices_filename = None:
    (optional) Specify the path to the .npz file random matrices are stored
    or to be stored if the file does not exist yet. If you change the input
    dimensions or the number of hashtables, you'll need to set the following
    option, overwrite, to True, or delete this file.

overwrite = False:
    (optional) Whether to overwrite the matrices file if it already exists.

索引（将点添加到哈希表）：

为给定LSH实例的数据点编制索引：
lsh.index（输入点，额外数据=无）

参数：

input_point:
    The input data point is an array or tuple of numbers of input_dim.

extra_data = None:
    (optional) Extra data to be added along with the input_point.
    This can be used to hold values like class labels, URIs, titles, etc.

此函数不返回任何内容。

查询（搜索相似点）

针对给定的LSH实例查询数据点：

lsh.query(query_point, num_results=None, distance_func="euclidean")

参数：

query_point:
    The query data point is a sparse CSR matrix.

num_results = None:
    (optional) Integer, specifies the max amount of results to be
    returned. If not specified all candidates will be returned as a
    list in ranked order.
    NOTE: You do not save processing by limiting the results. Currently,
    a similarity ranking and sort is done on all items in the hashtable
    regardless if this parameter.

distance_func = "euclidean":
    (optional) Distance function to use to rank the candidates. By default
    euclidean distance function will be used.

返回元组列表，每个元组都有原始输入点将是CSR矩阵的元组，额外的数据，或者如果没有额外的CSR矩阵提供了数据）和相似性评分。

欢迎加入QQ群-->： 979659372

sparselsh 1.1.3

sparselsh的Python项目详细描述

稀疏

功能

详细信息

安装

快速启动

主界面

索引（将点添加到哈希表）：

查询（搜索相似点）

推荐PyPI第三方库

dynamodb-json

django-purls

tx_elections_scrapers

fastga

ipynb-tests

tod

django-dual-authentication

aws-cdk.aws-medialive

jgrepl

PyDTMC

check-tier

odoo10-addon-account-invoice-triple-discount

engineering-tool

cloudlab

metaheuristic

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

sparselsh 1.1.3

sparselsh的Python项目详细描述

稀疏

功能

详细信息

安装

快速启动

主界面

索引（将点添加到哈希表）：

查询（搜索相似点）

推荐PyPI第三方库

dynamodb-json

django-purls

tx_elections_scrapers

fastga

ipynb-tests

tod

django-dual-authentication

aws-cdk.aws-medialive

jgrepl

PyDTMC

check-tier

odoo10-addon-account-invoice-triple-discount

engineering-tool

cloudlab

metaheuristic

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签