一种对局部性敏感的哈希库,重点放在大型(稀疏)数据集上。
sparselsh的Python项目详细描述
稀疏
一种对局部性敏感的哈希库,重点放在大的、高维的数据集上。
功能
- 使用稀疏矩阵的快速和内存高效计算。
- 内置对键值存储后端的支持:纯python、redis(内存绑定)、leveldb、berkeleydb
- 多哈希索引支持(基于kay zhu的lshash)
- 内置支持通用距离/目标函数,用于对输出进行排序。
详细信息
sparselsh基于kay-zhu的lshash的一个分支,适合于不需要 适合主存储器或是高维的。使用稀疏矩阵 与使用密集的、基于列表的方法相比,可以轻松地加速超过一个数量级 或者基于numpy数组的向量数学。稀疏矩阵也使得处理 这些数据集纯粹在内存中使用python dict或通过redis。如果这不是 适当的话,您可以使用一个基于磁盘的键值存储,leveldb和berkeleydb。 序列化是使用cpickle(用于原始c加速)完成的,返回到python 如果没有,就腌菜。
BTC捐款:1NejrUgQDm34CFyMHuaff9PNsd8zhd7SgR
安装
简单的方法:
pip install sparselsh
或者您可以克隆此回购协议并按照以下说明操作:
SparseLSH
取决于以下库:
可选(用于内存和基于磁盘的持久性):
安装(最小安装):
python setup.py install
如果您想使用leveldb或redis
存储后端,您可以安装依赖项
从optional-requirements.txt
:
pip install -r optional-requirements.txt
快速启动
为7维输入数据创建4位散列:
from sparselsh import LSH
from scipy.sparse import csr_matrix
X = csr_matrix( [
[ 3, 0, 0, 0, 0, 0, -1],
[ 0, 1, 0, 0, 0, 0, 1],
[ 1, 1, 1, 1, 1, 1, 1] ])
# One class number for each input point
y = [ 0, 3, 10]
X_sim = csr_matrix( [ [ 1, 1, 1, 1, 1, 1, 0]])
lsh = LSH( 4,
X.shape[1],
num_hashtables=1,
storage_config={"dict":None})
for ix in range(X.shape[0]):
x = X.getrow(ix)
c = y[ix]
lsh.index( x, extra_data=c)
# find the point in X nearest to X_sim
points = lsh.query(X_sim, num_results=1)
查询将生成矩阵类元组相似性列表 得分元组。在这种情况下,分数越低越好:
[((<1x7 sparse matrix of type '<type 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>, 10), 1)]
我们可以通过访问稀疏数组来查看最相似的匹配项
调用它的todense
函数:
In [11]: print points[0][0][0].todense()
[[1 1 1 1 1 1 1]]
主界面
大多数参数在类初始化时提供:
LSH( hash_size,
input_dim,
num_of_hashtables=1,
storage_config=None,
matrices_filename=None,
overwrite=False)
参数:
hash_size:
The length of the resulting binary hash. This controls how many "buckets"
there will be for items to be sorted into.
input_dim:
The dimension of the input vector. This needs to be the same as the input
points.
num_hashtables = 1:
(optional) The number of hash tables used. More hashtables increases the
probability of hash-collisions and the more similar items are likely
to be found for a query item.
storage = None:
(optional) A dict representing the storage backend and configuration
options. The following storage backends are supported with the following
configurations:
In-Memory Python Dictionary:
{"dict": None} # Takes no options
Redis:
{"redis": {"host": "127.0.0.1", "port": 6379, "db": 0}
LevelDB:
{"leveldb":{"db": "ldb"}}
Where "ldb" specifies the directory to store the LevelDB database.
(In this case it will be `./ldb/`)
Berkeley DB:
{"berkeleydb":{"filename": "./db"}}
Where "filename" is the location of the database file.
matrices_filename = None:
(optional) Specify the path to the .npz file random matrices are stored
or to be stored if the file does not exist yet. If you change the input
dimensions or the number of hashtables, you'll need to set the following
option, overwrite, to True, or delete this file.
overwrite = False:
(optional) Whether to overwrite the matrices file if it already exists.
索引(将点添加到哈希表):
为给定
LSH
实例的数据点编制索引:lsh.index(输入点,额外数据=无)
参数:
input_point:
The input data point is an array or tuple of numbers of input_dim.
extra_data = None:
(optional) Extra data to be added along with the input_point.
This can be used to hold values like class labels, URIs, titles, etc.
此函数不返回任何内容。
查询(搜索相似点)
针对给定的LSH
实例查询数据点:
lsh.query(query_point, num_results=None, distance_func="euclidean")
参数:
query_point:
The query data point is a sparse CSR matrix.
num_results = None:
(optional) Integer, specifies the max amount of results to be
returned. If not specified all candidates will be returned as a
list in ranked order.
NOTE: You do not save processing by limiting the results. Currently,
a similarity ranking and sort is done on all items in the hashtable
regardless if this parameter.
distance_func = "euclidean":
(optional) Distance function to use to rank the candidates. By default
euclidean distance function will be used.
返回元组列表,每个元组都有原始输入点 将是CSR矩阵的元组,额外的数据,或者如果没有额外的CSR矩阵 提供了数据)和相似性评分。