使用LSH的近似字符串匹配

1条回答

网友

1楼 · 发布于 2024-05-31 23:26:36

我在这方面找到的最好的学术资源是Chapter 3海量数据集的挖掘，它提供了对局部敏感的散列和minhashing的极好概述。

简单地说，其思想是取几个字符串，对这些字符串进行矢量化，然后在生成的向量上传递一个滑动窗口。如果两个向量在同一窗口位置具有相同的值，则将它们标记为更细粒度相似性分析的候选向量。

Python数据集库中有一个很好的实现（pip install datasketch）。下面是一个示例，显示您可以捕获模糊字符串相似性：

from datasketch import MinHash, MinHashLSH
from nltk import ngrams

data = ['minhash is a probabilistic data structure for estimating the similarity between datasets',
  'finhash dis fa frobabilistic fata ftructure for festimating the fimilarity fetween fatasets',
  'weights controls the relative importance between minizing false positive',
  'wfights cfntrols the rflative ifportance befween minizing fflse posftive',
]

# Create an MinHashLSH index optimized for Jaccard threshold 0.5,
# that accepts MinHash objects with 128 permutations functions
lsh = MinHashLSH(threshold=0.4, num_perm=128)

# Create MinHash objects
minhashes = {}
for c, i in enumerate(data):
  minhash = MinHash(num_perm=128)
  for d in ngrams(i, 3):
    minhash.update("".join(d).encode('utf-8'))
  lsh.insert(c, minhash)
  minhashes[c] = minhash

for i in xrange(len(minhashes.keys())):
  result = lsh.query(minhashes[i])
  print "Candidates with Jaccard similarity > 0.5 for input", i, ":", result

这将返回：

Candidates with Jaccard similarity > 0.5 for input 0 : [0, 1]
Candidates with Jaccard similarity > 0.5 for input 1 : [0, 1]
Candidates with Jaccard similarity > 0.5 for input 2 : [2, 3]
Candidates with Jaccard similarity > 0.5 for input 3 : [2, 3]

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用LSH的近似字符串匹配

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >