Cython Damerau-Levenshtein加速

2 投票

6 回答

1679 浏览

提问于 2025-04-16 15:16

我有一个用cython写的程序，用来计算两个字符串之间的Damerau–Levenshtein距离，参考了这篇维基百科文章。不过现在这个程序运行得太慢了，不符合我的需求。我有大约60万个字符串的列表，需要在这个列表中找出错别字。

如果有人能建议一些算法上的改进，或者一些python/cython的技巧，能让我这个程序运行得更快，我会非常感激。我不太在乎它占用多少内存，只关心计算所需的时间。

根据我对程序的分析，使用大约2000个字符串时，程序在damerauLevenshteinDistance这个函数上花费了80%的总运行时间（30秒中有24秒），我已经没有其他想法来让它更快了。

def damerauLevenshteinDistance(a, b, h):
    """
    a = source sequence
    b = comparing sequence
    h = matrix to store the metrics (currently nested list)
    """
    cdef int inf,lena,lenb,i,j,x,i1,j1,d,db
    alphabet = getAlphabet((a,b))
    lena = len(a)
    lenb = len(b)
    inf = lena + lenb + 1
    da = [0 for x in xrange(0, len(alphabet))]
    for i in xrange(1, lena+1):
        db = 0
        for j in xrange(1, lenb+1):
            i1 = da[alphabet[b[j-1]]]
            j1 = db
            d = 1
            if (a[i-1] == b[j-1]):
                d = 0
                db = j
            h[i+1][j+1] = min(
                h[i][j]+d,
                h[i+1][j]+1,
                h[i][j+1]+1,
                h[i1][j1]+(i-i1-1)+1+(j-j1-1)
            )
        da[alphabet[a[i-1]]] = i
    return h[lena+1][lenb+1]

cdef getAlphabet(words):
    """
    construct an alphabet out of the lists found in the tuple words with a
    sequential identifier for each word
    """
    cdef int i
    alphabet = {}
    i = 0
    for wordList in words:
        for letter in wordList:
            if letter not in alphabet:
                alphabet[letter] = i
                i += 1
    return alphabet

内存占用算法优化性能提升 cython 运行时间分析 damerau-levenshtein 字符串距离错别字检测

6 个回答

看起来你可以把更多的代码设置为静态类型，这样会提高运行速度。

你也可以看看用Cython实现的Levenshtein距离的例子，链接在这里： http://hackmap.blogspot.com/2008/04/levenshtein-in-cython.html

回答于 2025-04-16 由 Python大师

分享举报

我猜，提升你当前代码性能的最大关键在于使用C语言的数组，而不是用列表的列表来表示h矩阵。

回答于 2025-04-16 由 Python大师

分享举报

对于较长的字符串，使用一种不同的算法可以提高性能，这种算法不需要计算整个lena⋅lenb矩阵中的所有值。举个例子，通常不需要计算矩阵左上角的确切成本[lena][0]，因为它代表的是删除字符串a中所有字符的成本。

一个更好的算法是，总是查看到目前为止计算出的最低权重的点，然后从那里向所有方向再走一步。这样，你可能就能在不检查矩阵中所有位置的情况下，直接到达目标位置：

这个算法的实现可以使用优先队列，代码大致如下：

from heapq import heappop, heappush

def distance(a, b):
   pq = [(0,0,0)]
   lena = len(a)
   lenb = len(b)
   while True:
      (wgh, i, j) = heappop(pq)
      if i == lena and j == lenb:
         return wgh
      if i < lena:
         # deleted
         heappush(pq, (wgh+1, i+1, j))
      if j < lenb:
         # inserted
         heappush(pq, (wgh+1, i, j+1))
      if i < lena and j < lenb:
         if a[i] == b[i]:
            # unchanged
            heappush(pq, (wgh, i+1, j+1))
         else:
            # changed
            heappush(pq, (wgh+1, i+1, j+1))
      # ... more possibilities for changes, like your "+(i-i1-1)+1+(j-j1-1)"

这只是一个粗略的实现，还有很多可以改进的地方：

在将新坐标添加到队列时，要检查：
- 如果这些坐标之前已经处理过，就不要再添加了
- 如果这些坐标当前已经在队列中，只保留权重更好的那一个
使用用C语言实现的优先队列，而不是heapq模块

回答于 2025-04-16 由 Python大师

分享举报

Cython Damerau-Levenshtein加速

6 个回答

撰写回答