真damerau-levenshtein算法的cython实现。
fastDamerauLevenshtein的Python项目详细描述
FastDamerauleVenshtein
cython实现了真正的damerau levenshtein编辑距离,允许一个项目被多次编辑。 更多信息来自Wikipedia:
In information theory and computer science, the Damerau-Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau-Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.
The Damerau-Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions).
该实现基于James M. Jensen II解释,它允许指定每个操作的成本。
要求
这段代码需要Python2.7或3.4+和一个C编译器,比如GCC。
安装
fastdameraulevenshtein可在pypi上的https://pypi.python.org/pypi/fastDamerauLevenshtein找到。
使用pip:
安装pip install fastDamerauLevenshtein
从源安装:
python setup.py install
或
pip install .
用法
它被称为damerauLevenshtein
的可用方法,可以计算两个可散列对象(字符串、字符串列表等)上的距离。该方法提供以下参数:
firstobject
secondobject
相似性
- 如果这个参数值是
False
,它将返回编辑的总成本,否则它将返回一个从0.0到1.0的分数,表示两个对象有多相似。默认为True
。
- 如果这个参数值是
deleteWeight
- 删除操作的成本。
insertweight
- 插入操作的成本。
replaceWeight
- 更换操作的成本。
swapweight
- 交换操作的成本。
提供的操作权重必须是int
值。默认情况下,所有这些值都是1
。
基本用途:
fromfastDamerauLevenshteinimportdamerauLevenshteindamerauLevenshtein('ca','abc',similarity=False)# expected result: 2.0damerauLevenshtein('car','cars',similarity=True)# expected result: 0.75damerauLevenshtein(['ab','bc'],['ab'],similarity=False)# expected result: 1.0damerauLevenshtein(['ab','bc'],['ab'],similarity=True)# expected result: 0.5
基准
其他python damerau levenshtein和osa实现:
- pyxDamerauLevenshtein(编辑距离受限,无自定义权重)
- jellyfish(真正的damerau levenshtein,但没有自定义权重)
- editdistance(编辑距离受限,无自定义权重)
- textdistance(真正的damerau levenshtein,但没有自定义权重)
Python 3.7(在Intel i5 6500上):
>>> import timeit
>>> #fastDamerauLevenshtein:
... timeit.timeit(setup="import fastDamerauLevenshtein; text1='afwafghfdowbihgp'; text2='goagumkphfwifawpte'", stmt="fastDamerauLevenshtein.damerauLevenshtein(text1, text2)", number=100000)
0.43
>>> #pyxDamerauLevenshtein:
... timeit.timeit(setup="from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance; text1='afwafghfdowbihgp'; text2='goagumkphfwifawpte'", stmt="normalized_damerau_levenshtein_distance(text1, text2)", number=100000)
2.44
>>> #jellyfish
... timeit.timeit(setup="import jellyfish; text1='afwafghfdowbihgp'; text2='goagumkphfwifawpte'", stmt="jellyfish.damerau_levenshtein_distance(text1, text2)", number=100000)
0.20
>>> #editdistance
... timeit.timeit(setup="import editdistance; text1='afwafghfdowbihgp'; text2='goagumkphfwifawpte'", stmt="editdistance.eval(text1, text2)", number=100000)
0.22
>>> #textdistance
... timeit.timeit(setup="import textdistance; text1='afwafghfdowbihgp'; text2='goagumkphfwifawpte'", stmt="textdistance.damerau_levenshtein.distance(text1, text2)", number=100000)
0.70
许可证
它是根据麻省理工学院的许可证发行的。
Copyright (c) 2019 Robert Grigoroiu
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.