"Python中的difflib.get_close_matches()函数如何运行?"

2024-06-08 15:53:17 发布

您现在位置:Python中文网/ 问答频道 /正文

以下是两个数组:

import difflib
import scipy
import numpy

a1=numpy.array(['198.129.254.73','134.55.221.58','134.55.219.121','134.55.41.41','198.124.252.101'], dtype='|S15')
b1=numpy.array(['198.124.252.102','134.55.41.41','134.55.219.121','134.55.219.137','134.55.220.45', '198.124.252.130'],dtype='|S15')

difflib.get_close_matches(a1[-1],b1,2)

输出:

['198.124.252.130', '198.124.252.102']

不应该'198.124.252.102''198.124.252.101'最接近吗?

我查看了文档,其中他们指定了一些浮动类型权重,但没有关于算法使用的信息。

我需要找出最后两个八位组之间的绝对差是1(前提是前三个八位组是相同的)。

所以我先找到最接近的字符串,然后检查最接近的字符串是否符合上述条件。

有没有其他的功能或方法来实现这一点?还有get_close_matches()是如何表现的?

ipaddr似乎对ips没有这样的操作。


Tags: 字符串文档importnumpyclosegeta1scipy
2条回答

嗯,在docs中有一部分解释了您的问题:

This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

为了得到您期望的结果,可以使用Levenshtein_distance

但是为了比较IPs,我建议使用整数比较:

>>> parts = [int(s) for s in '198.124.252.130'.split('.')]
>>> parts2 = [int(s) for s in '198.124.252.101'.split('.')]
>>> from operator import sub
>>> diff = sum(d * 10**(3-pos) for pos,d in enumerate(map(sub, parts, parts2)))
>>> diff
29

可以使用此样式创建比较函数:

from functools import partial
from operator import sub

def compare_ips(base, ip1, ip2):
    base = [int(s) for s in base.split('.')]
    parts1 = (int(s) for s in ip1.split('.'))
    parts2 = (int(s) for s in ip2.split('.'))
    test1 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts1)))
    test2 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts2)))
    return cmp(test1, test2)

base = '198.124.252.101'
test_list = ['198.124.252.102','134.55.41.41','134.55.219.121',
             '134.55.219.137','134.55.220.45', '198.124.252.130']
sorted(test_list, cmp=partial(compare_ips, base))
# yields:
# ['198.124.252.102', '198.124.252.130', '134.55.219.121', '134.55.219.137', 
#  '134.55.220.45', '134.55.41.41']

difflib的一些提示:

SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching". The basic idea is to find the longest contiguous matching subsequence that contains no "junk" elements (R-O doesn't address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that "look right" to people.

关于基于自定义逻辑比较IP的需求。 您应该首先验证字符串是否是正确的ip。 然后,使用简单的整数算法编写比较逻辑应该是一个容易的任务,以满足您的要求。根本不需要图书馆。

相关问题 更多 >