筛选匹配字符串排列的集合

3条回答

网友

1楼 · 编辑于 2024-05-21 08:17:14

问题类别

您解决的问题最好描述为测试anagram匹配。

使用排序的解决方案

traditional solution是对目标字符串进行排序，对候选字符串进行排序，并测试是否相等。

>>> def permutations_in_dict(string, words):
        target = sorted(string)
        return sorted(word for word in words if sorted(word) == target)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

使用多集的解决方案

另一种方法是使用collections.Counter()进行multiset相等测试。这在算法上优于排序解决方案（O(n)与O(n log n)），但往往会丢失，除非字符串的大小很大（由于散列所有字符的成本）。

>>> def permutations_in_dict(string, words):
        target = Counter(string)
        return sorted(word for word in words if Counter(word) == target)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

使用完美散列的解决方案

唯一的anagram签名或perfect hash可以通过乘以与字符串中每个可能的字符对应的素数来构造。

commutative property of multiplication保证散列值对于单个字符串的任何置换都是不变的。散列值的唯一性由fundamental theorem of arithmetic（也称为唯一素因子分解定理）保证。

>>> from operator import mul
>>> primes = [2, 3, 5, 7, 11]
>>> primes += [p for p in range(13, 1620) if all(pow(b, p-1, p) == 1 for b in (5, 11))]
>>> anagram_hash = lambda s: reduce(mul, (primes[ord(c)] for c in s))
>>> def permutations_in_dict(string, words):
        target = anagram_hash(string)
        return sorted(word for word in words if anagram_hash(word) == target)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

置换解

当字符串很小时，使用itertools.permutations()对目标字符串进行置换搜索是合理的（在an长度字符串上生成置换生成n阶乘候选者）。

好消息是，当n较小且单词数较大时，该方法运行非常快（因为集合隶属度测试为O（1））：

>>> from itertools import permutations
>>> def permutations_in_dict(string, words):
        perms = set(map(''.join, permutations(string)))
        return sorted(word for word in words if word in perms)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

正如OP推测的那样，使用set.intersection()可以将纯python搜索循环加速到c-speed：

>>> def permutations_in_dict(string, words):
        perms = set(map(''.join, permutations(string)))
        return sorted(words & perms)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

最佳解决方案

最佳解决方案取决于字符串的长度和单词的长度。计时将显示哪个最适合特定问题。

以下是使用两种不同字符串大小的不同方法的比较计时：

Timings with string_size=5 and words_size=1000000
-------------------------------------------------
0.01406    match_sort
0.06827    match_multiset
0.02167    match_perfect_hash
0.00224    match_permutations
0.00013    match_permutations_set

Timings with string_size=20 and words_size=1000000
--------------------------------------------------
2.19771    match_sort
8.38644    match_multiset
4.22723    match_perfect_hash
<takes "forever"> match_permutations
<takes "forever"> match_permutations_set

结果表明，对于小字符串，最快的方法是使用集合交集搜索目标字符串上的置换。

对于较大的字符串，最快的方法是传统的排序和比较解决方案。

希望你发现这个小小的算法研究和我一样有趣。外卖包括：

集合、itertools和集合可以简化这样的问题。
大的oh运行时间很重要（对于大的n，n-因子分解）。
恒定的开销很重要（由于散列开销，排序比多集要好）。
离散数学是思想的宝库。
在进行分析和运行计时之前，很难知道什么是最好的：-）

定时设置

FWIW，这是我用来运行比较计时的测试设置：

from collections import Counter
from itertools import permutations
from string import letters
from random import choice
from operator import mul
from time import time

def match_sort(string, words):
    target = sorted(string)
    return sorted(word for word in words if sorted(word) == target)

def match_multiset(string, words):
    target = Counter(string)
    return sorted(word for word in words if Counter(word) == target)

primes = [2, 3, 5, 7, 11]
primes += [p for p in range(13, 1620) if all(pow(b, p-1, p) == 1 for b in (5, 11))]
anagram_hash = lambda s: reduce(mul, (primes[ord(c)] for c in s))

def match_perfect_hash(string, words):
    target = anagram_hash(string)
    return sorted(word for word in words if anagram_hash(word) == target)

def match_permutations(string, words):
    perms = set(map(''.join, permutations(string)))
    return sorted(word for word in words if word in perms)

def match_permutations_set(string, words):
    perms = set(map(''.join, permutations(string)))
    return sorted(words & perms)

string_size = 5
words_size = 1000000

population = letters[: string_size+2]
words = set()
for i in range(words_size):
    word = ''.join([choice(population) for i in range(string_size)])
    words.add(word)
string = word                # Arbitrarily search use the last word as the target

print 'Timings with string_size=%d and words_size=%d' % (string_size, words_size)
for func in (match_sort, match_multiset, match_perfect_hash, match_permutations, match_permutations_set):
    start = time()
    func(string, words)
    end = time()
    print '%-10.5f %s' % (end - start, func.__name__)

网友

2楼 · 编辑于 2024-05-21 08:17:14

显然，您希望输出按字母顺序排序，所以应该这样做：

return sorted(set(''.join(p) for p in itertools.permutations(string)) & words)

网友

3楼 · 编辑于 2024-05-21 08:17:14

您只需使用collections.Counter()将words与string进行比较，而不必创建所有permutations（这会随着字符串长度而爆炸）：

from collections import Counter

def permutations_in_dict(string, words):
    c = Counter(string)
    return [w for w in words if c == Counter(w)]

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['cat', 'act']

注意：sets是无序的，因此如果需要特定的顺序，可能需要对结果进行排序，例如return sorted(...)

问题类别

使用排序的解决方案

使用多集的解决方案

使用完美散列的解决方案

置换解

最佳解决方案

定时设置

相关问题更多 >

编程相关推荐

热门问题

热门文章