URL的最长前缀匹配

11 投票

4 回答

7962 浏览

提问于 2025-04-16 14:25

我需要一些关于可以用来在网址上进行“最长前缀匹配”的标准Python库的信息。我查过两个标准库，分别是http://packages.python.org/PyTrie/#pytrie.StringTrie和'http://pypi.python.org/pypi/trie/0.1.1'，但它们似乎对网址的最长前缀匹配任务没有帮助。

举个例子，如果我的网址集合有这些：1->http://www.google.com/mail，2->http://www.google.com/document，3->http://www.facebook.com，等等。

现在如果我搜索'http://www.google.com/doc'，那么应该返回2；如果搜索'http://www.face'，应该返回3。

我想确认一下是否有任何标准的Python库可以帮助我完成这个任务，还是说我应该自己实现一个Trie树来进行前缀匹配。

我不想要那种正则表达式的解决方案，因为随着网址数量的增加，这种方法不够灵活。

非常感谢。

数据结构字符串匹配算法优化标准库 URL匹配前缀搜索最长前缀匹配 trie树

4 个回答

如果你愿意用更多的内存来换取更快的运行速度，那么SuffixTree可能会对你有帮助。它有一些很不错的算法特性，比如可以在很短的时间内解决最长公共子串的问题。

如果你总是要查找某个前缀，而不是随便的子串，那么在填充SubstringDict()的时候，你可以添加一个独特的前缀：

from SuffixTree import SubstringDict

substr_dict = SubstringDict()
for url in URLS: # urls must be ascii (valid urls are)
    assert '\n' not in url
    substr_dict['\n'+url] = url #NOTE: assume that '\n' can't be in a url

def longest_match(url_prefix, _substr_dict=substr_dict):
    matches = _substr_dict['\n'+url_prefix]
    return max(matches, key=len) if matches else ''

这种使用SuffixTree的方法看起来不是最优的，但根据我尝试的数据，它的速度比@StephenPaulger的解决方案（基于.startswith()）快20到150倍（不算SubstringDict()的构建时间），这可能已经足够好了。

要安装SuffixTree，你可以运行：

pip install SuffixTree -f https://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees

回答于 2025-04-16 由 Python大师

分享举报

性能比较

`suffixtree` vs. `pytrie` vs. `trie` vs. `datrie` vs. `startswith` 函数

设置

记录的时间是进行1000次搜索的3次重复中最短的时间。构建trie的时间也算在内，并在所有搜索中分摊。搜索是在从1到1000000个主机名的集合中进行的。

有三种类型的搜索字符串：

non_existent_key - 这个字符串没有匹配项
rare_key - 每百万中大约有20个匹配
frequent_key - 出现次数与集合大小相当

结果

对于一百万个网址的最大内存消耗：

| function    | memory, | ratio |
|             |     GiB |       |
|-------------+---------+-------|
| suffix_tree |   0.853 |   1.0 |
| pytrie      |   3.383 |   4.0 |
| trie        |   3.803 |   4.5 |
| datrie      |   0.194 |   0.2 |
| startswith  |   0.069 |   0.1 |
#+TBLFM: $3=$2/@3$2;%.1f

要重现这些结果，可以运行trie基准测试代码。

稀有键/不存在的键情况

如果网址数量少于10000，则datrie是最快的；当数量大于10000时，suffixtree更快，而startswith的平均速度明显较慢。

rare_key

坐标轴：
- 纵轴（时间）大约是1秒（2的20次方微秒）
- 横轴显示每种情况下的网址总数：N= 1, 10, 100, 1000, 10000, 100000, 和 1000000（百万）。

nonexistent_key

频繁键

在N达到100000时，datrie是最快的（对于一百万个网址，时间主要由trie构建时间决定）。

找到最长匹配项所花的时间最多。因此，所有函数的表现都和预期相似。

frequent_key

startswith的时间性能与键的类型无关。

trie和pytrie的表现相似。

不考虑trie构建时间的性能

datrie - 最快，内存消耗适中
startswith在这里更处于劣势，因为其他方法不受构建trie所需时间的影响。
datrie、pytrie、trie - 对于稀有/不存在的键几乎是O(1)（常数时间）

rare_key_no_trie_build_time nonexistent_key_no_trie_build_time

frequent_key_no_trie_build_time

为了比较，拟合（近似）已知函数的多项式（与图中的相同对数/对数比例）：

| Fitting polynom              | Function          |
|------------------------------+-------------------|
| 0.15  log2(N)   +      1.583 | log2(N)           |
| 0.30  log2(N)   +      3.167 | log2(N)*log2(N)   |
| 0.50  log2(N)   +  1.111e-15 | sqrt(N)           |
| 0.80  log2(N)   +  7.943e-16 | N**0.8            |
| 1.00  log2(N)   +  2.223e-15 | N                 |
| 2.00  log2(N)   +  4.446e-15 | N*N               |

回答于 2025-04-16 由 Python大师

分享举报

这个例子适合处理小的URL列表，但对于大规模的数据来说效果不太好。

def longest_prefix_match(search, urllist):
    matches = [url for url in urllist if url.startswith(search)]
    if matches:
        return max(matches, key=len)
    else:
        raise Exception("Not found")

这里有一个使用trie模块的实现。

import trie


def longest_prefix_match(prefix_trie, search):
    # There may well be a more elegant way to do this without using
    # "hidden" method _getnode.
    try:
        return list(node.value for node in prefix_trie._getnode(search).walk())
    except KeyError:
        return list()

url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = trie.Trie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, longest_prefix_match(url_trie, search)

结果：

'http' -> ['http://www.facebook.com', 'http://www.google.com/document', 'http://www.google.com/mail']
'http://www.go' -> ['http://www.google.com/document', 'http://www.google.com/mail']
'http://www.fa' -> ['http://www.facebook.com']
'http://fail' -> []

或者使用PyTrie，它能得到相同的结果，但列表的顺序不同。

from pytrie import StringTrie


url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = StringTrie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, url_trie.values(prefix=search)

我开始觉得从内存使用的角度来看，使用基数树/帕特里夏树会更好。下面是基数树的样子：

基数树示例URL

而trie的样子更像是： trie示例URL

回答于 2025-04-16 由 Python大师

分享举报

URL的最长前缀匹配

4 个回答

性能比较

suffixtree vs. pytrie vs. trie vs. datrie vs. startswith 函数

设置

结果

不考虑trie构建时间的性能

撰写回答

`suffixtree` vs. `pytrie` vs. `trie` vs. `datrie` vs. `startswith` 函数