如何匹配多个重叠的正则表达式模式？

3 投票

2 回答

803 浏览

数据工程师

提问于 2025-04-16 21:35

背景

我有一串混合了mp3信息的字符串，我需要把它和一个由任意字符串和标记组成的模式进行匹配。这个过程是这样的：

程序会给用户展示一个字符串

the Beatles_Abbey_Road-SomeWord-1969

用户输入一个模式来帮助程序解析这个字符串

the %Artist_%Album-SomeWord-%Year

然后我想展示匹配的结果（但需要你的帮助）

找到2个可能的匹配：
[1] {'Artist': 'Beatles', 'Album':'Abbey_Road', 'Year':1969}
[2] {'Artist': 'Beatles_Abbey', 'Album':'Road', 'Year':1969}

问题

举个例子，假设模式是艺术家名字后面跟着标题（分隔符是'-'）。

例子1：

>>> artist = 'Bob Marley'
>>> title = 'Concrete Jungle'
>>> re.findall(r'(.+)-(.+)', '%s-%s' % (artist,title))
[('Bob Marley', 'Concrete Jungle')]

到目前为止，一切都很好。但是...
我无法控制使用的分隔符，也不能保证它不会出现在标签中，所以会有更复杂的情况：

例子2：

>>> artist = 'Bob-Marley'
>>> title = 'Roots-Rock-Reggae'
>>> re.findall(r'(.+)-(.+)', '%s-%s' % (artist,title))
[('Bob-Marley-Roots-Rock', 'Reggae')]

如预期的那样，在这种情况下它并不有效。

我该如何生成所有可能的艺术家/标题组合？

[('Bob', 'Marley-Roots-Rock-Reggae'),
 ('Bob-Marley', 'Roots-Rock-Reggae')
 ('Bob-Marley-Roots', 'Rock-Reggae'),
 ('Bob-Marley-Roots-Rock', 'Reggae')]

正则表达式是完成这项工作的工具吗？

请记住，匹配的标签数量和这些标签之间的分隔符不是固定的，而是用户定义的（所以使用的正则表达式必须能够动态构建）。
我尝试过使用贪婪匹配与最小匹配和前瞻断言，但没有成功。

谢谢你的帮助

正则表达式字符串匹配前瞻断言标签提取贪婪匹配最小匹配动态构建模式解析

2 个回答

那我们可以试试这种方法，而不是用常规的正则表达式？

import re

string = "Bob-Marley-Roots-Rock-Reggae"

def allSplits(string, sep):
    results = []
    chunks = string.split('-')
    for i in xrange(len(chunks)-1):
        results.append((
            sep.join(chunks[0:i+1]),
            sep.join(chunks[i+1:len(chunks)])
        ))

    return results

print allSplits(string, '-')

[('Bob', 'Marley-Roots-Rock-Reggae'),
 ('Bob-Marley', 'Roots-Rock-Reggae'),
 ('Bob-Marley-Roots', 'Rock-Reggae'),
 ('Bob-Marley-Roots-Rock', 'Reggae')]

回答于 2025-04-16 由 Python大师

分享举报

这个解决方案看起来有效。除了正则表达式，你还需要一个元组列表来描述模式，每个元素对应正则表达式中的一个捕获组。

以你提到的披头士乐队为例，它会是这样的：

pattern = r"the (.+_.+)-SomeWord-(.+)"
groups = [(("Artist", "Album"), "_"), ("Year", None)]

因为Artist（艺术家）和Album（专辑）之间只用一个分隔符，所以它们会一起被捕获到一个组里。列表中的第一个项目表示第一个捕获组会被分成Artist和Album，并且会使用_作为分隔符。列表中的第二个项目表示第二个捕获组会直接用作Year（年份），因为元组中的第二个元素是None。然后你可以这样调用这个函数：

>>> get_mp3_info(groups, pattern, "the Beatles_Abbey_Road-SomeWord-1969")
[{'Album': 'Abbey_Road', 'Year': '1969', 'Artist': 'Beatles'}, {'Album': 'Road', 'Year': '1969', 'Artist': 'Beatles_Abbey'}]

这里是代码：

import re
from itertools import combinations

def get_mp3_info(groups, pattern, title):
    match = re.match(pattern, title)
    if not match:
        return []
    result = [{}]
    for i, v in enumerate(groups):
        if v[1] is None:
            for r in result:
                r[v[0]] = match.group(i+1)
        else:
            splits = match.group(i+1).split(v[1])
            before = [d.copy() for d in result]
            for comb in combinations(range(1, len(splits)), len(v[0])-1):
                temp = [d.copy() for d in before]
                comb = (None,) + comb + (None,)
                for j, split in enumerate(zip(comb, comb[1:])):
                    for t in temp:
                        t[v[0][j]] = v[1].join(splits[split[0]:split[1]])

                if v[0][0] in result[0]:
                    result.extend(temp)
                else:
                    result = temp
    return result

还有一个关于鲍勃·马利的例子：

>>> pprint.pprint(get_mp3_info([(("Artist", "Title"), "-")],
...               r"(.+-.+)", "Bob-Marley-Roots-Rock-Reggae"))
[{'Artist': 'Bob', 'Title': 'Marley-Roots-Rock-Reggae'},
 {'Artist': 'Bob-Marley', 'Title': 'Roots-Rock-Reggae'},
 {'Artist': 'Bob-Marley-Roots', 'Title': 'Rock-Reggae'},
 {'Artist': 'Bob-Marley-Roots-Rock', 'Title': 'Reggae'}]

回答于 2025-04-16 由 Python大师

分享举报

如何匹配多个重叠的正则表达式模式？

背景

问题

2 个回答

撰写回答