在正则表达式中查找最后一个组

2 投票

5 回答

14901 浏览

数据工程师

提问于 2025-04-16 02:26

我的字符串由三个用下划线分开的部分组成：

第一部分（字母和数字）
中间部分（字母、数字和下划线）
最后一部分（字母和数字，选填）

注意：我需要通过名称来访问这些部分，而不是通过它们的位置。

举个例子：

String : abc_def
first : abc
middle : def
last : None

String : abc_def_xyz
first : abc
middle: def
last: xyz

String : abc_def_ghi_jkl_xyz
first : abc
middle : def_ghi_jkl
last : xyz

我找不到合适的正则表达式...

到目前为止，我有两个想法：

可选部分

(?P<first>[a-z]+)_(?P<middle>\w+)(_(?P<last>[a-z]+))?

但是中间部分会匹配到字符串的末尾：

String : abc_def_ghi_jkl_xyz
first : abc
middle : def_ghi_jkl_xyz
last : vide

使用'|'符号

(?P<first>[a-z]+)_(?P<middle>\w+)_(?P<last>[a-z]+)|(?P<first>[a-z]+)_(?P<middle>\w+)

这个表达式是无效的：第一部分和中间部分被声明了两次。我以为可以写一个表达式，重用第一个部分匹配到的部分：

(?P<first>[a-z]+)_(?P<middle>\w+)_(?P<last>[a-z]+)|(?P=first)_(?P=middle)

这个表达式是有效的，但像abc_def这样的只有第一部分和中间部分的字符串却没有被匹配到。

注意

这些字符串实际上是我需要匹配的路径的一部分。可能的路径有：

/my/path/to/abc_def
/my/path/to/abc_def/
/my/path/to/abc_def/some/other/stuf
/my/path/to/abc_def/some/other/stuf/
/my/path/to/abc_def_ghi_jkl_xyz
/my/path/to/abc_def_ghi_jkl_xyz/
/my/path/to/abc_def_ghi_jkl_xyz/some/other/stuf
/my/path/to/abc_def_ghi_jkl_xyz/some/other/stuf/
...

有没有办法仅用正则表达式解决我的问题？后处理匹配到的部分不是一个选项。

非常感谢！

正则表达式字符串匹配语法规则特殊字符字符串解析分组捕获可选组路径匹配

5 个回答

试试这个正则表达式：

^(?P<first>[a-z]+)_(?P<middle>[a-z]+(?:_[a-z]+)*?)(?:_(?P<last>[a-z]+))?$

这里有一个测试案例：

import re

strings = ['abc_def', 'abc_def_xyz', 'abc_def_ghi_jkl_xyz']
pattern = '^(?P<first>[a-z]+)_(?P<middle>[a-z]+(?:_[a-z]+)*?)(?:_(?P<last>[a-z]+))?$'
for string in strings:
    m = re.match(pattern, string)
    print m.groupdict()

输出结果是：

{'middle': 'def', 'last': None, 'first': 'abc'}
{'middle': 'def', 'last': 'xyz', 'first': 'abc'}
{'middle': 'def_ghi_jkl', 'last': 'xyz', 'first': 'abc'}

回答于 2025-04-16 由 Python大师

分享举报

使用

^(?P<first>[a-z]+)_(?P<middle>\w+?)(_(?P<last>[a-z]+))?$

^ 和 $ 是用来标记正则表达式的开始和结束位置的。

把 \w+? 设置为懒惰模式，这样它就会尽量匹配最少的字符（但至少要有一个字符）。

编辑：

针对你现在的新需求，包含了这个字符串前后的路径，这个方法可以用：

^(.*?/)(?P<first>[a-z]+)_(?P<middle>\w+?)(_(?P<last>[a-z]+))?(/.*)?$

代码示例（Python 3.1）：

import re
paths = ["/my/path/to/abc_def",
         "/my/path/to/abc_def/",
         "/my/path/to/abc_def/some/other/stuf",
         "/my/path/to/abc_def/some/other/stuf/",
         "/my/path/to/abc_def_ghi_jkl_xyz",
         "/my/path/to/abc_def_ghi_jkl_xyz/",
         "/my/path/to/abc_def_ghi_jkl_xyz/some/other/stuf",
         "/my/path/to/abc_def_ghi_jkl_xyz/some/other/stuf/"]

regex = re.compile(r"^(.*?/)(?P<first>[a-z]+)_(?P<middle>\w+?)(_(?P<last>[a-z]+))?(/.*)?$")

for path in paths:
    match = regex.match(path)
    print ("{}:\nBefore: {}\nFirst: {}\nMiddle: {}\nLast: {}\nAfter: {}\n".format(
           path, match.group(1), match.group("first"), match.group("middle"),
           match.group("last"), match.group(6)))

输出：

/my/path/to/abc_def:
Before: /my/path/to/
First: abc
Middle: def
Last: None
After: None

/my/path/to/abc_def/:
Before: /my/path/to/
First: abc
Middle: def
Last: None
After: /

/my/path/to/abc_def/some/other/stuf:
Before: /my/path/to/
First: abc
Middle: def
Last: None
After: /some/other/stuf

/my/path/to/abc_def/some/other/stuf/:
Before: /my/path/to/
First: abc
Middle: def
Last: None
After: /some/other/stuf/

/my/path/to/abc_def_ghi_jkl_xyz:
Before: /my/path/to/
First: abc
Middle: def_ghi_jkl
Last: xyz
After: None

/my/path/to/abc_def_ghi_jkl_xyz/:
Before: /my/path/to/
First: abc
Middle: def_ghi_jkl
Last: xyz
After: /

/my/path/to/abc_def_ghi_jkl_xyz/some/other/stuf:
Before: /my/path/to/
First: abc
Middle: def_ghi_jkl
Last: xyz
After: /some/other/stuf

/my/path/to/abc_def_ghi_jkl_xyz/some/other/stuf/:
Before: /my/path/to/
First: abc
Middle: def_ghi_jkl
Last: xyz
After: /some/other/stuf/

回答于 2025-04-16 由 Python大师

分享举报

把中间的部分改成非贪婪模式，并加上字符串的开始和结束标记：

^(?P<first>[a-z]+)_(?P<middle>\w+?)(_(?P<last>[a-z]+))?$

默认情况下，\w+会尽量匹配尽可能多的内容，这样会把后面的字符串都吃掉。加上?后，它就会尽量匹配尽可能少的内容。

感谢Tim Pietzcker指出了需要加上标记的要求。

回答于 2025-04-16 由 Python大师

分享举报

在正则表达式中查找最后一个组

5 个回答

撰写回答