使用正则表达式在Python中捕获重复组（见示例）

3 投票

5 回答

6074 浏览

提问于 2025-04-18 11:55

我正在用Python写一个正则表达式，目的是提取SSI标签里的内容。

我想解析的标签是：

<!--#include file="/var/www/localhost/index.html" set="one" -->

我希望把它分成以下几个部分：

标签功能（比如：include、echo 或 set）
属性名称，出现在=符号之前
属性值，出现在"之间

问题是，我不知道怎么抓取这些重复的组，因为在一个标签中，名称/值对可能出现一次或多次。我为此花了好几个小时。

这是我现在的正则表达式：

^\<\!\-\-\#([a-z]+?)\s([a-z]*\=\".*\")+? \-\-\>$

它能在第一组中捕获到include，在第二组中捕获到file="/var/www/localhost/index.html" set="one"，但我想要的是这个：

group 1: "include"
group 2: "file"
group 3: "/var/www/localhost/index.html"
group 4 (optional): "set"
group 5 (optional): "one"

(continue for every other name="value" pair)

我正在这个网站上开发我的正则表达式

正则表达式数据提取模式匹配属性值属性名称标签解析重复组 SSI标签

5 个回答

regex库可以捕捉重复的分组（而内置的re库做不到这一点）。这意味着你可以很简单地解决问题，不需要额外的循环来解析分组。

import regex

string = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'
rgx = regex.compile(
    r'<!--#(?<fun>[a-z]+)(\s+(?<key>[a-z]+)\s*=\s*"(?<val>[^"]*)")+')

match = rgx.match(string)
keys, values = match.captures('key', 'val')
print(match['fun'], *map(' = '.join, zip(keys, values)), sep='\n  ')

这样就能得到你想要的结果

include
  file = /var/www/localhost/index.html
  set = one

回答于 2025-04-18 由 Python大师

分享举报

很遗憾，Python 不支持递归的正则表达式。
不过你可以这样做：

import re
string = '''<!--#include file="/var/www/localhost/index.html" set="one" set2="two" -->'''
regexString = '''<!--\#(?P<tag>\w+)\s(?P<name>\w+)="(?P<value>.*?")\s(?P<keyVal>.*)\s-->'''
regex = re.compile(regexString)
match = regex.match(string)
tag = match.group('tag')
name = match.group('name')
value = match.group('value')
keyVal = match.group('keyVal').split()
for item in keyVal:
    key, val in item.split('=')
    # You can now do whatever you want with the key=val pair

回答于 2025-04-18 由 Python大师

分享举报

我不建议用一个复杂的正则表达式来捕捉重复组里的每一个项目。相反，虽然我不太懂Python，所以我用我熟悉的Java来解释，我建议先提取出所有的属性，然后再逐个处理每个项目，像这样：

   import  java.util.regex.Pattern;
   import  java.util.regex.Matcher;
public class AllAttributesInTagWithRegexLoop  {
   public static final void main(String[] ignored)  {
      String input = "<!--#include file=\"/var/www/localhost/index.html\" set=\"one\" -->";

      Matcher m = Pattern.compile(
         "<!--#(include|echo|set) +(.*)-->").matcher(input);

      m.matches();

      String tagFunc = m.group(1);
      String allAttrs = m.group(2);

      System.out.println("Tag function: " + tagFunc);
      System.out.println("All attributes: " + allAttrs);

      m = Pattern.compile("(\\w+)=\"([^\"]+)\"").matcher(allAttrs);
      while(m.find())  {
         System.out.println("name=\"" + m.group(1) + 
            "\", value=\"" + m.group(2) + "\"");
      }
   }
}

输出结果：

Tag function: include
All attributes: file="/var/www/localhost/index.html" set="one"
name="file", value="/var/www/localhost/index.html"
name="set", value="one"

这里有一个可能对你有帮助的答案： https://stackoverflow.com/a/23062553/2736496

请考虑收藏这个 Stack Overflow 正则表达式常见问题解答，以备后用。

回答于 2025-04-18 由 Python大师

分享举报

使用新的 Python 正则表达式模块的方法：

#!/usr/bin/python

import regex

s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'

p = r'''(?x)
    (?>
        \G(?<!^)
      |
        <!-- \# (?<function> [a-z]+ )
    )
    \s+
    (?<key> [a-z]+ ) \s* = \s* " (?<val> [^"]* ) "
'''

matches = regex.finditer(p, s)

for m in matches:
    if m.group("function"):
        print ("function: " + m.group("function"))
    print (" key:   " + m.group("key") + "\n value: " + m.group("val") + "\n")

使用 re 模块的方法：

#!/usr/bin/python

import re

s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'

p = r'''(?x)
    <!-- \# (?P<function> [a-z]+ )
    \s+
    (?P<params> (?: [a-z]+ \s* = \s* " [^"]* " \s*? )+ )
    -->
'''

matches = re.finditer(p, s)

for m in matches:
    print ("function: " + m.group("function"))
    for param in re.finditer(r'[a-z]+|"([^"]*)"', m.group("params")):
        if param.group(1):
            print (" value: " + param.group(1) + "\n")
        else:
            print (" key:   " + param.group())

回答于 2025-04-18 由 Python大师

分享举报

把所有可以重复的东西都抓取出来，然后一个一个地解析。这可能是使用命名组的一个好例子！

import re

data = """<!--#include file="/var/www/localhost/index.html" set="one" reset="two" -->"""
pat = r'''^<!--#([a-z]+) ([a-z]+)="(.*?)" ((?:[a-z]+?=".+")+?) -->'''

result = re.match(pat, data)
result.groups()
('include', 'file', '/var/www/localhost/index.html', 'set="one" reset="two"')

接着对这些内容进行遍历：

g1, g2, g3, g4 = result.groups()
for keyvalue in g4.split(): # split on whitespace
    key, value = keyvalue.split('=')
    # do something with them

回答于 2025-04-18 由 Python大师

分享举报

使用正则表达式在Python中捕获重复组（见示例）

5 个回答

撰写回答