我在Python中的正则表达式没有正确递归

2 投票

3 回答

507 浏览

提问于 2025-04-15 12:03

我想要抓取一个标签里面的所有内容，以及它后面的几行，但应该在遇到下一个括号时停止。我哪里做错了呢？

import re #regex

regex = re.compile(r"""
         ^                    # Must start in a newline first
         \[\b(.*)\b\]         # Get what's enclosed in brackets 
         \n                   # only capture bracket if a newline is next
         (\b(?:.|\s)*(?!\[))  # should read: anyword that doesn't precede a bracket
       """, re.MULTILINE | re.VERBOSE)

haystack = """
[tab1]
this is captured
but this is suppose to be captured too!
@[this should be taken though as this is in the content]

[tab2]
help me
write a better RE
"""
m = regex.findall(haystack)
print m

我想要得到的是：
[('tab1', '这部分被抓取了\n但这部分也应该被抓取！\n@[这部分应该被抓取，因为它在内容里面]\n', '[tab2]','帮我\n写一个更好的正则表达式\n')]

补充：

regex = re.compile(r"""
             ^           # Must start in a newline first
             \[(.*?)\]   # Get what's enclosed in brackets 
             \n          # only capture bracket if a newline is next
             ([^\[]*)    # stop reading at opening bracket
        """, re.MULTILINE | re.VERBOSE)

这个方法似乎有效，但它也把内容里面的括号给去掉了。

正则表达式文本处理数据提取字符串操作标签解析内容抓取递归匹配

3 个回答

这个代码能满足你的需求吗？

regex = re.compile(r"""
         ^                      # Must start in a newline first
         \[\b(.*)\b\]           # Get what's enclosed in brackets 
         \n                     # only capture bracket if a newline is next
         ([^[]*)
       """, re.MULTILINE | re.VERBOSE)

这个代码会返回一个包含多个元组的列表，每个元组里有两个元素（每次匹配会生成一个这样的元组）。如果你想要一个扁平化的元组，可以这样写：

m = sum(regex.findall(haystack), ())

回答于 2025-04-15 由 Python大师

分享举报

首先，为什么要用正则表达式来解析呢？你会发现自己无法找到问题的根源，因为正则表达式不会给你任何反馈。而且在这个正则表达式里也没有使用递归。

让你的生活简单一点：

def ini_parse(src):
   in_block = None
   contents = {}
   for line in src.split("\n"):
      if line.startswith('[') and line.endswith(']'):
         in_block = line[1:len(line)-1]
         contents[in_block] = ""
      elif in_block is not None:
         contents[in_block] += line + "\n"
      elif line.strip() != "":
         raise Exception("content out of block")
   return contents

你可以通过异常处理来捕捉错误，同时还能调试执行过程，这可是额外的好处。此外，你还可以得到一个字典作为结果，这样在处理时就能应对重复的部分。我的结果是：

{'tab2': 'help me\nwrite a better RE\n\n',
 'tab1': 'this is captured\nbut this is suppose to be captured too!\n@[this should be taken though as this is in the content]\n\n'}

现在正则表达式被过度使用了……

回答于 2025-04-15 由 Python大师

分享举报

我知道Python的正则表达式不支持递归。

补充说明：不过在你的情况下，这样做是可以的：

regex = re.compile(r"""
         ^           # Must start in a newline first
         \[(.*?)\]   # Get what's enclosed in brackets 
         \n          # only capture bracket if a newline is next
         ([^\[]*)    # stop reading at opening bracket
    """, re.MULTILINE | re.VERBOSE)

补充说明2：是的，这样做并不能完全正确。

import re

regex = re.compile(r"""
    (?:^|\n)\[             # tag's opening bracket  
        ([^\]\n]*)         # 1. text between brackets
    \]\n                   # tag's closing bracket
    (.*?)                  # 2. text between the tags
    (?=\n\[[^\]\n]*\]\n|$) # until tag or end of string but don't consume it
    """, re.DOTALL | re.VERBOSE)

haystack = """[tag1]
this is captured [not a tag[
but this is suppose to be captured too!
[another non-tag

[tag2]
help me
write a better RE[[[]
"""

print regex.findall(haystack)

不过我同意viraptor的看法。正则表达式很酷，但用它们来检查文件错误是不行的。也许可以考虑混合使用？:P

tag_re = re.compile(r'^\[([^\]\n]*)\]$', re.MULTILINE)
tags = list(tag_re.finditer(haystack))

result = {}
for (mo1, mo2) in zip(tags[:-1], tags[1:]):
    result[mo1.group(1)] = haystack[mo1.end(1)+1:mo2.start(1)-1].strip()
result[mo2.group(1)] = haystack[mo2.end(1)+1:].strip()

print result

补充说明3：这是因为^这个字符在[^方括号]里面表示负匹配，而在其他地方则表示字符串的开始（或者在re.MULTILINE模式下表示行的开始）。在正则表达式中，没有好的方法来进行负字符串匹配，只有字符匹配。

回答于 2025-04-15 由 Python大师

分享举报

我在Python中的正则表达式没有正确递归

3 个回答

撰写回答