python正则表达式替换编译带标志

0 投票

2 回答

1518 浏览

提问于 2025-04-18 07:36

我想用编译的方式运行一个子进程。我已经把子逻辑搞定了（下面的第一个例子），这个函数可以按预期打印出“，peanut，”基本上，这个子程序可以处理任何长度的强调HTML标签，不管有没有属性，然后把它们替换成“，”不过，我想让第二个版本也能工作，虽然这个版本的代码比较长，但修改起来更简单，因为我只需要添加“tagname”来增加一个新的强调标签，而不是像第一个版本那样需要写“|tagname|/tagname”来表示开闭标签。我知道答案是要用编译的方式，但我搜索了很久也没找到答案。

有效的代码：

def cut_out_emphasis():
    str1 = "<b class='boldclass'>peanut</b>
    str1 = re.sub(r'<(b|\/b|i|\/i|ul|\/ul)[^>]*>', ', ', str1, flags=re.I)
    print str1

无效的代码：

def cut_out_emphasis():
    str1 = "<b class='boldclass'>peanut</b>
    list1 = ["b", "i", "ul"]
    str2 = ""
    for x in list1:
        str2 = '%s|%s|/%s' % (str2, x, x)
    str2 = "r'<(%s)[^>]*>'" % (str2, )
    str1 = re.sub(re.compile(str2, re.IGNORECASE), ', ', str1, flags=re.I)
    print str1

正则表达式代码优化编译子进程 HTML标签标签替换强调标签

2 个回答

首先，所有复杂的正则表达式应该用自由间距模式来写，也就是要注意缩进和多加注释。这样做可以让你更容易地看到和修改需要去掉的标签列表——也就是说，不需要把标签放在一个固定的列表里，直接在正则表达式里加一行就可以了。

import re
re_strip_tags = re.compile(r"""
    # Match open or close tag from list of tag names.
    <                # Tag opening "<" delimiter.
    /?               # Match close tags too (which begin with "</").
    (?:              # Group list of tag names.
      b              # Bold tags.
    | i              # Italic tags.
    | em             # Emphasis tags.
#   | othertag       # Add additional tags here...
    )                # End list of tag names.
    (?:              # Non-capture group for optional attribute(s).
      \s+            # Attributes must be separated by whitespace.
      [\w\-.:]+      # Attribute name is required for attr=value pair.
      (?:            # Non-capture group for optional attribute value.
        \s*=\s*      # Name and value separated by "=" and optional ws.
        (?:          # Non-capture group for attrib value alternatives.
          "[^"]*"    # Double quoted string (Note: may contain "&<>").
        | '[^']*'    # Single quoted string (Note: may contain "&<>").
        | [\w\-.:]+  # Non-quoted attrib value can be A-Z0-9-._:
        )            # End of attribute value
      )?             # Attribute value is optional.
    )*               # Zero or more tag attributes.
    \s* /?           # Optional whitespace and "/" before ">".
    >                # Tag closing ">" delimiter.
    """, re.VERBOSE | re.IGNORECASE)

def cut_out_emphasis(str):
    return re.sub(re_strip_tags, ', ', str)

print (cut_out_emphasis("<b class='boldclass'>peanut</b>"))

在写Python的正则表达式时，为了避免转义字符（反斜杠）带来的问题，最好使用 r"原始字符串" 或 r"""原始多行字符串""" 这种写法。注意，上面的代码中，正则表达式只编译一次，但可以用很多次。（不过要说明的是，这其实并不是一个优势，因为Python内部会缓存编译好的正则表达式。）

还要提到的是，用正则表达式解析HTML一般来说是不太被推荐的。

回答于 2025-04-18 由 Python大师

分享举报

编译过的正则表达式（RE）并不是这样工作的。你应该这样做：

def cut_out_emphasis(str1):
    list1 = ["b", "i", "ul"]
    str2 = ""
    for x in list1:
        str2 = r'%s|%s|\/%s' % (str2, x, x)
    str2 = r'<(%s)[^>]*>' % (str2, )
    re_compiled = re.compile(str2, re.IGNORECASE)
    str1 = re_compiled.sub(', ', str1)
    return str1

不过，编译正则表达式是可选的。如果你多次使用同一个正则表达式，编译会提高性能。在你的情况下，你可以继续使用这个：

def cut_out_emphasis(str1):
    list1 = ["b", "i", "ul"]
    str2 = ""
    for x in list1:
        str2 = r'%s|%s|\/%s' % (str2, x, x)
    str2 = r'<(%s)[^>]*>' % (str2, )
    str1 = re.sub(str2, ', ', str1, flags=re.I)
    return str1

回答于 2025-04-18 由 Python大师

分享举报

python正则表达式替换编译带标志

2 个回答

撰写回答