提取括号内外的元素

1 投票

3 回答

1280 浏览

提问于 2025-04-17 01:14

我有一个字符串，想要提取里面的元素（比如xx="yy"）以及方括号中的内容。下面是一个例子：

[caption id="获取这个" align="还有这个" width="还有这个" caption="还有这个"]请这个也要[/caption]

我试过以下代码，但我对正则表达式还不太熟悉。

re.sub(r'\[caption id="(.*)" align="(.*)" width="(.*)" caption="(.*)"\](.*)\[\/caption\]', "tokens: %1 %2 %3 %4 %5", self.content, re.IGNORECASE)

非常感谢！

正则表达式字符串处理编程技巧数据提取文本解析 HTML标签

3 个回答

你可以试试这样的做法吗？

re = '[caption id="get this" align="and this" width="and this" caption="and this"]this too please[/caption]'
re.gsub(/([a-z]*)=\"(.*?)\"/i) do |m|
    puts "#{$1} = #{$2}
end

回答于 2025-04-17 由 Python大师

分享举报

你可以利用Python自带的SGML/HTML/XML解析模块的强大功能：如果把“[]”替换成“<>”是安全的，那么你可以进行这样的替换，以生成有效的XML，然后用标准库的XML解析函数来解析它：

import string
from xml.etree import ElementTree as ET

text = '[caption id="get this" align="and this" width="and this" caption="and this"]this too please[/caption]'
xml_text = string.translate(text, string.maketrans('[]', '<>'))  # Conversion to XML
parsed_text = ET.fromstring(xml_text)  # Parsing

# Extracted information
print "Text part:", parsed_text.text
print "Values:", parsed_text.attrib.values()

这样可以正确打印出：

Text part: this too please
Values: ['and this', 'and this', 'get this', 'and this']

这种方法的好处有三个：(1) 它使用的是很多人都熟悉的标准模块；(2) 它清楚地展示了你想要做的事情；(3) 你可以轻松提取更多信息，处理更复杂的值（包括包含双引号的值……）等。

回答于 2025-04-17 由 Python大师

分享举报

你可能遇到问题是因为 .* 是贪婪的匹配方式。试试用 [^"]* 替代它。这里的 [^"] 表示除了引号以外的所有字符。而且，正如你在评论中提到的，令牌的语法是 \\n，而不是 %n。可以试试这个：

re.sub(r'\[caption id="([^"]*)" align="([^"]*)" width="([^"]*)" caption="([^"]*)"\](.*)\[\/caption\]', "tokens: \\1 \\2 \\3 \\4 \\5", self.content, re.IGNORECASE)

标题标签的内容是否跨越了多行？如果是的话，.* 是无法捕捉到换行符的。你需要使用像 [^\x00]* 这样的方式。这里的 [^\x00] 表示除了空字符以外的所有字符。

re.sub(r'\[caption id="([^"]*)" align="([^"]*)" width="([^"]*)" caption="([^"]*)"\]([^\x00]*)\[\/caption\]', "tokens: \\1 \\2 \\3 \\4 \\5", self.content, re.IGNORECASE)

如果你的字符串中确实可能包含空字符，那么你需要使用 re.DOTALL 这个选项。

回答于 2025-04-17 由 Python大师

分享举报

提取括号内外的元素

3 个回答

撰写回答