Python正则表达式解析

2 投票

2 回答

1259 浏览

提问于 2025-04-15 11:21

我在Python中有一个字符串数组，每个字符串看起来像这样：

<r n="Foo Bar" t="5" s="10" l="25"/>

我搜索了很久，找到的最好办法就是尝试把一个HTML链接的正则表达式改成适合我需求的样子。

但是因为我对正则表达式了解不多，所以到现在为止还没有成功。这是我目前的尝试：

string = '<r n="Foo Bar" t="5" s="10" l="25"/>'
print re.split("<r\s+n=(?:\"(^\"]+)\").*?/>", string)

从这个字符串中提取n、t、s和l的值，最好的方法是什么呢？

正则表达式字符串处理编程技巧数据提取 html解析

2 个回答

<r n="Foo Bar" t="5" s="10" l="25"/>

这个源代码看起来像是XML格式，所以“最好的方法”就是使用一个XML解析模块。如果它不完全是XML格式，使用BeautifulSoup（或者说是BeautifulSoup.BeautifulStoneSoup模块）可能更合适，因为它在处理可能不太符合XML标准的内容时表现得很好（或者说是那些“不是很像XML”的东西）。

>>> from BeautifulSoup import BeautifulStoneSoup
>>> soup = BeautifulStoneSoup("""<r n="Foo Bar" t="5" s="10" l="25"/>""")

# grab the "r" element (You could also use soup.findAll("r") if there are multiple
>>> soup.find("r")
<r n="Foo Bar" t="5" s="10" l="25"></r>

# get a specific attribute
>>> soup.find("r")['n']
u'Foo Bar'
>>> soup.find("r")['t']
u'5'

# Get all attributes, or turn them into a regular dictionary
>>> soup.find("r").attrs
[(u'n', u'Foo Bar'), (u't', u'5'), (u's', u'10'), (u'l', u'25')]
>>> dict(soup.find("r").attrs)
{u's': u'10', u'l': u'25', u't': u'5', u'n': u'Foo Bar'}

回答于 2025-04-15 由 Python大师

分享举报

这段内容可以帮助你理解大部分情况：

>>> print re.findall(r'(\w+)="(.*?)"', string)
[('n', 'Foo Bar'), ('t', '5'), ('s', '10'), ('l', '25')]

re.split 和 re.findall 是一对好搭档。

每当你想到“我想要每个看起来像X的东西”时，就应该使用 re.findall。而当你想到“我想要每个X之间和周围的数据”时，就应该使用 re.split。

回答于 2025-04-15 由 Python大师

分享举报

Python正则表达式解析

2 个回答

撰写回答