Python正则表达式匹配多个标签

3 投票

5 回答

2844 浏览

提问于 2025-04-15 12:08

我想知道怎么从每个 <p> 标签中获取所有的结果。

import re
htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.match('<p[^>]*size="[0-9]">(.*?)</p>', htmlText).groups()

结果是：

('item1', )

我需要的内容是：

('item1', 'item2', 'item3')

正则表达式文本解析标签匹配

5 个回答

使用Beautiful Soup来解决这个问题绝对是个好主意。它的代码更简洁，也更容易理解。一旦你安装好了它，获取所有标签的代码大概是这样的。

from BeautifulSoup import BeautifulSoup
import urllib2

def getTags(tag):
  f = urllib2.urlopen("http://cnn.com")
  soup = BeautifulSoup(f.read())
  return soup.findAll(tag)


if __name__ == '__main__':
  tags = getTags('p')
  for tag in tags: print(tag.contents)

这段代码会打印出所有

标签的内容。

回答于 2025-04-15 由 Python大师

分享举报

对于这种类型的问题，建议使用DOM解析器，而不是正则表达式。

我经常看到有人推荐使用Beautiful Soup来处理Python中的相关内容。

回答于 2025-04-15 由 Python大师

分享举报

这个正则表达式的答案非常脆弱。这里有证据（还有一个有效的BeautifulSoup示例）。

from BeautifulSoup import BeautifulSoup

# Here's your HTML
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'

# Here's some simple HTML that breaks your accepted 
# answer, but doesn't break BeautifulSoup.
# For each example, the regex will ignore the first <p> tag.
html2 = '<p size="4" data="5">item1</p><p size="4">item2</p><p size="4">item3</p>'
html3 = '<p data="5" size="4" >item1</p><p size="4">item2</p><p size="4">item3</p>'
html4 = '<p data="5" size="12">item1</p><p size="4">item2</p><p size="4">item3</p>'

# This BeautifulSoup code works for all the examples.
paragraphs = BeautifulSoup(html).findAll('p')
items = [''.join(p.findAll(text=True)) for p in paragraphs]

使用BeautifulSoup。

回答于 2025-04-15 由 Python大师

分享举报

Python正则表达式匹配多个标签

5 个回答

撰写回答