使用BeautifulSoup解析RSS源中带子项的所有项目元素

2 投票

1 回答

7475 浏览

提问于 2025-04-17 06:48

从RSS源中，怎么提取每个item标签里面的所有内容呢？

下面是一个简化的输入示例：

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Test</title>
<item>
  <title>Hello world1</title>
  <comments>Hi there</comments>
  <pubDate>Tue, 21 Nov 2011 20:10:10 +0000</pubDate>
</item>
<item>
  <title>Hello world2</title>
  <comments>Good afternoon</comments>
  <pubDate>Tue, 22 Nov 2011 20:10:10 +0000</pubDate>
</item>
<item>
  <title>Hello world3</title>
  <comments>blue paint</comments>
  <pubDate>Tue, 23 Nov 2011 20:10:10 +0000</pubDate>
</item>
</channel>
</rss>

我需要一个Python函数，这个函数可以处理这个RSS文件（我现在在用BeautifulSoup），并且有一个循环可以遍历每个item。我需要一个变量，里面存储每个item标签里的所有内容。

这是第一次循环的结果示例：

<title>Hello world1</title>
<comments>Hi there</comments>
<pubDate>Tue, 21 Nov 2011 20:10:10 +0000</pubDate>

这段代码能让我得到第一个结果，但我该怎么获取后面的所有结果呢？

html_data = BeautifulSoup(xml)
print html_data.channel.item

数据提取 beautifulsoup rss xml解析循环遍历内容存储 item标签

1 个回答

使用BeautifulSoup 4：

import bs4 as bs
doc = bs.BeautifulSoup(xml, 'xml')
for item in doc.findAll('item'):
    for elt in item:
        if isinstance(elt, BeautifulSoup.Tag):
            print(elt)

而且你也可以用lxml来做同样的事情：

import lxml.etree as ET
doc = ET.fromstring(xml)
for item in doc.xpath('//item'):
    for elt in item.xpath('descendant::*'):
        print(ET.tostring(elt))

回答于 2025-04-17 由 Python大师

分享举报

使用BeautifulSoup解析RSS源中带子项的所有项目元素

1 个回答

撰写回答