BeautifulSoup 嵌套标签

6 投票

3 回答

12572 浏览

提问于 2025-04-16 09:24

我正在尝试用BeautifulSoup解析一个XML文件，但在使用“recursive”这个属性和findall()时遇到了麻烦。

我有一个比较奇怪的XML格式，如下所示：

<?xml version="1.0"?>
<catalog>
   <book>
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
      <book>true</book>
   </book>
   <book>
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
      <book>false</book>
   </book>
 </catalog>

你可以看到，书籍标签（book）在书籍标签内部重复，这导致我在尝试做一些事情时出现错误，比如：

from BeautifulSoup import BeautifulStoneSoup as BSS

catalog = "catalog.xml"


def open_rss():
    f = open(catalog, 'r')
    return f.read()

def rss_parser():
    rss_contents = open_rss()
    soup = BSS(rss_contents)
    items = soup.findAll('book', recursive=False)

    for item in items:
        print item.title.string

rss_parser()

在我的soup.findAll中，我添加了recursive=false，理论上这应该让它不去递归查找找到的项目，而是跳到下一个。

但这似乎不起作用，因为我总是会遇到以下错误：

  File "catalog.py", line 17, in rss_parser
    print item.title.string
AttributeError: 'NoneType' object has no attribute 'string'

我知道我这里可能做错了什么，如果有人能帮我解决这个问题，我会非常感激。

改变HTML结构不是一个选项，因为这段代码需要高效运行，因为它可能会解析一个很大的XML文件。

错误处理数据解析 beautifulsoup xml解析 HTML结构嵌套标签 findall recursive属性

3 个回答

-1

Beautifulsoup这个库速度慢，而且已经不再更新了，建议你用lxml这个库来代替它。:)

>>> from lxml import etree
>>> rss = open('/tmp/catalog.xml')
>>> items = etree.parse(rss).xpath('//book/title/text()')
>>> items
["XML Developer's Guide", 'Midnight Rain']
>>>

回答于 2025-04-16 由 Python大师

分享举报

问题出在嵌套的 book 标签上。BeautifulSoup 有一套预定义的可以嵌套的标签（BeautifulSoup.NESTABLE_TAGS），但是它不知道 book 这个标签可以嵌套，所以就出现了问题。

自定义解析器解释了发生了什么，以及你如何可以通过子类化 BeautifulStoneSoup 来自定义可嵌套的标签。下面是我们如何用它来解决你的问题：

from BeautifulSoup import BeautifulStoneSoup

class BookSoup(BeautifulStoneSoup):
  NESTABLE_TAGS = {
      'book': ['book']
  }

soup = BookSoup(xml) # xml string omitted to keep this short
for book in soup.find('catalog').findAll('book', recursive=False):
  print book.title.string

如果我们运行这个代码，得到的输出是：

XML Developer's Guide
Midnight Rain

回答于 2025-04-16 由 Python大师

分享举报

soup.findAll('catalog', recursive=False) 这个命令会返回一个只包含你最顶层的 "catalog" 标签的列表。因为这个标签下面没有 "title" 这个子标签，所以 item.title 的值是 None，也就是没有值。

你可以试试 soup.findAll("book") 或者 soup.find("catalog").findChildren()。

编辑：好的，问题不是我想的那样。试试这个：

BSS.NESTABLE_TAGS["book"] = []
soup = BSS(open("catalog.xml"))
soup.catalog.findChildren(recursive=False)

回答于 2025-04-16 由 Python大师

分享举报

BeautifulSoup 嵌套标签

3 个回答

撰写回答