查找所有具有命名空间属性的元素

0 投票

2 回答

654 浏览

提问于 2025-04-17 18:06

如果我有这样的内容：

<p>blah</p>
<p foo:bar="something">blah</p>
<p foo:xxx="something">blah</p>

我该怎么用beautifulsoup来选择带有foo命名空间属性的元素呢？

比如，我想要返回第二个和第三个

元素。

html解析命名空间 beautifulsoup 元素选择

2 个回答

BeautifulSoup（无论是3还是4版本）并不会把命名空间前缀当成什么特别的东西。它只是把带有冒号的命名空间前缀和属性当作普通的属性来处理。

所以，如果你想找到带有foo命名空间属性的<p>元素，你只需要遍历所有的属性键，检查一下属性名是不是以foo开头：

import BeautifulSoup as bs
content = '''\
<p>blah</p>
<p foo:bar="something">blah</p>
<p foo:xxx="something">blah</p>'''

soup = bs.BeautifulSoup(content)
for p in soup.find_all('p'):
    for attr in p.attrs.keys():
        if attr.startswith('foo'):
            print(p)
            break

这样就能得到结果

<p foo:bar="something">blah</p>
<p foo:xxx="something">blah</p>

使用lxml的话，你可以通过XPath来搜索，这种方式支持根据命名空间来查找属性：

import lxml.etree as ET
content = '''\
<root xmlns:foo="bar">
<p>blah</p>
<p foo:bar="something">blah</p>
<p foo:xxx="something">blah</p></root>'''

root = ET.XML(content)
for p in root.xpath('p[@foo:*]', namespaces={'foo':'bar'}):
    print(ET.tostring(p))

这样也能得到结果

<p xmlns:foo="bar" foo:bar="something">blah</p>
<p xmlns:foo="bar" foo:xxx="something">blah</p>

回答于 2025-04-17 由 Python大师

分享举报

来自文档：

Beautiful Soup提供了一个特别的参数叫做attrs，你可以在一些特定情况下使用它。attrs是一个字典，作用和关键字参数一样：

soup.findAll(id=re.compile("para$"))
# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,
#  <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

soup.findAll(attrs={'id' : re.compile("para$")})
# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,
#  <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

如果你需要对一些属性名进行限制，比如那些是Python保留字的属性名，比如class、for或者import；或者是一些在Beautiful Soup搜索方法中不是关键字参数的属性名，比如name、recursive、limit、text，或者attrs本身，你都可以使用attrs。

from BeautifulSoup import BeautifulStoneSoup
xml = '<person name="Bob"><parent rel="mother" name="Alice">'
xmlSoup = BeautifulStoneSoup(xml)

xmlSoup.findAll(name="Alice")
# []

xmlSoup.findAll(attrs={"name" : "Alice"})
# [parent rel="mother" name="Alice"></parent>]

所以针对你给出的例子：

soup.findAll(attrs={ "foo" : re.compile(".*") })
# or
soup.findAll(attrs={ re.compile("foo:.*") : re.compile(".*") })

回答于 2025-04-17 由 Python大师

分享举报

查找所有具有命名空间属性的元素

2 个回答

撰写回答