从BeautifulSoup中提取"extract"标签的内容

3 投票

1 回答

703 浏览

提问于 2025-04-17 20:24

我有一个XML文件，其中有一个标签叫做<EXTRACT>，但是这个词在BeautifulSoup中是个关键字。我想提取这个标签的内容。当我写entry.extract.text时，它会报错，而当我用entry.extract时，整个内容都被提取出来了。

根据我对BeautifulSoup的了解，它会对标签进行大小写处理。如果有什么方法可以解决这个问题，那对我也会很有帮助。

另外，目前我用以下方法解决了这个问题。

extra = entry.find('extract')
absts.write(str(extra.text))

但我想知道有没有办法像使用其他标签那样使用它，比如entry.tagName。

1 个回答

根据BS的源代码，其实 tag.tagname 这个写法在后台是调用 tag.find("tagname")。下面是 Tag 类的 __getattr__() 方法的样子：

def __getattr__(self, tag):
    if len(tag) > 3 and tag.endswith('Tag'):
        # BS3: soup.aTag -> "soup.find("a")
        tag_name = tag[:-3]
        warnings.warn(
            '.%sTag is deprecated, use .find("%s") instead.' % (
                tag_name, tag_name))
        return self.find(tag_name)
    # We special case contents to avoid recursion.
    elif not tag.startswith("__") and not tag=="contents":
        return self.find(tag)
    raise AttributeError(
        "'%s' object has no attribute '%s'" % (self.__class__, tag))

可以看到，它完全是基于 find() 方法的，所以在你的情况下，使用 tag.find("extract") 是完全可以的：

from bs4 import BeautifulSoup


data = """<test><EXTRACT>extract text</EXTRACT></test>"""
soup = BeautifulSoup(data, 'html.parser')
test = soup.find('test')
print test.find("extract").text  # prints 'extract text'

另外，你也可以使用 test.extractTag.text，不过这个方法已经不推荐使用了，我不建议你用。

希望这些信息对你有帮助。

回答于 2025-04-17 由 Python大师

分享举报

从BeautifulSoup中提取"extract"标签的内容

1 个回答

撰写回答