不要让BeautifulSoup自动添加html、head和body标签

41 投票

9 回答

15466 浏览

提问于 2025-04-17 15:33

我正在使用beautifulsoup和html5lib，它会自动添加html、head和body标签：

BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>

有没有什么选项可以关闭这种行为呢？

9 个回答

首先，我们来创建一个汤样本：

soup=BeautifulSoup("<head></head><body><p>content</p></body>")

你可以通过指定 soup.body.<tag> 来获取html和body的子元素：

# python3: get body's first child
print(next(soup.body.children))

# if first child's tag is rss
print(soup.body.rss)

你也可以使用 unwrap() 方法来去掉body、head和html这些标签。

soup.html.body.unwrap()
if soup.html.select('> head'):
    soup.html.head.unwrap()
soup.html.unwrap()

如果你加载的是xml文件，使用 bs4.diagnose(data) 会告诉你要使用 lxml-xml，这样就不会把你的汤包裹在 html+body 标签里。

>>> BS('<foo>xxx</foo>', 'lxml-xml')
<foo>xxx</foo>

回答于 2025-04-17 由 Python大师

分享举报

BeautifulSoup的这个方面一直让我很烦。

这是我处理它的方法：

# Parse the initial html-formatted string
soup = BeautifulSoup(html, 'lxml')

# Do stuff here

# Extract a string repr of the parse html object, without the <html> or <body> tags
html = "".join([str(x) for x in soup.body.children])

简单来说就是：

# Iterator object of all tags within the <body> tag (your html before parsing)
soup.body.children

# Turn each element into a string object, rather than a BS4.Tag object
# Note: inclusive of html tags
str(x)

# Get a List of all html nodes as string objects
[str(x) for x in soup.body.children]

# Join all the string objects together to recreate your original html
"".join()

我还是不太喜欢这样，但至少能完成任务。每次我用BS4从HTML文档中过滤某些元素或属性时，都会遇到这个问题，之后我需要把整个对象作为字符串返回，而不是作为BS4解析后的对象。

希望下次我在网上搜索这个问题时，能在这里找到我的答案。

回答于 2025-04-17 由 Python大师

分享举报

In [35]: import bs4 as bs

In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser")
Out[36]: <h1>FOO</h1>

这个用Python自带的HTML解析器来解析HTML。

引用文档中的内容：

跟html5lib不同，这个解析器不会试图通过添加一个<body>标签来创建一个格式正确的HTML文档。和lxml相比，它甚至连<html>标签都懒得加。

另外，你也可以使用html5lib解析器，然后直接选择<body>标签后的元素：

In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')

In [62]: soup.body.next
Out[62]: <h1>FOO</h1>

回答于 2025-04-17 由 Python大师

分享举报

不要让BeautifulSoup自动添加html、head和body标签

9 个回答

撰写回答