这个html5lib脚本怎么了?
我正在尝试处理一个非常简单的html5脚本,并使用html5lib来渲染它。
import html5lib
html = '''<!DOCTYPE html>
<html lang="en">
<head>
<title>Hi</title>
</head>
<body>
<script src="a.js"></script>
<script src="b.js"></script>
</body>
</html>
'''
parser = html5lib.HTMLParser(tree = html5lib.treebuilders.getTreeBuilder("lxml"))
walker = html5lib.treewalkers.getTreeWalker("lxml")
serializer = html5lib.serializer.htmlserializer.HTMLSerializer()
document = parser.parse(html)
stream = walker(document)
theHTML = serializer.render(stream)
print theHTML
输出看起来是这样的:
<!DOCTYPE html><html lang=en><head>
<title>Hi</title>
</head>
<body>
<script src=a.js></script>
<script src=b.js></script>
没错。它就是在中间截断了。把树构建器从lxml换成dom也没有任何效果。调整HTML虽然会改变输出,但结果还是很糟糕。
1 个回答
1
所以关键似乎在于 omit_optional_tags=False
,如果缺少这个设置,输出的最后部分就会被吃掉。
parser = html5lib.HTMLParser(tree = html5lib.treebuilders.getTreeBuilder("lxml"))
document = parser.parse(html)
walker = html5lib.treewalkers.getTreeWalker("lxml")
stream = walker(document)
s = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False)
output_generator = s.serialize(stream)
for item in output_generator:
print item
<!DOCTYPE html>
<html lang=en>
<head>
<title>
Hi
</title>
</head>
<body>
<script src=a.js>
</script>
<script src=b.js>
</script>
</body>
</html>
>>>