美化组4:删除注释标记及其内容

2024-06-01 00:54:20 发布

您现在位置:Python中文网/ 问答频道 /正文

所以我要废弃的页面包含这些html代码。如何删除注释标记<!-- -->,以及它与bs4的内容?

<div class="foo">
cat dog sheep goat
<!-- 
<p>NewPP limit report
Preprocessor node count: 478/300000
Post‐expand include size: 4852/2097152 bytes
Template argument size: 870/2097152 bytes
Expensive parser function count: 2/100
ExtLoops count: 6/100
</p>
-->

</div>

Tags: 代码标记div内容sizebytesfoohtml
3条回答

From this answer 如果您正在寻找BeautifulGroup版本3的解决方案BS3 Docs - Comment

soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
comment = soup.find(text=re.compile("if"))
Comment=comment.__class__
for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()
print soup.prettify()

您可以使用^{}(解决方案基于this answer):

PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.

from bs4 import BeautifulSoup, Comment

data = """<div class="foo">
cat dog sheep goat
<!--
<p>test</p>
-->
</div>"""

soup = BeautifulSoup(data)

div = soup.find('div', class_='foo')
for element in div(text=lambda text: isinstance(text, Comment)):
    element.extract()

print soup.prettify()

因此,您的div没有注释:

<div class="foo">
    cat dog sheep goat
</div>

通常不需要修改bs4解析树。你可以直接得到div的文本,如果这是你想要的:

soup.body.div.text
Out[18]: '\ncat dog sheep goat\n\n'

bs4分隔注释。但是,如果确实需要修改解析树:

from bs4 import Comment

for child in soup.body.div.children:
    if isinstance(child,Comment):
        child.extract()

相关问题 更多 >