如何使用BeautifulSoup删除两个HTML注释之间的所有内容

2024-05-16 13:30:04 发布

您现在位置:Python中文网/ 问答频道 /正文

<!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
<div><span id="company">Apple</span> Chats:</div>
<div>abcdefg<span>xvfdadsad</span>sdfsdfsdf</div>
<div>
<li>(<span>7</span>sadsafasf<span>vdvdsfdsfds</span></li>
<li>(<span>8</span>) <span>Reim</span></li>
</div>
<!-- Ad -->
<a href="#">

我想使用bs4删除两条注释行之间的所有内容,并将该文件变成如下内容:

<!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
<!-- Ad -->
<a href="#">

Tags: div内容mostbuckettoplicommonad
2条回答

可以使用^{}方法删除div。由于注释的类型为^{}BeautifulSoup不会看到它们,因此find_all()div:

# Find all the elements after the tag with `id="company"`
for tag in soup.find("span", id="company").next_elements:
    # Break once we encounter an `a` since all the comments have finished
    if tag.name == "a":
        break
    else:
        try:
            tag.previous_sibling.decompose()
        except AttributeError:
            continue

print(soup.prettify())

输出:

<!  Top Plans & Programs: Most Common User Phrases - List Bucket 6  >
<!  Ad  >
<a href="#">
</a>

首先,要小心断章取义的HTML片段。如果打印soupified代码段,您将获得:

<!  Top Plans & Programs: Most Common User Phrases - List Bucket 6  >
<html>
 <body>
  <div>
   <span id="company">
   ...

Whoops BS在<html>标记上方添加了注释,很明显,作为删除两个标记之间元素的算法,您的意图不会不可避免地删除整个文档(这就是为什么包含代码很重要…)

对于主任务,element.decompose()element.extract()将它从树中删除(extract()返回它,次要的细微之处)。漫游中要删除的元素需要保存在单独的列表中,并在遍历结束后删除

from bs4 import BeautifulSoup, Comment

html = """
<body>
<!  Top Plans & Programs: Most Common User Phrases - List Bucket 6  >
<div><span id="company">Apple</span> Chats:</div>
<div>abcdefg<span>xvfdadsad</span>sdfsdfsdf</div>
<div>
<li>(<span>7</span>sadsafasf<span>vdvdsfdsfds</span></li>
<li>(<span>8</span>) <span>Reim</span></li>
</div>
<!  Ad  >
<a href="#">
"""
start_comment = " Top Plans & Programs: Most Common User Phrases - List Bucket 6 "
end_comment = " Ad "
soup = BeautifulSoup(html, "lxml")
to_extract = []
between_comments = False

for x in soup.recursiveChildGenerator():
    if between_comments and not isinstance(x, str):
        to_extract.append(x)

    if isinstance(x, Comment):
        if start_comment == x:
            between_comments = True
        elif end_comment == x:
            break

for x in to_extract:
    x.decompose()

print(soup.prettify())

输出:

<html>
 <body>
  <!  Top Plans & Programs: Most Common User Phrases - List Bucket 6  >
  <!  Ad  >
  <a href="#">
  </a>
 </body>
</html>

请注意,如果结束注释与开始注释不在同一级别,这将销毁结束注释的所有父元素。如果您不想这样做,则需要沿着父链往回走,直到到达起始注释的级别

使用.find.next的另一个解决方案(与上面的导入/HTML字符串/输出相同):

start_comment = " Top Plans & Programs: Most Common User Phrases - List Bucket 6 "
end_comment = " Ad "
soup = BeautifulSoup(html, "lxml")
el = soup.find(text=lambda x: isinstance(x, Comment) and start_comment == x)
end = el.find_next(text=lambda x: isinstance(x, Comment) and end_comment == x)
to_extract = []

while el and end and el is not end:
    if not isinstance(el, str):
        to_extract.append(el)

    el = el.next

for x in to_extract:
    x.decompose()

print(soup.prettify())

相关问题 更多 >