如何使用BeautifulSoup删除两个HTML注释之间的所有内容

2条回答

网友

1楼 · 编辑于 2024-05-16 13:30:04

可以使用^{}方法删除div。由于注释的类型为^{}，BeautifulSoup不会看到它们，因此find_all()div：

# Find all the elements after the tag with `id="company"`
for tag in soup.find("span", id="company").next_elements:
    # Break once we encounter an `a` since all the comments have finished
    if tag.name == "a":
        break
    else:
        try:
            tag.previous_sibling.decompose()
        except AttributeError:
            continue

print(soup.prettify())

输出：

<!  Top Plans & Programs: Most Common User Phrases - List Bucket 6  >
<!  Ad  >
<a href="#">
</a>

网友

2楼 · 编辑于 2024-05-16 13:30:04

首先，要小心断章取义的HTML片段。如果打印soupified代码段，您将获得：

<!  Top Plans & Programs: Most Common User Phrases - List Bucket 6  >
<html>
 <body>
  <div>
   <span id="company">
   ...

Whoops BS在<html>标记上方添加了注释，很明显，作为删除两个标记之间元素的算法，您的意图不会不可避免地删除整个文档（这就是为什么包含代码很重要…）

对于主任务，element.decompose()或element.extract()将它从树中删除（extract()返回它，次要的细微之处）。漫游中要删除的元素需要保存在单独的列表中，并在遍历结束后删除

from bs4 import BeautifulSoup, Comment

html = """
<body>
<!  Top Plans & Programs: Most Common User Phrases - List Bucket 6  >
<div><span id="company">Apple</span> Chats:</div>
<div>abcdefg<span>xvfdadsad</span>sdfsdfsdf</div>
<div>
<li>(<span>7</span>sadsafasf<span>vdvdsfdsfds</span></li>
<li>(<span>8</span>) <span>Reim</span></li>
</div>
<!  Ad  >
<a href="#">
"""
start_comment = " Top Plans & Programs: Most Common User Phrases - List Bucket 6 "
end_comment = " Ad "
soup = BeautifulSoup(html, "lxml")
to_extract = []
between_comments = False

for x in soup.recursiveChildGenerator():
    if between_comments and not isinstance(x, str):
        to_extract.append(x)

    if isinstance(x, Comment):
        if start_comment == x:
            between_comments = True
        elif end_comment == x:
            break

for x in to_extract:
    x.decompose()

print(soup.prettify())

输出：

<html>
 <body>
  <!  Top Plans & Programs: Most Common User Phrases - List Bucket 6  >
  <!  Ad  >
  <a href="#">
  </a>
 </body>
</html>

请注意，如果结束注释与开始注释不在同一级别，这将销毁结束注释的所有父元素。如果您不想这样做，则需要沿着父链往回走，直到到达起始注释的级别

使用.find和.next的另一个解决方案（与上面的导入/HTML字符串/输出相同）：

start_comment = " Top Plans & Programs: Most Common User Phrases - List Bucket 6 "
end_comment = " Ad "
soup = BeautifulSoup(html, "lxml")
el = soup.find(text=lambda x: isinstance(x, Comment) and start_comment == x)
end = el.find_next(text=lambda x: isinstance(x, Comment) and end_comment == x)
to_extract = []

while el and end and el is not end:
    if not isinstance(el, str):
        to_extract.append(el)

    el = el.next

for x in to_extract:
    x.decompose()

print(soup.prettify())

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用BeautifulSoup删除两个HTML注释之间的所有内容

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >