如何在Python的BeautifulSoup4中使用.next_sibling时忽略空行

11 投票

4 回答

4004 浏览

提问于 2025-04-18 03:54

我想在一个HTML网站中去掉重复的占位符，所以我使用了BeautifulSoup的.next_sibling操作符。只要这些重复的占位符在同一行，这个方法就能很好地工作（可以看看data）。但是有时候它们之间会有空行，这时候我希望.next_sibling能够忽略这些空行（看看data2）。

这是代码：

from bs4 import BeautifulSoup, Tag
data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>"
data2 = """<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
"""
soup = BeautifulSoup(data)
string = 'method-removed-here'
for p in soup.find_all("p"):
    while isinstance(p.next_sibling, Tag) and p.next_sibling.name== 'p' and p.text==string:
        p.next_sibling.decompose()
print(soup)

对于data的输出是我预期的结果：

<html><head></head><body><p>method-removed-here</p></body></html>

而data2的输出（这个需要修复）：

<html><head></head><body><p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
</body></html>

我在BeautifulSoup4的文档中找不到有用的信息，而且.next_element也不是我想要的。

编程技巧网页抓取 html解析数据清洗 beautifulsoup python库文档解析空行处理

4 个回答

稍微改进了一下neurosnap的回答，让它更通用：

def next_elem(element, func):
    new_elem = getattr(element, func)
    if new_elem == "\n":
        return next_elem(new_elem, func)
    else:
        return new_elem

现在你可以用它来调用任何函数，比如：

next_elem(element, 'previous_sibling')

回答于 2025-04-18 由 Python大师

分享举报

这也不是一个很好的解决办法，但对我来说有效。

def get_sibling(element):
    sibling = element.next_sibling
    if sibling == "\n":
        return get_sibling(sibling)
    else:
        return sibling

回答于 2025-04-18 由 Python大师

分享举报

使用 find_next_sibling() 替代 next_sibling，同时用 find_previous_sibling() 替代 previous_sibling。

原因是：next_sibling 不仅返回下一个 HTML 标签，还会返回下一个“soup 元素”。通常情况下，这个元素是标签之间的空白，但也可能包含其他内容。而 find_next_sibling() 则只返回下一个 HTML 标签，忽略标签之间的空白和其他杂项。

我稍微调整了一下你的代码，以便进行演示。希望它在语义上是一样的。

使用 next_sibling 的代码演示了你描述的相同行为（对 data 有效，但对 data2 无效）

from bs4 import BeautifulSoup, Tag
data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>"
data2 = """<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>

<p>method-removed-here</p>
"""
soup = BeautifulSoup(data, 'html.parser')
string = 'method-removed-here'
for p in soup.find_all("p"):
    while True:
        ns = p.next_sibling
        if isinstance(ns, Tag) and ns.name== 'p' and p.text==string:
            ns.decompose()
        else:
            break
print(soup)

使用 find_next_sibling() 的代码，适用于 data 和 data2 两者

soup = BeautifulSoup(data, 'html.parser')
string = 'method-removed-here'
for p in soup.find_all("p"):
    while True:
        ns = p.find_next_sibling()
        if isinstance(ns, Tag) and ns.name== 'p' and p.text==string:
            ns.decompose()
        else:
            break
print(soup)

附加信息：

.children 和 .content 也会返回标签之间的空白。建议使用 .find_all(True)，它只返回标签。

想了解更多，可以查看这里：BeautifulSoup .children 或 .content 不带标签之间的空白

回答于 2025-04-18 由 Python大师

分享举报

我找到了一种解决这个问题的变通办法。这个问题在BeautifulSoup的谷歌讨论组中有描述，他们建议使用一个预处理器来处理html文件：

 def bs_preprocess(html):
     """remove distracting whitespaces and newline characters"""
     pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
     html = re.sub(pat, '', html)       # remove leading and trailing whitespaces
     html = re.sub('\n', ' ', html)     # convert newlines to spaces
                                        # this preserves newline delimiters
     html = re.sub('[\s]+<', '<', html) # remove whitespaces before opening tags
     html = re.sub('>[\s]+', '>', html) # remove whitespaces after closing tags
     return html

这不是最好的解决方案，但算是一种办法。

回答于 2025-04-18 由 Python大师

分享举报

如何在Python的BeautifulSoup4中使用.next_sibling时忽略空行

4 个回答

撰写回答