使用BeautifulSoup在单个循环中解析多个段落

0 投票

2 回答

1267 浏览

提问于 2025-04-18 15:49

我正在解析一个博客的评论区。可惜的是，这里的结构比较乱。

我遇到了两种情况：

第一条评论会被分成多个段落。

 <p>My first paragraph.<br />But this a second line</p>
 <p>And this is a third line</p>

而第二条评论则只有一个段落。

我想把每条评论放在一个字符串变量里。但是执行下面的代码时，

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
<body>

<div id="firstDiv">
     <br></br>
     <p>First comment and first line</p>
     <p>First comment and second line</p>
     <div id="secondDiv">
          <b>Date1</b>
     </div> 
     <br></br>  
     <p>Second comment</p>
     <div id="secondDiv">
          <b>Date2</b>
     </div> 
     <br></br>
     </div>
     <br></br>
 </div>

</body>
</html>
"""

soup = BeautifulSoup(html_doc)

for p in soup.find(id="firstDiv").find_all("p"):
    print "Print comment: " + p.get_text()
    print "End of loop"

程序会把前两个段落分成不同的循环实例来处理，打印出

Print comment: First comment and first line
End of loop
Print comment: First comment and second line
End of loop
Print comment: Second comment
End of loop

我该如何才能在同一个循环里打印出前两个段落呢？

beautifulsoup 网页解析循环处理字符串变量评论区段落结构

2 个回答

soup = BeautifulSoup(html_doc)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]

text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
print ", ".join(text[:2])
print " ".join(text[2:])

First comment and first line, First comment and second line
Second comment

[<p>First comment and first line</p>, <p>First comment and second line</p>, <p>Second comment</p>]

当你调用 soup.find(id="firstDiv").find_all("p") 这个代码时，它会生成一个列表，里面包含一些元素。因此，遍历这个列表中的三个元素时，你会得到三个循环，这是很合理的。

回答于 2025-04-18 由 Python大师

分享举报

你想做的事情不适合用“汤”来处理，因为你面对的是一种扁平的数据，这种数据的结构在HTML中并没有体现出来。所以，你可以先用“汤”尽量处理，然后再转向逐个处理。

获取父级

的

和

子元素最简单的方法就是先获取所有的子元素。我们只需要HTML节点，而不是它们之间的字符串，所以可以不带参数地去查找。像这样：

def chunkify(parent):
    """yields groups of <p> nodes separated by <div> siblings"""
    chunk = []
    for element in parent.find_all():
        if element.name == 'p':
            chunk.append(element)
        elif element.name == 'div':
            yield chunk
            chunk = []
    if chunk:
        yield chunk

for paras in chunkify(soup.find(id="firstDiv")):
    print "Print comment: " + '\n'.join(p.get_text() for p in paras)
    print "End of loop"

输出结果将是：

Print comment: First comment and first line
First comment and second line
End of loop
Print comment: Second comment
End of loop

这就是你想要的，对吧？

如果你了解itertools，你可以把这个函数写得更简洁、更易读……但我想先用一种更容易让新手理解的方式来写，虽然这样可能显得有点笨重。这里有一个更短的版本：

def chunkify(parent):
    """yields groups of <p> nodes separated by <div> siblings"""
    grouped = groupby(parent.find_all(), lambda element: element.name != 'div')
    groups = (g for k, g in grouped if k)
    return ([node for node in g if node.name == 'p'] for g in groups)

你还可以用一个更高级的函数来替代前两行，它封装了groupby；我知道more-itertools有这个，或者至少有类似的功能：

    groups = isplit(parent.find_all(), lambda element: element.name != 'div')

回答于 2025-04-18 由 Python大师

分享举报

使用BeautifulSoup在单个循环中解析多个段落

2 个回答

撰写回答