Bsoup4提取未被父元素包装的子元素

问题

目标

Trevor希望提取页面内容，其中相关内容不是由统一元素包装的，而是与标题元素相邻的。你知道吗

在下面的示例中，Trevor需要一个包含四个元素的python数据结构，每个元素包含一个“header”名称-值对和一个“body”名称-值对。你知道吗

细节

最好的解释方法是举例说明：

<h2>Alpha blurb</h2> * content here one * content here two <h2>Bravo blurb</h2> * content here one * content here two * content here tree * content here four * content here fyve * content here seeks <h2>Charlie blurb</h2> * content here four * content here fyve * content here seeks <h2>Delta blurb</h2> * blah

从Trevor到目前为止所看到的情况来看，Bsoup使用了一种刮取内容的策略，该策略包括查找容器元素并对其进行迭代和钻取。你知道吗

但是，在这个场景中，Trevor希望提取每个标题项及其关联的内容，即使关联的内容没有包装在包含元素中。你知道吗

一个内容节从何处开始，另一个内容节从何处结束的唯一指示是标题标记的位置。你知道吗

1条回答

网友

1楼 · 发布于 2024-04-26 12:46:53

特雷弗需要侧着身子，在这里使用^{}。示例：

from bs4 import BeautifulSoup


page = """
<div>
<h2>Alpha blurb</h2>

* content here one
* content here two

<h2>Bravo blurb</h2>

* content here one
* content here two
* content here tree
* content here four
* content here fyve
* content here seeks

<h2>Charlie blurb</h2>

* content here four
* content here fyve
* content here seeks

<h2>Delta blurb</h2>

* blah
</div>
"""
soup = BeautifulSoup(page)

for h2 in soup.find_all("h2"):

    print h2.text

    # loop over siblings until h2 is met (or no more siblings left)
    for item in h2.next_siblings:
        if item.name == "h2":
            break

        print item.strip()

    print "  "

印刷品：

Alpha blurb
* content here one
* content here two
  
Bravo blurb
* content here one
* content here two
* content here tree
* content here four
* content here fyve
* content here seeks
  
Charlie blurb
* content here four
* content here fyve
* content here seeks
  
Delta blurb
* blah

上下文

问题

目标

细节

问题

相关问题更多 >

编程相关推荐

热门问题

热门文章