BeautifulSoup查找下一个类

<h4 class="month-header">April 15, 2016</h4> <section class = "sneaker-post-main">...</section> <section class = "sneaker-post-main">...</section> <section class = "sneaker-post-main">...</section> <h4 class="month-header">April 16, 2016</h4> <section class = "sneaker-post-main">...</section> <section class = "sneaker-post-main">...</section> <section class = "sneaker-post-main">...</section> <h4 class="month-header">April 17, 2016</h4> <section class = "sneaker-post-main">...</section> <section class = "sneaker-post-main">...</section> <section class = "sneaker-post-main">...</section>

from bs4 import BeautifulSoup import requests import json headers = { #'Cookie': 'X-Mapping-fjhppofk=FF3085BC452778AD1F6476C56E952C7A; _gat=1; __qca=P0-293756458-1459822661767; _gat_cToolbarTracker=1; _ga=GA1.2.610207006.1459822661', 'Accept-Encoding': 'gzip, deflate, sdch', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36,(KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36', 'Accept-Language': 'en-US,en;q=0.8', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': 0 } response = requests.get('http://sneakernews.com/air-jordan-release-dates/',headers=headers).text soup = BeautifulSoup(response) for tag in soup.findAll('h4', attrs = {'class':'month-header'}): print tag.nextSibling.nextSibling.nextSibling

2条回答

网友

1楼 · 编辑于 2024-05-15 16:31:16

您可以使用^{}方法和一个简单的切片操作在h4标记后立即返回这些section。在

使用示例HTML文档演示：

In [32]: from bs4 import BeautifulSoup 

In [33]: result = []

In [34]: html = """<h4 class="month-header">April 15, 2016</h4>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <h4 class="month-header">April 16, 2016</h4>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <h4 class="month-header">April 17, 2016</h4>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>"""

In [35]: soup = BeautifulSoup(html, 'html.parser')

In [36]: for header in soup.find_all('h4', class_='month-header'):
   ....:     d = {}
   ....:     d['month'] = header.get_text()
   ....:     d['released'] = [s.get_text() for s in header.find_next_siblings('section', class_='sneaker-post-main')[:3]]
   ....:     result.append(d)
   ....:     

In [37]: result
Out[37]: 
[{'month': 'April 15, 2016', 'released': ['...', '...', '...']},
 {'month': 'April 16, 2016', 'released': ['...', '...', '...']},
 {'month': 'April 17, 2016', 'released': ['...', '...', '...']}]

更新：

如果“section”的数量不是常量，那么您可以使用生成器函数这样做（可能效率不高）。在

^{pr2}$

generator函数接受一个参数，即“汤”。在

from bs4 import BeautifulSoup, SoupStrainer, Tag

wanted_tag = SoupStrainer(['h4', 'section']) # only parse h4 and section tags 
soup = BeautifulSoup(response, 'html.parser', parse_only = wanted_tag)

for tag in soup(['script', 'style', 'img']):
    tag.decompose() #  Just to clean up little bit

for d in gen(soup):
    # do something

网友

2楼 · 编辑于 2024-05-15 16:31:16

颠倒逻辑，得到所有的section.sneaker-post-main，然后使用它作为dict中分组的键来查找前一个兄弟：

import  requests
from collections import defaultdict


cont = requests.get(url, headers=ua).content

soup = BeautifulSoup(cont,"lxml")

d = defaultdict(list)
sections = soup.select("div.release-post-list  section.sneaker-post-main")
for section in sections:
    h4 = section.find_previous_sibling("h4",{"class":"month-header"})
    d[h4.text.strip()].append(section)

print(d["April 15, 2016"])

使用第一个标题文本作为键，您可以看到正确的前三个sneaker-post-main：

^{pr2}$

每个h4.month-header可以有很多section.sneaker-post-main同级，但是每个section.sneaker-post-main只有一个与其部分相关的前一个h4.month-header同级。在

相关问题更多 >

编程相关推荐

热门问题

热门文章