BeautifulSoup查找下一个类

2024-05-15 16:31:16 发布

您现在位置:Python中文网/ 问答频道 /正文

所以基本上。我有两个班。 一个是鞋的发布日期。 另一个是在那一天发行的鞋子。 但是他们完全是两个不同的阶级。 因此,我正努力从这些课程中节衣缩食。”“月标题”包含所有日期。而下一个班是运动鞋后主打,所有的鞋子都是从月底开始的。但是他们是两个不同的阶级。他们之间没有联系。所以我试着做。下一次h4课的配音来赶上我的“节”课。它不是那样工作的。在

<h4 class="month-header">April 15, 2016</h4>
<section class = "sneaker-post-main">...</section>
<section class = "sneaker-post-main">...</section>
<section class = "sneaker-post-main">...</section>
<h4 class="month-header">April 16, 2016</h4>
<section class = "sneaker-post-main">...</section>
<section class = "sneaker-post-main">...</section>
<section class = "sneaker-post-main">...</section>
<h4 class="month-header">April 17, 2016</h4>
<section class = "sneaker-post-main">...</section>
<section class = "sneaker-post-main">...</section>
<section class = "sneaker-post-main">...</section>

另外,如果我的HTML没有意义,这就是我正在抓取的网站。http://sneakernews.com/air-jordan-release-dates/ 我希望输出看起来像 日期是字典的键,值是将在该日期发布的鞋子的列表。 如下图所示。在

^{pr2}$

我试着用beauthoulsoup来完成这个任务。我好像想不通。 2016年4月15日->;这是HTML的发布日期。 ... ->;这包含鞋子信息etctra(就像这里有鞋子列表,而不是一只鞋)

from bs4 import BeautifulSoup
import requests
import json


headers = {
    #'Cookie': 'X-Mapping-fjhppofk=FF3085BC452778AD1F6476C56E952C7A; _gat=1; __qca=P0-293756458-1459822661767; _gat_cToolbarTracker=1; _ga=GA1.2.610207006.1459822661',
    'Accept-Encoding': 'gzip, deflate, sdch',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36,(KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.8',
    'Accept': '*/*',
    'Connection': 'keep-alive',
    'Content-Length': 0
}
response = requests.get('http://sneakernews.com/air-jordan-release-dates/',headers=headers).text
soup = BeautifulSoup(response)
for tag in soup.findAll('h4', attrs = {'class':'month-header'}): 
    print tag.nextSibling.nextSibling.nextSibling

到目前为止,这是我的密码!在


Tags: importmainsectionposth4classheadersheader
2条回答

您可以使用^{}方法和一个简单的切片操作在h4标记后立即返回这些section。在

使用示例HTML文档演示:

In [32]: from bs4 import BeautifulSoup 

In [33]: result = []

In [34]: html = """<h4 class="month-header">April 15, 2016</h4>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <h4 class="month-header">April 16, 2016</h4>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <h4 class="month-header">April 17, 2016</h4>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>"""

In [35]: soup = BeautifulSoup(html, 'html.parser')

In [36]: for header in soup.find_all('h4', class_='month-header'):
   ....:     d = {}
   ....:     d['month'] = header.get_text()
   ....:     d['released'] = [s.get_text() for s in header.find_next_siblings('section', class_='sneaker-post-main')[:3]]
   ....:     result.append(d)
   ....:     

In [37]: result
Out[37]: 
[{'month': 'April 15, 2016', 'released': ['...', '...', '...']},
 {'month': 'April 16, 2016', 'released': ['...', '...', '...']},
 {'month': 'April 17, 2016', 'released': ['...', '...', '...']}]

更新

如果“section”的数量不是常量,那么您可以使用生成器函数这样做(可能效率不高)。在

^{pr2}$

generator函数接受一个参数,即“汤”。在

from bs4 import BeautifulSoup, SoupStrainer, Tag

wanted_tag = SoupStrainer(['h4', 'section']) # only parse h4 and section tags 
soup = BeautifulSoup(response, 'html.parser', parse_only = wanted_tag)

for tag in soup(['script', 'style', 'img']):
    tag.decompose() #  Just to clean up little bit

for d in gen(soup):
    # do something

颠倒逻辑,得到所有的section.sneaker-post-main,然后使用它作为dict中分组的键来查找前一个兄弟:

import  requests
from collections import defaultdict


cont = requests.get(url, headers=ua).content

soup = BeautifulSoup(cont,"lxml")

d = defaultdict(list)
sections = soup.select("div.release-post-list  section.sneaker-post-main")
for section in sections:
    h4 = section.find_previous_sibling("h4",{"class":"month-header"})
    d[h4.text.strip()].append(section)

print(d["April 15, 2016"])

使用第一个标题文本作为键,您可以看到正确的前三个sneaker-post-main

^{pr2}$

每个h4.month-header可以有很多section.sneaker-post-main同级,但是每个section.sneaker-post-main只有一个与其部分相关的前一个h4.month-header同级。在

相关问题 更多 >