beautifulsou使用特定字符串查找标题正下方和正上方的元素

2024-05-21 08:06:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我怎样才能勉强得到h3,h4类,而不是h5 string=“Prem League”和div^{cl1}$

我想要h3,h4的文本,在div里面,我需要一个span,在span里面的文本

因此,当h5类字符串是Prem-League时,我希望h4和h3在正上方,并且我需要在h5类字符串=Prem-League的正下方添加fixres\u项的各种元素

<div class="fixres__body" data-url="" data-view="fixture-update" data-controller="fixture-update" data-fn="live-refresh" data-sport="football" data-lite="true" id="widgetLite-6">
    <h3 class="fixres__header1">November 2018</h3>          
    <h4 class="fixres__header2">Saturday 24th November</h4>             
    <h5 class="fixres__header3">Prem League</h5>
    <div class="fixres__item">stuff in here</div>

    <h4 class="fixres__header2">Wednesday 28th November</h4>
    <h5 class="fixres__header3">UEFA Champ League</h5>
    <div class="fixres__item">stuff in here</div>

    <h3 class="fixres__header1">December 2018</h3>          
    <h4 class="fixres__header2">Sunday 2nd December</h4>                
    <h5 class="fixres__header3">Prem League</h5>
    <div class="fixres__item">stuff in here</div>

这是我到目前为止的代码,但这包括了h5以下的数据字符串“欧盟冠军联赛”-我不想要。我只想从低于h5标题“Prem League”的div获得数据。例如,我不希望PSG出现在输出中,因为它来自h5以下的div标题“eufachamp League”

我的代码-

def squad_fixtures():
    team_table = ['https://someurl.com/liverpool-fixtures']

    for i in team_table:

#        team_fixture_urls = [i.replace('-squad', '-fixtures') for i in team_table]

        squad_r = requests.get(i)
        premier_squad_soup = BeautifulSoup(squad_r.text, 'html.parser')
#        print(premier_squad_soup)
        premier_fix_body = premier_squad_soup.find('div', {'class': 'fixres__body'})
#        print(premier_fix_body)

        premier_fix_divs = premier_fix_body.find_all('div', {'class': 'fixres__item'})

    for i in premier_fix_divs:  
        team_home = i.find_all('span', {'class': 'matches__item-col matches__participant matches__participant--side1'})
        for i in team_home:
            team_home_names = i.find('span', {'class': 'swap-text--bp30'})['title']
            team_home_namesall.append(team_home_names)
    print(team_home_namesall)

输出
[‘沃特福德’、‘巴黎圣日耳曼’、‘利物浦’、‘伯恩利’、‘B'mouth’、‘利物浦’、‘利物浦’、‘狼队’、‘利物浦’、‘利物浦’、‘曼城’、‘布莱顿’、‘利物浦’、‘利物浦’、‘西汉姆’、‘利物浦’、‘曼联’、‘利物浦’、‘埃弗顿’、‘利物浦’、‘富勒姆’、‘利物浦’、‘索顿’、‘利物浦’、‘卡迪夫’、‘利物浦’、‘纽卡斯尔’、‘利物浦


Tags: indivhomedatabodyh4teamh3
1条回答
网友
1楼 · 发布于 2024-05-21 08:06:55

似乎您的挑战是将刮取限制在Premier League<h5>及其相关内容。你知道吗

Note: Your question states the string of the h5 should be Prem League, but it in fact appears to be Premier League when I look at the response.

这个HTML看起来非常扁平,结构上没有区别,所以看起来最好的办法是从h5开始遍历上一个和下一个兄弟姐妹,h5本身很容易定位:

import re

from bs4 import BeautifulSoup, Tag
import requests

prem_league_regex = re.compile(r"Premier League")


def squad_fixtures():
    team_table = ['https://www.skysports.com/liverpool-fixtures']

    for i in team_table:
        squad_r = requests.get(i)
        soup = BeautifulSoup(squad_r.text, 'html.parser')
        body = soup.find('div', {'class': 'fixres__body'})
        h5s = body.find_all('h5', {'class': 'fixres__header3'}, text=prem_league_regex)
        for h5 in h5s:
            prev_tag = find_previous(h5)
            if prev_tag.name == 'h4':
                print(prev_tag.text)
            prev_tag = find_previous(prev_tag)
            if prev_tag.name == 'h3':
                print(prev_tag.text)
            fixres_item_div = find_next(h5)
            """
                get the things you need from fixres__item now that you have it...
            """



def find_previous(tag):
    prev_tag = tag.previous_sibling
    while(not isinstance(prev_tag, Tag)):
        prev_tag = prev_tag.previous_sibling
    return prev_tag

def find_next(tag):
    next_tag = tag.next_sibling
    while(not isinstance(next_tag, Tag)):
        next_tag = next_tag.next_sibling
    return next_tag

相关问题 更多 >