无法获取附加到不同参与者的表格内容

2024-06-09 00:16:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从网页中获取与不同参与者相关的表的内容。我要找的信息已经在图像中打了出来,请您理解。目前我的脚本只能给出不同参与者的名字。我也希望分析与这些与会者有关的资料。在

Website Address

由于内容是动态的,我不得不使用一些公共API,这些API可以使用dev工具检索到。在

The image表示信息在该页面中的显示方式。我想抓住一条贯穿线。在

This isAPI响应的外观。在

我已经试过了:

import re
import requests

url = 'https://www.bet365.com.au/SportsBook.API/web?'

params = {
    'lid': '30',
    'zid': '0',
    'pd': '#AC#B151#C1#D50#E2#F163#',
    'cid': '13',
    'ctid': '13'
}

r = requests.get(url, params=params,headers={'User-Agent':'Mozilla/5.0'})
games = re.finditer(r'NA=(.*?);', r.text)
for game in games:
    if not 'v' in game.group(): continue
    print(game.group(1))

我得到的输出如下(部分):

^{pr2}$

我希望得到的输出如下(部分):

^{3}$

How can I grab the tabular contents attached to different participants?

这里可见的信息可能不一样,因为他们每几分钟更新一次,我希望用我已经尝试过的请求来完成任务。在


Tags: in图像importreapi信息gameurl
3条回答

如果您要使用Bet365api,那么您需要找到一种方法来了解如何解码网站的输出,以及JS部分如何工作以呈现我们在实际网站中看到的内容。我认为这不是一件容易的事。 这就是为什么我建议您使用SeleniumBeautifulSoup在浏览器选项卡下加载网站,然后使用Beautifulsoup来使用最终的HTML,这将降低从网站中提取内容的复杂性。在

下面是一个如何使用Chromeheadless模式来抓取tournments, dates and matches

PS:cookie部分不是必需的,但它有助于自动加载我们试图抓取的页面。在

首先需要安装:pip install webdriver-manager,然后:

import pickle
import time
from collections import defaultdict
from pprint import pprint
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup as bs

CHROME_OPTIONS = Options()
CHROME_OPTIONS.add_argument(" headless")

class Bet365:
    DRIVER = webdriver.Chrome(ChromeDriverManager().install(), options=CHROME_OPTIONS)
    DUMMY_URL = 'https://www.bet365.com'
    URL = 'https://www.bet365.com/#/AC/B1/C1/D13/E37628398/F2/:/AC/B1/C1/D13/E42294995/F2/:/AC/B1/C1/D13/E42535433/F2/'
    COOKIES_FILE = 'cookies.pkl'

    def __init__(self):
        self.DRIVER.get(self.DUMMY_URL)
        # Comment the next line if cookies file is not set
        self.setup_cookies()
        self.DRIVER.get(self.URL)
        # self.DRIVER.maximize_window()
        # Wait for JS to populate the page
        time.sleep(15)
        self.source = self.DRIVER.page_source
        # Store new cookies for next run
        self.dump_cookies()

    def dump_cookies(self):
        """Store cookies"""
        pickle.dump(self.DRIVER.get_cookies(), open(self.COOKIES_FILE, "wb"))

    def setup_cookies(self):
        """Add cookies"""
        cookies = pickle.load(open(self.COOKIES_FILE, "rb"))
        for cookie in cookies:
            if 'expiry' in cookie:
                del cookie['expiry']
            self.DRIVER.add_cookie(cookie)

    def get_source(self):
        """Get page HTML source"""
        return bs(self.source, "html.parser")

    def is_last_child(self, event):
        """Is last child"""
        out = {}
        out['last_child'] = True if 'sl-MarketCouponAdvancedBase_LastChild' in event['class'] else False
        event_date = event.find('div', {'class': 'sl-CouponParticipantWithBookCloses_BookCloses'})
        out['date'] = event_date.get_text() if event_date else 'None'
        teams = event.findAll('div', {'class': 'sl-CouponParticipantWithBookCloses_Name'})
        if len(teams) > 1:
            out['teams'] = ' v '.join(k.text for k in teams)
        elif len(teams) == 1:
            out['teams'] = teams[0].text
        else:
            out['teams'] = 'None'
        return out

    def get_events(self, data):
        """Return all events"""
        dates, teams = [], []
        for event in data.findAll('div', {'class': 'sl-MarketCouponFixtureLabelBase gll-Market_General gll-Market_HasLabels'}):
            dates = [elm.text for elm in event.find_all('div', {'class': lambda x: all(k in x for k in 'gll-MarketColumnHeader sl-MarketHeaderLabel sl-MarketHeaderLabel_Date'.split())})]
            teams_events = event.findAll("div", {'class': lambda x: x and x.startswith("sl-CouponParticipantWithBookCloses sl-CouponParticipantIPPGBase")})
            teams = [self.is_last_child(elm) for elm in teams_events]
            if len(dates) == 1:
                if teams:
                    teams[-1]['last_child'] = True
        return dates, teams

    def pretty_print_events(self, dates, teams):
        """Pretty print events"""
        def groupby_last_child(data):
            out, tmp = [], []
            for elm in data:
                tmp.append(elm)
                if elm['last_child']:
                    out.append(tmp)
                    tmp = []
            return out

        out = defaultdict(list)
        for date, groupped in zip(dates, groupby_last_child(teams)):
            # use += instead of append in order to have flatten list
            # instead of list of lists
            out[date] += groupped
        return dict(out)

    def scrape_events(self):
        """Return all ligues"""
        for block in self.get_source().findAll('div', {'class': 'gll-MarketGroup cm-CouponMarketGroup cm-CouponMarketGroup_Open'}):
            ligue_name = block.find('span', {'class': 'cm-CouponMarketGroupButton_Text'}).get_text()
            dates, teams = self.get_events(block)
            out = self.pretty_print_events(dates, teams)
            yield ligue_name, out

    def to_dict(self):
        """Scrape events and return a dict"""
        return dict((ligue, events) for ligue, events in self.scrape_events())


if __name__ == '__main__':
    instance = Bet365()
    out = instance.to_dict()
    pprint(out)

输出:

^{pr2}$

我帮你解决了第一个问题的代码。尽管其他2个答案使用Selenium,但这是不必要的,因为网站的api端点用于查找游戏。这种方法应该比硒更快。我可以再次使用正则表达式解析其他信息。然而,在实际的网站上,我没能找到像你期望的输出一样的“1-1”。希望这有帮助。《纽约时报》可能有问题,我不太确定。在

代码

import re
import requests
from datetime import datetime, timedelta
import pandas as pd

url = 'https://www.bet365.com.au/SportsBook.API/web?'

params = {
    'lid': '30',
    'zid': '0',
    'pd': '#AC#B151#C1#D50#E2#F163#',
    'cid': '13',
    'ctid': '13'
}

r = requests.get(url, params=params, headers={'User-Agent': 'Mozilla/5.0'})

games = re.finditer(r'NA=([\w\s\-._]+? v [\w\s\-._]+?);', r.text)
col_games = []
for game in games:
    # if 'v' in game.group() and '-' not in game.group():
    col_games.append(game.group(1))

prices_text = re.finditer(r'NA=1;.*?((?:OD=\d+/\d+;(?:.*?))+?)NA=', r.text)
col_1 = []
for text in prices_text:
    segments = text.group(1).split('|')
    for segment in segments:
        price = re.search(r'OD=(\d+/\d+);', segment)
        if price:
            price = int(eval(price.group(1) + '+1') * 100) / 100
            col_1.append(price)

prices_text = re.finditer(r'NA=2;.*?((?:OD=\d+/\d+;(?:.*?))+?)NA=', r.text)
col_2 = []
for text in prices_text:
    segments = text.group(1).split('|')
    for segment in segments:
        price = re.search(r'OD=(\d+/\d+);', segment)
        if price:
            price = int(eval(price.group(1) + '+1') * 100) / 100
            col_2.append(price)

times = re.finditer(r'BC=(\d+);', r.text)
col_times = []
for time in times:
    datetime_time = datetime.strptime(time.group(1)[:-2], '%Y%m%d%H%M')
    datetime_time = datetime_time + timedelta(hours=-1)
    col_times.append(datetime_time.time())


df = pd.DataFrame({'Time': col_times, "Games": col_games, "1": col_1, "2": col_2})
print(df)

输出

^{pr2}$

您可以使用selenium

from selenium import webdriver
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.bet365.com.au/#/AC/B151/C1/D50/E2/F163/')
def scrape_block(b):
   p = {'date':b.find('div', {'class':'gll-MarketColumnHeader sl-MarketHeaderLabel sl-MarketHeaderLabel_Date '}).text}
   c1, c2 = b.find_all('div', {'class':'sl-CouponParticipantWithBookCloses sl-CouponParticipantWithBookCloses_NoAdditionalMarkets sl-CouponParticipantIPPGBase '}), b.find_all('div', {'class':'sl-CouponParticipantWithBookCloses sl-CouponParticipantWithBookCloses_NoAdditionalMarkets sl-CouponParticipantIPPGBase sl-CouponParticipantWithBookCloses_ClockPaddingLeft '})
   if c1:
      pl = [[i.find('div', {'class':'sl-CouponParticipantWithBookCloses_BookCloses '}).text, i.find('div', {'class':'sl-CouponParticipantWithBookCloses_Name '}).text] for i in c1] 
   else:
      pl = [[i.find('div', {'class':'pi-CouponParticipantClockInPlay '}).text, i.find('div', {'class':'sl-CouponParticipantWithBookCloses_Name '}).text, i.find('div', {'class':'pi-ScoreVariantDefault '}).text] for i in c2]
   odds1, odds2 = [[i.text for i in c.find_all('div', {'class':'gll-ParticipantOddsOnlyDarker gll-Participant_General gll-ParticipantOddsOnly '})] for c in b.find_all('div', {'class':'sl-MarketCouponValuesExplicit2 gll-Market_General gll-Market_PWidth-15-4 '})]
   return {**p, 'data':[{'player':a, 1:b, 2:c} for a, b, c in zip(pl, [None] if not odds1 else odds1, [None] if not odds2 else odds2)]}

new_d = list(map(scrape_block, soup(d.page_source, 'html.parser').find_all('div', {'class':'gll-MarketGroupContainer gll-MarketGroupContainer_HasLabels '})))
final_result = list(filter(lambda x:bool(x['data']), new_d))

输出:

^{pr2}$

相关问题 更多 >