Selenium加载,但不打印所有HTML

2024-05-23 22:33:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试使用Python和Selenium从一个网站上动态地抓取加载的数据。问题是,只有大约一半的数据被报告为当前数据,而实际上所有数据都应该存在。即使在打印出所有页面内容之前使用暂停,或者简单地逐类查找元素,似乎也没有解决方案。站点的URL是https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909。如你所见,有13个主要部分,但是我只能从前四个游戏检索数据。为了更好地显示问题,我将附加整个页面的内部HTML打印代码,以显示加载和未加载数据之间的差异。在

from selenium import webdriver
import requests

url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909"
driver = webdriver.Chrome()
driver.get(url)
print(driver.execute_script("return document.documentElement.innerText;"))

编辑: 问题不在于等待时间,因为我正在逐行运行它并完全等待它加载。问题似乎可以归结为selenium没有获取页面上所有JS加载的文本,如下面的答案中的控制台输出所示。在


Tags: 数据httpsimportcomdatewwwdriverselenium
2条回答

@sudonym的分析是正确的。在尝试通过execute_script()方法提取所需的元素之前,需要诱导WebDriverWait,如下所示:

  • 代码块:

    # -*- coding: UTF-8 -*-
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    
    url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909"
    driver = webdriver.Chrome()
    driver.get(url)
    WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2[contains(.,'USA - National Football League')]//following::section//span[3]")))
    print(driver.execute_script("return document.documentElement.innerText;"))
    
  • 控制台输出:

    SPORTSBOOK REVIEW
    Home
    Best Sportsbooks
    Rating Guide
    Blacklist
    Bonuses
    BETTING ODDS
    FREE PICKS
    Sports Picks
    NFL
    College Football
    NBA
    NCAAB
    MLB
    NHL
    More Sports
    How to Bet
    Tools
    FORUM
    Home
    Players Talk
    Sportsbooks & Industry
    Newbie Forum
    Handicapper Think Tank
    David Malinsky's Point Blank
    Service Plays
    Bitcoin Sports Betting
    NBA Betting
    NFL Betting
    NCAAF Betting
    MLB Betting
    NHL Betting
    CONTESTS
    EARN BETPOINTS
    What Are Betpoints?
    SBR Sportsbook
    SBR Casino
    SBR Racebook
    SBR Poker
    SBR Store
    Today
    NFL
    NBA
    NHL
    MLB
    College Football
    NCAA Basketball
    Soccer
    Soccer Odds
    Major League Soccer
    UEFA Champions League
    UEFA Nations League
    UEFA Europa League
    English Premier League
    World Cup 2022
    Tennis
    Tennis Odds
    ATP
    WTA
    UFC
    Boxing
    More Sports
    CFL
    WNBA
    AFL
    Betting Odds/NFL Odds/Consensus
    TODAY
    |
    YESTERDAY
    |
    DATE
    ?
    Login
    ?
    Settings
    ?
    Bet Tracker
    ?
    Bet Card
    ?
    Favorites
    NFL Consensus for Sep 09, 2018
    USA - National Football League
    Sunday Sep 09, 2018
    01:00 PM
    /
    Pittsburgh vs Cleveland
    453
    Pittsburgh
    454
    Cleveland
    Current Line
    -3½+105
    +3½-115
    Wagers Placed
    10040
    54.07%
    8530
    45.93%
    Amount Wagered
    $381,520.00
    56.10%
    $298,550.00
    43.90%
    Average Bet Size
    $38.00
    $35.00
    SBR Contest Best Bets
    22
    9
    01:00 PM
    /
    San Francisco vs Minnesota
    455
    San Francisco
    456
    Minnesota
    Current Line
    +6-102
    -6-108
    Wagers Placed
    6250
    41.25%
    8900
    58.75%
    Amount Wagered
    $175,000.00
    29.50%
    $418,300.00
    70.50%
    Average Bet Size
    $28.00
    $47.00
    SBR Contest Best Bets
    5
    19
    01:00 PM
    /
    Cincinnati vs Indianapolis
    457
    Cincinnati
    458
    Indianapolis
    Current Line
    -1-104
    +1-106
    Wagers Placed
    11640
    66.36%
    5900
    33.64%
    Amount Wagered
    $1,338,600.00
    85.65%
    $224,200.00
    14.35%
    Average Bet Size
    $115.00
    $38.00
    SBR Contest Best Bets
    23
    12
    01:00 PM
    /
    Buffalo vs Baltimore
    459
    Buffalo
    460
    Baltimore
    Current Line
    +7½-103
    -7½-107
    Wagers Placed
    5220
    33.83%
    10210
    66.17%
    Amount Wagered
    $78,300.00
    16.79%
    $387,980.00
    83.21%
    Average Bet Size
    $15.00
    $38.00
    SBR Contest Best Bets
    5
    17
    01:00 PM
    /
    Jacksonville vs N.Y. Giants
    461
    Jacksonville
    462
    N.Y. Giants
    01:00 PM
    /
    Tampa Bay vs New Orleans
    463
    Tampa Bay
    464
    New Orleans
    01:00 PM
    /
    Houston vs New England
    465
    Houston
    466
    New England
    01:00 PM
    /
    Tennessee vs Miami
    467
    Tennessee
    468
    Miami
    04:05 PM
    /
    Kansas City vs L.A. Chargers
    469
    Kansas City
    470
    L.A. Chargers
    04:25 PM
    /
    Seattle vs Denver
    471
    Seattle
    472
    Denver
    04:25 PM
    /
    Dallas vs Carolina
    473
    Dallas
    474
    Carolina
    04:25 PM
    /
    Washington vs Arizona
    475
    Washington
    476
    Arizona
    08:20 PM
    /
    Chicago vs Green Bay
    477
    Chicago
    478
    Green Bay
    Media
    Site Map
    Terms of use
    Contact Us
    Privacy Policy
    DMCA
    18+. Gamble Responsibly.
    © Sportsbook Review. All Rights Reserved.
    

This solution is only worth to consider if there are lots of WebDriverWait calls and given the interest in reduced runtime - else go for DebanjanB's approach

你需要一些时间来完全加载html。此外,还可以为脚本执行设置timeout。要在selenium中添加对driver.get(URL)的无条件等待,driver.set_page_load_timeout(n)与{}和循环:

driver.set_page_load_timeout(n)         # Set timeout of n seconds for page load
loading_finished = 0                    # Set flag to 0
while loading_finished == 0:            # Repeat while flag = 0
    try:
       sleep(random.uniform(0.1, 0.5))  # wait some time
       website = driver.get(URL)        # try to load for n seconds
       loading_finished = 1             # Set flag to 1 and exit while loop
       logger.info("website loaded")    # Indicate load success
    except:
       logger.warn("timeout - retry")   # Indicate load fail
else:                                   # If flag == 1
    driver.set_script_timeout(n)        # Set timeout of n seconds for script  
    script_finished = 0                 # Set flag to 0
    while script_finished == 0          # Second loop
       try:
          print driver.execute_script("return document.documentElement.innerText;")       
          script_finished = 1           # Set flag to 1
          logger.info("script done")    # Indicate script done
       except:                          
          logger.warn("script timeout") 
    else:
        logger.info("if you're still missing html here, increase timeout")

相关问题 更多 >