从marinetraffic页面中抓取数据

2024-06-17 07:59:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从这个网页上抓取数据:marine traffic

我确实尝试了python和Selenium中的正常抓取,但我无法找出任何目标数据。(纬度/经度/速度)

enter image description here

有没有我缺少的特殊格式

这是我开始使用的代码

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless') 
driver = webdriver.Chrome("C:/webdrivers/chromedriver.exe", options=options)
page = driver.page_source

但是通过使用CTRL+F对文本进行简单搜索,我找不到任何令人满意的结果

你知道怎么把它刮下来吗

谢谢


Tags: 数据add网页目标driverseleniumpageargument
3条回答

首先,要在无头模式下使用Selenium,必须定义屏幕大小

options.add_argument(' window-size=1920,1080')

要获得坐标和速度,可以使用以下命令:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 20)

coordinates = wait.until(EC.visibility_of_element_located((By.XPATH, "//p[contains(text(),'Latitude')]/b"))).text

speed =  wait.until(EC.visibility_of_element_located((By.XPATH, "//p[contains(text(),'Speed')]/b"))).text

此外,由于您使用的是无头模式,这些设置可能有用

options.add_argument(' no-sandbox')
options.add_argument('  disable-dev-shm-usage')

没有什么东西

  1. 您需要单击接受cookies按钮
  2. 您需要单击X按钮,该按钮有时可见,有时不可见
  3. 您还需要显式等待

示例代码:

options = webdriver.ChromeOptions()
options.add_argument(" disable-infobars")
options.add_argument(" start-maximized")
options.add_argument(" disable-extensions")
options.add_experimental_option("prefs", {"profile.default_content_setting_values.notifications": 2})
options.add_argument(' window-size=1920,1080')
options.add_argument(" headless")
options.add_experimental_option("prefs", {"profile.default_content_settings.cookies": 2})
driver = webdriver.Chrome(options = options)
driver.implicitly_wait(30)
driver.maximize_window()
driver.get("https://www.marinetraffic.com/en/ais/details/ships/shipid:371441/mmsi:310554000/imo:9312456/vessel:STENA_PERROS")
wait = WebDriverWait(driver, 20)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[aria-label='AGREE']"))).click()
try:
    if(len(driver.find_elements(By.XPATH, "//*[name()='svg' and @class='MuiSvgIcon-root']/ancestor::button[contains(@class,'jss17')]"))) >0:
        print("X is visible")
        wait.until(EC.visibility_of_element_located((By.XPATH, "//*[name()='svg' and @class='MuiSvgIcon-root']/ancestor::button[contains(@class,'jss17')]"))).click()
        print("done clicking")
    else:
        print("X was not visible")
except:
    print("something went wrong")
    pass

print(wait.until(EC.visibility_of_element_located((By.XPATH, "//b//a[contains(@href,'/en/ais/hom')]"))).text)
print(wait.until(EC.visibility_of_element_located((By.XPATH, "//b//a[contains(@href,'/en/ais/hom')]/ancestor::p/following-sibling::p/b"))).text)
print(wait.until(EC.visibility_of_element_located((By.XPATH, "//b//a[contains(@href,'/en/ais/hom')]/ancestor::p/following-sibling::p[2]/b"))).text)

导入:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

输出:

X is visible
done clicking
-1.53057° / -48.77838°
Underway using Engine
1.7 kn / 250 °

Process finished with exit code 0

如果您在浏览器中查看页面,并记录浏览器的网络流量,您会注意到对各种API端点发出了一些XHR HTTP GET请求,这些请求的响应是JSON并包含您要查找的信息。您所要做的就是模仿这些请求-无需BeautifulSoup或Selenium:

def get_ship_position(ship_id):
    import requests

    url = "https://www.marinetraffic.com/vesselDetails/latestPosition/shipid:{}".format(ship_id)

    headers = {
        "accept": "application/json",
        "accept-encoding": "gzip, deflate",
        "user-agent": "Mozilla/5.0",
        "x-requested-with": "XMLHttpRequest"
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()

    return response.json()


def main():

    from datetime import datetime

    data = get_ship_position("371441")
    ts = datetime.utcfromtimestamp(data["lastPos"])
    print("Last known position: {} / {} @ {}".format(data["lat"], data["lon"], ts))
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

输出:

Last known position: -1.53057 / -48.77838 @ 2021-08-04 10:33:33
>>> 

相关问题 更多 >