如何打开网页上的隐藏信息

3条回答

网友

1楼 · 编辑于 2024-06-01 05:59:23

您遇到的问题是web抓取中的常见问题。你知道吗

位于https://pub.fsa.gov.ru/ral/view/8/applicant的网页在https://pub.fsa.gov.ru/main.73d6a501bd7bda31d5ec.js加载javascript文件，该文件负责动态内容加载。你知道吗

问题的根源在于urllib3、请求或python中的任何其他http客户机不会在该网页中呈现javascript。因此，您只有服务器提供给您的初始响应，在许多情况下，这些响应并不包含您需要的信息。你知道吗

解决方法是使用selenium。它将允许您与浏览器交互，例如chrome或firefox以编程方式，这些浏览器实际呈现结果。你知道吗

你没有具体的信息，你正试图刮下这个网站，我的建议是使用显式等待，直到元素，你希望找到是在DOM中。您可以在seleniumhere中找到有关等待的更多信息。你知道吗

用法示例

您应该修改这段代码来刮取您想要刮取的数据。你知道吗

# Imports
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

# Constants
URL = 'https://pub.fsa.gov.ru/ral/view/8/applicant'
ELEMENT_XPATH = '/html/body/fgis-root/div/fgis-ral/fgis-card-view/div/div/fgis-view-applicant/fgis-card-block/div/div[2]'

def main():
    options = Options()
    options.headless = True
    driver = webdriver.Chrome(options=options)
    driver.get(URL)
    try:
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, ELEMENT_XPATH))
        )
        print(element.text) 
    except TimeoutException:
        print("Could not find the desired element")
    finally:
        driver.quit()

if __name__ == '__main__':
    main()

网友

2楼 · 编辑于 2024-06-01 05:59:23

您可以模拟GET请求。此信息来自加载页面时在dev tools，F12的网络选项卡中观察到的网络流量。授权和会话id可能有时间限制。您可以使用Session来处理cookies部分，方法是在同一个Session中首先对前一个url发出请求。你知道吗

import requests
import urllib3; urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)


headers = {
    'Pragma': 'no-cache',
    'DNT': '1',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
    'lkId': '',
    'Accept': 'application/json, text/plain, */*',
    'Cache-Control': 'no-cache',
    'Authorization': 'Bearer eyJhbGciOiJIUzUxMiJ9.eyJpc3MiOiI5ZDhlNWJhNy02ZDg3LTRiMWEtYjZjNi0xOWZjMDJlM2QxZWYiLCJzdWIiOiJhbm9ueW1vdXMiLCJleHAiOjE1NjMyMzUwNjZ9.OnUcjrEXUsrmFyDBpgvhzznHMFicEknSDkjCyxaugO5z992H-McRRD9bfwNl7xMI3dm2HtdAPuTu3nnFzgCLuQ',
    'Connection': 'keep-alive',
    'Referer': 'https://pub.fsa.gov.ru/ral/view/8/applicant',
    'orgId': '',
}

with requests.Session() as s:
    r = s.get('https://pub.fsa.gov.ru/ral/view/8/applicant', verify = False)
    r = s.get('https://pub.fsa.gov.ru/api/v1/ral/common/companies/8', headers=headers).json()
    print(r)

网友

3楼 · 编辑于 2024-06-01 05:59:23

由于您要查找的内容是从javascript生成的，因此需要模拟浏览器。您可以使用^{}执行以下操作：

from selenium import webdriver

with webdriver.Firefox() as driver: # e.g. using Firefox webdriver
    driver.get('your_url_here')
    i = driver.find_elements_by_tag_name("fgis-root")

还可以检查here所有^{}提供的用于在页面中定位元素的可用方法。你知道吗

用法示例

相关问题更多 >

编程相关推荐

热门问题

热门文章