如何使用Python触发从网站下载文件?

2024-04-20 03:21:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试设置一个脚本,以便每天从网站中提取数据,但我很难让Python真正读取表—我不是一个专业的程序员。我试过两种方法:

1)用漂亮的汤刮桌子(页眉、行等),然后

2)使用网站的excel导出按钮

以下是准确的网站: https://scgenvoy.sempra.com/index.html#nav=/Public/ViewExternalLowOFO.getLowOFO%3Frand%3D200

到目前为止,我的代码是:

#Imports
import requests
import urllib.request
import pandas as pd
from lxml import html
import lxml.html as lh
from bs4 import BeautifulSoup
`URL ='https://scgenvoy.sempra.com/index.html#nav=/Public/ViewExternalLowOFO.getLowOFO%3Frand%3D200'`

#Create a handle, page, to handle the contents of the website
requests.packages.urllib3.disable_warnings()
page = requests.get(URL, verify=False)

我认为最简单的方法是用

xpath //*[@id="content"]/form/div[2]/div/table/tbody/tr/td[4]/table/tbody/tr/td[1]/a

非常感谢您的帮助!你知道吗


Tags: 方法fromhttpsimportcomindex网站html
3条回答

我会尝试识别“导出到excel”的API并使用该API。您可以从浏览器的开发人员工具中确定这一点。例如,以下是Google Chrome的Copy as Curl提供的:

curl 'https://scgenvoy.sempra.com/Public/ViewExternalLowOFO.submitLowOfoSaveAs' -H 'Connection: keep-alive' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' -H 'Origin: https://scgenvoy.sempra.com' -H 'Upgrade-Insecure-Requests: 1' -H 'DNT: 1' -H 'Content-Type: application/x-www-form-urlencoded' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3' -H 'Referer: https://scgenvoy.sempra.com/index.html' -H 'Accept-Encoding: gzip, deflate, br' -H 'Accept-Language: en-US,en;q=0.9' -H 'Cookie: FAROFFSession=537EB1587E4A063416D5F2206890A2B6.managed2'  data 'FileName=LowOFO05302019Cycle2&Class=com.sempra.krypton.common.saveas.constants.FancyExcelExportType&pageSize=letter&pageOrientation=portrait&HiddenGasFlowDateField=05%2F30%2F2019&HiddenCycleField=2&gasFlowDate=05%2F30%2F2019&cycle=2'  compressed 

API url为 https://scgenvoy.sempra.com/Public/ViewExternalLowOFO.submitLowOfoSaveAs

输入参数为:

FileName: LowOFO05302019Cycle2
Class: com.sempra.krypton.common.saveas.constants.FancyExcelExportType
pageSize: letter
pageOrientation: portrait
HiddenGasFlowDateField: 05/30/2019
HiddenCycleField: 2
gasFlowDate: 05/30/2019
cycle: 2

请求方法为POST。你知道吗

现在可以使用python请求库或beautifulsoup库发出此请求,并为参数传递适当的值。你知道吗

给你一个主意,而不是自己解决。你知道吗

您的website正在用export按钮追加动态表数据。所以基本上您需要使用Selenium包来处理动态数据。根据浏览器下载selenium web驱动程序。你知道吗

对于chrome浏览器:

http://chromedriver.chromium.org/downloads

为chrome浏览器安装web驱动程序:

unzip ~/Downloads/chromedriver_linux64.zip -d ~/Downloads
chmod +x ~/Downloads/chromedriver
sudo mv -f ~/Downloads/chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

硒教程

https://selenium-python.readthedocs.io/

导出Excel文件:

from selenium import webdriver
import time

driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get('https://scgenvoy.sempra.com/index.html#nav=/Public/ViewExternalLowOFO.getLowOFO%3Frand%3D200')
time.sleep(3)
excel_button = driver.find_element_by_xpath("//div[@id='content']/form/div[2]/div/table/tbody/tr/td[4]/table/tbody/tr/td[2]/a")

print(excel_button.click())

其中"/usr/bin/chromedriver"chrome web驱动程序路径。你知道吗

下面是我的代码:

## Input parameters
start_date = '5/28/19'
end_date = '5/31/19'

#### Loops through date range and pulls data
## Date Range ##
datelist = pd.date_range(start=start_date, end=end_date, 
freq='D',dtype='datetime64[ns]')
print(datelist)

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

# opens chrome and opens up Gas Envoy
driver =webdriver.Chrome('C:/Users/tmrt/Documents/chromedriver_win32/chromedriver.exe')

driver.get('https://scgenvoy.sempra.com/index.html#nav=/Public/ViewExternalLowOFO.getLowOFO%3Frand%3D200')

# pause to give time to think load
time.sleep(5)

# Loops through the dates
for d in datelist:
     # Finds Date Box and Date Box Go Button
     date_box = driver.find_element_by_xpath('//*[@id="content"]/form/div[2]/table/tbody/tr/td[1]/table/tbody/tr/td[2]/input')
     date_clicker = driver.find_element_by_xpath('//*[@id="content"]/form/div[2]/table/tbody/tr/td[2]/table/tbody/tr/td/a')

    # Input date into datebox
    date_box.clear()
    date_box.send_keys(d.strftime("%m/%d/%Y"))

    # Click date_box
    date_clicker.click()

    # Pause to allow to load
    time.sleep(5)

    # Clicks download
     csv_button = driver.find_element_by_xpath('//*[@id="content"]/form/div[2]/div/table/tbody/tr/td[4]/table/tbody/tr/td[1]/a')   
    csv_button.click()

driver.close()

相关问题 更多 >