我想把桌子刮下来:
https://www2.sgx.com/securities/annual-reports-financial-statements
我了解,通过研究标头并找到如下API调用,这是可能的: https://api.sgx.com/financialreports/v1.0?pagestart=3&pagesize=250¶ms=id,companyName,documentDate,securityName,title,url 但是我想知道是否可以不这样做就从表中获取所有数据,因为我需要解析16个JSON文件。你知道吗
当我尝试使用Selenium时,我只能到达可见表的末尾(当单击左侧的“全部清除”时,表会变得更大,这就是我需要的所有数据)。你知道吗
欢迎有任何想法!你知道吗
编辑:下面是代码,它只返回表中数千个单元格中的144个单元格
from time import sleep # to wait for stuff to finish.
from selenium import webdriver # to interact with our site.
from selenium.common.exceptions import WebDriverException # url is wrong
from webdriver_manager import chrome # to install and find the chromedriver executable
BASE_URL = 'https://www2.sgx.com/securities/annual-reports-financial-statements'
driver = webdriver.Chrome(executable_path=chrome.ChromeDriverManager().install())
driver.maximize_window()
try:
driver.get(BASE_URL)
except WebDriverException:
print("Url given is not working, please try again.")
exit()
# clicking away pop-up
sleep(5)
header = driver.find_element_by_id("website-header")
driver.execute_script("arguments[0].click();", header)
# clicking the clear all button, to clear the calendar
sleep(2)
clear_field = driver.find_element_by_xpath('/html/body/div[1]/main/div[1]/article/template-base/div/div/sgx-widgets-wrapper/widget-filter-listing/widget-filter-listing-financial-reports/section[2]/div[1]/sgx-filter/sgx-form/div[2]/span[2]')
clear_field.click()
# clicking to select only Annual Reports
sleep(2)
driver.find_element_by_xpath("/html/body/div[1]/main/div[1]/article/template-base/div/div/sgx-widgets-wrapper/widget-filter-listing/widget-filter-listing-financial-reports/section[2]/div[1]/sgx-filter/sgx-form/div[1]/div[1]/sgx-input-select/label/span[2]/input").click()
sleep(1)
driver.find_element_by_xpath("//span[text()='Annual Report']").click()
rows = driver.find_elements_by_class_name("sgx-table-cell")
print(len(rows))
我知道你要求不要使用API。我认为使用它是一种更干净的方法。你知道吗
(输出3709个文档)
相关问题 更多 >
编程相关推荐