如何将特定链接存储为列表，然后单击它们

a href="/report/an806147-fixed-mobile-convergence-from-challenger-operators.html">Fixed-Mobile Convergence from Challenger Operators: Case Studies and Analysis</a> a href="/annual/an378138-convergence-strategies.html">Convergence Strategies</a>

a id="MainContent_uxLevel2_JobTitles_uxJobTitleBtn_1" href="javascript:__doPostBack('ctl00$MainContent$uxLevel2_JobTitles$ctl03$uxJobTitleBtn','')">Academic Advisor</a a id="MainContent_uxLevel2_JobTitles_uxJobTitleBtn_2" href="javascript:__doPostBack('ctl00$MainContent$uxLevel2_JobTitles$ctl04$uxJobTitleBtn','')">Academic Program Specialist</a>

from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.common.keys import Keys from bs4 import BeautifulSoup import re import pandas as pd from tabulate import tabulate import os #launch url url = "https://www.giiresearch.com/topics/TL11.shtml" # create a new Chrome session driver = webdriver.Chrome(ChromeDriverManager().install()) driver.implicitly_wait(30) driver.get(url) #Selenium hands the page source to Beautiful Soup soup_level1=BeautifulSoup(driver.page_source, 'lxml') datalist = [] #empty list x = 0 #counter #Beautiful Soup finds all Job Title links on the agency page and the loop begins for link in soup_level1.find_all("div", {"class": "list_title"}): #Selenium visits each Job Title page python_button = driver.find_elements_by_class_name('list_title') python_button.click() #click link #Selenium hands of the source of the specific job page to Beautiful Soup soup_level2=BeautifulSoup(driver.page_source, 'lxml') #Beautiful Soup grabs the HTML table on the page table = soup_level2.find_all('table')[0] #Giving the HTML table to pandas to put in a dataframe object df = pd.read_html(str(table),header=0) #Store the dataframe in a list datalist.append(df[0]) #Ask Selenium to click the back button driver.execute_script("window.history.go(-1)") #increment the counter variable before starting the loop over x += 1 #end loop block #loop has completed #end the Selenium browser session driver.quit() #combine all pandas dataframes in the list into one big dataframe result = pd.concat([pd.DataFrame(datalist[i]) for i in range(len(datalist))],ignore_index=True) #convert the pandas dataframe to JSON json_records = result.to_json(orient='records') #pretty print to CLI with tabulate #converts to an ascii table print(tabulate(result, headers=["Report Title","Publisher","Published Date","Price"],tablefmt='psql')) #get current working directory path = os.getcwd() #open, write, and close the file f = open(path + "\\fhsu_payroll_data.json","w") #FHSU f.write(json_records) f.close()

1条回答

网友

1楼 · 发布于 2024-04-19 22:39:56

您可以使用css选择器中的子a标记元素与父类的关系来存储报表标题链接。虽然您可以从访问的页面中获取标题，但您也可以在列表中收集元组[(link.get_attribute('href') , link.text) for link in ....... 然后解压成可以循环的独立元组。你知道吗

代码：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = 'https://www.giiresearch.com/topics/TL11.shtml'
driver = webdriver.Chrome()
driver.get(url)
titles, links = zip(*[(link.text, link.get_attribute('href')) for link in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".list_title a")))]) #separate tuples
dates = [item.text for item in driver.find_elements_by_css_selector('.plist_dateinfo .plist_info_dd2')]
data = list(zip(dates, titles, links))

然后可以循环links元组和driver.get每个元组。你知道吗

你有标题和日期信息，以防你想做任何其他事情。例如：

print(data[3])

给予

('January 30, 2019', 'Global Telco Converged Plans & Bundles Insights 2019: A Look at Bundling Strategies from Around the World', 'https://www.giiresearch.com/report/wise780587-global-telco-converged-plans-bundles-insights-look.html')

相关问题更多 >

编程相关推荐

热门问题

热门文章