我一直在遵循一个关于如何在网页上刮http://kanview.ks.gov/PayRates/PayRates_Agency.aspx的教程。图可以在这里找到:https://medium.freecodecamp.org/better-web-scraping-in-python-with-selenium-beautiful-soup-and-pandas-d6390592e251。它的布局类似于我想要搜集信息的网站:https://www.giiresearch.com/topics/TL11.shtml。我唯一的问题是,giiresearch网站上的报告标题链接没有按照时间顺序排列,例如,以下是gii研究
a href="/report/an806147-fixed-mobile-convergence-from-challenger-operators.html">Fixed-Mobile Convergence from Challenger Operators: Case Studies and Analysis</a>
a href="/annual/an378138-convergence-strategies.html">Convergence Strategies</a>
kanview网站上的链接遵循一个顺序,例如
a id="MainContent_uxLevel2_JobTitles_uxJobTitleBtn_1" href="javascript:__doPostBack('ctl00$MainContent$uxLevel2_JobTitles$ctl03$uxJobTitleBtn','')">Academic Advisor</a
a id="MainContent_uxLevel2_JobTitles_uxJobTitleBtn_2" href="javascript:__doPostBack('ctl00$MainContent$uxLevel2_JobTitles$ctl04$uxJobTitleBtn','')">Academic Program Specialist</a>
这意味着我不能在我的项目中使用他们代码行中使用的方法:
python_button = driver.find_element_by_id('MainContent_uxLevel2_JobTitles_uxJobTitleBtn_' + str(x))
我试着按类名查找元素,但所有链接都有相同的类名“list title”,因此for循环只打开第一个链接,不再继续。你知道吗
我在想,应该有一种方法将报表标题链接存储在列表中,这样我就可以逐个打开它们,检索有关每个报表的更多信息,并将其保存在excel工作表中。你知道吗
这是一个项目,我想编译一个竞争对手的报告与他们的标题,价格,出版商,出版日期等统计数据的市场分析excel表。你知道吗
这是我的密码:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
from tabulate import tabulate
import os
#launch url
url = "https://www.giiresearch.com/topics/TL11.shtml"
# create a new Chrome session
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.implicitly_wait(30)
driver.get(url)
#Selenium hands the page source to Beautiful Soup
soup_level1=BeautifulSoup(driver.page_source, 'lxml')
datalist = [] #empty list
x = 0 #counter
#Beautiful Soup finds all Job Title links on the agency page and the loop begins
for link in soup_level1.find_all("div", {"class": "list_title"}):
#Selenium visits each Job Title page
python_button = driver.find_elements_by_class_name('list_title')
python_button.click() #click link
#Selenium hands of the source of the specific job page to Beautiful Soup
soup_level2=BeautifulSoup(driver.page_source, 'lxml')
#Beautiful Soup grabs the HTML table on the page
table = soup_level2.find_all('table')[0]
#Giving the HTML table to pandas to put in a dataframe object
df = pd.read_html(str(table),header=0)
#Store the dataframe in a list
datalist.append(df[0])
#Ask Selenium to click the back button
driver.execute_script("window.history.go(-1)")
#increment the counter variable before starting the loop over
x += 1
#end loop block
#loop has completed
#end the Selenium browser session
driver.quit()
#combine all pandas dataframes in the list into one big dataframe
result = pd.concat([pd.DataFrame(datalist[i]) for i in range(len(datalist))],ignore_index=True)
#convert the pandas dataframe to JSON
json_records = result.to_json(orient='records')
#pretty print to CLI with tabulate
#converts to an ascii table
print(tabulate(result, headers=["Report Title","Publisher","Published Date","Price"],tablefmt='psql'))
#get current working directory
path = os.getcwd()
#open, write, and close the file
f = open(path + "\\fhsu_payroll_data.json","w") #FHSU
f.write(json_records)
f.close()
您可以使用css选择器中的子
a
标记元素与父类的关系来存储报表标题链接。虽然您可以从访问的页面中获取标题,但您也可以在列表中收集元组[(link.get_attribute('href') , link.text) for link in .......
然后解压成可以循环的独立元组。你知道吗代码:
然后可以循环
links
元组和driver.get
每个元组。你知道吗你有标题和日期信息,以防你想做任何其他事情。例如:
给予
相关问题 更多 >
编程相关推荐