简介:给定一个对“selenium”提交的查询字符串的web响应,我无法让“requests”获取href,也无法通过分页(仅显示前20篇文章)来浏览数千篇文章。在
我正在使用我当地的图书馆网站连接到一个由Infotrac运营的付费在线订阅数据库网站,名为“佛罗里达报纸数据库”。最初,我使用Python和selenium运行一个web驱动程序实例,登录到本地图书馆站点获取它们的param,然后打开Infotrac主站点以捕获其param,打开佛罗里达报纸数据库站点并提交一个搜索字符串。我去selenium是因为我无法得到“请求”来完成。在
至少可以这么说,所有这些都是不优雅的。然而,一旦我从佛罗里达报纸数据库得到回应,我就面临着两个我无法克服的障碍。对我的查询的响应,在本例中是“byline john romano”生成了3000多篇文章,所有这些我都想通过编程方式下载。我试图让“请求”来处理下载,但到目前为止没有任何成功。在
搜索字符串的初始响应页面仅显示前20篇文章的链接(href)。使用Beautifulsoup我可以在列表中捕获url。但是,我没有成功地使用请求来获取href页面。即使我可以,我仍然面临着分页的问题,数千篇文章中有20篇。在
虽然我喜欢“请求”这个概念,但它一直是学习和工作的负担。看医生的报告只能起到这样的作用。我从Packt Publishing购买了《基本要求》,觉得很可怕。有没有人有要求阅读的清单?在
import requests
from requests import Session
from bs4 import BeautifulSoup
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
# opening the library page and finding the input elements
browser = webdriver.Firefox()
browser.get("https://pals.polarislibrary.com/polaris/logon.aspx")
username = browser.find_element_by_id("textboxBarcodeUsername")
password = browser.find_element_by_id("textboxPassword")
button = browser.find_element_by_id("buttonSubmit")
# inputing username and password
username.send_keys("25913000925235")
password.send_keys("9963")
button.send_keys(Keys.ENTER)
# opening the infotract page with the right cookies in the browser url
browser.get("http://infotrac.galegroup.com/itweb/palm83799?db=SP19")
# finding the input elements, first username
idFLNDB = browser.find_element_by_name("id")
idFLNDB.send_keys("25913000925235")
# finding the "Proceed" button by xpath because there's no name or id and clicking it
submit = browser.find_element_by_xpath("//input[@type='submit']")
submit.send_keys(Keys.ENTER)
# now get the Florida Newspaper Database page, find input element
searchBox = browser.find_element_by_id("inputFieldValue_0")
homepage = browser.find_element_by_id(“homepage_submit")
# input your search string
searchTopic = input("Type in your search string: ")
searchBox.send_keys(searchTopic)
homepage.send_keys(Keys.ENTER)
# get the cookies from selenium's webbrowser instance
cookies = browser.get_cookies()
# open up a requests session
s = requests.Session()
# get the cookies from selenium to requests
for cookie in cookies:
s.cookies.set(cookie['name'], cookie['value'])
searchTopic1 = searchTopic.replace(' ', '+')
# This is the param from the main search page
payload = {
"inputFieldValue(0)": searchTopic1,
"inputFieldName(0)": "OQE",
"inputFieldName(0)": "OQE",
"nwf": "y",
"searchType": "BasicSearchForm",
"userGroupName": "palm83799",
"prodId": "SPJ.SP19",
"method": "doSearch",
"dblist": "",
"standAloneLimiters": "LI",
}
current_url = browser.current_url
response = s.get(current_url, data=payload)
print("This is the status code:", response.status_code)
print("This is the current url:", current_url)
# This gives you BeautifulSoup object
soup = BeautifulSoup(response.content, "lxml")
# This gives you all of the article tags
links = soup.find_all(class_="documentLink")
# This next portion gives you the href values from the article tags as a list titled linksUrl
linksUrl = []
for i in range(len(links)):
a = links[i]['href']
linksUrl.append(a)
i +=1
# These are the param's from the article links off of the basic search page
payload2 = {
"sort": "DA-SORT",
"docType": "Column",
"}tabID": "T004",
"prodId": "SPJ.SP19",
"searchId": "R1",
"resultType": "RESULT_LIST",
"searchType": "BasicSearchForm"
}
# These are the request headers from a single article that I opened
articlePayload ={
"Host":"code.jquery.com",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv41.0) Gecko/20100101 Firefox/41.0",
"Accept":"*/*",
"Accept-Language":"en-US,en;q=0.5",
"Accept-Encoding":"gzip,deflate",
"Referer":"http://askalibrarian.org/widgets/gale/statewide",
"Connection":"keep-alive"
我创建了一个PoC来帮助您理解如何使用请求库。在
您可以调整代码以获取您感兴趣的特定数据。在
代码有注释,所以我不会在代码之外解释太多。不过,如果你还有什么问题,请告诉我。在
它将输出如下内容:(这只是一个示例输出,以避免粘贴过多的数据):
^{pr2}$最后,正如评论中所建议的,您可以:
我希望这能帮助您更好地理解请求库是如何工作的。在
相关问题 更多 >
编程相关推荐