无法通过Requests跨过分页

1条回答

网友

1楼 · 发布于 2024-04-19 19:59:32

我创建了一个PoC来帮助您理解如何使用请求库。在

This script only scrapes:
title and link of every news/article within every page of the search results for the provided keyword(s)

您可以调整代码以获取您感兴趣的特定数据。在

代码有注释，所以我不会在代码之外解释太多。不过，如果你还有什么问题，请告诉我。在

from lxml import html
from requests import Session

## Setting some vars
LOGIN_URL = "http://infotrac.galegroup.com/default/palm83799?db=SP19"
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"

## Payload for LOGIN_URL page
payload = {
    'db':'SP19',
    'locpword':'25913000925235',
    'proceed':'Authenticate',
}

## Headers to be set for every request with our requests.Session()
headers = {
    'User-Agent':USER_AGENT
}

## requests.Session insance
s = Session()

## Updating/setting headers to be used in every request within our Session()
s.headers.update(headers)

## Making first request to our LOGIN_URL page to get Cookies and Sessions we will need later
s.get(LOGIN_URL)

def extractTitlesAndLinksFromPaginatePageResponse(response, page):
    ## Creating a dictionary with the following structure
    ## {
    ##     page: { ## this value is the page number
    ##         "news": None # right now we leave it as None until we have all the news (dict), from this page, scraped
    ##     }
    ## }
    ##
    ## e.g.
    ##
    ## {
    ##     1: {
    ##        "news": None # right now we leave it as None until we have all the news (dict), from this page, scraped
    ##     }
    ## }
    ##
    news = {page: dict(news=None)}

    ## count = The result's number. e.g. The first result from this page will be 1, the second result will be 2, and so on until 20.
    count = 1

    ## Parsing the HTML from response.content
    tree = html.fromstring(response.content)

    ## Creating a dictionary with the following structure
    ## {
    ##     count: { ## count will be the result number for the current page
    ##            "title": "Here goes the news title",
    ##            "link": "Here goes the news link",
    ##     }
    ## }
    ##
    ## e.g.
    ##
    ## {
    ##     1: {
    ##        "title": "Drought swept aside; End-of-angst story? This is much more.",
    ##        "link": "http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=1921&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA138024966&contentSet=GALE%7CA138024966",
    ##     },
    ##     2: {
    ##        "title": "The Fast Life.",
    ##        "link": "http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=1922&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA137929858&contentSet=GALE%7CA137929858",
    ##     },
    ##     ...and so on...
    ## }
    tmp_dict = dict()

    ## Applying some xPATHs to extract every result from the current page
    ## Adding "http://go.galegroup.com/ps/" prefix to every result's link we extract
    ## Adding results to tmp_dict
    ## Count increment +1
    for result in tree.xpath('//li[@class="citation-view"]'):
        link, title = result.xpath('.//div[@class="titleWrapper"]/span[@class="title"]/a/@href | .//div[@class="titleWrapper"]/span[@class="title"]/a/text()')
        link = "{}{}".format("http://go.galegroup.com/ps/", link)
        tmp_dict[count] = dict(title=title, link=link)
        count += 1

    ## Asigning tmp_dict as value of news[page]["news"]
    news[page]["news"] = tmp_dict

    ## Returning news dictionary with all of the results from the current page
    return news


def searchKeyWord(search_string):
    ## Creating a dictionary with the following structure
    ## {
    ##     "keyword": search_string,  ## in this case 'search_string' is "byline john romano"
    ##     "pages": None              ## right now we leave it as None until we have all the pages scraped
    ## }
    full_news = dict(keyword=search_string, pages=None)

    ## This will be a temporary dictionary which will contain all the pages and news inside. This is the dict that will be the value of full_news["pages"]
    tmp_dict = dict()

    ## Replacing spaces with 'plus' sign to match the website's behavior
    search_string = search_string.replace(' ', '+')
    ## URL of the first page for every search request
    search_url = "http://go.galegroup.com/ps/basicSearch.do?inputFieldValue(0)={}&inputFieldName(0)=OQE&inputFieldName(0)=OQE&nwf=y&searchType=BasicSearchForm&userGroupName=palm83799&prodId=SPJ.SP19&method=doSearch&dblist=&standAloneLimiters=LI".format(search_string)

    ##
    ## count = Number of the page we are currently scraping
    ## response_code = The response code we should match against every request we make to the pagination endpoint. Once it returns a 500 response code, it means we have reached the last page
    ## currentPosition = It's like an offset var, which contains the value of the next results to be rendered. We will increment its value in 20 for each page we request.
    ##
    count = 1 ## Don't change this value. It should always be 1.
    response_code = 200 ## Don't change this value. It should always be 200.
    currentPosition = 21 ## Don't change this value. It should always be 21.

    ## Making a GET request to the search_url (first results page)
    first_page_response = s.get(search_url)
    ## Calling extractTitlesAndLinksFromPaginatePageResponse() with the response and count (number of the page we are currently scraping)
    first_page_news = extractTitlesAndLinksFromPaginatePageResponse(first_page_response, count)
    ## Updating our tmp_dict with the dict of news returned by extractTitlesAndLinksFromPaginatePageResponse()
    tmp_dict.update(first_page_news)

    ## If response code of last pagination request is not 200 we stop looping
    while response_code == 200:
        count += 1
        paginate_url = "http://go.galegroup.com/ps/paginate.do?currentPosition={}&inPS=true&prodId=SPJ.SP19&searchId=R1&searchResultsType=SingleTab&searchType=BasicSearchForm&sort=DA-SORT&tabID=T004&userGroupName=palm83799".format(currentPosition)
        ## Making a request to the next paginate page with special headers to make sure our script follows the website's behavior
        next_pages_response = s.get(paginate_url, headers={'X-Requested-With':'XMLHttpRequest', 'Referer':search_url})
        ## Updating response code to be checked before making the next paginate request
        response_code = next_pages_response.status_code
        ## Calling extractTitlesAndLinksFromPaginatePageResponse() with the response and count (number of the page we are currently scraping)
        pagination_news = extractTitlesAndLinksFromPaginatePageResponse(next_pages_response, count)
        ## Updating dict with pagination's current page results
        tmp_dict.update(pagination_news)
        ## Updating our offset/position
        currentPosition += 20

    ## Deleting results from 500 response code
    del tmp_dict[count]

    ## When the while loop has finished making requests and extracting results from every page
    ## Pages dictionary, with all the pages and their corresponding results/news, becomes a value of full_news["pages"]
    full_news["pages"] = tmp_dict
    return full_news

## This is the POST request to LOGIN_URL with our payload data and some extra headers to make sure everything works as expected
login_response = s.post(LOGIN_URL, data=payload, headers={'Referer':'http://infotrac.galegroup.com/default/palm83799?db=SP19', 'Content-Type':'application/x-www-form-urlencoded'})

## Once we are logged in and our Session has all the website's cookies and sessions
## We call searchKeyWord() function with the text/keywords we want to search for
## Results will be stored in all_the_news var
all_the_news = searchKeyWord("byline john romano")

## Finally you can
print all_the_news
## Or do whatever you need to do. Like for example, loop all_the_news dictionary to make requests to every news url and scrape the data you are interested in.
## You can also adjust the script (add one more function) to scrape every news detail page data, and call it from inside of extractTitlesAndLinksFromPaginatePageResponse()

它将输出如下内容：（这只是一个示例输出，以避免粘贴过多的数据）：

^{pr2}$

最后，正如评论中所建议的，您可以：

循环所有新闻字典，向每个新闻url发出请求，并从中获取您感兴趣的数据。在
调整脚本（再添加一个函数）以获取每个新闻详细信息页数据，并从ExtractTitlesAndLinksFromPaginePageResponse（）内部调用它

我希望这能帮助您更好地理解请求库是如何工作的。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

无法通过Requests跨过分页

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >