在跨多个页面时抓取数据时出现问题

2024-04-20 12:38:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我用python编写了一个脚本来从网页获取数据。该网站显示60页的内容。我的scraper可以解析第二页的数据。当我试图更改payload参数中的页码或创建一个循环以从少数页面获取数据时,它会立即中断。如何以这种方式更正脚本,使其能够从所有页面(而不仅仅是第二页)获取数据。提前谢谢。你知道吗

  1. 链接到带有数据的站点:Page_link
  2. 要替换为以下脚本的链接:page_url

我想,页码在这里:

ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages:1

以下是完整的脚本(仅适用于第2页):

import requests
from bs4 import BeautifulSoup

url = "Link to replace with the above url" ##Replace the number 2 links here

formdata = {
    'searchEntity':'FundServiceProvider',
    'searchType':'Name',
    'searchText':'',
    'registers':'6,29,44,45',
    'AspxAutoDetectCookieSupport':'1'
}
req = requests.get(url,params=formdata,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(req.text,"lxml")

VIEWSTATE = soup.select("#__VIEWSTATE")[0]['value']
EVENTVALIDATION = soup.select("#__EVENTVALIDATION")[0]['value']

payload = {
    '__EVENTTARGET':'','__EVENTARGUMENT':'','__LASTFOCUS':'','__VIEWSTATE':VIEWSTATE,'__SCROLLPOSITIONX':'0','__SCROLLPOSITIONY':'541','__EVENTVALIDATION':EVENTVALIDATION,'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages':1,'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$btnNext.x':'260','ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$btnNext.y':'11'
}

with requests.session() as session:
    session.headers = {"User-Agent":"Mozilla/5.0"}
    response = session.post(req.url,data=payload)
    soup = BeautifulSoup(response.text,"lxml")
    tabd = soup.select(".searchresults")[0]
    for items in tabd.select("tr")[:-1]:
        data = ' '.join([item.text for item in items.select("th,td")])
        print(data)

Tags: 脚本urlsessionrequestsselectreqpayloadsoup
1条回答
网友
1楼 · 发布于 2024-04-20 12:38:24

您只需删除负载数据的最后2个字段:

payload = {
    '__EVENTTARGET':'',
    '__EVENTARGUMENT':'',
    '__LASTFOCUS':'',
    '__VIEWSTATE':VIEWSTATE,
    '__SCROLLPOSITIONX':'0',
    '__SCROLLPOSITIONY':'541',
    '__EVENTVALIDATION':EVENTVALIDATION,
    'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages':1
}

而不是

payload = {
    '__EVENTTARGET':'',
    '__EVENTARGUMENT':'',
    '__LASTFOCUS':'',
    '__VIEWSTATE':VIEWSTATE,
    '__SCROLLPOSITIONX':'0',
    '__SCROLLPOSITIONY':'541',
    '__EVENTVALIDATION':EVENTVALIDATION,
    'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages':1,
    'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$btnNext.x':'260',
    'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$btnNext.y':'11'
}

然后更新ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages值将得到正确的页面数据

相关问题 更多 >