scrapy登录到imdb

class lisTopSpider(scrapy.Spider): name= 'ImdbListsSpider' allowed_domains = ['imdb.com'] titleLinkNum = 'tt0120852' start_urls = [ 'https://www.imdb.com/lists/'+titleLinkNum ] # def ???(self, response): # return scrapy.FormRequest.from_response( # formdata={"username": "example@gmail.com","password":"example",} # callback=self.parse) #lists related to given title def parse(self, response): listsLinks = response.xpath('//div[2]/strong') for link in listsLinks: list_url = response.urljoin(link.xpath('.//a/@href').get()) yield scrapy.Request(list_url, callback=self.parse_list, meta={'list_url': list_url}) next_page_url = response.xpath('//a[@class="flat-button next-page "]/@href').get() if next_page_url is not None: next_page_url = response.urljoin(next_page_url) yield scrapy.Request(next_page_url, callback=self.parse) #Link of each list def parse_list(self, response): list_url = response.meta['list_url'] myRatings = response.xpath('//div[@class="ipl-rating-star small"]/span[2]/text()').getall() yield{ 'list': list_url, 'ratings': myRatings, }

1条回答

网友

1楼 · 发布于 2024-05-16 12:58:56

很可能您想要的是^{}，它允许您，而不是Scrapy，组成启动爬网的初始Request对象。他们的示例或多或少与您的伪代码匹配

或者，您也可以不使用FormRequest提交凭据，而是通过浏览器单独登录，获取身份验证cookie，并通过start_requests提供这些cookie，以防登录时出现任何奇怪的情况（如reCaptcha、双因素身份验证等）：

def start_requests(self):
    return Request(self.start_urls[0],
                   cookies={'whatever-cookie': 'whatever-value'})

并确保您的蜘蛛的settings.py中有^{}

相关问题更多 >

编程相关推荐

热门问题

热门文章