我已经开始学习刮痧，并试图刮LetterBoxd。我无法将飞溅和刮擦结合起来。我如何刮去前2000页？

import scrapy from scrapy_splash import SplashRequest class Test4basicSpider(scrapy.Spider): name = 'test4Basic' allowed_domains = ['letterboxd.com'] start_urls = ['https://letterboxd.com/films/popular/size/small/page/1/'] # start_urls = ['https://letterboxd.com/film/tales-from-the-darkside-the-movie/'] script1 = ''' function main(splash, args) splash.private_mode_enabled = false url = args.url assert(splash:go(args.url)) assert(splash:wait(0.5)) return { html = splash:html() } end ''' def start_requests(self): yield SplashRequest(url='https://letterboxd.com/films/popular/size/small/page/1/', callback=self.parse, endpoint="execute", args={ 'lua_source': self.script1 }) def parse(self, response): for movie in response.xpath("//li[@class='listitem poster-container']"): movie_url = movie.xpath("(//a[@class='frame'])[1]/@href").get() yield scrapy.Request( url=f'https://letterboxd.com{movie_url}', callback=self.parse_movie, ) next_page = response.xpath("//a[@class='next']/@href").get() if next_page: yield SplashRequest( url=f'https://letterboxd.com{next_page}', endpoint='execute', args={ 'lua_source': self.script1 }, callback=self.parse ) def parse_movie(self, response): yield { 'title': response.xpath('//section[@id="featured-film-header"]/h1/text()').get(), 'year': response.xpath('//small[@class="number"]/a/text()').get(), 'duration': response.xpath('(//p[@class="text-link text-footer"]/text())[1]').get(), 'genre': response.xpath('//div[@class="text-sluglist capitalize"]/p/a/text()').getall(), 'rating': response.xpath('//a[contains(@class, "tooltip display-rating")]/text()').get(), 'language': response.xpath('((//span[contains(text(), "Language")]/parent::node()/following::node())/p/a/text())[1]').get() }

1条回答

网友
1楼 · 发布于 2024-05-13 03:50:51

关于下一页的问题。听起来像是需要下一页链接的启动请求。你应该想一想如何不断地请求下一个页面链接，直到没有链接为止
以帮助您稍微提高评级。没有必要使用splash来获取评级
如果查看浏览器在检查页面时发出的请求。您可以看到一个AJAX请求，其中包含一些与评级相对应的HTML，正如您所建议的，它是由javascript加载的
我倾向于复制此请求并粘贴到curl.trillworks.com。将cURL命令转换为python。然后，您可以对请求进行处理，看看您是否可以在没有任何标题的情况下获取它
事实上，您甚至不需要头/参数/cookies来发出请求。要获取评级信息，您需要向https://letterboxd.com/csi/film/joker-2019/rating-histogram/发出一个简单的HTTP get请求
代码示例
start_url = ['https://letterboxd.com/csi/film/joker-2019/rating-histogram/'] def parse(self,response): rating = response.xpath('//a[@class="tooltip display-rating"]/text()').get()
输出
3.8
对于任何电影链接，请使用URL中指定特定电影页面链接的相应部分替换joker-2019
根据评论更新
事实上，你已经快得到这个了。您已经为下一页正确地编写了代码。我认为每个链接的XPATH选择器都有点错误
更新代码
for movie in response.xpath("//li[@class='listitem poster-container']"): movie_url = movie.xpath(".//a[@class='frame']/@href").get() print(movie_url) yield scrapy.Request( url=f'https://letterboxd.com{movie_url}', callback=self.parse_movie, dont_filter=True )
更正
注意，它应该是.//而不是//.//搜索每个response.xpath("//li[@class='listitem poster-container']")列表项的相对路径。容易犯的错误我们都错过了
我不太确定XPATH选择器
'(//a[@class='frame'])[1]/@href'
我把它改成了'//a[@class='frame']/@href'，这很有效
它正在过滤所有请求，因为它具有相同的基本URLletterboxd.com，因此在scrapy.Request中，您必须确保dont_filter=True处理所有请求
更新2：将评级纳入准则
请参见答案的主体部分，但这里是实现。我们创建了链接的一部分，我们需要将其提供给直方图URL，该URL为我们提供评级。然后我们调用一个回调来获取评级，然后将评级方法中的这个变量传递给parse_movie方法
def parse(self, response): for movie in response.xpath("//li[@class='listitem poster-container']"): movie_url = movie.xpath(".//a[@class='frame']/@href").get() partial = movie_url.split('/')[-2] yield scrapy.Request( url=f'https://letterboxd.com{movie_url}', callback=self.parse_movie, dont_filter=True ) rating_url = f'https://letterboxd.com/csi/film/{partial}/rating-histogram/' yield scrapy.Request(url=rating_url,callback=self.rating) next_page = response.xpath("//a[@class='next']/@href").get() if next_page: yield SplashRequest( url=f'https://letterboxd.com{next_page}', endpoint='execute', args={ 'lua_source': self.script1 }, callback=self.parse ) def rating(self,response): self.rating = response.xpath('//a[@class="tooltip display-rating"]/text()').get() def parse_movie(self, response): yield { 'title': response.xpath('//section[@id="featured-film-header"]/h1/text()').get(), 'year': response.xpath('//small[@class="number"]/a/text()').get(), 'duration': response.xpath('(//p[@class="text-link text-footer"]/text())[1]').get(), 'genre': response.xpath('//div[@class="text-sluglist capitalize"]/p/a/text()').getall(), 'rating': self.rating, 'language': response.xpath('((//span[contains(text(), "Language")]/parent::node()/following::node())/p/a/text())[1]').get() }

代码示例

输出

根据评论更新

更新代码

更正

更新2：将评级纳入准则

相关问题更多 >

编程相关推荐

热门问题

热门文章