如何在Scrapy中使用yield时返回到调用的解析函数?

0 投票
2 回答
2865 浏览
提问于 2025-04-18 14:26

我想实现的目标是:

class Hello(Spider):
    #some stuff
    def parse(self, response):
        #get a list of url of cities using pickle and store in a list
        #Now for each city url I have to get list of monuments (using selenium) which is achieved by the below loops
        for c in cities:
            #get the list of monuments using selenium and iterate through each monument url contained in the division
            divs = sel.xpath('some xpath/div')
            for div in divs:
               monument_url=''.join(div.xpath('some xpath'))
               #For each monument url get the response and scrape the information
               yield Request(monument_url, self.parse_monument)
    def parse_monument(self, response):
        #scrape some information and return to the loop(i.e. return to "for div in divs:") 

现在发生的情况是:
1. 在执行yield语句之前,我会获取到所有城市的所有纪念碑的列表。
2. 每当执行yield语句时,它会跳转到parse_monument函数,并且不会返回到循环中,只会抓取第一个城市中存在的纪念碑列表。

有没有办法做到这一点?有没有办法在不进入parse_monument方法的情况下,获取request方法传递给parse_monument的响应对象,这样我就可以使用选择器从响应中选择我需要的元素?

谢谢!!

2 个回答

0

Request 是一个对象,而不是一个方法。Scrapy 会处理你生成的 Request 对象,并异步执行回调函数。你可以把 Request 看作是一个线程对象。

解决办法是反过来做,你把需要的数据从 parse 方法传递给 Request,这样你就可以在 parse_monument 中处理这些数据。

class Hello(Spider):

    def parse(self, response):
        for c in cities:
            divs = sel.xpath('some xpath/div')
            for div in divs:
               monument_url=''.join(div.xpath('some xpath'))

               data = ...   # set the data that you need from this loop

               # pass the data into request's meta
               yield Request(monument_url, self.parse_monument, meta={'data': data})

    def parse_monument(self, response):
        # retrieve the data from response's meta
        data = response.meta.get('data')
        ...
0

我觉得你不能像那样回调一个函数。这里有个改进的版本:

class HelloSpider(scrapy.Spider):
    name = "hello"
    allowed_domains = ["hello.com"]
    start_urls = (
        'http://hello.com/cities'
    )

    def parse(self, response):
        cities = ['London','Paris','New-York','Shanghai']
        for city in cities:
            xpath_exp= 'some xpath[city="' + city + '"]/div/some xpath'
            for monument_url in response.xpath(xpath_exp).extract():
                yield Request(monument_url, callback=self.parse_monument)

    def parse_monument(self,response):
        pass

撰写回答