如何创建线程池

def yield1(self, url): print("inside function") yield scrapy.Request(url, callback=self.parse_product) def parse(self, response): print("in herre") self.product_url = response.xpath('//div[@class = "collection-grid js-filter-grid"]//a/@href').getall() print(self.product_url) for pu in self.product_url: print("inside the loop") with ThreadPoolExecutor(max_workers=10) as executor: print("inside thread") executor.map(self.yield1, response.urljoin(pu))

1条回答

网友

1楼 · 发布于 2024-05-14 09:39:36

yield1是一个生成函数。为了让它产生一个值，你必须调用next。将其更改为返回值

def yield1(self, url):
    print("inside function")
    return scrapy.Request(url, callback=self.parse_product)

你知道吗警告：我不知道我真的不太懂刮痧。你知道吗

Overview in the docs表示请求是异步发出的。你的代码看起来不像那些文档中给出的例子。概述中的示例显示了使用response.follow在parse方法中发出的后续请求。您的代码看起来像是试图从页面中提取链接，然后异步地刮取这些链接，并使用不同的方法对其进行解析。因为看起来Scrapy会为您做这件事并处理异步性（？）我认为您只需要在spider中定义另一个解析方法，并使用response.follow来调度更多的异步请求。你不应该需要期货，the new requests should all be processed asynchrounously。你知道吗

我没有办法测试这个，但我认为你的蜘蛛应该看起来更像这样：

class TempSpider(scrapy.Spider):
    name = 'foo'
    start_urls = [
        'https://www.jny.com/collections/jackets',
    ]
    def parse(self, response):
        self.product_url =  response.xpath('//div[@class = "collection-grid js-filter-grid"]//a/@href').getall()
        for pu in self.product_url:
            print("inside the loop")
            response.urljoin(pu)
            yield response.follow(response.urljoin(pu), self.parse_product)

    def parse_product(self, response):
        '''parses the product urls'''

这假设self.product_url = response.xpath('//div[@class = "collection-grid js-filter-grid"]//a/@href').getall()做了它应该做的事情。你知道吗

甚至可能有一个单独的蜘蛛来解析后续链接。或者使用CrawlSpider。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章