将scrapy传输起始\u url到后续请求

2024-04-25 12:41:59 发布

您现在位置:Python中文网/ 问答频道 /正文

因为三天以来,我试图将各自的start\urs保存在meta属性中,以便将其作为item传递给scrapy中的后续请求,因此我可以使用start_url调用字典来填充输出中的附加数据。实际上它应该很简单,因为它在documentation中有解释。。。在

在google上有一个讨论scrapy group,还有一个问题here,但我无法让它运行:(

我是scrapy的新手,我认为这是一个很棒的框架,但是对于我的项目,我必须知道所有请求的起始url,它看起来很复杂。在

我真的很感激你的帮助!在

目前我的代码如下:

class example(CrawlSpider):

    name = 'example'
    start_urls = ['http://www.example.com']

    rules = (
    Rule(SgmlLinkExtractor(allow=('/blablabla/', )), callback='parse_item'),
    )

    def parse(self, response):
        for request_or_item in super(example, self).parse(response):
            if isinstance(request_or_item, Request):
                request_or_item = request_or_item.replace(meta = {'start_url':   response.meta['start_url']})
            yield request_or_item

    def make_requests_from_url(self, url):
         return Request(url, dont_filter=True, meta = {'start_url': url})


    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        item = testItem()
        print response.request.meta, response.url

Tags: orselfurl属性parseexampleresponserequest
1条回答
网友
1楼 · 发布于 2024-04-25 12:41:59

我想删除这个答案,因为它不能解决OP的问题,但我想把它作为一个蹩脚的例子。在


Warning

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

请改用BaseSpider

class Spider(BaseSpider):

    name = "domain_spider"


    def start_requests(self):

        last_domain_id = 0
        chunk_size = 10
        cursor = settings.db.cursor()

        while True:
            cursor.execute("""
                    SELECT domain_id, domain_url  
                    FROM domains  
                    WHERE domain_id > %s AND scraping_started IS NULL  
                    LIMIT %s
                """, (last_domain_id, chunk_size))
            self.log('Requesting %s domains after %s' % (chunk_size, last_domain_id))
            rows = cursor.fetchall()
            if not rows:
                self.log('No more domains to scrape.')
                break

            for domain_id, domain_url in rows:
                last_domain_id = domain_id
                request = self.make_requests_from_url(domain_url)
                item = items.Item()
                item['start_url'] = domain_url
                item['domain_id'] = domain_id
                item['domain'] = urlparse.urlparse(domain_url).hostname
                request.meta['item'] = item

                cursor.execute("""
                        UPDATE domains  
                        SET scraping_started = %s
                        WHERE domain_id = %s  
                    """, (datetime.now(), domain_id))

                yield request

    ...

相关问题 更多 >