了解Scrapy中的回调

2024-05-18 23:41:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我是新来的Python和刮胡。我以前没有使用回调函数。不过,我现在要做的是下面的代码。将执行第一个请求,并将其响应发送到定义为第二个参数的回调函数:

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = Request("http://www.example.com/some_page.html",
                      callback=self.parse_page2)
    request.meta['item'] = item
    return request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    return item

我无法理解以下事情:

  1. 如何填充item
  2. request.meta行是否在parse_page2中的response.meta行之前执行?
  3. parse_page2返回的item要去哪里?
  4. parse_page1中,return request语句的需要是什么?我想提取的物品需要从这里归还。

Tags: 函数代码selfurl参数return定义parse
3条回答
  1. 是的,scrapy使用a twisted reactor来调用spider函数,因此使用带有单个线程的单个循环可以确保
  2. spider函数调用方希望得到item/s或request/s作为回报,请求被放入队列中以供将来处理,项目被发送到配置的管道
  3. 在请求元中保存一个项(或任何其他数据)只有在获得响应后需要进一步处理时才有意义,否则,最好从parse_page1中简单地返回它,并避免额外的http请求调用

scrapy: understanding how do items and requests work between callbacks中 ,埃尔鲁尔的回答很好。

我要添加项转换的部分。首先,我们应该清楚回调函数只有在这个请求的响应dwonload之前才能工作。

在scrapy.doc给出的代码中,它没有声明page1和的url和请求。让我们将page1的url设置为“http://www.example.com.html”。

[parse_page1]是

scrapy.Request("http://www.example.com.html",callback=parse_page1)`

[parse_page2]是

scrapy.Request("http://www.example.com/some_page.html",callback=parse_page2)

下载page1的响应时,调用parse_page1生成page2的请求:

item['main_url'] = response.url # send "http://www.example.com.html" to item
request = scrapy.Request("http://www.example.com/some_page.html",
                         callback=self.parse_page2)
request.meta['item'] = item  # store item in request.meta

下载page2的响应后,调用parse_page2重新运行一个项目:

item = response.meta['item'] 
#response.meta is equal to request.meta,so here item['main_url'] 
#="http://www.example.com.html".

item['other_url'] = response.url # response.url ="http://www.example.com/some_page.html"

return item #finally,we get the item recording  urls of page1 and page2.

阅读docs

For spiders, the scraping cycle goes through something like this:

  1. You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

    The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

  2. In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

  3. In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.

  4. Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.

答案:

How is the 'item' populated does the request.meta line executes before response.meta line in parse_page2?

蜘蛛由破旧的引擎管理。它首先从start_urls中指定的url发出请求,并将它们传递给下载程序。下载完成时调用请求中指定的回调。如果回调返回另一个请求,则重复相同的操作。如果回调返回一个Item,则该项将被传递到一个管道以保存已刮除的数据。

Where is the returned item from parse_page2 going?

What is the need of return request statement in parse_page1? I thought the extracted items need to be returned from here ?

如文档中所述,每个回调(都是parse_page1parse_page2)可以返回RequestItem(或其中一个iterable)。parse_page1返回的是Request,而不是Item,因为需要从其他URL中删除其他信息。第二个回调parse_page2返回一个项,因为所有的信息都被删除并准备传递给一个管道。

相关问题 更多 >

    热门问题