了解Scrapy中的回调

3条回答

网友

1楼 · 编辑于 2024-05-18 23:41:28

是的，scrapy使用a twisted reactor来调用spider函数，因此使用带有单个线程的单个循环可以确保
spider函数调用方希望得到item/s或request/s作为回报，请求被放入队列中以供将来处理，项目被发送到配置的管道
在请求元中保存一个项（或任何其他数据）只有在获得响应后需要进一步处理时才有意义，否则，最好从parse_page1中简单地返回它，并避免额外的http请求调用

网友

2楼 · 编辑于 2024-05-18 23:41:28

在scrapy: understanding how do items and requests work between callbacks中，埃尔鲁尔的回答很好。

我要添加项转换的部分。首先，我们应该清楚回调函数只有在这个请求的响应dwonload之前才能工作。

在scrapy.doc给出的代码中，它没有声明page1和的url和请求。让我们将page1的url设置为“http://www.example.com.html”。

[parse_page1]是

scrapy.Request("http://www.example.com.html",callback=parse_page1)`

[parse_page2]是

scrapy.Request("http://www.example.com/some_page.html",callback=parse_page2)

下载page1的响应时，调用parse_page1生成page2的请求：

item['main_url'] = response.url # send "http://www.example.com.html" to item
request = scrapy.Request("http://www.example.com/some_page.html",
                         callback=self.parse_page2)
request.meta['item'] = item  # store item in request.meta

下载page2的响应后，调用parse_page2重新运行一个项目：

item = response.meta['item'] 
#response.meta is equal to request.meta,so here item['main_url'] 
#="http://www.example.com.html".

item['other_url'] = response.url # response.url ="http://www.example.com/some_page.html"

return item #finally,we get the item recording  urls of page1 and page2.

网友

3楼 · 编辑于 2024-05-18 23:41:28

阅读docs：

For spiders, the scraping cycle goes through something like this:
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.
The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.

答案：

How is the 'item' populated does the request.meta line executes before response.meta line in parse_page2?

蜘蛛由破旧的引擎管理。它首先从start_urls中指定的url发出请求，并将它们传递给下载程序。下载完成时调用请求中指定的回调。如果回调返回另一个请求，则重复相同的操作。如果回调返回一个Item，则该项将被传递到一个管道以保存已刮除的数据。

Where is the returned item from parse_page2 going?
What is the need of return request statement in parse_page1? I thought the extracted items need to be returned from here ?

如文档中所述，每个回调（都是parse_page1和parse_page2）可以返回Request或Item（或其中一个iterable）。parse_page1返回的是Request，而不是Item，因为需要从其他URL中删除其他信息。第二个回调parse_page2返回一个项，因为所有的信息都被删除并准备传递给一个管道。

相关问题更多 >

编程相关推荐

热门问题

热门文章