了解Scrapy中的回调问题的回答

了解Scrapy中的回调

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

阅读<a href="http://doc.scrapy.org/en/latest/topics/spiders.html">docs</a>： <blockquote> For spiders, the scraping cycle goes through something like this: <ol> <li>You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests. The first requests to perform are obtained by calling the <code>start_requests()</code> method which (by default) generates <code>Request</code> for the URLs specified in the <code>start_urls</code> and the <code>parse</code> method as callback function for the Requests.</li> <li>In the callback function, you parse the response (web page) and return either <code>Item</code> objects, <code>Request</code> objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.</li> <li>In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.</li> <li>Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.</li> </ol> </blockquote> 答案： <blockquote> How is the <code>'item'</code> populated does the <code>request.meta</code> line executes before <code>response.meta</code> line in <code>parse_page2</code>? </blockquote> 蜘蛛由破旧的引擎管理。它首先从<code>start_urls</code>中指定的url发出请求，并将它们传递给下载程序。下载完成时调用请求中指定的回调。如果回调返回另一个请求，则重复相同的操作。如果回调返回一个<code>Item</code>，则该项将被传递到一个管道以保存已刮除的数据。 <blockquote> Where is the returned item from <code>parse_page2</code> going? What is the need of <code>return request</code> statement in <code>parse_page1</code>? I thought the extracted items need to be returned from here ? </blockquote> 如文档中所述，每个回调（都是<code>parse_page1</code>和<code>parse_page2</code>）可以返回<code>Request</code>或<code>Item</code>（或其中一个iterable）。<code>parse_page1</code>返回的是<code>Request</code>，而不是<code>Item</code>，因为需要从其他URL中删除其他信息。第二个回调<code>parse_page2</code>返回一个项，因为所有的信息都被删除并准备传递给一个管道。

了解Scrapy中的回调

1 个回答

相关Python问题