<p>阅读<a href="http://doc.scrapy.org/en/latest/topics/spiders.html">docs</a>:</p>
<blockquote>
<p>For spiders, the scraping cycle goes through something like this:</p>
<ol>
<li><p>You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response
downloaded from those requests.</p>
<p>The first requests to perform are obtained by calling the
<code>start_requests()</code> method which (by default) generates <code>Request</code> for the
URLs specified in the <code>start_urls</code> and the <code>parse</code> method as callback
function for the Requests.</p></li>
<li><p>In the callback function, you parse the response (web page) and return either <code>Item</code> objects, <code>Request</code> objects, or an iterable of both.
Those Requests will also contain a callback (maybe the same) and will
then be downloaded by Scrapy and then their response handled by the
specified callback.</p></li>
<li><p>In callback functions, you parse the page contents, typically using <em>Selectors</em> (but you can also use BeautifulSoup, lxml or whatever
mechanism you prefer) and generate items with the parsed data.</p></li>
<li><p>Finally, the items returned from the spider will be typically persisted to a database (in some <em>Item Pipeline</em>) or written to a file
using <em>Feed exports</em>.</p></li>
</ol>
</blockquote>
<p>答案:</p>
<blockquote>
<p>How is the <code>'item'</code> populated does the <code>request.meta</code> line executes before <code>response.meta</code> line in <code>parse_page2</code>?</p>
</blockquote>
<p>蜘蛛由破旧的引擎管理。它首先从<code>start_urls</code>中指定的url发出请求,并将它们传递给下载程序。下载完成时调用请求中指定的回调。如果回调返回另一个请求,则重复相同的操作。如果回调返回一个<code>Item</code>,则该项将被传递到一个管道以保存已刮除的数据。</p>
<blockquote>
<p>Where is the returned item from <code>parse_page2</code> going?</p>
<p>What is the need of <code>return request</code> statement in <code>parse_page1</code>? I thought the extracted items need to be returned from here ?</p>
</blockquote>
<p>如文档中所述,每个回调(都是<code>parse_page1</code>和<code>parse_page2</code>)可以返回<code>Request</code>或<code>Item</code>(或其中一个iterable)。<code>parse_page1</code>返回的是<code>Request</code>,而不是<code>Item</code>,因为需要从其他URL中删除其他信息。第二个回调<code>parse_page2</code>返回一个项,因为所有的信息都被删除并准备传递给一个管道。</p>