擅长:python、mysql、java
<p>您应该为规则定义回调。下面是从<code>twitter.com</code>主页(<code>follow=False</code>)获取所有链接的示例:</p>
<pre><code>from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
class MyItem(Item):
url= Field()
class MySpider(CrawlSpider):
name = 'twitter.com'
allowed_domains = ['twitter.com']
start_urls = ['http://www.twitter.com']
rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=False), )
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
return item
</code></pre>
<p>然后,在输出文件中,我看到:</p>
<pre><code>http://status.twitter.com/
https://twitter.com/
http://support.twitter.com/forums/26810/entries/78525
http://support.twitter.com/articles/14226-how-to-find-your-twitter-short-code-or-long-code
...
</code></pre>
<p>希望能有所帮助。</p>