如何用Scrapy爬取整个网站?
我无法抓取整个网站,Scrapy 只能抓取表面内容,我想要更深入地抓取。我已经在网上搜索了5到6个小时,但没有找到解决办法。下面是我的代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
class ExampleSpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
2 个回答
2
在解析start_urls
的时候,可以通过href
这个标签来找到更深层次的网址。然后,可以在parse()
这个函数里发出更深层的请求。这里有一个简单的例子。下面是最重要的源代码:
from scrapy.spiders import Spider
from tutsplus.items import TutsplusItem
from scrapy.http import Request
import re
class MySpider(Spider):
name = "tutsplus"
allowed_domains = ["code.tutsplus.com"]
start_urls = ["http://code.tutsplus.com/"]
def parse(self, response):
links = response.xpath('//a/@href').extract()
# We stored already crawled links in this list
crawledLinks = []
# Pattern to check proper link
# I only want to get tutorial posts
linkPattern = re.compile("^\/tutorials\?page=\d+")
for link in links:
# If it is a proper link and is not checked yet, yield it to the Spider
if linkPattern.match(link) and not link in crawledLinks:
link = "http://code.tutsplus.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
titles = response.xpath('//a[contains(@class, "posts__post-title")]/h1/text()').extract()
for title in titles:
item = TutsplusItem()
item["title"] = title
yield item
6
规则是短路的,这意味着一个链接满足的第一个规则就是会被应用的规则,你的第二个规则(带有回调函数的)将不会被调用。
把你的规则改成这样:
rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)]