Scrapy: 规则SgmlLinkExtractor概念
请告诉我如何写规则 SgmlLinkExtractor
我有点困惑,看不懂英文文档
我想爬取很多网页
规则是:
http://abctest.com/list.php?c=&&page=1
http://abctest.com/list.php?c=&&page=2
http://abctest.com/list.php?c=&&page=3 ...
这是我的代码:
from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import re
class Spider(CrawlSpider):
name = "find"
start_urls = ["http://abctest.com/list.php?c=&&page=1",]
#crawl 2 pages to test if the data is normal allow=('?c=&&page=/d+')
rules = [Rule(SgmlLinkExtractor(allow=('?c=&&page=2')),callback='parse_item',follow=True)]
#get the page1 item
def parse(self, response):
sel = Selector(response)
sites = sel.css("div#list table tr ")
for site in sites:
item = LAItem()
item['day'] = site.css(" td.date::text ").extract()
item['URL'] = site.css(" td.subject a::attr(href) ").extract()
yield item
#get the page2 item
def parse_item(self, response):
sel = Selector(response)
sites = sel.css("div#list table tr ")
for site in sites:
item = LAItem()
item['day'] = site.css(" td.date::text ").extract()
item['URL'] = site.css(" td.subject a::attr(href) ").extract()
yield item
1 个回答
1
其实在这里你并不需要用到 LinkExtractor
和 CrawlSpider
,只需要用普通的 Spider
就可以了。你需要做的是定义一个 start_requests()
方法,并从这个方法中生成请求:
from scrapy import Request, Spider
from scrapy.exceptions import CloseSpider
from scrapy.selector import Selector
URL = 'http://abctest.com/list.php?c=&&page={page}'
class Spider(Spider):
handle_httpstatus_list = [404]
name = "find"
def start_requests(self):
index = 1
while True:
yield Request(URL.format(page=index))
index +=1
def parse(self, response):
if response.status == 404:
raise CloseSpider("Met the page which doesn't exist")
sel = Selector(response)
sites = sel.css("div#list table tr ")
for site in sites:
item = LAItem()
item['day'] = site.css(" td.date::text ").extract()
item['URL'] = site.css(" td.subject a::attr(href) ").extract()
yield item
这里的关键是要继续获取页面,直到遇到第一个返回404的响应,也就是“页面未找到”。这样做可以确保你能处理任意数量的页面。