Scrapy: 规则SgmlLinkExtractor概念

1 投票
1 回答
1489 浏览
提问于 2025-04-18 15:17

请告诉我如何写规则 SgmlLinkExtractor
我有点困惑,看不懂英文文档

我想爬取很多网页
规则是:

 http://abctest.com/list.php?c=&&page=1  
 http://abctest.com/list.php?c=&&page=2  
 http://abctest.com/list.php?c=&&page=3 ...

这是我的代码:

from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import re

class Spider(CrawlSpider):
    name = "find"
    start_urls = ["http://abctest.com/list.php?c=&&page=1",]
    #crawl 2 pages to test if the data is normal  allow=('?c=&&page=/d+')
    rules = [Rule(SgmlLinkExtractor(allow=('?c=&&page=2')),callback='parse_item',follow=True)]


    #get the page1 item
    def parse(self, response):
        sel = Selector(response)
        sites = sel.css("div#list table tr ")
        for site in sites:
            item = LAItem()
            item['day']  = site.css("  td.date::text ").extract()
            item['URL'] = site.css("  td.subject a::attr(href) ").extract()
         yield item

   #get the page2 item    
   def parse_item(self, response):
        sel = Selector(response)
        sites = sel.css("div#list table tr ")
        for site in sites:
            item = LAItem()
            item['day']  = site.css("  td.date::text ").extract()
            item['URL'] = site.css("  td.subject a::attr(href) ").extract()
         yield item   

1 个回答

1

其实在这里你并不需要用到 LinkExtractorCrawlSpider,只需要用普通的 Spider 就可以了。你需要做的是定义一个 start_requests() 方法,并从这个方法中生成请求:

from scrapy import Request, Spider
from scrapy.exceptions import CloseSpider
from scrapy.selector import Selector

URL = 'http://abctest.com/list.php?c=&&page={page}'


class Spider(Spider):
    handle_httpstatus_list = [404]
    name = "find"

    def start_requests(self):
        index = 1
        while True:
            yield Request(URL.format(page=index))
            index +=1

    def parse(self, response):
        if response.status == 404:
            raise CloseSpider("Met the page which doesn't exist")

        sel = Selector(response)
        sites = sel.css("div#list table tr ")
        for site in sites:
            item = LAItem()
            item['day']  = site.css("  td.date::text ").extract()
            item['URL'] = site.css("  td.subject a::attr(href) ").extract()
         yield item

这里的关键是要继续获取页面,直到遇到第一个返回404的响应,也就是“页面未找到”。这样做可以确保你能处理任意数量的页面。

撰写回答