Scrapy CrawlSpider不遵循拒绝规则

2 投票
1 回答
1655 浏览
提问于 2025-04-18 14:16

我在StackOverflow和其他问答网站上找了找,没能找到解决我问题的合适答案。

我写了一个爬虫程序来抓取nautilusconcept.com这个网站。这个网站的分类结构非常糟糕。因为这个原因,我不得不在解析所有链接时应用一些规则。我在parse_item方法里用if语句来判断哪些网址应该被解析。不过,爬虫似乎不听我的拒绝规则,还是在尝试抓取包含(?brw....)的链接。

这是我的爬虫代码:

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from vitrinbot.items import ProductItem
from vitrinbot.base import utils
import hashlib

removeCurrency = utils.removeCurrency
getCurrency = utils.getCurrency

class NautilusSpider(CrawlSpider):
    name = 'nautilus'
    allowed_domains = ['nautilusconcept.com']
    start_urls = ['http://www.nautilusconcept.com/']
    xml_filename = 'nautilus-%d.xml'
    xpaths = {
        'category' :'//tr[@class="KategoriYazdirTabloTr"]//a/text()',
        'title':'//h1[@class="UrunBilgisiUrunAdi"]/text()',
        'price':'//hemenalfiyat/text()',
        'images':'//td[@class="UrunBilgisiUrunResimSlaytTd"]//div/a/@href',
        'description':'//td[@class="UrunBilgisiUrunBilgiIcerikTd"]//*/text()',
        'currency':'//*[@id="UrunBilgisiUrunFiyatiDiv"]/text()',
        'check_page':'//div[@class="ayrinti"]'
    }

    rules = (

        Rule(
            LinkExtractor(allow=('com/[\w_]+',),

                          deny=('asp$',
                                'login\.asp'
                                'hakkimizda\.asp',
                                'musteri_hizmetleri\.asp',
                                'iletisim_formu\.asp',
                                'yardim\.asp',
                                'sepet\.asp',
                                'catinfo\.asp\?brw',
                          ),
            ),
            callback='parse_item',
            follow=True
        ),

    )


    def parse_item(self, response):
        i = ProductItem()
        sl = Selector(response=response)

        if not sl.xpath(self.xpaths['check_page']):
            return i

        i['id'] = hashlib.md5(response.url.encode('utf-8')).hexdigest()
        i['url'] = response.url
        i['category'] = " > ".join(sl.xpath(self.xpaths['category']).extract()[1:-1])
        i['title'] = sl.xpath(self.xpaths['title']).extract()[0].strip()
        i['special_price'] = i['price'] = sl.xpath(self.xpaths['price']).extract()[0].strip().replace(',','.')

        images = []
        for img in sl.xpath(self.xpaths['images']).extract():
            images.append("http://www.nautilusconcept.com/"+img)
        i['images'] = images

        i['description'] = (" ".join(sl.xpath(self.xpaths['description']).extract())).strip()

        i['brand'] = "Nautilus"

        i['expire_timestamp']=i['sizes']=i['colors'] = ''

        i['currency'] = sl.xpath(self.xpaths['currency']).extract()[0].strip()

        return i

这是Scrapy的日志片段:

2014-07-22 17:39:31+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=-1&order=&src=&typ=> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=&offset=&order=&src=&stock=1)
2014-07-22 17:39:31+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=&offset=&order=&src=&stock=1)
2014-07-22 17:39:32+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=name&src=&stock=1&typ=> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=&offset=&order=&src=&stock=1)
2014-07-22 17:39:32+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=2&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:32+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=2&kactane=100&mrk=1&offset=-1&order=name&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:33+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&chkBeden=&chkMarka=&chkRenk=&cid=64&cmp=&direction=1&grp=&kactane=100&model=&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:33+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=1&chkBeden=&chkMarka=&chkRenk=&cid=64&cmp=&direction=1&grp=&kactane=100&model=&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:33+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=1&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=name&src=&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=1&chkBeden=&chkMarka=&chkRenk=&cid=64&cmp=&direction=1&grp=&kactane=100&model=&mrk=1&offset=-1&order=name&src=&stock=1&typ=7)

爬虫确实能抓取到正确的页面,但它不应该尝试抓取包含(catinfo.asp?brw...)的链接。

我使用的是Scrapy==0.24.2和Python 2.7.6。

1 个回答

2

这是一个标准化的“问题”。默认情况下,LinkExtractor 会返回标准化的链接地址,但来自 denyallow 的正则表达式是在标准化之前就已经应用的。

我建议你使用以下规则:

rules = (

    Rule(
        LinkExtractor(allow=('com/[\w_]+',),

                      deny=('asp$',
                            'login\.asp',
                            'hakkimizda\.asp',
                            'musteri_hizmetleri\.asp',
                            'iletisim_formu\.asp',
                            'yardim\.asp',
                            'sepet\.asp',
                            'catinfo\.asp\?.*brw',
                      ),
        ),
        callback='parse_item',
        follow=True
    ),

)

撰写回答