Scrapy拒绝规则语法

2024-04-29 02:50:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我对scrapy和python是一个完全的新手,然而,我的项目和我的知识正在取得良好的进展,这要感谢这里令人敬畏的人们!为了完成我的爬行器,我只需要配置一些URL部分(例如,包含bottom.htmactionbar的所有URL,或者像?*),scrapy应该使用这些URL部分进行过滤。但我认为我很难使用正则表达式语法,因此爬虫程序至少在页面上运行,但似乎没有进行过滤。有人来解释我做错了什么吗

这是蜘蛛:

import scrapy

from scrapy.loader import ItemLoader

from ..items import NorisbankItem

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

class NorisbankSpider(CrawlSpider):

    name = "nbtest"

    allowed_domains = ['norisbank.de']

    start_urls = ['https://www.norisbank.de']

    custom_settings={ 'FEED_URI': "norisbank_%(time)s.json",

                      'FEED_FORMAT': 'json',

                      }

    rules = (

        Rule(

            LinkExtractor(allow=(''),

                          deny=('\*start\.do\?*',

                                '\*WT\.mc_id*',

                                '\*.js',

                                '\*.ico',

                                '\*_frame\.htm*',

                                '\*actionbar*',

                                '\*actionframe*',

                                '\*bottom\.htm*',

                                '\*navbar_m\.html',

                                '\*top\.htm*',

                                '\*expandsection*\.*',

                                '\*\?*\?*',

                                '\*\.xml',

                                '\*kid=*',

                                '\*\/dienste\/*',

                                '\*\.do',

                                '\*\.db',

                                '\*redirect',

                                '\*.html\?pi_*',

                          ),

            ),

           callback='parse_item',

           follow=True

           ),

         )

 

    def parse(self, response):

        page = response.url.split("/")[-2]

        filename = 'nbtest-%s.html' % page

        with open(filename, 'wb') as f:

            f.write(response.body)

        self.log('Saved file %s' % filename)

        #Content Extraction

        print(response.url)

        l = ItemLoader(NorisbankItem(), response=response)

        l.add_xpath('sitename', "//meta[@property='og:site_name']/@content")

        l.add_xpath('siteurl', "//link[@rel='canonical']/@href")

        l.add_xpath('dbCategory',"//meta[@name='dbCategory']/@content")

        l.add_css('title','title::text')

        l.add_xpath('descriptions',"normalize-space(//meta[@name='description']/@content)")

        l.add_xpath('date',"//meta[@name='date']/@content")

        l.add_xpath('version',"//meta[@name='v']/@content")

        l.add_xpath('time',"//meta[@name='time']/@content")

        l.add_xpath('sitecontent','//body//p//text()')

        yield l.load_item()

        all_pages = response.xpath('//a[contains(@href, "html")]/@href').getall()

        for next_page in all_pages :

            next_page = response.urljoin(next_page)

            yield scrapy.Request(next_page, callback=self.parse)

Tags: namefromimportaddurlresponsehtmlpage
1条回答
网友
1楼 · 发布于 2024-04-29 02:50:02

基本上,我希望使用这些规则,但是sytax似乎不起作用:

规则=(

    Rule(
        LinkExtractor(allow=(''),
                      deny=('\*start\.do\?*',
                            '\*WT\.mc_id*',
                            '\*.js',
                            '\*.ico',
                            '\*_frame\.htm*',
                            '\*actionbar*',
                            '\*actionframe*',
                            '\*bottom\.htm*',
                            '\*navbar_m\.html',
                            '\*top\.htm*',
                            '\*expandsection*\.*',
                            '\*\?*\?*',
                            '\*\.xml',
                            '\*kid=*',
                            '\*\/dienste\/*',
                            '\*\.do',
                            '\*\.db',
                            '\*redirect',
                            '\*.html\?pi_*',
                      ),
        ),
       callback='parse_item',
       follow=True
        ),
     )

相关问题 更多 >