我对scrapy和python是一个完全的新手,然而,我的项目和我的知识正在取得良好的进展,这要感谢这里令人敬畏的人们!为了完成我的爬行器,我只需要配置一些URL部分(例如,包含bottom.htm、actionbar的所有URL,或者像??*),scrapy应该使用这些URL部分进行过滤。但我认为我很难使用正则表达式语法,因此爬虫程序至少在页面上运行,但似乎没有进行过滤。有人来解释我做错了什么吗
这是蜘蛛:
import scrapy
from scrapy.loader import ItemLoader
from ..items import NorisbankItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class NorisbankSpider(CrawlSpider):
name = "nbtest"
allowed_domains = ['norisbank.de']
start_urls = ['https://www.norisbank.de']
custom_settings={ 'FEED_URI': "norisbank_%(time)s.json",
'FEED_FORMAT': 'json',
}
rules = (
Rule(
LinkExtractor(allow=(''),
deny=('\*start\.do\?*',
'\*WT\.mc_id*',
'\*.js',
'\*.ico',
'\*_frame\.htm*',
'\*actionbar*',
'\*actionframe*',
'\*bottom\.htm*',
'\*navbar_m\.html',
'\*top\.htm*',
'\*expandsection*\.*',
'\*\?*\?*',
'\*\.xml',
'\*kid=*',
'\*\/dienste\/*',
'\*\.do',
'\*\.db',
'\*redirect',
'\*.html\?pi_*',
),
),
callback='parse_item',
follow=True
),
)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'nbtest-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
#Content Extraction
print(response.url)
l = ItemLoader(NorisbankItem(), response=response)
l.add_xpath('sitename', "//meta[@property='og:site_name']/@content")
l.add_xpath('siteurl', "//link[@rel='canonical']/@href")
l.add_xpath('dbCategory',"//meta[@name='dbCategory']/@content")
l.add_css('title','title::text')
l.add_xpath('descriptions',"normalize-space(//meta[@name='description']/@content)")
l.add_xpath('date',"//meta[@name='date']/@content")
l.add_xpath('version',"//meta[@name='v']/@content")
l.add_xpath('time',"//meta[@name='time']/@content")
l.add_xpath('sitecontent','//body//p//text()')
yield l.load_item()
all_pages = response.xpath('//a[contains(@href, "html")]/@href').getall()
for next_page in all_pages :
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
基本上,我希望使用这些规则,但是sytax似乎不起作用:
规则=(
相关问题 更多 >
编程相关推荐