我对刮痧很陌生,我一直在试着刮http://www.icbse.com/schools/state/maharashtra,但我遇到了一个问题。 在可用的学校链接总数中,该页面一次仅以无序方式显示50个链接。在
但是,如果页面被重新加载,它会显示50个新的学校链接列表。其中一些链接与刷新前的第一个链接不同,而有些链接保持不变。在
我想做的是添加一个Set()
的链接,一旦len(set)
达到学校总数的长度,我就想把这个Set
发送到一个解析函数。
我不明白解决这个问题有两件事。在
set
,它将保留链接,而不是每次调用parse()时都刷新它。在以下是我当前的代码:
import scrapy
import re
from icbse.items import IcbseItem
class IcbseSpider(scrapy.Spider):
name = "icbse"
allowed_domains = ["www.icbse.com"]
start_urls = [
"http://www.icbse.com/schools/",
]
def parse(self, response):
for i in xrange(20): # I thought if i iterate the start URL,
# I could probably have the page reload.
# It didn't work though.
for href in response.xpath(
'//div[@class="row"]/div[3]//span[@class="list-group-item"]\
/a/@href').extract():
url = response.urljoin(href)
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
# total number of schools found on page
pages = response.xpath(
"//div[@class='container']/strong/text()").extract()[0]
self.captured_schools_set = set() # Placing the Set here doesn't work!
while len(self.captured_schools_set) != int(pages):
yield scrapy.Request(response.url, callback=self.reload_url)
for school in self.captured_schools_set:
yield scrapy.Request(school, callback=self.scrape_school_info)
def reload_url(self, response):
for school_href in response.xpath(
"//h4[@class='school_name']/a/@href").extract():
self.captured_schools_set.add(response.urljoin(school_href))
def scrape_school_info(self, response):
item = IcbseItem()
try:
item["School_Name"] = response.xpath(
'//td[@class="tfield"]/strong/text()').extract()[0]
except:
item["School_Name"] = ''
pass
try:
item["streetAddress"] = response.xpath(
'//td[@class="tfield"]')[1].xpath(
"//span[@itemprop='streetAddress']/text()").extract()[0]
except:
item["streetAddress"] = ''
pass
yield item
您正在迭代一个空集:
因此,
school
的请求永远不会被触发。在您应该使用dont_filter=True属性重新加载、启动http://www.icbse.com/schools/请求,因为在默认设置中,scrapy将重复项过滤掉。在
但看起来您并没有触发http://www.icbse.com/schools/请求,而是(http://www.icbse.com/schools/state/andaman-nicobar)“/state/name”请求;在上面的第4行中,您正在启动请求.url,有个问题,请改为/学校/
相关问题 更多 >
编程相关推荐