使用Scrapy在Craigslist上进行递归抓取
我一直在努力提升我的Python技能,最近开始用Scrapy来构建爬虫,因为它支持多线程和下载延迟功能。之前我用的是bs4,现在已经能做一个基本的爬虫并把数据输出到CSV文件了。不过,当我尝试添加递归功能时,就遇到了一些问题。我试着按照Scrapy递归下载内容的建议去做,但总是出现以下错误:
DEBUG: 重试 http://medford.craigslist.org%20%5Bu'/cto/4359874426.html'%5D> DNS查找失败:地址未找到
这让我觉得我在连接链接的方式上出了问题,因为它在网址中插入了一些字符,但我不知道该怎么修复。有没有什么建议?
这是我的代码:
#-------------------------------------------------------------------------------
# Name: module1
# Purpose:
#
# Author: CD
#
# Created: 02/03/2014
# Copyright: (c) CD 2014
# Licence: <your licence>
#-------------------------------------------------------------------------------
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import *
class PageSpider(BaseSpider):
name = "cto"
start_urls = ["http://medford.craigslist.org/cto/"]
rules = (Rule(SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//p[@class="nextpage"]' ,))
, callback="parse", follow=True), )
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//span[@class='pl']")
for titles in titles:
item = CraigslistSampleItem()
item['title'] = titles.select("a/text()").extract()
item['link'] = titles.select("a/@href").extract()
url = "http://medford.craiglist.org %s" % item['link']
yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)
def parse_item_page(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
return item
1 个回答
2
结果是你的代码:
url = "http://medford.craiglist.org %s" % item['link']
生成了:
http://medford.craigslist.org [u'/cto/4359874426.html']
在你的代码中,item['link']
返回的是一个列表,而不是你期待的字符串。你需要这样做:
url = 'http://medford.craiglist.org{}'.format(''.join(item['link']))