Scrapy: 爬取了但未抓取
根据提供的建议和很多尝试,我终于让爬虫能在单个页面上工作了。现在,我尝试修改代码,以实现多个规则,但结果看起来不太好。下面是我正在尝试做的事情的简要描述:
对于起始网址 http://sfbay.craigslist.org/,我使用 parse_items_1 来识别 http://sfbay.craigslist.org/npo 并解析这个链接。
在第二层,对于 http://sfbay.craigslist.org/npo 中的链接,我需要使用 parse_items_2 来识别像 http://sfbay.craigslist.org/npo/index100.html 这样的链接,并解析它。
爬虫能够爬取(我能看到显示的内容),但链接没有被抓取到。
2013-02-13 11:23:55+0530 [craigs] DEBUG: Crawled (200) <GET http://sfbay.craigslist.org/npo/index100.html> (referer: http://sfbay.craigslist.org/npo/)
('**parse_items_2:', [u'Development Associate'], [u'http://sfbay.craigslist.org/eby/npo/3610841951.html'])
('**parse_items_2:', [u'Resource Development Assistant'], [u'http://sfbay.craigslist.org/eby/npo/3610835088.html'])
但是,当抓取链接和标题时,它们的值是空的。
2013-02-13 11:23:55+0530 [craigs] DEBUG: Scraped from <200 http://sfbay.craigslist.org/npo/index100.html>
{'link': [], 'title': []}
2013-02-13 11:23:55+0530 [craigs] DEBUG: Scraped from <200 http://sfbay.craigslist.org/npo/index100.html>
{'link': [], 'title': []}
代码细节:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from myspider.items import CraigslistSampleItem
class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["sfbay.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/"]
rules = (
Rule(SgmlLinkExtractor(allow=("index\d00\.html")), callback="parse_items_2", follow= True),
Rule(SgmlLinkExtractor(allow=(r'sfbay.craigslist.org/npo')), callback="parse_items_1", follow= True),
)
def __init__(self, *a, **kw):
super(MySpider, self).__init__(*a, **kw)
self.items = []
self.item = CraigslistSampleItem()
def parse_items_1(self, response):
# print response.url
hxs = HtmlXPathSelector(response)
titles = hxs.select("//div")
for title in titles:
self.item ["title"] = title.select("//li/a/text()").extract()
self.item ["link"] = title.select("//li/a/@href").extract()
print ('**parse-items_1:', self.item["title"])
self.items.append(self.item)
return self.items
def parse_items_2(self, response):
# print response.url
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
for title in titles:
self.item ["title"] = title.select("a/text()").extract()
self.item ["link"] = title.select("a/@href").extract()
print ('**parse_items_2:', self.item["title"], self.item["link"])
self.items.append(self.item)
return self.items
任何帮助都非常感谢!
谢谢。
1 个回答
3
在scrapy的教程中,项目是在一个回调函数里创建的,然后再返回去继续处理,而不是和爬虫类的实例绑定在一起。这样做的好处是去掉了初始化部分,并且重写了一些回调代码,似乎解决了问题。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import CraigslistSampleItem
class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["sfbay.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/"]
rules = (
Rule(SgmlLinkExtractor(allow=("index\d00\.html")), callback="parse_items_2", follow= True),
Rule(SgmlLinkExtractor(allow=(r'sfbay.craigslist.org/npo')), callback="parse_items_1", follow= True),
)
def parse_items_1(self, response):
items = []
hxs = HtmlXPathSelector(response)
titles = hxs.select("//div")
for title in titles:
item = CraigslistSampleItem()
item ["title"] = title.select("//li/a/text()").extract()
item ["link"] = title.select("//li/a/@href").extract()
print ('**parse-items_1:', item["title"])
items.append(item)
return items
def parse_items_2(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
items = []
for title in titles:
item = CraigslistSampleItem()
item ["title"] = title.select("a/text()").extract()
item ["link"] = title.select("a/@href").extract()
print ('**parse_items_2:', item["title"], item["link"])
items.append(item)
return items
为了测试,我把爬取到的项目保存到了一个文件里(scrapy crawl craigs -t json -o items.json
)。我注意到有时候会出现空的条目,还有很多“使用条款”的链接。这说明你的提取路径可能需要调整得更精确一些,不过除此之外,整体看起来是正常工作的。