如何正确使用Rules和restrict_xpaths在scrapy中爬取和解析URL？

Question

我正在尝试编写一个爬虫程序，用来抓取一个网站的RSS源，然后解析文章的元标签。

第一个RSS页面是一个显示RSS分类的页面。我成功提取了链接，因为这个链接在一个标签里。它看起来像这样：

        <tr>
           <td class="xmlLink">
             <a href="http://feeds.example.com/subject1">subject1</a>
           </td>   
        </tr>
        <tr>
           <td class="xmlLink">
             <a href="http://feeds.example.com/subject2">subject2</a>
           </td>
        </tr>

一旦你点击那个链接，它会带你到该RSS分类的文章，页面看起来像这样：

   <li class="regularitem">
    <h4 class="itemtitle">
        <a href="http://example.com/article1">article1</a>
    </h4>
  </li>
  <li class="regularitem">
     <h4 class="itemtitle">
        <a href="http://example.com/article2">article2</a>
     </h4>
  </li>

如你所见，如果我使用标签，我可以再次通过xpath获取链接。我希望我的爬虫能去那个标签里的链接，并为我解析元标签。

这是我的爬虫代码：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import exampleItem


class MetaCrawl(CrawlSpider):
    name = 'metaspider'
    start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
        Rule(SgmlLinkExtractor(restrict_xpaths=('//h4[@class="itemtitle"]')), callback='parse_articles')]

    def parse_articles(self, response):
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//meta')
        items = []
        for m in meta:
           item = exampleItem()
           item['link'] = response.url
           item['meta_name'] =m.select('@name').extract()
           item['meta_value'] = m.select('@content').extract()
           items.append(item)
        return items

但是当我运行爬虫时，输出是这样的：

DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject1> (referer: http://example.com/tools/rss)
DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject2> (referer: http://example.com/tools/rss)

我在这里做错了什么呢？我一直在反复阅读文档，但感觉总是忽略了一些东西。任何帮助都非常感谢。

编辑： 添加了：items.append(item)。我在原帖中忘记写了这个。

编辑： 我也试过这个，结果输出还是一样：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from reuters.items import exampleItem
from scrapy.http import Request

class MetaCrawl(CrawlSpider):
    name = 'metaspider'
    start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(allow=[r'.*',], restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
             Rule(SgmlLinkExtractor(allow=[r'.*'], restrict_xpaths=('//h4[@class="itemtitle"]')),follow=True),]


    def parse(self, response):       
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//td[@class="xmlLink"]/a/@href')
        for m in meta:
            yield Request(m.extract(), callback = self.parse_link)


    def parse_link(self, response):       
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//h4[@class="itemtitle"]/a/@href')
        for m in meta:
            yield Request(m.extract(), callback = self.parse_again)    

    def parse_again(self, response):
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//meta')
        items = []
        for m in meta:
            item = exampleItem()
            item['link'] = response.url
            item['meta_name'] = m.select('@name').extract()
            item['meta_value'] = m.select('@content').extract()
            items.append(item)
        return items

data extraction xpath web scraping rss feed programming scrapy meta tags crawler

如何正确使用Rules和restrict_xpaths在scrapy中爬取和解析URL？

1 个回答

撰写回答