如何正确使用规则，限制xpath使用scrapy抓取和解析url？

2024-04-29 08:49:05 发布

您现在位置：Python中文网/ 问答频道 /正文

551

网友

男 | 程序猿一只，喜欢编程写python代码。

我试图编程一个爬行蜘蛛来爬行一个网站的RSS提要，然后解析文章的元标记。

第一个RSS页面是显示RSS类别的页面。我设法提取了链接，因为标记在标记中。看起来是这样的：

        <tr>
           <td class="xmlLink">
             <a href="http://feeds.example.com/subject1">subject1</a>
           </td>   
        </tr>
        <tr>
           <td class="xmlLink">
             <a href="http://feeds.example.com/subject2">subject2</a>
           </td>
        </tr>

单击该链接后，它将为您带来该RSS类别的文章，如下所示：

   <li class="regularitem">
    <h4 class="itemtitle">
        <a href="http://example.com/article1">article1</a>
    </h4>
  </li>
  <li class="regularitem">
     <h4 class="itemtitle">
        <a href="http://example.com/article2">article2</a>
     </h4>
  </li>

如您所见，如果使用标记，我可以再次获得与xpath的链接我想让我的爬虫到标签里面的链接，为我解析元标签。

这是我的爬虫程序代码：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import exampleItem


class MetaCrawl(CrawlSpider):
    name = 'metaspider'
    start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
        Rule(SgmlLinkExtractor(restrict_xpaths=('//h4[@class="itemtitle"]')), callback='parse_articles')]

    def parse_articles(self, response):
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//meta')
        items = []
        for m in meta:
           item = exampleItem()
           item['link'] = response.url
           item['meta_name'] =m.select('@name').extract()
           item['meta_value'] = m.select('@content').extract()
           items.append(item)
        return items

不过，这是我运行爬虫程序时的输出：

DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject1> (referer: http://example.com/tools/rss)
DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject2> (referer: http://example.com/tools/rss)

我在这里做错什么了？我一次又一次地阅读文档，但我觉得我一直在忽略一些东西。任何帮助都将不胜感激。

编辑：添加：items.append（item）。在原稿上忘了。 编辑：：我也试过，结果是相同的输出：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from reuters.items import exampleItem
from scrapy.http import Request

class MetaCrawl(CrawlSpider):
    name = 'metaspider'
    start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(allow=[r'.*',], restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
             Rule(SgmlLinkExtractor(allow=[r'.*'], restrict_xpaths=('//h4[@class="itemtitle"]')),follow=True),]


    def parse(self, response):       
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//td[@class="xmlLink"]/a/@href')
        for m in meta:
            yield Request(m.extract(), callback = self.parse_link)


    def parse_link(self, response):       
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//h4[@class="itemtitle"]/a/@href')
        for m in meta:
            yield Request(m.extract(), callback = self.parse_again)    

    def parse_again(self, response):
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//meta')
        items = []
        for m in meta:
            item = exampleItem()
            item['link'] = response.url
            item['meta_name'] = m.select('@name').extract()
            item['meta_value'] = m.select('@content').extract()
            items.append(item)
        return items

Tags： from import com http example response items item

1条回答

网友

1楼 · 发布于 2024-04-29 08:49:05

您返回了空的items，需要将item附加到items。
你也可以在循环中yield item。

如何正确使用规则，限制xpath使用scrapy抓取和解析url？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何正确使用规则，限制xpath使用scrapy抓取和解析url？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >