处理Scrapy DIV类

5 投票

1 回答

6336 浏览

提问于 2025-04-17 13:32

我刚接触Scrapy，也刚开始学Python。我正在尝试写一个爬虫，想从网页上提取文章的标题、链接和描述，几乎就像RSS订阅那样，目的是为了帮助我完成论文。我写了以下这个爬虫，但运行后导出为.txt文件时，结果是空的。我觉得我可能需要添加一个项目加载器，但不太确定。

Items.py

from scrapy.item import Item, Field

class NorthAfricaItem(Item):
    title = Field()
    link = Field()
    desc = Field()
    pass

Spider

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from northafricatutorial.items import NorthAfricaItem

class NorthAfricaItem(BaseSpider):
   name = "northafrica"
   allowed_domains = ["http://www.north-africa.com/"]
   start_urls = [
       "http://www.north-africa.com/naj_news/news_na/index.1.html",
   ]

 def parse(self, response):
 hxs = HtmlXPathSelector(response)
 sites = hxs.select('//ul/li')
 items = []
 for site in sites:
     item = NorthAfricaItem()
     item['title'] = site.select('a/text()').extract()
     item['link'] = site.select('a/@href').extract()
     item['desc'] = site.select('text()').extract()
     items.append(item)
 return items

更新

感谢Talvalin的帮助，经过一些尝试，我终于解决了这个问题。我之前使用的是网上找到的一个现成脚本。不过在使用命令行工具后，我找到了正确的标签，获取了我需要的信息。最后我得到了：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from northafrica.items import NorthAfricaItem

class NorthAfricaSpider(BaseSpider):
   name = "northafrica"
   allowed_domains = ["http://www.north-africa.com/"]
   start_urls = [
       "http://www.north-africa.com/naj_news/news_na/index.1.html",
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//ul/li')
       items = []
       for site in sites:
           item = NorthAfricaItem()
           item['title'] = site.select('//div[@class="short_holder"]    /h2/a/text()').extract()
       item['link'] = site.select('//div[@class="short_holder"]/h2/a/@href').extract()
       item['desc'] = site.select('//span[@class="summary"]/text()').extract()
       items.append(item)
   return items

如果有人看到我这里有什么错误，请告诉我……不过现在是可以工作的。

数据提取命令行工具网页解析 scrapy 爬虫信息抓取文章标题项目加载器

1 个回答

这段代码有个重要的地方，就是它会出错。如果你在命令行运行这个爬虫，你会看到类似这样的错误信息：

        exceptions.TypeError: 'NorthAfricaItem' object does not support item assignment

2013-01-24 16:43:35+0000 [northafrica] INFO: Closing spider (finished)

出现这个错误的原因是你把爬虫和项目类都命名为 NorthAfricaItem。

在你的爬虫代码中，当你创建一个 NorthAfricaItem 的实例来给它赋值（比如标题、链接和描述）时，爬虫版本的 NorthAfricaItem 会优先被使用。由于爬虫版本的 NorthAfricaItem 其实并不是一个项目类型，所以赋值就失败了。

要解决这个问题，你可以把爬虫类的名字改成像 NorthAfricaSpider 这样的名字，这样问题就解决了。

回答于 2025-04-17 由 Python大师

分享举报

处理Scrapy DIV类

1 个回答

撰写回答