Scrapy: 爬虫没有返回任何结果

0 投票

2 回答

1791 浏览

提问于 2025-04-18 10:16

这是我第一次创建一个爬虫，尽管我努力了，但它在导出到CSV文件时仍然什么都不返回。我的代码是：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

class Emag(CrawlSpider):
    name = "emag"
    allowed_domains = ["emag.ro"]
    start_urls = [
        "http://www.emag.ro/"]

    rules = (Rule(SgmlLinkExtractor(allow=(r'www.emag.ro')), callback="parse", follow= True))

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//a/@href').extract()
        for site in sites:
            site = str(site)

        for clean_site in site:
            name = clean_site.xpath('//[@id=""]/span').extract()
            return name

问题是，如果我打印出网站，它会给我一个网址的列表，这没问题。如果我在scrapy shell中搜索某个网址里的名字，它能找到。但是问题是，当我想要获取所有爬取链接中的名字时，就不行了。我是用“scrapy crawl emag>emag.csv”来运行的。

你能给我一点提示，告诉我哪里出错了吗？

数据提取网络爬虫调试技巧爬虫 csv导出 scrapy框架

2 个回答

一个可能的问题是，你的网站被robots.txt文件禁止了。你可以通过查看日志记录来确认这一点。如果真是这样，去你的settings.py文件里，把ROBOTSTXT_OBEY设置为False。这样就解决了我的问题。

回答于 2025-04-18 由 Python大师

分享举报

爬虫中有多个问题：

rules 应该是一个可迭代的对象，最后一个括号前面缺少逗号
没有指定 Item - 你需要定义一个 Item 类，并在爬虫的 parse() 回调中返回或生成它

下面是修复后的爬虫版本：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Field, Item


class MyItem(Item):
    name = Field()


class Emag(CrawlSpider):
    name = "emag"
    allowed_domains = ["emag.ro"]
    start_urls = [
        "http://www.emag.ro/"]

    rules = (Rule(SgmlLinkExtractor(allow=(r'www.emag.ro')), callback="parse", follow=True), )

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//a/@href')
        for site in sites:
            item = MyItem()
            item['name'] = site.xpath('//[@id=""]/span').extract()
            yield item

回答于 2025-04-18 由 Python大师

分享举报

Scrapy: 爬虫没有返回任何结果

2 个回答

撰写回答