Scrapy爬虫未正确抓取

0 投票

3 回答

4547 浏览

提问于 2025-04-18 12:02

我在Windows Vista上使用的是Python.org的64位2.7版本的命令行工具。我已经安装了Scrapy，感觉它运行得很稳定。不过，我复制了一段简单的代码：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class MySpider(BaseSpider):
        name = "craig"
        allowed_domains = ["craigslist.org"]
        start_urls = ["http://sfbay.craigslist.org/sfc/npo/"]

        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            titles = hxs.select("//p")
            for titles in titles:
                title = titles.select("a/text()").xpath()
                link = titles.select("a/@href").xpath()
                print title, link

这段代码来自于这个Youtube视频：

http://www.youtube.com/watch?v=1EFnX1UkXVU

当我运行这段代码时，出现了一个警告：

    hxs = HtmlXPathSelector(response)
C:\Python27\mrscrap\mrscrap\spiders\test.py:11: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
  titles = hxs.select("//p")
c:\Python27\lib\site-packages\scrapy\selector\unified.py:106: ScrapyDeprecationWarning: scrapy.selector.HtmlXPathSelector is deprecated, ins
.Selector instead.
  for x in result]
C:\Python27\mrscrap\mrscrap\spiders\test.py:13: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
  title = titles.select("a/text()").extract()
C:\Python27\mrscrap\mrscrap\spiders\test.py:14: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
  link = titles.select("a/@href").extract()

最近Scrapy的语法有没有变化，导致.extract()不再有效了？我试着用.xpath()来替代，但出现了一个错误，提示说.xpath()需要两个参数，但我不太确定该用什么。

有没有什么建议？

谢谢

xpath windows环境数据抓取 scrapy 爬虫爬虫框架语法变化报错处理

3 个回答

这段代码应该是这样的（经过测试）。Aufziehvogel的代码让我接近完成了95%。

    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    from craigslist_sample.items import CraigslistSampleItem

    class MySpider(BaseSpider):
        name = "craig"
        allowed_domains = ["craigslist.org"]
        start_urls = ["http://sfbay.craigslist.org/search/npo"]

        def parse(self, response):
            titles = response.selector.xpath("//p")
            items = []
            for titles in titles:
                item = CraigslistSampleItem()
                item["title"] = titles.xpath("a/text()").extract()
                item["link"] = titles.xpath("a/@href").extract()
                items.append(item)
            return items

回答于 2025-04-18 由 Python大师

分享举报

问题不在于 extract，因为 extract 还是有效的，问题出在 select 上。选择器的API最近发生了变化，正如 1478963 在评论中提到的（时间过得真快，最近可能已经是一年前的事了……）。

我们现在不再使用 HtmlXPathSelector，而是使用一个通用的 Selector，这个选择器包含了 xpath() 和 css() 的方法。使用这个选择器，你可以在两者之间选择，甚至可以通过调用其中一个方法来混合使用这两种选择方式。

你在新代码中的例子应该是这样的（未经测试）：

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

class MySpider(BaseSpider):
    name = "craig"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/sfc/npo/"]

    def parse(self, response):
        titles = response.selector.xpath("//p")
        for titles in titles:
            title = titles.xpath("a/text()").extract()
            link = titles.xpath("a/@href").extract()
            print title, link

回答于 2025-04-18 由 Python大师

分享举报

关于其他回答的内容，应该是这样的：

title = titles.xpath("a/text()").extract()

回答于 2025-04-18 由 Python大师

分享举报

Scrapy爬虫未正确抓取

3 个回答

撰写回答