使用scrapy进行网页抓取

2 投票

2 回答

624 浏览

提问于 2025-04-18 07:44

我正在尝试更深入地使用scrapy这个工具，但我只能获取到我抓取的内容的标题，而无法获取到任何详细信息。以下是我目前写的代码：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tcgplayer1.items import Tcgplayer1Item

class MySpider(BaseSpider):
    name = "tcg"
    allowed_domains = ["http://www.tcgplayer.com/"]
    start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//div[@class='magicCard']")
        vendor = hxs.select("//tr[@class='vendor']")
        items = []

        for titles in titles:
            item = Tcgplayer1Item()
            item ["cardname"] = titles.select("//li[@class='cardName']/a/text()").extract()
            item ["price"] = vendor.select("//td[@class='price']/br/text()").extract()
            item ["quantity"] = vendor.select("//td[@class='quantity']/td/text()").extract()
            items.append(item)
        return items

我无法获取到价格和数量的结果。每个商品卡片都有多个供应商，每个供应商都有自己的价格和数量。我觉得问题可能出在这里。任何帮助都会非常感谢。

数据提取网页抓取爬虫技术 scrapy框架商品信息获取

2 个回答

首先，你可以把

item ["price"] = vendor.select("//td[@class='price']/br/text()").extract()
item ["quantity"] = vendor.select("//td[@class='quantity']/td/text()").extract()

改成：

item ["price"] = titles.select("//td[@class='price']/br/text()").extract()
item ["quantity"] = titles.select("//td[@class='quantity']/td/text()").extract()

这样做可以确保你只获取到你想要的卡片的价格和数量行。

你可能还需要把选择器中的 /br 和 /td 去掉，这样你的代码看起来会是这样的：

item ["price"] = titles.select("//td[@class='price']/text()").extract()
item ["quantity"] = titles.select("//td[@class='quantity']/text()").extract()

回答于 2025-04-18 由 Python大师

分享举报

首先，这是修正后的代码版本：

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tcgplayer1.items import Tcgplayer1Item


class MySpider(BaseSpider):
    name = "tcg"
    allowed_domains = ["http://www.tcgplayer.com/"]
    start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]

    def parse(self, response):
        hxs = Selector(response)
        titles = hxs.xpath("//div[@class='magicCard']")
        for title in titles:
            item = Tcgplayer1Item()
            item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]

            vendor = title.xpath(".//tr[@class='vendor ']")
            item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
            item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
            yield item

代码中有几个问题：

vendor这个类名后面需要加一个空格：“vendor ” - 这个问题不太好找。
每个商品有多个供应商，所以你需要在循环里面定义vendor。
你在循环中重新定义了titles这个变量。
循环中的xpath表达式应该是相对路径.//。
使用Selector，而不是已经不推荐使用的HtmlXPathSelector。
使用xpath()，而不是已经不推荐使用的select()。
使用normalize-space()来去掉price和quantity xpath中的换行和多余空格。

回答于 2025-04-18 由 Python大师

分享举报

使用scrapy进行网页抓取

2 个回答

撰写回答