Scrapy - 如何将网站不同部分的数据连接起来

1 投票
2 回答
2159 浏览
提问于 2025-05-10 15:37

我正在搭建一个爬虫程序。现在,我希望它能浏览网站上所有可用的页面,并且 [i] 为每个产品填写多个数据字段, [ii] 针对每个产品,深入到对应的产品网址,填充其他一些数据字段。我希望每个产品的所有数据都放在同一个 {} 里。但是,爬虫现在的做法是先执行 [i],然后再执行 [ii],这样 [ii] 的数据就被放在了一个单独的 {} 里。

我想办法把数据 [i] 加入到 [ii] 里。request.meta['item'] = item 看起来可能有用,但我还没有成功让它工作。

我有以下代码:

# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy import Spider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request
from maxine.items import CrawlerItem



class Crawler1Spider(CrawlSpider):
    name = "crawler1"
    allowed_domains = ["website.com"]
    start_urls = (
        'starturl.com',
    )


rules = [
    #visit each page
    Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="listnavpagenum"]')), callback='parse_item', follow=True),
    #click on each product link
    Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="exhib_status exhib_status_interiors"]')), callback='parse_detail', follow=True),
    ]


def parse_item(self, response):
    sel = Selector(response)
    elements = sel.xpath('//div[@class="ez_listitem_wrapper"]')
    items = []
    results = []
    n = 0
    for element in elements:
        item = CrawlerItem()
        n = n + 1
        #work out how to put images into image folder
        item['title'] = element.css('a.exhib_status.exhib_status_interiors').xpath('text()').extract_first()
        item['title_code'] = element.xpath('.//div[@class="ez_merge8"]/text()').extract_first()
        item['item_url'] = element.xpath('//div[@class="ez_merge4"]/a/@href').extract_first()
        item['count'] = n
        yield item

        #items.append(item)
    #return items



def parse_detail(self, response):
    item = CrawlerItem()
    item['telephone'] = response.xpath('//div[@id="ez_entry_contactinfo"]//text()').re('[0-9]{4,}\s*[0-9]{4,}')
    item['website'] = response.xpath('//div[@id="ez_entry_contactinfo"]//text()').re('(?:http://)?www.[a-z0-9\/?_\- ]+.[0-9a-z]+')
    yield item

如果你能给我一些建议,让我能把每个产品的所有数据放在一个 {} 里,我会非常感激。

更新:20/11/15

我已经修改了代码,如下所示:

# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy import Spider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request
from maxine.items import CrawlItem



class Crawler1Spider(CrawlSpider):
name = "test"
allowed_domains = ["website.com"]
start_urls = (
    'starturl.com',
)

rules = [
    Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="listnavpagenum"]')), callback='parse_item', follow=True),
        ]


def parse_item(self, response):

    item = CrawlItem()
    sel = Selector(response)
    elements = sel.xpath('//div[@class="ez_listitem_wrapper"]')
    items = []
    n = 0
    for element in elements:
        n = n + 1
        #work out how to put images into image folder
        #item['image_urls'] = selector.xpath('//a[@class="exhib_status exhib_status_interiors"]/img/@src').extract()
        item['title'] = element.css('a.exhib_status.exhib_status_interiors').xpath('text()').extract_first()
        item['title_code'] = element.xpath('.//div[@class="ez_merge8"]/text()').extract_first()
        item['item_url'] = element.xpath('//div[@class="ez_merge4"]/a/@href').extract_first()
        item['count'] = n
        item_detail_url = item['item_url'] = element.xpath('//div[@class="ez_merge4"]/a/@href').extract_first()
        # crawl the item and pass the item to the following request with *meta*
    yield Request(url=item_detail_url, callback=self.parse_detail,meta=dict(item=item))


def parse_detail(self, response):
    #get the item from the previous passed meta
    item = response.meta['item']
    # keep populating the item
    item['telephone'] = response.xpath('//div[@id="ez_entry_contactinfo"]//text()').re('[0-9]{4,}\s*[0-9]{4,}')
    item['website'] = response.xpath('//div[@id="ez_entry_contactinfo"]//text()').re('(?:http://)?www.[a-z0-9\/?_\- ]+.[0-9a-z]+')
    yield item

现在我能把数据放在同一个 {} 里了,但爬虫只提取每个页面最后一个项目的数据。还有其他建议吗?

相关文章:

  • 暂无相关问题
暂无标签

2 个回答

0

试着在 parse_item 这个函数里的循环中创建一个新的 item = CrawlItem() 对象。

0

我怕你不能在这种情况下使用 rules,因为每个请求在到达你想要抓取的网站时都是独立的。

你需要从 start_requests 开始定义你自己的行为:

def start_requests(self):
     yield Request(url=myinitialurl, callback=self.parse)

def parse(self, response):
     # crawl the initial page and then do something with that info
     yield Request(url=producturl, callback=self.parse_item)

def parse_item(self, response):
     item = CrawlerItem()
     # crawl the item and pass the item to the following request with *meta*
     yield Request(url=item_detail_url, callback=self.parse_detail, meta=dict(item=item))

def parse_detail(self, response):
     # get the item from the previous passed meta
     item = response.meta['item']
     # keep populating the item
     yield item

撰写回答