Getting KeyError:尝试从TripAdvisor中获取电子邮件地址时出现“link”

2024-04-27 00:35:33 发布

您现在位置:Python中文网/ 问答频道 /正文

这是我迄今为止的代码,应该刮链接,餐厅名称和他们的电子邮件地址。直到我在邮件中加上之前,一切都很正常,尽管它返回了电子邮件地址

import scrapy
from scrapy import Request


class RestaurantSpider(scrapy.Spider):
    name = 'restaurant'
    start_urls = [
        'https://www.tripadvisor.com.my/Restaurants-g298570-Kuala_Lumpur_Wilayah_Persekutuan.html#EATERY_OVERVIEW_BOX']

def parse是我从主页上收集所有列表,然后浏览每个页面访问每个餐厅页面的地方

    def parse(self, response):
        listings = response.xpath(
            '//div[@class="restaurants-list-ListCell__cellContainer--2mpJS"]')

        for listing in listings:
            link = listing.xpath(
                './/a[@class="restaurants-list-ListCell__restaurantName--2aSdo"]/@href').extract_first()
            text = listing.xpath(
                './/a[@class="restaurants-list-ListCell__restaurantName--2aSdo"]/text()').extract_first()
            yield scrapy.Request(url=response.urljoin(link),
                                 callback=self.parse_listing,
                                 meta={
                                     'Link': link,
                                     'Text': text
            }
            )

        next_urls = response.xpath(
            '//*[@class="nav next rndBtn ui_button primary taLnk"]/@href').extract()
        for next_url in next_urls:
            yield scrapy.Request(response.urljoin(next_url), callback=self.parse)

def parse_listing是我访问特定餐厅的电子邮件,然后生成所需数据,这些数据稍后将存储到.csv文件中

    def parse_listing(self, response):
        link = response.meta['link']
        text = response.meta['text']

        email = response.xpath(
            '//a[contains(@href, "mailto")]/@href').extract_first()

        yield {
            'Link': link,
            'Text': text,
            'Email': email
        }

Tags: textselfparse电子邮件responsedeflinkextract
2条回答

您在parse()方法中定义了meta={'Link':link,'Text':text},但在parse_listing()方法中调用了错误的键link,以获取导致错误的值。你的XPath很容易出错。你知道吗

尝试以下操作以使其正常工作:

class RestaurantSpider(scrapy.Spider):
    name = 'restaurant'

    start_urls = [
        'https://www.tripadvisor.com.my/Restaurants-g298570-Kuala_Lumpur_Wilayah_Persekutuan.html#EATERY_OVERVIEW_BOX'
    ]

    def parse(self, response):
        for listing in response.xpath('//div[contains(@class,"__cellContainer ")]'):
            link = listing.xpath('.//a[contains(@class,"__restaurantName ")]/@href').get()
            text = listing.xpath('.//a[contains(@class,"__restaurantName ")]/text()').get()
            complete_url = response.urljoin(link)
            yield scrapy.Request(
                url=complete_url,
                callback=self.parse_listing,
                meta={'link': complete_url,'text': text}
            )

        next_url = response.xpath('//*[contains(@class,"pagination")]/*[contains(@class,"next")]/@href').get()
        if next_url:
            yield scrapy.Request(response.urljoin(next_url), callback=self.parse)

    def parse_listing(self, response):
        link = response.meta['link']
        text = response.meta['text']
        email = response.xpath('//a[contains(@href, "mailto:")]/@href').get()
        yield {'Link': link,'Text': text,'Email': email}

将“link”替换为“href”

无法复制您的代码,但似乎不是链接属性。。。。所以抓住“href”

<a href="/Restaurant_Review-g298570-d15211507-Reviews-Vintage_1988_Cafe-Kuala_Lumpur_Wilayah_Persekutuan.html" class="restaurants-list-ListCell__restaurantName 2aSdo" target="_blank">Vintage 1988 Cafe</a>


link = response.meta['href']

相关问题 更多 >