使用xpath从属性中提取属性id

<div class="corner-ribbon"> <span class="ribbon-green">NEW!</span> </div> <a href="Details?id=182519" title="view this property"> <img class="img-responsive img-prop" src="https://kwsadocuments.blob.core.windows.net/devblob/24c21aa4-ae17-41d1-8719-5abf8f24c766.jpg" alt="Living close to Nature"> </a>

response.xpath('//a[@title="view this property"]/@href').getall(), response.xpath('//*[@id="divListingResults"]/div/div/a/@href').getall(), response.xpath('//*[@class="corner-ribbon"]/a/@href').getall()

1条回答

网友

1楼 · 发布于 2024-05-29 00:03:52

首先，您需要了解此页面的工作原理。它使用Javascript加载属性（使用Ctrl+U检查浏览器中的页面源代码），并且（如您所知）Scrapy无法处理Javascript

但如果您检查页面源代码，您会发现所有需要的信息都“隐藏”在<input id="propertyJson" name="ListingResults.JsonResult" >标记中。因此，您只需获取value并使用json模块处理它：

import scrapy
import json

class PropertySpider(scrapy.Spider):
    name = 'property_spider'
    start_urls = ['https://www.kwsouthafrica.co.za/Property/RouteUrl?ids=P22%2C&ForSale=ForSale&PropertyTypes=&Beds=Any&Baths=Any&MinPrice=Any&MaxPrice=Any']

    def parse(self, response):
        property_json = response.xpath('//input[@id="propertyJson"]/@value').get()
        # with open('Samples/Properties.json', 'w', encoding='utf-8') as f:
        #     f.write(property_json)
        property_data = json.loads(property_json)
        for property in property_data:
            property_id = property['Id']
            property_title = property['Title']
            print(property_id)

        print(property_data)

相关问题更多 >

编程相关推荐

热门问题

热门文章