粗糙的复制结果

2024-05-23 23:38:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试解析页面上每个广告的各种数据项,例如https://www.pistonheads.com/classifieds?Category=used-cars&M=1044&ResultsPerPage=750

我的代码正确地捕获了大多数项。但是,我遇到了两个问题:

  1. Year列中的输出对于每一行都是相同的。尽管xpath与正常工作的title列中使用的xpath完全相同,但仍会发生这种情况
  2. 在我的输出中,每一行都有一个Transmission的值,这不可能是正确的,因为不是所有的广告都填充了这个变量

对我的代码的一般评论也表示感谢。也许我应该用ItemLoaders来做这个(我还没学会它们是怎么工作的)

import scrapy
from datetime import date


class SuperScraper(scrapy.Spider):
    name = 'ss22'

    def start_requests(self):
        urls = 'https://www.pistonheads.com/classifieds?Category=used-cars&M=1044&ResultsPerPage=750'
        yield scrapy.Request(urls, callback = self.parse_data)

    def parse_data( self, response ):
        advert = response.xpath( '//*[@class="ad-listing"]')
        title = advert.xpath( './/*[@class="listing-headline"]//h3/text()' ).extract()
        year = advert.xpath( './/*[@class="listing-headline"]//h3/text()' ).extract()
        price = advert.xpath( './/*[@class="price"]/text()' ).extract()
        mileage = advert.xpath( './/*[contains(@class, "flaticon solid gauge-1")]/following-sibling::text()' ).extract()
        mileage = [item.strip() for item in mileage]
        mileage = [item.replace(',','') for item in mileage]
        mileage = [item.replace(' miles','') for item in mileage]
        timestamp = str(date.today()).split('.')[0] 
        timestamps = [timestamp for i in range(len(title))]
        model = response.xpath('//head/title/text()').extract()
        model = [item.replace("Used ","") for item in model]
        model = [item.replace(" cars for sale with PistonHeads","") for item in model]
        models = [model for i in range(len(title))]
        transmission = advert.xpath('.//*[contains(@class, "flaticon solid location-pin-4")]/following-sibling::text()').extract()
        transmission = [item.strip() for item in transmission]
        link = advert.xpath( './/*[@class="listing-headline"]/a/@href' ).extract()
        link = ['https:\\www.pistonheads.com' + i for i in link]

        for item in zip(timestamps,link,models,title,year,price,mileage,transmission):
            price_data = {
                    'timestamp' : item[0],
                    'link' :item[1],
                    'model' : item[2],
                    'title' : item[3],
                    'year' : year[4],
                    'price' : item[5],
                    'mileage' : item[6],
                    'transmission' :item[7]

            }
            yield price_data 

Tags: textinfordatamodeltitlelinkextract
1条回答
网友
1楼 · 发布于 2024-05-23 23:38:15
  1. 你有'year' : year[4],,所以是的,它总是给你相同的值

  2. 由于您有70个传输和73个项目,zip以错误的方式将传输合并到项目。因此,我建议您这样做:

class SuperScraper(scrapy.Spider):
    name = 'ss22'

    def start_requests(self):
        urls = 'https://www.pistonheads.com/classifieds?Category=used-cars&M=1044&ResultsPerPage=750'
        yield scrapy.Request(urls, self.parse_data)

    def parse_data( self, response ):
        model = response.xpath('//head/title/text()').get('')
        model = model.replace("Used ", "").replace(" cars for sale with PistonHeads", "")
        for row in response.xpath('//*[@class="ad-listing"]'):
            transmisson = row.xpath('.//*[contains(@class, "flaticon solid location-pin-4")]/following-sibling::text()').get('')
            mileage = row.xpath('.//*[contains(@class, "flaticon solid gauge-1")]/following-sibling::text()').get('')
            price_data = {
                    'timestamp': str(date.today()).split('.')[0],
                    'link': 'https://www.pistonheads.com' + row.xpath('.//*[@class="listing-headline"]/a/@href').get(''),
                    'model': model,
                    'title': row.xpath('.//*[@class="listing-headline"]//h3/text()').get('').strip(),
                    'year': row.xpath('.//*[@class="listing-headline"]//h3/text()').get(''),
                    'price': row.xpath('.//*[@class="price"]/text()').get('').strip(),
                    'mileage': mileage.replace(',', '').replace(' miles', '').strip(),
                    'transmission': transmisson.strip(),
            }
            yield price_data

在这里,我们按项进行迭代,所以我们永远不会错过是否为该项显示传输

相关问题 更多 >