我正在尝试解析页面上每个广告的各种数据项,例如https://www.pistonheads.com/classifieds?Category=used-cars&M=1044&ResultsPerPage=750
我的代码正确地捕获了大多数项。但是,我遇到了两个问题:
Year
列中的输出对于每一行都是相同的。尽管xpath
与正常工作的title
列中使用的xpath
完全相同,但仍会发生这种情况李>Transmission
的值,这不可能是正确的,因为不是所有的广告都填充了这个变量李>对我的代码的一般评论也表示感谢。也许我应该用ItemLoaders
来做这个(我还没学会它们是怎么工作的)
import scrapy
from datetime import date
class SuperScraper(scrapy.Spider):
name = 'ss22'
def start_requests(self):
urls = 'https://www.pistonheads.com/classifieds?Category=used-cars&M=1044&ResultsPerPage=750'
yield scrapy.Request(urls, callback = self.parse_data)
def parse_data( self, response ):
advert = response.xpath( '//*[@class="ad-listing"]')
title = advert.xpath( './/*[@class="listing-headline"]//h3/text()' ).extract()
year = advert.xpath( './/*[@class="listing-headline"]//h3/text()' ).extract()
price = advert.xpath( './/*[@class="price"]/text()' ).extract()
mileage = advert.xpath( './/*[contains(@class, "flaticon solid gauge-1")]/following-sibling::text()' ).extract()
mileage = [item.strip() for item in mileage]
mileage = [item.replace(',','') for item in mileage]
mileage = [item.replace(' miles','') for item in mileage]
timestamp = str(date.today()).split('.')[0]
timestamps = [timestamp for i in range(len(title))]
model = response.xpath('//head/title/text()').extract()
model = [item.replace("Used ","") for item in model]
model = [item.replace(" cars for sale with PistonHeads","") for item in model]
models = [model for i in range(len(title))]
transmission = advert.xpath('.//*[contains(@class, "flaticon solid location-pin-4")]/following-sibling::text()').extract()
transmission = [item.strip() for item in transmission]
link = advert.xpath( './/*[@class="listing-headline"]/a/@href' ).extract()
link = ['https:\\www.pistonheads.com' + i for i in link]
for item in zip(timestamps,link,models,title,year,price,mileage,transmission):
price_data = {
'timestamp' : item[0],
'link' :item[1],
'model' : item[2],
'title' : item[3],
'year' : year[4],
'price' : item[5],
'mileage' : item[6],
'transmission' :item[7]
}
yield price_data
你有
'year' : year[4],
,所以是的,它总是给你相同的值由于您有70个传输和73个项目,zip以错误的方式将传输合并到项目。因此,我建议您这样做:
在这里,我们按项进行迭代,所以我们永远不会错过是否为该项显示传输
相关问题 更多 >
编程相关推荐