试图从一个网站上获取数据,但为某些URL获取了两个数据
本田思域
make = honda
model = civic
对于路虎
make = land
model = rover
应该在哪里
make = landrover
model = rangerover
试过这个:
你知道吗scala.txt文件地址:
https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208
https://www.redbook.com.au/cars/details/2019-holden-astra-rs-black-edition-bk-auto-my19/SPOT-ITM-524534
http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-ed4-pure-tech-manual-my15/SPOT-ITM-410126
http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-sd4-pure-tech-auto-4x4-my15/SPOT-ITM-410136
import lxml.html as lh
import pandas as pd
import html
from lxml import html
from bs4 import BeautifulSoup
import requests
import requests
from bs4 import BeautifulSoup as bs
cars = []
with open('scala.txt') as f:
urls = f.read().splitlines()
for url in urls:
car_data={}
headers = {'User-Agent':'Mozilla/5.0'}
page = (requests.get(url, headers=headers))
tree = html.fromstring(page.content)
car_data['url']=url
if tree.xpath('//h1[@class="details-title"]/text()')[0]:
full_car_name = tree.xpath('//h1[@class="details-title"]/text()')[0]
car_data['naming'] = full_car_name
print(full_car_name)
car_data['id'] = url.split("SPOT-ITM-")[1].replace("/", "")
car_data['year'] = full_car_name.split(" ")[0]
car_data['make'] = full_car_name.split(" ")[1]
car_data['model']= full_car_name.split(" ")[2]
cars.append(car_data)
对于前两个是好的,当第三个url出现时,有多个值
输出:
{'id': '524208',
'make': 'Honda',
'model': 'Civic',
'naming': '2019 Honda Civic 50 Years Edition Auto MY19',
'url': 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208',
'year': '2019'}
{'id': '410136',
'make': 'Land',
'model': 'Rover',
'naming': '2014 Land Rover Range Rover Evoque SD4 Pure Tech Auto 4x4 MY15',
'url': 'http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-sd4-pure-tech-auto-4x4-my15/SPOT-ITM-410136',
'year': '2014'}
对于路虎,make should be land rover
和model should be range rover
尝试使用
try/except
。有些元素没有img。因此,当它试图从索引[0]
获取图像的url时,那里什么都没有。您基本上是告诉我们从空列表中获取第一个元素:骨架
try/except
因此图像:
这里还有一些帮助来修复json关键字:值。 你得到这些结果的原因是因为你在空白处分裂。在文本/内容中,它是
land rover range rover
,而不是landrover rangerover
。所以当你分开的时候,它会返回['land', 'rover', 'range', 'rover']
。您正在获取索引0和1中的元素,即'land'
和'rover'
。你知道吗如果文本是
'landrover rangerover'
,那么您就可以正确地得到您想要的。它将分割['landrover', 'rangerover']
,因此在索引位置0和1中抓取元素将按您想要的方式工作。你知道吗输出:
相关问题 更多 >
编程相关推荐