raise JSONDecodeError(“预期值”,s,err.value)

2024-04-20 01:03:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图对数据进行爬网,但是代码抛出了一个错误json.loads。当我追溯到错误时,我意识到循环中的元素是None,因此json.loads无法运行

有什么解决办法吗

下面是我的代码:

import json
from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime

start_time = datetime.now()


data = []
 
op = webdriver.ChromeOptions()
op.add_argument('--ignore-certificate-errors')
op.add_argument('--incognito')
op.add_argument('--headless')
driver = webdriver.Chrome(executable_path='D:/Desktop/Query/chromedriver.exe',options=op)
driver.get('https://www.cdiscount.com/f-1175520-MIS2008813786478.html')
link = 'https://www.cdiscount.com/f-1175520-MIS2008813786478.html'
soup = BeautifulSoup(driver.page_source, 'html.parser')
b = soup.prettify()
product_title = soup.find('title').getText()
reviews = soup.find_all("script",type="application/ld+json")
for element in reviews : 
     json_string = element.getText()
     json_dict = json.loads(json_string)
     data.append(json_dict)

Tags: 代码fromimportaddjsondatetimehtmldriver
1条回答
网友
1楼 · 发布于 2024-04-20 01:03:27

您可以通过访问元素的contents来尝试读取JSON

for element in reviews: 
     json_string = ' '.join(element.contents)
     json_dict = json.loads(json_string)
     data.append(json_dict)

关于{}的美丽组合{a1}:

If you only want the human-readable text inside a document or tag, you can use the get_text() method.

...

As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of , , and tags are not considered to be ‘text’, since those tags are not part of the human-visible content of the pag*

这就是为什么在您的案例中getText返回一个空字符串,并且需要使用contents

相关问题 更多 >