从秘籍靓汤中提取

网友

1楼 · 编辑于 2024-04-25 00:16:38

我想您可以使用一个id。我假设第1层位于导航树中的shop之后。否则，我在脚本标记中看不到该值。我在一个普通的脚本（没有script[type=“application/ld+json”]）标记中看到了它，但是对于第1层有很多regex匹配项

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
data = soup.select_one("#bdCrumbDesktopUrls_0").text
print(data)

网友

2楼 · 编辑于 2024-04-25 00:16:38

下面是我用来获取输出的步骤

使用查找所有并获得第10个脚本标记。此脚本标记包含tier1Category值。
获取从第一次出现{到最后一次出现;的脚本文本。这将为我们提供一个合适的json文本。
使用json.loads
加载文本
理解json的结构，找到如何获得tier1Category值。

代码：

import json
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = BeautifulSoup(r.text, 'html.parser')
script_text=soup.find_all('script')[9].text
start=str(script_text).index('{')
end=str(script_text).rindex(';')
proper_json_text=script_text[start:end]
our_json=json.loads(proper_json_text)
print(our_json['product']['results']['productInfo']['tier1Category'])

输出：

Medicines & Treatments

网友

3楼 · 编辑于 2024-04-25 00:16:38

Bitto和我对此有类似的方法，但是我不想依赖于知道哪个脚本包含匹配模式，也不想知道脚本的结构。你知道吗

import requests
from collections import abc
from bs4 import BeautifulSoup as bs

def nested_dict_iter(nested):
    for key, value in nested.items():
        if isinstance(value, abc.Mapping):
            yield from nested_dict_iter(value)
        else:
            yield key, value

r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
for script in soup.find_all('script'):
    if 'tier1Category' in script.text:
        j = json.loads(script.text[str(script.text).index('{'):str(script.text).rindex(';')])
        for k,v in list(nested_dict_iter(j)):
             if k == 'tier1Category':
                 print(v)

相关问题更多 >

编程相关推荐

热门问题

热门文章

从秘籍靓汤中提取

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >