使用requests模块无法从网页抓取几个产品的名称

1 投票
1 回答
50 浏览
提问于 2025-04-14 17:15

我正在尝试从这个网页上抓取沙发的名称,使用的是requests模块,下面是我的代码。每当我查看这个请求的网络活动时,我发现我用的逻辑和网页上看到的差不多,但我总是得到状态码400。请问我该如何使用requests模块从这个网页上抓取沙发的名称呢?

import json
import requests

url = "https://www.wayfair.de/graphql?hash=22125c9a747b989fb251b014c2628213%23adda13419b757ac36f2b6c49a3bcd81c%23d0e1538c8199ca3253f0ba6c6f96b840%23f905cea77203534e928de4366ee3779d%2356bc66914e917daca7b7b6fbf0a6bcbb%2385a60692d411b1b94fcaf7e769b299f7%23937b5986d523ec2cd8ea2e68291b69e4%23226dd1f891ead00134d0c77687add073%2369fcc9e7651b988309f3197763ba4579%23d878a5fca245086d6cd18dec8a364f30%2383d1f1132338d12cf533926fa527c5a7%233d54fb8f7ef35c34629aa853236e0fb2%23201c0252e9b59dc9bb102e35e1c303fe%2382032c3ee9d609c60f37d9172e2dfebe%2370ac03a548872795355b4b2d9ee86698%23c5a76122f3b3d9dddfaccf682f3c5e19%23adc79abec9f62457c811fbde536bcc59%23e5ba723f1dbb69724c115115fd923a47%23787e38482534a0d1832e19a21a36700e%239c61a681637cd2ae4bc71227fe3cf711%2348c76b09e216b294dd5eaba432f05598%23ddeed8a890c5e7e562f69812d05c1cc7"

payload = {"variables":{"categoryId":479496,"browseInput":{"sortId":189,"filters":[],"pagination":{"page":2,"itemsPerPage":48},"boostedSkus":[],"isAjax":True,"skipLoadPricingModel":False},"usePricingField":True},"_pageId":"i1g5H18AMh07TKM2idhtCw==","_isPageRequest":True}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
    'Accept-Encoding': 'gzip, deflate, br, zstd',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept': 'application/json',
    'Origin': 'https://www.wayfair.de',
    'Referer': 'https://www.wayfair.de/moebel/sb0/sofas-c479496.html',
    'X-Parent-Txid': 'I/bHHmXsMEuIX3rEX/ejAg==',
    'Apollographql-Client-Name': '@wayfair/sf-ui-browse',
    'Apollographql-Client-Version': '6c7f9310e161d153c0d6b90b5f2961dae465981e',
}

with requests.Session() as s:
    s.headers.update(headers)
    res = s.post(url, json=payload)
    print(res.status_code)
    print(res.json())

1 个回答

1

看起来这个HTML页面里已经包含了和graphql接口发送的相同的Json数据,所以你可以直接从这里获取数据:

import json
import re

import requests


def get_data(page_no):
    url = f"https://www.wayfair.de/moebel/sb0/sofas-c479496.html?curpage={page_no}"

    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:123.0) Gecko/20100101 Firefox/123.0"
    }

    html_text = requests.get(url, headers=headers).text

    data = re.search(
        r'window\["WEBPACK_ENTRY_DATA"\]=({"application":.*);', html_text
    ).group(1)
    data = json.loads(data)

    # print(json.dumps(data, indent=4))

    objs = data["application"]["props"]["browse"]["browse_grid_objects"]
    return objs


for p in range(1, 3):  # <-- increase number of pages here
    objs = get_data(p)

    for o in objs:
        print(f"{o['sku']:<15} {o['product_name']}")
    print()

输出结果是:


...

D100155833      3-Sitzer Sofa Anease                                                                                                                                                                               
D001649432      Sofa Dantzler                                                                                                                                                                                      
D004037217      Sofa Forsyth                                                                                                                                                                                       
D003327297      Zweisitzer Maxen                                                                                                                                                                                   
D110028633      Zweiersofa Bricyn                                                                                                                                                                                  
D100155840      3-Sitzer Sofa Anease                                                                                                                                                                               
DOID7952        Vidaxl 3-Sitzer-Sofa Mit Hocker 180 Cm Stoff 214                                                                                                                                                   
D100169432      Sofa Maurizia                                                                                                                                                                                      
D003971364      Sofa Abhinaya mit Bettfunktion                                                                                                                                                                     
VOX2313         Sofa Rodeo aus Echtleder                                                                                                                                                                           

...

撰写回答