网站所有者的Web抓取错误:网站密钥的域无效

2024-06-01 02:49:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图获取此URL的内容-https://www.zillow.com/homedetails/131-Avenida-Dr-Berkeley-CA-94708/24844204_zpid/ 我用刮痧。这是我的密码

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://www.zillow.com/homedetails/131-Avenida-Dr-Berkeley-CA-94708/24844204_zpid/',
    ]
    def parse(self, response):
        filename = 'test.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

我打开了scraped数据(test.html),得到了这个内容。 enter image description here 我试图找到解决办法,我尝试了这个-ERROR for site owner: Invalid domain for site key 但这并没有解决我的问题


Tags: httpsselfcom内容responsewwwfilenameca
1条回答
网友
1楼 · 发布于 2024-06-01 02:49:08

首先,尝试这种方法,看看是否有效:

Headerz = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-US,en;q=0.9",
    "cache-control": "no-cache",
    "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
    "pragma": "no-cache",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "cross-site",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",
}

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://www.zillow.com/homedetails/131-Avenida-Dr-Berkeley-CA-94708/24844204_zpid/',
    ]

    def start_requests(self):
        yield scrapy.Request(start_urls[0], callback=self.parse, headers=Headerz)

    def parse(self, response):
        filename = 'test.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

我们在普通浏览器中看不到输出的原因是,我们没有使用正确的头文件,否则这些头文件总是由浏览器发送的

您需要按照上述代码中的说明或通过在settings.py中更新标题来添加标题

更好的方法是使用“旋转代理”职责和“旋转用户代理”存储库

相关问题 更多 >