使用Scrapy请求时出现403错误，而Python 'get'请求正常

Question

我在用Scrapy抓取几个网站的内容，但所有网站都返回了403（禁止访问）的响应代码。不过，当我用下面的'get'函数请求这些网站时，它们都能正常工作：

import requests
url = "https://www.name_of_website.com/"
headers = {
    "User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
}
response = requests.get(url, headers=headers)
print(response.status_code)

而且这些网站在Chrome浏览器中也能正常访问。我尝试用Chrome的请求头在Scrapy中设置DEFAULT_REQUEST_HEADERS，但还是失败了。

我不明白为什么Scrapy会失败，而普通的requests.get()却能成功。这种情况在很多网站上都有出现。我还尝试过使用scrapy-fake-useragent和中间件，但都没有成功。

如果有任何线索或解决方案，我会非常感激。

我在这里看到类似的问题，但没有帮助，所以希望能从这个领域的专家那里得到新的想法。

谢谢

编辑（回复@ewoks和@Lakshmanrao Simhadri）：

我正在尝试以下网址进行研究，并且我提到的响应代码如下：

https://www.fastcompany.com/        -  403
https://www.ft.com/                 -  200
https://www.theinformation.com/     -  200
https://www.pcmag.com/              -  403
https://www.thestreet.com/          -  403

这些网址在Scrapy中都无法使用。

我的Scrapy代码很简单，如下所示：

class TheinformationSpider(scrapy.Spider):
    name = "theinformation"
    allowed_domains = ["www.theinformation.com"]
    start_urls = ["https://www.theinformation.com/"]

    def parse(self, response):
       print(response)

我现在只是想查看响应代码。

我更新的设置如下：

DEFAULT_REQUEST_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Linux; x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "http://www.google.com",
}

在爬取时，我得到了以下响应：

2024-03-08 15:15:54 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.theinformation.com/> (referer: http://www.google.com)
2024-03-08 15:15:54 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.theinformation.com/>: HTTP status code is not handled or not allowed
2024-03-08 15:15:54 [scrapy.core.engine] INFO: Closing spider (finished)
Total articles scrapped by "theinformation" = 0, null data = 0

web scraping scrapy middleware 403 error request headers user agent response code site access

使用Scrapy请求时出现403错误，而Python 'get'请求正常

1 个回答

撰写回答