抓取该网站时出现403错误

0 投票

1 回答

1029 浏览

提问于 2025-04-18 03:01

我正在尝试抓取一个网站。

这个网站在页面上返回了 Error, query failed 的信息。然后我点击了 Find USED 选项卡，再点击搜索按钮来获取结果。其实这个搜索按钮是在发送一个请求，获取数据。

这是我的爬虫代码：

    def start_requests(self):
        for url in self.start_urls:
            yield self.make_requests_from_url(url)

    def make_requests_from_url(self, url):
        return Request(url,cookies={'PHPSESSID':'0a94ce3bf2484d5102a047b86f5b6c17','__utm':'154876456.1461047540.1397668365.1397668365.1397668365.1',', callback=self.page_parse)

    def parse(self,response):
        sel = Selector(response)
        print sel

我得到了这个响应：

2014-04-16 21:04:27+0300 [XXX] DEBUG: Crawled (403) <POST http://website> (referer: None)

我哪里做错了呢？

我分析了点击搜索按钮时发送的请求，这就是那个请求：

http://www.autodealer.ae/plugins/ad/buy.php?q=used+cars+dubai



User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: PHPSESSID=0a94ce3bf2484d5102a047b86f5b6c17;
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 131

我哪里做错了？

或者我该如何抓取这个网站呢？

我的爬虫代码有问题吗？

http请求网络爬虫数据抓取 403错误网站访问限制爬虫调试

1 个回答

这个网站需要某种形式的身份验证，而你在请求中提供了一个假的 PHPSESSID cookie。你的 Python 代码应该先进行身份验证，然后再继续向这个网站发送请求。

-------- 已编辑 ----------

向那个网址发送请求会导致 403 错误。

$ curl -X POST "网址已编辑"

 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
 <html><head>
 <title>403 Forbidden</title>
 </head><body>
 <h1>Forbidden</h1>
 <p>You don't have permission to access /plugins/ad/buy.php
  on this server.</p>
 <p>Additionally, a 404 Not Found
    error was encountered while trying to use an ErrorDocument to handle the request.

 </p>
 </body></html>

回答于 2025-04-18 由 Python大师

分享举报

抓取该网站时出现403错误

1 个回答

撰写回答