如何让我的刮擦蜘蛛跟随中间重定向页面到达目的地？

2024-05-19 01:07:46 发布

男 | 程序猿一只，喜欢编程写python代码。

我正试图从这个Chinese website的公寓链接中获取数据

问题是，每一个链接我遵循似乎要通过一个重定向页面，以防止我刮它。当我点击Chrome上的一个链接时，比如this link，一切正常，重定向页面加载速度非常快，我到达公寓描述页面。但是当我的spider运行时，它能给我的所有响应都是关于重定向页面的（它的HTML标题是跳转，根据googletranslate的意思是“重定向”）。我希望我的蜘蛛的行为像一个正常的用户会，也就是说，等待重定向页面解决，并继续到达其目的地。你知道吗

我是网页抓取的初学者，但我遵循了抓取教程和以下主题：

然而，这些教我停止重定向，而在我看来，我的蜘蛛准确地说不会到达它的目的地，如果它不遵循重定向链接。我也查了http://scraping.pro/7-ways-protect-website-scraping-bypass-protection，找不到任何与这个中文网站使用的防御机制相匹配的东西。你知道吗

我的蜘蛛是这样的：

import scrapy
from scrapy.spiders import CrawlSpider

class YangzhouSpider(CrawlSpider):
    name = 'fangtry'
    allowed_domains = ['fang.com']
    start_urls = ['https://yz.esf.fang.com']

def parse(self, response):
    print("This is HTML for main page : \n", response.text)

    # this should match every apartment
    all_apartments = response.xpath("//dl[@dataflag='bg']")

    # this gets link for the first apartment :
    first_apartment_link = all_apartments[0].xpath(".//h4[@class='clearfix']/a/@href").get()
    #  ------------
    #       ╰---> equals '/chushou/3_369807146.htm'

    follow_url = response.urljoin(first_apartment_link)
    # ------
    #   ╰---> equals 'https://yz.esf.fang.com/chushou/3_369807146.htm'

    yield scrapy.Request(follow_url, callback=self.parse_detail)

def parse_detail(self, response):
    crappy_html = response.text
    print("this HTML is bad : \n", crappy_html)
    exit(420)

主页的HTML很好，但是crappy_html没有关于我感兴趣的页面的信息。控制台给我

2019-07-22 16:34:06 [fangtry] INFO: Spider opened: fangtry
2019-07-22 16:34:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-22 16:34:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://yz.esf.fang.com> (referer: None)
This is HTML for main page : 

<!DOCTYPE html>
<html>
...
</html>

糟糕的是：

2019-07-22 16:34:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://search.fang.com/captcha-verify/redirect?h=https://yz.esf.fang.com/chushou/3_373806230.htm> from <GET https://yz.esf.fang.com/chushou/3_373806230.htm>
2019-07-22 16:34:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://search.fang.com/captcha-verify/redirect?h=https://yz.esf.fang.com/chushou/3_373806230.htm> (referer: None)
this HTML is bad : 
 <html xmlns="http://www.w3.org/1999/xhtml" lang="UTF-8"><head>
<meta name="mobile-agent" content="format=html5;url=https://m.fang.com/news/bj.html">
<meta http-equiv="content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>跳转...</title>
...
</html>

所以你有了它。抱歉，如果这个问题已经在另一个部分得到了回答，我发誓我试图找到一个匹配的主题，但找不到任何。如有任何建议，我们将不胜感激。你知道吗

Tags： https com http response html 页面 this 重定向

0条回答

目前没有回答

如何让我的刮擦蜘蛛跟随中间重定向页面到达目的地？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何让我的刮擦蜘蛛跟随中间重定向页面到达目的地？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >