去网站上爬取多页内容而不被屏蔽

2017-05-11 15:37:16 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.liberation.fr/debats/2017/05/03/pourquoi-marine-le-pen-peut-gagner-et-pourquoi-il-faut-le-dire_1566941http://www.liberation.fr/france/2017/05/05/calais-et-grande-synthe-deux-visages-des-migrations-en-france_1567534http://www.liberation.fr/elections-presidentielle-legislatives-2017/2017/05/04/a-l-etranger-un-scrutin-scrute_1567355> (referer: None) 2017-05-11 15:37:16 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://www.liberation.fr/debats/2017/05/03/pourquoi-marine-le-pen-peut-gagner-et-pourquoi-il-faut-le-dire_1566941http://www.liberation.fr/france/2017/05/05/calais-et-grande-synthe-deux-visages-des-migrations-en-france_1567534http://www.liberation.fr/elections-presidentielle-legislatives-2017/2017/05/04/a-l-etranger-un-scrutin-scrute_1567355>: HTTP status code is not handled or not allowed 2017-05-11 15:37:16 [scrapy.core.engine] INFO: Closing spider (finished)

1条回答

网友

1楼 · 发布于 2024-06-17 09:39:23

通常他们有一个很好的理由阻止抓取，总是试着看看他们是否可以通过一个API或任何其他类型的feed提供信息，我发现如果你的理由是认真和有效的，以获得你需要的信息，这是这种情况。在

否则，您的解决方案是TOR网络，它将为每个请求提供一个新的IP。这是一篇短文https://deshmukhsuraj.wordpress.com/2015/03/08/anonymous-web-scraping-using-python-and-tor/

相关问题更多 >

编程相关推荐

热门问题

热门文章