做一个好公民和网络垃圾

2条回答

网友

1楼 · 编辑于 2024-04-20 04:18:42

Is there possibly a way to do thing's incrementally

我正在使用Scrapy缓存功能来增量地抓取站点

HTTPCACHE_ENABLED = True

或者您可以使用新的0.14特性Jobs: pausing and resuming crawls

or put a pause in between different requests?

检查此设置：

DOWNLOAD_DELAY    
RANDOMIZE_DOWNLOAD_DELAY

is there a method with Scrapy to test a crawler without placing undue stress on a site?

您可以尝试在Scrapy shell中调试代码

I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?

另外，您可以在spider中随时调用scrapy.shell.inspect_response。在

Any advice or resources would be greatly appreciated.

垃圾文件是最好的资源。在

网友

2楼 · 编辑于 2024-04-20 04:18:42

你必须开始爬行并记录所有的事情。如果您被禁止，您可以在页面请求之前添加sleep（）。在

改变用户代理也是一个很好的实践(http://www.user-agents.org/http://www.useragentstring.com/）

如果你被ip禁止，使用代理来绕过它。干杯。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

做一个好公民和网络垃圾

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >