使用Python Scrapy时出现HTTP 403响应
我在使用Python.org的2.7 64位版本,操作系统是Windows Vista 64位。我正在测试以下Scrapy代码,目的是递归地抓取网站www.whoscored.com上的所有页面,这个网站是关于足球统计的:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
class ExampleSpider(CrawlSpider):
name = "goal3"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/"]
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
scripts = response.selector.xpath("normalize-space(//title)")
for scripts in scripts:
body = response.xpath('//p').extract()
body2 = "".join(body)
print remove_tags(body2).encode('utf-8')
execute(['scrapy','crawl','goal3'])
这段代码运行得没有错误,不过我抓取了4623个页面,其中217个返回了HTTP状态码200,2个返回了302,剩下的4404个都返回了403。有没有人能看出代码中有什么明显的问题,导致会出现这样的情况?这可能是网站采取的反爬虫措施吗?通常情况下,限制提交的数量是为了防止这种情况发生吗?
谢谢
2 个回答
9
我不知道这个是否还可用,但我需要在setting.py文件中添加以下几行:
HTTPERROR_ALLOWED_CODES =[404]
USER_AGENT = 'quotesbot (+http://www.yourdomain.com)'
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
希望这能帮到你。