使用Python访问Google专利时出现503错误

1 投票
2 回答
4202 浏览
提问于 2025-04-17 19:39

今天早些时候,我用下面的代码从谷歌专利网站提取了一些数据。

import urllib2

url = 'http://www.google.com/search?tbo=p&q=ininventor:"John-Mudd"&hl=en&tbm=pts&source=lnt&tbs=ptso:us'
req = urllib2.Request(url, headers={'User-Agent' : "foobar"})

response = urllib2.urlopen(req)

但是现在我再运行这个代码时,出现了503错误。我大概只循环运行了这个代码30次,因为我想获取30个人所拥有的所有专利。

HTTPError                                 Traceback (most recent call last)
<ipython-input-4-01f83e2c218f> in <module>()
----> 1 response = urllib2.urlopen(req)

C:\Python27\lib\urllib2.pyc in urlopen(url, data, timeout)
    124     if _opener is None:
    125         _opener = build_opener()
--> 126     return _opener.open(url, data, timeout)
    127 
    128 def install_opener(opener):

C:\Python27\lib\urllib2.pyc in open(self, fullurl, data, timeout)
    404         for processor in self.process_response.get(protocol, []):
    405             meth = getattr(processor, meth_name)
--> 406             response = meth(req, response)
    407 
    408         return response

C:\Python27\lib\urllib2.pyc in http_response(self, request, response)
    517         if not (200 <= code < 300):
    518             response = self.parent.error(
--> 519                 'http', request, response, code, msg, hdrs)
    520 
    521         return response

C:\Python27\lib\urllib2.pyc in error(self, proto, *args)
    436             http_err = 0
    437         args = (dict, proto, meth_name) + args
--> 438         result = self._call_chain(*args)
    439         if result:
    440             return result

C:\Python27\lib\urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    376             func = getattr(handler, meth_name)
    377 
--> 378             result = func(*args)
    379             if result is not None:
    380                 return result

C:\Python27\lib\urllib2.pyc in http_error_302(self, req, fp, code, msg, headers)
    623         fp.close()
    624 
--> 625         return self.parent.open(new, timeout=req.timeout)
    626 
    627     http_error_301 = http_error_303 = http_error_307 = http_error_302

C:\Python27\lib\urllib2.pyc in open(self, fullurl, data, timeout)
    404         for processor in self.process_response.get(protocol, []):
    405             meth = getattr(processor, meth_name)
--> 406             response = meth(req, response)
    407 
    408         return response

C:\Python27\lib\urllib2.pyc in http_response(self, request, response)
    517         if not (200 <= code < 300):
    518             response = self.parent.error(
--> 519                 'http', request, response, code, msg, hdrs)
    520 
    521         return response

C:\Python27\lib\urllib2.pyc in error(self, proto, *args)
    442         if http_err:
    443             args = (dict, 'default', 'http_error_default') + orig_args
--> 444             return self._call_chain(*args)
    445 
    446 # XXX probably also want an abstract factory that knows when it makes

C:\Python27\lib\urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    376             func = getattr(handler, meth_name)
    377 
--> 378             result = func(*args)
    379             if result is not None:
    380                 return result

C:\Python27\lib\urllib2.pyc in http_error_default(self, req, fp, code, msg, hdrs)
    525 class HTTPDefaultErrorHandler(BaseHandler):
    526     def http_error_default(self, req, fp, code, msg, hdrs):
--> 527         raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
    528 
    529 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 503: Service Unavailable

2 个回答

4

谷歌的服务条款禁止自动化查询,这真让人遗憾。它几乎可以肯定地检测到你在“搞鬼”。

来源: https://support.google.com/websearch/answer/86640?hl=en

1

随便猜猜:

你有没有查看一下响应中是否有“Retry-After”这个头部信息?在503错误的情况下,这个是很有可能出现的。

根据RFC 2616的说明:

14.37 Retry-After

这个“Retry-After”响应头可以和503(服务不可用)一起使用,来告诉请求的客户端服务预计会不可用多长时间。这个字段也可以和任何3xx(重定向)响应一起使用,表示用户代理在发出重定向请求之前需要等待的最短时间。这个字段的值可以是一个HTTP日期,或者是一个整数,表示响应后需要等待的秒数(以十进制表示)。

Retry-After = "Retry-After" ":" ( HTTP-date | delta-seconds )

这里有两个使用的例子:
Retry-After: Fri, 31 Dec 1999 23:59:59 GMT
Retry-After: 120

在后一个例子中,延迟是2分钟。

撰写回答