URLLib2.URL错误:读取服务器响应代码(Python)
我有一份网址列表。我想查看每个网址的服务器响应代码,看看有没有坏掉的链接。我能识别服务器错误(500)和坏链接(404),但是一旦遇到不是网站的地址(比如“notawebsite_broken.com”),代码就出错了。我查了很多地方,但没有找到解决办法……希望你们能帮帮我。
这是我的代码:
import urllib2
#List of URLs. The third URL is not a website
urls = ["http://www.google.com","http://www.ebay.com/broken-link",
"http://notawebsite_broken"]
#Empty list to store the output
response_codes = []
# Run "for" loop: get server response code and save results to response_codes
for url in urls:
try:
connection = urllib2.urlopen(url)
response_codes.append(connection.getcode())
connection.close()
print url, ' - ', connection.getcode()
except urllib2.HTTPError, e:
response_codes.append(e.getcode())
print url, ' - ', e.getcode()
print response_codes
这段代码的输出是……
http://www.google.com - 200
http://www.ebay.com/broken-link - 404
Traceback (most recent call last):
File "test.py", line 12, in <module>
connection = urllib2.urlopen(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
有没有人知道怎么解决这个问题,或者能给我指个方向?
3 个回答
1
当 urllib2.urlopen() 无法连接到服务器,或者无法找到主机的 IP 地址时,它会抛出一个 URLError,而不是 HTTPError。你需要同时处理 urllib2.URLError 和 urllib2.HTTPError,以应对这些情况。
1
urllib2库的接口真让人头疼。
很多人,包括我自己,都强烈推荐使用requests
这个包:
requests
的一个好处是,所有请求相关的问题都来自一个基础的异常类。当你直接使用urllib2
时,会出现很多不同的异常,不仅仅是urllib2
自己,还有socket
模块,可能还有其他的(我记不清了,反正很乱)。
总之——直接使用requests
库就好了。
3
你可以使用requests库:
import requests
urls = ["http://www.google.com","http://www.ebay.com/broken-link",
"http://notawebsite_broken"]
for u in urls:
try:
r = requests.get(u)
print "{} {}".format(u,r.status_code)
except Exception,e:
print "{} {}".format(u,e)
http://www.google.com 200
http://www.ebay.com/broken-link 404
http://notawebsite_broken HTTPConnectionPool(host='notawebsite_broken', port=80): Max retries exceeded with url: /