如何跳过URL中的Unicode错误
我正在尝试解决Python中的Unicode错误,并想跳过这些错误。我想我需要用try和except来处理UnicodeError,但我不知道在UnicodeError的条件中该放什么,以便跳过那个网址并继续抓取数据。以下是我的代码:
File "imagescraper.py", line 24, in <module>
urllib.urlretrieve(image, "image0"+str(page)+str(i)+".jpg")
File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 94, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 228, in retrieve
url = unwrap(toBytes(url))
File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1055, in toBytes
" contains non-ASCII characters")
UnicodeError: URL u'http://blogging.com/wp-content/uploads/2013/11/design-p\xe1gina-de-fans.png' contains non-ASCII characters
有什么想法吗?
2 个回答
1
与其跳过这个网址,不如把它编码成一个有效的网址:
import urllib, urlparse
parts = urlparse.urlsplit(image)
parts = parts._replace(path=urllib.quote(parts.path.encode('utf8')))
image = parts.geturl()
这样就把:
http://blogging.com/wp-content/uploads/2013/11/design-página-de-fans.png
变成了
http://blogging.com/wp-content/uploads/2013/11/design-p%C3%A1gina-de-fans.png