我收到一个url:https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75桌面虚拟化解决方案;它来自beauthulsoup。在
url=u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
我想再次反馈到urllib2.urlopen。在
^{pr2}$我得到的错误是:
UnicodeEncodeError: 'gbk' codec can't encode character u'\xae' in position 43: illegal multibyte sequence
因此,我试着:
source = urllib2.urlopen(url.encode("utf-8")).read()
它获得了页面源代码,但是它与原始url不同。在
originalUrl = 'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions'
originalSource = urllib2.urlopen(originalUrl).read()
originalSource == source
结果是错误的。有没有办法修复这个网址?如何将u'\xae'转换为原始的®
?在
URL必须是有效的bytestring,非ASCII码位编码正确。您需要编码为UTF-8,然后url quote您的url路径:
演示:
^{pr2}$相关问题 更多 >
编程相关推荐