如何在url中为urllib2.urlopen处理®?

2024-05-19 21:38:03 发布

您现在位置:Python中文网/ 问答频道 /正文

我收到一个url:https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75桌面虚拟化解决方案;它来自beauthulsoup。在

url=u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'

我想再次反馈到urllib2.urlopen。在

^{pr2}$

我得到的错误是:

UnicodeEncodeError: 'gbk' codec can't encode character u'\xae' in position 43: illegal multibyte sequence

因此,我试着:

source = urllib2.urlopen(url.encode("utf-8")).read()

它获得了页面源代码,但是它与原始url不同。在

originalUrl = 'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions'
originalSource = urllib2.urlopen(originalUrl).read()
originalSource == source

结果是错误的。有没有办法修复这个网址?如何将u'\xae'转换为原始的®?在


Tags: andhttpscomcloudurlwwwurllib2urlopen
1条回答
网友
1楼 · 发布于 2024-05-19 21:38:03

URL必须是有效的bytestring,非ASCII码位编码正确。您需要编码为UTF-8,然后url quote您的url路径:

import urllib
import urllib2
import urlparse

originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
encoded_link = parsed_link.geturl()
source = urllib2.urlopen(encoded_link).read()

演示:

^{pr2}$

相关问题 更多 >