在Python中将Unicode URL转换为ASCII（UTF-8百分号转义）的最佳方法是什么？

29 投票

5 回答

25678 浏览

提问于 2025-04-15 11:19

我想知道有没有简单的方法，或者说标准库中有没有简单的方式，可以把包含Unicode字符的URL（比如域名和路径）转换成对应的ASCII格式的URL。这个转换需要按照RFC 3986的规定，把域名用IDNA编码，路径则用百分号编码。

我从用户那里得到的是一个UTF-8格式的URL。如果用户输入的是 http://➡.ws/♥，那么在Python中我得到的是 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'。我想要的输出是ASCII版本的： 'http://xn--hgi.ws/%E2%99%A5'。

目前我做的方式是通过正则表达式把URL拆分成不同的部分，然后手动对域名进行IDNA编码，再分别用不同的 urllib.quote() 调用来编码路径和查询字符串。

# url is UTF-8 here, eg: url = u'http://➡.ws/㉌'.encode('utf-8')
match = re.match(r'([a-z]{3,5})://(.+\.[a-z0-9]{1,6})'
                 r'(:\d{1,5})?(/.*?)(\?.*)?$', url, flags=re.I)
if not match:
    raise BadURLException(url)
protocol, domain, port, path, query = match.groups()

try:
    domain = unicode(domain, 'utf-8')
except UnicodeDecodeError:
    return ''  # bad UTF-8 chars in domain
domain = domain.encode('idna')

if port is None:
    port = ''

path = urllib.quote(path)

if query is None:
    query = ''
else:
    query = urllib.quote(query, safe='=&?/')

url = protocol + '://' + domain + port + path + query
# url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C'

这样做对吗？有没有更好的建议？有没有简单的标准库函数可以完成这个操作？

unicode ascii standard library regex url encoding idna rfc 3986 percent encoding

5 个回答

目前有一些关于RFC-3896的网址解析的工作正在进行中（比如作为暑期编程项目的一部分），但据我所知，标准库里还没有相关的内容。而在网址编码方面也差不多，似乎没有什么新的进展。所以你可以考虑采用MizardX的优雅方法。

回答于 2025-04-15 由 Python大师

分享举报

MizardX提供的代码并不是完全正确的。这个例子是行不通的：

example.com/folder/?page=2

可以看看django.utils.encoding里的iri_to_uri()，这个函数可以把Unicode格式的URL转换成ASCII格式的URL。

http://docs.djangoproject.com/en/dev/ref/unicode/

回答于 2025-04-15 由 Python大师

分享举报

代码：

import urlparse, urllib

def fixurl(url):
    # turn string into unicode
    if not isinstance(url,unicode):
        url = url.decode('utf8')

    # parse it
    parsed = urlparse.urlsplit(url)

    # divide the netloc further
    userpass,at,hostport = parsed.netloc.rpartition('@')
    user,colon1,pass_ = userpass.partition(':')
    host,colon2,port = hostport.partition(':')

    # encode each component
    scheme = parsed.scheme.encode('utf8')
    user = urllib.quote(user.encode('utf8'))
    colon1 = colon1.encode('utf8')
    pass_ = urllib.quote(pass_.encode('utf8'))
    at = at.encode('utf8')
    host = host.encode('idna')
    colon2 = colon2.encode('utf8')
    port = port.encode('utf8')
    path = '/'.join(  # could be encoded slashes!
        urllib.quote(urllib.unquote(pce).encode('utf8'),'')
        for pce in parsed.path.split('/')
    )
    query = urllib.quote(urllib.unquote(parsed.query).encode('utf8'),'=&?/')
    fragment = urllib.quote(urllib.unquote(parsed.fragment).encode('utf8'))

    # put it back together
    netloc = ''.join((user,colon1,pass_,at,host,colon2,port))
    return urlparse.urlunsplit((scheme,netloc,path,query,fragment))

print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5')
print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/%2F')
print fixurl(u'http://Åsa:abc123@➡.ws:81/admin')
print fixurl(u'http://➡.ws/admin')

输出：

http://xn--hgi.ws/%E2%99%A5
http://xn--hgi.ws/%E2%99%A5/%2F
http://%C3%85sa:abc123@xn--hgi.ws:81/admin
http://xn--hgi.ws/admin

了解更多：

编辑记录：

修正了字符串中已经编码的字符的大小写。
把 urlparse 和 urlunparse 改成了 urlsplit 和 urlunsplit。
用户和端口信息不要和主机名一起编码。（感谢 Jehiah）
当缺少“@”时，不要把主机/端口当作用户/密码！（感谢 hupf）

回答于 2025-04-15 由 Python大师

分享举报

在Python中将Unicode URL转换为ASCII（UTF-8百分号转义）的最佳方法是什么？

5 个回答

代码：

输出：

了解更多：

编辑记录：

撰写回答