使用Python的urllib打开UTF16 URL

2024-04-20 00:41:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试使用googletranslateapi将Kannada(因此编码为utf-16)的文本翻译成英语。手动输入我的URL,在插入我的googleapi密钥https://www.googleapis.com/language/translate/v2?key=key#&q=。在

但问题是,这个url是utf16编码的。当我尝试使用urllib打开url时,会从下面得到错误消息。如有任何关于如何进行的建议或另一种进行方式的建议,我们将不胜感激。在

编辑: 我相信打电话可以解决这个问题urllib.parse.quote_plus(text)其中text是utf16文本,并用该函数的返回值替换utf16文本。在

Traceback (most recent call last):
  File "<pyshell#19>", line 1, in <module>
    urllib.request.urlopen(url)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 156, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 469, in open
    response = self._open(req, data)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 487, in _open
    '_open', req)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 447, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 1283, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 1248, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/http/client.py", line 1061, in request
    self._send_request(method, url, body, headers)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/http/client.py", line 1089, in _send_request
    self.putrequest(method, url, **skips)
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/http/client.py", line 953, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 73-79: ordinal not in range(128)

Tags: inpyselfurlrequestliblinelibrary
1条回答
网友
1楼 · 发布于 2024-04-20 00:41:33

The problem is, however, that this url is utf16 encoded

UTF-16并不是你想的那样。它是一些系统(如Win32 API)的字符串类型内部使用的Unicode字符到字节的编码。UTF-16几乎从未在web上使用过,因为它与ASCII不兼容。在

https://www.googleapis.com/language/translate/v2?key=key#&q=ಚಿಂಚೋಳಿ&source=kn&target=en

这不是URI-URI只能包含ASCII字符。它是一个IRI,它可以包含其他Unicode字符。在

但是urllib不支持IRIs。有一些Python库确实直接支持IRI;或者,您可以将任何IRI转换为urllib会满意的相应URI。这是通过使用IDNA算法对主机名中的任何非ASCII字符进行编码,并在地址的其他部分(包括查询参数)中使用字符的UTF-8表示的URL编码来编码任何非ASCII字符。这给了你这个:

^{pr2}$

但是,这里使用#看起来不对,这是一种从浏览器传入数据的客户端机制,它不适用于服务器请求。在

通常你会做一些类似的事情:

baseurl= 'https://www.googleapis.com/language/translate/v2'
text= u'ಚಿಂಚೋಳಿ'
url= baseurl+'?'+urllib.urlencode(dict(
    source= 'kn', target= 'en',
    q= text.encode('utf-8'),
    key= key
))

相关问题 更多 >