UnicodeEncodeError: 'ascii' 编码无法编码字符 '\xe9' - 使用 urllib.request 的 Python3 时

5 投票
2 回答
4181 浏览
提问于 2025-04-18 00:27

我正在写一个脚本,它会访问一系列链接并解析信息。

这个脚本在大多数网站上都能正常工作,但在某些网站上却出现了问题,报错信息是“UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 13: ordinal not in range(128)”。

这个错误发生在client.py文件中,这是Python3的urlib库的一部分。

出问题的具体链接是: http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html

这里有很多类似的帖子,但没有一个答案对我有用。

我的代码是:

from urllib import request

def __request(link,debug=0):      

try:
    html = request.urlopen(link, timeout=35).read() #made this long as I was getting lots of timeouts
    unicode_html = html.decode('utf-8','ignore')

# NOTE the except HTTPError must come first, otherwise except URLError will also catch an HTTPError.
except HTTPError as e:
    if debug:
        print('The server couldn\'t fulfill the request for ' + link)
        print('Error code: ', e.code)
    return ''
except URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('timeout')
        return ''    
else:
    return unicode_html

这段代码调用了请求函数

link = 'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html'

page = __request(link)

错误追踪信息是:

Traceback (most recent call last):
  File "<string>", line 250, in run_nodebug
  File "C:\reader\get_news.py", line 276, in <module>
    main()
  File "C:\reader\get_news.py", line 255, in main
    body = get_article_body(item['link'],debug=0)
  File "C:\reader\get_news.py", line 155, in get_article_body
    page = __request('na',url)
  File "C:\reader\get_news.py", line 50, in __request
    html = request.urlopen(link, timeout=35).read()
  File "C:\Python33\Lib\urllib\request.py", line 156, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python33\Lib\urllib\request.py", line 469, in open
    response = self._open(req, data)
  File "C:\Python33\Lib\urllib\request.py", line 487, in _open
    '_open', req)
  File "C:\Python33\Lib\urllib\request.py", line 447, in _call_chain
    result = func(*args)
  File "C:\Python33\Lib\urllib\request.py", line 1268, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "C:\Python33\Lib\urllib\request.py", line 1248, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "C:\Python33\Lib\http\client.py", line 1061, in request
    self._send_request(method, url, body, headers)
  File "C:\Python33\Lib\http\client.py", line 1089, in _send_request
    self.putrequest(method, url, **skips)
  File "C:\Python33\Lib\http\client.py", line 953, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 13: ordinal not in range(128)

任何帮助都非常感谢,这让我快要疯了,我觉得我已经尝试了所有的x.decode组合和类似的方法。

(如果可能的话,我可以忽略那些有问题的字符。)

2 个回答

3

你的网址里有一些字符是不能用ASCII字符来表示的。

你需要确保所有字符都已经正确地进行了网址编码;比如可以使用urllib.parse.quote_plus这个工具,它会用UTF-8的方式来编码那些不是ASCII字符的内容。

5

使用一个百分比编码的URL

link = 'http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html'

我找到上面的百分比编码URL是通过在浏览器中输入

http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html

访问那个页面,然后把浏览器提供的编码URL复制粘贴回文本编辑器。不过,你也可以通过编程的方式生成一个百分比编码的URL,方法是:

from urllib import parse

link = 'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html'

scheme, netloc, path, query, fragment = parse.urlsplit(link)
path = parse.quote(path)
link = parse.urlunsplit((scheme, netloc, path, query, fragment))

这样就会得到

http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html

撰写回答