用utf8处理页面

>>> import urllib2 >>> from urllib2 import HTTPError, URLError >>> import BaseHTTPServer >>> opener = urllib2.OpenerDirector() >>> opener.add_handler(urllib2.HTTPHandler()) >>> opener.add_handler(urllib2.HTTPDefaultErrorHandler()) >>> response = opener.open('http://www.columbia.edu/~fdc/utf8/') >>> content = response.read(700)

>>> import HTMLParser >>> h = HTMLParser.HTMLParser() >>> h.unescape(content) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 390, in unescape return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub return _compile(pattern, flags).sub(repl, string, count) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

2条回答

网友

1楼 · 编辑于 2024-05-17 19:14:02

您需要将页面从UTF-8解码到Unicode；其中有UTF-8序列（紧挨着非中断空格的HTML实体）：

>>> print h.unescape(content.decode('utf8'))
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<BASE href="http://kermit.columbia.edu">
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>UTF-8 Sampler</title>
</head>
<body bgcolor="#ffffff" text="#000000">
<h1><tt>UTF-8 SAMPLER</tt></h1>

<big><big>  ¥ · £ · € · $ · ¢ · ₡ · ₢ · ₣ · ₤ · ₥ · ₦ · ₧ · ₨ · ₩ · ₪ · ₫ · ₭ · ₮ · ₯ · &#8377</big></big>



<p>
<blockquote>
Frank da Cruz<br>
<a hre

你把编码和解码搞混了；内容已经被UTF-8编码了。在

注意，&#8377是页面本身的一个错误，;被省略了。HTML5解析器或浏览器可能会假定可以添加;，并对其进行解码：

^{pr2}$

必须先用正则表达式修复这些实体：

>>> import re
>>> brokenrefs = re.compile(r'(&#x?[a-e0-9]+)\b', re.I)
>>> print h.unescape(brokenrefs.sub(r'\1;', content.decode('utf8')))
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<BASE href="http://kermit.columbia.edu">
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>UTF-8 Sampler</title>
</head>
<body bgcolor="#ffffff" text="#000000">
<h1><tt>UTF-8 SAMPLER</tt></h1>

<big><big>  ¥ · £ · € · $ · ¢ · ₡ · ₢ · ₣ · ₤ · ₥ · ₦ · ₧ · ₨ · ₩ · ₪ · ₫ · ₭ · ₮ · ₯ · ₹</big></big>



<p>
<blockquote>
Frank da Cruz<br>
<a hre

网友

2楼 · 编辑于 2024-05-17 19:14:02

你误解了你的输出。这里没有HTML编码：但是当您在REPL中简单地输入content时，它会显示文本的repr()-ed版本。在

做print content会给你带来你想要的：

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<BASE href="http://kermit.columbia.edu">
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>UTF-8 Sampler</title>
</head>
<body bgcolor="#ffffff" text="#000000">
<h1><tt>UTF-8 SAMPLER</tt></h1>

<big><big>&nbsp;&nbsp;¥&nbsp;·&nbsp;£&nbsp;·&nbsp;€&nbsp;·&nbsp;$&nbsp;·&nbsp;¢&nbsp;·&nbsp;₡&nbsp;·&nbsp;₢&nbsp;·&nbsp;₣&nbsp;·&nbsp;₤&nbsp;·&nbsp;₥&nbsp;·&nbsp;₦&nbsp;·&nbsp;₧&nbsp;·&nbsp;₨&nbsp;·&nbsp;₩&nbsp;·&nbsp;₪&nbsp;·&nbsp;₫&nbsp;·&nbsp;₭&nbsp;·&nbsp;₮&nbsp;·&nbsp;₯&nbsp;·&nbsp;&#8377</big></big>



<p>
<blockquote>
Frank da Cruz<br>
<a hre

相关问题更多 >

编程相关推荐

热门问题

热门文章