如何用Python获取原始字符？

1 投票

2 回答

1377 浏览

提问于 2025-04-18 11:38

我正在用lxml的etree制作一个个人的RSS阅读器，但在把字符转换回原来的样子时遇到了麻烦。我希望能看到“2014年世界杯：在Júlio César的帮助下”：

url = 'rss.nytimes.com/services/xml/rss/nyt/HomePage.xml'
xml = etree.parse(url)
for x in xml.findall('.//item'):
    text = x.find('.//description').text
    print text
    # 'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
    text = text.encode('utf-8')
    print text
    # 'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
    text = text.decode('utf-8')
    # Error: 'UnicodeEncodeError: 'ascii' codec can't encode character....'

我看过了Python的Unicode使用指南和Joel的Unicode介绍，但我觉得我还是漏掉了什么。

编辑：快成功了，非常感谢unutbu...只需要帮助把\u2019转换过来：

content = 'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
html = LH.fromstring(content)
text = html.text_content()
print text
print(repr(text))
print text.encode('utf-8')

##RESULTS##
World Cup 2014: With Júlio César\u2019s Help
u'World Cup 2014: With J\xfalio C\xe9sar\\u2019s Help'
World Cup 2014: With Júlio César\u2019s Help

文本处理 lxml unicode 字符编码编程技巧数据转换 etree rss阅读器

2 个回答

你有一个字符串里面混合了拉丁字符（比如 \xfa）和 Unicode 字符（比如 \u2019）。但是，Python 的编码方法处理不了这种情况。

回答于 2025-04-18 由 Python大师

分享举报

在出现 UnicodeEncodeError 之前，我认为 text 是一种 unicode 类型：

text = u'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
text = text.decode('utf-8')

这个代码会重现错误信息：

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 22: ordinal not in range(128)

在 Python2 中，lxml 有时会返回 str 类型的文本，有时返回 unicode 类型。确实，如果你运行下面这个脚本，你会看到这种不太好的情况：

import lxml.etree as ET
import urllib2

url = 'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml'
xml = ET.parse(urllib2.urlopen(url))
for x in xml.findall('.//item'):
    text = x.find('.//description').text
    print(type(text))

打印结果是

<type 'str'>
<type 'str'>
<type 'str'>
<type 'unicode'>
<type 'str'>
<type 'unicode'>
...

不过，只有当文本是由普通的 ASCII 值组成时（也就是字节值在 0 到 127 之间），它才会返回 str。

虽然一般来说不应该对 str 进行编码，但如果你对由 0-127（ASCII）范围内的字节值组成的 str 使用 utf-8 编码，它会保持 str 的状态。

所以你实际上可以通过对两个都使用 utf-8 编码来处理 str 和 unicode，就好像 text 始终是 unicode 一样。

因为 text 实际上是 HTML，所以我在下面使用了 lxml.html 将 HTML 转换为纯文本内容。这也可以是 str 或 unicode。然后这个对象 text 在打印之前会被编码：

import lxml.etree as ET
import lxml.html as LH
import urllib2

url = 'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml'
xml = ET.parse(urllib2.urlopen(url))
for x in xml.findall('.//item'):
    content = x.find('.//description').text
    html = LH.fromstring(content)
    text = html.text_content()
    print(text.encode('utf-8'))

请注意，在 Python3 中，lxml 始终返回 unicode，所以思路变得清晰了。

UnicodeEncodeError 是如何发生的：

text = u'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
text = text.decode('utf-8')
# Error: 'UnicodeEncodeError: 'ascii' codec can't encode character....'

首先要注意，这实际上是一个 UnicodeEncodeError，尽管你是让 Python 解码 text。另外要注意，错误信息中提到 Python 正在尝试使用 ascii 编解码器。

这通常表明问题与 Python2 中 str 和 unicode 的自动转换有关。

假设 text 是一个 unicode。如果你调用

text.decode('utf-8')

那么你就是在让 Python 做一件不该做的事——解码一个 unicode。然而，Python2 会试图默默地先用 ascii 编解码器对这个 unicode 进行编码，然后再用 utf-8 解码。这种 str 和 unicode 之间的自动转换本来是为了方便处理只有 ASCII 范围内的值，但它也让人感到困惑，因为它让程序员容易忘记 str 和 unicode 之间的区别，并且只有在值在 ASCII 范围内时才有效。当值超出 ASCII 范围时，你就会遇到错误——这就是你遇到的情况。

在 Python3 中，不再有 bytes 和 str 之间的自动转换（在 Python2 中分别对应 str 和 unicode）。当你尝试编码 bytes 或解码 str 时，Python 会直接抛出错误。思路变得清晰，但代价是迫使程序员关注数据类型。不过，正如这个问题所示，即使在 Python2 中，这种代价也是不可避免的。

回答于 2025-04-18 由 Python大师

分享举报

如何用Python获取原始字符？

2 个回答

撰写回答