使用lxml解析RSS时的编码错误

9 投票

3 回答

7821 浏览

提问于 2025-04-16 16:33

我想用lxml来解析下载的RSS，但是我不知道怎么处理UnicodeDecodeError这个错误。

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)

但是我遇到了一个错误：

tree   = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)

编码错误 unicode解码 RSS解析

3 个回答

通常来说，先把字符串加载好并整理好，然后再用lxml库的fromstring方法来处理，会比直接使用lxml.etree.parse()函数要简单。这是因为后者的编码选项比较难以管理。

这个特定的rss文件一开始就有编码声明，所以一切应该都能正常工作：

<?xml version="1.0" encoding="utf-8"?>

下面的代码展示了一些不同的变体，你可以用来让etree处理不同的编码。你也可以要求它输出不同的编码，这些编码会出现在文件的头部。

import lxml.etree
import urllib2

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request).read()
print [response]
        # ['<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\xc5\x9bci...']

uresponse = response.decode("utf8")
print [uresponse]    
        # [u'<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\u015bci...']

tree = lxml.etree.fromstring(response)
res = lxml.etree.tostring(tree)
print [res]
        # ['<feed xmlns="http://www.w3.org/2005/Atom">\n<title>Wiadomo&#347;ci...']

lres = lxml.etree.tostring(tree, encoding="latin1")
print [lres]
        # ["<?xml version='1.0' encoding='latin1'?>\n<feed xmlns=...<title>Wiadomo&#347;ci...']


# works because the 38 character encoding declaration is sliced off
print lxml.etree.fromstring(uresponse[38:])   

# throws ValueError(u'Unicode strings with encoding declaration are not supported.',)
print lxml.etree.fromstring(uresponse)

你可以在这里尝试代码： http://scraperwiki.com/scrapers/lxml_and_encoding_declarations/edit/#

回答于 2025-04-16 由 Python大师

分享举报

我遇到过类似的问题，结果发现这根本和编码没有关系。发生的事情是这样的——lxml给你抛出了一个完全不相关的错误。在这个情况下，错误是因为 .parse 函数期待的是一个文件名或者网址，而不是一个包含内容的字符串。不过，当它尝试打印出错误信息时，却因为里面有非ASCII字符而卡住了，结果显示了一个让人困惑的错误信息。这真是太不幸了，其他人也在这里评论过这个问题：

https://mailman-mail5.webfaction.com/pipermail/lxml/2009-February/004393.html

幸运的是，你的问题很容易解决。只需要把 .parse 替换成 .fromstring，你就可以顺利运行了：

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)

## lxml Y U NO MAKE SENSE!!!
tree = etree.fromstring(response, parser)

我刚在我的电脑上测试过，效果很好。希望这能帮到你！

回答于 2025-04-16 由 Python大师

分享举报

你可能只应该在最后的情况下去定义字符编码，因为根据XML的开头部分（如果没有HTTP头的话）其实可以清楚地知道编码是什么。总之，除非你想要覆盖默认的编码，否则把编码传给etree.XMLParser其实是没必要的；所以把encoding这个参数去掉，它应该就能正常工作了。

补充一下：好吧，问题似乎实际上出在lxml上。以下代码可以正常工作，不管是什么原因：

parser = etree.XMLParser(ns_clean=True, recover=True)
etree.parse('http://wiadomosci.onet.pl/kraj/rss.xml', parser)

回答于 2025-04-16 由 Python大师

分享举报

使用lxml解析RSS时的编码错误

3 个回答

撰写回答