用Python编码网站时的问题：'charmap'编解码器无法在位置编码字符'\x9f

0 投票

1 回答

1115 浏览

提问于 2025-04-18 16:14

我想自己做一个RSS订阅阅读器，所以我开始动手了。

我测试用的页面是这个链接：'http://heise.de.feedsportal.com/c/35207/f/653902/index.rss'。

这是一个德国的网站，所以我选择用"iso-8859-1"来解码。下面是我的代码。

def main():
counter = 0
try:
    page = 'http://heise.de.feedsportal.com/c/35207/f/653902/index.rss'
    sourceCode = opener.open(page).read().decode('iso-8859-1')
except Exception as e:
    print(str(e))
    #print sourceCode
try:
    titles = re.findall(r'<title>(.*?)</title>',sourceCode)
    links = re.findall(r'<link>(.*?)</link>',sourceCode)
except Exception as e:
    print(str(e))     
rssFeeds = []
for link in links:
    if "rss." in link:
        rssFeeds.append(link)
for feed in rssFeeds:
    if ('html' in feed) or ('htm' in feed):
        try:
            print("Besuche " + feed+ ":")
            feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
        except Exception as e:
            print(str(e))   
        content = re.findall(r'<p>(.*?)</p>', feedSource)
        try:
            tempTxt = open("feed" + str(counter)+".txt", "w")
            for line in content:
                tempTxt.write(tagFilter(line))
        except Exception as e:
            print(str(e))
        finally:
            tempTxt.close()
            counter += 1
            time.sleep(10)

首先，我打开之前提到的网站，到目前为止，打开它没有任何问题。
解码完网站后，我在里面寻找所有链接标签中的内容。
然后，我选择那些包含"rss"的链接，并把它们存储在一个新的列表里。
接着，我用这个新列表打开这些链接，寻找它们的内容。

现在问题来了。我解码那些页面，仍然是德国的网站，但我遇到了这样的错误：

'charmap' 编码无法在位置 339 编码字符 '\x9f'。
'charmap' 编码无法在位置 43 编码字符 '\x9c'。
'charmap' 编码无法在位置 131 编码字符 '\x80'。

我真的不知道为什么会这样。出错之前收集到的数据会写入一个文本文件。

收集到的数据示例：

在heise在线登录热门主题：在谷歌本月初推出了64位的Chrome浏览器测试版后，互联网巨头现在也开始关注OS X。测试人员报告说，谷歌通过其Canary/Dev渠道自动发布64位版本，只要用户使用的是兼容的电脑。

希望有人能帮我。另外，任何能帮助我构建自己的RSS订阅阅读器的线索或信息也非常欢迎。

问候，Templum

网络编程 rss 编码问题数据抓取链接提取网站解析 iso-8859-1 文本文件写入

1 个回答

根据miko和Wooble的评论：

iso-8859-1应该改成utf-8，因为返回的XML中说明了编码是utf-8：

In [71]: sourceCode = opener.open(page).read()

In [72]: sourceCode[:100]
Out[72]: "<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet type='text/xsl' href='http://heise.de.feedspo"

而且，你真的应该使用像lxml或BeautifulSoup这样的XML解析器来解析XML。仅仅使用re模块会更容易出错。

feedSource是unicode类型，因为它是解码的结果：

        feedSource = opener.open(feed).read().decode("utf-8","replace")

所以，line也是unicode类型：

    content = re.findall(r'<p>(.*?)</p>', feedSource)
    for line in content:
        ...

tempTxt是一个普通的文件句柄（与使用io.open打开的文件不同，后者需要指定编码）。所以tempTxt期待的是字节（比如str），而不是unicode。

因此，在写入文件之前需要对line进行编码：

        for line in content:
            tempTxt.write(line.encode('utf-8'))

或者可以使用io.open来定义tempTxt并指定编码：

import io
with io.open(filename, "w", encoding='utf-8') as tempTxt:
    for line in content:
        tempTxt.write(line)

顺便说一下，除非你准备好处理所有异常，否则捕获所有异常并不好：

    except Exception as e:
        print(str(e))

而且，如果你只是打印错误信息，那么即使在try部分定义的变量未定义，Python也可能会执行后续代码。例如，

    try:
        print("Besuche " + feed+ ":")
        feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
    except Exception as e:
        print(str(e))   
    content = re.findall(r'<p>(.*?)</p>', feedSource)

在调用re.findall时使用feedSource，如果在feedSource定义之前发生了异常，可能会引发NameError。

如果你希望Python跳过这个feed并继续下一个，可以在except-suite中添加一个continue语句：

    except Exception as e:
        print(str(e))   
        continue

回答于 2025-04-18 由 Python大师

分享举报

用Python编码网站时的问题：'charmap'编解码器无法在位置编码字符'\x9f

1 个回答

撰写回答