使用BeautifulSoup4和Python 3.3时的解析错误

1 投票
1 回答
1974 浏览
提问于 2025-04-17 15:57

运行这段代码:

from bs4 import BeautifulSoup
soup = BeautifulSoup (open("my.html"))
print(soup.prettify())

会出现这个错误:

Traceback (most recent call last):
  File "soup.py", line 5, in <module>
    print(soup.prettify())
  File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u25ba' in position
9001: character maps to <undefined>

然后我尝试了:

print(soup.encode('UTF-8').prettify())

但是因为字节对象的字符串格式问题,这个尝试失败了:

Traceback (most recent call last):
  File "soup.py", line 11, in <module>
    print(soup.encode('UTF-8').prettify())
AttributeError: 'bytes' object has no attribute 'prettify'

我不太确定该怎么解决这个问题。任何建议都非常感谢。

1 个回答

3

你的(Windows)控制台使用的是 cp437 编码,而在你的数据中有一个字符是这种编码不支持的。默认情况下,遇到这种情况会抛出一个错误,但你可以进行一些调整。

import sys,io
from bs4 import BeautifulSoup
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,'cp437','backslashreplace')
soup = BeautifulSoup (open("my.html"))
print(soup.prettify())

另外,你也可以把数据写入一个文件,然后用支持这种编码的编辑器打开它:

# On Windows, utf-8-sig will allow the file to be read by Notepad.
with open('out.txt','w',encoding='utf-8-sig') as f:
   f.write(soup.prettify())

撰写回答