如何简单去除麻烦的Unicode字符？

6 投票

4 回答

19849 浏览

提问于 2025-04-16 13:16

这是我做的事情……

>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128)
>>> 
>>> soup.find('div')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128)
>>> 
>>> soup.find('span')
<span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span>
>>>

我该怎么简单地从 html 中去掉那些麻烦的unicode字符呢？
或者有没有更简单的解决办法？

unicode 数据清洗字符处理

4 个回答

首先，“麻烦”的unicode字符可能是某种语言的字母，但如果你不需要担心非英语字符的话，可以使用一个Python库来把unicode转换成ansi。你可以看看这个问题的回答：我该如何使用Python将文件格式从Unicode转换为ASCII？

那里的接受答案看起来是个不错的解决方案（我之前并不知道这个方法）。

回答于 2025-04-16 由 Python大师

分享举报

你看到的错误是因为 repr(soup) 尝试把 Unicode 和字节串混在一起。把 Unicode 和字节串混合在一起常常会导致错误。

对比一下：

>>> u'1' + '©'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

还有：

>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'

这里有一个关于类的例子：

>>> class A:
...     def __repr__(self):
...         return u'copyright ©'.encode('utf-8')
... 
>>> A()
copyright ©
>>> class B:
...     def __repr__(self):
...         return u'copyright ©'
... 
>>> B()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
...     def __repr__(self):
...         return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)

类似的情况也发生在 BeautifulSoup 上：

>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)

要解决这个问题，可以这样做：

>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'

回答于 2025-04-16 由 Python大师

分享举报

试试这个方法：soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

回答于 2025-04-16 由 Python大师

分享举报

如何简单去除麻烦的Unicode字符？

4 个回答

撰写回答