高效替换不良字符

25 投票

6 回答

46339 浏览

数据工程师

提问于 2025-04-16 21:00

我经常处理包含一些特殊字符的utf-8文本，比如：

\xc2\x99

\xc2\x95

\xc2\x85

等等

这些字符会让我使用的其他库感到困惑，所以我需要把它们替换掉。

有没有什么有效的方法来做到这一点，而不是：

text.replace('\xc2\x99', ' ').replace('\xc2\x85, '...')

字符编码 utf-8 数据清洗字符替换特殊字符处理

6 个回答

如果你想从一个字符串中去掉所有非ASCII字符，可以使用下面的代码：

text.encode("ascii", "ignore")

回答于 2025-04-16 由 Python大师

分享举报

我觉得这里面有个根本的问题，可能更应该去调查和解决这个问题，而不是单纯地掩盖症状。

\xc2\x95 是字符 U+0095 的 UTF-8 编码，这个字符是一个控制字符（消息等待）。你的库不能处理它也不奇怪。但问题是，这个字符是怎么出现在你的数据里的呢？

一种很可能的情况是，它最开始是 Windows-1252 编码中的字符 0x95（圆点），结果错误地被解码成了 U+0095，而不是正确的 U+2022，然后又被编码成了 UTF-8。（日语中有个词叫 mojibake，就是用来形容这种错误的。）

如果这个推测是对的，那么你可以通过把它放回 Windows-1252 编码，然后这次正确地解码成 Unicode，来恢复原来的字符。（在这些例子中，我使用的是 Python 3.3；在 Python 2 中，这些操作会有些不同。）

>>> b'\x95'.decode('windows-1252')
'\u2022'
>>> import unicodedata
>>> unicodedata.name(_)
'BULLET'

如果你想对所有在 0x80–0x99 范围内的有效 Windows-1252 字符进行这种修正，可以使用以下方法：

def restore_windows_1252_characters(s):
    """Replace C1 control characters in the Unicode string s by the
    characters at the corresponding code points in Windows-1252,
    where possible.

    """
    import re
    def to_windows_1252(match):
        try:
            return bytes([ord(match.group(0))]).decode('windows-1252')
        except UnicodeDecodeError:
            # No character at the corresponding code point: remove it.
            return ''
    return re.sub(r'[\u0080-\u0099]', to_windows_1252, s)

例如：

>>> restore_windows_1252_characters('\x95\x99\x85')
'•™…'

回答于 2025-04-16 由 Python大师

分享举报

总是可以使用正则表达式；只需要把所有需要替换的字符放在方括号里，像这样：

import re
print re.sub(r'[\xc2\x99]'," ","Hello\xc2There\x99")

这样会输出：'Hello There '，其中不想要的字符会被空格替代。

另外，如果你想用不同的字符来替换每个不想要的字符，可以这样做：

# remove annoying characters
chars = {
    '\xc2\x82' : ',',        # High code comma
    '\xc2\x84' : ',,',       # High code double comma
    '\xc2\x85' : '...',      # Tripple dot
    '\xc2\x88' : '^',        # High carat
    '\xc2\x91' : '\x27',     # Forward single quote
    '\xc2\x92' : '\x27',     # Reverse single quote
    '\xc2\x93' : '\x22',     # Forward double quote
    '\xc2\x94' : '\x22',     # Reverse double quote
    '\xc2\x95' : ' ',
    '\xc2\x96' : '-',        # High hyphen
    '\xc2\x97' : '--',       # Double hyphen
    '\xc2\x99' : ' ',
    '\xc2\xa0' : ' ',
    '\xc2\xa6' : '|',        # Split vertical bar
    '\xc2\xab' : '<<',       # Double less than
    '\xc2\xbb' : '>>',       # Double greater than
    '\xc2\xbc' : '1/4',      # one quarter
    '\xc2\xbd' : '1/2',      # one half
    '\xc2\xbe' : '3/4',      # three quarters
    '\xca\xbf' : '\x27',     # c-single quote
    '\xcc\xa8' : '',         # modifier - under curve
    '\xcc\xb1' : ''          # modifier - under line
}
def replace_chars(match):
    char = match.group(0)
    return chars[char]
return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text)

回答于 2025-04-16 由 Python大师

分享举报

高效替换不良字符

6 个回答

撰写回答