有效替换坏字符

3条回答

网友

1楼 · 编辑于 2024-05-23 08:05:49

如果要从字符串中删除所有非ASCII字符，可以使用

text.encode("ascii", "ignore")

网友

2楼 · 编辑于 2024-05-23 08:05:49

我认为这里有一个潜在的问题，调查并解决它可能是一个好主意，而不是试图掩盖症状。

\xc2\x95是字符U+0095的UTF-8编码，它是一个C1 control character（消息等待）。你的图书馆不能处理这件事并不奇怪。但问题是，它是如何进入你的数据的？

一种很可能的情况是，它最初是Windows-1252编码中的字符0x95（BULLET），被错误地解码为U+0095而不是正确的U+2022，然后被编码为UTF-8。（日本术语mojibake描述了这种错误。）

如果这是正确的，那么您可以通过将原始字符放回Windows-1252，然后这次将其正确解码为Unicode来恢复它们。（在这些示例中，我使用的是Python3.3；在Python2中，这些操作有些不同。）

>>> b'\x95'.decode('windows-1252')
'\u2022'
>>> import unicodedata
>>> unicodedata.name(_)
'BULLET'

如果要对0x80–0x99范围内的所有有效Windows-1252字符执行此更正，可以使用以下方法：

def restore_windows_1252_characters(s):
    """Replace C1 control characters in the Unicode string s by the
    characters at the corresponding code points in Windows-1252,
    where possible.

    """
    import re
    def to_windows_1252(match):
        try:
            return bytes([ord(match.group(0))]).decode('windows-1252')
        except UnicodeDecodeError:
            # No character at the corresponding code point: remove it.
            return ''
    return re.sub(r'[\u0080-\u0099]', to_windows_1252, s)

例如：

>>> restore_windows_1252_characters('\x95\x99\x85')
'•™…'

网友

3楼 · 编辑于 2024-05-23 08:05:49

总是有正则表达式；只需在方括号内列出所有有问题的字符，如下所示：

import re
print re.sub(r'[\xc2\x99]'," ","Hello\xc2There\x99")

这将打印“Hello There”，不需要的字符将替换为空格。

或者，如果每个字符有不同的替换字符：

# remove annoying characters
chars = {
    '\xc2\x82' : ',',        # High code comma
    '\xc2\x84' : ',,',       # High code double comma
    '\xc2\x85' : '...',      # Tripple dot
    '\xc2\x88' : '^',        # High carat
    '\xc2\x91' : '\x27',     # Forward single quote
    '\xc2\x92' : '\x27',     # Reverse single quote
    '\xc2\x93' : '\x22',     # Forward double quote
    '\xc2\x94' : '\x22',     # Reverse double quote
    '\xc2\x95' : ' ',
    '\xc2\x96' : '-',        # High hyphen
    '\xc2\x97' : '--',       # Double hyphen
    '\xc2\x99' : ' ',
    '\xc2\xa0' : ' ',
    '\xc2\xa6' : '|',        # Split vertical bar
    '\xc2\xab' : '<<',       # Double less than
    '\xc2\xbb' : '>>',       # Double greater than
    '\xc2\xbc' : '1/4',      # one quarter
    '\xc2\xbd' : '1/2',      # one half
    '\xc2\xbe' : '3/4',      # three quarters
    '\xca\xbf' : '\x27',     # c-single quote
    '\xcc\xa8' : '',         # modifier - under curve
    '\xcc\xb1' : ''          # modifier - under line
}
def replace_chars(match):
    char = match.group(0)
    return chars[char]
return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text)

相关问题更多 >

编程相关推荐

热门问题

热门文章