在Python中从字符串中删除控制字符

69 投票

11 回答

80753 浏览

提问于 2025-04-16 07:53

我现在有以下代码

def removeControlCharacters(line):
    i = 0
    for c in line:
        if (c < chr(32)):
            line = line[:i - 1] + line[i+1:]
            i += 1
    return line

如果要删除的字符超过一个，这段代码就不管用了。

字符串处理数据清洗控制字符

11 个回答

如果你对一个可以匹配任何Unicode控制字符的正则表达式字符类感兴趣，可以使用 [\x00-\x1f\x7f-\x9f]。

你可以这样测试它：

>>> import unicodedata, re, sys
>>> all_chars = [chr(i) for i in range(sys.maxunicode)]
>>> control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
>>> expanded_class = ''.join(c for c in all_chars if re.match(r'[\x00-\x1f\x7f-\x9f]', c))
>>> control_chars == expanded_class
True

所以如果你想用 re 来去掉这些控制字符，只需使用以下代码：

>>> re.sub(r'[\x00-\x1f\x7f-\x9f]', '', 'abc\02de')
'abcde'

回答于 2025-04-16 由 Python大师

分享举报

你可以使用 str.translate 方法，配合合适的映射表，举个例子可以这样做：

>>> mpa = dict.fromkeys(range(32))
>>> 'abc\02de'.translate(mpa)
'abcde'

回答于 2025-04-16 由 Python大师

分享举报

184

在unicode中有成百上千的控制字符。如果你在处理来自网络或其他可能包含非ASCII字符的数据时，就需要用到Python的unicodedata模块。这个模块里的unicodedata.category(…)函数可以返回任何字符的unicode类别代码（比如，控制字符、空白字符、字母等等）。对于控制字符来说，它的类别代码总是以"C"开头。

下面这段代码可以从一个字符串中去掉所有控制字符。

import unicodedata
def remove_control_characters(s):
    return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

一些unicode类别的例子：

>>> from unicodedata import category
>>> category('\r')      # carriage return --> Cc : control character
'Cc'
>>> category('\0')      # null character ---> Cc : control character
'Cc'
>>> category('\t')      # tab --------------> Cc : control character
'Cc'
>>> category(' ')       # space ------------> Zs : separator, space
'Zs'
>>> category(u'\u200A') # hair space -------> Zs : separator, space
'Zs'
>>> category(u'\u200b') # zero width space -> Cf : control character, formatting
'Cf'
>>> category('A')       # letter "A" -------> Lu : letter, uppercase
'Lu'
>>> category(u'\u4e21') # 両 ---------------> Lo : letter, other
'Lo'
>>> category(',')       # comma  -----------> Po : punctuation
'Po'
>>>

回答于 2025-04-16 由 Python大师

分享举报

在Python中从字符串中删除控制字符

11 个回答

撰写回答