制作python循环

def cleanup(s): strng = '' good = ['\t', '\r', '\n'] for char in s: if unicodedata.category(char)[0]!="C": strng += char elif char in good: strng += char elif char not in good: strng += ' ' return strng

2条回答

网友

1楼 · 编辑于 2024-04-25 09:43:23

如果我理解正确，您希望将所有Unicode控制字符转换为空格，除了选项卡、回车和新行。您可以使用^{}来实现：

good = map(ord, '\t\r\n')
TBL_CONTROL_TO_SPACE = {
    i: u' '
    for i in xrange(sys.maxunicode)
    if unicodedata.category(unichr(i))[0] == "C" and i not in good
}

def cleanup(s):
    return s.translate(TBL_CONTROL_TO_SPACE)

网友

2楼 · 编辑于 2024-04-25 09:43:23

如果我正确理解了您的任务，您希望用空格替换所有unicode控制字符，除了\t、\n和\r。你知道吗

下面介绍如何使用正则表达式而不是循环更有效地实现这一点。你知道吗

import re

# make a string of all unicode control characters 
# EXCEPT \t - chr(9), \n - chr(10) and \r - chr(13)
control_chars = ''.join(map(unichr, range(0,9) + \
                            range(11,13) + \
                            range(14,32) + \
                            range(127,160)))

# build your regular expression
cc_regex = re.compile('[%s]' % re.escape(control_chars))

def cleanup(s):
    # substitute all control characters in the regex 
    # with spaces and return the new string
    return cc_regex.sub(' ', s)

通过操纵组成control_chars变量的范围，可以控制要包含或排除哪些字符。请参阅List of Unicode characters。你知道吗

编辑：计时结果。

出于好奇，我做了一些计时测试，看看目前的三种方法中哪一种最快。你知道吗

我做了三个方法，名为cleanup_op(s)，它是OP代码的副本；cleanup_loop(s)，这是Cristian Ciupitu的答案；cleanup_regex(s)，这是我的代码。你知道吗

以下是我运行的内容：

from timeit import default_timer as timer

sample = u"this is a string with some characters and \n new lines and \t tabs and \v and other stuff"*1000

start = timer();cleanup_op(sample);end = timer();print end-start
start = timer();cleanup_loop(sample);end = timer();print end-start
start = timer();cleanup_regex(sample);end = timer();print end-start

结果是：

清理工作在大约1.1秒内完成

清理循环在大约0.02秒内完成

清理正则表达式在大约0.004秒内完成

因此，任何一个答案都是对原始代码的显著改进。我认为@CristianCiupitu给出了一个更优雅和python的答案，而regex的速度更快。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章