在Python中高效查找无效字符的方法

1 投票

9 回答

17691 浏览

提问于 2025-04-16 15:56

我正在用Django开发一个论坛应用，想确保用户在发帖时不会输入某些特定的字符。我需要一个有效的方法来检查他们整个帖子，看看有没有这些不允许的字符。目前我有的代码是这样的，不过它并没有正确工作，我觉得这个想法也不是很高效。

def clean_topic_message(self):
    topic_message = self.cleaned_data['topic_message']
    words = topic_message.split()
    if (topic_message == ""):
        raise forms.ValidationError(_(u'Please provide a message for your topic'))
    ***for word in words:
        if (re.match(r'[^<>/\{}[]~`]$',topic_message)):
            raise forms.ValidationError(_(u'Topic message cannot contain the following: <>/\{}[]~`'))***
    return topic_message

谢谢大家的帮助。

django 输入检查无效字符字符验证论坛应用

9 个回答

如果效率是一个主要考虑的问题，我建议你使用 re.compile() 来编译正则表达式，因为你会多次使用同一个正则表达式。

回答于 2025-04-16 由 Python大师

分享举报

使用正则表达式的时候，你得特别小心，因为里面有很多陷阱。

比如说在 [^<>/\{}[]~] 这个例子里，第一个 ] 会结束这个字符组，这可能不是你想要的效果。如果你想在字符组里使用 ]，它必须是紧跟在 [ 后面的第一个字符，比如 []^<>/\{}[~]。

简单的测试可以证明这一点。

>>> import re
>>> re.search("[[]]","]")
>>> re.search("[][]","]")
<_sre.SRE_Match object at 0xb7883db0>

其实对于这个问题来说，使用正则表达式有点过于复杂了。

def clean_topic_message(self):
    topic_message = self.cleaned_data['topic_message']
    invalid_chars = '^<>/\{}[]~`$'
    if (topic_message == ""):
        raise forms.ValidationError(_(u'Please provide a message for your topic'))
    if set(invalid_chars).intersection(topic_message):
        raise forms.ValidationError(_(u'Topic message cannot contain the following: %s'%invalid_chars))
    return topic_message

回答于 2025-04-16 由 Python大师

分享举报

对于正则表达式的解决方案，这里有两种方法：

在字符串中找到一个无效字符。
验证字符串中的每一个字符。

下面是一个实现这两种方法的脚本：

import re
topic_message = 'This topic is a-ok'

# Option 1: Invalidate one char in string.
re1 = re.compile(r"[<>/{}[\]~`]");
if re1.search(topic_message):
    print ("RE1: Invalid char detected.")
else:
    print ("RE1: No invalid char detected.")

# Option 2: Validate all chars in string.
re2 =  re.compile(r"^[^<>/{}[\]~`]*$");
if re2.match(topic_message):
    print ("RE2: All chars are valid.")
else:
    print ("RE2: Not all chars are valid.")

你可以选择其中一种。

注意：原来的正则表达式中错误地在字符类里放了一个右方括号，这个需要进行转义处理。

基准测试：在看到gnibbler用set()的有趣解决方案后，我很好奇这两种方法哪个更快，所以我决定进行测量。以下是基准测试的数据和测量的语句，以及timeit的结果：

测试数据：

r"""
TEST topic_message STRINGS:
ok:  'This topic is A-ok.     This topic is     A-ok.'
bad: 'This topic is <not>-ok. This topic is {not}-ok.'

MEASURED PYTHON STATEMENTS:
Method 1: 're1.search(topic_message)'
Method 2: 're2.match(topic_message)'
Method 3: 'set(invalid_chars).intersection(topic_message)'
"""

结果：

r"""
Seconds to perform 1000000 Ok-match/Bad-no-match loops:
Method  Ok-time  Bad-time
1        1.054    1.190
2        1.830    1.636
3        4.364    4.577
"""

基准测试显示，选项1比选项2稍微快一些，并且这两种方法都比set().intersection()方法快得多。这对于匹配和不匹配的字符串都是如此。

回答于 2025-04-16 由 Python大师

分享举报

在Python中高效查找无效字符的方法

9 个回答

撰写回答