使用UTF-8字符串而非Unicode进行正则表达式有什么优缺点？

Question

通常，在使用国际语言时，Python的最佳实践是使用unicode，并且尽早将任何输入转换为unicode，最后再转换为字符串编码（大多数情况下是UTF-8）。

但是，当我需要在unicode上使用正则表达式时，发现这个过程并不是很友好。比如说，如果我想找到字符'é'后面跟着一个或多个空格，我必须这样写（注意：我的命令行或Python文件设置为UTF-8）：

re.match('(?u)\xe9\s+', unicode)

所以我必须写出'é'的unicode代码。这并不是很方便，如果我需要从一个变量构建正则表达式，事情就变得复杂了。举个例子：

word_to_match = 'Élisa™'.decode('utf-8') # that return a unicode object
regex = '(?u)%s\s+' % word_to_match
re.match(regex, unicode)

这只是一个简单的例子。如果你有很多正则表达式需要一个接一个地处理，而且里面有特殊字符，我发现直接在UTF-8编码的字符串上做正则表达式更简单、更自然。比如：

re.match('Élisa\s+', string)
re.match('Geneviève\s+', string)
re.match('DrØshtit\s+', string)

我是不是漏掉了什么？使用UTF-8的方法有什么缺点吗？

更新

好的，我找到问题了。我在ipython中做测试，但不幸的是，它似乎搞乱了编码。举个例子：

在Python命令行中

>>> string_utf8 = 'Test « with theses » quotes Éléments'
>>> string_utf8
'Test \xc2\xab with theses \xc2\xbb quotes \xc3\x89l\xc3\xa9ments'
>>> print string_utf8
Test « with theses » quotes Éléments
>>>
>>> unicode_string = u'Test « with theses » quotes Éléments'
>>> unicode_string
u'Test \xab with theses \xbb quotes \xc9l\xe9ments'
>>> print unicode_string
Test « with theses » quotes Éléments
>>>
>>> unicode_decoded_from_utf8 = string_utf8.decode('utf-8')
>>> unicode_decoded_from_utf8
u'Test \xab with theses \xbb quotes \xc9l\xe9ments'
>>> print unicode_decoded_from_utf8
Test « with theses » quotes Éléments

在ipython中

In [1]: string_utf8 = 'Test « with theses » quotes Éléments'

In [2]: string_utf8
Out[2]: 'Test \xc2\xab with theses \xc2\xbb quotes \xc3\x89l\xc3\xa9ments'

In [3]: print string_utf8
Test « with theses » quotes Éléments

In [4]: unicode_string = u'Test « with theses » quotes Éléments'

In [5]: unicode_string
Out[5]: u'Test \xc2\xab with theses \xc2\xbb quotes \xc3\x89l\xc3\xa9ments'

In [6]: print unicode_string
Test Â« with theses Â» quotes ÃlÃ©ments

In [7]: unicode_decoded_from_utf8 = string_utf8.decode('utf-8')

In [8]: unicode_decoded_from_utf8
Out[8]: u'Test \xab with theses \xbb quotes \xc9l\xe9ments'

In [9]: print unicode_decoded_from_utf8
Test « with theses » quotes Éléments

如你所见，ipython在使用u''表示法时搞乱了编码。这就是我遇到问题的原因。这个bug在这里提到过：https://bugs.launchpad.net/ipython/+bug/339642

正则表达式 unicode 字符编码 utf-8 国际化 ipython 特殊字符编码问题

使用UTF-8字符串而非Unicode进行正则表达式有什么优缺点？

更新

2 个回答

撰写回答