python - 正则表达式和unicode的问题

4 投票

1 回答

4039 浏览

提问于 2025-04-15 13:51

你好，我在使用Python时遇到了一个问题。我想用一个例子来说明我的问题。

我有这样一个字符串：

>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ

我想要做的是，把除了Ñ、Ã、ï以外的字符都替换成""（空字符串）。

我尝试过：

>>> rePat = re.compile('[^ÑÃï]',re.UNICODE)
>>> print rePat.sub("",string)
�Ñ�����������������������������ï�������������������Ã

但我得到的是一个奇怪的符号�。我觉得这是因为在Python中，这些字符是用两个位置来表示的，比如说 \xc3\x91 就代表了Ñ。

所以，当我使用正则表达式时，所有的 \xc3 都没有被替换掉。我该怎么做才能完成这样的替换呢？？？

谢谢！

弗朗哥

正则表达式字符串处理 unicode 编码问题字符替换文本清洗多字节字符

1 个回答

你需要确保你的字符串是Unicode字符串，而不是普通字符串（普通字符串就像字节数组）。

举个例子：

>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> type(string)
<type 'str'>

# do this instead:
# (note the u in front of the ', this marks the character sequence as a unicode literal)
>>> string = u'\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\xc0\xc1\xc2\xc3'
# or:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'.decode('utf-8')
# ... but be aware that the latter will only work if the terminal (or source file) has utf-8 encoding
# ... it is a best practice to use the \xNN form in unicode literals, as in the first example

>>> type(string)
<type 'unicode'>
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ

>>> rePat = re.compile(u'[^\xc3\x91\xc3\x83\xc3\xaf]',re.UNICODE)
>>> print rePat.sub("", string)
Ã

当你从文件中读取内容时，string = open('filename.txt').read() 读取的是字节序列。

要获取Unicode内容，可以这样做：string = unicode(open('filename.txt').read(), 'encoding')。或者：string = open('filename.txt').read().decode('encoding')。

codecs模块可以实时解码Unicode流（比如文件）。

你可以在谷歌上搜索一下 python unicode。一开始理解Python的Unicode处理可能有点困难，但多读一些资料是值得的。

我遵循这个原则：“软件内部只应该使用Unicode字符串，输出时再转换成特定编码。”（来自 http://www.amk.ca/python/howto/unicode）

我还推荐这个链接：http://www.joelonsoftware.com/articles/Unicode.html

回答于 2025-04-15 由 Python大师

分享举报

python - 正则表达式和unicode的问题

1 个回答

撰写回答