检测Unicode字符串中的非ASCII字符

1条回答

网友

1楼 · 发布于 2024-05-16 13:57:41

The ultimate goal here is to compile a list of characters in the data that cannot encode to ascii.

我能想到的最有效的方法是使用^{}来去掉任何有效的ASCII字符，这会给您留下一个包含所有非ASCII字符的字符串。

这只会去掉可打印的字符。。。

>>> import re
>>> print re.sub('[ -~]', '', u'£100 is worth more than €100')
£€

…或者如果要包含不可打印字符，请使用此。。。

>>> print re.sub('[\x00-\x7f]', '', u'£100 is worth more than €100')
£€

要消除重复，只需创建返回字符串的set()。。。

>>> print set(re.sub('[\x00-\x7f]', '', u'£€£€'))
set([u'\xa3', u'\u20ac'])