如何检查字符串是Unicode还是ASCII？

Question

我在Python中需要做什么才能找出一个字符串的编码方式？

Answer 1

在Python 3.x中，所有的字符串都是Unicode字符的序列。检查一个对象是否是字符串（在这里字符串默认就是Unicode字符串）就可以了。

isinstance(x, str)

至于Python 2.x，大多数人似乎会使用一个包含两个检查的if语句，一个是检查str，另一个是检查unicode。

不过，如果你想用一句话来检查一个对象是否像字符串，你可以这样做：

isinstance(x, basestring)

Answer 2

如何判断一个对象是Unicode字符串还是字节字符串

你可以使用 type 或 isinstance 来检查。

在Python 2中：

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

在Python 2里，str 只是字节的一个序列。Python并不知道这些字节的编码是什么。unicode 类型是存储文本的更安全的方式。如果你想更深入了解这个问题，我推荐你看看这个链接：http://farmdev.com/talks/unicode/。

在Python 3中：

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

在Python 3里，str 就像Python 2中的 unicode，用来存储文本。而在Python 2中叫做 str 的东西，在Python 3中叫做 bytes。

如何判断一个字节字符串是否是有效的utf-8或ascii

你可以调用 decode 方法。如果它抛出一个UnicodeDecodeError异常，那就说明这个字节字符串不合法。

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Answer 3

在Python 3中，所有的字符串都是Unicode字符的序列。还有一种叫做bytes的类型，用来存放原始的字节数据。

在Python 2中，字符串可以是str类型或者unicode类型。你可以用类似下面的代码来判断它是什么类型：

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

这个判断并不能区分“Unicode还是ASCII”，它只是区分了Python中的类型。一个Unicode字符串可以完全由ASCII范围内的字符组成，而一个字节字符串则可以包含ASCII字符、编码后的Unicode字符，甚至是一些非文本的数据。

如何检查字符串是Unicode还是ASCII？

12 个回答

如何判断一个对象是Unicode字符串还是字节字符串

如何判断一个字节字符串是否是有效的utf-8或ascii

撰写回答