在UTF-8 Python代码中处理无法编码的mp4标签名

0 投票

3 回答

1847 浏览

提问于 2025-04-17 21:27

我不太明白为什么，有些mp4文件用作标签名的字段里包含了不可打印的字符，至少在mutagen看来是这样的。让我困扰的是'\xa9wrt'，这是作曲者字段的标签名（！？）。

如果我在Python控制台运行'\xa9wrt'.encode('utf-8')，我会得到

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte

我想在一个使用了一些未来兼容性措施的Python文件中访问这个值，包括：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

我甚至不知道怎么把字符串'\xa9wrt'输入到我的代码文件里，因为那个文件里的所有内容都被解释为utf-8，而我感兴趣的字符串显然不能用utf-8写出来。而且，当我把字符串'\xa9wrt'放到一个变量里（比如，从mutagen获取），处理起来也很麻烦。例如，"{}".format(the_variable)会失败，因为"{}"被解释为u"{}"，这又一次尝试把字符串编码为utf-8。

简单地输入'\xa9wrt'给我的是u'\xa9wrt'，这并不一样，我尝试的其他方法也都没有成功：

>>> u'\xa9wrt' == '\xa9wrt'
False
>>> str(u'\xa9wrt')
'\xc2\xa9wrt'
>>> str(u'\xa9wrt') == '\xa9wrt'
False

注意，这个输出来自控制台，在那里我似乎可以输入非Unicode字面量。我在Mac OS上使用Spyder，sys.version = 2.7.6 |Anaconda 1.8.0 (x86_64)| (default, Nov 11 2013, 10:49:09)\n[GCC 4.0.1 (Apple Inc. build 5493)]。

我该如何在Unicode的环境中处理这个字符串？是utf-8无法做到吗？

更新：谢谢@tsroten的回答。这让我对问题有了更清晰的理解，但我仍然无法达到我想要的效果。这里有一个更明确的问题：我该如何在不使用我现在的那些技巧的情况下，访问到带有'??'的那两行？

请注意，我正在处理的str是由一个库提供给我的。我必须接受它作为那种类型

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

tagname = 'a9777274'.decode('hex') # This value comes from a library as a str, not a unicode
if u'\xa9wrt' == tagname:
    # ??: What test could I run that would get me here without resorting to writing my string in hex?
    print("You found the tag you're looking for!")
else:
    print("Keep looking!")

print(str("This will work: {}").format(tagname))
try:
    print("This will throw an exception: {}".format(tagname))
    # ??: Can I reach this line without resorting to converting my format string to a str?
except UnicodeDecodeError:
    print("Threw exception")

更新2：

我觉得你（@tsroten）构造的任何字符串都不等于我从mutagen得到的那个字符串。那个字符串似乎仍然会引发问题：

>>> u = u'\xa9wrt'
>>> s = u.encode('utf-8')
>>> s2 = '\xa9wrt'
>>> s3 = 'a9777274'.decode('hex')
>>> s2 == s
False
>>> s2 == s3
True
>>> match_tag(s)
We have a match! tagname == ©wrt
Look! We printed tagname and no exception was raised.
>>> match_tag(s2)
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte

unicode 字符编码 utf-8 兼容性问题标签处理 mp4 mutagen 非打印字符

3 个回答

我终于找到了一种方法，可以在一个utf-8的文件中用unicode_literals来表示我想要的字符串。我先把这个字符串转换成十六进制，然后再转换回来。具体来说，在控制台（显然不是在unicode_literals模式下），我运行了

"".join(["{0:x}".format(ord(c)) for c in '\xa9wrt'])

然后在我的源文件中，我可以用

'a9777274'.decode('hex')

来创建我想要的字符串。

不过，这样做真的对吗？首先，如果我的控制台完全支持unicode，我不知道我能不能直接输入字符串 '\xa9wrt'，让Python告诉我代表这个字节字符串的十六进制序列。

回答于 2025-04-17 由 Python大师

分享举报

这个字符串是用拉丁-1编码的，所以如果你想把它存储到一个UTF-8格式的文件里，或者想和一个UTF-8的字符串进行比较，只需要这样做：

>>> '\xa9wrt'.decode('latin-1').encode('utf-8')
'\xc2\xa9wrt'

或者如果你想和一个Unicode字符串进行比较：

>>> '\xa9wrt'.decode('latin-1') == u'©wrt'
True

回答于 2025-04-17 由 Python大师

分享举报

\xa9 是版权符号。想了解更多信息，可以查看 C1 控制字符和拉丁文补充，这是 Unicode 标准的一部分。

也许 ©wrt 这个标签的意思是“版权”，而不是“作曲家”？

当你运行 '\xa9wrt'.encode('utf-8') 时，出现 UnicodeDecodeError 的原因是因为 encode() 方法期望的是 unicode 类型，但你给的是 str 类型。它首先会把你的 str 转换成 unicode，但默认假设 str 的编码是 'ascii'（或者其他默认编码）。所以在编码时就会出现解码错误。要解决这个问题，你应该使用 unicode：u'\xa9wrt'.encode('utf-8')。

在 Python 解释器中，默认情况下，输入 type('') 应该返回 <type 'str'>。如果你在解释器中先输入 from __future__ import unicode_literals，那么 type('') 就会返回 <type 'unicode'>。你说，直接输入 '\xa9wrt' 给我的是 u'\xa9wrt'，这不一样。不过，你的说法有时候对，有时候错。u'\xa9wrt' == '\xa9wrt' 是否为 True 或 False，取决于你是否导入了 unicode_literals。

复制、粘贴并保存以下内容到一个文件（例如 test.py），然后在命令行中运行 python test.py。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

tag1 = u'\xa9wrt'
tag2 = '\xa9wrt'
print("tag1 = u'\\xa9wrt'")
print("tag2 = '\\xa9wrt'")
print("tag1: %s" % tag1)
print("tag2: %s" % tag1)
print("type(tag1): %s" % type(tag1))
print("type(tag2): %s" % type(tag2))
print("tag1 == tag2: %s" % (tag1 == tag2))
try:
    print("str(tag1): %s" % str(tag1))
except UnicodeEncodeError:
    print("str(tag1): raises UnicodeEncodeError")
print("tag1.encode('utf-8'): ".encode('utf-8') + tag1.encode('utf-8'))

在将上面的代码复制并粘贴到文件中，然后在 Python 2.7 中运行后，我得到了以下输出：

tag1 = u'\xa9wrt'
tag2 = '\xa9wrt'
tag1: ©wrt
tag2: ©wrt
type(tag1): <type 'unicode'>
type(tag2): <type 'unicode'>
tag1 == tag2: True
str(tag1): raises UnicodeEncodeError
tag1.encode('utf-8'): ©wrt

编辑：

如果你的代码内部使用 unicode，生活会简单很多。这意味着，当你接收到输入时，要把它转换成 unicode，或者在输出时，如果需要的话，转换成 str。所以，当你从某处接收到一个 str 类型的 tagname 时，先把它转换成 unicode。

例如，这里是 test.py 的内容：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

def match_tag(tagname):
    if isinstance(tagname, str):
        # tagname comes in as str, so let's convert it
        tagname = tagname.decode('utf-8')  # enter the correct encoding here

    # Now that we have a unicode tag, we can deal with it easily:
    if tagname == '\xa9wrt':
        print("We have a match! tagname == %s" % tagname)
        print("Look! We printed tagname and no exception was raised.")

然后，我们运行它：

>>> from test import match_tag
>>> u = u'\xa9wrt'
>>> s = u.encode('utf-8')
>>> type(u)
<type 'unicode'>
>>> type(s)
<type 'str'>
>>> match_tag(u)
We have a match! tagname == ©wrt
Look! We printed tagname and no exception was raised.
>>> match_tag(s)
We have a match! tagname == ©wrt
Look! We printed tagname and no exception was raised.

所以，你需要找出你的输入字符串使用的编码。这样，你就能把 str 转换成 unicode，你的代码就能顺利运行。

编辑 2：

如果你只是想让 s2 = '\xa9wrt' 正常工作，那么你需要先正确解码。s2 是一个使用默认编码的 str（可以用 sys.getdefaultencoding() 查看是哪种编码——可能是 ascii）。但是，\xa9 不是 ASCII 字符，所以 Python 会自动对它进行转义。这就是 s2 的问题。尝试在传递给 match_tag() 时这样做：

>>> s2 = '\xa9wrt'
>>> s2_decoded = s2.decode('unicode_escape')
>>> type(s2_decoded)  # This is unicode, just like we want.
<type 'unicode'>
>>> match_tag(s2_decoded)
We have a match! tagname == ©wrt
Look! We printed tagname and no exception was raised.

回答于 2025-04-17 由 Python大师

分享举报

在UTF-8 Python代码中处理无法编码的mp4标签名

3 个回答

撰写回答