在Python中解码（扩展）URL编码的二进制字符串

0 投票

3 回答

3054 浏览

提问于 2025-04-15 13:42

为了分析，我需要把经过URL编码的二进制字符串解码回来（这些字符串可能包含不可打印的字符）。可惜的是，这些字符串是扩展的URL编码格式，比如"%u616f"。我想把它们存储到一个文件里，文件里应该包含原始的二进制值，比如0x61 0x6f。

我该怎么在Python中把这些转换成二进制数据呢？（urllib.unquote只处理"%HH"这种格式）

文件存储 url编码二进制字符串解码扩展url编码不可打印字符

3 个回答

这里有一种基于正则表达式的方法：

# the replace function concatenates the two matches after 
# converting them from hex to ascii
repfunc = lambda m: chr(int(m.group(1), 16))+chr(int(m.group(2), 16))

# the last parameter is the text you want to convert
result = re.sub('%u(..)(..)', repfunc, '%u616f')
print result

结果是

ao

回答于 2025-04-15 由 Python大师

分享举报

这些字符串很可惜是以扩展的URL编码形式出现的，比如"%u616f"

顺便说一下，这和URL编码没有关系。这是一种随意编造的格式，是由JavaScript的escape()函数生成的，几乎没有其他地方会用到。如果可以的话，最好的办法是把JavaScript改成使用encodeURIComponent函数。这样你就能得到一个标准的、符合UTF-8的URL编码字符串。

比如"%u616f"。我想把它们存储在一个文件中，文件里包含原始的二进制值，比如0x61 0x6f。

你确定0x61 0x6f（字母“ao”）是你想存储的字节流吗？这意味着你在使用UTF-16BE编码；你是以这种方式处理所有字符串的吗？

通常你会想把输入转换成Unicode，然后用合适的编码方式写出来，比如UTF-8或UTF-16LE。这里有一个快速的方法，可以利用一个小技巧，让Python把'%u1234'当作字符串转义格式u'\u1234'来读取：

>>> ex= 'hello %e9 %u616f'
>>> ex.replace('%u', r'\u').replace('%', r'\x').decode('unicode-escape')
u'hello \xe9 \u616f'

>>> print _
hello é 慯

>>> _.encode('utf-8')
'hello \xc2\xa0 \xe6\x85\xaf'

回答于 2025-04-15 由 Python大师

分享举报

我想你需要自己写解码函数。这里有一个实现的例子，可以帮助你入门：

def decode(file):
    while True:
        c = file.read(1)
        if c == "":
            # End of file
            break
        if c != "%":
            # Not an escape sequence
            yield c
            continue
        c = file.read(1)
        if c != "u":
            # One hex-byte
            yield chr(int(c + file.read(1), 16))
            continue
        # Two hex-bytes
        yield chr(int(file.read(2), 16))
        yield chr(int(file.read(2), 16))

使用方法：

input = open("/path/to/input-file", "r")
output = open("/path/to/output-file", "wb")
output.writelines(decode(input))
output.close()
input.close()

回答于 2025-04-15 由 Python大师

分享举报

在Python中解码（扩展）URL编码的二进制字符串

3 个回答

撰写回答