Python：用字典中的实体替换特定Unicode实体

2 投票

1 回答

973 浏览

提问于 2025-04-17 05:11

我已经读了很多关于Python字符串中反斜杠转义的问题（还有在不同编码下Python如何识别反斜杠）以及在正则表达式中使用反斜杠的内容，但仍然无法解决我的问题。如果有人能提供帮助（链接、代码示例等），我将非常感激。

我想用re模块把字符串中的十六进制代码替换成字典中的某些元素。这些代码的格式是'\uhhhh'，其中hhhh是十六进制数字。

我从sqlite3数据库中选择字符串；默认情况下，它们是以unicode格式读取的，而不是“原始”的unicode字符串。

import re
pattern_xml = re.compile(r"""
(.*?)                       
([\\]u[0-9a-fA-F]{4})
(.*?)                           
""", re.VERBOSE | re.IGNORECASE | re.DOTALL)
uni_code=['201C','201D']
decoded=['"','"']
def repl_xml(m):
    item=m.group(2)
    try: decodeditem=decoded[uni_code.index(item.lstrip('\u').upper())]
    except: decodeditem=item
    return m.group(1) + "".join(decodeditem) + m.group(3)

#input        
text = u'Try \u201cquotated text should be here\u201d try'
#text after replacement
decoded_text=pattern_xml.subn(repl_xml,text)[0]
#desired outcome
desired_text=u'Try "quotated text should be here" try'

所以，我希望_decoded_text_等于_desired_text_。

我没有成功将单个反斜杠替换为双反斜杠，或者强制Python将文本视为原始unicode字符串（这样反斜杠就会被当作普通字符处理，而不是转义字符）。我还尝试过使用re.escape(text)和设置re.UNICODE，但在我的情况下，这些都没有帮助。
我使用的是Python 2.7.2。

对此问题有什么解决方案吗？

编辑：
我实际上在StandardEncodings和PythonUnicodeIntegration上找到了一个可能的解决方案，通过对input应用以下编码：

text.encode('unicode_escape')

还有其他需要做的事情吗？

正则表达式字符串处理文本替换反斜杠转义 sqlite3 unicode编码 python2.7 十六进制代码

1 个回答

这个示例文本里没有任何反斜杠。\u201c 只是表示一个unicode字符的一种方式：

>>> text = u'Try \u201cquotated text should be here\u201d try'
>>> '\\' in text
False
>>> print text
Try “quotated text should be here” try

在这里其实不需要用正则表达式。只要按照需要把目标的unicode字符转换成想要的样子就可以了：

>>> table = {0x201c: u'"', 0x201d: u'"'}
>>> text.translate(table)
u'Try "quotated text should be here" try'

回答于 2025-04-17 由 Python大师

分享举报

Python：用字典中的实体替换特定Unicode实体

1 个回答

撰写回答