用Python正则表达式处理Unicode字符

2 投票

4 回答

799 浏览

提问于 2025-04-16 12:25

我正在写一个简单的应用程序，想要把某些词替换成其他词。不过，我遇到了一些问题，特别是那些带有单引号的词，比如 aren't、ain't 和 isn't。

我有一个文本文件，内容如下：

aren’t=ain’t
hello=hey

我解析这个文本文件，并从中创建一个字典。

u'aren\u2019t' = u'ain\u2019t'
u'hello' = u'hey'

然后，我尝试在给定的文本中替换所有字符。

text = u"aren't"

def replace_all(text, dict):
    for i, k in dict.iteritems():
        #replace all whole words of I with K in lower cased text, regex = \bSTRING\b
        text = re.sub(r"\b" + i + r"\b", k , text.lower())
    return text

问题是 re.sub() 这个函数不能把 u'aren\u2019t' 和 u"aren't" 匹配上。

我该怎么做才能让我的 replace_all() 函数同时匹配 "hello" 和 "aren't"，并把它们替换成相应的文本呢？我能在 Python 中做些什么，让我的字典不包含 Unicode 吗？或者我能把我的文本转换成使用 Unicode 字符，还是说我可以修改正则表达式，让它既能匹配 Unicode 字符，也能匹配其他文本呢？

正则表达式文本处理 unicode 字典文本解析编码问题字符替换匹配模式

4 个回答

u"aren\u2019t" == u"aren't"

假

u"aren\u2019t" == u"aren’t"

真

回答于 2025-04-16 由 Python大师

分享举报

试着把你的文件保存为UTF-8编码格式。

回答于 2025-04-16 由 Python大师

分享举报

我想你的问题是：

text = u"aren't"

而不是：

text = u"aren’t"

（注意这些引号不一样？）

这是你修改过的代码，让它可以正常工作：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

d = {
    u'aren’t': u'ain’t',
    u'hello': u'hey'
    }
#text = u"aren't"
text = u"aren’t"


def replace_all(text, d):
    for i, k in d.iteritems():
        #replace all whole words of I with K in lower cased text, regex = \bSTRING\b
        text = re.sub(r"\b" + i + r"\b", k , text.lower())
    return text

if __name__ == '__main__':
    newtext = replace_all(text, d)
    print newtext

输出结果：

ain’t

回答于 2025-04-16 由 Python大师

分享举报

用Python正则表达式处理Unicode字符

4 个回答

撰写回答