如何在Python中使用re替换Unicode值？

0 投票

4 回答

1554 浏览

数据工程师

提问于 2025-04-16 20:51

如何在Python中使用re替换unicode值？

我想要的功能大概是这样的：

line.replace('Ã','')
line.replace('¢','')
line.replace('Ã¢','')

或者有没有什么方法可以替换文件中的所有非ASCII字符。其实我把PDF文件转换成了ASCII格式，但里面有一些非ASCII字符，比如PDF中的项目符号。

请帮帮我。

文本处理 unicode 文件转换 ascii regex

4 个回答

1

为什么你想要替换掉这个，如果你已经有了

title.decode('latin-1').encode('utf-8')

或者如果你想要忽略这个

unicode(title, errors='replace')

回答于 2025-04-16 由 Python大师

分享举报

1

你需要把你的Unicode字符串转换成ASCII格式，并且在这个过程中忽略任何可能出现的错误。下面是具体的做法：

>>> u'uéa&à'.encode('ascii', 'ignore')
'ua&'

回答于 2025-04-16 由 Python大师

分享举报

1

根据评论中的反馈进行编辑。

另一种解决方法是检查每个字符的数字值，看看它们是否在128以下，因为ASCII码的范围是0到127。可以这样做：

# coding=utf-8

def removeUnicode():
    text = "hejsanäöåbadasd wodqpwdk"
    asciiText = ""
    for char in text:
        if(ord(char) < 128):
            asciiText = asciiText + char

    return asciiText

import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())

这是对jd的答案进行修改后的版本，并附上了基准测试：

# coding=utf-8

def removeUnicode():
    text = u"hejsanäöåbadasd wodqpwdk"
    if(isinstance(text, str)):
        return text.decode('utf-8').encode("ascii", "ignore")
    else:
        return text.encode("ascii", "ignore")        

import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())

使用str字符串作为输入的第一个解决方案的输出：

computer:~ Ancide$ python test1.py
Time taken: 5.88719677925

使用unicode字符串作为输入的第一个解决方案的输出：

computer:~ Ancide$ python test1.py
Time taken: 7.21077990532

使用str字符串作为输入的第二个解决方案的输出：

computer:~ Ancide$ python test1.py
Time taken: 2.67580914497

使用unicode字符串作为输入的第二个解决方案的输出：

computer:~ Ancide$ python test1.py
Time taken: 1.740680933

结论

编码是更快的解决方案，编码字符串的代码量更少；因此，这是更好的解决方案。

回答于 2025-04-16 由 Python大师

分享举报

撰写回答