Python：用unidecode解决unicode地狱

12 投票

2 回答

7245 浏览

提问于 2025-04-17 23:10

我一直在研究如何把文本转换成ASCII格式。比如说，把ā变成a，把ñ变成n，等等。

unidecode这个工具在这方面表现得非常好。

# -*- coding: utf-8 -*-
from unidecode import unidecode
print(unidecode(u"ā, ī, ū, ś, ñ"))
print(unidecode(u"Estado de São Paulo"))

它的输出是：

a, i, u, s, n
Estado de Sao Paulo

不过，我无法用输入文件中的数据得到同样的结果。

test.txt文件的内容是：

ā, ī, ū, ś, ñ
Estado de São Paulo

# -*- coding: utf-8 -*-
from unidecode import unidecode
with open("test.txt", 'r') as inf:
    for line in inf:
        print unidecode(line.strip())

它的输出是：

A, A<<, A<<, A, A+-
Estado de SAPSo Paulo

还有：

运行时警告：参数不是一个unicode对象。
传递一个编码过的字符串可能会导致意想不到的结果。

问题：我该如何将这些行读取为unicode，以便可以传递给unidecode？

文本处理 unicode 字符串操作编码转换数据读取 ascii 运行时警告 unidecode

2 个回答

import codecs
with codecs.open('test.txt', encoding='whicheveronethefilewasencodedwith') as f:
    ...

codecs模块提供了一个功能，可以打开文件，并自动处理Unicode编码和解码，除此之外还有其他一些功能。

回答于 2025-04-17 由 Python大师

分享举报

with codecs.open("test.txt", 'r', 'utf-8') as inf:

补充说明：上面的内容是针对 Python 2.x 的。对于 Python 3，你不需要使用 codecs，因为普通的 open 函数已经增加了编码参数。

with open("test.txt", 'r', encoding='utf-8') as inf:

回答于 2025-04-17 由 Python大师

分享举报