如何让这个Python2.6函数支持Unicode？

Question

我有一个函数，是我从网上的NLTK书第一章修改过来的。这个函数对我很有帮助，但尽管我读过关于Unicode的章节，我还是觉得很困惑。

def openbookreturnvocab(book):
    fileopen = open(book)
    rawness = fileopen.read()
    tokens = nltk.wordpunct_tokenize(rawness)
    nltktext = nltk.Text(tokens)
    nltkwords = [w.lower() for w in nltktext]
    nltkvocab = sorted(set(nltkwords))
    return nltkvocab

前几天我在《查拉图斯特拉如是说》上试了一下，结果把带有变音符号的字母o和u搞乱了。我相信你们中的一些人知道这是为什么。我也确信这很容易解决。我知道这和调用一个将词语重新编码为unicode字符串的函数有关。如果真是这样，那我觉得问题可能不在那个函数定义里，而是在我准备写入文件的地方：

def jotindex(jotted, filename, readmethod):
    filemydata = open(filename, readmethod)
    jottedf = '\n'.join(jotted)
    filemydata.write(jottedf)
    filemydata.close()
    return 0

我听说我需要在从文件读取字符串后将其编码为unicode。我试着这样修改函数：

def openbookreturnvocab(book):
    fileopen = open(book)
    rawness = fileopen.read()
    unirawness = rawness.decode('utf-8')
    tokens = nltk.wordpunct_tokenize(unirawness)
    nltktext = nltk.Text(tokens)
    nltkwords = [w.lower() for w in nltktext]
    nltkvocab = sorted(set(nltkwords))
    return nltkvocab

但是当我在匈牙利语上使用它时出现了这个错误。而在德语上使用时没有错误。

>>> import bookroutines
>>> elles1 = bookroutines.openbookreturnvocab("lk1-les1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bookroutines.py", line 9, in openbookreturnvocab
    nltktext = nltk.Text(tokens)
  File "/usr/lib/pymodules/python2.6/nltk/text.py", line 285, in __init__
    self.name = " ".join(map(str, tokens[:8])) + "..."
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)

我这样修复了存储数据的函数：

def jotindex(jotted, filename, readmethod):
    filemydata = open(filename, readmethod)
    jottedf = u'\n'.join(jotted)
    filemydata.write(jottedf)
    filemydata.close()
    return 0

然而，当我尝试存储德语时又出现了这个错误：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bookroutines.py", line 23, in jotindex
    filemydata.write(jottedf)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 414: ordinal not in range(128)
>>>

...这就是你尝试写入u'\n'.join的数据时得到的结果。

>>> jottedf = u'/n'.join(elles1)
>>> filemydata.write(jottedf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 504: ordinal not in range(128)

文本处理字符串处理数据存储 unicode 自然语言处理错误调试编码语言支持

如何让这个Python2.6函数支持Unicode？

1 个回答

更新：

撰写回答