python：以unicode格式打开并读取包含德语变音符的文件

4 投票

2 回答

27198 浏览

提问于 2025-04-17 20:33

我写了一个程序，可以从文本文件中读取单词，并把它们放进sqlite数据库里，同时也把它们当作字符串来处理。不过，我需要输入一些包含德语变音符的单词，比如：ä、ö、ü、ß。

这里有一段准备好的代码：

我尝试过用 # -- coding: iso-8859-15 -- 和 # -- coding: utf-8 -- 这两种方式，但没有任何区别(!)

    # -*- coding: iso-8859-15 -*-
    import sqlite3
    
    dbname = 'sampledb.db'
    filename ='text.txt'


    con = sqlite3.connect(dbname)
    cur = con.cursor()
    cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,name)''')    

    #f=open(filename)
    #text = f.readlines()
    #f.close()

    text = u'süß'

    print (text)
    cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))       

    con.commit()

    sentence = "The name is: %s" %(text,)

    print (sentence)
    f.close()
    con.close()

上面的代码运行得很好。但是我需要从一个包含单词'süß'的文件中读取'text'。所以当我取消注释那三行（f.open(filename) ....），并注释掉text = u'süß'时，就出现了错误。

    sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.

我尝试使用codecs模块来读取utf-8和iso-8859-15格式的文件。但我无法将它们解码成我需要的字符串'süß'，这样才能在代码的最后完成我的句子。

我曾经尝试在插入数据库之前先解码成utf-8。这是可行的，但我无法将其作为字符串使用。

有没有办法让我从文件中导入süß，并且既能插入到sqlite中，又能作为字符串使用呢？

更多细节：

在这里我添加更多细节以便说明。我之前使用过codecs.open。包含单词süß的文本文件是以utf-8格式保存的。使用f=codecs.open(filename, 'r', 'utf-8')和text=f.read()，我读取文件时得到了unicode格式u'\ufeffs\xfc\xdf'。将这个unicode插入sqlite3是顺利的：cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))。

问题在于：sentence = "The name is: %s" %(text,) 结果是u'The name is: \ufeffs\xfc\xdf'，而我还需要print(text)输出我的结果üß，但print(text)却出现了这个错误UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>。

谢谢。

字符串处理 unicode utf-8 sqlite 编码解码文本文件读取德语变音符 iso 8859-15

2 个回答

当你打开并读取一个文件时，你得到的是8位字符串，而不是Unicode字符串。在Python 2中，如果想要得到Unicode字符串，可以使用codecs.open来打开文件：

f=codecs.open(filename, 'r', 'utf-8')

不过希望你已经转到Python 3了，因为在Python 3中，编码已经被放进了普通的open调用里。而且，除非你用'b'标志以二进制方式打开文件，否则你总是会得到Unicode字符串，而不是8位的二进制字符串。如果你不指定编码，系统会使用默认编码。

f=open(filename, 'r', encoding='utf-8')

当然，根据文件的写入方式，你可能需要使用'iso-8859-15'编码。

补充说明：你的测试代码和注释掉的代码之间有一个很大的区别，就是从文件读取的数据会生成一个列表，而测试代码是一个单独的字符串。也许你的问题根本和Unicode无关。试着在你的测试代码中做这个替换，看看是否会产生相同的错误：

text = [u'süß']

不幸的是，我在Python中使用SQL的经验不够，无法进一步帮助你。

另外，当你打印一个list而不是单个字符串时，Unicode字符会被替换成它们的转义序列。如果你想看看字符串的真实样子，可以逐个打印它们。如果你感兴趣的话，这就是__str__和__repr__之间的区别。

补充说明2：字符u'\ufeff'被称为字节顺序标记（BOM），一些编辑器会插入这个标记来表示文件确实是UTF-8编码的。在使用字符串之前，你应该把它去掉。这个标记应该只出现在文件的最开头。可以参考一下在Python中读取带有BOM字符的Unicode文件数据。

回答于 2025-04-17 由 Python大师

分享举报

我解决了这个问题。谢谢大家的帮助。

这里是解决方案：

# -*- coding: iso-8859-1 -*-

import sys 
import codecs
import sqlite3

f = codecs.open("suess_sweet.txt", "r", "utf-8")    # suess_sweet.txt file contains two
text_in_unicode = f.read()                          # comma-separated words: süß, sweet 
f.close()

stdout_encoding = sys.stdout.encoding or sys.getfilesystemencoding()

con = sqlite3.connect('dict1.db')
cur = con.cursor()
cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,German,English)''')    

[ger,eng] = text_in_unicode.split(',')

cur.execute('''insert into table1 (id,German,English) VALUES (NULL,?,?)''',(ger,eng))       

con.commit()

sentence = "The German word is: %s" %(ger,)

print sentence.encode(stdout_encoding)

con.close()

我从这个页面（是德语的）得到了些帮助。

输出结果是：

The German word is: ?süß

不过还有一个小问题，就是那个问号（'?'）。我原以为在编码后，unicode u'会被替换成?。sentence的结果是：

>>> sentence
u'The German word is: \ufeffs\xfc\xdf '

而编码后的句子结果是：

>>> sentence.encode(stdout_encoding)
'The German word is: ?s\xfc\xdf '

所以结果并不是我想的那样。

我想到一个简单的解决办法，想要去掉问号，可以使用replace函数：

sentence = "The German word is: %s" %(ger,)
to_print = sentence.encode(stdout_encoding)
to_print = to_print.replace('?','')

>>> print(to_print)
The German word is: süß

谢谢Stack Overflow :)

回答于 2025-04-17 由 Python大师

分享举报

python：以unicode格式打开并读取包含德语变音符的文件

2 个回答

撰写回答