Python: 将RTF文件转换为Unicode？

2 投票

1 回答

4401 浏览

提问于 2025-04-15 18:50

我正在尝试把一个RTF文件里的每一行转换成一系列的unicode字符串，然后对这些行进行正则表达式匹配。（我需要它们是unicode格式，这样才能输出到另一个文件。）

不过，我的正则匹配没有成功——我觉得可能是因为它们没有正确转换成unicode格式。

这是我的代码：

usefulLines = []
textData = {}

# the regex pattern for an entry in the db (e.g. SUF 76,22): it's sufficient for us to match on three upper-case characters plus a space
entryPattern = '^([A-Z]{3})[\s].*$'  

f = open('textbase_1a.rtf', 'Ur')
fileLines = f.readlines()

# get the matching line numbers, and store in usefulLines
for i, line in enumerate(fileLines):
    #line = line.decode('utf-16be') # this causes an error: I don't really know what file encoding the RTF file is in...
    line = line.decode('mac_roman')
    print line
    if re.match(entryPattern, line):
        # now retrieve the following lines, all the way up until we get a blank line
        print "match: " + str(i)
        usefulLines.append(i)

现在，这段代码打印出了所有的行，但没有打印出任何匹配的内容——虽然应该能匹配到。而且，这些行的开头莫名其妙地出现了'/par'。当我尝试把它们打印到输出文件时，它们看起来也很奇怪。

问题的一部分是我不知道该指定什么编码。我该怎么找出这个编码呢？

如果我使用 entryPattern = '^.*$'，那么我确实能得到匹配结果。

有没有人能帮帮我？

正则表达式文本处理 unicode 字符编码数据转换编码问题文件输出 rtf

1 个回答

你根本没有解码RTF文件。RTF文件可不是简单的文本文件。例如，包含“äöü”的文件，打开后会显示这样的内容：

{\rtf1\ansi\ansicpg1252\deff0\deflang1031{\fonttbl{\f0\fswiss\fcharset0 Arial;}}

{*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\f0\fs20\'e4\'f6\'fc\par

}

在文本编辑器中打开时会看到这些内容。所以“äöü”这些字符是按照文件开头声明的windows-1252格式编码的（äöü = 0xE4 0xF6 0xFC）。

要读取RTF文件，你首先需要一个可以把RTF转换成文本的工具（这个问题已经在这里提过了）。

回答于 2025-04-15 由 Python大师

分享举报

Python: 将RTF文件转换为Unicode？

1 个回答

撰写回答