在Python中只替换一次unicode字符

2024-06-02 08:23:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试创建一个小脚本来替换如下文件中的一组字符:

# coding=utf-8

import codecs
import os
import sys

args = sys.argv

if len(args) > 1:
    subtitleFileName = args[1]
    newSubtitleFileName = subtitleFileName + "_new"

    replacePairs = {
        u"ã": "ă",
        u"Ã": "Ă",
        u"º": "ș",
        u"ª": "Ș",
        u"þ": "ț",
        u"Þ": "Ț",
    }

    if os.path.isfile(subtitleFileName):
        oldSubtitleFile = codecs.open(subtitleFileName, "rb", "ISO-8859-1")

        subtitleContent = oldSubtitleFile.read()
        subtitleContent = codecs.encode(subtitleContent, "utf-8")

        for key, value in replacePairs.iteritems():
            subtitleContent = subtitleContent.replace(codecs.encode(key, "utf-8"), value)

        oldSubtitleFile.close()

        newSubtitleFile = open(newSubtitleFileName, "wb")
        newSubtitleFile.write(subtitleContent)
        newSubtitleFile.close()

        os.remove(subtitleFileName)
        os.rename(newSubtitleFileName, subtitleFileName)

        print "Done!"
    else:
        print "Missing subtitle file!"
else:
    print "Missing arguments!"

第一次运行效果不错。在

因此,如果我有一个包含Eºti sigur cã vrei sã ºtergi fiºierele?的文件,在对该文件运行脚本之后,我得到Ești sigur că vrei să ștergi fișierele?,这就是我想要的。但如果我运行多次,我会得到:

EÈti sigur cÄ vrei sÄ Ètergi fiÈierele?

EĂÂti sigur cĂÂ vrei sĂÂ ĂÂtergi fiĂÂierele?

EÄÂĂÂti sigur cÄÂĂÂ vrei sÄÂĂÂ ÄÂĂÂtergi fiÄÂĂÂierele?

EĂÂĂÂÄÂĂÂti sigur cĂÂĂÂÄÂĂÂ vrei sĂÂĂÂÄÂĂÂ ĂÂĂÂÄÂĂÂtergi fiĂÂĂÂÄÂĂÂierele?

为什么我不明白。它如何找到文件中不再存在的字符来替换它们?为什么还要用其他人物来代替他们呢?在


Tags: 文件importostiargsutfficodecs
3条回答

很简单-这是因为在第一次运行时,您正在读取ISO-8859-1并编写UTF-8。然后在第二次运行时,你做的完全一样,尽管输入现在是UTF-8而不是ISO-8859-1。在随后的运行中,搜索和替换不再工作。在

此测试模拟您的第二次迭代-将UTF-8解释为ISO-8859-1:

# -*- coding: utf-8 -*-
print "Ești sigur că vrei să ștergi fișierele?".decode("ISO-8859-1")
>> EÈti sigur cÄ vrei sÄ Ètergi fiÈierele?

下一次迭代如下:

^{pr2}$

听从@Daniel的建议,解码一次,用Unicode替换Unicode,然后编码一次。我还被告知,最好使用io.open(),而不是{},因为它与python3兼容,可以解决通用新行的问题。在

"utf-8"内容上使用"ISO-8859-1"字符编码是不正确的:第一次运行脚本时,它会接受一个文本文件(可能是"ISO-8859-1"编码)并将其保存为"utf-8",同时替换某些Unicode字符。在

然后第二次运行转换,然后它接受"utf-8"内容并尝试将其解释为"ISO-8859-1",这是错误的。在

为了避免混淆,请在更改字符编码的同时单独进行替换。因此,替换将是幂等的。在

要进行替换,可以使用fileinput模块和unicode.translate()

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Replace some characters in 'iso-8859-1'-encoded files."""
import fileinput # read files given on the command-line and/or stdin

replacements = {
    u"ã": u"ă",
    u"Ã": u"Ă",
    u"º": u"ș",
    u"ª": u"Ș",
    u"þ": u"ț",
    u"Þ": u"Ț",
}
# key => ord(key)
replacements = dict(zip(map(ord, replacements.keys()), replacements.values()))
for line in fileinput.input(openhook=fileinput.hook_encoded("iso-8859-1")):
    print(line.translate(replacements))

要控制输出文件的编码,可以设置PYTHONIOENCODING,例如,在bash中:

^{pr2}$

此命令将替换字符并将输入从"iso-8859-1"转换为"utf-8"。在

如果输入filename.txt已损坏(没有单个字符编码正确解码),则可以try ^{} module来修复常见的编码错误:

$ ftfy filename.txt >filename.utf8.txt

不要使用编码内容。仅在写入新文件时编码:

import codecs
import os
import sys

args = sys.argv

if len(args) > 1:
    subtitleFileName = args[1]
    newSubtitleFileName = subtitleFileName + "_new"

    replacePairs = {
        u"ã": u"ă",
        u"Ã": u"Ă",
        u"º": u"ș",
        u"ª": u"Ș",
        u"þ": u"ț",
        u"Þ": u"Ț",
    }

    if os.path.isfile(subtitleFileName):
        with codecs.open(subtitleFileName, "rb", "ISO-8859-1") as oldSubtitleFile:
            subtitleContent = oldSubtitleFile.read()

        for key, value in replacePairs.iteritems():
            subtitleContent = subtitleContent.replace(key, value)

        with codecs.open(newSubtitleFileName, "wb", "utf-8") as newSubtitleFile:
            newSubtitleFile.write(subtitleContent)

        os.remove(subtitleFileName)
        os.rename(newSubtitleFileName, subtitleFileName)

        print "Done!"
    else:
        print "Missing subtitle file!"
else:
    print "Missing arguments!"

相关问题 更多 >