我是Python的新手,一般来说,也是编码方面的新手。因此,我们非常感谢您的帮助
我在一个目录中有3000多个文本文件,有多种编码。我需要将它们转换为单个编码(例如utf8),以便进行进一步的NLP工作。当我使用shell检查这些文件的类型时,我确定了以下编码:
Algol 68 source text, ISO-8859 text, with very long lines
Algol 68 source text, Little-endian UTF-16 Unicode text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
ASCII text
ASCII text, with very long lines
data
diff output text, ASCII text
ISO-8859 text, with very long lines
ISO-8859 text, with very long lines, with LF, NEL line terminators
Little-endian UTF-16 Unicode text, with very long lines
Non-ISO extended-ASCII text
Non-ISO extended-ASCII text, with very long lines
Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
UTF-8 Unicode (with BOM) text, with CRLF line terminators
UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators
UTF-8 Unicode text, with very long lines, with CRLF line terminators
如何将具有上述编码的文本文件转换为具有utf-8编码的文本文件
我遇到了和你一样的问题。 我用了两个步骤来解决这个问题
代码如下:
首先,使用chardet包识别文本的编码
其次,如果文本编码不是utf-8,则将文本重写为utf-8编码到目录中
希望这能有所帮助!谢谢
相关问题 更多 >
编程相关推荐