python：读取带有非ascii nam的二进制文件问题的回答

python：读取带有非ascii nam的二进制文件

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我试着做一个脚本，在文件夹中搜索双文件，然后在这样的字典中返回它 {filehash1:[dirfile1，dirfile2]，filehash2:[dirfile3]} （dirfile1和dirfile2相同的文件名/位置不同） 第一个代码： <pre><code>import glob import hashlib def getallfolders(dir): print dir+"*\\" folders = glob.glob(dir+"*\\") return folders def getallfiles(dir): folders = glob.glob(dir+"*.*") return folders def filehash(file): BLOCKSIZE = 65536 hasher = hashlib.sha1() with open(file, 'rb') as afile: buf = afile.read(BLOCKSIZE) while len(buf) > 0: hasher.update(buf) buf = afile.read(BLOCKSIZE) return hasher.hexdigest() def double_files(dir): mil = {} folders = getallfolders(dir) for folder in folders: mil.update(double_files(folder)) files = getallfiles(dir) for file in files: fhash = filehash(file) if fhash in mil.keys(): mil[fhash] = mil[fhash] + [file] else: mil[fhash] = [file] return mil print double_files("E:\\not organised\\") </code></pre> 但是如果我试着运行它，它会出错崩溃 ^{pr2}$ 因为不是所有的文件名都是英文的所以我尝试修复它，现在代码如下所示： <pre><code># -*- coding: utf-8 -*- import glob import hashlib import codecs def getallfolders(dir): print dir+"*\\" folders = glob.glob(dir+"*\\") return folders def getallfiles(dir): folders = glob.glob(dir+"*.*") return folders def filehash(file): BLOCKSIZE = 65536 hasher = hashlib.sha1() file = file.decode("utf8") with codecs.open(file, "rb", encoding="utf8") as afile: buf = afile.read(BLOCKSIZE) while len(buf) > 0: buf = buf.encode("ISO-8859-1") hasher.update(buf) buf = afile.read(BLOCKSIZE) return hasher.hexdigest() def double_files(dir): mil = {} folders = getallfolders(dir) for folder in folders: mil.update(double_files(folder)) files = getallfiles(dir) for file in files: fhash = filehash(file) if fhash in mil.keys(): mil[fhash] = mil[fhash] + [file] else: mil[fhash] = [file] return mil print double_files("E:\\not organised\\") </code></pre> 我补充道 <code># -*- coding: utf-8 -*-</code> 改变 <code>with open(file, 'rb')</code>到<code>with open(file, encoding='utf-8')</code> 但现在我得到了一个错误： <pre><code>UnicodeDecodeError: 'utf8' codec can't decode byte .. in position ..: Tnvalid start byte </code></pre> （那个。。意思是不总是一样的）它发生在<code>buf = afile.read(BLOCKSIZE)</code>线上我知道文件已打开，但当我尝试使用read函数时，它会崩溃并出现错误。我不知道怎么解决它。。。请帮忙。在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

你似乎对编码很困惑。。。在 <ul> <li><code># -*- coding: utf-8 -*-</code>仅适用于代码中的非asciiliteral字符串。</li> <li><code>file = file.decode("utf8")</code>将字符串<code>file</code>（最好命名为<code>filename</code>）从utf-8解码为unicode。这只适用于文件系统的编码（文件和文件夹名称）是utf-8（或者更确切地说，只要所有文件和文件夹的名称都可以解释为有效的utf-8）。当然，它对文件的内容没有多大作用。</li> <li><code>codecs.open(file, "rb", encoding="utf8")</code>：只有当文件的内容实际上是一些utf-8编码的文本时，这才有意义，而且显然您正在读取任何类型的二进制数据，因此，如果出现虚假的编码错误，这一点也不奇怪。</li> <li><code>buf.encode("ISO-8859-1")</code>：只是毫无用处，<code>hashlib.sha1()</code>可以很好地与utf-8配合使用。</li> </ul> 长话短说：你的“修复”没有一个是有意义的。在 回到问题的根源： <blockquote> IOError: [Errno 2] No such file or directory: file It caused because not all the files name in english </blockquote> 我真的认为你在这里假设的太多了-如果你从浏览文件系统得到了一个非ascii（不是“非英语”）文件名，那么你的文件系统应该支持这种非ascii编码（好吧，它是Windows，所以这里可能会发生一些特殊的事情，但我可以告诉你，我从来没有在Linux上遇到过这样的问题也不是MacOS）。最坏的情况下，如果您的“非英语”（非ascii）文件名是utf8，那么您可以尝试使用仅的<code>file = file.decode('utf-8')</code>来查看它是否工作得更好，但仍然是<a href="https://en.wikipedia.org/wiki/Programming_by_permutation" rel="nofollow">programming by accident</a>。在 实际上，由于您没有发布有效的回溯（使用完整的文件名），所以很难判断原始代码到底出了什么问题，所以您最好的选择当然是切换回第一个实现并仔细阅读完整的回溯。然后可以使用交互式pythonshell或<a href="https://docs.python.org/2/library/pdb.html" rel="nofollow">the step debugger</a>进一步检查问题。在 哦，是的：我假设您使用的是Python2.x-Python3处理编码的方式有点不同。在

python：读取带有非ascii nam的二进制文件

1 个回答

相关Python问题