处理损坏GZ（TAR）中的单文件提取

2 投票

2 回答

3025 浏览

数据工程师

提问于 2025-04-16 08:00

这是我在Stack Overflow上的第一篇帖子，我有一个关于如何从一个使用GZ压缩的TAR文件中提取单个文件的问题。我对Python不是很熟悉，所以可能做得不太对，任何帮助都非常感谢。

场景：

我收到一个损坏的*.tar.gz文件，GZ文件中的第一个文件包含获取系统序列号（SN）的重要信息。这个序列号可以用来识别机器，以便我们可以通知管理员这个文件已经损坏。

问题：

我使用普通的UNIX tar命令可以从这个归档文件中提取出README文件，尽管这个归档文件不完整，完全提取时会报错。但是在Python中，我无法只提取一个文件，即使我只指定了那个文件，它总是会返回一个异常。

当前解决方法：

我正在使用“os.popen”来调用UNIX的tar命令，以便只获取README文件。

期望的解决方案：

希望能使用Python的tarfile包来提取单个文件。

示例错误：

UNIX（有效）：

[root@athena tmp]# tar -xvzf bundle.tar.gz README
README

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
[root@athena tmp]# 
[root@athena tmp]# ls
bundle.tar.gz  README

Python：

>>> import tarfile
>>> tar = tarfile.open("bundle.tar.gz")
>>> data = tar.extractfile("README").read()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib64/python2.4/tarfile.py", line 1364, in extractfile
    tarinfo = self.getmember(member)
  File "/usr/lib64/python2.4/tarfile.py", line 1048, in getmember
    tarinfo = self._getmember(name)
  File "/usr/lib64/python2.4/tarfile.py", line 1762, in _getmember
    members = self.getmembers()
  File "/usr/lib64/python2.4/tarfile.py", line 1059, in getmembers
    self._load()        # all members, we first have to
  File "/usr/lib64/python2.4/tarfile.py", line 1778, in _load
    tarinfo = self.next()
  File "/usr/lib64/python2.4/tarfile.py", line 1588, in next
    self.fileobj.seek(self.offset)
  File "/usr/lib64/python2.4/gzip.py", line 377, in seek
    self.read(1024)
  File "/usr/lib64/python2.4/gzip.py", line 225, in read
    self._read(readsize)
  File "/usr/lib64/python2.4/gzip.py", line 273, in _read
    self._read_eof()
  File "/usr/lib64/python2.4/gzip.py", line 309, in _read_eof
    raise IOError, "CRC check failed"
IOError: CRC check failed
>>> print data
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
NameError: name 'data' is not defined

Python（处理异常）：

>>> tar = tarfile.open("bundle.tar.gz")
>>> try:
...     data = tar.extractfile("README").read()
... except:
...     pass
... 
>>> print(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
NameError: name 'data' is not defined

error handling unix commands file extraction archive management tar gz compression data recovery system identification

2 个回答

使用手动的Unix方法时，gzip解压文件会一直进行，直到遇到问题为止。

Python的gzip（或tar）模块一旦发现你有损坏的压缩文件，就会立刻停止，因为它检测到了CRC校验失败。

这只是个想法，你可以先用gzip处理那些损坏的压缩文件，然后再重新压缩，这样就能修正CRC问题。

gunzip < damaged.tar.gz | gzip > corrected.tar.gz

这样你就会得到一个修正过的.tar.gz文件，它会包含所有在压缩文件损坏之前的数据。现在你应该可以使用Python的tar/gzip库，而不会再遇到CRC错误了。

需要注意的是，这个命令会先解压再重新压缩文件，这会消耗存储空间和CPU时间，所以不要对所有的压缩文件都这样做。

为了提高效率，你应该只在遇到IOError: CRC check failed这个错误时再运行这个命令。

回答于 2025-04-16 由 Python大师

分享举报

你可以尝试这样做——先把gzip文件解压到一个临时文件里，然后再从中提取你需要的文件。在下面的例子中，我比较激进地尝试读取整个文件——根据gzip数据的块大小，你可能最多能读取128到256千字节。我感觉gzip的最大块大小是64千字节，但我不能保证。

这个方法是在内存中完成所有操作的，不需要中间文件或写入磁盘，但它会把解压后的所有数据都保存在内存中，所以……我不是在开玩笑，你需要根据自己的具体情况来调整这个方法。

#!/usr/bin/python

import gzip 
import tarfile 
import StringIO

# Depending on how your tar file is constructed, you might need to specify 
# './README' as your magic_file

magic_file = 'README'

f = gzip.open('corrupt', 'rb')

t = StringIO.StringIO()

try:
    while 1:
        block = f.read(1024)
        t.write(block) 
except Exception as e:
    print str(e)
    print '%d bytes decompressed' % (t.tell())

t.seek(0) 
tarball = tarfile.TarFile.open(name=None, mode='r', fileobj=t)

try:
    magic_data = tarball.getmember(magic_file).tobuf()
    # I didn't actually try this part, but in theory
    # getmember returns a tarinfo object which you can
    # use to extract the file 

    # search magic data for serial number or print out the
    # file 
    print magic_data 
except Exception as e:
    print e

回答于 2025-04-16 由 Python大师

分享举报

处理损坏GZ（TAR）中的单文件提取

2 个回答

撰写回答