在Python脚本中不解压缩地读取tar文件内容

112 投票

6 回答

110186 浏览

数据工程师

提问于 2025-04-15 17:44

我有一个tar文件，里面包含了很多文件。
我需要写一个Python脚本，来读取这些文件的内容，并计算出总字符数，包括字母、空格、换行符等等，关键是不用先解压这个tar文件。

文件读取字符计数压缩文件处理 tar文件

6 个回答

之前，这篇帖子展示了一个例子，使用“dict(zip(()”把成员名称和成员列表结合在一起，这样做其实很傻，而且会导致对归档文件的读取过多。为了达到同样的效果，我们可以使用字典推导式：

index = {i.name: i for i in my_tarfile.getmembers()}

关于如何使用tarfile的更多信息

提取tar文件的成员

#!/usr/bin/env python3
import tarfile

my_tarfile = tarfile.open('/path/to/mytarfile.tar')

print(my_tarfile.extractfile('./path/to/file.png').read())

给tar文件建立索引

#!/usr/bin/env python3
import tarfile
import pprint

my_tarfile = tarfile.open('/path/to/mytarfile.tar')

index = my_tarfile.getnames()  # a list of strings, each members name
# or
# index = {i.name: i for i in my_tarfile.getmembers()}

pprint.pprint(index)

索引、读取、动态提取tar文件

#!/usr/bin/env python3

import tarfile
import base64
import textwrap
import random

# note, indexing a tar file requires reading it completely once
# if we want to do anything after indexing it, it must be a file
# that can be seeked (not a stream), so here we open a file we
# can seek
my_tarfile = tarfile.open('/path/to/mytar.tar')


# tarfile.getmembers is similar to os.stat kind of, it will
# give you the member names (i.name) as well as TarInfo attributes:
#
# chksum,devmajor,devminor,gid,gname,linkname,linkpath,
# mode,mtime,name,offset,offset_data,path,pax_headers,
# size,sparse,tarfile,type,uid,uname
#
# here we use a dictionary comprehension to index all TarInfo
# members by the member name
index = {i.name: i for i in my_tarfile.getmembers()}

print(index.keys())

# pick your member
# note: if you can pick your member before indexing the tar file,
# you don't need to index it to read that file, you can directly
# my_tarfile.extractfile(name)
# or my_tarfile.getmember(name)

# pick your filename from the index dynamically
my_file_name = random.choice(index.keys())

my_file_tarinfo = index[my_file_name]
my_file_size = my_file_tarinfo.size
my_file_buf = my_tarfile.extractfile( 
    my_file_name
    # or my_file_tarinfo
)

print('file_name: {}'.format(my_file_name))
print('file_size: {}'.format(my_file_size))
print('----- BEGIN FILE BASE64 -----'
print(
    textwrap.fill(
        base64.b64encode(
            my_file_buf.read()
        ).decode(),
        72
    )
)
print('----- END FILE BASE64 -----'

处理有重复成员的tar文件

如果我们有一个奇怪创建的tar文件，比如在同一个tar归档中添加了多个版本的同一个文件，我们可以小心地处理这个情况。我已经标注了哪些成员包含了什么文本，假设我们想要第四个（索引3）成员，“capturetheflag\n”

tar -tf mybadtar.tar 
mymember.txt  # "version 1\n"
mymember.txt  # "version 1\n"
mymember.txt  # "version 2\n"
mymember.txt  # "capturetheflag\n"
mymember.txt  # "version 3\n"

#!/usr/bin/env python3

import tarfile
my_tarfile = tarfile.open('mybadtar.tar')

# >>> my_tarfile.getnames()
# ['mymember.txt', 'mymember.txt', 'mymember.txt', 'mymember.txt', 'mymember.txt']

# if we use extracfile on a name, we get the last entry, I'm not sure how python is smart enough to do this, it must read the entire tar file and buffer every valid member and return the last one

# >>> my_tarfile.extractfile('mymember.txt').read()
# b'version 3\n'

# >>> my_tarfile.extractfile(my_tarfile.getmembers()[3]).read()
# b'capturetheflag\n'

另外，我们可以遍历这个tar文件

#!/usr/bin/env python3

import tarfile
my_tarfile = tarfile.open('mybadtar.tar')
# note, if we do anything to the tarfile object that will 
# cause a full read, the tarfile.next() method will return none,
# so call next in a loop as the first thing you do if you want to
# iterate

while True:
    my_member = my_tarfile.next()
    if not my_member:
        break
    print((my_member.offset, mytarfile.extractfile(my_member).read,))

# (0, b'version 1\n')
# (1024, b'version 1\n')
# (2048, b'version 2\n')
# (3072, b'capturetheflag\n')
# (4096, b'version 3\n')

回答于 2025-04-15 由 Python大师

分享举报

你需要使用tarfile这个模块。具体来说，你要用TarFile这个类的一个实例来打开文件，然后可以通过TarFile.getnames()来获取文件里的名字。

 |  getnames(self)
 |      Return the members of the archive as a list of their names. It has
 |      the same order as the list returned by getmembers().

如果你想要读取文件的内容，那么你可以使用这个方法。

 |  extractfile(self, member)
 |      Extract a member from the archive as a file object. `member' may be
 |      a filename or a TarInfo object. If `member' is a regular file, a
 |      file-like object is returned. If `member' is a link, a file-like
 |      object is constructed from the link's target. If `member' is none of
 |      the above, None is returned.
 |      The file-like object is read-only and provides the following
 |      methods: read(), readline(), readlines(), seek() and tell()

回答于 2025-04-15 由 Python大师

分享举报

151

你可以使用 getmembers() 这个方法。

>>> import  tarfile
>>> tar = tarfile.open("test.tar")
>>> tar.getmembers()

之后，你可以用 extractfile() 来把这些成员提取成文件对象。这里给个例子。

import tarfile,os
import sys
os.chdir("/tmp/foo")
tar = tarfile.open("test.tar")
for member in tar.getmembers():
    f=tar.extractfile(member)
    content=f.read()
    print "%s has %d newlines" %(member, content.count("\n"))
    print "%s has %d spaces" % (member,content.count(" "))
    print "%s has %d characters" % (member, len(content))
    sys.exit()
tar.close()

在上面的例子中，使用文件对象 f，你可以使用 read()、readlines() 等等的方法。

回答于 2025-04-15 由 Python大师

分享举报

在Python脚本中不解压缩地读取tar文件内容

6 个回答

提取tar文件的成员

给tar文件建立索引

索引、读取、动态提取tar文件

处理有重复成员的tar文件

撰写回答