Python文件读取并进行字节序转换
最近有人在StackOverflow上问如何在Python中读取文件,而被接受的回答建议了类似的方法:
with open('x.txt') as x: f = x.read()
我该如何读取这个文件并转换数据的字节序呢?
举个例子,我有一个1GB的二进制文件,里面是一堆以大端格式存储的单精度浮点数,我想把它转换成小端格式,然后放到一个numpy数组里。下面是我写的一个函数来实现这个功能,还有一些实际调用它的代码。我使用了struct.unpack
来进行字节序转换,并尝试通过使用mmap
来加快速度。
我的问题是,我用mmap
和struct.unpack
的方式正确吗?有没有更简单、更快的方法来做到这一点?现在我用的方法是有效的,但我真的想学会更好的做法。
提前谢谢大家!
#!/usr/bin/python
from struct import unpack
import mmap
import numpy as np
def mmapChannel(arrayName, fileName, channelNo, line_count, sample_count):
"""
We need to read in the asf internal file and convert it into a numpy array.
It is stored as a single row, and is binary. Thenumber of lines (rows), samples (columns),
and channels all come from the .meta text file
Also, internal format files are packed big endian, but most systems use little endian, so we need
to make that conversion as well.
Memory mapping seemed to improve the ingestion speed a bit
"""
# memory-map the file, size 0 means whole file
# length = line_count * sample_count * arrayName.itemsize
print "\tMemory Mapping..."
with open(fileName, "rb") as f:
map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
map.seek(channelNo*line_count*sample_count*arrayName.itemsize)
for i in xrange(line_count*sample_count):
arrayName[0, i] = unpack('>f', map.read(arrayName.itemsize) )[0]
# Same method as above, just more verbose for the maintenance programmer.
# for i in xrange(line_count*sample_count): #row
# be_float = map.read(arrayName.itemsize) # arrayName.itemsize should be 4 for float32
# le_float = unpack('>f', be_float)[0] # > for big endian, < for little endian
# arrayName[0, i]= le_float
map.close()
return arrayName
print "Initializing the Amp HH HV, and Phase HH HV arrays..."
HHamp = np.ones((1, line_count*sample_count), dtype='float32')
HHphase = np.ones((1, line_count*sample_count), dtype='float32')
HVamp = np.ones((1, line_count*sample_count), dtype='float32')
HVphase = np.ones((1, line_count*sample_count), dtype='float32')
print "Ingesting HH_Amp..."
HHamp = mmapChannel(HHamp, 'ALPSRP042301700-P1.1__A.img', 0, line_count, sample_count)
print "Ingesting HH_phase..."
HHphase = mmapChannel(HHphase, 'ALPSRP042301700-P1.1__A.img', 1, line_count, sample_count)
print "Ingesting HV_AMP..."
HVamp = mmapChannel(HVamp, 'ALPSRP042301700-P1.1__A.img', 2, line_count, sample_count)
print "Ingesting HV_phase..."
HVphase = mmapChannel(HVphase, 'ALPSRP042301700-P1.1__A.img', 3, line_count, sample_count)
print "Reshaping...."
HHamp_orig = HHamp.reshape(line_count, -1)
HHphase_orig = HHphase.reshape(line_count, -1)
HVamp_orig = HVamp.reshape(line_count, -1)
HVphase_orig = HVphase.reshape(line_count, -1)
4 个回答
0
我本来以为像这样的代码会更快
arrayName[0] = unpack('>'+'f'*line_count*sample_count, map.read(arrayName.itemsize*line_count*sample_count))
请不要把 map
当作变量名来用
7
稍微修改了一下@Alex Martelli的回答:
arr = numpy.fromfile(filename, numpy.dtype('>f4'))
# no byteswap is needed regardless of endianess of the machine
6
with open(fileName, "rb") as f:
arrayName = numpy.fromfile(f, numpy.float32)
arrayName.byteswap(True)
这个方法在速度和简洁性上都很难被超越;-)。关于字节交换,可以查看这里(True
参数的意思是“就地处理”);关于从文件读取数据,可以查看这里。
这个方法在小端机器上可以直接使用(因为数据是大端格式,所以需要进行字节交换)。你可以测试一下是否是小端格式,然后有条件地进行字节交换,把最后一行从无条件调用字节交换改成,比如:
if struct.pack('=f', 2.3) == struct.pack('<f', 2.3):
arrayName.byteswap(True)
也就是说,调用字节交换的条件是测试是否为小端格式。