使用h5py访问数据范围
我有一个h5文件,里面包含62个不同的属性。我想查看每个属性的数据范围。
为了更详细地说明我在做什么
import h5py
the_file = h5py.File("myfile.h5","r")
data = the_file["data"]
att = data.keys()
之前的代码给了我一个属性列表,比如"U"、"T"、"H"等等。
假设我想知道"U"的最小值和最大值,那我该怎么做呢?
这是运行"h5dump -H"的输出结果
HDF5 "myfile.h5" {
GROUP "/" {
GROUP "data" {
ATTRIBUTE "datafield_names" {
DATATYPE H5T_STRING {
STRSIZE 8;
STRPAD H5T_STR_SPACEPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 62 ) / ( 62 ) }
}
ATTRIBUTE "dimensions" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
}
ATTRIBUTE "time_variables" {
DATATYPE H5T_IEEE_F64BE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
}
DATASET "Temperature" {
DATATYPE H5T_IEEE_F64BE
DATASPACE SIMPLE { ( 256, 512, 1024 ) / ( 256, 512, 1024 ) }
}
4 个回答
你是想说 data.attrs
而不是 data
本身吗?如果是这样,
import h5py
with h5py.File("myfile.h5", "w") as the_file:
dset = the_file.create_dataset('MyDataset', (100, 100), 'i')
dset.attrs['U'] = (0,1,2,3)
dset.attrs['T'] = (2,3,4,5)
with h5py.File("myfile.h5", "r") as the_file:
data = the_file["MyDataset"]
print({key:(min(value), max(value)) for key, value in data.attrs.items()})
会得到
{u'U': (0, 3), u'T': (2, 5)}
因为h5py数组和numpy数组关系很紧密,所以你可以用numpy.min和numpy.max这两个函数来实现这个功能:
maxItem = numpy.max(data['U'][:]) # Find the max of item 'U'
minItem = numpy.min(data['H'][:]) # Find the min of item 'H'
注意那个':',它是用来把数据转换成numpy数组的。
这可能是术语上的差异,但hdf5的属性是通过数据集对象的attrs
属性来访问的。我把你提到的称为变量或数据集。无论如何……
根据你的描述,我猜这些属性只是数组,你应该可以通过以下方式获取每个属性的数据,然后像处理任何numpy数组一样计算最小值和最大值:
attr_data = data["U"][:] # gets a copy of the array
min = attr_data.min()
max = attr_data.max()
所以如果你想要每个属性的最小值和最大值,你可以对属性名称进行循环,或者你可以使用
for attr_name,attr_value in data.items():
min = attr_value[:].min()
编辑以回答你的第一个评论:
h5py的对象可以像python字典一样使用。所以当你使用'keys()'时,你实际上并没有获取数据,而是获取了那个数据的名称(或键)。例如,如果你运行the_file.keys()
,你会得到一个包含该hdf5文件根路径下每个数据集的列表。如果你沿着某个路径继续,你最终会找到包含实际二进制数据的数据集。所以例如,你可能一开始在解释器中这样开始:
the_file = h5py.File("myfile.h5","r")
print the_file.keys()
# this will result in a list of keys maybe ["raw_data","meta_data"] or something
print the_file["raw_data"].keys()
# this will result in another list of keys maybe ["temperature","humidity"]
# eventually you'll get to the dataset that actually has the data or attributes you are looking for
# think of this process as going through a directory structure or a path to get to a file (or a dataset/variable in this case)
the_data_var = the_file["raw_data"]["temperature"]
the_data_array = the_data_var[:]
print the_data_var.attrs.keys()
# this will result in a list of attribute names/keys
an_attr_of_the_data = data_var.attrs["measurement_time"][:]
# So now you have "the_data_array" which is a numpy array and "an_attr_of_the_data" which is whatever it happened to be
# you can get the min/max of the data by doing like before
print the_data_array.min()
print the_data_array.max()
编辑2 - 为什么人们以这种方式格式化他们的hdf文件?这违背了目的。
如果可能的话,你可能需要和制作这个文件的人谈谈。如果是你自己制作的,那么你就能自己回答我的问题。首先,你确定在你原来的例子中data.keys()
返回了"U","T",等等
吗?除非h5py在做一些神奇的事情,或者你没有提供h5dump的所有输出,否则这不可能是你的输出。我会解释h5dump告诉我的内容,但请试着理解我在做什么,而不仅仅是复制粘贴到你的终端。
# Get a handle to the "data" Group
data = the_file["data"]
# As you can see from the dump this data group has 3 attributes and 1 dataset
# The name of the attributes are "datafield_names","dimensions","time_variables"
# This should result in a list of those names:
print data.attrs.keys()
# The name of the dataset is "Temperature" and should be the only item in the list returned by:
print data.keys()
从h5dump中可以看到,有62个datafield_names
(字符串),4个dimensions
(我想是32位整数),和2个time_variables
(64位浮点数)。它还告诉我Temperature
是一个三维数组,大小为256 x 512 x 1024(64位浮点数)。你看到我从哪里得到这些信息吗?现在进入难点,你需要确定datafield_names
是如何与Temperature
数组对应的。这是由制作文件的人完成的,所以你需要弄清楚Temperature
数组中的每一行/列代表什么。我首先猜测Temperature
数组中的每一行是一个datafield_names
,也许每个时间还有2个?但这行不通,因为数组中的行数太多了。也许这些维度以某种方式适合其中?最后,这里是如何获取每个信息片段的方法(继续之前的内容):
# Get the temperature array (I can't remember if the 3 sets of colons is required, but try it and if not just use one)
temp_array = data["Temperature"][:,:,:]
# Get all of the datafield_names (list of strings of length 62)
datafields = data.attrs["datafield_names"][:]
# Get all of the dimensions (list of integers of length 4)
dims = data.attrs["dimensions"][:]
# Get all of the time variables (list of floats of length 2)
time_variables = data.attrs["time_variables"]
# If you want the min/max of the entire temperature array this should work:
print temp_array.min()
print temp_array.max()
# If you knew that row 0 of the array had the temperatures you wanted to analyze
# then this would work, but it all depends on how the creator organized the data/file:
print temp_array[0].min()
print temp_array[1].max()
抱歉我无法提供更多帮助,但如果没有实际的文件和每个字段的含义,我能做的也就这些。试着理解我如何使用h5py读取信息。试着理解我如何将头部信息(h5dump输出)转换为我可以实际使用的h5py信息。如果你知道数据在数组中的组织方式,你应该能够做到你想要的。祝你好运,如果我能帮忙会更多。