如何将json数据转换为tensorflow数据集?

2024-05-15 13:15:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个json文件,其中包含如下结构的培训数据:

[
    {
        "audio_path": "common_voice_de_22136164.wav",
        "label": "Diese pyromanen ... Vertrauen."
    },
    {
        "audio_path": "common_voice_de_19872706.wav",
        "label": "Die einzelnen Unterar...

我的目标是在将音频路径转换为波形后,将这个json数据馈送到TensorFlow数据集对象中。我试图在tensorflow.org上重新创建类似于本教程的内容:https://www.tensorflow.org/tutorials/audio/simple_audio

我的尝试是将json数据转换为python列表,将它们输入tf.dataset,并应用一个函数,使用.map()方法将音频文件转换为波形

下面是我要加载到JSON(训练、测试)的python列表中的代码:

def load_json_into_lists(train_ds: str, test_ds: str, validation_size=0.09):

    # read json containing training data
    train_data = pd.read_json(train_ds, lines=False)

    # read json containing test data
    test_data = pd.read_json(test_ds, lines=False)

    # store json training data into a python list
    train_data_list = train_data.values.tolist()

    # store json training data into a python list
    test_data_list = test_data.values.tolist()

    # split train into train and validation
    new_train_data_list, validation_data_list = split_validation_from_train(train_data_list,
                                                                            validation_size=validation_size)

    print(f"Ex.:{new_train_data_list[0]}, Len.:{len(new_train_data_list)}, Type:{type(new_train_data_list)}")
    print(f"Ex.:{validation_data_list[0]}, Len.:{len(validation_data_list)}, Type:{type(validation_data_list)}")
    print(f"Ex.:{test_data_list[0]}, Len.:{len(test_data_list)}, Type:{type(test_data_list)}")

    return new_train_data_list, validation_data_list, test_data_list

以下是将培训列表拆分为培训和验证列表的代码:

def split_validation_from_train(train_data_list: list, validation_size: float):

    calculate_validation_size = round(len(train_data_list) * validation_size)
    print("Calculated Validation Dataset size: ", calculate_validation_size)
    # all elements til 178728
    train_data_list_new = train_data_list[:(len(train_data_list)-calculate_validation_size)]
    # all elements from 178729 to 194404 (17676)
    validation_data_list = train_data_list[len(train_data_list_new):]

    print("Validation Dataset size: ", len(validation_data_list))
    print("New Train Dataset size: ", len(train_data_list_new))

    return train_data_list_new, validation_data_list

然后我得到了一些波形转换函数,这是受上面提到的TensorFlow教程的启发

# Audio Processing
def decode_audio(audio_binary):
    audio_, _ = tf.audio.decode_wav(audio_binary)
    return tf.squeeze(audio_, axis=-1)


#@tf.function
def get_label(file_path):

    # get the loaded lists
    train_list, _, _ = load_json_into_lists(TRAIN_DS_PATH, TEST_DS_PATH)

    for sublist in train_list:
        if sublist[0] == str(file_path):

            return sublist[1]


#@tf.function
def get_waveform(file_path):

    # get the loaded lists
    train_list, _, _ = load_json_into_lists(TRAIN_DS_PATH, TEST_DS_PATH)

    for sublist in train_list:
        if sublist[0] == str(file_path):
            file_to_read = str("/de/cv_valid_data/" + sublist[0])
            audio_binary = tf.io.read_file(file_to_read)
            waveform = decode_audio(audio_binary)

            return waveform


def get_waveform_and_label(file_path):

    # get label
    label_ = get_label(file_path)

    # get waveform
    waveform_ = get_waveform(file_path)

    return waveform_, label_

最后是应用.map()get_waveform_and_label函数获取波形数据集的代码

出现错误(我真的不知道是什么原因造成的):

Traceback (most recent call last):
  File "training_test_3.py", line 123, in <module>
    tf_waveform_ds = convert_lists_into_tf_ds()
  File "/Users/pietmuller/miniforge3/envs/tensorM1_new_3/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 620, in wrapper
    return func(*args, **kwargs)
  File "training_test_3.py", line 111, in convert_lists_into_tf_ds
    tf_waveform_ds_ = tf_train_ds.map(get_waveform_and_label)
  File "/Users/pietmuller/miniforge3/envs/tensorM1_new_3/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1805, in map
    return MapDataset(self, map_func, preserve_cardinality=True)
  File "/Users/pietmuller/miniforge3/envs/tensorM1_new_3/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4208, in __init__
    variant_tensor = gen_dataset_ops.map_dataset(
  File "/Users/pietmuller/miniforge3/envs/tensorM1_new_3/lib/python3.8/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 3028, in map_dataset
    _ops.raise_from_not_ok_status(e, name)
  File "/Users/pietmuller/miniforge3/envs/tensorM1_new_3/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 6862, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'output_shapes' of 0 must be at least minimum 1
    ; NodeDef: {{node MapDataset}}; Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false> [Op:MapDataset]

谢谢你的回答


Tags: intestjsonnewdatasizegettf