我必须从Google云存储中读取几个JSON行文件.jsonl
。为了做到这一点,我从我想要读取的记录中创建了一个数据集,它是一个包含[[<gs:// url>, id], ...]
的numpy array
,其中id
是行号,用于检查哪一行是训练/测试/验证
主函数从generator
创建TF Dataset
,生成前面描述的np.ndarray
,然后运行map函数下载并解析文件,它是:
def load_dataset(records: np.ndarray) -> tf.data.Dataset:
"""Create Tensorflow Dataset MapDataset (generator) from a list of gs:// data URL.
Args:
records (np.ndarray): List of strings, which are gs://<foo>/foo<N>/*.jsonl.gz files
Returns:
tf.data.Dataset: MapDataset generator which can be used for training Keras models.
"""
dataset = tf.data.Dataset.from_generator(lambda: _generator(records), (tf.string, tf.int8))
return dataset
def _generator(records):
for r in records:
yield r[0], r[1]
如您所见,generator
只是通过np.ndarray
进行迭代以获得url
和'line index'
然后我必须从URL中load and preprocess
该文件以获得json -> Dict
对象的列表
def _load_and_preprocess(filepath, selected_sample):
"""Read a file GCS or local path and process it into a tensor
Args:
path (tensor): path string, pointer to GCS or local path
Returns:
tensor: processed input
"""
sample_raw_input = tf.io.read_file(filepath)
uncompressed_inputs = tf.py_function(_get_uncompressed_inputs, [sample_raw_input], tf.string)
sample = tf.py_function(_load_sampled_sample, [uncompressed_inputs, selected_sample], tf.float32) #This `tf.float32` is definitely wrong
return sample #This is not a tensor, but a List of Dictionaries which I will process later
def _get_uncompressed_inputs(record):
return zlib.decompress(record.numpy(), 16 + zlib.MAX_WBITS)
def _load_sampled_sample(inputs: Iterable, selected_sample: List[int]) -> List[Dict[str, str]]:
if not tf.executing_eagerly():
raise RuntimeError("TensorFlow must be executing eagerly.")
inputs = inputs.numpy()
selected_sample = selected_sample.numpy()
sample = _load__sampled_sample_from_jsonl(inputs, selected_sample)
return sample
def _load__sampled_sample_from_jsonl(jsonl: bytes, selected_sample: List[int]) -> List[Dict[str, str]]:
json_lines = _read_jsonl(jsonl).split("\n")
sample = list()
for n, sample_json in enumerate(json_lines):
sample_obj = _read_json(sample_json) if n in selected_sample else None
if sample_obj:
sample.append(sample_obj)
return sample
def _read_jsonl(jsonl: bytes) -> str:
return jsonl.decode()
然后,我用上述代码创建数据集,并尝试从中检索单个样本进行测试
val_ds = load_dataset(validation_records)
samples = tf.data.experimental.get_single_element(
val_ds
) # This should be a list of Dicts
其中{
InvalidArgumentError: ValueError: Attempt to convert a value ({...}) with an unsupported type (<class 'dict'>) to a Tensor.
# ... are the dict values, which is really big so I've shortened it to `...`
Traceback (most recent call last):
File "/home/victor/anaconda3/lib/python3.8/site-packages/tensorflow/python/ops/script_ops.py", line 242, in __call__
return func(device, token, args)
File "/home/victor/anaconda3/lib/python3.8/site-packages/tensorflow/python/ops/script_ops.py", line 140, in __call__
outputs = [
File "/home/victor/anaconda3/lib/python3.8/site-packages/tensorflow/python/ops/script_ops.py", line 141, in <listcomp>
_maybe_copy_to_context_device(self._convert(x, dtype=dtype),
File "/home/victor/anaconda3/lib/python3.8/site-packages/tensorflow/python/ops/script_ops.py", line 120, in _convert
return ops.convert_to_tensor(value, dtype=dtype)
File "/home/victor/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1499, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/victor/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 338, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/home/victor/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 263, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "/home/victor/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 275, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "/home/victor/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 300, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/home/victor/anaconda3/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 98, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
ValueError: Attempt to convert a value ({...}) with an unsupported type (<class 'dict'>) to a Tensor.
# ... are the dict values, which is really big so I've shortened it to `...`
[[{{node EagerPyFunc_1}}]] [Op:DatasetToSingleElement]
是否有任何方法可以在不急于执行的情况下处理dict列表(这是TF数据集不允许的)
这个dict列表不是我的模型的输入,但是,我无法在preprocessing
函数中使用它,因为在将值传递给任何其他函数之前会引发此错误
3.8
2.3.1
好的,我想我已经通过急切地运行
dataset.map
函数修复了它:dataset.map(lambda file, samples: tf.py_function(_load_and_preprocess, [file, samples], tf.variant))
这里描述:How can you map values in a tf.data.Dataset using a dictionary
相关问题 更多 >
编程相关推荐