TPU连接问题训练TF模型谷歌Colab
我已经在CPU和GPU上成功构建了一个可以工作的Tensorflow神经网络模型。现在,由于数据集很大,我想让这个模型在TPU上进行训练。我像往常一样初始化了TPU策略:
tpu_resolver = tf.distribute.cluster_resolver.TPUClusterResolver() # Automatically detects the TPU
tf.config.experimental_connect_to_cluster(tpu_resolver) # Connects to the TPU cluster
tf.tpu.experimental.initialize_tpu_system(tpu_resolver) # Initializes the TPU system
strategy = tf.distribute.TPUStrategy(tpu_resolver)
tpu_device = tpu_resolver.master() # Retrieves the TPU device URI
print("Running on TPU:", tpu_device)
这导致了以下打印输出:
Running on TPU: grpc://10.74.203.82:8470
但是,当我在strategy.scope()下训练我的模型时,出现了以下错误,训练就停止了。
err: File "/content/SeniorHonoursProject/BaCoN-II/train.py", line 169, in my_train
err: new_history = model.fit(train_dataset.dataset, epochs=epochs,
err: File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
err: raise e.with_traceback(filtered_tb) from None
err: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py", line 362, in _numpy
err: raise core._status_to_exception(e) from None # pylint: disable=protected-access
err: tensorflow.python.framework.errors_impl.InternalError: 8 root error(s) found.
err: (0) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[TPUReplicate/_compile/_9902494219978988908/_4/_384]]
err: (1) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[Pad_3/_250]]
err: (2) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[Pad_15/_466]]
err: (3) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}] ... [truncated]
err: Exception ignored in atexit callback: <function async_wait at 0x7ec0f4a74790>
err: Traceback (most recent call last):
err: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/context.py", line 2833, in async_wait
err: context().sync_executors()
err: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/context.py", line 749, in sync_executors
err: pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
err: tensorflow.python.framework.errors_impl.InternalError: 8 root error(s) found.
err: (0) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[TPUReplicate/_compile/_9902494219978988908/_4/_384]]
err: (1) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[Pad_3/_250]]
err: (2) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[Pad_15/_466]]
err: (3) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}] ... [truncated]
out: g1D)
有没有人知道怎么解决这个问题?这是我初始化模型和运行训练的方式:
with strategy.scope():
n_batches_eff = training_dataset.n_batches // strategy.num_replicas_in_sync
lr_fn = tf.optimizers.schedules.ExponentialDecay(FLAGS.lr, n_batches_eff, FLAGS.decay)
optimizer = tf.keras.optimizers.Adam(lr_fn)
with strategy.scope():
model=make_model(#Custom model building function)
if FLAGS.bayesian:
loss=BayesianLoss(n_train_examples=training_dataset.n_batches*training_dataset.batch_size, n_val_examples=validation_dataset.n_batches*validation_dataset.batch_size, TPU=FLAGS.TPU)
loss.set_model(model)
else:
if FLAGS.TPU:
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.NONE)
else:
loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
with strategy.scope():
val_steps_per_epoch = val_dataset.n_batches // strategy.num_replicas_in_sync
train_steps_per_epoch = train_dataset.n_batches // strategy.num_replicas_in_sync
new_history = model.fit(train_dataset.dataset, epochs=epochs,
validation_data=val_dataset.dataset,
callbacks=[callback], steps_per_epoch=train_steps_per_epoch, validation_steps=val_steps_per_epoch, initial_epoch=last_epoch)
我有一个相对复杂的数据处理流程,但我认为数据集的创建应该在CPU上完成。然后,我将数据集缓存到内存中,以便TPU可以访问:
with self.strategy.scope():
if self.shuffle:
dataset = dataset.shuffle(buffer_size=len(list_IDs))
dataset.cache()
global_batchsize = self.batch_size * self.strategy.num_replicas_in_sync
global_batchsize = tf.cast(global_batchsize, dtype=tf.int64)
dataset = dataset.batch(global_batchsize)
dataset = dataset.map(self.normalize_and_onehot, num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
dataset = self.strategy.experimental_distribute_dataset(dataset)
1 个回答
0
之前的问题是用生成器函数来生成数据集。现在改用 from_tensor_slices 之后,问题就解决了。