TPU连接问题训练TF模型谷歌Colab

Question

我已经在CPU和GPU上成功构建了一个可以工作的Tensorflow神经网络模型。现在，由于数据集很大，我想让这个模型在TPU上进行训练。我像往常一样初始化了TPU策略：

tpu_resolver = tf.distribute.cluster_resolver.TPUClusterResolver()  # Automatically detects the TPU
tf.config.experimental_connect_to_cluster(tpu_resolver)  # Connects to the TPU cluster
tf.tpu.experimental.initialize_tpu_system(tpu_resolver)  # Initializes the TPU system
strategy = tf.distribute.TPUStrategy(tpu_resolver)
tpu_device = tpu_resolver.master()  # Retrieves the TPU device URI
print("Running on TPU:", tpu_device)

这导致了以下打印输出：

Running on TPU: grpc://10.74.203.82:8470

但是，当我在strategy.scope()下训练我的模型时，出现了以下错误，训练就停止了。

err: File "/content/SeniorHonoursProject/BaCoN-II/train.py", line 169, in my_train
err: new_history = model.fit(train_dataset.dataset, epochs=epochs,
err: File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
err: raise e.with_traceback(filtered_tb) from None
err: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py", line 362, in _numpy
err: raise core._status_to_exception(e) from None  # pylint: disable=protected-access
err: tensorflow.python.framework.errors_impl.InternalError: 8 root error(s) found.
err: (0) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[TPUReplicate/_compile/_9902494219978988908/_4/_384]]
err: (1) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[Pad_3/_250]]
err: (2) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[Pad_15/_466]]
err: (3) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}] ... [truncated]
err: Exception ignored in atexit callback: <function async_wait at 0x7ec0f4a74790>
err: Traceback (most recent call last):
err: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/context.py", line 2833, in async_wait
err: context().sync_executors()
err: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/context.py", line 749, in sync_executors
err: pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
err: tensorflow.python.framework.errors_impl.InternalError: 8 root error(s) found.
err: (0) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[TPUReplicate/_compile/_9902494219978988908/_4/_384]]
err: (1) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[Pad_3/_250]]
err: (2) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[Pad_15/_466]]
err: (3) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}] ... [truncated]
out: g1D)

有没有人知道怎么解决这个问题？这是我初始化模型和运行训练的方式：

with strategy.scope():
  n_batches_eff = training_dataset.n_batches // strategy.num_replicas_in_sync
  lr_fn = tf.optimizers.schedules.ExponentialDecay(FLAGS.lr, n_batches_eff, FLAGS.decay)
  optimizer = tf.keras.optimizers.Adam(lr_fn)

with strategy.scope():
            model=make_model(#Custom model building function)
            if FLAGS.bayesian:
                 loss=BayesianLoss(n_train_examples=training_dataset.n_batches*training_dataset.batch_size, n_val_examples=validation_dataset.n_batches*validation_dataset.batch_size, TPU=FLAGS.TPU)
                loss.set_model(model)
            else:
                if FLAGS.TPU:
                    loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.NONE)
                else:
                    loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True)
            model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])




 with strategy.scope():
            val_steps_per_epoch = val_dataset.n_batches // strategy.num_replicas_in_sync
            train_steps_per_epoch = train_dataset.n_batches // strategy.num_replicas_in_sync
            new_history = model.fit(train_dataset.dataset, epochs=epochs,
                                validation_data=val_dataset.dataset,
                                callbacks=[callback], steps_per_epoch=train_steps_per_epoch, validation_steps=val_steps_per_epoch, initial_epoch=last_epoch)

我有一个相对复杂的数据处理流程，但我认为数据集的创建应该在CPU上完成。然后，我将数据集缓存到内存中，以便TPU可以访问：

with self.strategy.scope():
                if self.shuffle:
                    dataset = dataset.shuffle(buffer_size=len(list_IDs))
                dataset.cache()
                global_batchsize = self.batch_size * self.strategy.num_replicas_in_sync
                global_batchsize = tf.cast(global_batchsize, dtype=tf.int64)
                dataset = dataset.batch(global_batchsize)
                dataset = dataset.map(self.normalize_and_onehot, num_parallel_calls=tf.data.experimental.AUTOTUNE)
                dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
                dataset = self.strategy.experimental_distribute_dataset(dataset)

数据处理 tensorflow 神经网络 tpu 训练策略谷歌colab 模型优化

TPU连接问题训练TF模型谷歌Colab

1 个回答

撰写回答