gcloud ML引擎Keras未在GPU上运行

2024-04-26 18:57:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我刚加入谷歌云机器学习引擎, 我正在尝试在gcloud中训练一种基于Keras的图像分类DL算法。 为了在gcloud上配置GPU,我在setup.py install_requires中包含了'tensorflow-gpu'。 我的cloud-gpu.yaml如下所示

trainingInput:
  scaleTier: BASIC_GPU
  runtimeVersion: "1.0"

在我添加的代码中:

^{pr2}$

一开始

^{3}$

在任何Keras代码之前。在

结果是gcloud识别了gpu,但没有使用它,正如您从中可以看到的

实际云训练截图:

INFO    2018-11-18 12:19:59 -0600   master-replica-0        Epoch 1/20
INFO    2018-11-18 12:20:56 -0600   master-replica-0          1/219 [..............................] - ETA: 4:17:12 - loss: 0.8846 - acc: 0.5053 - f1_measure: 0.1043
INFO    2018-11-18 12:21:57 -0600   master-replica-0          2/219 [..............................] - ETA: 3:51:32 - loss: 0.8767 - acc: 0.5018 - f1_measure: 0.1013
INFO    2018-11-18 12:22:59 -0600   master-replica-0          3/219 [..............................] - ETA: 3:46:49 - loss: 0.8634 - acc: 0.5039 - f1_measure: 0.1010
INFO    2018-11-18 12:23:58 -0600   master-replica-0          4/219 [..............................] - ETA: 3:44:59 - loss: 0.8525 - acc: 0.5045 - f1_measure: 0.0991
INFO    2018-11-18 12:24:48 -0600   master-replica-0          5/219 [..............................] - ETA: 3:41:17 - loss: 0.8434 - acc: 0.5031 - f1_measure: 0.0992Sun Nov 18 18:24:48 2018       
INFO    2018-11-18 12:24:48 -0600   master-replica-0        +-----------------------------------------------------------------------------+
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |-------------------------------+----------------------+----------------------+
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |===============================+======================+======================|
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | N/A   32C    P0    56W / 149W |  10955MiB / 11441MiB |      0%      Default |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        +-------------------------------+----------------------+----------------------+
INFO    2018-11-18 12:24:48 -0600   master-replica-0                                                                                       
INFO    2018-11-18 12:24:48 -0600   master-replica-0        +-----------------------------------------------------------------------------+
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | Processes:                                                       GPU Memory |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |  GPU       PID   Type   Process name                             Usage      |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |=============================================================================|
INFO    2018-11-18 12:24:48 -0600   master-replica-0        +-----------------------------------------------------------------------------+

基本上GPU使用率在培训期间保持在0%,这怎么可能?在


Tags: 代码infomastergpuusagekerasf1eta
1条回答
网友
1楼 · 发布于 2024-04-26 18:57:42

我建议使用^{},它与cloud-gpu.yaml中的一个k80 GPU具有相同的n1-standard-8:

trainingInput:
  scaleTier: CUSTOM
  # standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4 GPUs
  masterType: standard_gpu
  runtimeVersion: "1.5"

这个:

^{pr2}$

should be

with tf.device('/device:GPU:0'):

我建议回顾一下这个cnn_with_keras.py以获得更好的示例。在

相关问题 更多 >