将TensorFlow GPU与Keras一起使用时,Blas GEMM启动失败

2024-06-16 10:24:06 发布

您现在位置:Python中文网/ 问答频道 /正文

非常不言自明。像我之前和之后的无数人一样,我在尝试调用model.fit()时出现了一条Blas GEMM launch failed错误消息

这是调用model.compile()之前nvidia-smi的输出:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   45C    P0    74W / 149W |      0MiB / 11441MiB |    100%      Default |   <<<--- 0% Memory usage
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |   <<<--- nothing running
+-----------------------------------------------------------------------------+

以及在调用model.compile()之后nvidia-smi的输出model.fit()之前:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   45C    P0    72W / 149W |  10942MiB / 11441MiB |      0%      Default |   <<<--- 96% Memory usage
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1811      C   /usr/bin/python3                           10929MiB |   <<<--- TF model here
+-----------------------------------------------------------------------------+

看起来编译的TensorFlow模型独占了96%的GPU内存。我不知道这是否正常,也不知道这是否可能是后来尝试训练模型时出现错误的原因

错误消息本身如下所示:

tensorflow/stream_executor/stream.cc:2041] attempting to perform BLAS operation using StreamExecutor without BLAS support

InternalError: Blas GEMM launch failed : a.shape=(32, 116032), b.shape=(116032, 256), m=32, n=256, k=116032 [[node dense_1/MatMul (defined at /home/ubuntu/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_1645]

Function call stack: keras_scratch_graph

tf.config.experimental.list_physical_devices()的输出:

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),
 PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

该模型使用以下方法构建:

  • Keras 2.3.1(使用keras.models.Sequential
  • TensorFlow GPU 2.1.0
  • CUDA 10.1
  • cuDNN 7.6.4
  • Ubuntu 18.04
  • AWS p2.xlarge实例(采用特斯拉K80 GPU)

我已经浏览了无数的GitHub问题、博客帖子、S.O.问题,所有这些都是为了确保在启动新进程时,GPU上没有以前运行的进程仍然处于活动状态,或者将CUPTI位置添加到LD_LIBRARY_路径,或者使用各种TF选项。。。所有这些都没有解决这个问题。如果您知道是什么原因导致了这种情况以及如何解决,我们将不胜感激


Tags: namemodelgpuversiondevicetype错误usage
1条回答
网友
1楼 · 发布于 2024-06-16 10:24:06

我也有同样的问题。我看到了许多答案,并使用了许多建议的代码来解决这个问题,但任何东西都可以帮助我

对我来说,问题在于GPU的使用,因此我用以下代码限制GPU使用的内存:

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
    try:
        tf.config.experimental.set_virtual_device_configuration(
            gpus[0],
            [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Virtual devices must be set before GPUs have been initialized
        print(e)

https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth开始。这已经解决了我的问题。我希望这也能解决你的问题

相关问题 更多 >