相同的Keras脚本在WSL(Ubuntu)中工作,但在Conda中不工作?

2024-06-08 21:28:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在训练CNN对图像进行分类,我有一个Keras脚本,可以在CPU和GPU环境下成功运行,但是网络只能在CPU环境下学习。在GPU环境中,损耗在第一个历元之后减小,但在第一个历元之后保持不变。为什么?你知道吗

使用python3.6,我有一个运行在WSL(Ubuntu)中的tensorflow环境和一个运行在Conda中的tensorflowgpu环境。我尝试过不同的架构和不同的优化器,但不管我对GPU环境有什么问题。你知道吗

编辑:我创建了一个CPU conda环境,我也遇到了同样的问题,所以这似乎是conda对WSL的问题,而不是GPU对CPU的问题。而且,Conda在CPU上的epoch是WSL中CPU epoch的两倍。你知道吗

我评论道型号.fit()以减少详细输出。你知道吗

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, MaxPooling2D
from tensorflow.keras.models import Model

import utils

(x_train, y_train), _, (x_test, y_test) = utils.load_data(limit=36)

input_image = Input(shape=(256, 256, 3))    
x = Conv2D(32, (3, 3), padding='same', activation='relu')(input_image)
x = Conv2D(32, (3, 3), activation='relu')(x)
x = MaxPooling2D()(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
output = Dense(9, activation='softmax')(x)
model = Model(inputs=input_image, outputs=output)

model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])

# model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test))
for i in range(25):
    print(str(i) + ': ' + str(model.train_on_batch(x_train, y_train)))
model.evaluate(x_train, y_train)
model.evaluate(x_test, y_test)

CPU培训结果:

0: [17.486359, 0.6875]
1: [61761.203, 0.28125]
2: [2228.4707, 0.71875]
3: [4440.558, 0.28125]
4: [1062.5581, 0.71875]
5: [481.29315, 0.71875]
6: [234.01581, 0.4375]
7: [170.98215, 0.71875]
8: [38.968575, 0.6875]
9: [8.086919, 0.75]
10: [5.7502546, 0.375]
11: [72.89319, 0.71875]
12: [13.203195, 0.6875]
13: [1.4184309, 0.875]
14: [9.258236, 0.46875]
15: [23.165062, 0.71875]
16: [8.963888, 0.78125]
17: [3.1053305, 0.84375]
18: [1.0664859, 0.96875]
19: [0.039992813, 1.0]
20: [0.023323938, 1.0]
21: [0.019487603, 1.0]
22: [0.01734325, 1.0]
23: [0.015670585, 1.0]
24: [0.014209943, 1.0]
32/32 [==============================] - 1s 19ms/sample - loss: 0.0129 - acc: 1.0000
4/4 [==============================] - 0s 20ms/sample - loss: 2.3463 - acc: 0.7500

我希望看到类似于上面的东西,但我得到了一个奇怪的东西——GPU训练的结果:

0: [8.630159, 0.1875]
1: [4.5332146, 0.71875]
2: [4.5332146, 0.71875]
3: [4.5332146, 0.71875]
4: [4.5332146, 0.71875]
5: [4.5332146, 0.71875]
6: [4.5332146, 0.71875]
7: [4.5332146, 0.71875]
8: [4.5332146, 0.71875]
9: [4.5332146, 0.71875]
10: [4.5332146, 0.71875]
11: [4.5332146, 0.71875]
12: [4.5332146, 0.71875]
13: [4.5332146, 0.71875]
14: [4.5332146, 0.71875]
15: [4.5332146, 0.71875]
16: [4.5332146, 0.71875]
17: [4.5332146, 0.71875]
18: [4.5332146, 0.71875]
19: [4.5332146, 0.71875]
20: [4.5332146, 0.71875]
21: [4.5332146, 0.71875]
22: [4.5332146, 0.71875]
23: [4.5332146, 0.71875]
24: [4.5332146, 0.71875]
32/32 [==============================] - 0s 4ms/sample - loss: 4.5332 - acc: 0.7188
4/4 [==============================] - 0s 19ms/sample - loss: 4.0295 - acc: 0.7500

我迫不及待地想看看我犯了什么愚蠢的错误


Tags: sampletestimportmodelgpu环境tensorflowtrain
1条回答
网友
1楼 · 发布于 2024-06-08 21:28:48

我不知道实际的问题是什么,但我知道从1.13.1更新tensorflow gpu修复了它。你知道吗

在conda shell中,我运行了pip install tf-nightly-gpu,现在网络按预期运行。我敢打赌我不必使用夜间构建,我可以只指定1.14.0(我在WSL中使用的tensorflow gpu构建),但不管怎样。你知道吗

相关问题 更多 >