模型训练总结直方图中的NaN: model/Training/dense/kernel/gradients
我正在创建一个包含45个特征的数据集,所有特征都是数字,并且经过归一化处理,使得数值范围在-1到1之间。
这是归一化的过程:
def normalize(train, test, cv):
normalized_train=(train-train.mean())/train.std()
normalized_test=(test-test.mean())/test.std()
normalized_cv=(cv-cv.mean())/cv.std()
return normalized_train, normalized_test, normalized_cv
X_train, X_test, X_cv = normalize(X_train, X_test, X_cv)
接下来,我正在构建tensorflow的数据集和迭代器,并将其传递给我的模型。这里是我的模型:
with tf.name_scope('model'):
regularizer = tf.contrib.layers.l2_regularizer(scale=0.1)
net = tf.layers.dense(features, 40, activation=tf.nn.relu, kernel_regularizer=regularizer,
kernel_initializer=tf.contrib.layers.xavier_initializer())
net = tf.layers.dense(net, 60, activation=tf.nn.relu, kernel_regularizer=regularizer,
kernel_initializer=tf.contrib.layers.xavier_initializer())
net = tf.layers.dense(net, 30, activation=tf.nn.relu, kernel_regularizer=regularizer,
kernel_initializer=tf.contrib.layers.xavier_initializer())
net = tf.layers.dense(net, 12, activation=tf.nn.relu, kernel_regularizer=regularizer,
kernel_initializer=tf.contrib.layers.xavier_initializer())
prediction = tf.layers.dense(net, 2, activation=tf.nn.sigmoid)
最后,我设置了损失函数、优化器和梯度计算,然后应用这些:
with tf.name_scope('Loss'):
loss = tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=prediction)
tf.summary.scalar('Loss', loss)
with tf.name_scope('Training'):
opt = tf.train.AdamOptimizer(learning_rate = learning_rate)
grads = opt.compute_gradients(loss)
for grad, var in grads:
if grad is not None:
tf.summary.histogram(var.op.name + '/gradients', grad)
train_op = opt.apply_gradients(grads)
当我运行这个时,我遇到了以下错误:
Caused by op 'model/Training/dense/kernel/gradients', defined at:
File "c:\Users\123456\Google Drive\Projects\GIT\Churn_TF\churn_1.2_local_dataset.py", line 103, in <module>
tf.summary.histogram(var.op.name + '/gradients', grad)
File "C:\Users\123456\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\summary\summary.py", line 193, in histogram
tag=tag, values=values, name=scope)
File "C:\Users\123456\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\ops\gen_logging_ops.py", line 215, in _histogram_summary
"HistogramSummary", tag=tag, values=values, name=name)
File "C:\Users\123456\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\123456\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 3160, in create_op
op_def=op_def)
File "C:\Users\123456\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Nan in summary histogram for: model/Training/dense/kernel/gradients
[[Node: model/Training/dense/kernel/gradients = HistogramSummary[T=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/device:CPU:0"](model/Training/dense/kernel/gradients/tag, model/Training/gradients/model/dense/MatMul_grad/tuple/control_dependency_1/_101)]]
问题是 模型/训练/密集/内核/梯度的摘要直方图中出现了Nan
根据我所了解,这可能是梯度爆炸的问题,但我该如何调试这个问题呢?因为创建这些直方图的目的是为了观察我的梯度变化情况。
而且,由于我进行了归一化和正则化处理,我对这个结果感到很惊讶……难道是我的梯度变得太小了吗?
我尝试用 tf.nn.leaky_relu
替换 tf.nn.relu
,但结果出现了浮点数从float64到float32的转换错误,我不知道该如何解决……有没有什么建议可以帮我解决这个问题?
相关问题:
- 暂无相关问题
1 个回答
1
根据你的代码
prediction = tf.layers.dense(net, 2, activation=tf.nn.sigmoid)
我推测你正在处理一个二分类问题,并且你输出层的激活函数是sigmoid函数。
不过,你的损失函数使用的是
tf.losses.softmax_cross_entropy
函数。所以首先我建议你使用
tf.losses.sigmoid_cross_entropy
函数。请注意,这个函数(还有tf.losses.softmax_cross_entropy函数)需要输入的是(未缩放的)logits。因此,在你的情况下,最终层的结果应该是在应用sigmoid非线性之前的值。
所以我建议你把下面这一行
prediction = tf.layers.dense(net, 2, activation=tf.nn.sigmoid)
改成
logits = tf.layers.dense(net, 2)
prediction = tf.nn.sigmoid(logits) # this line is only needed if you want to use predictions somewhere else
然后
loss = tf.losses.sigmoid_cross_entropy(onehot_labels=labels, logits=logits)
也许这样就能解决你的问题。如果还不行,你的学习率是多少?我通常在学习率过大时会遇到这个错误。