在Keras中向LSTM添加自定义功能

2024-05-08 04:30:03 发布

您现在位置:Python中文网/ 问答频道 /正文

在python3.4上将Keras与Tensorflow后端结合使用。在

我正在尝试在我的LSTM中加入自定义特性,它目前只使用Word2Vec中的单词嵌入

我用Word2Vec创建了一个嵌入矩阵。这是我的LSTM的嵌入层。在

现在,我在测试和训练数据集中有了一些特性,我想将这些特性合并到模型中。在

我想把这些特性合并到我的LSTM中,但是我不知道如何重塑数据。在

以下是目前为止代码的相关部分:

labels = train['is_duplicate'].tolist()
ids = test['test_id'].tolist()

re_weight = True #for imbalance

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(q1 + q2 + q1_test + q2_test)

sequences_1 = tokenizer.texts_to_sequences(q1)
sequences_2 = tokenizer.texts_to_sequences(q2)

test_sequences_1 = tokenizer.texts_to_sequences(q1_test)
test_sequences_2 = tokenizer.texts_to_sequences(q2_test)

word_index = tokenizer.word_index
print('Found %s unique tokens' % len(word_index))
data_1 = pad_sequences(sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
data_2 = pad_sequences(sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
labels = np.array(labels)

print ("Elapsed time till loading word vectors", time()-start)

test_data_1 = pad_sequences(test_sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
test_data_2 = pad_sequences(test_sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
ids = np.array(ids)

data_1_train = np.vstack((data_1, data_2))
data_2_train = np.vstack((data_2, data_1))
train_features = np.vstack((other_features_train, other_features_train))
labels_train = np.concatenate((labels, labels))

nb_words = min(MAX_NB_WORDS, len(word_index))+1

nb_features = train_features.size

embedding_matrix = np.load('../data/embedding_matrix_lstm.npy')

print ("Elapsed time till creating embedding matrix", time()-start_wv)

print("Elapsed time", time()-start)

start_model = time()

embedding_layer = Embedding(nb_words,
        EMBEDDING_DIM,
        weights=[embedding_matrix],
        input_length=MAX_SEQUENCE_LENGTH,
        trainable=False)
lstm_layer = LSTM(num_lstm, dropout=rate_drop_lstm, recurrent_dropout=rate_drop_lstm)

sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences_1 = embedding_layer(sequence_1_input)
x1 = lstm_layer(embedded_sequences_1)

sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences_2 = embedding_layer(sequence_2_input)
y1 = lstm_layer(embedded_sequences_2)

other_features = Input(shape=(nb_features, ))

merged = concatenate([x1, y1, other_features])
merged = Dropout(rate_drop_dense)(merged)
merged = BatchNormalization()(merged)

merged = Dense(num_dense, activation=act)(merged)
merged = Dropout(rate_drop_dense)(merged)
merged = BatchNormalization()(merged)

preds = Dense(1, activation='sigmoid')(merged)


model = Model(inputs=[sequence_1_input, sequence_2_input, other_features],
        outputs=preds)
model.compile(loss='binary_crossentropy',
        optimizer='nadam',
        metrics=['acc'])

hist = model.fit([data_1_train, data_2_train, other_features], labels_train,
        validation_split=0.1,
        epochs=200, batch_size=2048, shuffle=True,
        class_weight=class_weight, callbacks=[early_stopping, model_checkpoint])

model.load_weights(bst_model_path)
bst_val_score = min(hist.history['val_loss'])

# Predict

preds = model.predict([test_data_1, test_data_2, other_features_test], batch_size=8192, verbose=1)
preds += model.predict([test_data_2, test_data_1, other_features_test], batch_size=8192, verbose=1)
preds /= 2

other_features_trainother_features_test是维度为(train_length,5)和(test_length_5)的numpy数组。在

这是我得到的错误:

^{pr2}$

我怀疑这是输入数据的形状问题,因为我以不同的顺序将数据“1”和“2”堆叠两次。理由是-如果q1是q2的副本,q2也是q1的副本。我不知道该如何向模型表明这一点。在

我不知道什么是“维度”类型。我只想了解如何用特征塑造单词嵌入(嵌入矩阵)。在


Tags: testdatalabelsmodeltimenptrainmerged