Pytorch设计选择中的LSTM单元实现

import math import torch as th import torch.nn as nn class LSTM(nn.Module): def __init__(self, input_size, hidden_size, bias=True): super(LSTM, self).__init__() self.input_size = input_size self.hidden_size = hidden_size self.bias = bias self.i2h = nn.Linear(input_size, 4 * hidden_size, bias=bias) self.h2h = nn.Linear(hidden_size, 4 * hidden_size, bias=bias) self.reset_parameters() def reset_parameters(self): std = 1.0 / math.sqrt(self.hidden_size) for w in self.parameters(): w.data.uniform_(-std, std) def forward(self, x, hidden): h, c = hidden h = h.view(h.size(1), -1) c = c.view(c.size(1), -1) x = x.view(x.size(1), -1) # Linear mappings preact = self.i2h(x) + self.h2h(h) # activations gates = preact[:, :3 * self.hidden_size].sigmoid() g_t = preact[:, 3 * self.hidden_size:].tanh() i_t = gates[:, :self.hidden_size] f_t = gates[:, self.hidden_size:2 * self.hidden_size] o_t = gates[:, -self.hidden_size:] c_t = th.mul(c, f_t) + th.mul(i_t, g_t) h_t = th.mul(o_t, c_t.tanh()) h_t = h_t.view(1, h_t.size(0), -1) c_t = c_t.view(1, c_t.size(0), -1) return h_t, (h_t, c_t)

1条回答

网友

1楼 · 发布于 2024-04-20 05:28:57

1- Why multiply the hidden size by 4 for both self.i2h and self.h2h (in the init method)

在您包含的方程式中，输入x和隐藏状态h用于四个计算，其中每个计算都是矩阵与权重的乘积。无论您是执行四次矩阵乘法，还是将权重串联，然后执行一次更大的矩阵乘法，然后分离结果，都会得到相同的结果

input_size = 5
hidden_size = 10

input = torch.randn((2, input_size))

# Two different weights
w_c = torch.randn((hidden_size, input_size))
w_i = torch.randn((hidden_size, input_size))

# Concatenated weights into one tensor
# with size:[2 * hidden_size, input_size]
w_combined = torch.cat((w_c, w_i), dim=0)

# Output calculated by using separate matrix multiplications
out_c = torch.matmul(w_c, input.transpose(0, 1))
out_i = torch.matmul(w_i, input.transpose(0, 1))

# One bigger matrix multiplication with the combined weights
out_combined = torch.matmul(w_combined, input.transpose(0, 1))
# The first hidden_size number of rows belong to w_c
out_combined_c = out_combined[:hidden_size]
# The second hidden_size number of rows belong to w_i
out_combined_i = out_combined[hidden_size:]

# Using torch.allclose because they are equal besides floating point errors.
torch.allclose(out_c, out_combined_c) # => True
torch.allclose(out_i, out_combined_i) # => True

通过将线性层的输出大小设置为4*隐藏大小，有四个大小隐藏大小的权重，因此只需要一个层而不是四个层。这样做并没有什么好处，除了可能会有轻微的性能改进，主要是针对较小的输入，如果单独进行，则不会完全耗尽并行功能

4- I'm also confused about the column bounds in the activations part of the forward method. As an example, why do we upper bound with 3 * self.hidden_size for gates?

在这里，输出被分离，以对应于四个单独计算的输出。输出是[i_t; f_t; o_t; g_t]的串联（分别不包括tanh和sigmoid）

通过使用^{}将输出分成四个块，可以获得相同的分隔：

i_t, f_t, o_t, g_t = torch.chunk(preact, 4, dim=1)

但是在分离之后，您必须将torch.sigmoid应用于i_t，f_t和o_t，以及torch.tanh应用于g_t

5- Where are all the parameters of the LSTM? I'm talking about the Us and Ws here:

参数W是线性层self.i2h中的权重和线性层self.h2h中的U中的权重，但是是串联的

W_i, W_f, W_o, W_c = torch.chunk(self.i2h.weight, 4, dim=0)

U_i, U_f, U_o, U_c = torch.chunk(self.h2h.weight, 4, dim=0)

3- Why do we use view for h, c, and x in the forward method?

根据最后的h_t = h_t.view(1, h_t.size(0), -1)，隐藏状态的大小为[1，批处理大小，隐藏大小]。用h = h.view(h.size(1), -1)去掉第一个单数维，得到size[batch\u size，hidden\u size]。同样可以通过^{}实现

2- I don't understand the reset method for the parameters. In particular, why do we reset parameters in this way?

参数初始化会对模型的学习能力产生很大影响。初始化的一般规则是使值接近零而不太小。常用的初始化方法是从均值为0且方差为1/n的正态分布中提取，其中n是神经元的数量，这反过来意味着1/sqrt（n）的标准偏差

在这种情况下，它使用均匀分布而不是正态分布，但总体思路类似。根据神经元数量确定最小/最大值，但避免使其太小。如果最小/最大值为1/n，则值会变得非常小，因此使用1/sqrt（n）更合适，例如256个神经元：1/256=0.0039，而1/sqrt（256）=0.0625

Initializing neural networks通过交互式可视化提供了不同初始化的一些解释

相关问题更多 >

编程相关推荐

热门问题

热门文章