运行时错误:函数AddmmBackward在索引1处返回无效梯度(形状不匹配)

2024-04-23 22:24:21 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试实现VAE,但在计算模型的梯度时遇到问题。我相信这是在解码器中发生的。确切的错误消息是,函数AddmmBackward在索引1处返回了无效的渐变-得到[10,32],但预期形状与[10,1024]兼容。。这是解码器模型

class decoderNW(nn.Module):
    def __init__(self):
        super(decoderNW,self).__init__()

        channels = 32
        kernelSize = 4
        padding = (2,0)
        stride = (2,2)
        outputpadding = (1,0)

        self.FC1 = nn.Linear(channels, 1024)

        self.FC2 = nn.Linear(channels, 10656)

        self.deConv3x301 = nn.ConvTranspose2d(channels, 64, kernel_size=kernelSize, stride=stride, output_padding=outputpadding)
        nn.init.xavier_uniform_(self.deConv3x301.weight)

        self.deConv3x302 = nn.ConvTranspose2d(64, 128, kernel_size=kernelSize, stride=stride, output_padding=outputpadding)
        nn.init.xavier_uniform_(self.deConv3x302.weight)

        self.deConv3x303 = nn.ConvTranspose2d(128, 64, kernel_size=kernelSize, stride=stride, output_padding=outputpadding)
        nn.init.xavier_uniform_(self.deConv3x303.weight)

        self.deConv3x304 = nn.ConvTranspose2d(64, 3, kernel_size=kernelSize, stride=stride)
        nn.init.xavier_uniform_(self.deConv3x304.weight)

        self.bn1 = nn.BatchNorm1d(1024)
        self.bn2 = nn.BatchNorm2d(64)
        self.bn3 = nn.BatchNorm2d(128)
        self.bn4 = nn.BatchNorm2d(64)
 


        self.ReLU = nn.ReLU(inplace=True)

        self.sigmoid = nn.Sigmoid()

    def forward(self,x):

        x = self.FC1(x)
        x = self.bn1(x)
        x = self.ReLU(x)
        # Shape of x => 10x1024

        x = self.FC2(x)

        # Shape of x => 10x10656
        # Reshape x as 10x8x42x75
        x = x.view(x.size(0),32,9,37)

        x = self.deConv3x301(x)
        x = self.bn2(x)
        x = self.ReLU(x)

        x = self.deConv3x302(x)
        x = self.bn3(x)
        x = self.ReLU(x)

        x = self.deConv3x303(x)
        x = self.bn4(x)
        x = self.ReLU(x)

        x = self.deConv3x304(x)
        x = self.sigmoid(x)

        return(x)

我相信,当我试图将张量从FC层重塑为2D张量(如图像)到deconv层时,就会发生这种情况

我尝试过使用重塑功能,但同样的问题仍然存在。我不确定我错在哪里。非常感谢您的帮助

谢谢

PS:我在向后运行()时遇到此错误。下面是这方面的代码片段

            optimizerVAE.zero_grad()
            variationalAE.train()

            vaeT = vaeT.to('cuda')

            mu, sigma, xHat, z = variationalAE(srcClrT)

            loss = vaeLoss(srcClrT, mu, sigma, xHat, z)

            loss.backward()

编辑1:将代码添加到我的VAE损失中

class getVAELoss(torch.nn.Module):
def __init__(self):
    super(getVAELoss, self).__init__()

def forward(self, x, mu, sigma, xHat, z):
    # Caluclate ELBO
    # ELBO = KLDivergence - reconstruction loss
    
    # Reconstruction loss 
    # Compute the probability of x uner n-d distribution
    logScale = nn.parameter.Parameter(torch.Tensor([0.0]).to('cuda'))
    scale = torch.exp(logScale)
    dist = torch.distributions.Normal(xHat,scale)
    logProbXZ = dist.log_prob(x)
    logProbXZ = logProbXZ.sum(dim=(1,2,3))
    reconstructionLoss = logProbXZ

    # KL Divergence
    # create two probabilities p and q
    # P is the reference distribution with zero mean and unit sigma
    p = torch.distributions.Normal(torch.zeros_like(mu), torch.ones_like(sigma))
    q = torch.distributions.Normal(mu,sigma)

    # Calculating the log Probablility with the Z
    logQZX = q.log_prob(z)
    logPz = p.log_prob(z)

    KL = logQZX - logPz

    KL = KL.sum(-1)

    elbo = KL - reconstructionLoss

    elbo = elbo.mean()

    return(elbo)

VAE损失与here中所示非常相似

编辑2 查看几种VAE网络架构,我意识到解码器网络中只使用了1个FC,因此删除第二个FC层并更改第一个FC的大小,消除了错误。但我不明白为什么会这样

self.FC1 = nn.Linear(channels, 1024*4*13)

#self.FC2 = nn.Linear(channels, 10656)

Tags: selfsizeinitdefnntorchsigmarelu