无法将PyTorch模型移动到设备上(.to(device))

0 投票

3 回答

67 浏览

提问于 2025-04-14 18:11

我在写我的第一个自编码器，这里是代码（可能有点奇怪，但我相信我写的没错）：

class Autoencoder(nn.Module):
    def __init__(self):
        super(Autoencoder, self).__init__()
        
        self.flatten = nn.Flatten()
        
        self.enc_conv0 = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=(1, 1)),
            nn.ReLU(),
            nn.BatchNorm2d(64),

            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=(1, 1)),
            nn.ReLU(),
            nn.BatchNorm2d(128)
        )
        
        self.enc_conv1 = nn.Sequential(
            nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, padding=(1, 1)),
            nn.ReLU(),
            nn.BatchNorm2d(256),

            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, padding=(1, 1)),
            nn.ReLU(),
            nn.BatchNorm2d(512)
        )
        
        self.enc_fc = nn.Sequential(
            nn.Linear(in_features=512*64*64, out_features=4096),
            nn.ReLU(),
            nn.BatchNorm1d(4096),
            
            nn.Linear(in_features=4096, out_features=2048),
            nn.ReLU(),
            nn.BatchNorm1d(2048),
            
            nn.Linear(in_features=2048, out_features=dim_code)
        )
        
        self.dec_fc = nn.Sequential(
            nn.Linear(in_features=dim_code, out_features=2048),
            nn.ReLU(),
            nn.BatchNorm1d(2048),
            
            nn.Linear(in_features=2048, out_features=4096),
            nn.ReLU(),
            nn.BatchNorm1d(4096),
            
            nn.Linear(in_features=4096, out_features=512*64*64),
            nn.ReLU(),
            nn.BatchNorm1d(512*64*64)
        )
        
        self.dec_conv0 = nn.Sequential(
            nn.ConvTranspose2d(in_channels=512, out_channels=256, kernel_size=(3,3), padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(256),
            
            nn.ConvTranspose2d(in_channels=256, out_channels=128, kernel_size=(3,3), padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(128),
        )
        
        self.dec_conv1 = nn.Sequential(
            nn.ConvTranspose2d(in_channels=128, out_channels=64, kernel_size=(3,3), padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(64),
            
            nn.ConvTranspose2d(in_channels=64, out_channels=3, kernel_size=(3,3), padding=1)
        )

    def forward(self, x):
        e0 = self.enc_conv0(x)
        e1 = self.enc_conv1(e0)
        latent_code = self.enc_fc(self.flatten(e1))
        
        d0 = self.dec_fc(latent_code)
        d1 = self.dec_conv0(d0.view(-1, 512, 64, 64))
        reconstruction = self.dec_conv1(d1)

        return reconstruction, latent_code

然后我准备用下一段代码来训练它：

`device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

criterion = nn.BCELoss()
print('crit')

autoencoder = Autoencoder().to(device)
print('deviced')`

运行这段代码后，输出显示：

cuda 'crit'

然后程序就一直卡在那里，疯狂占用内存和CPU（我是在Kaggle的笔记本上操作）。我不知道为什么会这样。 :(

我试着把同样的笔记本在Google Colab上运行，但它直接崩溃了，提示说尝试分配无法访问的资源出错。

我还想过问题可能和类初始化后的第一行代码有关，所以我把

def __init__(self):
        super().__init__()

替换成

def __init__(self):
        super(Autoencoder, self).__init__()

就像我在一些教程里看到的（老实说，我不知道这些代码是干嘛的，只是每个类似的案例里都有写）

但这样也没用。

内存管理错误调试深度学习 pytorch cuda 资源分配自编码器设备迁移

3 个回答

-2

这是你训练代码的更新版本，里面加入了之前提到的一些建议：

import torch
import torch.nn as nn

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

criterion = nn.MSELoss()  # Use Mean Squared Error for image reconstruction
print('crit')

autoencoder = Autoencoder().to(device)
print('deviced')

如果问题还没有解决，可以试着看看上面提到的那些要点。如果你遇到具体的错误，或者有更多关于这个问题的细节，记得告诉我。

回答于 2025-04-14 由 Python大师

分享举报

我看到你已经找到了问题，但这个回答对其他人可能仍然有帮助。

首先，计算一下你模型每一层有多少个参数：

Linear(in_features=N, out_features=M) 这一层有 MxN 个权重;
Flatten 和 ReLU 这两层没有权重;
BatchNorm1d(C) 和 BatchNorm2d(C) 各有 2xC 个权重（每个通道2个）;
Conv2d 和 ConvTranspose2d(in_features=N, out_features=M, kernel_size=K, ...) 各有 M 个不同的滤波器，每个滤波器的大小是 NzKxK。每个滤波器通常还有一个偏置（也是一个权重）。所以，对于 Conv2d 层，你的权重总数是 MxNxKxK + M。

如果你计算一下模型的总权重数量，应该会得到 8,599,888,384 + 2048 x dim_code + 8,604,081,155 + 2048 x dim_code = 17,203,969,539 + 4,096 x dim_code 这样不同的参数（希望我没有算错！具体计算在回答底部）。这个数量甚至比最近的一些大型语言模型还要多，比如 Mistral7B（顾名思义，它大约有 7B 个参数）。

现在考虑到 PyTorch 默认使用 float32 数据类型来处理张量，你的模型需要超过 17B x 32bit = 64 GiB 的内存。这一计算没有考虑 4,096 x dim_code 这一项，假设 dim_code 相对较小。因此，确保你的机器有足够的内存（如果你使用的是 CPU，或者如果你使用 GPU，确保有足够的显存）。

参数数量计算

编码器的参数数量是 8,599,888,384 + 2048 x dim_code：

enc_conv0 有：
- Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1) 有 3x64x3x3+64 = 1,792 个权重;
- BatchNorm2d(64) 有 2x64 = 128 个权重;
- Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1) 有 128x64x3x3+128 = 73,856 个权重;
- BatchNorm2d(128) 有 2x128 = 256 个权重;
enc_conv1 有：
- Conv2d(in_channels=128, out_channels=256, kernel_size=3, padding=1) 有 256x128x3x3+256 = 295,168 个权重;
- BatchNorm2d(256) 有 2x256 = 512 个权重;
- Conv2d(in_channels=256, out_channels=512, kernel_size=3, padding=1) 有 512x256x3x3+512 = 1,180,160 个权重;
- BatchNorm2d(512) 有 2x512 = 1,024 个权重
enc_fc 有：
- Linear(in_features=512*64*64, out_features=4096) 有 512x64x64x4096 = 8,589,934,592 个权重;
- BatchNorm1d(4096) 有 2x4096 = 8,192 个权重;
- Linear(in_features=4096, out_features=2048) 有 4096x2048 = 8,388,608 个权重;
- BatchNorm1d(2048) 有 4096 个权重;
- Linear(in_features=2048, out_features=d) 有 2048xd 个权重;

解码器的参数数量是 8,604,081,155 + 2048 x dim_code：

dec_fc 有：
- Linear(in_features=dim_code, out_features=2048) 有 dim_code x 2048 个权重;
- BatchNorm1d(2048) 有 2x2048 = 4,096 个权重;
- Linear(in_features=2048, out_features=4096) 有 2048x4096 = 8,388,608 个权重;
- BatchNorm1d(4096) 有 2x4096 = 8,192 个权重;
- Linear(in_features=4096, out_features=512*64*64) 有 4096x512x64x64 = 8,589,934,592 个权重;
- BatchNorm1d(512*64*64) 有 2x512x64x64 = 4,194,304 个权重;
dec_conv0 有：
- ConvTranspose2d(in_channels=512, out_channels=256, kernel_size=3, padding=1) 有 256x512x3x3 + 256 = 1,179,904 个权重;
- BatchNorm2d(256) 有 512 个权重;
- ConvTranspose2d(in_channels=256, out_channels=128, kernel_size=3, padding=1) 有 128x256x3x3 + 128 = 295,040 个权重;
- BatchNorm2d(128) 有 2x128 = 256 个权重;
dec_conv1 有：
- ConvTranspose2d(in_channels=128, out_channels=64, kernel_size=3, padding=1) 有 64x128x3x3 + 64 = 73,792 个权重;
- BatchNorm2d(64) 有 2x64 = 128 个权重;
- ConvTranspose2d(in_channels=64, out_channels=3, kernel_size=3, padding=1) 有 3x64x3x3 + 3 = 1,731 个权重。

回答于 2025-04-14 由 Python大师

分享举报

所以问题出在模型的大小上，一旦我尝试做一个小一点的模型，所有的问题就都消失了。

回答于 2025-04-14 由 Python大师

分享举报

无法将PyTorch模型移动到设备上(.to(device))

3 个回答

参数数量计算

撰写回答