PyTorch torch.no_grad（）与requires_grad=False

bert = BertModel.from_pretrained('bert-base-uncased') class BERTGRUSentiment(nn.Module): def __init__(self, bert): super().__init__() self.bert = bert def forward(self, text): with torch.no_grad(): embedded = self.bert(text)[0]

with torch.no_grad requires_grad = False Parameters Ran ------------------ --------------------- ---------- --- a. Yes Yes 3M Successfully b. Yes No 112M Successfully c. No Yes 3M Successfully d. No No 112M CUDA out of memory

1条回答

网友

1楼 · 发布于 2024-04-20 06:04:37

这是一个较老的讨论，多年来略有变化（主要是因为with torch.no_grad()作为一种模式的目的。在on Stackoverflow already中可以找到一个很好的答案来回答您的问题。
然而，由于原来的问题有很大的不同，我将避免标记为重复，特别是由于关于记忆的第二部分

对no_grad的初步解释如下here：

with torch.no_grad() is a context manager and is used to prevent calculating gradients [...].

另一方面，使用requires_grad

to freeze part of your model and train the rest [...].

来源再次the SO post

本质上，使用requires_grad只会禁用网络的一部分，而no_grad根本不会存储任何梯度，因为您可能将其用于推理而不是训练。
要分析参数组合的行为，让我们调查发生了什么：

a)和b)根本不存储任何渐变，这意味着无论参数有多少，您都有更多的可用内存，因为您没有为潜在的向后传递保留它们
c)必须为以后的反向传播存储前向传递，但是，只存储有限数量的参数（300万），这使得这仍然是可管理的
d)但是，需要为所有1.12亿参数存储前向传递，这会导致内存不足

相关问题更多 >

编程相关推荐

热门问题

热门文章